One of the major problems in computer science is the question of how best to represent or organize data, for example by a data structure or a database schema. Indeed, volumes have been written on data modeling, and numerous methodologies have been developed to codify this task; object-oriented design, relational design, object-relational design, and so on. Two measures by which these design-styles can be evaluated are (i) how well they can present the same information in multiple views or organizations and (ii) the related problem of how well they can compensate for changes or modifications to their schema/organization.
These two challenges form the basis of "Schema Fragility". Different users or applications may require the same data to be organized differently for ease of processing, or reflecting different models or levels of detail, or to provide different user communities with views customized for their perspective. This ability is not only critical to the "applicability" of a representation, but also its "longevity". One of the most important aspects of modeling a "real-world" system is that both it and our knowledge of it are changing continually. A robust design is one that can easily adapt to such inevitable changes.
Consider, for example, the issues involved in modeling a hierarchy. In such organizations, records are linked together like a family tree, such that each record has only one owner, e.g. an order is owned by only one customer. Indeed, hierarchical structures were widely used in the first mainframe management systems in the early 1970s.
As a case study, we'll consider the classic hierarchy from biology, that of phylogeny. Most biologists are aware that all living organisms can be placed in a hierarchy of kingdoms, genus and species. If it were truly as simple as this it might be reasonable to organize as a three-level tree. The Biota bioinformatics database, for example, goes further with kingdom, phylum, class, order, family, genus and species as seven different tables. But even this is the tip of the iceberg, as the NCBI taxonomy database, as used by GenBank, records things as
superkingdom, kingdom, subkingdom, superphylum, phylum, subphylum, superclass, class, subclass, infraclass, cohort, subcohort, superorder, order, suborder, infraorder, parvorder, superfamily, family, subfamily, tribe, subtribe, genus, subgenus, species group, species subgroup, species, subspecies, varietas and forma.
Hence if NCBI used a top-down hierarchical data organization, looping over all the organisms in a database only requires 30 nested loops. The problem is exasperated by the fact that the other major bioinformatics databases use a different hierarchy, and that different subtrees and branches have different levels.
Clearly, modeling such a hierarchy explicitly (even without the ambiguity of organisms belonging to multiple leaves of the hierarchy) has serious limitations.
More relevant to OEChem is the related problem of how to represent biomolecules. Once again, a naive structural biologist could be forgiven for assuming it's a simple matter of organizing atoms into residues, and residues into chains for a simple three-level hierarchy. Indeed, this was a fundamental mistake made by the immensely popular RasMol molecular graphics program that had exactly such a three-level structure. In reality, the organization of a PDB file also requires multiple NMR models, crystal related symmetries, secondary and tertiary structural elements, folding domains, (active) sites, connected components, XPLOR segment IDs and distinctions between proteins, nucleic acids, ligands and solvent, and the distinction between backbone vs. side-chain atoms, and categorization by ring system and cycle membership, heavy atoms and hydrogens. Finally, let's not forget the alternate conformation indicators for each atom or residue!
Notice, that this hierarchy also fails to be a "strict" hierarchy. A single syntactic chain may be split into multiple connected components, and multiple PDB chains may be covalently bonded into a single connected component. TER records normally serve as chain terminators but in several PDB files occur with a single residue. Most chains are either ATOM or HETATOM, but peptidic inhibitors and post-translationally modified proteins are mixtures of both. A single strand of a beta-sheet is always formed from a single chain, but a beta-sheet may be formed from stands from multiple chains.
It turns out that there are simple computer science solutions to these problems. Indeed, it was Codd's exposition of these principals for "relational" database systems, that completely killed off use of hierarchical and network database management systems within a decade of their introduction.
The premise is that rather than encode a single fragile hierarchy explicitly, each leaf or record instead maintains its identity or position within the organization. This allows the representation of arbitrary sets and/or partitions of a set of records. All ligand atoms of a molecule are denoted by the fact that they have a ligand property that is true, rather than it being implicit from where it is stored (in the abstract sense of an access path). Of course, each record may possess more than one property, allowing it to simultaneously exist in more than one set. Hence, this representation is generic enough to handle arbitrary Venn diagrams. Strict hierarchies and trees are therefore just an emergent property, where some sets are strict subsets of others. This allows elements to simultaneously be organized in more than one hierarchy, or to elide or introduce new levels into the hierarchy.
The next realization is that once sets, or levels in a hierarchy, are represented by boolean properties or predicates, that there's no need to have an explicit "name" or placeholder for a set. Instead, a set or partition can be defined/named by providing a representative member, and a binary predicate that determines whether another member/record is in the same set/partition as it. For example, to represent a protein chain, it is sufficient to specify an arbitrary atom in the appropriate chain, and provide a SameChain function. Similarly, a residue can be specified by providing the exact same atom and a SameResidue function.
The above explanation should go some way to explaining OEChem's decision to attach biopolymer information to each atom, rather than have container classes for residues and chains (and presumably connected components, NMR models, etc...). The OEResidue class is therefore an additional set of fields that may be associated with an atom. It does not denote or prescribe an amino or nucleic acid but instead stores atom-specific data such as atom serial number, b-factor and occupancy, in addition to residue information, chain information, fragment information, NMR model information, etc...
The residue information associated with an atom can be set with the OEAtomSetResidue function, and is retrieved with the OEAtomGetResidue function. The PDB and Macromodel file format readers parse this information from the input file format. Additionally, OEChem allows residue information to be perceived directly from the connection table using the OEPerceiveResidues function.
For many algorithms processing biomolecules, it is convenient to maintain the atoms of the OEMolBase in sorted order to group atoms in the same residue next to one another, and residues in the same chain sequentially. This can be done conveniently in OEChem using the OEPDBOrderAtoms function. Note, that OEPercieveResidues calls OEPDBOrderAtoms automatically.
A common idiom is therefore the following code snippet:
def MyPrepareProtein(mol):
if OEHasResidues(mol):
OEPDBOrderAtoms(mol)
else:
OEPerceiveResidues(mol)
As a teaching example, the following code demonstrates one way of reporting the number of different chains in an OEMolBase.
def MyCountChains(mol):
result = 0
first = True
prev = None
for atom in mol.GetAtoms():
res = OEAtomGetResidue(atom)
chain = res.GetChainID()
if first or chain != prev:
result += 1
first = False
prev = chain
return result
A slightly improved version would be to use OEChem's OESameChain function.
def MyCountChains2(mol):
result = 0
prev = None
for atom in mol.GetAtoms():
res = OEAtomGetResidue(atom)
if prev and OESameChain(res,prev):
continue
prev = res
result += 1
return result
Clearly, a MyCountResidues function would look almost identical but use the OESameResidue function instead of OESameChain. The slightly more complicated example below, reports the number of residues in each chain.
def MyReportResidues1(mol):
prevchain = None
for chain in mol.GetAtoms():
chainres = OEAtomGetResidue(chain)
if not prevchain or not OESameChain(chainres, prevchain):
prevres = None
count = 0
for residue in mol.GetAtoms():
resres = OEAtomGetResidue(residue)
if OESameChain(resres, chainres):
if not prevres or not OESameChain(resres,prevres):
prevres = resres
count += 1
print count,"residues in chain",chainres.GetChainID()
prevchain = chainres
Whilst the above example contains the doubly nested loops that some structural biologists like to see, the same output can be generated even more efficiently by:
def MyReportResidues2(mol):
count = 0
residue = None
chain = None
for atom in mol.GetAtoms():
res = OEAtomGetResidue(atom)
if not chain:
chain = res
elif not OESameChain(res,chain):
print count,"residues in chain",chain.GetChainID()
count = 0
if not residue or not OESameResidue(res, residue):
residue = res
count += 1
if count>0:
print count,"residues in chain",chain.GetChainID()
Of course, just because OEChem uses an extremely advanced representation of biopolymers, there's absolutely nothing to prevent a user slurping this information into a FORTRAN common block or whichever representation best suits their way of thinking about the problem.