The previous Fingerprint Generation chapter showed how to create fingerprints with default parameters. These default parameters are calibrated on the Briem-Lessel [Briem-Lessel-2000], Hert-Willett [Hert-Willett-2004] ,and Grant [Grant-2006] benchmarks.
However, OEGraphSim also provides facilities to construct user-defined path fingerprints. When constructing a user-defined path fingerprint, the following parameters have to be considered:
The following code snippet shows how to generate a 1024 bit long fingerprint that encodes paths from 0 up to 5 bonds in length with default atom and bond properties defined by the OEFPAtomType_DefaultAtom and OEFPBondType_DefaultBond constants, respectively.
numbits = 1024
minbonds = 0
maxbonds = 5
OEMakePathFP(fp, mol, numbits, minbonds, maxbonds,
OEFPAtomType_DefaultAtom, OEFPBondType_DefaultBond)
Warning
Two path-based fingerprints which are generated with different parameters will have different fingerprint types!
In Listing 1, two fingerprints are generated with different parameters, namely they have a different number of bits. This means that they also have different types, therefore, no similarity value can be calculated between them.
Listing 1: Example of different path fingerprint types
#!/usr/bin/env python
import sys
from openeye.oechem import *
from openeye.oegraphsim import *
mol = OEGraphMol()
OEParseSmiles(mol, "c1ccccc1")
fpA = OEFingerPrint()
numbits = 1024
minbonds = 0
maxbonds = 5
OEMakePathFP(fpA, mol, numbits, minbonds, maxbonds,
OEFPAtomType_DefaultAtom, OEFPBondType_DefaultBond)
fpB = OEFingerPrint()
numbits = 2048
OEMakePathFP(fpB, mol, numbits, minbonds, maxbonds,
OEFPAtomType_DefaultAtom, OEFPBondType_DefaultBond)
print "same fingerprint types =",OEIsSameFPType(fpA, fpB)
print OETanimoto(fpA, fpB)
The output of Listing 1 is the following:
same fingerprint types = False
Fatal: fingerprint type mismatch!
In Listing 2 fingerprints with various atom and bond types are generated for two molecules (depicted in Example molecules). As less atom and bond properties are taken into account when the fingerprints are generated, the dissimilarity between the two molecules gradually fades away, resulting in larger Tanimoto similarity values. At the end, when only the topology of two molecules is considered, i.e., whether or not their atoms and bonds belong to any ring system, the fingerprints of the two molecules become identical.
Listing 2: Similarity calculation with various atom/bond typing
#!/usr/bin/env python
from openeye.oechem import *
from openeye.oegraphsim import *
def PrintTanimoto(molA, molB, atype, btype):
fpA = OEFingerPrint()
fpB = OEFingerPrint()
numbits = 2048
minb = 0
maxb = 5
OEMakePathFP(fpA, molA, numbits, minb, maxb, atype, btype)
OEMakePathFP(fpB, molB, numbits, minb, maxb, atype, btype)
print "Tanimoto(A,B) = %.3f" % OETanimoto(fpA, fpB)
molA = OEGraphMol()
OEParseSmiles(molA, "Oc1c2c(cc(c1)CF)CCCC2")
molB = OEGraphMol()
OEParseSmiles(molB, "c1ccc2c(c1)c(cc(n2)CCl)N")
PrintTanimoto(molA, molB, OEFPAtomType_DefaultAtom, OEFPBondType_DefaultBond)
PrintTanimoto(molA, molB, OEFPAtomType_DefaultAtom|OEFPAtomType_EqAromatic,
OEFPBondType_DefaultBond)
PrintTanimoto(molA, molB, OEFPAtomType_Aromaticity, OEFPBondType_DefaultBond)
PrintTanimoto(molA, molB, OEFPAtomType_InRing, OEFPBondType_InRing)
Example molecules
The output of Listing 2 is the following:
Tanimoto(A,B) = 0.166
Tanimoto(A,B) = 0.241
Tanimoto(A,B) = 0.592
Tanimoto(A,B) = 1.000
Path-based fingerprint generation involves molecular graph traversal to identify all unique paths. When a fingerprint is initialized, the minimum and maximum number of bonds of the paths that are encoded into the fingerprint can be specified.
Example molecule
For example, when generating a fingerprint of the molecule shown in Figure: Example Molecule with minimum and maximum length set to 0 and 3, respectively, only paths listed in the first four rows in Table: Enumerated Paths, are encoded into the fingerprint.
| Path length (in bonds) | Generated Unique Paths |
|---|---|
| 0 | C, N, O |
| 1 | C-C, C-N, C-O |
| 2 | C-C-C, C-C-N, C-C-O, C-N-C, N-C-O |
| 3 | C-C-C-C, C-C-C-N, C-C-C-O, C-C-N-C, C-N-C-O, |
| 4 | C-C-C-C-C, C-C-C-C-N, C-C-C-C-O, C-C-C-N-C, C-C-N-C-C, O-C-N-C-C |
| 5 | C-C-C-C-C-N, C-C-C-C-C-O, C-C-C-C-N-C, C-C-C-N-C-C, C-C-C-N-C-O |
Figure: Example of path enumeration depicts the six unique paths of length four that are generated for the example molecule. Each unique path is encoded only once without considering its frequency.
Example of path enumeration
In the example shown in Listing 3, fingerprints with various minimum and maximum path length are generated for pyrrole and pyridine. When enumerating only paths that are shorter than four bonds, the fingerprints generated for the two molecules are identical. Since the four bond-length pattern ccccc is present in pyridine but not in pyrrole, the fingerprints become different, resulting in a smaller Tanimoto similarity score.
Listing 3: Similarity calculation with various path lengths
#!/usr/bin/env python
from openeye.oechem import *
from openeye.oegraphsim import *
def PrintTanimoto(molA, molB, minb, maxb):
fpA = OEFingerPrint()
fpB = OEFingerPrint()
numbits = 2048
atype = OEFPAtomType_DefaultAtom
btype = OEFPBondType_DefaultBond
OEMakePathFP(fpA, molA, numbits, minb, maxb, atype, btype)
OEMakePathFP(fpB, molB, numbits, minb, maxb, atype, btype)
print "Tanimoto(A,B) = %.3f" % OETanimoto(fpA, fpB)
molA = OEGraphMol()
OEParseSmiles(molA, "c1ccncc1")
molB = OEGraphMol()
OEParseSmiles(molB, "c1cc[nH]c1")
PrintTanimoto(molA, molB, 0, 3)
PrintTanimoto(molA, molB, 1, 3)
PrintTanimoto(molA, molB, 0, 4)
PrintTanimoto(molA, molB, 0, 5)
The output of Listing 3 is the following:
Tanimoto(A,B) = 1.000
Tanimoto(A,B) = 1.000
Tanimoto(A,B) = 0.950
Tanimoto(A,B) = 0.731
The previous sections explain how the atom and bond typing and path length can effect the similarity scores. Selecting an adequate fingerprint size is also very crucial. The number of unique paths present in molecular structures is extremely large, therefore, the generated paths have to be hashed into the fixed-length fingerprint. This means that a bit in a fingerprint does not correspond to a unique pattern exclusively (as it does in structural key). Also a bit has no particular structural meaning, i.e., each bit represents the presence of a number of structural patterns.
The smaller the size of the fingerprints, the more dense they become, raising the probability of collisions. A collision occurs when different paths are mapped to the same bit. This will inherently result in information loss and weaken the power to discriminate between structurally similar and dissimilar structures. On the other hand, when the size of the fingerprints is too large they become very sparse, which will reduce information loss However, the time spent to calculate similarity scores will increase.
The following table shows the number of unique paths generated for benzylpenicillin (depicted in Figure: Benzylpenicillin).
Note
The more atom and bond properties that are taken into account and the larger the size of paths to enumerate, the larger the size of the fingerprint has to be in order to encode the paths without a significant number of bit collisions.
Benzylpenicillin
| Atom/Bond typing | path 0-3 | path 0-5 | path 0-7 |
|---|---|---|---|
| AtomicNumber, BondOrder | 56 | 149 | 297 |
| AtomicNumber | HvyDegree, BondOrder | 111 | 265 | 453 |
| AtomicNumber | HvyDegree | Aromaticity, BondOrder | InRing | 126 | 297 | 499 |
| DefaultAtom, DefaultBond | 147 | 362 | 617 |