Similarity Measures

The basic idea underlying similarity-based measures is that molecules that are structurally similar are likely to have similar properties. In a fingerprint the presence or absence of a structural fragment is represented by the presence or absence of a set bit. This means that two molecules are judged as being similar if they have a large number of bits in common.

Measuring molecular similarity or dissimilarity has two basic components: the representation of molecular characteristics (such as fingerprints) and the similarity coefficient that is used to quantify the degree of resemblance between two such representations.

Built-in Similarity Measures

Since different similarity coefficients quantify different types of structural resemblance, several built-in similarity measures are available in OEGraphSim (see Table: Built-in similarity indices) The table below defines the four basic bit count terms that are used in fingerprint-based similarity calculations:

Basic bit count terms
Symbol Description
onlyA number of bits set on in fingerprint A but not in B
onlyB number of bits set on in fingerprint B but not in A
bothAB number of bits set on in both fingerprints
neitherAB number of bits set off in both fingerprints
Built-in similarity indices
Similarity measure Range OEGraphSim Function
Cosine [0.0 - 1.0] OECosine
Dice [0.0 - 1.0] OEDice
Euclidean [0.0 - 1.0] OEEuclid
Manhattan [1.0 - 0.0] OEManhattan
Tanimoto [0.0 - 1.0] OETanimoto
Tversky variable OETversky

Cosine

Formula: Sim_{Cosine}(A,B) = \frac{bothAB}{\sqrt{(onlyA + bothAB) * (onlyB + bothAB)}}

Calculates the ratio of the bits in common to the geometric mean of the number of on bits in the two fingerprints.

Dice

Formula: Sim_{Dice}(A,B) = \frac{2 *bothAB}{onlyA + onlyB + 2 * bothAB}

Calculates the ratio of the bits in common to the arithmetic mean of the number of on bits in the two fingerprints.

Euclidean

Formula: Sim_{Euclid}(A,B) = \sqrt{\frac{bothAB + neitherAB}{onlyA + onlyB + bothAB + neitherAB}}

Manhattan

Formula: Sim_{Manhattan}(A,B) = \frac{onlyA + onlyB}{onlyA + onlyB + bothAB + neitherAB}

Tanimoto

Formula: Sim_{Tanimoto}(A,B) = \frac{bothAB}{onlyA + onlyB + bothAB}

The number of bits set in both molecules divided by the number of bits set in either molecules. The more sparsely bits are set on, the smaller Sim_{Tanimoto} values generally become.

Tversky

Formula: Sim_{Tversky}(A,B) = \frac{bothAB}{\alpha * onlyA + \beta * onlyB + bothAB}

The Tversky similarity measure is asymmetric. Setting the parameters \alpha = \beta = 1.0 is identical to using the Tanimoto measure.

The factor \alpha weights the contribution of the first ‘reference’ molecule. The larger \alpha becomes, the more weight is put on the bit setting of the reference molecule.

Similarity Calculation

The following example demonstrates how to calculate Tanimoto scores from fingerprints.

_images/fptanimoto.png

Example molecules

Listing 1: Calculating Tanimoto index

#!/usr/bin/env python
from openeye.oechem import *
from openeye.oegraphsim import *

molA = OEGraphMol()
OEParseSmiles(molA, "c1ccc2c(c1)c(c(oc2=O)OCCSC(=N)N)Cl")
fpA = OEFingerPrint()
OEMakeFP(fpA, molA, OEFPType_MACCS166)

molB = OEGraphMol()
OEParseSmiles(molB, "COc1cc2ccc(cc2c(=O)o1)NC(=N)N")
fpB = OEFingerPrint()
OEMakeFP(fpB, molB, OEFPType_MACCS166)

molC = OEGraphMol()
OEParseSmiles(molC, "COc1c(c2ccc(cc2c(=O)o1)NC(=N)N)Cl")
fpC = OEFingerPrint()
OEMakeFP(fpC, molC, OEFPType_MACCS166)

print "Tanimoto(A,B) = %.3f" % OETanimoto(fpA, fpB)
print "Tanimoto(A,C) = %.3f" % OETanimoto(fpA, fpC)
print "Tanimoto(B,C) = %.3f" % OETanimoto(fpB, fpC)

Molecules B and C (shown in Figure: Example Molecules) have the largest Tanimoto value since they share the largest number of common structural features.

The output of Listing 1 is the following:

Tanimoto(A,B) = 0.618
Tanimoto(A,C) = 0.709
Tanimoto(B,C) = 0.889

User-defined Similarity Measures

The following code snippet demonstrates how implement the Yule similarity measure with the following formula:

Sim_{Yule}(A,B) = \sqrt{\frac{(bothAB * neitherAB) - (onlyA * onlyB)}{(bothAB * neitherAB) + (onlyA * onlyB)}}

def CalculateYule(fpA, fpB):
    onlyA, onlyB, bothAB, neitherAB = OEGetBitCounts(fpA, fpB)
    yule  = float(bothAB * neitherAB - onlyA * onlyB)
    yule /= float(bothAB * neitherAB + onlyA * onlyB)
    return yule

The OEGetBitCounts function returns the four basic values (namely onlyA, onlyB, bothAB and neitherAB) from which any similarity measures can be calculated. For the definition of these values see Table: Basic terms

OEMakeFP(fpA, molA, OEFPType_Path)
OEMakeFP(fpB, molB, OEFPType_Path)
OEMakeFP(fpC, molC, OEFPType_Path)

print "Yule(A,B) = %.3f" % CalculateYule(fpA, fpB)
print "Yule(A,C) = %.3f" % CalculateYule(fpA, fpC)
print "Yule(B,C) = %.3f" % CalculateYule(fpB, fpC)

Warning

User-defined similarity measures can only be used with path (OEFPType_Path) and MACCS key (OEFPType_MACCS166) fingerprints but not with LINGO (OEFPType_Lingo).