1.1 Molecular Shape

What do we mean by shape? The word is often used without consideration of precise meaning but in this document we shall be very clear as to the definition of shape. Two entities will have the same shape if their volumes exactly correspond. The more the volumes differ, the more the shapes will differ. We will give a precise mathematical exposition below, but it is worth noting even at this most basic level shape is defined as a relative quantity, depending on references to other shapes. In this we differ from approaches that attempt to provide absolute, canonical, ``shapes'' by which to categorize molecules.

What do we mean by ``volume''? A volume is any scalar field. This means a function that has a single number, or ``scalar'', value at each point in space. The ``special case'' for the common understanding of volume is a specific scalar field that has a value of one inside an object and zero outside. The volume of a scalar field is:


\begin{displaymath}
V \mbox{(volume)} = \int f(x,y,z) dv
\end{displaymath} (1.1)

The volume function, $f$, is also referred to as the ``characteristic'' function. When the characteristic function corresponds to the common definition of a volume field this integral corresponds to what is commonly expected by volume. However, we are not restricted to such simple functions and can still calculate a V. In general the volume of a scalar field is a ``contraction'' of the information represented by that characteristic function. It is more precisely referred to as the zeroth-order contraction, or ``moment''. We will discuss other moments and their uses later, but one immediate observation is that two objects can not have the same shape if their volumes are not the same. The converse is obviously not true. Rather, two objects can have the same volume and not have the same shape. Volume is typical, therefore, of most contractions of information.

We can now write down a precise definition of shape similarity. Consider the integral:


\begin{displaymath}
S_1 = \int \vert f(x,y,z) - g(x,y,z)\vert dV
\end{displaymath} (1.2)

where $f$ and $g$ are different characteristic functions. If this integral is zero then $f$ and $g$ are actually the same function and therefore correspond to the same shape. The larger the integral, the more different the shapes defined by $f$ and $g$. It defines a metric quantity between the two fields $f$ and $g$. The word ``metric'' is used loosely to mean ``shape'', but here we mean the precise mathematical definition: i.e. a distance that is 1) always positive, 2) zero if and only if two entities are identical and 3) that obeys the triangle inequality. The triangle inequality states that if entity A is distance x from entity B and B is distance y from entity C then the distance between A and C is bounded by $\vert x-y\vert$ and $\vert x+y\vert$. The type of comparison shown in $S_1$ is referred to as an $L_1$ metric. Another metric is the $L_2$ metric:


\begin{displaymath}
S_2=\sqrt{\int [f(x,y,z)-g(x,y,z)]^2 dV}
\end{displaymath} (1.3)

This integral is the standard one we will use to define shape similarlity The primary advantage, computationally, of 1.3 over 1.2 is that 1.3 does not involve the absolute value function that is not analytic.

Multiplying the terms in the integral out gives:


\begin{displaymath}
S_2^2 = \int f(x,y,z)^2dV + \int g(x,y,z)^2dV - 2\int f(x,y,z)g(x,y,z)dV
\end{displaymath} (1.4)

This is the fundamental equation for shape comparison. We rewrite it as:


\begin{displaymath}
S_{f,g} = I_f + I_g - 2O_{f,g}
\end{displaymath} (1.5)

The $I$ terms are the self-volume overlaps of each entity (for our purposes - molecule), while the $O$ term is the overlap between the two functions. They constitute the three terms we need to calculate to compare the shapes of two fields. The $I$ terms are independent of orientation but not $O$. Finding the orientation that maximizes $O$, and hence minimizes $S_{f,g}$, is equivalent to finding the best overlay between the two objects (a quantity that has its own, distinct metric properties). We also note here that the quantity referred to as a Tanimoto coefficient may be derived by recombining $I$'s and $O$ so:


\begin{displaymath}
Tanimoto_{f,g} = \frac{O_{f,g}}{I_f+I_g-O_{f,g}}
\end{displaymath} (1.6)

Tanimoto coefficients will be familiar to those who use them for bitvector fingerprint comparison. An alternative measure is the Tversky coefficient, also mostly used for similarity between bitvector fingerprints. Similarly to the Tanimoto coefficient above, we can define a shape Tversky measure. The base equation for the Tversky coefficient is:


\begin{displaymath}
Tversky_{f,g} = \frac{O_{f,g}}{\alpha I_f+\beta I_g}
\end{displaymath} (1.7)

Normally, $\alpha + \beta = 1$, and for our current use, $\alpha$ is chosen to be 0.95. Since this introduces an assymmetry, the Tversky calculation depends on which molecule's self-overlap has the $\alpha$ pre-factor. ROCS calculates two Tversky values, one with the query molecule with $\alpha$ pre-factor and a second with the database molecule with $\alpha$ as the pre-factor. Also, note that since shape is a field property, instead of a simple scalar like a bitvector, shape Tversky can be larger than 1.0 since the overlap $O_{f,g}$ can be larger than a molecule's self-overlap, $I_f$.

The OpenEye Shape Toolkit is a set of calculational objects designed to facilitate the calculation of these field-metric quantities. ROCS is an application built on top of the Shape toolkit.