Molecules are traditionally viewed as a set of fused spheres, sometimes
referred to as the CPK model. The common view of molecular volume is then of a
characteristic function that is one (1) inside at least one sphere and zero
(0) outside. How do we calculate the volume of such a seemingly simple
function? The volume of a single sphere is
but the complication
for two fused spheres is that we have to account for the shared
volume and not count it twice. For more than two atoms, there are triple
intersections that must be added back in if we have removed the three
pairs of intersections. The general formula for N spheres that explicitly
calculates the volume of every level of overlap and it's correct contribution
is:
![]() |
(2.13) |
This is easy to write, not so easy to solve because the analytic formulae for overlaps of increasing order are highly non-trivial (although they have been derived to arbitrary order). It is fair to say that this has hindered the development of shape comparison in many ways. Attempts to use analytic formulae led to very slow programs and approximate methods, for instance using grids of points that are turned ``in'' or ``out'' by each sphere, do not give smooth gradients required for minimization. Brian Masek (AstraZeneca) was the first to attempt to optimize overlaps of molecules using the analytic approach. His program would take several minutes per minimization. In addition it would often suffer from a common problem when using functions that vary sharply (such as solid spheres): it would often get stuck in local mimina. Nevertheless, Brian did have encouraging success using this method to find similarities not obvious from chemical structure.
The conceptual breakthrough in shape comparison came in 1992 in a paper by
Andrew Grant (AstraZeneca) and Barry Pickup (University of Sheffield). They
showed that if one let go of the concept of the characteristic function being
binary, and instead use a sum of continuous functions, i.e.
a Gaussian, that the solid-sphere volume, could be recovered to high
accuracy (typically
0.1%). A sphere has one defining parameter,
its radius, whereas a Gaussian has two defining parameters, its prefactor,
, and its width,
:
| (2.14) |
Grant and Pickup found that by fixing
to 2.7 and setting
for each atom
such that the volume integral for each atom agreed with its solid-sphere
volume, they achieved remarkable precision. In addition, the
overlap terms between any two atoms, and hence any higher-order overlaps, are
all Gaussian functions themselves because of the Gaussian Contraction formula
(shown here for one spatial variable):
| (2.15) |
i.e. two atomic-gaussians overlap to produce another gaussian. Likewise, a three atomic-Gaussian overlap is that of an overlap-Gaussian with an atomic-Gaussian, hence another Gaussian. The simplicity of these formulae and the formula for the volume of each individual Gaussian leads to very efficient algorithms for the calculation of the volume of a molecule so represented (the OpenEye method calculates several thousand volumes per second while calculating intersections up to sixth order).
In addition to simple calculation of molecular volume, which is the zeroth-order
moment of the characteristic function, the ease of evaluation of intersections
allows for accurate calculation of high-order moments: what Andy calls the
``steric multipoles''. For instance, if the product formulae for atomic and
intersection Gaussians yields
Gaussians, the first order moments are:
![]() |
(2.16) | ||
![]() |
(2.17) | ||
![]() |
(2.18) |
These integrals are easy to solve and their sum can be set to zero by an appropriate choice of origin: the center of ``mass'' for the sum of Gaussians. Second-order moments are found from integrals of the type:
![]() |
(2.19) |
where P and Q are chosen from (x,y,z), e.g. x
, xy etc. These moments
can be thought of as a symmetric 3*3 matrix which we refer to as the ``mass
matrix''. Rotating or translating the molecule will change the moments and the
transform that sets the first-order moments to zero and diagonalizes the
mass-matrix puts the molecule into its ``inertial frame''. By convention we
assign the x-axis to the largest eigenvector of the mass matrix, y-axis to the
median eigenvector, z-axis to the smallest. Note that this orientation is still
not uniquely defined: 180 degree rotations around any axis also diagonalize the
mass-matrix. The eight (
) possible transforms that can be generated by
combinations of such rotations actually lead to four unique inertial
orientations.
If a molecule is aligned to its inertial frame, all higher-order steric multipoles become invariant, ignoring certain sign-changes from the four-fold degeneracy of the inertial frame. As such they, as well as the second-order moments, are shape ``descriptors''. They are still contractions of the information contained in the characteristic field, i.e. two molecules can have the same steric moments and yet have different shapes. (Moments are ``complete'' in that if we calculate them to infinite order they do exactly define volume but this is seldom a practical approach!) Nevertheless, they do contain useful information and can be used as a rapid, approximate, filter for shape similarity.
The same advantages that allow for the calculation of molecular volume carry over to the calculation of molecular volume overlap. The overlap of volumes are Gaussian contractions, easily tabulated and efficiently retrieved. Andy re-wrote Brian's program and obtained an order-of-magnitude improvement in performance as well as another remarkable observation: if the starting orientation of each molecule is that given from the inertial frame then very few "false" minima are produced. The smoothness of the Gaussian characteristic function is enough to overcome the problems with convergence in Brian's program. The four possible "inertial" starting points were enough to find the best, global, overlay between two molecules. This observation and the Gaussian approach are the basis of the OpenEye Shape Toolkit and ROCS program for rapid shape overlay.
But note, despite the algorithmic advantages, a correlation with common perception has been lost. Because the pre-factor of each atomic Gaussian is not unity, the characteristic function does not correspond to the inside/outside description with which we are most comfortable. In the Gaussian model all points in space are to some degree inside and to some degree outside. That is, the Gaussian model typically shows about 0.1% error with respect to the solid sphere model due to the fact that is includes a portion of all points in space inside the volume.