The maximum common substructure (henceforth MCS) of two molecular graphs can be
identified using the OEMCSSearch class.
The Listing 17.3 demonstrates how to initialize an OEMCSSearch object,
perform a maximum common substructure search, and then retrieve the matches.
1 #include "openeye.h"
2 #include "oechem.h"
3 #include "oeplatform.h"
4
5 using namespace OEPlatform;
6 using namespace OEChem;
7 using namespace OESystem;
8
9 int main()
10 {
11 OEGraphMol pattern,target;
12 OEParseSmiles(pattern, "c1cc(O)c(O)cc1CCN");
13 OEParseSmiles(target, "c1c(O)c(O)c(Cl)cc1CCCBr");
14
15 unsigned int atomexpr = OEExprOpts::DefaultAtoms;
16 unsigned int bondexpr = OEExprOpts::DefaultBonds;
17 // create maximum common substructre object
18 OEMCSSearch mcss(pattern,atomexpr,bondexpr,OEMCSType::Exhaustive);
19 // set scoring function
20 mcss.SetMCSFunc(OEMCSMaxAtoms());
21 // ignore matches smaller than 6 atoms
22 mcss.SetMinAtoms(6);
23
24 bool unique = true;
25 unsigned int count = 1;
26 OEIter<OEMatchBase> match;
27 // loop over matches
28 for (match = mcss.Match(target,unique);match;++match)
29 {
30 OEIter< OEMatchPair<OEAtomBase> > apr;
31 oeout << "Match " << count << ':' << oeendl;
32 oeout << "pattern atoms: ";
33 for (apr = match->GetAtoms();apr;++apr)
34 oeout << apr->pattern->GetIdx() << ' ';
35 oeout << oeendl;
36 oeout << "target atoms: ";
37 for (apr = match->GetAtoms();apr;++apr)
38 oeout << apr->target->GetIdx() << ' ';
39 oeout << oeendl;
40 //create match subgraph
41 OEGraphMol m;
42 OESubsetMol(m,match,true);
43 std::string smi;
44 OECreateSmiString(smi,m);
45 oeout << "match smiles = " << smi << oeendl;
46 ++count;
47 }
48 return 0;
49 }
The first molecule, pattern, is dopamine, and the second molecule,
target, is a dopamine analog.
The OEMCSSearch instance is initialized with dopamine,
atom and bond expressions for the maximum common substructure query, and the type
of the search (see line 18 in Listing 17.3).
The atom and bond expressions define criteria for atom and bond equivalence
used during the search and are defined in the OEExprOpts
namespace .
See Section 17.4 for more information.
An OEMCSSearch object can also be constructed from a SMARTS string directly.
In this case standard SMARTS matching rules apply for what constitutes a match.
If it is constructed with an OEQMolBase, then whatever atom and bond expressions
have been applied to the OEQMolBase will apply in the MCS search.
The last argument of the initialization defines the search type, either OEMCSType::Approximate or OEMCSType::Exhaustive. The difference between the two search types is detailed in Section 17.2.1. This argument is optional, if it is not specified, then the faster approximate method is employed.
During the search process, each identified common substructure is evaluated by
a scoring function and only substructures with the best score are retained.
The SetMCSFunc (line 20)
provides an ability to set the scoring function of an OEMCSSearch object,
thereby influence the result of the maximum common substructure search process.
See Section 17.2.2 for more information.
The OEMCSSearch::Match method (line 28) returns an
iterator over the maximum common substructure(s).
The OEMatchBase is then passed as an argument to the
OESubsetMol function (line 42),
and subsequently converted into a smiles string.
The detected maximum common substructure of the example program is depicted in Figure 17.1, the output of the program is shown below.
Match 1: pattern atoms: 0 1 2 3 4 5 6 7 8 9 target atoms: 7 5 3 4 1 2 0 8 9 10 match smiles = CCc1ccc(c(c1)O)O
|
The maximum common substructure search can perform unique or non-unique substructure searching changing the second argument of the OEMCSSearch::Match function (see line 28 in Listing 17.3). Th default in a non-unique search.
By definition, a match or subgraph is considered unique if it differs from all other subgraphs found previously by at least one atom or bond. Additionally, it is also considered unique if the pattern subgraph is mapped to a different part of the target.
Figure 17.2 shows an example in which the two matches are identified using the unique search method. Even though the two obtained subgraphs are identical, they represent different mappings between the pattern and the target, therefore they are both considered unique. Using a non-unique search would result in four matches, since the phenol can flip, yielding two additional matches.
|
The search space of a maximum common subgraph determination can be restricted
by constraining pairs of atoms or bonds to be mapped
onto one another in all subgraph solutions. This is done using the
OEMCSSSearch::AddConstraint method. Failure to satisfy atom or
bond pairwise constraints will prevent any subgraph solutions from
being identified. Constraints are considered satisfied in subgraphs
which do not contain any constrained atoms or bonds in either the
pattern or target molecules. The AddConstraint method returns
true if a constraint is added successfully. If the pattern atom or
bond in the OEMatchPair does not exist as part of the query molecule
created in the initialization of the OESubSearch object then
AddConstraint will return false. Multiple calls to AddConstraint using
the same pattern atom or bond will cause previously stored constraints
to be overwritten as constraints are mutually exclusive. It is
impossible to satisfy multiple simultaneous constraints for a single
pattern atom or bond, hence the exclusivity.
A read-only reference to the query molecule (OEQMol) contained
in an instance of OEMCSSearch can be obtained with the
GetPattern method. Note that if the OEMCSSearch was
constructed with an OEMolBase , the returned OEQMol is a
separate object. Const OEQMolBase methods can be used to interrogate
the returned OEQMolBase reference.
The SetMaxMatches method alters the maximum number of maximum
common subgraph matches that will be returned by the
OEMCSSearch::Match method. The search for maximum common
substructures will not terminate immediately upon reaching this limit.
The maximum common subgraph cannot be known unless the MCS is composed
of all atoms and bonds of at least one of the graphs being compared.
The limit of subgraphs to be returned may be reached with a smaller
subgraph than the maximum. In such a case the search continues for
larger subgraphs until the search is exhausted.
OEMCSSearch::Match will return the first N maximum common
subgraphs where N is less than or equal to the maximum match limit.
The default limit set upon construction of an OEMCSSearch
instance is 1024 matches.
The SetMinAtoms method sets the minimum number of atoms
required of a subgraph match to be returned by a MCS search.
For example, changing the parameter of SetMinAtoms in Listing 17.3
line 22 to 11 would result in no solution since there are only 10
atoms of the largest maximum common substructure (see Figure 17.1).
A single atom can be a perfectly valid maximum common
subgraph, however, for many applications such a small subgraph may not
be considered useful. Setting the minimum number of atoms to a useful
size prevents unproductive subgraph matches from being returned by the
OEMCSSearch::Match method.
The default set upon construction of an OEMCSSearch instance for the minimum number of atoms is one.
The maximum common substructure search can be performed in two different modes: a very fast method (OEMCSType::Approximate) or a more comprehensive method (OEMCSType::Exhaustive).
The type of the OEMCSSearch can be set at initialization. The default
is OEMCSType::Approximate.
The approximate method is based on traversing through pre-defined paths of the query structure and trying to map the visited query atoms into target atoms. Because these pre-defined paths represent only a fraction of all possible paths of a compound, it is not guaranteed that the approximate method can find the largest and all common substructures. Significant difference of the detected matches of the two methods could exist in cases when the query or target structure contains complex ring systems (fused or bridged) or stereo centers. However, comparing the two methods for thousands of structures revealed that these cases are rare and the approximate method provides a good trade-off between identifying MCS matches accurately and doing it 3-6 times faster than the exhaustive method.
Figure 17.3 and Figure 17.4 shows an example where the substructure identified by the approximate method is smaller by one atom then the solution identified by the exhaustive method.
|
|
Using the approximate MCS is recommended if the speed of the search is crucial.
OEMCSFunc is an abstract base class that defines the API used
for scoring subgraph matches. The scores generated by implementations
of OEMCSFunc influence the sorting and retention of maximum
common subgraph matches generated by the OEMCSSearch class.
It is important to mention that using different scoring functions does not alter the way the search space is traversed to identify common substructures. It effects only how these identified substructures are evaluated.
Four implementations of the OEMCSFunc class are available in OEChem:
OEMCSMaxAtoms (See example in Figure 17.5)
The OEMCSMaxAtoms class is designed to order maximum common
substructure matches by the maximum number of atoms included in the graph match.
If two matches have the same number of atoms, then te tie is split based
on the number of bonds contained in the match.
Scoring function: number of mapped atoms + number of mapped bonds/100
|
OEMCSMaxBonds (See example in Figure 17.6)
The OEMCSMaxBonds class is designed to order maximum common
substructure matches by the maximum number of bonds included in the graph match.
If two matches have the same number of bonds, then the tie is split based on the
number of atoms contained in the match.
Scoring function: number of mapped bonds + number of mapped atoms/100
|
OEMCSMaxAtomsCompleteCycles (See example in Figure 17.7)
The OEMCSMaxAtomsCompleteCycles class
is the same as the OEMCSMaxAtoms with the addition of penalizing cyclic
query bonds that are not mapped to any target bonds, thereby giving priority
to matches which contain complete cycles common to both the pattern and the target
structure.
Scoring function: number of mapped atoms + nrumber of mapped bonds/100 - penalty
number. of unmapped cyclic query bonds
|
The default penalty for each unmapped cyclic query bond is 1.0.
OEMCSMaxBondsCompleteCycles (See example in Figure 17.8)
The OEMCSMaxBondsCompleteCycles class
is the same as the OEMCSMaxBonds with the addition of penalizing cyclic
query bonds that are not mapped to any target bonds, thereby giving priority
to matches which contain complete cycles common to both the pattern and the target
structure.
Scoring function: number of mapped bonds + number of mapped atoms/100 - penalty
number of unmapped cyclic query bonds
|
The default scoring function, OEMCSMaxAtoms, can be changed
using the SetMCSFunc method of the OEMCSSearch class.
The scoring function method is called automatically by the
OEMCSSearch class when a common subgraph of two molecules is
identified. The pattern (query) molecule and target molecule currently
being matched are passed as the first and second arguments to the
function. Arrays of pointers to atoms and bonds hold the atom and bond
correspondences between the pattern and target. The arrays are the
length of the maximum atom and bond indices of the pattern molecule.
The indices of the atoms and bonds in the pattern molecule can be used
to look up the corresponding atoms and bonds in the target molecule.
Subgraphs may not include all pattern atoms. Array positions for
unmatched atoms and bonds are assigned to the hex value 0x0.
The integer part of the double precision floating point value returned
by the method is used to determine maximal common subgraphs. All
integer part scores which are smaller than the maximum computed value
for any subgraph are discarded by OEMCSSearch. The decimal part
of the floating point value returned by the method is used to sort the
matches found by OEMCSSearch. For example, by scoring matches
using the function (number of atoms + (number of bonds/100)), all
matches which have the same number of subgraph atoms would be retained
but the matches would be returned in order of decreasing number of
bonds matched.
It is important to mention, that not only matches with the highest score are
retained, but also matches with scores higher than the best score rounded down
to the highest integer.
In the example shown in Figure 17.6 three common
substructures are detected using the OEMCSMaxBonds scoring functions.
The first two matches are scored 5.06, since they composed of 5 mapped bonds
and 6 mapped atoms.
There is only one other matche which scored higher than 5.0, this
is the third retained match with a 5.05 score.