Subsections

 
19.2 Maximum Common Substructure Search

The maximum common substructure (henceforth MCS) of two molecular graphs can be identified using the OEMCSSearch class. The Listing 19.3 demonstrates how to initialize an OEMCSSearch object, perform a maximum common substructure search, and then retrieve the matches.

 1 import openeye.oechem.*;
 2
 3 public class MaximumCommonSS
 4 {
 5   public static void main(String argv[])
 6   {
 7     OEGraphMol pattern = new OEGraphMol();
 8     oechem.OEParseSmiles(pattern, "c1cc(O)c(O)cc1CCN");
 9     OEGraphMol target = new OEGraphMol();
10     oechem.OEParseSmiles(target,  "c1c(O)c(O)c(Cl)cc1CCCBr");
11
12     int atomexpr = OEExprOpts.DefaultAtoms;
13     int bondexpr = OEExprOpts.DefaultBonds;
14     /* create maximum common substructre object */
15     OEMCSSearch mcss = new OEMCSSearch(pattern,atomexpr,bondexpr,OEMCSType.Exhaustive);
16     /* set scoring function */
17     mcss.SetMCSFunc(new OEMCSMaxAtoms());
18     /* ignore matches smaller than 6 atoms */
19     mcss.SetMinAtoms(6);
20
21     boolean unique = true;
22     int count = 1;
23     /* loop over matches */
24     for (OEMatchBaseIter mi = mcss.Match(target,unique); mi.hasNext();)
25     {
26       OEMatchBase match = mi.next();
27       System.out.println("Match "+count+ " :");
28       System.out.print("pattern atoms: ");
29       OEMatchPairAtomIter ma = match.GetAtoms();
30       for( ;ma.hasNext(); )
31         System.out.print(ma.next().getPattern().GetIdx()+" ");
32       System.out.println();
33       ma.ToFirst();
34       System.out.print("target atoms:  ");
35       for( ;ma.hasNext(); )
36         System.out.print(ma.next().getTarget().GetIdx()+" ");
37       System.out.println();
38       //create match subgraph
39       OEGraphMol m = new OEGraphMol();
40       oechem.OESubsetMol(m,match,true);
41       System.out.println("match smiles = "+oechem.OECreateCanSmiString(m));
42       count++;
43     }
44   }
45 }

Listing:19.3 Maximum common substructure search example

The first molecule, pattern, is dopamine, and the second molecule, target, is a dopamine analog. The OEMCSSearch instance is initialized with dopamine, atom and bond expressions for the maximum common substructure query, and the type of the search (see line 15 in Listing 19.3).

The atom and bond expressions define criteria for atom and bond equivalence used during the search and are defined in the OEExprOpts namespace . See Section 19.4 for more information.

An OEMCSSearch object can also be constructed from a SMARTS string directly. In this case standard SMARTS matching rules apply for what constitutes a match. If it is constructed with an OEQMolBase, then whatever atom and bond expressions have been applied to the OEQMolBase will apply in the MCS search.

The last argument of the initialization defines the search type, either OEMCSType.Approximate or OEMCSType.Exhaustive. The difference between the two search types is detailed in Section 19.2.1. This argument is optional, if it is not specified, then the faster approximate method is employed.

During the search process, each identified common substructure is evaluated by a scoring function and only substructures with the best score are retained. The SetMCSFunc (line 17) provides an ability to set the scoring function of an OEMCSSearch object, thereby influence the result of the maximum common substructure search process. See Section 19.2.2 for more information.

The OEMCSSearch.Match method (line 24) returns an iterator over the maximum common substructure(s). The OEMatchBase is then passed as an argument to the OESubsetMol function (line 40), and subsequently converted into a smiles string.

The detected maximum common substructure of the example program is depicted in Figure 19.1, the output of the program is shown below.

Match 1:
pattern atoms: 0 1 2 3 4 5 6 7 8 9
target atoms:  7 5 3 4 1 2 0 8 9 10
match smiles = CCc1ccc(c(c1)O)O

Figure 19.1: The maximum common substructure (highlighted by red) of dopamine and dopamine analog
 
1365

The maximum common substructure search can perform unique or non-unique substructure searching changing the second argument of the OEMCSSearch.Match function (see line 24 in Listing 19.3). Th default in a non-unique search.

By definition, a match or subgraph is considered unique if it differs from all other subgraphs found previously by at least one atom or bond. Additionally, it is also considered unique if the pattern subgraph is mapped to a different part of the target.

Figure 19.2 shows an example in which the two matches are identified using the unique search method. Even though the two obtained subgraphs are identical, they represent different mappings between the pattern and the target, therefore they are both considered unique. Using a non-unique search would result in four matches, since the phenol can flip, yielding two additional matches.

Figure 19.2: Example for unique maximum common substructures
 
1378

The search space of a maximum common subgraph determination can be restricted by constraining pairs of atoms or bonds to be mapped onto one another in all subgraph solutions. This is done using the OEMCSSSearch.AddConstraint method. Failure to satisfy atom or bond pairwise constraints will prevent any subgraph solutions from being identified. Constraints are considered satisfied in subgraphs which do not contain any constrained atoms or bonds in either the pattern or target molecules. The AddConstraint method returns true if a constraint is added successfully. If the pattern atom or bond in the OEMatchPair does not exist as part of the query molecule created in the initialization of the OESubSearch object then AddConstraint will return false. Multiple calls to AddConstraint using the same pattern atom or bond will cause previously stored constraints to be overwritten as constraints are mutually exclusive. It is impossible to satisfy multiple simultaneous constraints for a single pattern atom or bond, hence the exclusivity.

A read-only reference to the query molecule (OEQMol) contained in an instance of OEMCSSearch can be obtained with the GetPattern method. Note that if the OEMCSSearch was constructed with an OEMolBase , the returned OEQMol is a separate object. Const OEQMolBase methods can be used to interrogate the returned OEQMolBase reference.

The SetMaxMatches method alters the maximum number of maximum common subgraph matches that will be returned by the OEMCSSearch.Match method. The search for maximum common substructures will not terminate immediately upon reaching this limit. The maximum common subgraph cannot be known unless the MCS is composed of all atoms and bonds of at least one of the graphs being compared. The limit of subgraphs to be returned may be reached with a smaller subgraph than the maximum. In such a case the search continues for larger subgraphs until the search is exhausted. OEMCSSearch.Match will return the first N maximum common subgraphs where N is less than or equal to the maximum match limit. The default limit set upon construction of an OEMCSSearch instance is 1024 matches.

The SetMinAtoms method sets the minimum number of atoms required of a subgraph match to be returned by a MCS search. For example, changing the parameter of SetMinAtoms in Listing 19.3 line 19 to 11 would result in no solution since there are only 10 atoms of the largest maximum common substructure (see Figure 19.1).

A single atom can be a perfectly valid maximum common subgraph, however, for many applications such a small subgraph may not be considered useful. Setting the minimum number of atoms to a useful size prevents unproductive subgraph matches from being returned by the OEMCSSearch.Match method. The default set upon construction of an OEMCSSearch instance for the minimum number of atoms is one.

 
19.2.1 Exhaustive and approximate MCSS

The maximum common substructure search can be performed in two different modes: a very fast method (OEMCSType.Approximate) or a more comprehensive method (OEMCSType.Exhaustive).

The type of the OEMCSSearch can be set at initialization. The default is OEMCSType.Approximate.

The approximate method is based on traversing through pre-defined paths of the query structure and trying to map the visited query atoms into target atoms. Because these pre-defined paths represent only a fraction of all possible paths of a compound, it is not guaranteed that the approximate method can find the largest and all common substructures. Significant difference of the detected matches of the two methods could exist in cases when the query or target structure contains complex ring systems (fused or bridged) or stereo centers. However, comparing the two methods for thousands of structures revealed that these cases are rare and the approximate method provides a good trade-off between identifying MCS matches accurately and doing it 3-6 times faster than the exhaustive method.

Figure 19.3 and Figure 19.4 shows an example where the substructure identified by the approximate method is smaller by one atom then the solution identified by the exhaustive method.

Figure 19.3: Example for maximum common substructure identified by the approximate method
 
1432

Figure 19.4: Example for maximum common substructure identified by the exhaustive method
 
1441

Using the approximate MCS is recommended if the speed of the search is crucial.

 
19.2.2 MCS scoring functions: OEMCSFunc

OEMCSFunc is an abstract base class that defines the API used for scoring subgraph matches. The scores generated by implementations of OEMCSFunc influence the sorting and retention of maximum common subgraph matches generated by the OEMCSSearch class.

It is important to mention that using different scoring functions does not alter the way the search space is traversed to identify common substructures. It effects only how these identified substructures are evaluated.

Four implementations of the OEMCSFunc class are available in OEChem:

  1. OEMCSMaxAtoms (See example in Figure 19.5)

    The OEMCSMaxAtoms class is designed to order maximum common substructure matches by the maximum number of atoms included in the graph match. If two matches have the same number of atoms, then te tie is split based on the number of bonds contained in the match.

    Scoring function: number of mapped atoms + number of mapped bonds/100

    Figure 19.5: Retrieved matches using OEMCSMaxAtoms as scoring function
     
    1463

  2. OEMCSMaxBonds (See example in Figure 19.6)

    The OEMCSMaxBonds class is designed to order maximum common substructure matches by the maximum number of bonds included in the graph match. If two matches have the same number of bonds, then the tie is split based on the number of atoms contained in the match.

    Scoring function: number of mapped bonds + number of mapped atoms/100

    Figure 19.6: Retrieved matches using OEMCSMaxBonds as scoring function
     
    1476

  3. OEMCSMaxAtomsCompleteCycles (See example in Figure 19.7)

    The OEMCSMaxAtomsCompleteCycles class is the same as the OEMCSMaxAtoms with the addition of penalizing cyclic query bonds that are not mapped to any target bonds, thereby giving priority to matches which contain complete cycles common to both the pattern and the target structure.

    Scoring function: number of mapped atoms + nrumber of mapped bonds/100 - penalty $\times$ number. of unmapped cyclic query bonds

    Figure 19.7: Retrieved matches using OEMCSMaxAtomsCompleteCycles as scoring function
     
    1490

    The default penalty for each unmapped cyclic query bond is 1.0.

  4. OEMCSMaxBondsCompleteCycles (See example in Figure 19.8)

    The OEMCSMaxBondsCompleteCycles class is the same as the OEMCSMaxBonds with the addition of penalizing cyclic query bonds that are not mapped to any target bonds, thereby giving priority to matches which contain complete cycles common to both the pattern and the target structure.

    Scoring function: number of mapped bonds + number of mapped atoms/100 - penalty $\times$ number of unmapped cyclic query bonds

    Figure 19.8: Retrieved match using OEMCSMaxBondsCompleteCycles as scoring function
     
    1504
    The default penalty for each unmapped cyclic query bond is 1.0.

The default scoring function, OEMCSMaxAtoms, can be changed using the SetMCSFunc method of the OEMCSSearch class.

The scoring function method is called automatically by the OEMCSSearch class when a common subgraph of two molecules is identified. The pattern (query) molecule and target molecule currently being matched are passed as the first and second arguments to the function. Arrays of pointers to atoms and bonds hold the atom and bond correspondences between the pattern and target. The arrays are the length of the maximum atom and bond indices of the pattern molecule. The indices of the atoms and bonds in the pattern molecule can be used to look up the corresponding atoms and bonds in the target molecule. Subgraphs may not include all pattern atoms. Array positions for unmatched atoms and bonds are assigned to the hex value 0x0.

The integer part of the double precision floating point value returned by the method is used to determine maximal common subgraphs. All integer part scores which are smaller than the maximum computed value for any subgraph are discarded by OEMCSSearch. The decimal part of the floating point value returned by the method is used to sort the matches found by OEMCSSearch. For example, by scoring matches using the function (number of atoms + (number of bonds/100)), all matches which have the same number of subgraph atoms would be retained but the matches would be returned in order of decreasing number of bonds matched. It is important to mention, that not only matches with the highest score are retained, but also matches with scores higher than the best score rounded down to the highest integer. In the example shown in Figure 19.6 three common substructures are detected using the OEMCSMaxBonds scoring functions. The first two matches are scored 5.06, since they composed of 5 mapped bonds and 6 mapped atoms. There is only one other matche which scored higher than 5.0, this is the third retained match with a 5.05 score.