5.2 Frequency-based Optimizations

Most SMARTS matcher implementation use recursive depth-first backtracking matching strategy of comparing a query pattern against a target molecule. This matching strategy causes the performance of the pattern matcher to be dependent upon the input ordering or atom and bond expressions in a SMARTS pattern. Therefore SMARTS optimizer may improve performance by reordering and restructuring atom and bond expressions in a substructure query, using estimated atom and bond frequencies.

To speed up the matching of individual atom and bond expressions, the terms and factors can be permuted such that the evaluation of an expression completes earlier. For AND operators, if the first operand is false, the result will be false without testing the second operand. Hence for AND expressions, the subexpression that is least likely to match should be ordered first. Similarly, for OR operators, if the first operand is true, the result will be true without having to test the second operand. Hence for OR expressions, the most likely subexpression should appear first.

SMACK's SMARTS optimizers base their ``a priori'' probabilities upon an analysis of the National Cancer Institute's August 2000 compound database, NCI00, which contains 250,251 molecules. The probabilities for SMARTS bond primitives are based upon the statistics in the table below:

Expression Description Prob
!@- Acyclic single bond 35.20%
!@= Acyclic double bond 7.37%
!@# Acyclic triple bond 0.31%
@- Cyclic single bond 17.84%
@= Cyclic double bond 2.21%
@# Cyclic triple bond 0.00%
: Aromatic bond 37.07%

From which the following probability values are derived:

Expression Description Prob
- Single bond 53.04%
= Double bond 9.58%
# Triple bond 0.31%
@ Cyclic (ring) bond 57.12%
!@ Acyclic bond 42.88%

The probabilities for SMARTS atom primitives are based upon the statistics in the table below:

Expression Description Prob
#5 Boron atom 0.03%
#6 Carbon atom 73.04%
#7 Nitrogen atom 9.07%
#8 Oxygen atom 13.51%
#9 Fluorine atom 0.65%
#11 Sodium atom 0.01%
#14 Silicon atom 0.04%
#15 Phosphorus atom 0.20%
#16 Sulfur atom 1.45%
#17 Chlorine atom 1.38%
#26 Iron atom 0.02%
#27 Cobalt atom 0.02%
#28 Nickel atom 0.01%
#29 Copper atom 0.02%
#30 Zinc atom 0.01%
#32 Germanium atom 0.01%
#33 Arsenic atom 0.02%
#34 Selenium atom 0.01%
#35 Bromine atom 0.30%
#46 Palladium atom 0.01%
#50 Tin atom 0.03%
#53 Iodine atom 0.06%
#74 Tungsten atom 0.11%
#78 Platinum atom 0.03%
#80 Mercury atom 0.01%

Expression Description Prob
* Any atom (wildcard) 100.00%
!* No atom 0.00%
a Aromatic atom 38.91%
A Aliphatic atom 61.09%
R Cyclic atom 57.25%
R0 Acyclic atom 42.75%
-1 Formal Charge = -1 0.72%
+0 Formal Charge = 0 98.15%
+1 Formal Charge = +1 1.09%
+2 Formal Charge = +2 0.02%
+3 Formal Charge = +3 0.01%
H0 Total HCount = 0 43.78%
H1 Total HCount = 1 34.56%
H2 Total HCount = 2 14.23%
H3 Total HCount = 3 7.42%
h0 Implicit HCount = 0 43.78%
h1 Implicit HCount = 1 34.56%
h2 Implicit HCount = 2 14.23%
h3 Implicit HCount = 3 7.42%
X0 Total Bonds = 0 0.02%
X1 Total Bonds = 1 9.76%
X2 Total Bonds = 2 10.51%
X3 Total Bonds = 3 52.08%
X4 Total Bonds = 4 27.49%
X5 Total Bonds = 5 0.05%
X6 Total Bonds = 6 0.06%
D0 Explicit Bonds = 0 0.07%
D1 Explicit Bonds = 1 20.83%
D2 Explicit Bonds = 2 48.28%
D3 Explicit Bonds = 3 28.38%
D4 Explicit Bonds = 4 2.33%
D5 Explicit Bonds = 5 0.02%
D6 Explicit Bonds = 6 0.06%
v0 Valence = 0 0.02%
v1 Valence = 1 2.97%
v2 Valence = 2 13.78%
v3 Valence = 3 8.43%
v4 Valence = 4 74.04%
v5 Valence = 5 0.22%
v6 Valence = 6 0.50%