5.1 Boolean Optimizations

The composite atom and bond expressions used in SMARTS patterns are built up using standard Boolean operators, and can be simplified using standard algebraic techniques. This class of optimizations are completely independent of the form of the leaf nodes, and apply equally well to compiling source code as they do to simplifying substructure queries in chemoinformatics.

The first observation is that the high-precedence and low-precedence "and" operators in SMARTS, are just a syntactic convenience to allow representation and parsing of simple queries without requiring parenthesis. Internally, SMACK represents both ANDs identically, the precedence being implied by the abstract syntax tree (AST). The SMACK SMARTS parser ensures that the high and low precedence operators are parsed into the appropriately nested internal representation.

Internally, SMACK canonicalizes all Boolean expressions to a representation called ``disjunctive normal form''. In this representation, any expression tree is represented as the disjunction (or ``OR'') of a set of terms (possibly containing only a single term), where each term is the conjunction (or ``AND'') of a set of factors (possibly containing only a single factor), where each factor is either a leaf node, or the inversion (or ``NOT'') of a leaf node. Any Boolean expression can be reduced to disjunctive normal form, by the following simple transformation rules.

    not (not (x))  ->  x
    not (and (x, y))  ->  (or (not x) (not y))
    not (or (x, y))  ->  (and (not x) (not y))
    and (or (x, y), z)  ->  or (and (x, z), and (y, z))
    and (x, or (y, z))  ->  or (and (x, y), and (x, z))

In this manner consecutive NOTs can be eliminated, so that a NOT should never contain another NOT as its operand, a NOT can be pushed inside an AND or an OR, and an AND can always be pushed inside an OR.

Similar transformations allow one to arbitrarily reorder factors and terms, using associativity and commutativity rules, to maintain canonical Boolean expressions by maintaining terms and factors in "sorted" order.

    and (x , y)  ==  and (y, x)
    and (and (x, y), z)  == and (x, and (y, z))
    or (x, y)  ==  or (y, x)
    or (or (x, y), z)  ==  or (x, or (y, z))

Finally Boolean optimizations of the binary AND and OR operators allow our sets of "terms" and "factors" to never contain duplicates.

    and (x, x)  ->  x
    or (x, x)  ->  x