|
SMIRKS Primer
An introduction to SMIRKS and OEChem for reaction based chemoinformatics algorithms.
-
History
-
The SMIRKS language was invented by David Weininger and Jack Delany of
Daylight Chemical Information Systems, and derived from the SMILES and
SMARTS languages for molecules and subgraph pattern matching,
respectively. SMIRKS is largely consistent in syntax and semantics with
SMILES and SMARTS but there are key differences. A SMIRKS represents a
"reaction transform", which, given a set of reactants, may match and
result in a set of products. The SMIRKS may represent the mechanism
of the reaction, but it may not. SMIRKS may represent general molecular
graph modifications which may have no relation to real chemistry.
Atoms may be created or destroyed or change their atomic number.
-
Uses of SMIRKS
-
- Generating virtual combinatorial libraries from fragments
- Fragmentation
- Standardization/correction of valence models
- Evolutionary algorithms
- Representing reaction types for expert systems (e.g. retrosynthetic
analysis)
- Generic molecular manipulation
- Rigorous, reusable encoding of algorithmic intelligence
-
Why SMIRKS?
-
Like SMILES and SMARTS, SMIRKS is a compact line notation with
rigorous, documented and well understood syntax and semantics.
Software supporting these languages accurately provide added benefits
to users through interoperability and use of linguistic and conceptual
chemoinformatics standards (inter-comprehensibility).
-
Basic syntax
-
A SMIRKS is a set of one or more dot disconnected reactants,
followed by ">>", followed by a set of one or more dot disconnected
products. Reactants and products are expressed in a format based very
closely on smiles and smarts, and comprised only from elements of these
languages.. Several points can be illustrated by
the following example:
[C&X4&H3:1][H].[O&X2&H1:2][H]>>[C:1]-[O:2]
- atom mapping:
- Atom mapping specifies correspondence between reactant and
product atoms. Atom map indices are integers following the
colon in an atom specification.
- unmapped atoms:
- In this example the two "[H]" hydrogens are unmapped.
Unmapped reactant atoms are destroyed by the reaction.
Unmapped product atoms are created by the reaction.
- reactant smarts:
- The reactant "C&X4&H3" specification is regarded as a smarts
which must match a carbon atom with exactly four attached
atoms of which three are hydrogens.
- hydrogens as atoms and properties:
- Hydrogens are treated both as atoms and properties of heavy
atoms -- the H-count. So the explicit methyl hydrogen atom
"[H]" is not in addition to the three "H3" hydrogens.
- no need to match products:
- In this case the reaction bonds two atoms and destroys two
atoms. There is no need to use smarts to match the products.
-
OpenEye extensions to SMIRKS
-
Note that OEChem allows some smirks extensions beyond the Daylight
standard. Mostly these correspond with the strictSmirks flag used
when initializing OEUniMolecularRxn and OELibraryGen objects, as
described in the API manual entries for these classes.
Use of the Daylight standard may be preferable in many
cases. One reason is compatibility, since users may wish to use both
software packages, or be able to in future -- and/or other smirks-able
software. Another reason is pedagogical, since smirks is somewhat
tricky to learn without added complications, and it may be easier to
learn about the extensions after some experience may indicate their
justification. I'll alternately refer to the Daylight smirks standard
as "standard smirks".
-
When should I use explicit hydrogens in SMIRKS?
-
If hydrogens are removed from a heavy atom, those hydrogens should be
specified explicitly in the reactant. These hydrogens may or may not
be mapped to explicit hydrogens in the product(s). If they are not
present in the product, they have been deleted by the transform.
-
Where can SMARTS be used in SMIRKS?
-
SMARTS atom specifications can be used for mapped atoms only. SMARTS are only
relevant on the reactant side for a forward transform using standard smirks.
Note that this is not true for reverse transforms, which are implemented
by Daylight but not OEChem. Also this is not true when OEChem is used
to modify implicit H count using the "h" smarts primitive, this functionality
being an OE smirks extension.
-
How can I preserve or change atom properties at a mapped atom?
-
Properties of mapped atoms are preserved by default with OEChem's
implementation of SMIRKS. (This is a difference between OpenEye and
Daylight.) Properties = {charge, stereo, mass}. To change properties
they must be specified in the products. Note that implicit hydrogen
count is also an atom property in most contexts. However, with
standard smirks, hydrogens are handled as explicit atoms.
-
Neutralizing an atom
-
To modify the charge to zero, "+0" or "-0" should be specified explicitly
in the products as follows:
[*:1][N+:2](=[O:3])[O-:4]>>[*:1][N+0:2](=[O:3])=[O+0:4]
-
OELibraryGen vs. OEUniMolecularRxn
-
OELibraryGen and. OEUniMolecularRxn are two OEChem classes which implement
smirks within the overall OEChem molecular object model. As implied by
the name, OEUniMolecularRxn handles only single reactant smirks and
reactions.
With OELibraryGen, the explicit-H property
(Get/SetExplicitHydrogens(bool)) is by default set to true, which is
consistent with standard smirks behavior. OEUniMolecularRxn lacks such a
setting or method, so this can only be implemented by externally
setting H's explicit or implicit.
OELibraryGen applies smirks transforms one time when the GetProducts() method
is called. In contrast, OEUniMolecularRxn applies transforms exhaustively,
that is, the transform is applied to products iteratively until no match
is found.
-
OELibraryGen, reactant numbers and count
-
An OELibraryGen instance is initialized with a smirks, and reactants are
specified using the SetStartingMaterial() method. Unlike Daylight's
reaction toolkit, OELibraryGen requires that reactants are specified
according to their number, which must correspond with their lexical
order in the input smirks. And, there must be exactly the correct number of
reactants specified. If a user wishes to apply a smirks to a "soup"
of reactants, only some of which may be involved, the combinatorics
must be coded separately from OELibraryGen.
-
Representing individual, specific reactions
-
A subtle difference between OpenEye and Daylight reaction transforms is:
the output of Daylight reaction transforms are normally reactions;
in contrast, the output of OE reaction transforms are normally products.
Daylight reactions are represented as reaction smiles, optionally with
atom maps to designate a partial or complete reaction mechanism.
Although the OELibraryGen and OEUniMolecularRxn methods
do not directly output such reactions, these reaction smiles can easily
be constructed. The OELibraryGen::SetAssignMapIdx() method
is used to maintain the map indices.
-
Discarding unwanted byproducts
-
A fundamental limitation of a smirks is that atoms can only be destroyed if they are
matched on the reactant side. That is, we must know they exist when writing the
smirks. This makes it impossible to implement with smirks alone a concept akin to
"discard whatever is attached here". Instead, such byproducts, a.k.a. "leaving groups",
must be discarded in a separate step after the smirks based transform.
As an example, let's say our task is to replace an alkyl group attached to an
aromatic ring by a single hydrogen atom. The following smirks could accomplish
this but also leave the alkyl group:
[a:1]-[C:2]>>[a:1][H].[H][C:2]
The following smirks illustrates a helpful device:
[a:1]-[C:2]>>[a:1][H].[Xe][C:2]
The Xeon atom serves to tag the byproduct so a subsequent step can easily recognize
unwanted byproducts as such and discard them.
See also:
rev: 2007 05 30
|