The oeiupac library currently processes NUL (zero) terminated ASCII character strings, therefore Greek characters, symbols, fonts and superscripts must be transliterated into the printable subset of ASCII. When parsing compound names, the oeiupac library considers both spaces and tab characters as interchangeable, and any number of consecutive ‘whitespace’ characters are treated as a single space.
Currently, the name parsing is case insensitive, allowing arbitrary mixing of upper and lower case characters, e.g. initial letter capitalization.
Greek characters are understood in a number of different
representations. The strings ‘$a’, ‘${a}’, ‘alpha’, ‘.alpha.’,
‘α’, ‘α’ and ‘α’ are all understood to represent the
Greek character alpha, (
).
There is no special representation for italic characters. Compound names such as ‘tert-butyl’ and ‘p-aminobenzamidine’ are represented as ‘tert-butyl’ and ‘p-aminobenzamidine’. Both the long and short forms of prefixes can be used, allowing the above examples to also be written as ‘t-butyl’ and ‘para-aminobenzamidine’.
Unrecognized functional groups, linkers or ring systems are denoted in the generated name as the string ‘BLAH’. As much of the name possible is generated resulting in compound names such as ‘dichloroBLAHcarboxylic acid’. Generated compound names are entirely lower case, with no initial capitalization. Upper case characters are generated for locants and as described above, for BLAH.
When generating Greek characters in compound names, the oeiupac
library currently uses the dollar character followed by single letter
representation. In this formalism, ‘$a’ represents the Greek character
alpha,
, ‘$b’ the Greek character beta,
,
‘$g’ the Greek character gamma,
and ‘$l’ the Greek
character lambda,
.
When generating superscripts, the oeiupac library currently uses the
caret and curly braces representation. Hence ‘$l^{5}’ represents the
Greek character lambda followed by a superscript five, i.e.
. Similarly, ‘pentacyclo[4.2.0.0^{2,5}.0^{3,8}.0^{4,7}]octane’ would
be the von Baeyer system name for cubane, i.e.
pentacyclo
octane.
Multiple components in a disconnected molecule, apart from common salts and counter ions, are separated from each other by a semicolon followed by a space. Mixtures containing salts are written ordering the cations, before the compound name, followed by anions, finally followed by any common neutral molecules (e.g. hydrate or hydrochloride).
The Lexichem compound naming functionality supports the generation of several styles of compound name. The currently predefined name styles are OpenEye (the default), IUPAC, CAS, Traditional and Systematic. OpenEye names loosely correspond to the kinds of names familiar to a medicinal chemist. These names are intended to be a subset of the IUPAC 2005 standard’s acceptable names, but not necessarily the PIN (Preferred IUPAC Name). These correspond to the types of names found in a Sigma-Aldrich catalog or a Journal of Medicinal Chemistry article for example.
IUPAC names are intended to follow the IUPAC 2005 recommendations for the Preferred IUPAC Name (PIN). Unfortunately, this functionality is relatively recent, so the best that can be hoped for these names is that they are more IUPAC-like than the default OpenEye name style. Future release of Lexichem may further refine this definition to provide IUPAC2005, IUPAC93 and IUPAC79 name styles that reflect the corresponding standard’s preferred name.
The Lexichem CAS name style is intended to follow the Chemical Abstracts Service’s naming conventions, where they differ from IUPAC’s. Once again, as this functionality is relatively recent, the effect is to generate names that are more CAS-like than the default OpenEye name style.
The Traditional name style corresponds to forms of compound naming that are now no longer acceptable to the IUPAC rules. The boundary between whether a trivial/common name is considered OpenEye or Traditional when it acceptable to IUPAC but not preferred is blurred, with OpenEye attempting to follow the more prevalent usage.
Finally, Systematic names correspond to the fully systematic IUPAC names that the IUPAC preferred names are slowly converging towards.
Some of the concepts explained in the previous section are probably best clarified through some real examples.
The SMILES string O is called ‘water’ by the OpenEye name style, but ‘oxidane’ by the IUPAC and Systematic name styles.
The SMILES C#C is called ‘acetylene’ by the OpenEye and IUPAC name styles, but ‘ethyne’ by the Systematic name style.
The SMILES prefix *Nc1ccccc1 is called ‘anilino’ by OpenEye and IUPAC, but ‘phenylamino’ by systematic.
The SMILES prefix *O[N+]#[C-] is called ‘fulminato’ by OpenEye, but ‘isocyanooxy’ by IUPAC and Systematic.
The SMILES prefix *C(=O)C is called ‘acetyl’ in OpenEye and IUPAC, but ‘ethanoyl’ in Systematic.
The SMILES string CC(=O)C is called ‘acetone’’ in OpenEye, but ‘propan-2-one’ in IUPAC and Systematic.
The SMILES string C12C3C4C1C5C4C3C25 is called ‘cubane’ in OpenEye, but is currently named ‘BLAH’ in IUPAC and Systematic as we currently fail to name it as the preferred IUPAC2005 PIN: ‘pentacyclo[4.2.0.0^{2,5}.0^{3,8}.0^{4,7}]octane’.
The SMILES string C(=O)O is called ‘formic acid’ in OpenEye/IUPAC, but ‘methanoic acid’ in Systematic.
The SMILES string c1ccccc1CCCCCCC is named as ‘1-phenylheptane’ by OpenEye and IUPAC, but as ‘heptylbenzene’ by CAS.
The SMILES prefix *[BH2] is called ‘boranyl’ by OpenEye and IUPAC, but as ‘boryl’ by CAS.
The SMILES prefix *S is called ‘sulfanyl’ by OpenEye and IUPAC, but as ‘mercapto’ by Traditional.
The SMILES string CCCCCCCCC(=O)O is called ‘nonanoic acid’ by OpenEye and IUPAC, but as ‘pelargonic acid’ by Traditional.