http://www.gnova.com/pgchem/cup2005/
TJ O'Donnell
O'Donnell Associates, San Diego, CA
tjo@acm.org
PostgreSQL
is a relational database management system (RDBMS)
based on POSTGRES, Version 4.2,
developed at the University of California at Berkeley Computer Science Department.
POSTGRES pioneered many concepts that only became available much later in some commercial
database systems.
Most RDBMS are
extensible,
allowing users to store and manipulate complex data.
For example, storage and searching of chemical structures in Oracle is handily accomplished
by the use of
cartridge extensions, such as those offered by
Accelrys,
CambridgeSoft,
ChemAxon,
ChemNavigator,
Daylight,
DeltaSoft,
InfoChem,
MDL,
and
Tripos.
Such a cartridge allows easy and efficient integration of chemical storage and searching into
the structured query language (SQL)
underlying most RDBMS applications.
I have written an extension to PostgreSQL using OEChem from OpenEye. It creates SQL
functions oe_cansmiles, oe_keksmiles, oe_impsmiles, oe_matches,
and oe_count_matches.
The use of oe_cansmiles, along with indexing features of PostgreSQL,
allows the creation of unique, indexed tables of chemical structures for fast lookup.
Substructure searches can be performed using oe_matches. The function oe_count_matches can be used
to count the number of times each substructure matches, for example to count the
number of halogens, hydrogen bond donors, aromatic heteroatoms, etc. in a given structure.
I will show examples of two test databases I have created using about 100,000
and 1,000,000 structures from vendor catalogs. Issues of efficiency and speed will
be discussed. Finally, further extensions and optimizations of existing extensions
will be considered.