Using OEChem to Extend PostgreSQL

TJ O'Donnell
O'Donnell Associates, San Diego, CA
tjo@acm.org

PostgreSQL is a relational database management system (RDBMS) based on POSTGRES, Version 4.2, developed at the University of California at Berkeley Computer Science Department. POSTGRES pioneered many concepts that only became available much later in some commercial database systems. Most RDBMS are extensible, allowing users to store and manipulate complex data. For example, storage and searching of chemical structures in Oracle is handily accomplished by the use of cartridge extensions, such as those offered by Accelrys, CambridgeSoft, ChemAxon, ChemNavigator, Daylight, DeltaSoft, InfoChem, MDL, and Tripos. Such a cartridge allows easy and efficient integration of chemical storage and searching into the structured query language (SQL) underlying most RDBMS applications.

I have written an extension to PostgreSQL using OEChem from OpenEye. It creates SQL functions oe_cansmiles, oe_keksmiles, oe_impsmiles, oe_matches, and oe_count_matches. The use of oe_cansmiles, along with indexing features of PostgreSQL, allows the creation of unique, indexed tables of chemical structures for fast lookup. Substructure searches can be performed using oe_matches. The function oe_count_matches can be used to count the number of times each substructure matches, for example to count the number of halogens, hydrogen bond donors, aromatic heteroatoms, etc. in a given structure. I will show examples of two test databases I have created using about 100,000 and 1,000,000 structures from vendor catalogs. Issues of efficiency and speed will be discussed. Finally, further extensions and optimizations of existing extensions will be considered.

outline.html