Title: "Creating a ChemInformatics Data System for Public Consumption" Author: Evan Bolton NIH/NCBI Abstract: The processing pipeline used to create PubChem is outlined, discussed, and demonstrated. PubChem offers researchers public access to an array of structure and activity information for a diverse set of small molecules. It is organized as three linked databases, Substance, Compound, and BioAssay, within the Entrez/PubMed information retrieval system. PubChem contains the results of high-throughput biological screening experiments, and, when possible, PubChem's records are linked to other NCBI databases, such as the PubMed scientific literature database and NCBI's 3D protein structure database. Validation and standardization of chemical structure data is critical to PubChem, since it allows computation of properties, descriptors, and similarity relationships among entries in a uniform and accurate way. Within PubChem, OEChem is used for file I/O and molecular data handling, SMARTS pattern matching, stereochemistry and aromaticity perception, and valence bond canonicalization, among other things. Ogham is being used to assign proper IUPAC names to chemical structures and will also be used generate structures for the number of cases where the dataset has only names. In the future, PubChem is planned to include properties predicted using other software from OpenEye.
Creating_a_ChemInformatics_Data_System_for_Public_Consumption.pdf