Title:
OEChem in PubChem: Parsing Legacy Data and Cleaning Up PDB Small Molecules
Author:
Paul Thiessen
NIH/NCBI
Abstract:
OEChem is used extensively throughout the processing pipeline in PubChem,
which starts with original legacy data in a variety of formats and ends
with fully standardized chemical structures in PubChem's ASN.1/XML format.
This presentation will focus on issues that arise during the early stages
of parsing external data, such as SMILES or SDF. Particularly emphasized
will be our efforts at bringing small molecules from PDB into PubChem,
via NCBI's MMDB database. The main challenge with this data is that PDB
does not contain explicit bond orders, and often leaves out hydrogens.
Even basic element identity and atom connectivity are often suspect. So,
we rely on OEChem's 3D coordinate-based perception to fill in missing
information and provide molecules with complete chemical detail, all
without requiring manual intervention. We will discuss how we decide what
constitutes a "small molecule," a variety of problems encountered along
this path, and tricks used to overcome noise in the data or even artifacts
of OEChem's own perception.
ThiessenCUP6.ppt