For a general introduction to database preparation, please see the appropriate section of the introduction.
We recognize that not everyone has a multiconformer database of fragments readily available. For this reason, we have included a database of molecular fragments to search for bioisosteres. While we are certain that our databases are reasonably thorough, we recognize that some users may want to develop their own fragment databases. Fragmentation exercises are straightforward with many cheminformatics toolkits and strategies are available in the literature (See [Lewell-1998]).
The BROOD distribution includes two programs to aid users in generating their own databases, CHOMP and MERGE.
The BROOD distribution includes the progrom CHOMP for fragmenting molecules into databases of potential bioisosteres. CHOMP fragments molecules by identifying critical bonds that can be broken as specified with the -smarts. CHOMP includes three sets of bond-breaking patterns including: RLF (ring, linkers & functional groups), RECAP rules [Lewell-1998], and ALL (indicating breaking all non-ring, non-resonance bonds). By default, CHOMP breaks all bonds identified by any of these three methods. A user can also specify a SMARTS file of their own bond identifiers. This file should include a series of SMARTS patterns (1 per line) that each define 2 atoms on opposite ends of the bond to be broken. For example, a line with the SMARTS “-” will cause all single bonds to be broken, while a line with the SMARTS “[R]-[!R]” will cause all bonds between ring atoms and non-ring atoms to be broken and “[#6]-!@[!#6!#1]” will cause CHOMP to break every single non-ring bond between a carbon atom and a heteroatom.
The RLF chemical heuristics seek to break compounds into three types of primary fragments; continguous ring systems, functional groups, and linkers. Continguous ring-systems include any set of atoms that are bonded together by at least 1 ring-bond. Thus fused rings and spiro rings are included as a single ring system, but biphenyl is broken into two ring systems. Functional groups are defined as any collection of bonded atoms including one or more heteroatoms or unsaturated carbons separated by at most a single fully-saturated carbon atom. The linkers are the remaining saturated carbon skeletons. It should be noted that linkers, like functional groups and ring-systems, can be terminal (i.e. degree 1).
CHOMP systematically identifies all of the molecular fragments that can be generated by breaking one or more of the specified bonds. CHOMP first eliminates fragments that have more than 15 heavy atoms, or that have more than 3 attachment points. CHOMP also filters the fragments based on commonly used molecular filters, eliminating unstable, reactive or toxic functional groups etc. As a final fragment generation step, for every fragment with 2 attachment points, CHOMP generates the 2 single attachment fragments and for every fragment with 3 attachment points, CHOMP generates the 6 fragments with 2 attachment points and the 3 fragments with single attachment points by capping the open attachment valence with hydrogens. CHOMP then eliminates any duplicate fragments.
At this point, CHOMP offers additional options for duplicate removal. If a user is generating a database to augment the default BROOD database, they can direct CHOMP to remove any duplicates with the default database using the removeDuplicates flag. Similarly, if a user want to eliminate any set of potential fragments, such as those from a previous proprietary fragment database, they can specify the file containing these fragments with the -userUnique flag.
In many instances, users already have sets of fragments generated by their own means. CHOMP will read these fragments directly using the -userFrags flag. This flag can be used to generate a BROOD database using only these fragments, or in conjunction with fragments generated using the molecular fragmentation process discussed above. In either case, BROOD will process the user-defined fragments first and process them in the order they are read from the -userFrags file. This feature is important for use in conjunction with BROODS -quickLook flag because the fragments will be searched first.
This first portion of the CHOMP algorithm involves only graph processing and can proceed very quickly. Please note, however, when CHOMPing molecular databases of more than a few million compounds, the CHOMP process will take significant amounts of memory. Memory usage is aggressive early in the algorithm, when many new fragments are being identified and slows somewhat later. For this reason, we recommend that a user have about 1G of memory for every million molecules they wish to process. It is entirely possible to break a large molecular database into several groups and run each individaully, and saving their intermediate results before recombining them prior to the final stages of the CHOMP algorithm.
In order to write intermediate files, whether for the purpose of breaking a database to chuncks, or simply to investigate the fragments CHOMP is generating, a user can set the -omega flag to false. This will cause CHOMP to write a file as specified by the out flag, but with the .ism suffix, which contains all of the fragments processed so far. These molecules can be examined, filtered, sorted, joined with additional smiles and otherwise manipulated. Then the same molecules can be used with the userFrags flag to generate a BROOD database. Other than the changes made by manipulation of this intermediate output file, the databases generated by stopping at this intermediate step will be the same as one generated by executing CHOMP with the -omega flag set to true from the start.
The final two stages of the CHOMP algorithm are to generate 3D conformers and to write them to a BROOD database. CHOMP generates 3D conformers using the Omega algorithm (operated here with only a BROOD license). CHOMP’s version of Omega has a modified TORLIB in order to sample more generously around the attachment points. It also differs from default Omega parameters in having tighter constraints for removal of RMSD duplicates (as the measure is very sensitive to size) and for having scaled MAXCONFS and EWINDOW parameters based on the size of the fragments. Lastly, CHOMP writes the BROOD database. In order to write the fragments into a database CHOMP carries out several precalculations, including generating physical properties, adding color atoms. CHOMP also segregates the fragment conformers into groups that are likely to match the same-types of queries. All of these processes help BROOD search the databases more quickly.
Similarly to the OMEGA application, the OMEGA algorithm inside CHOMP can take up to 1 or occassionally 2 seconds per fragment. For this reason, CHOMP can be run in multi-processor mode using MPI. For more details on the use of CHOMP with MPI, refer to the Open MPI section of the installation manual.
BROOD databases are currently in the format of a directory of folder that contains many files. The databases can be compressed with standard compression algorithms, moved and uncompressed. Of the files in the database directoy, only one is human-readable. This is the .info file. If you open the .info file with a text reader, you will see some information about the database, its version, and a manifest of the files that are supposed to be in the database. Whenever BROOD or MERGE read a database, they check this manifest as some assurance that the database has not been corrupted.
Finally, after a user generates a database, they may desire to combine the new database with prior databases, such as BROOD’s default database. In order to facilitate this, BROOD comes with a utility program called MERGE. MERGE has three required parameters -in1, -in2, and -out. This allows a user to combine two databases in the directories specified by the two in options and write a new database to the directory specified by the -out flag. MERGE will not overwrite a database.
CHOMP is the algorithm to generate BROOD databases. At its simplest, CHOMP takes molecules as input and fragments them to generate BROOD databases. When you first begin using BROOD, you should attempt to utilize the default database that comes with the BROOD program. In some instances, the default BROOD database may be a separate download and installation than the BROOD application, but in all cases it is available for use with BROOD.
CHOMP offers a single step BROOD database generation. If you simply specify the molecules you would like to use to generate fragments (with -in:) and the output database you would like to create (with -out), CHOMP will take care of the rest.
By default CHOMP assumes you want to create a database from your proprietary collection in order to augment the default database. Thus, by default CHOMP removes fragments that are duplicates of those already in the default database. If this is not your goal, then turn off this duplicate removal with the -removeDuplicates flag. Failure to follow this set could lead to far fewer fragments than you expect.
The output from CHOMP will be a directory filled with a BROOD database. If you want to search the database with BROOD, specify the directory name with BROOD’s -db flag.
Executing CHOMP with no arguments will result in:
prompt> chomp
Chemical Heuristic for Optimal Molecular Pieces (CHOMP).
CHOMP version 2.0.0, 20100105
OEChem version 1.8.0, 20091211
Platform: microsoft-win32-msvc9-MD-x86
OpenEye Scientific Software, Inc.
Single processor
MPI Multiprocessor
=======================================
This executable supports single processor execution
No argument specified on the command line
Required parameters:
-out : Output fragment database name
For more help type:
chomp --help
A description of the command line interface can be obtained by executing CHOMP with the --help option.
prompt> chomp --help
will generate the following output:
Help functions:
chomp --help simple : Get a list of simple parameters
chomp --help all : Get a complete list of parameters
chomp --help defaults : List the defaults for all parameters
chomp --help <parameter> : Get detailed help on a parameter
chomp --help html : Create an html help file for this program
If you desire to see all of the basic command-line options use --help simple.
prompt> chomp --help all
will generate the folloing output:
These two parameter represent the minimum useful set of parameters and the best place to start in learning to use CHOMP.
If you desire to see all of the command-line options use --help all.
prompt> chomp --help all
will generate the following output:
Complete parameter list
chomp
-in : Input molecule filename
-out : Output fragment database name
-removeDuplicates : Only process fragments that are not in the OpenEye
database
-param : Control parameter file
-prefix : Prefix for generic output files
-dots : Write dots to the screen to follow progress
-maxDegree : Only enumerate fragments with degree <= N
-minDegree : Only enumerate fragments with degree > N
-minFrequency : Only accept fragments with freq >= minFrequency
-maxMolWt : Only enumerate fragments <= maxMolWt a.u.
-maxHvy : Only enumerate fragments with <= maxHvy heavy atoms
-minHvy : Only enumerate fragments with > minHvy heavy atoms
-filter : Apply filter file (true, false, or filter.txt)
-userFrags : User fragments after chomp ready for omega & db prep
-userUnique : User database for duplicate removal
-smarts : SMARTS file for bonds to break (recap, both, rlf, all or file)
-capAttach : Use the input fragment coordinates
-omega : Build multi-conformers with Omega (license required)
-backmap : Build mapping between the fragments and the molecules
The defaults for each command-line parameter can be examined with the --defaults flag.
-in : Input molecule filename -out : Output fragment database name
-removeDuplicates : Only process fragments that are not in the OpenEye database -param : Control parameter file -prefix : Prefix for generic output files -dots : Write dots to the screen to follow progress -maxDegree : Only enumerate fragments with degree <= N -minDegree : Only enumerate fragments with degree > N -minFrequency : Only accept fragments with freq >= minFrequency -maxMolWt : Only enumerate fragments <= maxMolWt a.u. -maxHvy : Only enumerate fragments with <= maxHvy heavy atoms -minHvy : Only enumerate fragments with > minHvy heavy atoms -filter : Apply filter file (true, false, or filter.txt) -userFrags : User fragments after chomp ready for omega & db prep -userUnique : User database for duplicate removal -smarts : SMARTS file for bonds to break (recap, both, rlf, all or file) -capAttach : Use the input fragment coordinates -omega : Build multi-conformers with Omega (license required) -backmap : Build mapping between the fragments and the molecules
This section has a series of example CHOMP command-line executions. Each example is followed by a brief description of its behavior.
prompt> chomp -in mymolecules.smi -out myfrags.oeb.gz
prompt> chomp -i mymolecules.smi -o myfrags.oeb.gz
prompt> chomp mymolecules.smi myfrags.oeb.gz
All three of these command-lines specify exactly the same thing. In each case, CHOMP will read the molecules in mymolecules.smi and write the fragment file myfrags.oeb.gz. This is the most basic and most common CHOMP execution. Since the -dots flag defaults to true, “dot’s” will be written to std:cout to indicate progress of the molecular fragmentation as the job progresses.
prompt> chomp -param oldrun.param
This execution of CHOMP will read the command-line parameters from the file oldrun.param. Every time CHOMP is executed, a file called chomp.param is written that records the command-line parameters used. This is useful for recalling what was used in a specific execution or for repeating a previous calculation as in the example here.
prompt> chomp -smarts bondids mymolecules.smi myfrags.oeb.gz
prompt> chomp mymolecules.smi myfrags.oeb.gz -smarts bondids
This example demonstrates two important principles. The first of these two command-lines will work, but the second will result in an error. When specifying a command-line with keyless arguments for the -in and -out files, these files must be the final two arguments on the command-line.
The second principle is use of non-default fragmentation patterns. In this example, CHOMP will use the SMARTS patterns in the file bondids to generate fragments rather than the default fragmentation scheme.
prompt> chomp -simple mymolecules.smi myfrags.oeb.gz
This example will read the molecules in mymolecules.smi and fragment them using the default fragmentation pattern. The -simple flag indicates that only these primary fragments will be written to the output file myfrags.oeb.gz.