OEChem or OELib?
Matthew T. Stahl
Chief
Geek, OpenEye Scientific Software
How OELib Became Open-Source
For the uninitiated, OELib is a
free (that is libre, not gratis according to the Free Software
Foundation) open-source C++ molecule toolkit. It began with the
need for a molecule handling toolkit on which OpenEye Scientific
Software could base some of its products. As OELib grew and became
more useful, a small number of users outside of OpenEye found that it
had utility in their own work. In an attempt to prevent the
further reinvention of the programmatic wheel with respect to
molecule handling functionality, OELib was released under the GNU
General Public License (GPL). The GPL was simply a convenient
mechanism to prevent OELib from being incorporated into commercial
closed-source projects. By attaching the GPL to OELib, it
remained accessible to the desired audience of developers trying to
solve their own problems but not to commercial entities trying to
make money by solving other people's problems. The second
paragraph of the OELib Primer reads:
I admire open-source projects. Linux and the Free Software Foundation are both great examples of how summing the spare time of Geeks and Hackers can result in useful projects. Although the potential audience is small, I hope to take advantage many of the great minds writing chemical software by making OELib available to others. Two things can be accomplished by releasing source code to OELib. First, development time can be shortened by basing projects on OELib. The less people have to reinvent the wheel (or the function) the better. Second, by releasing the source code hopefully other programmers can contribute to the project. Joe Corkery, Brian Goldman, Anthony Nicholls, Roger Sayle, and Pat Walters have already made significant contributions to OELib. As the list of contributors grows all the users of OELib benefit.
OELib became a qualified success. By the summer of 2002, there were (very conservatively) over 30 individuals who have written and were writing software using OELib. The majority of the software applications sold by OpenEye are based on OELib. Released under an alternate license (non-GPL), OELib became the basis on which Dock V and a commercial software product from Lion Biosciences were written. OELib development also forked into the OpenBabel project and at the time of this writing was being actively developed on SourceForge (http://openbabel.sourceforge.net/). The original intent of OELib was to provide a library of existing functionality on which those authoring chemical software could build. As such, OELib was a qualified success based on the adoption by developers inside and outside of OpenEye.
Policy Change and the ‘New OELib’
In May of 2001, members of OpenEye
Scientific Software began a de novo redesign and rewrite of
OELib. By the end of August 2001, the decision was made that
although the API of the OELib rewrite would be open and published,
the source code of the project would be closed. The rewrite of
OELib, ultimately to be called OEChem, was released in August, 2002.
This document attempts to explain why the rewrite of OELib was
necessary, the relationship between OELib and OEChem, and why OpenEye
is releasing OEChem as closed-source instead of the GPL model of
OELib.
The rewrite of OELib was prompted by a number of design
flaws and technical failings of the library. Details of the
perceived deficiencies are covered in a later section in this
document. It was clear to members of OpenEye that continued
development and reliance on the existing version of OELib was going
to have long term deleterious effects on the company. The
amount of programmer time spent internally fixing acute bugs in OELib
was already detracting from the product development cycle. Design
flaws in OELib caused chronic failures, short-comings, and
inconsistencies that were deemed to be impossible to address by
modifications to the code base. The need for a complete rewrite
thus became painfully obvious. A project that began life being
called 'new OELib', took on a very different character than its
predecessor and ultimately became OEChem.
In the initial
design phase of OEChem, all aspects of the OELib project were
questioned for the purposes of identifying all its shortcomings in
order to avoid repeating mistakes. The examination soon
stretched beyond technical issues of how OEChem should work into
areas of licensing and distribution. It had been clear long
before work was begun on OEChem that the GPL did not prevent
commercial entities other than OpenEye from using OELib to generate
data that was then included in a commercial product. This was a
shortcoming of the GPL that OpenEye was willing to accept given the
extensive coverage provided by the GPL in other areas of greater
concern. In addition, the GPL did not prevent companies from
including OELib and derivative open-source products as parts of a
commercial distribution. Schrödinger, for example,
packaged the original Babel program along with Jaguar for the
purposes of file format conversion. OELib theoretically could
be used in an identical manner as part of commercial distribution
without being physically included in a closed-source product. Again,
this was an unintended and undesirable use of OELib from the
perspective of OpenEye, but one that OpenEye was willing to tolerate
because of advantages of the GPL in other areas. The GPL was
meant to foster derivative works of software, with the anticipation
of contributions of the derivative works back into the original work.
GNU/Linux stands as a shining example of how the GPL allows a
community of developers to contribute to a body of software with an
agreed upon set of rules for appropriate commercial and
non-commercial uses. Free and open-source software works well
when there is little or no barrier to back donation. In many
cases there is strong incentive for software donation such as drivers
for a hardware component. OELib, on the other hand, suffered as
an open-source project because its primary user base was individuals
in pharmaceutical companies. Because of the inherent nature of
pharmaceutical companies to protect proprietary data, donation of
source and bug fixes to OELib by programmers outside of OpenEye paled
in comparison to the amount of source being generated within OpenEye.
This statement is not meant to demean the contributions of
notable individuals like Pat Walters, Jens Sadowski, Brian Goldman,
Roger Sayle and others. There were many more developers who
admitted to having forked in-house only versions of OELib than there
were developers who made significant contributions to the project.
Developers in academia were much more willing to contribute
time and effort to the project, but again, the total contributions
were minimal in comparison to the amount of work being done by
OpenEye employees. In a few rare instances, the contributions
provided by external sources caused instability in releases of
OpenEye products. In summary, the open-source community model
of development was never achieved with OELib. The number of
misuses outnumbered the intended uses.
Again, OELib was a
qualified success. People used it and found it useful. The
failing of OELib was perceived to be the open-source nature of the
project - based on the minimal number of contributions. At the
time when OpenEye was deciding on the license model for a rewrite of
OELib it seemed as though there were two clear choices. At the
end of the project the final product could again be released under
the GPL, or it could be released as a commercial product with the API
remaining open. If OpenEye opted for the GPL, the final product
would have to have corners cut in areas like documentation and
testing given the corporate financial resources while the ‘new
OELib’ was being written. A commercial release that would
likely recover development costs would allow significantly more
development time and allow a higher quality product to be developed.
Potential and existing customers at the time, when asked their
preference, overwhelmingly voted in favor of the latter choice.
Customers regarded access to source code secondary to solid,
well designed, thoroughly documented product.
Closing
the source code and concentrating on a commercial quality product
solved a number of issues for OpenEye. It relaxed time
constraints on the project since a commercial outcome was
anticipated. Control over the license strategy would prevent
further misuses of the eventual replacement of OELib. In a
company which developed both open- and closed-source software, making
decisions regarding the placement of new functionality could in some
cases be very difficult. In an environment where all source is
closed, releasing powerful features into a library is facilitated
because proprietary methods can be more easily guarded. The end
user gains access to powerful tools which, because of the need to
maintain a competitive advantage, OpenEye would only release behind
the protection of a closed-source system.
Design Flaws in OELib
The previous section gives a historical perspective surrounding the decision to rewrite OELib. Although a policy change regarding open versus closed source was made early on in the process of redesigning OELib, it by no means was a contributing factor in discarding OELib and writing OEChem. The decision to write OEChem originated with a perceived need to permanently resolve the flaws and technical deficiencies inherent in the design of OELib. There are well-defined classes of problems which OELib, and by extension OpenBabel, consistently fail to handle properly. In some cases the failings are unfortunately so fundamental to the design decisions at the lowest levels of OELib, that correcting problems by modifying the existing code base would be exceedingly difficult.
The single most significant issue inherent in the design of OELib is the corruption of chemical data due to the inferred valence, formal charge, and pKa models built into OELib. Data in chemical file formats can be inconsistent or simply errant. Molecules can also be represented as a predominant form in a phase other than the targeted modeling environment. For example, acetic acid may be stored in a Sybyl Mol2 file protonated, as it would be in gas-phase, when an end-user application needs to model it in aqueous phase. OELib’s solution was to normalize a molecule by discarding formal charges and then inferring the valence states and formal charges based on a set of rules. The rules consisted of molecule patterns associated with values for implicit hydrogens and formal charges. Retention of some types of chemical information was difficult or impossible to achieve, because the perception system in OELib without fail attempted to deduce valence and charge without regard to the input data. File formats that have strict valence rules, such as SMILES and MDL molfiles, cannot be intercoverted reliably without potential loss of information as OELib isn’t designed to retain strictly implied valence states. Furthermore, molecular properties that depend on valence states of atoms cannot, in all cases, be perceived reliably. Aromaticity, for example, is highly dependent on valence states of nitrogen atoms in rings. It can be difficult to simultaneously determine which set of nitrogens in a ring should be protonated (as in pyrole) and whether the ring system to which the nitrogens belong is aromatic. Since the two are mutually dependent given insufficient valence information, the choices become somewhat arbitrary. Although molecular normalization and pKa state correction are important in many applications, these features should not be tied into the valence model for handling molecules as designed in OELib.
Even if a strict valence model was added to OELib, the types of chemical applications that could easily be built were severely limited. OELib lacked extensibility because there was no separation between interface and implementation. The molecule class in OELib was a heavyweight, three-dimensional, multiconformer molecule with a rich set of member functions that performed the vast majority of the operations on a molecule. No provision was made for having alternate molecule implementations. For instance, a database application that only required a lightweight connection table implementation of a molecule could not be written with OELib without completely redesigning a molecule from scratch. Perception routines built into the OELib molecule class could not be reused except for copying and pasting the implementations into another molecule. The behavior of the OELib molecule class could not be altered using standard features of the C++ language because of the manner of declaration of the molecule class. Task specific molecule implementations could not be easily introduced into OELib. This single feature of OELib severely limits the domains in which it can act as a base chemical library on which applications can be written. Since reusability was a deciding factor in releasing OELib as open-source, the lack of extensibility is an unfortunately failing of the library.
Extensibility would have been improved in OELib had there been separation between the interface and implementation, but well-designed application programmer interface (API) would have been necessary as well. Again, the task specific nature of the initial design of OELib to handle three-dimensional multiconformer molecules was reflected in the interface. As new applications were encountered, OELib’s API was modified and extended. One of the design goals of OELib should have been to achieve a reasonably constant API. Many of the numerous changes to the OELib API caused derived applications to be modified as well. Although it was hoped that OELib would become a ubiquitous library on which numerous chemistry applications would be built, a highly mutable API detracted from application programmers’ productivity.
Along with the more fundamental failings of OELib, there were a number of correctable but nagging issues with the library. In some cases OELib sported weak implementations that were prone to error. The SMILES reader, for example, does not always retain correct stereochemistry around chiral centers. The Mol2 writer does not always assign atoms types according to the Mol2 format specification. Such deficient implementations are correctable, but in some cases the corrections would require rewriting significant sections of code. Although the molecule class in OELib was designed to store multiple conformers, it could have done it in a much more natural way by making conformers first class objects. In other words, conformers could have acted as molecules themselves instead of being just another set of coordinates in multiconformer molecule. Much of the functionality in OELib was implemented at a very high level to make application programming easier. In reality, lower level interfaces should have been provided in addition to high level interfaces to harness the full power of the implementations. OELib also had a single internal notion of chemistry, which meant that perception routines that identified features such as aromaticity and atom types frequently did not exactly agree with specifications provided by other companies. The interoperability of OELib with other software packages was therefore limited as the data influx and efflux were suspect.
In summary, OELib and OpenBabel contain numerous design limitations, deficiencies and flaws. The utility of OELib is in many cases only limited, but not obviated by the fundamental issues of the library. Indeed, at that time that this document was authored, more applications existed built on OELib than on OEChem. At some point in the future that will almost certainly not be the case. Every shortcoming, bug, and general failure of OELib went into the design and implementation of OEChem. In design and implementation, OELib and OpenBabel in anything resembling their existing forms will never achieve the extensibility, utility, and robustness of OEChem.
Conclusions
Exploring the history and origins
of OEChem and OELib provides a framework for understanding the
reasonable domains and utility of the respective libraries. OELib is
fine for use where minor amounts of data corruption are acceptable.
Molecular modeling is fraught with assumptions and errors that in
many cases are larger than the errors introduced into molecular
models by OELib. The open-source nature limits the commercial
utility of OELib, however, for instructional purposes and extension
into other open-source projects it is ideal. OELib effectively
provides “quick and dirty” solutions to a large number of
problems in handling molecules in a computer. OEChem, on the other
hand, is designed to be as error free when handling chemical data as
absolutely possible. Molecular modeling is so fraught with
assumptions and errors that no additional uncertainties should be
introduced by a chemical toolkit or application. While the
closed-source commercial nature of OEChem limits the utility by
open-source projects, it is a well documented, extensively tested,
robust, extensible, well designed toolkit that has the ability to be
the basis for vast array of possible chemical software applications.
If data integrity, speed of authoring applications, or execution time
are project requirements then OEChem is clearly the correct choice.
OEChem far surpasses the original vision of the OELib project.
Unless a project is dependent on open-source or free software, OEChem
is a superior choice.