|

OEChem or OELib?
How OELib Became Open-Source
For the uninitiated, OELib is a free (that is libre, not gratis according to the Free Software Foundation) open-source C++ molecule toolkit. It began with the need for a molecule handling toolkit on which OpenEye Scientific Software could base some of its products. As OELib grew and became more useful, a small number of users outside of OpenEye found that it had utility in their own work. In an attempt to prevent the further reinvention of the programmatic wheel with respect to molecule handling functionality, OELib was released under the GNU General Public License (GPL). The GPL was simply a convenient mechanism to prevent OELib from being incorporated into commercial closed-source projects. By attaching the GPL to OELib, it remained accessible to the desired audience of developers trying to solve their own problems but not to commercial entities trying to make money by solving other people's problems. The second paragraph of the OELib Primer reads:
I admire open-source projects. Linux and the Free Software Foundation are both great examples of how summing the spare time of Geeks and Hackers can result in useful projects. Although the potential audience is small, I hope to take advantage many of the great minds writing chemical software by making OELib available to others. Two things can be accomplished by releasing source code to OELib. First, development time can be shortened by basing projects on OELib. The less people have to reinvent the wheel (or the function) the better. Second, by releasing the source code hopefully other programmers can contribute to the project. Joe Corkery, Brian Goldman, Anthony Nicholls, Roger Sayle, and Pat Walters have already made significant contributions to OELib. As the list of contributors grows all the users of OELib benefit.
OELib became a qualified success. By the summer of 2002, there were (very conservatively) over 30 individuals who have written and were writing software using OELib. The majority of the software applications sold by OpenEye are based on OELib. Released under an alternate license (non-GPL), OELib became the basis on which Dock V and a commercial software product from Lion Biosciences were written. OELib development also forked into the OpenBabel project and at the time of this writing was being actively developed on SourceForge (http://openbabel.sourceforge.net/). The original intent of OELib was to provide a library of existing functionality on which those authoring chemical software could build. As such, OELib was a qualified success based on the adoption by developers inside and outside of OpenEye.
Policy Change and the ‘New OELib’
In May of 2001, members of OpenEye Scientific Software began a de novo redesign and rewrite of OELib. By the end of August 2001, the decision was made that although the API of the OELib rewrite would be open and published, the source code of the project would be closed. The rewrite of OELib, ultimately to be called OEChem, was released in August, 2002. This document attempts to explain why the rewrite of OELib was necessary, the relationship between OELib and OEChem, and why OpenEye is releasing OEChem as closed-source instead of the GPL model of OELib.
The rewrite of OELib was prompted by a number of design flaws and technical failings of the library. Details of the perceived deficiencies are covered in a later section in this document. It was clear to members of OpenEye that continued development and reliance on the existing version of OELib was going to have long term deleterious effects on the company. The amount of programmer time spent internally fixing acute bugs in OELib was already detracting from the product development cycle. Design flaws in OELib caused chronic failures, short-comings, and inconsistencies that were deemed to be impossible to address by modifications to the code base. The need for a complete rewrite thus became painfully obvious. A project that began life being called 'new OELib', took on a very different character than its predecessor and ultimately became OEChem.
In the initial design phase of OEChem, all aspects of the OELib project were questioned for the purposes of identifying all its shortcomings in order to avoid repeating mistakes. The examination soon stretched beyond technical issues of how OEChem should work into areas of licensing and distribution. It had been clear long before work was begun on OEChem that the GPL did not prevent commercial entities other than OpenEye from using OELib to generate data that was then included in a commercial product. This was a shortcoming of the GPL that OpenEye was willing to accept given the extensive coverage provided by the GPL in other areas of greater concern. In addition, the GPL did not prevent companies from including OELib and derivative open-source products as parts of a commercial distribution. Schrödinger, for example, packaged the original Babel program along with Jaguar for the purposes of file format conversion. OELib theoretically could be used in an identical manner as part of commercial distribution without being physically included in a closed-source product. Again, this was an unintended and undesirable use of OELib from the perspective of OpenEye, but one that OpenEye was willing to tolerate because of advantages of the GPL in other areas. The GPL was meant to foster derivative works of software, with the anticipation of contributions of the derivative works back into the original work. GNU/Linux stands as a shining example of how the GPL allows a community of developers to contribute to a body of software with an agreed upon set of rules for appropriate commercial and non-commercial uses. Free and open-source software works well when there is little or no barrier to back donation. In many cases there is strong incentive for software donation such as drivers for a hardware component. OELib, on the other hand, suffered as an open-source project because its primary user base was individuals in pharmaceutical companies. Because of the inherent nature of pharmaceutical companies to protect proprietary data, donation of source and bug fixes to OELib by programmers outside of OpenEye paled in comparison to the amount of source being generated within OpenEye. This statement is not meant to demean the contributions of notable individuals like Pat Walters, Jens Sadowski, Brian Goldman, Roger Sayle and others. There were many more developers who admitted to having forked in-house only versions of OELib than there were developers who made significant contributions to the project. Developers in academia were much more willing to contribute time and effort to the project, but again, the total contributions were minimal in comparison to the amount of work being done by OpenEye employees. In a few rare instances, the contributions provided by external sources caused instability in releases of OpenEye products. In summary, the open-source community model of development was never achieved with OELib. The number of misuses outnumbered the intended uses.
Again, OELib was a qualified success. People used it and found it useful. The failing of OELib was perceived to be the open-source nature of the project - based on the minimal number of contributions. At the time when OpenEye was deciding on the license model for a rewrite of OELib it seemed as though there were two clear choices. At the end of the project the final product could again be released under the GPL, or it could be released as a commercial product with the API remaining open. If OpenEye opted for the GPL, the final product would have to have corners cut in areas like documentation and testing given the corporate financial resources while the ‘new OELib’ was being written. A commercial release that would likely recover development costs would allow significantly more development time and allow a higher quality product to be developed. Potential and existing customers at the time, when asked their preference, overwhelmingly voted in favor of the latter choice. Customers regarded access to source code secondary to solid, well designed, thoroughly documented product.
Closing the source code and concentrating on a commercial quality product solved a number of issues for OpenEye. It relaxed time constraints on the project since a commercial outcome was anticipated. Control over the license strategy would prevent further misuses of the eventual replacement of OELib. In a company which developed both open- and closed-source software, making decisions regarding the placement of new functionality could in some cases be very difficult. In an environment where all source is closed, releasing powerful features into a library is facilitated because proprietary methods can be more easily guarded. The end user gains access to powerful tools which, because of the need to maintain a competitive advantage, OpenEye would only release behind the protection of a closed-source system.
Design Flaws in OELib
The previous section gives a historical perspective surrounding the decision to rewrite OELib. Although a policy change regarding open versus closed source was made early on in the process of redesigning OELib, it by no means was a contributing factor in discarding OELib and writing OEChem. The decision to write OEChem originated with a perceived need to permanently resolve the flaws and technical deficiencies inherent in the design of OELib. There are well-defined classes of problems which OELib, and by extension OpenBabel, consistently fail to handle properly. In some cases the failings are unfortunately so fundamental to the design decisions at the lowest levels of OELib, that correcting problems by modifying the existing code base would be exceedingly difficult.
The single most significant issue inherent in the design of OELib is the corruption of chemical data due to the inferred valence, formal charge, and pKa models built into OELib. Data in chemical file formats can be inconsistent or simply errant. Molecules can also be represented as a predominant form in a phase other than the targeted modeling environment. For example, acetic acid may be stored in a Sybyl Mol2 file protonated, as it would be in gas-phase, when an end-user application needs to model it in aqueous phase. OELib’s solution was to normalize a molecule by discarding formal charges and then inferring the valence states and formal charges based on a set of rules. The rules consisted of molecule patterns associated with values for implicit hydrogens and formal charges. Retention of some types of chemical information was difficult or impossible to achieve, because the perception system in OELib without fail attempted to deduce valence and charge without regard to the input data. File formats that have strict valence rules, such as SMILES and MDL molfiles, cannot be intercoverted reliably without potential loss of information as OELib isn’t designed to retain strictly implied valence states. Furthermore, molecular properties that depend on valence states of atoms cannot, in all cases, be perceived reliably. Aromaticity, for example, is highly dependent on valence states of nitrogen atoms in rings. It can be difficult to simultaneously determine which set of nitrogens in a ring should be protonated (as in pyrole) and whether the ring system to which the nitrogens belong is aromatic. Since the two are mutually dependent given insufficient valence information, the choices become somewhat arbitrary. Although molecular normalization and pKa state correction are important in many applications, these features should not be tied into the valence model for handling molecules as designed in OELib.
Even if a strict valence model was added to OELib, the types of chemical applications that could easily be built were severely limited. OELib lacked extensibility because there was no separation between interface and implementation. The molecule class in OELib was a heavyweight, three-dimensional, multiconformer molecule with a rich set of member functions that performed the vast majority of the operations on a molecule. No provision was made for having alternate molecule implementations. For instance, a database application that only required a lightweight connection table implementation of a molecule could not be written with OELib without completely redesigning a molecule from scratch. Perception routines built into the OELib molecule class could not be reused except for copying and pasting the implementations into another molecule. The behavior of the OELib molecule class could not be altered using standard features of the C++ language because of the manner of declaration of the molecule class. Task specific molecule implementations could not be easily introduced into OELib. This single feature of OELib severely limits the domains in which it can act as a base chemical library on which applications can be written. Since reusability was a deciding factor in releasing OELib as open-source, the lack of extensibility is an unfortunately failing of the library.
Extensibility would have been improved in OELib had there been separation between the interface and implementation, but well-designed application programmer interface (API) would have been necessary as well. Again, the task specific nature of the initial design of OELib to handle three-dimensional multiconformer molecules was reflected in the interface. As new applications were encountered, OELib’s API was modified and extended. One of the design goals of OELib should have been to achieve a reasonably constant API. Many of the numerous changes to the OELib API caused derived applications to be modified as well. Although it was hoped that OELib would become a ubiquitous library on which numerous chemistry applications would be built, a highly mutable API detracted from application programmers’ productivity.
Along with the more fundamental failings of OELib, there were a number of correctable but nagging issues with the library. In some cases OELib sported weak implementations that were prone to error. The SMILES reader, for example, does not always retain correct stereochemistry around chiral centers. The Mol2 writer does not always assign atoms types according to the Mol2 format specification. Such deficient implementations are correctable, but in some cases the corrections would require rewriting significant sections of code. Although the molecule class in OELib was designed to store multiple conformers, it could have done it in a much more natural way by making conformers first class objects. In other words, conformers could have acted as molecules themselves instead of being just another set of coordinates in multiconformer molecule. Much of the functionality in OELib was implemented at a very high level to make application programming easier. In reality, lower level interfaces should have been provided in addition to high level interfaces to harness the full power of the implementations. OELib also had a single internal notion of chemistry, which meant that perception routines that identified features such as aromaticity and atom types frequently did not exactly agree with specifications provided by other companies. The interoperability of OELib with other software packages was therefore limited as the data influx and efflux were suspect.
In summary, OELib and OpenBabel contain numerous design limitations, deficiencies and flaws. The utility of OELib is in many cases only limited, but not obviated by the fundamental issues of the library. Indeed, at that time that this document was authored, more applications existed built on OELib than on OEChem. At some point in the future that will almost certainly not be the case. Every shortcoming, bug, and general failure of OELib went into the design and implementation of OEChem. In design and implementation, OELib and OpenBabel in anything resembling their existing forms will never achieve the extensibility, utility, and robustness of OEChem.
Conclusions
Exploring the history and origins of OEChem and OELib provides a framework for understanding the reasonable domains and utility of the respective libraries. OELib is fine for use where minor amounts of data corruption are acceptable. Molecular modeling is fraught with assumptions and errors that in many cases are larger than the errors introduced into molecular models by OELib. The open-source nature limits the commercial utility of OELib, however, for instructional purposes and extension into other open-source projects it is ideal. OELib effectively provides “quick and dirty” solutions to a large number of problems in handling molecules in a computer. OEChem, on the other hand, is designed to be as error free when handling chemical data as absolutely possible. Molecular modeling is so fraught with assumptions and errors that no additional uncertainties should be introduced by a chemical toolkit or application. While the closed-source commercial nature of OEChem limits the utility by open-source projects, it is a well documented, extensively tested, robust, extensible, well designed toolkit that has the ability to be the basis for vast array of possible chemical software applications. If data integrity, speed of authoring applications, or execution time are project requirements then OEChem is clearly the correct choice. OEChem far surpasses the original vision of the OELib project. Unless a project is dependent on open-source or free software, OEChem is a superior choice.
|
|