A New Look at Clustering Large Datasets
by Robin Hewitt, Feb 2003
Abstract
This talk presents two new algorithms that make use of descriptor-count
information to improve the performance of standard chemical
clustering methods. Descriptor-count information is a neglected
resource that can be leveraged to avoid comparing every possible
pair of compounds when clustering. Performance tests on industry
datasets demonstrate that, using these algorithms, far fewer than
N² compound-to-compound comparisons are actually required to
achieve good clustering. Further, the larger the dataset, the
greater the performance improvement.
Contents
- Introduction
- Descriptor Count
- Assumptions and Limitations
- BOOST (Bettering-Our-Odds Sort Tree)
- Theory and Method
- Performance
- DiET (Directed Exploration of Territories)
- Example Clusters
- Overall Performance
- Summary
- Thanks and Acknowledgements