A New Look at Clustering Large Datasets
by Robin Hewitt, Feb 2003

Abstract
This talk presents two new algorithms that make use of descriptor-count information to improve the performance of standard chemical clustering methods. Descriptor-count information is a neglected resource that can be leveraged to avoid comparing every possible pair of compounds when clustering. Performance tests on industry datasets demonstrate that, using these algorithms, far fewer than N² compound-to-compound comparisons are actually required to achieve good clustering. Further, the larger the dataset, the greater the performance improvement.

Contents

  1. Introduction
  2. Descriptor Count
  3. Assumptions and Limitations
  4. BOOST (Bettering-Our-Odds Sort Tree)
    1. Theory and Method
    2. Performance
  5. DiET (Directed Exploration of Territories)
  6. Example Clusters
  7. Overall Performance
  8. Summary
  9. Thanks and Acknowledgements