MUG '00 -- 14th Daylight User Group Meeting -- 22-25 Feb 2000

Ties and Clustering

John MacCuish and Christos Nicolaou

Norah MacCuish


Various clustering methods such as Wards or complete link are commonly used in compound selection and diversity analysis, where binary representations of chemical structures are used to cluster the compound data. However, such clustering methods can generate ambiguous results due to ties in proximity, i.e., compounds or clusters of compounds being equi-distant from more than one compound or cluster in a given collection. The severity of the problem can impact results such as those obtained from level selection techniques employed to find the best set of clusters. The magnitude of the problem increases with the number of compounds to be clustered, and is dependent on the distribution and number of possible dissimilarities, given the length of the binary representation and the dissimilarity measure used.


Daylight Chemical Information Systems, Inc.