Ties and Clustering
February 24, Santa Fe, NM
John MacCuish (Bioreason, Inc.)
Christos Nicolaou (Bioreason, Inc.)
Norah MacCuish (Daylight, CIS)
The Problem with Clustering Bit Vector Representations
Two implementations of same clustering algorithm, or a single algorithm not stable with respect to input order.
Nevertheless, two different, but correct results
Ambiguity due to ties in proximity
Is this bad? How bad?
We cluster to find natural groups in data
Yet our "best" (favorite) algorithm may produce numerous vastly different clusterings.
How many ties in the dataset (e.g., in the proximity matrix, near neighbor table)
The representation effects on the number of ties
The effect of ambiguous ties in various clustering algorithms
Exploring the Problem
Similarity (Dissimilarity) Measures
Theory: The Actors in the Argument
Length of bit vector -- can be anywhere from 166 (public MACCS) to 2,048 (Daylight), or even larger - anywhere from 5,000 to 10,000 for BCI keys
Number of compounds to be clustered
Number of proximities given a similarity measure ... and their distribution (Godden, Xue, Bajorath, JCICS, '00, 40).
Susceptibility of clustering algorithms to ambiguous ties
in proximity (-ties).
Why are there so many ties: Birthday Problem
In a classroom of 23 (random) students, there is a greater than 50% chance that two of them will have the same birthday - a tie - in spite of the fact that there are 365 possible birthdays.
In clustering, we can think of all of the pair-wise compound
proximities as the students' birthdays. We need to know what is the total
possible number of proximities - analogous to 365 days in a year - given
a similarity measure and a fixed length bit vector.
Total Possible Proximities, given length N bit vectors
the sum of the Farey series N (sum of Euler's function from 1 to N)
Likelihood of Ties in Proximity
If n equals the total possible number of proximities, and m equal the total number of pair-wise proximities generated from a random (like our students) set of bit vectors. Then, if
the probability that there will be no ties is at most
To get a feel how the number of ties increases as m
increases with respect to n, it can be shown that for m=n,
for large n,the expected number of ties approaches .
Proximities are not necessarily equi-probable (uniformly distributed), but rather weighted probabilities
Euclidean proximities of bit vectors has a gaussian-like distribution centered at 0.5 (distances normalized between 0 and 1) for all possible bit vector proximities.
Tanimoto similarity has a log-normal-like shape, but is
a fractal-like distribution (self similar and scale invariant across a
certain range) with the mode at .
Euclidean-Soergel product is similar to the Tanimoto with a different shape and fractal pattern.
Ties and the Number of Compounds
In the birthday problem setting, m is the number of student birthdays, but in our problem m is really C(C-1)/2, where C is the number of compounds.
under the assumption that proximities are equi-probable, then it follows that if we want to be assured that there are few or no ties
Some Horrifying Numbers
But the (dis)similarity values are not equi-probable.
In fact, a small fraction of the Tanimoto values (reduced fractions containing
low order relative primes, i.e., ,
etc.) contain most of the probability.
Ambiguous Ties in Clustering
Clustering Algorithms in the Cheminfomatics Literature
Weighted Probabilities of the Tanimoto Similarities
Enumeration of a specific Tanimoto fraction (e.g., ):
Probability of Tanimoto reduced fraction (e.g., ), R:
It can be easily seen that the reduced Tanimoto fraction,
a+b+c=N has a
probability far less than where the reduced fraction has
Number of Compounds given the Tanimoto Similarity
Conjecture that for the Tanimoto
|Data Set||# of Compounds||Bit Vector||Length|
The jth nearest-neighbor cutoff may be tied to the jth +1nearest-neighbor, or repeatedly tied (jth +1, jth +2, etc.).
For an example we generated the nearest neighbor table
for 1,990 compounds from the SPRESI95demo database, using the Tanimoto
measure with Daylight fingerprints.
|jth Nearest Neighbor||Total Number of Ties|
Careful consideration of the use of:
Multi-domain solutions for clustering bit vector representations
More thorough and systematic empirical study, employing different algorithms, data sets, measures, and bit vectors.
Designing or modifying clustering algorithms, such that they avoid or incorporate ambiguous ties in proximity.
Impact of measure distributions and ties in database searching.
Conjectures and Open Problems
Number of possible proximities when using the Euclidean-Soergel measure?
To avoid ties when using the Tanimoto?