Ties and Clustering

MUG '00

February 24, Santa Fe, NM

John MacCuish (Bioreason, Inc.)

Christos Nicolaou (Bioreason, Inc.)

Norah MacCuish (Daylight, CIS)


Clustering Applications

Size of a clustering application - ranges from 102 to 106


The Problem with Clustering Bit Vector Representations

Two implementations of same clustering algorithm, or a single algorithm not stable with respect to input order.

Nevertheless, two different, but correct results

Ambiguity due to ties in proximity


Is this bad?  How bad?

We cluster to find natural groups in data

Yet our "best" (favorite) algorithm may produce numerous vastly different clusterings.

Three Issues

How many ties in the dataset (e.g., in the proximity matrix, near neighbor table)

The representation effects on the number of ties

The effect of ambiguous ties in various clustering algorithms


Exploring the Problem



Bit Vectors



Similarity (Dissimilarity) Measures



Theory: The Actors in the Argument

Length of bit vector -- can be anywhere from 166 (public MACCS) to 2,048 (Daylight), or even larger - anywhere from 5,000 to 10,000 for BCI keys

Number of compounds to be clustered

Number of proximities given a similarity measure ... and their distribution (Godden, Xue, Bajorath, JCICS, '00, 40).

Susceptibility of clustering algorithms to ambiguous ties in proximity (\( \alpha \)-ties).


Why are there so many ties: Birthday Problem

In a classroom of 23 (random) students, there is a greater than 50% chance that two of them will have the same birthday - a tie - in spite of the fact that there are 365 possible birthdays.

In clustering, we can think of all of the pair-wise compound proximities as the students' birthdays. We need to know what is the total possible number of proximities - analogous to 365 days in a year - given a similarity measure and a fixed length bit vector.


Total Possible Proximities, given length N bit vectors



Likelihood of Ties in Proximity

If n equals the total possible number of proximities, and m equal the total number of pair-wise proximities generated from a random (like our students) set of bit vectors. Then, if

\begin{displaymath}m=\left\lceil \sqrt{2n}+1\right\rceil \end{displaymath}

the probability that there will be no ties is at most \( \frac{1}{e} \)

To get a feel how the number of ties increases as m increases with respect to n, it can be shown that for m=n, for large n,the expected number of ties approaches \( \frac{n}{e}=\frac{m}{e} \).


Proximities are not necessarily equi-probable (uniformly distributed), but rather weighted probabilities

Euclidean proximities of bit vectors has a gaussian-like distribution centered at 0.5 (distances normalized between 0 and 1) for all possible bit vector proximities.

Tanimoto similarity has a log-normal-like shape, but is a fractal-like distribution (self similar and scale invariant across a certain range) with the mode at \( \frac{1}{3} \).
Euclidean-Soergel product is similar to the Tanimoto with a different shape and fractal pattern.


Ties and the Number of Compounds

In the birthday problem setting, m is the number of student birthdays, but in our problem m is really C(C-1)/2, where C is the number of compounds.


\begin{displaymath}m=\left\lceil \sqrt{2n}+1\right\rceil =C(C-1)/2\end{displaymath}

under the assumption that proximities are equi-probable, then it follows that if we want to be assured that there are few or no ties

\begin{displaymath}C\leq 4\left( n^{1/4}\right) \end{displaymath}


Some Horrifying Numbers
Measure N Total Prox(n) \( C\leq 4\left( n^{1/4}\right) \)
Euclid 166 167 7
Euclid 1,024 1,025 11
Euclid 2,048 2,049 13
Euclid 4,096 4,097 16
Tani 166 9,600 20
Tani 1,024 328,969 48
Tani 2,048 1,297,444 67
Tani 4,096 5,148,814 95

But the (dis)similarity values are not equi-probable. In fact, a small fraction of the Tanimoto values (reduced fractions containing low order relative primes, i.e., \( \frac{1}{2},\frac{1}{3},\frac{2}{5} \), etc.) contain most of the probability.


Ambiguous Ties in Clustering

Susceptibility of clustering algorithm to ties in proximity.
Distribution of bit vectors - compound set dependent
Both 1 and 2 in combination with the likelihood of ties in proximity - namely, length of the bit vector, similarity measure used, and the number of compounds.


Clustering Algorithms in the Cheminfomatics Literature


Weighted Probabilities of the Tanimoto Similarities
Enumeration of a specific Tanimoto fraction (e.g., \( \frac{2}{4} \)):

\begin{displaymath}\sum ^{b+c}_{k=0}\left( \begin{array}{c}N\\a+k\end{array}\right) \end{displaymath}

 Probability of Tanimoto reduced fraction (e.g., \( \frac{1}{2} \)), R:

\begin{displaymath}P\left( R\right) =\frac{\sum ^{\left\lfloor \frac{N}{a+b+c}\r......}N\\ia+j\end{array}\right) }{2^{N-1}\left( 2^{N}-1\right) }\end{displaymath}

It can be easily seen that the reduced Tanimoto fraction, where a=N-1and a+b+c=N has a probability far less than where the reduced fraction has a=1, and a+b+c=2.


Number of Compounds given the Tanimoto Similarity

Conjecture that for the Tanimoto

\begin{displaymath}C\leq 4\left( \frac{n^{1/4}}{O\left( \log N\right) }\right) \end{displaymath}


Empirical Experiments



Agglomerative (Wards)
Data Set # of Compounds Bit Vector Length
MAO 290 BR-MACCS 157
NCI 675 BR-MACCS 157
Interbioscreen 1,049 Daylight 1,024

Data Set Measure \( \alpha \)-Ties Hierarchies
MAO Euclid 21 ~4.7 mil.
MAO Tani 5 32
NCI Euclid 19 ~0.5 mil.
NCI Tani 5 32
Interbioscreen Euclid 9 512
Interbioscreen Tani 6 64



Jarvis Patrick

The jth nearest-neighbor cutoff may be tied to the jth +1nearest-neighbor, or repeatedly tied (jth +1, jth +2, etc.).

For an example we generated the nearest neighbor table for 1,990 compounds from the SPRESI95demo database, using the Tanimoto measure with Daylight fingerprints.
jth Nearest Neighbor Total Number of Ties
j=4 178
j=8 135
j=16 142




Careful consideration of the use of:

Awareness of the inherent ambiguity when clustering a large number of compounds with bit vector representations and how this effects conclusions based on clustering results.

Multi-domain solutions for clustering bit vector representations of compounds.





Future Work

More thorough and systematic empirical study, employing different algorithms, data sets, measures, and bit vectors.

Designing or modifying clustering algorithms, such that they avoid or incorporate ambiguous ties in proximity.

Impact of measure distributions and ties in database searching.


Conjectures and Open Problems

Number of possible proximities when using the Euclidean-Soergel measure?

\begin{displaymath}O\left( \Phi \left( N\right) +N\right) \end{displaymath}

To avoid ties when using the Tanimoto?

\begin{displaymath}C\leq 4\left( \frac{n^{1/4}}{O\left( \log N\right) }\right) \end{displaymath}