Why k-modes?

The bit values in Daylight fingerprints are categorical data.

The binary values of 0 and 1 indicate the absence (0) or possible presence (1) of a particular path.

There is no sense in which any of the 0 values along a particular fingerprint can be equated or related. The fact a snake possesses neither wheels nor legs allows us to say nothing about the relative value of wheels and legs.

The same is true for the 1 values. In Daylight fingerprints this situation is complicated further by the ambiguous nature of the meaning of a single set bit.

It is therefore inappropriate to

As the popular k-means algorithm does both these things, it too is inappropriate as an algorithm to cluster objects described by Daylight fingerprints.

k-modes clustering ( Chaturvedi, A., Green, P.E. et al (2001) Journal of Classification 18, 35-55 ) gets around these problems by

  1. Using an association measure such as the Tanimoto coefficient to measure the relationship between two fingerprints
  2. Modes instead of means for clusters
  3. A frequency based method to update modes

We have implemented the algorithm as decribed in Huang, Z. ( 1998 ) Data Mining and Knowledge Discovery 2, 283-304

Back Arrow Home arrow Forward Arrow

Daylight Chemical Information Systems, Inc.
European flag
John Bradshaw.