Algorithm5: A Procedure for Fuzzy Similarity Clustering

Tom Doman
G. D. Searle

Daylight User's Group Meeting
2/27/96 - 3/1/96
Santa Fe, New Mexico


Limitations of Jarvis-Patrick Clustering


Requirements for an Improved Clustering Algorithm

  1. Generate reasonable clusters.

  2. Require no user-supplied parameters.

  3. Allow compounds to be in more than one cluster.


Algorithm5 in a Nutshell

Put Jarvis-Patrick to work on those areas of the inventory where it works best, then "fuzzy cluster" around the J.P. clusters.


Algorithm 5: General Overview

STEP 1:

Choose a subset of the compound space which consists only of those compounds which have a sufficient number of "close" neighbors.

STEP 2:

Apply the J-P algorithm to this subset using a number of nearest neighbors which would insure that the nearest neighbor list for each compound consisted only of "close" neighbors. (No undesired "chaining" can occur.)

STEP 3:

360 Create a fuzzy clustering of the entire database by assigning p's to each compound which describe the "degree of membership" to each of the clusters determined in STEP 2.

STEP 4:

Determine which compounds do not belong to any of the existing clusters on the basis of their p-values. If "almost all" compounds already "belong" to some cluster, STOP.

STEP 5:

Determine a new "sufficient number" to be used in STEP 1. Repeat the process with STEPS 1 and 2 only operating upon those compounds identified in STEP 4.


Algorithm5: Implementation Specifics

STEP 1:

Choose a subset of the compound space which consists only of those compounds which have a sufficient number of "close" neighbors.


Algorithm5: Implementation Specifics

STEP 2:

Apply the J-P algorithm to this subset using a number of nearest neighbors which would insure that the nearest neighbor list for each compound consisted only of "close" neighbors. (No undesired "chaining" can occur.)

Perform Jarvis-Patrick clustering on this subset, using fairly restrictive parameters (e.g. near=20, need=15); refer to resulting clusters as "crisp".


Algorithm5: Implementation Specifics

STEP 3:

Create a fuzzy clustering of the entire database by assigning p's to each compound which describe the "degree of membership" to each of the clusters determined in STEP 2.


Algorithm5: Implementation Specifics

STEP 4:

Determine which compounds do not belong to any of the existing clusters on the basis of their p-values. If "almost all" compounds already "belong" to some cluster, STOP.

STEP 5:

Determine a new "sufficient number" to be used in STEP 1. Repeat the process with STEPS 1 and 2 only operating upon those compounds identified in STEP 4.

Collect the next small portion of unclustered compounds and loop back through step 3 until finished.


Idealized Algorithm5 Cluster


Algorithm5 Implementation and Usage


Acknowledgements

John Cibulskis
Mike Cibulskis
Patrick McCray

Compounds shown in figures are from the "Master File":

Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR. Hydrophobic, Electronic, and Steric Constants.; American Chemical Society: Washington, D.C., 1995.