## Algorithm5: A Procedure for Fuzzy Similarity Clustering

Tom Doman
G. D. Searle

Daylight User's Group Meeting
2/27/96 - 3/1/96
Santa Fe, New Mexico

Limitations of Jarvis-Patrick Clustering

• No overlap between clusters.

• Under- or over-clustering ("chaining") is common.

• No explicit distance consideration between compounds (except neighbor list).

• Requires manual tuning, via the "need/near" parameters.

• Parameters are generally not transferable.

Requirements for an Improved Clustering Algorithm

1. Generate reasonable clusters.

2. Require no user-supplied parameters.

3. Allow compounds to be in more than one cluster.

### Algorithm5 in a Nutshell

Put Jarvis-Patrick to work on those areas of the inventory where it works best, then "fuzzy cluster" around the J.P. clusters.

### Algorithm 5: General Overview

STEP 1:

Choose a subset of the compound space which consists only of those compounds which have a sufficient number of "close" neighbors.

STEP 2:

Apply the J-P algorithm to this subset using a number of nearest neighbors which would insure that the nearest neighbor list for each compound consisted only of "close" neighbors. (No undesired "chaining" can occur.)

STEP 3:

360 Create a fuzzy clustering of the entire database by assigning p's to each compound which describe the "degree of membership" to each of the clusters determined in STEP 2.

STEP 4:

Determine which compounds do not belong to any of the existing clusters on the basis of their p-values. If "almost all" compounds already "belong" to some cluster, STOP.

STEP 5:

Determine a new "sufficient number" to be used in STEP 1. Repeat the process with STEPS 1 and 2 only operating upon those compounds identified in STEP 4.

### Algorithm5: Implementation Specifics

STEP 1:

Choose a subset of the compound space which consists only of those compounds which have a sufficient number of "close" neighbors.

• For each compound, determine a "cardinality" - # of compounds within a certain distance d0. (d0 = 0.15, Tc = 0.85).

• Collect a small portion (3%) of the most densely populated space (i.e., compounds with the highest card()).

### Algorithm5: Implementation Specifics

STEP 2:

Apply the J-P algorithm to this subset using a number of nearest neighbors which would insure that the nearest neighbor list for each compound consisted only of "close" neighbors. (No undesired "chaining" can occur.)

Perform Jarvis-Patrick clustering on this subset, using fairly restrictive parameters (e.g. near=20, need=15); refer to resulting clusters as "crisp".

### Algorithm5: Implementation Specifics

STEP 3:

Create a fuzzy clustering of the entire database by assigning p's to each compound which describe the "degree of membership" to each of the clusters determined in STEP 2.

• Find skeleton compounds of crisp cluster.

• Fuzzily cluster any non-crisply-clustered compounds which are within 2d0 of a skeleton compound.

### Algorithm5: Implementation Specifics

STEP 4:

Determine which compounds do not belong to any of the existing clusters on the basis of their p-values. If "almost all" compounds already "belong" to some cluster, STOP.

STEP 5:

Determine a new "sufficient number" to be used in STEP 1. Repeat the process with STEPS 1 and 2 only operating upon those compounds identified in STEP 4.

Collect the next small portion of unclustered compounds and loop back through step 3 until finished.

### Algorithm5 Implementation and Usage

• 60Implemented as a series of "C" programs which run on SGI machines.

• Inventories up to 300,000 compounds successfully clustered.

• Timing 1: 140 CPU hours on a 75 Mhz. R8000 Power Indigo2.

• Timing 2: (neighbor calculation split into 5 chunks of 60K) clustering completed in 40 hours.

• Upper limit: 500,000 compounds with 128M memory?

• Tuning: d0 may be varied.

### Acknowledgements

John Cibulskis
Mike Cibulskis
Patrick McCray

Compounds shown in figures are from the "Master File":

Hansch, C.; Leo, A.; Hoekman, D. Exploring QSAR. Hydrophobic, Electronic, and Steric Constants.; American Chemical Society: Washington, D.C., 1995.