Sphere Exclusion Clustering:

Sphere exclusion is a simple, intuitive selection method. Given a set of items, one proceeds by selecting items one at a time, usually at random, and excludes items which are near it from further consideration. Selection is complete once all items have either been selected or excluded.


Graphic from (Gobbi, MUG02)

There is an older contrib version of Sphere exclusion, based on Darko Butinas version (Butina). Man-Ling Lee, at MUG02, talked about an enhanced version called DISE (Directed Sphere Exclusion) (Gobbi). The big trick is, rather than selecting structures randomly, selection is based on the sorted similarity to three separate structures. There are two effects of this sorting:

We've implemented a general version; uses user-defined expressions and is reasonably fast (baseline: 2 x DISE). Also, rather than using external input structures for sorting, we use the bitcount on as the sort key. It's intrinsic to the data and works well for locality of reference.


$ spherex -help

SPHEREX USAGE SYNOPSIS:

  spherex [options] in.tdt [out.tdt]

  options ...... options, see below
  in.tdt ....... readable .tdt file containing FP data
  out.tdt ...... writable file to receive .tdt output (default: stdout)

  options:
    -t  ....... similarity threshold for spheres (default: 0.85)
    -expr  .... use user_expr for comparison (default: tanimoto)
    -comparison [DISTANCE|SIMILARITY]
                  ........ relative goodness of expr values
    -NNID val ............ identify run by `val' (default: don't) [-id]
    -FID val ............. use fingerprints with id `val' 
                           (default: use first) [-in]
    -RECORD_COUNT val .... expect `val' structures (default: 10000) [-m]

  NOTE: For the comparison option, DISTANCE means "lower is better" 
        while SIMILARITY means "higher is better" for the computed
        expression values.  The program will attempt to figure out the
        direction of the expression.  This option is only needed if
        the automatic computation is incorrect.


$ time spherex med03.fp_512 > med03.cl
spherex: reading input file (med03.fp_512)
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.103 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.103 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.102 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.104 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.102 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.103 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.104 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.101 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.102 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 3776 in 0.077 sec
 48777 trees in, 48776 w/FP's processed
spherex: sorting 48776 fingerprints
spherex: Computing clusters of 48776 structures
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 40.847 sec, 5672 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 37.990 sec, 11456 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 33.894 sec, 17010 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 32.025 sec, 21880 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 29.885 sec, 26976 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 24.061 sec, 31910 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 18.450 sec, 36797 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 14.519 sec, 41184 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 9.565 sec, 45808 done
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3776 in 2.176 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.054 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.054 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.054 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.054 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.054 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.070 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.068 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.055 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.078 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3776 in 0.059 sec
 48776 new structures processed
spherex: normal exit
  245.06s real   244.44s user     0.07s system


$ more med03.cl
$CLG<na;spherex;4.90;FP;,SIMILARITY,0.850000>
|
$FPG<na;fingerprint;4.90;med03.smi;512,512,0.30,0/7>
|
$SMI<N#COc1ccccc1>
CLT<1>
CL<1742;6;na>
|
$SMI<CCCc1cc(I)cc(CNCCCN(C)C)c1O>
CL<6472;4;na>
|
$SMI<O=N(=O)c1ccc(N(=O)=O)c2c(cccc12)N(=O)=O>
CL<1514;25;na>
|
...

In addition to the cluster number for each structure (clusters are assigned based on the sphere which excluded each structure), the output includes a CLT<> dataitem for each structure which was selected as the center for the sphere. Note that this is often not near the centroid of the cluster. The output can be further processed with listclusters(1) and showclusters(1) to summarize results, generate statistics, etc.