K-modes Clustering:

K-modes is a variation of the well-known K-Means clustering algorithm, adapted to work for categorical data, in our case fingerprints.


$ kmodes -help

KMODES USAGE SYNOPSIS:

  kmodes [options] in.tdt [out.tdt]

  options ...... options, see below
  in.tdt ....... readable .tdt file containing FP data
  out.tdt ...... writable file to receive .tdt output (default: stdout)

  options:
    -k <modals> .......... number of modes to find (default: 100 or count
                           of input seeds)
    -km <max_modals> ..... maximum number of modes; causes splitting
                           (default: don't) 
    -seeds <seed_file> ... tdt file of initial modals/seeds (default: none)
    -d <cluster_size> .... if a cluster drops below this size due to 
                           relocations, eliminate it (default: 0)
    -fast <threshold> .... percentage of relocations in a pass to
                           terminate processing (default: 0.0)
    -partition ........... don't relocate modals; one assignment only
    -iter <max_iter> ..... maximum relocation iterations (default: none)
    -nomove .............. don't relocate during initial assignments
    -random .............. pick random seeds (default: don't)
    -randseed <###> ...... pick random seeds, use value as randomizer seed
    -EXPRESSION <expr> ... use expr for comparison (default: tanimoto)
    -COMPARISON [DISTANCE|SIMILARITY]
                  ........ relative goodness of expr values
                           (default: similarity)
    -JP_RUNID val ........ identify run by `val' (default: don't) [-id]
    -in val .............. use fingerprints with id `val' (default: first)
    -min val ............. use seed fingerprints with id `val' (default: first)
    -RECORD_COUNT val .... expect `val' structures (default: 10000) [-m]

  NOTE: For the comparison option, DISTANCE means "lower is better" 
        while SIMILARITY means "higher is better" for the computed
        expression values.  The program will attempt to figure out the
        direction of the expression.  This option is only needed if
        the automatic computation is incorrect.


$ time kmodes med03.fp_512 > med03.cl
kmodes: reading input file (/sfhome/jjdelany/TMP/zz/med03.fp_512)
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.118 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.100 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.099 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.102 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.111 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.100 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.099 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.098 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 5000 in 0.098 sec
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv 3776 in 0.076 sec
 48777 trees in, 48776 w/FP's processed
kmodes: Computing clusters of 48776 structures using 200 modals.
kmodes: Terminate when no structures move in a relocation pass.
Initial assignment in: 4.405 sec
relocated: 9158 in 3.247 sec
relocated: 4574 in 2.841 sec
relocated: 2987 in 2.702 sec
relocated: 2751 in 2.654 sec
relocated: 2437 in 2.653 sec
relocated: 3171 in 2.726 sec
relocated: 3953 in 2.799 sec
relocated: 2757 in 2.680 sec
relocated: 2154 in 2.637 sec
relocated: 1835 in 2.606 sec
relocated: 1466 in 2.576 sec
relocated: 1046 in 2.518 sec
relocated: 869 in 2.496 sec
relocated: 844 in 2.502 sec
relocated: 566 in 2.478 sec
relocated: 399 in 2.469 sec
relocated: 376 in 2.468 sec
relocated: 237 in 2.480 sec
relocated: 108 in 2.459 sec
relocated: 0 in 2.419 sec
ivar: 0.178263 ovar: 0.165578 kstart: 200  kfinal: 200 reloc: 20
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.109 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.097 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.093 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.091 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.092 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 5000 in 0.091 sec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3776 in 0.069 sec
 48776 new structures processed
kmodes: normal exit

real    0m59.46s
user    0m58.89s
sys     0m0.09s

$ more med03.cl
$CLG<;kmodes;4.83;FP;TANIMOTO,SIMILARITY,200/200>
|
$FPG<na;fingerprint;4.83;med03.smi;512,512,0.30,0/7>
|
$SMI<N#COc1ccccc1>
FP<.0204.EE.U6.U...W.7.6.2..G.26.3...U.w.E..E2..A.U..2WE.2.3+.......0IoA4......E.6...2+4...1;512;57;512;57;1>
CL<79;208;>
FPM<.0604.EE2U..U...0.6.M.20.G..6.5...UUs.E..E2..A.2..2WE.2.3+..0....0II24......E.6...2+4...1;512;333;512;56;1>
|
$SMI<CCCc1cc(I)cc(CNCCCN(C)C)c1O>
FP<.100gA+O.kFZUKEXUY7+8F2.Sm++M03.6Mc+w.F.UE2..Q2UQk.W.+263P.OU.k.07KI456OG9G.tQe..+2F4...1;512;135;512;135;1>
CL<1;203;>
FPM<.1.0A.+E.k+2U.E0.26.8.2.0G+.M03.6Ec+s.E.YE2..A2.Q.2W.+2.3FE6..U.07II44.G070.NM8...2+4...1;512;135;512;87;1>
|
...
$SMI<COC(=O)C(C)NP(=O)(OCC1OC(C=C1)n2cc(C)c(=O)[nH]c2=O)Oc3cccc(c3)C(=O)C>
FP<DPOqQcbsDcfZiSrqb5zJ8wjSTytIRPTIzKWRwcznsux99g0uuxLXYShz3xzt0SfzhDyqBrNHCRTruRSzutTzqU..1;512;330;512;330;1>
CL<159;4923;>
|
$SMI<O=N(=O)C=C1SCCN1Cc2ccccc2>
FP<0XEKB.Fd2k7Mk6SB448EeU6V0me2c02.AEXHtUE1UkyUU25HO..W.MI0bHF8YU1E1BJEF4GGed4UtQ8F20.55...1;512;165;512;165;1>
CL<161;436;>
|
...

The output includes the cluster number for each structure and the modal fingerprint for the cluster, output as an FPM<> datatype with the first entry seen in a cluster. The output can be further processed with listclusters(1) and showclusters(1) to summarize results, generate statistics, etc.