The Cluster Package enables one to generate clusters of compounds based on the Daylight Fingerprint descriptor and the Jarvis-Patrick clustering algorithm. Subsets of large datasets can be selected as well as clustering data added to TDT files for insertion into Daylight Databases. Keep track of files from this exercise for use in Day 2 labs.
$DY_ROOT/bin/smi2tdt -t '$SMI' day1.cluster.smi day1.cluster.tdt
fingerprint -b 1024 -c 1024 -id day1 day1.cluster.tdt >day1.cl.fp.tdt
nearneighbors -fid day1 -NEIGHBORS 5 day1.cl.fp.tdt day1.cl_nn.tdt
sun1% $DY_ROOT/bin/smi2tdt -t '$SMI' day1.cluster.smi day1.cluster.tdt sun1% fingerprint -b 1024 -c 1024 -id day1 day1.cluster.tdt >day1.cl.fp.tdt ..................................................500 TDTs, 500 fingerprints added ..................................................1000 TDTs, 1000 fingerprints added 1002 TDTs, 1002 fingerprints added, 0 errors Done. sun1% nearneighbors -fid day1 -NEIGHBORS 5 day1.cl.fp.tdt day1.cl_nn.tdt nearneighbors: reading input file (day1.cl.fp.tdt) vvvvvvvvvv 1002 in 0.703 sec 1003 trees in, 1002 w/FP's processed nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints. 1003 datatrees read in so far 2004 datatrees contain SMILES ($SMI) data 1002 datatrees contain valid fingerprints nearneighbors: sorting 1002 fingerprints nearneighbors: finding neighbors of 1002 new structures ^^^^^^^^^^ 1002 in 7.577 sec 1002 new structures processed nearneighbors: normal exit sun1% more day1.cl_nn.tdt $NNG<na;nearneighbors;4.62;FP,day1;5> | $FPG<day1;fingerprint;4.62;day1.cluster.tdt;1024,1024,0.30,0/7> | $SMI<CC(C)(C)CC(C)(C)c1ccc(O)c(Cc2ccc(Cl)cc2Cl)c1> FP<E+.0Y.+E..F5cIG3U.+..22.60+00U....U3E.EE0E...A.U8..W.2.c.1.A..c8.6IE2..0.5..EA..6+2d0.eU0+6.....6+0..2.0.0c...22...VE4.2+8..0+2+..0....8.U0+..M.2....2..+.7.U....630...F.+6.2;1024;127;1024;127;1;day1> NN<na;0,39,14,694,17;1.0000,0.5746,0.5588,0.5426,0.5401> $SMI<CLOFOCTOL> |
jpscan -NN_BEST_THRESHOLD 0.7 -JP_NEAR 5 day1.cl_nn.tdt >jpscan.out
sun1% jpscan -NN_BEST_THRESHOLD 0.7 -JP_NEAR 5 day1.cl_nn.tdt >jpscan.out vvvvvvvvvv 1002 jpscan: note, 1002 of 1004 input trees contain valid NN data PERCENTAGE OF STRUCTURES CLUSTERED ------ NEED ------------------------------ 2 3 4 5 NEAR --- --- --- --- 2: 46 - - - 3: 60 47 - - 4: 65 62 43 - 5: 69 67 63 40 sun1% more jpscan.out Program ......... jpscan Version ......... Daylight Software Release 4.62 Function ........ scan Jarvis-Patrick clustering parameters Input ........... NN (nearest neighbors) data NN data set ..... na created by ... nearneighbors version ...... 4.62 from ......... FP with params .. day1 Trees read in ... 1002 Clustered by .... standard Jarvis-Patrick method Similarity threshold ... 0.700000 resulted in ... 266 automatic singletons. NUMBER OF STRUCTURES CLUSTERED ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 462 - - - 3: 600 467 - - 4: 656 621 433 - 5: 689 675 629 401 PERCENTAGE OF STRUCTURES CLUSTERED ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 46.10 - - - 3: 59.88 46.60 - - 4: 65.46 61.97 43.21 - 5: 68.76 67.36 62.77 40.01 NUMBER OF CLUSTERS ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 231 - - - 3: 212 198 - - 4: 184 193 170 - 5: 155 169 187 156 AVERAGE CLUSTER SIZE ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 2.0 - - - 3: 2.8 2.3 - - 4: 3.5 3.2 2.5 - 5: 4.4 3.9 3.3 2.5 SIZE OF LARGEST CLUSTER ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 3 - - - 3: 6 4 - - 4: 12 8 5 - 5: 18 13 9 6 NUMBER OF SINGLETONS ------- NEED --------------- 2 3 4 5 NEAR ------ ------ ------ ------ 2: 540 - - - 3: 402 535 - - 4: 346 381 569 - 5: 313 327 373 601
showclusters 
jarpat -JP_NEED 3 -JP_NEAR 5 day1.cl_nn.tdt >day1.35.cl.tdt
sun1% jarpat -JP_NEED 3 -JP_NEAR 5 day1.cl_nn.tdt >day1.35.cl.tdt vvvvvvvvvv 1002 jarpat: note, 1002 of 1004 input trees contain valid NN data^^^^^^^^^^ 1002 1002 total: 155 singletons; 847 (84.5%) in 180 clusters sun1% more day1.35.cl.tdt $CLG<na;jarpat;4.62;NN;3,5,0> | $NNG<na;nearneighbors;4.62;FP,day1;5> | $FPG<day1;fingerprint;4.62;day1.cluster.tdt;1024,1024,0.30,0/7> | $SMI<CC(C)(C)CC(C)(C)c1ccc(O)c(Cc2ccc(Cl)cc2Cl)c1> FP<E+.0Y.+E..F5cIG3U.+..22.60+00U....U3E.EE0E...A.U8..W.2.c.1.A..c8.6IE2..0.5 ..EA..6+2d0.eU0+6.....6+0..2.0.0c...22...VE4.2+8..0+2+..0....8.U0+..M.2....2..+. 7.U....630...F.+6.2;1024;127;1024;127;1;day1> CL<0;3> $SMI<CLOFOCTOL> | $SMI<CC(C)(C)NCC(O)c1cc(O)cc(O)c1> FP<...020+E.U73UIE..2+...2.60+......EU+0EE....2.A2U7.2W....2+6M.0W0..oM2..G0+ ..Q6.0.0U30..U.0............U1.8U.M.2..0.VE0.U.8...62+..8.2..+...+..EU.+.U....+. 3k..0..630...+.+..2;1024;108;1024;108;1;day1> CL<1;11> $SMI<TERBUTALI> | ...
showclusters -h -q -v day1.35.cl.tdt >day1.35.cl.out
sun1% showclusters -h -q -v day1.35.cl.tdt >day1.35.cl.out sun1% more day1.35.cl.out HEADER AND SUMMARY: program ................... showclusters function .................. analysis and display of structure clusters version ................... DCIS Release 4.62 (c) 1995 output requested .......... Summary Frequencies Sorted lists singletons to be listed ... no datatype(s) to show ....... all SMILES display long data items ... normal input file ................ day1.35.cl.tdt tree allocation, initial .. 10000 tree allocation, final .... 10000 total datatrees read ...... 1005 trees with SMILES ......... 2004 cluster id required ....... none trees with CL data ........ 1002 trees with FP data ........ 1002 trees with other data ..... 0 (0 items read) trees used ................ 1002 clusters + singletons ..... 335 number of singletons ...... 155 number of clusters ........ 180 average cluster size ...... 4.7 largest cluster ........... 12 Generation of CLUSTERS: ID ........... na Program ...... jarpat Version ...... 4.62 Source ....... NN (near neighbors) sun1% more day1.35.cl.out HEADER AND SUMMARY: program ................... showclusters function .................. analysis and display of structure clusters version ................... DCIS Release 4.62 (c) 1995 output requested .......... Summary Frequencies Sorted lists singletons to be listed ... no datatype(s) to show ....... all SMILES display long data items ... normal input file ................ day1.35.cl.tdt tree allocation, initial .. 10000 tree allocation, final .... 10000 total datatrees read ...... 1005 trees with SMILES ......... 2004 cluster id required ....... none trees with CL data ........ 1002 trees with FP data ........ 1002 trees with other data ..... 0 (0 items read) trees used ................ 1002 clusters + singletons ..... 335 number of singletons ...... 155 number of clusters ........ 180 average cluster size ...... 4.7 largest cluster ........... 12 Generation of CLUSTERS: ID ........... na Program ...... jarpat Version ...... 4.62 Source ....... NN (near neighbors) Parameters ... 3,5,0 Generation of NEAR NEIGHBORS: ID ........... na Program ...... nearneighbors Version ...... 4.62 Source ....... FP,day1 Parameters ... 5 Generation of FINGERPRINTS: ID ........... day1 Program ...... fingerprint Version ...... 4.62 Source ....... day1.cluster.tdt Parameters ... 1024,1024,0.30,0/7 FREQUENCIES OF CLUSTER SIZES: size | frequency size | frequency size | frequency ----------+---------- ----------+---------- ----------+---------- 1 | 155 5 | 24 9 | 5 2 | 44 6 | 27 10 | 4 3 | 20 7 | 17 11 | 3 4 | 28 8 | 7 12 | 1 CLUSTERS LISTED BY SIZE, SMILES BY VAR(TANIMOTO): CLUSTER 0 (100) size 12 0.0 0.0468 CEFMENOXI 0.1 0.0607 CEFOTIAM 0.2 0.0672 CEFETAMET 0.3 0.0764 CEFIXIME 0.4 0.0771 CEFTERAM 0.5 0.0810 CEFOTAXIM 0.6 0.0848 CEFTAZIDI 0.7 0.0930 CEFAMANDO 0.8 0.1011 CEFORANID 0.9 0.1100 CEFPIROME 0.10 0.1253 CEFOTETAN 0.11 0.1316 CEFBUPERA CLUSTER 1 (1) size 11 1.0 0.0171 ISOPRENAL 1.1 0.0201 COLTEROL 1.2 0.0256 ETILEFRIN 1.3 0.0270 PHENYLEPH 1.4 0.0278 DIOXIFEDR 1.5 0.0314 NORADRENA 1.6 0.0326 TERBUTALI 1.7 0.0370 DIMETOFRI 1.8 0.0446 DENOPAMIN 1.9 0.0466 NORMETANE 1.10 0.0480 ISOETARIN ...
listclusters -a day1.35.cl.tdt >day1.cl.tdt
sun1% listclusters -a day1.35.cl.tdt >day1.cl.tdt sun1% more day1.cl.tdt $CLG<na;jarpat;4.62;NN;3,5,0> | $NNG<na;nearneighbors;4.62;FP,day1;5> | $FPG<day1;fingerprint;4.62;day1.cluster.tdt;1024,1024,0.30,0/7> | $SMI<CEFTAZIDI> CL<0;12> | $SMI<CEFORANID> CL<0;12> | $SMI<CEFAMANDO> CL<0;12> | $SMI<CEFOTETAN> CL<0;12> |
$SMI<CEFBUPERA> CL<0;12> |
sun1% smi2tdt drugs.smi >drugs.tdt
sun1% fingerprint -b 1024 -c 1024 -id day1 drugs.tdt > drugs.fp.tdt
.
10 TDTs, 10 fingerprints added, 0 errors
Done.
sun1% nearneighbors -NEIGHBORS 5 -UPDATE_FILE day1.cl_nn.tdt drugs.fp.tdt \
day1.cl_nn.update.tdt
nearneighbors: reading update file (day1.cl_nn.tdt)
vvvvvvvvvv 1002 in 0.800 sec
 1004 trees in, 1002 w/FP's processed
nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
      1004 datatrees read in so far
      2004 datatrees contain SMILES ($SMI) data
      1002 datatrees contain valid fingerprints
nearneighbors: reading input file (drugs.fp.tdt)
 1012 in 0.008 sec
 1015 trees in, 1012 w/FP's processed
nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
      1015 datatrees read in so far
      2014 datatrees contain SMILES ($SMI) data
      1012 datatrees contain valid fingerprints
nearneighbors: updating old neighbor lists
^^^^^^^^^^ 1002 in 1.945 sec
 1002 old structures processed
nearneighbors: sorting 1012 fingerprints
nearneighbors: finding neighbors of 10 new structures
 10 in 0.073 sec
 10 new structures processed
nearneighbors: normal exit
 Daylight Chemical Information Systems Inc.
Daylight Chemical Information Systems Inc.