Daylight Summer School 1999, June 15-17, St. John's College, Santa Fe, NM

Daylight Worksheet - Cluster Package ... WITH HINTS!


The Cluster Package enables one to generate clusters of compounds based on the Daylight Fingerprint descriptor and the Jarvis-Patrick clustering algorithm. Subsets of large datasets can be selected as well as clustering data added to TDT files for insertion into Daylight Databases. Keep track of files from this exercise for use in Day 2 labs.

  1. Generate a TDT file containing a clustered dataset from the ~mug/data/day1.cluster.smi dataset which uses fixed length fingerprints 5 nearest neighbors and tanimoto threshold of 0.7, and a "reasonable" JP clustering level chosen from jpscan output.

  2. Pick a representative subset of the clustered dataset from step one by selecting only the cluster centroids and the singletons.

    listclusters -a day1.35.cl.tdt >day1.cl.tdt

    huxley% listclusters -a day1.35.cl.tdt >day1.cl.tdt
    huxley% more day1.cl.tdt
    
    $CLG<na;jarpat;4.62;NN;3,5,0>
    |
    $NNG<na;nearneighbors;4.62;FP,day1;5>
    |
    $FPG<day1;fingerprint;4.62;day1.cluster.tdt;1024,1024,0.30,0/7>
    |
    $SMI<CEFTAZIDI>
    CL<0;12>
    |
    $SMI<CEFORANID>
    CL<0;12>
    |
    $SMI<CEFAMANDO>
    CL<0;12>
    |
    $SMI<CEFOTETAN>
    CL<0;12>
    |
    $SMI<CEFBUPERA> CL<0;12> |

  3. Update the nearneighbors table generated from the day1.cluster.tdt dataset with the ~mug/data/drugs.smi dataset fingerprinted with the same parameter set used in step one.

    huxley% smi2tdt drugs.smi >drugs.tdt
    huxley% fingerprint -b 1024 -c 1024 -id day1 drugs.tdt > drugs.ftp.tdt
    .
    10 TDTs, 10 fingerprints added, 0 errors
    Done.
    huxley% nearneighbors -NEIGHBORS 5 -UPDATE_FILE day1.cl_nn.tdt drugs.fp.tdt \
    day1.cl_nn.update.tdt
    nearneighbors: reading update file (day1.cl_nn.tdt)
    vvvvvvvvvv 1002 in 0.800 sec
     1004 trees in, 1002 w/FP's processed
    nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
          1004 datatrees read in so far
          2004 datatrees contain SMILES ($SMI) data
          1002 datatrees contain valid fingerprints
    nearneighbors: reading input file (drugs.fp.tdt)
     1012 in 0.008 sec
     1015 trees in, 1012 w/FP's processed
    nearneighbors: WARNING not all SMILES-rooted datatrees have fingerprints.
          1015 datatrees read in so far
          2014 datatrees contain SMILES ($SMI) data
          1012 datatrees contain valid fingerprints
    nearneighbors: updating old neighbor lists
    ^^^^^^^^^^ 1002 in 1.945 sec
     1002 old structures processed
    nearneighbors: sorting 1012 fingerprints
    nearneighbors: finding neighbors of 10 new structures
     10 in 0.073 sec
     10 new structures processed
    nearneighbors: normal exit
    

Daylight Chemical Information Systems Inc.
support@daylight.com