showclusters

Daylight v4.9
Release Date: 1 February 2008

Name

showclusters - display Jarvis-Patrick clusters and statistics

Unix Synopsis

showclusters [options] [in.tdt] [out.lis]

Description

showclusters reads a .tdt (Thor datatree) file containing "Cluster" (CL) data and tabulates the data for text output. The input file must be a .tdt file (either "list" or "dump" format) containing data of $SMI and CL datatypes FP datatype optional. Standard output is used if an output file is not specified. If the input file is also not specified, input is expected on standard input. showclusters is able to produce four kinds of output:

summary of the data related to clustering (-h)
frequency distribution of cluster sizes (-q)
list of cluster centroids (-x)
sorted list of all structures (-v)

The user must select one or more of these command line options to obtain output (there is no "default" type of output). Clusters are displayed sorted by cluster size (largest first). Output of "singleton" clusters (clusters containing a single structure) is suppressed by default. Structures are normally listed by their SMILES but output can include other data present in the input file.

Options

-a

List all clusters, including singletons. (Output of singleton clusters is suppressed by default.)

-d tags

Display all data items of datatypes listed in `tags' when listing structures, where `tags' is a space-delimited list of THOR datatags. Data will appear in the same order as in the input. Not compatible with option -f.

-in clid

Use only cluster (CL) data identified by `clid' as input rather than using the first occurring CL item in the datatree. Useful when CL data from more than one clustering run are present.

-f tags

Display the first data item of each datatype listed in `tags' when listing structures, where `tags' is a space-delimited list of THOR datatags. Not compatible with option -d.

-h

Display header and summary of clustering data.

-q

Display the frequency distribution of cluster sizes.

-t width

Truncate long data items at column `width' in list output. Not compatible with option -w.

-v

Display a list of all structures sorted by cluster size then by variance within the cluster.

-w width

Wrap long data items at column `width' in list output. Not compatible with option -t.

-x

Display a list of cluster centroids sorted by cluster size.

-RECORD_COUNT count

Initially allocate memory for `count' structures. Ideally, `count' should be set to the number of structures to be input. It is good practice to specify a number equal to or slightly greater than this number. If more than `count' structures are encountered while reading input, memory will be reallocated as needed, resulting in a performance penalty and a possible "out of memory" error. The default is 10000. [-m]

-EXPRESSION expr

Uses given expression to compute the variance used for the -v computation. (Default: 1-tanimoto()). Note: if the expression is not a distance relative to 0.0 (eg. lower is better, the minimum distance being 0.0), the variance computation will give nonsensical results.

Return Value

Returns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed: showclusters: no output requested

This program requires that at least one of the options -h, -q, -v, -x be specified to obtain output.

showclusters: unknown option encountered: xxx

An invalid option was specified on the command line.

showclusters: can't open input file xxx

The file specified on the command line does not exist or is not readable.

showclusters: bad value for option -m <count>
showclusters: bad value for option -t <width>
showclusters: bad value for option -w <width>

The value of options -m, -t, and -w must be a non-negative integer.

showclusters: minimum value for option -t is 32
showclusters: minimum value for option -w is 32

A command line option value for a -t (truncate) or -w (wrap) option busted the hardcoded minimum of 32.

showclusters: not enough memory
showclusters: out of memory

There is not enough virtual memory for the initial allocation of memory for data ("not enough memory") or for reallocation of memory while reading input ("out of memory"). Use the -m option to set the limit to the number of datatrees to be input and don't use -d or -f options. If it still fails, your computer doesn't have enough virtual memory to process the input file.

Examples

Display names of cluster centroids:

$ showclusters -x -f PCN -w 54 fromjarpat.tdt
======================== excerpt =====================
CLUSTER CENTROIDS LISTED BY CLUSTER SIZE:
count clust size var centroid
----- ----- ---- ------ --------------------------
0 49 20 0.3326 PCN P-I-PROPYLBENZOICACID,DIET
AMINOETHYLESTER
1 41 19 0.5083 PCN I-VALERICACID
2 13 17 0.1786 PCN HYDROCORTISONE-17-
PROPIONATE
3 15 15 0.1112 PCN 3'-O-METHYLADENOSINE
4 142 15 0.1792 PCN 4-QUINOLINAMINE,2-METHYL
5 46 14 0.3132 PCN 2,2',3,3',4,4',5,5',6,6'-
PCB
======================================================

Display names and CAS numbers of clustered structures, reading standard input and wrapping output at column 54:

$ showclusters -v -f '$CAS PCN' -w 54 fromjarpat.tdt
======================== excerpt =====================
CLUSTER 9 (64) size 9
9.0 0.0305 PCN IMIDAZOLIUM CHLORIDE,1-METHYL-2
-HYDROXYIMONMETHYL-3-(1,3,3-
TRIMETHYLBUTOXY)METHYL
$CAS 117941-48-7
9.1 0.0378 PCN IMIDAZOLIUM CHLORIDE,1-I-PROPOX
YMETHYL-2-HYDROXYIMINOMETHYL-3-
METHYL

9.2 0.0417 PCN IMIDAZOLIUM CHLORIDE,1-METHYL-2
-HYDROXYIMINO-3-(1-METHYL-3-
NITROPROPYL)OXYMETHYL
=====================================================

Display sorted list of structures from pomona91 database (contains about 24800 structures):

$ showclusters -v -m 25000 pomona91jp.tdt
CLUSTERS LISTED BY SIZE, SMILES BY VAR(TANIMOTO):
======================== excerpt =====================
CLUSTER 1006 (1274) size 4
1006.0 0.0063 NS(=O)(=O)c1cc2c(s1)C(O)CCS2(=O)=O
1006.1 0.0065 NS(=O)(=O)c1cc2S(=O)CCC(O)c2s1
1006.2 0.0149 NS(=O)(=O)c1cc2SCCC(O)c2s1
1006.3 0.0196 NS(=O)(=O)c1cc2c(CC(O)CS2(=O)=O)s1
CLUSTER 1007 (1415) size 4
1007.0 0.0020 N#Cc1cccc2[nH]ccc12
1007.1 0.0030 N#Cc1ccc2[nH]ccc2c1
1007.2 0.0032 N#Cc1ccc2cc[nH]c2c1
1007.3 0.0042 N#Cc1c[nH]c2ccccc12
CLUSTER 1008 (1446) size 4
1008.0 0.3452 CCOP(=O)(OCC)Oc1ccc(SC)cc1C
1008.1 0.3484 CCOP(=O)(OCC)Oc1ccc(SC)c(C)c1
======================================================

Display summary of cluster data and frequency distribution of cluster sizes:

$ showclusters -h -q fromjarpat.tdt
...

Files

$DY_ROOT/bin/showclusters

Daylight License

programs: cluster

Bugs

It would be better if the -d and -f options were not mutually exclusive, resulting in output of the first data items of certain datatypes while allowing output of all items of others.

Strictly speaking, the concepts "centroid" and "variance" do not apply to clusters based on metrics such as the Tanimoto coefficient when used with binary data. The structure referred to as the "centroid" is more accurately "the structure nearest to an approximation of the cluster centroid". The quantity referred to as "variance" is more accurately "the standard error of estimate using the structure nearest to the centroid as the predictor" or, granting the above definition of "centroid", "cluster variance unexplained by the centroid".