Release Date: 1 February 2008
Nameshowclusters - display Jarvis-Patrick clusters and statistics
Unix Synopsisshowclusters [options] [in.tdt] [out.lis]
Descriptionshowclusters reads a .tdt (Thor datatree) file containing "Cluster" (CL) data and tabulates the data for text output. The input file must be a .tdt file (either "list" or "dump" format) containing data of $SMI and CL datatypes FP datatype optional. Standard output is used if an output file is not specified. If the input file is also not specified, input is expected on standard input. showclusters is able to produce four kinds of output:
summary of the data related to clustering (-h)The user must select one or more of these command line options to obtain output (there is no "default" type of output). Clusters are displayed sorted by cluster size (largest first). Output of "singleton" clusters (clusters containing a single structure) is suppressed by default. Structures are normally listed by their SMILES but output can include other data present in the input file.
List all clusters, including singletons. (Output of singleton clusters is suppressed by default.)-d tags
Display all data items of datatypes listed in `tags' when listing structures, where `tags' is a space-delimited list of THOR datatags. Data will appear in the same order as in the input. Not compatible with option -f.-in clid
Use only cluster (CL) data identified by `clid' as input rather than using the first occurring CL item in the datatree. Useful when CL data from more than one clustering run are present.-f tags
Display the first data item of each datatype listed in `tags' when listing structures, where `tags' is a space-delimited list of THOR datatags. Not compatible with option -d.-h
Display header and summary of clustering data.-q
Display the frequency distribution of cluster sizes.-t width
Truncate long data items at column `width' in list output. Not compatible with option -w.-v
Display a list of all structures sorted by cluster size then by variance within the cluster.-w width
Wrap long data items at column `width' in list output. Not compatible with option -t.-x
Display a list of cluster centroids sorted by cluster size.
Initially allocate memory for `count' structures. Ideally, `count' should be set to the number of structures to be input. It is good practice to specify a number equal to or slightly greater than this number. If more than `count' structures are encountered while reading input, memory will be reallocated as needed, resulting in a performance penalty and a possible "out of memory" error. The default is 10000. [-m]-EXPRESSION expr
Uses given expression to compute the variance used for the -v computation. (Default: 1-tanimoto()). Note: if the expression is not a distance relative to 0.0 (eg. lower is better, the minimum distance being 0.0), the variance computation will give nonsensical results.
Return ValueReturns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed: showclusters: no output requested
This program requires that at least one of the options -h, -q, -v, -x be specified to obtain output.showclusters: unknown option encountered: xxx
An invalid option was specified on the command line.showclusters: can't open input file xxx
The file specified on the command line does not exist or is not readable.showclusters: bad value for option -m <count>
showclusters: bad value for option -t <width>
showclusters: bad value for option -w <width>
The value of options -m, -t, and -w must be a non-negative integer.showclusters: minimum value for option -t is 32
showclusters: minimum value for option -w is 32
A command line option value for a -t (truncate) or -w (wrap) option busted the hardcoded minimum of 32.showclusters: not enough memory
showclusters: out of memory
There is not enough virtual memory for the initial allocation of memory for data ("not enough memory") or for reallocation of memory while reading input ("out of memory"). Use the -m option to set the limit to the number of datatrees to be input and don't use -d or -f options. If it still fails, your computer doesn't have enough virtual memory to process the input file.
ExamplesDisplay names of cluster centroids:
$ showclusters -x -f PCN -w 54 fromjarpat.tdtDisplay names and CAS numbers of clustered structures, reading standard input and wrapping output at column 54:
$ showclusters -v -f '$CAS PCN' -w 54 fromjarpat.tdtDisplay sorted list of structures from pomona91 database (contains about 24800 structures):
$ showclusters -v -m 25000 pomona91jp.tdtDisplay summary of cluster data and frequency distribution of cluster sizes:
$ showclusters -h -q fromjarpat.tdt
Daylight Licenseprograms: cluster
Related Topicsfingerprint(1) jarpat(1) jpscan(1) listclusters(1) mergeneighbors(1) nearneighbors(1) licensing(5)
Daylight Theory Manual
BugsIt would be better if the -d and -f options were not mutually exclusive, resulting in output of the first data items of certain datatypes while allowing output of all items of others.
Strictly speaking, the concepts "centroid" and "variance" do not apply to clusters based on metrics such as the Tanimoto coefficient when used with binary data. The structure referred to as the "centroid" is more accurately "the structure nearest to an approximation of the cluster centroid". The quantity referred to as "variance" is more accurately "the standard error of estimate using the structure nearest to the centroid as the predictor" or, granting the above definition of "centroid", "cluster variance unexplained by the centroid".