listclusters

Daylight v4.9
Release Date: 1 February 2008

Name

listclusters - sort and reformat Jarvis-Patrick clustering data

Unix Synopsis

listclusters [options] [in.tdt] [ out.[tdt|smi] ]

Description

listclusters sorts and optionally reformats a Thor datatree file containing data from a clustering program (e.g. jarpat) for use with other programs. The input file must be a .tdt file (either "list" or "dump" format) containing data of $SMI and CL datatypes (FP and $CLG datatypes are optional). Datatrees are sorted by cluster and written to standard output. By Default, the order that structures appear within clusters is arbitrary.

The -v option causes the cluster "variance" unexplained by each structure to be computed within the cluster. The variance will be included in the output and structures will be ordered within clusters by this value. The structure with the lowest value is called the "centroid"; the -x option limits output to just centroids.

The default output format is .tdt "list" format:

$SMI<smiles>
CL<cluster no;cluster size;runid>
VAR<variance;runid>
... additional data ...
|

If option -s is used, output will be in .smi format:

smiles clusterno [variance] [additional data]

Options used control which optional data fields are output:

The -id option causes the "runid" fields to be output.
The -v and -x options cause [variance] to be output.
The -a option causes "additional data" to be output.

Options

-a

List all clusters, including singletons (singletons are suppressed by default).

-d tags

Append all data items of datatypes listed in `tags' to the output, where `tags' is a space-delimited list of THOR datatags. Data will appear in the same order as in the input. Not compatible with option -f.

-f tags

Append the first data item of each datatype listed in `tags' to the output, where `tags' is a space-delimited list of THOR datatags. Not compatible with option -d.

-in clid

Use only cluster (CL) data identified by `clid' as input rather than using the first occurring CL item in the datatree. Useful when CL data from more than one clustering run are present.

-s

Write output in .smi format (.tdt format is default).

-v

Sort structures by cluster size then by variance within the cluster. This option causes variance to be included in the output. Not compatible with option -x.

-x Output only cluster centroids sorted by cluster size.

This option causes variance to be included in the output. Not compatible with option -v.

-RECORD_COUNT count

Initially allocate memory for `count' structures. Ideally, `count' should be set to the number of structures to be input. It is good practice to specify a number equal to or slightly greater than this number. If more than `count' structures are encountered while reading input, memory will be reallocated as needed, resulting in a performance penalty and a possible "out of memory" error. The default is 10000. [-m]

-EXPRESSION expr

Uses given expression to compute the variance used for the -v computation. (Default: 1-tanimoto()). Note: if the expression is not a distance relative to 0.0 (eg. lower is better, the minimum distance being 0.0), the variance computation will give nonsensical results.

Return Value

Returns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed:

listclusters: unknown option encountered: xxx

An invalid option was specified on the command line.

listclusters: can't open input file xxx

The file specified on the command line does not exist or is not readable.

listclusters: bad value for option -m <limit>

The value for option -m is not a non-negative integer.

listclusters: not enough memory
listclusters: out of memory

There is not enough virtual memory for the initial allocation of memory for data ("not enough memory") or for reallocation of memory while reading input ("out of memory"). Use the -m option to set the limit to the number of datatrees to be input and don't use -d or -f options. If it still fails, your computer doesn't have enough virtual memory to process the input file.

Examples

Extract SMILES, cluster number, local names and CAS Registry numbers from from jarpat(1) output for subsequent processing:

$ listclusters -d '$CAS PCN' fromjp.tdt > output.tdt

Use listclusters to remove extraneous datatypes from jarpat output and add a runid for subsequent loading into a Thor database:

$ listclusters -id '10DEC91' fromjp.tdt > tothor.tdt

Create a .smi file containing one structure per cluster (the one closest to the cluster centroid) including singleton clusters:

$ listclusters -x -a -s fromjp.tdt > centroids.smi

Files

$DY_ROOT/bin/listclusters

Daylight License

programs: cluster

Bugs

It would be better if the -d and -f options were not mutually exclusive, resulting in output of the first data items of certain datatypes while allowing output of all items of others.

Strictly speaking, the concepts "centroid" and "variance" do not apply to clusters based on metrics such as the Tanimoto coefficient when used with binary data. The structure referred to as the "centroid" is more accurately "the structure nearest to an approximation of the cluster centroid". The quantity referred to as "variance" is more accurately "the standard error of estimate using the structure nearest to the centroid as the predictor" or, granting the above definition of "centroid", "cluster variance unexplained by the centroid".