kmodes

Daylight v4.9
Release Date: 1 February 2008

Name

kmodes - calculate K-modes clustering

Unix Synopsis

kmodes [options] in.tdt [out.tdt]

Description

kmodes(1) performs a K-modes clustering of the input dataset, which must contain fixed-size fingerprints. Its output is designed to be post-processed using the listclusters(1) and showclusters(1) programs. The input file must be a .tdt file (either "list" or "dump" format) containing fingerprint (FP) data. Input is copied to output with a "cluster" (CL) data item inserted after each fingerprint item. In addition, the representative cluster modals are written as part of the first datatree for each cluster (FPM). If the name of the output file is not specified, output will be written to standard output.

Optionally, the program will write out a datatype value which indicates whether or not the datatree member would pass a fingertest screen against the modal for it's cluster; that is, does the member contain all the bits of the modal for the cluster.

A "CL generation" ($CLG) datatree is also written to output which includes the run ID, program name, version number, and the parameters used.

Options

-FID fpid

Use only fingerprints identified by `fpid' rather than the first one encountered in each tree. This is chiefly useful for testing: in normal use, there is usually only one fingerprint per tree. (-in)

-JP_RUNID runid

Identify this run by `runid' in $CLG and CL output data. (-id)

-MODAL_TESTTAG tag

Write a dataitem with either 'Y' or 'N' to indicate whether or not the modal for the cluster passes fingertest for this datatree item. Uses the given tag for the datatype.

-MAX_PASSES val

The maximum number of relocation passes to perform. If zero, the program performs an initial partitioning pass and quits. The default is unlimited, which means that the program will perform relocation passes until it converges (see -MIN_RELOC) before terminating.

-MFID fpid

When selecting the initial modal seeds, use only fingerprints identified by `fpid' rather than the first one encountered in each tree. This option applies with the -SEED_FILE option.

-MIN_MODESIZE items

Specify the minimum number of items in each cluster before the cluster is eliminated from further processing. If a cluster size drops below this value, the cluster will be removed an all of its previous members will be relocated to other remaining clusters. The default is zero.

-MIN_RELOC val

The percentage of relocations within a pass that will result in termination of the program. If the percentage of relocations is below `val', the program considers that the clustering has converged, writes the data to output, and terminates. The default is 0.0.

-MODES modes

Specify the initial number of clusters (K) for the run. This will be the maximum number of clusters found, as the program may eliminate clusters but will not add them. The default value is 100 or the count of seeds in the SEEDS_FILE.

-MODAL_THRESHOLD value

Specify the threshold percentage for computation of the modal. Default value is 0.5.

-MOVE bool

Controls whether or not the modals are recomputed for each item added during the initial partitioning step. The default is TRUE, which means that, after each item is added to a cluster, the modal for that cluster is recomputed.

-RANDOM bool

When selecting the seeds from the input dataset, this controls whether to take the inital K items as the seed (FALSE) or to select the K items randomly from the entire dataset (TRUE). The default is FALSE.

-RANDOM_SEED val

When using the -RANDOM option this sets the initial numeric seed for the random number generator. This can be used to repeat runs based on the pseudo-random set of items selected with the -RANDOM option.

-SEED_FILE file

When provided, the initial seed modals are read from this file. The default is to not read an external file and select the initial modal seeds from the input dataset.

-RECORD_COUNT count

Initially allocate memory for `count' structures. Ideally, `count' should be set to the number of structures to be input. It is good practice to specify a number equal to or slightly greater than this number. If more than `count' structures are encountered while reading input, memory will be reallocated as needed, resulting in a performance penalty and a possible "out of memory" error. The default is 10000. (-m)

-EXPRESSION expr

Uses given expression as the similarity measure for the neighbors list generation. (Default: tanimoto()).

-COMPARISON [DISTANCE|SIMILARITY]

Controls relative goodness of similarity comparisons for list ranking. SIMILARITY means that higher values are better; DISTANCE means that lower values are better. If not specified, the program attempts to derive the directionality of the measure given in the -EXPRESSION option by computing it at the endpoints.

Return Value

Returns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed:

kmodes: input file not specified

An input file was not specified on the command line.

kmodes: can't open input file

The input file specified on the command line does not exist or is not readable.

kmodes: can't open output file

The output file specified on the command line can't be accessed for writing.

kmodes: problem with option manager

The option manager could not be initialized. Verify that DY_ROOT is set properly.

kmodes: note, x of x trees contain valid fingerprints

This (non-fatal) message appears if not all input trees contain fingerprints, and is intended to let the user know how much work is actually being done (trees without fingerprints are ignored in the computations). The number of trees with fingerprints is typically a few less than the total.

kmodes: no trees with valid fingerprints were found

No valid fingerprints were found, either because no FP items were in the input or their "Run ID" didn't match that specified with the -FID option.

Files

$DY_ROOT/bin/kmodes

Daylight License

programs: cluster