jpscan

Daylight v4.9
Release Date: 1 February 2008

Name

jpscan - calculate summary statistics for Jarvis-Patrick clustering

Unix Synopsis

jpscan [options] [in.tdt [out.tdt]]

Description

jpscan(1) repeatedly performs Jarvis-Patrick clustering while varying the parameters controlling the clustering process. The program is typically used to find Jarvis-Patrick parameters which work well for a particular data set, before clustering with jarpat(1). Input is a .tdt file (either "list" or "dump" format) containing "nearest neighbors" (NN) data (e.g., produced by nearneighbors(1)). Output is in the form of 80-column-wide ASCII tables, which are typically printed or saved to a file for later examination. Input and output default to standard input and output, respectively.

jpscan(1) operates on the same input as jarpat(1), i.e. nearest neighbor lists. The Jarvis-Patrick algorithm clusters two items if "need" of their "near" nearest neighbors are in common. To produce output tables, jpscan(1) varies these two parameters up to a specified limit (-p option). In general, high parameter values produce better defined clusters, while low values result in better linkage, more complete clustering, and discovery of smaller clusters. The "best" parameter values are dependent on the particular data set and the purpose for which the clustering is to be used. For instance, 12/16 might be used when characterizing structures in a highly diverse database, while 8/14 might be used for selecting a small number of representatives of a highly clustered corporate database for screening.

A header and six tables are written to output:

NUMBER OF STRUCTURES CLUSTERED
PERCENTAGE OF STRUCTURES CLUSTERED
NUMBER OF CLUSTERS
AVERAGE CLUSTER SIZE
SIZE OF LARGEST CLUSTER
NUMBER OF SINGLETONS

As with jarpat(1), jpscan(1) provides exhaustive and fast options to control the clustering search. See the description of jarpat(1) for more information on the method.

Options

-JP_TYPE EXHAUSTIVE

Do an exhaustive search for cluster members by fully relaxing the first Jarvis-Patrick criterion (don't require cluster members to be in each other's lists). [-e]

-JP_TYPE FAST

Do a fast search for cluster members by half relaxing the first Jarvis-Patrick criterion (require that only one cluster member be in another's list). [-f]

-NNID nnid

Only use nearest neighbor lists identified by `nnid', rather than the first one encountered in the tree. This is chiefly useful for testing: in normal use, only one set of nearest neighbors is ever generated. [-in]

-RECORD_COUNT count

Initially allocate memory for `count' structures. Ideally, `count' should be set to the number of structures to be input. It is good practice to specify a number equal to or slightly greater than this number. If more than `count' structures are encountered while reading input, memory will be reallocated as needed, resulting in a performance penalty and a possible "out of memory" error. The default is 10000. [-m]

-RESCUE_SIZE sos

Rescue singletons by putting them in the cluster which contains the plurality, but not less than <sos>, of their <near> nearest neighbors (where <near> is as per option -p). Singletons are not rescued by default. Provides two additional tables of output statistics: rescued singletons, and singletons before rescue.

-JP_NEAR near

Specify the maximum number of nearest neighbors to be examined. This will become the highest row and column values in the output tables. Specifying a number higher than the actual length of the nearest neighbor lists in the input will result in an error. The default is 16. [-n]

-NN_BEST_THRESHOLD val

Causes any structures whose best neighbor is less similar than 'val' to be considered a singleton by default (default: don't)

-COMPARISON [DISTANCE|SIMILARITY]

Controls relative goodness of similarity comparisons for tie-handling. SIMILARITY means that higher values are better; DISTANCE means that lower values are better. This is only used in conjunction with the NN_BEST_THRESHOLD value, otherwise it's ignored. (Default: SIMILARITY)

-JP_USE_ALL_TIES

Count all ties in the neighbor list as part of the NEED for clustering. The default behavior is to count only any common ties which can be part of the first NEAR neighbors in the list.

Return Value

Returns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed:
jpscan: can't open input file

The input file specified on the command line does not exist or is not readable.

jpscan: can't open output file

The output file specified on the command line can't be created for writing.

jpscan: bad value for option xx xxxxxxx

A non-numeric value was specified for an integer-valued option.

jpscan: bad [format|value] in NN<;list;>

Something unexpected happened when reading the nearest neighbor lists. A possible cause is that jpscan(1) was asked to examine more near neighbors than exist in the input lists. If a -n option was used to specify list length to nearneighbors(1), be sure to specify the same (or lower) value to jpscan(1).

jpscan: xx (trees with NN data) not equal to xx ($SMI trees)

Input file is not valid; a nearest neighbor list must be available for each structure input.

jpscan: option -n `mxnear' too low (must be 2 or more)

A nonsensical maximum limit was specified.

jpscan: options -e and -f are not compatible

Both exhaustive (-e) and fast (-f) searching were requested, which is not possible.

jpscan: unknown option encountered: xxx

An invalid option was specified on the command line.

jpscan: out of memory

The program was not able to allocate enough virtual memory to run the specified problem. Use the -m option to specify the number of datatrees to be input and don't specify an excessively large value for the -n option. If it still fails, your computer doesn't have enough virtual memory to process the input file.

jpscan: note, x of x input trees contain valid NN data

This (non-fatal) message appears if not all input trees contain nearest neighbor lists, and is intended to let the user know how much work is actually being done (trees without NN data are ignored in the clustering). The number of trees with nearest neighbors is typically a few less than the total.

jarpat: no trees with valid nearest neighbor lists were found

No valid nearest neighbor lists were found, either because no NN data were input or their "Run ID" didn't match that specified with the -in option.

Examples

jpscan(1) is typically used in concert with nearneighbors(1) and jarpat(1) for doing cluster analysis.

Examine Jarvis-Patrick clustering of the structures in a .tdt file containing 24718 datatrees:

$ fingerprint < p91.tdt > p91fp.tdt
$ nearneighbors p91fp.tdt > p91nn.tdt
$ jpscan -RECORD_COUNT 25000 p91nn.tdt

Files

$DY_ROOT/bin/jpscan

Daylight License

programs: cluster

Bugs

An option to control table width would be nice. An option to format output as per tbl(1) would be nice.