nearneighbors

Daylight v4.9
Release Date: 1 February 2008

Name

nearneighbors - calculate nearest neighbors

Unix Synopsis

nearneighbors [options] in.tdt [out.tdt]

Description

nearneighbors(1) finds the nearest neighbors for each item in a given list, using a binary Tanimoto metric to compare structural fingerprints. Its output is designed to be used for Jarvis-Patrick clustering rather than saved in a database (see jarpat(1)). The input file must be a .tdt file (either "list" or "dump" format) containing fingerprint (FP) data. Both fixed-width and folded fingerprints may be processed (comparison between fingerprints of different sizes is done by folding the larger one to the size of the smaller one). Input is copied to output with a "nearest neighbors" (NN) data item inserted after each fingerprint item. If the name of the output file is not specified, output will be written to standard output.

A "NN generation" ($NNG) datatree is also written to output which includes the run ID (if set via the fingerprint(1) -id option), program name, version number, and the nnear parameter used.

The "Distance" field (3rd field in NN output data) contains similarity values as a binary Tanimoto metric, defined as:

Nc / (N1 + N2 - Nc)

where N1, N2, and Nc are the number of fingerprint bits set in the structure, the neighbor, and in common, respectively. Used with structural fingerprints, this metric can be thought of as an approximation of the fraction of all substructures in both structures which are in common (see fingerprint(1)). A value of 0.0 indicates complete dissimilarity (farthest); 1.0 indicates substructural identity (closest). By definition, each structure is its own nearest neighbor.

WARNING:

Finding nearest neighbors in a non-Euclidean space is an order N-squared process (N is the number of structures). The nearneighbors(1) implementation provided by Daylight is highly optimized and runs faster than N-squared; however, running time is not predictable since the optimization effectiveness is dependent on the distribution of structures (runs faster on more highly clustered data sets). If you plan on using this program with large data sets (e.g. 100000+ structures) allow for the possibility of very long running times (e.g. days).

CALCULATIONS ON LARGE DATASETS:

There are two typical approaches to handling large datasets. The choice of approach depends on ones available computing resources: if one has computers with sufficient memory to run the entire dataset (approximately 10MB required per 100k fingerprints), the first approach should be chosen. If not, the second approach is preferred.

Approach 1: In this approach, the options -SKIP_RECORDS and -DO_RECORDS are used to calculate portions of the neighbors lists. Each neighbor list calculated contains the result from scanning the entire dataset. That is, each list contains the definitive set of neighbors for that fingerprint. The mergeneighbors utility, with the option -NN_MERGE_LISTS FALSE, can be used to combine multiple output files. For example, given a 100k fingerprint input file, the following sequence will generate a complete output nearneighbors list:

nearneighbors -DO_RECORDS 30000 -WRITE_SKIPS FALSE in.tdt tempout1.tdt
nearneighbors -SKIP_RECORDS 30000 -DO_RECORDS 30000 -WRITE_SKIPS FALSE in.tdt tempout2.tdt
nearneighbors -SKIP_RECORDS 60000 -WRITE_SKIPS FALSE in.tdt tempout3.tdt
mergeneighbors -NN_MERGE_LISTS FALSE in.tdt tempout1.tdt tempout2.tdt tempout3.tdt > out.tdt

In this case, concatenation of the three intermediate files could be substituted for the mergeneighbors step. The one advantage of using mergeneighbors is that it will issue a warning if any neighbors lists are missing.

Approach 2: In this approach, the -SKIP_INPUT and -DO_INPUT options are used. These options control which structures are considered as potential neighbors. Only the range of fingerprints between skip+1 and skip+do are loaded into memory (thus conserving memory). Each output neighbors list will only contain the best neighbors from within the range skip+1 and skip+do. Output files must be merged with the -NN_MERGE_LISTS TRUE option. Again, an example of a 100k fingerprint input file:

nearneighbors -DO_INPUT 30000 in.tdt tempout1.tdt
nearneighbors -SKIP_INPUT 30000 -DO_INPUT 30000 in.tdt tempout2.tdt
nearneighbors -SKIP_INPUT 60000 in.tdt tempout3.tdt
mergeneighbors -NN_MERGE_LISTS TRUE in.tdt tempout1.tdt tempout2.tdt tempout3.tdt > out.tdt

Tradeoffs: Note that approach 2 uses much more disk space to store the intermediate lists: in the example, each of the three intermediate files is about the same size as out.tdt. Thus, four time as much disk space is required. Approach 2 does have the advantage of being able to handle larger problems than can be loaded into memory of a single computer.

Options

-NNID runid

Identify this run by runid in $NNG and NN output data. (-id)

-FID fpid

Use only fingerprints identified by fpid rather than the first one encountered in each tree. Use the fingerprint(1) -id fpid option to add a FID field to FP dataitems. This is chiefly useful for testing: in normal use, there is usually only one fingerprint per tree. (-in)

-SKIP_RECORDS skip -DO_RECORDS do

Controls which portion of the input file for which to calculate nearest neighbors (i.e. the self TDTs). Calculation of neighbors begins with the skip+1 tree, and proceeds through the skip+do tree. Note that this parameter applies to the calculation of neighbors only, the skipped trees can be found as neighbors for other trees. The default value for skip is 0 records while the default for do is all records. Depending on the value of -WRITE_SKIPS, the skipped trees may be written to output. (-k)

-WRITE_SKIPS [TRUE|FALSE]

Controls whether or not to write to output trees for which neighbors have not been calculated. If true, the unaltered portions of the input file are written to output. That is, trees before skip in the input file and after skip + do are written unaltered.

-SKIP_INPUT skip -DO_INPUT do

Controls which portion of the input file to use as the potential neighbors (i.e. the others). Only the trees from skip+1 to skip+do are considered as potential neighbors. The default value for skip is 0 records while the default for do is all records. These options are incompatible with the -UPDATE_FILE option.

-NEIGHBORS nnear

Specify the length of nearest neighbor lists to be generated. This should be set to the highest number of nearest neighbors to be needed by subsequent processing. Lower values save time and space. Although any number greater than one is acceptable, 20 is about the largest value typically used for Jarvis-Patrick clustering. The default value is 16. (-n)

-MAX_NEIGHBORS nnear

Specify the maximum length of nearest neighbor lists to be generated, including any tied neighbors at the final position in the list. This must be larger than -NEIGHBORS.

-RECORD_COUNT count

Initially allocate memory for count structures. Ideally, count should be set to the number of structures to be input. It is good practice to specify a number equal to or slightly greater than this number. If more than count structures are encountered while reading input, memory will be reallocated as needed, resulting in a performance penalty and a possible "out of memory" error. The default is 10000. (-m)

-UPDATE_FILE nn.tdt

Update previous nearneighbors results in file nn.tdt. This is useful for clustering large databases which get updated with a relatively small number of structures relatively often. The output of nearneighbors run with the -u option is functionally equivalent to that which would result from a normal run with all structures.(-u)

-UPDATE_FID nnid

Use fingerprints identified by nnid when extracting data from previous results (this option is only useful with the -u option). If specified, the first NN data item following the selected fingerprint is taken as the corresponding nearest neighbors list. By default, the first fingerprint (FP data item) in each tree is used. (-uid)

-NUM_PROCESSES val

Use val child processes on multiprocessing machines, (solaris, SGI). For best performance, use two child processes per cpu. This allows the system to schedule the children as it sees fit. To use only part of a multi-cpu machine, set val to the number of desired cpus + 1 (eg. to use only 4 of an 8-cpu machine, set -NUM_PROCESSES to 5). Since the scheduling of child processes is imperfect, this will average to out to approximately the correct load.

-EXPRESSION expr

Uses given expression as the similarity measure for the neighbors list generation. (Default: tanimoto()).

-COMPARISON [DISTANCE|SIMILARITY]

Controls relative goodness of similarity comparisons for list ranking. SIMILARITY means that higher values are better; DISTANCE means that lower values are better. If not specified, the program attempts to derive the directionality of the measure given in the -EXPRESSION option by computing it at the endpoints.

Return Value

Returns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed:

nearneighbors: input file not specified

An input file was not specified on the command line.

nearneighbors: cannot open input file

The input file specified on the command line does not exist or is not readable.

nearneighbors: cannot open output file

The output file specified on the command line cannot be accessed for writing.

nearneighbors: problem with option manager

The option manager could not be initialized. Verify that DY_ROOT is set properly.

nearneighbors: bad value for option xx xxxxxxx

A non-numeric value was specified for an integer-valued option.

nearneighbors: value for option -NEIGHBORS val is too low

The shortest nearest neighbor list that can be generated by nearneighbors is 2 (for these purposes, structures are their own 1st-nearest neighbor).

nearneighbors: unknown option encountered: xxx

An invalid option was specified on the command line.

nearneighbors: out of memory

The program was not able to allocate enough virtual memory to run the specified problem. Use the -m option to set the limit to the number of datatrees to be input and limit the nearest neighbor list with the -n option. If it still fails, your computer doesnot have enough virtual memory to process the input file.

nearneighbors: note, x of x trees contain valid fingerprints

This (non-fatal) message appears if not all input trees contain fingerprints, and is intended to let the user know how much work is actually being done (trees without fingerprints are ignored in nearest neighbor computations). The number of trees with fingerprints is typically a few less than the total.

nearneighbors: no trees with valid fingerprints were found

No valid fingerprints were found, either because no FP items were in the input or their "Run ID" did not match that specified with the -FID option.

nearneighbors: (ouch!) old file missing NN dataitem
nearneighbors: bad format in NN<;list;>
nearneighbors: bad value (unexpected '.') in NN<;list;>
nearneighbors: bad value (cannot sscanf) in NN<;list;>

For update, the program could not successfully read the old neighbors list. Either a list was missing, the format was bad, or the list has a different number of neighbors than expected.

nearneighbors: Cannot use SKIP_INPUT or DO_INPUT for update

The update function, and the SKIP_INPUT and DO_INPUT functions are not compatible.

nearneighbors: Cannot use NN_BEST_THRESHOLD with SKIP_ or DO_INPUT

The NN_BEST_THRESHOLD and the SKIP_INPUT or DO_INPUT functions are incompatible. Since the SKIP_INPUT and DO_INPUT functions cause nearneighbors to ignore portions of the input file, the NN_BEST_THRESHOLD option would incorrectly throw away good neighbors lists and distort the results when the lists were merged. If using SKIP_INPUT and DO_INPUT, the NN_BEST_THRESHOLD option should be used during the jarpat step.

Files

$DY_ROOT/bin/nearneighbors

Daylight License

programs: cluster

Bugs

It would be useful if nearneighbors could provide an estimated time to completion as it runs.