Daylight v4.9
Release Date: 1 February 2008

Name

fingerprint - generate fingerprint data for SMILES TDTs

Unix Synopsis

fingerprint [-b minbits] [-c maxsize] [-d dens] [-id fpid] [-t TAG] [-x]
[-z] [-s minstep/maxstep]
[-m [-mb minbits] [-mt TAG] [-md dens] [-mz]]
[ in.tdt [ out.tdt ] ]

Description

fingerprint creates an information-dense representation of the structural characteristics of molecules and reactions (called "fingerprints"). Fingerprints thus generated are comprehensive, compact, ambiguous, redundantly-coded, and are ideal for use in substructure screening, similarity searching, and structural clustering. Input is copied to output with a "Fingerprint" (FP) data item inserted after each SMILES item. Note that fingerprints can be successfully generated for both Molecule and Reaction SMILES.

Input must be in .tdt (Thor datatree) format (either "list" or "dump" format) containing SMILES ($SMI) data. If an output file name is not specified on the command line, output is written to standard output; if the input file is also not specified, input is expected on standard input.

An "FP generation" ($FPG) datatree is also written to output which includes the run ID (if set via the -id option), input source, program name, version number, and parameters used.

The "Fingerprint" field (first field in FP output data) contains binary fingerprint data encoded in ASCII as per dt_bin2ascii(3). This data contains redundant bits indicating the presence of atom and bond relationships in paths, branches, ring closures, and other structural features. The program fingerprint does an exhaustive search for such features; as a result it runs quite slowly (ca. 25 structures per minute per MIPS).

This program can be used to generate either folded or fixed-size fingerprints. Fingerprints are created at given size (-c option) and are then repeatedly folded until they are either a minimum allowed size (-b option) or above a minimum required density (-d option). Fixed-size fingerprints can be generated by setting the maximum and minimum sizes equal. Default parameters are -b 64 -c 2048 -d 0.3.

Options

-b minbits
Specify the minimum fingerprint size allowed, in bits. This value must be a power of two and must be no larger than the creation size specified with the -c option. Fixed-size fingerprints will be created if the minimum size is set equal to the creation size. For best performance, it should also be a multiple of the size of an int in C (typically 32 bits). Up to a point dependent on the specific data used, larger values allow increased information storage at the expense of space. Default value is 64.
-c maxbits
Specify the maximum (creation) fingerprint size in bits. This value must be a power of two and must be at least as large as the minimum size specified with the -b option. Fixed-size fingerprints will be created if the creation size is set equal to the minimum size. Larger values allow increased storage of information for heterogeneous structures at the expense of speed. Default value is 2048.
-d density
Specify density below which fingerprints may be folded. The highest information densities are obtained with a value of 0.41, but lower values may be used to effectively increase information content at the expense of information density (space). Similarity values computed using folded fingerprints are valid up to twice this value. Default value is 0.30.
-id runid
Identify this run by `runid' in $FPG and FP output data. If not specified, the `runid' field of FP data is omitted and the first field of the $FPG datum is set to "na".
-x Generate 'difference' fingerprints for reactions.
This option causes the program to generate the difference fingerprints for the given SMILES, rather than the 'normal' fingerprints. The difference fingerprints are generated using dt_fp_differencefp(3).
-z Don't copy ("zap") existing $FPG and FP data to output.
By default, all data in the input are copied to output.
-s minstep/maxstep
Compute bits for pathlength in this range. The default is 0/7. The upper limit on the path length is 32.
-m Generate mixture fingerprints
The following options apply to mixture fingerprints.
-mb minbits
The minimum fingerprint size allowed in bits. The default is 64.
-mt TAG
Use 'TAG' instead of 'FPP' for mixture fingerprints.
-md dens
Density below which fingerprints are folded. The default is 0.3 .
-mz
Zap existing FPP data from TDT stream.

Return Value

Returns 0 to its environment on success, or 1 on error, in which case a diagnostic message is printed:

fingerprint: can't open input file

The input file specified on the command line doesn't exist or isn't readable.
fingerprint: can't open output file
The output file specified on the command line can't be opened writable.
fingerprint: more than two file names on command line
More than two arguments (other than options) were specified.
fingerprint: bad value for option xx xxxxxxx
A non-numeric value was specified for an numeric-valued option.
fingerprint: option -d `density' out of bounds 0.0 to 1.0
A nonsensical density was specified.
fingerprint: -b <minsize> greater than -c <maxsize>
The minimum fingerprint size is greater than the maximum fingerprint size. If not specified with the -b option, the minimum size defaults to 64. If not specified with the -c option, the maximum size defaults to 1024.
fingerprint: unknown option encountered: xxx
An invalid option was specified on the command line.
fingerprint: can't find EOT in $SMI-rooted tree
A non-|-delimited datatree was encountered.

Files

$DY_ROOT/bin/fingerprint

Daylight License

programs: fingerprint

Related Topics

nearneighbors(1) licensing(5)
Daylight Theory Manual

Bugs

None known.