Daylight Summer School 2002, June 5-7, Santa Fe, NM

Daylight Fingerprints

Fingerprints are bit arrays (aka "bitmaps"), for high-speed structural screening and similarity comparison.

Daylight molecular fingerprints contain:
- a pattern for each atom
- a pattern representing each atom and its nearest neighbors (plus the bonds that join them)
- a pattern representing each group of atoms and bonds connected by paths up to 2 bonds long
- ... atoms and bonds connected by paths up to 3 bonds long
- ... continuing, with paths up to 4, 5, 6, and 7 bonds long. Default is 7, can be up to a max of 31
Example:
the molecule OC=CNwould generate the following patterns:

0-bond paths: C O N

1-bond paths: OC C=C CN

2-bond paths: OC=C C=CN

3-bond paths: OC=CN
each pattern sets a set of bits (typically 4 or 5 bits per pattern) which is added to the fingerprint.
If a pattern is a substructure of a molecule, every bit that is set in the pattern's fingerprint will be set in the molecule's fingerprint.
Fingerprints can be folded to variable lengths to decrease size, increase the information density, but incur some information loss.

Fingerprint program syntax/options:

fingerprint [-b minbits] [-c maxsize] [-d dens] [-id fpid] [-t TAG] [-x] [-z] [-s minstep/maxstep] [-m [-mb minbits] [-mt TAG] [-md dens] [-mz]] [ in.tdt [ out.tdt ] ] in.tdt ....... .tdt file contining $SMI data (default: stdin) out.tdt ...... .tdt file with $FPG and FP data added (default: stdout) standard options: -b minbits .. minimum fingerprint size allowed, bits (default: 64) -c maxbits .. creation size of fingerprint, bits (default: 2048) -d dens ..... density below which fingerprints are folded (default: 0.3) -id fpid .... identify this run by `fpid' -t TAG ...... use `TAG' instead of `FP' for fingerprint dataitems -x .......... generate difference fingerprints (XFP<>) -z .......... zap existing FP and $FPG data -s min/max .. Compute bits for pathlength in this range (default: 0/7) options for mixtures: -m .......... generate fingerprints for mixture components ("parts") -mb minbits . minimum fingerprint size allowed, bits (default: 64) -mt TAG ..... use 'TAG' instead of `FPP' for mixture fingerprints -md dens .... density below which fingerprints are folded (default: 0.3) -mz ......... zap existing FPP data from TDT stream produces: FP<fp;obits;oset;nbits;nset;ver;fpid> FPP<part-ntuple;fpid>

Reaction Fingerprints

Structural Reaction Fingerprints - for structural screening =: the fingerprint of the reactant part
+ the fingerprint of the product part
+ the bit-shifted fingerprint of the product part
Difference Fingerprints - reflects atom/bond changes in a reaction =: reactant fingerprint XOR products fingerprint

Example

Sn2 displacement reaction:

[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI

The paths generated for the molecules would be as follows:

Enumerated Fingerprint Paths:
Path Length:	Reactant (count/path):	Product (count/path):
0	1 I, 1 Na, 3 C, 1 Br	1 I, 1 Na, 3 C, 1 Br
1	1 C=C, 1 C-C, 1 C-Br	1 C=C, 1 C-C, 1 C-I
2	1 C=C-C, 1 C-C-Br	1 C=C-C, 1 C-C-I
3	1 C=C-C-Br	1 C=C-C-I

Difference in Path Counts:

Path Length: Difference (count/path):

0 0 I, 0 Na, 0 C, 0 Br

1 0 C=C, 0 C-C, 1 C-Br, 1 C-I

2 0 C=C-C, 1 C-C-Br, 1 C-C-I

3 1 C=C-C-Br, 1 C=C-C-I

After generating the difference in counts, we only use the six paths with non-zero differences to set bits in the difference fingerprint. These are the paths which walk through bonds that change during the reaction. By considering only these paths, we get a fingerprint which reflects the overall bond changes in the reaction.

Mixture Fingerprints - part-tuple fingerprints

Mixtures stored as Dot Disconnected SMILES are fingerprinted
Each component is fingerprinted
FPP datatype contains fingerprint for the resulting combination fingerprint
Example:
$SMI<"[I-].[Na+].C=CCBr>>[Na+].[Br-].C=CCI"> FPP<..2..2...E2.2,..+.06...+..2,G60.EoH0e+o.2,..+U1A...+U.2,1.1.ME......2,vA5U.qPXXFw.2;> |
$D<FPP> _V<"Component fingerprints;Component fingerprints/ID"> _B<"FPP;FPP/ID"> _N<"PART_NTUPLE 1;"> _P<"*;"> _S<Component fingerprints> _M<System> _O<Daylight Chemical Information Systems Inc.> |

Other Kinds of Fingerprints

Modal Fingerprints - represent a group of compounds
Atom Fingerprints - characterize an atom in its molecular environment

Comparing Fingerprints

Three Similarity Metrics: Tanimoto, Euclidean, and Tversky

Terms:

Symbol	Definition	Description
bits(F)		A function that returns the number of "1" bits in a bitmap
BT		The total number of bits (the fingerprint's size); a constant
B1	bits(F1)	The number of 1's in F1
B2	bits(F2)	The number of 1's in F2
BC	bits( F1 AND F2 )	The number of 1's in common between F1 and F2
BI	bits(F1 XOR (NOT F2))	The number of identical bits (1's and 0's) between F1 and F2
BU1	bits(F1 AND (NOT F2))	The number of unique bits (1's) in F1
BU2	bits(F2 AND (NOT F1))	The number of unique bits (1's) in F2

Tanimoto Coefficient

The number of bits in common divided by the total number of bits that could be in common. Scale, 1.0 identical fingerprints, 0.7 highly similar, 0.5 roughly similar

TC = BC / (B1 + B2 - BC)

Euclidian distance

A measure of the geometric distance between two fingerprints. Scale , 0.0 identical fingerprints, 0.3 highly similar, 0.5 roughly similar

DE(F1,F2) = (BT - BI) / BT

The distance-as-substructure metric is:

DSE(F1,F2) = (B1 - bits(F1 AND F2)) / B1

Tversky Similarity

For a complete description of Tversky similarity see John Bradshaw's MUG '97 presentation, "Introduction to Tversky similarity measure".

Tversky similariy compares features in a given structure (the "prototype") to features in database structures (as "variants") with user specified weighting for each set of features.

TS = BC / (

BU1 +

BU2 + BC)

Example: Setting the weighting of prototype features to 100% and variant features to 100%, i.e.=1,=1, produces a symmetrical similarity metric identical to the Tanimoto metric.

Example: Setting the weighting of prototype and variant features asymmetrically produces a similarity metric in a more-substructural or more-superstructural sense. Setting the weighting of prototype features to 100% (=1) and variant features to 0% (=0) means that only the prototype features are important, i.e., this produces a "superstucture-likeness" metric. In this case, a Tversky similarity value of 1.0 means that all prototype features are represented in the variant, 0.0 that none are.

Example: Setting the weights to 0% prototype (=0) / 100% variant (=1) produces a "substucture-likeness" metric, where completely embedded structures have a 1.0 value and "near-substructures" have values near 1.0.

Tversky metrics where the two weightings add up to 100% (1.0) are of special interest (e.g., the 50/50 metric is known as the Dice index).

Daylight Chemical Information Systems Inc.

0-bond paths:	C	O	N
1-bond paths:	OC	C=C	CN
2-bond paths:	OC=C	C=CN
3-bond paths:	OC=CN

Difference in Path Counts:
Path Length:	Difference (count/path):
0	0 I, 0 Na, 0 C, 0 Br
1	0 C=C, 0 C-C, 1 C-Br, 1 C-I
2	0 C=C-C, 1 C-C-Br, 1 C-C-I
3	1 C=C-C-Br, 1 C=C-C-I