Hamming-Grey Codes for Fingerprinting Descriptors

Roger Sayle
Bioinformatics Group,
Metaphorics LLC,
Santa Fe, New Mexico.

1. Introduction

This talks describes a number of schemes for encoding numeric and non-numeric values in binary fingerprints. Encodings are suggested that significantly reduce the number of bits required by existing methods. These reductions are achieved by defining the required "resolution" of similarity a priori. Different "codes" are proposed for both the "Hamming" and "Tanimoto" measures of similarity. Finally we present the example application of these encoding methods to pseudo-structural similarity and to 3D pharmacophore matching. These solutions attempt to answer some of the problems posed by John Bradshaw in his MUG98 talk entitled " Adding non-structural data into fingerprints".

2. Fingerprints and Similarity

2.1 Overview

A "fingerprint" is a fixed length bit vector, often used to describe the features of a molecule in chemical information systems, such as those provided by Daylight or MDL. Conventionally each bit is used to encode the presence (or absence) of a particular structural "motif", such as the occurance of a particular bond or atom type. These "binary" fingerprints may then be used for pre-screening database searches or as a measure of "similarity" between two database entries. The characteristics of encoding for these two applications are outlined below.

For pre-screening database searches, a fingerprint is considered only if it contains a prescribed bit pattern. Algebraically, if A is a query fingerprint and B a database fingerprint, if all the features of A appear in B then A&B == A (where "&" is the bitwise AND operator). This technique is typically used for superstructure searching. This use of fingerprints is most effective when no value's bits are a subset of another value's bits, and when each value typically has several/many bits set. For example, encoding a descriptor by all zeros, provides no ability to pre-screen a search for that value, and encoding a descriptor by all ones, leads to false drops on all queries over that descriptor column.

For similarity comparisons between fingerprints, several measures are commonly used including Hamming, Euclidean, Tanimoto and Tversky indices. In this article, we restrict ourselves to describing Hamming and Tanimoto suitable encodings. Hamming "distance" is defined as the number of bits that are different between two binary fingerprints, [A^B] (where "[x]" is the number of bits set in "x", and "^" is the bitwise XOR operator). The Tanimoto coefficient is defined as the number of bits in common between the two fingerprints divided by the bits set in either fingerprint, [A&B]/[A|B] (where "|" is the bitwise OR operator). To encode descriptors suitable for Hamming comparison, similar encodings should have similar bit patterns. Additionally for Tanimoto comparison all encodings should preferably have the same number of bits, with similar encodings sharing the same bits set.

2.2 Composition of Fingerprints

Finally, in this section we briefly mention composition of descriptor encodings. The use of fingerprint composition allows multiple descriptors to be compared in a single "logical" operation. Composition of fingerprints (written here as A:B where A and B are "subfingerprints") by concatenation of the two bit vectors into a single fingerprint. A number of trivial identities hold over concatenated fingerprints, [A:B] = [A]+[B], (A:B)^(C:D) = (A^C):(B^D), etc...

Perhaps the most useful corollary of the above idenities is that Hamming(A:B,C:D) = Hamming(A,C) + Hamming(B,D). This allows us to test for similarity over several descriptors simultaneously, potentially trading similarity of one with disimilarity of another. This trade-off may be customized by choosing encodings such the Hamming distance between any pair of related significant descriptors is higher than between any pair related by less significant descriptors.

One trivial way of achieving this without affecting the descriptor encodings, is to simply encode the more significant descriptors multiple times. Duplicating a subfingerprint twice effectively doubles the number of bits that are different in a similarity comparison, Hamming(A:A:B,C:C:D) = 2*Hamming(A,C) + Hamming(B,D). Although this method is effective typically it is possible to achieve the same result through changing the descriptor encodings.

Finally, notice that it is difficult to derive Tanimoto(A:B,C:D) from Tanimoto(A,C) and Tanimoto(B,D). However, such a definition is possible if any encoding has a constant number of bits. Rewriting Tanimoto(A,B) as the more common [A&B]/([A]+[B]-[A&B]), and replacing [A] and [B] by the constant "kAB", [A&B]/(kAB-[A&B]). Which is a function of the number of bits in common. Composition then results in Tanimoto(A:B,C:D) = ([A&C]+[B&D]) / ((kAC+kBD)-([A&C]+[B&D])), which once again is a function of the number of bits in common.

2.3 Conversion of Hamming Encodings to Tanimoto Encodings

There is a clever trick for converting descriptor encodings intended for Hamming similarity into encodings suitable for Tanimoto similarity and database search pre-screening. This method was first introduced for error detecting codes by Berger in 1961. The method is that given a descriptor encoding A, represent this in a fingerprint as A:~A (where "~" is the bitwise NOT operator), i.e. compose the encoding with its own compliment. These Berger Codes have the property that exactly half of their bits are set (making them a subset of Sperner codes). Now all encodings of a descriptor have the same number of bits. Given codes of length n, A and B which are distance x=Hamming(A,B) apart, we can define y the Tanimoto threshold for the corresponding Berger codes, Tanimoto(A:~A,B:~B). By analysis [(A:~A)&(B:~B)]=n-x and [A:~A]=n, hence y=(n-x)/(n+x).

One other advantage of Berger codes (and their principal use in error correcting and delay-insensitive asynchronous circuits) is that no code is a subset of another. As mentioned previously this is a desirable property for pre-screening database searches.

3. Enumeration (Nominal) Data

Nominal data is where a descriptor may take any of m different states, but these states cannot be arranged in any meaningful order. A special case is where it has two states, usually represented by the digits 0 and 1 when it is called a binary or boolean variable. The only notion of similarity for such variables is equality. We do not consider the case where a variable may be in more than one state simultaneously.

3.1 Single Bit (1-of-m) Codes

A trivial, but commonly used method, is to represent such descriptors as a m-bit vector, where each encoding has only a single bit set. Although this encoding is inefficient for large m it has several benefits. No code is a subset of another, all encodings contain the same number of bits, every code is Hamming distance 2 from every other. However the greatest benefits are that it is possible to encode variables in multiple states in the same scheme, and that the sparse fingerprints are easily folded.

3.2 Arbitrary Conventional Codes

Another obvious choice is to encode each state as an integer between 0..m-1, and encode the resulting value in the conventional ceil(log2(m)) bits. A Hamming distance greater than zero indicates that the descriptors are in different states. Note that no meaning may be assigned to distances greater than zero.

3.3 Hamming Codes

We have seen that the weighting of fingerprints can be increased by duplicating bits. A better encoding technique is to use Hamming Codes. A set of Hamming codes are a set of code words that have atleast a specified number of bits different between any two non-identical codes. The use of a parity bit can be seen as ensuring that all distinct codes have a hamming distance of atleast two. To encode m values with a minimum Hamming distance of k requires 2k-1 additional bits. This results in an encoding scheme requiring ceil(log2(m))+2k-1 bits. Once again note that no meaning may be ascribed to Hamming distances greater than or equal to k.

Link 1: Hamming Codes for m=16

4. Numeric (Interval Scaled) Data

Numeric (or cardinal) data is where a descriptor may take any of m sequential integer values. These are nominal states that not only may be arranged in a meaningful order but also differences between sucessive values are of equal sizes. Ordinal variables (as defined in John Bradshaw's talk) where sucessive values are not equally similar may be handled by mapping these values onto suitable numeric values. For example, the formal charges -3,-2,-1,0,1,2,3 may be mapped onto the values 0,1,2,4,6,7,8 making same signed charges similar, neutral to charged changes less similar and sign changes less similar still. We do not consider the encoding of variables that may have more than one numeric value, or range information giving upper and/or lower bounds on values. We also do not consider cyclic values, where the largest value is considered similar to the smallest (such as times or angles).

4.1 Explicit Run Encoding

The traditional way of encoding m distingishable values is to use m-1 bits. Each value in the range 0..m-1 being represented by the appropriate number of consecutive 1 bits followed by the remaining 0 bits. Two values that differ by delta values, are a Hamming distance on delta apart. This representation becomes increasingly inefficient with larger values of m.

4.2 Binning Values

Another less than satifactory solution is to "bin" values into a smaller number of subintervals. For example, percentage values between zero and one hundred, could be placed in ten bins: 0-10, 11-20, 21-30, ..., 91-100. These 10 bin values can then be encoded either as nominal states (with no notion of similarity between bins) or as numeric data but with a smaller range. One problem (as demonstrated above) is that its not always possible to place values in equal sized bins. However the greatest disadvantage with such a scheme is the "boundary effects", where very similar values that fall either side of a partition are considered disimilar, and where all values within a bin are considered equal.

4.3 Hamming-Grey Codes

In this talk, we present an improved solution to encoding numeric descriptors in binary fingerprints. The observation is initially made that if the complete range of similarity values needs to be ordered with respect to a query, then the "Explicit Run" encoding of section 4.1 can be shown optimal. However in practice such complete orderings are rarely required, most queries only consider similarity over a restrcited range and need not rank disimilar values precisely. This is similar to the "Horizon Effect" in Tanimoto similarity where all database entries that have no bits in common with a query look identical. Hence we propose a numeric descriptor encoding that can be used to rank all neighboring values within some delta, but is unable to distinguish values further than delta away.

The proposed solution combines the coding methods described by Hamming and Grey, and shall therefore be refered to as Hamming-Grey codes. As has already been mentioned in section 3.3, Hamming codes are n bit codes that are guarantee a Hamming distance of atleast k between distinct values. Grey codes are an ordering of n bit codes where each code differs from the previous code by a single bit, i.e. Hamming(Grey(i),Grey(i+1))=1. Example Grey codes for 16 values are given in link 2 below. Note that Grey codes do not describe an encoding "per se", but an ordering on codes. Grey codes themselves cannot be used directly to encode similarity as several pairs of codes (in addition to neighboring values) have a Hamming distance of 1, i.e. only a single bit difference. A Grey code can be envisioned as a non-intersecting path through a n-dimensional space traversing only a single axis on each move. Hamming codes can be considered suitably distant points in the same n-dimensional space.

Hamming-Grey codes are an encoding of numeric values, paramterized by two values, the length of the code and the resolution delta of the encoding. Each code has the property that sucessive values differ by only a single bit. Hamming(Hamming-Grey(i),Hamming-Grey(i+1)) = 1. Similarly, if two values differ by x, less than delta, then their two encodings have a Hamming distance of x. Any pair of values further than delta apart are guaranteed to have a Hamming distance of atleast delta. This is expressed by the constraint, that if Hamming(Hamming-Grey(A),Hamming-Grey(B)) < delta, then Hamming(Hamming-Grey(A),Hamming-Grey(B)) = abs(A-B). This encoding allows different queries to retrieve descriptors up to a maximum of delta-1 away from the query value. Hamming-Grey encodings can easily be found by searching through an n-dimensional space changing one bit at a time, and not changing that bit again for the next delta codes, and never coming within delta bits of a code seem more than delta values previously. Example Hamming-Grey encodings for 16 values are given in link 3 below.

Given the definition above, the first question that springs to mind is how large does a Hamming-Grey code have to be to encode m different values. Unfortunately, this is not a trivial question and I have been unable to derive an analytical formula, however the coding is relatively efficient an a table Hamming-Grey code encoding densities is given in link 4 below.

An interesting reference point in the link 4 table, is that a Hamming-Grey code of length 16 and resolution 5 can be used to encode atleast 111 values. We can use such an encoding to solve the percentage problem described in section 4.2. With a resolution of 5, a query can retrieve all values within the 11 percentile interval x-5..x+5 by limiting the Hamming distance from x to be 5 bits. This scheme doesn't suffer from the boundary problem 79 is encoded one bit away from 80, and there are no binning problems 70 is disimilar to 89. Such a percentage encoding using a Hamming-Grey(16,5) code is given in link 5 below.

Link 2: Example Grey Codes for m=16
Link 3: Hamming Grey Codes for m=16
Link 4: Table of Hamming-Grey Code Ranges
Link 5: Example Encoding of Percentage Values

5. Example Applications

One potential application is in the selection of "similar" compounds based on higher-level structural features. For example, it would be easy to encode fingerprints containing the number of hydrogen bond donors and acceptors, the molecular weight, the number of rotatable bonds etc... Then using the rules, given a particular query molecule select similar molecules that have no more than one or two fewer/more rotable bonds and/or approximately the same molecular weight purely by specifying the appropriate Hamming or Tanimoto threshold.

Another potential application is in 3D pharmacophore searching. Consider encoding all 6 pairwise distances in a four-point pharmacophore as a single fingerprint. Each distance could be represented as one of 200 values with a resolution of 0.1 Angstroms (giving a 20 Angstrom maximum). Using either Hamming or Tanimoto similarity it is then possible to select molecule conformations whose total pairwise error over all size distances is less than 0.8 Angstroms. Such an encoding can be achieved with a Hamming-Grey(24,8) encoding, only requiring 18 bytes per small molecule conformation.

6. Acknowledgements

I wish to thank John Bradshaw from Glaxo Wellcome, Stevenage, U.K., and David Weininger and Jack Delany from Daylight Chemical Information Systems for the discussions that first described the problem to me and much of the inspiration that led to the solutions described above.

7. Bibliography