Bioinformatics Group,

Metaphorics LLC,

Santa Fe, New Mexico.

For pre-screening database searches, a fingerprint is considered only if it contains a prescribed bit pattern. Algebraically, if A is a query fingerprint and B a database fingerprint, if all the features of A appear in B then A&B == A (where "&" is the bitwise AND operator). This technique is typically used for superstructure searching. This use of fingerprints is most effective when no value's bits are a subset of another value's bits, and when each value typically has several/many bits set. For example, encoding a descriptor by all zeros, provides no ability to pre-screen a search for that value, and encoding a descriptor by all ones, leads to false drops on all queries over that descriptor column.

For similarity comparisons between fingerprints, several measures are commonly used including Hamming, Euclidean, Tanimoto and Tversky indices. In this article, we restrict ourselves to describing Hamming and Tanimoto suitable encodings. Hamming "distance" is defined as the number of bits that are different between two binary fingerprints, [A^B] (where "[x]" is the number of bits set in "x", and "^" is the bitwise XOR operator). The Tanimoto coefficient is defined as the number of bits in common between the two fingerprints divided by the bits set in either fingerprint, [A&B]/[A|B] (where "|" is the bitwise OR operator). To encode descriptors suitable for Hamming comparison, similar encodings should have similar bit patterns. Additionally for Tanimoto comparison all encodings should preferably have the same number of bits, with similar encodings sharing the same bits set.

Perhaps the most useful corollary of the above idenities is that Hamming(A:B,C:D) = Hamming(A,C) + Hamming(B,D). This allows us to test for similarity over several descriptors simultaneously, potentially trading similarity of one with disimilarity of another. This trade-off may be customized by choosing encodings such the Hamming distance between any pair of related significant descriptors is higher than between any pair related by less significant descriptors.

One trivial way of achieving this without affecting the descriptor encodings, is to simply encode the more significant descriptors multiple times. Duplicating a subfingerprint twice effectively doubles the number of bits that are different in a similarity comparison, Hamming(A:A:B,C:C:D) = 2*Hamming(A,C) + Hamming(B,D). Although this method is effective typically it is possible to achieve the same result through changing the descriptor encodings.

Finally, notice that it is difficult to derive Tanimoto(A:B,C:D) from Tanimoto(A,C) and Tanimoto(B,D). However, such a definition is possible if any encoding has a constant number of bits. Rewriting Tanimoto(A,B) as the more common [A&B]/([A]+[B]-[A&B]), and replacing [A] and [B] by the constant "kAB", [A&B]/(kAB-[A&B]). Which is a function of the number of bits in common. Composition then results in Tanimoto(A:B,C:D) = ([A&C]+[B&D]) / ((kAC+kBD)-([A&C]+[B&D])), which once again is a function of the number of bits in common.

One other advantage of Berger codes (and their principal use in error correcting and delay-insensitive asynchronous circuits) is that no code is a subset of another. As mentioned previously this is a desirable property for pre-screening database searches.

Link 1: Hamming Codes for *m*=16

The proposed solution combines the coding methods described by Hamming
and Grey, and shall therefore be refered to as Hamming-Grey codes. As
has already been mentioned in section 3.3, Hamming codes are *n*
bit codes that are guarantee a Hamming distance of atleast *k*
between distinct values. Grey codes are an ordering of *n* bit
codes where each code differs from the previous code by a single bit,
i.e. Hamming(Grey(i),Grey(i+1))=1. Example Grey codes for 16 values
are given in link 2 below. Note that Grey
codes do not describe an encoding "per se", but an ordering on codes.
Grey codes themselves cannot be used directly to encode similarity as
several pairs of codes (in addition to neighboring values) have a
Hamming distance of 1, i.e. only a single bit difference. A Grey code
can be envisioned as a non-intersecting path through a *n*-dimensional
space traversing only a single axis on each move. Hamming codes can be
considered suitably distant points in the same *n*-dimensional space.

Hamming-Grey codes are an encoding of numeric values, paramterized by
two values, the length of the code and the resolution *delta* of
the encoding. Each code has the property that sucessive values differ
by only a single bit. Hamming(Hamming-Grey(i),Hamming-Grey(i+1)) = 1.
Similarly, if two values differ by x, less than *delta*, then
their two encodings have a Hamming distance of x. Any pair of values
further than *delta* apart are guaranteed to have a Hamming distance
of atleast *delta*. This is expressed by the constraint, that if
Hamming(Hamming-Grey(A),Hamming-Grey(B)) < *delta*, then
Hamming(Hamming-Grey(A),Hamming-Grey(B)) = **abs**(A-B). This
encoding allows different queries to retrieve descriptors up to a
maximum of *delta*-1 away from the query value. Hamming-Grey
encodings can easily be found by searching through an *n*-dimensional
space changing one bit at a time, and not changing that bit again
for the next *delta* codes, and never coming within *delta*
bits of a code seem more than *delta* values previously. Example
Hamming-Grey encodings for 16 values are given in
link 3 below.

Given the definition above, the first question that springs to mind is
how large does a Hamming-Grey code have to be to encode *m*
different values. Unfortunately, this is not a trivial question and
I have been unable to derive an analytical formula, however the coding
is relatively efficient an a table Hamming-Grey code encoding densities
is given in link 4 below.

An interesting reference point in the link 4 table, is that a Hamming-Grey code of length 16 and resolution 5 can be used to encode atleast 111 values. We can use such an encoding to solve the percentage problem described in section 4.2. With a resolution of 5, a query can retrieve all values within the 11 percentile interval x-5..x+5 by limiting the Hamming distance from x to be 5 bits. This scheme doesn't suffer from the boundary problem 79 is encoded one bit away from 80, and there are no binning problems 70 is disimilar to 89. Such a percentage encoding using a Hamming-Grey(16,5) code is given in link 5 below.

Link 2: Example Grey Codes for *m*=16

Link 3: Hamming Grey Codes for *m*=16

Link 4: Table of Hamming-Grey Code Ranges

Link 5: Example Encoding of Percentage Values

Another potential application is in 3D pharmacophore searching. Consider encoding all 6 pairwise distances in a four-point pharmacophore as a single fingerprint. Each distance could be represented as one of 200 values with a resolution of 0.1 Angstroms (giving a 20 Angstrom maximum). Using either Hamming or Tanimoto similarity it is then possible to select molecule conformations whose total pairwise error over all size distances is less than 0.8 Angstroms. Such an encoding can be achieved with a Hamming-Grey(24,8) encoding, only requiring 18 bytes per small molecule conformation.

- J.M. Berger, "
**A Note on Error Detection Codes for Asymmetric Channels**",*Information and Control*, Vol. 4, pp. 68-73, 1961. - B. Bose and T.R.N. Rao, "
**Theory of Unidirectional Error Correcting/Detecting Codes**",*IEEE Transactions on Computers*, Vol. C-31, No. 6, pp. 521-530, June 1982. - R.W. Hamming, "
**Error Detecting and Error Correcting Codes**",*Bell Systems Technology Journal*, Vol. 29, pp. 147-160, April 1950. - D.E. Knuth, "
**Efficient Balanced Codes**",*IEEE Transactions on Information Theory*, Vol. IT-32, pp. 51-53, 1986. - E. Sperner, "
**Ein Satz uber Untermengen Einer Endlichen Menge**",*Math. Z.*, Vol. 27, pp. 544-548, 1928. - Tom Verhoeff, "
**Delay-insensitive Codes: An Overview**",*Distributed Computing*, Vol 3, pp. 1-8, 1988.

roger@metaphorics.com