Finding lead compounds using extended chemical fingerprints

Martyn G Ford

University of Portsmouth, Centre for Molecular Design, King Henry Building, King Henry 1 Street, Portsmouth, Hants. PO1 2DY.


A procedure that uses bitstrings to search databases for in silico “hits”, based on structural and chemical information has been developed. Binary strings were chosen for this procedure because they provide a fast and efficient means of retrieving and storing information. New bitstring formats have been developed that can encode additional information to be appended to chemical fingerprints. The extended fingerprints can be used for lead identification and hence the design of compound libraries. Two formats that preserve the properties of continuous variables (real numbers) are proposed: the " Hamming-Gray" coding and the " Band" coding. Each provides a solution to the problem of describing a wide range of diversity while maintaining a high density of bits (the Density/Diversity problem of binary strings). The two systems have been applied to a small set of 75 compounds described by four numerical descriptors and their chemical fingerprints. The results show that encoded numerical descriptors appended to the chemical fingerprints give a greater hit rate than that obtained using the fingerprint or appended descriptors alone, even for a small set of compounds deemed to be structural outliers in their respective activity class. Both methods are shown to be superior to compound selection using random-sampling. Finally, it is shown that a pre-processing algorithm, Unsupervised Forward Selection (UFS), reduces the size of the binary strings to result in faster processing with minimal loss in terms of the rate of hit retrieval.

Presentation slides

Daylight Chemical Information Systems, Inc.