Fingerprint-Based And Graph-Based Methods For Similarity Searching In Chemical Databases

Peter Willett

Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, UK


Current fingerprint-based methods for similarity searching typically use the Tanimoto coefficient, an association coefficient that has been found to provide an effective measure of similarity in a range of chemical database applications. There are, however, many other types of similarity coefficient that are available, and we here report a comparison of 22 different similarity coefficients when they are used for searching databases of 2D fragment bit-strings. Experiments with the NCI AIDS and ID Alert databases show that the coefficients fall into several well-marked clusters, in which the members of a cluster will produce comparable rankings of a set of molecules. These clusters provide a basis for selecting combinations of coefficients for use in data fusion experiments. The results of these experiments provide a simple way of increasing the effectiveness of fragment-based similarity searching systems [1].

Problems of computational efficiency have meant that similarity measures based on maximal common subgraph isomorphism algorithms are too time-consuming for use in a database context, where a target structure must be matched against each of the structures in a large database. We have recently described a new algorithm, called RASCAL (for RApid Similarity CALculation) for the identification of the maximal common edge subgraph, rather than the maximal common node subgraphs that have figured in previous work on searching 2D chemical graphs. RASCAL uses a two-stage screening procedure that eliminates database structures with a low degree of similarity to the target structure, followed by a novel clique detection algorithm that employs a range of pruning and upperbounding heuristics. Experiments with several chemical datasets show that RASCAL can calculate similarities at rates in excess of a thousand structures per second, thus providing a mechanism for graph-based searching of large chemical databases [2, 3].

  1. J. D. Holliday, C-Y. Hu and P. Willett, Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings, in the press.
  2. J.W. Raymond, E.J. Gardiner and P. Willett, RASCAL: Calculation of graph similarity using maximum common edge subgraphs, submitted for publication.
  3. J.W. Raymond, E.J. Gardiner and P. Willett, Heuristics for rapid similarity searching of chemical graphs using a maximum common edge subgraph algorithm, submitted for publication.

Presentation slides

Daylight Chemical Information Systems, Inc.