EuroMUG '05    5 - 7 October, 2005

Choosing the Right Similarity Measure

John Holliday
University of Sheffield


Previous studies have examined the use of a selection of similarity coefficients for searching chemical databases. The results showed that retrieval could be improved by using alternative coefficients or by combining two or more coefficients using data fusion techniques. However, there is no single best coefficient or combination of coefficients, and selection of an appropriate combination is difficult. Several studies are described which investigate alternative approaches to standard similarity techniques in order to improve performance.

A machine learning approach has been used to identify the appropriate selection of coefficients for the data fusion methodology. This has been extended to include the automatic derivation of unique class-dependent formulae. An alternative approach to the fusion of coefficients is the fusion of different descriptor types. Specifically, this methodology has been used to investigate the relationship between descriptor pathlength and activity class.

A separate, but not unrelated, study investigated the effect of selecting seeds of known classes for k-modes cluster-based analysis in order to identify multi-modal classes.

Presentation slides:

Daylight Chemical Information Systems, Inc.