Some thoughts on significant similarity and sufficient diversity

John Bradshaw and Roger A. Sayle
Glaxo Wellcome, Stevenage, Herts SG1 2NY, UK


Two of the FAQ about molecular similarity and diversity are

In general, we have not made attempts, in the chemical world, to answer such questions. Arbitrary rules have grown up for default values of an index such as the Tanimoto, as to what is high, medium and low similarity. What is clear from clustering work and other studies of similarity in areas such as psychology,(A. Tversky Psychological Reviews (1977)84 (4) 327-352), is that these measures are both set and target dependent. One solution is to opt for a totally non-parametric approach. However users are not totally comfortable with the consequences of this approach and various parametric qualifiers have been added e.g. NN_BEST_THRESHOLD based on work at Sheffield University, Abbott Laboratories and Daylight CIS.

Alternatively, as we hold the view that the data structure should be determined by the data and not by some arbitrary imposed cartesian grid, we can attempt to add some statistics which are set and target based in order to address these problems.

By borrowing the methods used in bioinformatics for similarity scores ( J.F. Collins, A.F.W. Coulson and A.Lyall CABIOS (1988) 4(1) 67-71), we attempt in this paper/poster to indicate possible answers to the two questions posed.

Is the similarity between two chemicals significant?

One way to answer this question is to recast it slightly as
Is the similarity between the object I am considering and the target significantly greater than one would expect by chance
Is the similarity between the object I am considering and the target significantly greater than the mean value to all other objects in the set
In a preliminary experiment we looked at the distribution of similarity scores of a set of typical drugs ($DY_ROOT/data/drugs.smi) against the wdi971demo database provided in the 4.51 release of the DAYLIGHT software. If we ignore the nearest 100 neighbours then the distribution of similarity scores is almost normal. Note this is not true if we use folded fingerprints.

Distribution of Tanimoto scores
for folded fingerprints against
vitamin a
Distribution of Tanimoto scores
for non-folded fingerprints against
vitamin a

If we require that significant neighbours are at least 5 standard deviations (i.e. a Z-score > 5 )from the mean of the similarity to the non-immediate neighbour group then we get the table shown below.

Structure Distribution of Tanimoto scores Z-score Number of significant neighbours
Click to view

Note that in the case of caffeine we have significant neighbours below 0.4 on a Tanimoto scale, illustrating that the significance is set dependent.

Is the compound set I have sufficiently diverse?

We can possibly use this Z-score to get a handle on an answer to the second question. Again recasting the question slightly
Will this new compound usefully increase the diversity of my compound collection?
One of the difficulties with the measurement of diversity is that it needs to be linked to the appropriateness of the task. For example if one wished to cross the Atlantic by diverse forms of transport, you would very rapidly eliminate a bicycle from your list of options as it is not appropriate for the task. In the same way, certain compounds are inappropriate as drug lead candidates. However the measure of appropriateness is difficult.

One approach is to order a collection by its Z-score relative to the rest of the database. If you do this then the "odd-ball" compounds rise to the top. For instance, if we calculate the Z-scores for the wdi971demo database and sort the database by the score, the ones with the highest scores would seem inappropriate as drug leads.

Table of 20 compounds with highest Z scores in wdi971demo

However if we look at the compounds with the lowest overall Z-scores, they would seem to be much more appropriate as drug leads.
Initial studies would indicate that a Z-score less than 18 in this database would appear to be reasonable for a drug lead candidate.

Table of 20 compounds with lowest Z scores in wdi971demo

Potentially, therefore we have a way of 'cleaning' databases and removing inappropriate compounds without a priori having a rigorous definition of appropriateness.

In a single experiment we took the 1779 compounds in wdi971demo and calculated the Z-scores against the wdi971demo database and against the spresi95demo databases.
The hope was that we could spread the compounds and get some handle on drug likeness. In the event the values were highly correlated

Z-spresi = 3.02 + 0.678Z-wdi

Compounds which were real outliers to wdi were also real outliers to the spresi sample, although less so. One interpretation would be that the wdi is a much more discriminating database than the collection of compounds in spresi in general, but compounds with really high Z-scores are rare in a global sense. This is understandable, given the molecular descriptors used, for compounds like oxygen, but less so for some of the other points marked on the plot.
The plot further shows that 99% of structures are contained in a region where the Z-scores are below 18.


Further work needs to be done to see whether this is a useful measure. The difficulty at the moment is that for a database the process is O(N*N). We and others (see D.B. Turner, S.M. Tyrrell, P Willett, JCICS (1997) 37(1) 18-22) have shown that sampling of a large chemical database at about the 10% level does not affect global characteristics such as the mean pairwise similarity. We could therefore reduce the time 100 fold for the calculation of the mean and standard error by assuming the value of a 10% sample. Our method of choice has been to use the DEAL option on the thorlist command.

We also need to estimate how many neighbours we need to skip when calculating the population mean. Currently we cannot use any of the fast methods for means of pairwise similarities, as the whole target set needs to be sorted before the exclusion. Experiments are being carried out excluding compounds which could not possibly be, say, within a Tanimoto similarity of 0.8 as the do not have sufficient bits set. Results are currently inconclusive.

Unfortunately the current code works against a merlin database and as the merlinload uses the *.DP file rather than a dt_stream() across the database, see dt_open(3), there is currently no simple way to produce "random" sample merlin bases from a parent Thor database. (See also Dave's comments for further detail.)

One of us has suggested that the ratio of the Z-scores may provide a useful similarity index. This will require investigation, not least because the resulting index will be directional.

This work would not affect a clustering method such as Jarvis Patrick as the ordering of nearest neighbours to a target is not affected. However hierarchical methods become difficult as the Z-scores are intrinsically directional.


We would like to thank Dr Jack Delaney of Daylight CIS, Professor Peter Willett of the University of Sheffield and Dr Aldo Feriani and Dr Giovanna Tedesco of Glaxo Wellcome Verona for their useful comments.
Daylight Chemical Information Systems, Inc.