EuroMUG '04 -- 4 - 5 Nov, 2004

A Mathematical Model for Screening Collections - requirements and implications for molecular fingerprints and similarity

Gavin Harper
GlaxoSmithKline R&D
Gunnels Wood Rd, Stevenage SG1 2NY


We introduce a quantitative model that relates chemical structural similarity to biological activity, and in particular to the activity of lead series of compounds in high-throughput assays. From this model we can derive the optimal screening collection make up for a given fixed size of screening collection, and identify the conditions under which a diverse collection of compounds or a collection focusing on particular regions of chemical space are appropriate strategies. This leads directly to a diversity function that may be used to assess compounds for acquisition or libraries for combinatorial synthesis by their ability to complement an existing screening collection.

However, the model relies on structural clustering of millions of compounds. Associated with each cluster in the model are two parameters. Clearly, practical parameterisation of the model depends on being able to assume that the same parameters apply to large groups of clusters. However, whether or not this is a realistic aim in turn relies on the clustering technique and similarity measure used. Coming from this viewpoint of what constitutes a high quality clustering of the data, we consider the implications for how we represent molecules and compare molecular fingerprints.

Presentation slides:

Daylight Chemical Information Systems, Inc.