Diversity Analysis in Combinatorial Libraries:
The Avoidance of Enumeration

John M. Barnard and Geoff M. Downs

Barnard Chemical Information Ltd, 46 Uppergate Road, Stannington, Sheffield S6 6BX, UK
Tel: +44 (0)114 233 3170; fax: +44 (0)114 234 3415; http://www.bci1.demon.co.uk

Poster Presentation in the Knowledge-Based Library Design (KBLD) session at the First Electronic Molecular Graphics and Modelling Society Conference, October 7-18, 1996

Note: An earlier version of this poster was presented orally at the 4th International Conference on Chemical Structures, Noorwijkerhout, The Netherlands, in June 1996, and has been submitted for publication in the Journal of Chemical Information and Computer Sciences.


Computer systems for handling "Markush" structures from chemical patents (which frequently cover unlimited numbers of individual compounds) use a number of techniques to avoid enumeration of the compounds themselves. This poster discusses the similarities and differences between the Markush structures in patents and those which may be used to describe combinatorial libraries. A program is described which uses Markush structure-handling techniques to generate structure fingerprints for the compounds in a library, without enumerating the compounds themselves, and thus offers substantial savings in processing time. Potential further application of these techniques to diversity analysis in combinatorial libraries is also discussed.

Markush Structures

Markush or "generic" structures are essentially structures involving R-groups, where a part of the molecule is defined by a series of alternatives. They occur frequently in chemical patent claims, where the inventor wants to protect an entire class of compounds rather than just the individual one he plans to sell, and the name Markush is derived from a patent applicant who was involved in a legal case of the validity of this type of claim in the 1920s. A long-running research project on Markush structure storage and retrieval at Sheffield University [Lynch and Holliday, 1996 ] found it useful to consider them as consisting of a (possibly vestigial) core, or invariant part, to which were attached variable groups (normally designated R1, R2, etc.) whose variability could be shown in one or more of four different ways:

Furthermore, each variable part can have further variable parts nested within it, to any depth.

The occurrence of homology variation means that a single Markush structure can cover open-ended (i.e. infinite) set of compounds and this clearly rules out enumeration as a technique for dealing with them. The Sheffield research project developed a number of techniques for representing and searching databases of Markush structures, without enumeration of their coverage, and several operational systems have been developed commercially, for use with patent databases [Barnard, 1991 ].

Markush structures have a number of applications outside the patent field, and the table below shows the types of variation which may be encountered in each. Most substructure search systems now provide some Markush features in their query languages, though the sophistication varies between systems; the results of QSAR analyses are frequently presented in the form of a Markush structure with a table of activity data for each combination of R-group values; combinatorial libraries may also be expressed as Markush structures.

  substituent position frequency homology
Patents X X X X
Substructure Search Queries X (X) (X) (X)
QSAR Studies X X    
Combinatorial Libraries X (X) (X) (((X)))

Though most combintorial libraries involve only substituent variation,it is likely that other variation types may become important as novel combinatorial synthesis techniques are developed; h-variation is clearly entirely absent in physical combinatorial Libraries, though it might be argued that it is potentially useful in description of extremely large virtual libraries.

Diversity Analysis in Combinatorial Libraries

A single combinatorial library (CL) of compounds may contain a very large number of individual compounds. Though it is now generally considered that only relatively modest CL's (a few thousands or tens of thousands of compounds) will actually be physically synthesised, computer systems will still be required to handle "virtual" CLs several orders of magnitude larger than this. Indeed, performing diversity analyses on such virtual CLs may be an essential step in deciding the optimum subset CL which is actually to be synthesised.This can cause problems for computer systems which depend on enumeration of these compounds in order to be able to perform any sort of analysis of their diversity. Though hardware solutions, such as the SpaceCrunch project being jointly developed by Tripos Inc. and Silicon Graphics Inc., may provide a partial solution,. problems such as clustering the compounds in a large CL, which may have time and/or space requirements proportional to the square of the number of the number of compounds, are likely to be too slow to be practicable. Recent work [Martin et al., 1995 ] has focused on considering the diversity of each building block pool in isolation, and though the current nature of combinatorial synthesis requires that subsets of each building block pool (rather than random combinations of members of different pools) must be chosen in order to build a subset CL, this approach pre-supposes that maximising the diversity in each pool subset will maximise the diversity of the compounds in the subset CL. Work reported elsewhere in this conference session suggests that such a presupposition may be unwarranted [Gillett, Willett and Bradshaw, 1996].

Some of the techniques used for searching databases of Markush structures from patents are also being employed in the search systems currently under development by major software vendors for CL information systems. At the simplest level, this involves storing and matching the common parts of the compounds covered by a CL only once, in order to avoid full enumeration. Such techniques are not especially new, and the earliest such system was designed as long ago as 1958 [Meyer et al., 1984 ].

Fingerprint Generation Without Compound Enumeration

As a first stage in applying Markush structure handling techniques to diversity analysis in combinatorial libraries, we have developed software to generate structure fingerprints for the compounds in a library, without the need to enumerate the compounds themselves. Though this still involves enumeration of the fingerprints (which might then be used for clustering, dissimilarity selection etc.) our initial purpose was to demonstrate the applicability of the approach, which in any case may offer significant savings in processing time; extension of the work to avoid even this enumeration is discussed below.

Our inital objectives were:

  1. To design an internal data structure for Markush structures suitable for representing CLs;
  2. To generate structure "fingerprints" (indicating the presence or absence of certain substructural fragments in each compound) for all the compounds in a CL directly from this;
  3. To compare these fingerprints with those obtained by first enumerating the compounds, and then generating fingerprints for each individually.

The figure below shows the basic processing flow involved. Two alternative routes were used to generate structure fingerprints from initial input of a CL representation: on the left-hand side, a file containing the enumerated connection tables for all the compounds in the CL is generated, which can then be used as input to standard fingerprint generation software. On the right-hand side new software is used to read the CL into the specially-designed internal representation, and then to generate the fingerprints directly from it. The fingerprint files on each side should, of course, be identical.

Processing Flow Diagram

Internal Representation of Libraries

In the Sheffield research work, a specially-designed internal representation, called the Extended Connection Table Representation (ECTR) [Barnard et al., 1982] was used for Markush structures, and two important algorithms were developed to process it. The ECTR represents a Markush structure as a logical "AND/OR" tree in which leaf nodes represent individual alternatives for each variable group ("partial structures"), and internal nodes show the logical relationships between them. The TreeTrace algorithm traverses the ECTR accessing each partial structure in turn, and the Bubble-Up algorithm accumulates information from each partial structure, keeping the correct logical relationships, and "bubbles it up" to the top (root) of the tree [Downs et al., 1989].

For the present work, a conceptually similar data structure was designed, called "ECTR For Combinatorial Libraries" (EFCL). This is implemented using dynamic arrays in the C language, and is intended to by created and held in random-access memory only for as long as is required for processing; no arbitrary limits are placed on the number of partial structures, the number of connections between any pair of partial structures, or the depth of the tree. With a view to possible future requirements, provision is also made for the inclusion of generic (homolgy-variant) partial structures. The figure below illustrates the logical structure of the EFCL for a simple combinatorial library, also shown as a CHORTLES notation. Each "leaf node" in the tree is a partial structure, representing one building block (monomer unit). The lines connected by small arcs indicate AND relationships between the partial structures at the ends of the lines, whilst those without arcs represent OR relationships.

EFCL Diagram

Software Development

For reasons of availability and convenience, significant use was made of the Monomer Toolkit supplied by Daylight Chemical Information Inc. CLs were represented as CHORTLES notations [Siani et al., 1995], with associated building block definitions. Daylight Monomer Toolkit routines were used to enumerate the individual compound connection tables which were then used as input to Barnard Chemical Information Ltd.'s standard fingerprint generation program, MAKEBITS. The Daylight fingerprint generation routines were not used, as these could not have been used for fingerprint generation directly from the EFCL.

Special routines (also using the Daylight Monomer Toolkit) were written in order to generate the EFCL from CHORTLES input, though there is no reason why it should not also be generated from other input formats such as Tripos Inc.'s Combinatorial Sybyl Line Notation or MDL's RGfile format, or possibly using Daylight's forthcoming Reaction Toolkit.

Adapted versions of certain routines from the MAKEBITS program were used to generate "partial fingerprints" on each partial structure, and modified versions of the TreeTrace and Bubble-Up algorithms were implemented to accumulate these to form one fingerprint for each compound covered. Fragments in three different families were generated:

Those fragments contained entirely within a single partial structure ("intra-PS" fragments) are assigned to a partial fingerprint bitstring associated with that partial structure, and the modified Bubble-Up algorithm logically ORs together the partial fingerprints for each partial structure combination required. Fragments spanning two or more partial structures ("inter-PS" fragments) are assembled from partial fragments during the bubble-up process itself, and added to the partial fingerprint for the highest-level partial structure with which they are associated.


Data from demonstration files provided with Daylight software were used to generate fingerprint files by both the enumeration and Markush routes. The PAM95 library contains 600 small molecules, the PEPTIDE95 library contains 1280 peptides, and the PEPTOID95 library contains 1280 peptoids (N-substituted glycines). Separate runs were made for three different fragment families; in each case, the fingerprint files generated by the two routes were identical, after sorting to deal with different enumeration orders.

The tables below compare the two routes in terms of processing times (elapsed seconds) for generation of files of structure fingerprints for Augmented Atoms and Atom Sequences (the code for Atom Pairs was not sufficiently complete to include timings for these too). Programs were run on a PC (Pentium 133). Note that the timings for the enumeration route do not include the time taken to generate the individual connection tables. Due to the reliance on the Daylight toolkit this had to be performed on a Silicon Graphics Indigo 2 computer and was a substantial additional overhead adding around 20s to 60s.

Augmented Atom Fragments
Library Number of
PAM95 600 4.22s 0.31s
PEPTIDE95 1280 8.49s 1.19s
PEPTOID95 1280 8.70s 1.23s

Atom Sequence Fragments
Library Number of
PAM95 600 39.6s 0.99s
PEPTIDE95 1280 39.14s 52.62s
PEPTOID95 1280 39.06s 30.33s


The Markush method for fingerprint generation generally provides significant time savings, which are likely to be greater for larger CLs, as the ratio of number of compounds to number of building blocks (partial structures in the EFCL) increases. The fragment Bubble-Up principle is also applicable to any structural descriptor which is essentially an additive property of its component parts, such as many topological indexes.

The savings are, however, much less marked, or even absent altogether, when generating atom sequence fragments from the PEPTIDE95 and PEPTOID95 libraries; the reason for this lies in the manner in which CHORTLES repesents these libraries. Effectively in these cases the "invariant part" of the molecule is vestigial, and consists solely of a sequence of linked variable groups, each of which can take up to 20 different values. This is, however, a somwhat artificial representation, since there is actually a common peptide or peptoid backbone off which the variable side-chains are hung. Much time is thus spent in generating all possible combinations of partial atom sequences from neighbouring building blocks, most of which eventually form the same sequence fragments, describing the invariant backbone; only seqeuences of at least six atoms would involve two different side chains, and because of their linear nature, no sequences would involve three or more side-chains. A different input representation (or more sophisticated software to build the EFCL), in which the common features of each alternative building block were extracted and placed in the root partial structure, should significantly increase the processing speed for this type of fragment. In the small-molecule PAM95 library, which does contain a substantial invarant part, the Markush method is substantially faster than the enumeration method for generating sequences.

Despite the advantages of more rapid generation of structural descriptors, the rate-limiting step for diversity analysis of a CL by clustering its compounds is likely to be the pairwise comparison of the fingerprints (for which the time requirements are related to the square of the number of compounds). However, the present demonstration of the applicability of Markush structure-handling techniques to diversity analysis in CLs also provides a basis for further work. For example, the Bubble-Up algorithm could be adapted for direct generation of a "modal fingerprint" [Shemetulskis et al., 1996], characteristic of a CL as a whole, in which each bit is set if the corresponding fragment appears in more than a certain proportion of the compounds covered.

For clustering purposes, in which pairwise similarities between the compounds are required, these could be calculated "on the fly" without the need to output a file of fingerprints. With a clustering method such as the popular Jarvis-Patrick non-hierarchical method, only the top 20 or so nearest neighbours of each compound are required, and upperbound calculations might be used in order to avoid full calculation of similarities which could not occur in the top-20 list. A similar approach might also be used to find reciprocal nearest neighbours, as required in some of the hierarchical clustering methods which have recently been found to perform significantly better than the Jarvis-Patrick method [Brown and Martin, 1996].


Thanks are due to John Bradshaw (GlaxoWellcome), Rob Brown, Bev Eccles and Yvonne Martin (Abbott Laboratories), for helpful discussions and financial support of this work. Also to Craig James, Yosi Taitz and Jeremy Yang (Daylight Chemical Information Systems Inc.) for provision of, and assistance with Daylight software; to Valerie Gillet and Peter Willett (Sheffield University) for assistance with computing facilities; and to Gary Wiggins (Indiana University) where JMB spent two months as a Visiting Scholar in Fall 1995, and gained experience in the use of the Daylight toolkits.

Bibliographic References

  1. Barnard, J. M.; Lynch, M. F.; Welford, S. M. Computer storage and retrieval of generic structures in chemical patents. Part 4. An extended connection table representation for generic structures. J. Chem. Inf. Comput. Sci., 1982, 22, 160-164.
  2. Barnard, J. M. A comparison of different approaches to Markush structure handling. J. Chem. Inf. Comput. Sci., 1991, 31, 64-68.
  3. Brown, R.D.; Martin, Y.C. Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci., 1996, 36, 572-584.
  4. Carhart, R.E.; Smith, D.H.; Venkataragharvan, R. J. Chem. Inf. Comput. Sci., 1985, 25, 64-73
  5. Downs, G. M.; Gillet, V. J.; Holliday, J. D.; Lynch, M. F. Computer storage and retrieval of generic chemical structures in patents. 10. The generation and logical bubble-up of ring screens for structurally-explicit generics. J. Chem. Inf. Comput. Sci., 1989, 29, 215-224.
  6. Lynch, M. F.; Holliday, J. D. The Sheffield generic structures project - a retrospective review. J. Chem. Inf. Comput. Sci. 1996, in press.
  7. Martin, E. J.; Blaney, J. M.; Siani, M. A.; Spellmeyer, D. C.; Wong, A. K.; Moos, W. H. Measuring diversity: experimental design of combinatorial libraries for drug discovery. J. Med. Chem., 1995, 38, 1431-1436.
  8. Meyer, E.; Schilling, P.; Sens, E. "Experiences with input, translation and search in files containing Markush formulae". In Computer Handling of Generic Chemical Structures, ed. Barnard, J. M. Aldershot: Gower, 1984. pp. 83-95.
  9. Shemetulskis, N.E.; Weininger, D.; Blankley, C. J.; Yang. J. J.; Humblet, C. Stigmata: an algorithm to determine structural commonalties in diverse datasets. J. Chem. Inf. Comput. Sci. 1996, 36, 862-871.
  10. Siani, M.; Weininger, D.; James, C. A.; Blaney, J. M. CHORTLES: a method for representing oligomeric and template-based mixtures. J. Chem. Inf. Comput. Sci., 1995, 35, 1026-1033.