Use of Markush Structure Techniques to Avoid Enumeration in Diversity Analysis of Large Combinatorial Libraries

John M. Barnard and Geoff M. Downs

Barnard Chemical Information Ltd, 46 Uppergate Road, Stannington, Sheffield S6 6BX, UK
Tel: +44 (0)114 233 3170; fax: +44 (0)114 234 3415; http://www.bci1.demon.co.uk

This paper was presented orally at the MSI Combinatorial Chemistry Consortium Meeting, L'Auberge Del Mar, Del Mar, CA 92014, USA, February 11, 1997 and at the Daylight Chemical Information Systems MUG97 meeting, The Surf and Sand Hotel, Laguna Beach, CA, USA, February 27, 1997

Copyright (c) Barnard Chemical Information Ltd., 1997


Abstract

Computer systems for handling "Markush" structures from chemical patents (which frequently cover unlimited numbers of individual compounds) use a number of techniques to avoid enumeration of the compounds themselves. Though there are several important differences between the Markush structures in patents and those which may be used to describe combinatorial libraries, some of the same techniques can be used to avoid enumerating the individual compounds covered for diversity analysis purposes. The Daylight Monomer Toolkit has been used in a program which is able to generate structure fingerprints for the compounds in a library, without first enumerating the compounds themselves, and can thus offer substantial savings in processing time. Various extensions and improvements to this approach are currently under investigation.


The Combinatorial Explosion

A large "virtual" combinatorial library (CL) can encompass a very large number of compounds, of the order of 1012 or more. Though it may only be required actually to synthesise relatively modest CL's (103 or 104 compounds) the performance of diversity analyses on much larger virtual CLs may be an essential step in deciding the optimum subset CL which is actually to be synthesised. This can cause problems for computer systems which depend on enumeration of these compounds in order to be able to perform such analyses. There are a number of potential solutions to this problem.

A hardware-based approach, in which parallel machines are used to search enumerated CLs, is unlikely to provide a satisfactory solution since the resources (processors or time) required will rise at least linearly with the number of compounds covered, which means that even a single order of magnitude increase could present an intractible problem. The SpaceCrunchTM project being jointly developed by Tripos Inc. and Silicon Graphics Inc., incorporates some aspects of a hardware solution, and CLs of up to 1011 compounds can be searched in "acceptable" time, though full enumeration of the compounds is not required.

The most common approach, exemplified by the work of Martin et al. (1995), considers the diversity of each building block pool in isolation. This has the advantage of reducing the problem from one whose scale varies with the product of the number of alternatives for each building block to one whose scale varies with the sum of the number of such alternatives. However, it pre-supposes that maximising the diversity in each pool subset will maximise the diversity of the compounds in the subset CL. Recent work at Sheffield University [Gillett, Willett and Bradshaw, 1996] suggests that this supposition may be unwarranted, though it is important to remember that the current nature of combinatorial synthesis requires that subsets of each building block pool (rather than random combinations of members of different pools) must be chosen in order to build a subset CL. Furthermore, the approach ignores the context of building blocks in the eventual products.

The approach used in MSI's C2 Diversity system develops the previous approach, by looking at the building blocks "in context", but only varying one of them at a time. This retains the advantage of reducing the problem from the product of the number of alternatives to their sum, but still ignores the diversity of features involving multiple building blocks.

The approach discussed in this presentation involves the use of techniques developed in the 1980s for the storage and retrieval of Markush structures from patent documents. Such techniques allow the handling of very large (frequently infinitely large) sets of compounds without enumeration, and are used in two operational online systems, the MARPAT system from Chemical Abstracts (available on STN) and the Markush DARC system jointly developed by Derwent Information Ltd, Questel SA and the French Patent Office (INPI) (available on the Questel Orbit system) [Barnard, 1991].


Markush Structures

Markush structures are essentially structures involving R-groups, where a part of the molecule is defined by a series of alternatives; they thus specify sets of specific molecules. The occurrence of undelimited expressions such as "R1 is an optionally-substituted heterocycle" (which occur frequently in patent specifications) mean that they can be open-ended (i.e. infinite) sets.

Markush structures occur frequently in chemical patent claims, and enable the inventor to protect an entire class of compounds rather than just the individual one he plans to sell; the name Markush is derived from a patent applicant who was involved in a legal case concerning the validity of this type of claim in the 1920s. Markush structures also have a number of applications outside the patent field. Most substructure search systems now provide some Markush features in their query languages, though the sophistication varies between systems; the results of QSAR analyses are frequently presented in the form of a Markush structure with a table of activity data for each combination of R-group values; and combinatorial libraries may also be expressed as Markush structures.

The long-running research project on Markush structure storage and retrieval at Sheffield University [Lynch and Holliday, 1996 ] found it useful to consider them as consisting of a (possibly vestigial) core, or invariant part, to which were attached variable groups (normally designated R1, R2, etc.) whose variability could be shown in one or more of four different ways:

Furthermore, each variable part can have further variable parts nested within it, to any depth. The following simple Markush structure illustrates each type of variability

Variability in Markush structures

and the table below illustrates the different types of variablity which are typically found in different application areas.

  substituent position frequency homology
Patents X X X X
Substructure Search Queries X (X) (X) (X)
QSAR Studies X X    
Combinatorial Libraries X (X) (X) (((X)))

Although, at present at least, combinatorial libraries generally involve substituent variation only, it is likely that other variation types may become important as novel combinatorial synthesis techniques are developed. Homology-variation is clearly entirely absent in physical CLs, though it might be argued that it is potentially useful in description of extremely large virtual libraries.

The occurrence of homology variation, which means that a single Markush structure can cover an infinite set of compounds, clearly rules out enumeration as a technique for dealing with them. Developers of systems for handling Markush structures from patents were thus obliged to develop techniques for representing and searching databases of Markush structures, without enumeration. Some of these techniques are being employed in the search systems currently under development by major software vendors for CL information systems [Leland et al., 1997]. At the simplest level, for searching, this involves storing and matching the common parts of the compounds covered by a CL only once, in order to avoid full enumeration. Such techniques are not especially new, and indeed the earliest such system was designed as long ago as 1958 (though details were not published until a quarter of a century later [Meyer et al., 1984 ]), but work during the 1980s also developed special techniques to avoid problems associated with differing "segmentation" of the Markush structure into building blocks, which can cause particular difficulties, especially where homology variation is present.

In essence a Markush structure is an implicit representation of a set of specific molecules. In formal language terms, it is a "grammar" which specifies the rules by which valid "sentences" of a language (the individual compounds covered) may be generated. The techniques developed during the 1980s enable operations (comparison, analysis, search etc.) to be carried out on the finite grammar of the language, rather than on the potentially infinite set of valid sentences in it. Similarly, one determines whether or not a given sentence is a valid sentence in the English language, by determining whether or not it conforms to the rules of English grammar, and not by searching for the sentence in an enumerated database of valid English sentences.


Fingerprint Generation Without Compound Enumeration

The work described here is an initial study, aimed at demonstrating the applicability of Markush structure techniques to diversity analysis of combinatorial libraries, and has been supported by some of Barnard Chemical Information Ltd.'s pharmaceutical industry clients. Some preliminary results were presented at the 4th International Conference on Chemical Structures in the Netherlands in June 1996 [Downs and Barnard, 1997], though it remains "work in progress".

The inital objectives of the work were:

  1. To design an internal computer data structure for Markush structures suitable for representing CLs;
  2. To write software to generate structure "fingerprints" (indicating the presence or absence of certain substructural fragments in each compound) for all the compounds in a CL directly from this;
  3. To compare these fingerprints with those obtained by first enumerating the compounds, and then generating fingerprints for each individually; the two sets of fingerprints should, of course, be the same.

Though this process still involves enumeration of the fingerprints (which might then be used for clustering, dissimilarity selection etc.) our initial purpose was to demonstrate the applicability of the approach, which should in any case offer significant savings in processing time; extension of the work to avoid explicit generation of an enumerated fingerprint file is discussed below.

The figure below shows the basic processing flow involved. Two alternative routes were used to generate structure fingerprints from initial input of a CL representation: on the left-hand side, a file containing the enumerated connection tables for all the compounds in the CL is generated, which can then be used as input to standard fingerprint generation software. On the right-hand side new software is used to read the CL into the specially-designed internal representation, and then to generate the fingerprints directly from it.

Processing Flow Diagram

Internal Representation of Libraries

The internal data structure is constructed as an "AND/OR" logic tree, which links a set of "partial structures" (building blocks) in appropriate logical relationships. It is based very closely on the Extended Connection Table Representation (ECTR) [Barnard et al., 1982] developed as part of the Sheffield University research on Markush structures, and is thus called the "ECTR for Combinatorial Libraries" (EFCL). In the EFCL tree structure, leaf nodes represent individual building blocks (partial structures) and internal nodes show the logical relationships between them (AND or OR as appropriate). The EFCL is implemented using dynamic arrays in the C language, and is intended to be created and held in random-access memory only for as long as is required for processing; no arbitrary limits are placed on the number of partial structures, the number of connections between any pair of partial structures, or the depth of the tree. With a view to possible future requirements, provision is also made for the inclusion of generic (homolgy-variant) partial structures, and the data structure also includes certain other features for position and frequency variation.

The figure below illustrates the logical structure of the EFCL for a simple combinatorial library, also shown as a Daylight CHORTLES notation. Each "leaf node" in the tree is a partial structure, representing one building block (monomer unit). The lines connected by small arcs indicate AND relationships between the partial structures at the ends of the lines, whilst those without arcs represent OR relationships.

EFCL Diagram

Sample Libraries

Libraries encoded in the Daylight CHORTLES notation [Siani et al., 1995], with associated building block definitions (monomer tables) were used to test the software developed for fingerprint generation from the EFCL. This allowed Daylight Monomer Toolkit routines to be used to enumerate the libraries for comparison purposes, and to assist in generting the EFCL itself.

Three sample libraries supplied as demo files with the Daylight Monomer Toolkit were used. These were

In addition a much larger, non-proprietary library, containing 720,900 compounds with 3 points of diversity was supplied by Abbott Laboratories. The general structure involved is shown below:

ABBOTT Library

For input, each library was represented as two TDT files, one containing the CHORTLES and one containing the monomer definitions.

Software Development

For the enumeration route a simple Monomer Toolkit program was used to enumerate the individual compound connection tables which were then input to Barnard Chemical Information Ltd.'s standard fingerprint generation program, MAKEBITS. It was not possible to use the Daylight Fingerprint Toolkit, as it would have been difficult to adapt this for fingerprint generation directly from the EFCL, and it was essential that fingerprints generated by the two routes be comparable.

For the Markush route, special routines (also using the Daylight Monomer Toolkit) were written to generate the EFCL from CHORTLES and monomer table input. Routines from the MAKEBITS program were adapted and used to generate fingerprints for each combination of partial structures directly from the EFCL.

The fingerprints generated by MAKEBITS are based on identifying the presence of structure fragments in a molecule, and then looking up the fragment (and various generalised versions of it) in a dictionary file, which (if the fragment is present) specifies a bit position in the fingerprint to be set. For this work, three types of fragment were generated:

During the processing, an algorithm called TreeTrace traverses the EFCL visiting each partial structure in turn. Fragments wholly contained within a single partial structure ("intra-PS" fragments) are immediately assigned to "partial fingerprints" associated with that partial structure. Fragments spanning two or more partial structures ("inter-PS" fragments) are assembled from their constituent parts in the top-most partial structure with which they are associated (i.e. that closest to the root of the EFCL). Complete fingerprints for each individual compound are then constructed for each logical combination of partial structures, by ORing together the relevant intra-PS partial fingerprints, and assembling the inter-PS fragments and assigning the relevant bits in the final fingerprint.

Results

Separate runs were made to generate fingerprints based on the three different fragment families, and in each case the fingerprint files generated by the two routes were identical, after sorting to deal with different enumeration orders.

The tables below compare the processing times for the two routes. The actual timings given for the ABBOTT library are not directly comparable with those for the other three libraries, as the runs for this library were done on a different machine; however the figures for the different routes and fragment families within each library are comparable.

For the PAM95, PEPTIDE95 and PEPTOID libraries, the fragment generation programs were run on a single-user Pentium PC (133MHz) and the times given are elapsed seconds. (The Daylight-toolkit dependent routines to create the EFCL were run on a Silicon Graphics Indigo 2, and a text file representation of the EFCL transferred to the PC.) For the ABBOTT library the fingerprint generation programs were run on a Silicon Graphics Indigo 2, and the times given are cpu-seconds.

Note that the timings for the enumeration route represent only the fingerprint generation time and do not include the time taken to enumerate the individual connection tables. This added around 20s to 60s for the PAM95, PEPTIDE95 and PEPTOID95 libraries, and about 3500s for the ABBOTT library.

Augmented Atom Fragments
Library Number of
Compounds
# Building
Blocks
Enumeration
method
Markush
method
Speedup
PAM95 600 22 4.22 0.29 14.6
PEPTIDE95 1280 36 8.49 0.63 13.5
PEPTOID95 1280 36 8.70 0.72 12.8
ABBOTT 720,900 270 11,103     291     38.1

Atom Sequence Fragments
Library Number of
Compounds
# Building
Blocks
Enumeration
method
Markush
method
Speedup
PAM95 600 22 39.6 0.78 50.7
PEPTIDE95 1280 36 39.14 3.95 9.9
PEPTOID95 1280 36 39.06 2.92 13.4
ABBOTT 720,900 270 66,570       510       109.1

Atom Pair Fragments
Library Number of
Compounds
# Building
Blocks
Enumeration
method
Markush
method
Speedup
PAM95 600 22 8.22 0.64 12.8
PEPTIDE95 1280 36 15.84 16.90 0.9
PEPTOID95 1280 36 15.10 18.19 0.8
ABBOTT 720,900 270 32,160       944       34.1


Discussion

Fingerprint Generation Times

The tables of results shown above show that the Markush method for fingerprint generation provides significant time savings, which are greater for larger CLs, as the ratio of the number of compounds to number of building blocks (partial structures in the EFCL) increases. There are, however, marked differences between the different fragment types and the different libraries studied.

In the case of the PEPTOID95 and PEPTIDE95 libraries the Markush method is actually slower at generating fingerprints based on atom pair (AP) fragments. The reason for this lies in the manner in which CHORTLES represents these libraries. Effectively in these cases the "invariant part" of the molecule is vestigial, and consists solely of a sequence of linked variable groups, each of which can take a number of different values. This is a somwhat artificial representation, since there is actually a common peptide or peptoid backbone off which the variable side-chains are hung. When generating atom pair (AP) fragments involving atoms at either end of the peptide chain, much time is thus spent in identifying and measuring the lengths of all paths through the intervening "variable" partial structures, all of which ultimately turn out to be identical, and lead to the same bit position being set in the fingerprint.

A similar problem also occurred with atom sequence (AS) fragments in an early version of the program, and though refinements to the code were able to reduce its effect on processing times, it can be seen that the speedup offered for AS fragments is still substantially less in the PEPTIDE95 and PEPTOID95 libraries than in the small-molecule libraries where the common scaffold is explicitly identified. In contrast, there is much less difference in the speedup figures for much more localised augmented atom (AA) fragments in the different libraries. A procedure to "optimise" the EFCL by identifying common substructures, and transfering the atoms concerned to the "parental" partial structure, would be likely to improve processing times substantially for the longer-range AS and AP fragments.

Current Work

Though the files of enumerated fingerprints generated as described above can be used for similarity searching, clustering etc., our current work is focussing on the direct calculation of measures of library diversity from the EFCL, without the need to write out such a file.

Shemetulskis et al. (1996) have described the use of a "modal fingerprint", characteristic of a CL as a whole, in which each bit is set if the corresponding fragment appears in more than a certain user-specified threshhold proportion of the compounds covered. A "generalised" modal fingerprint (specifying the number of compounds in which each bit is set) can easily be cumulated during the fingerprint generation decribed above, and modal fingerprints at any desired threshhold extracted from it. Even faster techniques [Downs et al., 1989] could be used to generate the logical AND and logical OR of the individual fingerprints (which correspond to modal fingerprints at threshholds of 100% and 0% respectively) without the need to go via the individual fingerprints at all. Counts of bits set in the logical OR and logical AND of the individual fingerprints have been proposed as diversity measures [Martin et al., 1995].

Turner et al. (1997) have recently described the use of a "centroid" fingerprint, which is a vector giving a weighted sum of individual fingerprints, and have shown that if appropriate weights are chosen, the dot product of this vector with itself is equal to the sum of the pairwise similarities for all the molecules in the dataset. They have suggested that the mean intermolecular dissimilarity (trivially calculated from this) is a useful measure of diversity, avoiding a disadvantage of the "counting bits" measure, which is sensitivity to individual outlier compounds. Furthermore, using Turner et al.'s method, the measure can be calculated in O(N) time. As with the generalised modal fingerprint, the centroid fingerprint can be cumulated rapidly during generation of individual fingerprints (which need not be retained) from the EFCL.

Other Potential Applications

Several other applications of the Markush techniques described here to the handling of combinatorial libraries can be identified. For clustering purposes, where pairwise similarities between the compounds are required, these could be calculated "on the fly" without the need to output a file of fingerprints. With a clustering method such as the popular Jarvis-Patrick non-hierarchical method, only the top 20 or so nearest neighbours of each compound are required, and upperbound calculations might be used in order to avoid full calculation of similarities which could not occur in the top-20 list. A similar approach might also be used to find reciprocal nearest neighbours, as required in some of the hierarchical clustering methods which have recently been found to perform significantly better than the Jarvis-Patrick method [Brown and Martin, 1996].

A wide variety of other descriptor types could also be generated using the "bubble-up" principle used here for generating 2D-fragment-based fingerpints. The approach is applicable to any structural descriptor which is essentially an additive property of its component parts, such as molecular weights, many topological indexes, and CLOGPs.

For basic information-handling processes in library registration and searching, Markush structure techniques are already well-established in the operational patent information systems MARPAT and Markush DARC. They allow (sub-)structure searching of Markush structures without enumeration, and could be extended to allow identification of the overlap between libraries with different "segmentation" (division into building blocks).

Other Input Methods

The EFCL data structure is central to the Markush approach. In the software we have developed to date it is generated from CHORTLES and monomer table using Daylight monomer toolkit routines. It could also be generated from other formats used for representation of combinatorial libraries, such as Tripos Inc.'s Sybyl Line Notation (SLN) [Ash et al., 1997], and MDL's RGfiles.

The EFCL represents a set of product molecules, but in principle it can be built incrementally from a sequence of generic reactions (transforms) with sets of precursor reagents. The EFCL data structure has been designed to handle features such as nested R-groups, linked R-groups (such as might result from Diels-Alder cycloaddition reactions), and conditional R-group inter-dependencies. We are currently developing software to generate the EFCL from SMIRKS/SMILES input using the Daylight Reaction Toolkit, and this appears to be an appropriate way of using a reaction-based system for representation and manipulation of mixtures.


Acknowledgements

Financial support for this work from Abbott Laboratories, GlaxoWellcome and Parke-Davis is gratefully acknowledged. Thanks are also due Rob Brown and Mark Bures (Abbott Laboratories) for provision of the ABBOTT example library, and using it to test the programs.


Bibliographic References

  1. Ash, S.; Cline, M. A.; Homer, R. W.; Hurst, T.; Smith, G. B. SYBYL Line Notation (SLN): a versatile language for chemical structure representation. J. Chem. Inf. Comput. Sci. 1997, 37 71-79.
  2. Barnard, J. M.; Lynch, M. F.; Welford, S. M. Computer storage and retrieval of generic structures in chemical patents. Part 4. An extended connection table representation for generic structures. J. Chem. Inf. Comput. Sci., 1982, 22, 160-164.
  3. Barnard, J. M. A comparison of different approaches to Markush structure handling. J. Chem. Inf. Comput. Sci., 1991, 31, 64-68.
  4. Brown, R.D.; Martin, Y.C. Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci., 1996, 36, 572-584.
  5. Carhart, R.E.; Smith, D.H.; Venkataragharvan, R. J. Chem. Inf. Comput. Sci., 1985, 25, 64-73
  6. Downs, G. M.; Gillet, V. J.; Holliday, J. D.; Lynch, M. F. Computer storage and retrieval of generic chemical structures in patents. 10. The generation and logical bubble-up of ring screens for structurally-explicit generics. J. Chem. Inf. Comput. Sci., 1989, 29, 215-224.
  7. Downs, G. M.; Barnard, J. M. Techniques for generating descriptive fingerprints in combinatorial libraries. J. Chem. Inf. Comput. Sci., 1997, 37, 59-61.
  8. Gillet, V. J.; Willett, P.; Bradshaw, J. The effectiveness of monomer pools for generating structurally-diverse combinatorial libraries. Poster presented in the Knowledge-Based Library Design (KBLD) session of the First Electronic Molecular Graphics and Modelling Society Conference, October 7-18, 1996.
  9. Leland, B. A.; Christie, B. D.; Nourse, J. G.; Grier, D. L.; Carhart, R. E.; Maffet, T.; Welford, S. M.; Smith, D. H. Managing the combinatorial explosion. J. Chem. Inf. Comput. Sci. 1997, 37 62-70.
  10. Lynch, M. F.; Holliday, J. D. The Sheffield generic structures project - a retrospective review. J. Chem. Inf. Comput. Sci. 1996, 36 930-936.
  11. Martin, E. J.; Blaney, J. M.; Siani, M. A.; Spellmeyer, D. C.; Wong, A. K.; Moos, W. H. Measuring diversity: experimental design of combinatorial libraries for drug discovery. J. Med. Chem., 1995, 38, 1431-1436.
  12. Meyer, E.; Schilling, P.; Sens, E. "Experiences with input, translation and search in files containing Markush formulae". In Computer Handling of Generic Chemical Structures, ed. Barnard, J. M. Aldershot: Gower, 1984. pp. 83-95.
  13. Shemetulskis, N.E.; Weininger, D.; Blankley, C. J.; Yang. J. J.; Humblet, C. Stigmata: an algorithm to determine structural commonalties in diverse datasets. J. Chem. Inf. Comput. Sci. 1996, 36, 862-871.
  14. Siani, M.; Weininger, D.; James, C. A.; Blaney, J. M. CHORTLES: a method for representing oligomeric and template-based mixtures. J. Chem. Inf. Comput. Sci., 1995, 35, 1026-1033.
  15. Turner, D. B.; Tyrell, S. M.; Willett, P. Rapid quantification of molecular diversity for selective database acquisition. J. Chem. Inf. Comput. Sci., 1997, 37, 18-22.