EMUG '99: John Bradshaw- Virtual Chemical Stores

EMUG '99 -- 28 October 1999

Virtual Chemical Stores

Rose Liang
Daylight CIS

Introduction

Talks at MUG meetings over the last year MUG99 EMUG98 have laid the foundations for a project at Daylight to build a virtual chemical stores/stockroom (VCS).

The argument, that chemical vendors' stores are simply logical extensions of a chemical user's compound store is a very powerful one. However tools do not exist to allow the end user to make one simple chemistry-based query across a general set of inventories.

Chemistry-based queries fall into 5 classes

Is my structure the same as a molecule in the database? Boolean

Is my structure part of a molecule in the database? Sequence (Molecules)

Is a molecule in my database part of my structure? Sequence (Molecules)

Is my structure like a molecule in the database? Sequence (Molecule, Metric pairs )

Is a molecule in my database like my structure? Sequence (Molecule, Metric pairs )

We therefore need to know how a molecule is represented in the database in order that the query may be represented appropriately. We also need to know how the molecule is parameterised in the database to be able to evaluate any similarity metric. With the proliferation of vendors creating their own proprietary, undocumented database formats with no public API, large amounts of chemical information is not accessible to this single query approach.

Various scenarios can be envisaged to try and solve this problem.

Leave the database alone

Keep the databases separate and send generic queries

Requires the vendors have documented APIs or the database can be fooled it is getting a local query.

Requires the query can be genericized without changing its sense.

Keep the databases separate and send multiple specific queries

Requires the vendors have documented APIs or the database can be fooled it is getting a local query.

Requires the query can be converted without changing its sense.

Both of these still require you know what to do with the answers. For instance, suppose you did a substructure search in a Daylight database and wanted to merge the answers with and MDLI database. Even though substructures exist in the MDLI database there are not the tools to identify them. On a multiple database search the result may be misinterpreted as being no substructures.
Again doing Tanimoto based similarity searching across the same pair of databases would lead to odd results because the descriptors used are different.
Note in the first case the results are erroneous, whereas in the second case they are just different.

Preconvert the databases to the same format.

Convert into single large database

Require you have conversion tools

Require you can work within operating system limits e.g. 2Gbyte files

Make it appear to the user that they have a single large database VDB

Glaxo Wellcome, along with other companies decided to explore the second approach of converting suppliers' databases to a standard format. The background to this has been described earlier, however some points are worth reiterating.

Early work on WLN had shown the power of building a parent database. In general some components of a molecule or mixture are more important than others are. In this world, at least, these important pieces are called parents. E.g. usually HCl in hydrochloride salts can be ignored. Occasionally someone may want to ask questions like "How many commercially available drugs are hydrochloride salts?". The advantage of associating ranitidine hydrochloride with other ranitidine salts usually outweighs any advantage of associating ranitidine hydrochloride with sal ammoniac - smelling salts. CAOCI made it difficult to ask/answer the "hydrochloride salts" question, although permuted lists were very powerful. In the transformation to ACD, the concept of parent disappeared, making the usual question more difficult to ask/answer.

The DAYLIGHT thesaurus model seemed an ideal environment into which to merge these disparate sources. It required only that there was some sort of normalisation process of the SMILES over and above that available by creating the canonical SMILES. This covers concepts like

How to represent oxides of nitrogen, azides, diazoketones etc.

How to generate a canonical tautomer which itself could be normalised.

How to generate a parent structure.

There is no right or wrong way to do the latter functions, other than the constraints of valency. These are usually referred to as business rules and are decoupled from the canonicalisation process.

After the MUG presentation several of the companies who were represented there requested DAYLIGHT to explore building a large database of commercially available compounds which could be made available for public browsing at www.daylight.com but could also be licensed in-house by those who so wished. It would require that the vendors were willing to supply the data to DAYLIGHT who was not a potential customer. Indeed they could have been seen as a rival as many of these vendors have their own web sites offering compounds. DAYLIGHT would need to stress to the compound vendors that this was extra advertising, over and above that available on the vendors' Web sites. In addition, by making the system installable within a customer company environment, if required, scientists would have access to their wares on an equal footing with all the other sources of compounds, internal and external. From the customer company's point of view, mechanisms could be put in place to both control and facilitate compound ordering and payment.

Glaxo Wellcome would supply a library of functions and applications to build a prototype. Documentation of the library is here and the VCS tree format defined here. On the advice of several of the MUGers the parent perception was altered to incorporate a salts database. Currently, if a mixture cannot be reduced to exactly one component, by removing salts, solvents and duplicates, all the components are registered individually into the database along with a mixture. This is currently hard-coded but is strictly a "business rule" and needs to be decoupled.

DAYLIGHT for their part, employed Rose Liang as a summer intern, to build the database, working with the DAYLIGHT California office to get the data files from the vendors. The prototype system is accessible for this meeting. We would be interested in feedback as to the data models and any scenarios for its use not covered by the Santa Fe group.

Use of a VCS

There seemed to be a spectrum of modes of use, which any system would need to respond to.

At one extreme were users who urgently needed novel compounds for their screening or synthesis exercise, which may or may not have structural requirements.

E.g. find all novel basic amines with molecular weight less than 300.

At the other extreme were users who had highly specific requirements and wanted to put together the best possible set and had the luxury of time to build the set.

E.g. find the most diverse set of 800 adenines with non-natural sugar isosteres put into 10 plates for a screen we will run in 6 months time.

The first extreme requires the data to be current. The company has rapid ordering and acquisition processes and the biological screens or chemistries vary only on a long time cycle. A simple 'thor' look-up based triage system removes all compounds that are already known to the system leaving a reduced number to be refined or simply bought. There is no requirement for the information on the rejected compounds to be stored. The only role the VCS fulfils is to rapidly channel the information through a checking procedure in a constant format which will ensure nothing is missed. It may be regarded as a 'drift net' model.

The second extreme requires that the data be maintained in a current state. Selections may be made weeks, months or years after the data was first made available. Sufficient recovery of compounds is all that matters, there is no requirements that the selection is totally comprehensive. The fact that a compound was once available from a supplier may be sufficient for pursuit, if it were highly desirable. Maintaining the fishing metaphor, this is an 'angling' model. When new screens and new chemistries come along the database needs to be 'trawled'. In the current DAYLIGHT model these are 'merlin' search processes.

Progress towards a VCS

Currency of databases requires much more active processes are put in place to maintain that state, than are traditionally applied to small molecule databases. Mechanisms of the type used in bioinformatics data collection should be investigated. One of the major complaints of the group at the Santa Fe meeting was that by the time an integrated, multi-vendor database arrived on their desktop as many as 50% of the desired compounds were no longer available.

So far work has concentrated on cutting down the latency period between the data arriving and being incorporated into the 'warehouse'. Scripts have been built to convert and load the data and datatypes databases. The database remains live during this period. Within GW there is a mechanism for these structures to be merlin searchable within a few seconds of being loaded, this needs to be implemented at DAYLIGHT and incorporated into the (VDB(?)) toolkit. Collaboration with vendors should allow the exploration of VNFS type solutions.

The maintenance of the database has been described. The thor model ensures the correct relationship is kept between identifiers and data. The VCS tree format ensures that the identifiers are appropriate and have the correct data associated with them, and the indirect identifiers for the suppliers, maintains the link to the original supply source.

The Future

Whilst the initial driving force for this project was compound acquisition for biological screening, it is clear that there are much wider possibilities. Storing compounds in general, is clearly the first step, but there is no reason why reactions both real and virtual cannot also be stored. Agents and catalysts could be stored by what they do, rather than what they are. The thesaurus mechanism would then bring [Na+].[BH4-] and [Al](OC(C)C)(OC(C)C)OC(C)C.OC(C)C together on a page rooted in *C(=O)*>>*C(O)*.

It is necessary that the query client understands the conventions used in the data model. It is also a requirement that there is seamless querying across in-house databases too. This requires both full "business rule" decoupling, and code to impose local rules on a rebuilt database.

Canonical tautomers are still a fraught issue but will be dealt with later in the meeting.

If DAYLIGHT is going to serve out a copy of this database more suppliers need to be convinced to join in. Vendors are still more willing to supply their catalogues directly to their customers.

The advent of virtual databases will still require that the databases of any vendors without DAYLIGHT engines are converted to the VCS model. The next talk will explore these possibilities.

Daylight Chemical Information Systems, Inc.
support@daylight.com

John Bradshaw.