Virtual Chemical Stores
Talks at MUG meetings over the last year MUG99 EMUG98 have laid the foundations for a project at Daylight to build a virtual chemical stores/stockroom (VCS).
The argument, that chemical vendors' stores are simply logical extensions of a chemical user's compound store is a very powerful one. However tools do not exist to allow the end user to make one simple chemistry-based query across a general set of inventories.
Chemistry-based queries fall into 5 classes
We therefore need to know how a molecule is represented in the database in order that the query may be represented appropriately. We also need to know how the molecule is parameterised in the database to be able to evaluate any similarity metric. With the proliferation of vendors creating their own proprietary, undocumented database formats with no public API, large amounts of chemical information is not accessible to this single query approach.
Various scenarios can be envisaged to try and solve this problem.
Both of these still require you know what to do with the answers. For instance, suppose you did a substructure search in a Daylight database and wanted to merge the answers with and MDLI database. Even though substructures exist in the MDLI database there are not the tools to identify them. On a multiple database search the result may be misinterpreted as being no substructures.
Again doing Tanimoto based similarity searching across the same pair of databases would lead to odd results because the descriptors used are different.
Note in the first case the results are erroneous, whereas in the second case they are just different.
Glaxo Wellcome, along with other companies decided to explore the second approach of converting suppliers' databases to a standard format. The background to this has been described earlier, however some points are worth reiterating.
There is no right or wrong way to do the latter functions, other than the constraints of valency. These are usually referred to as business rules and are decoupled from the canonicalisation process.
After the MUG presentation several of the companies who were represented there requested DAYLIGHT to explore building a large database of commercially available compounds which could be made available for public browsing at www.daylight.com but could also be licensed in-house by those who so wished. It would require that the vendors were willing to supply the data to DAYLIGHT who was not a potential customer. Indeed they could have been seen as a rival as many of these vendors have their own web sites offering compounds. DAYLIGHT would need to stress to the compound vendors that this was extra advertising, over and above that available on the vendors' Web sites. In addition, by making the system installable within a customer company environment, if required, scientists would have access to their wares on an equal footing with all the other sources of compounds, internal and external. From the customer company's point of view, mechanisms could be put in place to both control and facilitate compound ordering and payment.
Glaxo Wellcome would supply a library of functions and applications to build a prototype. Documentation of the library is here and the VCS tree format defined here. On the advice of several of the MUGers the parent perception was altered to incorporate a salts database. Currently, if a mixture cannot be reduced to exactly one component, by removing salts, solvents and duplicates, all the components are registered individually into the database along with a mixture. This is currently hard-coded but is strictly a "business rule" and needs to be decoupled.
DAYLIGHT for their part, employed Rose Liang as a summer intern, to build the database, working with the DAYLIGHT California office to get the data files from the vendors. The prototype system is accessible for this meeting. We would be interested in feedback as to the data models and any scenarios for its use not covered by the Santa Fe group.
Use of a VCS
There seemed to be a spectrum of modes of use, which any system would need to respond to.
The first extreme requires the data to be current. The company has rapid ordering and acquisition processes and the biological screens or chemistries vary only on a long time cycle. A simple 'thor' look-up based triage system removes all compounds that are already known to the system leaving a reduced number to be refined or simply bought. There is no requirement for the information on the rejected compounds to be stored. The only role the VCS fulfils is to rapidly channel the information through a checking procedure in a constant format which will ensure nothing is missed. It may be regarded as a 'drift net' model.
The second extreme requires that the data be maintained in a current state. Selections may be made weeks, months or years after the data was first made available. Sufficient recovery of compounds is all that matters, there is no requirements that the selection is totally comprehensive. The fact that a compound was once available from a supplier may be sufficient for pursuit, if it were highly desirable. Maintaining the fishing metaphor, this is an 'angling' model. When new screens and new chemistries come along the database needs to be 'trawled'. In the current DAYLIGHT model these are 'merlin' search processes.
Progress towards a VCS
Currency of databases requires much more active processes are put in place to maintain that state, than are traditionally applied to small molecule databases. Mechanisms of the type used in bioinformatics data collection should be investigated. One of the major complaints of the group at the Santa Fe meeting was that by the time an integrated, multi-vendor database arrived on their desktop as many as 50% of the desired compounds were no longer available.
So far work has concentrated on cutting down the latency period between the data arriving and being incorporated into the 'warehouse'. Scripts have been built to convert and load the data and datatypes databases. The database remains live during this period. Within GW there is a mechanism for these structures to be merlin searchable within a few seconds of being loaded, this needs to be implemented at DAYLIGHT and incorporated into the (VDB(?)) toolkit. Collaboration with vendors should allow the exploration of VNFS type solutions.
The maintenance of the database has been described. The thor model ensures the correct relationship is kept between identifiers and data. The VCS tree format ensures that the identifiers are appropriate and have the correct data associated with them, and the indirect identifiers for the suppliers, maintains the link to the original supply source.
Whilst the initial driving force for this project was compound acquisition for biological screening, it is clear that there are much wider possibilities. Storing compounds in general, is clearly the first step, but there is no reason why reactions both real and virtual cannot also be stored. Agents and catalysts could be stored by what they do, rather than what they are. The thesaurus mechanism would then bring [Na+].[BH4-] and [Al](OC(C)C)(OC(C)C)OC(C)C.OC(C)C together on a page rooted in *C(=O)*>>*C(O)*.
It is necessary that the query client understands the conventions used in the data model. It is also a requirement that there is seamless querying across in-house databases too. This requires both full "business rule" decoupling, and code to impose local rules on a rebuilt database.
Canonical tautomers are still a fraught issue but will be dealt with later in the meeting.
If DAYLIGHT is going to serve out a copy of this database more suppliers need to be convinced to join in. Vendors are still more willing to supply their catalogues directly to their customers.
The advent of virtual databases will still require that the databases of any vendors without DAYLIGHT engines are converted to the VCS model. The next talk will explore these possibilities.
Daylight Chemical Information Systems, Inc.