EuroMUG '01: VCS, John Bradshaw

Euromug01 13th-14th September 2001, Cambridge UK

VCS - Virtual Chemical Store, the current position

John Bradshaw
Daylight CIS Inc., Sheraton House, Castle Park, Cambridge, CB3 0AX, UK

Introduction

Following a talk at MUG '99 meeting at which the concept of a Virtual Chemical Stores/Stockroom was introduced, Daylight agreed to explore the possibility of producing a commercial product that would allow the information, present in electronic compound catalogues, to be made available in consistent format to end-users. The data model drew strongly on the Thor datatree structure and the compound hierarchy used in the Modgraph Compound Registry system. The original concept was to build a large single Thor database, however the data model allows alternatives such as Virtual Databases, in which the constituent databases are not merged, or using the Oracle cartridge retaining the compound hierarchy by judicious use of tables. A further option is to merge, the results of the query across multiple consistent databases, with a client.

The major sources of data are still the compound vendors, in particular those providing compounds for screening. Data are generally provided in the form of MDL sd files. However as the concept is not limited to commercially available compounds, other sources, such as the World Drug Index, or the NCI database would benefit from this approach.

A decision has been made to provide a product which will allow customers to build their own database by providing tools which will either produce Thor datatrees for loading into a Thor-Merlin system or SQL*Plus scripts to produce an Oracle database. There is also an option to produce a flat source which allows users to incorporate into an existing system or their own proprietary system. The earlier model has been improved so that now there is a predefined set of datatypes (Thor) or columns and tables (Oracle) into which any new data source can be added. We can therefore provide the datatypes database (Thor) or the schema (Oracle). In addition there is no longer a need for an external Thor database when building the input. It is hoped this product will be released in version 4.8.

Chemical Suppliers' Data Model

The aim of most chemical suppliers' information handling is to produce a printed catalogue. So the fundamental data unit in this model is the catalogue entry from a particular supplier, which will be identified by some arbitrary string such as 0800, "BAS 00123" etc. Associated with this identifier will be items such as a chemical structure, a chemical name, and possibly other data. The suppliers will also use this identifier for their inventory and supply systems.

It is these records which forms the basis of the MDLI sd file which suppliers distribute to potential customers usually on CD. Asinex cd .

For example, if we take a sample page of the Tocris catalogue, we find

0774 identifying a sample with a name 4-[3-(Benzotriazol-1-yl)propyl]-1-(2-methoxyphenyl)-piperazine maleate with a molecular weight of 467.52. We also learn that this compound is "A potent pre- and postsynaptic 5-HT 1A receptor antagonist." and a reference to this property, Mokrosz et al (1994) Structure-activity relationship studies of central nervous system agents. 13. 4-[3-(Benzotriazol-1-yl)propyl]-1-(2- methoxy)piperazine, a new putative 5-HT 1A receptor antagonist, and its analogues. J.Med.Chem. 37 ,2754. In addition there is a chemical structure corresponding to the systematic name.

Note that the compilers of the catalogue imply that the property of "A potent pre- and postsynaptic 5-HT 1A receptor antagonist." belongs to the sample they offer for sale. It is unlikely they have tested the particular sample they send out to you. The authors of the paper ascribe the property to what we will be referring to as the parent molecule, the maleic acid plays no part in this. It is implied that the property of activity at 5-HT 1A receptors is inherited by the version molecule and will be exhibited by any subsequent sample you buy of this version. On the other hand the molecular weight of 467.52 is data about the version molecule and will also be exhibited by any subsequent sample you buy of this version. As the data are so confused in the supply source, it is unlikely that any automatic system will unravel it. It needs to be dealt with, without prejudice.

The point to remember is that these identifiers are the link to the only items which have a real existence, everything else in this model is information and data i.e. they are the names which the customer uses to communicate with the supplier when

placing an order for samples of chemicals such as these. Asinex chemicals

VCS Compound Hierarchy and Associated Data

In general, all constituents of a multi-component compound are not equally important for the process in hand. If we restrict ourselves to some sort of biological assay or property measure such as logD or pK_a, then these are easily understood as reflecting the presence of a parent molecule. Data such as a_D reflect the presence of isomeric parent molecule. Note that neither of these parent molecules have an independent existence as a physical entity, in the Daylight sense they are identifiers or names, about which, data exist.

Data such as melting point are clearly about a particular sample, as are the experimental results of both physicochemical and biological assays. A version molecule structure may also have been assigned to this sample perhaps a salt or solvate or even a impure isomer.

Calculated data such as cLogP are about the parent molecule, whereas a rubicon structure is a datum about the isomeric parent. An x-ray structure on the other hand is a datum about the sample, and is related to the version structure. Molecular formulae or weights can refer to version or parent and thus must be clearly defined. Note that the sample is identified by some arbitrary name, in the current case by the supplier. The version structure is a datum about the sample identifier, it does not differ in principle from any other data assigned to the (hopefully ) white crystalline powder.

As we are interested in a chemical information system we can use this hierarchy of structures to group the data together in a dendrite model. The sample identifiers can be grouped by the isomeric version structure. As there are many suppliers of a particular compound, there is a many to one relationship between sample identifier and isomeric version structure. As the valence bond model for representing a compound is not unambiguous, the isomeric version structures can be grouped by normalized isomeric version structures. If, as implied above, not all the components of a compound are equally important, we can group the normalized isomeric version structures by the isomeric parent structure. Above this there is a many to one relationship between these isomeric parent structures and a parent structure which contains no stereochemical or isomeric information

It is also important to note that whilst the structure assigned to the sample may be the same as either the parent or isomeric parent, that is purely coincidental. This strict hierarchy ensures that the data are associated with the correct identifier in the tree, or key. This can be illustrated using the depict algorithm.

Daylight depiction

The above picture is generated from the parent smiles using smi2gif(), much like clogP would be using clogp(). Whereas

Image using 2D coords

is generated from the version smiles and the associated 2D coordinates, still using smi2gif().

A benefit of this model is that data, which come from a flat file such as an sd file, can be restrained within a Thor subtree, or in a row in an ORACLE table, associating the supplier's data only with the supplier's identifier, not with a particular structure. In fact it is necessary to use two tables in ORACLE as, a priori we have no knowledge of the number of data items about a particular sample vide ultra.

A further benefit of this model is that we now have total control over the relationships between structure and data and can make sure only appropriate values are stored.

Properties of parents

A set of routines have been written to calculate useful properties of molecules directly from the structures. These values can then be stored along with the structure from which they were derived. These functions are available via a program_object interface which allows them to be called from within DayCart®. The ones, which are not grayed out, are included in the VCS building routines, calculated from the parent smiles. All values are returned as part_tuples except for PART_COUNT which is an integer. If the SQL output is chosen numeric part_tuples are set to NULL

AVERAGE_MOL_WT: Reads SMILES, calculates molecular weight based on average atomic weights for naturally occurring elements.
MOL_FORM: Reads SMILES, calculates molecular formula in Hill order. Charges are ignored.
ROT_BONDS: Reads SMILES, calculates count of rotatable bonds, secondary nitrogen connected to a trigonal carbon is fixed. I.e.
[!$(*#*)&!D1]-&!@[!$(*#*)&!D1] rotates,
[N&H1&D2]-&!@[#6&X3] does not.
H_DONOR: Reads SMILES, calculates count of hydrogen bond donors.
Donors are [!#6;!H0] .
Note the number of donatable H's are counted not the heavy atoms to which the are attached.
H_ACCEPTOR: Reads SMILES, calculates count of hydrogen bond acceptors.
Acceptors are [$([!#6;+0]);!$([F,Cl,Br,I]);!$([o,s,nX3]);!$([Nv5,Pv5,Sv4,Sv6])].
Note the number of acceptor sites are counted not the heavy atoms.
PARACHOR: Reads SMILES, calculates parachor using McGowan's method, only works for C H N O S P F Cl Br I.
CLOGP: Not implemented in this version.
LOGP_STAR: Not implemented in this version.
CMR: Not implemented in this version.
ACCURATE_MASS: Reads SMILES, calculates molecular weight based on the accurate mass (IUPAC 1989 ) for most abundant isotope of each element.
These values will be replaced with IUPAC 1999 values in the release version.
MOLAR_VOLUME: Reads SMILES, calculates the average molar volume based on Schrödinger's method, only works for C H N O S F Cl Br I.
RING_COUNT: Reads SMILES, calculates the count of rings in the SSSR ie the smallest set of smallest rings.
RIGIDITY: Reads SMILES, calculates the Tanimoto similarity between the input molecule and the set of molecules formed by removing the rotatable bonds.A value of 1.00 implies the structure is totally rigid. Highly flexible molecules have much lower values, nearer zero .
FRAG_COUNT: Reads SMILES, calculates the count of the number of fragments formed by removal of the isolating carbons from the structure.
Isolating carbons are [$([#6]);!$(C(F)(F)F);!$(c(:[!c]):[!c]);!$([#6]=,#[!#6]); !$([#6;!+0])].
FLEXIBILITY: Reads SMILES, calculates the flexibility of the structure from the ratio of the count of rotatable bonds, to the total count of bonds.
Totally flexible structures have a value near 1.0.
Rigid structures return a value of zero.
FINGERPRINT: Reads SMILES, returns the lexical string version of the DAYLIGHT fingerprint. Default parameters are size 2048, min/max path 0/7.
DEPICTION: Reads SMILES, returns a set of pairs of coordinates, in input SMILES order. Only coordinates for stereo significant hydrogens are returned.
STEREO_COUNT: Reads SMILES, returns the number of potential stereo centres. These are necessary but not sufficient conditions for stereo. Atom stereo [$([X4&!v6&!v5;H0,H1]),$([SX3]([#6])([#6])~O)]
Bond stereo [CX3;!H2]=[CX3;!H2]
Allene stereo [CX3;H0]=C=[CX3;H0,H1]
PART_COUNT: Reads SMILES, returns the number of components in a molecule.
POLAR_SURFACE_AREA: Reads SMILES, returns the topological polar surface area, according to the method of P. Ertl, B. Rohde, P. Selzer "Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-based Contributions and Its Application to the Prediction of Drug Transport Properties", J.Med.Chem.(2000), 43, 371 4- 3717.

Properties of versions

The following are included in the VCS building routines, calculated from the parent smiles. All values are returned as a part_tuple. The depiction is only returned if there are no 2D data in the input file.

AVERAGE_MOL_WT: Reads SMILES, calculates molecular weight based on average atomic weights for naturally occurring elements.
MOL_FORM: Reads SMILES, calculates molecular formula in Hill order. Charges are ignored.
FINGERPRINT: Reads SMILES, returns the lexical string version of the DAYLIGHT fingerprint. Default parameters are size 2048, min/max path 0/7.
DEPICTION: Reads SMILES, returns a set of pairs of coordinates, in input SMILES order. Only coordinates for stereo significant hydrogens are returned.

Normalization rules

The structure representation is normalized by a few simple rules. Note that this only affects the grouping of compounds, it does not affect the original representation which is maintained or affect any in-house display business rules.

REMOVAL OF CHARGES: All charges are reduced to zero where possible by addition of hydrogens consistent with valence rules. 1,2 dipolar systems are retained.
ADDITION OF CHARGES: Group I metals always have +1 charge.
BOND FORMATION: Nitros and sulphoxides etc are converted to their uncharged double bond forms.
BOND CLEAVAGE: Group I metals are always ionic. Bonds to the metal are broken to ensure this and charges added. Group II metals only have covalent bonds to carbon as do divalent first row transition metals.

Creating parents

Parent isomeric structures are created from the normalized version isomeric structure by the following steps.

REMOVE DUPLICATES: All duplicate components in the normalized version are removed.
REMOVE SALTS: All components which have an entry in the salts table are removed.

If there are no components left, i.e. all components are in the salt table there is a roll-back of the last step to give a structure which is treated like a mixture. If there is more than one component this structure is treated like a mixture.

Handling mixtures

In the case of a mixture, in the current version, all possible single parents are generated plus the multicomponent parent. I.e. there is a one to many relationship between the child and the parent. In a registration system, the registrar may take on the Solomon role, and keeps the simple tree structure to the data model. In the absence of other information, all potential parents are treated equally. This means duplication of data in a Thor model and spawns yet another table in Oracle.

Why such a complex model

Questions have been raised why we need such a complex model. What is wrong with simply grouping on matching version valence model as in ACD? Aside from the value of classification and the fact that most chemical structure searches are at what we have described as the parent level, there are maintenance issues. Below are the fixes to the Tocris 2001 catalogue, in this model changes are constrained to the subtree in Thor or the row in Oracle as they are changes to the data about the catalogue number.

Tocris amendments

Thor example

As was described in the introduction, this model fits well into the Thor paradigm. There are limitations as to the level of nesting but a typical tree showing the relationships in the data is shown here. Note that the data about the sample is stored in a sub-tree rooted in the normalized primary supplier identifier. There is a user-controlled list, per catalogue, of secondary identifiers. All data from the supplier is stored in a two field datatype

Oracle example

As a rather obtuse example the following SQL*Plus script will find the parent and version isomeric structures along with their catalogue number and name and the hydrogen bond donor profiles for fairly rigid compounds which are known toxins.

SELECT sample.pism, sample.vism, sample.vcs_name, sample_data.data_value, parent.h_donor, parent.h_accept FROM sample_data, sample, parent WHERE parent.rigidity > 0.9 AND sample_data.data_name LIKE '%NAM%' AND sample_data.data_value LIKE '%toxin%' AND sample.par_id = parent.par_id AND sample.vcs_id = sample_data.vcs_id ;

Running this query against the tocris2001 database gives

Parent	Version	Catalogue Number	Compound name	H-donors	H-Acceptors
		1128	Picrotoxin (a 1:1 mixture of picrotoxinin and picrotin)	1	12
		1128	Picrotoxin (a 1:1 mixture of picrotoxinin and picrotin)	2	14

This illustrates the handling of mixtures. Note the predicate parent.rigidity > 0.9 only selects rows which have a non-NULL rigidity value. It is difficult to interpret a rigidity figure for a mixture, so it is set to NULL. As indicated above properties like part-count do have meeting and are filled for mixtures.

Demo versions under Oracle ( help here ) and Thor are available during the meeting. The Thor version is a sample from approximately 40 sources including the aids set from NCI. Under DayCart we have a single supplier Tocris 2001.

Daylight Chemical Information Systems, Inc.
info@daylight.com