Getting back to your roots or a guide to responsible parenting


Over recent years, there have been major changes in the pharmaceutical industry, both organisationally and in the way we carry out drug discovery roles in the business. Along with the much-vaunted revolution in high-throughput screening and combinatorial chemistry, which have almost become industries in themselves, there are other, quieter revolutions which impact heavily on the process of finding potential drug leads.

All of these changes put pressure on those responsible for handling chemical structures to come up with systems which will be responsive to these new requirements. Whilst our in-house world has been isolated, we could apply whatever rules we liked to the storage of chemical information as long as they were self-consistent. This paper aims to show how we can use the power of the DAYLIGHT SMILES toolkit and THOR databases to effectively move out into the external world.

Example: Commercial Compound Acquisition

As companies came along offering large numbers of compounds cheaply and ready for test, people became suspicious. We knew

Other than the first and last points, these fears can be tested for and hopefully dispelled. As the numbers were large, we could not simply look at them. We set out to design a triage system, which would allow us to make a decision on whether to accept or reject a collection on the following bases

The results could easily be displayed as a series of pie charts.

However, to answer the questions "is it the same" or "is it like" require that you have a consistent description of your chemical structures between the vendors offering and your in-house registry file.


The other changes that have occurred, described above, require solutions predicated on the same requirement: an accurate knowledge of what the structure represents.

The solution: Read the Manual

In the DAYLIGHT documentation, it points out the need for a unique name for a compound i.e. the SMILES. However, as it also says in the manual, the SMILES depends on the valence bond representation of the molecule.

There have been several attempts over the years to define standards for chemical structure representation for example SMD, CEX, the Procrustean MDL sd file and more recently CML. However none of these attack the problem of choice in valence bond representation.

Let us see what happens if we do nothing about it. Let us try looking up morphine and zantac across several databases, by name and by SMILES.



The decision, about whether to represent a nitro group as A or B, is purely arbitrary. Those of us who came from a WLN world tend to use form B as it relates to the NW of WLN and retains the equivalence of the oxygens. The point is neither form is incorrect. They are just local conventions, or in the jargon, business rules.


What we have done is to opt for a datawarehouse model where the process of cleaning the data applies the business rules but we retain the originator's valence bond representation. By making use of the THOR model, i.e. carefully assigning identifier and data items and rigorously applying the rules about the relationships between them, we have a system, which allows us to ask questions across databases.

The need for parents and roots

We are very careful about what we define as the root structure of the THOR datatree. There are basically three processes.

Again, note that these are arbitrary rules. Any choice is valid as long as it is consistent. Chapman and Hall for instance, have a root structure and derivatives, which are related by breaking or forming covalent bonds, e.g. oximes have the corresponding ketone as parent or root structure. This reflects the organisation of Heilbron's Dictionary of Organic Compounds.

If your business involves monomers for combinatorial chemistry, you may wish to create a parent R-*, where * represents a whole range of interconvertable groups, e.g. amides, carboxylic acids, nitriles. All chemically convertible monomers then automatically come together. Multifunctional compounds have multiple parents and occur on multiple pages.

However we need to retain the relationship with the original structure and its representation. The approach we have adopted is to demote this structure and associate it with an arbitrary identifier such as a supplier's code number. If you build trees in this way, all the power of SMILES and THOR databases is open to you. It is possible to look up information across several databases in one step. All salts for instance, of an amine occur on the same page associated with an appropriate identifier.

Note this is just a simple application of the thesaurus philosophy and storing what you know and what you know it to be about.

Here is an example of process starting from a MDL sd file.

Application to other areas

Once you have got the representation of the structures standardised, all sorts of other opportunities open up. Suppose, for example, a vendor or another company offers you a set of structures for exchange or sale, but do not wish to declare the structures. You can exchange the fingerprints. These can be compared with the fingerprints of your own compounds to help you decide whether to continue with the exchange. They may be like compounds you wish to have, or very different; either way you can make an informed decision. Queries can be passed across different vendors databases, as it is possible to compare like with like, on the fly, even if you do not opt for the datawarehouse solution described above.

Outstanding problems

The major problem not dealt with here is that of tautomers.
What is required is that the disodium salt of NBQX (below) generates the same tautomer as parent as does the free base.

Internally, we use a dictionary in our registration process, which is fine for small numbers of compounds over which we have control. However, the model above allows the accurate treatment of tautomers, the power of SMIRKS allows the reaction toolkit to be used to standardize structural representation whilst maintaining the original tautomeric information. The root of the tree must simply be consistent, it is helpful, but not necessary, that it is reasonable or accurate.

The whole aim of this work is to try and fulfill the statement:
....Part of the power of SMILES comes from the fact that an algorithm exists (available in the Daylight Toolkit) to produce a unique SMILES. With standard SMILES, the name of a molecule is synonymous with its structure; with unique SMILES, the name is universal. Anyone in the world who uses unique SMILES to name a molecule will choose the exact same name......

We may not be able to get global agreement on how we represent or store chemical structures, but if we can understand properly how we differ, we can make sure we decide correctly when we compare them.

Daylight Chemical Information Systems, Inc.

John Bradshaw.