Introduction to Chemical Info Systems

Euromug02 24th-26th September 2002, Cambridge UK

Introduction to Chemical Info Systems

John Bradshaw
Daylight CIS Inc., Sheraton House, Castle Park, Cambridge, CB3 0AX, UK

The Philosophy Bit

It is a fundamental of any chemical information system that it can store chemical structures. These structures should be searchable, so that one can retrieve

Structures which exactly match a target structure.
Structures which contain a target structure.
Structures which are contained in a target structure.
Structures which are tautomeric with the target structure.
Structures that have features in common with the target structure.
Structures which match a, possibly ambiguous, pattern of atoms and bonds.

In addition, if one is interested in chemoinformatics, the system should

Retrieve data which are about the target structure e.g. a depiction or a conformation.
Calculate data derived from the target structure such as physical properties, a conformation or the ring systems contained in it.
Carry out in silico chemistry on sets of compounds, such as combinatorial library enumeration

All these requirements should be fulfilled using chemical structure ( the natural language of the chemist ) as an entry point.

It is essential that the method used to store the chemical structures is robust enough the allow for completion of these tasks.

There are three rôles involved in implementing and running a chemical information system

The person who assigns the structure to a sample, which maybe fictitious.
The person who curates and archives the structure on some medium.
The person who searches and retrieves the structure.

If we are using paper as our medium, then it is possible, if person 1 and person 3 have similar training, to interpret a picture or an icon such as the structure diagram. So to quote the example from Pierre Laszlo

...the structural formula of, say, p-rosaniline represents the same substance to Robert B. Woodward, say, in 1979 as it did to Emil Fischer in 1879, and even though such a signifier pointed to the same signified, nevertheless the entity signified had been enriched in meaning with time.

P. Laszlo in "Tools and Modes of Representation in the Laboratory Sciences"; Klein, U. Ed. Kluwer Academic Publishers; London 2001; p52

However for this communication to be possible it requires

that both Woodward and Fischer were both chemists
the conventions for formula drawing had not changed significantly in the intervening century
the medium on which the communication took place (paper) survived.( Berichte der Deutschen Chemischen Gesellschaft 12 (1879) 2344-2353 )

The other point, which Lazlo makes, is that, in the intervening years, science has moved on. As techniques develop much more became known about the dyestuff, its UV-visible spectrum, its crystal structure, pK_a, NMR and so on, which could not possibly be known to Fischer. Although these data were collected on different samples over the years, they were unified by being about the "entity signified" described by the structural formula of p-rosaniline.

This is where the third person in our triumvirate comes in. It is incumbent upon the curator/archivist that they store the structure in such a way that new data can be "attached" to the appropriate part of the molecule. An inappropriate choice of the level of abstraction for storage can lead to data loss. You cannot store C13 NMR chemical shift information about a peptide if the structure is represented as Ala-Phe-Gly. Chemical information science has had the advantage of over 200 years to learn these lessons. The answer is usually to store the structure at the most basic primitive level of atoms and bonds.

This is of more than passing philosophical interest. It is salutary to note that most of the chemistry used to prepare Zantac® is best part of 100 years old. ( Bradshaw J, "Ranitidine" In Ledneicer, D. Ed. Chronicles of Drug Discovery, Vol 3 American Chemical Society, Washington DC 1993 p 45 ). The information was accessible to the chemists involved as Allen and Hanburys Ltd had maintained complete runs of the paper copies of Beilstein and Chemical Abstracts in their library. As the saying goes "two months in the laboratory saves two hours in the library."

The History Bit

Those who would question the present should investigate the past. Those who do not understand what is to come should look at what has gone before.
The Guanzi

The point at which chemical structures and formulae become recognizable today is in the early 19th century. The essays by Berzelius starting in 1813 ( Annals of Philosophy 2 443-454 ) paved the way.( My emphasis )

But, though we must acknowledge that these [alchemic] signs were very well contrived, and very ingenious, they were of no use; because it is easier to write an abbreviated word than to draw a figure, which has but little analogy with letters, and which, to be legible, must be made of a larger size than our ordinary writing. In proposing new chemical signs, I shall endeavour to avoid the inconveniences which rendered the old ones of little utility. I must observe here that the object of the new signs is not that, like the old ones, they should be employed to label vessels in the laboratory: they are destined solely to facilitate the expression of chemical proportions, and to enable us to indicate, without long periphrases, the relative number of volumes of the different constituents contained in each compound body. By determining the weight of the elementary volumes, these figures will enable us to express the numeric result of an analysis as simply, and in a manner as easily remembered, as the algebraic formulas in mechanical philosophy.
The chemical signs ought to be letters, for the greater facility of writing, and not to disfigure a printed book. Though this last circumstance may not appear of any great importance, it ought to be avoided whenever it can be done. I shall take, therefore, for the chemical sign, the initial letter of the Latin name of each elementary substance: but as several have the same initial letter, I shall distinguish them in the following manner:-- 1. In the class which I call metalloids, I shall employ the initial letter only, even when this letter is common to the metalloid and some metal. 2. In the class of metals, I shall distinguish those that have the same initials with another metal, or a metalloid, by writing the first two letters of the word. 3. If the first two letters be common to two metals, I shall, in that case, add to the initial letter the first consonant which they have not in common: for example, S = sulphur, Si = silicium, St = stibium (antimony)[2], Sn = stannum (tin), C = carbonicum, Co = cobaltum (cobalt), Cu = cuprum (copper), O = oxygen, Os = osmium, &c.;

Note that Berzelius was suggesting a system appropriate to the communication medium (paper) and method ( manuscript or print ) but also more importantly that the "word" he produced ( the collection of characters and numbers ) represented the underlying chemistry ( mainly analysis). The idea was that this was a sort of algebra. Even today we "balance" chemical equations in the way pioneered by Berzelius, using, for the most part, the same symbols to represent atoms. In the early part of the 19th century figures and illustrations were printed separately, often on different paper and bound at the end of books or journals, Berzelius allowed integration of the chemical information with the rest of the text.
For a more detailed discussion of these formulae as tools in chemistry see Klein, U. "Berzelian Formulas as Paper Tools in early Nineteenth Century Chemistry" Foundations of Chemistry 3: 7-32, 2001.
Within a very few years these formulae appeared in textbooks ( Turner, E. "Elements of Chemistry: Including the Recent Discoveries and Doctrines of the Science", 4th Ed. London; John Taylor 1833 ) which shows the important rôle of textbooks in stabilizing notation.

As the 19th century progressed and organic chemistry grew, it became clear that there were frequently occurring groups in molecules, which, to a first approximation, were invariant in their properties. These were represented as shorthand, first in a Berzelian way CH₃-, C₆H₅- etc and then in a more convenient way as Me- and Ph- etc. There was however no limit to the number of these abbreviations, or indeed, any rules as Berzelius had proposed, for their construction. This defeated the whole object of being able to communicate structures, as the underlying vocabulary was undefined.

Being able to represent a molecule as a parent structure, substituted with various groups, is very appealling though. This approach was followed by people interested in the nomenclature and indexing of the rapidly growing number of compounds. In particular Beilstein who published the first edition of his Handbuch in 1880. This introduced rigorous methods for classifying, naming and indexing compounds which brought together "related compounds". For this the Berzelian formulae were not sufficient, as it was important to know the spatial relationships between atoms. A key contributor to the development of these structures was Crum Brown in Edinburgh. Here we have his structure for phenol from 1861, which mathematically is a graph with atoms as nodes and edges as bonds.

Crum Brown Structure for phenol

It was also clear that the molecules which were being prepared were not two dimensional as the medium on which they were being portrayed was. There were two solutions.

Build a physical model, with the difficulty that you could not easily communicate that information other than to people who could see your model.
Project the 3-dimensional information onto the 2-dimensional medium for communication.

Hofmann was one of the chemists who adopted a modelling approach. His famous lecture in 1865, at the Royal Society in London used croquet balls as the atoms and steel rods as the bonds.

Hofmann's models

To this day modelling kits tend to use the same colours as croquet balls for the atoms.

Others such as van't Hoff, abandoned all pretence of atoms and bonds in an effort to get across the spatial arrangements of groups and the possibility of stereoisomerism.

Emil Fischer used the alternative strategy of projection. This is restricted to certain classes of compound and also requires that various conventions about orientation etc are applied. He did this by physically flattening a rubber model of tartaric acid made by his colleague Friedländer.

Fischer projection of tartaric acids

Once you move away from linear formulae constrained to read left to right by the text in which they are embedded, you need to provide a whole lot of information like numbering the atoms to ensure that all the readers get the same starting point for the eye movement which recognizes the structure.
For a more detailed discussion of the philosophy of this see P. Laszlo in "Tools and Modes of Representation in the Laboratory Sciences"; Klein, U. Ed. Kluwer Academic Publishers; London 2001; p52

Linear formulae continued to be used, embedded in text, but, with improvements in printing techniques and the imposition of typesetting standards and atom numbering by organizations such as Chemical Abstracts, graphical representations of structure began to predominate. There wasn't a need for a major change until the advent of computers in the middle of the last century. There were no good mechanisms in early computers to store the increasingly elaborate icons which had come to be used by one chemist to communicate with another. A simpler system was needed.

At this point Wiswesser returned to linear notations. If he could come up with a method to add structural and connectivity information to Berzelian molecular formulae these could be handled by a computer. Wiswesser threw away some of the two character symbols for elements such as chlorine and bromine. There were no less than four different symbols for nitrogen K,N,M, and Z, depending on the number of attached hydrogens and the charge. Carbon was reduced to a skeleton, element alkyl chains being replaced simply the number of carbon atoms in the straight chain and branches by Y and X. Benzene rings, so long the marker for the chemist to orient a view of a molecule were reduced to a branching point with a single character R. Somewhat obtusely, other ring systems did not suffer benzene's fate and were promoted to dominate the structure, building on the recent work of Patterson. ( Patterson A.M.; Capell, L.T. "The Ring Index" ; American Chemical Society: Washington D.C. 1940 ) So for a compound such as

6-dimethylamino-4-phenylamino-naphthalene-2-sulphonic acid the WLN is L66J BMR& DSWQ INI&1

However it only turned out to be a shorthand way of representing the systematic name. The advantage to this was that they could make use of all the sorting and indexing technologies which had been developed over the years for names using permuted indices, the downside was it ran into the same problems that the indexers and namers had.

What was the parent, usually most important, ring system? As new ring systems were found the priority ordering altered.
How were the atoms numbered in each ring system to establish the prefix for the connection? Only single character locants were allowed, so there was no way it could map onto Patterson's model where fusion atoms, especially in polycyclic aromatics, quite often are not numbered as they could not be substituted. Wiswesser took the opposite view and chose the fusion locants to be the priority ones.
How was stereochemistry to be shown? Cahn-Ingold_Prelog notation fell apart in the WLN system and appended text fields were used to carry the information
What was the 'prime path' through the molecule? This was essential to get the correct ordering of the code.

Vast armies of people were kept busy coding and teaching others to code, compounds. The coder needed to know the rules to get the "right" WLN. Disputes went to a committee. There is only one valid WLN for a compound. There were no good automatic ways of generating the correct WLN from what had become the natural language of the chemist, the structure diagram. Equally the representations could not easily be parsed left to right without back tracking.

Despite all of this Wiswesser's notation was embraced by both industry and content providers such as ISI and CAS. Substructure searching became available through the CROSSBOW program. However as soon as technology allowed the structures to be input by drawing, as had been done in the paper systems, WLN was abandoned. For whilst it was not possible to autogenerate WLN by machine, it was possible to convert it to a connection table.

The new graphics entry systems stored the information in the form of a connection table representing the underlying graph. As the structure was now represented as a graph, subgraph matches could be made using existing algorithms.

Here we have our compound in MDL molfile format. Note the fixed column format required by FORTRAN code used in earlier times.

6-dimethylamino-4-phenylamino-naphthalene-2-sulphonic acid
  -ISIS-  02110115552D

 24 26  0  0  0  0  0  0  0  0999 V2000

   -1.7931    0.6000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.7984   -0.2273    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0851   -0.6459    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0786    1.0072    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.3647    0.5965    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.3692   -0.2315    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3450   -0.6497    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0642   -0.2368    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0648    0.5943    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3499    1.0046    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.3375   -1.4750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.0542   -1.8875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.0504   -2.7160    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7662   -3.1326    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4807   -2.7176    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4790   -1.8859    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7668   -1.4772    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7792    1.0125    0.0000 S   0  0  3  0  0  0  0  0  0  0  0  0
   -2.5208   -0.6375    0.0000 N   0  0  3  0  0  0  0  0  0  0  0  0
   -2.5250   -1.4667    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -3.2417   -0.2208    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.4824    1.4383    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.3586    1.7304    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.3625    0.4292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  2  3  1  0  0  0  0
 11 12  1  0  0  0  0
  5  6  1  0  0  0  0
 12 13  2  0  0  0  0
  3  6  2  0  0  0  0
 13 14  1  0  0  0  0
  6  7  1  0  0  0  0
 14 15  2  0  0  0  0
  1  2  2  0  0  0  0
 15 16  1  0  0  0  0
  7  8  2  0  0  0  0
 16 17  2  0  0  0  0
 17 12  1  0  0  0  0
  5  4  2  0  0  0  0
  9 18  1  0  0  0  0
  8  9  1  0  0  0  0
  2 19  1  0  0  0  0
  4  1  1  0  0  0  0
 19 20  1  0  0  0  0
  9 10  2  0  0  0  0
 19 21  1  0  0  0  0
 10  5  1  0  0  0  0
 18 22  2  0  0  0  0
 18 23  2  0  0  0  0
  7 11  1  0  0  0  0
 18 24  1  0  0  0  0
M  END

Clearly these connection tables have some advantages. As they map the internal program storage so closely, the computer has no problem reading them. However limitations were imposed by the computer languages.

Many versions of FORTRAN required fixed array sizes so limiting the number of atoms which could be stored and hence the size of the molecule. For many years there was a 256 atom limit on MDL molfiles.
As has been noted above the entry lines were limited to 80 characters as punched cards were and the information needed to be in fixed format.
The fixed format limits the range of numbers and the length of character strings. Compound names could not be longer than 80 characters for instance.
Connection tables of this type are not easily human readable. For instance the 0 in heavy type in the file above means that the chiral flag is not set, i.e. the molecule represented has no absolute stereo centres. A 1 in that field implies all the stereo centres are known. It is not possible to represent a molecule in a MDL molfile where only some of the stereocentres are known.

Limitations were also imposed by the need to store different information for different purposes so each vendor came up with differing formats. Attempts were made to unite the formats ( Garavelli, J.S. (1990) Chemical Design and Automation News 5 2-5 ) but to no avail. All that happened was they became even larger to accommodate everyone's needs.

Beilstein made a foray into the line notation area too. They came up with a notation called ROSDAL which is a linear way of representing the connection table. So the ROSDAL code for the naphthalene sulphonic acid above is

1=-5-=10=5,10-1,1-11N-12-=17=12,3-18S-19O,18=20O,18=21O,8-22N-23,22-24

Notice again, as with WLN, carbon is treated differently from the other elements. This was designed to facilitate computer manipulation, not chemists visual recognition. It is defined by means of a formal grammar expressed in Backus-Naur form, the meta notation commonly used in the definition of computer languages.

It was not until the invention of SMILES that we had a linear representation of structure which was valid for a paper system using manuscript or typesetting and also for a computer system using either the keyboard or graphical input. For the naphthalene sulphonic acid above, a SMILES is

c1ccccc1Nc2cc(S(=O)(=O)O)cc3c2cc(N(C)C)cc3

SMILES are compact. There is no white space in a SMILES.
SMILES is a language with a defined vocabulary, syntax and semantics. It is therefore independent of the computer system or language which uses it.
SMILES represents the structure in terms of atoms and bonds, the fundamental primitives; allowing it to be an ideal lingua franca for other methods of representation.
SMILES uses the only symbols in the periodic table to represent atoms, it does not allow pseudo-atoms to be made up.
SMILES uses the symbols -,=,#,: to represent bonds, single, double, triple and aromatic. In addition the full stop or period '.' represents no bond.
SMILES does not give undue precedence to any part of the structure such as rings or heteroatoms.
SMILES can be written uniquely, as the graph which the SMILES represents can be uniquely ordered. For the above example the unique SMILES is CN(C)c1ccc2cc(cc(Nc3ccccc3)c2c1)S(=O)(=O)O
SMILES encodes stereochemistry on a per centre basis. You can therefore accurately represent a structure in which not all stereocentres are assigned absolutely.
SMILES, because it is based on a valence bond model in common with most connection table formats, does not handle molecules well which can only be represented in molecular orbital terms eg carboranes and ferrocenes.
SMILES currently ( v4.73 )does not handle stereochemistry well, when the chirality is determined by features other than bonds between atoms eg lone pairs or restricted rotation.

This fixed vocabulary and ordering ensures that the SMILES language can be used to create a unique and universal name. The link to the graph ensures that all the required searches can be carried out, and its representation at the atom and bond level allows additive-constitutive properties to be calculated. A useful consequence of the fixed ordering in the name, is that atom and bond properties, such as 3-D coordinates can also be kept in ordered lists.

Daylight Chemical Information Systems, Inc.
support@daylight.com

John Bradshaw.