2. Molecules and Reactions in A Computer

There are atoms and space. Everything else is opinion. -- Democritus

The foundation of a chemical information system is its ability to represent molecules in a computer and to communicate a molecule's structure from one place to another. This can seem like a simple problem at first glance so that easy solutions are often proposed and implemented. But a close examination of the problem reveals that several subtle traps await the unwary and methods of avoiding them must be considered before an effective computer representation of a molecule can be designed.

2.1 Representing Molecules

To represent a molecule in a computer, we must first choose a particular physical model. Many models have served chemists, ranging from the Bohr model through the most modern quantum theory; all have had adherents, detractors, uses, and flaws. When using such models, we must always avoid the trap of arguing that a particular model is right rather than arguing that it is useful. Models are just that - models.

Daylight's system represents molecules using a fairly standard valence model. For example, the Daylight system understands the normal valences of organic compounds, and by counting the bonding electrons in a molecule, can fill in unspecified hydrogens, detect aromatic and anti-aromatic ring systems, and issue warnings when unlikely or impossible molecules are specified.

The Daylight system represents a molecule as a graph in which the nodes are atoms and the edges are bonds. Each atom has a several properties, including its atomic number, atomic weight, charge, and the number of attached hydrogens. If the atom is a chiral center, it can also have chiral specifications.

Bond properties are simpler: a bond is single, double, triple, or aromatic. The concept of aromaticity in the Daylight system is not a chemical one, but rather is a set of rules designed for a chemical nomenclature system (this is discussed more in the SMILES chapter).

There is some flexibility in this valence model. Molecules can be represented as a hydrogen-suppressed graph (hydrogen atoms are represented as a property of "heavy" atoms) or as a hydrogen-complete graph (hydrogens are represented the same way as other atoms). Bonds in cyclic structures can be represented as aromatic or as the alternating single/double bond Kekulé form. Isotopic information such as chirality and atomic mass can be unspecified, partially specified, or completely specified.

2.2 Analyzing Molecules

There are two classes of properties associated with a molecule and its constituent parts (atoms and bonds): explicit properties and derived properties. Explicit properties are those needed to completely specify the graph of a molecule: its atoms and bonds and their properties. Derived properties are properties that can be computed from the graph. The following sections describe the derived properties that are of interest in the Daylight system.

2.2.1 Cycles

When a molecule's atoms and bonds are specified, it may be that there are cycles in the graph. A bond is said to be a ring bond if it can be removed without breaking the graph (molecule) into two pieces. In this case, the graph is said to be biconnected). Atoms that are connected via ring bonds are ring atoms.

There are two parts to ring-detection in a graph:

  • Ring detection: Discovering those atoms and bonds that are part of ring systems.
  • Finding a smallest set of smallest rings (SSSR). In a multi-ring system there are many cyclic paths; for example a naphthalene system has two paths of length six around the two "obvious" rings plus a path of length ten around the perimeter. Any two of these three "rings" would completely describe the ring system, but the shortest cyclic paths are what one normally calls the "right" ones. Note that there is more than one valid SSSR for many systems - a tetrahedron has four equivalent faces but only three rings, making six valid SSSRs.
The algorithms used for these tasks are beyond the scope of this document. However, we will point out that of the two tasks above, the first is relatively trivial and the second is surprisingly complex. Many papers have been written describing algorithms for efficiently detecting SSSRs (G. Downs et al, A Review of Ring Perception Algorithms for Chemical Graphs, J. Chem. Inf. Comput. Sci. 29:172; 1989).

2.2.2 Bond Type, Bond Order, and Aromaticity

Bonds are, in a sense, both an explicit and a derived property. Although you generally specify the bond type of each bond, the Daylight system will sometimes rearrange them, such as in Kekulé ring systems.

The Daylight system defines bond type and bond order as follows:

Bond Order
Bond order is one of single, double, or triple. Bond order is a formal property.

Bond Type
A derived property; one of single, double, triple, or aromatic. The Daylight system uses an extended version of Hueckel's rule to identify aromatic molecules and ions. To qualify as aromatic, all atoms in a ring must be sp2 hybridized and the number of available "excess" p-electrons must satisfy Hueckel's 4N+2 criterion. This is covered in more detail in the SMILES chapter.

Note that the definition of aromaticity is not intended to imply anything about the reactivity, magnetic resonance spectra, heat of formation, or odor of substances. Rather, the definition is designed to be useful in a chemical nomenclature system (SMILES) that is discussed in detail in the subsequent chapter.

2.2.3 Symmetry

A molecule's symmetry is useful for many purposes, including generating canonical labelings (such as when generating a unique SMILES), classifying chirality, detecting degenerate chiral specifications, and eliminating redundant calculations. The Daylight system automatically detects symmetry in the molecules it represents as those with two dimensional rotations.

2.2.4 Canonical Labeling

A computer representation of a molecule is often built in an arbitrary fashion; one can start with any atom on the molecule and add atoms and bonds in any order. If a "label" (typically a number) is assigned to each atom and bond as it is specified, the labeling is also arbitrary - a different input order of the same molecule results in a different set of labels.

Chemical nomenclature systems such as SMILES require a canonical labeling of the atoms and bonds - a numbering that is independent of the history of the molecule's representation. The Daylight system generates such a labeling whenever it generates a unique SMILES.

2.2.5 Chirality

Chirality, like bonds, is both an explicit property and an derived property. That is, the Daylight system accepts various chiral specifications on input, though it will sometimes change the specification to a different (but equivalent) form. Like the canonical labeling of atoms discussed above, a "canonical chiral representation" must be chosen if a unique SMILES is desired.

2.3 Representing Reactions

A reaction consists of an set of molecules, each of which plays a specific role in a reaction: reactant, product, or agent. Since reactions are made up of molecules, reactions naturally use the same valence model, bonding, aromaticity, and symmetry rules as molecules. At minimum, a reaction must contain valid molecules based on these rules.

In an ideal world (at least from an information-processing point of view), all reactions would be represented stoichiometrically (every relevant atom shown), and enough information would be present to tell unambiguously which atom was which between the reactants and products. This information would be provided by a pairwise mapping of the reactant and product atoms. In effect, the only difference between the reactant molecule(s) and product molecule(s) would be the bond changes and atom property changes (chirality, charge, aromaticity) which occur during the reaction. If these criteria are met, one can 'superimpose' the reactants and products on one-another and represent the reaction as a reaction graph. This is both a complete and compact description of a reaction.

Unfortunately, these stringent requirements can rarely be met for reactions available in electronic form. The Daylight system is designed to be able to represent and store both completely specified (reaction graph-like) reactions and information-deficient reactions in a repeatable and searchable fashion. Although all of the molecules within a reaction must be chemically valid, an overall analysis of the reaction for chemical sensibility is not carried out.

The Daylight system is oriented towards single-step reactions with the following three roles for molecules defined:

Reactant
A starting material. It is expected to participate in the reaction by contributing one or more atoms to the products. This participation is not enforced.
Agent
Agents are molecules which do not contribute atoms to the product, or accept atoms from the reactants. Note that this definition of agents is manifested in the way atom maps are handled; they are ignored in the reaction object and are not part of the lexical definition of reactions (SMILES output). Agents are commonly used for catalysts, solvents and other adjuncts which participate indirectly in a reaction.
Product
Molecules which are the final result of the reaction. All of the atoms in the product should come from the reactants. This is not enforced.

Note that the above distinctions between reactant, agent, and product all involve the participation of atoms in the reaction. This participation is recorded via the reaction atom map. The atom map simply maps the correspondence of the reactant and product atoms in the reaction. Agents never have meaningful atom maps, since by definition agent atoms do not participate directly in a reaction.

Clearly, reactions have additional data which one wants to store about them. The Daylight approach is to only encode the pure structural information in the lexical representation of the reaction and handle the additional data outside of the reaction. A standardized THOR database can allow coupling of the following data about the individual components of a reaction to those components:
  • Stoichiometry
  • Equilibrium constants
  • Yield
  • Conditions, (amounts, times, temperature, pressure, operations) which may or may not be expressed in a rigorous language

2.4 Depictions

One of the most important jobs of a chemical information system is to communicate information to the chemist effectively. One of the best ways to do this is graphically, using the standard schematic representation of chemicals familiar to all chemists.

The Daylight system provides an algorithm for generating these schematic diagrams ab initio - a drawing can be made of any molecule or reaction, whether or not it has ever been seen before. When generating a schematic diagram, two criteria are critical:

  • Correctness - The schematic should correctly represent the molecule's graph
  • Comprehensibility - The schematic should represent the atoms, bonds, and cycles in a sensible way, so that chemists will readily recognize familiar molecules and functional groups.
  • Appearance - The aesthetic appearance is a subjective matter, but the Daylight system attempts to generate depictions that are pleasing to the eye. In some cases this is not possible to achieve in a reasonable time, in which case correctness and comprehensibility take precedence.


Go To Next Chapter...3. SMILES - A Simplified Chemical Language