Chemical Information Systems

Cheminformatics is the application of computational tools to solve discovery chemistry problems such as facile exploration or structural and chemical or biological data, library design, and virtual screening. Such an application requires a chemical information system with the following key functions:
  • Effective storage of structures and data
  • Accurate retrieval of structures based upon matching specific search criteria
  • Calculation of structure-derived data such as physical properties and 3D-conformation
  • Execution of in silico chemistry operations such as combinatorial library enumeration
The first step in building such a system requires one to choose a chemical structure model. The structures generated by this model need to be standardized, easily interpreted by a computer, compact, and ideally human readable. The following sections describe the development of the currently available models from a historical perspective.

Jons Jakob Berzelius introduced the classical system of chemical symbols and formulae which with a few minor changes is still in use today. His system provided a way to easily integrate chemical information with the medium (paper) and method (print) of the day using a collection of simple characters and numbers in algebraic-like expressions.

As time past, organic chemists recognized the need to be able to represent a molecule as a parent structure substituted with various groups. In particular, Friedrich Konrad Beilstein introduced rigorous methods for classifying, naming and indexing "related compounds" in his Handbuch first published in 1880. However, it was Alexender Crum Brown who was a key contributor to the development of a method that represents the spatial relationships between atoms. Below is his 1861 structure for phenol. Note that while Crum Brown's representations provides a mathematical graph of a molecule where atoms are the nodes and bonds are the edges, it also highlights the issue that molecules are not two-dimensional.


Chemists subsequently approached the issue of projecting three-dimensional information on a two-dimensional medium in two ways. One way is to circumvent the issue by building physical models such as August Wilhelm Hofmann did by using croquet balls as the atoms and steel rods as the bonds.


The difficulty with this approach is that it is the information is only communicated to those who can physically see the model. Emil Fischer used the alternative strategy. He created a projection by physically flattening a physical model.


Linear formulae continued to be used but with improvements in printing techniques and the adoption of certain standards for atom numbering, graphical representations of structures began to predominate. It was not until the advent of computers, that better methods were required. In 1949 William Wiswesser lead the way by improving on the standard Berzelian system by streamlining the symbols used and by adding structural and connectivity information.

Name: N-ethylaminobenzene
Line Notation: 2MR

Despite the fact that in practice Wiswesser's notation turned out to be primarily a shorthand for representing the systematic name of a compound, it was used widely until graphic entry technology was introduced. At that time, it became possible to store structure information as a connection table representing the underlying graph and thus expanding the possibilities for subgraph matching. Implicit in modern graph representations is a valence-bond model of structures where vertices are atoms with a variety of atomic properties and edges are bonds, usually constrained to a few defined types (single, double, triple, and aromatic). Below is the N-ethylaminobenzene compound represented in MDL molfile format.
N-ethylaminobenzene
APtclserve04110610582D 0   0.00000     0.00000NCI NS
 
 10 10  0  0  0  0  0  0  0  0999 V2000
    3.732    2.250    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.732    1.250    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.866    0.750    0.000 N   0  0  0  0  0  0  0  0  0  0  0  0
    2.866   -0.250    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.000   -0.750    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.000   -1.750    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.866   -2.250    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.732   -1.750    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.732   -0.750    0.000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.329    1.060    0.000 H   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  2  3  1  0  0  0  0
  3  4  1  0  0  0  0
  4  5  2  0  0  0  0
  5  6  1  0  0  0  0
  6  7  2  0  0  0  0
  7  8  1  0  0  0  0
  8  9  2  0  0  0  0
  4  9  1  0  0  0  0
  3 10  1  0  0  0  0
M  END
Despite many size and format constraints, connection tables are easily handled by computers. Unfortunately, connection tables and many of the other linear notations systems developed during this era such as ROSDAL are not easily interpreted by humans. It was not until the invention of SMILESTM in 1989 that a structure could be represented as a graph both in text documents and in a computer system. A SMILESTM string such as that shown below for the N-ethylaminobenzene compound is human understandable, very compact, and if canonicalized represents a unique string that can be used as a universal identifier.

CCNc1ccccc1

The second step in building a chemical information system is to ensure that the appropriate mechanisms are available to retrieve structures that match a target structure or pattern in a variety of ways as in the following examples.

Target structure:
Target pattern:
Cytosine
Oxygen or nitrogen
substituted pyrimidine
NC1=NC(O)=NC=C1
[N,O]C1=NC(O)=NC=C1
Match structure:
Tautomeric structure:
Contain structure:
Similar to structure:
Contain pattern:

Incorporation of the ability to calculate structure-derived data such as physical properties and conformational information is the third step in building a chemical information system. A sample of some typical calculated properties are listed below.

Name: phentermine morphine
Structure: CC(C)(N)Cc1ccccc1 Oc1ccc2CC3N(C)CCC45
  C3C=CC(O)C4Oc1c25
Depiction:
Molecular weight: 149.26 285.37
H bond donors: 2 2
H bond acceptor: 2 7
Ring count: 1 5
ClogP: 1.90 0.76
Rigidity: 0.5758 1.0000
Polar surface area: 26.02 52.93


Lastly, a good chemical information system should have the ability to execute various in silico operations on registered compounds. For example, the ability to perform a reaction transform on a set of compounds as illustrated below.

[H][N:4]([H])[C] . [C:1](=[O:2])[Cl] >> [C:1](=[O:2])[N:4]([H])[C].[Cl][H]
         Amine       +  Acid chloride     >>      Amide
Transform 1: >>
Transform 2: >>
Transform 3: >>
Transform 4: >>

Within the past 20 to 30 years, computers and cheminformatics software have evolved to the point where nearly all chemists routinely access huge chemical databases from their desktops. Many millions of structures of known compounds and many times that number of virtual compounds are stored as graphs and can be explored efficiently by chemists and biologists. Cheminformatics has revolutionized discovery chemistry.