MUG '05 -- 19th Daylight User Group Meeting -- 9-11 Mar 2005

Chemical information models

Dave Weininger
Daylight/Metaphorics

ABSTRACT

Information modeling is at the heart of informatics. The molecular information model is seriously cool. The ability to represent pure substances as molecular structures has transformed chemistry from a science of substances into a molecular science in less than 100 years.

A basic chemical information "trick" is representing a molecule in a computer. Doing just that is useful for things like molecular structure registration.

The next step is developing the information model itself -- how to connect properties to the representations of molecular entities in a computer to do something more useful than just making lists, e.g., connect it to external information resources. For the most part, we have been stuck using the same data models that are used for registration. A number of different chemical information models will be presented with real-world examples and the prospects for chemical data integration will be discussed.


MUG '05 : Chemical information models
How can one do anything useful with a computer?

One usually needs to represent ("model") real things in a computer program using digital representations ("models").


MUG '05 : Chemical information models : Preliminaries : 1

What chemical entities are useful to represent?


MUG '05 : Chemical information models : Preliminaries : 2

What operations on chemical entities are useful?


MUG '05 : Chemical information models : Preliminaries : 3

How can we represent chemical entities?

Various representations have different advantages

For instance:

MUG '05 : Chemical information models : Properties

Direct association of properties with chemical entities

  1. Direct properties of molecular structure(s)

    [These] are the [property] values of [these molecular structures].

    Simplest possible chemical information model: molecular structure is the identifier, properties are connected to it. Perfect for the clean, idealistic world of non-overlapping molecular chemistry. Very powerful but not comprehensive and not very useful IRL.

    Examples: TSCA Toxic Substance Control Act, Empath metabolic pathways, Primary tables in CRC Handbook, Chemist's Companion, etc.

  2. Direct properties of hierarchical molecular structure(s)
    [These] are the [property] values of [these specific kinds of] [generic molecular structures].
    Entities are identified by molecular structure level-of-detail hierarchy. Properties are connected to entities at the appropriate level. Clean and idealistic yet more powerful and more useful than above.

    Examples: MedChem masterfile (generic/isomers/ionic forms), QSAR (sets of molecules/molecular patterns/compounds)


MUG '05 : Chemical information models : Properties, cont.

Associate properties with arbitrary molecular identifiers

  1. Properties of an arbitrary molecular identifier
    The [entity with this registration number] has [this molecular structure] and has [these] [property] values.
    Common model for traditional chemical registries. All possible molecular entities are represented by registration numbers; properties are assigned to these entities. Requires "god-like" (omniscient) structure identification and discrimination methods ... which IRL become unstable over time when used by normal human beings. Other problems include poor behavior with incomplete structural knowledge and this requires development of a religious "group or split" dogma. OK for closed, static, short-term delivery of homogenous data.

    Examples: CAS, MACCS, WDI, some registration systems

  2. Properties of an arbitrary molecular set identifier
    The [entity with this registration number] contains [these molecular structures] and has [these] [property] values.

    Similar to above systems, but for multiplicity of molecules. The problem with god-like systems is even worse than for discrete entities.

    Example: Most "chemical" USPTO Patents (with legal caveat)


MUG '05 : Chemical information models : Properties, cont.

Associate properties with arbitrary identifiers

  1. Property of arbitrary identifier
    The [entity with this registration number] has [these] [property] values.
    [This molecular identifier] is associated with [this entity].
    Often used for the chemical portion of a larger non-molecular database system. Entities and property-associations are not necessarily molecular. In such systems there are usually no requirements for uniqueness, exclusiveness, nor comprehensiveness. This is the weakest chemical information model.
  2. Property of arbitrary set identifier
    The [entity with this registration number] contains [these molecular identifiers] and has [these] [property] values.
    Often used when database entities are mixtures of things which are not necessarily molecular, or when non-molecular components are essential to the definition of a database entity.

    Examples: FDA Orange Book (NDAs), MSDS collections


MUG '05 : Chemical information models : Properties, cont.

Reverse association of molecules with propertied entities

  1. Molecular elucidation
    The [entity with this registration name or number] has [these] [property] values. [These molecular structures] are associated with [this entity].
    This has a similar "shape" to #5 above, but in this case, molecular structure-property associations are derived from the existence of specific molecular structures in database entities.

    Examples: TCM Traditional Chinese Medicines, sample databases at analytical labs

  2. Document databases
    This [database entity is a document] which states that [these molecular structures] have [these] [property] values.
    Document databases are special databases which normally contain non-molecular primary entities (e.g., journal articles, patents) which in turn reference molecular entities (e.g., compounds, reactions). They normally can be "inverted" to form a chemical information data set of one of the kinds mentioned above.

    Example: Spresi, MSDS


MUG '05 : Chemical information models : Example 1

Example: WDI - Derwent's World Drug Index


MUG '05 : Chemical information models : Example 2

Example: Empath - Metabolic Pathways


MUG '05 : Chemical information models : Example 3

Example: Orange - FDA Orange Book (NDAs)


MUG '05 : Chemical information models : Summary

Concluding thoughts about chemical information models