MUG '01 -- 15th Daylight User Group Meeting -- 6 - 9 Mar 2001

Using molecular structure to build bridges:
Chinese, allopathic and post-genomic medical traditions.

Dave Weininger
Daylight CIS.


ABSTRACT

A new type of information managment system is being developed which emphasizes relationships as a primary entities. Even a modest database contains a vast number of relationships which are of possible interest, most of which are inaccessible when using conventional database management systems. The number of such relationships between data in different databases is far vaster, it's staggering. The challenge is to build a rigorous and effective system which provides access to such data, yet is easily understood by users.

This report summarizes progress of the fedora project which is currently alpha- or beta-testing 16 database services. Servers in three fields of knowledge are demonstrated here: Traditional Chinese medicine, allopathic pharmacology, and protein-ligand interactions. These non-trivial data sets illustrate the power and limitations of such an information system. The discussion covers strategy, design, implementation and actual deployment.

This presentation also serves as an annoucement of a demo system available on online.daylight.com via the internet.


Relationships as knowledge

Most human knowledge is about relationships; most scientific research is about discovering relationships.

Contrariwise, most computer database systems (e.g., RDBs, ORDBMSs, Thor) encapsulate information in fixed relationships. Ironically, one can't store relationships as data in computer databases even in (especially in) relational database management systems (RDBs), in which all possible data relationships are established beforehand (i.e., the database schema). Fixed data models are extremely useful for storing and regurgitating well-understood information.

The same features which allow highly efficient data archival form a tremendous limitation for other purposes such as integration of diverse knowledge bases and discovery in general.

Relevance to scientific research

Scientific research starts with a hypothesis which is often a relationship of one kind or another. Data is obtained (gathered or generated) which supports or refutes the original hypothesis. Very often, the result of such research is the generation of one or more additional hypotheses, i.e., more relationships to keep track of and to test. Researchers operate at a great disadvantage because the relationships themselves are not storable in (most) databases, e.g., entered as data, tabulated, corrected, searched, etc. From an IT point-of-view, scientists are a real problem. Research scientists are notorious for noticing unexpected relationships. But it's not their fault, it's their job. The problem is that they then ask questions which were not expected when a given data system was designed. It seems self-evident that more flexible data systems would be a boon for research purposes.

This report

A new type of information management system is being developed which is not subject to such limitations, insofar as this is possible. We are 8-9 months into this project. One major false-start was followed by a number of very successful proof-of-principle and prototype services. The system is currently being integrated into a production data system (fedora); the first 12 data services are in alpha or beta tests now. If all goes well, they will appear as production software by the end of this year (in Daylight Software Release 4.81).

The experiment

The work

The basic approach used here is to

The data

A relatively large number (>20) of different kinds of data were explored. Three are described here.

WDI, an allopathic pharmacopoeia

The World Drug Index from Derwent is a comprehensive collection of registered drugs and reported trial preparations along with structures, names and pharmacology. One big advantage is that its contents are well understood [like mother's milk to most of us]. It contains a rich set of pharmacological fields with intrisic relationships which are not well represented by typical object-property models, e.g., pharmacological activity vs. mechanism of action, or precautions and warnings vs. contraindications vs. adverse effects. High-quality structures are available for a large majority of the entries, so we can conveniently leverage our expertise with structural relationships (e.g., substructure, superstructure, and similarity of molecular features). These are extended with newly-formalized molecular relationships such as "common-parent structure".

The World Drug index is implemented as a single fedora service:

wdi
The wdi server uses the same data (same datatrees) as the production Thor database. A demo version is available which contains under 5000 of the 65000+ entries in the current version (wdi011). wdi establishes data relationships with tcm and planet (as available). [Of the servers in this release, wdi has the most complete interface and documentation.]

Interestingly, good ol' WDI is changing in ways which cry out for a more relationship-oriented data system. Two main changes were reported by Richard Lambert at EMUG'00. The "Mechanism of action" field is being revised to be more mechanistic, e.g., fields which might have originally contained "antibiotic" are being revised to things like "nitric oxide syntase inhibitor", as the information becomes available. The amount of such data already available is impressive: there are 4,800 such protein/enzyme-oriented mechanistic fields in wdi011. [This is neat, since we should be able to derive sensible relationships between drugs such as "operate on the same metabolic pathway".] The other change is the introduction of "combination preparations". This area needs more work, but these obviously form further relationships which must be considered when correlating observed pharmacology data. Many of the combination preps contain natural products and herbs which brings us to ...

Traditional Chinese Medicines

Traditional Chinese medicine is almost entirely based on observations about relationships. Many of these relationships are intrinsic to the traditional Chinese model of the body, e.g., flow of qi, yin/yang, channels and organs, etc. There are also many informatic relationships, e.g., effects are known for specific TCMs (many-to-one), which are compounded from biological sources (mostly plants, one-to-many), which are ultimately sets of pseudo-equivalent species (one to many), for which we know chemical constituents (one-to-many), often with observed mechanisms of action (many-to-many). One confounding factor is that all this is represented in a variety of languages: medical knowledge in Han (Mandarin, with TCM names also in Pinyin), medicinal sources in Pinyin and Latin (e.g., family, genus, species), pharmaology information in English (e.g., preparation methods), and molecular structures (now) in SMILES. If that were not confusing enough, there are two distinct ways of writing Han (traditional and simplified) and though most Chinese practitioners (who are in the PRC) use simplified characters, most computerized data sources use traditional characters and few (so far) use Unicode. Also, there are a relatively large number of specialized characters for medicine and pharmacology which don't appear in normal bilingual dictionaries (or character sets, i.e., fonts!) It's a difficult set of data to represent properly, but given a system which can handle relationships with ease, not quite impossible.

One saving grace is that high-quality constituent molecular structures are available for most TCMs (~95%) and most of these (~75%) are familiar or very similar to something with known pharmacology in isolation. Molecular structure really is a universal language. American and Chinese chemists might not be able to say to each other over the dinner table, "Please pass the sugar.", but when it gets down to the molecular structure of sucrose, they draw identical structures, biosynthetic pathways, physical properties, metabolic utilization, etc.

Chinese medical data is divided into four separate fedora information services:

tcm
An electronic version of the book Traditional Chinese Medicines by Yan, Zhou, Xie and Milne, as published by Ashgate. The tcm server uses the same data (same datatrees) as the soon-to-be-released Thor database tcm00. tcm's basic trick is a marriage of two kinds of data objects: biological entries (i.e., the sources of Chinese medicines) and chemical entries (structures observed in those sources). Most tcm data are in English and SMILES, with Latin and Pinyin data used for crossreferencing. tcm is loosely coupled to dcm and park, and establishes data relationships with wdi and planet.

dcm
An electronic version of the book A Practical Dictionary of Chinese Medicine by Wiseman and Feng, as published by Paradigm. The dcm server processes data in a specialized format used for multilingual typesetting (CTEX). dcm is more like an encyclopedia than a dictionary. dcm maintains a large number (220,000+) of searchable fields in four laguanges (English, traditional Chinese, Pinyin, and Latin) and a huge number (many millions) of derived relationships. dcm is tightly coupled to zi4 which allows Chinese readers the option of displaying traditional or simplified characters.

zi4
is the Chinese character for character. zi4 is also a non-interactive utility which provides various kinds of Chinese character image and mapping services to other servers. It knows about Unicode, traditional Chinese characters (BIG5), simplified Chinese characters (GB), and Pinyin tone symbols. It also knows about specialized Chinese characters used in medicine and pharmacology (this is not yet complete). zi4's main trick is delivering Chinese characters in a way which is useful on any web browser (GIF89a's). It also implements a neat algorithm for fast and reliable translation of Traditional Chinese to Simplified Chinese. [Someday, though not soon, we will all use Unicode and zi4 will not be needed.]

park
park is a HTTP-based archive for annotated images such as photographs. Its intended purpose here is to provide images of the sources of Chinese medicines (i.e., photos of plants and animals). There isn't much content yet, due to the [surprising] difficulty in obtaining good photographs of correct specie.

Planet, a protein ligand association network

Each primary data object in planet is a protein-ligand association, i.e., one or more proteins and one or more ligands, with an association such as an observed binding, computed docking, or molecular transformation. All planet associations are currently 3-D geometrical associations but this is not necessary in principle. For example, protein-ligand relationships can represent reactions (i.e., metabolic pathway steps) without known 3-D geometry. It is not yet clear whether a separate server for metabolic pathways will be required. Planet represents data objects canonically and addresses them by content (including relationships between data objects).

Planet is designed for exploratory data analysis of archived results rather than as a compute service (e.g., for docking). It serves both archival and exploratory functions. As an archive, original data is retained and served. As an exploratory tool, a number of search methods are provided for discovering relationships between proteins, ligands and their complexes. For instance, a novel method is used to evaluate ligand similarity which operates well with the poor oxidation state information typical of crystallographic data. It is implemented as a single fedora service:

planet
The planet server archives original PDB files as well as system-specific data (primarily CEX streams). Currently, the planet dataset consists of 177 observed complexes from the PDB. planet establishes data relationships with tcm and wdi (as available).

A side note about "post-genomic medical traditions". It seems a bit peculiar to be talking about post-genomic "traditions" in the same talk as Chinese medicine where the traditions are 2-3 millenia old. However, they are surprisingly similar from the informatics POV. So now we have the human genome, the map of the human bean, the 100,000 (or maybe 35,000) "ideas" which make the human machine work. To be more precise, we have the native-human-encoding of the map. The "post-genomic" task is now to figure out how they interact with each other, a much bigger task by any measure. One important aspect of this is metabolism, in particular, enzyme-catalyzed biosynthesis and bioregulation. Much of the direct evidence for this reductionistic science comes from X-ray observations of crystal structures. The way we encode this information is historical and almost as nonsensical as evolution itself: the Protein Data Bank format. It isn't a long tradition, but it's an incredibly strong and bizarre one. After doing both, let me just note that it is much easier to encode Traditional Chinese Medicine semantics (Mandarin, Latin and all) than it is to encode the semantics of real-life PDB files in a rigorous manner (a task not yet complete).

The relationships

There are literally millions of intrinsic (within database) relationships in these three data sets. A few examples:

There are many more relationships between databases.

Similarity is as important a relationship as identity

The starting point of reductionistic informatics is that we can record something we know and associate it with something we know it to be about, i.e., the basis for the object-property information model. In so far as this is true, identity is a sufficient operation and measures of similarity can be treated as subservient methods which are used to re-organize the pieces of perfectly-represented information. For instance, we have many databases of molecular properties which are associated with a valence model of a molecule. It is useful to select and sort them in various ways such as molecular weight, commonality of features, observed or predicted physical properties. Such operations do not affect the underlying representation of the information (the data itself) nor of the knowledge (the connection of the data to the model). This approach works well for much of chemistry and is why thor/merlin/daycart is useful.

Unfortunately, most knowledge is imperfect. Data recording errors are practically, though not theoretically, inevitable. Even if we had flawless information encoding, storage and retrieval, it would not be possible to create an information system which represents nature perfectly. All non-trivial scientific databases must somehow accomodate the fact that observations are subject to error: multiple, valid, conflicting observations exist. A deeper type of error is in the model of the observed object. Multiple, valid, conflicting ways exist for representing most things of interest in world (i.e., models). In chemistry, for instance, molecules can be represented by empirical, valence, LCAO, MO and other models. Many valid representations are possible even even within the domain of the "valence model". Working with multiple sources of data which use different models inevitably leads to some ambiguity.

One approach to addressing this problem is to dictate a single information model and thus define a "data universe" in which everything is consistent. This approach has two main problems: the lesser problem is that it's a lot of work, the greater problem is that information is lost during normalization. As information scope increases, so do adverse effects of these problems.

An effective alternative is to provide field-specific methods for representing different types of similarities which are appropriate to particular types of data. Our old favorite "binary Tanimoto similarity" is often suitable for chemical databases with well-known molecular structures, but a different similarity measure is required when oxidation state information is not available (e.g., crystallographic observation of ligands). Exact and approximate (e.g., ARES) substring matching is appropriate for group of some kinds of data (e.g., free English text), but other methods are required for different kinds of text (e.g., Chinese text, Pinyin phrases, peptide sequences). When examining relationships between information in different knowledge bases, such methods are more than organizational: they are the primary methods of defining relationships. Thus, similarity (selection and sorting criteria in general) become as important a concept as identity.

Implementation as self-contained web services

The strategy used here is to divide information into logically separate knowledge bases, represent each set of information as well as possible using the most appropriate available methods, then federate the resultant independent knowledge sources into an extensible and hopefully-cohesive whole.

The major advantage of this strategy is that enormously disparate types of information can be represented, e.g., Chinese medical observations encoded 1800 years ago in Mandarin and observed crystallographic binding of a ligand to an enzyme published as a PDB last month. Two major disadvantages are that it's usually a lot of work to encode information in an semantically well-defined manner and even when that's done, you end up with a heterogeneous data model.

This sounds like a formidable task. Just 20 years ago, it would have seemed nearly impossible. However, this is exactly what the World Wide Web has accomplished, albeit in a non-rigorous and evolutionary manner. [Not entirely by chance: HTTP was created specifically to support distributed collaborative information systems.]

The independent knowledge bases which form the core of this federation are implemented as virtual webservers. Each knows about one or more kinds of information, about the interrelationships of that information as knowledge, and about the relationships that can be formed between such knowledge and that in other knowledgebases.

"Websurfing" is a suitable interface

One might imagine that it would be a nearly-insuperable task to create an effective user interface for such an open-ended and complex data model as is described here. Apparently, not so! The same global phenomenon that created WWW also proliferated hyperlinked web-browsers. Hyperlinked documents are nearly optimal for management, presentation, and navigation of relationships between disparate sets of information. Most informatics users are already competent web-surfers. Such users may need to find out what information is available, but they already know how to explore it. E.g., once information is available in a web-based format, the surf's up!

Deployment - platform selection

Requirements for a practical platform include: This is close to the formula for a super-webserver with unusually large computational requirements. As far as numbers go: 2 or more CPUs, 1-2 GB RAM, 2-4 Gflops, and 50+ GB disk would be nice. Sounds like (what we used to call) a supercomputer.

An analysis of the current crop of computers led to a surprising result. Sun Enterprise-class servers running Solaris would do the job nicely, as would many other machines which are intended as production servers. Great machines, but pricey.

One long-shot which was evaluated was the Macintosh G4's running OS-X. Unlike earlier Mac OS's, OS-X is based on BSD Unix running a Mach kernel. With the Mach kernel to handle lightweight multiprocessing and a full complement of BSD Unix tools, it seems like a contender. Although OS-X is in beta until 24 March 2001, we broke our rule about never using beta OS's ... and wow, OS-X works great. The whole Daylight development system, the development http toolkit and all 16 http servers were ported and running in less than a week. Performance using the G4 chips exceeded equivalent machines using UltraSPARC (but only slightly).

The fedora "beta 4" system, as currently deployed on the internet as online.daylight.com, is running on the following platform:

As it happened, Apple was having a "fire sale" on the 2000 model G4-DPs for ~$1800 (street price), who could resist? We bought three of them and added commodity memory ($560/GB) and disk ($4/GB). THe Mac OS-X Public Beta was an extra $30. Additionally, an extended-runtime UPS (1.8 KVA, ~$600) was dedicated to this machine because power in NM is not terribly reliable.

Deployment - configuration

One of the goals of this project was to design and operate an ultra-reliable production service. online.daylight.com was therefore configured as a server appliance rather than a general purpose computer.

The machine is configured to run sixteen IP services: 14 http-toolkit based servers (including dcm, park, planet, tcm, wdi and zi4 as discussed here), grind (for JavaGrins), and thorserver (for bbq/qsar). One of the http-toolkit servers is sandman (server and daemon manager) which operates all the others and does recovery as needed (though it has not been needed, yet). The minimum (and only) updateable unit is a server (which contains both executable and data segments).

Almost nothing else is running, e.g., no conventional "open-ended" webserver such as apache, no inetd services such as login, telnet, ftp, etc. Since an adequate amount memory is available, everything is paged-in: after the initial startup, the magnetic disks spun down semi-permanently (until the software is updated). The only moving parts are the fan and (periodically, non-critically) CD-R for logging. The idea is that, with so little going on, there is very little to go wrong. We'll see.

Deployment - experience so far

The machine which became online.daylight.com was received new on 7 February and was put into service on 9 February. It was brought down once, on 18 February, to deinstall beta 3 servers and install the beta 4 servers. In the first few weeks the servers have collectively processed about 50,000 http requests successfully. A little over half of these were from internal clients, most of the others were beta testers in Beijing, Boston, and London. To our knowledge, no failures have occurred at the http or tcp/ip levels.

Plans

In general, and if all goes well, the http-toolkit and servers based on it are slated to become part of Daylight Software Release 4.81 (expected late 2001).

Beta versions are available now to Daylight beta sites. (However, most of the beta work is being done on "online" rather than on intranets as with other Daylight beta's.)

In principle, early-release versions could be made available under Daylight Release 4.72 (2Q2001: Linux, Irix and Solaris) if there were a compelling reason to do so.

In practice, sale of these servers as production software is limited by the status of specific databases and their owners (e.g., Ashgate, Derwent, BioByte, etc.) Everyone's saying the right things, but talk with Yosi about specifics.

Conclusions

Thanks to


Daylight Chemical Information Systems, Inc.
info@daylight.com