MUG '01
-- 15th Daylight User Group Meeting -- 6 - 9 Mar 2001
Using molecular structure to build bridges:
Chinese, allopathic and post-genomic medical traditions.
Dave Weininger
Daylight CIS.
ABSTRACT
A new type of information managment system is being developed which
emphasizes relationships as a primary entities.
Even a modest database contains a vast number of relationships which
are of possible interest, most of which are inaccessible when using
conventional database management systems.
The number of such relationships between data in different databases
is far vaster, it's staggering.
The challenge is to build a rigorous and effective system which provides
access to such data, yet is easily understood by users.
This report summarizes progress of the fedora project which
is currently alpha- or beta-testing 16 database services.
Servers in three fields of knowledge are demonstrated here:
Traditional Chinese medicine,
allopathic pharmacology,
and protein-ligand interactions.
These non-trivial data sets illustrate the power and limitations of
such an information system.
The discussion covers strategy, design, implementation and actual deployment.
This presentation also serves as an annoucement of a demo system
available on
online.daylight.com
via the internet
(also fedora.daylight.com at MUG'01).
Relationships as knowledge
Most human knowledge is about relationships;
most scientific research is about discovering relationships.
Contrariwise, most computer database systems (e.g., RDBs, ORDBMSs, Thor)
encapsulate information in fixed relationships.
Ironically, one can't store relationships as data in computer databases
even in (especially in) relational database management systems (RDBs),
in which all possible data relationships are established beforehand
(i.e., the database schema).
Fixed data models are extremely useful for storing and regurgitating
well-understood information.
The same features which allow highly efficient data archival form a
tremendous limitation for other purposes such as integration of diverse
knowledge bases and discovery in general.
Relevance to scientific research
Scientific research starts with a hypothesis which is often a relationship
of one kind or another.
Data is obtained (gathered or generated) which supports or refutes the
original hypothesis.
Very often, the result of such research is the generation of one or more
additional hypotheses, i.e., more relationships to keep track of and to test.
Researchers operate at a great disadvantage because the relationships
themselves are not storable in (most) databases,
e.g., entered as data, tabulated, corrected, searched, etc.
From an IT point-of-view, scientists are a real problem.
Research scientists are notorious for noticing unexpected relationships.
But it's not their fault, it's their job.
The problem is that they then ask questions which were not expected when
a given data system was designed.
It seems self-evident that more flexible data systems would be a boon
for research purposes.
This report
A new type of information management system is being developed which
is not subject to such limitations, insofar as this is possible.
We are 8-9 months into this project.
One major false-start was followed by a number of very successful
proof-of-principle and prototype services.
The system is currently being integrated into a production data system
(fedora);
the first 12 data services are in alpha or beta tests now.
If all goes well, they will appear as production software by the end of
this year (in Daylight Software Release 4.81).
The experiment
- Is it possible to build information systems which represent data
which are fundamentally knowledge-bases of relationships?
- If so, is it possible to build information systems which encapsulate
relationships between non-trivial knowledge bases?
- If so, can a practical user interface be developed which
provides reasonable access to such non-trivial data?
- A "no-holds-barred" implementation is likely to be expensive in terms
of computer resources. Is it practical with today's off-the-shelf hardware?
Or is it blue-sky super-computer stuff?
The work
The basic approach used here is to
- encapsulate information in a set of well-defined and well-understood
languages,
- implement data services which read, understand, search,
correlate and deliver such information,
- implement communication methods between such services,
- load up a number of services with non-trivial sets of information,
- establish a practical user interface,
- evaluate the utility of such a system.
The data
A relatively large number (>20) of different kinds of data were explored.
Three are described here.
WDI, an allopathic pharmacopoeia
The World Drug Index from Derwent is a comprehensive collection of
registered drugs and reported trial preparations along with structures,
names and pharmacology.
One big advantage is that its contents are well understood
[like mother's milk to most of us].
It contains a rich set of pharmacological fields with intrisic relationships
which are not well represented by typical object-property models, e.g.,
pharmacological activity vs. mechanism of action,
or precautions and warnings vs. contraindications vs. adverse effects.
High-quality structures are available for a large majority of the entries,
so we can conveniently leverage our expertise with structural relationships
(e.g., substructure, superstructure, and similarity of molecular features).
These are extended with newly-formalized molecular relationships such as
"common-parent structure".
The World Drug index is implemented as a single fedora service:
- wdi
- The wdi server uses the same data (same datatrees) as the
production Thor database.
A demo version is available which contains under
5000 of the 65000+ entries in the current version (wdi011).
wdi establishes data relationships with
tcm and planet (as available).
[Of the servers in this release,
wdi has the most complete interface and documentation.]
Interestingly, good ol' WDI is changing in ways which cry out for
a more relationship-oriented data system.
Two main changes were reported by Richard Lambert at EMUG'00.
The "Mechanism of action" field is being revised to be more mechanistic,
e.g., fields which might have originally contained "antibiotic" are being
revised to things like "nitric oxide syntase inhibitor", as the information
becomes available.
The amount of such data already available is impressive:
there are 4,800 such protein/enzyme-oriented mechanistic fields in wdi011.
[This is neat, since we should be able to derive sensible relationships
between drugs such as "operate on the same metabolic pathway".]
The other change is the introduction of "combination preparations".
This area needs more work, but these obviously form further relationships
which must be considered when correlating observed pharmacology data.
Many of the combination preps contain natural products and herbs which
brings us to ...
Traditional Chinese Medicines
Traditional Chinese medicine is almost entirely based on observations about
relationships.
Many of these relationships are intrinsic to the traditional Chinese model
of the body, e.g., flow of qi, yin/yang, channels and organs, etc.
There are also many informatic relationships, e.g.,
effects are known for specific TCMs (many-to-one),
which are compounded from biological sources (mostly plants, one-to-many),
which are ultimately sets of pseudo-equivalent species (one to many),
for which we know chemical constituents (one-to-many),
often with observed mechanisms of action (many-to-many).
One confounding factor is that all this is represented in a variety of
languages: medical knowledge in Han (Mandarin, with TCM names also in
Pinyin), medicinal sources in Pinyin and Latin (e.g., family, genus, species),
pharmaology information in English (e.g., preparation methods),
and molecular structures (now) in SMILES.
If that were not confusing enough,
there are two distinct ways of writing Han (traditional and simplified)
and though most Chinese practitioners (who are in the PRC) use simplified
characters, most computerized data sources use traditional characters
and few (so far) use Unicode.
Also, there are a relatively large number of specialized characters for
medicine and pharmacology which don't appear in normal bilingual
dictionaries (or character sets, i.e., fonts!)
It's a difficult set of data to represent properly, but given a system which
can handle relationships with ease, not quite impossible.
One saving grace is that high-quality constituent molecular structures
are available for most TCMs (~95%) and most of these (~75%) are familiar
or very similar to something with known pharmacology in isolation.
Molecular structure really is a universal language.
American and Chinese chemists might not be able to say to each other over
the dinner table, "Please pass the sugar.",
but when it gets down to the molecular structure of sucrose, they draw
identical structures, biosynthetic pathways, physical properties,
metabolic utilization, etc.
Chinese medical data is divided into four separate
fedora information services:
- tcm
- An electronic version of the book
Traditional Chinese Medicines
by Yan, Zhou, Xie and Milne,
as published by Ashgate.
The tcm server uses the same data (same datatrees) as the
soon-to-be-released Thor database tcm00.
tcm's basic trick is a marriage of two kinds of data objects:
biological entries (i.e., the sources of Chinese medicines)
and chemical entries (structures observed in those sources).
Most tcm data are in English and SMILES,
with Latin and Pinyin data used for crossreferencing.
tcm is loosely coupled to dcm and park,
and establishes data relationships with wdi and planet.
- dcm
- An electronic version of the book
A Practical Dictionary of Chinese Medicine
by Wiseman and Feng,
as published by Paradigm.
The dcm server processes data in a specialized format used
for multilingual typesetting (CTEX).
dcm is more like an encyclopedia than a dictionary.
dcm maintains a large number (220,000+) of searchable fields
in four laguanges (English, traditional Chinese, Pinyin, and Latin)
and a huge number (many millions) of derived relationships.
dcm is tightly coupled to zi4
which allows Chinese readers the option of displaying
traditional or simplified characters.
- zi4
-
is the Chinese character for character.
zi4 is also a non-interactive utility which provides
various kinds of Chinese character image and mapping services
to other servers.
It knows about Unicode,
traditional Chinese characters (BIG5),
simplified Chinese characters (GB),
and Pinyin tone symbols.
It also knows about specialized Chinese characters used
in medicine and pharmacology (this is not yet complete).
zi4's main trick is delivering Chinese characters
in a way which is useful on any web browser (GIF89a's).
It also implements a neat algorithm for fast and reliable
translation of Traditional Chinese to Simplified Chinese.
[Someday, though not soon, we will all use Unicode and
zi4 will not be needed.]
- park
- park is a HTTP-based archive for annotated images
such as photographs.
Its intended purpose here is to provide images of the sources
of Chinese medicines (i.e., photos of plants and animals).
There isn't much content yet, due to the [surprising] difficulty
in obtaining good photographs of correct specie.
Planet, a protein ligand association network
Each primary data object in planet is a protein-ligand association,
i.e., one or more proteins and one or more ligands, with an
association such as an observed binding, computed docking,
or molecular transformation.
All planet associations are currently 3-D geometrical associations
but this is not necessary in principle.
For example, protein-ligand relationships can represent reactions
(i.e., metabolic pathway steps) without known 3-D geometry.
It is not yet clear whether a separate server for metabolic pathways
will be required.
Planet represents data objects canonically and addresses them by content
(including relationships between data objects).
Planet is designed for exploratory data analysis of archived
results rather than as a compute service (e.g., for docking).
It serves both archival and exploratory functions.
As an archive, original data is retained and served.
As an exploratory tool, a number of search methods are provided
for discovering relationships between proteins, ligands and their complexes.
For instance, a novel method is used to evaluate ligand similarity
which operates well with the poor oxidation state information
typical of crystallographic data.
It is implemented as a single fedora service:
- planet
- The planet server archives original PDB files as well as
system-specific data (primarily CEX streams).
Currently, the planet dataset consists of 177 observed
complexes from the PDB.
planet establishes data relationships with
tcm and wdi (as available).
A side note about "post-genomic medical traditions".
It seems a bit peculiar to be talking about post-genomic "traditions"
in the same talk as Chinese medicine
where the traditions are 2-3 millenia old.
However, they are surprisingly similar from the informatics POV.
So now we have the human genome, the map of the human bean, the
100,000 (or maybe 35,000) "ideas" which make the human machine work.
To be more precise, we have the native-human-encoding of the map.
The "post-genomic" task is now to figure out how they interact with
each other, a much bigger task by any measure.
One important aspect of this is metabolism, in particular,
enzyme-catalyzed biosynthesis and bioregulation.
Much of the direct evidence for this reductionistic science comes from
X-ray observations of crystal structures.
The way we encode this information is historical and almost as nonsensical
as evolution itself: the Protein Data Bank format.
It isn't a long tradition, but it's an incredibly strong and bizarre one.
After doing both, let me just note that it is much easier to encode
Traditional Chinese Medicine semantics (Mandarin, Latin and all) than
it is to encode the semantics of real-life PDB files in a rigorous manner
(a task not yet complete).
The relationships
There are literally millions of intrinsic (within database)
relationships in these three data sets.
A few examples:
- WDI, drugs with common indications, same/different mechanism of action.
e.g., indication of thrombosis with
anticoagulant,
antiaggregant,
or other mechanism.
- WDI, drugs marketed as different formulations but with common active
component, e.g.,
benzyl penicillin itself,
as parent,
by name,
containing benzyl penicillin, or
similar to benzyl penicillin structure.
- TCMs with sources from common families but differing indications, e.g.,
horsetails,
vise-versa, e.g.,
strangury,
and same structures from different source families, e.g.,
chitin.
- Protein-ligand complexes with similar proteins or ligands but not both
HIV-1 protease,
thymidylate synthase
There are many more relationships between databases.
- Most chemical substances in TCMs are related to metabolic pathways,
many are used as allopathic drugs or supplements,
many have enzyme binding observations, e.g.,
biotin,
folate,
adenosine,
- Enzyme binding observations are available for some allopathic drugs, e.g.,
nevirapine
- Many allopathic drugs have natural origins, e.g.,
capsaicin,
digoxin,
morphine.
- Similar drugs are used in Chinese and allopathic medicine
for different purposes, e.g.,
methylcurine.
scutellarin/breviscapine
- Many Chinese and allopathic medicines with common indications
also have common molecular features, e.g.,
thevetin-A
- Many natural products do not have counterparts as allopathic drugs,
e.g.,
palustrine
- Associated information provides important clues to which components
are active, synergistic, agonistic or antagonistic.
Swordlike Atractylodes ->
atractylodes ->
bran-frying
-> clogp, etc.
Similarity is as important a relationship as identity
The starting point of reductionistic informatics is that we can
record something we know and associate it with something we know
it to be about,
i.e., the basis for the object-property information model.
In so far as this is true, identity is a sufficient operation
and measures of similarity can be treated as subservient methods which
are used to re-organize the pieces of perfectly-represented information.
For instance, we have many databases of molecular properties which are
associated with a valence model of a molecule.
It is useful to select and sort them in various ways such as molecular weight,
commonality of features, observed or predicted physical properties.
Such operations do not affect the underlying representation of the information
(the data itself) nor of the knowledge
(the connection of the data to the model).
This approach works well for much of chemistry
and is why thor/merlin/daycart is useful.
Unfortunately, most knowledge is imperfect.
Data recording errors are practically, though not theoretically, inevitable.
Even if we had flawless information encoding, storage and retrieval,
it would not be possible to create an information system which represents
nature perfectly.
All non-trivial scientific databases must somehow accomodate the fact
that observations are subject to error:
multiple, valid, conflicting observations exist.
A deeper type of error is in the model of the observed object.
Multiple, valid, conflicting ways exist for representing most things
of interest in world (i.e., models).
In chemistry, for instance, molecules can be represented by empirical,
valence, LCAO, MO and other models.
Many valid representations are possible even
even within the domain of the "valence model".
Working with multiple sources of data which use different models inevitably
leads to some ambiguity.
One approach to addressing this problem is to dictate a single information
model and thus define a "data universe" in which everything is consistent.
This approach has two main problems:
the lesser problem is that it's a lot of work,
the greater problem is that information is lost during normalization.
As information scope increases, so do adverse effects of these problems.
An effective alternative is to provide field-specific methods for
representing different types of similarities which are appropriate
to particular types of data.
Our old favorite "binary Tanimoto similarity" is often suitable for
chemical databases with well-known molecular structures, but a different
similarity measure is required when oxidation state information is not
available (e.g., crystallographic observation of ligands).
Exact and approximate (e.g., ARES) substring matching is appropriate
for group of some kinds of data (e.g., free English text),
but other methods are required for different kinds of text
(e.g., Chinese text, Pinyin phrases, peptide sequences).
When examining relationships between information in different
knowledge bases, such methods are more than organizational:
they are the primary methods of defining relationships.
Thus, similarity (selection and sorting criteria in general) become as
important a concept as identity.
Implementation as self-contained web services
The strategy used here is to divide information into logically separate
knowledge bases, represent each set of information as well as possible
using the most appropriate available methods,
then federate the resultant independent knowledge sources into
an extensible and hopefully-cohesive whole.
The major advantage of this strategy is that enormously disparate types of
information can be represented, e.g., Chinese medical observations encoded
1800 years ago in Mandarin and observed crystallographic binding of a
ligand to an enzyme published as a PDB last month.
Two major disadvantages are that it's usually a lot of work to encode
information in an semantically well-defined manner
and even when that's done,
you end up with a heterogeneous data model.
This sounds like a formidable task.
Just 20 years ago, it would have seemed nearly impossible.
However, this is exactly what the World Wide Web has accomplished,
albeit in a non-rigorous and evolutionary manner.
[Not entirely by chance: HTTP
was created specifically to support
distributed collaborative information systems.]
The independent knowledge bases which form the core of this federation
are implemented as virtual webservers.
Each knows about one or more kinds of information,
about the interrelationships of that information as knowledge,
and about the relationships that can be formed between such knowledge
and that in other knowledgebases.
"Websurfing" is a suitable interface
One might imagine that it would be a nearly-insuperable task to
create an effective user interface for such an open-ended and
complex data model as is described here.
Apparently, not so!
The same global phenomenon that created WWW also proliferated
hyperlinked web-browsers.
Hyperlinked documents are nearly optimal for
management, presentation, and navigation of relationships
between disparate sets of information.
Most informatics users are already competent web-surfers.
Such users may need to find out what information is available,
but they already know how to explore it.
E.g., once information is available in a web-based format, the surf's up!
Deployment - platform selection
Requirements for a practical platform include:
- multiple processors
- large amount of fast memory (1GB+) with efficient paging
- large reliable (preferably RAIDable) disk backup
- reliable offline storage device
- fast TCP/IP connectivity
- sturdy development and performance measurement tools
This is close to the formula for a super-webserver with unusually large
computational requirements.
As far as numbers go:
2 or more CPUs, 1-2 GB RAM, 2-4 Gflops, and 50+ GB disk
would be nice.
Sounds like (what we used to call) a supercomputer.
An analysis of the current crop of computers led to a surprising result.
Sun Enterprise-class servers running Solaris would do the job nicely,
as would many other machines which are intended as production servers.
Great machines, but pricey.
One long-shot which was evaluated was the Macintosh G4's running OS-X.
Unlike earlier Mac OS's, OS-X is based on BSD Unix running a Mach kernel.
With the Mach kernel to handle lightweight multiprocessing and a full
complement of BSD Unix tools, it seems like a contender.
Although OS-X is in beta until 24 March 2001, we broke our rule about
never using beta OS's ... and wow, OS-X works great.
The whole Daylight development system,
the development http toolkit and all 16 http servers
were ported and running in less than a week.
Performance using the G4
chips exceeded equivalent machines using UltraSPARC (but only slightly).
The fedora "beta 4" system, as currently deployed
on the internet as online.daylight.com, is running
on the following platform:
- G4-DP Macintosh (dual processor @450 MHz, 11.0 Gflops peak)
- 1 GB memory
- 180 GB disk (original 30GB unused, 2x75GB's mirrored)
- CD-R (for logging)
- gigabit TCP/IP (routed through 100baseT switch)
- Standard Mac OS-X (Public Beta)
As it happened, Apple was having a "fire sale" on the 2000 model G4-DPs
for ~$1800 (street price), who could resist?
We bought three of them and added commodity memory ($560/GB) and disk ($4/GB).
THe Mac OS-X Public Beta was an extra $30.
Additionally, an extended-runtime UPS (1.8 KVA, ~$600) was dedicated to
this machine because power in NM is not terribly reliable.
Deployment - configuration
One of the goals of this project was to design and operate an
ultra-reliable production service.
online.daylight.com was therefore configured as a server appliance rather
than a general purpose computer.
The machine is configured to run sixteen IP services:
14 http-toolkit based servers
(including dcm, park, planet, tcm, wdi and zi4 as discussed here),
grind (for JavaGrins), and thorserver (for bbq/qsar).
One of the http-toolkit servers is sandman (server and daemon manager)
which operates all the others and does recovery as needed
(though it has not been needed, yet).
The minimum (and only) updateable unit is a server
(which contains both executable and data segments).
Almost nothing else is running,
e.g., no conventional "open-ended" webserver such as apache,
no inetd services such as login, telnet, ftp, etc.
Since an adequate amount memory is available, everything is paged-in:
after the initial startup, the magnetic disks spun down semi-permanently
(until the software is updated).
The only moving parts are the fan
and (periodically, non-critically) CD-R for logging.
The idea is that, with so little going on, there is very little to go wrong.
We'll see.
Deployment - experience so far
The machine which became online.daylight.com was received
new on 7 February and was put into service on 9 February.
It was brought down once, on 18 February, to deinstall beta 3 servers
and install the beta 4 servers.
In the first few weeks the servers have collectively processed about
50,000 http requests successfully.
A little over half of these were from internal clients,
most of the others were beta testers in Beijing, Boston, and London.
To our knowledge, no failures have occurred at the http or tcp/ip levels.
Plans
In general, and if all goes well, the http-toolkit and servers based
on it are slated to become part of Daylight Software Release 4.81
(expected late 2001).
Beta versions are available now to Daylight beta sites.
(However, most of the beta work is being done on "online" rather than
on intranets as with other Daylight beta's.)
In principle, early-release versions could be made available under
Daylight Release 4.72 (2Q2001: Linux, Irix and Solaris)
if there were a compelling reason to do so.
In practice, sale of these servers as production software is limited
by the status of specific databases and their owners (e.g., Ashgate,
Derwent, BioByte, etc.)
Everyone's saying the right things, but talk with Yosi about specifics.
Conclusions
- Is it possible to build systems which represent
knowledge-bases of relationships?
- Yes, definitely.
However, they look different than conventional databases.
- Is it possible to build systems which encapsulate
relationships between non-trivial knowledge bases?
- Yes, in cases where we have a good model for the
relationships (e.g., molecular structure).
- Probably, in cases where both sides have knowledge
in formalized languages (e.g., Chinese and Western medicine).
- Unknown, in general.
- Can a practical user interface be developed which
makes cuch data easily accessible?
- Yes, definitely.
Web-based interfaces are nearly optimal for exploring relationship data.
Most users are already experienced www-surfers,
with 100's of hours of self-training which can be leveraged.
- Is "off-the-shelf" or "supercomputer" hardware required?
- Both.
For most data sets, current off-the-shelf hardware has the capacity to
provide effective data relationship management,
even doing it the "hard way".
If they weren't so cheap and easy to obtain, we'd call them supercomputers.
The largest data sets (e.g., 5e6 structures in spresi, 18e6 in reg) are
currently beyond the capacity of off-the-shelf desktop computers
... but probably not for long.
- Is this technology competitive with conventional DBMS's?
- No, at least not in the near future.
Data relationships are mainly useful for recording and discovering
knowledge rather than managing the underlying data per se.
Conventional DBMS's provide better stability, formality and efficiency
for management of well-understood data with fixed-relationships.
That being said, the trends in both theoretical and practical
informatics are eating away at each of those advantages.
Thanks to
- Apple ... for still being there
- Dawn Abriel ... for medical expertise
- Jack Delany and Yosi Tatiz, Daylight
... for making room for this work
- Nigel Wiseman and Bob Felt, Paradigm ... for DCM
- Peter Nielsen and Bill Milne, Ascgate ... for TCM
- Ragu Bharadwaj, Daylight ... for JavaGrins and OS-X enthusiasm
- Roger Sayle, Metaphorics ... for rasmol and inspiration
- Scott Dixon, Metaphorics ... for planet
- Xinjian Yan, NIST ... for TCM
- Zhou Jiaju, Chinese Academy of Sciences ... for TCM
Daylight Chemical Information Systems, Inc.
info@daylight.com