online experiment

The online.daylight.com experiment

Dave Weininger, Daylight CIS
Euromug 2001, Cambridge

Question:

What does it take to deploy chemical information sources on WWW?

Considerations:

Content
full-strength, heterogeneous databases and compute servers

Strategy
Use computer resources to simplify access to information

User interface
intuitive HTML interfaces delivered over HTTP

Access

Scaleablity

Reliability and monitoring

Hardware

Experience with production services

online.daylight.com: content

Servers deliver very heterogenous data.

A variety of implemention architectures are used, partly due to design constraints, partly for experimentation. Data are highly connected by logical links, many of which represent chemical relationships (i.e., identity or similarity).

Conventional chemical information

pow (P_o/w): pow serves information about n-octanol/water partitioning, including CLOGP (computed) and LogPstar (measured) values from BioByte, Inc. CLOGP computation is done by an external program object (clogptalk). LogPstar data is obtained from the same Thor datatrees used in the the medchem00 database. This server demonstrates the implementation of mixed compute and data services as a single HTTP-oriented resource, as well as the ability to deliver encapsulated computational resources (program object). [Experimental. This is the only server in this set which is not implemented locally, due to the lack of an OS-X Fortran compiler.]
quasar (quantitative structure-activity relationships): quasar serves information about Quantitative Structure-Activity Relationships. It is implemented as an interface to a live Thor database (BioByte's QSAR database, bbq). quasar's primary data objects are observed relationships between molecular structure and biological activities (6500+) or physiochemical properties (7600+) for a given set of molecules. quasar is designed to serve the original data, relationships between the QSAR sets of molecules, primary QSAR relationships, and relationships between them (comparative QSAR). quasar is one of the earliest fedora services (implemented before the HTTP toolkit). This project is currently "on hold" with no firm release plans. Consequently, the quasar server has not been fully developed and it does not establish data relationships with other fedora servers. [Experimental.]
wdi (world drug index): The World Drug Index by Derwent, a large database of pharmaceutical information about named drugs including registered drugs and trial preparations. The wdi server uses the same data (same datatrees) as the production Thor database. The demo version available here contains under 5000 of the 65000+ entries in wdi003. wdi establishes data relationships with tcm and planet (as available). [Of the servers in this set, wdi has the most complete interface and documentation.]
savvy: savvy is a small, experimental server which provides molecular surface area and volume computations. This HTTP server emerged as a tool (rather than a product) which was useful in development of a computational algorithm. Currently, it only provides molecular volume calculation. But it's not without interest: it includes an implementation of the recent analytical solution to the general triple spherical intersection problem. It's not hooked up to anything else yet, but it should be ;-) [Experimental.]

Chinese medicine

tcm (traditional chinese medicines): An electronic version of the book Traditional Chinese Medicines by Yan, Zhou, Xie and Milne, as published by Ashgate. The tcm server uses the same data (same datatrees) as the production Thor database tcm00. tcm's basic trick is a marriage of two kinds of data objects: biological entries (i.e., the sources of Chinese medicines) and chemical entries (structures observed in those sources). Most tcm data are in English and SMILES, with Latin and Pinyin data used for crossreferencing. tcm is loosely coupled to dcm and park, and establishes data relationships with wdi and planet.
zi4 (chinese character server): zi4 is a non-interactive utility which provides various kinds of Chinese character image and mapping services to other servers. It knows about Unicode, traditional Chinese characters (BIG5), simplified Chinese characters (GB), and Pinyin tone symbols. It also knows about specialized Chinese characters used in medicine and pharmacology (this is not yet complete). zi4's main trick is delivering Chinese characters in a way which is useful on any web browser (GIF89a's). It also implements a neat algorithm for fast and reliable translation of Traditional Chinese to Simplified Chinese. [Someday, though not soon, we will all use Unicode and zi4 will not be needed.]
dcm (Dictionary of Chinese Medicine): An electronic version of the book A Practical Dictionary of Chinese Medicine by Wiseman and Feng, as published by Paradigm. The dcm server processes data in a specialized format used for multilingual typesetting (CTEX). dcm is more like an encyclopedia than a dictionary. dcm maintains a large number (220,000+) of searchable fields in four laguanges (English, traditional Chinese, Pinyin, and Latin) and a huge number (many millions) of derived relationships. dcm is tightly coupled to zi4 which allows Chinese readers the option of displaying traditional or simplified characters.
park (photo archive): park is a HTTP-based archive for annotated images such as photographs. Its intended purpose here is to provide images of the sources of Chinese medicines (i.e., photos of plants and animals). There isn't much content yet, due to the [surprising] difficulty in obtaining good photographs of correct specie. [Experimental.]

Biochemistry/bioinformatics/biocomplexes

ecbook (Enzyme Commission code book): Metaphorics' ecbook provides convenient searching of enzymes by functionality and name. ecbook was developed as an internal tool to aid pathos data entry, but it has been so useful that it might end up as a product in its own right. [Experimental.]
pathos (metabolic pathway server): Metaphorics' pathos server provides a model of the metabolic pathway chart. It is intended for both exploration of metabolism and as an index to data from other sources. Compounds, cofactors, enzymes, agents, regulators, steps and pathways are maintained as "live objects" (e.g., steps and pathways are represented by Daylight reaction objects.) Landmarks, genetic diseases and other notes are also integrated. This server is a proof-of-principle; the only 5-10% of data entry is completed. [Experimental.]
planet (protein-ligand association network): Each primary data object in Metaphorics' planet server represents a protein-ligand association, i.e., one or more proteins and one or more ligands, with an association such as an observed binding or computed docking. A number of search methods are provided for exploring the relationships of proteins, ligands and their complexes. E.g., a novel method is used to evaluate ligand similarity which operates well with the poor oxidation state information typical of crystallographic data. planet is designed for exploratory data analysis of archived results rather than as a compute service (e.g., for docking). Currently, the planet dataset consists of ~500 observed complexes from the PDB. planet establishes data relationships with ecbook, pathos, tcm and wdi (as available). [Alpha.]

Utilities

fedora (federation of research assets): The front door to the current set of fedora services. In privileged installations, the fedora server lives at port 80 (the default http service) so it can be accessed with just a machine name, e.g.,
http://online.daylight.com/
fedora will refer requests to other servers, e.g., the full URL of the tcm home page is
http://online.daylight.com:26551/tcm/index.html
but it can also be accessed via
http://online.daylight.com:80/fedora/tcm/index.html.
or just:
http://online.daylight.com/tcm
Starting with beta-5, fedora also provides a login service.
gold and dlog (server and daemon journal, daemon log): gold is an HTTP server which maintains log files for fedora servers and daemons. dlog is an HTTP client program which submits log requests to gold. (Functionally, they are sort of mirror images of each other, in case you hadn't guessed from their names.) Both are utilities, mainly designed for use by system administrators rather than end-users. [beta]
grind (grins daemon): grind provides services which are needed to disseminate Daylight's JavaGrins molecular editor applet. grind does its work behind the scenes, e.g., when JavaGrins generates canonical SMILES or shows menus of templates, there's some grinding going on. grind simulates a Java RMI server: in practice it obviates the need for JRE on a server delivering JavaGrins. It is the oddball member of the fedora server family -- it doesn't use the HTTP toolkit (it talks Java RMI) -- although it does respond to HTTP requests for monitoring. Our plans are to migrate JavaGrins to use HTTP communication in future Daylight releases. In fact, dayutilserver 4.72 is the very same progam as grind, but with gold logging disabled. [beta]
imagine (image engine): The imagine server is primarily a non-interactive utility which provides various kinds of image generation, layouts and conversions. Its services are normally accessed via URIs. In principle, it is designed to be a central resource for invented images. This imagine service is currently limited to generating molecular depictions and is used by other servers such as (pow and savvy). [Experimental.]
sandman (server and daemon manager): sandman is the server which runs all other fedora servers. Its main job is to start the other servers and keep them running. sandman logs its activity with gold. sandman's HTTP user interface is limited to a monitoring page and documentation for system administrators. [beta]
testhttp (http-toolkit test program): The testhttp server is an example program which demonstrates basic functionality of Daylight's HTTP Toolkit. When this toolkit is released (v4.81) it is likely to contain source code for a program very much like testhttp, for the benefit of those programmers who will be using the toolkit. [beta]

online.daylight.com: fedora strategy

Use resources to simplify human and machine data access.

HTTP communication model
HTTP was specifically designed for "collaborative, distributed computation".
HTTP is probably the dominant form of data communication on this planet.

Two-way communication
Each fedora server is also an HTTP client.
- "What's related?"
- definitions, documentation, credits, help
- molecular depictions and other glyphs
- access control and licensing

Language processing

Do one thing well

KISS interface

HTTP isn't just HTML

online.daylight.com: access

Restricting access is tricky. Total access IRL is trickier.

Fedora provides a "service domain"

Fedora's default service IP domain is the local (A, B, or C) network.
Exception lists are provided.
Total access is also allowed (-sips '*.*.*.*')

Fedora also provides a user/password registration

Information-service with cookie-based scheme
Great for demo/evaluation/beta sites
Reasonable compromise between convenience and security
Not secure enough for truly sensitive material (yet)
Centralized implementation could meet NSA's B-2 standards
Does anyone care?

"Total access" means "serve everyone"

Does that mean "all robots", too?
Automatic indexing is pretty neat, for most sites.
Densely-linked sites are a problem for robots (e.g., wdi, tcm).
Robots are a problem for densely linked sites.
Concerns about web-indexing robots are probably unfounded.

Robot-proofing

How much is "four plus VI + 9"?
(Ragu knows something that Google doesn't.)

online.daylight.com: scaleability

Identical software should operate at all deployment scales.

WWW
The big one, e.g., where online.daylight.com lived.
Average over 4+ months: ~25,000 requests/day (peak day 85K+, 2 sigma 62K+). Average number of searches was > 3800/day. Performance was deemed acceptable in almost all cases (Chinese connectivity was occasionally unreliable).

LAN
Operation on Daylight's heterogenous LAN was good. Included 10/100BT copper, 11Mb wireless, frame-relay, sync and async DSL, async cable, and ppp over normal telephone lines.

MUGnet
~4 days of homogenous 100BT LAN at MUG'01: moderate to ocassionally very heavy use, no problem.

Isolated laptop
Loaded on PowerBook G4 (single processor 400 MHz "titanium") running OS-X for ACS meetings in San Diego (as isolated laptop) and Chicago (as both demo machine and isolated network server). Surprisingly wonderful performance as server (given 2+GB VM vs 1GB physical memory). A little pokey as both client and server, but no actual complaints.

online.daylight.com: reliability

Should meet highest standards of reliability and performance

reliabilty by design
- Dedicated host (G4-DP)
- Dedicated URL (online.daylight.com)
- No extraneous services (telnet, ftp, etc.)
- Adequate memory 1.38 VM, 1.5 physical (0 pageouts)
- Dedicated UPS
- (-) non-redundant power supply
- (-) single network (nameserver was SPOF)
- (-) Mac OS-X "Public Beta" used all 4 months

reliability experience
- System taken down for MUG'01
- System started twice during MUG'01 setup (that Dalke!)
- System restarted once to clear wdi-hang
- 4 power failures, one exceeded limits, self-restarted in ~22 mins
- System restarted to update sandman server
- wdi server would hang/restart every other week (OS-X beta problem?)
- overall (from gold): 99.44% server uptime
- average server uptime (not counting MUG): 54 days
- we could do better ... we're already doing better, e.g., no fedora server failure on Solaris, Irix or OS-X 10.1.

security experience
- Experienced 3 sets of documented external attacks
  O/S was immune to two of them (Windoz-specific)
  nonexistence of telnet/ftp foiled the third
- Our captive test-geek managed to hang sandman once (fuzztest)
- Huge, sporadic robot-access might be considered a successful denial-of-service attack

online.daylight.com: hardware

Hardware selection is a mission-specific choice.

Macintosh OS-X
- Mac OS-X Public Beta was a curious choice for this test.
- Main reason was that Mac's were readily available and cheap.
- Unexpected bonus was simplicity of scale-testing.
- Mac OS-X is a real O/S! 3-day Daylight port. 1-day fedora port.
- Macintosh hardware is limited on high-end (2x800 G4, 1.5 GB)
- Amazing performance/price. Cluster software available.

Solaris/Sun
- Solaris running on UltraSparc is still our high-end favorite.
- Developed on Ultra-60 workstation under 2.6, works fine.
- Ported to Enterprise 3000 and Sun-250 servers (2.7, 2.8), also fine.
- More work to be done to optimize for really big machines.

Irix/SGI
- Ported to and tested on a number of SGI workstations and servers (O2, Octane, Origin 200/2000, etc.
- No problems, good performance.

Red Hat Linux/Intel & other
- Ported to and tested on a number of Red Hat Linux machines: Dell powerbook, 4-processor SGI, also LinuxPPC on Mac PowerBook.
- Some difficulty with gcc compiler due to a known problem handling large arrays. Conveniently resolved by adding a big swap disk.
- No problems with servers once made, very good performance, great performance/price.

online.daylight.com: experience

What did we learn?

Greatly simplified data services are feasible
Zero-learning-curve interfaces to sophisticated data are possible
But what of the power-user?

There appears to be a large demand for such interfaces
A very large number of users hit the online.daylight.com experiment (4,000+ repeat-day-users); lots of complaints when it went down.
Are they in our traditional market or not?

The molecular model can effectively bridge many disciplines
- wdi-tcm: an east-west connection
- tcm/wdi-pathos: the metabolic chart as a drug index
- tcm/wdi-planet: proteins are molecules, too
- clogp/qsar-spresi/imagine: nicer pictures for all
- wdi/planet-ecbook and tcm/dcm: non-structural peeking
- tcm/dcm/park/zi4: multi-lingual searching

Amazing increases in machine capacity change the nature of EDA
At $0.10/MB, everything of value belongs in RAM. Address-by-content becomes feasible. Encapsulated knowledge servers can be both data producers and consumers.

Great improvement in server implementation efficiency is possible
- Averaging ~1 server / month (during http toolkit development!)
- Development is limited by understanding the information rather than programming. Understanding is never trivial.
- Is there an interest for the http toolkit as product?