-- 16th Daylight User Group Meeting -- 26 Feb - 1 Mar 2002
Dave Weininger, Daylight CIS
To be covered:
- What is fedora,
- and why bother?
- Why federate rather than unify?
- Information models
- Making it work: a modern delivery mechanism
- Demonstration: dayflash
- Example servers.
- To everything, there is a season: fedora's place
What is fedora, and why bother?
- Federation of Research Assets
- Web-like database system using HTTP communication
- Language is the unifying communications element
- The molecular model provides integration of chemical information
What is fedora, and
- Integration of disparate information is necessary, difficult
Researchers are limited by fixed data models
- Computers get better, access to information gets more difficult
Integrating a new type of information is harder now than in 1990
- We're not even reinventing the wheel, we're duplicating it!
Encapsulate data once, then use it everywhere
- Researchers need to try out new things cheaply
Special-purpose applications don't serve researchers well
rather than unify?
- Information is best represented by a native data model.
- Each server does the best job representing its data,
regardless of application.
- Integration with other information sources is done by each
server using local, native data model.
- Servers don't have to know about each other's conventions.
- Cheap to try new things, due to minimal side effects.
- Each resource can be supported/tested/administered separately.
- Resource modularization allows efficient, specific deployment.
- Scale independence: isolated laptop to collaborative web.
- Potential for zero-administration implementation.
- HTTP was specifically designed for collaborative computing
- HTTP is now used for most information exchange on this planet
- Flexible: can transport all kinds of data (not just HTML)
- Scaleable: from laptop to workgroup to corporate LAN to WWW
- Universal: virtually everyone already knows how to browse HTTP
- Stateless model is harder to secure
- Real-time applications are limited
e.g., molecular editing, 3-D display
Here are some fedora database synopses, selected to illustrate the wide
variety of information models which are "native" to various kinds of data.
- observed hydrophobicity as log(Po/w)
- 11,053 structures, names, observed data of one kind
- "simplest" chemical information model is still non-trivial
- find exact/generic structure, similar structures, find by name
- World Drug Index, pharmacology of named drugs
- 67,059 formal entries containing ~800,000 fields
- 7 pharmacology types (100K items),
12 name types (400K items)
- discrete structures, combination preps and unknown structures
- full complement of structure/name/data searching methods needed
- Traditional Chinese Medicine: structures, indications and effects
- biological entries (sources of Chinese medicinals)
- chemical entries (structures observed in Chinese medicinals)
- information is represented in English, Chinese (Pinyin) and SMILES
- underlying information model is Traditional Chinese Medicine
- annotated photographic archive of medicinal sources
- photographs of medicinal sources which are mostly biological
- provides best identification available for such entities
- complements Latin/English/Chinese names which are not reliable
- annotations searchable in multiple languages
- Dictionary of Chinese Medicine, encyclopedic
- 220,000+ fields in 4 languages (English, Chinese, Pinyin, Latin)
- underlying information model is Traditional Chinese Medicine
- identifiers are expressed as traditional Chinese characters
- add'l info, e.g., western medical concepts, acupuncture channels
- intrinsic queries are multilingual-multiconceptual
- Chinese character service
- Chinese character image and mapping services
- Unicode, Traditional- and Simplified-Chinese characters
- Traditional to Simplified translation
- pragmatic utility handles non-trivial requirement
- Protein-Ligand Association NETwork
- each data object is a protein-ligand association, i.e., a
relationship between one or more proteins and one or more ligands
- e.g., observed binding, computed docking
- large/small molecule search/similarity methods are implemented
- robust in the face of relatively poor oxidation state information
typical of crystallographic data
- big and fast
- metabolic pathway chart
- models a modern metabolic pathway chart
- agents, cofactors, compounds, diseases, enzymes, landmarks, notes,
pathways, regulators, steps
- represents unified metabolism (plant, unicellular and animal)
- a natural index: most drugs operate in this realm
- massive image representing a massive reaction schema
- integrated name/structure/similarity/functionality searching
- Enzyme Commission codebook
- EC code index is a discrete model of enzyme functionality
- EC codes are dynamic with time
- many-to-many relationship required due to multifunctional enzymes
- primarily a crossreference and index to other databases
- Quantitative Struture-Activity Relationships
- Each primary entity is an observed relationship between molecular
structure and biological activity (6500+) or physical property (7600+)
- Raw data and references are included (13,000+ authors!)
- Search by both component (data) and relationship (QSAR) properties
- Supports relationships of QSAR relationships (comparative QSAR)
- However, other applications might be just as important
Making it work:
a modern delivery mechanism
deliver large amounts of disparate information in
a manner which is robust, reliable, integrated, and operationally trivial.
- "Computers get better, access to information gets more difficult."
-- if we are to succeed, this needs to be reversed.
- Continuing improvements in computer technology makes it possible
to simplify information access, let's use it.
- HTTP-oriented services form a logical beginning:
how do we implement a complete solution for research purposes?
- One attractive solution: embedded HTTP servers on flash memory devices.
- good capacity, excellent random-access performance
- promises zero-administration solution
- embedded hardware licensing can simplify installation
- provide all required resources: admin privileges not needed
- robost, no moving parts, 10 year persistance
- main disadvantage: this is not the way we already do things!
- This has been implemented using USB flash-memory-based drives.
I plan on plugging a dayflash device into a USB port on a laptop:
it should come up serving multiple data sources to the network with
no fuss, no bother.
- Plug-n-play: self configuring, zero-admin
- Truly minimal interaction, no command line
- Nothing written to disk: no traces, no priv's needed
- Delivers high performance network services
- Available at MUG'02
wdidemo, tcm, dcm, park, zi4, logpstar
above, plus full wdi, qsar, imagine, pathos, and some more
planet, pathos, ecbook, wdidemo
To everything, there is a season:
- Excellent for wide dissemination of complex data sources
- Good for research; not perfect but better than what we have so far
- Does not compete with RDB of data with known/simple data models
- Limited number of databases to be released with Daylight 4.81
- HTTP toolkit also to be released with Daylight 4.81
Thanks are due to a large number of people and companies who have contributed
(and are still contributing) to this effort in various ways:
Thank you for your time and interest.
- Apple ... for still being there
- Derwent ... for WDI
- Jack Delany and Yosi Tatiz, Daylight
... for making room for this work
- Nigel Wiseman and Bob Felt, Paradigm ... for DCM
- Peter Nielsen, Daylight and Bill Milne, Ashgate
... for TCM
- Dawn Abriel ... for medical expertise
- Ragu Bharadwaj, Daylight ... for JavaGrins and OS-X enthusiasm
- Roger Sayle, Metaphorics ... for rasmol and inspiration
- Scott Dixon, Metaphorics ... for planet
- Xinjian Yan, NIST ... for TCM
- Zhou Jiaju, Chinese Academy of Sciences ... for TCM
- Al Leo and Corwin Hansch, BioByte ... for c/logp and QSAR