Dayblob: an ORDBMS-oriented
chemical information package

Dave Weininger, Daylight

What is dayblob?


Design criteria


Interfaces

Our ambitious requirement for stability required development of a new object interface. We aspire to provide a component that not only continues to work correctly as machine architectures and programming languages evolve, but may be asked to do its work on a different machine type than the controlling RDB server. Niether Daylight's Toolkit interface (dt) nor existing RDB API interfaces (e.g., OCI) handle this problem well.

A specialized object interface was developed for this project (the "db" interface). It is largely modelled on Daylight's "dt" object interface but is an entirely independent entity. Many of the same types of object are supported ... e.g., integers, strings, objects, sequences of objects ... but using neutral types allows us to change either side without changing our interface. For instance, one side of the partnership can move from pure 32-bit to LP64 without distrupting the other.

Support for some RDB-specific objects is provided. Row IDs are essential to the RDB data model and are treated as short variable length binary objects (BOBs). BLOBs ("binary large objects") are complex objects which are accessed via contexts (CXTs) and object locators (LOCs).


Architecture

The architecture of a high-performance database component normally involves a number of uncomfortable compromises. For flexibility, we need to handle things like Row IDs which vary in size. For performance, we need big chunks of fast, randomly-accessible storage (i.e., memory) which are none-the-less persistant and secure. For stability, we need a reliable method for accessing data which permits efficient caching (and which is likely to evolve over time). And in some cases (such as Oracle), we can't get access to the server's process space, no way, no how.

In a traditional RDB environment, such requirements are mutually contradictory. However, the novel approach of operating entirely within a BLOB (suggested by Sam Defazio and implemented by Cathy Trezza) appears to meet these requirements. The idea is that a component operates entirely within a single chunk of persistent storage (the BLOB). BLOB storage is managed by the RDB server via an ORDBMS interface such as OCI. From the database's POV, the BLOB is just a bunch of bits in a table and the software component is an extensible indexing method which uses these bits. From the component's POV, it can do anything it likes within the BLOB, e.g., run a fast specialized database of its own. Cool.

Aside from the calling program, dayblob requires two external utilites: one to define and manage "Row IDs" and one to manage the BLOB. BLOB management is done via contexts (CXTs) and object locators (LOCs). BLOBs are accessed in discrete readonly or read/write segments to enable efficient caching.


Capabilities

Most of the structure-oriented database capabilities in Daylight Release 4.61 are delivered, including:

Performance

The dayblob package was tested against the standard Daylight test suite used to evaluate merlinserver performance. Detailed results of this test are available.

For example, 243 superstructures of the cefalosporin-g1 moiety are known to exist among the 37037 structures in the test database wdi93. This search was done on the same machine using dayblob and merlinserver, then the results were examined for correctness and relative speed. In this case, dayblob completed the search correctly in 0.68 seconds, which is 85% of the time used by merlinserver461 (using 1 CPU only).

Selecting a current high-end workstation (Sun Ultra60) for comparison, for typical structural searches, dayblob is about the same speed as the current merlinserver461 when run with 1-CPU (30% slower to 15% faster, depending on query). Including pathological cases, it is 2.5x slower.


Limitations

Limitations of dayblob v46107 include: Limitations of the prototype cartridge include:

Chemistry Cartridge demos

A number of CGIs demonstrating the Chemistry Cartridge prototype are are available here.

The real experiment: collaboration

As much as anything else, the dayblob project has been an experiment in collaborative work between two very different companies. To make this project succeed, each of us had to move towards the other's philosophy to some extent. The fact that this project is working attests to the willingness of the members to overcome many obstacles which were encountered along the way. Outside the development environment, one rarely hears about such things. With the dayblob-based Chemistry Cartridge, we have shown that it is possible to produce a product in which each side does what it does best, i.e., Oracle develops the RDB side, Daylight the cheminfo algorithms. Contrary to historical precedent, it is not neccessary for one company to own or control the others' realm. This speaks worlds for the force of will and the level of respect among the participants. Speaking for myself, I am as pleased by this aspect of the project as I am by any aspect of techincal excellence in the product.

This is an ongoing project, and an ongoing process. Although we are "over the hump", there is much more work to be done. The two main areas of effort remaining have to do with applications (Who defines them?, Which one first?, Who produces them?, etc.) and business models (Whose product is this, anyway?) We have every reason to expect that these issues will be addressed with the same level of creativity and cooperation that has characterized this project thus far.


Acknowlegements

Sandeepan Banerjee, Oracle
Roger Critchlow, Daylight
Sam Defazio, Oracle
Jack Delany, Daylight
Bob Gouslin, Oracle
Steve Hagan, Oracle
Mike Mansell, Oracle
Norah McCuish, Daylight
Johnny Petersen, Oracle
Yosef Taitz, Daylight
Cathy Trezza, Oracle
Markus Visscher, Oracle
Dave Weininger, Daylight
and the Novartis "Chemical Datawarehouse" people
Daylight Chemical Information Systems, Inc.
info@daylight.com