Daylight Chemistry Database Cartridge

Oracle8i Implementation Overview

Sam DeFazio, Cathy Trezza
Oracle Corporation
New England Development Center
Nashua, NH 03062

Chemistry Database Cartridge Components

The cartridge is made up of several server-side components. This includes the Dayblob interface to build and update the index as well as the implementation of the extensible indexing API. The extensible index API requires handling initial creation of the index, keeping the index updated when DML operations occur, and of course, calling the appropriate query function during searches. The Source Table contains the SMILESs data and is referred to on index creation and is not altered. The index data itself is stored in the Index Table.
Oracle8i Extensible Index Table Structures
Source Table

Index Table

The source data is a simple string. You can store the string in a standard VARCHAR2 column or a VARCHAR2 object attribute. In this example you might have a table with the compound name (shown as NAME here), which is a string. You might also have the compound itself in the table (SMILES here) which contains the SMILEs description of the compound, also a string. There might also be other columns in the table.

The index data is stored in an ancillary index table. This table contains a couple of columns. One is a Binary Large Object, also known as a BLOB (shown as HASH here). This column contains the index structure used in any queries. The second column is called REFRESH_NUMBER, and is used for caching the index (which is discussed below). This table contains exactly one row.

Daylight Index Data Structure

The Daylight index structure is a hash table (actually several hash tables) and is opaque to the database. This structure is stored as a single BLOB and supports a variety of search operations, including Equals, a tautometric search (Tautomers), Contains, Matches, Similar with a rank or score, Nearest with a rank or score, and a LookFor operator (for index uniqueness).
Oracle8i Extensible Index Framework

Smiles Index Implementation

     create or replace type SmilesIndexIM as object
         -- Query context/state
         db_loc RAW(8), 
         lob_loc RAW(8),

         -- DDL Methods
         member function ODCIGetInterfaces(...),
         member function ODCIIndexCreate(...),
         member function ODCIIndexDrop(...),

         -- DML Methods
         ODCIIndexInsert (...)  return number,
         member function ODCIIndexDelete (...) return number,
         member function ODCIIndexUpdate (...) return number,

         -- Query Operator Methods
         -- IndexStart method for Contains
         member function ODCIIndexStart(..., strt number, 
                                        stop number,
                                        cmpval varchar2)
                                        return number,
         -- IndexStart method for SIMILAR
         member function ODCIIndexStart(..., strt number, 
                                        stop number,  
                                        cmpval varchar2, 
                                        threshold float)
                                        return number,
         member function ODCIIndexFetch(nrows number, 
                                        rids OUT
                                        return number,
         member function ODCIIndexClose return number

Smiles IndexType Definition

     create or replace indextype SmilesIndex
     using SmilesIndexIM ;

Sample Usage

     -- Create sample table of compounds
     create table Compound
     (ID	number,
      Name	varchar2(256),
      Smiles	varchar2(4000) ...

     -- Create index on smiles column
     create index on Compound(Smiles) 
            INDENXTYPE IS SmilesIndex ... ;

     -- Perform similarity query and return score
     select Score(1)
     from Compound
     where Similar(Smiles, 'S=PS', .1, 1)=1;

     -- Select compounds similar to given structure

     select Name, Smiles
     from Compound
     where Contains(Smiles, 'CN(C)CCCN')=1;

     -- Select the nearest 10 compounds to given structure
     select Score(1)
     from Compound
     where Nearest(Smiles, 'S=PS', 10)=1;

Resource Considerations

When a query is issued, the index table data is loaded from a disk file into the buffer cache, and is kept around until it is no longer used. Once the table data is loaded into the buffer cache, the database doesn't have to fetch the data from the disk when another query accesses the table.

The start, fetch, and close methods of a query operator run in a separate external process, called extproc. There is one external process per session. The external process is separate from the database process, and CAN NOT bring down the server in the event of a program error in the index. The data from the buffer cache is passed to the external process via standard OCI requests to read the data from the BLOB. This data is stored in memory and passed to the Dayblob interface to process the query. The results of the query are the ROWIDs which are then passed back to the database along with any ancillary information, such as score.

The index itself can be large. A 1 million compound table would have an index table that is about 136M, and the entire amount could be allocated for each session using the index. To deal with this issue, we recommend session multiplexing, discussed below.

Session Multiplexing

For databases with large numbers of compounds, even a large server is likely to be able to handle only a small number of concurrent sessions running queries using the cartridge, due to the large memory requirement per sesson.

Common queries run in a short period of time, so the best way to deal with this problem is to place an application tier between the user and the database and multiplex the requests to the available sessions in the database. This allows you to process all the queries but keep a handle on the memory usage on the server.

Index Caching

In addition to multiplexing the requests to the available processes in the server, we can speed the processing of the smaller, common queries by "caching" the memory used in the external process and not freeing it when a query is complete. This eliminates the calls to allocate memory in the external process and load the index from the buffer cache (after the first query). The result with queries against a large index that produce a small number of rows is a significant performance gain. The reason for this gain is that the time to calculate the results of the query is much less than the time it takes to load the index from the buffer cache. For complex queries, the gain from caching the index is not as noticible due to the long query processing time.

Caching is turned on and off via a procedure call and is under the user's control. The cache is automatically reloaded when a change occurs to the index. This synchronization is handled by a sequence number in the index table (REFRESH_NUMBER) and the sequence number of a package holding the memory for the index. When these numbers match, the memory is in synch with the table. When these numbers differ, such as when a transaction is commited in this or another session, the two numbers will be different and the index is marked to be reloaded at the start of the next query.

Oracle8i+ Resources