EMUG98 Talk

Daylight Chemistry Database Cartridge

Oracle8i Implementation Overview

Sam DeFazio, Cathy Trezza
Oracle Corporation
New England Development Center
Nashua, NH 03062
sdefazio@us.oracle.com
ctrezza@us.oracle.com

Chemistry Database Cartridge Components

Dayblob procedures

Index build/update
Search

Oracle8i extensible index methods

DDL
DML
Query

Source data table
Index table

The cartridge is made up of several server-side components. This includes the Dayblob interface to build and update the index as well as the implementation of the extensible indexing API. The extensible index API requires handling initial creation of the index, keeping the index updated when DML operations occur, and of course, calling the appropriate query function during searches. The Source Table contains the SMILESs data and is referred to on index creation and is not altered. The index data itself is stored in the Index Table.

Oracle8i Extensible Index Table Structures

Source data resides in a VARCHAR2 Smiles string

Table column
Object attribute

Daylight index data structure resides in a BLOB

**Source Table**
NAME VARCHAR2(100)	SMILES VARCHAR2(4000)	...
...	...	...

**Index Table**
HASH BLOB	REFRESH_NUMBER NUMBER

The source data is a simple string. You can store the string in a standard VARCHAR2 column or a VARCHAR2 object attribute. In this example you might have a table with the compound name (shown as NAME here), which is a string. You might also have the compound itself in the table (SMILES here) which contains the SMILEs description of the compound, also a string. There might also be other columns in the table.

The index data is stored in an ancillary index table. This table contains a couple of columns. One is a Binary Large Object, also known as a BLOB (shown as HASH here). This column contains the index structure used in any queries. The second column is called REFRESH_NUMBER, and is used for caching the index (which is discussed below). This table contains exactly one row.

Daylight Index Data Structure

Composite index data structure (Dave's invention)

Index structure supports a number of operations:

Equals
Tautomers
Contains
Matches
Similar, with Rank/Score
Nearest
LookFor

The Daylight index structure is a hash table (actually several hash tables) and is opaque to the database. This structure is stored as a single BLOB and supports a variety of search operations, including Equals, a tautometric search (Tautomers), Contains, Matches, Similar with a rank or score, Nearest with a rank or score, and a LookFor operator (for index uniqueness).

Oracle8i Extensible Index Framework

An Oracle8i extensible index implements the following:

DDL Methods

ODCIindexCreate: Create the index (including any index tables)

ODCIindexDrop: Drop the index

ODCIindexAlter: Alter the index (as in REBUILD)

ODCIindexTruncate: Truncate the index

DML Methods

ODCIindexInsert: Insert a new row

ODCIindexDelete: Delete a row

ODCIindexUpdate: Update a row

Query Operator Methods

ODCIindexStart: Called at the start of a query

ODCIindexFetch: Returns Query results as ROWID or array of ROWID

ODCIindexClose: Called at the end of a query

Index methods can be written using PL/SQL, C, Java

For each query operator:

Index Implementation
Functional Implementation

Smiles Index Implementation

create or replace type SmilesIndexIM as object ( -- Query context/state db_loc RAW(8), lob_loc RAW(8), -- DDL Methods member function ODCIGetInterfaces(...), member function ODCIIndexCreate(...), member function ODCIIndexDrop(...), -- DML Methods ODCIIndexInsert (...) return number, member function ODCIIndexDelete (...) return number, member function ODCIIndexUpdate (...) return number, -- Query Operator Methods -- IndexStart method for Contains member function ODCIIndexStart(..., strt number, stop number, cmpval varchar2) return number, -- IndexStart method for SIMILAR member function ODCIIndexStart(..., strt number, stop number, cmpval varchar2, threshold float) return number, member function ODCIIndexFetch(nrows number, rids OUT sys.odciridlist) return number, member function ODCIIndexClose return number )

Smiles IndexType Definition

create or replace indextype SmilesIndex for Contains(VARCHAR2,VARCHAR2), LookFor(VARCHAR2,VARCHAR2), Tautomers(VARCHAR2,VARCHAR2), Matches(VARCHAR2,VARCHAR2), Similar(VARCHAR2,VARCHAR2,FLOAT), Nearest(VARCHAR2,VARCHAR2,INTEGER) using SmilesIndexIM ;

Sample Usage

-- Create sample table of compounds create table Compound (ID number, Name varchar2(256), Smiles varchar2(4000) ... ); -- Create index on smiles column create index on Compound(Smiles) INDENXTYPE IS SmilesIndex ... ; -- Perform similarity query and return score select Score(1) from Compound where Similar(Smiles, 'S=PS', .1, 1)=1; -- Select compounds similar to given structure select Name, Smiles from Compound where Contains(Smiles, 'CN(C)CCCN')=1; -- Select the nearest 10 compounds to given structure select Score(1) from Compound where Nearest(Smiles, 'S=PS', 10)=1;

Resource Considerations

When a query is issued, the index table data is loaded from a disk file into the buffer cache, and is kept around until it is no longer used. Once the table data is loaded into the buffer cache, the database doesn't have to fetch the data from the disk when another query accesses the table.

The start, fetch, and close methods of a query operator run in a separate external process, called extproc. There is one external process per session. The external process is separate from the database process, and CAN NOT bring down the server in the event of a program error in the index. The data from the buffer cache is passed to the external process via standard OCI requests to read the data from the BLOB. This data is stored in memory and passed to the Dayblob interface to process the query. The results of the query are the ROWIDs which are then passed back to the database along with any ancillary information, such as score.

The index itself can be large. A 1 million compound table would have an index table that is about 136M, and the entire amount could be allocated for each session using the index. To deal with this issue, we recommend session multiplexing, discussed below.

Session Multiplexing

For databases with large numbers of compounds, even a large server is likely to be able to handle only a small number of concurrent sessions running queries using the cartridge, due to the large memory requirement per sesson.

Common queries run in a short period of time, so the best way to deal with this problem is to place an application tier between the user and the database and multiplex the requests to the available sessions in the database. This allows you to process all the queries but keep a handle on the memory usage on the server.

Index Caching

In addition to multiplexing the requests to the available processes in the server, we can speed the processing of the smaller, common queries by "caching" the memory used in the external process and not freeing it when a query is complete. This eliminates the calls to allocate memory in the external process and load the index from the buffer cache (after the first query). The result with queries against a large index that produce a small number of rows is a significant performance gain. The reason for this gain is that the time to calculate the results of the query is much less than the time it takes to load the index from the buffer cache. For complex queries, the gain from caching the index is not as noticible due to the long query processing time.

Caching is turned on and off via a procedure call and is under the user's control. The cache is automatically reloaded when a change occurs to the index. This synchronization is handled by a sequence number in the index table (REFRESH_NUMBER) and the sequence number of a package holding the memory for the index. When these numbers match, the memory is in synch with the table. When these numbers differ, such as when a transaction is commited in this or another session, the two numbers will be different and the index is marked to be reloaded at the start of the next query.

Oracle8i+ Resources