Daylight Summer School

Lab: DayCart extensions to SQL -- WITH HINTS AND ANSWERS

This set of labs guides the user through the use of Daycart functionality in SQL and PL/SQL.


The first operation will be to load the demo tables into your local Oracle instance. The data is available from the ftp server. The filenames are: tcm01.dmp, demo.sql, acd_demo.sql, wdi_demo.sql, and wombat_demo.sql.

From Import files:

The file "tcm01.dmp" is an Oracle dump file. It was generated using the Oracle "export" program, a utility which allows one to dump data from an Oracle database into a more portable binary format. The "import" utility will load this binary data into the database.

$ imp mug/coffee
Import: Release - Production on Mon May 24 15:52:51 2004

(c) Copyright 1999 Oracle Corporation.  All rights reserved.

Connected to: Oracle8i Enterprise Edition Release - Production
With the Partitioning and Java options
PL/SQL Release - Production

Import file: expdat.dmp > tcm01.dmp

Enter insert buffer size (minimum is 8192) 30720> <CR>

Export file created by EXPORT:V08.01.05 via conventional path

Warning: the objects were exported by TCM, not by you

import done in US7ASCII character set and US7ASCII NCHAR character set
List contents of import file only (yes/no): no > <CR>

Ignore create error due to object existence (yes/no): no > <CR>

Import grants (yes/no): yes > <CR>

Import table data (yes/no): yes > <CR>

Import entire export file (yes/no): no > yes

. importing TCM's objects into MUG
. . importing table                    "TCM01_MOL"       6780 rows imported
. . importing table                  "TCM01_PLANT"       1541 rows imported
. . importing table              "TCM01_PLANT_MOL"      12102 rows imported
. . importing table                    "TCM01_REF"        656 rows imported
. . importing table                "TCM01_MOL_REF"       7917 rows imported
Import terminated successfully without warnings.

From SQL Scripts:

Regular SQL scripts can be executed directly from within the SQLPlus environment:

$ sqlplus mug/coffee
SQL> @demo

...  (lots of output)

SQL> @acd_demo

...  (lots of output)

SQL> @wdi_demo

...  (lots of output)

SQL> @wombat_demo

...  (lots of output)

The main table of interest is: demo. It has chemical structures and ID numbers. The other tables are subsets of the ACD, WDI, and Wombat databases.

You can find out the names of the tables which are owned by the current user with a SQL select statement (from the data dictionary):

SQL>  select table_name from user_tables;

You can get information about the columns in the tables with the 'describe' command. For example:

SQL>  desc demo

Simple SQL Operators

  1. Find Dopamine (NCCc1ccc(O)c(O)c1) in the "demo" table.

    The SQL operator used for an exact match search is:

    operator exact (a IN VARCHAR2_OR_CLOB, b IN VARCHAR2_OR_CLOB) => NUMBER

    The SQL operator used for a absolute-SMILES match match search is:

    operator asmiles (smiles IN VARCHAR2_OR_CLOB, query IN VARCHAR2_OR_CLOB) => NUMBER

    The exact and role operators expect the query to be canonicalized in the same fashion as the column being searched. Since the demo table has absolute canonical SMILES, then the query must include the absolute canonical SMILES.

    There are several options here. Each will use a different index to accomplish the same operation:

  2. Find structures similar to Dopamine in the "demo" table. Impose a tanimoto cutoff of 0.7. Alternately, find the ten most similar structures to Dopamine in the "demo" table.

    The operators used for a tanimoto search is:

    operator tanimoto (fp_or_smi1 IN VARCHAR2_OR_CLOB, 
                       fp_or_smi2 IN VARCHAR2_OR_CLOB) => NUMBER
    operator tanimoto (fp_or_smi1 IN VARCHAR2_OR_CLOB,
                       fp_or_smi2 IN VARCHAR2_OR_CLOB, count IN NUMBER) => NUMBER

    The first SQL command is:

    select smi, tanimoto (smi, 'NCCc1ccc(O)c(O)c1')  from demo
      where tanimoto(smi, 'NCCc1ccc(O)c(O)c1') > 0.7
      order by tanimoto(smi, 'NCCc1ccc(O)c(O)c1') desc;

    Note that this query calls the tanimoto() function three times. The tanimoto() function in the where clause is what's used to actually perform the selection; the where clause eliminates rows which don't meet the criteria of a similarity greater than 0.7. The tanimoto() function in the select computes the simliarity value for the output of the statement, and the tanimoto() in the order-by clause is used to sort the output in descending order.

    The second option, finding the ten most similar structures, uses the optional third parameter to the tanimoto() function.

    select smi, tanimoto (smi, 'NCCc1ccc(O)c(O)c1')  from demo
      where tanimoto(smi, 'NCCc1ccc(O)c(O)c1', 10) > 0.0
      order by tanimoto(smi, 'NCCc1ccc(O)c(O)c1') desc;

  3. Count the number of compounds in the demo table which have thiophenol (Sc1ccccc1) as a substructure

    The function used for a substructure search is:

    operator contains (smiles1 IN VARCHAR2_OR_CLOB,
                       smiles2 IN VARCHAR2_OR_CLOB) => NUMBER

    The SQL command is:

    select count(1) from demo where contains(smi,'Sc1ccccc1') = 1;

  4. Perform a SMARTS query on the demo table.

    The function used for a SMARTS query is:

    operator matches (smiles IN VARCHAR2_OR_CLOB,
                      smarts IN VARCHAR2_OR_CLOB) => NUMBER

    The SQL command is:

    select smi from demo where matches(smi,'[N,O]ccC=O') = 1;

    This query finds ortho-substitued benzaldehyde structures and returns the SMILES strings.

PL/SQL Functions

  1. Write a simple PL/SQL routine to compute the number of atoms and rings in naphthalene (c1cccc2c1cccc2).

    Use the ddprop package functions, not the SQL operators, within the PL/SQL block.

    Use the dbms_output package with "set serveroutput on" to write simple results to the screen.

    set serveroutput on
      smi varchar2(20);
      natoms number;
      nrings number;
      smi := 'c1cccc2c1cccc2';
      natoms := c$dcischem.ddprop.fatom_count(smi);
      nrings := c$dcischem.ddprop.fring_count(smi);
      dbms_output.put_line('Number of atoms in naphthalene is: ' || natoms);
      dbms_output.put_line('Number of rings in naphthalene is: ' || nrings);

  2. Create a PL/SQL function which implements a new (trivial) molecular property, the ring ratio, here defined as the fraction of atoms in a molecule which are in a ring.

    Use the match_count() function.

    create function ring_ratio(smi in varchar2) return number
     natoms number;
     nringatoms number;
     natoms := c$dcischem.ddprop.fatom_count(smi);
     nringatoms := c$dcischem.ddprop.fmatch_count(smi, '[*r]');
     return nringatoms/natoms;

    Then try it with:

    select ring_ratio('c1ccccc1') from dual;
    select ring_ratio('CCCCCCc1ccccc1') from dual;

Oracle Data Manipulation

  1. Create a demo table as a subset of tcm01_mol. Only include the first 100 records from tcm01_mol.

    Use a variation the "CREATE TABLE" command ("create table ... as select ... ") and use the "rownum" pseudocolumn to limit the rows.

    create table tcm_subset as 
      select * from tcm01_mol
        where rownum <= 100;

  2. Add a column for average molecular weight. Populate the new table with calculated values.

    Use the "ALTER TABLE" command to add the column, then use the Daycart function to compute the average molecular weight and update the new column.

    alter table tcm_subset add new_mw number;
    desc tcm_subset;

    To populate the columns:

    update tcm_subset
      set new_mw=smi2amw(smi);

  3. Validate the calculated molecular weight (new_mw) against the stored molecular weight (mw). Look for a discrepancy greater than 1.0.

    The SQL function "ABS()" returns the absolute value of a number.

    select smi, mw, new_mw from tcm_subset
      where abs(mw - new_mw) > 1.0;

    Also note that we could do the validation by calculating the molecular weight on the fly, without using a new column:

    select smi, mw, smi2amw(smi) from tcm_subset
      where abs(mw - smi2amw(smi)) > 1.0;

    Note that the "mw" column in the table is actually a string (VARCHAR2). The SQL engine performs an implicit conversion to a number for the comparison. The same result is obtained if the explicit conversion, using the to_number() function, is used:

    select smi, mw, new_mw from tcm_subset
      where abs(to_number(mw) - new_mw) > 1.0;

  4. Add columns for rotatable bonds, hydrogen donors, and ClogP and populate the table.

    For clogp(), use the contrib script to create the fclogp() function and clogp() operator. Then:

    alter table tcm_subset add (rot_bond number, hdonor number, clogp number);
    update tcm_subset set

Multi-Table Queries

  1. Find out how many structures are common between the demo table and the tcm01_mol table.

    Since the SMILES in the two tables have been canonicalized and are stored as simple VARCHAR2 data, a simple join query will work.

    select count(1) from tcm01_mol, demo
      where tcm01_mol.smi = demo.smi;

  2. Refine the query and find all structures in both demo and tcm01_mol that contain the phenyl ethylamine (NCCc1ccccc1) substructure.

    Extend the previous answer with the contains() query or use an INTERSECT statement to join two SQL queries.

    There are several ways to answer this query.

    select demo.smi from demo, tcm01_mol
      where contains(demo.smi, 'NCCc1ccccc1') = 1
      and demo.smi = tcm01_mol.smi;

    The above query finds all SMILES which match the contains() query, and are also are common to both lists.

    An intersection query will also work.

    select smi from demo
      where contains(smi, 'NCCc1ccccc1') = 1
    select smi from tcm01_mol

    Note that the two queries above give different answers. In fact, the INTERSECT set operation removes duplicates, so the second query has less answers. The query below gives the same answers as the INTERSECT-based query.

    select distinct(demo.smi) from demo, tcm01_mol
      where contains(demo.smi, 'NCCc1ccccc1') = 1
      and demo.smi = tcm01_mol.smi;

Daylight Chemical Information Systems Inc.