MUG '05 -- 9 - 11 March, 2005

DAYCONVERT: A Daylight-Centric Conversion Tool for Chemical Information

Michael A. Kappler
Daylight CIS, Inc.

ABSTRACT

Dayconvert is a new tool for conversion of chemical information to and from Daylight native form. Dayconvert is available as an application or packaged as a DayCart operator. Our aim in the initial release is conversion of chemical information stored in the most widely used file formats in chemistry. Example conversions to and from SMILES, SMARTS, SMIRKS and ThorDataTrees (TDTs) will be shown.


Background

Movement of chemical information into and out of software is a common task. Naturally, there is a need to import and export structure and data into and out of Daylight software and databases. For this purpose, scientists have used open source, commercial and proprietary in-house software, however they suffer from accuracy, integration or supportability. Therefore, a Daylight-centric solution has been created to move chemical information into and out of Daylight software and databases,

Methods and Use

The core technology is based upon is a new product, the Conversion Toolkit. The Dayconvert application is built on the toolkit and performs file input and output. The Dayconvert operator in DayCart accesses the toolkit directly, in other words, the toolkit is built into the server side of DayCart. Conversion of chemical information is based on public technical documentation and the mapping external representations to Daylight languages. The most widely used file format in chemistry is MDL's CTfile Formats. Therefore, our aim in the initial release is support of the following formats: The use cases of this technology are:
  1. import a structure
  2. import a query
  3. import a reaction or transformation
  4. import a database
  5. export a database
  6. export a structure
  7. depict a structure
  8. preserve a structure
  9. preserve a structure with data
Each of these use cases will be demonstrated using the Dayconvert application and operator (DayCart) as methods.

Use Case #1: Import a Structure

The following structure is L-Alanine (13C) from "CTfile Formats", page 9:

Method A: Application

The following converts the L-Alanine (13C) structure from molfile format to SMILES, given the above molfile in a file named ctfile009.mol:

To include isomeric information (isotopes and stereochemistry), use the isomeric option:

Method B: DayCart

The following converts the L-Alanine (13C) structure from molfile format to SMILES, given the above molfile in an Oracle table named ddtable and a column named molfile:

To include isomeric information (isotopes and stereochemistry), use the isomeric option:


Use Case #2: Import a Query

The following query represents analine, phenol, or toluene:

Method A: Application

The following converts the substituted-benzene query from molfile format to SMARTS, given the above molfile in a file named phenylNCO.mol:

Method B: DayCart

The following converts the substituted-benzene query from molfile format to SMARTS, given the above molfile in an Oracle table named ddtable and a column named molfile:


Use Case #3: Import a Reaction or Transformation

The following reaction is acylation of benzene from "CTfile Formats", page 44:

Method A: Application

The following converts the acylation reaction from rxnfile format to Reaction SMILES, given the above rxnfile in a file named ctfile044.mol:

Method B: DayCart

The following converts the acylation reaction from rxnfile format to Reaction SMILES, given the above rxnfile in an Oracle table named ddtable and a column named rxnfile:

If the reaction was balanced with a chloride ion on the product side and contained a query feature, it may be converted to SMIRKS, e.g.:


Use Case #4: Import a Database

The following is a file list showing an SDfile from Derwent (World Drug Index) and an RDfile from Sunset Molecular, LLC (WomBat):

Method A: Application

The following converts the WDI database from SDfile format to SMILES:

The following converts the WomBat database from RDfile format to SMILES:

Method B: DayCart

A SQL control file named import_sdf.ctl is provided to import an SDfile into Oracle:

The following converts the WDI database from SDfile format to SMILES, given the above SDfile in an Oracle table named ddtable and a column named sdfile:

A SQL control file named import_rdf.ctl is provided to import an RDfile into Oracle:

The following converts the WomBat database from RDfile format to SMILES, given the above RDfile in an Oracle table named ddtable and a column named rdfile:


Use Case #5: Export a Database

Method A: Application

The following converts the WDI and WomBat databases from SMILES to SDfile and RDfile formats, respectively:

Method B: DayCart

A PL/SQL procedure named export_file is provided to export an file from Oracle. The following converts the WDI and WomBat databases from SMILES to SDfile and RDfile formats, respectively, given the SMILES in an Oracle table named ddtable and a column named sdfile and rdfile:

A PL/SQL procedure named oracle2rdf is provided to export an RDfile from Oracle:


Use Case #6: Export a Structure

Method A: Application

The following converts the L-Alanine (13C) structure from SMILES to molfile:

Method B: DayCart

Note: Beginning in v4.91, stereochemical hydrogens are represented as atoms on molecules with the dt_smilin_addh entry point in the SMILES Toolkit.


Use Case #7: Depict a Structure

The ThorDataTree (TDT) is a Daylight format for structure and data. Method A: Application

The following converts the L-Alanine (13C) structure from molfile format to TDT format, given the above molfile in a file named ctfile009.mol:

Method B: DayCart

The following converts the L-Alanine (13C) structure from molfile format to TDT format, given the above molfile in an Oracle table named ddtable and a column named molfile:

The 2D-coordinates (2D), bond styles (BST) and visibility (VIS) datatypes are used by the HTTP-based SMI2GIF application to preserve the original depiction.

Table 1: Newer Depictions are Better
v4.81
what the computer "thinks"
v4.82
use of 2D-coordinates
v4.83
aesthetics improvments
v4.91
total control of layout

Note: The DEPICT Toolkit entry points dt_setcoord, dt_setbondstyle and dt_setvisible are used to observe 2D, BST, and VIS datatypes.


Use Case #8: Preserve a Structure

In additional to the depiction datatypes in the TDT, the atom index (ATI), bond index (BDI), and bond reversal (BDR) datatypes are used used to reproduce the original molfile.

Method A: Application

The following converts the L-Alanine (13C) structure from TDT format to molfile format and reproducing the original input.

Method B: DayCart

The following converts the L-Alanine (13C) structure from ThorDataTree TDT format to molfile format and reproducing the original input.


Use Case #9: Preserve a Structure or Reaction with Data

Additional chemical information involving structural properties and data are in SDfile and RDfile formats. The solution for preserving structures or reactions with data is a TDT. The following structures are L-trans-1,2 cyclohexane-dicarboxylic acid and 2-methyl furan from "CTfile Formats" (October 2003), page 40 (SDfile format) and the conversion of them to a TDT. The same solution is used for RDfile formats.


End Notes

The Oracle types use in the DayCart operators may be either a variable character array (VARCHAR2) or a character large object (CLOB).

The initial release of the Converstion Toolkit Application Programmer's Interface is four entry points and can be used to input and output strings and objects, such as molecules, reactions, patterns, transformations, depiction, and conformations.

A generic input type, such as 'mdl' or 'daylight', may be specified, which invokes a perception algorithm that attempts to detect the specific format type. Additionally, each input type is associated with an output type. So, input may be specified as "Generic MDL" and output may be specified as "Generic Daylight". Then, for example, if the input is perceived to be an SDfile it will be read as such, and output will be either SMILES or SMARTS depending on whether the input contains query features. An informational program, FILEFORMAT, is provided to show perceived types of CTfiles.

Another informational program, MDL2INFO, is provided to show the datatypes of an SDfile or an RDfile and can be used to select the datatype to be used for naming a structure for storage in a TDT. The default value is first line of the header in a molfile, rxnfile, SDfile, or RDfile. The following is information for the WomBat 2004.1 database and shows that several of the top datatypes are unique and present for all records, and are therefore reasonable selectiona for naming the structure sample to be stored in the TDT.

For complete information on the Conversion Toolkit, Dayconvert Application and DayCart operators, see the Conversion Toolkit Programmer's Guide.

Conclusion

Dayconvert is a new tool for conversion of chemical information to and from Daylight native form. Dayconvert is available as an application or packaged as a DayCart operator. The core technology, the new Conversion Toolkit, is a robust and general Application Programmer Interface for conversion of Daylight languages or objects, to and from molecules, reactions, patterns, transformations, depictions, and conformations.
Copyright 2005, Daylight CIS, Inc.


Daylight Chemical Information Systems, Inc.
info@daylight.com