MUG '04 -- 24 - 27 Feb, 2004
Michael A. Kappler
The Conversion Toolkit is aimed at facilitating migration of data into and out of Daylight software. The following describes the project goals, requirements, approach, design, issues, and current status of this new product.
- Facilitate migration to DayCart®. Mitigate architectural barriers by supporting third party data.
- Promote integration of Daylight Toolkits. The ability to handle third party data would give customers the opportunity to use our tools in other environments.
- Support creation of Daylight databases. Offer more content in Daylight native form.
The aim is to support structural information, data, queries, and reactions to and from SMILES, SMARTS, and SMIRKS languages.
- Convert popular chemical information file formats to Daylight native form.
- Generate these popular formats from Daylight native form.
- Preserve information to the maximum extent possible.
Widely used file formats for structures, reactions, and data in the chemical information software industry have been published by Molecular Design Limited (MDL), known as the chemical table file (CTfile) Formats. There are an estimated two dozen file formats and another dozen or so in the making. The current position regarding various formats are catagorized as follows:
Group I: "must-have", supported in first release
MDL CTfile Formats (input sketcher, storage)
CambridgeSoft® ChemDraw (input sketcher)
- Ctab (structure)
- Ctab Atom List (query)
- Ctab Stext (data)
- Ctab Properties (query)
- Ctab Properties (Rgroup)
- Ctab Properties (Sgroup)
- Ctab Properties (3D)
- Ctab Enhancements (atom types, large numbers, data)
- molfile (one structure, a.k.a. V2000)
- RGfile (Rgroup query)
- SDfile (data, multiple structures)
- rxnfile (one reaction)
- SDfile (data, multiple reactions)
- XDfile (XML)
- extended molfile (structure, a.k.a. V3000 and mol3)
- extended Ctab Properties (query)
- extended Ctab Properties (Rgroup)
- extended Ctab Properties (Sgroup)
- extended Ctab Properties (3D)
- extended Ctab Properties (Collections)
- extended rxnfile (one reaction)
Group II: "valuable", supported in subsequent release
Group III: potentially valuable", no plans
- Accelrys Common File Formats
- CSD FDAT/FCON
- GAMESS XYZ
- Gaussian Z-matrix
- ISISDraw TGF
- Sybyl MOL2
Group IV: "noted", not worthy of support
Successful conversion will be determined by comparison of input from CambridgeSoft ChemDraw and MDL® ISIS/Draw against output generated by the Conversion Toolkit.
The implementation will be a Daylight Toolkit and integrated with existing tools. The Toolkit approach will maximize robustness - it should never fail itself, only fail to convert malformed input. Robustness and efficiency will be maximized using purification and quantification tools.
A compatibility accessment will be performed with other chemical drawing tools, i.e. CACTVS, ChemSymphony, JME,, Marvin, and MDL® Draw Enterprise.
One interface concept is a sequence of string objects. Control will be described with "named properties" on a "Conversion Object". The interface will be extensible so additional or not-yet-developed formats can be supported in the future. Since no Daylight Toolkit performs direct I/O, a ``Contrib'' program would be offered for reading and writing data. The following is one possible interface design.
dt_Handle <= dt_alloc_conversion(dt_Integer request)
Allocate an object for a specific conversion request, e.g., DX_CONV_MOL2SMILES.
dt_Handle <= dt_convert(dt_Handle conversion, dt_Handle sequence)
Convert data based on properties. Input is a conversion object an a sequence of strings. Return a sequences of strings.
Library in $DY_ROOT/lib
libdt_conv.a & libdt_conv.so
Applications in $DY_ROOT/bin
MOL2SMILES, MOL2SMARTS, & MOL2SMIRKS
SMILES2MOL, SMART2SMOL, & SMIRK2SMOL
CLOB <= function ddpackage.fsmiles2mol(data IN CLOB)
dt_Handle <= du_file2seq(FILE *file, char *delimiter)
Read from file into a sequence of string
dt_Boolean <= du_seq2file(dt_Handle sequence, FILE *file)
Write from sequence of string to a file
The extent to which we allow dialects is an issue. Do we take a hard line on violations? A soft approach to format interpretation is problematic - there is no ``right'' answer. A list of allowed line delimiters (LF, CR, CR/LF, etc.) is needed.
An Alpha version with basic functionality has been completed. A "Conversion Toolkit was built and depends on SMILES, SMARTS, and DEPICT Toolkits. An initial interface has been implemented and conversion between MDL and SMILES has been done. A non-functional interface betwen MDL and SMARTS and SMIRKS is in place. The estimated timeline for this project is as follows:
|Table 1:Conversion Toolkit Project TImeline
||identify goals and requirements, design approaches
||MUG '04, Alpha version, basic functionality
||Beta version, required features
||EuroMUG '04, compatability accessment, extended progress
|sometime in '04
||Software Release v4.91, stable interface
||Software Release v4.92, extended capabilities
Data from the CTfile Formats is being used as test input. Send us your "problematic" data to ensure we cover known issues affect you. Below, we see the new toolkit successfully convert data from molfile format as reported in the CTFile Format document. Support for extended molfile format is planned.
% foreach CTFILE (ctfile* )
foreach? echo `$CTFILE` `cat $CTFILE | mol2smiles`
-  Dalby, A. et al. Description of several chemical structure file formats used by computer programs developed at molecular design limited. J. Chem. Inf. Comput. Sci., 32 (1992), 244--255.
-  CTfile Formats (October 2003), MDL Website. http://www.mdl.com/downloads/public/ctfile/ctfile.jsp.
-  CambridgeSoft Website. http://www.cambridgesoft.com/.
-  ChemAxon Website. http://www.chemaxon.com/marvin/.
-  MolInspiration Website. http://www.molinspiration.com/jme/.
-  Chemical Markup Language Website. http://www.xml-cml.org/.
-  Accelrys Website. http://www.msg.ucsf.edu/local/programs/insightII/doc/life/insight2000.1/ formats980/Files980TOC.doc.html.
-  Cambridge Structural Database Website. http://www.ccdc.cam.ac.uk/ and http://www.lrz-muenchen.de/services/software/chemie/unichem_doku/ 5500/5500_245.html#HEADING244.
-  Trinity University Website. http://hackberry.chem.trinity.edu/IJC/Text/xmolxyz.html.
-  UniChem User's Guide. http://www.lrz-muenchen.de/services/software/chemie/unichem_dok u/5500/5500_248.html#HEADING247.
-  MDL Website. http://www.mdl.com/.
-  Atomic and Molecular Physical Data Website. http://www.jcamp.org/protocols.html.
-  Protein Data Bank Website. http://www.rcsb.org/pdb/.
-  Tripos Website. http://www.tripos.com/custResources/mol2Files/.
-  Thomson Dialog Website. http://library.dialog.com/bluesheets/html/bl0390.html.
-  MDL Website. http://www.mdl.com/products/framework/isis_draw/index.jsp.
-  University of Erlangen Website. http://www2.ccc.uni-erlangen.de/software/cactvs/.
-  Cherwell Scientific Website. http://www.chm.bris.ac.uk/chemsymphony/start_here.html.
-  MDL Website. http://www.mdl.com/products/framework/draw_enterprise/index.jsp.
Daylight Chemical Information Systems, Inc.