Daylight Summer School 2001, June 5-7, Santa Fe, NM

TDTs - THOR Data Trees

TDT is the standard Daylight file format for structures plus data. More precisely, TDTs are an external ascii-stream representation of a Thor database.

The TDT format is comprised of Datatypes, Dataitems, Datafields,Datatrees

EXAMPLE TDT

Datatype Definitions- Gives meaning to data stored in  THOR Databases

STANDARDIZATIONS/NORMALIZATIONS

USMILES -- unique SMILES

Interpret the field as a SMILES, convert it to the unique SMILES.

ASMILES -- absolute SMILES

Interpret the field as an isomeric SMILES (contains isomeric and/or isotopic information); generate the unique isomeric SMILES

USMILESANY -- unique SMILES, unrelated to root

Interpret the field as a SMILES, convert it to the unique SMILES. Unlike the USMILES normalization, there is no requirement that a USMILESANY field have any relationship to the root identifier. This might be used, for instance, to specify a solvent.

ASMILESANY

Like USMILESANY, except the field is interpreted as an absolute SMILES (contains isomeric and/or isotopic information).

WHITE0 -- zap spaces

Remove all "whitespace" (spaces, tabs, newlines, and carriage-returns) from a field.

WHITE1 -- zap 2 or more spaces

Convert all occurances of two or more whitespace characters into a single space.

WHITE2 -- zap 3 or more spaces

Convert all occurances of three or more whitespace characters into two spaces.

UPCASE -- convert to upcase

Translate the characters a-z to their uppercase equivalents A-Z.

DOWNCASE -- convert to downcase

Translate the characters A-Z to their lowercase equivalents a-z.

NOPUNCT -- zap punctuation

Remove all punctuation. Punctuation characters are all non-alphanumeric characters.

CASNUM -- insert hyphens and verifies checksum

Chemical Abstracts numbers have the form NNNN-NN-N, where the last digit is a checksum. The CASNUM standardization will insert the hyphens, and verifies that the checksum is correct

INDIRECT -- indirect data field

The datafield is an indirect reference; on reading, its contents are replaced by the expansion for its contents. This is discussed in more depth in the section below entitled Indirect Data.

INTEGER16 -- numeric data
INTEGER32
REAL32
REAL64

These normalizations indicate that the datafield is a 16- or 32-bit integer or a 32- or 64-bit real number, respectively. Thor does no actual range checking to verify that the datafield meets the specification. However, the Merlin system can often use these specifications to allocate storage more efficiently, and to greatly increase the speed of certain searching and/or sorting operations.

BINARY -- binary data

The datafield is binary; that is, it contains arbitrary 8-bit integers. Such datafields are invisibly converted by THOR to remove characters that would otherwise confuse the THOR server and clients. THOR and Merlin clients also use this when formatting a TDT for display.

SMILES_NTUPLE -- SMILES-order n-tuple data

It is often necessary to store data about individual atoms in a molecule, that is, data that have a one-to-one corespondence with the atomic symbol in a SMILES string.

$SMI<OCC>2D<1,2,3,4,5,6>| (original SMILES)
$SMI<CCO>2D<5,6,3,4,1,2>| (unique SMILES)

AUTOGEN -- Generate a new dataitem using this dataitem's contents

Takes the tag of a second datatype. Each time a dataitem is entered, a new datatype is created of the type specified by the tag, then that datatype's normalizations are applied.

For example, one might have two datatypes, NAM and $NAM, the former with just the "AUTOGEN $NAME" and the latter with WHITE0, UPCASE, NOPUNCT. If we entered the dataitem NAM<1,2-dimethylgoo>, a new dataitem, $NAM<12DIMETHYLGOO> would be automatically created.

MAKERXNMOL -- Generate component molecules for a reaction

This is a variation of the AUTOGEN normalization. It takes a comma-separated list of three datatype tags as a second field. Each time a dataitem is entered for this field, it is parsed into its dot-separated component parts. A new dataitem is created for each component within the datattree. The datatype tag for each component depends upon the role of the component in the reaction.

For example, consider a database for which the ISM<> datatype defined with the "MAKERXNMOL $RMOL,$AMOL,$PMOL" normalization and the three datatypes: $RMOL<>, $AMOL<>, $PMOL<>, each has the USMILES_ANY normalization. If the following datatree is entered:

$SMI<"BrCC=C>>ICC=C">
$RNO<12345>
ISM<"BrCC=C>CC(=O)C.CCC(=O)C>ICC=C">
|

the datatree actually stored in Thor, after normalization, would be as follows:

$SMI<"BrCC=C>>ICC=C">
$RNO<12345>
ISM<"BrCC=C>CC(=O)C.CCC(=O)C>ICC=C">
$RMOL<BrCC=C>
$AMOL<CC(=O)C>
$AMOL<CCC(=O)C>
$PMOL<ICC=C>
|

PART_NTUPLE -- Component-order n-tuple data

This normalization is of particular use with the FPP datatype, which consists of a set of N fingerprints corresponding with N dot-disconnected SMILES representing a mixture and/or library. As with SMILES_NTUPLE, an integer argument is required indicating the number of data per part.

GRAPH -- convert SMILES to GRAPH
MAKEGRAPH -- produce a GRAPH subtree

THOR uses the concept of a molecule's graph to allow retrieval of structures that might be tautomers, isomers, or otherwise an inexact match to a particular SMILES. One of the problems in representing molecules in a computer is that we must choose one valence model as the preferred representation, but there are many valid valence models. The graph of a molecule is an information-deficient representation that removes most valence-model information, allowing greater flexibility in retrieving data.

A molecule's graph is created by removing all isotopic, charge, and bond information from it. All bonds are set to "single", all charges are set to zero, and each atom's hydrogen count is set to the normal lowest valence consistent with its bond configuration. Having removed all of this information, the resulting "molecule" is used to generate a unique SMILES; this is the graph's identifier.

D3D -- compute 3D hash

The D3D standardization is designed primarily for use with the $D3D datatype. It has two effects: First, it is equivalent to specifying "SMILES_NTUPLE 3" for the 3D data. Second, it causes a "hash code" to be generated from the 3D data; this hash code is stored in the datafield two positions earlier in the dataitem.

INDIRECT

Indirect data are data that are stored separately from the regular data in a database. A field thus marked will contain an indirect reference rather than the actual data of interest. When the TDT is retrieved from the database, an indirect-reference expansion takes place: The indirect reference is looked up in an auxillary database, and the expansion data replaces the original indirect reference.

EXAMPLE:

spresi95demo_datatypes.tdt:

$D<"$ISC">
_V<"indirect/citation;citation">
_B<"$ISC/id;$ISC/data">
_P<>
_S<Indirect reference for citation data, used in Spresi datatype JA>
_M<Indirect>
_C<Spresi datatype>
_O<Daylight CIS Inc.>
|


$D<"JA">
_V<"Journal article;Author(s);Institution;Citation;Keywords;Language;Year;Document
ID">
_B<"article;art/author;art/inst;art/citation;art/keywords;art/lang;art/yr;art/DocID">
_P<"*;*;*;*;*;*;*;*">
_S<Data type describing a journal article (subclassed from Spresi
DOK type).A>
_M<Reference>
_C<Spresi datatype>
_O<Daylight CIS Inc.>
|

spresi95.tdt:

$SMI<Br.CCCN(CCC)C1CCc2cc(Cl)c(O)cc2C1>
CL<26439;19;;0.0066>
FP<e4.02.JEEdR5dQk2EZB.e.YU8Gl6DW326Okrv.uEYE43,AAEN.AX01YU3X2e2Ve7E6Iw45.O0177xQ7YInY34...1;2048;194;512;173;1;S95>
TS<199701100502.52>
$GRF<Br.CCCN(CCC)C1CCC2CC(Cl)C(O)CC2C1>
$SNO<1402584-201>
JA<130505;111512;;134485;5~3297~2;ENG;1986;06X0203 87>
SMPlt;227.00;227.00;;;;;methanol/diethyl ether;;06X0203 87>
|
spresi95_indirect.tdt:

$ISC<134485;J. MED. CHEM., 29,(1986) N 9, 1615-1627>
|

Daylight Chemical Information Systems Inc.