Thor/Merlin 4.5: Datafield quoting

The TDT datafield quoting convention

If a datafield contains characters which are used in the Thor datatree syntax ( $ < ; > | " ), the whole field must be enclosed in doublequotes ("..."). Doublequotes in data are indicated by doubling the doublequotes, e.g., the dataitem for name 4',4"-PCB would be appear as:

What's changed?

The quoting convention for TDT datafields hasn't changed -- but in version 4.51, it's uniformly enforced. Previous versions of the software made exceptions, e.g., for datatype definitions. The syntax and of all datatrees is now unified; no exceptional cases remain.

What hasn't changed?

This change shouldn't affect toolkit programs, since the the lexical (external) form of datatrees is not visible at the object (toolkit) level. Unquoted datafield values are set via dt_setstringvalue(datafield) and obtained via dt_stringvalue(datafield). Conversion between (quoted) lexical datatrees and thor datatree objects is done via dt_thor_str2tdt() and dt_tdt2str().

Datatype definitions of identifiers

The "datatype definitions" ($D datatype) of identifiers always require quoting now because the tag value contains '$', e.g., the Spresi Reaction Registry Number, $SRNO:

   _V<SPRESI Reaction Registry Number>
   _B<SPRESI Reaction ID>
   _M<Name, Lookup, Common, System>
   _C<Spresi datatype>

Datatype definitions of multi-field datatypes

Datatype definitions of multi-field identifiers also require quoting because the data values contain ';', e.g., the Fingerprint definition, FP:

   _V<"Fingerprint;Orig size;Obits on;Size;Bits on;Type;Run ID">

Reaction SMILES

Some SMILES data need to be quoted now, since reactions contain the reserved `>' character.

   BALDWIN JOHN E., REDDY V. PRAKASH;;J. ORG. CHEM., 54,(1989) N2, C. 5264-5267;

A simplification

Although only datafields containing reserved characters need to be quoted, there is no harm in quoting all datafields on input. The easiest way to "fix up" an old datatypes tdt file is to quote all fields, e.g., the two datatrees below are synonymous:

_V<Ave molecular weight>
_B<Ave MW>
_S<Average molecular weight>
_D<Average natural molecular weight>
_M<System, Medchem, Calculated>
  _V<"Ave molecular weight">
  _B<"Ave MW">
  _S<"Average molecular weight">
  _D<"Average natural molecular weight">
  _M<"System, Medchem, Calculated">


A practical consequence of the above-described change is that all extant Thor databases must be reloaded. Although it is not difficult to do so with the tools supplied in the Daylight distribution, the program thordbfix451 is supplied to to the job. This is a simple and robust shell script which "leads you by the hand" through the process. It's use is strongly recommended.

All databases supplied by Daylight with v4.51 (dated 1997 or later) are already in the updated form (of course!) and do not need to be thordbfix-ed.

A hint

It is sometimes difficult to tell which databases have be created and loaded by a particular version of the Daylight software (e.g., when you are halfway through converting your databases). As of version 4.51, the database .THOR file contains a "version" line. To find out what software version created a database, check the .THOR file. If it contains a line like "version: 4.51", you've got it. If not, it was created by version 4.42 software (or earlier).


Daylight Chemical Information Systems, Inc.