The main goal of database AAA is to serve as an example of how to build databases for MUG '97. The number of data and kinds of data will be kept to a minimum. The database should contain a only a small amount of easily-understood data. Data will be keyed to structures, including stereochemical information as available. Robust structure-based retrieval is desired (e.g., query as molecular amino acid or carboxylate). Stereochemical information will be recorded in the structures. Although there is no need here to be concerned about speed/space, indirect referencing will be used whenever appropriate for the purposes of illustration. AAA should be suitable for use in all possible current and future Thor and Merlin clients, including CD-ROMS, so we well choose file names which are valid under the (very restrictive) ISO-9660 rules. AAA should be expandable to other data about amino acids as they become available. AAA will not be clustered (due to the small number of structures), but that might change in the future.
|Amino acid||Entry||Solvent||Grams amino|
acid per 100
|Temp, C||Ref No.|
1. Cohn, E. J., McMeekin, T. L., Edsall, J. T., and Weare, J. H., J. Am. Chem. Soc., 58, 2270 (1934)
2. McMeekin, T., L., Cohn, E. J., and Weare, J. H., J. Am. Chem. Soc., 103, 2270 (1933)
3. Dunn, M. S. and Rose, F. J., J. Biol. Chem., 124, 300 (1938)
4. McMeekin, T., L., Cohn, E. J., and Weeare, J. H., J. Am. Chem. Soc, 58, 2173 (1936)
5. Abderhalden, E., and Zeissctw., Z. physiol. Chem., 196, 121 (1931)
6. Kapfhammer, J., and Eck, R., Z. physiol. Chem., 170, 294 (1927)
The Merlin system provides a spreadsheet-like view of a Thor database. The "normal" view where each row represents a single datatree (i.e., information about a generic chemical structure). This view may be augmented by separating subtree data for "distinguished identifiers" into their own rows, by setting the "pool inclusion flag" (_P) to `!'. Doing so doesn't affect how the data is stored, only how it is viewed. For purposes of this example, we will do so for the $DID identifier.
Here is the definition of the $DID datatype as a datatree:
$D<"$DID"> _V<Data source ID> _B<Data ID> _N<WHITE0 UPCASE> _P<!> _S<Identifies source of data> _D<Identifies precise source of data (compilation and entry)> _M<Lookup, Common> _O<Daylight Chemical Information Systems, Inc.> |
Reference data will be stored as indirect data. It's more efficient and reliable to do so, since such data contains a lot of repeat values. For this example, the reference field is split into two fields, "Author(s)" and "Citation". This isn't really necessary, but it makes a good example here.
Solvent data would normally also be stored as indirect data, but for the purposes of this example, we will store it as the named SMILES of the solvent using the "USMILESANY" normalization. This will allows us to display and search the solvent field as structures.
The other field, "Temp., C" in the table, might be called "Temperature of measurement", but here we call it "at Temp, C" for brevity. It is a simple NUMERIC field.
Putting this all together, this is the definition of the SOL datatype as a datatree:
$D<SOL> _V<"Solubility;Solvent;at Temp, C;Author(s);Citation"> _B<"Sol;Sol/solv;Sol/temp;Sol/auth;Sol/ref"> _N<"NUMERIC;USMILESANY;NUMERIC;INDIRECT $IAUTH;INDIRECT $ICITE;"> _P<"*;*;*;*;*"> _S<Solubility, grams per 100 grams of solvent> _D<"Solubility, grams / 100 grams of solvent;Solvent SMILES;Temp, C;Authors;Citation"> _M<Physical Property, Measured, Common> _O<Daylight Chemical Information Systems, Inc.> |
In this case, structures are obtained from the name in the table for a given entry, we will connect ISM data to the same $DID identifier as the data.
Since the database design requires structural and similarity searching, we will need to add fingerprints to the datatrees. The fingerprint program generates fingerprints only for $SMI-rooted datatrees. Although the above approach of loading ISM's and letting Thor generate the $SMI root is very convenient and accurate, it means that we don't have $SMI-rooted datatrees in the input data for fingerprinting. We might generate $SMI roots in some other way, but this introduces another possible source of error. A better way would be to get the Thor system to generate and fingerprint SMILES in an automatic fashion -- as will be shown, this can be done quite simply.
Note: If a database has many kinds of names, it's a good idea to crossreference all of them as $NAM's, so users can look up "Names" without also specifying the kind of name (IUPAC, USAN, INN, Trade Name, Common Name, etc.) This isn't the case here; there's only one kind of name in this example.
The most common way to do this is to write a sh(1) script and this approach works very well. A more powerful method is to describe the rules for building the database as a make(1) specification. We'll use make in this example and put the rules in a Makefile.
We have to pick a name for the database(s) at this point. In previous years, we might have used a name for the primary database such as "aminoacids" with "aminoacids_datatypes" and "aminoacids_indirect" for the auxiliary databases. Such choices would be just fine. But we've been experimenting with CD-ROM databases lately and the goals call for ISO-9660 compatibility, so we're limited to "8.3" names. We'll use the names "AAA", "AAA_DTS", and "AAA_IN" with the database extension ".TDB".
The Makefile delcares dependencies such that the three databases are created and loaded in a specific order. The datatypes database (AAA_DTS) must be created and loaded first, then the indirect database (AAA_IN) must be created next, then the primary database is created and loaded. All databases are created with thormake and loaded with thorload. The only unusual thing about the database loading is that indirect data is automatically generated (via the "+GENERATE_INDIRECT" option).
The database-building strategy used here allows the database to be created and fingerprinted from non-$SMI-rooted datatrees:
This Makefile has only one "automatic" target, the database AAA.TDB. Typing either "make all" (or just "make") will create the entire database (or bring it up to date). Two other convenience targets are are provided: "clean" to remove the database and "cleanall" which also removes the temporary files.
Note that all datatrees are $DID-rooted and that the only structural data are in ISM datatypes, some of which have chiral specifications and some do not not. Remember that the ISM datatype has been given the ASMILES normalization so ISM's will end up on the correct Thor page keyed to their unique SMILES.
Note that the $DID-rooted datatree is itself rooted at a generic SMILES, a lookup "Name" has been properly derived from the "Local name". Also, "Graph" ID, Fingerprint, and Timestamp data have been generated and added to the datatree.
Open the database in xvmerlin, create columns of various fields, perform structure searching (check that fingerprints are OK) and try a few string and range searches. The following xvmerlin screen shows the AAA database with rows selected which have solubility data measured at 25 C in aliphatic alcholic solvents (SMARTS "[CX4]-[OH]"):
$ cd /home2/daylight/newdb $ tar c AAA*Some people are more comfortable saving TDT files because they are stored as simple ASCII text. If you choose to do so, be sure to save tdt files for all the databases, e.g.,
$ thorlist ACS_DTS > ACS_DTS.TDT $ thorlist ACS > ACS.TDT $ thorlist ACS_IN > ACS_IN.TDTTo avoid indirect data dependency, you might choose to save the TDT's with indirect data "expanded", e.g.,
$ thorlist -KEEP_INDIRECT_DATA FALSE ACS > ACS_X.TDTFinally, for the purposes of database backup only, you can use the the low-level program thordump to dump all the trees in a from the primary thor database (.DP) directly. Be sure that the database is not loaded writable by any thorserver (check for a .LCK file) and use thordump like thorlist:
$ cd /home2/daylight/newdb $ thordump ACS_DTS.DP > ACS_DTS.TDT $ thordump ACS.DP > ACS.TDT $ thordump ACS_IN.DP > ACS_IN.TDT
The basis of Thor security is that the only way anyone except the Thor database administrator can get access to data in a Thor database is via thorserver. For these purposes, the administrator can not be the "root" super-user but another pseudo-user, conventionally named "thor". Check that only "thor" has access to the database files. (There's no sense in maintaining elaborate access control if unauthorized users can read the database files directly.)
Thor and Merlin servers control access via two separate mechanisms: user and host authorization. You can authorize access from specific machines or to specific (possibly passworded) users or (in "restricted hosts" mode) only to specific users when they are working on specific machines. Server access is controlled via sthorman. See the Daylight Administration Manual, Section 8. Security for more information.
Three passwords were supplied when the database was created, for read, write and executive privileges. By convention, completely unsecured databases have no read password and the passwords "write" and "exec". "Public" databases typically have no read password but do have passwords for write and executive access. Databases with restricted access should have a read password which is known only to authorized users. You can use sthorman or thorchange to change database passwords as needed (the executive password will be needed to do so.)