8. Merlin - Exploratory Data Analysis Program

Merlin is a high-speed exploratory data analysis (EDA) program, designed to explore today's large chemical-information databases. Where THOR deals with TDTs one at a time, Merlin treats a database as a single unit and performs operations (e.g. searches and sorts) on the whole database. By using high-speed in-memory techniques, Merlin performs these EDA tasks with unprecedented speed.

In general, a user of an EDA program does not know exactly what information is required before starting work - the Merlin system is designed to allow information to be "discovered" based on available data. EDA is similar to, but distinct from, data archival and retrieval. For example, a data archival/retrieval question might be, "What is the structure of Atromepine?" whereas EDA questions might be, "What known structures have names that are similar to Atromepine?" (e.g. Atropamine) and "What known compounds are structurally similar to Atromepine?" (e.g. Hyoscyamine).

Like the THOR system described in the previous section, Merlin provides access to a THOR database. However, the "view" Merlin gives of the database is quite different than THOR's view: THOR is a "microscope" for the database that provides a detailed view of individual datatrees, whereas Merlin might be thought of as a "macroscope" that performs operations on the database as a whole.

In the Daylight system, THOR provides data archival and retrieval services while Merlin provides exploratory analysis services. One advantage of using separate systems for archival and EDA is each can use the best strategies for the task at hand. This clear distinction differs from the design of most other systems in which a single methodology is used for all archival, retrieval, and search services.

8.1 In-Memory Searching

The basic idea behind Merlin is that data in a computer's main memory can be manipulated roughly five orders of magnitude faster than data on its disk. Throughout the history of computers, there has been a price-capacity-speed tradeoff for data storage: Large-capacity storage (tapes, drums, disks, CD-ROMS) is affordable but slow; high-speed storage ("core", RAM) is expensive but fast. Until recently, high-speed memory was so costly that even a modest amount of chemical information had to be stored on tapes or disks.

But technology has a way of overwhelming problems like this. The amount of chemical information is growing at an alarming rate, but the size of computer memories is growing even faster: at an exponential rate. In the mid-1980's it became possible for a moderately large minicomputer to fit a chemical database of several tens of thousands of structures into its memory. By the early 1990's, a desktop "workstation" could be purchased that could hold all of the known chemicals in the world (ca. 15 million structures) in its memory, along with a bit of information about each.

On the surface, in-memory operations seem like a straightforward good deal: A computer's memory is typically 105 times faster than its disk, so everything you could do on disk is 100000 times faster when you do it in memory. But these simple numbers, while impressive, don't capture the real differences between disk- and memory-based searches:

  • With disk-based systems, you formulate a search carefully, because it can take minutes to days to get your answer back. With Merlin it is usually much faster to get the answer than it is to think up the question. This has a profound effect on user's attitudes towards the EDA system.
  • In disk-based systems, you typically approach with a specific question, often a question of enough significance that you are willing to invest significant effort to find the answer. With Merlin, it is possible to "explore" the database in "real-time" - to poke around and see what is there. Searches are so fast that users adopt a whole new approach to exploratory data analysis.

There are three basic database operations in Merlin:

    Searching: Merlin provides a number of search services. You can search for particular text or "ranges" of text (e.g. names, properties, activity, etc.), numeric ranges, similar molecular structures, substructures, and superstructures.

    Sorting: Merlin can sort information using a variety of "comparison functions," including numeric, alphabetic, molecular formula, CAS number, etc.

    Selecting: You can use several techniques to select items of interest "by hand."

Merlin presents the database as a "chemical spreadsheet" called a pool. The data are seen in rows and columns. The results of searches are stored in hitlists.

    A pool is a THOR database that is loaded into the computer's memory.

    A row is Merlin's representation of a THOR datatree. That is, all data in one row are from a single TDT in one database.

    A hitlist is an ordered subset of the rows in the pool. Sorts and searches modify hitlists: A search adds or deletes rows from a hitlist, and a sort changes the order of the rows in a hitlist.

    A hit is a row that is currently in a hitlist.

    A column is a "vertical slice" through the pools, and contains data of one particular datatype. For example, a column might be for the datatype "Name," the column would contain a name from each row of the pool.

These concepts are discussed in more detail below.

8.2 Servers and Clients

The Merlin facility is organized in two parts: servers and clients. The Merlin server provides the basic capabilities in the Merlin system. The server creates and maintains all of the fundamental "objects" (pools, hitlists, and columns), and it carries out all of the main searching and sorting operations of the Merlin system. In addition, the server is entirely responsible for security: validating a user/password when a client connects, and validating database passwords when a database is open (security is discussed in more detail in the THOR_Merlin Administration Guide.

As with the THOR server, a Merlin server's primary purpose is to share resources. In THOR the resource being shared is access to disk-based databases, whereas in Merlin the resource being shared is the computer's memory. Merlin uses in-memory techniques for high-performance searching, so having sufficient memory and sufficient computer power is critical to Merlin's usefulness. A Merlin server allows many client programs to access the same data and computer, thus sharing the cost of the resources (memory and speed) among many users.

Although the Merlin server examines vast quantities of data during a typical search or sort operation, there is typically a relatively small amount of information that is communicated between the client and server. The client makes a request to the server, which might be a few to a few hundred bytes of information. The server carries out the request (e.g. a sort or search), then typically only sends back those data that are to be displayed via the client's user interface. The Merlin system is designed so that the client/server communication is at a "low density" point in the "layers" of software, so that interprocess communication is minimized.

Merlin clients are primarily responsible for the "user interface" portion of the Merlin system's task. A typical Merlin client has a "Merlin Window" that showed the contents of a hitlist and some columns, allows a user to set up and carry out searches, and has a way of storing or printing the results.

Like the THOR system, there are many possible Merlin clients. The Merlin server is a program provided by Daylight, but the Daylight Toolkit provides access to all Merlin capabilities. Using the Toolkit, many possible clients can be created.

Two examples of Merlin clients are programs available from Daylight. The program XVMerlin is an X-Windows-based user interface to Merlin's search capabilities. The program 'sthorman' is a "tty-style" management tool used to load pools into Merlin, and to control security in the Merlin system.

8.3 Pools

A Merlin pool consists of data loaded into memory from a THOR database. To create a pool, the Merlin server reads the database's datatype definitions, the "_P" Merlin-pool-inclusion flag indicates which data are to be part of the pool, then copies data from the disk file to the computer's main memory, where they reside until the pool is unloaded.

8.4 Columns and Cells

Although a pool typically contains a subset of the data in a THOR database, it often consists of a wide variety of types of information, including structure, reactivity, chemical properties, prices, catalog numbers, and so forth. At any particular moment, the typical user is only interested in a few types of data. To solve this problem, Merlin provides columns.

A column of data is conceptually a "vertical slice" through all datatrees; it selects a specific datum or derived-datum (the result of computations on real data) from each datatree. The particular datum used for a column is defined by two properties:

Datatype
When creating a column, you specify the datatype and the field within that datatype.

Function
A datatree may contain several of a particular datatype (for example, there may be many names). Several functions are available to select from among those available, including "first", "last", "average" (for numeric data), "least", "greatest", and so forth; these are explained in detail below.

The pool can be thought of as a "chemical spreadsheet," with one row for each datatree in the database and various columns for different types of data. The intersection of a row and a column is called a cell and is the basic unit, or datum, in Merlin. Note: Setting the pool inclusion flag (_P) to '!' separates subtree data for "distinguished identifiers" into their own rows. This does not affect how the information is stored, only how it is viewed.

A derived-data column is a special type of column whose data are not in the database, but are derived from the database or computed during operations on the database. For example, a SIMILARITY column contains data computed when you perform a structural-similarity search. There is no SIMILARITY data in the database itself; the data are created and changed as you work. Other types of derived-data columns are discussed in the section below.

8.5 Column-creation Functions

As discussed above, column-creation functions, or simply functions, allow you to specify which instance of a particular datatype to use when more than one occurs in a row of the pool. The functions are:

First
Use the first instance of the datafield in the row.

Last
Use the last instance of the datafield in the row.

Least
Use the lowest-valued instance of the datafield. For ASCII data, this is the lowest lexical value; for numeric data it is the lowest numeric value.

Greatest
Use the greatest-valued instance of the datafield.

Longest
Use the datafield that is the longest (contains the most characters). Not applicable to numeric data.

Shortest
the datafield that is the shortest.

Count
Creates a derived-data column containing the number of instances of the datafield.

Average
Creates a derived-data column containing the average of all instances of the datafield. Only applies to numeric datatypes.

Standard Deviation
Creates a derived-data column by computing the standard deviation of all instances of the specified datafield. Only applies to numeric datatypes.

8.6 Hitlists

A hitlist is an object that holds the results of sorts and searches. It is an ordered set of rows of data from a particular search pool. That is, it is a list of which rows are currently "hit" (selected), and the order in which you want the hits presented and operated on.

When a hitlist is first created, or after it is reset, it contains the set of all rows in the database in native order (native order is essentially random order, but it doesn't change as long as you have the pool open).

A hitlist's contents are changed by sorting operations (the hits are reordered) and by searching operations (rows are added to or deleted from the hitlist). These and other hitlist operations are discussed below.

The Merlin server allows clients to have several hitlists active at one time. For example, Daylight's XVMerlin program uses three: a primary hitlist, an "undo" hitlist, and a hitlist "memory". This allows users to undo mistakes, and to save a particular hitlist for later recall.


Go To First Chapter... 1. Introduction