We are still evaluating the need for a "Scheme" construct within the Daylight system. When we started this project, we envisoned that we'd ultimately extend the base reaction capabilities to handle multi-step reactions as separate entities.
This would mean supporting a new toolkit object which would be made up of multiple reaction objects which shared common reactants or products. This also implies a canonicalizable language for schemes, useful for a database representation.
It is not clear that either of these features is needed to continue future work, including the processing of multi-step reactions. There are two main areas that we'll be exploring related to multi-step reactions in the next year:
The main argument for supporting a scheme "entity" is the need to store them (as a unit) in a database and attach data to this unit. Another benefit would be to provide a higher level of abstraction in the toolkit which would simplify some operations.
Note is that the lack of a scheme object within the system should not preclude any processing capabilities. If this turns out to not be the case, then we'll most certainly add a scheme object. Supporting a scheme object should make some operations more convenient. In either case, user interface elements to are required for input and display.
There are some pitfalls, however. Supporting scheme objects would add complexity to the toolkit and databases. It would increase the potential for confusion of the database model (schemes and reactions would likely be stored together), where should the data go? how are they cross-referenced?
Without using schemes, we can store scheme-oriented information efficiently using the current Thor system. We can rapidly link single-step reactions using Thor cross-references. This can be extended for scheme processing. In fact, the "scheme server" which we are contemplating will use both Thor and Merlin features for its operation, depending on the task.
Given the reactions A -> B -> C, reported as a single reaction sequence, and B -> C reported separately as a single-step reaction, a proposed representation for this information follows:
$SMI<"A>>B"> FP<> XFP<> CL<> $RNO<12345> ISM<"A>X>B"> /SCH_ID<77> $RMOL<A> $AMOL<X> $PMOL<B> | $SMI<"B>>C"> FP<> XFP<> CL<> $RNO<23456> ISM<"B>Y>C"> /SCH_ID<77> $RNO<87654> ISM<"B>>C"> CIT<"Tett. Lett.";1994;...> $RMOL<B> $AMOL<Y> $PMOL<C> | $SCH_ID<77> CIT<"Total synthesis of C";"J. Org. Chem.";1995;...> |
One advantage of the above representation is that it simplifies the processing of explicit versus implicit schemes. An explicit scheme is a multi-step sequence of reactions which are reported in a single citation. An implicit scheme is a sequence of reactions which are assembled from unrelated citations.
In the above case, there is an implicit scheme which can be formed from the second citation (Tett. Lett.). During processing, one could choose to limit the search to explicit schemes (by checking that all of the single-step reactions share the same scheme-id) or simply allow all implicit schemes to be considered.
This is the database representation that we're leaning towards for the scheme server. Dave will talk about other aspects of this project in his "Futures" talk.
The other area of work is in mixture processing and how reactions can facilitate this important area. There are at least four distinct methods for representing mixture data in our system. These include:
Advantages: current tools and database models work correctly.
Limitations: Database size, members of a mixture often don't have measured data, some/many not actually formed, etc.
Advantages: Addresses most limitations of enumeration: compact, data attached to mixture, searches efficiently.
Limitations: Monomers are a pain, non-regular mixtures aren't handled well.
Advantages: Handles all mixtures (regular to completely random). Data attached to mixture. Searching efficient.
Limitations: Still some processing bottlenecks related to sizes of SMILES.
Advantages: Data attached to mixture. Captures experimental information.
Limitations: Searching inefficient (requires generation of target molecules for queries). Doesn't handle random mixtures.
The first three methods are product-oriented; structures and data for the hypothetical (or actual) members of the mixture are stored. None of these three methods addresses the need to record the methodology used to build a mixture at any level (this might include robot control instructions, synthetic operations, reactions performed, etc.) Our belief is that this information is being collected and stored ad hoc, and that this information is not in a form which is readily retrieved and reused.
Reaction representations are experiment-oriented; structures are derived from the expected reactions. The "MTZ" language is a simple reaction-oriented language for representation of mixtures (contrib/src/c/transform/enumerate.c). It allows the specification of sets of molecules, transforms, and postprocessing operations (to prune unimportant byproducts). MTZ is by no means a complete language.
The method which holds the most promise right now (note: this is without critical scrutiny) is a combination of dot-separated SMILES and an experimental language (VCL- or MTZ-like). The reaction language would be used to generate the dot-separated SMILES; it would capture the experimental specification; and would be quite reuseable. Searching would occur over the dot-separated SMILES. For completely random mixtures, the experimental language wouldn't apply, only the dot-separated SMILES would be applicable.
This approach would allow the capture and registration of experimental
methods, product-oriented searches of the mixtures, and still have the
flexiblity to handle both regular and non-regular mixtures. Further
exploration of this and other methods will be undertaken over the next year.