Transformation Generation and Reaction GAs

Jack Delany

DAYLIGHT Chemical Information Systems, Inc. Mission Viejo, CA USA


Overview:

This talk is a 'status report' describing ongoing work and recent results for reaction processing. Most of our focus recently has been the exploration of multi-step reaction processing ideas and tools.

To date, all of the Daylight reaction-processing tools are oriented towards single step reactions. The reaction languages (SMILES, SMARTS, and SMIRKS) represent single step reactions, and have no syntax for describing multi-step sequences. Now, it is straightforward to assemble reactions into multi-step sequences for a specific application; we have yet to adopt and support a standardized method of doing so. Similarly, the Daylight toolkits deal with reaction objects as single-step entities.

We don't apologize for this focus, for several reasons. Tools to process single step reactions are a natural extension of the molecule-processing tools we previously supported. By supporting single step reactions, we are able to deliver useful tools which address clear chemical information needs (eg. delivering and searching reaction databases).

Multi-step reaction processing tools must necessarily be built upon the current foundation of reaction tools. They also must solve some chemical information problem to be useful. At the time we developed the initial reaction toolkit, it wasn't clear that the payback for the added complexity would be worth it. In fact, we decided that it would be better to further explore the need for a multi-step reaction representation before actually implementing one.

Our recent explorations are in several areas:

Transformation Generation:

Automatic transformation generation is a useful "tool-building" process. We can take reported specific reactions (eg. from a commercial database) and generate transformations which can be used for combinatorial library design, synthesis search, etc.

Transformations are key to multi-step reaction exploration. Commercial databases of reactions are poorly-connected graphs. That is, relatively few reactions share common reactants and products; hence relatively few multi-step sequences can be formed by connecting single step reactions. Searching for multi-step sequences within this sparse graph is fairly uninteresting.

There are two ways to improve the connectivity of this graph. First, one can 'merge' nodes in the graph. If one can find sets of molecules which meet some structural criteria, then one can consider these molecules as a single node in the search graph. One might look at common substructures or similarities. Lawson et. al. describe a method in their RABBIT approach where automatic functional group and protection/deprotection steps are used.

The other approach is to increase the number of reactions in the graph. If each "reaction" in the domain is a generic reaction which can apply to many molecules, this dramatically increases the connectivity of the graph for reaction sequence exploration. This is the typical approach used in traditional "synthesis planning" systems.

Our version of a transformation is much simpler (and dumber) than that in a typical synthesis planning system. Right now, we're simply using the observed reactant and product structures and some simple rules to discover structural features which are relevant to the transform (eg. pi conjugation, atoms connected to the reacting center).

The basic procedure is as follows:

Note that the output is reaction SMILES with opened valences everywhere except for reacting center hydrogens (and they are explicit). This level of reaction SMILES is the common unambiguous portion of SMILES, SMARTS, and SMIRKS. Within all three languages these strings have the exact same meaning!!! It also has the same meaning in the Merlin "substructure" search, which is discussed in the next section. By dealing with the hydrogens and aromaticity carefully, we avoid the "problem areas" of differing semantics between languages.

[there is one caveat; I'm not dealing with the difference in semantics for chirality - SMILES chirality is global in scope; SMARTS / SMIRKS chirality is local].

Results from the conversion of CCR972 follow. CCR972 contains 268k unique reaction SMILES. Approximately 225k of them have atom maps; the remainder are 'overall reactions'. Of these 225k reactions, approximately 187k result in valid transforms.

Transforms generated from CCR972
DB Name Algorithm Direction SMILES roots Unique Transforms DB Size
groktxf_fwd Core only Forward 21k 67k 18 MB
groktxf_rev Core only Reverse 21k 67k 11 MB
groktxf Core only Both 39k 134k 28 MB
ccrtxf_fwd Core + 1-atom + conjugation Forward 52k 110k 41 MB
ccrtxf_rev Core + 1-atom + conjugation Reverse 65k 110k 29 MB
ccrtxf Core + 1-atom + conjugation Both 106k 220k 70 MB

On the surface, this simplistic view of generic reactions seems weak; in fact it does underdescribe the knowledge in contained in any given reaction or set of reactions. Regiochemical, steric and electronic effects and trends on reactivity aren't captured at all. This representation does have one important advantage; speed. Because we can process these transformations as SMILES, they can be fingerprinted, screened, searched, clustered, etc.

Using the tools and optimizations which we've developed over the years for molecule and reaction processing, we have the ability to screen huge numbers of specific transformations (>1 million/second). The ultimate question then is: Can this brute-force application of processing power make up for the simplicity of the representation of transformations?

A Search Paradigm for Multi-step Sequences:

It should be apparent that one of the key operations for the exploration of multi-step reaction sequences is the repeated discovery and application of transformations to target molecules.

In either case, we have a molecule and need to find a transformation which applies to that molecule. And, we need to do it fast because we'll end up doing it thousands of times to answer a single question.

We've created a database architecture which can do this very quickly. For each transformation, we create one or more SMILES-rooted TDTs (one SMILES for each component on the reactant side of the transform, excluding hydrogen and non-atom mapped components).

$SMI<C=CC=C>
FP<.....U....+.U....2........+..UUE....0UE...k.2;2048;13;256;13;1>
TXF3<"[C:1]=[C:5][C:6]=[C:2].[C:3]=[C:7][C:8](=[O:4])[O:9]>>
[O:9][C:8](=[O:4])[C:7]1[C:3][C:1][C:5]=[C:6][C:2]1";[O]([C]=[O])[C]=[C];1>
TXF3<"[C:1]=[C:3][C:4]=[C:2]>>Br[C:1][C:3]=[C:4][C:2]Br";ZZZ;2>
|
$SMI<O(C=O)C=C>
FP<..+..U....++U.2E.I...+..2.+..UUE.....UN0..k.2;2048;22;256;22;1>
TXF3<"[C:1]=[C:5][C:6]=[C:2].[C:3]=[C:7][C:8](=[O:4])[O:9]>>
[O:9][C:8](=[O:4])[C:7]1[C:3][C:1][C:5]=[C:6][C:2]1";[C]=[C][C]=[C];1>
|

Note the following: