The Hitch Hiker’s Guide

to Chemical Space

by Anthony Nicholls,

OpenEye Scientific Software, Inc.

From the Introduction to "The Hitch Hiker’s Guide to Chemical Space":

Q:What is Chemical Space?

A: It's where 10^{200 }small, possibly drug-like, molecules live.

Q: 10^{200}. Isn’t that a big number?

A: Yes, it's big. Really big. You just won't believe how vastly hugely mindbogglingly big it is. I mean, you think your average combinatorial library is big, but that's just peanuts compared to Chemical Space. Listen:

(it goes on for a bit here before finally settling down with things you really need to know)

Q. Sounds scary. Do you have anything helpful to say about finding your way around in Chemical Space?

A: Lots! For starters…

* *

The important Question:

*In a *10^{200}*, how many are really different?*

Only differences in:

- Scalar properties/descriptors, 1D
- Chemical Bonds, aka graphs, 2D
- Physical Structure, 3D

that produce differences in:

- Biological Activity
- Bioavailability
- Toxicity

really matter (for drug design).

Fundamental OpenEye Credo:

Shape Matters

Shape = Field Properties:

Such as:

Steric

Electrostatic

Functions derived by atoms types

(e.g. hydrophobic, hydrogen bond potential)

Why we feel shape is probably sufficient

Consider the following problems in traditional energetic approaches to the important problem of calculating binding energies:

- General Hydrophobic Effect (binding event, plus unbound conformers ranking)
- Discrete Water Effects (binding)
- Entropy of Binding (Solid Body)
- Conformation Entropy (ligand and protein)
- Large Scale Protein Motions
- Polarization of Charge Distributions
- Hydrogen Bonds

i.e. why worry about "precise" vacuum energies?!?

To have a shape one needs a structure:

Computational Structure Generation

General Methods:

- Rule Based, e.g. CONCORD
- Distance Geometry, e.g. Rubicon
- Build Up
- Exhaustive Search

However, structure generation has to take account of the aqueous environment of biological processes, i.e. we need:

Solvation Modeling:

The output from Solvent Modeling has two uses:

- Energies, to Rank Structures
- Derivatives (Forces), to Minimize Structures

Possible Methods include:

- All atom simulations (FEP)
- VERY Time consuming
- Accessible Area (e.g. Ooi-Scheraga)
- Fast but inaccurate
- GBSA (Still)
- Relatively fast, accurate for small molecules
- Poisson-Boltzmann (PB)/ Hydrophobic Area terms
- PB much faster than FEP, slower than Area based methods

We use PB because we believe it provides the highest ratio of accuracy to CPU cycles, once certain implementation problems are addressed.

OpenEye Advances in PB

- Use of Gaussian Functions to derive a smooth dielectric function

- Quadratic interpolation to map properties to and from the grid upon which PB is solved

Results:

- Energies stable with respect to grid displacement (see graph below)

- Much faster solve and set up
- Derivatives for Solvent Minimization

Comparing Molecular Shapes

- Difference in Shape = Difference in Fields

(i.e. what is the overlap of A with B)

- A Field is an ordered set of numbers

(e.g. think of a grid/lattice representation)

- The difference between an ordered set of numbers is a metric

(admittedly in an infinite dimensional space)

- Metrics are our friends.

(aka: the triangle inequality: d_{ab} + d_{bc} >= d_{ac} rules, ok)

Why are metrics so wonderful?

Example 1:

Exhaustive Search for the Best

Overlay = Minimal Field Difference

Traditionally a 6-dimensional search problem

But Consider:

- Two molecules A and B
- Each center of mass at (0,0,0)
- Generate a set of N positions and orientations of B relative to original {B}
- Order {B} in a distance tree, based on the overlap with B
- Find the best overlap of {B} with A via the distance tree
- Number of comparisons ~ log(N)

Example 2:

Minimal Metrics =

Metric of a Different Order

Given a non-projective Operation T

i.e. the mimimum field difference between two molecules is also a metric.

An example of T would be a rotation and/or translation, and hence the __best__ overlay between two molecules defines an interesting minimal metric, in particular for the space it "induces".

Shape Space

The N*(N-1)/2 optimal overlays of N molecular fields form an N*N distance matrix, **D**.

From a distance matrix can be derived what is known as the metric **G **matrix. If **G** is diagonalized the number of approximately non-zero eigenvalues determines the "effective" dimensionality of the geometric space N points can inhabit to give rise to those N*(N-1)/2 distances, and the position of those points.

The space formed from the diagonalization, and subsequent culling of insignificant dimensions, of the metric matrix derived from a set of molecular overlay distances we refer to as Shape Space.

Q: What is the dimensionality of the Shape Space of **10 ^{200}** small molecules?

A: Probably that of a MUCH smaller set

Q. How much smaller? **10 ^{180}**,

We plan to find out via the fields of the structures corresponding to:

- Random smiles strings
- Combinatorial libraries
- Reaction Pathways

My Guess? Dimensionality < 100. Why? Because, as we shall show in later, the steric field can usually be approximated very well by 30-40 variables. We anticipate a smaller dimensionality for electrostatic fields.

Uses of a Shape Space Decomposition:

- Only D+1 "difficult" overlays are ever required (where D= Dimensionality of the Shape Space), after that ~ 10
^{5}speed up in finding the best overlays. - Fundamental (geometric) measures of diversity, fundamental shape descriptors
- Clever organization of databases of molecular shapes
- Multiple properties form product Shape Spaces

PLUS: Tames **10 ^{200}** molecules, sort of.

All of the above concerns global similarity, but what of:

i.e. A and B fit poorly on top of each other, and yet they both have an essential element of common shape.

Local Shape Comparisons:

Two Novel Metric Approaches:

- Ellipsoidal Domain Decomposition

- Surface Metrics
- Surface Contour- 1D
- Pointwise Characteristic Functions– 0D

Ellipsoidal Domain Decomposition

The Idea:

Use a smooth gaussian function for the molecular field (typically steric) and seed N ellipsoidal gaussians within the molecules. Minimize the variable describing these gaussians against the field difference from the sum of such and the molecular field

The Result:

(for 1-3 Ellipsoidal Gaussians for a small, 2 ringed molecule)

One Ellipsoid Fit

Two Ellipsoid Fit

Three Ellipsoid Fit

By analyzing the fit of each ellipsoid to the underlying atoms we can robustly determine that the 2 ellipsoid fit is the best representation for this molecule.

Most molecules are fit by 3-4 ellipsoids to within 10-15A^{3}, i.e. less than volume of one methyl group. As each ellipsoid has 10 free parameters, the steric field can thus be well represented by 30-40 parameters, and hence our hope for a shape space of similar dimensionality.

Uses of Ellipsoidal Decompositions:

- Look Really Pretty

- Automatic Fragmentation

- Provide Injections to Shape Space

- Clique Detection/ Overlay

- Local Property Descriptions:
- 2D Ellipsoidal Surface Functions
- Scalar/Vector Characterization

- Ellipsoidal Diversity Measure

- Many More! (Very Rich Description)

Final Conclusions:

- Chemical Space is Vast but Mappable
- Useful Guide is Local Domain Decomposition
- Two approaches to the (almost)
__Ultimate Question to Life, the Universe and Everything__:

"What looks like molecule X?":

- Retrieval from huge databases

- De Novo Generation, e.g. via GA

Finally, Like Hitch Hiking, Waiting for OpenEye Tools requires Patience!

Current State of OpenEye Code:

- Structure Generation ~ a
- Solvent Ranking/ Minimization ~ b
- Fast Overlay > a
- Docking ~ a
- Ellipsoidal Decomposition ~ b
- Surface Metricization < a

Until then:

KEEP

BANGING

THE

ROCKS

TOGETHER!

Thanks to:

Peter Jeffs @GlaxoWellcome and Tony Wilkinson @Zeneca for funding for some of these projects

Andy Grant @Zeneca for his work, ideas and enthusiasm

Roger Sayle for help and wise thoughts

Daylight for the chance to present

Dave-I-Think-Big-Therefore-I-Am-Weininger for the courage to try