WATER Theory

Theory on the Methods used within WATER

Introduction

The aim of WATER is to allow the users to compute simple models which may give hints on activity or inactivity of compounds based on prior knowledge (the training set). A model is some kind of logic, which looks at a compound and tells its judgement about the activity. As with most prophets you are supposed to belief but it is not always the right thing to do. On the other side WATER tries to explain how it got to the judgement. These models may then be applied to compounds with unknown activity to predict activity.
WATER tries to find commonalties between the active compounds in the training set which are not present in the inactive compounds and vice versa. This is very similar to what is usually done in project work. Let's have a look at an example. Suppose you have a look at the last compounds you have synthesized and sent to the screening and you recognize that all those which have a tri flour methyl group beta to an carbonyl group are active while those which do not are inactive. Then you will suppose that this is an essential feature for activity in your compound class. This is exactly what WATER tries to do in a systematic fashion.
There are tree parts of a model:

The descriptors, which are used to describe a molecule.
A weighting scheme, which uses the counts of all descriptors in the training set to decide which ones should be used to which extend to decide if a compound is active or not.
The statistics, which uses the weights of the relevant descriptors in the training set to predict activity or inactivity.

We will now look at these steps in greater detail.

Descriptors

Within WATER anything which can be said to be true or false about a molecule can possibly be used as descriptor. Here are a few examples of possible descriptors:

Presence of a functional group.
Presence of a substructure.
A logP value in the range of 1-2.
An energy of the HOMO in a given range.
An absorption band at a certain frequence in the IR range.
A biological response of any kind.
A predefined arrangement of pharmacophore points in 3D space (3D pharmacophores).

However the data must be available for all or most of the compounds in the training data set and for all the compounds to be estimated. Currently only 2D 3 Point pharmacophores are available. One other option are 3D 3 and 4 Point pharmacophores..

Weighting Schemes

Given a list of Descriptors some kind of statistics must be used to decide which ones are relevant to the problem under consideration. This is done by counting the occurrence of the presence of one descriptor in the active and inactive compounds from the training set. Assume we have a training set with 10 active and 100 inactive molecules and that all actives have an amino group in beta position to a carbonyl group. Now assume that none of the inactive molecules does have an amino group beta to a carbonyl group. You might then come to the conclusion that this substructure is very important for activity in your test system. Of course this could also be due to a bad choice of molecules for your training set (more about this in the caveats section). As always live is not that easy and our training set will mostly not show a single feature which is able to distinguish between actives and inactives. In most cases we will have a number of features which are present more often in actives than in inactives and a number of features where the relationship is vice versa. Most features on the other hand will not be related to activity and therefore will be more or less equally distributed between the actives and inactives.
The weighting scheme should be able to associate a larger weight to fragments, which are able to distinguish between actives and inactives. There are two weighting scheme used within WATER, an infromation theoretical weighting function and an additive weighting function.

The information theoretical weighting scheme

Given the percentage of active compounds having a feature pa and the percentage of inactive compounds having the feature pi the information theoretical weight is computed as:

wi = pa/100 * log(pa/pi) + (100-pa)/100 * log((100-pa)/(100-pi))

According to information theory (cf. Numerical Recipes in C chapter 14.) the higher this value the more information will the measurement of this feature yield.

The additive weighting scheme

Given the percentage of active compounds having a feature pa and the percentage of inactive compounds having the feature pi the additive weight is computed as:

wa = log(pa/pi)

This is simply the logarithm of the fraction of actives versus inactives compounds having this fragment. According to information theory the logarithm is required to ensure the additivity of the weights it does however not alter the ordering of the features.

Comparison of the weighting scheme

Again, given our training set with 10 actives and 100 inactives look at the following table which illustrates the differences of the weighting scheme:

Feature No.	Actives having this feature		Inactives having this feature		pa/pi	wa=log(pa/pi)	wi
Feature No.	Absolute	%	Absolute	%	pa/pi	wa=log(pa/pi)	wi
1	1	10	1	1	10	3.32	0.29
2	5	50	1	1	50	5.64	2.67
3	9	90	1	1	90	6.49	5.74
4	1	10	9	9	1.11	0.15	0.01
5	5	50	9	9	5.56	2.47	1.11
6	9	90	9	9	10	3.32	2.89
7	1	10	90	90	0.11	-3.17	0.54
8	5	50	90	90	0.56	-0.84	-0.07
9	9	90	90	90	1	0	0

The difference between the two weighting scheme is visible in the highlighted columns. The relative amount of compound having feature 1 and 6 are equal: 10% of the active and 1 % of the inactive compounds have feature number 1. Feature number 6 is present in 90% of the actives and in 9% of the inactives. This means that both features are present 10 times more frequently in the actives than they are in the inactives (pa/pi). Which yields an equally high value of 3.32 for both wa. The information theoretical wi on the other hand gives quite different values 0.3 for feature 1 and 2.9 for feature 6. The reason for this is that feature one is present in much less compounds, therefor if you expect to get less information by the measurement of feature 1 than you expect by the measurement of feature 6, simply because you will encounter feature 6 more often then feature 1.

Statistics

There are currently two methods used to estimate the activity of a compound given the descriptors and the weighting scheme: the Information Theoretical approach and an additive model:

The Additive model

The additive model is very simple but nevertheless effective. The additive weights of any fragment present in a molecule are added together to give a score for this compound. Suppose a compound has only feature 1 and 6 from the table given above, Then the scoring value would be 3.32 + 3.32 = 6.64. This scoring value may than be compared to the scoring of the molecules in the training set or in a validation set to estimate the probability that the compound is active.

The Information Theoretical

The Information Theoretical Model was develeoped based on work presented by a group from CombiChem Inc. at the 1999 ACS Spring Meeting. The Information Theoretical model is a little more complicated since it has two parameters. A compound is rated active if it has Y bits from the most important X bits. The features are ordered by their wi values in decreasing order. All possible combinations of X and Y are computed and the number of actives which are rated active by the model as well as the number of inactives which are rated active by the model are computed. Clearly the best case would be to have all actives (100%) and none of the inactives (0%) rated active, unfortunately the model usually is not perfect. As measure for the quality of the Information Theoretical model the difference between % actives and % inactives rated active is taken:

q = % actives rated active - % inactives rated active

Clearly the larger q the better your model will distinguish actives from inactives. There are also two other quality parameters, which can be used to judge a Information Theoretical model.

The selectivity is defined as:
s = % actives rated active / % inactives rated active
The selectivity however may not be used alone because it will have a maximum whenever there are no inactives which are rated active independently of the number of actives.
The completeness which is simply defined as the % of actives rated active.

Therefore tree distinct models are build within WATER.

The optimal model, which maximizes q the difference between the actives and inactives rated active.
A more selective model, which allows for a reduced difference q, if the selectivity s may be increased.
A more complete model, which allows for a reduced difference q, if the completeness may be increased.

Caveats

There are a number of caveats within the approach used in WATER. Most important this is a GIGO (Garbage In Garbage Out) approach. So think a little bit about you input before stating. The method is somewhat robust against false positives and false negatives in the training data set as long as you do not have mostly wrong screening results. This is because the method looks for a consensus of the features present in actives and inactives. However due to that it will probably not recognize a separate mode of action if you have only one active compound with this mode of action and lots of other compounds with separate modes of action.
A second thing to keep in mind is that you may introduce an artificial bias if you are not careful selecting the compounds in your training set. E.g. if you have only peptides which are active and you do not include peptides in the set of inactives then you may expect WATER to find that the features of the peptide backbone are essential for activity. This may however not be the case, since probably you will find lots of inactive peptides if you would only screen peptides.
In theory you best results should be expected if you use inactive compounds which are as similar as possible to the active compounds because this will highlight the differences. However if you do that your model will be more sensitive to false positive and false negative compounds.
You should also decide if you are looking for a general model, which has a broad range of application, or for a model, which might perform betteri, but only for a limited class of compounds. Models are usually restricted to the modes of action present in the training set. To find a broad model a good choice would be to select a divers set of actives and inactives build the training set. To find a high performance narrow model maybe the best approach would be to take a balance set of similar actives and inactives.

In any case let me know about your thoughts since this is still experimental.

Page Owner: Alberto Gobbi - Last updated: Apr 17, 1999