Physiological ionization and pKa prediction

Roger Sayle
Bioinformatics Group,
Metaphorics LLC,
Santa Fe, New Mexico.

1. Abstract

This talk addresses the problem of determining the ionization state of small molecules under physiological conditions. Most chemical information systems, such as supplier databases or registry systems, represent compounds in their neutral form. This representation has many advantages for data processing, such as equality testing, but creates problems for computational chemists and modellers who are interested in a molecule's activity and properties at physiological pHs.

2. Introduction

Virtually all drug-like molecules are weak acids or bases. This means that they contain at least one site that can reversibly disassociate or associate a proton (a hydrogen ion) to form a negatively charged anion or a positively charged cation. Molecules that disassociate protons are acids, and those that associate protons are bases. The reversibility, means that a sample is always in an equilibrium with some fraction protonated and the rest deprotonated.

HA <=> H+ + A- or HB+ <=> H+ + B

By varying the availability of protons, i.e. the acidity of the media, the balance of the equilibrium can be shifted. This provides a measure of the ease of proton disassociation of a site in a compound, the disassociation (or ionization) constant pKa, defined by the equation:

pKa = pH + log(protonated/unprotonated)

Alternatively, the pKa of a site can be thought of the pH at which the protonated and deprotonated fractions are equal. If the pH is higher than the pKa, the site is mostly deprotonated, and if the pH is lower than the pKa, the site is mostly protonated.

To take an example, consider the drug aspirin (acetyl salicylic acid). The structure of aspirin as given in the Merck Index, WDI or Medchem is shown on the left, CC(=O)Oc1ccccc1C(=O)O. However, the pKa of the carboxylic acid group in aspirin is 3.49, which means that under nearly all physiological conditions, this group is almost entirely deprotonated. Hence, it is the physical properties of the molecule on the right, CC(=O)Oc1ccccc1C(=O)[O-], that is of most relevant to computational chemists.

3. Ionization state vs. pKa prediction

Although pKa prediction itself is an interesting challenge, the closely related problem of physiological ionization state determination is of greater immediate value to computational chemists. This is the task of determining the protonation state of a molecule in one of a small number of tissue fluids. The following table lists the pH of various human tissue fluids of pharmacological interest. Typically, however, interest is restricted to a narrow range between about 7.2 and 7.4.

Tissue Fluid	pH
Aqueous humor	7.21
Blood (arterial)	7.40
Blood (venous)	7.39
Blood (maternal umbilical)	7.25
Cerebrospinal fluid	7.35
Duodenum	5.5
Intenstine (microsurface)	5.3
Lacrimal fluid (tears)	7.4
Milk (breast)	7.0
Muscle (skeletal)	6.0
Nasal secretions	6.0
Saliva	6.4
Stomach	1.0-3.5
Sweat	5.4
Urine (female)	5.8
Urine (male)	5.7

The fact that only the physiological ionization state is required greatly simplifies and increases the applicability of pKa prediction. There is little to be gained from knowing the pKa of an ionizable group if it is significantly greater than or less than pH 7.3. Similarly, a pKa that is close to 7.3, implies that a significant fraction of the molecule may be either protonated or deprotonated, and analysis should be applied to both forms.

Consider a docking calculation of the following molecule NSC577 from NCI95. It is unlikely that there are any accurate experimental values or Hammett equations of the effects of the aromatic ring systems on the ionizable iodine, but it is clear that the site is deprotonated (by about 16 log units!).

The currently implemented algorithm to determine physiological ionization is built upon a pKa predictor. The algorithm can start from either the input or neutral form of the molecule. The method then iteratively either protonates the most basic site or deprotonates the most acidic site until there are no remaining "acid" or "base" sites for the specified pH, or until a loop is encoutered. The pKa's of all sites must be recalculated on each iteration, so that pKa reflects the ionization state of the rest of the molecule. One pleasant feature of this approach is that it can determine whether a neutral molecule is predominantly zwitterionic.

4. pKa Prediction by Atom Typing

A simple but surprisingly effective method of pKa prediction is to assign pKa values based purely on local atom types. Typically, the two most influential factors governing pKa, is the atomic species undergoing protonation or deprotonation and the very local steric and electronic affects.

The table below contains a number of atom types and associated values used in an atom type based predictor. The pKa is given for the both the conjugate acid and base. For convenience, the SMARTS-like constraint [!H0] is ommitted from all entries in the "Acid" column. The rows are grouped by element, and the first matching type is used. In the table below, no more than two shells of an atoms neighbour need be inspected to determine an approximate pKa.

Conjugate Acid	Value	Conjugate Base	Model Compound
F	3.18	[F-]	HF
Cl	-6.50	[Cl-]	HCl
Br	-8.50	[Br-]	HBr
I	-9.00	[I-]	HI
c	43.00	[c-]	c1ccccc1
C#N	9.30	[C-]#N	C#N
C	50.00	[C-]	CH4
[nH]	16.50	[n-]	[nH]1cccc1
N=N=*	-99.99	[N-]=N=*	???
NC=O	22.00	[N-]C=O
NS(=O)=O	10.10	[N-]S(=O)=O	???
N	32.50	[N-]	NH3
???	-3.80	[nH]	[nH]1cccc1
[nH+]	5.23	n	n1ccccc1
[NH+]#*	-12.00	N#*	RC#N
[NH+]=C(N)N	14.4	N=C(N)N	NC(=N)N
[NH+]=C(N)a	11.6	N=C(N)a	N=C(N)c1ccccc1
[NH+]=CN	12.4	N=CN	CC(=N)N
[NH+]=*	-99.99	N=*	???
[NH+]a	4.69	Na	Nc1ccccc1
[NH+]C=O	4.74	NC=O	???
[NH+]C=N	-99.99	NC=N	???
[NH+]S(=O)=O	-99.99	NS(=O)=O	???
[NH+]	~10.5	N	CN(C)C
[NH2+]	~11.1	CNC
[NH3+]	10.6	N	CN
[NH4+]	9.25	N	NH4+
[OH2]	15.70	[OH-]	H2O
Oc	10.00	[O-]a	Oc1ccccc1
OC(=O)[O-]	10.33	[O-]C(=O)[O-]	CO(OH)2
OC(=O)a	4.20	[O-]C(=O)a	OC(=O)c1ccccc1
OC=O	4.80	[O-]C=O	OC(=O)C
OC	15.50	[O-]C	CH3OH
ON(=O)=O	-1.40	[O-]N(=O)=O	HNO3
O[N+]=O	-12.00	[O-][N+]=O	CN(=O)=O
ONC=O	9.40	[O-]NC=O	ONC(=O)C
ON=*	12.34	[O-]N=*	ON=CC
ON()	5.2	[O-]N()	ON(C)C
ON	5.96	[O-]N	ONC
OP([O-])([O-])=O	12.50	[O-]P([O-])([O-])=O	H3PO4
OP([O-])=O	6.70	[O-]P([O-])=O	H3PO4
OP=O	2.00	[O-]P=O	H3PO4
OP[O-]	99.99	[O-]P[O-]	???
OPa	2.10	[O-]Pa	OP(O)c1ccccc1
OP	3.08	[O-]P	CP(O)O
OS(=O)(=O)[O-]	2.0	[O-]S(=O)(=O)[O-]	H2SO4
OS(=O)(=O)	-99.99	[O-]S(=O)(=O)	H2SO4
OS(=O)[O-]	7.20	[O-]S(=O)[O-]	H2SO3
OS(=O)	1.80	[O-]S(=O)	H2SO3
O	99.99	[O-]	???
[OH+]	-1.74	O	H3O+
P	29.00	[P-]	PH3
[P+]	-13.00	P	PH4+
S*=O	3.52	[S-]*=O	CH3C(=O)S
Sa	6.52	[S-]a	Sc1ccccc1
[SH2]	7.00	[SH-]	SH2
S	12.00	[S-]	CH3SH
[SH+]	-7.00	S	CH3SH2+

The very generic nature of the atom types in the above table allow this approach to be applied rapidly to large datasets of diverse compounds.

5. pKa prediction by Hammett and Taft equations

A more accurate prediction of pKa, but for a small class of compounds, may be made using Hammett equations. The atom typing above reflects the major factors influencing a site's disassociation constant, the atomic species of the site and the very local steric and electronic effects. However no account is made of the longer range electronic (inductive, mesomeric and electrostatic field effects). In 1940, L.P. Hammett demonstrated that the effects on pKa of meta- and para- substitued aromatic compounds (benzoic acids) were linear and additive.

This leads to the Hammett Equation for pKa:

pKa = pKa⁰ - Rho * Sum(Sigma)

Where Sigma is a constant assigned to a particular substituent, Rho is a constant assigned to the particular acid or base functional group and pKa⁰ is the pKa of the unsubstituted acid or base. For benzoic acids, pKa⁰ is 4.20 and Rho is defined to be 1.0.

In the original formulation, two constants need to assigned to each substituent, Sigma_meta for meta-substitutions and Sigma_para for para-substitutions. This was soon extended to aliphatic systems by Taft, by introducing Sigma_star (also written Sigma^*). Currently over 40 forms of sigma constant have been defined, but many of these corrolate extremely well with each other.

As a worked example, consider the pKa prediction of the compound shown below, 4-chloro-3,5-dimethylphenol.

The Hammett equation for phenol has pKa⁰ = 9.92 and Rho = 2.23. The Sigma_meta for -CH3 is -0.06 and the Sigma_para is 0.24. Hence the predicted value of the pKa is 9.92 - 2.23*(0.24-0.06-0.06) = 9.70. This compares extremely well with the experimental value of 9.71.

Currently, a database of over 170 Hammett/Taft equations has been encoded as SMILES/SMARTS (including chirality) from the compilation given in Perrin and Serjeant's "silver book". Similarly, a database of sigma constants for several thousand functional groups has been compiled by Corwin Hansch and Al Leo, a early TDT version of which has been placed in Daylight "contrib" by John Bradshaw.

A major benefit of Hammet/Taft equations is their ability to handle special cases.

Tetronic acids (pKa ~3.39):
The duck-billed platypus of organic chemistry?

6. Estimation of Sigma Constants

Unfortunately, the Achilles heel of Hammett and Taft based pKa prediction is the dependence upon large databases, both for the substituent constants and for the acid/base functional group under consideration. The functional groups can be supplemented by the atom type based approach described above. Missing sigma constants, however, are a more serious problem. In a recent analysis by Peter Ertl, showed that only 63 of the 100 common substituents (taken from the logpstar database) had measured sigma constants.

One common approach, is to extend the set of known substituents with sigma transmission equations. For example, Sigma_star of -CH2-R can be estimated as 0.41 * the Sigma_star for R. Similarly, -CH=CH- has transmission coefficient 0.51 and -C6H4- has coefficient 0.30. Similar schemes include the Exner-Fiedler method for aliphatic cycles, and the Dewar-Grisdale method for polyaromatic systems. However, this approach cannot be used when a terminal group in unparameterized.

A second approach is to use molecular orbital methods to estimate sigma values when there is no experimental data. Using the strong corrolation between sigma_meta and sigma_para and charges calculated with MOPAC's AM1 hamiltonian, Peter Ertl of Novartis has been able to calculate sigma constants for over 80,000 organic functional groups. Fortunately, as described in my MUG97 talk, "Efficient Protein and Nucleic Acid Perception", there exist efficient algorithms for pattern matching large numbers of functional groups in constant time.

7. pKa Prediction of Bound Ligands

The intrinsic pKa of sites in a molecule as calculated above relate to their bulk properties in solvent. However, much of the interest in the ionization state of a small molecule is to predict the bound state, where the local environment of the protein can significantly alter the protonation state. Fortunately, such affects have been well studied for amino acid sidechains in proteins. The interested reader is refered to the papers by Antosiewicz et al. and Gilson in the bibliography below.

If the neutral state of the protein is taken as the reference state, its electrostatic energy in a given ionization state is:

where x_i is 1 when the group i is ionized, and 0 when it is neutral; is +1 for bases, and 1 for acids; pKa_intrinsic,i is the intrinsic pKa of group i; is the absolute value of the interaction energy of groups i and j; M is the number of ionizable groups in the protein; R is the gas constant; and T is absolute temperature.

The intrinsic pKa of group i is given by:

where is the difference between the free energy change of ionization for group i in the otherwise unionized protein and in a model compound of pKa_model,i. The intrinsic pKa thus represents the pKa of the group in the protein with all other titratable amino acids in the neutral state. Given values for and for all groups, the above equation provides the basis for the computation of pH-dependent properties of protein-ligand complexes.

Traditionally, this technique has only been used for determining the protonation states of amino acid sidechains in proetins, where the values of pKa_model,i are simply looked up in a small table. However, small molecule pKa prediction methods such as those described above could easily be incorporated, not only allowing the protein the affect the ionization state of the ligand but potentially the ligand affecting the ionization state of the protein.

8. Results

The table below lists the experimental and predicted pKa of a commonly used benchmark of 22 drugs. The predictions were made using the atom type method of section 4 above.

SMILES	Name	Exp. pKa	Calc. pKa
	Atropine	9.90	10.3
	Chloramphenicol	11.03	15.50
	Chlorothiazide	9.50	10.10
	Chlorpromazine	9.30	10.10
	Cimetidine	6.80	5.23
	Diazepam	3.30	4.69
	Diltiazem	8.91	10.10
	Diphenhydramine	9.00	10.10
	Disopyramide	10.40	10.50
	Flufenamic Acid	3.90	4.20
	Furosemide	3.90	4.20
	Haloperidol	8.30	10.50
	Imipramine	9.50	10.10
	Lidocaine	7.94	10.50
	Phenobarbital	7.44	9.50
	Phenytoin	8.30	9.50
	Procainamide	9.40	9.25
	Propafenone	9.30	11.10
	Propranolol	9.50	11.10
	Tetracaine	8.49	10.10
	Trimethoprim	7.20	5.23
	Verapamil	9.04	10.30

The results show an impressive agreement using a very simple atom type based method. The R² value is 0.800, and a standard deviation of 0.95. Far more accurate figures on this benchmark can be achieved using Hammett & Taft equations, where ACD Labs' ACD/pKa obtains an R² value is 0.978, and a standard deviation of 0.39.

The following table shows the input and output structures of the physiological state algorithm above on the ten ligands used in Metaphorics' DockIt crunch.

PDB Code	Before	After
1HYT (-2)
1RBP (0)
1HPV (0)
1SWD (-1)
1ADD (0)
2CBR (-1)
3CLA (0)
1RDS (-1)
4HVP (+2)
1ELA (+1)

9. Bibliography

Jan Antosiewicz, James M. Briggs, Adrian H. Elcock, Michael K. Gilson and J. Andrew McCammon, "Computing Ionization States of Proteins with a Detailed Charge Model", Journal of Computational Chemistry, Vol. 17, No. 14, pp. 1633-1644, 1996.
Peter Ertl, "Simple Quantum Chemical Parameters as an Alternative to the Hammett Sigma Constants in QSAR Studies", Quantitative Structure-Activity Relationships, Vol. 16, pp. 377-382, 1997.
Michael K. Gilson, "Multiple-Site Titration and Molecular Modelling: Two Rapid Methods for Computing Energies and Forces for Ionizable Groups in Proteins", Proteins: Structure, Function and Genetics, Vol. 15, pp. 266-282, 1993.
Corwin Hansch and Albert Leo, "Exploring QSAR Vol. 1: Fundamentals and Applications in Chemistry and Biology" and "Exploring QSAR Vol. 2: Hydrophobic, Electronic and Steric Constants", ACS Professional Reference Book, American Chemical Society Publishers, 1995.
Bertram G. Katzung, "Basic and Clinical Pharmacology", Chapter 1, 7th Edition, Appleton and Lange Publishers, 1998.
D.D. Perrin, Boyd Dempsey and E.P. Serjeant, "pKa Prediction for Organic Acids and Bases", Chapman and Hall Publishers, 1981.

roger@metaphorics.com