The main steps of the algorithm are:
2a. Each PDB file is considered internally delimited by both "END" and "ENDM" records. This allows processing of multiple models in NMR files, and of concatenated PDB files. A command line option controls whether just the first or all of the models/structures within a file are processed.
2b. All atom records containing a character other than " ", "A" or
"1" in the alternate location column (column 17) were ignored.
The convention on alternate locations is that the column be blank if the
atom is unambiguously located, or contain the sequence "A", "B", "C" etc..
for each potential conformation for ambiguous atoms. Some PDB files use
digits instead of letters, for example pdb code
2c. All atom records containing " Q" as the first two characters of the atom name were ignored. The element " Q" is commonly used in NMR processing to represent pseudo atoms used in refinement.
2d. The residue name "DUM" is officially used to denote dummy
atoms, such as in pdbcodes
represent unexplained electron density. All atom records with residue
name "DUM" are ignored.
2e. All atoms with X, Y and Z co-ordinates equal 9999.000 are also ignored. These dummy atoms are created by XPLOR when the actual atom is not used in the structure refinement.
Occasionally even the above rules are insufficient. For example,
1a34 contains the same atom, O3* of U1D, twice.
First as atom serial number 2960 and again as 2973, both with identical
1e8b) or incorrect atomic symbols (Nd in
1d00, U in
3bcc). Combined with the slow uptake of this field and the requirement to process pre-version 2.0 files, its is better to just process the atom name, which together with the residue name typically provide enough contextual information to determine the atomic number.
3a. The atom name " UNK" is interpreted as an atom of atomic number, and is assigned element number 0, represented "*" in SMILES.
3b. If the first character of the atom name is blank, and the third character is lower case, the PDB file was generated by CONCORD and the atomic symbol is taken from the 2nd and 3rd characters. As PDB files are normally only uppercase, CONCORD's incorrect column alignment is easy to recognize.
3c. If the first character is blank, and residue name denotes
a hetero group that must have a prefix, and the third character is
"H", "C", "N", "O", "P" or "S", the appropriate atomic number is used.
If the third character is not one of these, the 2nd character is used
as the atomic symbol. The current hetero groups that require a prefix
are "GPC", "NAD" and "NDP" (which fixes
3d. If the first character is blank, the second character is assumed to be the atomic symbol. If this character isn't recognized as an atomic symbol, and the third character contains "H", "C", "N", "O", "P" or "S", then the appropriate atomic number is used.
3e. If the first character is a digit, the second character is assumed to be the atomic symbol.
3f. If the first character is "H", and the residue is a
recognised amino acid, nucleic acid or special hetero group, then
the atom is assumed to be hydrogen. For remaining groups, if the
first two characters are a recognized atomic symbol, this is the
element, or hydrogen if the first two characters are not a recognized
atomic symbol. This fixes the Holmiums "Ho" and Mercuries "Hg" in
numerous files (including
The recognized amino acid residues are "ACE", "ALA", "ARG", "ASN", "ASP", "ASX", "CYS", "FOR", "GLN", "GLU", "GLX", "GLY", "HIS", "HYP", "ILE", "LEU", "LYS", "MET", "PHE", "PRO", "PCA", "SER", "THR", "TRP", "TYR", "UNK" and "VAL".
The recognized nucleic acid residues are " A", " C", " G", " T", " U", " +U", " YG", "1MA", "1MG", "2MG", "5MC", "5MU", "7MG", "H2U", "M2G", "OMC", "OMG" and "PSU".
The recognized special hetero groups are "101", "12A", "1AR", "1GL",
"2AS", "2GL", "3AA", "3AT", "3DR", "3PO", "6HA", "6HC", "6HG", "6HT",
"A26", "AA6", "ABD", "AC1", "ACO", "AIR", "AMU", "AMX", "AP5", "AMG",
"APU", "B9A", "BCA", "BNA", "CAA", "CBS", "CGS", "CMC", "CND", "CO8",
"COA", "COF", "COS", "DCA", "DGD", "FAB", "FAD", "FAG", "FAM", "FDA",
"GPC", "IB2", "NAD", "NAH", "NAI", "NAL", "NAP", "NBD", "NDP", "PAD",
"SAD", "SAE", "T5A", "tRE", "UP5" and "ZID". Although it shouldn't
be required including "BU1" fixes
3g. If the first character is one of the symbols """, "'" or *" then the second character is treated as the atomic symbol.
3h. If the residue name indicates a hetero group that uses
an suffix, the the first character of the atom name is treated as
the atomic symbol. The current hetero groups that require a suffix
are "AGF", "COT" amd "FVF" (which fixes
3i. If the residue name is one of the special hetero groups listed in 3f, the second character of the atom name is treated as the atomic symbol.
3j. The default, all remaining cases, treats the first two characters of the atom name as the atomic symbol. If the first two characters aren't a valid element, then the second character is treated as the atomic symbol.
3k. An exception to the above rule 3j is that "Nd"
(atomic number 60) doesn't occur in any of the amino acids listed
in section 3f, and occurances are corrected to nitrogen.
This fixes pdbcode
3l. Finally, exceptions to all of the above rules are
handled explicitly. Currently, the only exception is the atom
name "NSE1" which occurs in residues "SAD" and "SAE". This atom
name actually represents a selenium atom, encoded in the 2nd and
3rd columns! (which handles
4a. Metal atoms are prevented from forming covalent bonds. It appears that software used for protein x-ray crystallography (such as XPLOR and CNS) are poorly parameterized for inorganic compounds when compared with small molecule crystallography software (such as SHEL-X). The false positive rate when using proximity based connectivity perception is much higher in PDB than in the Cambridge Structural Database (CSD).
Unfortunately, this constraint means that algorithm fails for some of
the organo-metalics in PDB, including the ferrocene in
and the uridine vanadate in
There is a remarkable diversity of metalic elements in the PDB:
|Element||Symbol||Number||Example PDB Code|
4b. Two atoms are considered bonded if they are closer than the sum of their covalent radii plus a small tolerance. The values of covalent radii are those published by the Cambridge Crystallographic Data Center and are given in the table below. Although the CCDC recommend a tolerance of 0.4 Angstroms, the current algorithm uses the larger value 0.45 Angstroms used by Baber and Hodgkin, and by Hendlich et al.. This value is lower than the 0.56 Angstrom tolerance used by the molecular graphics program RasMol. A lower bound of 0.4A for bond lengths is also imposed.
4c. Bonds can only be formed between atoms with the same chain
identifier and that are not separated by a TER (chain terminator)
record in the PDB file. Unfortunately the TER record in PDB file
1atl is incorrectly placed, cleaving a methyl group.
4d. All solvent atoms are prevented from forming covalent bonds other than to the same residue. Close contacts with waters account for a frequent source of false positive bonds. Recognizing the residue names for water ("HOH", "H20", "WAT", "TIP", "SOL", "DOD" and "D20") and for other solvents ("EOH", "MOH", "PER", "PO4", "SO4" and "SUL"), and explicitly excluding these atoms from inter-residue bonding improves the resuting topologies.
4e. Any hydrogens that are two or more connected by the above procedure are then fixed. All its bonds, except the one to the closest non-hydrogen atom in the same residue, are deleted.
4f. Occasionally, atoms with identical co-ordinates are present in a PDB file. If two atoms in different PDB groups have the same co-ordinates and are bonded to a common third atom, all bonds between the these two groups are broken. If two atoms with the same co-ordinate remain bonded to a common third atom, the second atom is deleted.
4g. If any atom is bonded to more than a maximum number of
neighbors for that element, then bonds between the PDB group containing
that atom and groups connected via that atom are broken. This handles
multiple occupancy problems such as
1abe. The maximum
neighbor count for each element is four unless tabulated below.
1elafor example) and not always honored (pdbcodes
5a. All connected components containing more than 100 heavy atoms are considered to be protein (or nucleic acid). If the entire component originated from ATOM records, it is ignored. Otherwise, all ATOM atoms are deleted, except for those covalently bonded directly to HETATM atoms. The atoms are converted to an asterisk to indicate an attachment point, and bonds between pairs of asterisks are deleted. This should have the effect of eliminating all large proteins and nucleic acids, after cleaving all covalently bound ligands and post-translational modifications.
5b. The next step is a fragment size filter. All remaining connected components (after step 5a) are ignored if they contain more than 100 or less than six heavy atoms (typically metals, waters, sulphates and phosphates).
5c. The final optional step is to remove all protein and nucleic
acid fragments. These are all connected components (after 5b)
that contain no HETATM records. If the PDB file contained a connected
component of more than 100 atoms (a protein or nucleic acid), an exception
is made for components whose chain identifiers occured in no other
components. This correctly identifies peptide ligands (such as chain
4hvp, or chain "C" in
Unfortunately, the lysine in
1lst is still not correctly
perceived as a ligand.
6a. Initial assignment of hybridization state is on the basis of average bond angle. Two connected atoms with a bond angle of greater than 155 degrees are typed as sp-hybridized. Remaining atoms with an average bond angle greater than 115 degrees are typed as sp2-hybridized and those with an average bond angle less than or equal to 115 degrees as sp3-hybridized.
6b. Average bond angles are unable to descriminate the correct hybridization state of two connected atoms in five membered rings. For example, in an aromatic ring such as pyrole or furan, the sp2 carbons have bond angles near 108 degrees! A second pass sets the hybridization of all two connected atoms in a five membered ring, as sp2-hybridized when the average in-ring torsion is less than 7.5 degrees. A similar planarity test is used for six membered rings, where a 12 degree threshold is used.
6c. A final 'antialiasing' pass is used to detect and correct misassigned hybridization states. If an sp-hybridized atom doesn't have a sp-hybridized or terminal neighbor with unfilled valence, it is reassigned as sp2-hybridized. Similarly, if an sp2-hybridized atom doesn't have a sp2-hybridized or terminal neighbor with unfilled valence, it is reassigned as sp3-hybridized.
Because the recognition is applied after hybridization state assigment,
the a patterns can make use of geometry at each pattern position.
Typically, however, hybridization information is only used for ring
atoms. This correctly handles distorted carboxylic acid groups (such
1tdr) but also correctly handles difficult systems
When multiple patterns are applicable heuristics based upon atomic electronegativity are used to place the double bond. A double bond to a terminal oxygen is chosen over a double bond to a terminal sulphur. For guanadine groups, if the central carbon is in a ring a ring nitrogen is chosen over a non-ring nitrogen, otherwise a terminal nitrogen is chosen over a non-terminal nitrogen.
Under each of the atom types above are listed the number of electrons each contributes to the ring for the Huckel 4n+2 aromaticity calculation. Carbons typically contribute one, except when doubly bonded to a terminal oxygen atom. Oxygen, Sulphur and Selenium contribute two. Pyridine and N-oxide nitrogens contribute one, and pyrole nitrogens contribute two.
At this point, there are two ambiguous cases. The first is that a carbon of unfilled valence can double bond to the oxygen or the ring, resulting in zero or one electrons respectively. The second is that a two connected nitrogen of unfilled valence could be double bonded to the ring to become pyridine-like, or gain an implicit hydrogen to become pyrole-like, contributing one and two electrons respectively.
The heuristic used by the algorithm is that a sp2-hybridized ring should be made aromatic if possible. This is done in the following steps.
8a. First, a test is applied to the ambiguous cases (*-[C](O)-* and *-[N]-*) to see whether they can be resolved by their neighbours. If both neighbours have full valences or incident multiple bonds, both ring bonds must be single and their unique types are reassigned (*-C(=O)-* and *-[NH]-* respectively).
8b. If the count of electrons in the ring modulo four is one, and the ring contains an ambiguous nitrogen, it is retyped pyrole-like (with an implicit hydrogen).
8c. If the count of electrons in the ring modulo four is three, and the ring contains an ambiguous carbon, it is retyped with an exo double bond to the oxygen.
8d. If the count of electrons in the ring modulo four is three, and the ring contains a uncharged pyrole-like nitrogen, the nitrogen is charged.
8e. If, after all the above reassignments, the count of electrons in the ring modulo four is two, the Huckel condition, all the atoms and bonds are marked as potentially aromatic.
After the above processing has been applied to all rings in a molecule, the molecule is passed to a kekule form assignment routine that attempt to provide a kekule form of alternating single and double bonds. The actual kekule form assignment algorithm is complex and described in detail elsewhere (parts of the algorithm were presented in the tautomerism talk at EuroMug99).
9a. The first stage of final bond order assignment is to mark all ring bonds between sp2-hybridized atoms as aromatic, and call the kekule assignment algorithm a second time. This should assign alternating single and double bonds to conjugated ring systems including aromatic rings with more than seven atoms.
9b. The next step is to process any unsatisfied sp2-hybridized atom by checking each neighboring atom for a terminal oxygen, and when found testing the bond length against the table of bond lengths below. This pass preferencially selects the keto over the enol forms of conjugated molecules.
9c. The very last step is to check each bond using distance criteria. Only bonds between atoms with unfilled valences and without an incident multiple bond are considered. Bonds between sp-hybrized atoms or between sp-hybridized and terminal atoms are tested against triple bond lengths, and bonds between sp2-hybridized atoms or between sp2-hybridized atoms and terminal atoms are tested against double bond lengths. Bonds between a pair of terminal atoms are tested against both double and triple bond lengths.
9d. All remaining unfilled valences are filled with implicit hydrogen atoms. The atomic valences are the standard values used by the Daylight toolkit.
The most common ligands are heme and cytochrome (456+242 occurences), followed by N-acetyl-D-glucosamine "NAG" (341 occurences), followed by glycerol "GOL" (222 occurences). 2426 ligands occur only once, with the remaining 1075 occuring multiple times (135 >10, 39 >25, 23 >50 and 12 >100).
On test set 1 of protein ligands from the review by Ricketts et al.. The algorithm gets all 18 of the ligands correct. Back in 1996 the author did a review of existing techniques on a subset of 17 of these files. The results are shown in the table below. The column labelled "All Single" shows the structures correct when all bonds are left single. The "Covalent" method (inspired by Pauling's nature of the chemical bond) assigns a triple bond to lengths less than 81% of the sum of the attached covalent radii, a double bond between 81% and 87%, and a single bond above 87%. The "SMARTS" method used simple pattern matching of terminal groups (similar to the functional group recognition in section 7). The "Cambridge" and "Oxbridge" algorithms are based on the algorithm of Baber and Hodgkin. The "Cambridge" column are the results using just bond lengths for CSD, and "Oxbridge" is the full algorithm using bond lengths and bond angles. The "IDATM" column represents the algorithm of Meng and Lewis as encoded in Babel v1.4. The "COBRA" column is Andrew Leach's perception code as used by Oxford Molecular's COBRA package. The "BONDAGE" column contains the results of algorithm by Blaney, Dixon and Swanson in the DGEOM95 package.
|No.||PDB Code||All Single||Covalent||SMARTS||Cambridge||Oxbridge||IDATM||COBRA||Bondage||This work|
Note these results are slightly biased as results were used in the design and development of the algorithm presented above. Also note that the table is actually presented in approximate order of the quality of results, and although Baber and Hodgkin's Oxbridge algorithm looks impressive, for the structures is got, it performed far worse than the methods to its right.
On the DockIt test set of 10 PDB files, the program gets all 10 ligands
correct. The original FORTRAN bondage algorithm gets 8/10, and CCDC's
ReliBase, which is hand curated, also gets 8/10. ReliBase doesn't consider
the peptide inhibitor of HIV protease in
4hvp to be a ligand,
and gets the element typing in
Hendlich reports in his paper, that the BALI program gets the biotin
1bib wrong due to an abnormal bond length of 1.133A
which is typed as a triple bond. The described algorithm get 1bib
correct as a triple bond requires supporting sp-hybridization evidence.
Finally the 133 pdb codes below are taken from the extended GOLD benchmark suite of structures. The "This" column represents the algorithm described here, the "Gold" column represents the "ligand.mol2" structures distributed with gold, and the "Reli" column represents the structures in Relibase v4.0.
obsolete and have been replaced in the PDB by corrected versions.
Relibase consistently represents nitro groups incorrectly. The lysine
1lst is not perceived as a ligand in this work or
Relibase. The vanadium atom in
6rsa causes problems for
the described algorithm and GOLD (where it is replaced by a phosphorus).
1bbp is a particularly tricky chromophore. There's a
misplaced "TER" record in