In the previous post I briefly touched the question of protonation in docking and promised to come back with more details at a later time. Well, it turns out I could not go to sleep tonight until I get this out of me
So here it comes…
Several of the protein residues and many of the ligand functional groups may exist in different protonation states. In a docking study, the protonation states of the protein residues in the active site and of the functional groups of the ligand are very important. Most of the PDB structures obtained from X-ray crystallography do not contain the protons corresponding to the Hydrogen atoms because the resolution of the X-ray data is not fine enough to “see” the protons in the electron density map. On the other hand, most docking software requires the input files to be protonated. Therefore, some software technique (e.g. modeling packages) have to be used to add the protons. The ligand structures are often taken from a database, that may store the structures in a very compact form, e.g. SMILES string, and then protons are added again by some software prior to docking. Many simple software tools will generate the neutral form of the small molecules if possible, e.g. see the picture of
Aspartic acid on ChemSpider or Wikipedia, (although ChemSpider also has the de-protonated charged form — aspartate as a separate entry). Unfortunately, the neutral form is not the best choice for docking as this residue is more likely to be in the negatively charged de-protonated form in most proteins under in vivo conditions. But, please do notice the wording I used : “more likely” and “most”. Of course, there are lots of exceptions. Due to resonance, the role of the two oxygen atoms of the carboxylic acid are also symmetrical, so the H can be on either of them and in 3D on 2 possible positions on each oxygen.
OK, you say, enough of preaching to the choir, of course you are always using protonation states appropriate for physiologycal pH, so you are fine. Unfortunately, it is not that simple. There are many examples, e.g. metalloenzymes that will require the use of different protonation states, see the paper cited in the previous post. Thanks to the ZINC team, now you have a chance to download ligands in a protonation state specific to certain pH values. Some modeling software will also allow you to protonate your protein according to a given pH value. Hurray, you say, now we are good to go we just have to decide the appropriate pH value for the given study and we are all set. No, I am afraid, it is more complicated than that. Let’s look at an example 3D
snapshot from the center of the active site of HIV-1 protease. You see two carboxylates two Asp residues nicely facing each other at very close contact distance between the oxygens. The only way this is possible if there is a proton between them making a nice Hydrogen-bond, otherwise the two lone pairs of the oxygens (with strong partial negative charges on both) would be VERY uncomfortable to say the least. So, one of them is protonated, the other isn’t. Uh, oh! This isn’t good for the pH rule. Whatever value you chose, it fails, because we have the same residue behaving differently right next to each other.
There are even nastier cases. Let’s look at the Serine protease family: trypsin, thrombin, factor Xa. The key Ser-195 residue in the active site attacks the carbon of a peptide, cleaving the amide bond, because its alcohol group has “lost” the proton — now an alcohol group is not such a horribly strong acid to lose its proton so easily, so there must be some really basic conditions in that active site — one might assume. Well, not really, because the natural substrate is the Arginine side chain which must be in a protonated form to be attracted to the negatively charged Asp-189 at the bottom of the pocket (that long-range attraction “lures” the Arg into the pocket for the perfect position for the Ser-195 to attack the peptide bond at the neck of the residue. The key to this mystery is the Asp-102, His-57, Ser-195 catalytic triad. The mechanism is nicely explained here. The lesson to be learned regarding protonation states is that the local environment is the key. Global notions like pH only work in a nice uniform water solution with millions of copies of a given ligand and some specific ions floating around. In the binding site of a protein there are lots of local constraints that determine the interactions that occur.
Let’s pay a bit more attention to the middle player of that triad: the Histidine side chain. This residue on its own is another reason why pH is not sufficient to determine the protonation states. The two Nitrogen atoms of the aromatic ring are usually drawn with different connectivity (one has a double bond the other does not), but in fact due to the resonance of the aromatic ring, the role of the two is completely interchangeable (due to two different tautomer forms). At physiologycal pH one of them should be protonated (H-bond donor) the other should have a lone electron pair (H-bond acceptor) sticking out. However, which is which cannot be decided based on pH. Yet, such decisions are crucial for recognizing (scoring) the correct binding pose of a ligand. Sometimes, the intra-molecular H-bonding possibilities within the receptor can help to decide the correct state, e.g. in the case of the above mentioned catalytic triad (the Asp-102 dictates that the H must be on that side of the His-57).
Unfortunately, even full analysis of the receptor site isn’t enough in some cases. There are situations where some receptor residues change protonation states depending on which ligand binds to it, e.g. HIV-1 protease see this paper. Of course, the ligand protonation state may also be altered by the active site upon binding. There is no such thing as “correct” protonation state for a given receptor site without considering the ligand and also there is no correct protonation state for a ligand either without taking into account its interactions in the active site. This means, that the only correct treatment of protonation states is to leave the question open until the end of the docking process and make the decision separately for each result docking pose, e.g. using this method for the full complex. Of course, changing protonation states isn’t always free, usually it involves an energetic change, e.g. a carboxylate “prefers” to be de-protonated (lower energy form) — exceptions to this are the Histidine flipping the H between the two N, and a protonated carboxylate moving around the proton in the 4 possible positions on the two oxygens. So, a perfect scoring function must be able to consider various protonation states and also consider the energy cost associated with switching between them. The last release version of eHiTS (6.2) does the first part, i.e. it samples the protonation states on the fly for the result poses — independently for each functional group. The second part (energy cost of various protonation states) will be incorporated in the next release coming up this summer.

For other docking software that do not perform on the fly protonation changes, the only correct way to execute a screening study is to prepare multiple copies of each ligand enumerating all feasible protonation states (increasing the size of the screening library substantially, 150X for the example on the figure above), and also prepare multiple copies of the receptor with all combination of active site residue protonation states and perform a separate screening batch for each and select the best scoring solutions from all runs. This process raises the CPU time — possibly by several orders of magnitude. So, it is much more efficient to use eHiTS, or better yet eHiTS Lightning
An update in response to the comments by Andy and Danni:
Let me summarize how I see the process of correct protonation state determination and scoring:
- The correct docking pose has to be generated — possibly among several enumerated.
- The correct protonation state needs to be determined for each functional group involved — for each pose separately
- The energy has to be estimated correctly considering not only the interactions but also the cost of protonation state change relative to the lowest energy form.
Most docking programs will fail already at step 1 if the input receptor/ligand molecules are not in the right protonation state, because they will not put two donors or two acceptors facing each other at close interaction distance — and for those mentioned cases where the protonation dynamically changes upon binding as part of the recognition process, the input molecules will NOT be in the correct protonation state. In contrast eHiTS uses ambivalent flags for indicating the possibility of H/lp (donor/acceptor) variations and will generate all the possible poses — including the correct one.
Step 2 can still be done without QM calculations — it is a discrete optimization problem that can be solved optimally, for example using the method described by Paul Labute at the CCG site. So, now we have multiple poses, each of them with all the involved FG having the appropriate protonation state for the given pose. All this done on the fly in the docking process efficiently. What is missing now from eHiTS is step 3.
Step 3 is what requires the QM calculations to be able to estimate the energy of each pose accurately and select which one is the correct — lowest energy pose. In my opinion what we need to do ahead of time with QM is to compute two tables: the hydrogen bonding strengths between various functional groups and the internal energy difference between different protonation states of a given FG. Using these pre-computed values, we can estimate the total energy of the system. The question is, how big is the error in the estimation due to the use of pre-cooked FG values as opposed to computing the energy of the whole system at a high level of theory ? I do not know the answer to this question,and therefore I am not yet sure how accurate our estimate will be, but soon we will be able to see the results.
ZZ.