The question is asked and answered by a group of researchers from the University of Warsaw in a recently published paper (http://onlinelibrary.wiley.com/doi/10.1002/jcc.21643/abstract). They performed a comparison of 7 docking and scoring programs to evaluate pose prediction and score accuracy on a large set of 1300 PDB complexes. They performed a fairly thorough study asking some important questions, such as how the starting ligand conformations influence the results and how the results differ for small or large ligands, mostly hydrophobic or mostly polar interaction. The good news they report is that, statistically, overall results do not seem to be influenced by the starting conformations, although there is a slight advantage in some programs for the X-ray conformation, which is understandable. The bad news is that ligand size does matter: while we are very successful with small, fairly rigid molecules, large floppy ones still prove to be hard to handle for all programs. The really ugly news is that none of the scoring functions provided adequate correlation with binding energy.
“On the basis of those results, we can order programs in the following way: GOLD ~ eHiTS > Surflex > Glide > LigandFit > FlexX > AutoDock. The best programs have the average RMSD top score around 2.7 A, and it increases to nearly 4.5 A for the weakest FlexX. As expected, better results were observed for best pose conformations (Fig. 4). For those poses, the mean RMSD value was even below 2 A for GOLD, eHiTS, and Surflex. … Moreover, the percentage of pairs for which top score conformation is below 2 A shows that even for the best programs the success rate is below 60%, and in some cases even below 40%.”
Based on the score-energy correlation performance, the authors divided the programs into three categories. The best one is “composed of functions implemented in eHiTS and in Surflex, which gave Pearson correlation 0.38 and 0.33, respectively. Moreover, for eHiTS scoring function very high-Spearman correlation was obtained…” The Pearson correlations for the middle and worse categories are in the range of 0.17-0.25 and less than 0.1 respectively. The authors rightly conclude that the score-energy correlation results are inadequate even “for the best program, namely, eHiTS“.
Finally, in the ranking performance comparison (correlation of score with quality of poses) AutoDock achieved the highest 0.32 correlation with eHiTS as close second with ~0.3 correlation. So, what is the final conclusion of the authors with regards to answering the question in the title ? Here is the quote with the answer:
“Thus, can we trust docking programs? The answer must be given individually for two aspects of docking programs. In terms of pose prediction, we can say that GOLD and eHiTS performance is accurate enough … In the case of scoring functions, the answer must be negative, as virtually no correlations could be observed between docking score and in vitro binding affinities … the empirically derived functions have now reached the saturation of year-to-year improvement … The future direction should be either to use statistical approach based on increasing number of X-ray protein-ligand complexes, as can be determined from the results obtained by eHiTS scoring functions, or to develop completely new approaches in terms of predicting in vivo activity of the ligand.”
I am very happy to see that eHiTS came up among the best-2 contenders for all three aspects of the comparison (while the other-best were three different programs for the 3 aspects). On the other hand, I agree with the authors that there is still a lot of room and need for significant improvements both in terms of pose prediction (~60% success rate) and score accuracy (~0.4 correlation). Furthermore, we definitely need such thorough and large-scale performance comparisons as this one in the future to continuously assess the state of the art until some programs (hopefully eHiTS remaining on the lead) will reach adequate performance.
Posted by Zsolt