Archive for June, 2008

Fastest supercomputer built on the Cell/BE

Friday, June 20th, 2008

I have already mentioned in May, that RoadRunner the world’s current fastest supercomputer is built on Cell BE processors, the same platform that eHiTS Lightning runs on. If the Los Alamos Lab chooses Cell Processors then we chose well!

Looks like the mainstraim media is now catching on to the news:

  • Infoworld reports: IBM’s Cell-based RoadRunner supercomputer is world’s fastest.
  • PC World: IBM’s Cell-based RoadRunner Supercomputer Is World’s Fastest
  • GamePro.com: PS3 “Cell” CPU to Power World’s Fastest Supercomputer
  • ITportal.com: IBM’s Roadrunner Runs Fastest
  • ScienceDaily: World-record Supercomputer Mimics Human Sight Brain Mechanisms
  • ITjungle.com: Beep, Beep: Roadrunner Linux Super Breaks the Petaflops Barrier
  • Chemical Engineering News: World’s Fastest Computer Debuts
  • ACM Technews picks: Europe Prays That Cathedrals to Computing Will Help Industry

The last article in the above list highlights: “Roadrunner was built using 6,912 dual-core Opteron processors from Advanced Micro Devices, and 12,960 IBM Cell eDP accelerators. Early tests indicate that the Cell processors have reached 1.33 petaflops while the Opterons reached 49.8 teraflops”. So twice as many Cells produce 26.7 times more crunching power compared to the dual core Opterons. In an earlier blog post, I have analyzed that advantages of the Cell BE over other acceleration technologies, like GPU and FPGA.
ZZ

eHiTS and Score - Low RMSD/lnKd(IC50) Correlations.

Thursday, June 19th, 2008

Most flexible ligand/rigid protein and flexible ligand/flexible protein docking approaches do predict poses reproducing known structural solutions amongst their top ranking scored poses. But when assessing diverse top scoring poses for ligands in a screening exercise I find it time consuming, even with employing some auxiliary pharmacophoric information, deciding which poses amongst my top scoring poses are the most pharmacologically relevant.

Let me simplify that statement! What I would really like is to have confidence that if I took my `top 5-10 poses’ that there would be a much higher likelihood of finding the biologically relevant pose in that group than the `next ten’. Moreover, we all would like our pose scores to bear some resemblance to IC50/Kd rankings of our screened ligands, be they agonists, inverse agonists, or inhibitors at the pharmacological endpoint!

I am a guy that early on naively believed that docking solutions relying on physics based approaches, for example carefully developed charge sets (1-6-12 potentials) could get me there. After all, I have done numerous MD simulations and employed MMPBSA approaches to estimate binding free energies to good effect(Proteins 55:895-914). But this worked well, if I had the right ligand pose(!) and if either the enthalpic components dominated or my crude estimate of entropic terms was adequate. The bottom line is that most docking approaches while doing OK on pose prediction do not, to date, give you good Score-Based Good/Bad RMS separation or give you much confidence in using docking scores to `rank’ prospective ligands for synthesis(J. Med. Chem, 49:5912-5931).

eHiTS is an informatics based approach (J.Mol. Graphics Mod. 26: 198-212) and what I have learned is that it is powerful in providing me with two major items on `my Christmas wish list’:
1.Good Score-RMS correlation (good scores have low RMSD), and
2.Good correlation with ln(IC50)/ln(Kd)!

What I have learned this month is that if you `train a customized scoring function’ for my pharmacological protein target by using ~5 co-crystallized complexes I can achieve both endpoints on my Christmas wish list. That is powerful.

How about an example of this?
Nicotinic Acetylcholine receptors (nAChR’s) are an important class of proteins amongst a superfamily of ligand `gated’ (allosterically modulated) ion-channels. One of the surrogate proteins having binding motifs and pharmacology analogous to nAChRs is the acetylcholine binding protein (AchBP). We have begun investigations on this class of proteins given it is a challenging problem of ligand recognition via conserved aromatic motifs in the binding pocket via the cation interactions with a `box’ of Trp/Tyr residues. The upper left hand panel of the figure below shows three classic cationic ligands, acetylcholine, nicotine, and carbamylcholine(CCE). All of these ligands bind with considerable affinity to the binding pocket containing Trp/Tyr aromatic residues with the cationic center interacting with Trp/Tyr residues. Considerable experimental and computational evidence suggests that the pi-cation interactions are a substantial contribution to the binding free energy of these ligands to AChBP. The right uppermost panel shows this interaction motif for CCE in the crystal structure 1UV6. The lower left panel shows eHiTS superimposed docked poses of nicotine(NIC) and carbamylcholine (CCE) interacting with the surrounding pi electron system. While these poses have low scores for the default `out-of the box scoring function’, the informatics based scoring function did not involve `training’ including this class of proteins, and the top plot (labeled `untrained’) shows that there is no correlation of `score-regime’ with low RMSD poses (relative to the crystallographic pose). “What happens if you train a scoring function specific to this class of proteins as regards score separation of low (good) RMSD and high RMSD poses?” The lower plot in the right lower panel tells the story. One obtains: Score Based Separation (Correlation) of your low RMSD from your high (RMSD) poses. Which scoring function would you want to use for virtual screening of new leads for this class of protein? I think you know the answer to that one. You would want to use the one where you had confidence that your top scoring poses were the ones with probable low RMSD to the actual pharmacological pose.
Come back for the 2nd part of the story eHiTS Score-lnKd based correlations in my next blog.

Posted by DLH

EHITS_SCORE_RMSD_SEPARATION

Public apology to CCDC

Wednesday, June 18th, 2008

My previous post about errors in crystal structures have triggered strong reactions from CCDC (not only response post, direct email, but also email to my former PhD supervisor in the UK asking him for remedy and explanation). Apparently, they have interpreted my post as an attack on the quality of their services. Let me clarify first, that I have never intended to imply anything negative or derogatory about the CCDC services or software. My sincere apologies if my post came across that way. All I wanted to do is raise awareness in the docking/scoring community that small molecule crystallographic data is not free of errors. My understanding is, that the data deposited in CSD has been determined by thousands of people all over the world and published in various scientific journals, while CCDC aggregates the data and creates a comprehensive, validated and value-added database known as the Cambridge Structural Database (CSD), and the complete CSD System (CSDS) includes the CSD itself and associated software for search, visualisation and analysis of stored information. I acknowledge that CCDC provides a valuable service to the community and any error in the data is not their fault.

They have also sent us a “friendly reminder” that since our license to CSD has expired, according to the signed agreement we are not allowed to retain or use any data downloaded form CSD, not even any derived information or data. As I already stated in the update added to the previous blog entry, we have ceased using any data derived from CSD to comply with the license. I have even removed the image of the molecule from the post (since that can also be considered as derived data). We have not incorporated any data into our software. As I mentioned in the previous post, we had the intention to improve our scoring function with statistics collected from CSD (while we had the license during 2007), but it did not prove to be useful, therefore we abandoned that approach and continue to use publicly available PDB crystal structure data — which has been used for all released version of the software. We have not renewed our (rather pricey) license for 2008 for this reason.
One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data. It is ironic that the links expressing the need for open data and the open repository happens to point to a web site within the same University where CCDC resides.
ZZ

Crystal structure errors — in CSD too

Friday, June 13th, 2008

Many of you involved in structure based drug discovery will know very well about the numerous problems and errors in the data found in the Protein Data Bank (PDB) especially concerning the ligand structures. There have been a lot of publications about such errors, e.g. in Jones et al. J Mol. Biol. (1997) 267:727, and I heard various conference presentations about this topic too, e.g. by Gerard Kleywegt (University of Uppsala), titled “Protein crystallography: not as simple as ABC then?” at Bryn Mawr, Philadelphia (15-19 October 2007) eChemInfo meeting. The errors are often blamed on the low resolution of the structures involving large protein structures (often thousands of atoms). One would assume that the small molecule crystal structures of the Cambridge Structural Database (CSD) do not have such errors, since they have much higher resolution and dealing with small molecules. Let me correct that wrong assumption!

The scoring function of our eHiTS docking software relies of statistics of interaction patterns. Earlier we have collected such statistics from thousands of PDB files — also considering the Gaussian distributions of the atom coordinates based on the given temperature factors to account for the uncertainty in the data. In the past year we have collected some statistics from the CSD with the hope to improve the accuracy of our scoring function by using more reliable, more precise data. Unfortunately, we had to learn the hard way, that the CSD isn’t so clean either. We have found a lot of obvious errors, like some atom centers falling within 0.2 Angstrom or less from each other when the crystal packing transformations are applied, some completely impossible bond lengths and angles. We kept adding sanity checks to report and exclude data entries with various obvious errors. At the end of the automated cleaning process, we had almost 15% of the data dropped for one reason or another. Then we thought the remaining data is good, we can use it for collecting the statistics.

Now, the refined scoring function is nearing completion and we are running various tests. One of the tests was to compute the internal strain energy of various ligand structures, minimize the conformations from a systematic set of sampling conformations to identify global minima based on the new scoring function. This is an important exercise, part of the protein-ligand binding energy estimation problem, as Ashutosh Jogalekar blogged about it today. Yesterday, one of our developers Bashir Sadjad presented me some data he collected running these minimizations on a few CSD structures. An intriguing point he raised was, that a several structures have shown very high strain energies that could be resolved with fairly small dihedral changes. Of course, you cannot expect the CSD structures to be all at global minimum conformation, because there are interactions in the crystal lattice that may force some compromise to reach a better H-bond or other interaction. However, I was expecting them to be at least at or near a local minimum conformation. Then Bashir has pulled out one of the worst examples where the X-ray structure had very high strain energy: CSD code [REDACTED] has two carboxylate groups as shown on the image [REDACTED]. The original structure from the CSD is was displayed with thick bonds and the optimized one has thin bonds, you can see the optimization has twisted the two carboxylate out of the plane of the aromatic ring in order to avoid two lone pair facing each other. When I saw the image I immediately said: this must be a protonation error, because it looks like to me that if one of the carboxylates is protonated towards the other, then instead of a bad clash, you would have a good H-bond between the two. Are there H atoms in the original ? It would make perfect sense to have the original non-twisted conformation in that case, but if they are lone pairs with negative charge on both carboxylates, then it is very likely that they would twist out of plane to avoid each other. Bashir said, the structure did have H atoms, but NOT on the carboxylates, each oxygen appears de-protonated. OK, then I do not get it, there must be an error I thought. Even with the N in the ring protonated, both carboxylate cannot be deprotonated, because the whole structure would have a -1 formal charge, which is impossible in a crystal — there is no salt in the lattice to counter balance the charge, so the molecule must be neutral to form the crystal.

Today, Bashir came back with the explanation to the puzzle, he said:

This case was really annoying and I could not convince myself that it is only our scoring function that assigns the huge score to it, so I looked at the publication for this. Just looking at the figures, they all have a hydrogen between the two oxygens. In fact the title talks about C7H5NO4 while there are only 4 hydrogens in the original mol2 file! Finally, I looked carefully at the CIF file and in a ’special_details’ part it says:

“H5 bridges O2 and O3 with almost equal distances. H5 is not retained”

:-)

So, actually, it is not our scoring function that is wrong but the CSD entry!

So, the morale of the story: we can’t even trust the high resolution CSD data, let alone the PDB.

ZZ

Update:

Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article. To correct this problem, I will quote the entire comment here:

Author : J (IP: 131.111.113.139 , jenner.ccdc.cam.ac.uk) Says:
June 16th, 2008 at 8:39 am

These comments are interesting - not because they reveal anything that a small molecule crystallographer doesn’t know: More because they reveal that modellers have expectations of information that are a bit naive.

I’m a long time user of the CSD and, as a small molecule crystallographer, I understand the caveats behind crystallographic data. There are errors in the published crystal structures, and not all of them will get spotted during peer review or data curation. CSD users would be well advised to try to understand them and factor them in to their work. I particularly turn your attention to Points 3 and 4 below …

Point 1.

H-positions are sometimes hard to resolve in small molecule studies, and need to be treated warily in crystal structures. Ok - the entry QUICNA01 is a neutron study, so one would expect them to be better, but disordering is an issue.

One should always look at both the 2D and 3D structural information when working with crystal structures. If you look at the 2D representation in the CSD for QUICNA01 it is correct.

Point 2.

Undiagnosed disorder/symmetry can lead to problems: There are structural studies in the CSD where the crystallographer has missed a disorder, or missed some symmetry. AACRUB is an example of missed symmetry - and when you look at the study you see rather dubious bond lengths and angles, due to correlations in the refinement co-variance matrix.

Quite often, when this sort of thing happens, a later study will then correct the error: see AACRUB01 in this case.

Missed dis-orderings and symmetry are hard to spot, note: This is by far the most likely thing to trip up a modeller who ‘just wants the coordinates’.

Point 3.

Newer structural studies are more likely to be more reliable than older studies due to enormous improvements in equipment and software to undertake the studies. I think, in the case of QUICNA01 this is very pertinent. The structure was published in 1974 …. Ok - if this is the only structure then you may have to use it but ….

Point 4.

If there are several similar studies of a structure, they end up in a CSD refcode family. In the case of QUICNA01 you also have some later studies - namely QUICNA02 QUICNA03 QUICNA10 QUICNA11 QUICNA12 QUICNA13

QUICNA10, QUICNA11, QUICNA12 and QUICNA13 are all later studies of the structure, and they *all* have the proton to which you refer, since they are ‘deuterated’ compounds which will resolve better in neutron studies.

Now - you might quite reasonably say ‘but how do I know which one to pick?’ - There is this study

http://www.ccdc.cam.ac.uk/free_services/best_representative/

Though admittedly for QUICNA note that the choice is inconclusive based on the 4 lists given: I think the hydrogen list may not account for deuteration.

The other main point raised was, that our CCDC license has expired since the data collection was made, therefore we can no longer use any data — even derived data — from the CSD. We certainly fully obey this cease and desist order and will not use any of the data — we have not made any publications containing data from CSD except for this blog entry (and I have now removed the code name and the image to comply with the order) and none of the released versions of our software containes such data either. By the way, the data did not help us improve the scoring function anyway, partly due to the fact that similar errors occur in the data as in the PDB and the PDB data is more relevant to docking, because of the constraints present in the protein environment.

On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement.

BIO-IT World’s “2008 Best of Show Competition” in the press

Monday, June 9th, 2008

The June 2008 issue of BioIT World reveals the finalists and winners of the “2008 Best of Show Competition”. eHiTS Lightning was a finalist in the “Life Sciences Informatics Applications” category. On page 17, they say:

Playstation Algorithm

eHiTS lightning is the first science application delivered on the very powerful, yet inexpensive PS3 Playstation….

read the full story here:
http://online.qmags.com/BIW0608/?sessionID=63E48403BEC324EAA3F26DFC5&cid=842554&eid=13014

or see it from our site:
http://www.simbiosys.ca/whatsnew/media/2008/BIW_20080601_Jun_2008.PDF

posted by Aniko

Chemical software quality 3 - polar surface area

Wednesday, June 4th, 2008

This is my third post in the ongoing debate on chemical software quality. To give quick look-up table for those who join the discussion late, let me throw in some pointers in a time-line:

  1. Egon Willighagen wrote a blog about unit testing in CDK: Finding differences between IChemObjects
  2. Peter Murray Rust responded: The Blue Obelisk - Egon’s diff is boring making some general comments about chemical software quality
  3. Egon continued his unit testing story: Finding differences between IChemObjects #2
  4. Egon responded to PMR (2): Good Scientists Pimp there Research (was: Damn, I’m boring…)
  5. I responded to PMR (2, also citing 1,3,4): Research and software testing
  6. PMR responded to my post (5): Quality in chemical software - a debate
  7. I responded to PMR (6): Quality in chemical software - the debate continues
  8. PMR responded to (7): Quality is emerging in chemical software
  9. Egon responded to (2,5,6,7) in his post: Recovering full mass spectra from GC-MS data

In post (9) Egon points out that annual competition results and benchmark results that I referred to in (7) have very little connection with unit testing and basic software quality — i.e. detecting, fixing and avoiding bugs — which is the focus of unit testing discussed in (1,3,5). I agree and that is exactly why I focused on that definition of software quality in my first post (5) in this debate. PMR has accepted my defense of closed source software quality in the software engineering sense (”I am prepared to believe that a company is able to reproduce its own results internally and I suspect that the quality is better than it was 10 years ago.“) and jumped onto a different definition of chemical software quality in (6) - one that has to do with assessing the scientific value of the answers provided by the software. This is what I addressed in (7), so I am not confusing the quality addressed by unit testing and the competitions or benchmarks, and I hope it is equally clear to everyone else that these are two very distinct issues. It seems everybody is agreement that unit testing is crucial, and now we have a common understanding that it has always been (traditionally) applied in the commercial chemical software world. Now I will respond to the new points raised by PMR’s post (8):

PMR: By a tradition of quality I mean that there is a communal understanding that quality matters. Although quality is a wide term it is often difficult to discuss unless it is measured.

Indeed, and we already touched on two very different meanings of the word as I elaborated above.

PMR: Leaving aside the stochastic aspect - which we agree on (and which makes quality assessment much harder) my concern is not whether a given calculation is reproducible when confined to a manufacturers platform, but whether the results have been assessed as meaningful. Now I agree that this is not easy, but unless the manufacturers develop interoperable standards then the quality of the result is only assessable by public assessment, requiring standard data sets and standard results. I gave the example of “(total) polar surface area” which should, in principle, be computable reproducibly by all manufacturers. But only if it is defined in a manner that all agree upon. Otherwise we have as many different values as there are manufacturers. And I would content that - unless each has a clear defintions of the lagorithm and the proerty calculated - this is a lack of quality.

Well the question: “what is the (total) polar surface area of a molecule ?” really belongs to the computing non-observable category. I would say it is about as well defined as the question: “what is the favourite color ?” Of course, there is no single “correct” answer for either one. You need to specify the question far more precisely if you want to get a meaningful answer. First, which surface are you talking about ? The van der Waals surface, the solvent accessible surface, the solvent excluded (aka Connolly) surface or any of the iso-surfaces, e.g. electron density iso-surface at any given cut-off ? If you choose one of the former three, what radius values should be used to define the atom spheres ? Each force field has a different set of vdw radii. If you choose electron density iso-surface, what cut-off value should be chosen ? All of these choices will significantly alter the “correct answer” to the question. Then how do you define what part of the surface is considered polar ? For vdwSA or SAS it maybe defined based on atom type, like O, N being polar, hydrophobic carbons being apolar. But what about aromatic carbons next to a nitrogen, or the carbon of a charged group, like carboxylate, should that be considered polar or not ? Or should polarity be defined by computing the net charge effect of all atom based partial charges for every single point of the surface and sum that up via a surface integral ? How do you assign partial charge values for that ? Or use a QM method to compute the charges at surface points ? What level of theory to use ? All these questions and choices are outside the scope of software correctness or quality. There is a correct answer for each variation of the question and each set of parameters chosen. Once the choices and parameters are fixed, then you can ask how accurately a given software computes the polar surface area for the given specification. So, simply this “property” isn’t a single property but a whole range. Oh, and don’t forget the conformation, because a lot of interesting molecules are flexible and the PSA will depend on what conformation you use to compute it. The lowest energy conformation may not be the most relevant if you are interested in bio-activity against a specific target, you need to know the bioactive conformation. So before we can even begin to address the question of software quality metrics, we need to define the problems precisely. Otherwise, you may get a totally different result from every piece of software package and all of them can be accurate.

PMR: I have not - and will not - claim that the Open Source movement in chemistry is of higher quality than closed source.

This statement is easy to refute by a verbatim quote from PMR’s post (2):

PMR: So the Blue Obelisk is emerging as the main area which takes quality in chemical software and chemical data seriously. More organisations are taking Open Source seriously. I met a chemical software company last week - no names - who is seriously looking at Open Source and thinking of integrating its competitors’ products. Perhaps not RSN, but they are looking at it.

And when they do they will find the Blue Obelisk is the only place for software and data quality.

Notice the word only (emphasis is mine) in the last sentence. It is clear that it is an even stronger statement than claiming OS to be higher quality than closed, it implies there is no quality in closed source chemical software at all. Incidentally, this is the statement that “inspired” me to enter this debate in the first place.

PMR: I said there was no tradition of quality. As a result of your post I will moderate this statement slightly.

PMR: I agree this, but note that many of these are very recent. So I would be prepared to say that in certain fields a tradition of quality metrics is starting to emerge. Almost all of these relate to docking into proteins and are driven, at least in part, by the tradition of competitions in proteins such as CASP which has for many years been involved in predicting protein structure.

So I wish them well and will now exclude docking (but not QSAR) from my remarks.

Thank you very much, Peter. I am glad you changed your view about docking. This area has been the main focus of my research and development for the past 6 years and I believe we have very good results measured with sound scientific quality metrics. Mind you, some vendors in the past have used rather questionable metrics to report good results, and I have explained in that post how such metric can lead to ridiculous results.
ZZ

Quality in chemical software - the debate continues

Tuesday, June 3rd, 2008

Peter Murray Rust has responded to my previous blog post and has raised some important points to which I have to respond, see comments section by section:

Quality in chemical software - a debate

ButSymBioSys Blog has replied to my post about unit testing in a long and thoughtful post. I don’t know who the individual is but the company sells a number of chemical software packages, a lot of which I recognize from Peter Johnson’s research group at Leeds.

Let me introduce myself: I am Zsolt Zsoldos, Chief Scientific/Technical Officer at SimBioSys. As Peter MR has recognised correctly, some of the software we market has been developed in Peter Johnson’s research group at Leeds, including the Sprout de novo design software which was my PhD project and Peter Johnson was my supervisor, and he is a scientific adviser and a director on the board of SimBioSys. There are a number of publications listed here covering my post-PhD work at SimBioSys as well as various presentations I gave at conferences, just to give some background on my work.

PMR:

I’m confining my remarks to “chemoinformatics” software. I exclude quantum mechanics programs (which take considerable care to publish results and test against competitors) and instrumental software (such as for crystal structure determination and NMR. Any software which comes up against reality has to make sure it’s got the right answers as far as possible. But chemoinformatics largely computes non-observables.

Reproducibility of results and robustness is not the whole story of quality. There are tens of thousands of docking and QSAR studies done each year and many of them are published. Are they reproducible? I expect that if a different researcher in a different institution with different software ran the “same” calculation they would get different results.

I fail to see how the “tens of thousands” of docking studies considered to compute “non-observables”, when we have tens of thousands of X-ray crystal structures to compare against. How is that less of a reality to come up against than quantum mechanics ? There are experimentally measured binding affinities to compare scoring results against. What better metric does QM has ? There is no exact mathematical solution to the Schrodinger wave functions, so all QM software computes approximations and there is no absolute benchmark point to compare against, because we cannot compute the exact solutions.

Are the docking and QSAR study results reproducible ? With eHiTS and LASSO, the answer is definitely YES! I understand that many tools on the docking/QSAR market use stochastic (read random) methods and therefore their results are inherently unreproducible. Again, I can only speak with authority about our own software, which uses strictly deterministic and reproducible techniques. So if a different researcher in a different location runs our software on the same input they will get the same result. However, I do not see how one could run the “same calculation” using a different software. By definition, if you are using a different software (which embodies the calculation) then you are not running the same calculation. I can assure you the same is true for QM software as well, for the simple floating point error reasons I have explained in a previous blog post. So any different QM implementation will necessarily involve computation steps in different orders (as simple as summation in different order will suffice) and therefore get slightly different results.

PMR:

Which manufacturers publish the source code of their algorithms? Without this the user depends completely on trust in the manufacturer.

Hmmm, very good point. Let me see, does Microsoft publish their source code ? No. Then why do they have over 95% market share ? They must be very trust-worthy, right ? Then why are they facing anti-trust trials in US, Europe and Japan. Perhaps my example is off-topic and off-target, since PMR advocates open source over closed proprietary software and standard, like OpenOffice over MS Office and ODF over OOXML ? Nope, those links prove the exact opposite with statements like:

PMR:

The reason I currently like OOXML is that we can make it work and that we have material in Word that we can use.

My worry about Open Office (which emits ODT) is that I don’t yet believe that has reached a state where I could evangelize it without it falling over or being too difficult to install.

So, let’s just agree that if something is open source that does not automatically guarantee good quality, and on the other hand, it is also possible to have good quality software that is proprietary. Although, I definitely see and acknowledge the quality values in open source, but in my opinion the open source model requires a critical mass (in terms of number of developers and users) to achieve the “any bug is shallow for many eyes” state of linux. Whether the user and developer base has reached that level for chemistry software is an interesting question worthy of its own debate. Let’s continue with our current debate:

PMR:

Many communities have annual software and data competitions. They use standard data sets and different groups have to predict observables. Examples are protein structure and crystal structures. In text-mining and information retrieval there are major competitions. They rely on standard data sets (”gold standards”) against which everyone can test their software.

But in chemical software these type of standards are rare. If companies feel strongly about quality they should be doing something publicly. Developing test cases. Collaborating on the publication of Open Standard data. Creating Gold Standards. Developing Ontologies - if we don’t agree what quantity we are calculating then we are likely to get different answers.

Yes, indeed many communities have annual software competitions, including the docking community: for example, the SAMPL competition by OpenEye which the Bio-IT World has reported about, or the CASP docking competition as published by Lang et al. J Biomol Screen.2005; 10: 649-652. As for standard benchmarking data, how about GOLD validation set, or the more recent Astex diverse validation set specifically designed to be a high quality benchmark set for docking, published as:

    Diverse, High-Quality Test Set for the Validation of Protein-Ligand Docking Performance.
    M. J. Hartshorn, M. L. Verdonk, G. Chessari, S. C. Brewerton, W. T. M. Mooij, P. N. Mortenson, C. W. Murray
    J. Med. Chem., 50, 726-741, 2007.
    [DOI:10.1021/jm061277y]

For binding energy estimation we have the PDB-bind database, and for enrichment studies the DUD data set at docking.org. As for community based collaboration I have personally participated (among many others from the industry and academia) in the eChemInfo “Virtual screening and docking - comparative methodology and best practice” workshop last year at Bryn Mawr College, Philadelphia. A recent special issue of the Journal of Computer-Aided Molecular Design (Vol 22, Num 3-4 March/April 2008 131-266) has been devoted to “Recommendations for Evaluation of Computational Methods for Docking and Ligand-based Modeling”. As demonstrated by these links, it is unfair to say that standards, public data and collaboration do not exist in this area.

ZZ

Research and software testing

Tuesday, June 3rd, 2008

And one major way is writing “unit tests”. Is that boring? Extremely. Do you get publications by writing unit tests? No. Are they simple to write. Not when you start, but they get easier.

Of course, writing unit tests for chemistry software is not chemistry research and so you do not get to write chemistry publications about it. However, it is an active topic in computer science. If you hop over to the ACM digital library and enter the search “unit test”, you get 19,314 hits all in peer reviewed journals, just to show you a few example hits:

Automatic extraction of abstract-object-state machines from unit-test executions

Tao Xie, Evan Martin, Hai Yuan, Hai Yuan
May 2006

ICSE ‘06: Proceedings of the 28th international conference on Software engineering
Software unit test coverage and adequacy

Hong Zhu, Patrick A. V. Hall, John H. R. May
December 1997

ACM Computing Surveys (CSUR), Volume 29 Issue 4
Carving differential unit test cases from system test cases

Sebastian Elbaum, Hui Nee Chin, Matthew B. Dwyer, Jonathan Dokulil
November 2006

SIGSOFT ‘06/FSE-14: Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering

When you read further Peter’s blog entry you see these statements:

The chemical software and data industry has no tradition of quality. I’ve known it for 30 years and I’ve never seen a commercial company output quality metrics.

Now, this is a bold claim if I have ever seen one. I am sure most commercial vendors who produce chemical software employ computer science or software engineering graduates, who during their training have been thought the standard unit testing and regression practices of the industry at school as part of the standard curriculum. How do I know that ? Because, not only do I have a BSc and an MSc myself in computer science (my PhD is in computational chemistry so that does not fall under CS), but I also spent 3 years as a teaching assistant at ELTE Budapest teaching programming methodology curses to CS undergraduates — including these techniques.

Of course, I can only speak about my own chemical software company with authority, so let me elaborate on how we do software testing. Our system consists of several compact software modules with well defined input and output data objects. These modules can be linked into a pipeline to perform complex tasks like docking or retrosynthetic analysis. Each of the modules have a unit test bed, which consists of a test engine, a set of test scripts and some input output data files and expected error report files. The test engine reads the test script, loads extracts the input data from the script, executes functions of the module and tests the responses, results returned comparing them to expected data from the script or data files. There are four distinct type of tests:

Func - functionality test; valid calls and parameters; checking certain scenarios to see if the module functions properly based on the script

Speed - performance test; valid calls and parameters; should be run with optimised compilation, debug turned off; measures speed

Error - testing of the exception handling; valid calls, parameters simulating extreme scenarios (e.g. file does not exist or incorrect file format used) that may happen in valid usage scenario due to wrong data being passed to the program by the user

Robust - robustness test; invalid call sequences and/or parameters to see whether the sanity checks (asserts) are thorough and complete. These tests programming errors in the integration pipeline, e.g. NIL pointers passed for required data input or calls made to uninitialized objects.

The last two categories have associated expected error files, where the error messages are listed that are expected to be in the response from the module that is being tested. An example functional test script is here from the MolFragGraph module. As you can see it contains a simple language, one command per line starting with a keyword followed by optional parameters and a data block. Of course, writing such scripts is boring, so we typically write only a few of them when a new module is developed. Then we add code like this to the program:

DBGMESSNLF(DEB_SCRIPT, “SCRIPT: MarkGridHead ClientID=0 NumLines=”<
<<" NumLineItems="<
<<" Low="<<_p_info->unit_min
<<" Dim="<<_p_info->unit_dims
<<" CellSize="<<_p_info->cell_size<<"\n");

This is a macro call, that is controlled by a debug flag (DEB_SCRIPT). If that flag is turned on during run-time, then the code will output a line into the log file identified by the "SCRIPT:" header and containing one complete line for the test script along with parameters and data. When we run an integrated software pipe, we can generate a log file containing the actual data being passed in and output from any given module inthe format required by the test bed scripts. This allows us to automatically generate test scripts for any of the modules by running an integrated software pipe for a practical input case. If we find a bug, when we reproduce it with a debug version of the code, we can immediately generate test script for each module involved and test them separately to identify where is the root of the problem. Once the bug is fixed, we can generate the correct output expected for each module for the test case. This comes very handy for generating regression tests, so that if later changes of the code would break any of the previously fixed functionality, then we can notice because the corresponding test script would fail. Of course, the running of all these tests is automated in a nightly build and test script. Each module is assigned to a developer who is responsible for the module. When a test script fails during the automated nightly test, the developer gets an email notification so he can fix it during the next day. For quality metric we are producing similar tables each night, like the VTK dash board (I cannot show you our own for confidentiality reasons). We have been doing development with quality control in SimBioSys since the start of the company in 1996. I have also worked in larger software company for medical imaging where software development was carried out under ISO 9001 certified methodology and I have implemented the same principles (with some more automation) in SimBioSys even though we have not applied for the certification — which is a long bureaucratic process with a significant cost.

So what is the take-home message from this post? That software unit and regression testing is a very important, serious — although boring — part of the chemistry software development, and it is not limited to (nor invented by) open source groups like the Blue Obelisk, which is NOT the only place for software and data quality, contrary to what PMR would like you to believe.

ZZ