Archive for May, 2008

ARChem is now available with a demo version of CrossFire Beilstein reaction database

Friday, May 30th, 2008

Here at SimBioSys we have been working on our ARChem retrosynthetic analysis system for about three years. It is already deployed and a publication is about to be submitted for peer-review. Derived from our work on synthetic feasibility that produced our CAESA product ARChem is the result of a collaboration from medicinal chemists in large pharma, the chemical knowledge of Prof. Peter Johnson from Leeds University in the UK, our technical team here in Toronto and the availability of a number of reaction databases used in order to perform clustering of the reactions for the most appropriate retrosynthetic analysis work. You will hear a lot about ARChem in the coming few weeks as I discuss how the system works, can be used and its successes and, of course, its issues as a piece of software.

For now I am happy to announce to interested users that in addition to Accelrys reaction databases we have newly received a demo of the CrossFire Beilstein reaction database from Elsevier. We have now sourced a 1million sub-set of CrossFire Beilstein and these new reactions were clustered to enhance the performance of ARChem. If you are interested in seeing the system and will be attending the 8th International Conference on Chemical Structures, in the Netherlands, next week (June 1-5) then please visit us at the Keymodule booth during the conference or visit our web-site and request a demo account for the online version of ARChem that you can test. Let me know.

posted by Aniko

Video games are enabling life sciences research

Thursday, May 29th, 2008

Life sciences researchers should be thrilled, because the success of the video games is enabling their research, and so is SimBioSys, says  Salvatore Salamone from BIO-IT world, read the full story here:
http://www.bio-itworld.com/inside-it/2008/05/gta4-and-life-sciences.html
Very interesting!

posted by Aniko

The future of HPC

Thursday, May 29th, 2008

Tuesday (May 27) I attended the SHARCNET Symposium on GPU and CELL Computing at the University of Waterloo. There were speakers from IBM, AMD, NVIDIA and Ben Berger from Los Alamos where the new fastest supercomputer on earth is running the benchmarks as we speak — look out for the official announcement about breaking the PetaFLOPS barrier on June 10th. The common theme I heard from all hardware manufacturers is that the future is about many-core technologies. Moore’s law still holds up in its original form, i.e. that the number of transistors packed into a chip doubles every 18-24 months. For several decades up to about 2003 it has translated into exponential growth in processor speed. The clock speed increase has stopped under 4GHz due to diminishing returns (energy requirements and heat increases quadratically and has reached the point where it becomes unmanageable, passive power crossed over active power). The new trend is to keep the clock speed steady (at around 2-3 GHz) but increase the number of parallel computation cores. Intel and AMD has quad-core chips on the market, 8-cores are around the corner for CPUs. At the same time GPU accelerators already pack hundreds of cores into a chip at lower speeds, while the Cell BE has 8 vector processor cores (equivalent to 64 individual cores). With the exponential growth of Moore’s law we can expect thousands of cores in the CPU within ten years on our desktop/laptop. However, to make use of this kind of parallel power, the software world needs to undergo a major change! The days of the lazy programmers are over, we cannot sit back and wait for the faster processor if our program is too slow. The single execution threads will not get any faster,we need to make our code capable of running in a massively parallel way — that is not easy.

Michael Perrone from IBM has started his talk with a story about HP expecting 2X performance increase when they introduced the first dual core computers on the market but only got about 1.7X, when they went from 2-core to 4-core they expected another 1.7X but only got 1.35X. So what should they expect from 8-core over 4-core ? How about 16,32,64 cores ? Will the curve soon flatten out and we do not get any more speed-up ? The answer is : It is all about the data. Memory bandwidth is not keeping up with the computation speed, so it is no use to increase the computation capabilities if we are unable to feed the beast (food=data, beast=cpu-core). And now we have to start feeding many beasts and they will multiply exponentially.

Peter Murray Rust is asking on his blog: Where should we get our computing ? The answer is: form the multi-core accelerator technologies, like GPGPU and Cell BE. His worries about hardware cost and management can be reduced by 50-100 fold using these accelerators. It is no accident that the RoadRunner supercomputer is built on Cell BE processors for the computing (with the communication and file I/O being handled by AMD Opterons) beating the previous fastest HPC system benchmark (held by IBM’s BlueGene) by over 4X.
As for the GPGPU versus Cell BE angle, this symposium has reinforced my beliefs that the Cell BE is a general purpose accelerator suitable for any task (just like a CPU) while the GPUs from AMD and NVIDIA are highly specialized tools that can get great performance for a very specific subset of the problems. GPUs were designed for graphics, where the computation tasks are massively parallel (millions of 3D points and triangles to process) and completely independent (what needs to appear on each pixel is independent of the others and so is the computation to be performed for different 3D points). Tasks that have these properties are suitable for GPGPU, e.g. image processing, some physics simulations (material science, plasma, laser, particles) and even some chemistry problems, like molecular dynamics simulation if one wants to compute the full atom pair matrix of forces. However, as soon as you want to be smart and compute only forces within a cut-off range (which itself can gain a hundred fold speed-up if you work with proteins) and/or need dynamically changing data size or inter-dependencies (like an N-body problem or QM) than GPU is not a good choice. There can be non-trivial performance hurdles even for seemingly fitting problems, like image processing. Michael Kinsner has brought up an example in his talk, where he had to learn the hard way that processing image blocks of 16×4 was fast, but 8×8 was much slower due to some peculiar memory access pattern issue - the input data pattern of the code has to map directly to the underlying hardware architecture to get good performance on the GPU.

On the other hand, the Cell BE is an extension of the CPU architecture, completely general purpose and solves the memory access (hungry beast) problem by giving full control into the hands of the programmer via direct programming of 9 separate memory flow controller and a huge 300GB/s data pipe. Of course, such control means the programming isn’t easy and worry free, but we have the means — the challenge is upon us to program the beast so it does not starve.

ZZ

Floating point errors

Wednesday, May 28th, 2008

On the CCL mailing list Sina Türeli has posted this question:

“I am working on a project related to proteins and my precision for coordinates is there digits after the dot. When I do operations like rotation around a dihedral, the dihedrals which shouldn’t change change at about 0.01 angles and so. I am afraid though that is not much it might accumulate over time. So do you have any suggestions for reducing floating integer errors? Would’t be feasible if I turned lets say the coordinate 21.567 to 21567 and do my operations? Or maybe even 215670? “

I have spent a lot of time fighting similar problems so I know how annoying this can be. To understand what is hapenning, look at how floating point numbers are stored and operated upon according to the IEEE 754 standard. Because of the binary representation, our decimal fractional numbers do not match exactly to float/double numbers. Even though, the 23 fraction bits in a floating point binary number map roughly to 6 decimal digit precision, it still does not mean that all 3 digit decimal numbers can be represented precisely. On the other hand, integers can be stored exactly, so that gives basis to the idea to store the number 21.567 as 21567 or 215670. This would work for storing and also for applying some basic arithmetic (addition, subtraction and multiplication) to numbers accurately without any error. However, division starts a problem and any trigonometric functions or sqare root function blows up the problem to be much worse than what you have with floating point numbers. Those functions produce irrational numbers i.e. they cannot be represented by a division of two integers. Unfortunately, rotations are typically defined by angles and the coordinate transformation requires the sine and cosine of the angle — depending on the task, sometimes the transformation required can be expressed in other ways, e.g. by quaternions, and sometimes we can compute transformations by simpler arithmetic if the goal is to transform some atoms to specific positions, e.g. an overlay (rather than rotation by a given angle). So, in short using integer fixed point representation will not solve the problem of coordinate drifting errors of 3D transformations, especially if rotations by a given angle is required. On the other hand, it can solve simpler problems.
How bad is the floating point problem and what can be solved by fixed point integer arithmetic ? Let me give you an example: you learned in school that a+b+c = c+a+b. Well, this simple rule breaks even for single digit precision floating point numbers! Consider the following simple C code (you can download the source, or a linux binary):

#include  int main() {
float a, b, c, d, s;
for (a = 0.1; a < 9.9; a += 0.2) {
for (b=0.4; b < 9.9; b += 0.1) {
for (c=0.7; c < 9.9; c += 0.3) {
s = a + b;
s += c;
d = c + a;
d += b;
if ( s != d ) {
printf( "a=%f, b=%f, c=%f, a+b+c=%f, c+a+b=%f hex: a=0x%08x b=0x%08x c=0x%08x s=0x%08x d=0x%08xn",
a, b, c, s, d, *(int *) &a,*(int *) &b,*(int *) &c,*(int *) &s, *(int *) &d );
}
}
}
}
}

Sorry for the lack of indentation, I could not convince WordPress to keep it pre-formatted.

You can see that if the summation in two orders were the same for all tested cases it would never print anything. However, when you run it, you get plenty of examples (34585 cases on my computer, about one third of all cases tested) printed where the rule breaks. Here are the first few examples you get:

a=0.100000, b=0.400000, c=3.700000, a+b+c=4.200000, c+a+b=4.199999
hex: a=0×3dcccccd b=0×3ecccccd c=0×406ccccb s=0×40866666 d=0×40866665
a=0.100000, b=0.400000, c=7.600002, a+b+c=8.100002, c+a+b=8.100001
hex: a=0×3dcccccd b=0×3ecccccd c=0×40f33337 s=0×4101999c d=0×4101999b
a=0.100000, b=0.500000, c=2.500000, a+b+c=3.100000, c+a+b=3.100000
hex: a=0×3dcccccd b=0×3f000000 c=0×401fffff s=0×40466666 d=0×40466665

As you can see from the hexadecimal version, the difference is only in the last 1 bit. Nevertheless, it is enough to throw off the result. Imagine, if you sum up score components, sort them and select only the best N solutions. Suddenly, you may keep or lose a solution depending on the order of summation. Now, that is scary…This problem cannot be solved by using double precision, but it can be solved simply by using fixed point integer representation. However, fixed point does not help for the rotation problem I am afraid.

ZZ

Cognate docking accuracy measurement vs pre-optimized pose

Monday, May 26th, 2008

I have seen some papers published in peer reviewed journals where the authors have proposed (and executed) the following evaluation protocol for cognate docking:

  1. Optimize the crystal structure of the protein-ligand complex obtained by X-ray in order to remove any severe clashes, fix bad geometries etc. Save the protein receptor and the ligand into separate files. Note, that the optimization was performed together, not separately for receptor and ligand!
  2. Perform the docking into the protein structure obtained via the optimization of step 1. The input ligand structure starting conformation maybe randomized or optimized in vacuum or in solvent.
  3. Compute the Root Mean Square Deviation (RMSD) between the heavy atoms of the solution pose and the pose saved in step 1 after optimization together with the receptor.

Upon a quick surface-scan (i.e. without really analyzing or thinking about the meaning) this may even sound like a reasonable protocol. But if one looks a little deeper, it becomes clear that the method is very seriously flawed, especially if the docking procedure involves an optimization step using the same force-field or scoring function that is used in step 1.

The calculated RMSD value is simply the distance of two selected local minima of the scoring function and as such, has very little to do with the docking accuracy. To better explain this statement, let’s take a simple hypothetical 2D function (the real docking pose search space is 6+n dimensional, where n is the number of rotatable bonds, so it would be hard to visualize that) and follow what happens if we optimize some points (indicated by P and X on the figure2D function) by following the steepest path to a local maximum. Suppose, X represents the original X-ray ligand pose, and points indicated by P represent various docking poses generated prior to local optimization. The black arrows show where the points would end up after optimization. The distance indicated in white is the measured RMSD. You can see, that each local optima has an attraction region and if you move around the starting P or X points within the same region, then they would still end up in the same point after optimization, thus the “measure” isn’t very sensitive to the P or X locations. You can also see, that in the particular case I have drawn, there is actually another P position to the right of the X which was in fact closer to the X prior to the optimization than the one judged closest after optimization. It is also clear that if the pose generation sampling is fine enough to create one pose P within the attraction region of each local optima (on each mountain of the figure), then the measured RMSD would be zero - because the X would converge to the same peak as the P that fell onto the same mountain even if it was at the other side of it quite far apart. So, in other words, a sufficiently fine (exhaustive) sampling would have to guarantee a zero RMSD solution!

How fine such sampling would have to be ? That depends only on how “rough” the scoring function is. If we choose a very nice function which has a single minimum position only, then it is enough to generate a single pose anywhere and it will converge to the same point as the X-ray. If the previous 2D example wasn’t clear enough to convey this message, then look at the following figure that has a nice 1D function with a single minimum point.

1d function

See, starting from any point the optimization will end up always in the same place. Cool — does that mean we could create a smart enough scoring function that would give us a cheap way to reach zero RMSD solutions following the above protocol ? Of course, it does, indeed!

Let me present you the perfect docking suite:

  1. First we define our scoring function, or force field, let’s call it Origin Optimized Potential System (OOPS), where the energy of any structure is computed with the following formula:
    formula
    We will use the OOPS scoring function to optimize the receptor-ligand complex prior to docking.
  2. For the second step we need a docking algorithm, let’s call it Zero Rmsd Ideal Docking Engine (ZRIDE). ZRIDE simply assigns zero (0.0) to all atoms coordinates of the ligand (x,y,z) := (0,0,0)
  3. Now, we are ready to calculate the RMSD between the OOPS optimized X-ray pose and the generated docking pose. Since the OOPS energy function has a single minimum point at the origin, any X-ray pose will converge to move all atoms into the origin thus the RMSD from the docked pose is always ZERO (0.0).

If you still do not believe me, just download the following linux executable files (also available in a tar gzipped package) and try it out yourself for any receptor-ligand complex. The protocol, you need to follow for the test:

Step1. Minimize the xray structure in the context of the receptor:

./nanomodel -rec receptor.mol -lig xray.mol -min -out xray_min.mol

Step 2. Generate an input ligand from xray.mol using ANY energy minimizer tool (any force-field or conformation generator). Suppose, you have produced input.mol which has the altered conformation. (you may also just copy xray.mol to input.mol if you wish to use the original coordinates)

Step 3. Perform the docking calculation:

./zride -rec receptor.mol -lig input.mol -out result.mol

This will immediately report the RMSD value (always zero, i.e. 0.000A), but you can also use any other external tool to compute the RMSD between result.mol and xray_min.mol. It is VERY important to compare to the minimized xray_min.mol and not to the raw xray.mol since that may have a very bad geometry according to the OOPS force field. You can verify the score difference of them by running:

./oops -rec receptor.mol -lig xray.mol -score
./oops -rec receptor.mol -lig xray_min.mol -score

You will see that the raw xray file has a large positive (repulsive conformation strain energy) score, while the minimized version will have no strain, i.e. the score will be zero. :)
This product suite has reached the ultimate docking accuracy following the protocol suggested above — matching the one some peer reviewed publications have followed. So, I rest my case, if that protocol is
acceptable for anyone, then look no further, download the perfect solution free of charge today!

Disclaimer: Any resemblence between the above program names and real software tools (living on the market or dead) is an incredible coincidence by random chance.
ZZ

An empty binding site

Monday, May 19th, 2008

Derek Lowe has a blog post that brought to my attention a very interesting paper. The authors show the extreme case of a completely dehydrated empty binding site for the large nonpolar binding cavity in bovine β-lactoglobulin. They state it is not a matter of undetectable spatially delocalized waters — they have used techniques like water 17O and 2H magnetic relaxation dispersion (MRD), 13C NMR spectroscopy, molecular dynamics simulations, and free energy calculations to establish the absence of water from the binding cavity. The site appears to be empty — plain old vacuum. Based on the comments on Derek’s blog, many people find this very intriguing — just like myself.

I have seen many protein structures in the PDB with significant empty gaps, but I always assumed there must be water there with high enough entropy not to be “visible” in the electron density map of the X-ray. But this paper raises the question whether it is such a rare case or perhaps it is happening on smaller scale quite regularly, i.e. there may be smaller gaps occuring with a significant probability in most hydrophobic pockets. Since this happened even in the crystal, during the solvated dynamic environment it should happen more often for short time periods. What would that mean for binding energy estimation, scoring ? We need to re-think the the de-solvation energy and entropy estimation components. The implicit homogenous solvent models seem to be even more inadequate than we suspected (not that I trusted them too much anyway).

OK, maybe I should not get carried away, it is only one very strange case so far. But we better keep an eye open for similar reports…

ZZ

Correct protonation state for docking

Saturday, May 17th, 2008

In the previous post I briefly touched the question of protonation in docking and promised to come back with more details at a later time. Well, it turns out I could not go to sleep tonight until I get this out of me :-) So here it comes…

Several of the protein residues and many of the ligand functional groups may exist in different protonation states. In a docking study, the protonation states of the protein residues in the active site and of the functional groups of the ligand are very important. Most of the PDB structures obtained from X-ray crystallography do not contain the protons corresponding to the Hydrogen atoms because the resolution of the X-ray data is not fine enough to “see” the protons in the electron density map. On the other hand, most docking software requires the input files to be protonated. Therefore, some software technique (e.g. modeling packages) have to be used to add the protons. The ligand structures are often taken from a database, that may store the structures in a very compact form, e.g. SMILES string, and then protons are added again by some software prior to docking. Many simple software tools will generate the neutral form of the small molecules if possible, e.g. see the picture of aspartic acidAspartic acid on ChemSpider or Wikipedia, (although ChemSpider also has the de-protonated charged form — aspartate as a separate entry). Unfortunately, the neutral form is not the best choice for docking as this residue is more likely to be in the negatively charged de-protonated form in most proteins under in vivo conditions. But, please do notice the wording I used : “more likely” and “most”. Of course, there are lots of exceptions. Due to resonance, the role of the two oxygen atoms of the carboxylic acid are also symmetrical, so the H can be on either of them and in 3D on 2 possible positions on each oxygen.

OK, you say, enough of preaching to the choir, of course you are always using protonation states appropriate for physiologycal pH, so you are fine. Unfortunately, it is not that simple. There are many examples, e.g. metalloenzymes that will require the use of different protonation states, see the paper cited in the previous post. Thanks to the ZINC team, now you have a chance to download ligands in a protonation state specific to certain pH values. Some modeling software will also allow you to protonate your protein according to a given pH value. Hurray, you say, now we are good to go we just have to decide the appropriate pH value for the given study and we are all set. No, I am afraid, it is more complicated than that. Let’s look at an example 3D aspartates in HIV-1snapshot from the center of the active site of HIV-1 protease. You see two carboxylates two Asp residues nicely facing each other at very close contact distance between the oxygens. The only way this is possible if there is a proton between them making a nice Hydrogen-bond, otherwise the two lone pairs of the oxygens (with strong partial negative charges on both) would be VERY uncomfortable to say the least. So, one of them is protonated, the other isn’t. Uh, oh! This isn’t good for the pH rule. Whatever value you chose, it fails, because we have the same residue behaving differently right next to each other.

There are even nastier cases. Let’s look at the Serine protease family: trypsin, thrombin, factor Xa. The key Ser-195 residue in the active site attacks the carbon of a peptide, cleaving the amide bond, because its alcohol group has “lost” the proton — now an alcohol group is not such a horribly strong acid to lose its proton so easily, so there must be some really basic conditions in that active site — one might assume. Well, not really, because the natural substrate is the Arginine side chain which must be in a protonated form to be attracted to the negatively charged Asp-189 at the bottom of the pocket (that long-range attraction “lures” the Arg into the pocket for the perfect position for the Ser-195 to attack the peptide bond at the neck of the residue. The key to this mystery is the Asp-102, His-57, Ser-195 catalytic triad. The mechanism is nicely explained here. The lesson to be learned regarding protonation states is that the local environment is the key. Global notions like pH only work in a nice uniform water solution with millions of copies of a given ligand and some specific ions floating around. In the binding site of a protein there are lots of local constraints that determine the interactions that occur.

Let’s pay a bit more attention to the middle player of that triad: the Histidine side chain. This residue on its own is another reason why pH is not sufficient to determine the protonation states. The two Nitrogen atoms of the aromatic ring are usually drawn with different connectivity (one has a double bond the other does not), but in fact due to the resonance of the aromatic ring, the role of the two is completely interchangeable (due to two different tautomer forms). At physiologycal pH one of them should be protonated (H-bond donor) the other should have a lone electron pair (H-bond acceptor) sticking out. However, which is which cannot be decided based on pH. Yet, such decisions are crucial for recognizing (scoring) the correct binding pose of a ligand. Sometimes, the intra-molecular H-bonding possibilities within the receptor can help to decide the correct state, e.g. in the case of the above mentioned catalytic triad (the Asp-102 dictates that the H must be on that side of the His-57).

Unfortunately, even full analysis of the receptor site isn’t enough in some cases. There are situations where some receptor residues change protonation states depending on which ligand binds to it, e.g. HIV-1 protease see this paper. Of course, the ligand protonation state may also be altered by the active site upon binding. There is no such thing as “correct” protonation state for a given receptor site without considering the ligand and also there is no correct protonation state for a ligand either without taking into account its interactions in the active site. This means, that the only correct treatment of protonation states is to leave the question open until the end of the docking process and make the decision separately for each result docking pose, e.g. using this method for the full complex. Of course, changing protonation states isn’t always free, usually it involves an energetic change, e.g. a carboxylate “prefers” to be de-protonated (lower energy form) — exceptions to this are the Histidine flipping the H between the two N, and a protonated carboxylate moving around the proton in the 4 possible positions on the two oxygens. So, a perfect scoring function must be able to consider various protonation states and also consider the energy cost associated with switching between them. The last release version of eHiTS (6.2) does the first part, i.e. it samples the protonation states on the fly for the result poses — independently for each functional group. The second part (energy cost of various protonation states) will be incorporated in the next release coming up this summer.

protonation example

For other docking software that do not perform on the fly protonation changes, the only correct way to execute a screening study is to prepare multiple copies of each ligand enumerating all feasible protonation states (increasing the size of the screening library substantially, 150X for the example on the figure above), and also prepare multiple copies of the receptor with all combination of active site residue protonation states and perform a separate screening batch for each and select the best scoring solutions from all runs. This process raises the CPU time — possibly by several orders of magnitude. So, it is much more efficient to use eHiTS, or better yet eHiTS Lightning ;-)

An update in response to the comments by Andy and Danni:

Let me summarize how I see the process of correct protonation state determination and scoring:

  1. The correct docking pose has to be generated — possibly among several enumerated.
  2. The correct protonation state needs to be determined for each functional group involved — for each pose separately
  3. The energy has to be estimated correctly considering not only the interactions but also the cost of protonation state change relative to the lowest energy form.

Most docking programs will fail already at step 1 if the input receptor/ligand molecules are not in the right protonation state, because they will not put two donors or two acceptors facing each other at close interaction distance — and for those mentioned cases where the protonation dynamically changes upon binding as part of the recognition process, the input molecules will NOT be in the correct protonation state. In contrast eHiTS uses ambivalent flags for indicating the possibility of H/lp (donor/acceptor) variations and will generate all the possible poses — including the correct one.
Step 2 can still be done without QM calculations — it is a discrete optimization problem that can be solved optimally, for example using the method described by Paul Labute at the CCG site. So, now we have multiple poses, each of them with all the involved FG having the appropriate protonation state for the given pose. All this done on the fly in the docking process efficiently. What is missing now from eHiTS is step 3.
Step 3 is what requires the QM calculations to be able to estimate the energy of each pose accurately and select which one is the correct — lowest energy pose. In my opinion what we need to do ahead of time with QM is to compute two tables: the hydrogen bonding strengths between various functional groups and the internal energy difference between different protonation states of a given FG. Using these pre-computed values, we can estimate the total energy of the system. The question is, how big is the error in the estimation due to the use of pre-cooked FG values as opposed to computing the energy of the whole system at a high level of theory ? I do not know the answer to this question,and therefore I am not yet sure how accurate our estimate will be, but soon we will be able to see the results.
ZZ.

protonation states and docking

Friday, May 16th, 2008

OK, this topic deserves a much longer and detailed discussion. I will do that another time, now just a quick reaction to a question on zinc-fans mailing list at docking.org:

> (/Why was there a need to create these subsets based
> on pH of ligands in ZINC database?

Metalloenzymes deprotonate thiols, sulfonamides, and hydroxamic acids, for example. Thus you must create the deprotonated forms to “get the right answer for the right reasons”. See Irwin JJ, Raushel FM, Shoichet BK, “Virtual screening against metalloenzymes for inhibitors and substrates.”, Biochemistry, 2005, 44(37),12316-28. DOI <http://dx.doi.org/10.1021/bi050801k>.’


While it is definitely better to use ligands protonated at a specific pH that corresponds to the target environment as suggested in the response quoted above rather than using neutral forms (which is typically created by many software tools and dominates databases), I do not think it is sufficiently sophisticated. When a ligand is bound to a protein, the appropriate protonation states should be determined locally for each functional group. There are lots of examples for binding with protonation states that are “unexpected” at physiologycal pH. Correct choice of protonation states has to consider both the receptor and ligand environment, all the sorrounding effects (e.g. Serine protease catalytic triad ASP-HIS-SER capable of deprotonating even an alcohol). Therefore, the protonation state is not something that can be decided a-priori without considering the docking pose. One correct (but very time consuming) solution is to enumerate all feasible protonation states and dock each of them. A more efficient correct solution is what eHiTS does: choose the correct protonation state on-the-fly during the docking run considering the local environment and possible score values to reach with different states for each functional groups, so that a single run can find the correct pose with the right protonation state even if that differs from the input file state.

ZZ

Conformance problems: ODF and OOXML

Tuesday, May 13th, 2008

Apparently, the wikipedia page I linked to in my previous post about ODF supporting software is overly optimistic according to Peter Sefton’s blog. He demonstrates that only OpenOffice.org and StarOffice works properly with ODF while others have serious problems even with very basic formating. There is also a very useful converter table posted by Peter.

OK, so that brings the ODF conformance count down to 2, however, this is still 2 more than the number of applications that conform to the OOXML standard, which is exactly zero at this moment according to these tests. So, the race is on, the result is 2:0 so far with ODF in the lead :-)

ZZ. — A proud member of the ODF “cheer squad”

What is wrong with OOXML

Sunday, May 11th, 2008

Peter MR has voiced his opinions on his blog about the use of OOXML for archiving chemistry documents:

The reason I currently like OOXML is that we can make it work and that we have material in Word that we can use. I’ll be demoing it publicly in a week’s time (more later). If we had material in ODT we’d use that, but we don’t.

My worry about Open Office (which emits ODT) is that I don’t yet believe that has reached a state where I could evangelize it without it falling over or being too difficult to install.

I would much rather recommend the ODF (ODT) format, which is a truly open ISO standard (approved on May 1st 2006). OpenOffice.org is only one of the tools that can generate it, there are several others as well as various converters (e.g. SUN’s MS Office plugin, Clever Age ODF translator) available for MS Word users.
His point is that it is still better to use OOXML than the binary doc format of MS Word. I do not agree with this point, I think OOXML is just as bad as the binary Word doc, for these reasons:

  1. It is a single vendor format with patent encumbered binary extensions — so it might as well be called proprietary. OOXML cannot be implemented by open source software due to incompatibilities with the GPL.
  2. The national bodies have raised over a thousand unique objections about technical details of the format during the ISO process (see also the wiki collection), less than 20 percent of which has been discussed during the Ballot Resolution Meeting and most of those was not resolved to the satisfaction of the opponents. You can find a good collection of remaining problems here
  3. It has been accepted as a standard via blatant manipulation, ballot stuffing, corruption in various levels, see some of the history here and here. More irregularities: Poland’s new rule: no vote equals yes, Cuba’s No vote counted as yes, Microsoft friendly “yes-men” invaded Belgium’s Technical Committee, Denmark voted yes by consensus while 50% opposed, interesting vote counting in Croatia (14 No + 3 Yes = Yes), how the Philippines changed their vote from no to yes.
  4. ISO has violated the WTO rules by allowing a duplicative standard to an existing one (ODF), according to Tineke Egyedi, president of the European Academy for Standardisation.
  5. OOXML reinvents the wheel, ignoring and replacing mature standards like SVG, MathML, XForms and even XML. The most prominent example is the neglection of MathML where OOXML defines its own formula markup language (OOMML).
  6. OOXML requires undisclosed copyrighted material from Microsoft Office. The previous problem of Border Style art being undisclosed was acknowledged and fixed on February 22nd 2008 however Part 4 2.18.94 ST_TextEffect (Animated Text Effects) describes VML art that is not included in the specification.
  7. OOXML does not provide the Binary to XML mapping which is required to fully represent the existing corpus of user documents. No other application supporting OOXML will be able to faithfully or fully recreate the look of Microsoft’s legacy binary documents. Although the binary Office document specifications have been posted by Microsoft (15 Feb 2008), no standardized mappings were offered during the BRM, as requested by the US, United Kingdom, Brazil, and Malaysia, amongst others.
  8. Markets cannot rely on ISO standards with calculation errors. Spreadsheet formulas still result in calculation errors. Although the CEILING function was recognised to have a legacy bug and fixed during the BRM, there exist more mathematical inaccuracies in OOXML’s spreadsheet function. The FLOOR function has been identified to have a similar mathematical inaccuracies for negative numbers. This is a problem that needs to get carefully studied. We recall that Intel faced a consumer scandal and losses when their new Pentium chip was found to have a calculation error. The Y2K problem, a standardization issue, resulted in billions of investment for damage control.
  9. Macro functionality is not properly defined. Section 2.16.5.41 defines a “MACROBUTTON” field that allows the definition of a button in the document that will trigger a macro. But little is said about how the macro is stored, bound, what API’s are available, or what the security model is for this feature. ECMA’s disposition (approved in batch by the BRM without discussion or opportunity for objection), was something quite different and unsatisfactory. ECMA simply added: “The mechanism by which the command specified by text in field-argument-1 is located and/or executed by an application is “implementation-defined”. Unfortunately, with this addition, not only is it impossible to have cross-platform interoperability of this feature, it is unlikely that vendors will be able to implement a reasonable security policy to detect, scan or block macros included in documents.
  10. There are additional 850+ technical problems raised during the ISO process and has not been resolved, I will not list all of them here :)

In a single sentence: OOXML is nothing more than a marketing check-box for Microsoft, so that they can now claim to have an open ISO standard document format, but in reality it is neither open, nor standard by any rational definition of the words.

ZZ.