Archive for the 'Technology' Category

New hope for high resolution PDB structures ?

Friday, January 16th, 2009

Researchers at an I.B.M. laboratory have captured a three-dimensional image of a a tobacco mosaic virus using magnetic resonance force microscopy with a spatial resolution down to four nanometers. The New York Times article reports: “Dr. Rugar and others were able to make an image of a single electron with the new technique. The new achievement is the dimensionality of the image. Magnetic resonance force microscopy employs an ultrasmall cantilever arm as a platform for specimens that are then moved in and out of proximity to a tiny magnet. At extremely low temperatures the researchers are able to measure the effect of a magnetic field on the protons in the hydrogen atoms found in the virus”. Since the technique does not require crystallization, it can be used to study structures hat have proven elusive for X-ray crystallography (e.g. membrane proteins). I hope it will give some high resolution 3D data to learn from and improve our models.

ZZ

IBM’s white paper on the Cell technology and Molecular Modeling

Wednesday, October 22nd, 2008

Over the past year SimBioSys has been focused on the development of eHiTS Lightning. Any of you reading our blog will have seen the many announcements and discussions about WHY we chose the Cell processor as the platform for our development. For those of you who have not seen those comments I point you to our list of blog postings below…all good reading and we certainly do not regret our decision.

The fast and the furious: compare Cell/B.E., GPU and FPGA

The wow factor at the BIO-IT World Expo

Fastest supercomputer built on the Cell/BE

Our peers in the world of software development for virtual screening and docking have talked with us a number of times regarding “How’s it going with the Cell?”, always with interest and generally with a “that’s a major commitment to a platform” in their comments. We are committed to Cell..we believe we will hear more about this processor for scientific applications in the future. In doing our port we have had the privilege of working with some excellent people at IBM in regards to optimizing the port to Cell. We are proud to say that there views can be summarized with “you guys did an excellent port!” from IBM Cell specialists. We have recently finished a White paper with IBM that will be distributed to the scientific community via both of our organizations and thank them for their work in putting that together with us. A copy of the White paper is available online here: Using Cell Broadband Engine Technology to Improve Molecular Modeling Applications

Fastest supercomputer built on the Cell/BE

Friday, June 20th, 2008

I have already mentioned in May, that RoadRunner the world’s current fastest supercomputer is built on Cell BE processors, the same platform that eHiTS Lightning runs on. If the Los Alamos Lab chooses Cell Processors then we chose well!

Looks like the mainstraim media is now catching on to the news:

  • Infoworld reports: IBM’s Cell-based RoadRunner supercomputer is world’s fastest.
  • PC World: IBM’s Cell-based RoadRunner Supercomputer Is World’s Fastest
  • GamePro.com: PS3 “Cell” CPU to Power World’s Fastest Supercomputer
  • ITportal.com: IBM’s Roadrunner Runs Fastest
  • ScienceDaily: World-record Supercomputer Mimics Human Sight Brain Mechanisms
  • ITjungle.com: Beep, Beep: Roadrunner Linux Super Breaks the Petaflops Barrier
  • Chemical Engineering News: World’s Fastest Computer Debuts
  • ACM Technews picks: Europe Prays That Cathedrals to Computing Will Help Industry

The last article in the above list highlights: “Roadrunner was built using 6,912 dual-core Opteron processors from Advanced Micro Devices, and 12,960 IBM Cell eDP accelerators. Early tests indicate that the Cell processors have reached 1.33 petaflops while the Opterons reached 49.8 teraflops”. So twice as many Cells produce 26.7 times more crunching power compared to the dual core Opterons. In an earlier blog post, I have analyzed that advantages of the Cell BE over other acceleration technologies, like GPU and FPGA.
ZZ

Quality in chemical software - the debate continues

Tuesday, June 3rd, 2008

Peter Murray Rust has responded to my previous blog post and has raised some important points to which I have to respond, see comments section by section:

Quality in chemical software - a debate

ButSymBioSys Blog has replied to my post about unit testing in a long and thoughtful post. I don’t know who the individual is but the company sells a number of chemical software packages, a lot of which I recognize from Peter Johnson’s research group at Leeds.

Let me introduce myself: I am Zsolt Zsoldos, Chief Scientific/Technical Officer at SimBioSys. As Peter MR has recognised correctly, some of the software we market has been developed in Peter Johnson’s research group at Leeds, including the Sprout de novo design software which was my PhD project and Peter Johnson was my supervisor, and he is a scientific adviser and a director on the board of SimBioSys. There are a number of publications listed here covering my post-PhD work at SimBioSys as well as various presentations I gave at conferences, just to give some background on my work.

PMR:

I’m confining my remarks to “chemoinformatics” software. I exclude quantum mechanics programs (which take considerable care to publish results and test against competitors) and instrumental software (such as for crystal structure determination and NMR. Any software which comes up against reality has to make sure it’s got the right answers as far as possible. But chemoinformatics largely computes non-observables.

Reproducibility of results and robustness is not the whole story of quality. There are tens of thousands of docking and QSAR studies done each year and many of them are published. Are they reproducible? I expect that if a different researcher in a different institution with different software ran the “same” calculation they would get different results.

I fail to see how the “tens of thousands” of docking studies considered to compute “non-observables”, when we have tens of thousands of X-ray crystal structures to compare against. How is that less of a reality to come up against than quantum mechanics ? There are experimentally measured binding affinities to compare scoring results against. What better metric does QM has ? There is no exact mathematical solution to the Schrodinger wave functions, so all QM software computes approximations and there is no absolute benchmark point to compare against, because we cannot compute the exact solutions.

Are the docking and QSAR study results reproducible ? With eHiTS and LASSO, the answer is definitely YES! I understand that many tools on the docking/QSAR market use stochastic (read random) methods and therefore their results are inherently unreproducible. Again, I can only speak with authority about our own software, which uses strictly deterministic and reproducible techniques. So if a different researcher in a different location runs our software on the same input they will get the same result. However, I do not see how one could run the “same calculation” using a different software. By definition, if you are using a different software (which embodies the calculation) then you are not running the same calculation. I can assure you the same is true for QM software as well, for the simple floating point error reasons I have explained in a previous blog post. So any different QM implementation will necessarily involve computation steps in different orders (as simple as summation in different order will suffice) and therefore get slightly different results.

PMR:

Which manufacturers publish the source code of their algorithms? Without this the user depends completely on trust in the manufacturer.

Hmmm, very good point. Let me see, does Microsoft publish their source code ? No. Then why do they have over 95% market share ? They must be very trust-worthy, right ? Then why are they facing anti-trust trials in US, Europe and Japan. Perhaps my example is off-topic and off-target, since PMR advocates open source over closed proprietary software and standard, like OpenOffice over MS Office and ODF over OOXML ? Nope, those links prove the exact opposite with statements like:

PMR:

The reason I currently like OOXML is that we can make it work and that we have material in Word that we can use.

My worry about Open Office (which emits ODT) is that I don’t yet believe that has reached a state where I could evangelize it without it falling over or being too difficult to install.

So, let’s just agree that if something is open source that does not automatically guarantee good quality, and on the other hand, it is also possible to have good quality software that is proprietary. Although, I definitely see and acknowledge the quality values in open source, but in my opinion the open source model requires a critical mass (in terms of number of developers and users) to achieve the “any bug is shallow for many eyes” state of linux. Whether the user and developer base has reached that level for chemistry software is an interesting question worthy of its own debate. Let’s continue with our current debate:

PMR:

Many communities have annual software and data competitions. They use standard data sets and different groups have to predict observables. Examples are protein structure and crystal structures. In text-mining and information retrieval there are major competitions. They rely on standard data sets (”gold standards”) against which everyone can test their software.

But in chemical software these type of standards are rare. If companies feel strongly about quality they should be doing something publicly. Developing test cases. Collaborating on the publication of Open Standard data. Creating Gold Standards. Developing Ontologies - if we don’t agree what quantity we are calculating then we are likely to get different answers.

Yes, indeed many communities have annual software competitions, including the docking community: for example, the SAMPL competition by OpenEye which the Bio-IT World has reported about, or the CASP docking competition as published by Lang et al. J Biomol Screen.2005; 10: 649-652. As for standard benchmarking data, how about GOLD validation set, or the more recent Astex diverse validation set specifically designed to be a high quality benchmark set for docking, published as:

    Diverse, High-Quality Test Set for the Validation of Protein-Ligand Docking Performance.
    M. J. Hartshorn, M. L. Verdonk, G. Chessari, S. C. Brewerton, W. T. M. Mooij, P. N. Mortenson, C. W. Murray
    J. Med. Chem., 50, 726-741, 2007.
    [DOI:10.1021/jm061277y]

For binding energy estimation we have the PDB-bind database, and for enrichment studies the DUD data set at docking.org. As for community based collaboration I have personally participated (among many others from the industry and academia) in the eChemInfo “Virtual screening and docking - comparative methodology and best practice” workshop last year at Bryn Mawr College, Philadelphia. A recent special issue of the Journal of Computer-Aided Molecular Design (Vol 22, Num 3-4 March/April 2008 131-266) has been devoted to “Recommendations for Evaluation of Computational Methods for Docking and Ligand-based Modeling”. As demonstrated by these links, it is unfair to say that standards, public data and collaboration do not exist in this area.

ZZ

Research and software testing

Tuesday, June 3rd, 2008

And one major way is writing “unit tests”. Is that boring? Extremely. Do you get publications by writing unit tests? No. Are they simple to write. Not when you start, but they get easier.

Of course, writing unit tests for chemistry software is not chemistry research and so you do not get to write chemistry publications about it. However, it is an active topic in computer science. If you hop over to the ACM digital library and enter the search “unit test”, you get 19,314 hits all in peer reviewed journals, just to show you a few example hits:

Automatic extraction of abstract-object-state machines from unit-test executions

Tao Xie, Evan Martin, Hai Yuan, Hai Yuan
May 2006

ICSE ‘06: Proceedings of the 28th international conference on Software engineering
Software unit test coverage and adequacy

Hong Zhu, Patrick A. V. Hall, John H. R. May
December 1997

ACM Computing Surveys (CSUR), Volume 29 Issue 4
Carving differential unit test cases from system test cases

Sebastian Elbaum, Hui Nee Chin, Matthew B. Dwyer, Jonathan Dokulil
November 2006

SIGSOFT ‘06/FSE-14: Proceedings of the 14th ACM SIGSOFT international symposium on Foundations of software engineering

When you read further Peter’s blog entry you see these statements:

The chemical software and data industry has no tradition of quality. I’ve known it for 30 years and I’ve never seen a commercial company output quality metrics.

Now, this is a bold claim if I have ever seen one. I am sure most commercial vendors who produce chemical software employ computer science or software engineering graduates, who during their training have been thought the standard unit testing and regression practices of the industry at school as part of the standard curriculum. How do I know that ? Because, not only do I have a BSc and an MSc myself in computer science (my PhD is in computational chemistry so that does not fall under CS), but I also spent 3 years as a teaching assistant at ELTE Budapest teaching programming methodology curses to CS undergraduates — including these techniques.

Of course, I can only speak about my own chemical software company with authority, so let me elaborate on how we do software testing. Our system consists of several compact software modules with well defined input and output data objects. These modules can be linked into a pipeline to perform complex tasks like docking or retrosynthetic analysis. Each of the modules have a unit test bed, which consists of a test engine, a set of test scripts and some input output data files and expected error report files. The test engine reads the test script, loads extracts the input data from the script, executes functions of the module and tests the responses, results returned comparing them to expected data from the script or data files. There are four distinct type of tests:

Func - functionality test; valid calls and parameters; checking certain scenarios to see if the module functions properly based on the script

Speed - performance test; valid calls and parameters; should be run with optimised compilation, debug turned off; measures speed

Error - testing of the exception handling; valid calls, parameters simulating extreme scenarios (e.g. file does not exist or incorrect file format used) that may happen in valid usage scenario due to wrong data being passed to the program by the user

Robust - robustness test; invalid call sequences and/or parameters to see whether the sanity checks (asserts) are thorough and complete. These tests programming errors in the integration pipeline, e.g. NIL pointers passed for required data input or calls made to uninitialized objects.

The last two categories have associated expected error files, where the error messages are listed that are expected to be in the response from the module that is being tested. An example functional test script is here from the MolFragGraph module. As you can see it contains a simple language, one command per line starting with a keyword followed by optional parameters and a data block. Of course, writing such scripts is boring, so we typically write only a few of them when a new module is developed. Then we add code like this to the program:

DBGMESSNLF(DEB_SCRIPT, “SCRIPT: MarkGridHead ClientID=0 NumLines=”<
<<" NumLineItems="<
<<" Low="<<_p_info->unit_min
<<" Dim="<<_p_info->unit_dims
<<" CellSize="<<_p_info->cell_size<<"\n");

This is a macro call, that is controlled by a debug flag (DEB_SCRIPT). If that flag is turned on during run-time, then the code will output a line into the log file identified by the "SCRIPT:" header and containing one complete line for the test script along with parameters and data. When we run an integrated software pipe, we can generate a log file containing the actual data being passed in and output from any given module inthe format required by the test bed scripts. This allows us to automatically generate test scripts for any of the modules by running an integrated software pipe for a practical input case. If we find a bug, when we reproduce it with a debug version of the code, we can immediately generate test script for each module involved and test them separately to identify where is the root of the problem. Once the bug is fixed, we can generate the correct output expected for each module for the test case. This comes very handy for generating regression tests, so that if later changes of the code would break any of the previously fixed functionality, then we can notice because the corresponding test script would fail. Of course, the running of all these tests is automated in a nightly build and test script. Each module is assigned to a developer who is responsible for the module. When a test script fails during the automated nightly test, the developer gets an email notification so he can fix it during the next day. For quality metric we are producing similar tables each night, like the VTK dash board (I cannot show you our own for confidentiality reasons). We have been doing development with quality control in SimBioSys since the start of the company in 1996. I have also worked in larger software company for medical imaging where software development was carried out under ISO 9001 certified methodology and I have implemented the same principles (with some more automation) in SimBioSys even though we have not applied for the certification — which is a long bureaucratic process with a significant cost.

So what is the take-home message from this post? That software unit and regression testing is a very important, serious — although boring — part of the chemistry software development, and it is not limited to (nor invented by) open source groups like the Blue Obelisk, which is NOT the only place for software and data quality, contrary to what PMR would like you to believe.

ZZ

The future of HPC

Thursday, May 29th, 2008

Tuesday (May 27) I attended the SHARCNET Symposium on GPU and CELL Computing at the University of Waterloo. There were speakers from IBM, AMD, NVIDIA and Ben Berger from Los Alamos where the new fastest supercomputer on earth is running the benchmarks as we speak — look out for the official announcement about breaking the PetaFLOPS barrier on June 10th. The common theme I heard from all hardware manufacturers is that the future is about many-core technologies. Moore’s law still holds up in its original form, i.e. that the number of transistors packed into a chip doubles every 18-24 months. For several decades up to about 2003 it has translated into exponential growth in processor speed. The clock speed increase has stopped under 4GHz due to diminishing returns (energy requirements and heat increases quadratically and has reached the point where it becomes unmanageable, passive power crossed over active power). The new trend is to keep the clock speed steady (at around 2-3 GHz) but increase the number of parallel computation cores. Intel and AMD has quad-core chips on the market, 8-cores are around the corner for CPUs. At the same time GPU accelerators already pack hundreds of cores into a chip at lower speeds, while the Cell BE has 8 vector processor cores (equivalent to 64 individual cores). With the exponential growth of Moore’s law we can expect thousands of cores in the CPU within ten years on our desktop/laptop. However, to make use of this kind of parallel power, the software world needs to undergo a major change! The days of the lazy programmers are over, we cannot sit back and wait for the faster processor if our program is too slow. The single execution threads will not get any faster,we need to make our code capable of running in a massively parallel way — that is not easy.

Michael Perrone from IBM has started his talk with a story about HP expecting 2X performance increase when they introduced the first dual core computers on the market but only got about 1.7X, when they went from 2-core to 4-core they expected another 1.7X but only got 1.35X. So what should they expect from 8-core over 4-core ? How about 16,32,64 cores ? Will the curve soon flatten out and we do not get any more speed-up ? The answer is : It is all about the data. Memory bandwidth is not keeping up with the computation speed, so it is no use to increase the computation capabilities if we are unable to feed the beast (food=data, beast=cpu-core). And now we have to start feeding many beasts and they will multiply exponentially.

Peter Murray Rust is asking on his blog: Where should we get our computing ? The answer is: form the multi-core accelerator technologies, like GPGPU and Cell BE. His worries about hardware cost and management can be reduced by 50-100 fold using these accelerators. It is no accident that the RoadRunner supercomputer is built on Cell BE processors for the computing (with the communication and file I/O being handled by AMD Opterons) beating the previous fastest HPC system benchmark (held by IBM’s BlueGene) by over 4X.
As for the GPGPU versus Cell BE angle, this symposium has reinforced my beliefs that the Cell BE is a general purpose accelerator suitable for any task (just like a CPU) while the GPUs from AMD and NVIDIA are highly specialized tools that can get great performance for a very specific subset of the problems. GPUs were designed for graphics, where the computation tasks are massively parallel (millions of 3D points and triangles to process) and completely independent (what needs to appear on each pixel is independent of the others and so is the computation to be performed for different 3D points). Tasks that have these properties are suitable for GPGPU, e.g. image processing, some physics simulations (material science, plasma, laser, particles) and even some chemistry problems, like molecular dynamics simulation if one wants to compute the full atom pair matrix of forces. However, as soon as you want to be smart and compute only forces within a cut-off range (which itself can gain a hundred fold speed-up if you work with proteins) and/or need dynamically changing data size or inter-dependencies (like an N-body problem or QM) than GPU is not a good choice. There can be non-trivial performance hurdles even for seemingly fitting problems, like image processing. Michael Kinsner has brought up an example in his talk, where he had to learn the hard way that processing image blocks of 16×4 was fast, but 8×8 was much slower due to some peculiar memory access pattern issue - the input data pattern of the code has to map directly to the underlying hardware architecture to get good performance on the GPU.

On the other hand, the Cell BE is an extension of the CPU architecture, completely general purpose and solves the memory access (hungry beast) problem by giving full control into the hands of the programmer via direct programming of 9 separate memory flow controller and a huge 300GB/s data pipe. Of course, such control means the programming isn’t easy and worry free, but we have the means — the challenge is upon us to program the beast so it does not starve.

ZZ

Floating point errors

Wednesday, May 28th, 2008

On the CCL mailing list Sina Türeli has posted this question:

“I am working on a project related to proteins and my precision for coordinates is there digits after the dot. When I do operations like rotation around a dihedral, the dihedrals which shouldn’t change change at about 0.01 angles and so. I am afraid though that is not much it might accumulate over time. So do you have any suggestions for reducing floating integer errors? Would’t be feasible if I turned lets say the coordinate 21.567 to 21567 and do my operations? Or maybe even 215670? “

I have spent a lot of time fighting similar problems so I know how annoying this can be. To understand what is hapenning, look at how floating point numbers are stored and operated upon according to the IEEE 754 standard. Because of the binary representation, our decimal fractional numbers do not match exactly to float/double numbers. Even though, the 23 fraction bits in a floating point binary number map roughly to 6 decimal digit precision, it still does not mean that all 3 digit decimal numbers can be represented precisely. On the other hand, integers can be stored exactly, so that gives basis to the idea to store the number 21.567 as 21567 or 215670. This would work for storing and also for applying some basic arithmetic (addition, subtraction and multiplication) to numbers accurately without any error. However, division starts a problem and any trigonometric functions or sqare root function blows up the problem to be much worse than what you have with floating point numbers. Those functions produce irrational numbers i.e. they cannot be represented by a division of two integers. Unfortunately, rotations are typically defined by angles and the coordinate transformation requires the sine and cosine of the angle — depending on the task, sometimes the transformation required can be expressed in other ways, e.g. by quaternions, and sometimes we can compute transformations by simpler arithmetic if the goal is to transform some atoms to specific positions, e.g. an overlay (rather than rotation by a given angle). So, in short using integer fixed point representation will not solve the problem of coordinate drifting errors of 3D transformations, especially if rotations by a given angle is required. On the other hand, it can solve simpler problems.
How bad is the floating point problem and what can be solved by fixed point integer arithmetic ? Let me give you an example: you learned in school that a+b+c = c+a+b. Well, this simple rule breaks even for single digit precision floating point numbers! Consider the following simple C code (you can download the source, or a linux binary):

#include  int main() {
float a, b, c, d, s;
for (a = 0.1; a < 9.9; a += 0.2) {
for (b=0.4; b < 9.9; b += 0.1) {
for (c=0.7; c < 9.9; c += 0.3) {
s = a + b;
s += c;
d = c + a;
d += b;
if ( s != d ) {
printf( "a=%f, b=%f, c=%f, a+b+c=%f, c+a+b=%f hex: a=0x%08x b=0x%08x c=0x%08x s=0x%08x d=0x%08xn",
a, b, c, s, d, *(int *) &a,*(int *) &b,*(int *) &c,*(int *) &s, *(int *) &d );
}
}
}
}
}

Sorry for the lack of indentation, I could not convince WordPress to keep it pre-formatted.

You can see that if the summation in two orders were the same for all tested cases it would never print anything. However, when you run it, you get plenty of examples (34585 cases on my computer, about one third of all cases tested) printed where the rule breaks. Here are the first few examples you get:

a=0.100000, b=0.400000, c=3.700000, a+b+c=4.200000, c+a+b=4.199999
hex: a=0×3dcccccd b=0×3ecccccd c=0×406ccccb s=0×40866666 d=0×40866665
a=0.100000, b=0.400000, c=7.600002, a+b+c=8.100002, c+a+b=8.100001
hex: a=0×3dcccccd b=0×3ecccccd c=0×40f33337 s=0×4101999c d=0×4101999b
a=0.100000, b=0.500000, c=2.500000, a+b+c=3.100000, c+a+b=3.100000
hex: a=0×3dcccccd b=0×3f000000 c=0×401fffff s=0×40466666 d=0×40466665

As you can see from the hexadecimal version, the difference is only in the last 1 bit. Nevertheless, it is enough to throw off the result. Imagine, if you sum up score components, sort them and select only the best N solutions. Suddenly, you may keep or lose a solution depending on the order of summation. Now, that is scary…This problem cannot be solved by using double precision, but it can be solved simply by using fixed point integer representation. However, fixed point does not help for the rotation problem I am afraid.

ZZ

Conformance problems: ODF and OOXML

Tuesday, May 13th, 2008

Apparently, the wikipedia page I linked to in my previous post about ODF supporting software is overly optimistic according to Peter Sefton’s blog. He demonstrates that only OpenOffice.org and StarOffice works properly with ODF while others have serious problems even with very basic formating. There is also a very useful converter table posted by Peter.

OK, so that brings the ODF conformance count down to 2, however, this is still 2 more than the number of applications that conform to the OOXML standard, which is exactly zero at this moment according to these tests. So, the race is on, the result is 2:0 so far with ODF in the lead :-)

ZZ. — A proud member of the ODF “cheer squad”

What is wrong with OOXML

Sunday, May 11th, 2008

Peter MR has voiced his opinions on his blog about the use of OOXML for archiving chemistry documents:

The reason I currently like OOXML is that we can make it work and that we have material in Word that we can use. I’ll be demoing it publicly in a week’s time (more later). If we had material in ODT we’d use that, but we don’t.

My worry about Open Office (which emits ODT) is that I don’t yet believe that has reached a state where I could evangelize it without it falling over or being too difficult to install.

I would much rather recommend the ODF (ODT) format, which is a truly open ISO standard (approved on May 1st 2006). OpenOffice.org is only one of the tools that can generate it, there are several others as well as various converters (e.g. SUN’s MS Office plugin, Clever Age ODF translator) available for MS Word users.
His point is that it is still better to use OOXML than the binary doc format of MS Word. I do not agree with this point, I think OOXML is just as bad as the binary Word doc, for these reasons:

  1. It is a single vendor format with patent encumbered binary extensions — so it might as well be called proprietary. OOXML cannot be implemented by open source software due to incompatibilities with the GPL.
  2. The national bodies have raised over a thousand unique objections about technical details of the format during the ISO process (see also the wiki collection), less than 20 percent of which has been discussed during the Ballot Resolution Meeting and most of those was not resolved to the satisfaction of the opponents. You can find a good collection of remaining problems here
  3. It has been accepted as a standard via blatant manipulation, ballot stuffing, corruption in various levels, see some of the history here and here. More irregularities: Poland’s new rule: no vote equals yes, Cuba’s No vote counted as yes, Microsoft friendly “yes-men” invaded Belgium’s Technical Committee, Denmark voted yes by consensus while 50% opposed, interesting vote counting in Croatia (14 No + 3 Yes = Yes), how the Philippines changed their vote from no to yes.
  4. ISO has violated the WTO rules by allowing a duplicative standard to an existing one (ODF), according to Tineke Egyedi, president of the European Academy for Standardisation.
  5. OOXML reinvents the wheel, ignoring and replacing mature standards like SVG, MathML, XForms and even XML. The most prominent example is the neglection of MathML where OOXML defines its own formula markup language (OOMML).
  6. OOXML requires undisclosed copyrighted material from Microsoft Office. The previous problem of Border Style art being undisclosed was acknowledged and fixed on February 22nd 2008 however Part 4 2.18.94 ST_TextEffect (Animated Text Effects) describes VML art that is not included in the specification.
  7. OOXML does not provide the Binary to XML mapping which is required to fully represent the existing corpus of user documents. No other application supporting OOXML will be able to faithfully or fully recreate the look of Microsoft’s legacy binary documents. Although the binary Office document specifications have been posted by Microsoft (15 Feb 2008), no standardized mappings were offered during the BRM, as requested by the US, United Kingdom, Brazil, and Malaysia, amongst others.
  8. Markets cannot rely on ISO standards with calculation errors. Spreadsheet formulas still result in calculation errors. Although the CEILING function was recognised to have a legacy bug and fixed during the BRM, there exist more mathematical inaccuracies in OOXML’s spreadsheet function. The FLOOR function has been identified to have a similar mathematical inaccuracies for negative numbers. This is a problem that needs to get carefully studied. We recall that Intel faced a consumer scandal and losses when their new Pentium chip was found to have a calculation error. The Y2K problem, a standardization issue, resulted in billions of investment for damage control.
  9. Macro functionality is not properly defined. Section 2.16.5.41 defines a “MACROBUTTON” field that allows the definition of a button in the document that will trigger a macro. But little is said about how the macro is stored, bound, what API’s are available, or what the security model is for this feature. ECMA’s disposition (approved in batch by the BRM without discussion or opportunity for objection), was something quite different and unsatisfactory. ECMA simply added: “The mechanism by which the command specified by text in field-argument-1 is located and/or executed by an application is “implementation-defined”. Unfortunately, with this addition, not only is it impossible to have cross-platform interoperability of this feature, it is unlikely that vendors will be able to implement a reasonable security policy to detect, scan or block macros included in documents.
  10. There are additional 850+ technical problems raised during the ISO process and has not been resolved, I will not list all of them here :)

In a single sentence: OOXML is nothing more than a marketing check-box for Microsoft, so that they can now claim to have an open ISO standard document format, but in reality it is neither open, nor standard by any rational definition of the words.

ZZ.

The fast and the furious: compare Cell/B.E., GPU and FPGA

Saturday, May 3rd, 2008

For decades we were spoiled by Moore’s law directly translating into an exponential speed increase, the CPU clock was going up exponentially to 3GHz which was reached in 2003, but in the last 5 years it seems to be stuck at that point. Instead, manufacturers try to pack multiple cores into a chip. People started to look for alternative ways to get faster computation (see MRSC 2008 conf.): Field Programmable Gate Arrays (FPGA), General Purpose computing on Graphics Processing Units (GPGPU) and most recently the Cell Broadband Engine (Cell/B.E.) from IBM-Sony-Toshiba.

Tony Williams over at the ChemConnector Blog has had a couple of people ask him for comments about which way to go and which one is better for a particular application ? We’ve just invested two man years of effort porting to the Cell/B.E. and not only do I have strong opinions I also have enough “hands-on experience” to comment!
The April issue of Bio-IT World had an article about the use of GPUs for scientific computing, then I chatted with Attila Berces (CEO of Chemistry Logic) at the Bio IT World Expo, who is an expert in FPGA and had presented a similarity search system implemented on FPGA. Meanwhile we have presented our docking software running on the Cell/B.E. So, all these angles fresh in my head, I have put together a comparative analysis.

Performance and capabilities

FPGA allows hardware level wiring of decision logic, it excels in integer arithmetic, but floating point operations are difficult to encode and do not yield very good performance compared to traditional CPUs. The reason is that CPUs are running at several GHz speed, while FPGAs have clock speeds at a few hundred MHz. Decision logic (branching) is bad for the CPU/GPU/Cell with deep pipeline, but natural to the FPGA. Parallelism can be very wide and massive, not limited by architecture (128 or 256 bit for Cell and GPU). Therefore, FPGA shines for logic intensive tasks that do not need floating point calculations, e.g. discrete math graph algorithms, searching, matching, gene sequence alignment.

GPU and Cell/B.E. are close cousins from a hardware architecture point of view. They both rely on Single Instruction Multiple Data (SIMD) parallelism — a.k.a vector processing, and they both run at high clock speed (>3GHz) and implement floating point operations using RISC technology achieving single cycle execution even for complex operations like reciprocal or square root estimates. These come in very handy for 3D transformations and distance calculations (used a lot both in 3D graphics and scientific modeling). They both manage to pack over 200 GFlops (billions of floating point operations per second) into a single chip. They are excellent choices for applications like 3D molecular modeling, MM force field computations, docking, scoring, flexible ligand overlay, protein folding. There are some subtle differences between the two, e.g. Cell/B.E. support double precision calculations while GPUs do not (there is some work being done in that direction at Nvidia though), which makes the Cell/B.E. the only suitable choice for quantum chemistry calculations. There is a difference in memory handling too: GPUs rely on caching just like CPUs, while the Cell/B.E. puts complete control into the hands of the programmers via direct DMA programming. This allows the developers to keep “feeding the beast” with data using double buffering techniques without ever hitting a cache-miss causing stalls in the computation. Another difference is that GPUs use wider registers 256 bits, while the Cell/B.E. uses 128 bits, but using a double-pipe which allows two operations to execute in a single cycle. The two approach may sound like equivalent on a cursory look, but again provides a subtle difference. 128 bit houses 4 floats, enough for a 3D transformation row or point coordinate (typically extended to 4 instead of 3 to handle perspective), so you can execute 2 different operations on them on the Cell/B.E. while the GPU can only do the same operation on more data. If the purpose is to apply an operation to a lot of data, that comes down to the same, but a more complex computation series on a single 3D matrix can be done twice as fast on the Cell/B.E. The 8 Synergetic Processor Units of the Cell/B.E. can transfer data between each others memory via a 192GB/s bandwidth bus, while the fastest GPU (GeForce 8800 Ultra) has a bandwidth of 103.7 GB/s and all others fall well below 100GB/s. The high end GPUs have over 300GFlops theoretical throughput, but due to the memory bus speed limitations and cache miss latency, the practical throughput falls far short of that, while the Cell/B.E. has demonstrated benchmark results (e.g. for real-time ray tracing application) far superior to that of the G80 GPU despite the theoretical throughput being lower than the GPU.

Cost comparison

A fair cost comparison requires the ability to measure roughly equivalent processing power, but that is difficult due to the fact that FPGA is better in logic and integer computation, while GPU and Cell/B.E. are better in floating point computation, so what benchmark to chose ? I decided to use a chart from Attila Berces where he compares an FPGA solution to 400 Intel CPU cores. Let’s use that as a reference performance point and see how many Cell/B.E. and GPU units we need to reach that. We also have to differentiate theoretical throughput and practical sustained throughput (see above). I have chosen the practical throughput as the basis of the comparison in the table below:

Costs 400 CPU cluster FPGA GPU Cell BE
Hardware purchase $200K-$400K $60K $30K $4K-$40K
Electricity (power+cooling) $180K-$360K $6K $18K $3K
Total cost $380K-$760K $66K $48K $7K-$43K

The range in the cost of the Cell/B.E. solution is due to the very different price points of the various options: cheapest is the Sony PS3 at $400 providing 6 usable SPE core, the Mercury CAB about $8,000 providing 8SPEs, while the IBM QS21 blade is about $10,000 with 16SPEs. High end GPUs have price points around 1 thousand dollar. FPGAs have a high entry point, that was forming the bases of the above table.

Programming effort, compatibility

Last but not least, let me address the necessary programming effort to make use of these acceleration techniques. We have completed our first porting from scalar Intel code to the Cell/B.E. for the eHiTS docking software. We ported about 10% of the code that was responsible for over 98% of the CPU time spent to the SPUs, amounting to a bit over 21,500 lines of code. The total effort — including the learning curve — took about 2 man-years of work. That may seem a lot, but you have to consider not only the learning aspect (the technology was completely new for us when we started), but also that we went down to the lowest assembly level performance tuning, counting individual operation cycles and analyzing every single pipe stall in the tight loops until we got it perfectly streamlined to run at near-peak performance. The vectorization (SIMD data arrangement and operations) would have been necessary also if we target GPUs. The programming of GPUs have traditionally been much more complicated via OpenGL fake graphics calls. Recently, both Nvidia and AMD has issued libraries with more convenient APIs to program the GPUs for generic purpose computations. Nevertheless, you still need to transform the entire code, computation sequence into those API calls. In contrast, you can simply compile your existing C or C++ code for the Cell/B.E. SPU using a variant of the gcc compiler. Of course, if you only do that much, then you will not reach very high performance, your code is still scalar, so all you gain is to run on multiple core (up to 8X performance, but due to branch penalties it is more likely to be around 4X). But the advantage is, that you can start out this way, having your code run about 4 times faster and already on the SPUs with a few weeks of work for a large application. Then you can start profiling where the bulk of the time is spent and focus your efforts to optimize/vectorize only the most important pieces of code. In comparison, both GPU and FPGA require all-or-nothing commitment and effort. The effort required for FPGA is far more significant (several orders of magnitude) because the code has to be taken down way beyond the assembly coding level, all the way to the micro electronics gate logic level.
So, while as I described in our white paper, the Cell/B.E. requires a different kind of thinking and coding than a traditional CPU, the same is true for the GPU and the FPGA and the these later ones require significantly more effort. Another important point is code compatibility and maintenance on multiple platforms. We have done all our vectorization and porting using C++ wrapper classes and functions for which we have two translations: one to the direct Cell/B.E. intrinsic API and another one to simple C scalar code. This way we have a single code base now, that runs both on the Cell/B.E. and on Intel/AMD platform too. In fact, the vectorization have slightly benefited the Intel code too, it runs about 10% faster than before the port. Of course, that is nothing compared to the 50-fold we reached on the Cell/B.E. If you choose GPU or FPGA, then you need to maintain very different code bases for those and for traditional CPUs.

So, I hope I have managed to provide a good overview of the differences between FPGAs, GPUs and the Cell. I’m clearly biased but, I believe, rightly so!

ZZ