The future of HPC
Tuesday (May 27) I attended the SHARCNET Symposium on GPU and CELL Computing at the University of Waterloo. There were speakers from IBM, AMD, NVIDIA and Ben Berger from Los Alamos where the new fastest supercomputer on earth is running the benchmarks as we speak — look out for the official announcement about breaking the PetaFLOPS barrier on June 10th. The common theme I heard from all hardware manufacturers is that the future is about many-core technologies. Moore’s law still holds up in its original form, i.e. that the number of transistors packed into a chip doubles every 18-24 months. For several decades up to about 2003 it has translated into exponential growth in processor speed. The clock speed increase has stopped under 4GHz due to diminishing returns (energy requirements and heat increases quadratically and has reached the point where it becomes unmanageable, passive power crossed over active power). The new trend is to keep the clock speed steady (at around 2-3 GHz) but increase the number of parallel computation cores. Intel and AMD has quad-core chips on the market, 8-cores are around the corner for CPUs. At the same time GPU accelerators already pack hundreds of cores into a chip at lower speeds, while the Cell BE has 8 vector processor cores (equivalent to 64 individual cores). With the exponential growth of Moore’s law we can expect thousands of cores in the CPU within ten years on our desktop/laptop. However, to make use of this kind of parallel power, the software world needs to undergo a major change! The days of the lazy programmers are over, we cannot sit back and wait for the faster processor if our program is too slow. The single execution threads will not get any faster,we need to make our code capable of running in a massively parallel way — that is not easy.
Michael Perrone from IBM has started his talk with a story about HP expecting 2X performance increase when they introduced the first dual core computers on the market but only got about 1.7X, when they went from 2-core to 4-core they expected another 1.7X but only got 1.35X. So what should they expect from 8-core over 4-core ? How about 16,32,64 cores ? Will the curve soon flatten out and we do not get any more speed-up ? The answer is : It is all about the data. Memory bandwidth is not keeping up with the computation speed, so it is no use to increase the computation capabilities if we are unable to feed the beast (food=data, beast=cpu-core). And now we have to start feeding many beasts and they will multiply exponentially.
Peter Murray Rust is asking on his blog: Where should we get our computing ? The answer is: form the multi-core accelerator technologies, like GPGPU and Cell BE. His worries about hardware cost and management can be reduced by 50-100 fold using these accelerators. It is no accident that the RoadRunner supercomputer is built on Cell BE processors for the computing (with the communication and file I/O being handled by AMD Opterons) beating the previous fastest HPC system benchmark (held by IBM’s BlueGene) by over 4X.
As for the GPGPU versus Cell BE angle, this symposium has reinforced my beliefs that the Cell BE is a general purpose accelerator suitable for any task (just like a CPU) while the GPUs from AMD and NVIDIA are highly specialized tools that can get great performance for a very specific subset of the problems. GPUs were designed for graphics, where the computation tasks are massively parallel (millions of 3D points and triangles to process) and completely independent (what needs to appear on each pixel is independent of the others and so is the computation to be performed for different 3D points). Tasks that have these properties are suitable for GPGPU, e.g. image processing, some physics simulations (material science, plasma, laser, particles) and even some chemistry problems, like molecular dynamics simulation if one wants to compute the full atom pair matrix of forces. However, as soon as you want to be smart and compute only forces within a cut-off range (which itself can gain a hundred fold speed-up if you work with proteins) and/or need dynamically changing data size or inter-dependencies (like an N-body problem or QM) than GPU is not a good choice. There can be non-trivial performance hurdles even for seemingly fitting problems, like image processing. Michael Kinsner has brought up an example in his talk, where he had to learn the hard way that processing image blocks of 16×4 was fast, but 8×8 was much slower due to some peculiar memory access pattern issue - the input data pattern of the code has to map directly to the underlying hardware architecture to get good performance on the GPU.
On the other hand, the Cell BE is an extension of the CPU architecture, completely general purpose and solves the memory access (hungry beast) problem by giving full control into the hands of the programmer via direct programming of 9 separate memory flow controller and a huge 300GB/s data pipe. Of course, such control means the programming isn’t easy and worry free, but we have the means — the challenge is upon us to program the beast so it does not starve.
ZZ

June 20th, 2008 at 9:42 am
[…] I have already mentioned in May, that RoadRunner the world’s current fastest supercomputer is built on Cell BE processors, the same platform that eHiTS Lightning runs on. If the Los Alamos Lab chooses Cell Processors then we chose well! […]
December 18th, 2008 at 4:22 pm
[…] We’ve been having the conversation within our company that the two dials of speed and accuracy work counter to each other. So, we’ve been espousing that even when it comes to the eHiTS Lightning solution that higher accuracy does take longer. We still stand by that BUT what we are happy about is the type of accuracy we can achieve very quickly using the new eHiTS Lightning algorithms. This becomes more obvious when our results are compared to the results of others. There has been a proliferation of arguments for GPUs being used as acceleration processors – we actually believe this is simply because of the business driver of “looking for new markets” for the GPU manufacturers. Zsolt has discussed his views regarding the future of High Performance Computing previously and commented on GPUs. Our belief is that while GPUs are clearly more “common” our decision to work with the Cell BE processor can certainly lead to far superior results…don’t forget that the RoadRunner computer is based on the Cell Processor, not GPUs. Did we make the right decision? […]