<?xml version="1.0" encoding="UTF-8"?><!-- generator="wordpress/2.0.11" -->
<rss version="2.0" 
	xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
	<title>Comments on: The fast and the furious: compare Cell/B.E., GPU and FPGA</title>
	<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/</link>
	<description>Addressing the challenges of computational drug discovery</description>
	<pubDate>Fri, 03 Sep 2010 20:37:15 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.0.11</generator>

	<item>
		<title>by: cirus</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1531</link>
		<pubDate>Fri, 12 Jun 2009 08:51:29 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1531</guid>
					<description>I've heard about OpenCL as a tool where different GPUs ( of different vendors) can be put to work together, can Cell and GeForce be integrated in this way??</description>
		<content:encoded><![CDATA[<p>I&#8217;ve heard about OpenCL as a tool where different GPUs ( of different vendors) can be put to work together, can Cell and GeForce be integrated in this way??
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Cam</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1529</link>
		<pubDate>Tue, 26 May 2009 16:10:56 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1529</guid>
					<description>&#62;&#62; In contrast, you can simply compile your 
&#62;&#62; existing C or C++ code for the Cell/B.E. 
&#62;&#62; SPU using a variant of the gcc compiler. 
&#62;&#62; Of course, if you only do that much, then ...
&#62;&#62; ...
&#62;&#62; ... already on the SPUs with a few weeks 
&#62;&#62; of work for a large application. Then you 
&#62;&#62; can start profiling where the bulk of the 
&#62;&#62; time is spent ...

It is "the same" for NVidia, 'write your code
and run it'; it won't be fast but it will run.

To "fix" it you go in and split up the loops into
threads and work to avoid data transfers...


One Video Card (from ATI or NVidia) offers
about a TeraFLOP or computing power, of which
you can _expect_ 10% and may sometimes get
30% of that. An extra 300 GFLOPs of additional
power for your 200 to 300 MFLOPs computer.

Some motherboards are designed to accept FOUR
Video Cards (and the Cards are also designed
to allow pairing and quading) so a "home-user"
can "easily" have a TeraFLOP on their Desk
for under $4000.

The GPU (or "a Video Card") is on every (new)
Desktop Computer ready to lend it's power to
specially compiled programs available today.

The "Cell/B.E." or FPGA are not as good an
investment for most people since they are
not required (like a Video Card). 

Your powerful Video Card can be used for 
it's intended (displaying output FAST on 
your monitor) when it is not being used 
for computation.

Your Sony PS3 (when not used for computation)
can double as a Video Game. What one would
do with a FPGA when it is not used for it's
intended purpose it anyone's guess.

If the Cell were cheaper in quantity and
available on a PCI Card it would be much
better, but that is not to be.

The FPGA is likely to be the fastest solution
but not the cheapest. Ultimatley it will
be the GPU (or 4 of them) that win this race.


&#62;&#62; In an earlier blog post, I have analyzed 
&#62;&#62; that advantages of the Cell BE over other
&#62;&#62; acceleration technologies, like GPU and FPGA. ZZ 

The GPU and FPGA (if on a PCI Card) can obtain
their input Data and transfer the calculation
result to the Host Computer faster than you
can send it over a Network from your Sony PS3.

Cam</description>
		<content:encoded><![CDATA[<p>&gt;&gt; In contrast, you can simply compile your<br />
&gt;&gt; existing C or C++ code for the Cell/B.E.<br />
&gt;&gt; SPU using a variant of the gcc compiler.<br />
&gt;&gt; Of course, if you only do that much, then &#8230;<br />
&gt;&gt; &#8230;<br />
&gt;&gt; &#8230; already on the SPUs with a few weeks<br />
&gt;&gt; of work for a large application. Then you<br />
&gt;&gt; can start profiling where the bulk of the<br />
&gt;&gt; time is spent &#8230;</p>
<p>It is &#8220;the same&#8221; for NVidia, &#8216;write your code<br />
and run it&#8217;; it won&#8217;t be fast but it will run.</p>
<p>To &#8220;fix&#8221; it you go in and split up the loops into<br />
threads and work to avoid data transfers&#8230;</p>
<p>One Video Card (from ATI or NVidia) offers<br />
about a TeraFLOP or computing power, of which<br />
you can _expect_ 10% and may sometimes get<br />
30% of that. An extra 300 GFLOPs of additional<br />
power for your 200 to 300 MFLOPs computer.</p>
<p>Some motherboards are designed to accept FOUR<br />
Video Cards (and the Cards are also designed<br />
to allow pairing and quading) so a &#8220;home-user&#8221;<br />
can &#8220;easily&#8221; have a TeraFLOP on their Desk<br />
for under $4000.</p>
<p>The GPU (or &#8220;a Video Card&#8221;) is on every (new)<br />
Desktop Computer ready to lend it&#8217;s power to<br />
specially compiled programs available today.</p>
<p>The &#8220;Cell/B.E.&#8221; or FPGA are not as good an<br />
investment for most people since they are<br />
not required (like a Video Card). </p>
<p>Your powerful Video Card can be used for<br />
it&#8217;s intended (displaying output FAST on<br />
your monitor) when it is not being used<br />
for computation.</p>
<p>Your Sony PS3 (when not used for computation)<br />
can double as a Video Game. What one would<br />
do with a FPGA when it is not used for it&#8217;s<br />
intended purpose it anyone&#8217;s guess.</p>
<p>If the Cell were cheaper in quantity and<br />
available on a PCI Card it would be much<br />
better, but that is not to be.</p>
<p>The FPGA is likely to be the fastest solution<br />
but not the cheapest. Ultimatley it will<br />
be the GPU (or 4 of them) that win this race.</p>
<p>&gt;&gt; In an earlier blog post, I have analyzed<br />
&gt;&gt; that advantages of the Cell BE over other<br />
&gt;&gt; acceleration technologies, like GPU and FPGA. ZZ </p>
<p>The GPU and FPGA (if on a PCI Card) can obtain<br />
their input Data and transfer the calculation<br />
result to the Host Computer faster than you<br />
can send it over a Network from your Sony PS3.</p>
<p>Cam
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Gregg J. Macdonald</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1526</link>
		<pubDate>Sat, 09 May 2009 04:50:23 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1526</guid>
					<description>re: FPGA - I recently announced very basic PC /embedded PC functionality and am offering a developers board with access to IP for those interested in an FPGA solution.  As soon as I get all the rest of the main pieces up and running then it becomes easier for everyone else to hook into it . . . and reuse it.  Cost of FPGA has good downward pressure.</description>
		<content:encoded><![CDATA[<p>re: FPGA - I recently announced very basic PC /embedded PC functionality and am offering a developers board with access to IP for those interested in an FPGA solution.  As soon as I get all the rest of the main pieces up and running then it becomes easier for everyone else to hook into it . . . and reuse it.  Cost of FPGA has good downward pressure.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: patrick</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1522</link>
		<pubDate>Thu, 05 Mar 2009 05:11:39 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1522</guid>
					<description>This is just a bit over my head and with that I do have a question: with there strengths in different areas would it be possible to us them together? Like in the past with a math co-processor? Or am I way off base here.</description>
		<content:encoded><![CDATA[<p>This is just a bit over my head and with that I do have a question: with there strengths in different areas would it be possible to us them together? Like in the past with a math co-processor? Or am I way off base here.
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: SimBioSys Blog &#187; Blog Archive &#187; IBM&#8217;s white paper on the Cell technology and Molecular Modeling</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1513</link>
		<pubDate>Thu, 23 Oct 2008 02:57:51 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1513</guid>
					<description>[...] The fast and the furious: compare Cell/B.E., GPU and FPGA [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] The fast and the furious: compare Cell/B.E., GPU and FPGA [&#8230;]
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: SimBioSys Blog &#187; Blog Archive &#187; Bio-IT World article about eHiTS Lightning</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1504</link>
		<pubDate>Tue, 15 Jul 2008 07:07:14 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1504</guid>
					<description>[...] Mike May wrote an article to Bio-IT World about eHiTS Lightning and our efforts to accelerate docking on the Cell BE processor. There are more details about the topic in our white paper, and some earlier blog posts here and here. Some people argue that speed is not the most important issue for docking, accuracy is far more crucial. They fail to realize the fact, that speed is a factor that can enable us to use much finer pose sampling and more sophisticated scoring terms and still run at reasonable time frame. Think about it this way: it is well known that quantum chemistry based methods, e.g. free enegy perturbation (FEP) can provide the most accurate binding energy estimation, yet nobody has ever considered using such technique for scoring in docking or virtual screening, simply because it takes many CPU hours to compute the energy for a single ligand pose with FEP while a single docking run requires many thousands, possibly millions of poses to be scored. If SimBioSys as a software vendor would offer a docking software with FEP scoring that requires years of CPU time for a single docking run, nobody would buy such a product. But if we could do FEP-score based docking such that it runs in a few minutes per ligand that would be a &#8220;killer application&#8221;. [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] Mike May wrote an article to Bio-IT World about eHiTS Lightning and our efforts to accelerate docking on the Cell BE processor. There are more details about the topic in our white paper, and some earlier blog posts here and here. Some people argue that speed is not the most important issue for docking, accuracy is far more crucial. They fail to realize the fact, that speed is a factor that can enable us to use much finer pose sampling and more sophisticated scoring terms and still run at reasonable time frame. Think about it this way: it is well known that quantum chemistry based methods, e.g. free enegy perturbation (FEP) can provide the most accurate binding energy estimation, yet nobody has ever considered using such technique for scoring in docking or virtual screening, simply because it takes many CPU hours to compute the energy for a single ligand pose with FEP while a single docking run requires many thousands, possibly millions of poses to be scored. If SimBioSys as a software vendor would offer a docking software with FEP scoring that requires years of CPU time for a single docking run, nobody would buy such a product. But if we could do FEP-score based docking such that it runs in a few minutes per ligand that would be a &#8220;killer application&#8221;. [&#8230;]
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: SimBioSys Blog &#187; Blog Archive &#187; Fastest supercomputer built on the Cell/BE</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1501</link>
		<pubDate>Fri, 20 Jun 2008 16:20:22 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1501</guid>
					<description>[...] The last article in the above list highlights: &#8220;Roadrunner was built using 6,912 dual-core Opteron processors from Advanced Micro Devices, and 12,960 IBM Cell eDP accelerators. Early tests indicate that the Cell processors have reached 1.33 petaflops while the Opterons reached 49.8 teraflops&#8221;. So twice as many Cells produce 26.7 times more crunching power compared to the dual core Opterons. In an earlier blog post, I have analyzed that advantages of the Cell BE over other acceleration technologies, like GPU and FPGA. ZZ [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] The last article in the above list highlights: &#8220;Roadrunner was built using 6,912 dual-core Opteron processors from Advanced Micro Devices, and 12,960 IBM Cell eDP accelerators. Early tests indicate that the Cell processors have reached 1.33 petaflops while the Opterons reached 49.8 teraflops&#8221;. So twice as many Cells produce 26.7 times more crunching power compared to the dual core Opterons. In an earlier blog post, I have analyzed that advantages of the Cell BE over other acceleration technologies, like GPU and FPGA. ZZ [&#8230;]
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Petaflops and Cell Processors at The ChemConnector Blog by Antony Williams - Observations and Musings for the Chemistry Community By Antony Williams</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1498</link>
		<pubDate>Fri, 20 Jun 2008 05:30:54 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1498</guid>
					<description>[...] With the fastest computer in the world using the Cell processor as part of its architecture, and with the processor now proving itself for docking, the question is whether we will see this processor become even more mainstream in the foreseeable future. It&#8217;s NOT easy to port&#8230;but it can be done.   addthis_url = 'http%3A%2F%2Fwww.chemconnector.com%2Fchemunicating%2Fpetaflops-and-cell-processors.html'; addthis_title = 'Petaflops+and+Cell+Processors'; addthis_pub = '';   Posted in Uncategorized     June 19th, 2008 &#124; 11:14 pm [...]</description>
		<content:encoded><![CDATA[<p>[&#8230;] With the fastest computer in the world using the Cell processor as part of its architecture, and with the processor now proving itself for docking, the question is whether we will see this processor become even more mainstream in the foreseeable future. It&#8217;s NOT easy to port&#8230;but it can be done.   addthis_url = &#8216;http%3A%2F%2Fwww.chemconnector.com%2Fchemunicating%2Fpetaflops-and-cell-processors.html&#8217;; addthis_title = &#8216;Petaflops+and+Cell+Processors&#8217;; addthis_pub = &#8216;&#8217;;   Posted in Uncategorized     June 19th, 2008 | 11:14 pm [&#8230;]
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: Flexy</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1496</link>
		<pubDate>Thu, 19 Jun 2008 15:21:07 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1496</guid>
					<description>great post</description>
		<content:encoded><![CDATA[<p>great post
</p>
]]></content:encoded>
				</item>
	<item>
		<title>by: RandomFeedFollower</title>
		<link>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1489</link>
		<pubDate>Thu, 05 Jun 2008 15:35:23 +0000</pubDate>
		<guid>http://www.simbiosys.com/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/#comment-1489</guid>
					<description>"Mike Says:
May 9th, 2008 at 1:10 pm

On other fairly important point I think ought to be made here is around the ability to rely on the computed answers - what’s the use of all that speed if you can trust it, or have to run several iterations to build confidence?

That is to say, of the technologies (CPU, GPU, FPGA, and Cell/B.E.) only the CPU’s and Cell/B.E. have protection against Soft Errors (which are undetected corrupted data caused by things like cosmic rays). It’s a difference in a server heritage vs a consumer low cost target. Words like ECC, Parity, and CRC don’t appear in GPU’s (or typically FPGA’s - certainly they could be coded into FPGA’s, but at the cost of gates and effort etc)."

What a load of FUD.  Firstly the cosmic rays issue, GPU's use a similar plastic ball grid array packaging to CPU's.  In fact I've yet to see a GPU that doesn't completely enclose the silicon in a cavity, unlike say the P3 FBGA.  FPGA's?  They come in a variety of package options, also plastic, some for commercial applications, some for industrial, and some that are military radiation hardened.  While traditional general purpose CPU's can come in rad-hard packaging, however they're typically surface mounted and are ~5x the cost of the commercial (for you Mike, the ones you buy at BestBuy when you've saved up lawn mowing money)

Accuracy?  While none of the GPU vendors are exact to the IEEE standard with their rounding schemes, however its known that x86 isn't 100% spot on in all cases.  There are methods to correct this precision with respect to floating point unit size, in fact it was invented by Newton a few hundred years before electricity.  With regards to FPGA's, well the precision is your choice or left up to your choice of floating point core.  There are dozens of companies that make off the shelf IP (Intellectual Property) cores that are IEEE compliant.

ECC?  Typically you don't see this in on chip memories, or at least that I've ever seen in 12 years of hardware development.  As far as off chip memories, depends on what component.  You can buy non-ECC or ECC memories, the choice is usually made in the Project Requirements phase of any development project.  Do GPU's use ECC?  I'm sure vendors have and  do, nVidia's chips support ECC (at least to my knowledge the GeForce 6, the last datasheet of which I've seen).  FPGAs?  Of course.

Parity?  GPU's?  No, it would almost be stupid to waste IO pins on the overhead given the project constraints.  Maybe some vendors do, I've never seen it.  FPGAs?  Definately on every blockram or distributed ram I've ever generated on a Xilinx Spartan or Virtex FPGA.  At a cost of gates?  No, at a cost of one bit of parity per eight bits of data.

CRC?  Why would one implement this in hardware other than for acceleration?  Its a bit heavy weight for routine hardware to hardware tests.  However it can easily be implemented on a GPU (why one would, I don't know) and as with most of the BS you've spewed, there are probably a dozen vendors that sell IP cores that do this.

The only real difference between GPU's and CPU's is that GPU's are logic dense with small amounts of on chip memory (typically a ratio of 70%-90% logic and 30%-10% on memory), while almost 60% of modern x86 cpu's is on chip memory (L1, on die L2 cache, register file).  With FPGA's, it depends on the package selected, but they typically maintain a 90% logic to 10% memory ratio, at least with Xilinx's line.  As a result of this bias towards logic vs on chip memory and simple yet high speed architecture, they can be applied to a variety of engineering solutions and be coupled with a wide range of peripheral components.

-Burned</description>
		<content:encoded><![CDATA[<p>&#8220;Mike Says:<br />
May 9th, 2008 at 1:10 pm</p>
<p>On other fairly important point I think ought to be made here is around the ability to rely on the computed answers - what’s the use of all that speed if you can trust it, or have to run several iterations to build confidence?</p>
<p>That is to say, of the technologies (CPU, GPU, FPGA, and Cell/B.E.) only the CPU’s and Cell/B.E. have protection against Soft Errors (which are undetected corrupted data caused by things like cosmic rays). It’s a difference in a server heritage vs a consumer low cost target. Words like ECC, Parity, and CRC don’t appear in GPU’s (or typically FPGA’s - certainly they could be coded into FPGA’s, but at the cost of gates and effort etc).&#8221;</p>
<p>What a load of FUD.  Firstly the cosmic rays issue, GPU&#8217;s use a similar plastic ball grid array packaging to CPU&#8217;s.  In fact I&#8217;ve yet to see a GPU that doesn&#8217;t completely enclose the silicon in a cavity, unlike say the P3 FBGA.  FPGA&#8217;s?  They come in a variety of package options, also plastic, some for commercial applications, some for industrial, and some that are military radiation hardened.  While traditional general purpose CPU&#8217;s can come in rad-hard packaging, however they&#8217;re typically surface mounted and are ~5x the cost of the commercial (for you Mike, the ones you buy at BestBuy when you&#8217;ve saved up lawn mowing money)</p>
<p>Accuracy?  While none of the GPU vendors are exact to the IEEE standard with their rounding schemes, however its known that x86 isn&#8217;t 100% spot on in all cases.  There are methods to correct this precision with respect to floating point unit size, in fact it was invented by Newton a few hundred years before electricity.  With regards to FPGA&#8217;s, well the precision is your choice or left up to your choice of floating point core.  There are dozens of companies that make off the shelf IP (Intellectual Property) cores that are IEEE compliant.</p>
<p>ECC?  Typically you don&#8217;t see this in on chip memories, or at least that I&#8217;ve ever seen in 12 years of hardware development.  As far as off chip memories, depends on what component.  You can buy non-ECC or ECC memories, the choice is usually made in the Project Requirements phase of any development project.  Do GPU&#8217;s use ECC?  I&#8217;m sure vendors have and  do, nVidia&#8217;s chips support ECC (at least to my knowledge the GeForce 6, the last datasheet of which I&#8217;ve seen).  FPGAs?  Of course.</p>
<p>Parity?  GPU&#8217;s?  No, it would almost be stupid to waste IO pins on the overhead given the project constraints.  Maybe some vendors do, I&#8217;ve never seen it.  FPGAs?  Definately on every blockram or distributed ram I&#8217;ve ever generated on a Xilinx Spartan or Virtex FPGA.  At a cost of gates?  No, at a cost of one bit of parity per eight bits of data.</p>
<p>CRC?  Why would one implement this in hardware other than for acceleration?  Its a bit heavy weight for routine hardware to hardware tests.  However it can easily be implemented on a GPU (why one would, I don&#8217;t know) and as with most of the BS you&#8217;ve spewed, there are probably a dozen vendors that sell IP cores that do this.</p>
<p>The only real difference between GPU&#8217;s and CPU&#8217;s is that GPU&#8217;s are logic dense with small amounts of on chip memory (typically a ratio of 70%-90% logic and 30%-10% on memory), while almost 60% of modern x86 cpu&#8217;s is on chip memory (L1, on die L2 cache, register file).  With FPGA&#8217;s, it depends on the package selected, but they typically maintain a 90% logic to 10% memory ratio, at least with Xilinx&#8217;s line.  As a result of this bias towards logic vs on chip memory and simple yet high speed architecture, they can be applied to a variety of engineering solutions and be coupled with a wide range of peripheral components.</p>
<p>-Burned
</p>
]]></content:encoded>
				</item>
</channel>
</rss>
