Performance Impact of Rambus - THG.RU

Introduction

Rambus is a very hot topic. Intel has been promoting Rambus as the new memory standard since late 1996. Now, eighteen months later a few DRAM manufacturers have prototype silicon in hand. Because it uses a completely new interface, Rambus requires a whole new generation of chip sets. Intel’s Rambus platform is targeted for mid 1999 using the Katmai processor. Because of the risks associated with Rambus, we also expect that Intel will offer a Katmai platform using a BX-like chip set supporting SDRAM at the same time.

Isn’t Rambus going to be really fast?

Remember, there are two kinds of fast – low latency and high bandwidth. Rambus offers extremely high bandwidth, but has slower latency than even standard SDRAM. Its slower latency will compromise CPU performance, but its higher bandwidth exceeds the ability of the CPU to use. This does not translate to “fast”.

Doesn’t Rambus run at 800MHz?

It is described as 800MHz DRAM, but the bus actually runs at a 400MHz clock with a double data rate approach like AGP and DDR SDRAM. In order to hit this clock speed, the bus width had to be reduced by 75%. At 16 bits wide, it is not wide enough to issue commands to the DRAM in the standard manner. It must packetize and serialize the commands and data between the controller and the DRAM chip. This adds delays in the path between the chip set and DRAM, resulting in slower access latency.

What is “Fake Rambus”?

Because of the uncertainty of Rambus, Intel is developing a version of the Rambus Memory Module that doesn’t use Rambus DRAM at all. It uses SDRAM. This type of module may be cheaper and easier to get than “Real Rambus”, but its performance will be even worse than Rambus. Each module will have an additional translator chip that increases latency further, making fake Rambus probably the slowest high speed memory on Earth. Intel may even use “Fake Rambus” to demonstrate how Rambus is faster than SDRAM. Don’t fall for it.

Can ultra high bandwidth make up for poor latency?

Not really. Apply a little psychology to the question. First, mess up by being slow to deliver what the CPU requires, then try to make up for it by offering the CPU data that it may not be able to use – faster than it can accept it. Sounds like something men do when they forget their anniversary or wife’s birthday. It doesn’t work in relationships, or in PC architecture.

If you read the article “Bandwidth vs. Latency” posted here a few weeks ago, you may recall a chart that shows how the performance profile of the CPU, peak burst bandwidth and DRAM latency have progressively been getting out of balance. The Katmai/Rambus platform of 1999 is even worse in this respect. Intel seems to prefer to fix the part that is not broken, while further degrading the performance attributes that most desperately need attention.

For the full context of this analysis please see the article “Bandwidth vs. Latency“.

This article will focus almost entirely on Rambus performance issues. But, there are several other barriers that the OEM and user will face if they choose to adopt Rambus. We should expect Rambus to be rather expensive. It has a large die and a new and expensive packaging technology. It burns a lot of power and introduces new challenges regarding cooling and power management. For the first six months of its life, Rambus platforms will not be able to support a memory capacity of more than 256MB. This seems more like the minimum configuration for a 500MHz Katmai platform, not the maximum. These and other issues will be covered in future articles. For now, lets dive in to the performance analysis.

Is Rambus Faster than SDRAM?

In this Rambus analysis I will repeat the same application modeling technique that I applied to SDRAM and ESDRAM in a previous article. Because Rambus is seen as a high-end technology, I have chosen to raise the CPU speeds up a few notches. This time the range is between 350 and 667MHz. As before 2D (biz apps), multimedia and 3D loads are evaluated in standard architecture platforms and in UMA platforms.

This chart below categorizes and averages the results of 192 system simulations with Rambus and with standard SDRAM. The values displayed in this chart represent the average performance impact that Rambus introduces to each platform and its computational loads. The performance impact is not always positive.

Of the 96 comparisons, only 34 showed an increase in performance while 62 configurations showed a decrease in performance. The biggest performance advantage was demonstrated on processors and platforms aimed at the mid range and the low end.

A quick look at the average performance impact by CPU type below indicates that Rambus decreases benchmarkable performance by about 1% in standard architecture systems compared to SDRAM. However, the low-end UMA platform benefits from a 1-3% performance boost as compared to SDRAM. This would be somewhat encouraging, except that Intel is not expected to use Rambus in its UMA systems anytime soon. If Intel can convince you that Rambus is better, they will want to use it as a hook to sell more high-end systems, not more low-end systems.

In these high-end systems, users pay hundreds of dollars for performance improvements of just a few percent. The unfortunate reality appears to be that Rambus will take some of that away, while probably driving the system cost up even higher.

This is a strange thing for a CPU vendor to do. Why would Intel deliberately promote a memory type that reduces CPU efficiency? I can’t answer that, but I must point out that the same question applies to the 740. Why would Intel promote a graphics chip architecture that needlessly sacrifices CPU performance?

In the case of the 740, Intel potentially degrades CPU performance by 10% in order to save a few dollars in graphics DRAM. Then, in the case of Rambus, Intel reverses its position and asks us to pay a premium for DRAM, while still suffering a reduction in performance. The whole thing seems terribly screwed up.

It seems to me that users are willing to shell out a few extra dollars to ensure that they have sufficient graphics memory, but I don’t think anyone wants to pay an excise tax on all of main system memory unless there is a clear performance advantage. Doesn’t this seem obvious? Does Intel see this? If so, what motive could they have for acting in this counter-intuitive manner?

I don’t know if I can answer this question without sounding like a crack-pot, so lets just stick with the facts. (BTW – have you seen the movie Conspiracy Theory? Just because you are paranoid it doesn’t mean they are not out to get you. As a matter of fact, it was the illustrious Andy Grove who, shortly before retiring, graced us with a book entitled “Only the Paranoid Survive“. A prophetic warning?)

Now, back to the matter at hand…

Does Rambus help UMA? Is it the best solution?

Rambus seems to be able to demonstrate some performance advantage over SDRAM for UMA applications. A UMA system should be better able to realize a performance advantage from Rambus due to its faster burst, and because the data stream used by the graphics controller does not have to synchronize with the CPU bus.

In any UMA system (and with AGP) there is some probability that the CPU may begin a new DRAM access just at the moment that the graphics controller is also reading from main memory. This is called an arbitration conflict, resulting in a longer CPU stall and a reduction in CPU performance. Rambus improves performance by allowing the graphics controller to complete its burst a little faster than SDRAM. This allows the CPU to regain access to DRAM a little sooner.

Rambus accomplishes this by trimming one or two clocks from the burst cycle. But the same effect can be accomplished by trimming one or two clocks from latency as well. ESDRAM, for example, trims about four clocks from latency. This approach delivers a direct performance benefit to the CPU, in addition to offering UMA arbitration delays that are shorter to Rambus.

When I ran ESDRAM against Rambus in the performance model, it produced the numbers below.

Low latency SDRAM not only outperforms Rambus for UMA systems, but for standard architecture systems as well.

Now would be a good time to mention the kinds of applications that this performance model simulates (in order to ensure everyone’s expectations are in order). The 2D load is a synthetic business computing load characterized by the ZD labs CPUmark32 benchmark. The multimedia load is an approximation of software motion video decode. It assumes full CPU utilization, which is very difficult to do in real applications but it does happen in multimedia benchmarks. The 3D load would be typical of a cache thrashing D3D game with advanced game logic, user interaction, audio, communications, etc. The simulation would not correlate to 3D Winbench. 3D Winbench 98 is void of any game logic, user interaction or audio load. It is pure geometry stream processing and accelerator test. Maybe the new version will change that.

I believe this is a broad enough representation to be considered “viable,” but you can be certain that Intel and Rambus will scour the earth for a few benchmarks that can show a performance advantage for Rambus. Or worse yet, if they can t find one, they will write a new one in order to satisfy their promotional goals. This would be a definite red flag.

Why do you think they call it “Bench-Marketing”?