DRAM Performance: Latency vs. Bandwidth
The industry is in the midst of a raging debate over DRAM performance. Today, chip makers are fighting it out, but very soon the battle zone will expand to include system manufacturers, all the way down to individual users. The debate is over bandwidth vs. latency and DRAM chip interfaces.
The PC press is scattered with inconclusive analyses about many of the contenders, including Rambus, SLDRAM, DDR SDRAM, ESDRAM, Virtual Channel and others. Only one thing is clear – Intel is pushing Rambus, and almost everyone else is looking for alternatives. Why all of the anxiety? I can only answer that with another question. “Is Rambus about performance, or is Intel merely placing another proprietary bus roadblock in the path of its competitors?”
Aside from the politics (which run very deep), the goal is supposed to be improving system performance. As these new memory types become available, users must make their own choices. In order to avoid making expensive and disappointing mistakes, it is important to understand the real performance balance between DRAM, CPUs and buses.
Do faster DRAMs make a difference?
Size is first, speed second. If your memory configuration is too small, memory page swapping to the hard disk will severely limit performance. Beyond that, the system level performance impact of faster DRAM depends on your system architecture and on what kind of applications you run.
How can the new memory types improve performance?
Only two things can impact DRAM performance – improve latency (access time) or increase peak burst bandwidth. Predictably, some of the new DRAM types improve latency, while others crank up the burst rate.
What’s more important, faster latency or faster burst bandwidth?
That is the heart of the matter. As a rule of thumb for today’s desktop PC, faster latency will almost always deliver a performance benefit. Increasing peak burst bandwidth sometimes offers a performance benefit, but not in every case, and not usually as much.
What happens to CPU performance when latency is improved?
When the CPU experiences a cache miss, part or all of the CPU stalls for a surprisingly long period of time. Faster latency DRAM allows the CPU to resume operation quicker. The CPU realizes this benefit each time it accesses DRAM.
What happens when peak burst bandwidth is increased?
Currently, CPUs are incapable of ingesting burst data at a rate faster than one CPU bus clock. SDRAM already satisfies this requirement. Rambus, DDR and SLDRAM pump data out in a half clock – twice as fast as the CPU can swallow it. Chip sets will have to buffer the data and slow it back down to the speed of SDRAM. Does this sound enticing?
Then why the heck does anyone want higher bandwidth?
Sometimes, the CPU must contend with master mode peripherals over DRAM. If a peripheral is busy accessing DRAM at the precise moment that the CPU stalls on a cache miss, higher bandwidth DRAM can resolve the conflict a little faster. But fast latency DRAM can achieve the same result or better depending on burst length.
Confused? Good. You need more information to make sense of all of this.
CPUs Need Latency – Caches Want Bandwidth
Before we go any farther let’s make sure that you understand how DRAM accesses are generated.
The CPU core screams along very nicely at half a Gigahertz reading code and data from the caches, until it experiences a cache miss. At this point, all or part of the CPU comes to a screeching halt until the CPU’s need for the missing code or data is satisfied. The CPU then generates a 64-bit external read which is called the “demand word access”. When the “demand word access” is fulfilled, the CPU is able to continue processing. The period that the CPU must wait for the “demand word” is known as latency.
DRAM latency may be measured in nanoseconds, and can dynamically vary from under 40ns to over 100ns depending on many different factors. Latency may also be measured in terms of external CPU bus clocks, but in order to understand the CPU performance impact of latency, it must be evaluated in terms of core CPU clocks.
Next we move to the burst part of the cycle. After the CPU generates a demand word access and waits around for DRAM to respond, the L2 cache controller immediately kicks in with a short burst sequence which fills up part of the cache SRAM. The term “cache line fill” is often used to describe this transaction. The CPU may or may not need this data, but the cache controller fetches it just in case. It is also quite convenient that since the DRAM just went through the painful process of delivering the demand word, it is more than ready to pump out the neighboring data very quickly.
You have seen notations such as 7,1,1,1 or 5,2,2,2 used to describe bus transaction speed. The first value (5 or 7) is the number of bus clocks associated with latency. The next three values (1,1,1 or 2,2,2) are the bus clocks for the burst cache fill.
Peak burst bandwidth may be calculated using this clock rate. 100MHz SDRAM hits 800MB/s. The math is easy to do and generates attractively huge numbers. Armed with this formula, anyone with a calculator and a pocket protector can assume that they have found the secret to evaluating memory performance or bus performance. Not True!
Consider the jump from EDO to SDRAM at 66MHz. EDO has a peak burst bandwidth of a rather pathetic 266MB/s, while 66MHz SDRAM delivers a screaming 533MB/s. How many of you were blown away by the 2x performance boost when you first ripped out your EDO and dropped in SDRAM? You were lucky to see a 1% performance delta.
Unfortunately, peak burst bandwidth does not have a very direct relationship to CPU performance! (Hint: Consider LATENCY!!!)
Bandwidth vs. Latency – A Progress Report
The burst protocol was popularized in the PC architecture during the 486/33 era (due to the integration of the L1 cache). Today, CPUs are running at 400MHz – a clock speed increase of 12x. Over the same period, the peak burst bandwidth of the DRAM sub-system has also improved by 12x – having increased from 66MB/s in the 486 days, to 800MB/s today. CPU clock speeds and DRAM burst bandwidth seem pretty well aligned. But, a quick look at the latency situation leads to a very different conclusion.
During the same period, effective DRAM latency has not improved by 12x, or even remained constant. In fact, measured in CPU core wait states, latency has become worse by a factor of more than 5x.
In the 486 days, the CPU core operated at its external bus speed (33MHz) and DRAM latency caused the CPU to stall for about 5 CPU clocks. In a P2/400 system, bus latency is a stiff seven clocks, but the CPU is running at a 4x clock multiplier. With calculator in hand, it is clear to see that when a 400MHz CPU stalls, it now takes an astronomical 28 core CPU clocks to resolve the stall and resume execution. From the perspective of the CPU core, latency has degraded by an incredible 5.6x.
What’s your guess… what really needs fixing, peak burst bandwidth or latency?
Silly as it may seem, some continue to insist that peak burst bandwidth is the main issue. The most prominent example is Rambus. Rambus wants to redouble the peak burst bandwidth up to 24x, while making latency even worse in the process.
Rambus, SLDRAM and DDR can all spew out burst cycles twice as fast as any X86 CPU can ingest them, but how each of these high bandwidth memory types rate on latency must be further evaluated individually.
As a general rule, once you satisfy the CPU’s maximum burst bandwidth rate, it is difficult to improve performance by adding more bandwidth. Under these circumstances, differences in CPU performance will be determined primarily by memory latency.
How can I get fast latency DRAM
Several types of special fast latency DRAM will be appearing on the scene over the next year. The most immediate of these is ESDRAM (Enhanced SDRAM) from Enhanced Memory Systems of Colorado Springs (subsidiary of Ramtron Inc). ESDRAM has been approved by JEDEC as a superset of the SDRAM standard. It is compatible with standard SDRAM and can be used in existing systems with plug compatible DIMM and SO DIMM modules.
It operates at bus speeds up to 133MHz, and offers better latency than ordinary DRAM at all speeds. When its special features are properly supported in a chip set, it improves latency by a whopping 35-50% depending on the bus speed. Be warned – this does not translate directly into the benchmarkable performance delta. The benchmarkable performance advantage will always be less than this, depending on how heavily an application actually uses main memory. I will post a follow up article very soon that lays this out in detail (based on extensive performance modeling).
This kind of DRAM will carry a price premium, but it’s performance potential should prove interesting for many users and applications – particularly for overclockers.
If someone offers me massive bandwidth, should I “Just Say No” ?
Never reject better performance. But don’t get fooled into buying slower memory that is hiding behind a smokescreen of “Bandwidth Overkill“, particularly if someone expects you to pay a significant price premium for it.
Ultra high burst bandwidth can be needed in multiprocessor servers that more fully utilize the DRAM bus. Usually, this need is satisfied nicely using wide configurations of standard DRAM. If DDR, SLDRAM or Rambus become available at no cost premium, these memory types could also be used in the server market.
Using a 5x clock multiplier, SDRAM at 133MHz can get us up to 667MHz CPU speeds. It is challenging, but not impossible to migrate to higher bus speeds using SDRAM, but at that point it may become necessary to evaluate new DRAM solutions that combine faster latency and higher burst speeds.
Until we get to that point, fast latency SDRAM running at bus speeds between 66 and 133MHz will be able to deliver a satisfactory performance enhancement for most types of systems. This will easily take us into the next century.