Introduction
Predicting the system performance impact of a new type of DRAM is a tricky business. It is impossible to do this merely be evaluating how much faster the DRAM is by itself. If you can get your hands on the hardware, testing is the best approach. But if that cannot be done for some reason, modeling is the only option.
Because it is the newest JEDEC standard for DRAM, and because it offers excellent latency, I have chosen to do some performance modeling with ESDRAM (with all features enabled). The results are summarized at the end of this article.
Very soon, Tom & I expect to be able to do some hands on testing of ESDRAM to evaluate performance. Because the BX chip set is not optimized for ESDRAM, we expect only a small performance impact in most cases, but we will also be looking for better reliability at higher bus speeds such as 133MHz.
As ESDRAM optimized chip sets and graphics controllers become available, we will offer more test results. Meanwhile, more questions have popped up that ought to be addressed.
What determines the performance impact of fast latency DRAM?
DRAM bus utilization is the primary factor. If DRAM is not used very heavily, the performance difference may be very small. If the DRAM bus is highly saturated with activity, the performance impact will be much more profound.
Is the DRAM bus usually saturated, or idle?
When running normal business applications, bus utilization is usually very low. This is because the cache is doing its job. But some applications, (i.e. multimedia and games) can tend to beat up the cache and drive memory utilization much higher. More on this later.
Doesn’t the cache take care of the latency problem altogether?
It helps a lot, but also hurts a little. Caches reduce external bus activity and reduce the average latency as seen by the CPU. But behind the cache, memory accesses become much more random in nature. This result is a very high DRAM page miss rate. Page misses produce the worst possible latency from DRAM. This causes the average DRAM latency to be even worse than systems without an L2 cache. Fortunately, the cache helps to offset much of this problem, but there is still room for improvement.
Memory Bus Saturation
The fastest uniprocessor PCs, when engaged in ordinary tasks will usually exhibit very low DRAM bus utilization. While we find ourselves wrapped up in talk of 500MB/s, 800MB/s or even 1.6 gigabytes per second, most real applications hover around at a startlingly low 5, 10 or 20MB/s.
Intel clarified this in a presentation given at the Intel Developers Forum in February of 1998. Intel engineers characterized the external bandwidth demand of several popular benchmarks. We should acknowledge that benchmarks will usually work a PC much harder than a normal human can. During the several minutes that a benchmark may be running, the PC accomplishes about as much work as a human can force it to do in a week. Yet, the external bandwidth demands remain quite low as demonstrated by Intel’s data below.
Corel Draw is a fairly challenging application. It is probably more challenging than Word, Excel or other business apps. Yet, during the run of the benchmark, the majority of the results are pegged to the bottom of the chart – very close to ZERO megabytes per second. Of course there are a few blips up to 15 or 30MB/s.
With a maximum peak bandwidth of 533MB/s, this load represents about 1% to 5% bus saturation. Increasing the available bandwidth to the Gigabyte level seems utterly senseless for this type of application. The only effect it would have is to create more unused bandwidth. But improving latency could still show a performance improvement – though quite small, due to the low bus utilization.
Other applications do drive the bus a little harder though. Soft DVD decode averages about 60MB/s quite consistently. 3D games can range from 60 to about 100MB/s on average. These figures will increase if the cache is turned off or reduced in size, by running faster CPU clock speeds, and due to AGP or UMA architectures.
Above all, let us remember that high bus saturation is a problem, not an advantage. If an application burns a lot of external bandwidth, it may be because the code is not optimized, it is not making good use of the cache, and its performance will be disappointing. Well-refined games like Quake2 are very playable and acceleratable because they are optimized to fit in the caches and have relatively low external bandwidth requirements. Games that do not scale well as CPU speeds increase may be thrashing the cache as a result of poor code optimization.
Graphics Architectures
Many of you may be familiar with my criticism of AGP texturing for high performance systems. AGP texturing is a small scale form of UMA (Unified Memory Architecture). Fundamentally, the goal of UMA is to save dollars on memory in exchange for a some degree of compromise in system performance.
While AGP uses main memory only for textures, full UMA uses main memory for all of graphics related functions. UMA is not for the power user, nor was it ever positioned as a high performance architecture. It is clearly driven by cost, and for the Sub $1K market, UMA can make a lot of sense. The problem is that main memory can get really hammered in UMA systems.
Intel’s plans for Whitney (integrating its north bridge with the 740), have inspired many of the chip set vendors and graphics chip vendors to begin looking at similar integration projects. Whitney will probably not be a full UMA architecture, but many of the competing products will be UMA.
For UMA systems, standard DRAM performance may be adequate for some systems, UMA platforms will need a broader range of DRAM performance options in order to satisfy large portions of the market. With adequate memory performance, UMA systems could become a very cost-effective way to build a low cost midrange system as well. I am speaking from the perspective of an OEM, not as an individual power junkie.
Of course, OEMs and chip vendors must make difficult decisions on the “bandwidth vs. latency” question. The key here lies in the concept of “Randomness”. DRAM accesses from a cached CPU are very random and benefit more from latency improvements than bandwidth improvements. Also, 3D graphics controllers demonstrate very random behavior as they jump from place to place in memory for texture reads, pixel writes, z-buffer reads and writes, CRT scanning, etc.
When running 3D graphics applications on a UMA system, CPU activity is interleaved with graphics controller activity – driving randomness to an extreme. As a result, the page miss rate becomes even higher. In these systems, page miss latency can make or break the system.
Optimizing Page Miss Performance
Many users and even engineers still automatically assume that page hit latency is still the key to DRAM performance. Over the life of the PC, page miss rates have skyrocketed from about 20% up to 70 or 80%. This is a result of the caches. CPU activity that is neatly sequential or localized ends up being serviced by the cache. Accesses that are extremely random in nature have a high probability of missing the cache, and usually result in a page miss when presented to DRAM. The problem grows worse as CPU caches become larger and more efficient.
In the table above, ESDRAM shows a page hit rate at all speeds that is one or two clocks faster than SDRAM. But more importantly, when used with an optimized controller, page misses can be serviced as fast as page hits with ordinary SDRAM. This is an amazing feat. (These numbers include a one clock delay for address propagation and decode by the chip set.)
In order to estimate the CPU performance impact of this kind of optimization, I went through a rather exhaustive modeling exercise. I built 16 different model configurations representing different speed grades of the K6, Mendocino and the Pentium2. The speed grades ranged from 233 to 533MHz at external bus speeds of 66, 100 and 133MHz. Each entry below is an average of all of the configurations selected for that processor.
There are two tables below. One evaluating the performance impact on standard architecture PCs and another for low cost UMA style systems. Three types of system bandwidth loads are modeled. “2D” represents the CPU performance delta for a typical 2D business application (as simulated by the ZD Labs CPU Mark 32 benchmark). “MM” signifies the CPU performance delta to be had under a multimedia load such as motion video decode. “3D” models the CPU performance impact based on the system bandwidth load of a rather challenging game application.
AGP and UMA bandwidth demand is considered in this model, but not other forms of I/O. For example, if your application spends all of its time waiting for the hard disk, your results will be different.
In a standard architecture system (above), the performance impact of an ESDRAM optimized memory controller on ordinary business applications is negligible. For multimedia applications the performance impact grows to 3-6%. 3D graphics applications start to become interesting at 8-12%, a throughput increase equivalent to one CPU speed grade.
For UMA systems, the DRAM bus saturation is inherently higher, and so is the performance impact of ESDRAM. Most business applications are not screaming for more performance, so the small advantage in 2D is still insignificant. Multimedia increases by 8-15%, while 3D jumps to a 15-23% performance advantage. These applications are quite challenging for a UMA system, and ESDRAM offers a valuable performance impact.
Mendocino is the second processor in Intel’s Celeron product line. It is due to its smaller 128K integrated L2 cache that the performance impact is higher for this processor than the others. The highest individual improvement in this simulation was shown by a 333MHz Mendocino, which resulted in a 34% performance advantage for UMA 3D applications when using ESDRAM as compared to standard SDRAM.
The AMD K6, Cyrix M2, IDT WinChip and Mendocino should all be popular CPUs for the high volume sub $1K market. By offering SDRAM and ESDRAM support for these systems, OEMs will be better able to deliver a broader range of performance with a narrower range of platforms. SDRAM will be the choice for business PCs, and ESDRAM will be an option with faster CPUs or for midrange game PCs.
Though in theory, DDR could be supported in the same way, Rambus and SLDRAM require a completely new controller that is not compatible with SDRAM. Besides, first pass simulations of high bandwidth DRAM show that it offers only about half of the performance benefit of low latency DRAM. I will offer more detailed information on this in the future.