Intel has added another nifty feature that I want to bring to your attention in the L1/L2 cache context. If you think of the Pentium III launch in February 1999, you might remember Intel's introduction of the 'streaming' SIMD Extensions. The 'streaming' bit of 'SSE' is actually represented by the prefetch-instructions of Pentium III, which enable software to load data into the caches before it is requested by the processor core.
Those instructions still exist in Pentium 4's instruction set, but with the new hardware prefetch feature of Pentium 4 a lot of this is done automatically. This new unit is able to recognize data access patterns of the software executed by Pentium 4, so that it 'guesses' which data will be needed next and 'pre-fetches' it into the cache.
The procedure might sound familiar to you from the complex hard drive cache algorithms and you might also be aware how much this can speed up hard disk accesses under certain circumstances. Pentium 4's hardware prefetch is probably able to significantly accelerate the execution of software that is using a lot of large data arrays.
Entering The Execution Pipeline - Pentium 4's Trace Cache
Our code has now passed the system bus, L1 and L2-cache, so that it's finally time to enter the execution path of Pentium 4. You remember that Pentium 4 is not using an L1 instruction cache, but a much niftier thing instead. Let me first explain what is bad about an L1 instruction cache.
With Pentium III or Athlon, who both have an L1 instruction cache, code is fetched by this cache and stored until it's about time to enter the execution path. This is done by code entering the decoder unit, which e.g. in case of Athlon consists of 3 'direct path' and 3 'vector path' decoders, which alternatively produce the 'OPs' (as explained above) that can get executed by the execution units of the processor. This situation has a few glitches. First of all, some x86-instructions are rather complex, taking a lot of time to be decoded by the slow or 'vector path' decoders. In the worst case all decoder units are busy decoding complex instructions, thus stalling the execution pipeline of the processor. Another problem is the fact that x86-instructions that are supposed to be executed repeatedly (e.g. in small loops) need to be decoded each time they enter the execution path, thus wasting a lot of time. Software branches are another wasteful situation for a processor with L1 instruction cache that starts its pipeline at the decoder level.
Pentium 4's fancy Execution Trace Cache does not suffer from the above-described problems. Once you understood it, the idea of the trace cache is actually rather simple, but it takes quite a bit more silicon resources and design skill to replace the good old L1 instruction cache with something like Pentium 4's trace cache. Basically, the 'Execution Trace Cache' is nothing but a L1 instruction cache that lies BEHIND the decoders. Obviously it's quite bit more complex, but once you understood this basic fact you start to realize the benefits of the trace cache.