Entering The Execution Pipeline - Pentium 4's Trace Cache, Continued
As already mentioned in my description of the term 'µOP', those simple instructions are the language understood by the execution units. They are of a defined size and thus easier to be sequenced than x86-instructions that are of variable length. Once in the trace cache, Pentium 4 saves the time to re-decode repeating instructions. It can easier check for dependencies required for the branch prediction process. The trace cache ensures that the processor pipeline is continuously fed with instructions, decoupling the execution path from a possible stall-threat of the decoder units. This is particularly important in case of the high clock rate design of Pentium 4. The execution trace cache supplies the next pipeline stage with 6 µOPs every 2 clocks and thus 3 µOPs per clock, which is about as fast as what AMDs Athlon is able to do under ideal conditions.
Now there's quite a bit more to know about 'µOPs', decoders and the trace cache. First of all, those 'µOPs' are not exactly small. In fact they are considerably larger than an x86-instruction although they contain less information (most x86-instructions are represented by more than one µOP). The µOPs of Pentium III are known to be as large as 118-bit. Intel never reported the physical size of the execution trace cache and neither the size of Pentium 4's µOPs. We only know that the trace cache is supposed to contain about 12,000 µOPs. Looking at a die picture that I took by myself from the Pentium 4 die chips supplied by Intel to the press at Comdex and comparing the trace cache area with the L2-cache area it looks as if the trace cache is about 92-96 KB in size. It therefore seems to be a good guess estimating the size of a Pentium 4 µOP in the neighborhood of 64-bit.
96 KB is quite a considerable size and 6 times larger than Pentium III's 16 KB L1 instruction cache. However, Intel hasn't been wasteful with the space offered by Pentium 4's execution trace cache. Due to the fact that the trace cache stores decoded x86-instrutions, Pentium 4 is aware of what it actually does, wants, represents. The decoder units that feed the trace cache ensure that only those µOPs are stored in the trace cache that will actually be executed.