Its the peak, the top, its the Mona Lisa. Its the $64,000 Question: what processor will dominate 64-bit computing? Sixty-four bits holds the promise of new performance, new architectures, new compilers, and a new balance of power in CPU realpolitik. A clean break with the old, a new chance for the new.
What hardware or architectural changes are in store for 64 bits? Quite a lot, although few of them have to do with 64-bittedness, per se. But 64-bit processors are at todays very high end, and they showcase all the best thinking in microprocessor design. This is the cutting edge, where silicon manufacturing, computer architecture, compiler technology, and marketing wizardry all come together. In the words of Calvin and Hobbes, scientific progress goes “Boink!”
For most of us waiting breathlessly on the sidelines, the 64-bit battle is between Intels IA-64 and AMDs Hammer architectures. Separately, well evaluate the pros and cons of the “other” 64-bit processors used in workstations and servers, such as SPARC, Power, MIPS, and Alpha.
In this first segment of our 64-bit computing series, well launch into the wonder that is IA-64. Youve probably seen much information already written about Itanium and IA-64 architecture in the past few years, which is mostly a replay of Intel-generated information. Well try to get beyond the standard facts and hype, and take a critical look at Itanium and IA-64/EPIC, by describing features and delivering some critical analyses. Well set the stage for an architectural comparison with Hammer and other 64-bit architectures in future segments.
To be clear, this 64-bit computing architecture series is not performance testing focused. Its architecture focused, and discusses long-term potentials. We will, however, point you to a few Itanium performance studies on the Web.
-64 & Itanium”>
Intels IA-64 (née Tahoe) architecture had a gestation period longer than that of an elephant. After first announcing their cooperation in 1994, Hewlett Packard and Intel said the first offspring of their matrimony would arrive “not before 1998,” a prognostication that certainly proved to be true. In reality, the design was even longer in the making, for Intel and HP had stealthily begun working well before their mid-94 announcement.
Ten years and 325 million transistors later, we behold Itanium (all the good names were taken). Originally code-named Merced, Itanium is the first-born of the IA-64 family and our first real look into how well IA-64 will–or wont–work. (The subsequent offspring code-named McKinley, Madison, and Deerfield, are covered later in this article.) First and most obviously, Itanium, like all IA-64 processors, is not an x86 chip. It is a clean break from the long and legendary x86 (or IA-32, in Intel parlance) architecture that Intel invented, seemingly back when Earth was still cooling, and which propelled the Santa Clara company to such heights. Yes, Itanium is able to run x86 code in backward-compatibility mode, but that compatibility is tacked on; in its element, Itanium and all IA-64 chips are nothing at all like Pentium.
Thats both good news and bad news, as we shall see. Its good to be free from the tyranny of the x86 architecture, considered by many programmers to be the worst 8-bit, 16-bit, or 32-bit (take your pick) CPU family ever developed. That it should have succeeded so spectacularly is enough to shake ones faith in divine forces. The bad news? IA-64 leaves behind everything that made x86 chips ubiquitous, and presumably replaces it all with new bugs, new quirks, and new head-scratchers, leaving us to wonder, “why the hell did they design it that way?”
The Heart of the
Beast: A Modified VLIW Core”>
Internally, Itanium is a six-issue processor, meaning it can profitably handle six instructions simultaneously. Its also a VLIW (very long instruction word) machine with some enhancements for added flexibility in instruction groupings, less code expansion than classic VLIW designs, and better scalability, to permit wider parallel instruction issue in future IA-64 processors. Thus Intel prefers the term EPIC: Explicitly Parallel Instruction-set Computing.
Itanium has nine execution units and future IA-64 processors will probably have more. The nine are grouped into two integer units, two combo integer-and-load/store units, two floating-point units, and three branch units. These four groups are significant, as we shall see in a moment.
Heres a simplified Itanium block diagram:
And heres a more complex block diagram:
Itanium has a 10-stage pipeline, which is respectable but not impressive by todays standards. Again, future IA-64 processors may have different and probably longer pipes. For comparison, Pentium III has a 12-stage pipeline, but the Alpha 21264 has just eight stages. And Pentium 4 has 20 stages (from the point of fetching micro-ops from its trace cache), and Athlon has 10 stages.
Heres a basic Itanium pipeline diagram:
328 Registers and Counting
The Itanium processor has a massive register set, with 128 general-purpose integer registers (each 64 bits wide), 128 floating-point registers (each 82 bits wide), 64 1-bit predicate registers, 8 branch registers, and a whole bunch of other registers scattered among several different functions, including some for x86 backward compatibility. Like a lot of RISC processors, the first register (GR0) is hard-wired to a permanent zero, making it worthless for storage but useful as a constant for inputs and a bit bucket for outputs.
Heres a simplified diagram of key application registers:
Heres a detailed diagram of application and system-level register sets:
And of course Itanium supports standard 32-bit x86 execution modes and the 32-bit registers are mapped onto the IA-64 registers. See details in the section titled “Dont Look Back: How Itanium Handles x86 Code” down below.
What a far cry from the cramped, crowded register set of the x86! With 256 registers to play with, programmers have an embarrassment of riches. To avoid that embarrassment, IA-64 has two features that manage the register file: register frames and register rotating. These require some explanation…
Register Your Window Frames
Registers are great when your program is running, but pushing and popping 128 big ol registers for subroutine calls is unpleasantly time-consuming (and usually not necessary anyway). Its traditional but its inefficient. One alternative is register windows, of which SPARC processors are a notable proponent. Register windows have their problems, too, and its no coincidence that the only major RISC architecture to use register windows is also the slowest major RISC architecture still in production. IA-64 gets around the constant pushing and popping by using register frames.
The first 32 of the 128 integer registers are global, available to all tasks at all times. The other 96, though, can be framed, rotated, or both. Before a function call, you use Itaniums ALLOC instruction (which is unrelated to the C function of the same name) to shift the apparent arrangement of the general-purpose registers so that it appears that parameters are being passed from one function to another through shared registers. In reality, ALLOC changes the mapping of the logical (software-visible) registers to the physical registers, much like SPARC does. The similarities with SPARCs windows are strong and the differences mostly minor. With IA-64s frames, the frame size is arbitrary, unlike SPARC, which supports a few different fixed frame sizes. In the example illustration, the calling routine sets aside 11 registers (GR32 – GR42) for the called routine, with four registers overlapping. The overlapping registers are where the parameters will be passed, although they never really move. Regardless of what registers either routine physically uses, they will appear to be contiguous with the first 32 fixed registers, GR0 – GR31.
The maximum frame size is all 96 registers, plus the 32 globals that are always visible. Only the integer registers are framed; FP registers and predicate registers (described below) are not. The minimum frame size is one register, or you can choose not to use ALLOC at all.
On top of the frames, theres register rotation, a feature that helps loop unrolling more than parameter passing. With rotation, Itanium can shift up to 96 of its general-purpose registers (the first 32 are still fixed and global) by one or more apparent positions. Why? So that iterative loops that hammer on the same register(s) time after time can all be dispatched and executed at once without stepping on each other. Each instance of the loop actually targets different physical registers, allowing them all to be in flight at once.
If this sounds a lot like register renaming, it is. Itaniums register-rotation feature is less generic than all-purpose register renaming like Athlons, so its easier to implement and faster to execute. Chip-wide register renaming like Athlons adds gobs of multiplexers, adders, and routing, one of the big drawbacks of a massively out-of-order machine. On a smaller scale, ARM used this trick with its ill-fated Piccolo DSP coprocessor. At the high end, Cydrome also used this technique, a favorite feature that Cydrome alumnus and Itanium team member Bob Rau apparently brought with him.
So IA-64 has two levels of indirection for its own registers: the logical-to-virtual mapping of the frames and the virtual-to-physical mapping of the rotation. All this means that programs usually arent accessing the physical registers they think they are, but thats nothing new to high-end microprocessors. Arcane as it seems, this method still uses less hardware trickery than the full register renaming of Athlon, Pentium III, or P4.
Frames and rotation help up to a point, but eventually even Itanium runs out of registers. When that happens, were back to pushing and popping registers on and off the stack. Where Itanium differs from SPARC is that Intel makes it automatic. Itaniums register save engine (RSE) is an automated circuit within the processor that oversees filling and spilling registers to/from the stack when the register file overflows or underflows. Unlike SPARC, Itaniums RSE handles this task automatically and invisibly to software. SPARC, in contrast, raises a fault that must be handled in software.
The RSE is more complicated than you might think. It has to handle any kind of memory problem, page fault, exception, or error without bothering the processor. In Itanium, the RSE stalls the processor to do its work. In future IA-64 implementations, it will probably be more elegantly handled in the background.
The Good Stuff
: Instruction Set”>
As we mentioned, IA-64 is an enhanced VLIW architecture, so its concept of “instruction” is a little different from that of, say, Pentium or Alpha. With IA-64, there are instructions, there are bundles, and there are groups. Get your notepads ready.
Instructions are 41 bits long. Yup – say goodbye to powers of two. It takes 7 bits to specify one of 128 general-purpose (or floating-point) registers, so two source-operand fields and a destination field eat up 21 bits right there, before you even get to the opcode. Another 6 bits specify the 64 combinations of predication (which we discuss in more detail below), if any.
Instructions are delivered to the processor in “bundles.” Bundles are 128 bits: three 41-bit instructions (making 123 bits), plus one 5-bit template, which well get to in a minute. Still with us? Then there are instruction groups, which are collections of instructions that can theoretically all execute all at once. The instruction groups are the compilers way of showing the processor which instructions can be dispatched simultaneously without dependencies or interlocks. Its the responsibility of the compiler to get this right; the processor doesnt check. Groups can be of any arbitrary length, from one lonely instruction up to millions of instructions that can (hypothetically, at least) all run at once without interfering with each other. A bit in the template identifies the end of a group.
A bundle is not a group. That is, IA-64 instructions are physically packaged into 128-bit bundles because thats deemed the minimum width for an IA-64 processors bus and decode circuitry. (Itanium dispatches two bundles, or 256 bits, at once.) A bundle just happens to hold three complete instructions. But logically, instructions can be grouped in any arbitrary amount, and its the groups that determine how instructions interrelate to one another.
All IA-64 instructions fall into one of four categories: integer, load/store, floating-point, and branch operations. These categories are significant in how they map onto the chips hardware resources. Different IA-64 implementations (Itanium, McKinley, etc.) might have different hardware resources, but all will do their best to dispatch all the instructions in a group at once. And well see IA-64 compilers capable of optimizing binaries for different IA-64 processors too.
Its hard not to think that Intels institutionalized taste for baroque and ungainly, not to mention bizarre instruction set features didnt creep in here somewhere. With so much elegance going for it, IA-64 falls down in the evening gown competition. First, IA-64 opcodes are not unique – theyre reused up to four times. In other words, the same 41-bit pattern decodes into four completely different and unrelated operations depending on whether its sent to the integer unit, the floating-point unit, the memory unit, or the branch unit. A C++ programmer would call this overloading. An assembly program would call it nuts. Youd think that Itaniums designers would have been satisfied with 241 different opcodes, but no…
The second eccentric feature, which is related to the first, explains how Itanium avoids confusing these identical-but-different opcodes (a process serious engineers call disambiguation). The five-bit template at the start of every 128-bit bundle helps route the three-instruction payload to the correct execution units. Those of you who are good at binary arithmetic are thinking, “wait a minute… five bits isnt enough.” And youd be right–if you werent designing Itanium. Rather than tagging each of the three instructions with its associated execution unit, or just extending the instruction width, IA-64 uses these five bits to define one of 24 different “templates” for an instruction bundle (the other eight combinations are reserved). A template spells out how the three instructions are arranged in a bundle, and where the end of the logical group is, if any. And yes, youre right again, 24 templates is not enough to define all possible combinations of integer, FP, branch, and memory operations within a bundle, as well as the presence of a groups logical stop. Deal with it.
Youll notice that its impossible to have an FP instruction as the first instruction of a bundle, and that load/store instructions are not allowed at the end. You cant have two FP instructions in a bundle, yet you can have three branch instructions bundled together. This is not as counterproductive as it sounds–as long as two of the branches are conditional and evaluate false, they do no harm other than wasting space.
How Epic is EPIC
Is EPIC really VLIW? Yes, by most definitions of that term. Pedantic computer architects may argue over abstruse differences, and Intels marketing people will steam over the misuse of their trademark, but for all intents and purposes, EPIC is merely a more pronounceable rendition of VLIW with a few enhancements.
Few, if any, of EPICs features discussed so far are unique to Itanium or to Intel. Broadsiding a processor with a volley of instructions at once is what VLIW is all about. EPIC corrupts, if you will, the pure ideal of VLIW by introducing its peculiar 5-bit instruction templates that unnecessarily complicates multi-instruction issue and effectively eliminates several potential combinations of instructions. On the plus side, Intel gets credit for allowing flexible-sized instruction groupings, which help increase issue efficiency. This is likely to payoff handsomely in future IA-64 processors. IA-64s groups also reduce the code bloat seen in traditional VLIW designs (where fixed-width VLIW instruction slots may often go unused if the compiler cannot find independent instructions to group together from within a particular window of instructions).
Certainly there are plenty of processors with multiple execution units and microarchitectures that can keep them busy. Predicated execution is nothing new, either. Tiny embedded processors do it, and compiler writers are happy to manage the multiple predicate bits. Itaniums scoreboard bits, register frames, and svelte and RISC-like instruction set all have been seen before. Itanium doesnt even reorder instructions, for cryin out loud, something even midrange 32-bitters do all day long. But then again, Intels formally stated goal was to shift complexity out of the processor logic and to the compiler. Yet, if you read a presentation from last Intel Developer Forum, youll see that “Future Itanium Processor Family processors can have out-of-order execution.” Of course, this also implies that McKinley will be called Itanium II or something similar.
IA-64 doesnt really introduce anything all that new. Its more of an amalgam of concepts and techniques seen before and given the ol Intel twist. That doesnt make it bad, but its also not spectacular nerd porn.
Instruction Set Highlights
It would be tedious in the extreme to even summarize the entire IA-64 instruction set; you can refer here for the complete ISA (Instruction Set Architecture) listing. But there are some highlights in the ISA worth noting, such as conditional (predicated) execution, hinted and speculative loads, and the odd way in which Itanium handles integer math.
Pretty much any IA-64 instruction can be conditional, with its execution predicated on literally anything you care to define. Far beyond the simple Z (zero), V (overflow), S (sign), and N (negative) flags of our childhood, IA-64 has 64 free-form predicate bits, each considered a separate predicate register. You can set or clear a predicate bit any way you like, and its condition sticks indefinitely. Any subsequent instruction anywhere in the program can check that bit (or multiple bits) and behave accordingly. This allows you, for example, to evaluate two numbers in one part of a program, but not make a decision (conditional branch) until much later. The microprocessor cognoscenti consider predicate bits more elegant than flags; they scale more easily to larger sizes (more bits) and are easier for compilers to target. Well cover predication in more detail below.
Loading Up the Stores
IA-64 is surprisingly stingy with memory-addressing modes. It has precisely one: register-indirect with optional post-increment. This seems horribly limiting but is very RISC-like in philosophy. Addresses are calculated just like any other number and deposited in a general-purpose register. By avoiding special addressing modes, Itanium avoids specialized hardware in the critical path. VLIW pushes complexity onto the compiler instead of the hardware.
Loads can be pretty uninteresting, but IA-64 manages to spice them up a bit. Loads can “hint” to the cache that it would be beneficial to preload additional data after the load, whether that data is likely to be reused, and if so, which of the three cache levels is most appropriate to hold it. These are not the kinds of things even dedicated assembly-language programmers are likely to know, but large-scale commercial developers might profile a new operating system or major application extensively, and use the feedback to provide prefetch and caching hints. These are just hints, too–the processor is under no obligation to act on the hints or the caching information.
Somewhat stronger than a hint is a speculative load, an instruction that tells the processor it might want to load data from memory. Programmers (or more realistically, advanced compilers) can sprinkle their code with speculative loads to try to snag data that might be needed soon. Itanium will do its best to comply, but if the system bus is busy, the speculative load might be postponed indefinitely. If a speculative load fails (such as from a memory fault or violation) the processor does not raise an exception. Hey, it was only speculative anyway.
Itanium can hoist loads above branches, which many high-end RISCs do, but it can also hoist loads above stores, which is much trickier. The usual problem with the latter procedure is alias detection: the compiler cant be sure that loads and stores arent to the same address. As long as theres a chance, its dangerous to load from memory before all the stores to the same memory addresses are finished. Yet loads are time-consuming, so its a big win if you can accelerate them.
IA-64 gets around this problem–with a little help from you–with the LD.A (load advanced) instruction. LD.A speculatively loads from memory, but also stuffs the load address into a special buffer called the Advanced Load Address Table (ALAT). Subsequent stores to memory are checked against addresses in the ALAT. If theres a match, the speculative load aborts (or, if it already completed, the contents are discarded). Using the data from a LD.A can be tricky, too. You need to validate them with a CHK.A instruction first. Theres no guarantee that any calculations you did wont have to be redone with valid data. Its a bit of a gamble, but can pay handsomely if you speculate wisely. Architecture imitates life.
FP You, Too
Bizarrely, Itaniums two floating-point units cant multiply two numbers together. They cant add, either. The FPU is designed for multiply-accumulate (MAC) operations, so if you want a conventional FP MUL you program it as an FP MAC with an adder of zero. Likewise, if you want a simple FP ADD youre forced to use a multiplier of 1.0 along with the value you want to add.
Stranger still, Itanium has no integer multiply function at all. Any multiplication, whether its integer or floating-point, has to happen in the FP MAC unit. Unfortunately, that means transferring a pair of integers from the general-purpose registers to the floating-point registers, then transferring the result back again. Fortunately, IA-64 includes a few instructions specifically for this eventuality. What were they thinking?
: Going Out On a Limb”>
The longer the pipeline, the bigger the train wreck if the processor mispredicts a branch. And Itanium has a fairly long pipeline, so the potential for performance-robbing disaster looms ever large. Predicting branches takes on paramount importance and to that end, IA-64 has a number of tricks to help it avoid the dreaded mispredicted branch.
First, theres only one form of conditional branch, but its behavior can be based on any of the 64 predicate bits mentioned earlier. Branches can also be tagged with either static or dynamic branch prediction (thats prediction, not predication), which predicts whether the branch is likely, or not likely, to be taken this time around. Static prediction cannot be overridden; dynamic prediction leaves the decision to Itaniums own branch-prediction hardware. If you, as the programmer, know which way the branch is likely to go, stick with static prediction and the chip will assume youre always right. If youre unsure, let Itanium make up its own mind. If youre feeling especially clairvoyant, you can also suggest that Itanium fetch instructions from the predicted target of the branch, and even how far ahead of the branch target it should prefetch.
Predication is cool–it avoids short branches that inject bubbles into the pipeline. Rather than skip over short sections of code, predicated processors can plow straight ahead, either committing or discarding the results based on the predicate test. It effectively permits execution of both branch code paths at the same time. Predicated instruction sets have a mixed effect on code density. They improve code density slightly by eliminating branch instructions, but then hand back much of that improvement by usurping several bits (in IA-64s case, six bits per instruction specifying one of 64 predicate registers) from every instruction for the predicate field.
Predicated execution sacrifices execution units on the altar of branch latency. In other words, predicated instructions make it most of the way though the pipeline whether theyre supposed to execute or not. In Itaniums case, all conditional instructions are predicated so everything executes nearly to completion. Its only in the next-to-last DET (exception detection) stage of the pipeline that their effects are canceled if the predicate turns out to be false. By that time, the instruction has already commandeered one of Itaniums nine execution units for nothing, possibly preventing some other instruction from using it. Well, not entirely for nothing; it has served the greater good by avoiding a potential bubble in the pipeline. Better to waste a little work than to spin your wheels waiting for a branch to resolve.
Its small comfort, but predicated instructions that would stall waiting for an operand are killed early, because Itanium resolves the predicate (true/false) about the same time that it detects the dependency. It wont stall instructions waiting for data thats irrelevant. Thats the beauty of predicate bits set well ahead of time instead of flags that are updated every cycle.
Dont Look Back
: How Itanium Handles x86 Code”>
Yes, Virginia, there is an x86-compatibility mode in Itanium. Its awkward and unnatural, but we know how attached you are to your old binaries. IA-64 does not normally support older x86 binaries, and its entirely possible that some future IA-64 implementation might drop this feature or water it down, but for now your old Lotus 1-2-3 diskettes are safe.
Itanium supports all x86 instructions in one way or another, even MMX, SSE (not SSE2), Protected, Virtual 8086, and Real mode features. You can even run entire operating systems in x86 mode, or just run the applications under a new IA-64 OS. All the x86 registers map onto Itaniums own general-purpose registers, but some of the less orthogonal x86 registers appear in Itaniums “application registers” AR24 through AR31.
Switching modes appears trivial but isnt. Theres one IA-64 instruction that switches the processor to x86 mode and another (newly defined) x86 instruction, JMPE, that switches to IA-64 mode. If the programmer so wishes, interrupts can switch automatically to IA-64 mode or the machine can stay in x86 mode. In the latter case, you can reuse your x86 interrupt handlers.
Switching to x86 mode is a lot like booting a 386 because you have to set up memory segment descriptors, status registers, and flags. Also, x86 code likes to have its way with all the resources of the processor, either overwriting or ignoring many of Itaniums state bits and registers. Its also likely to upset your cache contents. In general, its best to save the entire state of the processor before switching to x86 mode. Its awkward enough that you probably dont want to switch modes willy-nilly. Save it for dramatic changes, such as executing entire x86 applications.
Not that anyone was asking, but PA-RISC compatibility is handled offline through a software translator. IA-64 instructions dont directly support PA-RISC instructions, but they do map fairly closely (hey, RISC is RISC). The fact that x86 binaries are emulated in minute detail with enormous helpings of hardware while PA-RISC code is relegated to a translator before it has any hope of running says a lot about the relative importance of these two installed bases. It may also tell us something about the “equal” relationship between the HP and Intel engineers designing IA-64.
Oooooh, Its So Big
The definition of chip, processor, and die become somewhat clouded with Itanium. The first IA-64 “chip” is really a metal-cased cartridge, somewhat like Pentium II modules of yore. The cartridge – which is mechanically incompatible with anything ever seen before – contains at least five chips, including the processor itself and four cache SRAMs. The first- and second-level caches (L1 and L2) really are on the same die as the processor; the L3 cache takes up those four SRAMs that are off-chip but on-module. Got it?
Then theres the PAL. PAL is Intels “processor abstraction layer,” a flash ROM inside the cartridge that, in Intels words “… maintain[s] a single software interface for multiple implementations of the processor silicon steppings.” Sounds like a “fudge ROM” for hiding, tweaking, or patching imperfections in the processor that may not entirely live up to their data book specification.
The whole thing weighs in at about 325 million transistors: 25 million for the processor chip (including L1 and L2 caches) and about 75 million for each of the four L3 cache chips. Well toss in the PAL for free. If 25 million transistors seems like a lot, remember that Pentium III has 24 million and Pentium 4 has 42 million. For a high-end 64-bit processor, Itanium is looking positively dinky.
You know what else is big? Itaniums code footprint. Poor code density is a hallmark of VLIW designs, and although IA-64 makes some improvements as we mentioned, its no exception to the rule. With no (public) code to look at its hard to be sure, but educated estimates pin Itaniums code size at about one-third bigger than other 64-bit RISCs and double the size of Pentium binaries.
Poor code density means lots of disk space, but thats not a big deal for high-end systems. It also means less effective cache size, which in turn reduces cache-hit rates. Again, no big deal because caches can always be made bigger. But cache bandwidth is hard to improve and that may be the real bottleneck for IA-64 processors. Thats why Itaniums first two levels of cache are on the processor die itself and the L3 cache is very nearby on the same module.
Outside the Box
The 128-bit bus between the Itanium die and its L3 caches is contained entirely within the cartridge; its never exposed to the outside. Itaniums external bus is 64 bits wide and this is its only connection with the outside world, main memory, or other processors. Up to four processors can share this bus. After that, Intel has a bridge chip that allows four-processor clusters to talk to each other.
Its a pretty pedestrian bus as these things go. It has none of the exotic interprocessor communications that Hammer has (as well study in our next segment), nor is it even very fast at 2.1 GB/second of maximum bandwidth, compared with 3.2 GB/second for Pentium 4 or 3.6 GB/second for MIPS. Its also a doomed, dead-end bus: McKinley will have a completely different interface.
McKinley, Madison, and Deerfield
: The Next Generation”>
The second IA-64 processor after Itanium is code-named McKinley and its likely to be faster, smaller, and all-around better than its predecessor. McKinleys L1 caches will be the same size as Itaniums, but the L2 cache will grow from 96K to 256K. The L3 cache will get smaller (3M instead of 4M) but move onto the actual chip, not just on the same cartridge. All three cache interfaces will get faster. McKinley shaves one cycle off the L1 cache access time (from two cycles to one), shortens L2 access time by seven cycles (to five), and takes eight cycles off the L3 latency (to 12 cycles). Adding the L3 cache to the chip will boost McKinleys die size significantly, probably to around 450 mm2, and ups the transistor count to 221 million. But manufacturing cost should be significantly reduced without the external L3 SRAMs and larger package required for the dual-chip (core and L3) Itanium.
McKinley will use a completely different socket design from Itanium and a revised bus interface, dooming the first IA-64 systems almost before they get out the door. Just like Pentium Pro, Itaniums mechanical footprint will be an orphan from Day One. McKinleys system bus will widen to 128 bits (up from Itaniums 64) and its clock frequency will improve from 133 MHz to 200 MHz. The bus will still be double-pumped (i.e., transferring data on both rising and falling edges of every clock) yielding 6.4GB/sec front-side bus bandwidth.
Next up comes Madison, expected to be a 0.13-micron shrink of McKinley, all other things being equal. Deerfield, the fourth member of IA-64s growing family, will also be a 0.13-micron shrink of McKinley, but this time with a smaller 1M L3 cache and yet another new bus interface intended for cheaper systems. Deerfield will be the “value” version of IA-64, à la Celeron or Duron.
IA-64 is an interesting architecture that borrows from and/or extends many existing microarchitectural techniques, and also adds some new and interesting twists, but the first instantiation of the architecture, Itanium, has not been a major success to date. After waiting a few years longer than originally anticipated for the first IA-64 chip to appear (Intel publicly disclosed initial IA-64 details at Microprocessor Forum in 1997, and stated the first IA-64 chip, code-named Merced, was expected to ship in mid-1999), we saw a processor with a slower than expected clock rate, and less than stellar integer performance that catered to a very limited market. Plus initial shipments were stymied with delays from key vendors. Some commentators called Intels first IA-64 chip “Unobtanium”. And not surprisingly, the catch-phase for quite some time has been “wait for McKinley, Itanium is simply a development platform”.
Very recently another setback occurred with Dell dropping Itanium workstations from its product lineup (see “Dell Discontinues Itanium Workstation”), possibly encouraging even more people to “wait for McKinley”. But clearly Itanium is not all that bad. Floating-point performance as seen in some benchmarks is impressive today, and its large address space can certainly be useful in various high-end applications, but Intel faces a steep uphill battle trying to convince many server and workstation customers, with long histories using established 64-bit architectures, to convert to IA-64 at this juncture. Then again, Intel has swayed many customers to convert portions of their application processing to Itanium-based solutions as seen at this link. Software developers are a key target as well, and many have been on the IA-64 bandwagon for a while.
Things could improve substantially when McKinley arrives later this year in development systems and early next year in volume. We expect Intel to start seriously ramping IA-64 architecture processor shipments in selected markets within two three years. But lets not forget about AMD, who clearly appears to up for the challenge, as well see in our next segment. Also, well provide our thoughts on the rumored Yamhill 64-bit x86 “hedge your bet” technology under deep wraps within Intel development labs.
- Itanium manuals. Be sure to explore the menu to the left of the page – it has links to lots of other Itanium reference material including some PowerPoint slides.
- A nice quick overview of basic Itanium features can be found here.
- A set of performance tests and a summary of SPEC test results are at this link from last summer.
- Intels own benchmarketing results are at this link.