64-Bit CPUs: What You Need to Know - Page 14

Predication is cool--it avoids short branches that inject bubbles into the pipeline. Rather than skip over short sections of code, predicated processors can plow straight ahead, either committing or discarding the results based on the predicate test. It effectively permits execution of both branch code paths at the same time. Predicated instruction sets have a mixed effect on code density. They improve code density slightly by eliminating branch instructions, but then hand back much of that improvement by usurping several bits (in IA-64s case, six bits per instruction specifying one of 64 predicate registers) from every instruction for the predicate field.

Predicated execution sacrifices execution units on the altar of branch latency. In other words, predicated instructions make it most of the way though the pipeline whether theyre supposed to execute or not. In Itaniums case, all conditional instructions are predicated so everything executes nearly to completion. Its only in the next-to-last DET (exception detection) stage of the pipeline that their effects are canceled if the predicate turns out to be false. By that time, the instruction has already commandeered one of Itaniums nine execution units for nothing, possibly preventing some other instruction from using it. Well, not entirely for nothing; it has served the greater good by avoiding a potential bubble in the pipeline. Better to waste a little work than to spin your wheels waiting for a branch to resolve.

Its small comfort, but predicated instructions that would stall waiting for an operand are killed early, because Itanium resolves the predicate (true/false) about the same time that it detects the dependency. It wont stall instructions waiting for data thats irrelevant. Thats the beauty of predicate bits set well ahead of time instead of flags that are updated every cycle.