Intel Labs is teaming up with researchers at the University of Texas, Carnegie Mellon and other universities to develop processors that configure themselves depending on the workload, an idea that could result in significant performance and energy efficiency gains in CPUs.
In a Jan. 30 post on the Intel Labs blog, Chris Wilkerson, a senior staff research scientist at Intel Labs in Oregon, wrote that a collaboration between Intel and University in Texas researchers in particular as resulted in the “development of MorphCore, a CPU that ‘morphs’ between two configurations. One configuration tailored for high performance single-threaded workloads, and a second for higher throughput multi-threaded workloads.”
Such transformable CPUs could result in designs that increase performance by as much as 10 percent and energy efficiency by 22 percent over traditional chips, according to simulations of MorphCore run by researchers, he wrote.
Most software runs in single-thread environments, where the CPU takes care of one task before moving onto the next. However, a growing number of applications—particularly in such environments as supercomputing and high-performance computing (HPC)—can be run in parallel, where the workload is broken up into tasks, and the CPU processes the tasks simultaneously.
Over the past few years, organizations have been leveraging GPU accelerators from the likes of Nvidia and Advanced Micro Devices in their high-end systems to help run compute intensive and highly parallel workloads in an effort to increase the power of their supercomputers while keep power consumption in check. Intel officials in November 2012 introduced their Xeon Phi coprocessors, x86-based chips that run with traditional Xeon processors to give systems a performance and energy efficiency boost while offering users the benefits of working with familiar Intel Architecture tools.
Intel Labs’ efforts around transformable CPUs would create single chips that could handle both single- and multi-threaded workloads, according to Wilkerson. In the blog, he noted that single-thread applications require high-performance CPUS that can throw all the available resources at the instruction thread that is being run, including large register files that can re-order instructions that can run when they’re ready, rather than having to wait until the preceding instruction is complete.
Meanwhile, when processing multiple instruction threads simultaneously, CPUs can often delay work on a thread that has stalled, turning its attention and resources instead to the other instructions being processed. Buffers that enable instruction re-ordering aren’t needed, and not using them not only keeps the CPU performance high, but also enhances energy efficiency by ensuring that power doesn’t continue to be consumed while dealing with a stalled thread, he wrote.
The University of Texas’ “UT’s MorphCore addresses this problem by modifying the design of a high performance CPU to permit shutting down some buffers and repurposing others,” Wilkerson wrote.
The CPU’s throughput mode splits the largest of the buffers, the physical register file, into equal partitions, with each partition storing the architectural state of one executing thread, he wrote. Doing so simplifies renaming—which assigns buffer space for fetched instructions—and the throughput mode also cuts power consumption by turning off the load buffer and much of the store buffer, essentially “sacrificing memory reordering in favor of reduced power.”
Wilkerson said the results are CPUs that can handle whatever workloads are thrown at them. He compared this to a vehicle that can either be an SUV when the power and capacity to drive around family members or friends are needed, or be quickly transformed into a motorcycle to save on gas when only a single person is driving.
“With these [MorphCore] and other promising ideas developed in collaboration with Intel’s academic partners, processors in the next 5-10 years may offer the best of both worlds: high performance to minimize delay and deliver the best user experience; as well as throughput mode to maximize efficiency when single thread performance is less important,” Wilkerson wrote.