Graphics chip maker Nvidia released more details of its ambitious and delayed Graphics Fermi 100 (GF100) graphics processor, which features a 384-bit memory interface.
The next-generation graphics processing unit (GPU) will incorporate more than 3 billion transistors and 512 CUDA cores, which Nvidia says will accelerate its parallel-computing abilities and performance for applications involving ray tracing, physics, finite element analysis, high-precision scientific computing, sparse linear algebra and search algorithms.
Nvidia did allow a view into the Fermi architecture, based around CUDA, the hardware and software architecture that enables NVIDIA GPUs to execute programs written with C, C++, Fortran, OpenCL, DirectCompute and other languages.
Nvidia said the first Fermi-based GPU features up to 512 CUDA cores, with each executing a floating point or integer instruction per clock for a thread.
The 512 CUDA cores are organized in 16 SMs (streaming processors) of 32 cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory interface, supporting up to a total of 6GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-Express, and the GigaThread global scheduler distributes thread blocks to SM thread schedulers.
Each SM features 32 CUDA processors-a fourfold increase over prior SM designs.
Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). Prior GPUs used IEEE 754-1985 floating point arithmetic.
The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single- and double-precision arithmetic.
Nvidia said the Fermi architecture has been specifically designed to offer “unprecedented performance” in double-precision arithmetic, with up to 16 double-precision fused multiply-add operations that can be performed per SM, per clock.
Fermi also supports concurrent kernel execution, where different kernels of the same application context can execute on the GPU at the same time. Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU. On the Fermi architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU resources.
Kernels from different application contexts can still run sequentially with high efficiency due to the improved context switching performance.
“Rather than taking the simple route of adding execution units, the Fermi team has tackled some of the toughest problems of GPU computing. The importance of data locality is recognized through Fermi’s two-level cache hierarchy and its combined load/store memory path,” a Nvidia whitepaper said. “Double precision performance is elevated to supercomputing levels, while atomic operations execute up to twenty times faster. Lastly, Fermi’s comprehensive ECC support strongly demonstrates our commitment to the high-performance computing market.”