Although price, clock speeds and chip size weren't mentioned, Nvidia released revealing new bits of information regarding its delayed graphics architecture, code-named Fermi.
Graphics chip maker Nvidia
released more details of its ambitious and delayed Graphics Fermi 100 (GF100)
graphics processor, which features a 384-bit memory interface.
The next-generation graphics processing unit (GPU) will incorporate more
than 3 billion transistors and 512 CUDA cores, which Nvidia
says will accelerate its parallel-computing abilities and performance for
applications involving ray tracing, physics, finite element analysis,
high-precision scientific computing, sparse linear algebra and search
algorithms.
Nvidia
did allow a view into the Fermi architecture, based around CUDA, the
hardware and software architecture that enables NVIDIA GPUs to execute programs
written with C, C++, Fortran, OpenCL, DirectCompute and other languages.
Nvidia said the first Fermi-based GPU features up to 512 CUDA cores, with each
executing a floating point or integer instruction per clock for a thread.
The 512 CUDA cores are organized in 16 SMs (streaming processors) of 32
cores each. The GPU has six 64-bit memory partitions, for a 384-bit memory
interface, supporting up to a total of 6GB of GDDR5 DRAM
memory. A host interface connects the GPU to the CPU via PCI-Express, and the
GigaThread global scheduler distributes thread blocks to SM thread schedulers.
Each SM features 32 CUDA processors-a fourfold increase over prior SM
designs.
Each CUDA processor has a fully pipelined integer arithmetic logic unit (ALU)
and floating point unit (FPU). Prior GPUs
used IEEE 754-1985
floating point arithmetic.
The Fermi architecture implements the new IEEE 754-2008 floating-point standard, providing
the fused multiply-add (FMA) instruction for both single- and double-precision
arithmetic.
Nvidia said the Fermi architecture has been specifically designed to offer
"unprecedented performance" in double-precision arithmetic, with up to 16
double-precision fused multiply-add operations that can be performed per SM,
per clock.
Fermi also supports concurrent kernel execution, where different kernels of
the same application context can execute on the GPU at the same time.
Concurrent kernel execution allows programs that execute a number of small
kernels to utilize the whole GPU. On the Fermi architecture, different kernels
of the same CUDA context can execute concurrently, allowing maximum utilization
of GPU resources.
Kernels from different application contexts can still run sequentially with
high efficiency due to the improved context switching performance.
"Rather than taking the simple route of adding execution units, the Fermi
team has tackled some of the toughest problems of GPU computing. The importance
of data locality is recognized through Fermi's two-level cache hierarchy and
its combined load/store memory path," a Nvidia whitepaper said. "Double
precision performance is elevated to supercomputing levels, while atomic
operations execute up to twenty times faster. Lastly, Fermi's comprehensive ECC
support strongly demonstrates our commitment to the high-performance computing
market."
Nathan Eddy is Associate Editor, Midmarket, at eWEEK.com. Before joining eWEEK.com, Nate was a writer with ChannelWeb and he served as an editor at FierceMarkets. He is a graduate of the Medill School of Journalism at Northwestern University.