GF100 high-level architecture overview
We’re going to start with the high level, 36,000 foot overview of GF100. From this altitude, GF100 looks somewhat similar to previous GeForce GPUs, but trust us, the differences will be more apparent at low level. Here’s the block diagram:

If you recall previous NVIDIA architectures, you’ll note that some of the terminology from the G80/GT200 days has changed. What NVIDIA once called Streaming Processors (SPs) are now called CUDA Cores. The basic functionality is the same, only someone in marketing decided mixing “CUDA” and “Core” sounded better. There is one interesting change however: the Texture Processing Clusters (TPCs) from previous GPUs have been replaced by more capable Streaming Multiprocessors (SMs) in GF100.
The CUDA Cores (we refered to them as shaders on the previous page) are the small green squares in the block diagram above. Again, GF100 has 512 of them.
Each SM has 32 CUDA Cores, four texture units, NVIDIA’s PolyMorph engine, dedicated caches, and more. Previous architectures combined 8 CUDA Cores per SM. Texture filtering units and their L1 cache were then grouped to these SMs to make the TPC. We’ll be taking a closer look at GF100’s new SMs a little later in this article.
As you can see, the SMs are organized into groups called graphics processing clusters (GPCs). A GPC consists of four SMs and one raster engine. Here’s where you’ll see another key difference between GF100 and GT200. Whereas GT200 was limited to just one raster engine for the whole GPU, each GPC has its own raster engine. With the exception of ROP functions, it’s essentially its own self-contained GPU, hence the reason it’s called a graphics processing cluster.
GF100 has four GPCs, with each GPC containing 128 CUDA Cores. Add it all up, and you’ve got 512 cores.
Tied to all that is the memory subsystem, which consists of six 64-bit memory controllers (384-bit total), L2 cache, and 48 ROPs. The ROPs are organized into six groups of eight and are the dark blue squares in the block diagram above, with each ROP group paired up with its own memory controller.
NVIDIA can tailor the number of GPCs and memory controllers down to address different markets. To compete with the Radeon 5700 series in the mainstream market for example, you could see a cut-down GF100 derivative with just two GPCs (256 shaders total) and 128-bit memory interface.
The following chart highlights some of the key differences between GF100 and its predecessor, GT200:
| GeForce GPU Features Comparison |
| GT200 | GF100 |
| # of Transistors | 1.4 billion | 3.0 billion |
| CUDA Cores | 240 | 512 |
| Raster Engines | 1 | 4 |
| PolyMorph Engines | - | 16 |
| Special Function Units (per SM) | 2 | 4 |
| Texture Units | 80 | 64* |
| ROPs | 32 | 48* |
| Warp Schedulers (per SM) | 1 | 2 |
| Total Shared Memory | 16KB | Configurable 48KB or 16KB |
| L1 Texture Cache (per quad) | 12KB | 12KB |
| Dedicated L1 Load/Store Cache | None | 16KB or 48KB |
| L2 Cache | 256KB (for texture reads only) | 768KB (all clients read/write) |
| Concurrent Kernels | No | Up to 16 |
| *Improved Clock Speed |
 |
And that’s the quick and dirty high level overview. We’re now going to go low level, taking a closer look at the functional blocks that make up GF100.