2025-07-11 23:54:55
Today, we’re used to desktop processors with 16 or more cores, and several times that in server CPUs. But even prior to 2010, Intel and AMD were working hard to reach higher core counts. AMD’s “Magny Cours”, or Opteron 6000 series, was one such effort. Looking into how AMD approached core count scaling with pre-2010s technology should make for a fun retrospective. Special thanks goes to cha0shacker for providing access to a dual socket Opteron 6180 SE system.
A Magny Cours chip is basically two Phenom II X6 CPU dies side by side. Reusing the prior generation’s 6-core die reduces validation requirements and improves yields compared to taping out a new higher core count die. The two dies are connected via HyperTransport (HT) links, which previously bridged multiple sockets starting from the K8 generation. Magny Cours takes the same concept but runs HyperTransport signals through on-package PCB traces. Much like a dual socket setup, the two dies on a Magny Cours chip each have their own memory controller, and cores on a die enjoy faster access to locally attached memory. That creates a NUMA (Non-Uniform Memory Access) setup, though Magny Cours can also be configured to interleave memory accesses across nodes to scale performance with non-NUMA aware code.
Magny Cours’s in-package links have an awkward setup. Each die has four HT ports, each 16 bits wide and capable of operating in an “unganged” mode to provide two 8-bit sub links. The two dies are connected via a 16-bit “ganged” link along with a 8-bit sublink from another port. AMD never supported using the 8-bit cross-die link though, perhaps because its additional bandwidth would be difficult to utilize and interleaving traffic across uneven links sounds complicated. Magny Cours uses Gen 3 HT links that run at up to 6.4 GT/s, so the two dies have 12.8 GB/s of bandwidth between them. Including the disabled 8-bit sublink would increase intra-package bandwidth to 19.2 GB/s.
With one and a half ports connected within a package, each die has 2.5 HT ports available for external connectivity. AMD decided to use those to give a G34 package four external HT ports. One is typically used for IO in single or dual socket systems, while the other three connect to the other socket.
Quad socket systems get far more complicated, and implementations can allocate links to prioritize either IO bandwidth or cross-socket bandwidth. AMD’s slides show a basic example where four 16-bit HT links are allocated to IO, but it’s also possible to have only two sockets connect to IO.
In the dual socket setup we’re testing with here, two ports operate in “ganged” mode and connect corresponding dies on the two sockets. The third port is “unganged” to provide a pair of 8-bit links, which connect die 0 on the first socket to die 1 on the second socket, and vice versa. That creates a fully connected mesh. The resulting topology resembles a square, with more link bandwidth along its sides and less across its diagonals.
Cross-node memory latency is 120-130 ns, or approximately 50-60 ns more than a local memory access. Magny Cours lands in the same latency ballpark as a newer Intel Westmere dual socket setup. Both dual socket systems from around 2010 offer significantly lower latencies for both local and remote accesses compared to modern systems. The penalty for a remote memory access over a local one is also lower, suggesting both the memory controllers and cross-socket links have lower latency.
Like prior AMD generations, Magny Cours’s memory controllers (MCTs) are responsible for ensuring coherency. They can operate in a broadcast mode, where the MCTs probe everyone with each memory request. While simple, this scheme creates a lot of probe traffic and increases DRAM latency because the MCTs have to wait for probe responses before returning data from DRAM. Most memory requests don’t need data from another cache, so AMD implemented an “HT assist” option that reserves 1 MB of L3 cache per die for use as a probe filter. The MCTs use the probe filter to remember which lines in its local address space are cached across the system and if so, what state they’re cached in.
Regardless of whether HT assist is enabled, Magny Cours’s MCTs are solely responsible for ensuring cache coherency. Therefore, core to core transfers must be orchestrated by the MCT that owns the cache line in question. Cores on the same die may have to exchange data through another die, if the cache line is homed to that other die. Transfers within the same die have about 180 ns of latency, with a latency increase of an extra ~50 ns to the other die within the same socket. In the worst case, latency can pass 300 ns when bouncing a cache line across three dies (two cores on separate dies, orchestrated by a memory controller on a third die).
For comparison, Intel’s slightly newer Westmere uses core valid bits at the L3 cache to act as a probe filter, and can complete core to core transfers within the same die even if the address is homed to another die. Core to core latencies are also lower across the board.
The bandwidth situation with AMD’s setup is quite complicated because it’s a quad-node system, as opposed to the dual-node Westmere setup with half as many cores. Magny Cours connected via 16-bit HT links gets about 5 GB/s of bandwidth between them, with slightly better performance over an intra-package link as opposed to a cross-socket one. Cross-node bandwidth is lowest over the 8-bit “diagonal” cross-socket links, at about 4.4 GB/s.
From a simple perspective of how fast one socket can read data from another, the Opteron 6180 SE lands in the same ballpark as a Xeon X5650 (Westmere) system. Modern setups of course enjoy massively higher bandwidth, thanks to newer DDR versions, wider memory buses, and improved cross-socket links.
Having cores on both sockets read from the other’s memory pool brings cross-socket bandwidth to just over 17 GB/s, though getting that figure requires making sure the 16-bit links are used rather than the 8-bit ones. Repeating the same experiment but going over the 8-bit diagonal links only achieves 12.33 GB/s. I was able to push total cross-node bandwidth to 19.3 GB/s with a convoluted test where cores on each die read from memory attached to another die over a 16-bit link. To summarize, refer to the following simple picture:
NUMA-aware applications will of course try to keep memory accesses local, and minimize the more expensive accesses over HyperTransport links. I was able to achieve just over 48 GB/s of DRAM bandwidth across the system with all cores reading from their directly attached memory pools. That gives the old Opteron system similar DRAM bandwidth to a relatively modern Ryzen 3950X setup. Of course, the newer 16-core chip has massively higher cache bandwidth and doesn’t have NUMA characteristics.
Magny Cours’s on-die network, or Northbridge, bridges its six cores to the local memory controller and HyperTransport links. AMD’s Northbridge design internally consists of two crossbars, dubbed the System Request Interface (SRI) and XBAR. Cores connect to the SRI, while the XBAR connects the SRI with the memory controller and HyperTransport links. The two-level split likely reduces port count on each crossbar. 10h CPUs have a 32 entry System Request Queue between the SRI and XBAR, up from 24 entries in earlier K8-based Opterons. At the XBAR, AMD has a 56 entry XBAR Scheduler (XCS) that tracks commands from the SRI, memory controller, and HyperTransport links.
Crossbar: This topology is simple to build, and naturally provides an ordered network with low latency. It is suitable where the wire counts are still relatively small. This topology is suitable for an interconnect with a small number of nodes
Arm’s AMBA 5 CHI Architecture Specification (https://kolegite.com/EE_library/datasheets_and_manuals/FPGA/AMBA/IHI0050E_a_amba_5_chi_architecture_spec.pdf), on the tradeoffs between crossbar, ring, and mesh interconnects
The crossbar setup in AMD’s early on-die network does an excellent job of delivering low memory latency. Baseline memory latency with just a pointer chasing pattern is 72.2 ns. Modern server chips with more complicated interconnects often see memory latency exceed 100 ns.
As bandwidth demands increase, the interconnect does a mediocre job of ensuring a latency sensitive thread doesn’t get starved by bandwidth hungry ones. Latency increases to 177 ns with the five other cores on the same die generating bandwidth load, which is more than a 2x increase over unloaded latency. Other nodes connected via HyperTransport can generate even more contention on the local memory controller. With cores on another die reading from the same memory controller, bandwidth drops to 8.3 GB/s, while latency from a local core skyrockets to nearly 400 ns. Magny Cours likely suffers contention at multiple points in the interconnect. The most notable issue though is poor memory bandwidth compared to what the setup should be able to achieve.
Three cores are enough to reach bandwidth limits on Magny Cours, which is approximately 10.4 GB/s. With dual channel DDR3-1333 on each node, the test system should have 21.3 GB/s of DRAM bandwidth per node, or 85.3 GB/s across all four nodes. However, bandwidth testing falls well short: even when using all 6 cores to read from a large array, a single die on the Opteron 6180 SE is barely better than a Phenom II X4 945 with DDR2-800, and slightly worse than a Phenom X4 9950 with fast DDR2. This could be down to the low 1.8 GHz northbridge clock and a narrow link to the memory controller (perhaps 64-bit), or insufficient queue entries at the memory controller to absorb DDR latency. Whatever the case, Magny Cours leaves much of DDR3’s potential bandwidth advantage on the table. This issue seems to be solved in Bulldozer, which can achieve well over 20 GB/s with a 2.2 GHz northbridge clock.
Magny Cours is designed to deliver high core count, not maximize single threaded performance. The Opteron 6180 SE runs its cores at 2.5 GHz and northbridge at 1.8 GHz, which is slower than even first-generation Phenoms. With two desktop dies in the same package, AMD needs to use lower clock speeds to keep power under control. Single-threaded SPEC CPU2017 scores are therefore underwhelming. The Opteron 6180 SE comes in just behind Intel’s later Goldmont Plus core, which can clock up to 2.7 GHz.
AMD’s client designs from a few years later achieved much better SPEC CPU2017 scores. The older Phenom X4 9950 takes a tiny lead across both test suites. Its smaller 2 MB L3 cache is balanced out by higher clocks, with an overclock to 2.8 GHz really helping it out. Despite losing 1 MB of L3 for use as a probe filter and thus only having 5 MB of L3, the Opteron 6180 maintains a healthy L3 hit rate advantage over its predecessor.
The exact advantage of the larger L3 can vary of course. 510.parset is a great showcase of what a bigger cache can achieve, with the huge hit rate difference giving the Opteron 6180 SE a 9.4% lead over the Phenom X4 9950. Conversely, 548.exchange2 is a high IPC test with a very tiny data footprint. In that subtest, the Phenom X4 9950 uses its 12% clock speed advantage to gain a 11% lead.
Technologies available in the late 2000s made core count scaling a difficult task. Magny Cours employed a long list of techniques to push core counts higher while keeping cost under control. Carving a snoop filter out of L3 capacity helps reduce die area requirements. At a higher level, AMD reuses a smaller die across different market segments, and uses multiple instances of the same die to scale core counts. As a result, AMD can keep fewer, smaller dies in production. In some ways, AMD’s strategy back then has parallels to their current one, which also seeks to reuse a smaller die across different purposes.
While low cost, AMD’s approach has downsides. Because each hexacore die is a self-contained unit with its own cores and memory controller, Magny Cours relies on NUMA-aware software for optimal performance. Intel also has to deal with NUMA characteristics when scaling up core counts, but their octa-core Nehalem-EX and 10-core Westmere EX provide more cores and more memory bandwidth in each NUMA node. AMD’s HyperTransport and low latency northbridge do get credit for keeping NUMA cross-node costs low, as cross-node memory accesses have far lower latency than in modern designs. But AMD still relies more heavily on software written with NUMA in mind than Intel.
Digging deeper reveals quirks with Magny Cours. Memory bandwidth is underwhelming for a DDR3 system, and the Northbridge struggles to maintain fairness under high bandwidth load. A four-node system might be fully connected, but “diagonal” links have lower bandwidth. All of that makes the system potentially more difficult to tune for, especially compared to a modern 24-core chip with uniform memory access. Creating a high core count system was quite a challenge in the years leading up to 2010, and quirks are expected.
But Magny Cours evidently worked well enough to chart AMD’s scalability course for the next few generations. AMD would continue to reuse a small die across server and client products, and scaled core count by scaling die count. Even for their Bulldozer Opterons, AMD kept using this general setup of four sockets with 2 dies each, only updating it for Zen1, which is perhaps the ultimate evolution of the Magny Cours strategy, hitting four dies per socket and replacing the Northbridge/HyperTransport combination with Infinity Fabric components. Today, AMD continues to use a multi-die strategy, though with more die types (IO dies and CCDs) to provide more uniform memory access.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-07-07 04:56:17
Lion Cove is Intel’s latest high performance CPU architecture. Compared to its predecessor, Raptor Cove, Intel’s newest core can sustain more instructions per cycle, reorganizes the execution engine, and adds an extra level to the data cache hierarchy. The list of changes goes on, with tweaks to just about every part of the core pipeline. Lion Cove does well in the standard SPEC CPU2017 benchmark suite, where it posts sizeable gains especially in higher IPC subtests. In the Arrow Lake desktop platform, Lion Cove can often go head-to-head against AMD’s Zen 5, and posts an overall lead over Intel’s prior Raptor Cove while pulling less power. But a lot of enthusiasts are interested in gaming performance, and games have different demands from productivity workloads.
Here, I’ll be running a few games while collecting performance monitoring data. I’m using the Core Ultra 9 285K with DDR5-6000 28-36-36-96, which is the fastest memory I have available. E-Cores are turned off in the BIOS, because setting affinity to P-Cores caused massive stuttering in Call of Duty. In Cyberpunk 2077, I’m using the built-in benchmark at 1080P and medium settings, with upscaling turned off. In Palworld, I’m hanging out near a base, because CPU load tends to be higher with more entities around.
Gaming workloads generally fall at the low end of the IPC range. Lion Cove can sustain eight micro-ops per cycle, which roughly corresponds to eight instructions per cycle because most instructions map to a single micro-op. It posts very high IPC figures in several SPEC CPU2017 tests, with some pushing well past 4 IPC. Games however get nowhere near that, and find company with lower IPC tests that see their performance limited by frontend and backend latency.
Top-down analysis characterizes how well an application is utilizing a CPU core’s width, and accounts for why pipeline slots go under-utilized. This is usually done at the rename/allocate stage, because it’s often the narrowest stage in the core’s pipeline, which means throughput lost at that stage can’t be recovered later. To briefly break down the reasons:
Bad Speculation: Slot was utilized, but the core was going down the wrong path. That’s usually due to a branch mispredict.
Frontend Latency: Frontend didn’t deliver any micro-ops to the renamer that cycle
Frontend Bandwidth: The frontend delivered some micro-ops, but not enough to fill all renamer slots (eight on Lion Cove)
Core Bound: The backend couldn’t accept more micro-ops from the frontend, and the instruction blocking retirement isn’t a memory load
Backend Memory Bound: As above, but the instruction blocking retirement is a memory load. Intel only describes the event as “TOPDOWN.MEMORY_BOUND_SLOTS” (event 0xA4, unit mask 0x10), but AMD and others explicitly use the criteria of a memory load blocking retirement for their corresponding metrics. Intel likely does the same.
Retiring: The renamer slot was utilized and the corresponding micro-op was eventually retired (useful work)
Core width is poorly utilized, as implied by the IPC figures above. Backend memory latency accounts for a plurality of lost pipeline slots, though there’s room for improvement in instruction execution latency (core bound) and frontend latency as well. Bad speculation and frontend bandwidth are not major issues.
Lion Cove has a 4-level data caching setup, with the L1 data cache split into two levels. I’ll be calling those L1 and L1.5 for simplicity, because the second level of the L1 lands between the first level and the 3 MB L2 cache in capacity and performance.
Lion Cove’s L1.5 catches a substantial portion of L1 misses, though its hitrate isn’t great in absolute terms. It gives off some RDNA 128 KB L1 vibes, in that it takes some load off the L2 but often has mediocre hitrates. L2 hitrate is 49.88%, 71.87%, and 50.98% in COD, Palworld, and Cyberpunk 2077 respectively. Cumulative hitrate for the L1.5 and L2 comes in at 75.54%, 85.05%, and 85.83% across the three games. Intel’s strategy of using a larger L2 to keep traffic off L3 works to a certain extent, because most L1 misses are serviced without leaving the core.
However, memory accesses that do go to L3 and DRAM are very expensive. Lion Cove can provide an idea of how often each level in the memory hierarchy limits performance. Specifically, performance monitoring events count cycles where no micro-ops were ready to execute, a load was pending from a specified cache level, and no loads missed that level of cache. For example, a cycle would be L3 bound if the core was waiting for data from L3, wasn’t also waiting for data from DRAM, and all pending instructions queued up in the core were blocked waiting for data. An execute stage stall doesn’t imply performance impact, because the core has more execution ports than renamer slots. The execute stage can race ahead after stalling for a few cycles without losing average throughput. So, this is a measurement of how hard the core has to cope, rather than whether it was able to cope.
Intel’s performance events don’t distinguish between L1 and L1.5, so both are counted as “L1 Bound” in the graph above. The L1.5 seems to move enough accesses off L2 to minimize the effect of L2 latency. Past L2 though, L3 and DRAM performance have a significant impact. L2 misses may be rare in an absolute sense, but they’re not quite rare enough considering the high cost of a L3 or DRAM access.
Lion Cove and the Arrow Lake platform can monitor queue occupancy at various points in the memory hierarchy. Dividing occupancy by request count provides average latency in cycles, giving an idea of how much latency the core has to cope with in practice.
Count occurrences (rising-edge) of DCACHE_PENDING sub-event0. Impl. sends per-port binary inc-bit the occupancy increases* (at FB alloc or promotion).
Intel’s description for the L1D_MISS.LOAD event, which unhelpfully doesn’t indicate which level of the L1 it counts for.
These performance monitoring events can be confusing. The L1D_MISS.LOAD event (event 0x49, unit mask 1) increments when loads miss the 48 KB L1D. However the corresponding L1D_PENDING.LOAD event (event 0x48, unit mask 1) only accounts for loads that miss the 192 KB L1.5. Using both events in combination treats L1.5 hits as zero latency. It does accurately account for latency to L2 and beyond, though only from the perspective of a queue between the L1.5 and L2.
Measuring latency at the arbitration queue (ARB) can be confusing in a different way. The ARB runs at the CPU tile’s uncore clock, or 3.8 GHz. That’s well below the 5.7 GHz maximum CPU core clock, so the ARB will see fewer cycles of latency than the CPU core does. Therefore, I’m adding another set of bars with post-ARB latency multiplied by 5.7/3.8, to approximate latency in CPU core cycles.
Another way to get a handle on latency is to multiply by cycle time to approximate actual latency. Clocks aren’t static on Arrow Lake, so there’s additional margin of error. But doing so does show latency past the ARB remains well controlled, so DRAM bandwidth isn’t a concern. If games were approaching DRAM bandwidth limits, latency would go much higher as requests start piling up at the ARB queue and subsequent points in the chip’s interconnect.
Much of the action happens at the backend, but Lion Cove loses some throughput at the frontend too. Instruction-side accesses tend to be more predictable than data-side ones, because instructions are executed sequentially until the core reaches a branch. That means accurate branch prediction can let the core hide frontend latency.
Lion Cove’s branch predictor enjoys excellent accuracy across all three games. Mispredicts however can still be an issue. Just as the occasional L3 or DRAM access can be problematic because they’re so expensive, recovering from a branch mispredict can hurt too. Because a mispredict breaks the branch predictor’s ability to run ahead, it can expose the core to instruction-side cache latency. Fetching the correct branch target from L2 or beyond could add dozens of cycles to mispredict recovery time. Ideally, the core would contain much of an application’s code footprint within the fastest instruction cache levels to minimize that penalty.
Lion Cove’s frontend can source micro-ops from four sources. The loop buffer, or Loop Stream Detector (LSD) and microcode sequencer play a minor role. Most micro-ops come from the micro-op cache, or Decoded Stream Buffer (DSB). Even though the op cache delivers a majority of micro-ops, it’s not large enough to serve as the core’s primary instruction cache. Lion Cove gets a 64 KB instruction cache, carried over from Redwood Cove. Intel no longer documents events that would allow a direct L1i hitrate calculation. However, older events from before Alder Lake still appear to work. Micro-op cache hits are counted as instruction cache hits from testing with microbenchmarks. Therefore, figures below indicate how often instruction fetches were satisfied without going to L2.
The 64 KB instruction cache does its job, keeping the vast majority of instruction fetches from reaching L2. Code hitrate from L2 is lower, likely because accesses that miss L1i have worse locality in the first place. Instructions also have to contend with data for L2 capacity. L2 code misses don’t happen too often, but can be problematic just as on the data side because of the dramatic latency jump.
Among the three games here, Cyberpunk 2077’s built-in benchmark has better code locality, while Palworld suffers the most. That’s reflected in average instruction-side latency seen by the core. When running Palworld, Lion Cove takes longer to recover from pipeline resteers, which largely come from branch mispredicts. Recovery time here refers to cycles elapsed until the renamer issues the first micro-op from the correct path.
Offcore code read latency can be tracked in the same way as demand data reads. Latency is lower than on the data side, suggesting higher code hitrate in L3. However, hiding hundreds of cycles of latency is still a tall order for the frontend, just as it is for the backend. Again, Lion Cove’s large L2 does a lot of heavy lifting.
Performance counters provide insight into other delays as well. A starts with the renamer (allocator) restoring a checkpoint with known-good state[1], which takes 3-4 cycles and as expected, doesn’t change across the three games. Lion Cove can also indicate how often the instruction fetch stage stalls. Setting the edge/cmask bits can indicate how long each stall lasts. However, it’s hard to determine the performance impact from L1i misses because the frontend has deep queues that can hide L1i miss latency. Furthermore, an instruction fetch stall can overlap with a backend resource stall.
While pipeline resteers seem to account for the bulk of frontend-related throughput losses, other reasons can contribute too. Structures within the branch predictor can override each other, for example when a slower BTB level overrides a faster one (BPClear). Large branch footprints can exceed the branch predictor’s capacity to track them, and cause a BAClear in Intel terminology. That’s when the frontend discovers a branch not tracked by the predictor, and must redirect instruction fetch from a later stage. Pipeline bubbles from both sources have a minor impact, so Lion Cove’s giant 12K entry BTB does a good job.
In a latency bound workload like gaming, the retirement stage operates in a feast-or-famine fashion. Most of the time it can’t do anything. That’s probably because a long latency instruction is blocking retirement, or the ROB is empty from a very costly mispredict. When the retirement stage is unblocked, throughput resembles a bathtub curve. Often it crawls forward with most retire slots idle. The retirement stage spends very few cycles retiring at medium-high throughput.
Likely, retirement is either crawling forward in core-bound scenarios when a short latency operation completes and unblocks a few other micro-ops that complete soon softer, or is bursting ahead after a long latency instruction completes and unblocks retirement for a lot of already completed instructions.
Lion Cove can retire up to 12 micro-ops per cycle. Once it starts using its full retire width, the core on average blasts through 28 micro-ops before getting blocked again.
Compared to Zen 4, Lion Cove suffers harder with backend memory latency, but far less from frontend latency. Part of this can be explained by Zen 4’s stronger data-side memory subsystem. The AMD Ryzen 9 7950X3D I previously tested on has 96 MB of L3 cache on the first die, and has lower L3 latency than Lion Cove in Intel’s Arrow Lake platform. Beyond L3, AMD achieves better load-to-use latency even with slower DDR5-5600 36-36-36-89 memory. Intel’s interconnect became more complex when they shifted to a chiplet setup, and there’s clearly some work to be done.
Lion Cove gets a lot of stuff right as well, because the core’s frontend is quite strong. The larger BTB and larger instruction cache compared to Zen 4 seem to do a good job of keeping code fetches off slower caches. Lion Cove’s large L2 gets credit too. It’s not perfect, because the occasional instruction-side L2 miss has an average latency in the hundreds of cycles range. But Intel’s frontend improvements do pay off.
Even though Intel and AMD have different relative strengths, a constant factor is that games are difficult, low IPC workloads. They have large data-side footprints with poor access locality. Instruction-side accesses are difficult too, though not to the same extent because modern branch predictors can mostly keep up. Both factors together mean many pipeline slots go unused. Building a wider core brings little benefit because getting through instructions isn’t the problem. Rather, the challenge is in dealing with long stalls as the core waits for data or instructions to arrive from lower level caches or DRAM. Intel’s new L1.5 likely has limited impact as well. It does convert some already fast L2 hits into even faster accesses, but it doesn’t help with long stalls as the core waits for data from L3 or DRAM.
Comparing games to SPEC CPU2017 also emphasizes that games aren’t the only workloads out there. Wider cores with faster upper level caches can pay off in a great many SPEC CPU2017 tests, especially those with very high IPC. Conversely, a focus on improving DRAM performance or increasing last level cache capacity would provide minimal gains for workloads that already fit in cache. Optimization strategies for different workloads are often in conflict, because engineers must decide where to allocate a limited power and area budget. They have limited time to determine the best tradeoff too. Intel, AMD, and others will continue to tune their CPU designs to meet expected workloads, and it’ll be fun to see where they go.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
Henry Wong suggests the INT_MISC.RECOVERY_CYCLES event, which is present on Lion Cove as well as Haswell, accounts for time taken for a mapping table recovery. The renamer maintains a register alias table (mapping) that maps architectural registers to renamed physical registers. Going back to a known good state would mean restoring a previous version of the table prior to the mispredicted branch. https://www.stuffedcow.net/files/henry-thesis-phd.pdf
2025-06-29 08:26:53
Nvidia has a long tradition of building giant GPUs. Blackwell, their latest graphics architecture, continues that tradition. GB202 is the largest Blackwell die. It occupies a massive 750mm2 of area, and has 92.2 billion transistors. GB202 has 192 Streaming Multiprocessors (SMs), the closest equivalent to a CPU core on a GPU, and feeds them with a massive memory subsystem. Nvidia’s RTX PRO 6000 Blackwell features the largest GB202 configuration to date. It sits alongside the RTX 5090 in Nvidia’s lineup, which also uses GB202 but disables a few more SMs.
A high level comparison shows the scale of Nvidia’s largest Blackwell products. AMD’s RDNA4 line tops out with the RX 9070 and RX 9070XT. The RX 9070 is slightly cut down, with four WGPs disabled out of 32. I’ll be using the RX 9070 to provide comparison data.
A massive thanks to Will Killian for giving us access to his RTX PRO 6000 Blackwell system for us to test. And so, a massive thanks goes out to him for this article!
GPUs use specialized hardware to launch threads across their cores, unlike CPUs that rely on software scheduling in the operating system. Hardware thread launch is well suited to the short and simple tasks that often characterize GPU workloads. Streaming Multiprocessors (SMs) are the basic building block of Nvidia GPUs, and are roughly analogous to a CPU core. SMs are grouped into Graphics Processing Clusters (GPCs), which contain a rasterizer and associated work distribution hardware.
GB202 has a 1:16 GPC to SM ratio, compared to the 1:12 ratio found in Ada Lovelace’s largest AD102 die. That lets Nvidia cheaply increase SM count and thus compute throughput without needing more copies of GPC-level hardware. However, dispatches with short-duration waves may struggle to take advantage of Blackwell’s scale, as throughput becomes limited by how fast the GPC can allocate work to the SMs rather than how fast the SMs can finish them.
AMD’s RDNA4 uses a 1:8 SE:WGP ratio, so one rasterizer feeds a set of eight WGPs in a Shader Engine. WGPs on AMD are the closest equivalent to SMs on Nvidia, and have the same nominal vector lane count. RDNA4 will be easier to utilize with small dispatches and short duration waves, but it’s worth noting that Blackwell’s design is not out of the ordinary. Scaling up GPU “cores” independently of work distribution hardware is a common technique for building larger GPUs. AMD’s RX 6900XT (RDNA2) had a 1:10 SE:WGP ratio. Before that, AMD’s largest GCN implementations like Fury X and Vega 64 had a 1:16 SE:CU ratio (CUs, or Compute Units, formed the basic building block of GCN GPUs). While Blackwell does have the same ratio as those large GCN parts, it enjoys higher clock speeds and likely has a higher wave launch rate to match per-GPU-core throughput. It won’t suffer as much as the Fury X from 10 years ago with short duration waves, but GB202 will still be harder to feed than smaller GPUs.
Although Nvidia didn’t scale up work distribution hardware, they did make improvements on Blackwell. Prior Nvidia generations could not overlap workloads of different types on the same queue. Going between graphics and compute tasks would require a “subchannel switch” and a “wait-for-idle”. That requires one task on the queue to completely finish before the next can start, even if a game doesn’t ask for synchronization. Likely, higher level scheduling hardware that manages queues exposed to host-side applications can only track state for one workload type at a time. Blackwell does away with subchannel switches, letting it more efficiently fill its shader array if applications frequently mix different work types on the same queue.
Once assigned work, the SM’s frontend fetches shader program instructions and delivers them to the execution units. Blackwell uses fixed length 128-bit (16 byte) instructions, and the SM uses a two-level instruction caching setup. Both characteristics are carried forward from Nvidia’s post-Turing/Volta designs. Each of the SM’s four partitions has a private L0 instruction cache, while a L1 instruction cache is shared across the SM.
Nvidia’s long 16 byte instructions translate to high instruction-side bandwidth demands. The L0+L1 instruction caching setup is likely intended to handle those bandwidth demands while also maintaining performance with larger code footprints. Each L0 only needs to provide one instruction per cycle, and its smaller size should make it easier to optimize for high bandwidth and low power. The SM-level L1 can then be optimized for capacity.
Blackwell’s L1i is likely 128 KB, from testing with unrolled loops of varying sizes and checking generated assembly (SASS) to verify the loop’s footprint. The L1i is good for approximately 8K instructions, providing a good boost over prior Nvidia generations.
Blackwell and Ada Lovelace both appear to have 32 KB L0i caches, which is an increase over 16 KB in Turing. The L1i can fully feed a single partition, and can filter out redundant instruction fetches to feed all four partitions. However, L1i bandwidth can be a visible limitation if two waves on different partitions spill out of L1i and run different code sections. In that case, per-wave throughput drops to one instruction per two cycles.
AMD uses variable length instructions ranging from 4 to 12 bytes, which lowers capacity and bandwidth pressure on the instruction cache compared to Nvidia. RDNA4 has a 32 KB instruction cache shared across a Workgroup Processor (WGP), much like prior RDNA generations. Like Nvidia’s SM, a WGP is divided into four partitions (SIMDs). RDNA1’s whitepaper suggests the L1i can supply 32 bytes per cycle to each SIMD. It’s an enormous amount of bandwidth considering AMD’s more compact instructions. Perhaps AMD wanted to be sure each SIMD could co-issue to its vector and scalar units in a sustained fashion. In this basic test, RDNA4’s L1i has no trouble maintaining full throughput when two waves traverse different code paths. RDNA4 also enjoys better code read bandwidth from L2, though all GPUs I’ve tested show poor L2 code read bandwidth.
Each Blackwell SM partition can track up to 12 waves to hide latency, which is a bit lower than the 16 waves per SIMD on RDNA4. Actual occupancy, or the number of active wave slots, can be limited by a number of factors including register file capacity. Nvidia has not changed theoretical occupancy or register file capacity since Ampere, and the latter remains a 64 KB per partition. A kernel therefore can’t use more than 40 registers while using all 12 wave slots, assuming allocation granularity hasn’t changed since Ada and is still 8 registers. For comparison, AMD’s high end RDNA3/4 SIMDs have 192 KB vector register files, letting a kernel use up to 96 registers while maintaining maximum occupancy.
Blackwell’s primary FP32 and INT32 execution pipelines have been reorganized compared to prior generations, and are internally arranged as one 32-wide execution pipe. That creates similarities to AMD’s RDNA GPUs, as well as Nvidia’s older Pascal. Having one 32-wide pipe handle both INT32 and FP32 means Blackwell won’t have to stall if it encounters a long stream of operations of the same type. Blackwell inherits Turing’s strength of being able to do 16 INT32 multiplies per cycle in each partition. Pascal and RDNA GPUs can only do INT32 multiplies at approximately quarter arte (8 per partition, per cycle).
Compared to Blackwell, AMD’s RDNA4 packs a lot of vector compute into each SIMD. Like RDNA3, RDNA4 can use VOPD dual issue instructions or wave64 mode to complete 64 FP32 operations per cycle in each partition. An AMD SIMD can also co-issue instructions of different types from different waves, while Nvidia’s dispatch unit is limited to one instruction per cycle. RDNA4’s SIMD also packs eight special function units (SFUs) compared to four on Nvidia. These units handle more complex operations like inverse square roots and trigonometric functions.
Differences in per-partition execution unit layout or count quickly get pushed aside by Blackwell’s massive SM count. Even when the RX 9070 can take advantage of dual issue, 28 WGPs cannot take on 188 SMs. Nvidia holds a large lead in every category.
Nvidia added floating point instructions to Blackwell’s uniform datapath, which dates back to Turing and serves a similar role to AMD’s scalar unit. Both offload instructions that are constant across a wave. Blackwell’s uniform FP instructions include adds, multiples, FMAs, min/max, and conversions between integer and floating point. Nvidia’s move mirrors AMD’s addition of FP scalar instructions with RDNA 3.5 and RDNA4.
Still, Nvidia’s uniform datapath feels limited compared to AMD’s scalar unit. Uniform registers can only be loaded from constant memory, though curiously a uniform register can be written out to global memory. I wasn’t able to get Nvidia’s compiler to emit uniform instructions for the critical part of any instruction or cache latency tests, even when loading values from constant memory.
Raytracing has long been a focus of Nvidia’s GPUs. Blackwell doubles the per-SM ray triangle intersection test rate, though Nvidia does not specify what the box or triangle test rate is. Like Ada Lovelace, Blackwell’s raytracing hardware supports “Opacity Micromaps”, providing functionality similar to the sub-triangle opacity culling referenced by Intel’s upcoming Xe3 architecture.
Like Ada Lovelace and Ampere, Blackwell has a SM-wide 128 KB block of storage that’s partitioned for use as L1 cache and Shared Memory. Shared Memory is Nvidia’s term for a software managed scratchpad, which backs the local memory space in OpenCL. AMD’s equivalent is the Local Data Share (LDS), and Intel’s is Shared Local Memory (SLM). Unlike with their datacenter GPUs, Nvidia has chosen not to increase L1/Shared Memory capacity. As in prior generations, different L1/Shared Memory splits do not affect L1 latency.
AMD’s WGPs use a more complex memory subsystem, with a high level design that debuted in the first RDNA generation. The WGP has a 128 KB LDS that’s internally built from a pair of 64 KB, 32-bank structures connected by a crossbar. First level vector data caches, called L0 caches, are private to pairs of SIMDs. A WGP-wide 16 KB scalar cache services scalar and constant reads. In total, a RDNA4 WGP has 208 KB of data-side storage divided across different purposes.
A RDNA4 WGP enjoys substantially higher bandwidth from its private memories for global and local memory accesses. Each L0 vector cache can deliver 128 bytes per cycle, and the LDS can deliver 256 bytes per cycle total to the WGP. Mixing local and global memory traffic can further increase achievable bandwidth, suggesting the LDS and L0 vector caches have separate data buses.
Doing the same on Nvidia does not bring per-SM throughput past 128B/cycle, suggesting the 128 KB L1/Shared Memory block has a single 128B path to the execution units.
Yet any advantage AMD may enjoy from this characteristic is blunted as the RX 9070 drops clocks to 2.6 GHz, in order to stay within its 220W power target. Nvidia in contrast has a higher 600W power limit, and can maintain close to maximum clock speeds while delivering 128B/cycle from SM-private L1/Shared Memory storage.
Just as with compute, Nvidia’s massive scale pushes aside any core-for-core differences. The 188 SMs across the RTX PRO 6000 Blackwell together have more than 60 TB/s of bandwidth. High SM count gives Nvidia more total L1/local memory too. Nvidia has 24 MB of L1/Shared Memory across the RTX PRO 6000. AMD’s RX 9070 has just under 6 MB of first level data storage in its WGPs.
SM-private storage typically offers low latency, at least in GPU terms, and that continues to be the case in Blackwell. Blackwell compares favorably to AMD in several areas, and part of that advantage comes down to address generation. I’m testing with dependent array accesses, and Nvidia can convert an array index to an address with a single IMAD.WIDE instruction.
AMD has fast 64-bit address generation through its scalar unit, but of course can only use that if the compiler determines the address calculation will be constant across a wave. If each lane needs to independently generate its own address, AMD’s vector integer units only natively operate with 32-bit data types and must do an add-with-carry to generate a 64-bit address.
Because separating address generation from cache latency is impossible, Nvidia enjoys better L1 vector access latency. AMD can be slightly faster if the compiler can carry out scalar optimizations.
GPUs can also offload address generation to the texture unit, which can handle an array address calculations. The texture unit of course can also do texture filtering, though I’m only asking it to return raw data when testing with OpenCL’s image1d_buffer_t type. AMD enjoys lower latency if the texture unit does address calculations, but Nvidia does not.
GPUs often handle atomic operations with dedicated ALUs placed close to points of coherency, like the LDS or Shared Memory for local memory, or the L2 cache for global memory. That contrasts with CPUs, which rely on locking cachelines to handle atomics and ensure ordering. Nvidia appears to have 16 INT32 atomic ALUs at each SM, compared to 32 for each AMD WGP.
In a familiar trend, Nvidia can crush AMD by virtue of having a much bigger GPU, at least with local memory atomics. Both the RTX PRO 6000 and RX 9070 have surprisingly similar atomic add throughput in global memory, suggesting Nvidia either has fewer L2 banks or fewer atomic units per bank.
RDNA4 and Blackwell have similar latency when threads exchange data through atomic compare and exchange operations, though AMD is slightly faster. The RX 9070 is a much smaller and higher clocked GPU, and both can help lower latency when moving data across the GPU.
Blackwell uses a conventional two-level data caching setup, but continues Ada Lovelace’s strategy of increasing L2 capacity to achieve the same goals as AMD’s Infinity Cache. L2 latency on Blackwell regresses to just over 130 ns, compared to 107 ns on Ada Lovelace. Nvidia’s L2 latency continues to sit between AMD’s L2 and Infinity Cache latencies, though now it’s noticeably closer to the latter.
Tests results using Vulkan suggest the smaller RTX 5070 also has higher L2 latency (122 ns) than the RTX 4090, even though the 5070 has fewer SMs and a smaller L2. Cache latency results from Nemes’s Vulkan test suite should be broadly comparable to my OpenCL ones, because we both use a current = arr[current] access pattern. A deeper look showed minor code generation differences that seem to add ~3 ns of latency to the Vulkan results. That doesn’t change the big picture with L2 latencies. Furthermore, the difference between L1 and L2 latency should approximate the time taken to traverse the on-chip network and access the L2. Differences between OpenCL and Vulkan results are insignificant in that regard. Part of GB202’s L2 latency regression may come from its massive scale, but results from the 5070 suggest there’s more to the picture.
The RTX PRO 6000 Blackwell’s VRAM latency is manageable at 329 ns, or ~200 ns over L2 hit latency. AMD’s RDNA4 manages better VRAM latency at 254 ns for a vector access, or 229 ns through the scalar path. Curiously, Nvidia’s prior Ada Lovelace and Ampere architectures enjoyed better VRAM latency than Blackwell, and are in the same ballpark as RDNA4 and RDNA2.
Blackwell’s L2 bandwidth is approximately 8.7 TB/s, slightly more than the RX 9070’s 8.4 TB/s. Nvidia retains a huge advantage at larger test sizes, where AMD’s Infinity Cache provides less than half the bandwidth. In VRAM, Blackwell’s GDDR7 and 512-bit memory bus continue to keep it well ahead of AMD.
Nvidia’s L2 performance deserves closer attention, because it’s one area where the RX 9070 gets surprisingly close to the giant RTX PRO 6000 Blackwell. A look at GB202’s die photo shows 64 cache blocks, suggesting the L2 is split into 64 banks. If so, each bank likely delivers 64 bytes per cycle (of which the test was able to achieve 48B/cycle). It’s an increase over the 48 L2 blocks in Ada Lovelace’s largest AD102 chip. However, Nvidia's L2 continues to have a tough job serving as both the first stop for L1 misses and as a large last level cache. In other words, it’s doing the job of AMD’s L2 and Infinity Cache levels. There’s definitely merit to cutting down cache levels, because checking a level of cache can add latency and power costs. However, caches also have to make a tradeoff between capacity, performance, and power/area cost.
Nvidia likely relies on their flexible L1/Shared Memory arrangement to keep L2 bandwidth demands under control, and insulate SMs from L2 latency. A Blackwell SM can use its entire 128 KB L1/Shared Memory block as L1 cache if a kernel doesn’t need local memory, while an AMD WGP is stuck with two 32 KB vector caches and a 16 KB scalar cache. However a kernel bound by local memory capacity with a data footprint in the range of several megabytes would put Nvidia at a disadvantage. Watching AMD and Nvidia juggle these tradeoffs is quite fun, though it’s impossible to draw any conclusions with the two products competing in such different market segments.
FluidX3D simulates fluid behavior and can demand plenty of memory bandwidth. It carries out computations with FP32 values, but can convert them to FP16 formats for storage. Doing so reduces VRAM bandwidth and capacity requirements. Nvidia’s RTX PRO 6000 takes a hefty lead over AMD’s RX 9070, as the headline compute and memory bandwidth specifications would suggest.
Nvidia’s lead remains relatively constant regardless of what mode FluidX3D is compiled with.
We technically have more GPU competition than ever in 2025, as Intel’s GPU effort makes steady progress and introduces them as a third contender. On the datacenter side, AMD’s MI300 has proven to be very competitive with supercomputing wins. But competition is conspicuously absent at the top of the consumer segment. Intel’s Battlemage and AMD’s RDNA4 stop at the midrange segment. The RX 9070 does target higher performance levels than Intel’s Arc B580, but neither come anywhere close to Nvidia’s largest GB202 GPUs.
As for GB202, it’s yet another example of Nvidia building as big as they can to conquer the top end. The 750mm2 die pushes the limits of what can be done with a monolithic design. Its 575W or 600W power target tests the limits of what a consumer PC can support. By pushing these limits, Nvidia has created the largest consumer GPU available today. The RTX PRO 6000 incredibly comes close to AMD’s MI300X in terms of vector FP32 throughput, and is far ahead of Nvidia’s own B200 datacenter GPU. The memory subsystem is a monster as well. Perhaps Nvidia’s engineers asked whether they should emphasize caching like AMD’s RDNA2, or lean on VRAM bandwidth like they did with Ampere. Apparently, the answer is both. The same approach applies to compute, where the answer was apparently “all the SMs”.
Building such a big GPU isn’t easy, and Nvidia evidently faced their share of challenges. L2 performance is mediocre considering the massive compute throughput it may have to feed. Beyond GPU size, comparing with RDNA4 shows continued trends like AMD using a smaller number of individually stronger cores. RDNA4’s basic Workgroup Processor building block has more compute throughput and cache bandwidth than a Blackwell SM.
But none of that matters at the top end, because Nvidia shows up with over 6 times as many “cores”, twice as much last level cache capacity, and a huge VRAM bandwidth lead. Some aspects of Blackwell may not have scaled as nicely. But Nvidia’s engineers deserve praise because everyone else apparently looked at those challenges and decided they weren’t going to tackle them at all. Blackwell therefore wins the top end by default. Products like the RTX PRO 6000 are fascinating, and I expect Nvidia to keep pushing the limits of how big they can build a consumer GPU. But I also hope competition at the top end will eventually reappear in the decades and centuries to come.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
SASS instruction listings: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#blackwell-instruction-set
GB202 whitepaper: https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
Blackwell PRO GPU whitepaper: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/NVIDIA-RTX-Blackwell-PRO-GPU-Architecture-v1.0.pdf
Techpowerup observing that the RTX 5090 could reach 2.91 GHz: https://www.techpowerup.com/review/nvidia-geforce-rtx-5090-founders-edition/44.html
2025-06-21 05:13:03
Hello you fine Internet folks,
At AMD's Advancing AI 2025, I had the pleasure of interviewing Alan Smith, AMD Senior Fellow and Chief Instinct Architect, about CDNA4 found in the MI350 series of accelerators.
Hope y'all enjoy!
Transcript below has been edited for conciseness and readability.
George: Hello you fine internet folks! We're here today at AMD's Advancing AI 2025 event, where the MI350 series has just been announced. And I have the pleasure to introduce, Alan Smith from AMD.
Alan: Hello, everyone.
George: What do you do at AMD?
Alan: So I'm the chief architect of Instinct GPUs.
George: Awesome, what does that job entail?
Alan: So I'm responsible for the GPU product line definition, working with our data center GPU business partners to define the-, work with them on the definition of the requirements of the GPUs, and then work with the design teams to implement those requirements.
George: Awesome. So, moving into MI350: MI350's still GFX9 based, and- for you guys in the audience, GFX9 is also known as Vega, or at least derivative of GFX9. Why is it that MI350 is still on GFX9, whereas clients such as RDNA 3 and 4, GFX11 and 12 respectively?
Alan: Sure, yeah, it's a great question. So as you know, the CDNA architecture off of, you know, previous generations of Instinct GPUs, starting with MI100 and before, like you said, in the Vega generations, were all GCN architecture, which was Graphics Core Next. And over several generations, CDNA's been highly optimized for the types of distributed computing algorithms for high-performance computing and AI. And so we felt like starting with that base for MI350 would give us the right pieces that we needed to deliver the performance goals that we had for the MI350 series.
George: And with GCN, as you know, there's a separate L1 cache and LDS, or Local Data Store. Why is that still in MI350? And why haven't they been merged?
Alan: Yeah, so like you said, you know, it is a legacy piece of the GCN architecture. It's sort of fundamental to the way the compute unit is built. So we felt like in this generation, it wasn't the right opportunity to make a microarchitectural change of that scale. So what we did instead, was we increased the capacity of the LDS. So previously in MI300 series, we had a 64 kilobyte LDS, and we've increased that capacity to 160 kilobytes in MI350 series. And in addition to that, we increased the bandwidth as well. So we doubled the bandwidth of the LDS into the register file, in order to be able to feed the Tensor Core rates that we have in the MI350 series.
George: And speaking of Tensor Cores, you've now introduced microscaling formats to MI350x for FP8, FP6, and FP4 data types. Interestingly enough, a major differentiator for MI350 is that FP6 is the same rate as FP4. Can you talk a little bit about how that was accomplished and why that is?
Alan: Sure, yep, so one of the things that we felt like on MI350 in this timeframe, that it's going into the market and the current state of AI... we felt like that FP6 is a format that has potential to not only be used for inferencing, but potentially for training. And so we wanted to make sure that the capabilities for FP6 were class-leading relative to... what others maybe would have been implementing, or have implemented. And so, as you know, it's a long lead time to design hardware, so we were thinking about this years ago and wanted to make sure that MI350 had leadership in FP6 performance. So we made a decision to implement the FP6 data path at the same throughput as the FP4 data path. Of course, we had to take on a little bit more hardware in order to do that. FP6 has a few more bits, obviously, that's why it's called FP6. But we were able to do that within the area of constraints that we had in the matrix engine, and do that in a very power- and area-efficient way.
George: And speaking of data types, I've noticed that TF32 is not on your ops list for hardware level acceleration. Why remove that feature from... or why was that not a major consideration for MI350?
Alan: Yeah, well, it was a consideration, right? Because we did remove it. We felt like that in this timeframe, that brain float 16, or BF16, would be a format that would be leverageable for most models to replace TF32. And we can deliver a much higher throughput on BF16 than TF32, so we felt like it was the right trade off for this implementation.
George: And if I was to use TF32, what would the speed be? Would it still be FP32, the speed of FP32?
Alan: You have a choice. We offer some emulation, and I don't have all the details on the exact throughputs off the top of my head; but we do offer emulation, software-based emulation using BF16 to emulate TF32, or you can just cast it into FP32 and use it at FP32 rate.
George: And moving from the CU up to the XCD, which is the compute die; the new compute die's now on N3P, and yet there's been a reduction from 40 CUs to 36 CUs physically on the die with four per shader engine fused off. Why 32 CUs now, and, why that reduction?
Alan: Yeah, so on MI300, we had co-designed for both MI300X and MI300A, one for HPC and one for AI. In the MI300A, we have just six XCDs. And so, we wanted to make sure when we only had six of the accelerator chiplets that we had enough compute units to power the level of HPC or high-performance computing - which is traditional simulation in FP64 - to reach the performance levels that we wanted to hit for the leadership-class supercomputers that we were targeting with that market.
And so we did that and delivered the fastest supercomputer in the world along with Lawrence Livermore, with El Capitan. But so as a consideration there, we wanted to have more compute units for XCD so that we could get 224 total within MI300A. On 350, where it's designed specifically as an accelerator only, a discrete accelerator, we had more flexibility there. And so we decided that having a power of two number of active compute units per die - so 36 physical, like you said, but we enable 32. Four of them, one per shader engine, are used for harvesting and we yield those out in order to give us good high-volume manufacturing through TSMC-N3, which is a leading edge technology. So we have some of the spare ones that allow us to end up with 32 actually enabled.
And that's a nice power of two, and it's easy to tile tensors if you have a power of two. So most of the tensors that you're working with, or many of them, would be matrices that are based on a power of two. And so it allows you to tile them into the number of compute units easily, and reduces the total tail effect that you may have. Because if you have a non-power of two number of compute units, then some amount of the tensor may not map directly nicely, and so you may have some amount of work that you have to do at the end on just a subset of the compute unit. So we find that there's some optimization there by having a power of two.
George: While the new compute unit is on N3P, the I/O die is on N6; why stick with N6?
Alan: Yeah, great question. What we see in our chiplet technology, first of all, we have the choice, right? So being on chiplets gives you the flexibility to choose different technologies if appropriate. And the things that we have in the I/O die tend not to scale as well with advanced technologies. So things like the HBM PHYs, the high-speed SERDES, the caches that we have with the Infinity Cache, the SRAMs, those things don't scale as well. And so sticking with an older technology with a mature yield on a big die allows us to deliver a product cost and a TCO (Total Cost of Ownership) value proposition for our clients. And then we're able to leverage the most advanced technologies like N3P for the compute where we get a significant benefit in the power- and area-scaling to implement the compute units.
George: And speaking of, other than the LDS, what's interesting to me is that there have not been any cache hierarchy changes. Why is that?
Alan: Yeah, great question. So if you remember what I just said about MI300 being built to deliver the highest performance in HPC. And in order to do that, we needed to deliver significant global bandwidth into the compute units for double precision floating point. So we had already designed the Infinity Fabric and the fabric within the XCC or the Accelerated Compute Core to deliver sufficient bandwidth to feed the really high double precision matrix operations in MI300 and all the cache hierarchy associated with that. So we were able to leverage that amount of interconnecting capabilities that we had already built into MI300 and therefore didn't need to make any modifications to those.
George: And with MI350, you've now moved from four base dies to two base dies. What has that enabled in terms of the layout of your top dies?
Alan: Yeah, so what we did, as you mentioned, so in MI350, the I/O dies, there's only two of them. And then each of them host four of the accelerator chiplets versus in MI300, we had four of the I/O dies, with each of them hosting two of the accelerator chiplets. So that's what you're talking about.
So what we did was, we wanted to increase the bandwidth from global, from HBM, which, MI300 was designed for HBM3 and MI350 was specially designed for HBM3E. So we wanted to go from 5.2 or 5.6 gigabit per second up to a full 8 gigabit per second. But we also wanted to do that at the lowest possible power, because delivering the bytes from HBM into the compute cores at the lowest energy per bit gives us more power at a fixed GPU power level, gives us more power into the compute at that same time. So on bandwidth-bound kernels that have a compute element, by reducing the amount of power that we spend in data transport, we can put more power into the compute and deliver a higher performance for those kernels.
So what we did by combining those two chips together into one was we were able to widen up the buses within those chips; so we deliver more bytes per clock, and therefore we can run them at a lower frequency and also a lower voltage, which gives us the V-squared scaling of voltage for the amount of power that it takes to deliver those bits. So that's why we did that.
George: And speaking of power, MI350x is 1000 watts, and MI355x is 1400 watts. What are the different thermal considerations when considering that 40% uptick in power, not just in terms of cooling the system, but also keeping the individual chiplets within tolerances?
Alan: Great question, and obviously we have some things to consider for our 3D architectures as well.
So when we do our total power and thermal architecture of these chips, we consider from the motherboard all the way up to the daughterboards, which are the UBB (Universal Baseboard), the OAM (OCP Accelerator Module) modules in this case, and then up through the stack of CoWoS (Chip on Wafer on Substrate), the I/O dies, which are in this intermediate layer, and then the compute that's above those. So we look at the total thermal density of that whole stack, and the amount of thermal transport or thermal resistance that we have within that stack, and the thermal interface materials that we need in order to build on top of that for heat removal, right?
And so we offer two different classes of thermal solutions for MI350 series. One of them air-cooled, like you mentioned. The other one is a direct-attach liquid cool. So the cold plate would then, in the liquid cool plate, liquid-cooled case would directly attach to the thermal interface material on top of the chips. So we do thermal modeling of that entire stack, and work directly with all of our technology partners to make sure that the power densities that we build into the chips can be handled by that entire thermal stack up.
George: Awesome, and since we're running short on time, the most important question of this interview is, what's your favorite type of cheese?
Alan: Oh, cheddar.
George: Massively agree with you. What's your favorite brand of cheddar?
Alan: I like the Vermont one. What is that, oh... Calbert's? I can't think of it. [Editor's note: It's Cabot Cheddar that is Alan's favorite]
George: I know my personal favorite's probably Tillamook, which is, yeah, from Oregon. But anyway, thank you so much, Alan, for this interview.
If you would like to support the channel, hit like, hit subscribe. And if you like interviews like this, tell us in the comments below. Also, there will be a transcript on the Chips and Cheese website. If you want to directly monetarily support Chips and Cheese, there's Patreon, as well as Stripe through Substack, and PayPal. So, thank you so much for that interview, Alan.
Alan: Thank you, my pleasure.
George: Have a good one, folks!
2025-06-18 01:00:05
CDNA 4 is AMD’s latest compute oriented GPU architecture, and represents a modest update over CDNA 3. CDNA 4’s focus is primarily on boosting AMD’s matrix multiplication performance with lower precision data types. Those operations are important for machine learning workloads, which can often maintain acceptable accuracy with very low precision types. At the same time, CDNA 4 seeks to maintain AMD’s lead in more widely applicable vector operations.
To do so, CDNA 4 largely uses the same system level architecture as CDNA 3. It’s massive chiplet setup, with parallels to AMD’s successful use of chiplets for CPU products. Accelerator Compute Dies, or XCDs, contain CDNA Compute Units and serve a role analogous to Core Complex Dies (CCDs) on AMD’s CPU products. Eight XCDs sit atop four base dies, which implement 256 MB of memory side cache. AMD’s Infinity Fabric provides coherent memory access across the system, which can span multiple chips.
Compared to the CDNA 3 based MI300X, the CDNA 4 equipped MI355X slightly cuts down CU count per XCD, and disables more CUs to maintain yields. The resulting GPU is somewhat less wide, but makes up much of the gap with higher clock speeds. Compared to Nvidia’s B200, both MI355X and MI300 are larger GPUs with far more basic building blocks. Nvidia’s B200 does adopt a multi-die strategy, breaking from a long tradition of using monolithic designs. However, AMD’s chiplet setup is far more aggressive and seeks to replicate their scaling success with CPU designs with large compute GPUs.
CDNA 3 provided a huge vector throughput advantage over Nvidia’s H100, but faced a more complicated situation with machine learning workloads. Thanks to a mature software ecosystem and a heavy focus on matrix multiplication throughput (tensor cores), Nvidia could often get close (https://chipsandcheese.com/p/testing-amds-giant-mi300x) to the nominally far larger MI300X. AMD of course maintained massive wins if the H100 ran out of VRAM, but there was definitely room for improvement.
CDNA 4 rebalances its execution units to more closely target matrix multiplication with lower precision data types, which is precisely what machine learning workloads use. Per-CU matrix throughput doubles in many cases, with CDNA 4 CUs matching Nvidia’s B200 SMs in FP6. Elsewhere though, Nvidia continues to show a stronger emphasis on low precision matrix throughput. B200 SMs have twice as much per-clock throughput as a CDNA 4 CU across a range of 16-bit and 8-bit data types. AMD continues to rely on having a bigger, higher clocked GPU to maintain an overall throughput lead.
With vector operations and higher precision data types, AMD continues MI300X’s massive advantage. Each CDNA 4 CU continues to have 128 FP32 lanes, which deliver 256 FLOPS per cycle when counting FMA operations. MI355X’s lower CU count does lead to a slight reduction in vector performance compared to MI300X. But compared to Nvidia’s Blackwell, AMD’s higher core count and higher clock speeds let it maintain a huge vector throughput lead. Thus AMD’s CDNA line continues to look very good for high performance compute workloads.
Nvidia’s focus on machine learning and matrix operations keeps them very competitive in that category, despite having fewer SMs running at lower clocks. AMD’s giant MI355X holds a lead across many data types, but the gap between AMD and Nvidia’s largest GPUs isn’t nearly as big as with vector compute.
GPUs provide a software managed scratchpad local to a group of threads, typically ones running on the same core. AMD GPUs use a Local Data Share, or LDS, for that purpose. Nvidia calls their analogous structure Shared Memory. CDNA 3 had a 64 KB LDS, carrying forward a similar design from AMD GCN GPUs going back to 2012. That LDS had 32 2 KB banks, each 32 bits wide, providing up to 128 bytes per cycle in the absence of bank conflicts.
CDNA 4 increases the LDS capacity to 160 KB and doubles read bandwidth to 256 bytes per clock. GPUs natively operate on 32 bit elements, and it would be reasonable to assume AMD doubled bandwidth by doubling bank count. If so, each bank may now have 2.5 KB of capacity. Another possibility would be increasing bank count to 80 while keeping bank size at 2 KB, but that’s less likely because it would complicate bank selection. A 64-banked LDS could naturally serve a 64-wide wavefront access with each bank serving a lane. Furthermore, a power-of-two bank count would allow simple bank selection via a subset of address bits.
The larger LDS lets software keep more data close to the execution units. Kernels can allocate more LDS capacity without worrying about lower occupancy due to LDS capacity constraints. For example, a kernel that allocates 16 KB of LDS could run four workgroups on a CDNA 3 CU. On CDNA 4, that would increase to ten workgroups.
Software has to explicitly move data into the LDS to take advantage of it, which can introduce overhead compared to using a hardware-managed cache. CDNA 3 had GLOBAL_LOAD_LDS instructions that let kernels copy data into the LDS without going through the vector register file, CDNA 4 augments GLOBAL_LOAD_LDS to support moving up to 128 bits per lane, versus 32 bits per lane on CDNA 3. That is, the GLOBAL_LOAD_LDS instruction can accept sizes of 1, 2, 4, 12, or 16 DWORDS (32-bit elements), versus just 1, 2, or 4 on CDNA 3.1
CDNA 4 also introduces read-with-transpose LDS instructions. Matrix multiplication involves multiplying elements of a row in one matrix with corresponding elements in a second matrix’s column. Often that creates inefficient memory access patterns, for at least one matrix, depending on whether data is laid out in row-major or column-major order. Transposing a matrix turns the awkward row-to-column operation into a more natural row-to-row one. Handling transposition at the LDS is also natural for AMD’s architecture, because the LDS already has a crossbar that can map bank outputs to lanes (swizzle).
Even with its LDS capacity increase, AMD continues to have less data storage within its GPU cores compared to Nvidia. Blackwell’s SMs have a 256 KB block of storage partitioned for use as both L1 cache and Shared Memory. Up to 228 KB can be allocated for use as Shared Memory. With a 164 KB Shared Memory allocation, which is close to matching AMD’s 160 KB LDS, Nvidia would still have 92 KB available for L1 caching. CDNA 4, like CDNA 3, has a 32 KB L1 vector cache per CU. Thus a Blackwell SM can have more software managed storage while still having a larger L1 cache than a CDNA 4 CU. Of course, AMD’s higher CU count means there’s 40 MB of LDS capacity across the GPU, while Nvidia only has ~33 MB of Shared Memory across B200 with the largest 228 KB Shared Memory allocation.
To feed the massive arrays of Compute Units, MI355X largely uses the same system level architecture as MI300X. MI355X does see a few enhancements though. The L2 caches can “writeback dirty data and retain a copy of the line”. “Dirty” refers to data that has been modified in a write-back cache, but hasn’t been propagated to lower levels in the memory subsystem. When a dirty line is evicted to make room for newer data, its contents are written back to the next level of cache, or DRAM if it’s the last level cache.
AMD may be seeking to opportunistically use write bandwidth when the memory subsystem is under low load, smoothing out spikes in bandwidth demand caused by cache fill requests accompanied by writebacks. Or, AMD could be doing something special to let the L2 transition a line to clean state if written data is likely to be read by other threads across the system, but isn’t expected to be modified again anytime soon.
MI355X’s DRAM subsystem has been upgraded to use HBM3E, providing a substantial bandwidth and capacity upgrade over its predecessor. It also maintains AMD’s lead over its Nvidia competition. Nvidia also uses HBM3E with the B200, which also appears to have eight HBM3E stacks. However, the B200 tops out at 180 GB of capacity and 7.7 TB/s of bandwidth, compared to 288 GB at 8 TB/s on the MI355X. The MI300X could have a substantial advantage over Nvidia’s older H100 when the H100 ran out of DRAM capacity, and AMD is likely looking to retain that advantage.
Higher bandwidth from HBM3E also helps bring up MI355X’s compute-to-bandwidth ratio. MI300X had ~0.03 bytes of DRAM bandwidth per FP32 FLOP, which increases to 0.05 on MI355X. Blackwell for comparison has ~0.10 bytes of DRAM bandwidth per FP32 FLOP. While Nvidia has increased last level cache capacity on Blackwell, AMD continues to lean more heavily on big caches, while Nvidia relies more on DRAM bandwidth.
CDNA 2 and CDNA 3 made sweeping changes compared to their predecessors. CDNA 4’s changes are more muted. Much like going from Zen 3 to Zen 4, MI355X retains a similar chiplet arrangement with compute and IO chiplets swapped out for improved versions. Rather than changing up their grand strategy, AMD spent their time tuning CDNA 3. Fewer, higher clocked CUs are easier to utilize, and increased memory bandwidth can help utilization too. Higher matrix multiplication throughput also helps AMD take on Nvidia for machine learning workloads.
In some ways, AMD’s approach with this generation has parallels to Nvidia’s. Blackwell SMs are basically identical to Hopper’s from a vector execution perspective, with improvements focused on the matrix side. Nvidia likely felt they had a winning formula, as their past few GPU generations have undoubtedly been successful. AMD may have found a winning formula with CDNA 3 as well. MI300A, MI300X’s iGPU cousin, powers the highest ranking supercomputer on TOP500’s June list.4 Building on success can be a safe and rewarding strategy, and CDNA 4 may be doing just that.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
https://github.com/llvm/llvm-project/blob/main/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl - b96 and b128 (96-bit and 128-bit) global_load_lds sizes
https://github.com/llvm/llvm-project/blob/84ff1bda2977e580265997ad2d4c47b18cd3bf9f/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td#L426C1-L426C50 - LDS transpose intrinsics
https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html
https://www.reddit.com/r/hardware/comments/1kj38r1/battle_of_the_giants_8x_nvidia_blackwell_b200/ - reports 148 Compute Units via OpenCL for B200. Nvidia usually reports SMs for the Compute Unit count
2025-06-14 23:47:00
Today, AMD’s Infinity Fabric interconnect is ubiquitous across the company’s lineup. Infinity Fabric provides well-defined interfaces to a transport layer, and lets different functional blocks treat the interconnect as a black box. The system worked well enough to let AMD create integrated GPU products all the way from the Steam Deck’s tiny van Gogh APU, to giant systems packing four MI300A chips. Across all those offerings, Infinity Fabric enables coherent memory access as CPU and GPU requests converge to Coherent Slaves (CS-es), which have a probe filter and can request data from the appropriate source.
AMD was also trying to build powerful iGPUs over a decade ago, but their interconnect looked very different at the time. Their Northbridge architecture owes its name to the Athlon 64 era, when AMD brought the chipset northbridge’s functionality onto the CPU die. AMD’s engineers likely didn’t envision needing to tightly integrate a GPU during Athlon 64’s development, so the interconnect was designed to tie together a few CPU cores and service memory requests with low latency. But then AMD acquired ATI, and set about trying to take advantage of their new graphics talent by building powerful iGPUs.
“Trinity” is an AMD Accelerated Processing Unit (APU) from 2012. It combines two dual-threaded Piledriver CPU modules with a 6-SIMD Terascale 3 iGPU. Here I’ll be looking into AMD’s second generation iGPU interconnect with the A8-5600K, a slightly cut-down Trinity implementation with four Terascale 3 SIMDs enabled and CPU boost clocks dropped from 4.2 to 3.9 GHz. I’m testing the chip on a MSI FM2-A75MA-E35 board with 16 GB of Kingston DDR3-1866 10-10-9-26 memory.
Trinity’s on-die network resembles that of AMD’s first APU, Llano, and has clear similarities to AMD’s Athlon 64 Northbridge. The Northbridge sits on a separate voltage and frequency domain, and runs at 1.8 GHz on the A8-5600K. It uses a two-level crossbar setup. CPU cores connect to a System Request Interface (SRI), which routes requests onto a set of queues. Most memory requests head to a System Request Queue (SRQ). The SRI also accepts incoming probes on a separate queue, and routes them to the CPU cores. A second level crossbar, simply called the XBAR, connects with the SRI’s various queues and routes requests to IO and memory. The XBAR can handle IO-to-IO communication too, though such traffic is rare on consumer systems.
On Trinity, the XBAR’s scheduler (XCS) has 40 entries, making it slightly smaller than the 64 entry XCS on desktop and server Piledriver chips. AMD defaults to a 22+10+8 entry split between the SRI, Memory Controller (MCT),and upstream channel, though the BIOS can opt for a different XCS entry allocation. The XBAR sends memory requests to the MCT, which prioritizes requests based on type and age, and includes a strided-access prefetcher. Like Infinity Fabric’s CS, the MCT is responsible for ensuring cache coherency and can send probes back to the XBAR. The MCT also translates physical addresses to “normalized” addresses that only cover the memory space backed by DRAM.
All AMD CPUs from Trinity’s era use a similar SRI+XBAR setup, but Trinity’s iGPU sets it apart from CPU-only products. The iGPU gets its own Graphics Memory Controller (GMC), which arbitrates between different request types and schedules requests to maximize DRAM bandwidth utilization. Thus the GMC performs an analogous role to the MCT on the CPU side. A “Radeon Memory Bus” connects the GMC to the DRAM controllers, bypassing the MCT. AMD documentation occasionally refers to the Radeon Memory Bus as “Garlic”, likely a holdover from internal names used during Llano’s development.
A second control link hooks the GPU into the XBAR, much like any other IO device. Previously, the control link was called the “Fusion Control Link”, or “Onion”. I’ll use “Garlic” and “Onion” to refer to the two links because those names are shorter.
Trinity’s “Garlic” link lets the GPU saturate DRAM bandwidth. It bypasses the MCT and therefore bypasses the Northbridge’s cache coherency mechanisms. That’s a key design feature, because CPUs and GPUs usually don’t share data. Snooping the CPU caches for GPU memory requests would waste power and flood the interconnect with probes, most of which would miss.
Bypassing the MCT also skips the Northbridge’s regular memory prioritization mechanism, so the MCT and GMC have various mechanisms to keep the CPU and GPU from starving each other of memory bandwidth. First, the MCT and GMC can limit how many outstanding requests they have on the DRAM controller queues (DCQs). Then, the DCQs can alternate between accepting requests from the MCT and GMC, or can be set to prioritize the requester with fewer outstanding requests. Trinity defaults to the former. Finally, because the CPU tends to be more latency sensitive, Trinity can prioritize MCT-side reads ahead of GMC requests. Some of this arbitration seems to happen at queues in front of the DCQ. AMD’s BIOS and Kernel Developer’s Guide (BKDG) indicates there’s a 4-bit read pointer for a sideband signal FIFO between the GMC and DCT, so the “Garlic” link may have a queue with up to 16 entries.
From software testing, Trinity does a good job controlling CPU-side latency under increasing GPU bandwidth demand. Even with the GPU hitting the DRAM controllers with over 24 GB/s of reads and writes, CPU latency remains under 120 ns. Loaded latency can actually get worse under CPU-only bandwidth contention, even though Trinity’s CPU side isn’t capable of fully saturating the memory controller. When testing loaded latency, I have to reserve one thread for running the latency test, so bandwidth in the graph above tops out at just over 17 GB/s. Running bandwidth test threads across all four CPU logical threads would give ~20.1 GB/s.
Mixing CPU and GPU bandwidth load shows that latency goes up when there’s more CPU-side bandwidth. Again, the GPU can pull a lot of bandwidth without impacting CPU memory latency too heavily. Likely, there’s more contention at the SRI and XBAR than at the DRAM controllers.
Integrated GPUs raise the prospect of doing zero-copy data sharing between the CPU and GPU, without being constrained by a PCIe bus. Data sharing is conceptually easy, but cache coherency is not, and caching is critical to avoiding DRAM performance bottlenecks. Because requests going through the “Garlic” link can’t snoop caches, “Garlic” can’t guarantee it retrieves the latest data written by the CPU cores.
That’s where the “Onion” link comes in. An OpenCL programmer can tell the runtime to allocate cacheable host memory, then hand it over for the GPU to use without an explicit copy. “Onion” requests pass through the XBAR and get switched to the MCT, which can issue probes to the CPU caches. That lets the GPU access the CPU’s cacheable memory space and retrieve up-to-date data. Unfortunately, “Onion” can’t saturate DRAM bandwidth and tops out at just under 10 GB/s. Trinity’s interconnect also doesn’t come with probe filters, so probe response counts exceed 45 million per second. Curiously, probe response counts are lower than if a probe had been issued for every 64B cacheline.
Northbridge performance counters count “Onion” requests as IO to memory traffic at the XBAR. “Garlic” traffic is not counted at the XBAR, and is only observed at the DRAM controllers. Curiously, the iGPU also creates IO to memory traffic when accessing a very large buffer nominally allocated out of the GPU’s memory space (no ALLOC_HOST_PTR flag), implying it’s using the “Onion” link when it runs out of memory space backed by “Garlic”. If so, iGPU memory accesses over “Onion” incur roughly 320 ns of extra latency over a regular “Garlic” access.
Therefore Trinity’s iGPU takes a significant bandwidth penalty when accessing CPU-side memory, even though both memory spaces are backed by the same physical medium. AMD’s later Infinity Fabric handles this in a more seamless manner, as CS-es observe all CPU and GPU memory accesses. Infinity Fabric CS-es have a probe filter, letting them maintain coherency without creating massive probe traffic.
Telling the GPU to access cacheable CPU memory is one way to enable zero-copy behavior. The other way is to map GPU-side memory into a CPU program’s address space. Doing so lets Trinity’s iGPU use its high bandwidth “Garlic” link, and AMD allows this using a CL_MEM_USE_PERSISTENT_MEM_AMD flag. But because the Northbridge has no clue what the GPU is doing through its “Garlic” link, it can’t guarantee cache coherency. AMD solves this by making the address space uncacheable (technically write-combining) from the CPU side, with heavy performance consequences for the CPU.
Besides not being able to enjoy higher bandwidth and lower latency from cache hits, the CPU can’t combine reads from the same cacheline into a single fill request. From the Northbridge perspective, the System Request Interface sends “sized reads” to the XBAR instead of regular “cache block” commands. From Llano’s documentation, the CPU to iGPU memory request path only supports a single pending read, and that seems to apply to Trinity too.
Bandwidth sinks like a stone with the CPU unable to exploit memory level parallelism. Worse, the CPU to iGPU memory path appears less optimized in latency terms. Pointer chasing latency is approximately 93.11 ns across a small 8 KB array. For comparison, requests to cacheable memory space with 2 MB pages achieve lower than 70 ns of latency with a 1 GB array. I’m not sure what page size AMD’s drivers use when mapping iGPU memory into CPU address space, but using a 8 KB array should avoid TLB misses and make the two figures comparable. On AMD’s modern Ryzen 8840HS, which uses Infinity Fabric, there’s no significant latency difference when accessing GPU memory. Furthermore, GPU memory remains cacheable when mapped to the CPU side. However, memory latency in absolute terms is worse on the Ryzen 8840HS at over 100 ns.
While AMD had GPU compute in its sights when designing Trinity, its biggest selling point was as a budget gaming solution. Many modern games struggle or won’t launch at all on Trinity, so I’ve selected a few older workloads to get an idea of how much traffic Trinity’s interconnect has to work with.
Unigine Valley is a 2013-era DirectX 11 benchmark. Performance counter data collected over a benchmark pass shows plenty of DRAM traffic not observed at the XBAR, which can be attributed to the iGPU’s “Garlic” link. CPU-side memory bandwidth is light, and generally stays below 3 GB/s. Total DRAM traffic peaked at 17.7 GB/s, measured over a 1 second interval.
Final Fantasy 14’s Heavensward benchmark came out a few years after Trinity. With the “Standard (Laptop)” preset running at 1280x720, the A8-5600K averaged 25.6 FPS. Apparently that counts as “Fairly High”, which I only somewhat agree with.
The Heavensward benchmark is more bandwidth hungry, with DRAM bandwidth reaching 22.7 GB/s over a 1 second sampling interval. The CPU side uses more bandwidth too, often staying in the 3-4 GB/s range and sometimes spiking to 5 GB/s. Still, the bulk of DRAM traffic can be attributed to the iGPU, and uses the “Garlic” link.
The Elder Scrolls Online (ESO) is a MMO that launched a few years after Trinity, and can still run on the A8-5600K (if barely) thanks to FSR. I’m running ESO at 1920x1080 with the low quality preset, and FSR set to Quality. Framerate isn’t impressive and often stays below 20 FPS.
CPU bandwidth can often reach into the high 4 GB/s range, though that often occurs when loading in or accessing hub areas with many players around. GPU-side bandwidth demands are lower compared to the two prior benchmarks, and total bandwidth stays under 16 GB/s.
RawTherapee converts camera raw files into standard image formats (i.e. JPEG). Here, I’m processing 45 megapixel raw files from a Nikon D850. RawTherapee only uses the CPU for image processing, so most DRAM traffic is observed through the XBAR, coming from the CPU. The iGPU still uses a bit of bandwidth for display refreshes and perhaps the occasional GPU-accelerated UI feature. In this case, I have the iGPU driving a pair of 1920x1080 60 Hz displays.
Darktable is another raw conversion program. Unlike RawTherapee, Darktable can offload some parts of its image processing pipeline to the GPU. It’s likely the kind of application AMD hoped would become popular and showcase its iGPU prowess. With Darktable, Trinity’s Northbridge handles a mix of traffic from the CPU and iGPU, depending on what the image processing pipeline demands. IO to memory traffic also becomes more than a rounding error, which is notable because that’s how “Onion” traffic is counted. Other IO traffic like SSD reads can also be counted at the XBAR, but that should be insignificant in this workload.
In Trinity’s era, powerful iGPUs were an obvious way for AMD to make use of their ATI acquisition and carve out a niche, despite falling behind Intel on the CPU field. However AMD’s interconnect wasn’t ideal for integrating a GPU. The “Onion” and “Garlic” links were a compromise solution that worked around the Northbridge’s limitations. With these links, Trinity doesn’t get the full advantages that an iGPU should intuitively enjoy. Zero-copy behavior is possible, and should work better than doing so with a contemporary discrete GPU setup. But both the CPU and GPU face lower memory performance when accessing the other’s memory space.
Awkwardly for AMD, Intel had a more integrated iGPU setup with Sandy Bridge and Ivy Bridge. Intel’s iGPU sat on a ring bus along with CPU cores and L3 cache. That naturally made it a client of the L3 cache, and unified the CPU and GPU memory access paths at that point. Several years would pass before AMD created their own flexible interconnect where the CPU and GPU could both use a unified, coherent memory path without compromise. Fortunately for AMD, Intel didn’t have a big iGPU to pair with Ivy Bridge.
Trinity’s interconnect may be awkward, but its weaknesses don’t show for graphics workloads. The iGPU can happily gobble up bandwidth over the “Garlic” link, and can do so without destroying CPU-side memory latency. Trinity does tend to generate a lot of DRAM traffic, thanks to a relatively small 512 KB read-only L2 texture cache on the GPU side, and no L3 on the CPU side. Bandwidth hungry workloads like FF14’s Heavensward benchmark can lean heavily on the chip’s DDR3-1866 setup. But even in that workload, there’s a bit of bandwidth to spare. Trinity’s crossbars and separate coherent/non-coherent iGPU links may lack elegance and sophistication compared to modern designs but it was sufficient to let AMD get started with powerful iGPUs in the early 2010s.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.