2025-06-29 08:26:53
Nvidia has a long tradition of building giant GPUs. Blackwell, their latest graphics architecture, continues that tradition. GB202 is the largest Blackwell die. It occupies a massive 750mm2 of area, and has 92.2 billion transistors. GB202 has 192 Streaming Multiprocessors (SMs), the closest equivalent to a CPU core on a GPU, and feeds them with a massive memory subsystem. Nvidia’s RTX PRO 6000 Blackwell features the largest GB202 configuration to date. It sits alongside the RTX 5090 in Nvidia’s lineup, which also uses GB202 but disables a few more SMs.
A high level comparison shows the scale of Nvidia’s largest Blackwell products. AMD’s RDNA4 line tops out with the RX 9070 and RX 9070XT. The RX 9070 is slightly cut down, with four WGPs disabled out of 32. I’ll be using the RX 9070 to provide comparison data.
A massive thanks to Will Killian for giving us access to his RTX PRO 6000 Blackwell system for us to test. And so, a massive thanks goes out to him for this article!
GPUs use specialized hardware to launch threads across their cores, unlike CPUs that rely on software scheduling in the operating system. Hardware thread launch is well suited to the short and simple tasks that often characterize GPU workloads. Streaming Multiprocessors (SMs) are the basic building block of Nvidia GPUs, and are roughly analogous to a CPU core. SMs are grouped into Graphics Processing Clusters (GPCs), which contain a rasterizer and associated work distribution hardware.
GB202 has a 1:16 GPC to SM ratio, compared to the 1:12 ratio found in Ada Lovelace’s largest AD102 die. That lets Nvidia cheaply increase SM count and thus compute throughput without needing more copies of GPC-level hardware. However, dispatches with short-duration waves may struggle to take advantage of Blackwell’s scale, as throughput becomes limited by how fast the GPC can allocate work to the SMs rather than how fast the SMs can finish them.
AMD’s RDNA4 uses a 1:8 SE:WGP ratio, so one rasterizer feeds a set of eight WGPs in a Shader Engine. WGPs on AMD are the closest equivalent to SMs on Nvidia, and have the same nominal vector lane count. RDNA4 will be easier to utilize with small dispatches and short duration waves, but it’s worth noting that Blackwell’s design is not out of the ordinary. Scaling up GPU “cores” independently of work distribution hardware is a common technique for building larger GPUs. AMD’s RX 6900XT (RDNA2) had a 1:10 SE:WGP ratio. Before that, AMD’s largest GCN implementations like Fury X and Vega 64 had a 1:16 SE:CU ratio (CUs, or Compute Units, formed the basic building block of GCN GPUs). While Blackwell does have the same ratio as those large GCN parts, it enjoys higher clock speeds and likely has a higher wave launch rate to match per-GPU-core throughput. It won’t suffer as much as the Fury X from 10 years ago with short duration waves, but GB202 will still be harder to feed than smaller GPUs.
Although Nvidia didn’t scale up work distribution hardware, they did make improvements on Blackwell. Prior Nvidia generations could not overlap workloads of different types on the same queue. Going between graphics and compute tasks would require a “subchannel switch” and a “wait-for-idle”. That requires one task on the queue to completely finish before the next can start, even if a game doesn’t ask for synchronization. Likely, higher level scheduling hardware that manages queues exposed to host-side applications can only track state for one workload type at a time. Blackwell does away with subchannel switches, letting it more efficiently fill its shader array if applications frequently mix different work types on the same queue.
Once assigned work, the SM’s frontend fetches shader program instructions and delivers them to the execution units. Blackwell uses fixed length 128-bit (16 byte) instructions, and the SM uses a two-level instruction caching setup. Both characteristics are carried forward from Nvidia’s post-Turing/Volta designs. Each of the SM’s four partitions has a private L0 instruction cache, while a L1 instruction cache is shared across the SM.
Nvidia’s long 16 byte instructions translate to high instruction-side bandwidth demands. The L0+L1 instruction caching setup is likely intended to handle those bandwidth demands while also maintaining performance with larger code footprints. Each L0 only needs to provide one instruction per cycle, and its smaller size should make it easier to optimize for high bandwidth and low power. The SM-level L1 can then be optimized for capacity.
Blackwell’s L1i is likely 128 KB, from testing with unrolled loops of varying sizes and checking generated assembly (SASS) to verify the loop’s footprint. The L1i is good for approximately 8K instructions, providing a good boost over prior Nvidia generations.
Blackwell and Ada Lovelace both appear to have 32 KB L0i caches, which is an increase over 16 KB in Turing. The L1i can fully feed a single partition, and can filter out redundant instruction fetches to feed all four partitions. However, L1i bandwidth can be a visible limitation if two waves on different partitions spill out of L1i and run different code sections. In that case, per-wave throughput drops to one instruction per two cycles.
AMD uses variable length instructions ranging from 4 to 12 bytes, which lowers capacity and bandwidth pressure on the instruction cache compared to Nvidia. RDNA4 has a 32 KB instruction cache shared across a Workgroup Processor (WGP), much like prior RDNA generations. Like Nvidia’s SM, a WGP is divided into four partitions (SIMDs). RDNA1’s whitepaper suggests the L1i can supply 32 bytes per cycle to each SIMD. It’s an enormous amount of bandwidth considering AMD’s more compact instructions. Perhaps AMD wanted to be sure each SIMD could co-issue to its vector and scalar units in a sustained fashion. In this basic test, RDNA4’s L1i has no trouble maintaining full throughput when two waves traverse different code paths. RDNA4 also enjoys better code read bandwidth from L2, though all GPUs I’ve tested show poor L2 code read bandwidth.
Each Blackwell SM partition can track up to 12 waves to hide latency, which is a bit lower than the 16 waves per SIMD on RDNA4. Actual occupancy, or the number of active wave slots, can be limited by a number of factors including register file capacity. Nvidia has not changed theoretical occupancy or register file capacity since Ampere, and the latter remains a 64 KB per partition. A kernel therefore can’t use more than 40 registers while using all 12 wave slots, assuming allocation granularity hasn’t changed since Ada and is still 8 registers. For comparison, AMD’s high end RDNA3/4 SIMDs have 192 KB vector register files, letting a kernel use up to 96 registers while maintaining maximum occupancy.
Blackwell’s primary FP32 and INT32 execution pipelines have been reorganized compared to prior generations, and are internally arranged as one 32-wide execution pipe. That creates similarities to AMD’s RDNA GPUs, as well as Nvidia’s older Pascal. Having one 32-wide pipe handle both INT32 and FP32 means Blackwell won’t have to stall if it encounters a long stream of operations of the same type. Blackwell inherits Turing’s strength of being able to do 16 INT32 multiplies per cycle in each partition. Pascal and RDNA GPUs can only do INT32 multiplies at approximately quarter arte (8 per partition, per cycle).
Compared to Blackwell, AMD’s RDNA4 packs a lot of vector compute into each SIMD. Like RDNA3, RDNA4 can use VOPD dual issue instructions or wave64 mode to complete 64 FP32 operations per cycle in each partition. An AMD SIMD can also co-issue instructions of different types from different waves, while Nvidia’s dispatch unit is limited to one instruction per cycle. RDNA4’s SIMD also packs eight special function units (SFUs) compared to four on Nvidia. These units handle more complex operations like inverse square roots and trigonometric functions.
Differences in per-partition execution unit layout or count quickly get pushed aside by Blackwell’s massive SM count. Even when the RX 9070 can take advantage of dual issue, 28 WGPs cannot take on 188 SMs. Nvidia holds a large lead in every category.
Nvidia added floating point instructions to Blackwell’s uniform datapath, which dates back to Turing and serves a similar role to AMD’s scalar unit. Both offload instructions that are constant across a wave. Blackwell’s uniform FP instructions include adds, multiples, FMAs, min/max, and conversions between integer and floating point. Nvidia’s move mirrors AMD’s addition of FP scalar instructions with RDNA 3.5 and RDNA4.
Still, Nvidia’s uniform datapath feels limited compared to AMD’s scalar unit. Uniform registers can only be loaded from constant memory, though curiously a uniform register can be written out to global memory. I wasn’t able to get Nvidia’s compiler to emit uniform instructions for the critical part of any instruction or cache latency tests, even when loading values from constant memory.
Raytracing has long been a focus of Nvidia’s GPUs. Blackwell doubles the per-SM ray triangle intersection test rate, though Nvidia does not specify what the box or triangle test rate is. Like Ada Lovelace, Blackwell’s raytracing hardware supports “Opacity Micromaps”, providing functionality similar to the sub-triangle opacity culling referenced by Intel’s upcoming Xe3 architecture.
Like Ada Lovelace and Ampere, Blackwell has a SM-wide 128 KB block of storage that’s partitioned for use as L1 cache and Shared Memory. Shared Memory is Nvidia’s term for a software managed scratchpad, which backs the local memory space in OpenCL. AMD’s equivalent is the Local Data Share (LDS), and Intel’s is Shared Local Memory (SLM). Unlike with their datacenter GPUs, Nvidia has chosen not to increase L1/Shared Memory capacity. As in prior generations, different L1/Shared Memory splits do not affect L1 latency.
AMD’s WGPs use a more complex memory subsystem, with a high level design that debuted in the first RDNA generation. The WGP has a 128 KB LDS that’s internally built from a pair of 64 KB, 32-bank structures connected by a crossbar. First level vector data caches, called L0 caches, are private to pairs of SIMDs. A WGP-wide 16 KB scalar cache services scalar and constant reads. In total, a RDNA4 WGP has 208 KB of data-side storage divided across different purposes.
A RDNA4 WGP enjoys substantially higher bandwidth from its private memories for global and local memory accesses. Each L0 vector cache can deliver 128 bytes per cycle, and the LDS can deliver 256 bytes per cycle total to the WGP. Mixing local and global memory traffic can further increase achievable bandwidth, suggesting the LDS and L0 vector caches have separate data buses.
Doing the same on Nvidia does not bring per-SM throughput past 128B/cycle, suggesting the 128 KB L1/Shared Memory block has a single 128B path to the execution units.
Yet any advantage AMD may enjoy from this characteristic is blunted as the RX 9070 drops clocks to 2.6 GHz, in order to stay within its 220W power target. Nvidia in contrast has a higher 600W power limit, and can maintain close to maximum clock speeds while delivering 128B/cycle from SM-private L1/Shared Memory storage.
Just as with compute, Nvidia’s massive scale pushes aside any core-for-core differences. The 188 SMs across the RTX PRO 6000 Blackwell together have more than 60 TB/s of bandwidth. High SM count gives Nvidia more total L1/local memory too. Nvidia has 24 MB of L1/Shared Memory across the RTX PRO 6000. AMD’s RX 9070 has just under 6 MB of first level data storage in its WGPs.
SM-private storage typically offers low latency, at least in GPU terms, and that continues to be the case in Blackwell. Blackwell compares favorably to AMD in several areas, and part of that advantage comes down to address generation. I’m testing with dependent array accesses, and Nvidia can convert an array index to an address with a single IMAD.WIDE instruction.
AMD has fast 64-bit address generation through its scalar unit, but of course can only use that if the compiler determines the address calculation will be constant across a wave. If each lane needs to independently generate its own address, AMD’s vector integer units only natively operate with 32-bit data types and must do an add-with-carry to generate a 64-bit address.
Because separating address generation from cache latency is impossible, Nvidia enjoys better L1 vector access latency. AMD can be slightly faster if the compiler can carry out scalar optimizations.
GPUs can also offload address generation to the texture unit, which can handle an array address calculations. The texture unit of course can also do texture filtering, though I’m only asking it to return raw data when testing with OpenCL’s image1d_buffer_t type. AMD enjoys lower latency if the texture unit does address calculations, but Nvidia does not.
GPUs often handle atomic operations with dedicated ALUs placed close to points of coherency, like the LDS or Shared Memory for local memory, or the L2 cache for global memory. That contrasts with CPUs, which rely on locking cachelines to handle atomics and ensure ordering. Nvidia appears to have 16 INT32 atomic ALUs at each SM, compared to 32 for each AMD WGP.
In a familiar trend, Nvidia can crush AMD by virtue of having a much bigger GPU, at least with local memory atomics. Both the RTX PRO 6000 and RX 9070 have surprisingly similar atomic add throughput in global memory, suggesting Nvidia either has fewer L2 banks or fewer atomic units per bank.
RDNA4 and Blackwell have similar latency when threads exchange data through atomic compare and exchange operations, though AMD is slightly faster. The RX 9070 is a much smaller and higher clocked GPU, and both can help lower latency when moving data across the GPU.
Blackwell uses a conventional two-level data caching setup, but continues Ada Lovelace’s strategy of increasing L2 capacity to achieve the same goals as AMD’s Infinity Cache. L2 latency on Blackwell regresses to just over 130 ns, compared to 107 ns on Ada Lovelace. Nvidia’s L2 latency continues to sit between AMD’s L2 and Infinity Cache latencies, though now it’s noticeably closer to the latter.
Tests results using Vulkan suggest the smaller RTX 5070 also has higher L2 latency (122 ns) than the RTX 4090, even though the 5070 has fewer SMs and a smaller L2. Cache latency results from Nemes’s Vulkan test suite should be broadly comparable to my OpenCL ones, because we both use a current = arr[current] access pattern. A deeper look showed minor code generation differences that seem to add ~3 ns of latency to the Vulkan results. That doesn’t change the big picture with L2 latencies. Furthermore, the difference between L1 and L2 latency should approximate the time taken to traverse the on-chip network and access the L2. Differences between OpenCL and Vulkan results are insignificant in that regard. Part of GB202’s L2 latency regression may come from its massive scale, but results from the 5070 suggest there’s more to the picture.
The RTX PRO 6000 Blackwell’s VRAM latency is manageable at 329 ns, or ~200 ns over L2 hit latency. AMD’s RDNA4 manages better VRAM latency at 254 ns for a vector access, or 229 ns through the scalar path. Curiously, Nvidia’s prior Ada Lovelace and Ampere architectures enjoyed better VRAM latency than Blackwell, and are in the same ballpark as RDNA4 and RDNA2.
Blackwell’s L2 bandwidth is approximately 8.7 TB/s, slightly more than the RX 9070’s 8.4 TB/s. Nvidia retains a huge advantage at larger test sizes, where AMD’s Infinity Cache provides less than half the bandwidth. In VRAM, Blackwell’s GDDR7 and 512-bit memory bus continue to keep it well ahead of AMD.
Nvidia’s L2 performance deserves closer attention, because it’s one area where the RX 9070 gets surprisingly close to the giant RTX PRO 6000 Blackwell. A look at GB202’s die photo shows 64 cache blocks, suggesting the L2 is split into 64 banks. If so, each bank likely delivers 64 bytes per cycle (of which the test was able to achieve 48B/cycle). It’s an increase over the 48 L2 blocks in Ada Lovelace’s largest AD102 chip. However, Nvidia's L2 continues to have a tough job serving as both the first stop for L1 misses and as a large last level cache. In other words, it’s doing the job of AMD’s L2 and Infinity Cache levels. There’s definitely merit to cutting down cache levels, because checking a level of cache can add latency and power costs. However, caches also have to make a tradeoff between capacity, performance, and power/area cost.
Nvidia likely relies on their flexible L1/Shared Memory arrangement to keep L2 bandwidth demands under control, and insulate SMs from L2 latency. A Blackwell SM can use its entire 128 KB L1/Shared Memory block as L1 cache if a kernel doesn’t need local memory, while an AMD WGP is stuck with two 32 KB vector caches and a 16 KB scalar cache. However a kernel bound by local memory capacity with a data footprint in the range of several megabytes would put Nvidia at a disadvantage. Watching AMD and Nvidia juggle these tradeoffs is quite fun, though it’s impossible to draw any conclusions with the two products competing in such different market segments.
FluidX3D simulates fluid behavior and can demand plenty of memory bandwidth. It carries out computations with FP32 values, but can convert them to FP16 formats for storage. Doing so reduces VRAM bandwidth and capacity requirements. Nvidia’s RTX PRO 6000 takes a hefty lead over AMD’s RX 9070, as the headline compute and memory bandwidth specifications would suggest.
Nvidia’s lead remains relatively constant regardless of what mode FluidX3D is compiled with.
We technically have more GPU competition than ever in 2025, as Intel’s GPU effort makes steady progress and introduces them as a third contender. On the datacenter side, AMD’s MI300 has proven to be very competitive with supercomputing wins. But competition is conspicuously absent at the top of the consumer segment. Intel’s Battlemage and AMD’s RDNA4 stop at the midrange segment. The RX 9070 does target higher performance levels than Intel’s Arc B580, but neither come anywhere close to Nvidia’s largest GB202 GPUs.
As for GB202, it’s yet another example of Nvidia building as big as they can to conquer the top end. The 750mm2 die pushes the limits of what can be done with a monolithic design. Its 575W or 600W power target tests the limits of what a consumer PC can support. By pushing these limits, Nvidia has created the largest consumer GPU available today. The RTX PRO 6000 incredibly comes close to AMD’s MI300X in terms of vector FP32 throughput, and is far ahead of Nvidia’s own B200 datacenter GPU. The memory subsystem is a monster as well. Perhaps Nvidia’s engineers asked whether they should emphasize caching like AMD’s RDNA2, or lean on VRAM bandwidth like they did with Ampere. Apparently, the answer is both. The same approach applies to compute, where the answer was apparently “all the SMs”.
Building such a big GPU isn’t easy, and Nvidia evidently faced their share of challenges. L2 performance is mediocre considering the massive compute throughput it may have to feed. Beyond GPU size, comparing with RDNA4 shows continued trends like AMD using a smaller number of individually stronger cores. RDNA4’s basic Workgroup Processor building block has more compute throughput and cache bandwidth than a Blackwell SM.
But none of that matters at the top end, because Nvidia shows up with over 6 times as many “cores”, twice as much last level cache capacity, and a huge VRAM bandwidth lead. Some aspects of Blackwell may not have scaled as nicely. But Nvidia’s engineers deserve praise because everyone else apparently looked at those challenges and decided they weren’t going to tackle them at all. Blackwell therefore wins the top end by default. Products like the RTX PRO 6000 are fascinating, and I expect Nvidia to keep pushing the limits of how big they can build a consumer GPU. But I also hope competition at the top end will eventually reappear in the decades and centuries to come.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
SASS instruction listings: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#blackwell-instruction-set
GB202 whitepaper: https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf
Blackwell PRO GPU whitepaper: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/NVIDIA-RTX-Blackwell-PRO-GPU-Architecture-v1.0.pdf
Techpowerup observing that the RTX 5090 could reach 2.91 GHz: https://www.techpowerup.com/review/nvidia-geforce-rtx-5090-founders-edition/44.html
2025-06-21 05:13:03
Hello you fine Internet folks,
At AMD's Advancing AI 2025, I had the pleasure of interviewing Alan Smith, AMD Senior Fellow and Chief Instinct Architect, about CDNA4 found in the MI350 series of accelerators.
Hope y'all enjoy!
Transcript below has been edited for conciseness and readability.
George: Hello you fine internet folks! We're here today at AMD's Advancing AI 2025 event, where the MI350 series has just been announced. And I have the pleasure to introduce, Alan Smith from AMD.
Alan: Hello, everyone.
George: What do you do at AMD?
Alan: So I'm the chief architect of Instinct GPUs.
George: Awesome, what does that job entail?
Alan: So I'm responsible for the GPU product line definition, working with our data center GPU business partners to define the-, work with them on the definition of the requirements of the GPUs, and then work with the design teams to implement those requirements.
George: Awesome. So, moving into MI350: MI350's still GFX9 based, and- for you guys in the audience, GFX9 is also known as Vega, or at least derivative of GFX9. Why is it that MI350 is still on GFX9, whereas clients such as RDNA 3 and 4, GFX11 and 12 respectively?
Alan: Sure, yeah, it's a great question. So as you know, the CDNA architecture off of, you know, previous generations of Instinct GPUs, starting with MI100 and before, like you said, in the Vega generations, were all GCN architecture, which was Graphics Core Next. And over several generations, CDNA's been highly optimized for the types of distributed computing algorithms for high-performance computing and AI. And so we felt like starting with that base for MI350 would give us the right pieces that we needed to deliver the performance goals that we had for the MI350 series.
George: And with GCN, as you know, there's a separate L1 cache and LDS, or Local Data Store. Why is that still in MI350? And why haven't they been merged?
Alan: Yeah, so like you said, you know, it is a legacy piece of the GCN architecture. It's sort of fundamental to the way the compute unit is built. So we felt like in this generation, it wasn't the right opportunity to make a microarchitectural change of that scale. So what we did instead, was we increased the capacity of the LDS. So previously in MI300 series, we had a 64 kilobyte LDS, and we've increased that capacity to 160 kilobytes in MI350 series. And in addition to that, we increased the bandwidth as well. So we doubled the bandwidth of the LDS into the register file, in order to be able to feed the Tensor Core rates that we have in the MI350 series.
George: And speaking of Tensor Cores, you've now introduced microscaling formats to MI350x for FP8, FP6, and FP4 data types. Interestingly enough, a major differentiator for MI350 is that FP6 is the same rate as FP4. Can you talk a little bit about how that was accomplished and why that is?
Alan: Sure, yep, so one of the things that we felt like on MI350 in this timeframe, that it's going into the market and the current state of AI... we felt like that FP6 is a format that has potential to not only be used for inferencing, but potentially for training. And so we wanted to make sure that the capabilities for FP6 were class-leading relative to... what others maybe would have been implementing, or have implemented. And so, as you know, it's a long lead time to design hardware, so we were thinking about this years ago and wanted to make sure that MI350 had leadership in FP6 performance. So we made a decision to implement the FP6 data path at the same throughput as the FP4 data path. Of course, we had to take on a little bit more hardware in order to do that. FP6 has a few more bits, obviously, that's why it's called FP6. But we were able to do that within the area of constraints that we had in the matrix engine, and do that in a very power- and area-efficient way.
George: And speaking of data types, I've noticed that TF32 is not on your ops list for hardware level acceleration. Why remove that feature from... or why was that not a major consideration for MI350?
Alan: Yeah, well, it was a consideration, right? Because we did remove it. We felt like that in this timeframe, that brain float 16, or BF16, would be a format that would be leverageable for most models to replace TF32. And we can deliver a much higher throughput on BF16 than TF32, so we felt like it was the right trade off for this implementation.
George: And if I was to use TF32, what would the speed be? Would it still be FP32, the speed of FP32?
Alan: You have a choice. We offer some emulation, and I don't have all the details on the exact throughputs off the top of my head; but we do offer emulation, software-based emulation using BF16 to emulate TF32, or you can just cast it into FP32 and use it at FP32 rate.
George: And moving from the CU up to the XCD, which is the compute die; the new compute die's now on N3P, and yet there's been a reduction from 40 CUs to 36 CUs physically on the die with four per shader engine fused off. Why 32 CUs now, and, why that reduction?
Alan: Yeah, so on MI300, we had co-designed for both MI300X and MI300A, one for HPC and one for AI. In the MI300A, we have just six XCDs. And so, we wanted to make sure when we only had six of the accelerator chiplets that we had enough compute units to power the level of HPC or high-performance computing - which is traditional simulation in FP64 - to reach the performance levels that we wanted to hit for the leadership-class supercomputers that we were targeting with that market.
And so we did that and delivered the fastest supercomputer in the world along with Lawrence Livermore, with El Capitan. But so as a consideration there, we wanted to have more compute units for XCD so that we could get 224 total within MI300A. On 350, where it's designed specifically as an accelerator only, a discrete accelerator, we had more flexibility there. And so we decided that having a power of two number of active compute units per die - so 36 physical, like you said, but we enable 32. Four of them, one per shader engine, are used for harvesting and we yield those out in order to give us good high-volume manufacturing through TSMC-N3, which is a leading edge technology. So we have some of the spare ones that allow us to end up with 32 actually enabled.
And that's a nice power of two, and it's easy to tile tensors if you have a power of two. So most of the tensors that you're working with, or many of them, would be matrices that are based on a power of two. And so it allows you to tile them into the number of compute units easily, and reduces the total tail effect that you may have. Because if you have a non-power of two number of compute units, then some amount of the tensor may not map directly nicely, and so you may have some amount of work that you have to do at the end on just a subset of the compute unit. So we find that there's some optimization there by having a power of two.
George: While the new compute unit is on N3P, the I/O die is on N6; why stick with N6?
Alan: Yeah, great question. What we see in our chiplet technology, first of all, we have the choice, right? So being on chiplets gives you the flexibility to choose different technologies if appropriate. And the things that we have in the I/O die tend not to scale as well with advanced technologies. So things like the HBM PHYs, the high-speed SERDES, the caches that we have with the Infinity Cache, the SRAMs, those things don't scale as well. And so sticking with an older technology with a mature yield on a big die allows us to deliver a product cost and a TCO (Total Cost of Ownership) value proposition for our clients. And then we're able to leverage the most advanced technologies like N3P for the compute where we get a significant benefit in the power- and area-scaling to implement the compute units.
George: And speaking of, other than the LDS, what's interesting to me is that there have not been any cache hierarchy changes. Why is that?
Alan: Yeah, great question. So if you remember what I just said about MI300 being built to deliver the highest performance in HPC. And in order to do that, we needed to deliver significant global bandwidth into the compute units for double precision floating point. So we had already designed the Infinity Fabric and the fabric within the XCC or the Accelerated Compute Core to deliver sufficient bandwidth to feed the really high double precision matrix operations in MI300 and all the cache hierarchy associated with that. So we were able to leverage that amount of interconnecting capabilities that we had already built into MI300 and therefore didn't need to make any modifications to those.
George: And with MI350, you've now moved from four base dies to two base dies. What has that enabled in terms of the layout of your top dies?
Alan: Yeah, so what we did, as you mentioned, so in MI350, the I/O dies, there's only two of them. And then each of them host four of the accelerator chiplets versus in MI300, we had four of the I/O dies, with each of them hosting two of the accelerator chiplets. So that's what you're talking about.
So what we did was, we wanted to increase the bandwidth from global, from HBM, which, MI300 was designed for HBM3 and MI350 was specially designed for HBM3E. So we wanted to go from 5.2 or 5.6 gigabit per second up to a full 8 gigabit per second. But we also wanted to do that at the lowest possible power, because delivering the bytes from HBM into the compute cores at the lowest energy per bit gives us more power at a fixed GPU power level, gives us more power into the compute at that same time. So on bandwidth-bound kernels that have a compute element, by reducing the amount of power that we spend in data transport, we can put more power into the compute and deliver a higher performance for those kernels.
So what we did by combining those two chips together into one was we were able to widen up the buses within those chips; so we deliver more bytes per clock, and therefore we can run them at a lower frequency and also a lower voltage, which gives us the V-squared scaling of voltage for the amount of power that it takes to deliver those bits. So that's why we did that.
George: And speaking of power, MI350x is 1000 watts, and MI355x is 1400 watts. What are the different thermal considerations when considering that 40% uptick in power, not just in terms of cooling the system, but also keeping the individual chiplets within tolerances?
Alan: Great question, and obviously we have some things to consider for our 3D architectures as well.
So when we do our total power and thermal architecture of these chips, we consider from the motherboard all the way up to the daughterboards, which are the UBB (Universal Baseboard), the OAM (OCP Accelerator Module) modules in this case, and then up through the stack of CoWoS (Chip on Wafer on Substrate), the I/O dies, which are in this intermediate layer, and then the compute that's above those. So we look at the total thermal density of that whole stack, and the amount of thermal transport or thermal resistance that we have within that stack, and the thermal interface materials that we need in order to build on top of that for heat removal, right?
And so we offer two different classes of thermal solutions for MI350 series. One of them air-cooled, like you mentioned. The other one is a direct-attach liquid cool. So the cold plate would then, in the liquid cool plate, liquid-cooled case would directly attach to the thermal interface material on top of the chips. So we do thermal modeling of that entire stack, and work directly with all of our technology partners to make sure that the power densities that we build into the chips can be handled by that entire thermal stack up.
George: Awesome, and since we're running short on time, the most important question of this interview is, what's your favorite type of cheese?
Alan: Oh, cheddar.
George: Massively agree with you. What's your favorite brand of cheddar?
Alan: I like the Vermont one. What is that, oh... Calbert's? I can't think of it. [Editor's note: It's Cabot Cheddar that is Alan's favorite]
George: I know my personal favorite's probably Tillamook, which is, yeah, from Oregon. But anyway, thank you so much, Alan, for this interview.
If you would like to support the channel, hit like, hit subscribe. And if you like interviews like this, tell us in the comments below. Also, there will be a transcript on the Chips and Cheese website. If you want to directly monetarily support Chips and Cheese, there's Patreon, as well as Stripe through Substack, and PayPal. So, thank you so much for that interview, Alan.
Alan: Thank you, my pleasure.
George: Have a good one, folks!
2025-06-18 01:00:05
CDNA 4 is AMD’s latest compute oriented GPU architecture, and represents a modest update over CDNA 3. CDNA 4’s focus is primarily on boosting AMD’s matrix multiplication performance with lower precision data types. Those operations are important for machine learning workloads, which can often maintain acceptable accuracy with very low precision types. At the same time, CDNA 4 seeks to maintain AMD’s lead in more widely applicable vector operations.
To do so, CDNA 4 largely uses the same system level architecture as CDNA 3. It’s massive chiplet setup, with parallels to AMD’s successful use of chiplets for CPU products. Accelerator Compute Dies, or XCDs, contain CDNA Compute Units and serve a role analogous to Core Complex Dies (CCDs) on AMD’s CPU products. Eight XCDs sit atop four base dies, which implement 256 MB of memory side cache. AMD’s Infinity Fabric provides coherent memory access across the system, which can span multiple chips.
Compared to the CDNA 3 based MI300X, the CDNA 4 equipped MI355X slightly cuts down CU count per XCD, and disables more CUs to maintain yields. The resulting GPU is somewhat less wide, but makes up much of the gap with higher clock speeds. Compared to Nvidia’s B200, both MI355X and MI300 are larger GPUs with far more basic building blocks. Nvidia’s B200 does adopt a multi-die strategy, breaking from a long tradition of using monolithic designs. However, AMD’s chiplet setup is far more aggressive and seeks to replicate their scaling success with CPU designs with large compute GPUs.
CDNA 3 provided a huge vector throughput advantage over Nvidia’s H100, but faced a more complicated situation with machine learning workloads. Thanks to a mature software ecosystem and a heavy focus on matrix multiplication throughput (tensor cores), Nvidia could often get close (https://chipsandcheese.com/p/testing-amds-giant-mi300x) to the nominally far larger MI300X. AMD of course maintained massive wins if the H100 ran out of VRAM, but there was definitely room for improvement.
CDNA 4 rebalances its execution units to more closely target matrix multiplication with lower precision data types, which is precisely what machine learning workloads use. Per-CU matrix throughput doubles in many cases, with CDNA 4 CUs matching Nvidia’s B200 SMs in FP6. Elsewhere though, Nvidia continues to show a stronger emphasis on low precision matrix throughput. B200 SMs have twice as much per-clock throughput as a CDNA 4 CU across a range of 16-bit and 8-bit data types. AMD continues to rely on having a bigger, higher clocked GPU to maintain an overall throughput lead.
With vector operations and higher precision data types, AMD continues MI300X’s massive advantage. Each CDNA 4 CU continues to have 128 FP32 lanes, which deliver 256 FLOPS per cycle when counting FMA operations. MI355X’s lower CU count does lead to a slight reduction in vector performance compared to MI300X. But compared to Nvidia’s Blackwell, AMD’s higher core count and higher clock speeds let it maintain a huge vector throughput lead. Thus AMD’s CDNA line continues to look very good for high performance compute workloads.
Nvidia’s focus on machine learning and matrix operations keeps them very competitive in that category, despite having fewer SMs running at lower clocks. AMD’s giant MI355X holds a lead across many data types, but the gap between AMD and Nvidia’s largest GPUs isn’t nearly as big as with vector compute.
GPUs provide a software managed scratchpad local to a group of threads, typically ones running on the same core. AMD GPUs use a Local Data Share, or LDS, for that purpose. Nvidia calls their analogous structure Shared Memory. CDNA 3 had a 64 KB LDS, carrying forward a similar design from AMD GCN GPUs going back to 2012. That LDS had 32 2 KB banks, each 32 bits wide, providing up to 128 bytes per cycle in the absence of bank conflicts.
CDNA 4 increases the LDS capacity to 160 KB and doubles read bandwidth to 256 bytes per clock. GPUs natively operate on 32 bit elements, and it would be reasonable to assume AMD doubled bandwidth by doubling bank count. If so, each bank may now have 2.5 KB of capacity. Another possibility would be increasing bank count to 80 while keeping bank size at 2 KB, but that’s less likely because it would complicate bank selection. A 64-banked LDS could naturally serve a 64-wide wavefront access with each bank serving a lane. Furthermore, a power-of-two bank count would allow simple bank selection via a subset of address bits.
The larger LDS lets software keep more data close to the execution units. Kernels can allocate more LDS capacity without worrying about lower occupancy due to LDS capacity constraints. For example, a kernel that allocates 16 KB of LDS could run four workgroups on a CDNA 3 CU. On CDNA 4, that would increase to ten workgroups.
Software has to explicitly move data into the LDS to take advantage of it, which can introduce overhead compared to using a hardware-managed cache. CDNA 3 had GLOBAL_LOAD_LDS instructions that let kernels copy data into the LDS without going through the vector register file, CDNA 4 augments GLOBAL_LOAD_LDS to support moving up to 128 bits per lane, versus 32 bits per lane on CDNA 3. That is, the GLOBAL_LOAD_LDS instruction can accept sizes of 1, 2, 4, 12, or 16 DWORDS (32-bit elements), versus just 1, 2, or 4 on CDNA 3.1
CDNA 4 also introduces read-with-transpose LDS instructions. Matrix multiplication involves multiplying elements of a row in one matrix with corresponding elements in a second matrix’s column. Often that creates inefficient memory access patterns, for at least one matrix, depending on whether data is laid out in row-major or column-major order. Transposing a matrix turns the awkward row-to-column operation into a more natural row-to-row one. Handling transposition at the LDS is also natural for AMD’s architecture, because the LDS already has a crossbar that can map bank outputs to lanes (swizzle).
Even with its LDS capacity increase, AMD continues to have less data storage within its GPU cores compared to Nvidia. Blackwell’s SMs have a 256 KB block of storage partitioned for use as both L1 cache and Shared Memory. Up to 228 KB can be allocated for use as Shared Memory. With a 164 KB Shared Memory allocation, which is close to matching AMD’s 160 KB LDS, Nvidia would still have 92 KB available for L1 caching. CDNA 4, like CDNA 3, has a 32 KB L1 vector cache per CU. Thus a Blackwell SM can have more software managed storage while still having a larger L1 cache than a CDNA 4 CU. Of course, AMD’s higher CU count means there’s 40 MB of LDS capacity across the GPU, while Nvidia only has ~33 MB of Shared Memory across B200 with the largest 228 KB Shared Memory allocation.
To feed the massive arrays of Compute Units, MI355X largely uses the same system level architecture as MI300X. MI355X does see a few enhancements though. The L2 caches can “writeback dirty data and retain a copy of the line”. “Dirty” refers to data that has been modified in a write-back cache, but hasn’t been propagated to lower levels in the memory subsystem. When a dirty line is evicted to make room for newer data, its contents are written back to the next level of cache, or DRAM if it’s the last level cache.
AMD may be seeking to opportunistically use write bandwidth when the memory subsystem is under low load, smoothing out spikes in bandwidth demand caused by cache fill requests accompanied by writebacks. Or, AMD could be doing something special to let the L2 transition a line to clean state if written data is likely to be read by other threads across the system, but isn’t expected to be modified again anytime soon.
MI355X’s DRAM subsystem has been upgraded to use HBM3E, providing a substantial bandwidth and capacity upgrade over its predecessor. It also maintains AMD’s lead over its Nvidia competition. Nvidia also uses HBM3E with the B200, which also appears to have eight HBM3E stacks. However, the B200 tops out at 180 GB of capacity and 7.7 TB/s of bandwidth, compared to 288 GB at 8 TB/s on the MI355X. The MI300X could have a substantial advantage over Nvidia’s older H100 when the H100 ran out of DRAM capacity, and AMD is likely looking to retain that advantage.
Higher bandwidth from HBM3E also helps bring up MI355X’s compute-to-bandwidth ratio. MI300X had ~0.03 bytes of DRAM bandwidth per FP32 FLOP, which increases to 0.05 on MI355X. Blackwell for comparison has ~0.10 bytes of DRAM bandwidth per FP32 FLOP. While Nvidia has increased last level cache capacity on Blackwell, AMD continues to lean more heavily on big caches, while Nvidia relies more on DRAM bandwidth.
CDNA 2 and CDNA 3 made sweeping changes compared to their predecessors. CDNA 4’s changes are more muted. Much like going from Zen 3 to Zen 4, MI355X retains a similar chiplet arrangement with compute and IO chiplets swapped out for improved versions. Rather than changing up their grand strategy, AMD spent their time tuning CDNA 3. Fewer, higher clocked CUs are easier to utilize, and increased memory bandwidth can help utilization too. Higher matrix multiplication throughput also helps AMD take on Nvidia for machine learning workloads.
In some ways, AMD’s approach with this generation has parallels to Nvidia’s. Blackwell SMs are basically identical to Hopper’s from a vector execution perspective, with improvements focused on the matrix side. Nvidia likely felt they had a winning formula, as their past few GPU generations have undoubtedly been successful. AMD may have found a winning formula with CDNA 3 as well. MI300A, MI300X’s iGPU cousin, powers the highest ranking supercomputer on TOP500’s June list.4 Building on success can be a safe and rewarding strategy, and CDNA 4 may be doing just that.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
https://github.com/llvm/llvm-project/blob/main/clang/test/CodeGenOpenCL/builtins-amdgcn-gfx950.cl - b96 and b128 (96-bit and 128-bit) global_load_lds sizes
https://github.com/llvm/llvm-project/blob/84ff1bda2977e580265997ad2d4c47b18cd3bf9f/mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td#L426C1-L426C50 - LDS transpose intrinsics
https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html
https://www.reddit.com/r/hardware/comments/1kj38r1/battle_of_the_giants_8x_nvidia_blackwell_b200/ - reports 148 Compute Units via OpenCL for B200. Nvidia usually reports SMs for the Compute Unit count
2025-06-14 23:47:00
Today, AMD’s Infinity Fabric interconnect is ubiquitous across the company’s lineup. Infinity Fabric provides well-defined interfaces to a transport layer, and lets different functional blocks treat the interconnect as a black box. The system worked well enough to let AMD create integrated GPU products all the way from the Steam Deck’s tiny van Gogh APU, to giant systems packing four MI300A chips. Across all those offerings, Infinity Fabric enables coherent memory access as CPU and GPU requests converge to Coherent Slaves (CS-es), which have a probe filter and can request data from the appropriate source.
AMD was also trying to build powerful iGPUs over a decade ago, but their interconnect looked very different at the time. Their Northbridge architecture owes its name to the Athlon 64 era, when AMD brought the chipset northbridge’s functionality onto the CPU die. AMD’s engineers likely didn’t envision needing to tightly integrate a GPU during Athlon 64’s development, so the interconnect was designed to tie together a few CPU cores and service memory requests with low latency. But then AMD acquired ATI, and set about trying to take advantage of their new graphics talent by building powerful iGPUs.
“Trinity” is an AMD Accelerated Processing Unit (APU) from 2012. It combines two dual-threaded Piledriver CPU modules with a 6-SIMD Terascale 3 iGPU. Here I’ll be looking into AMD’s second generation iGPU interconnect with the A8-5600K, a slightly cut-down Trinity implementation with four Terascale 3 SIMDs enabled and CPU boost clocks dropped from 4.2 to 3.9 GHz. I’m testing the chip on a MSI FM2-A75MA-E35 board with 16 GB of Kingston DDR3-1866 10-10-9-26 memory.
Trinity’s on-die network resembles that of AMD’s first APU, Llano, and has clear similarities to AMD’s Athlon 64 Northbridge. The Northbridge sits on a separate voltage and frequency domain, and runs at 1.8 GHz on the A8-5600K. It uses a two-level crossbar setup. CPU cores connect to a System Request Interface (SRI), which routes requests onto a set of queues. Most memory requests head to a System Request Queue (SRQ). The SRI also accepts incoming probes on a separate queue, and routes them to the CPU cores. A second level crossbar, simply called the XBAR, connects with the SRI’s various queues and routes requests to IO and memory. The XBAR can handle IO-to-IO communication too, though such traffic is rare on consumer systems.
On Trinity, the XBAR’s scheduler (XCS) has 40 entries, making it slightly smaller than the 64 entry XCS on desktop and server Piledriver chips. AMD defaults to a 22+10+8 entry split between the SRI, Memory Controller (MCT),and upstream channel, though the BIOS can opt for a different XCS entry allocation. The XBAR sends memory requests to the MCT, which prioritizes requests based on type and age, and includes a strided-access prefetcher. Like Infinity Fabric’s CS, the MCT is responsible for ensuring cache coherency and can send probes back to the XBAR. The MCT also translates physical addresses to “normalized” addresses that only cover the memory space backed by DRAM.
All AMD CPUs from Trinity’s era use a similar SRI+XBAR setup, but Trinity’s iGPU sets it apart from CPU-only products. The iGPU gets its own Graphics Memory Controller (GMC), which arbitrates between different request types and schedules requests to maximize DRAM bandwidth utilization. Thus the GMC performs an analogous role to the MCT on the CPU side. A “Radeon Memory Bus” connects the GMC to the DRAM controllers, bypassing the MCT. AMD documentation occasionally refers to the Radeon Memory Bus as “Garlic”, likely a holdover from internal names used during Llano’s development.
A second control link hooks the GPU into the XBAR, much like any other IO device. Previously, the control link was called the “Fusion Control Link”, or “Onion”. I’ll use “Garlic” and “Onion” to refer to the two links because those names are shorter.
Trinity’s “Garlic” link lets the GPU saturate DRAM bandwidth. It bypasses the MCT and therefore bypasses the Northbridge’s cache coherency mechanisms. That’s a key design feature, because CPUs and GPUs usually don’t share data. Snooping the CPU caches for GPU memory requests would waste power and flood the interconnect with probes, most of which would miss.
Bypassing the MCT also skips the Northbridge’s regular memory prioritization mechanism, so the MCT and GMC have various mechanisms to keep the CPU and GPU from starving each other of memory bandwidth. First, the MCT and GMC can limit how many outstanding requests they have on the DRAM controller queues (DCQs). Then, the DCQs can alternate between accepting requests from the MCT and GMC, or can be set to prioritize the requester with fewer outstanding requests. Trinity defaults to the former. Finally, because the CPU tends to be more latency sensitive, Trinity can prioritize MCT-side reads ahead of GMC requests. Some of this arbitration seems to happen at queues in front of the DCQ. AMD’s BIOS and Kernel Developer’s Guide (BKDG) indicates there’s a 4-bit read pointer for a sideband signal FIFO between the GMC and DCT, so the “Garlic” link may have a queue with up to 16 entries.
From software testing, Trinity does a good job controlling CPU-side latency under increasing GPU bandwidth demand. Even with the GPU hitting the DRAM controllers with over 24 GB/s of reads and writes, CPU latency remains under 120 ns. Loaded latency can actually get worse under CPU-only bandwidth contention, even though Trinity’s CPU side isn’t capable of fully saturating the memory controller. When testing loaded latency, I have to reserve one thread for running the latency test, so bandwidth in the graph above tops out at just over 17 GB/s. Running bandwidth test threads across all four CPU logical threads would give ~20.1 GB/s.
Mixing CPU and GPU bandwidth load shows that latency goes up when there’s more CPU-side bandwidth. Again, the GPU can pull a lot of bandwidth without impacting CPU memory latency too heavily. Likely, there’s more contention at the SRI and XBAR than at the DRAM controllers.
Integrated GPUs raise the prospect of doing zero-copy data sharing between the CPU and GPU, without being constrained by a PCIe bus. Data sharing is conceptually easy, but cache coherency is not, and caching is critical to avoiding DRAM performance bottlenecks. Because requests going through the “Garlic” link can’t snoop caches, “Garlic” can’t guarantee it retrieves the latest data written by the CPU cores.
That’s where the “Onion” link comes in. An OpenCL programmer can tell the runtime to allocate cacheable host memory, then hand it over for the GPU to use without an explicit copy. “Onion” requests pass through the XBAR and get switched to the MCT, which can issue probes to the CPU caches. That lets the GPU access the CPU’s cacheable memory space and retrieve up-to-date data. Unfortunately, “Onion” can’t saturate DRAM bandwidth and tops out at just under 10 GB/s. Trinity’s interconnect also doesn’t come with probe filters, so probe response counts exceed 45 million per second. Curiously, probe response counts are lower than if a probe had been issued for every 64B cacheline.
Northbridge performance counters count “Onion” requests as IO to memory traffic at the XBAR. “Garlic” traffic is not counted at the XBAR, and is only observed at the DRAM controllers. Curiously, the iGPU also creates IO to memory traffic when accessing a very large buffer nominally allocated out of the GPU’s memory space (no ALLOC_HOST_PTR flag), implying it’s using the “Onion” link when it runs out of memory space backed by “Garlic”. If so, iGPU memory accesses over “Onion” incur roughly 320 ns of extra latency over a regular “Garlic” access.
Therefore Trinity’s iGPU takes a significant bandwidth penalty when accessing CPU-side memory, even though both memory spaces are backed by the same physical medium. AMD’s later Infinity Fabric handles this in a more seamless manner, as CS-es observe all CPU and GPU memory accesses. Infinity Fabric CS-es have a probe filter, letting them maintain coherency without creating massive probe traffic.
Telling the GPU to access cacheable CPU memory is one way to enable zero-copy behavior. The other way is to map GPU-side memory into a CPU program’s address space. Doing so lets Trinity’s iGPU use its high bandwidth “Garlic” link, and AMD allows this using a CL_MEM_USE_PERSISTENT_MEM_AMD flag. But because the Northbridge has no clue what the GPU is doing through its “Garlic” link, it can’t guarantee cache coherency. AMD solves this by making the address space uncacheable (technically write-combining) from the CPU side, with heavy performance consequences for the CPU.
Besides not being able to enjoy higher bandwidth and lower latency from cache hits, the CPU can’t combine reads from the same cacheline into a single fill request. From the Northbridge perspective, the System Request Interface sends “sized reads” to the XBAR instead of regular “cache block” commands. From Llano’s documentation, the CPU to iGPU memory request path only supports a single pending read, and that seems to apply to Trinity too.
Bandwidth sinks like a stone with the CPU unable to exploit memory level parallelism. Worse, the CPU to iGPU memory path appears less optimized in latency terms. Pointer chasing latency is approximately 93.11 ns across a small 8 KB array. For comparison, requests to cacheable memory space with 2 MB pages achieve lower than 70 ns of latency with a 1 GB array. I’m not sure what page size AMD’s drivers use when mapping iGPU memory into CPU address space, but using a 8 KB array should avoid TLB misses and make the two figures comparable. On AMD’s modern Ryzen 8840HS, which uses Infinity Fabric, there’s no significant latency difference when accessing GPU memory. Furthermore, GPU memory remains cacheable when mapped to the CPU side. However, memory latency in absolute terms is worse on the Ryzen 8840HS at over 100 ns.
While AMD had GPU compute in its sights when designing Trinity, its biggest selling point was as a budget gaming solution. Many modern games struggle or won’t launch at all on Trinity, so I’ve selected a few older workloads to get an idea of how much traffic Trinity’s interconnect has to work with.
Unigine Valley is a 2013-era DirectX 11 benchmark. Performance counter data collected over a benchmark pass shows plenty of DRAM traffic not observed at the XBAR, which can be attributed to the iGPU’s “Garlic” link. CPU-side memory bandwidth is light, and generally stays below 3 GB/s. Total DRAM traffic peaked at 17.7 GB/s, measured over a 1 second interval.
Final Fantasy 14’s Heavensward benchmark came out a few years after Trinity. With the “Standard (Laptop)” preset running at 1280x720, the A8-5600K averaged 25.6 FPS. Apparently that counts as “Fairly High”, which I only somewhat agree with.
The Heavensward benchmark is more bandwidth hungry, with DRAM bandwidth reaching 22.7 GB/s over a 1 second sampling interval. The CPU side uses more bandwidth too, often staying in the 3-4 GB/s range and sometimes spiking to 5 GB/s. Still, the bulk of DRAM traffic can be attributed to the iGPU, and uses the “Garlic” link.
The Elder Scrolls Online (ESO) is a MMO that launched a few years after Trinity, and can still run on the A8-5600K (if barely) thanks to FSR. I’m running ESO at 1920x1080 with the low quality preset, and FSR set to Quality. Framerate isn’t impressive and often stays below 20 FPS.
CPU bandwidth can often reach into the high 4 GB/s range, though that often occurs when loading in or accessing hub areas with many players around. GPU-side bandwidth demands are lower compared to the two prior benchmarks, and total bandwidth stays under 16 GB/s.
RawTherapee converts camera raw files into standard image formats (i.e. JPEG). Here, I’m processing 45 megapixel raw files from a Nikon D850. RawTherapee only uses the CPU for image processing, so most DRAM traffic is observed through the XBAR, coming from the CPU. The iGPU still uses a bit of bandwidth for display refreshes and perhaps the occasional GPU-accelerated UI feature. In this case, I have the iGPU driving a pair of 1920x1080 60 Hz displays.
Darktable is another raw conversion program. Unlike RawTherapee, Darktable can offload some parts of its image processing pipeline to the GPU. It’s likely the kind of application AMD hoped would become popular and showcase its iGPU prowess. With Darktable, Trinity’s Northbridge handles a mix of traffic from the CPU and iGPU, depending on what the image processing pipeline demands. IO to memory traffic also becomes more than a rounding error, which is notable because that’s how “Onion” traffic is counted. Other IO traffic like SSD reads can also be counted at the XBAR, but that should be insignificant in this workload.
In Trinity’s era, powerful iGPUs were an obvious way for AMD to make use of their ATI acquisition and carve out a niche, despite falling behind Intel on the CPU field. However AMD’s interconnect wasn’t ideal for integrating a GPU. The “Onion” and “Garlic” links were a compromise solution that worked around the Northbridge’s limitations. With these links, Trinity doesn’t get the full advantages that an iGPU should intuitively enjoy. Zero-copy behavior is possible, and should work better than doing so with a contemporary discrete GPU setup. But both the CPU and GPU face lower memory performance when accessing the other’s memory space.
Awkwardly for AMD, Intel had a more integrated iGPU setup with Sandy Bridge and Ivy Bridge. Intel’s iGPU sat on a ring bus along with CPU cores and L3 cache. That naturally made it a client of the L3 cache, and unified the CPU and GPU memory access paths at that point. Several years would pass before AMD created their own flexible interconnect where the CPU and GPU could both use a unified, coherent memory path without compromise. Fortunately for AMD, Intel didn’t have a big iGPU to pair with Ivy Bridge.
Trinity’s interconnect may be awkward, but its weaknesses don’t show for graphics workloads. The iGPU can happily gobble up bandwidth over the “Garlic” link, and can do so without destroying CPU-side memory latency. Trinity does tend to generate a lot of DRAM traffic, thanks to a relatively small 512 KB read-only L2 texture cache on the GPU side, and no L3 on the CPU side. Bandwidth hungry workloads like FF14’s Heavensward benchmark can lean heavily on the chip’s DDR3-1866 setup. But even in that workload, there’s a bit of bandwidth to spare. Trinity’s crossbars and separate coherent/non-coherent iGPU links may lack elegance and sophistication compared to modern designs but it was sufficient to let AMD get started with powerful iGPUs in the early 2010s.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-05-29 23:49:20
Cloud computing rose rapidly around 2010, powered by AMD’s Opteron and Intel’s Xeon processors. The large cloud market piqued the interest of other CPU makers including Qualcomm. Qualcomm had grown to become a formidable force in the mobile SoC market by the mid-2010s, and had several in-house CPU designs under their belt. They had good reason to be confident about their cloud server prospects. High core counts in server chips translate to low per-core power, blunting AMD and Intel’s greatest strength of high single threaded performance.
Meanwhile, Qualcomm’s mobile background gave them plenty of experience in low power CPU design. Their high mobile volume gave them access to Samsung’s 10nm FinFET process. That could at least level the playing field against Intel’s 14nm node, if not give Qualcomm an outright advantage in power consumption and density. Qualcomm hoped to use those factors to deliver a cheaper, lower power competitor on the cloud computing scene.
To break into the cloud computing segment, Qualcomm needed a CPU architecture with enough performance to meet key metrics in areas like tail-end latency. During a Hot Chips presentation, Qualcomm noted that throwing a pile of weak cores onto a chip and calling it day wouldn’t cut it. An Arm cloud CPU may not have to match Intel and AMD core for core, but they do need to hit a baseline level of performance. Qualcomm hopes to do that while maintaining their traditional power and density advantages.
The Falkor CPU architecture aims to meet that performance threshold with low power and silicon area requirements. Falkor is a 4-wide aarch64 core with features inherited from Qualcomm’s prior mobile cores. It runs the 64-bit Arm instruction set (aarch64, armv8) with a few features pulled in from armv8.1. 32-bit Arm support is absent, as there’s no large existing install base of Arm server applications. Falkor is Qualcomm’s 5th in-house core design, and the company’s first designed specifically for cloud computing.
A Centriq 2400 series chip packs up to 48 Falkor cores on a 398mm2 die with a 120W TDP. That translates to less than 2.5W per core. Qualcomm notes that power consumption is usually well under 120W in typical all-core loads.
For this article, Neggles (Andi) was given a Centriq 2452 system from the folks at Corellium at no cost in order for us to test it. So a massive shout out to both Corellium for providing the system as well as to Neggles for getting the system up and running.
The Centriq 2452 system is set up with 96 GB of DDR4 running at 2666 MT/s, and identifies itself as the “Qualcomm Centriq 2400 Reference Evaluation Platform CV90-LA115-P23”.
Falkor has both a L0 and L1 instruction cache like Qualcomm’s prior Krait architecture, and possibly Kryo too. The 24 KB, 3-way set associative L0 delivers instructions at lower power and with lower latency. The L0 is sized to contain the large majority of instruction fetches, while the 64 KB 8-way L1 instruction cache handles larger code footprints. Although the L0 fills a similar role to micro-op caches and loop buffers in other CPUs, it holds ISA instruction bytes just like a conventional instruction cache.
Both instruction cache levels have enough bandwidth to feed Falkor’s 4-wide decoder. The two instruction cache levels are exclusive of each other, so the core effectively has 88 KB of instruction cache capacity. Qualcomm might use a victim cache setup to maintain that exclusive relationship. If so, incoming probes would have to check both the L0 and L1, and L1 accesses would incur the additional cost of a copy-back from L0 on top of a fill into L0. An inclusive setup would let the L1 act as a snoop filter for the L0 and reduces the cost of L1 accesses, but has less total caching capacity.
The exclusive L0/L1 setup gives Falkor high instruction caching capacity compared to contemporary cores. Falkor wouldn't be beaten in this respect until Apple's M1 launched several years later. High instruction cache capacity makes L2 code fetch bandwidth less important. Like many 64-bit Arm cores of the time, or indeed AMD's pre-Zen cores, Falkor's instruction throughput drops sharply once code spills into L2. Still, Faklor performs better than A72 in that respect.
Falkor's instruction caches are parity protected, as is common for many CPUs. Hardware resolves parity errors by invalidating corrupted lines and reloading them from L2. The instruction caches also hold branch targets alongside instruction bytes, and therefore serve as branch target buffers (BTBs). A single cache access provides both instructions and branch targets, so Falkor doesn't have to make a separate BTB access like cores with a decoupled BTB do. However, that prevents the branch predictor from following the instruction stream past a L1i miss.
Taken branches incur one pipeline bubble (2 cycle latency) within L0, and up to 6 cycle latency from L1. For smaller branch footprints, Falkor do zero-bubble taken branches using a 16 entry branch target instruction cache (BTIC). Unlike a BTB, the BTIC caches instructions at the branch destination rather than the target address. It therefore bypasses cache latency, and allows zero-bubble taken branches without needing to make a L0 achieve single cycle latency.
Direction prediction uses multiple history tables, each using a different history length. The branch predictor tracks which history length and corresponding table works best for a given branch. The scheme described by Qualcomm is conceptually similar to a TAGE predictor, which also uses multiple history tables and tags tables to indicate whether they’re useful for a given branch. Falkor doesn’t necessarily use a classic TAGE predictor. For example, the history lengths may not be in a geometric series. But the idea of economizing history storage by using the most appropriate history length for each branch still stands. Arm’s Cortex A72 uses a 2-level predictor, presumably with a single table and a fixed history length.
In an abstract test with varying numbers of branches, each being taken or not-taken in random patterns of increasing length, Falkor does slightly better than Kryo. Falkor handles better with a lot of branches in play, though the longest repeating pattern either core can handle is similar for a small number of branches.
Falkor a two-level indirect target array for indirect branches, which read a target from a register rather than specifying a jump distance. An indirect branch may go to different targets, adding another dimension of difficulty to branch prediction. Falkor’s first level indirect target array has 16 entries, while a second level has 512 entries.
Having multiple targets for an indirect branch carries little penalty as long as total target count doesn’t exceed 16. That can be one branch switching between 16 targets, or eight branches alternating between two targets each.
Returns are a special case of indirect branches, because they typically go back to the call site. Falkor has a 16 entry return stack like Kryo. Cortex A72 has a much larger return stack with 31 entries. A function call and return costs approximately four cycles on Falkor, Kryo, and A72, which would be an average of 2 cycles per branch-with-link instruction.
Falkor’s decoders translate up to four instructions per cycle into micro-ops. Like most other CPUs, Qualcomm aims to decode most common instructions into a single micro-op. 128-bit vector math instructions appear to be a notable exception.
Micro-ops from the decoders need resources allocated in the backend for bookkeeping during out-of-order execution. Falkor’s renamer can handle register renaming and resource allocation for up to for micro-ops per cycle. However, the fourth slot can only process direct branches and a few specialized cases like NOPs or recognized register zeroing cases. A conditional branch that also includes an ALU operation, like cbz
/cbnz
, cannot go into the fourth slot.
Besides special handling for zeroing registers by moving an immediate value of zero into it, I didn’t see other common optimizations carried out.There’s no MOV elimination, and the renamer doesn’t recognize that XOR-ing or subtracting a register from itself results in zero.
Falkor doesn’t have a classic reorder buffer, or ROB. Rather, it uses a series of structures that together enable out-of-order execution while ensuring program results are consistent with in-order execution. Falkor has a 256 entry rename/completion buffer. Qualcomm further states Falkor can have 128 uncommitted instructions in flight, along with another 70+ uncommitted instructions for a total of 190 in-flight instructions. The core can retire 4 instructions per cycle.
From a microbenchmarking perspective, Falkor acts like Arm’s Cortex A73. It can free resources like registers and load/store queue entries past a long latency load, with no visible limit to reordering capacity even past 256 instructions. An unresolved branch similarly blocks out-of-order resource deallocation, after which Falkor’s reordering window can be measured. At that point, I may be measuring what Qualcomm considers uncommitted instructions.
Kryo and Falkor have similar reordering capacity from that uncommitted instruction perspective. But otherwise Qualcomm has re-balanced the execution engine to favor consistent performance for non-vector code. Falkor has a few more register file entries than Kryo, and more crucially, far larger memory ordering queues.
Integer execution pipelines on Falkor are specialized to handle different categories of operations from each other. Three pipes have integer ALUs, and a fourth pipe is dedicated to direct branches. Indirect branches use one of the ALU ports. Another ALU port has an integer multiplier, which can complete one 64-bit multiply per cycle with 5 cycle latency. Each ALU pipe has a modestly sized scheduler with about 11 entries.
Falkor has two largely symmetric FP/vector pipelines, each also with a 11 entry scheduler. Both pipes can handle basic operations like FP adds, multiplies, and fused multiply-adds. Vector integer adds and multiplies can also execute down both pipes. More specialized operations like AES acceleration instructions are only supported by one pipe.
FP and vector execution latency is similar to Kryo, as is throughput for scalar FP operations. Both of Falkor’s FP/vector pipes have a throughput of 64 bits per cycle. 128-bit math instructions are broken into two micro-ops, as they take two entries in the schedulers, register file, and completion buffer. Both factors cut into potential gains from vectorized code.
Falkor’s load/store subsystem is designed to handle one load and one store per cycle. The memory pipeline starts with a pair of AGUs, one for loads and one for stores. Both AGUs are fed from a unified scheduler with approximately 13 entries. Load-to-use latency is 3 cycles for a L1D hit, and the load AGU can handle indexed addressing with no penalty.
Virtual addresses (VAs) from the load AGU proceeds to access the 32 KB 8-way L1 data cache, which can provide 16 bytes per cycle. From testing, Falkor can handle either a single 128-bit load or store per cycle, or a 64-bit load and a 64-bit store in the same cycle. Mixing 128-bit loads and stores does not bring throughput over 128 bits per cycle.
Every location in the cache has a virtual tag and a physical tag associated with it… If you don’t have to do a TLB lookup prior to your cache, you can get the data out faster, and you can return the data with better latency.
Qualcomm’s Hot Chips 29 presentation
The L1D is both virtually and physically tagged, which lets Falkor retrieve data from the L1D without waiting for address translation. A conventional VIPT (virtually indexed, physically tagged) cache could select a set of lines using the virtual address, but needs the physical address (PA) to be available before checking tags for hits Qualcomm says some loads can skip address translation completely, in which case there’s no need for loads to check the physical tags at all. It’s quite an interesting setup, and I wonder how it handles multiple VAs aliasing to the same PA.
…a novel structure that is built off to the side of the L1 data cache, that acts almost like a write-back cache. It’s a combination of a store buffer, a load fill buffer, and a snoop filter buffer from the L2, and so this structure that sits off to the side gives us all the performance benefit and power savings of having a write-back cache without the need for the L1 data cache to be truly write-back
Qualcomm’s Hot Chips 29 presentation
Falkor’s store pipeline doesn’t check tags at all. The core has a write-through L1D, and uses an unnamed structure to provide the power and performance benefits of a write-back L1D. It functionally sounds similar to Bulldozer’s Write Coalescing Cache (WCC), so in absence of a better name from Qualcomm, I’ll call it that. Multiple writes to the same cacheline are combined at the WCC, reducing L2 accesses.
Stores on Falkor access the L1D physical tags to ensure coherency, and do so after they’ve reached the WCC. Thus the store combining mechanism also serves to reduce physical tag checks, saving power.
Qualcomm is certainly justified in saying they can deliver the performance of a write-back cache. A Falkor core can’t write more than 16B/cycle, and the L2 appears to have for more bandwidth than that. One way to see the WCC is to make one store per 128B cacheline, which reveals it’s a 3 KB per-core structure and can write a 128B cacheline back to L2 once every 2-3 cycles. But software shouldn’t run into this in practice.
Other architectures that use a write-through L1D, notably Intel’s Pentium 4 and AMD’s Bulldozer, suffered from poor store forwarding performance. Falkor doesn’t do well in this area, but isn’t terrible either. Loads that are 32-bit aligned within a store that they depend on can get their data with 8 cycle latency (so possibly 4 cycles for the store, and four cycles for the load). Slower cases, including partial overlaps, are handled with just one extra cycle. I suspect most cores handle partial overlaps by waiting for the store to commit, then having the load read data from cache. Quaclomm may have given Falkor a more advanced forwarding mechanism to avoid the penalty of reading from the WCC.
Using a write-through L1D lets Qualcomm parity protect the L1D rather than needing ECC. As with the instruction caches, hardware resolves parity errors by reloading lines from lower level caches, which are ECC protected.
Unlike mobile cores, server cores may encounter large data footprints with workloads running inside virtual machines. Virtualization can dramatically increase address translation overhead, as program-visible VAs are translated to VM-visible PAs, which in turn are translated via hypervisor page tables to a host PA. A TLB miss could require walking two sets of paging structures, turning a single memory access into over a dozen accesses under the hood.
Kryo appears to have a single level 192 entry TLB, which is plainly unsuited to such server demands. Falkor ditches that all-or-nothing approach in favor of a more conventional two-level TLB setup. A 64 entry L1 DTLB is backed by a 512 entry L2 TLB. Getting a translation from the L2 TLB adds just two cycles of latency, making it reasonably fast. Both the L1 DTLB and L2 TLB store “final” translations, which map a program’s virtual address all the way to a physical address on the host.
Falkor also has a 64 entry “non-final” TLB, which caches a pointer to the last level paging structure and can skip much of the page walk. Another “stage-2” TLB with 64 entries caches translations from VM PAs to host PAs.
Server chips must support high core counts and high IO bandwidth, which is another sharp difference between server and mobile SoCs. Qualcomm implements Falkor cores in dual core clusters called duplexes, and uses that as a basic building block for their Centriq server SoC. Kryo also used dual core clusters with a shared L2, so that concept isn’t entirely alien to Qualcomm.
Falkor’s L2 is a 512 KB, 8-way set associative, and inclusive of L1 contents. It serves both as a mid-level cache between L1 and the on-chip network, as well as a snoop filter for the L2 caches. The L2 is ECC protected, because it can contain modified data that hasn’t been written back anywhere else.
Qualcomm says the L2 has 15 cycles of latency, though a pointer chasing pattern sees 16-17 cycles of latency. Either way, it’s a welcome improvement over Kryo’s 20+ cycle L2 latency. Kryo and Arm’s Cortex A72 used the L2 as a last-level cache, which gave them the difficult task of keeping latency low enough to handle L1 misses with decent performance, while also having enough capacity to insulate the cores from DRAM latency. A72 uses a 4 MB L2 cache with 21 cycle latency, while Kryo drops the ball with both high latency and low L2 capacity.
Multiple interleaves (i.e. banks) help increase L2 bandwidth. Qualcomm did not specify the number of interleaves, but did say each interleave can deliver 32 bytes per cycle. The L2 appears capable of handling a 128B writeback every cycle, so it likely has at least four interleaves. Two Falkor cores in a complex together have just 32B/cycle of load/store bandwidth, so the L2 has more than enough bandwidth to feed both cores. In contrast, the L2 caches on Kryo and A72 have noticeably less bandwidth than their L1 caches.
A Falkor duplex interfaces with the system using the Qualcomm System Bus (QSB) protocol. QSB is a proprietary protocol that fulfills the same function as the ACE protocol used by Arm. It can also be compared to Intel’s IDI or AMD’s Infinity Fabric protocols. The duplex’s system bus interface provides 32 bytes per cycle of bandwidth per direction, per 128B interleave.
Qualcomm uses a bidirectional, segmented ring bus to link cores, L3 cache, and IO controllers. Data transfer uses two sets of bidirectional rings, and traffic is interleaved between the two bidirectional rings at 128B cacheline granularity. In total, Centriq has four rings covering even and odd interleaves in both clockwise and counterclockwise directions. Qualcomm’s slides suggest each ring can move 32B/cycle, so the ring bus effectively has 64B/cycle of bandwidth in each direction.
A dual core cluster can access just under 64 GB/s of L3 bandwidth from a simple bandwidth test, giving Qualcomm a significant cache bandwidth advantage over Cortex A72. L3 bandwidth from a dual core Falkor complex is similar to that of a Skylake core on the Core i5-6600K.
Ring bus clients include up to 24 dual core clusters, 12 L3 cache slices, six DDR4 controller channels, six PCIe controllers handling 32 Gen 3 lanes, and a variety of low speed IO controllers.
Centriq’s L3 slices have 5 MB of capacity and are 20-way set associative, giving the chip 60 MB of total L3 capacity across the 12 slices. The 46 core Centriq 2452 has 57.5 MB enabled. Cache ways can be reserved to divide L3 capacity across different applications and request types, which helps ensure quality of service.
Addresses are hashed across L3 slices to enable bandwidth scalability, like many other designs with many cores sharing a large L3. Centriq doesn’t match L3 slice count to core count, unlike Intel and AMD designs. However, each Centriq L3 slice has two ring bus ports, so the L3 and Falkor duplexes the same aggregate bandwidth to the on-chip network.
L3 latency is high at over 40 ns, or north of 100 cycles. That’s heavy for cores with 512 KB of L2. Bandwidth can scale to over 500 GB/s, which is likely adequate for anything except very bandwidth heavy vector workloads. Falkor isn’t a great choice for vector workloads anyway, so Centriq has plenty of L3 bandwidth. Latency increases to about 50 ns under moderate bandwidth load, and reaches 70-80 ns when approaching L3 bandwidth limits. Contention from loading all duplexes can bring latency to just over 90 ns.
Centriq’s L3 also acts as a point of coherency across the chip. The L3 is not inclusive of the upper level caches, and maintains L2 snoop filters to ensure coherency. In that respect it works like the L3 on AMD’s Zen or Intel’s Skylake server. Each L3 slice can track up to 32 outstanding snoops. Cache coherency operations between cores in the same duplex don’t need to transit the ring bus.
A core to core latency test shows lower latency between core pairs in a duplex, though latency is still high in an absolute sense. It also indicates Qualcomm has disabled two cores on the Centriq 2452 by turning off one core in a pair of duplexes. Doing so is a slightly higher performance option because two cores don’t have to share L2 capacity and a system bus interface.
Centriq supports up to 768 GB of DDR4 across six channels. The memory controllers support speeds of up to 2666 MT/s for 128 GB/s of theoretical bandwidth. Memory latency is about 121.4 ns, and is poorly controlled under high bandwidth load. Latency can rise beyond 500 ns at over 100 GB/s of bandwidth usage. For comparison, Intel is able to keep latency below 200 ns with more than 90% bandwidth utilization. Still, Centriq has plenty of bandwidth from an absolute sense. Against contemporary Arm server competition like Amazon’s Graviton 1, Centriq has a huge bandwidth advantage. Furthermore, the large L3 should reduce DRAM bandwidth demands compared to Graviton 1.
Unlike Intel and AMD server processors, Centriq cannot scale to multi-socket configurations. That caps a Centriq server to 48 cores, while AMD’s Zen 1 and Intel’s Skylake can scale further using multiple sockets. Qualcomm’s decision to not pursue multi-socket reasons is reasonable, because cross-socket connections require both massive bandwidth and additional interconnect work. However, it does exclude more specialized cloud applications that benefit from VMs with over a hundred CPU cores and terabytes of memory. Having just 32 PCIe lanes also limits Centriq’s ability to host piles of accelerators. Even contemporary high end workstations had more PCIe lanes.
Thus Centriq’s system architecture is designed to tackle mainstream cloud applications, rather than trying to cover everything Intel does. By not tackling all those specialized applications, Qualcomm’s datacenter effort can avoid getting distracted and focus on doing the best job they can for common cloud scenarios. For those use cases, sticking with 32 PCIe lanes and integrating traditional southbridge functions like USB and SATA likely reduce platform cost. And while Centriq’s interconnect may not compare well to Intel’s, it’s worlds ahead of Graviton 1.
In SPEC CPU2017, a Falkor core comfortably outperforms Arm’s Cortex A72, with a 21.6% lead in the integer suite and a 53.4% lead in the floating point suite. It falls behind later Arm offerings on more advanced process nodes.
With SPEC CPU2017’s integer workloads, Falkor compares best in memory-bound workloads like 505.mcf and 502.gcc. Falkor pulls off a massive lead in several floating point subtests like 503.bwaves and 507.cactuBSSN, which inflates its overall lead in the floating point suite.
From an IPC perspective, Falkor is able to stretch its legs in cache-friendly workloads like 538.imagick. Yet not all high IPC workloads give Falkor a substantial lead. Cortex A72 is just barely behind in 548.exchange2 and 525.x264, two high IPC tests in SPEC CPU2017’s integer suite. It’s a reminder that Falkor is not quite 4-wide.
For comparison, I’ve included IPC figures from Skylake, a 4-wide core with no renamer slot restrictions. It’s able to push up to and past 3 IPC in easier workloads, unlike Falkor.
With 7-Zip set to use eight threads and pinned to four cores, Falkor achieves a comfortable lead over Cortex A72. Using one core per cluster provides a negligible performance increase over loading both cores across two clusters.
Unlike 7-Zip, libx264 is a well vectorized workload. Falkor has poor vector capabilities, but so does Cortex A72. Again, additional L2 capacity from using four duplexes provides a slight performance increase. And again, Falkor has no trouble beating A72.
Qualcomm’s Kryo mobile core combined high core throughput with a sub-par memory subsystem. Falkor takes a different approach in its attempt o break into the server market. Its core pipeline is a downgrade compared to Kryo in many respects. Falkor has fewer execution resources, less load/store bandwidth, and worse handling for 128-bit vectors. Its 3+1 renamer acts more as a replacement for branch fusion than making Falkor a truly 4-wide core, which is another step back from Kryo. Falkor improves in some respects, like being able to free resources out-of-order, but it lacks the raw throughput Kryo could bring to the table.
In exchange, Falkor gets a far stronger memory subsystem. It has more than twice as much instruction caching capacity. The load/store unit can track many more in-flight accesses and can perform faster store forwarding. Even difficult cases like partial load/store overlaps are handled well. Outside the core, Falkor’s L2 is much faster than Kryo’s, and L2 misses benefit from a 60 MB L3 behind a high bandwidth interconnect. Rather than spam execution units and core width, Qualcomm is trying to keep Falkor fed.
Likely, Falkor aims to deliver adequate performance across a wide variety of workloads, rather than exceptional performance on a few easy ones. Cutting back the core pipeline may also have been necessary to achieve Qualcomm’s density goals. 48 cores is a lot in 2017, and would have given Qualcomm a core count advantage over Intel and AMD in single socket servers. Doing so within a 120W envelope is even more impressive. Kryo was perhaps a bit too “fat” for that role. A wide pipeline and full 128-bit vector execution units take power. Data transfer can draw significant power too, and Kryo’s poor caching capacity did it no favors.
Falkor ends up being a strong contender in the 2017 Arm server market. Centriq walks all over Amazon’s Graviton 1, which was the first widely available Arm platform from a major cloud provider. Even with core cutbacks compared to Kryo, Falkor is still quite beefy compared to A72. Combined with a stronger memory subsystem, Falkor is able to beat A72 core for core, while having more cores on a chip.
But beating Graviton 1 isn’t enough. The Arm server scene wasn’t a great place to be around the late 2010s. Several attempts to make a density optimized Arm server CPU had come and gone. These included AMD’s “Seattle”, Ampere’s eMAG 8180, and Cavium’s ThunderX2. Likely, the strength of the x86-64 competition and nascent state of the Arm software ecosystem made it difficult for these early Arm server chips to break into the market. Against Skylake-X for example, Falkor is a much smaller core. Centriq’s memory subsystem is strong next to Kryo or A72’s, but against Skylake it has less L2 and higher L3 latency.
Qualcomm Datacenter Technologies no doubt accomplished a lot when developing the Centriq server SoC. Stitching together dozens of cores and shuffling hundreds of gigabytes per second across a chip is no small feat, and is a very different game from mobile SoC design. But taking on experienced players like Intel and AMD isn’t easy, even when targeting a specific segment like cloud computing. Arm would not truly gain a foothold in the server market until Ampere Altra came out after 2020. At that point, Arm’s stronger Neoverse N1 core and TSMC’s 7 nm FinFET process left Falkor behind. Qualcomm planned to follow up on Falkor with a “Saphira” core, but that never hit the market as far as I know.
However, Qualcomm is looking to make a comeback into the server market with their announcement of supplying HUMAIN, a Saudi state-backed AI company, with "datacenter CPUs and AI solutions". NVIDIA’s NVLink Fusion announcement also mentions Qualcomm as a provider of server CPUs that can be integrated with NVIDIA’s GPUs using NVLink. I look forward to seeing how that goes, and whether Qualcomm's next server CPU builds off experience gained with Centriq.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-05-23 11:46:57
Hello you fine Internet folks,
Today we are covering the announcements that AMD had at Computex 2025 which were the RX 9060 XT, Threadripper 9000, the Radeon AI Pro R9700, and more ROCm commitments. Due to time constraints and the vastness of Computex, the transcript/article will be done after Computex.
Hope y'all enjoy!