Rss preview of Blog of Chips and Cheese

Evaluating the Infinity Cache in AMD Strix Halo

2025-10-22 11:58:42

Strix Halo is the codename for AMD’s highest end mobile chip, which is used in the Ryzen AI MAX series. It combines a powerful CPU with 16 Zen 5 cores and a large GPU with 20 RDNA 3.5 Workgroup Processors (WGPs). The sizeable iGPU makes Strix Halo particularly interesting because GPUs have high bandwidth requirements. Strix Halo tackles that with a 256-bit LPDDR5X-8000 setup combined with 32 MB of memory side cache. The latter is often referred to as Infinity Cache, or MALL (Memory Attached Last Level). I’ll refer to it as Infinity Cache for brevity.

Infinity Cache has been around since RDNA2 in AMD’s discrete consumer GPU lineup, where it helped AMD hit high performance targets with lower DRAM bandwidth requirements. However, Infinity Cache’s efficacy has so far been difficult for me to evaluate. AMD’s discrete GPUs have performance monitoring facilities accessible through AMD’s developer tools. But those tools stop providing information past L2. Strix Halo stands out because it has an Infinity Cache implementation, and all the accessible performance monitoring features typical of a recent AMD GPU. That includes programmable performance counters at Infinity Fabric and memory controllers. It’s an opportunity to finally get insight into how well AMD’s Infinity Cache does its job in various graphics workloads.

Acknowledgements

Special thanks goes out to ASUS for sampling their ROG Flow Z13. This device implements AMD’s Ryzen AI MAX+ 395 with 32 GB of LPDDR5X in a thin and light form factor. It superficially represents a convertible tablet from Microsoft’s Surface line, and is remarkably portable for a device with gaming credentials. Without ASUS’s help, this article wouldn’t have been possible.

Infinity Fabric, Performance Monitoring, and Theory

AMD’s Infinity Fabric aims to abstract away details of how data travels across the chip. It does so by providing endpoints with well defined interfaces to let blocks make or handle memory requests. Infinity Fabric also provides a set of programmable performance counters. AMD documents a single DATA_BW performance event that counts data beats at endpoints. DATA_BW targets an endpoint via its 8-bit instance ID, and can count either reads or writes. AMD never documented Infinity Fabric instance IDs for Strix Halo. So, I did some guessing by generating traffic at various blocks and observing bandwidth counts at all possible instance IDs.

Instance IDs start from the Coherent Stations (CS-es), just like on server platforms. CS blocks sit in front of memory controllers and ensure cache coherency by probing another block if it might have a modified copy of the requested cacheline. If it doesn’t, which is most of the time, the CS will pass the request on to its attached Unified Memory Controller (UMC). Because CS blocks observe all requests to physical memory backed by DRAM, it’s a logical place to implement a memory side cache. That’s exactly what AMD does on chips with Infinity Cache. Cache hits let the CS avoid going to the UMC.

Strix Halo has 16 memory controllers and CS instances, each handling a 16-bit LPDDR5X channel. The GPU occupies the next eight instance IDs, suggesting it has a wide interface to Infinity Fabric. CPU core clusters come next. Each octa-core Zen 5 Core Complex (CCX) connects to Infinity Fabric via one endpoint. Miscellaneous blocks follow. These include the NPU, media engine, the display engine, and a mystery.

Setting up Performance Monitoring

Cache misses usually cause a request to the next level in the memory hierarchy, while cache hits do not. Therefore, my idea is to compare traffic levels at the CS and UMC levels. Traffic that shows up at the CS but not at the UMCs can be used as a proxy for Infinity Cache hits. There are a few problems with that approach though.

First, Strix Halo only provides eight Infinity Fabric performance counters. I have to use two counters per endpoint to count both read and write data beats, so I can only monitor four CS-es simultaneously. Memory traffic is interleaved across channels, so taking bandwidth observed at four CS-es and multiplying by four should give a reasonably accurate estimate of overall bandwidth. But interleaving isn’t perfectly even in practice, so I expect a few percentage points of error. The UMC situation is easier. Each UMC has its own set of four performance counters, letting me monitor all of them at the same time. I’m using one counter per UMC to count Column Address Strobe (CAS) commands. Other available UMC events allow monitoring memory controller frequency, bus utilization, and ACTIVATE or PRECHARGE commands. But that stuff goes beyond what I want to focus on here, and collecting more data points would make for annoyingly large spreadsheets.

Cross-CCX traffic presents a second source of error as mentioned above. If a CS sends a probe that hits a modified line, it won’t send a request to the UMC. The request is still being satisfied out of cache, just not the Infinity Cache that I’m interested in. I expect this to be rare because cross-CCX traffic in general is usually low compared to requests satisfied from DRAM. It’ll be especially rare in the graphics workloads I’m targeting here because Strix Halo’s second CCD tends to be parked in Windows.

CPU-side traffic in general presents a more significant issue. Strix Halo’s Infinity Cache narrowly targets the GPU, and only GPU-side memory requests cause cache fills. CPU memory accesses will be counted as misses in a hitrate calculation. That may be technically correct, but misses the point of Infinity Cache. Most memory traffic comes from the GPU, but there’s enough CPU-side traffic to create a bit more of an error bar than I’d like. Therefore, I want to focus on whether Infinity Cache is handling enough traffic to avoid hitting DRAM bandwidth bottlenecks.

Traffic captured at the CCD endpoints and two of the GPU’s (multiplied by 4)

A final limitation is that I’m sampling performance counters with a tool I wrote years ago. I wrote it so I could look at hardware performance stats just like how I like looking at Task Manager or other monitoring utilities to see how my system is doing. Because I’m updating a graphical interface, I sample data every second and thus have lower resolution compared to some vendor tools. Those can often sample at millisecond granularity or better. That opens the window to underestimating bandwidth demands if a bandwidth spike only occurs over a small fraction of a second. In theory I could write another tool. But Chips and Cheese is an unpaid hobby project that has to be balanced with a separate full time job as well as other free time interests. Quickly rigging an existing project to do my bidding makes more efficient use of time, because there’s never enough time to do everything I want.

Results

I logged data over some arbitrarily selected graphics workloads, then selected the 1-second interval with the highest DRAM bandwidth usage. That should give an idea of whether Strix Halo comes close to reaching DRAM bandwidth limits. Overall, the 32 MB cache seems to do its job. Strix Halo is able to stay well clear of the 256 GB/s theoretical bandwidth limit that its LPDDR5X-8000 setup can deliver.

Approximate CS-side bandwidth demands do indicate several workloads can push uncomfortably close to 256 GB/s. A workload doesn’t necessarily have to get right up to bandwidth limits for memory-related performance issues to start cropping up. Under high bandwidth demand, requests can start to pile up in various queues and end-to-end memory latency can increase. GHPC and Ungine Valley fall into that category. 3DMark Time Spy Extreme would definitely be DRAM bandwidth bound without the memory side cache.

Picking an interval with maximum bandwidth demands at the CS gives an idea of how much bandwidth Strix Halo’s GPU can demand. Strix Halo has the same 2 MB of graphics L2 cache that AMD’s older 7000 series “Phoenix” mobile chip had, despite more than doubling GPU compute throughput. Unsurprisingly, the GPU can draw a lot of bandwidth across Infinity Fabric. 3DMark Time Spy again stands out. If AMD wanted to simply scale up DRAM bandwidth without spending die area on cache, they would need well over 335 GB/s from DRAM.

Curiously, Digital Foundry notes that Strix Halo has very similar graphics performance to the PS5. The PS5 uses a 256-bit, 14 GT/s GDDR6 setup that’s good for 448 GB/s of theoretical bandwidth, and doesn’t have a memory side cache like Strix Halo. 448 GB/s looks just about adequate for satisfying Time Spy Extreme’s bandwidth needs, if just barely. If Strix Halo didn’t need to work in power constrained mobile devices, and didn’t need high memory capacity for multitasking, perhaps AMD could have considered a GDDR6 setup. To bring things back around to Infinity Cache, it seems to do very well at that interval above. It captures approximately 73% of memory traffic going through Infinity Fabric, which is good even from a hitrate point of view.

The two bar graphs above already hint at how bandwidth demands and cache hitrates vary across workloads. Those figures vary within a workload as well, though plotting all of the logged data would be excessive. Variation both across and within workloads makes it extremely difficult to summarize cache efficacy with a single hitrate figure. Plotting the percentage of traffic at the CS not observed at the UMCs as a proxy for hitrate further emphasizes that point.

Resolution settings can impact cache hitrate as well. While I don’t have a direct look at hitrate figures, plotting with the data I do have suggests increasing resolution in the Ungine Valley benchmark tends to depress hitrate.

AMD presented a slide at Hot Chips 2021 with three lines for different resolutions across a set of different cache capacities. I obviously can’t test different cache capacities. But AMD’s slide does beg the question of how well a 32 MB cache can do at a wider range of resolutions, and whether bandwidth demands remain under control at resolutions higher than what AMD may have optimized the platform for.

Ungine Superposition and 3DMark workloads can both target a variety of resolutions without regard to monitor resolution. Logging data throughout benchmark runs at different resolutions shows the 32 MB cache providing reasonable “bandwidth amplification” at reasonable resolutions. At unreasonable resolutions, the cache is still able to do something. But it’s not as effective as it was at lower resolutions.

Plotting data over time shows spikes and dips in Infinity Cache efficacy occur at remarkably consistent times, even at different resolutions. 8K is an exception with the Superposition benchmark. The 8K line looks stretched out, possibly because Strix Halo’s GPU was only able to average a bit over 10 FPS, and time got slowed a bit.

If Strix Halo’s iGPU were capable of delivering over 30 FPS in Superposition, AMD would definitely need a larger cache, more DRAM bandwidth, or a combination of both. Taking a simplistic view and tripling maximum observed 8K bandwidth demands would give a figure just north of 525 GB/s. But Strix Halo’s iGPU clearly wasn’t built with such a scenario in mind, and AMD’s selected combination of cache capacity and DRAM bandwidth works well enough at all tested resolutions. While high resolutions create the most DRAM traffic, 1080P shows the heaviest bandwidth demands at the Infinity Fabric level.

3DMark Wild Life Extreme is a better example because it’s a lightweight benchmark designed to run on mobile devices. Strix Halo’s iGPU can average above 30 FPS even when rendering at 8K. Again, DRAM bandwidth demands increase and Infinity Cache becomes less effective as resolution goes up. But Infinity Cache still does enough to keep the chip well clear of DRAM bandwidth limits. Thus the cache does its job, and the bandwidth situation remains under control across a wide range of resolutions.

Bandwidth demands are more important than hitrates. Curiously though, Wild Life Extreme experiences increasingly severe hitrate dips at higher resolutions around the 16-19 and 45-52 second intervals. Those dips barely show at 1440P or below. Perhaps a 32 MB cache’s efficacy also shows more variation as resolution increases.

A Video?

A few errors here - I said Core Coherent Master read/write beats were 32B and 64B/cycle. It’s not per cycle, it’s per data beat. And I meant to say reads outnumber writes at the end.

Final Words

Chip designers have to tune their designs to perform well across a wide variety of workloads. Cache sizes are one parameter on the list. AMD chose to combine a 32 MB cache with 256 GB/s of DRAM bandwidth, and it seems to do well enough across the workloads I tested. Monitoring at the CS-es and UMCs also supports AMD’s data showing that higher resolutions tend to depress hitrate. Those results explain why larger GPUs tend to have larger caches and higher DRAM bandwidth.

At a higher level, GPU bandwidth demands have been a persistent challenge for large iGPUs and have driven a diverse set of solutions. Over a decade ago, Intel created “halo” iGPU variants and paired a 128 MB eDRAM cache while sticking with a typical 128-bit client memory bus. AMD’s console chips use large and fast DRAM buses. The PS5 is one example. Strix Halo does a bit of both. It combines modest cache capacity with a more DRAM bandwidth than typical client chips, but doesn’t rely as much on DRAM bandwidth as console chips.

Those varied approaches to the bandwidth problem are fascinating. Watching how Infinity Cache behaves in various graphics workloads has been fascinating as well. But everything would be so much more fun if AMD’s tools provided direct data on Infinity Cache hitrates. That applies to both integrated and discrete GPUs. Infinity Cache is a well established part of AMD’s strategy. It has been around for several generations, and now has a presence in AMD’s mobile lineup too. I suspect developers would love to see Infinity Cache hitrates in addition to the data on GPU-private caches that AMD’s current tools show.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

AMD’s Chiplet APU: An Overview of Strix Halo

2025-10-18 11:52:18

Hello you fine Internet folks!

Today we are looking at AMD’s largest client APU to date, Strix Halo. This is an APU designed to be a true all-in-one mobile processor, able to handle high end CPU and GPU workloads without compromise. Offering a TDP range of 55W to 120W, the chip targets a far higher power envelope compared to standard Strix Point, but eschews the need for dedicated graphics.

To get y’all all caught up on the history and specifications of this APU, AMD first announced Strix Halo at CES 2025 earlier this year to much fanfare. Strix Halo is AMD’s first chiplet APU in the consumer market with AMD using Strix Halo as a bit of a show piece for what both CPU and GPU performance can look like with a sufficiently large APU.

AMD’s Strix Halo can be equipped with dual 8 core Zen 5 CCDs for a total of 16 cores that feature the same 512b FPU as the desktop parts. This is a change from the more mainstream and monolithic Strix Point APU which has “double-pumped” 256b FPUs similar to Zen 4 for use with AVX512 code. What is similar to the more mainstream Strix Point is the same 5.1GHz max boost clock which is a 600MHz deficit compared to the desktop flagship Zen 5 CPU, the Ryzen 9 9950X.

Moving to the 3rd die on a Strix Halo package, a RDNA 3.5 iGPU takes up the majority of the SoC die with 40 compute units, 32MB of Infinity Cache, and a boost clock of up to 2.9GHz placing raw compute capability somewhere between the RX 7600 XT and RX 7700.

To feed this chip, AMD has equipped Strix Halo with a 256b LPDDR5X-8000 memory bus, which provides up to 256GB/s shared between all of the components. This is slightly lower than the 288GB/s available to the RX 7600 XT but is much higher than any other APU we have tested.

Acknowledgments

A massive thank you to both Asus and HP for sending over a ROG Flow Z13 (2025) and a ZBook Ultra G1a 14” for testing which were both equipped with an AMD Ryzen AI Max+ 395. All of the gaming tests were done on the Flow Z13 due to that being a more gaming focused device and all of the microbenchmarking was done on the ZBook Ultra G1a.

HP ZBook Ultra G1a 14” which has become my daily go-to laptop for the past few months.

Memory Subsystem from the CPU’s Perspective

Starting with the memory latency from Zen 5’s perspective, we see that the latency difference between Strix Point and Strix Halo is negligible with Strix Point at ~128ns of memory latency and Strix Halo at ~123ns of memory latency. However, as you can see the CPU does not have access to the 32MB of Infinity Cache on the IO die. This behavior was confirmed by Mahesh Subramony during our interview about Strix Halo at CES 2025.

While the 123ns DRAM latency seen here is quite good for a mobile part, desktop processors like our 9950X fare much better at 75-80ns.

Moving on to memory bandwidth, we see Strix Halo fall into a category of its own of the SoCs we have tested.

When doing read-modify-add operations across both CCDs, the 16 Zen 5 cores can pull over 175GB/s of bandwidth from the memory with reads being no slouch at 124GB/s across both CCDs.

However, looking at the bandwidth of a single CCD and just like the desktop CPUs a single Strix-Halo CCD only has a 32 byte per cycle read link to the IO die. And just like the desktop chips, the chip to chip link runs at ~2000MHz, which caps out the single CCD read at 64GB/s. Unlike the desktop chips, the write link is 32 bytes per cycle and we are seeing about 43GB/s for the write bandwidth. That brings the total theoretical single CCD bandwidth to 128GB/s and the observed bandwidth is just over 103GB/s.

CPU’s Performance

The performance of Strix Halo’s CPU packs quite a bit more of a punch than Strix Point’s CPU.

Strix Halo’s CPU can match a last generation desktop flagship CPU, the 7950X, in Integer performance despite a 11.7% clock speed delta. And nearly matches AMD current desktop flagship CPU, the 9950X, in Floating Point performance again with a 11.7% clock speed deficit.

Looking at the SPEC CPU 2017 Integer subtests and while Strix Halo can’t quite match the desktop 9950X, likely due to the higher memory latency of Strix Halo’s LPDDR5X bus, it does get close in a number of subtests.

Moving to the FP subtests and the story is similar to the Integer subtests but Strix Halo can get even closer to the 9950X and even beat it in the fotonik3d subtest.

Memory from the GPU’s Perspective

Moving to the GPU side of things and this is where Strix Halo really shines. The laptop we used as a comparison to Strix Halo was the HP Omen Transcend 14 2025 with a 5070M equipped which maxed out at about 75 Watts for the GPU.

Strix Halo has over double the memory bandwidth of any of the other mobile SoCs that we have tested. However, the RTX 5070 Mobile does have about 50% more memory bandwidth compared to Strix Halo.

Looking at the caches of Strix Halo, the Infinity Cache, AKA MALL, is able to deliver over 40% higher bandwidth compared to the 5070M’s L2 while having 33% more capacity. Plus Strix Halo has a 4MB L2 which is capable of providing 2.5TB/s of bandwidth to the GPU.

Moving to the latency, the more complex cache layout of Strix Halo does give it a latency advantage after the 128KB with Strix Halo’s L2 being significantly lower latency than the 5070M’s L2 and the larger 32MB MALL that Strix Halo has a similar latency to the 5070M’s L2. And Strix Halo’s memory latency is about 35% lower than the 5070M’s memory latency.

The GPU’s Compute Throughput

Looking at the floating point throughput, we see that Strix Halo unsurprisingly has about 2.5x the throughput of Strix Point considering it has about two and a half times the number of Compute Units. Strix Halo oftentimes can match or even pull ahead of the 5070 Mobile in terms of throughput. I will note that the FP16 results for the 5070 Mobile are half of what I would expect; the FP16:FP32 ratio for the 5070 Mobile should be 1:1 so I am not positive about what is going on there.

Moving to the integer throughput and we see the 5070 Mobile soundly pulling ahead of the Radeon 8060S.

GPU Performance

Looking at the GPU performance, we see Strix Halo once again shine, with a staggering level of performance available for an iGPU, courtesy of the large CU count paired with relatively high memory bandwidth. Our comparison suite includes several recent iGPU’s from Intel/AMD, along with the newest generation RTX 5070 Mobile @ 75W to act as a reference for mid to high-range laptop dedicated graphics, and the antiquated GTX 1050 as a reference for budget dedicated graphics.

Looking at Fluid X3D for a compute-heavy workload, we can see the Radeon 8060S absolutely obliterates the other iGPU’s from Intel/AMD, putting itself firmly in a class above. The 5070 is no slouch though, and still holds a substantial 64.1% lead largely due to the higher memory bandwidth afforded to the 5070M.

Switching to gaming workloads with Cyberpunk 2077, we start with a benchmark conducted while on battery power. The gap with other iGPU’s is still wide, but now the 5070M is limited to 55W and exhibits 7.5% worse performance at 1080p low settings when compared to the Radeon 8060S.

Finally, moving to wall power and allowing both the Radeon 8060S and 5070M to access the full power limit in CP2077, we can see that the 8060S still pulls ahead at 1080p low by 2.5%, while at 1440p medium we see a reversal, with the 5070M commanding an 8.3% lead. Overall the two provide a comparable experience in Cyberpunk 2077, with changes in settings or power limits adjusting the lead between the two. This is a seriously impressive turnaround for an iGPU working against dedicated graphics, and demonstrates the versatility of the chip in workloads like gaming, where iGPU’s have traditionally struggled.

Conclusion

Strix Halo follows in the footsteps of many other companies in the goal of designing an SoC for desktop and laptop usage that is truly all encompassing. The CPU and GPU performance is truly a class above standard low power laptop chips, and is even able to compete with larger systems boasting dedicated graphics. CPU performance is especially impressive with a comparable showing to the desktop Zen 5 CPUs. GPU performance is comparable to mid range dedicated graphics, while still offering the efficiency and integration of an iGPU. High end dedicated graphics still have a place above Strix Halo, but the versatility of this design for smaller form factor devices is class leading.

Asus ROG Flow Z13 2025 which is a very similar form factor to the Microsoft Surface but has a Strix Halo inside of it instead of a lower power Intel chip. Image credit goes to Chester.

However, this is not to say that Strix Halo is perfect. I was hoping to have a section dedicated to the ML performance of Strix Halo in this article, however AMD only just released preview support for Strix Halo in the ROCm 7.0.2 release which came out about a week ago from time of publication. As a result of the long delay between the launch of Strix Halo and the release of ROCm 7.0.2, the ML performance will have to wait until a future article.

However, putting aside ROCm, Strix Halo is a very, very cool piece of technology and I would love to see successors to Strix Halo with newer CPU and GPU IP and possibly even larger memory buses similar to Apple’s Max and Ultra series of SoCs with 512b and 1024b memory respectively. AMD has a formula for building bigger APUs with Strix Halo, which opens the door to a lot of interesting hardware possibilities in the future.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Panther Lake’s Reveal at ITT 2025

2025-10-15 01:01:17

Hello you fine Internet folks,

At this year’s ITT, Intel announced their upcoming client SoC, Panther Lake (PTL) along with all of the improvements that they have made to this latest SoC. With a new node, improvements to the CPU cores, a brand new GPU, and more, PTL is an evolution of Intel’s prior SoC designs. So without further ado, let’s dive in.

Hope y’all enjoy!

SoC Layout

Starting with the SoC layout of Panther Lake, there are 3 active tiles on a Panther Lake CPU: Compute tile, GPU tile, and the Platform Controller tile.

Starting with the Compute tile, the Compute tile has the main CPU complex, the Low Power CPU complex, the ISP, the NPU, the Media engines, and the 8MB Memory Side Cache alongside the 128b memory bus. The GPU tile only has the GPU compute on board and nothing else. The Platform Controller tile houses all of the I/O IP such as the PCIe, Thunderbolt, USB, etc.

Starting with the smallest Panther Lake configuration, we have 4 P-Cores and 4 LP E-Cores paired with 8MB Memory-side Cache on the Compute Tile which is fabricated on Intel’s newest node 18A. That Compute Tile is paired with a 4 core Xe3 iGPU fabricated on Intel’s 3 process which makes this the first Intel Arc product fabricated at Intel. The Platform Controller Tile has 12 PCIe lanes split between 4 Gen 5 lanes and 8 Gen 4 lanes along with 4 Thunderbolt 4 controllers, 2 USB 3.2 & 8 USB 2.0 ports, and Wi-Fi 7 and Bluetooth 6.0.

Moving to the middle configuration of Panther Lake, there has been a large change in both the Compute Tile and the Platform Controller Tile. The Compute Tile now has 4 P-Cores and 8 E-Cores which share a 18MB L3 cache with an extra 4 LP E-Cores, all of which can access the 8MB memory-side cache on board the Compute tile.

The Platform Controller tile now has an extra 8 Gen 5 PCIe lanes for a total of 20 PCIe lanes on the SoC. It was not made clear if all versions of Panther Lake use the same Platform Controller Tile fabricated on TSMC N6 or if there are 2 versions of this tile but it seems likely that all Panther Lake SoCs reuse the same physical die but with different configurations enabled.

Moving to the largest configuration of Panther Lake which keeps the same 4 P-Core, 8 E-Core, and 4 LP E-Core configuration as the middle configuration and drops the I/O configuration to the same as the smallest Panther Lake, but in exchange gains a 12 Xe3 core GPU tile which is fabricated on TSMC’s N3E process. This is the largest iGPU that Intel has made for a client SoC.

Xe3 Architecture

For many people the most exciting part of Panther Lake is going to be the brand new Xe3 architecture found in Panther Lake.

However, this new architecture will not be marketed as Intel Arc C-Series but instead will be marketed under Intel Arc B-Series just like the Battlemage series of dGPUs e.g. the B580 and B570. This is quite perplexing due to the new architecture yet the old branding which even Tom Petersen said was “confusing”.

Moving on from the strange marketing decisions, one of the big changes in Xe3 is making the render slice capable of handling up to 6 Xe Cores from the 4 Xe Cores that the Xe2 render slice could support.

The ability of the render slice to accommodate up to 6 Xe Cores allows for the 12 Xe core configuration that is found in the largest version of Panther Lake.

The 16MB L2 found in the 12Xe GPU reduces the traffic on the SoC fabric by up to 36% depending on the workload in question compared to a 8MB L2.

Going into the Xe Core, the amount of compute throughput that Xe3 has not changed compared to Xe2 with the same 8 512 bit Vector Engines and 8 2048 bit Matrix Engines per Xe Core but the L1/SLM capacity has increased from 192KB to 256KB.

And while Intel hasn’t increased the compute throughput of Xe3, they have made some large improvements to the frontend capabilities of Xe3. A XVE can now track up to 10 threads up from the 8 threads in Xe2’s XVE along with the single largest improvement being what Intel is calling “Variable Register Allocation”.

In Xe1 and prior Intel GPUs, a thread would be allocated 128 registers regardless of if that thread needed all 128 registers or not. In Xe2, Intel added the “Large GRF” mode which allows a thread to pick between 128 registers or 256 registers. Xe3 now allows a thread to allocate up to 256 registers in 32 register increments.

So while Xe3 kept the same 64KB per XVE register file as Xe2, the register file can now be better utilized thanks to more granularity in register allocation.

All of this, along with ray tracing units now capable of dynamic ray management, adds up to an approximate 50% performance uplift for the 12 Xe3 core iGPU in comparison to Lunar Lake’s iGPU in 3D Mark Solar Bay, Cyberpunk 2077, and Borderlands 3 albeit at higher power.

Cougar Cove and Darkmont

Going to the CPU cores found in Panther Lake, and they are much more evolution rather than revolution. If you’d like a deeper dive into the CPU cores found in Panther Cove, we did an interview with the Chief Architect of x86 Cores at Intel, Stephen Robinson, at ITT.

Starting with the P-Cores, codenamed Cougar Cove, Intel had increased the TLB capacity by 50% compared to Lion Cove along with improving the branch predictor by increasing some of the structure sizes in the BPU along with porting over some of the novel BPU algorithms that Intel tested in Lunar Lake.

Moving to the E-Core found in Panther Lake, codenamed Darkmont, the major improvements to Darkmont are also found in the BPU by increasing structure sizes, improving the prefetcher, and improved nanocode behavior.

Interestingly, both Cougar Cove and Darkmont have improved the memory disambiguation (Store to Load) behavior but they each use different techniques for the improved memory disambiguation behavior. This is a case of having 2 different teams trying out different things and coming to a similar solution with different implementations.

Putting all of the microarchitecture improvements and cache improvements together and Intel is claiming a 10% single threaded improvement compared to both Arrow Lake and Lunar Lake.

Intel is also claiming a 40% performance per watt improvement over Arrow Lake.

Moving to multi-thread performance, Intel is claiming a 50% performance improvement over Lunar Lake and a 30% decrease in power consumption at similar multi-thread performance.

NPU and Media Engine

Panther Lake also comes with some improvements to other parts of the SoC such as the Media Engine now supporting AVC 10b decode and encode.

The NPU also got an upgrade with FP8 support along with shrinking the NPU by about 40% compared to the NPU found in Lunar Lake.

Final Words

Intel’s upcoming Panther Lake series of CPUs is a merging of the Lunar Lake and Arrow Lake families of SoCs in terms of the SoC design. By moving the GPU on to its own tile, Intel can now start to fabricate Arc class iGPUs on Intel’s own nodes instead of solely fabbing them on TSMC’s nodes. Perhaps this could mean that Intel could fabricate future dGPUs on Intel’s own nodes.

As a bit of a sidenote, Intel announced that they would be announcing an inference focused GPU at OCP 2025. Now it is unknown what node this chip is fabricated on, but now we know it is based on the Xe3P microarchitecture, has up to 160GB of LPDDR5X, and will be sampling in the second half of 2026 which hopefully means that we could be hearing more about Celestial dGPUs for consumers by the end of next year.

Moving back to Panther Lake, the official launch of Panther Lake and all of the different SKUs will occur at CES 2026 with review samples hopefully in our hands by the end of January and units on shelves shortly after. The numbers that Intel put forward for Panther Lake’s performance are promising but personally my optimism for Panther Lake has been slightly dimmed by the lackluster graphs that Intel were showing for the improvements found in Panther Lake.

Regardless, I can’t wait to test a laptop powered by a 12 Xe3 core iGPU and see what performance it can put up compared to other large iGPU client devices.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Interviewing Intel's Chief Architect of x86 Cores at Intel Tech Tour 2025

2025-10-09 23:58:43

Hello you fine Internet folks,

I was at Intel Tech Tour this year where Intel talked about their upcoming Panther Lake and Clearwater Forest CPUs. I had a chance to sit down with Stephen Robinson, Lead Architect for x86 Cores at Intel, and talk about lntel’s approach to Cougar Cove and Darkmont and what changed from the prior generations.

Hope y’all enjoy!

The transcript has been edited for readability and conciseness.

George Cozma: Hello you fine Internet folks! We’re here in Phoenix, Arizona at Intel Tech Tour 2025, where there’s been a number of disclosures about Panther Lake and Clearwater Forest. And joining me, to talk about the core architectures in both of those CPUs, is Stephen Robinson. What do you do at Intel?

Stephen Robinson: I am a CPU architect and I lead the architecture team for the x86 cores.

George Cozma: Awesome. So, diving straight in: so we did do an interview about- a recorded interview about Skymont. But going back to Skymont, what were the big changes from your previous architecture, Crestmont, moving into Skymont?

Stephen Robinson: Yeah, so Skymont, we did a lot. We wanted to build a wider, deeper machine so that we could sort of run more workloads. So kind of, “coverage” is one of the terms we use sometimes. If we can get more workloads running on an E core, then we can bring more efficiency to the whole platform.

So, you know, sometimes people want- “Why are you adding IPC to an E core? You’re making it more expensive, right?” Well, actually, software runs better. So we made the out-of-order depth about 50% bigger, somewhere around that ballpark. We went from two load ports to three. We roughly doubled the vector hardware. So we had two FMAs in Crestmont. Now we have four FMAs in Skymont. And then the front end, we went from kind of a six-wide, two cluster, three decode front end to a nine wide, three cluster. And then eight wide alloc and, you know, more branch prediction, a little bit more L2 bandwidth, the whole lot.

George Cozma: So sort of an interesting quirk that I noticed about Skymont is that it has four store ports and three load ports. Why the four store ports? Usually you see more load ports than store ports. Why more store ports in this case?

Stephen Robinson: Yeah. So let’s, let’s break it down into address generation versus execution. So, when you have three load execution ports, you need three load address generators. And so that’s there. On the store side, we have four store address generation units. But we only sustain two stores into the data cache.

So we have a little bit of asymmetry on the store side. So you’re right. Why on earth do we have more store address units than store ports? The answer is, we have hazards between loads and stores. And sometimes loads get blocked on stores because we don’t know the store address because we’re all out of order. So by increasing the store address bandwidth, that reduces the latency to resolving unknown stores.

So basically we get performance by just spending more time and effort generating store addresses so that loads don’t end up blocking.

George Cozma: Awesome. So I know in Darkmont, something that caught my eye was the memory disambiguation. And that usually has to deal with store to load forwarding. But that’s been a technique that’s been around for quite a while, so what have you enhanced in Darkmont that you would like to talk about?

Stephen Robinson: Sure, yeah. So you’re right. It’s about store load connections. And so in... there’s several different ways people do it. You can have a big, big table that tells you when it’s safe to ignore stores. You can have a small table that tells you when it’s unsafe to ignore stores.

So you can kind of do it either way. And those two techniques end up kind of collapsing to the same answer, because the big table saturates and everything’s safe. And then so, you know, you discover the hazard. What we’ve done here is we’ve spent a little bit more time trying to have hardware that isn’t just sort of a history table, that actually figures out, before address generation, whether things are going to be connected.

So when we bring uops to the memory subsystem to address generation, we kind of look at some information and say, “Oh, I’m fairly confident that these loads and stores are connected.” So I’m not using a table, I’m using sort of the inherent information about the instructions, whether I think they’re going to be connected. And that gives us the ability to kind of slow down the load or something like that so that, you know, we know when the store’s gone now it should be safer to do the load that we think might be connected.

George Cozma: Interesting. Now, I know on Cougar Cove, which is the P core in Panther Lake, that you also commented on the memory disambiguation. Is that similar in Cougar Cove?

Stephen Robinson: It’s similar in kind of concept, but it’s a bit different in implementation. So it’s a different method, but in the end, we’re still getting to where we’re trying to figure out when’s a good time to schedule the load relative to the store that we’re going to be dependent on. So, you know, I tell a story of two tables. Well, this is kind of another table. And again, we’re trying to say, “Okay, now I think it’s time to do the load because I think it’s going to be connected to the store.” So similar concept, different implementation.

George Cozma: Okay. And I guess sort of what drives the two different implementations? Like the reasoning behind the two different implementations, I should say.

Stephen Robinson: I would say it’s as simple as two teams working in parallel, doing independent research, solving localized problems, coming up with solutions. And then we end up with two similar but different implementations.

George Cozma: Okay.

Stephen Robinson: Two teams.

George Cozma: Cool. So sort of talking about Cougar Cove, a key change made in Lion Cove was the lack of SMT. SMT in Lunar Lake and Arrow Lake is no longer there. Why wasn’t it -- could you have re-added it to Cougar Cove if you had wished? And why haven’t you added it back? Like what would be the reason why you wouldn’t?

Stephen Robinson: Yeah. So let’s talk about client first, right, where this is where we’ve shipped products without SMT. When you have hybrid compute, SMT isn’t necessarily as valuable, right? So when you schedule something, you know, if you want performance, you schedule it on a P core. And then you schedule it on an E core. And then, once you’ve exhausted those, then you would come back and schedule a thread.

So in Alder Lake, Raptor Lake, that’s kind of how it works. So those are the threads on top of the dessert, right? In Lion Cove, in Lunar Lake and Arrow Lake, you know, we removed threads. We didn’t have threads implemented. Let me say it that way. And so that gave us -- we didn’t lose a lot in client because of hybrid and the core count. But we gained a bit in our design execution, so a little bit lower power because you don’t have the transistors and the logic to support SMT. A little bit smaller area because -- same reason. And it’s a little bit easier to achieve your frequency target. Because, you know, the old joke that SMT is a bit in the mux [multiplexer], right? So there’s truth to that. There’s a mux somewhere. And that causes delay. So now you’ve kind of got something that’s maybe a little bit easier and less expensive and maybe can go a little bit faster.

So when you’re doing Cougar Cove, you just take those basic premises and say, yeah, this is what I’m going to do for the next gen as well.

George Cozma: And so on server, I know that there have been some data points that suggest that SMT does help. So what is sort of your opinion there?

Stephen Robinson: Yeah. So server is a little bit different than client. You know, people have talked about doing hybrid compute in servers. But nobody does it. And the simple explanation is if you want to be hybrid in servers, you do it at the rack-level, not inside an SOC. Why would I want asymmetry inside my SOC when I can have asymmetry, you know, a 200 core server, another 200 core server, and I’ve got a bunch of those. So you have the choice. You know, Amazon and others, they have different instances that you can go and you can get. So what’s the value of different instances within one?

So first, there’s no hybrid in servers today in general. The second thing is, you know, kind of the story I told about you’d schedule on P cores and E cores and come back with threads. Well, if you don’t have the E cores, then you’re going to go on threads. Server workloads and gaming workloads and others, right, they miss a lot. They can have long latency. And so when you miss and you have long latency, you’ve got available hardware. So in the server area, threads, there are more workloads that like it. You know, take a networking workload. Those usually like threads because they’re moving a lot of data around and they’re exposing those latencies. So the server workloads are a bit different. And without hybrid, then SMT has more value.

George Cozma: And so actually speaking on difference between client and server, so Darkmont is both used in Panther Lake and in Clearwater Forest. What sort of differences do you have to make in a core between server and client in terms of stuff like RAS features? So what differences are there in terms of implementation and what you have to design.

Stephen Robinson: Yeah. Great question. So in the client space, you can have RAS features, but they don’t quite have as much value because the client system is different, right? If I have hundreds of cores or thousands of cores, reliability becomes very, very important.

If I’m on my own little laptop and I have fewer, it’s a different concern, right? When Google Cloud goes down, everyone’s very upset.

George Cozma: Everything goes down.

Stephen Robinson: That’s right. Everyone’s very upset. So clearly, you know, the bar of the reliability is there. So in a core, if we want a target server, there are additional features we’ll do. You know, ECC in the caches... so inside the core, we do add features. We can put that core in both if we want, right? So there aren’t a lot of physical differences between the cores in the two, but the environment is very different. So on the server side, maybe we have power gates per core, maybe we don’t. The power delivery is different. Because the power delivery is different, you may change the decision on when to power gate and when not to. And the power level is different. So maybe power gating isn’t as important in server because, you know, 24/7, I’m always running.

The other thing is there are things that can only really truly work at the SOC level because you need SOC components to be part of that. You know, take technology like SGX or TDX, you know, security. You know, secure computing elements. If you don’t have the security and the controllers in your client part, then even if you implemented it inside the core, it doesn’t matter because you need that whole system to do it. So there’s a lot of things that maybe it’s in the core, but you can really only test it and run it and productize it with the complete stack.

George Cozma: And speaking of sort of the differences between client and server, I know in Lunar Lake, you talked a lot about how there was some novel branch prediction stuff going on. Do you see that being helpful in server workloads or was that -- were those sort of improvements more targeted towards client?

Stephen Robinson: Everyone wants branch prediction. Honestly, everyone does. So in client, you know, it’s funny. Games. Are games similar to web servers?

George Cozma: Not really.

Stephen Robinson: Not really, right. But in terms of code footprints and paths and sizes, they’re more similar than you realize. Same kind of thing for databases. Databases are very large binaries.

George Cozma: Databases are actually close -- are very similar to games in their sort of what they like in a core.

Stephen Robinson: Exactly, right. So honestly, when it comes to branch prediction, we do it for everybody, right? We do it for client, we do it for server. And the things we do will be workload-specific sometimes in where you get the gains. But there’s always a workload in both client and server that will appreciate what you did.

George Cozma: So sort of evolving on that, is it such that potentially you could make a branch predictor that is more targeted for server workloads and/or client workloads? Or is it such that there isn’t really a difference there, so to speak?

Stephen Robinson: I would say that I think the -- internally, within Intel, we tend to think that server wants more branch prediction, larger capacity, right? Bigger paths. Because we know that the workloads are complex in -- and large binaries in server. But it really is in client as well, right? You know, just -- which workloads are you working at, right?

George Cozma: Exactly.

Stephen Robinson: You know. SPEC, okay, that’s different, obviously, right? But again, games and databases, yeah, they’re --

George: I would argue games and databases are closer to each other than SPEC is to either, in most cases.

Stephen Robinson: That can be true.

George Cozma: But of course, my final question here is, what’s your favorite type of cheese?

Stephen Robinson: Oof. I like a good smoked Gouda. But honestly, we’re doing blue cheese, Roquefort type things these days. Because, you know, a little musky.

George Cozma: I will admit, blue cheese is not my favorite. I had a really good cheddar from Washington. And, yeah, that was actually really good. It was a smoked cheddar. Which I’m not usually the biggest fan of.

Stephen Robinson: I do like the smoked cheeses. I really do.

George Cozma: Well, thank you so much.

Stephen Robinson: Of course.

George Cozma: So thank you so much for watching. If you like interviews like this, hit like, hit subscribe. Unfortunately I do have to say all that because it does help with the algorithm. And go check out the Substack where there will be a written transcript of this up there. And, well, if you want to donate, PayPal and Patreon are down below. And have a good one, folks!

AMD’s EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

2025-10-01 02:27:47

High core count chips are headline grabbers. But maxing out the metaphorical core count slider isn’t the only way to go. Server players like Intel, AMD, and Arm aim for scalable designs that cover a range of core counts. Not all applications can take advantage of the highest core count models in their lineups, and per-core performance still matters.

AMD’s EPYC 9355P is a 32 core part. But rather than just being a lower core count part, the 9355P pulls levers to let each core count for more. First, it clocks up to 4.4 GHz. AMD has faster clocking chips in its server lineup, but 4.4 GHz is still a good bit higher than the 3.7 or 4.1 GHz that 128 or 192 core Zen 5 SKUs reach. Then, AMD uses eight CPU dies (CCDs) to house those 32 cores. Each CCD only has four cores enabled out of the eight physically present, but still has its full 32 MB of L3 cache usable. That provides a high cache capacity to core count ratio. Finally, each CCD connects to the IO die using a “GMI-Wide” setup, giving each CCD 64B/cycle of bandwidth to the rest of the system in both the read and write directions. GMI here stands for Global Memory Interconnect. Zen 5’s server IO die has 16 GMI links to support up to 16 CCDs for high core count parts, plus some xGMI (external) links to allow a dual socket setup. GMI-Wide uses two links per CCD, fully utilizing the IO die’s GMI links even though the EPYC 9355P only has eight CCDs.

Acknowledgments

Dell has kindly provided a PowerEdge R6715 for testing, and it came equipped with the aforementioned AMD EPYC 9355P along with 768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth. Besides providing a look into how a lower core count SKU behaves, we have BMC access which provides an opportunity to investigate different NUMA setups.

The provided Dell PowerEdge R6715 system racked and ready for testing. Credit for the image goes to Zach from ZeroOne.

We’d also like to thank Zack and the rest of the fine folks at ZeroOne Technology for hosting the Dell PowerEdge R6715 at no cost to us.

Memory Subsystem and NUMA Characteristics

NPS1 mode stripes memory accesses across all 12 of the chip’s memory controllers, presenting software with a monolithic view of memory at the cost of latency. DRAM latency in that mode is slightly better than what Intel’s Xeon 6 achieves in SNC3 mode. SNC3 on Intel divides the chip into three NUMA nodes that correspond to its compute dies. The EPYC 9355P has good memory latency in that respect, but it falls behind compared to the Ryzen 9 9900X with DDR5-5600. Interconnects that tie more agents together tend to have higher latency. On AMD’s server platform, the Infinity Fabric network within the IO die has to connect up to 16 CCDs with 12 memory controllers and other IO, so the higher latency isn’t surprising.

Cache performance is similar across AMD’s desktop and server Zen 5 implementations, with the server variant only losing because of lower clock speeds. That’s not a surprise because AMD reuses the same CCDs on desktop and server products. But it does create a contrast to Intel’s approach, where client and server memory subsystems differ starting at L3. Intel trades L3 latency for capacity and the ability to a logical L3 instance across more cores.

Different NUMA configurations can subdivide EPYC 9355P, associating cores with the closest memory controllers to improve latency. NPS2 divides the chip into two hemispheres, and has 16 cores form a NUMA node with the six memory controllers on one half of the die. NPS4 divides the chip into quadrants, each with two CCDs and three memory controllers. Finally, the chip can present each CCD as a NUMA node. Doing so makes it easier to pin threads to cores that share a L3 cache, but doesn’t affect how memory is interleaved across channels. Memory addresses are still assigned to memory controllers according to the selected NPS1/2/4 scheme, which is a separate setting.

NPS2 and NPS4 only provide marginal latency improvements, and latency remains much higher than in a desktop platform. At the same time, crossing NUMA boundaries comes with little penalty. Apparently requests can traverse the huge IO die quite quickly, adding 20-30 ns at worst. I’m not sure what the underlying Infinity Fabric topology looks like, but the worst case unloaded cross-node latencies were under 140 ns. On Xeon 6, latency starts higher and can climb over 180 ns when cores on one compute die access memory controllers on the compute die at the other end of the chip.

EPYC 9355P can get close to theoretical memory bandwidth in any of the three NUMA nodes, as long as code keeps accesses within each node. NPS2 and NPS4 offer slightly better bandwidth, at the cost of requiring code to be NUMA aware. I tried to cause congestion on Infinity Fabric by having cores on each NUMA node access memory on another. That does lower achieved bandwidth, but not by a huge amount.

By C0M1, I mean cores on node 0 accessing memory on node 1

An individual NPS4 node achieves 117.33 GB/s to its local memory pool, and just over 107 GB/s to the memory on the other three nodes. The bandwidth penalty is minor, but a bigger potential pitfall is lower bandwidth to each NUMA node’s memory pool. Two CCDs can draw more bandwidth than the three memory controllers they’re associated with. Manually distributing memory accesses across NUMA nodes can improve bandwidth for a workload contained within one NUMA node’s cores. But doing so in practice may be an intricate exercise.

In general, EPYC 9355P has very mild NUMA characteristics and little penalty associated with running the chip in NPS1 or NPS2 mode. I imagine just using NPS1 mode would work well enough in the vast majority of cases, with little performance to be gained from carrying out NUMA optimizations.

Looking into GMI-Wide

GMI-Wide is AMD’s attempt to address bandwidth pinch points between CCDs and the rest of the system. With GMI-Wide, a single CCD can achieve 99.8 GB/s of read bandwidth, significantly more than the 62.5 GB/s from a Ryzen 9 9900X CCD with GMI-Narrow. GMI-Wide also allows better latency control under high bandwidth load. The Ryzen 9 9900X suffers from a corner case where a single core pulling maximum bandwidth can saturate the GMI-Narrow link and starve out another latency sensitive thread. That sends latency to nearly 500 ns, as observed by a latency test thread sharing a CCD with a thread linearly traversing an array. Having more threads generate bandwidth load seems to make QoS mechanisms kick in, which slightly reduces bandwidth throughput but brings latency back under control.

I previously wrote about loaded memory latency on the Ryzen 9 9950X when testing the system remotely, and thought it controlled latency well under high bandwidth load. But back then, George (Cheese) set that system up with very fast DDR5-8000 along with a higher 2.2 GHz FCLK. A single core was likely unable to monopolize off-CCD bandwidth in that setup, avoiding the corner case seen on my system. GMI-Wide increases off-CCD bandwidth by a much larger extent and has a similar effect. Under increasing bandwidth load, GMI-WIde can both achieve more total bandwidth and control latency better than its desktop single-link counterpart.

A read-modify-write pattern gets maximum bandwidth from GMI-Wide by exercising both the read and write paths. It doesn’t scale perfectly, but it’s a substantial improvement over using only reads or writes. A Ryzen 9 9900X CCD can theoretically get 48B/cycle to the IO die with a 2:1 read-to-write ratio. I tried modifying every other cacheline to achieve this ratio, but didn’t get better bandwidth probably because the memory controller is limited by a 32B/cycle link to Infinity Fabric. However, mixing in writes does get rid of the single bandwidth thread corner case, possibly because a single thread doesn’t saturate the 32B/cycle read link when mixing reads and writes.

AMD’s slide shown for Zen 4 Desktop. Zen 5 Desktop uses the same IO die, and thus has the same Infinity Fabric setup

On the desktop platform, latency under high load gets worse possibly because writes contend with reads at the DRAM controller. The DDR bus is unidirectional, and must waste cycles on “turnarounds” to switch between read and write mode. Bandwidth isn’t affected, probably because the Infinity Fabric bottleneck leaves spare cycles at the memory controller, which can absorb those turnarounds. However, reads from the latency test thread may be delayed while the memory controller drains writes before switching the bus back to read mode.

On the EPYC 9355P in NPS1 mode, bandwidth demands from a single GMI-Wide CCD leave plenty of spare cycles across the 12 memory controllers, so there’s little latency or bandwidth penalty when mixing reads and writes. The same isn’t true in NPS4 mode, where a GMI-Wide link can outmatch a NPS4 node’s three memory controllers. Everything’s fine with just reads, which actually benefit possibly because of lower latency and not having to traverse as much of the IO die. But with a read-modify-write pattern, bandwidth drops from 134 GB/s in NPS1 mode to 96.6 GB/s with NPS4. Latency gets worse too, rising to 248 ns. Again, NPS4 is something to be careful with, particularly if applications might require high bandwidth from a small subset of cores.

SPEC CPU2017

From a single thread perspective, the EPYC 9355P falls some distance behind the Ryzen 9 9900X. Desktop CPUs are designed around single threaded performance, so that’s to be expected. But with boost turned off on the desktop CPU to match clock speeds, performance is surprisingly close. Higher memory latency still hurts the EPYC 9355P, but it’s within striking distance.

NUMA modes barely make any difference. NPS4 technically wins, but by an insignificant margin. The latency advantage was barely measurable anyway. Compared to the more density optimized Graviton 4 and Xeon 6 6975P-C, the EPYC 9355P delivers noticeably better single threaded performance.

CCD-level bandwidth pinch points are worth a look too, since that’s traditionally been a distinguishing factor between AMD’s EPYC and more logically monolithic designs. Here, I’m filling a quad core CCD by running SPEC CPU2017’s rate tests with eight copies. I did the same on the Ryzen 9 9900X, pinning the eight copies to four cores and leaving the CCD’s other two cores unused. I bound the test to a single NUMA node on all tested setups.

SPEC’s floating point suite starts to tell a different story now. Several tests within the floating point suite are bandwidth hungry even from a single core. 549.fotonik3d for example pulled 28.23 GB/s from Meteor Lake’s memory controller when I first went through SPEC CPU2017’s tests. Running eight copies in parallel would multiply memory bandwidth demands, and that’s where server memory subsystems shine.

In 549.fotonik3d, high bandwidth demands make the Ryzen 9 9900X’s unloaded latency advantage irrelevant. The 9900X even loses to Redwood Cove cores on Xeon 6. The EPYC 9355P does very well in this test against both the 9900X and Xeon 6. Intel’s interconnect strategy tries to keep the chip logically monolithic and doesn’t have pinch points at cluster boundaries. But each core on Xeon 6 can only get to ~33 GB/s of DRAM bandwidth at best, using an even mix of reads and writes. AMD’s GMI-Wide can more than match that, and Intel’s advantage doesn’t show through in this scenario. However, Intel does have a potential advantage against more density focused AMD SKUs where eight cores sit in front of a narrower link.

NPS4 is also detrimental to the EPYC 9355P’s performance in this test. It only provides a minimal latency benefit at the cost of lower per-node bandwidth. The bandwidth part seems to hurt here, and taking the extra latency of striping accesses across 6 or 12 memory controllers gives a notable performance improvement.

Final Words

Core count isn’t the last word in server design. A lot of scenarios are better served by lower core count parts. Applications might not scale to fill a high core count chip. Bandwidth bound workloads might not benefit from adding cores. Traditionally lower core count server chips just traded core counts for higher clock speeds. Today, chips like the EPYC 9355P do a bit more, using both wider CCD-to-IOD links and more cache to maximize per-core performance.

Looking at EPYC 9355P’s NUMA characteristics reveals very consistent memory performance across NUMA modes. Intel’s Xeon 6 may be more monolithic from a caching point of view, but AMD’s DRAM access performance feels more monolithic than Intel’s. AMD made a tradeoff back in the Zen 2 days where they took lower local memory latency in exchange for more even memory performance across the socket. Measured latencies on EPYC 9355P are a bit higher than figures on the Zen 2 slide above. DDR5 is higher latency, and the Infinity Fabric topology is probably more complex these days to handle more CCDs and memory channels.

From the AMD EPYC 9005 Processor Architecture Overview

But the big picture remains. AMD’s Turin platform handles well in NPS1 mode, and cross-node penalties are low in NPS2/NPS4 modes. Those characteristics likely carry over across the Zen 5 EPYC SKU stack. It’s quite different from Intel’s Xeon 6 platform, which places memory controllers on compute dies like Zen 1 did. For now, AMD’s approach seems to be better at the DRAM level. Intel’s theoretical latency advantage in SNC3 mode doesn’t show through, and AMD gets to reap the benefits of a hub-and-spoke model while not getting hit where it should have downsides.

AMD seems to have found a good formula back in the Zen 2 days, and they’re content with reinforcing success. Intel is furiously iterating to find a setup that preserves a single level, logically monolithic interconnect while scaling well across a range of core counts. And of course, there’s Arm chips, which generally lean towards a single level monolithic interconnect too. It’ll be interesting to watch what all of these players do going forward as they continue to iterate and refine their designs.

A big thanks to Zach from ZeroOne Technology for taking pictures of the system in the rack!

And again, we’d like to thank both Dell and ZeroOne for, respectively, providing and hosting this PowerEdge R6715 without both of whom this article wouldn’t have been possible.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

A Look into Intel Xeon 6’s Memory Subsystem

2025-09-27 00:43:45

Intel’s server dominance has been shaken by high core count competition from the likes of AMD and Arm. Xeon 6, Intel’s latest server platform, aims to address this with a more scalable chiplet strategy. Chiplets are now arranged side by side, with IO chiplets on either side of the chip. It’s reminiscent of the arrangement Huawei used for Kunpeng 920, but apparently without the uniform chiplet height restrictions of Huawei’s design. Intel also scales up to three compute dies, as opposed to a maximum of two on Kunpeng 920.

Compared to Emerald Rapids and Granite Rapids, Xeon 6 uses a more aggressive chiplet strategy. Lower speed IO and accelerators get moved to separate IO dies. Compute dies contain just cores and DRAM controllers, and use the advanced Intel 3 process. The largest Xeon 6 SKUs incorporate three compute dies and two IO dies, scaling up to 128 cores per socket. It’s a huge core count increase from the two prior generations.

AWS has Xeon 6 instances generally available with their r8i virtual machine type, providing an opportunity to check out Intel’s latest chiplet design from a software performance perspective. This will be a short look. Renting a large cloud instance for more detailed testing is prohibitively expensive.

System Overview

AWS’s r8i instance uses the Xeon 6 6985P-C. This SKU is not listed on Intel’s site or other documentation, so a brief overview of the chip is in order. The Xeon 6 6985P-C has 96 Redwood Cove cores that clock up to 3.9 GHz, and each have 2 MB of L2 cache. Redwood Cove is a tweaked version of Intel’s Golden Cove/Raptor Cove, and has previously featured on Intel’s Meteor Lake client platform. Redwood Cove brings a larger 64 KB L1 instruction cache among other improvements discussed in another article. Unlike their Meteor Lake counterparts, Xeon 6’s Redwood Cove cores enjoy AVX-512 support with 2x 512-bit FMA units, and 2x512-bit load + 1x512-bit store throughput to the L1 data cache. AMX support is present as well, providing specialized matrix multiplication instructions to accelerate machine learning applications.

Xeon 6 uses a mesh interconnect like prior Intel server chips. Cores share a mesh stop with a CHA (Caching/Home Agent), which incorporates a L3 cache slice and a snoop filter. The Xeon 6 6985P-C has 120 CHA instances running at 2.2 GHz, providing 480 MB of total L3 across the chip. CHA count interestingly doesn’t match enabled core count. Intel may be able to harvest cores without disabling the associated cache slice, as they had to do on some older generations.

The mesh runs across die boundaries to keep things logically monolithic. Modular Data Fabric (MDF) mesh stops sit at die edges where dies meet with their neighbors. They handle the logical layer of the mesh protocol, much like how IFOP blocks on AMD chips encode the Infinity Fabric protocol for transport between dies. The physical signals on Xeon 6 run over Intel’s EMIB technology, which uses an embedded silicon bridge between dies. The Xeon 6 6985P-C has 80 MDF stops running at 2.5 GHz. Intel hasn’t published documents detailing Xeon 6’s mesh layout. One possibility is that 10 MDF stops sit at each side of a die boundary.

Intel places memory controllers at the short edge of the compute dies. The largest Xeon 6 SKUs have 12 memory controllers, or four per compute die. The Xeon 6 6985P-C falls into that category. AWS equipped the Xeon 6 6985P-C with 1.5 TB of Micron DDR5-7200 per socket. I couldn’t find the part number (MTC40F2047S1RC72BF1001 25FF) anywhere, but it’s certainly very fast DDR5 for a server platform. AWS has further configured the chip in SNC3 mode. SNC stands for sub-NUMA clustering, and divides the chip into three NUMA nodes. Doing so partitions the physical address space into three portions, each backed by the DRAM controllers and L3 cache on their respective compute dies. That maintains affinity between the cores, cache, and memory controllers on each die, reducing latency as long as cores access the physical memory backed by memory controllers on the same die. Xeon 6 also supports a unified mode where all of the attached DRAM is exposed in one NUMA node, with accesses striped across all of the chip’s memory controllers and L3 slices. However, SNC3 mode is what AWS chose, and is also Intel’s default mode.

Cache and Memory Latency

Xeon 6’s Redwood Cove cores have the same 5 cycle L1D and 16 cycle L2 latencies as their Meteor Lake counterparts, though the server part’s lower clock speeds mean higher actual latencies. Memory subsystem characteristics diverge at the L3 cache, which has a latency of just over 33 ns (~130 cycles). In exchange for higher latency, each core is able to access a huge 160 MB pool of L3 cache on its tile.

L3 latency regresses slightly compared to Emerald Rapids. It’s concerning because the Emerald Rapids Xeon Platinum 8559C chips in AWS’s i7i instances do not use SNC. Thus each core in Emerald Rapids sees the full 320 MB of L3 as a logically monolithic cache, and L3 accesses are distributed across both chiplets. The same applies to DRAM accesses. Emerald Rapids achieves lower latency to its DDR5-5600 Hynix memory (HMCGM4MGBRB240N).

Compared to Sapphire Rapids, Xeon 6’s L3 is a straightforward upgrade with higher capacity at similar latency. DRAM latency is still a bit worse on Xeon 6 though. Note: I originally mentioned MCRDIMM support here. From the DIMM part number above, AWS isn’t using MCRDIMMs so they’re not relevant to results in this article.

Next to AMD’s Zen 5 server platform, codenamed Turin, Xeon 6 continues to use larger but slower caches. Intel’s large L2 helps mitigate L3 latency, letting Intel use a larger L3 shared across more cores. AMD’s L3 is higher performance, but is also smaller and only shared across eight cores. That translates to less efficient SRAM capacity usage across the chip, because data may be duplicated in different L3 cache instances. For low threaded workloads, AMD may also be at a disadvantage because a single core can only allocate into its local 32 MB L3 block, even though the chip may have hundreds of megabytes of total L3. Even if AMD extends each CCD’s L3 with 3D stacking (VCache), 96 MB is still less than Intel’s 160 MB.

Memory latency on AMD’s platform is 125.6 ns, higher than Emerald Rapids but better than Xeon 6. For this test, the AMD system was set up in NPS1 mode, which distributes accesses across all of the chip’s 12 memory controllers. It’s not the most latency optimized mode, and it’s concerning that AMD achieves better memory latency without having to contain memory accesses in smaller divisions.

Bandwidth

Xeon 6’s core-private caches provide high bandwidth with AVX-512 loads. A single core has about 30 GB/s of L3 bandwidth, just under what Emerald Rapids could achieve. A read-modify-write pattern nearly doubles L3 bandwidth, much like it does on AMD’s Zen line. On Zen, that’s due to implementation details in the L3’s victim cache operation. A L3 access request can come with a L2 victim, and thus perform both a read and a write on the L3. That doubles achievable bandwidth with the same request rate. Intel may have adopted a similar scheme, though figuring that out would require a more thorough investigation. Or, Intel may have independent queues for pending L3 lookups and writebacks. Zooming back up, Zen 5 achieves high L3 bandwidth at the cost of smaller L3 instances shared across fewer cores. It’s the same story as with latency.

DRAM bandwidth from a single Xeon 6 core comes in at just below 20 GB/s. It’s lower than on Zen 5, but also less of a concern thanks to massive L3 capacity. As with L3 bandwidth, a read-modify-write pattern dramatically increases DRAM bandwidth achievable from a single core.

At the chip level, Xeon 6’s high core count and fast L1/L2 caches can provide massive bandwidth for code with good memory access locality. Total cache read bandwidth is a huge step above Emerald Rapids and smaller Zen 5 SKUs that prioritize per-core performance over throughput. Xeon 6’s L3 bandwidth improves to align with its core count increase, though L3 bandwidth in total remains behind AMD.

With each thread reading from memory backed by its local NUMA node, I got 691.62 GB/s of DRAM bandwidth. It’s a step ahead of Emerald Rapids’s 323.45 GB/s. Intel has really scaled up their server platform. Xeon 6’s 12 memory controllers and faster memory give it a huge leg up over Emerald Rapids’s eight memory controllers. AMD also uses 12 memory controllers, but the EPYC 9355P I had access to runs DDR5-5200, and achieved 478.98 GB/s at the largest test size using a unified memory configuration (single NUMA node per socket).

NUMA/Chiplet Characteristics

Intel uses consistent hashing to route physical addresses to L3 cache slice (CHAs), which then orchestrate coherent memory access. Xeon 6 appears to carry out this consistent hashing by distributing each NUMA node’s address space across L3 slices on the associated die, as opposed to covering the entire DRAM-backed physical address space with the CHAs on each die. Thus accesses to a remote NUMA node are only cached by the remote die’s L3. Accessing the L3 on an adjacent die increases latency by about 24 ns. Crossing two die boundaries adds a similar penalty, increasing latency to nearly 80 ns for a L3 hit.

Similar penalties apply to remote DRAM accesses, though that’s to be expected for any NUMA setup. Getting data from memory controllers across one die boundary takes 157.44 ns, for about 26 ns extra over hitting a local one. Crossing two die boundaries adds another 25 ns, bringing latency to 181.54 ns.

Under load, latency gradually increases until it reaches a plateau, before seeing a sharper rise as the test approaches DRAM bandwidth limits. It’s not a perfect hockey stick pattern like on some other chips. Latency remains reasonably well controlled, at just under 300 ns for a within-NUMA-node access under high bandwidth node. Crossing tile boundaries increases latency, but leaves bandwidth largely intact. Repeating the exercise with 96 MB used for bandwidth test threads and 16 MB for the latency test gives an idea of what cross-die L3 bandwidth is like. It’s very high. Cross-die bandwidth nearly reaches 500 GB/s for a read-only pattern.

I was able to exceed 800 GB/s of cross-die bandwidth by respecting Intel’s conga line topology, for lack of a better term. Specifically, I had cores from node 0 accessing L3 on node 1, and cores on node 1 accessing L3 on node 2. I used a read-modify-write pattern as well to maximize L3 bandwidth. It’s a rather contrived test, but it underscores the massive cross-die bandwidth present in Intel’s Xeon 6 platform. A single Xeon 6 compute die has more off-die bandwidth than an AMD CCD, even when considering GMI-Wide AMD setups that use two Infinity Fabric links between each CCD and the central IO die. However, each AMD L3 instance covers the physical address space across all NUMA nodes, so the situations aren’t directly comparable.

Coherency Latency

The memory subsystem may have to orchestrate transfers between peer caches, which I test by bouncing a value between two cores. That “core to core latency” shows similar characteristics to prior Xeon generations. Transfers within a compute die land in the 50-80 ns range, with only a slight increase when crossing die boundaries. NUMA characteristics above apply here too. I ran this test with memory allocated out of the first NUMA node.

A pair of cores on NUMA node 2 would need a CHA on node 0 to orchestrate the transfer, which is why nodes farther away from the first compute die show higher latency in this test. The worst case of 100-120 ns with cores going through a CHA on the farthest NUMA node is still better than the 150-180 ns on AMD server platforms when crossing cluster (CCX) boundaries. In both cases, the round trip goes across two die boundaries. But Intel’s logically monolithic topology and advanced packaging technologies likely give it a lead.

Single Core Performance: SPEC CPU2017

SPEC CPU2017 is an industry standard benchmark, and I’ve been running the rate suite with a single copy to summarize single-thread performance. This 96 core Xeon 6 part falls behind lower core count chips that are better optimized for per-core performance. Its score in the integer suite is closely matched to AWS’s Graviton 4, which similarly implements 96 cores per socket. Xeon 6 enjoys a slight 8.4% lead over Graviton 4 in the floating point suite.

Final Words

Intel’s latest server chiplet strategy is superficially similar to AMD’s. Both use compute and IO dies. Both strive for a scalable and modular design that can scale across a range of core counts. But Intel and AMD have very different system level design philosophies, driving huge architectural differences under the hood. Intel wants to present a logically monolithic chip, and wants their mesh to accommodate an ever increasing number of agents. It’s not an easy task, because it demands very high performance from cross-die transfers.

Slide shown with Emerald Rapids, highlighting Intel’s goal of maintaining a logically monolithic design via a large mesh interconnect spanning die boundaries. Xeon 6 has similar goals

AMD in contrast doesn’t try to make a big system feel monolithic, and instead goes with two interconnect levels. A fast intra-cluster network handles high speed L3 traffic and presents the cluster as one client to Infinity Fabric. Infinity Fabric only handles DRAM and slower IO traffic, and thus faces a less demanding job than Intel’s mesh. Infinity Fabric does cross die boundaries, but gets by with on-package traces thanks to its lower performance requirements.

Intel’s approach has advantages, even when subdivided using SNC3 mode. These include a more unified and higher capacity L3 cache, better core to core latency, and no core cluster boundaries with bandwidth pinch points. And theoretically, placing memory controllers on compute dies allows SNC3 mode to deliver lower DRAM latency by avoiding cross-die hops.

But every design comes with tradeoffs. The DRAM latency advantage doesn’t materialize against AMD’s Turin. L3 performance isn’t great. Latency is high, and a Redwood Cove core has less L3 bandwidth than a Zen 5 core does from DRAM. I have to wonder how well the L3 would perform in a Xeon 6 chip configured to run as a single NUMA node too. L3 accesses would start with a baseline latency much higher than 33 ns. Getting hard numbers would be impossible without access to a Xeon 6 chip in such a configuration, but having 2/3 of accesses cross a die boundary would likely send latency to just under 50 ns. That is, (2 * 57.63 + 33.25) / 3 = 49.5 ns. Scalability is another question too. Extending the conga line with additional compute dies may erode Intel’s advantage of crossing fewer die boundaries on average than AMD. The best way to keep scaling within such a topology would likely be increasing core counts within each die, which could make it harder to scale to smaller configurations using the same dies.

I wonder if keeping a logically monolithic design is worth the tradeoffs. I’m not sure if anyone has built a mesh this big, and other large meshes have run into latency issues even with techniques like having core pairs share a mesh stop. Nvidia’s Grace for example uses a large mesh and sees L3 latency reach 38.2 ns. Subdividing the chip mitigates latency issues but starts to erode the advantages of a logically monolithic setup. But of course, no one can answer the question of whether the tradeoffs are worthwhile except the engineers at Intel. Those engineers have done a lot of impressive things with Xeon 6. Going forward, Intel has a newer Lion Cove core that’s meant to compete more directly with Zen 5. Redwood Cove was more contemporary with Zen 4. I look forward to seeing what Intel does at the system level with their next server design. Those engineers are no doubt busy cooking it up.

Chips and CheeseModify

Rss preview of Blog of Chips and Cheese

Acknowledgements

Infinity Fabric, Performance Monitoring, and Theory

Setting up Performance Monitoring

Results

A Video?

Final Words

Acknowledgments

Memory Subsystem from the CPU’s Perspective

CPU’s Performance

Memory from the GPU’s Perspective

The GPU’s Compute Throughput

GPU Performance

Conclusion

SoC Layout

Xe3 Architecture

Cougar Cove and Darkmont

NPU and Media Engine

Final Words

Acknowledgments

Memory Subsystem and NUMA Characteristics

Looking into GMI-Wide

SPEC CPU2017

Final Words

System Overview

Cache and Memory Latency

Bandwidth

NUMA/Chiplet Characteristics

Coherency Latency

Single Core Performance: SPEC CPU2017

Final Words

Chips and Cheese Modify