2025-11-27 04:46:13
NUMA, or Non-Uniform Memory Access, lets hardware expose affinity between cores and memory controllers to software. NUMA nodes traditionally aligned with socket boundaries, but modern server chips can subdivide a socket into multiple NUMA nodes. It’s a reflection of how non-uniform interconnects get as core and memory controller counts keep going up. AMD designates their NUMA modes with the NPS (Nodes Per Socket) prefix.
NPS0 is a special NUMA mode that goes in the other direction. Rather than subdivide the system, NPS0 exposes a dual socket system as a single monolithic entity. It evenly distributes memory accesses across all memory controller channels, providing uniform memory access like in a desktop system. NPS0 and similar modes exist because optimizing for NUMA can be complicated and time intensive. Programmers have to specify a NUMA node for each memory allocation, and take are to minimize cross-node memory accesses. Each NUMA node only represents a fraction of system resources, so code pinned to a NUMA node will be constrained by that node’s CPU core count, memory bandwidth, and memory capacity. Effort spent getting an application to scale across NUMA nodes might be effort not spent on a software project’s other goals.

A massive thank you goes to Verda (formerly DataCrunch) for proving an instance with 2 AMD EPYC 9575Fs and 8 Nvidia B200 GPUs. Verda gave us about 3 weeks with the instance to do with as we wished. While this article looks at the AMD EPYC 9575Fs, there will be upcoming coverage of the B200s found in the VM.
This system appears to be running in NPS0 mode, giving an opportunity to see how a modern server acts with 24 memory controllers providing uniform memory access.
A simple latency test immediately shows the cost of providing uniform memory access. DRAM latency rises to over 220 ns, giving a nearly 90 ns penalty over the EPYC 9355P running in NPS1 mode. It’s a high penalty compared to using the equivalent of NPS0 on older systems. For example, a dual socket Broadwell system has 75.8 ns of DRAM latency when each socket is treated as a NUMA node, and 104.6 ns with uniform memory access[1].
NPS0 mode does have a bandwidth advantage from bringing twice as many memory controllers into play. But the extra bandwidth doesn’t translate to a latency advantage until bandwidth demands reach nearly 400 GB/s. The EPYC 9355P seems to suffer when a latency test thread is mixed with bandwidth heavy ones. A bandwidth test thread with just linear read patterns can achieve 479 GB/s in NPS1 mode. However, my bandwidth test produces low values on the EPYC 9575F because not all test threads finish at the same time. I avoid this problem in the loaded memory latency test, because I have bandwidth load threads check a flag. That lets me stop all threads at approximately the same time.
Per-CCD bandwidth is barely affected by the different NPS modes. Both the EPYC 9355P and 9575F use “GMI-Wide” links for their Core Complex Dies, or CCDs. GMI-Wide provides 64B/cycle of read and write bandwidth at the Infinity Fabric clock. On both chips, each CCD enjoys more bandwidth to the system compared to standard “GMI-Narrow” configurations. For reference, a GMI-Narrow setup running at a typical desktop 2 GHz FCLK would be limited to 64 GB/s of read and 32 GB/s of write bandwidth.
Higher memory latency could lead to lower performance, especially in single threaded workloads. But the EPYC 9575F does surprisingly well in SPEC CPU2017. The EPYC 9575F runs at a higher 5 GHz clock speed, and DRAM latency is only one of many factors that affect CPU performance.
Individual workloads show a more complex picture. The EPYC 9575F does best when workloads don’t miss cache. Then, its high 5 GHz clock speed can shine. 548.exchange2 is an example. On the other hand, workloads that hit DRAM a lot suffer in NPS0 mode. 502.gcc, 505.mcf, and 520.omnetpp see the EPYC 9575F’s higher clock speed count for nothing, and the higher clocked chip underperforms compared to 4.4 GHz setups with lower DRAM latency.
SPEC CPU2017’s floating point suite also shows diverse behavior. 549.fotonik3d and 554.roms suffer in NPS0 mode as the EPYC 9575F struggles to keep itself fed. 538.imagick plays nicely to the EPYC 9575F’s advantages. In that test, high cache hitrates let the 9575F’s higher core throughput shine through.
NPS0 mode performs surprisingly well in a single threaded SPEC CPU2017 run. Some sub-tests suffer from higher memory latency, but enough other tests benefit from the higher 5 GHz clock speed to make up the difference. It’s a lesson about the importance of clock speeds and good caching in a modern server CPU. Those two factors go together, because faster cores only provide a performance advantage if the memory subsystem can feed them. The EPYC 9575F’s good overall performance despite having over 220 ns of memory latency shows how good its caching setup is.
As for running in NPS0 mode, I don’t think it’s worthwhile in a modern system. The latency penalty is very high, and bandwidth gains are minor for NUMA-unaware code. I expect those latency penalties to get worse as server core and memory controller counts continue to increase. For workloads that need to scale across socket boundaries, optimizing for NUMA looks to be an unfortunate necessity.
Again, a massive thank you goes to Verda (formerly DataCrunch) without which this article, and the upcoming B200 article, would not be possible!
2025-11-22 11:57:05
Hello you fine Internet folks,
Here at Supercomputing the Gordon Bell Prize is announced every year. The Gordon Bell Prize is awarded every year to recognize outstanding achievement in high-performance computing applications.
One of this year’s finalist is the largest ever simulation of the Universe run on the Frontier Supercomputer at the Department of Energy’s Oak Ridge National Laboratory (ORNL). The simulation was run using the Hardware/Hybrid Accelerated Cosmology Code also known as HACC.

This simulation of the observable universe tracked over 4 trillion particles across 15 billion light years of space. The prior state of the art observable universe simulations only went up to about 250 billion particles which is a fifteenth the number of particles of this new simulation.
But, not only was this the largest universe simulation ever, the ORNL team managed to get over 500 Petaflops on nearly 9,000 nodes of Frontier’s 9,402 nodes. As a reminder, Frontier manages to get approximately 1,353 Petaflops on High Performance Linpack (HPL). This means that for this simulation the ORNL team managed to get about 37% of the Rmax HPL performance out of Frontier which is very impressive for a non-synthetic workload.
It is awesome to see the Department of Energy’s (DOE) supercomputers being used for amazing science like this! With the announcement of the Discovery Supercomputer that is due in 2028/2029, I can’t wait to see the science that comes out of that system when it is turned over to the scientific community!
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-11-19 22:10:20
Hello you fine Internet folks,
Last week I was in San Diego at Qualcomm’s Headquarters where Qualcomm disclosed more information about their upcoming Snapdragon X2 Elite SOC.
Snapdragon X2 Elite is Qualcomm’s newest SOC for the Windows on ARM ecosystem that is designed to bring a new level of performance so let’s dive in.
The Snapdragon X2 Elite (SDX2E) comes equipped with a total of 18 CPU cores with 12 “Prime” cores and 6 “Performance” cores.
Starting with the Prime cores, these are the real heart of the SDX2E SoC with a total of up to 12 cores split across 2 clusters that can clock up to 5.0 GHz.
Each of these clusters has 16MB, 16-way associative, shared L2 cache with 6 Prime cores attached along with a Qualcomm Matrix Engine per cluster.
The L2 can serve up to 64B per cycle per core with a total fill bandwidth of up to 256B per cycle for the cluster. The L1 miss to L2 hit latency is now 21 cycles, up from the 17 cycles of the Snapdragon X Elite (SDXE). The reason for this increase is due to the increased size of the structure. The L2 runs at the same clocks as the cores and supports over 220 in-flight transactions with each core supporting over 50 requests to the L2 at a time.
Diving into the CPU cores and there is quite a bit familiar with this core at a high level.
Starting with the L1 instruction cache, it is 192KB in size with 6-way associativity and is fully coherent. The Fetch Unit can fetch up to 16 4 Byte instructions per cycle for a total fetch bandwidth of 64 Bytes per cycle. The L1 iTLB is an 8-way associative, 256 entry structure that supports 4KB and larger page sizes.
Moving to the Decode, Rename, and Retirement stages, Oryon Gen 3 has increased these stages up to 9 wide from the 8 wide that Oryon Gen 1 was meaning that Oryon Gen 3 can support up to 9 micro-ops retired per cycle. There are over 400 Vector and Integer registers in their respective physical register files which is similar to the number of entries in Oryon Gen 1. Similarly, the Reorder Buffer is similarly 650+ entries for Oryon Gen 3.
Delving into the Integer side of the core, Oryon Gen 3 now has 4 Branch units which is a doubling of the number of Branch units found inside Oryon Gen 1. Otherwise, the integer side of Oryon Gen 3 is very similar to Oryon Gen 1 with 6 20-entry Reservation Stations for a total of 120 entries in the Integer scheduler, 6 Integer ALUs with 2 capable of Multiplies and one unit capable of handling Crypto and Division instructions.
Swapping to the Vector unit and Oryon Gen 3 adds SVE and SVE2 support to the core with the high-level layout similar to Oryon Gen 1 with over 400 128b Vector registers, 4 128b Vector ALUs all capable of FMAs, and 4 48-entry Reservation Stations for a total of 192 entries in the Vector scheduler.
Moving to the Load and Store system, Oryon Gen 3 has the same 4 Memory AGUs all of which are capable of loads and stores as Oryon Gen 1. These then feed a 192 entry Load Queue and 56 entry Store Queue which are the same size as the queues found on Oryon Gen 1. The L1 Data Cache is also the same fully coherent 96KB, 6-way. 64 Byte cache line, structure as what Oryon Gen 1 had.
Landing at the Memory Management Unit, and the TLBs of Oryon Gen 3 are again very similar to Oryon Gen 1 with one possible difference. Slide 11 says that the L1 dTLB is a 224 entry, 7-way, structure whereas Slide 12 says that the L1 dTLB is a 256 entry, 8-way, structure. If Slide 12 is correct then this is an increase from Oryon Gen 1’s 224 entry, 7-way, L1 dTLB. Oryon Gen 3’s 256 entry, 8-way, L1 iTLB and 8K entry, 8-way, shared L2 is unchanged from Oryon Gen 1. Note, the 2 cycle access for the L2 TLB is the SRAM access time not the total latency for a TLB lookup which Qualcomm wouldn’t disclose but is a similar range to the ~7 cycles you see on modern x86 cores.
In each of SDX2E’s 3 clusters lies a SME compatible Matrix Engine.
This matrix unit uses a 64 bit x 64 bit MLA numeric element in a 8x8 or 4x8 grid. This means that this matrix unit is 4096 bit wide which can do up to 128 FP32/INT32, 256 FP16/BF16/INT16, or 512 INT8 operations per cycle. The matrix engine is on a separate clock domain to the Cores and L2 Cache for better power and thermal management of the SoC.
Something that is new to the SDX2E that SDXE didn’t have is a 3rd cluster on board with what Qualcomm is calling their “Performance cores”.
This cluster has the same number of cores along with the Matrix Engine as the Prime Clusters but instead of 16 MB of shared L2, the Performance Cluster has 12 MB of shared L2.
The Performance core is also different to the Prime cores. These cores are of a similar but distinct microarchitecture to the Prime cores. These cores are targeted at a lower power point and has been optimized for operation below 2 watts. As such, the Performance core isn’t as wide as the Prime core and has fewer execution pipes, smaller caches, and shallower Out-of-Order structures compared to the Prime core.
SDX2E has a revamped GPU architecture that Qualcomm is calling the Adreno X2 microarchitecture.
This is Qualcomm’s largest GPU they have made to date with 2048 FP32 ALUs clocking up to 1.85GHz.
The Adreno X2 is a “Slice-Based” architecture, roughly equivalent to a Shader Engine from AMD or a GPC from Nvidia, with 4 slices for the top-end X2-90. Each slice has one Front-End which capable of rasterizing up to 4 triangles per cycle.
After the Front-end, are the Shader Processors, which are roughly equivalent to AMD’s WGP or Nvidia’s SM, which has an instruction cache and 2 micro-Shader Processors (uSP), similar to AMD’s SIMD unit or Nvidia’s SMSP, which each have a 128KB register file feeding 128 ALUs which support FP32, FP16, and BF16. A change from the prior Adreno X1 architecture is the removal of Wave128, now the Adreno X2 only supports Wave64 and can dual issue Wave64 instructions in order to keep the 128 ALUs fed.
Each uSP has a Ray Tracing Unit which supports either 4 ray-triangles or 8 ray-box intersections per cycle.
From there is what is referred to as Adreno High Performance Memory (AHPM). There is 21 MB of AHPM in a X2-90 GPU, 5.25 MB per slice, which is both a scratchpad and a cache depending on what the driver configures it as. Up to 3 MB of each of the 5.25 MB slices can be configured as a cache with the remaining 2.25 MB of SRAM being a scratchpad.
AHPM is designed to allow for the GPU to do tiled rendering all within the AHPM before rendering out the frame to the display buffer. This reduces the amount of data movement that the GPU has to do which consequently improves the performance per watt of the Adreno X2 compared to the Adreno X1.
Moving back to the Slice level, each slice has a 128 KB cluster cache which is then backed by a unified 2 MB L2 cache. This L2 can then spill into the 8 MB System Level Cache (SLC) which then is backed by the up to 228 GB/s memory subsystem.
As for API support, Adreno X2 supports DX12.2, Shader Model 6.8, Vulkan 1.4, OpenCL 3.0, as well as SYCL support coming in the first half of 2026.
Qualcomm has increased the performance of the Hexagon NPU from 45 TOPS of INT8 to 80 TOPS of INT8 with SDX2E.
Qualcomm has also added FP8 and BF16 support to the Hexagon NPU 6 vector unit.
In addition to the BF16 and FP8 support, the new matrix engine in NPU 6 has INT2 dequantization support.
However, the largest change in NPU 6 is the addition of 64 bit Virtual Addressing to the DMA unit which means that NPU 6 can now access more than 4GB of memory.
For testing the power of a system, Qualcomm has used what they call INPP or Idle Normalized Platform Power. What INPP is, is talking the total platform power during load and subtracting out the platform power at idle.
What INPP gets you is the SOC power plus the DRAM power plus the Power Conversion Losses; while this isn’t quite solely SOC power, INPP is about as close as you can get to pure SOC power in a laptop form factor where discrete power sensors aren’t very common.
Different workloads have different power characteristics. For example, while GB 6 Multi-threaded doesn’t pull a ton of power overall, it is a very bursty workload that can spike to over 150 watts; whereas a memory bandwidth test pulls over 105 watts in a sustained fashion.
Looking at the performance versus power graph in Cinebench R24 MT, the SDX2E Extreme with 18 cores (12 Prime and 6 Performance cores) scores just over 1950 points in Cinebench R24 at about 105 watts INPP with the standard SDX2E with 12 cores (6 Prime and 6 Performance cores) scoring just over 1100 points at approximately 50 watts INPP.
Qualcomm has also implemented a clock boosting scheme similar to Intel’s Turbo Boost where depending on the number of cores active in a cluster, the cluster will clock up or down accordingly.
Qualcomm also highlighted the performance of the SDX2E when the laptop is on battery performance compared to the laptop on wall power.
Qualcomm has made significant advances with the SDX2E with regards to the SOC, GPU, and NPU. SDX2E is planned to hit shelves in the first half of 2026 and we can’t wait to get a system to test.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-11-01 05:50:39
Editor’s Note (11/2/2025): Due to an error in moving the article over from Google Docs to Substack, the “Balancing CPU and GPU Bandwidth Demands” section was missing some Cyberpunk 2077 data. Apologizes for the mistake!
AMD’s Strix Halo aspires to deliver high CPU and GPU performance within a mobile device. Doing so presents the memory subsystem with a complicated set of demands. CPU applications are often latency sensitive with low bandwidth demands. GPU workloads are often latency tolerant and bandwidth hungry. Then, multitasking requires high memory capacity. Mobile devices need low power draw. Finally, the whole package has to fit within a price tag acceptable to consumers. Investigating how AMD dealt with those challenges should make for a good time.
ASUS has kindly sampled the ROG Flow Z13, which implements Strix Halo in a tablet form factor with 32 GB of LPDDR5X. They’ve made deep dives like this possible, and we greatly appreciate their support.
RX 7600 results were provided by Azralee from the Chips and Cheese Discord.
Strix Halo’s GPU uses a similar cache setup to AMD’s older and smaller mobile chips. As on Strix Point and Hawk Point (Zen 4 mobile), Strix Halo’s GPU is split into two Shader Arrays. Each Shader Array has 256 KB of L1 mid-level cache, and a 2 MB L2 services the entire GPU. Latencies to those GPU-private caches are in line with other RDNA3 and RDNA3.5 implementations. AMD likely kept L2 capacity at 2 MB because a 32 MB memory side cache (Infinity Cache, or MALL) takes over as the GPU’s last level cache.The L2 only has to catch enough traffic to prevent the Infinity Cache from getting overwhelmed. The resulting cache setup is similar to the one in the RX 7600, a lower midrange RDNA3 discrete card.
The Infinity Cache on Strix Halo has slightly higher latency compared to implementations in AMD’s discrete cards. DRAM latency from the GPU is higher as well. Compared to AMD’s other mobile CPUs with iGPUs though, the 32 MB Infinity Cache offers a large cache capacity increase.
Nemes’s Vulkan bandwidth test achieves just under 1 TB/s from Infinity Cache. The figures align well with performance counter data. Taken together with the chip’s 2 GHz FCLK, bandwidth test results suggest the GPU has a 512B/cycle path to the interconnect. If so, each of the GPU’s eight Infinity Fabric endpoints has a 64B/cycle link.
As a memory side cache, Infinity Cache can theoretically handle any access to physical addresses backed by DRAM. In an earlier interview with Cheese (George), AMD indicated that Infinity Cache was focused on the GPU, and that its behavior could change with firmware releases. Some of that change has happened already. When I first started testing Strix Halo just after Hot Chips 2025, results from my OpenCL microbenchmarks reflected Infinity Cache’s presence. I used that OpenCL code to figure out Data Fabric performance events. But PMU data collected from games suggested Infinity Cache wasn’t used once a game went into the background. Hardware doesn’t know whether a process is running in the foreground or background. That’s something the operating system knows, and that info would have to be communicated to hardware via drivers. Therefore, Infinity Cache policy can change on the fly from software control.
At that time, Nemes’s Vulkan-based code didn’t reflect Infinity Cache’s presence. PMU data showed a match between CS and UMC traffic, indicating the microbenchmark wasn’t taking advantage of Infinity Cache rather than the cache struggling with the access pattern. I was in the middle of investigating what Infinity Cache did or didn’t apply to when Windows updated. Then, foreground/background status no longer had any effect. Nemes’s Vulkan code was also able to observe the Infinity Cache.
Early observations on Infinity Cache behavior aren’t relevant today, but they do show Infinity Cache’s behavior is influenced by factors beyond a memory request’s origination point. Not all GPU requests install into the cache, and AMD can change cache policy on the fly. AMD could tune behavior with future updates too.
One early observation from OpenCL remained consistent though. Infinity Cache isn’t used for a buffer created with the CL_MEM_ALLOC_HOST_PTR flag and managed with zero-copy map/unmap APIs. CL_MEM_ALLOC_HOST_PTR requests an allocation from host-visible memory. On systems with discrete GPUs, AMD tends to handle that by allocating memory from DRAM attached to the CPU.
Intuitively, that flag shouldn’t make a difference on integrated GPUs. I’m not sure why it affects Infinity Cache behavior. Perhaps Strix Halo splits address ranges for the CPU and GPU under the hood, and the CPU’s address ranges aren’t cacheable from the Infinity Cache’s perspective.
AMD’s discrete Radeon RX 9070 shows similar behavior, with Infinity Cache not being used for host-side memory. Latency to host memory goes up to nearly a microsecond on RX 9070, while it remains unchanged on Strix Halo. Integrated GPUs have an advantage with zero-copy compute code, and it shows.
To further check zero-copy behavior, I have a test that allocates a 256 MB buffer using OpenCL’s Shared Virtual Memory APIs and only modifies a single 32-bit value. Strix Halo supports fine-grained buffer sharing like other recent AMD GPUs, meaning applications can use results generated from the GPU without calling map/unmap functions.
Strix Halo shows low latencies in line with zero-copy behavior. It’s worth noting that not all integrated GPUs can avoid a copy under the hood.
Copy APIs like clEnqueueReadBuffer and clEnqueueWriteBuffer are still relevant, because they’re the traditional way to work with discrete GPUs. Those APIs often use the copy queue and DMA engines, which handle data movement without involving general purpose compute units. Strix Halo can achieve high copy bandwidth in the CPU to GPU direction, but not the other way around.
Performance counter data suggests copies to the GPU don’t go through the Infinity Cache. During a copy, the shared memory controllers should observe both a read from CPU-side memory and a write to GPU-side memory. But there’s nowhere near 100% overhead compared to software measurements.
Bandwidth is lower in the other direction, but curiously CS-level bandwidth is similar. The memory controllers see less bandwidth, indicating some requests were handled on-chip, likely by Infinity Cache. Curiously, there’s way more than 100% overhead when comparing PMU data to software-visible copy bandwidth.
Strix Halo’s CPU side superficially resembles AMD’s flagship desktop parts, with 16 Zen 5 cores split across two Core Complex Dies (CCDs). However, these CCDs use TSMC’s InFO_oS for connectivity to the IO die rather than on-PCB traces. The CCD has 32B/cycle of bandwidth to the system in both the read and write directions.
Therefore, Strix Halo’s CCDs have more bandwidth at the die boundary than their desktop counterparts, but only in the write direction. It’s an advantage that’s likely to have minimal impact because reads often outnumber writes by a large margin.
Other CPU chiplet designs have more bandwidth at die boundaries, including the Compute Tile on Intel’s Meteor Lake and AMD’s own “GMI-Wide” configuration. GMI-Wide uses two links between the CCD and IO die to maximize cross-die bandwidth in lower core count server chips. Even though GMI-Wide doesn’t use advanced packaging, it has significantly more cross-die bandwidth than Strix Halo.

In a loaded latency test with reads, a Strix Halo CCD can reach high bandwidth levels at lower latency than standard GMI-Narrow CCDs. Part of that is likely down to its high bandwidth LPDDR5X setup, which a single CCD can’t come close to saturating. But that advantage doesn’t come through until bandwidth loads pass 45-55 GB/s. Before that, LPDDR5X’s high baseline latency puts Strix Halo at a disadvantage. At very high bandwidth load, Intel Meteor Lake’s higher cross-die bandwidth keeps it ahead. AMD’s GMI-Wide setup shows what a bandwidth-focused cross-die link can do, providing excellent bandwidth at low latency.
Bringing both CCDs into play gives Strix Halo a lead over Meteor Lake. I’m starting the test by placing bandwidth load on CCD1 while running the latency test on CCD0. That gives lower latency at bandwidth loads below 60 GB/s because contention at the CCD interface is taken out of the picture. Latency does increase as I spread bandwidth load across both dies, and rises beyond 200 ns as the test approaches die-to-die bandwidth limits. However, a read-only pattern is still limited by cross-die bandwidth and falls far short of the 256 GB/s that the LPDDR5X setup is theoretically capable of.
Advanced packaging may provide latency benefits too. Regular AMD CCDs use SerDes (serializer-deserializer) blocks, which convert signals for transport over lower quality PCB traces. Zen 2’s Infinity Fabric On-Package (IFOP) SerDes for example uses 32 transmit and 40 receive lanes running at a very high clock. Forwarded clock signals per lane data bundle help tackle clock skew that comes up with high speed parallel transmission over wires of unequal lengths. CRC helps ensure data integrity.
All of that adds power and latency overhead. Strix Halo’s InFO_oS packaging doesn’t require SerDes. But any latency advantage is difficult to observe in practice. DRAM requests are the most common type of off-CCD traffic. High LPDDR5X latency masks any latency advantage when looking at DRAM requests, as shown above. Cache coherency traffic is another form of off-CCD traffic, and doesn’t involve DRAM. However, testing that with a “core to core latency” test that bounces cachelines between core pairs also doesn’t provide favorable results for Strix Halo.
AMD handles cross-CCX cache coherency at Coherent Stations (CS-es) that sit right in front of the memory controllers. Memory traffic is interleaved across memory channels and thus CS instances based on their physical address. I try hitting different physical addresses by testing with various cacheline offsets into a 4 KB page, which gives me different combinations of L3 slices and memory controller + CS pairs. Values within a single run reflect variation based on the tested core pair, while different runs display variation from different memory subsystem blocks owning the tested address.

Cross-CCX latencies on Strix Halo land in the 100-120 ns range depending on the location of the tested core pair, responsible L3 slice, and responsible CS. It’s significantly higher on typical desktop systems or prior mobile chips from AMD. For example, the Ryzen 9 9900X tends to have cross-CCX latencies in the 80-90 ns range, which is in line with prior Zen generations. It’s about 20 ns faster than Strix Halo.
Therefore, I don’t have a satisfactory answer about Strix Halo’s cross-die latency. Latency may indeed be lower at die boundaries. But everything past that boundary has higher latency compared to other client systems, making any advantage invisible to software.
Sharing a memory controller across the CPU and GPU comes with advantages, like making zero-copy behavior more natural to pull off. But it comes with challenges too. CPU and GPU memory requests can contend with each other for DRAM access. Contention surfaces as higher latency. From Zen 4 onward, AMD’s L3 performance monitoring unit (PMU) can measure average latency in nanoseconds for requests external to the core cluster. PMU data isn’t directly comparable to software measurements, because it only accounts for latency after the point of a L3 miss. But it is consistent in slightly under-estimating software observed latency when running a simple latency microbenchmark. When gaming, I typically see low CPU bandwidth demands and correspondingly mild latency increases over the baseline.
The same doesn’t hold true when gaming on Strix Halo’s integrated GPU. Latency rises far above the baseline of around 140 ns. I logged average latency over 1 second intervals, and many of those intervals saw latency figures around 200 ns across several games
I wrote a microbenchmark to investigate how CPU memory latency is impacted by GPU-sde bandwidth load. As with the CPU loaded latency test, I run a latency test thread on a CPU core. But instead of using a read-only pattern, I do a standard C=A+B computation across large arrays on the GPU. To control GPU bandwidth load, I can have each OpenCL kernel invocation do more math with A and B before writing the result to C. Results show increased latency at higher GPU bandwidth demands. Other recent iGPUs show similar behavior.
In-game CPU bandwidth demands are low, but not as low as a simple latency test. I tried running a couple of read bandwidth threads on top of the test above. Strix Halo seems to let its GPU squeeze out the CPU when under extreme bandwidth demands. Latency suffers, passing 300 ns at one point.
Plotting L3 and memory controller PMU data with 1 second intervals helps capture the relationship between latency and bandwidth usage in more complex workloads. The points don’t track well with microbenchmark data collected with a single CPU-side latency test thread. Perhaps there’s enough CPU-side bandwidth demand to cause contention at both the die-to-die interface and the memory controllers. Or maybe, CPU and GPU bandwidth spikes tend to line up within those 1 second intervals. Whatever the case, PMU data highlights how Strix Halo’s CPU cores need high cache hitrates more than their desktop counterparts.
Cyberpunk 2077’s built-in benchmark is largely CPU bound when run at 1080P with medium settings and no upscaling. I used Intel’s Arc B580 on desktop systems, since it has vaguely similar compute power to Strix Halo’s iGPU. Results show a large gap between Strix Halo and AMD’s desktop platform, even though both use the same Zen 5 cores.
Memory latency under load is largely not a problem with CPU-only workloads, even when considering heavily multithreaded ones. Total bandwidth demands are much lower and actually well within the capabilities of a 128-bit DDR5 setup. That explains why AMD was able to take on quad channel HEDT parts using a desktop dual channel platform back in the Zen 2 days. Good caching likely played a role, and Strix Halo continues to have 4 MB of last level cache per core. PMU data from Cinebench, code compilation, and AV1 video encoding loosely align with microbenchmark results. Latency barely strays above the baseline. Y-Cruncher is an exception. It’s very bandwidth hungry and not cache friendly. Its bandwidth demands are several times higher, and often go beyond a dual channel DDR5-5600 setup’s capabilities. Strix Halo is a good choice for that type of workload. But in the client space, bandwidth hungry CPU applications tend to be exceptions.
Observations above suggest Strix Halo’s Infinity Fabric and DRAM setup focuses on feeding the GPU and as a result the CPU gets the short end of the stick. High Infinity Fabric endpoint count and a wide LPDDR5X bus provide high bandwidth at high latency. CPU workloads tend to be latency sensitive and contention can make that even worse.

Other aspects of the memory subsystem de-prioritize the CPU as well. CPU accesses don’t fill into that cache, but still do a lookup likely to maintain cache coherency with the GPU. That cache lookup at the very least costs power and might add latency, even though it’ll almost never result in a hit. Lack of GMI-Wide style bandwidth is another example.

AMD’s decisions are understandable. Most client workloads have light bandwidth requirements.Strix Halo’s memory system design lets it perform well in portable gaming devices like the ROG Flow Z13. But it does make tradeoffs. And extrapolating from those tradeoffs suggests iGPU designs will face steeper challenges at higher performance tiers.
For its part, Strix Halo strikes a good balance. It enjoys iGPU advantages without being large enough for the disadvantages to hurt. I hope AMD continues to target Strix Halo’s market segment with updated designs, and look forward to seeing where they go next.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-10-22 11:58:42
Strix Halo is the codename for AMD’s highest end mobile chip, which is used in the Ryzen AI MAX series. It combines a powerful CPU with 16 Zen 5 cores and a large GPU with 20 RDNA 3.5 Workgroup Processors (WGPs). The sizeable iGPU makes Strix Halo particularly interesting because GPUs have high bandwidth requirements. Strix Halo tackles that with a 256-bit LPDDR5X-8000 setup combined with 32 MB of memory side cache. The latter is often referred to as Infinity Cache, or MALL (Memory Attached Last Level). I’ll refer to it as Infinity Cache for brevity.
Infinity Cache has been around since RDNA2 in AMD’s discrete consumer GPU lineup, where it helped AMD hit high performance targets with lower DRAM bandwidth requirements. However, Infinity Cache’s efficacy has so far been difficult for me to evaluate. AMD’s discrete GPUs have performance monitoring facilities accessible through AMD’s developer tools. But those tools stop providing information past L2. Strix Halo stands out because it has an Infinity Cache implementation, and all the accessible performance monitoring features typical of a recent AMD GPU. That includes programmable performance counters at Infinity Fabric and memory controllers. It’s an opportunity to finally get insight into how well AMD’s Infinity Cache does its job in various graphics workloads.
Special thanks goes out to ASUS for sampling their ROG Flow Z13. This device implements AMD’s Ryzen AI MAX+ 395 with 32 GB of LPDDR5X in a thin and light form factor. It superficially represents a convertible tablet from Microsoft’s Surface line, and is remarkably portable for a device with gaming credentials. Without ASUS’s help, this article wouldn’t have been possible.
AMD’s Infinity Fabric aims to abstract away details of how data travels across the chip. It does so by providing endpoints with well defined interfaces to let blocks make or handle memory requests. Infinity Fabric also provides a set of programmable performance counters. AMD documents a single DATA_BW performance event that counts data beats at endpoints. DATA_BW targets an endpoint via its 8-bit instance ID, and can count either reads or writes. AMD never documented Infinity Fabric instance IDs for Strix Halo. So, I did some guessing by generating traffic at various blocks and observing bandwidth counts at all possible instance IDs.
Instance IDs start from the Coherent Stations (CS-es), just like on server platforms. CS blocks sit in front of memory controllers and ensure cache coherency by probing another block if it might have a modified copy of the requested cacheline. If it doesn’t, which is most of the time, the CS will pass the request on to its attached Unified Memory Controller (UMC). Because CS blocks observe all requests to physical memory backed by DRAM, it’s a logical place to implement a memory side cache. That’s exactly what AMD does on chips with Infinity Cache. Cache hits let the CS avoid going to the UMC.
Strix Halo has 16 memory controllers and CS instances, each handling a 16-bit LPDDR5X channel. The GPU occupies the next eight instance IDs, suggesting it has a wide interface to Infinity Fabric. CPU core clusters come next. Each octa-core Zen 5 Core Complex (CCX) connects to Infinity Fabric via one endpoint. Miscellaneous blocks follow. These include the NPU, media engine, the display engine, and a mystery.
Cache misses usually cause a request to the next level in the memory hierarchy, while cache hits do not. Therefore, my idea is to compare traffic levels at the CS and UMC levels. Traffic that shows up at the CS but not at the UMCs can be used as a proxy for Infinity Cache hits. There are a few problems with that approach though.
First, Strix Halo only provides eight Infinity Fabric performance counters. I have to use two counters per endpoint to count both read and write data beats, so I can only monitor four CS-es simultaneously. Memory traffic is interleaved across channels, so taking bandwidth observed at four CS-es and multiplying by four should give a reasonably accurate estimate of overall bandwidth. But interleaving isn’t perfectly even in practice, so I expect a few percentage points of error. The UMC situation is easier. Each UMC has its own set of four performance counters, letting me monitor all of them at the same time. I’m using one counter per UMC to count Column Address Strobe (CAS) commands. Other available UMC events allow monitoring memory controller frequency, bus utilization, and ACTIVATE or PRECHARGE commands. But that stuff goes beyond what I want to focus on here, and collecting more data points would make for annoyingly large spreadsheets.
Cross-CCX traffic presents a second source of error as mentioned above. If a CS sends a probe that hits a modified line, it won’t send a request to the UMC. The request is still being satisfied out of cache, just not the Infinity Cache that I’m interested in. I expect this to be rare because cross-CCX traffic in general is usually low compared to requests satisfied from DRAM. It’ll be especially rare in the graphics workloads I’m targeting here because Strix Halo’s second CCD tends to be parked in Windows.
CPU-side traffic in general presents a more significant issue. Strix Halo’s Infinity Cache narrowly targets the GPU, and only GPU-side memory requests cause cache fills. CPU memory accesses will be counted as misses in a hitrate calculation. That may be technically correct, but misses the point of Infinity Cache. Most memory traffic comes from the GPU, but there’s enough CPU-side traffic to create a bit more of an error bar than I’d like. Therefore, I want to focus on whether Infinity Cache is handling enough traffic to avoid hitting DRAM bandwidth bottlenecks.
A final limitation is that I’m sampling performance counters with a tool I wrote years ago. I wrote it so I could look at hardware performance stats just like how I like looking at Task Manager or other monitoring utilities to see how my system is doing. Because I’m updating a graphical interface, I sample data every second and thus have lower resolution compared to some vendor tools. Those can often sample at millisecond granularity or better. That opens the window to underestimating bandwidth demands if a bandwidth spike only occurs over a small fraction of a second. In theory I could write another tool. But Chips and Cheese is an unpaid hobby project that has to be balanced with a separate full time job as well as other free time interests. Quickly rigging an existing project to do my bidding makes more efficient use of time, because there’s never enough time to do everything I want.
I logged data over some arbitrarily selected graphics workloads, then selected the 1-second interval with the highest DRAM bandwidth usage. That should give an idea of whether Strix Halo comes close to reaching DRAM bandwidth limits. Overall, the 32 MB cache seems to do its job. Strix Halo is able to stay well clear of the 256 GB/s theoretical bandwidth limit that its LPDDR5X-8000 setup can deliver.
Approximate CS-side bandwidth demands do indicate several workloads can push uncomfortably close to 256 GB/s. A workload doesn’t necessarily have to get right up to bandwidth limits for memory-related performance issues to start cropping up. Under high bandwidth demand, requests can start to pile up in various queues and end-to-end memory latency can increase. GHPC and Ungine Valley fall into that category. 3DMark Time Spy Extreme would definitely be DRAM bandwidth bound without the memory side cache.
Picking an interval with maximum bandwidth demands at the CS gives an idea of how much bandwidth Strix Halo’s GPU can demand. Strix Halo has the same 2 MB of graphics L2 cache that AMD’s older 7000 series “Phoenix” mobile chip had, despite more than doubling GPU compute throughput. Unsurprisingly, the GPU can draw a lot of bandwidth across Infinity Fabric. 3DMark Time Spy again stands out. If AMD wanted to simply scale up DRAM bandwidth without spending die area on cache, they would need well over 335 GB/s from DRAM.
Curiously, Digital Foundry notes that Strix Halo has very similar graphics performance to the PS5. The PS5 uses a 256-bit, 14 GT/s GDDR6 setup that’s good for 448 GB/s of theoretical bandwidth, and doesn’t have a memory side cache like Strix Halo. 448 GB/s looks just about adequate for satisfying Time Spy Extreme’s bandwidth needs, if just barely. If Strix Halo didn’t need to work in power constrained mobile devices, and didn’t need high memory capacity for multitasking, perhaps AMD could have considered a GDDR6 setup. To bring things back around to Infinity Cache, it seems to do very well at that interval above. It captures approximately 73% of memory traffic going through Infinity Fabric, which is good even from a hitrate point of view.
The two bar graphs above already hint at how bandwidth demands and cache hitrates vary across workloads. Those figures vary within a workload as well, though plotting all of the logged data would be excessive. Variation both across and within workloads makes it extremely difficult to summarize cache efficacy with a single hitrate figure. Plotting the percentage of traffic at the CS not observed at the UMCs as a proxy for hitrate further emphasizes that point.
Resolution settings can impact cache hitrate as well. While I don’t have a direct look at hitrate figures, plotting with the data I do have suggests increasing resolution in the Ungine Valley benchmark tends to depress hitrate.
AMD presented a slide at Hot Chips 2021 with three lines for different resolutions across a set of different cache capacities. I obviously can’t test different cache capacities. But AMD’s slide does beg the question of how well a 32 MB cache can do at a wider range of resolutions, and whether bandwidth demands remain under control at resolutions higher than what AMD may have optimized the platform for.
Ungine Superposition and 3DMark workloads can both target a variety of resolutions without regard to monitor resolution. Logging data throughout benchmark runs at different resolutions shows the 32 MB cache providing reasonable “bandwidth amplification” at reasonable resolutions. At unreasonable resolutions, the cache is still able to do something. But it’s not as effective as it was at lower resolutions.
Plotting data over time shows spikes and dips in Infinity Cache efficacy occur at remarkably consistent times, even at different resolutions. 8K is an exception with the Superposition benchmark. The 8K line looks stretched out, possibly because Strix Halo’s GPU was only able to average a bit over 10 FPS, and time got slowed a bit.
If Strix Halo’s iGPU were capable of delivering over 30 FPS in Superposition, AMD would definitely need a larger cache, more DRAM bandwidth, or a combination of both. Taking a simplistic view and tripling maximum observed 8K bandwidth demands would give a figure just north of 525 GB/s. But Strix Halo’s iGPU clearly wasn’t built with such a scenario in mind, and AMD’s selected combination of cache capacity and DRAM bandwidth works well enough at all tested resolutions. While high resolutions create the most DRAM traffic, 1080P shows the heaviest bandwidth demands at the Infinity Fabric level.
3DMark Wild Life Extreme is a better example because it’s a lightweight benchmark designed to run on mobile devices. Strix Halo’s iGPU can average above 30 FPS even when rendering at 8K. Again, DRAM bandwidth demands increase and Infinity Cache becomes less effective as resolution goes up. But Infinity Cache still does enough to keep the chip well clear of DRAM bandwidth limits. Thus the cache does its job, and the bandwidth situation remains under control across a wide range of resolutions.
Bandwidth demands are more important than hitrates. Curiously though, Wild Life Extreme experiences increasingly severe hitrate dips at higher resolutions around the 16-19 and 45-52 second intervals. Those dips barely show at 1440P or below. Perhaps a 32 MB cache’s efficacy also shows more variation as resolution increases.
A few errors here - I said Core Coherent Master read/write beats were 32B and 64B/cycle. It’s not per cycle, it’s per data beat. And I meant to say reads outnumber writes at the end.
Chip designers have to tune their designs to perform well across a wide variety of workloads. Cache sizes are one parameter on the list. AMD chose to combine a 32 MB cache with 256 GB/s of DRAM bandwidth, and it seems to do well enough across the workloads I tested. Monitoring at the CS-es and UMCs also supports AMD’s data showing that higher resolutions tend to depress hitrate. Those results explain why larger GPUs tend to have larger caches and higher DRAM bandwidth.
At a higher level, GPU bandwidth demands have been a persistent challenge for large iGPUs and have driven a diverse set of solutions. Over a decade ago, Intel created “halo” iGPU variants and paired a 128 MB eDRAM cache while sticking with a typical 128-bit client memory bus. AMD’s console chips use large and fast DRAM buses. The PS5 is one example. Strix Halo does a bit of both. It combines modest cache capacity with a more DRAM bandwidth than typical client chips, but doesn’t rely as much on DRAM bandwidth as console chips.
Those varied approaches to the bandwidth problem are fascinating. Watching how Infinity Cache behaves in various graphics workloads has been fascinating as well. But everything would be so much more fun if AMD’s tools provided direct data on Infinity Cache hitrates. That applies to both integrated and discrete GPUs. Infinity Cache is a well established part of AMD’s strategy. It has been around for several generations, and now has a presence in AMD’s mobile lineup too. I suspect developers would love to see Infinity Cache hitrates in addition to the data on GPU-private caches that AMD’s current tools show.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-10-18 11:52:18
Hello you fine Internet folks!
Today we are looking at AMD’s largest client APU to date, Strix Halo. This is an APU designed to be a true all-in-one mobile processor, able to handle high end CPU and GPU workloads without compromise. Offering a TDP range of 55W to 120W, the chip targets a far higher power envelope compared to standard Strix Point, but eschews the need for dedicated graphics.
To get y’all all caught up on the history and specifications of this APU, AMD first announced Strix Halo at CES 2025 earlier this year to much fanfare. Strix Halo is AMD’s first chiplet APU in the consumer market with AMD using Strix Halo as a bit of a show piece for what both CPU and GPU performance can look like with a sufficiently large APU.
AMD’s Strix Halo can be equipped with dual 8 core Zen 5 CCDs for a total of 16 cores that feature the same 512b FPU as the desktop parts. This is a change from the more mainstream and monolithic Strix Point APU which has “double-pumped” 256b FPUs similar to Zen 4 for use with AVX512 code. What is similar to the more mainstream Strix Point is the same 5.1GHz max boost clock which is a 600MHz deficit compared to the desktop flagship Zen 5 CPU, the Ryzen 9 9950X.
Moving to the 3rd die on a Strix Halo package, a RDNA 3.5 iGPU takes up the majority of the SoC die with 40 compute units, 32MB of Infinity Cache, and a boost clock of up to 2.9GHz placing raw compute capability somewhere between the RX 7600 XT and RX 7700.
To feed this chip, AMD has equipped Strix Halo with a 256b LPDDR5X-8000 memory bus, which provides up to 256GB/s shared between all of the components. This is slightly lower than the 288GB/s available to the RX 7600 XT but is much higher than any other APU we have tested.
A massive thank you to both Asus and HP for sending over a ROG Flow Z13 (2025) and a ZBook Ultra G1a 14” for testing which were both equipped with an AMD Ryzen AI Max+ 395. All of the gaming tests were done on the Flow Z13 due to that being a more gaming focused device and all of the microbenchmarking was done on the ZBook Ultra G1a.

Starting with the memory latency from Zen 5’s perspective, we see that the latency difference between Strix Point and Strix Halo is negligible with Strix Point at ~128ns of memory latency and Strix Halo at ~123ns of memory latency. However, as you can see the CPU does not have access to the 32MB of Infinity Cache on the IO die. This behavior was confirmed by Mahesh Subramony during our interview about Strix Halo at CES 2025.
While the 123ns DRAM latency seen here is quite good for a mobile part, desktop processors like our 9950X fare much better at 75-80ns.
Moving on to memory bandwidth, we see Strix Halo fall into a category of its own of the SoCs we have tested.
When doing read-modify-add operations across both CCDs, the 16 Zen 5 cores can pull over 175GB/s of bandwidth from the memory with reads being no slouch at 124GB/s across both CCDs.
However, looking at the bandwidth of a single CCD and just like the desktop CPUs a single Strix-Halo CCD only has a 32 byte per cycle read link to the IO die. And just like the desktop chips, the chip to chip link runs at ~2000MHz, which caps out the single CCD read at 64GB/s. Unlike the desktop chips, the write link is 32 bytes per cycle and we are seeing about 43GB/s for the write bandwidth. That brings the total theoretical single CCD bandwidth to 128GB/s and the observed bandwidth is just over 103GB/s.
The performance of Strix Halo’s CPU packs quite a bit more of a punch than Strix Point’s CPU.
Strix Halo’s CPU can match a last generation desktop flagship CPU, the 7950X, in Integer performance despite a 11.7% clock speed delta. And nearly matches AMD current desktop flagship CPU, the 9950X, in Floating Point performance again with a 11.7% clock speed deficit.
Looking at the SPEC CPU 2017 Integer subtests and while Strix Halo can’t quite match the desktop 9950X, likely due to the higher memory latency of Strix Halo’s LPDDR5X bus, it does get close in a number of subtests.
Moving to the FP subtests and the story is similar to the Integer subtests but Strix Halo can get even closer to the 9950X and even beat it in the fotonik3d subtest.
Moving to the GPU side of things and this is where Strix Halo really shines. The laptop we used as a comparison to Strix Halo was the HP Omen Transcend 14 2025 with a 5070M equipped which maxed out at about 75 Watts for the GPU.
Strix Halo has over double the memory bandwidth of any of the other mobile SoCs that we have tested. However, the RTX 5070 Mobile does have about 50% more memory bandwidth compared to Strix Halo.
Looking at the caches of Strix Halo, the Infinity Cache, AKA MALL, is able to deliver over 40% higher bandwidth compared to the 5070M’s L2 while having 33% more capacity. Plus Strix Halo has a 4MB L2 which is capable of providing 2.5TB/s of bandwidth to the GPU.
Moving to the latency, the more complex cache layout of Strix Halo does give it a latency advantage after the 128KB with Strix Halo’s L2 being significantly lower latency than the 5070M’s L2 and the larger 32MB MALL that Strix Halo has a similar latency to the 5070M’s L2. And Strix Halo’s memory latency is about 35% lower than the 5070M’s memory latency.
Looking at the floating point throughput, we see that Strix Halo unsurprisingly has about 2.5x the throughput of Strix Point considering it has about two and a half times the number of Compute Units. Strix Halo oftentimes can match or even pull ahead of the 5070 Mobile in terms of throughput. I will note that the FP16 results for the 5070 Mobile are half of what I would expect; the FP16:FP32 ratio for the 5070 Mobile should be 1:1 so I am not positive about what is going on there.
Moving to the integer throughput and we see the 5070 Mobile soundly pulling ahead of the Radeon 8060S.
Looking at the GPU performance, we see Strix Halo once again shine, with a staggering level of performance available for an iGPU, courtesy of the large CU count paired with relatively high memory bandwidth. Our comparison suite includes several recent iGPU’s from Intel/AMD, along with the newest generation RTX 5070 Mobile @ 75W to act as a reference for mid to high-range laptop dedicated graphics, and the antiquated GTX 1050 as a reference for budget dedicated graphics.
Looking at Fluid X3D for a compute-heavy workload, we can see the Radeon 8060S absolutely obliterates the other iGPU’s from Intel/AMD, putting itself firmly in a class above. The 5070 is no slouch though, and still holds a substantial 64.1% lead largely due to the higher memory bandwidth afforded to the 5070M.
Switching to gaming workloads with Cyberpunk 2077, we start with a benchmark conducted while on battery power. The gap with other iGPU’s is still wide, but now the 5070M is limited to 55W and exhibits 7.5% worse performance at 1080p low settings when compared to the Radeon 8060S.
Finally, moving to wall power and allowing both the Radeon 8060S and 5070M to access the full power limit in CP2077, we can see that the 8060S still pulls ahead at 1080p low by 2.5%, while at 1440p medium we see a reversal, with the 5070M commanding an 8.3% lead. Overall the two provide a comparable experience in Cyberpunk 2077, with changes in settings or power limits adjusting the lead between the two. This is a seriously impressive turnaround for an iGPU working against dedicated graphics, and demonstrates the versatility of the chip in workloads like gaming, where iGPU’s have traditionally struggled.
Strix Halo follows in the footsteps of many other companies in the goal of designing an SoC for desktop and laptop usage that is truly all encompassing. The CPU and GPU performance is truly a class above standard low power laptop chips, and is even able to compete with larger systems boasting dedicated graphics. CPU performance is especially impressive with a comparable showing to the desktop Zen 5 CPUs. GPU performance is comparable to mid range dedicated graphics, while still offering the efficiency and integration of an iGPU. High end dedicated graphics still have a place above Strix Halo, but the versatility of this design for smaller form factor devices is class leading.

However, this is not to say that Strix Halo is perfect. I was hoping to have a section dedicated to the ML performance of Strix Halo in this article, however AMD only just released preview support for Strix Halo in the ROCm 7.0.2 release which came out about a week ago from time of publication. As a result of the long delay between the launch of Strix Halo and the release of ROCm 7.0.2, the ML performance will have to wait until a future article.
However, putting aside ROCm, Strix Halo is a very, very cool piece of technology and I would love to see successors to Strix Halo with newer CPU and GPU IP and possibly even larger memory buses similar to Apple’s Max and Ultra series of SoCs with 512b and 1024b memory respectively. AMD has a formula for building bigger APUs with Strix Halo, which opens the door to a lot of interesting hardware possibilities in the future.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.