MoreRSS

site iconChips and CheeseModify

Deep dives into computer hardware and software and the wider industry...
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Chips and Cheese

SC25: The Present and Future of HPC Networking with Cornelis Networks CEO Lisa Spelman

2025-12-20 06:21:52

Hello you fine Internet folks,

Today we have an interview with the CEO of Cornelis Networks, Lisa Spelman, where we talk about what makes Omnipath different to other solutions on the market along with what steps has Cornelis taken in support of Ultra Ethernet.

Hope y’all enjoy!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

This transcript has been lightly edited for readability and conciseness.

George: Hello, you fine internet folks. We’re here at Supercomputing 2025 at the Cornelis Networks booth. So, with me I have, Lisa Spelman, CEO of Cornelis. Would you like to tell us about what Cornelis is and what you guys do?

Lisa: I would love to! So, thanks for having me, it’s always good, the only thing we’re missing is some cheese. We got lots of cheese.

George: There was cheese last night!

Lisa: Oh, man! Okay, well, yeah, we were here at the kickoff last night. It was a fun opening. So, Cornelis Networks is a company that is laser-focused on delivering the highest-performance networking solutions for the highest-performance applications in your data center. So that’s your HPC workloads, your AI workloads, and everything that just has intense data demands and benefits a lot from a parallel processing type of use case.

So that’s where all of our architecture, all of our differentiation, all of our work goes into.

George: Awesome. So, Cornelis Networks has their own networking, called OmniPath.

Lisa: Yes.

George: Now, some of you may know OmniPath used to be an Intel technology. But Cornelis, I believe, bought the IP from Intel. So could you go into a little bit about what OmniPath is and the difference between OmniPath, Ethernet, and InfiniBand.

Lisa: Yes, we can do that. So you’re right, Cornelis spun out of Intel with an OmniPath architecture.

George: Okay.

Lisa: And so this OmniPath architecture, I should maybe share, too, that we’re a full-stack, full-solution company. So we design both a NIC, a SuperNIC ASIC, we design a Switch ASIC... Look at you- He’s so good! He’s ready to go!

George: I have the showcases!

Lisa: Okay, so we have... we design our SuperNIC ASIC, we design our Switch ASIC, we design the card for the add-in card for the SuperNIC and the switchboard, all the way up, you know, top of rack, as well as all the way up to a big old director-class system that we have sitting here.

All of that is based on our OmniPath architecture, which was incubated and built at Intel then, like you said, spun out and acquired by Cornelis. So the foundational element of the OmniPath architecture is this lossless and congestion-free design.

So it was... it was built, you know, in the last decade focused on, how do you take all of the data movement happening in highly-parallel workloads and bring together a congestion-free environment that does not lose packets? So it was specifically built to address these modern workloads, the growth of AI, the massive scale of data being put together while letting go of learning from the past, but letting go of legacy, other networks that may be designed more for storage systems or for... you know, just other use cases. That’s not what they were inherently designed for.

George: The Internet.

Lisa: Yeah.

George: Such as Ethernet.

Lisa: A more modern development. I mean, Ethernet, I mean, amazing, right? But it’s 50 years old now. So what we did was, in this architecture, built in some really advanced capabilities and features like your credit-based flow controls and your dynamic lane scaling. So it’s the performance, as well as adding reliability to the system. And so the network plays a huge role, not only increasing your compute utilization of your GPU or your CPU, but it also can play a really big role in increasing the uptime of your overall system, which has huge economic value.

George: Yeah.

Lisa: So that’s the OmniPath architecture, and the way that it comes to life, and the way that people experience it, is lowest latency in the industry on all of these workloads. You know, we made sure on micro benchmarks, like ping pong latency, all the good micros. And then the highest message rates in the industry. We have two and a half X higher than the closest competitor. So that works really great for those, you know, message-dependent, really fast-rate workloads.

And then on top of that, we’re all going to operate at the same bandwidth. I mean, bandwidth is not really a differentiator anymore. And so we measure ourselves on how fast we get to half-bandwidth, and how quickly we can launch and start the data movement, the packet movement. So fastest to half-bandwidth is one of our points of pride for the architecture as well.

George: So, speaking of sort of Ethernet. I know there’s been the new UltraEthernet consortium in order to update Ethernet to a more... to the standard of today.

Lisa: Yeah.

George: What has Cornelis done to support that, especially with some of your technologies?

Lisa: So we think this move to UltraEthernet is really exciting for the industry. And it was obviously time, I mean, you know, it takes a big need and requirement to get this many companies to work together, and kind of put aside some differences, and come together to come up with a consortium and a capability, a definition that actually does serve the workloads of today.

So we’re we’re very excited and motivated towards it. And the reason that we are so is because we see so much of what we’ve already built in OmniPath being reflected through in the requirements of UltraEthernet. We also have a little bit of a point of pride in that the the software standard for UltraEthernet is built on top of LibFabric, which is an open source, you know, that we developed actually, and we’re maintainers of. So we’re we’re all in on the UltraEthernet, and in fact, we’ve just announced our next generation of products.

George: Speaking of; the CN6000, the successor to the CN5000: what exactly does it support in terms of networking protocols, and what do you sort of see in terms of like industry uptick?

Lisa: Yes. So this is really cool. We think it’s going to be super valuable for our customers. So with our next generation, our CN6000, that’s our 800 gig product, that one is going to be a multi-protocol NIC. So our super NIC there, it will have OmniPath native in it, and we have customers that they absolutely want that highest performance that they can get through OmniPath and it works great for them. But we’re also adding into it Rocky V2. So Ethernet performance as well as UltraEthernet, the 1.0, you know, the spec, the UltraEthernet compliance as well.

So you’re going to get this multi-modal NIC, and what we what we’re doing, what our differentiation is, is that you’re you’re moving to that Ethernet transport layer, but you’re still behind the scenes getting the benefits of the OmniPath architecture.

George: OK.

Lisa: So it’s not like it’s two totally separate things. We’re actually going to take your packet, run it through the OmniPath pipes and that architecture benefit, but spit it out as Ethernet, as a protocol or the transport layer.

George: Cool. And for UltraEthernet, I know there’s sort of two sort of specs. There’s what’s sort of colloquially known as the “AI spec” and the “HPC spec”, which have different requirements. For the CN6000, will it be sort of the AI spec or the HPC spec?

Lisa: Yeah. So we’re focusing on making sure the UltraEthernet transport layer absolutely works. But we are absolutely intending to deliver to both HPC performance and AI workload performance. And one of the things I like to kind of point out is, it’s not that they’re so different; AI workloads and HPC workloads have a lot of similar demands on the network. They just pull on them in different ways. So it’s like, message rate, for example: message rate is hugely important in things like computational fluid dynamics.

George: Absolutely.

Lisa: But it also plays a role in inference. Now, it might be the top determiner of performance in a CFD application, and it might be the third or fourth in an AI application. So you need that same, you know, the latency, the message rates, the bandwidth, the overlap, the communications, you know, all that type of stuff. Just workloads pull on them a little differently.

So we’ve built a well-rounded solution that addresses all. And then, by customer use case, you can pull on what you need.

George: Awesome. And as you can see here [points to network switch on table, you guys make your own switches.

Lisa: We do.

George: And you make your own NICs. But one of the questions I have is, can you use the CN6000 NIC with any switch?

Lisa: OK, so that’s a great point. And yes, you can. So one of our big focuses as we expand the company, the customer base, and serve more customers and workloads, is becoming much more broadly industry interoperable.

George: OK.

Lisa: So we think this is important for larger-scale customers that want to maybe run multi-vendor environments. So we’re already doing work on the 800 gig to ensure that it works across a variety of, in, you know, standard industry available switches. And that gives customers a lot of flexibility.

Of course, they can still choose to use both the super NIC and the switch from us. And that’s great, we love that, but we know that there’s going to be times when there’s like, a partner or a use case, where having our NIC paired with someone else’s switch is the right move. And we fully support it.

George: So then I guess sort of the flip-side of that is if I have, say, another NIC, could I attach that to an OmniPath switch?

Lisa: You will be able to, not in a CN6000, but stay tuned, I’ll have more breaking news! That’s just a little sneak peek of the future future.

George: Well, and sort of to round this off with the most important question. What’s your favorite type of cheese, Lisa?

Lisa: OK, I am from Portland, Oregon. So I have to go with our local, to the state of Oregon, our Rogue Creamery Blues.

George: Oh, OK.

Lisa: I had a chance this summer to go down to Grants Pass, where they’re from, and headquartered, and we did the whole cheese factory tour- I thought of you. We literally got to meet the cows! So it was very nice, it was very cool. And so that’s what I have to go with.

George: One of my favorite cheeses is Tillamook.

Lisa: OK, yes! Yes, another local favorite.

George: Thank you so much, Lisa!

Lisa: Thank you for having me!

George: Yep, have a good one, folks.

Nvidia’s B200: Keeping the CUDA Juggernaut Rolling ft. Verda (formerly DataCrunch)

2025-12-16 04:56:51

Nvidia has dominated the GPU compute scene ever since it became mainstream. The company’s Blackwell B200 GPU is the next to take up the mantle of being the go-to compute GPU. Unlike prior generations, Blackwell can’t lean heavily on process node improvements. TSMC’s 4NP process likely provides something over the 4N process used in the older Hopper generation, but it’s unlikely to offer the same degree of improvement as prior full node shrinks. Blackwell therefore moves away from Nvidia’s tried-and-tested monolithic die approach, and uses two reticle sized dies. Both dies appear to software as a single GPU, making the B200 Nvidia’s first chiplet GPU. Each B200 die physically contains 80 Streaming Multiprocessors (SMs), which are analogous to cores on a CPU. B200 enables 74 SMs per die, giving 148 SMs across the GPU. Clock speeds are similar to the H100’s high power SXM5 variant.

From Nvidia’s presentation at Hot Chips 2024

I’ve listed H100 SXM5 specifications in the table above, but data in the article below will be from the H100 PCIe version unless otherwise specified.

Acknowledgements

A massive thank you goes to Verda (formerly DataCrunch) for providing an instance with 8 B200s which are all connected to each other via NVLink. Verda gave us about 3 weeks with the instance to do with as we wished. We previously covered the CPU part of this VM if you want to check that part out as well.

Cache and Memory Access

B200’s cache hierarchy feels immediately familiar coming from the H100 and A100. L1 and Shared Memory are allocated out of the same SM-private pool. L1/Shared Memory capacity is unchanged from H100 and stays at 256 KB. Possible L1/Shared Memory splits have not changed either. Shared Memory is analogous to AMD’s Local Data Share (LDS) or Intel’s Shared Local Memory (SLM), and provides software managed on-chip storage local to a group of threads. Through Nvidia’s CUDA API, developers can advise on whether to prefer a larger L1 allocation, prefer equal, or prefer more Shared Memory. Those options appear to give 216, 112, and 16 KB of L1 cache capacity, respectively.

In other APIs, Shared Memory and L1 splits are completely up to Nvidia’s driver. OpenCL gets the largest 216 KB data cache allocation with a kernel that doesn’t use Shared Memory, which is quite sensible. Vulkan gets a slightly smaller 180 KB L1D allocation. L1D latency as tested with array indexing in OpenCL comes in at a brisk 19.6 ns, or 39 cycles.

As with A100 and H100, B200 uses a partitioned L2 cache. However, it’s now much larger with 126 MB of total capacity. For perspective, H100 had 50 MB of L2, and A100 had 40 MB. L2 latency to the directly attached L2 partition is similar to prior generations at about 150 ns. Latency dramatically increases as test sizes spill into the other L2 partition. B200’s cross-partition penalty is higher than its predecessors, but only slightly so. L2 partitions on B200 almost certainly correspond to its two dies. If so, the cross-die latency penalty is small, and outweighed by a single L2 partition having more capacity than H100’s entire L2.

B200 acts like it has a triple level cache setup from a single thread’s perspective. The L2’s partitioned nature can be shown by segmenting the pointer chasing array and having different threads traverse each segment. Curiously, I need a large number of threads before I can access most of the 126 MB capacity without incurring cross-partition penalties. Perhaps Nvidia’s scheduler tries to fill one partition’s SMs before going to the other.

AMD’s Radeon Instinct MI300X has a true triple-level cache setup, which trades blows with B200’s. Nvidia has larger and faster L1 caches. AMD’s L2 trades capacity for a latency advantage compared to Nvidia. Finally, AMD’s 256 MB last level cache offers an impressive combination of latency and high capacity. Latency is lower than Nvidia’s “far” L2 partition.

One curiosity is that both the MI300X and B200 show more uniform latency across the last level cache when I run multiple threads hitting a segmented pointer chasing array. However, the reasons behind that latency increase are different. The latency increase past 64 MB on AMD appears to be caused by TLB misses, because testing with a 4 KB stride shows a latency increase at the same point. Launching more threads brings more TLB instances into play, mitigating address translation penalties. Cutting out TLB miss penalties also lowers measured VRAM latency on the MI300X. On B200, splitting the array didn’t lower measured VRAM latency, suggesting TLB misses either weren’t a significant factor with a single thread, or bringing on more threads didn’t reduce TLB misses. B200 thus appears to have higher VRAM latency than the MI300X, as well as the older H100 and A100. Just as with the L2 cross-partition penalty, the modest nature of the latency regression versus H100/A100 suggests Nvidia’s multi-die design is working well.

OpenCL’s local memory space is backed by Nvidia’s Shared Memory, AMD’s LDS, or Intel’s SLM. Testing local memory latency with array accesses shows B200 continuing the tradition of offering excellent Shared Memory latency. Accesses are faster than on any AMD GPU I’ve tested so far, including high-clocked members of the RDNA line. AMD’s CDNA-based GPUs have much higher local memory latency.

Atomic operations on local memory can be used to exchange data between threads in the same workgroup. On Nvidia, that means threads running on the same SM. Bouncing data between threads using atomic_cmpxchg shows latency on par with AMD’s MI300X. As with pointer chasing latency, B200 gives small incremental improvements over prior generations. AMD’s RDNA line does very well in this test compared to big compute GPUs.

Modern GPUs use dedicated atomic ALUs to handle operations like atomic adds and increments. Testing with atomic_add gives a throughput of 32 operations per cycle, per SM on the B200. I wrote this test after MI300X testing was concluded, so I only have data from the MI300A. Like GCN, AMD’s CDNA3 Compute Units can sustain 16 atomic adds per cycle. That lets the B200 pull ahead despite having a lower core count.

Bandwidth Measurements

A higher SM count gives B200 a large L1 cache bandwidth advantage over its predecessors. It also catches up to AMD’s MI300X from OpenCL testing. Older, smaller consumer GPUs like the RX 6900XT are left in the dust.

Local memory offers the same bandwidth as L1 cache hits on the B200, because both are backed by the same block of storage. That leaves AMD’s MI300X with a large bandwidth lead. Local memory is more difficult to take advantage of because developers must explicitly manage data movement, while a cache automatically takes advantage of locality. But it is an area where the huge MI300X continues to hold a lead.

Nemez’s Vulkan-based benchmark provides an idea of what L2 bandwidth is like on the B200. Smaller data footprints contained within the local L2 partition achieve 21 TB/s of bandwidth. That drops to 16.8 TB/s when data starts crossing between the two partitions. AMD’s MI300X has no graphics API support and can’t run Vulkan compute. However, AMD has specified that the 256 MB Infinity Cache provides 14.7 TB/s of bandwidth. The MI300X doesn’t need the same degree of bandwidth from Infinity Cache because the 4 MB L2 instances in front of it should absorb a good chunk of L1 miss traffic.

The B200 offers a large bandwidth advantage over the outgoing H100 at all levels of the cache hierarchy. Thanks to HBM3E, B200 also gains a VRAM bandwidth lead over the MI300X. While the MI300X also has eight HBM stacks, it uses older HBM3 and tops out at 5.3 TB/s.

Global Memory Atomics

AMD’s MI300X showed varying latency when using atomic_cmpxchg to bounce values between threads. Its complex multi-die setup likely contributed to this. The same applies to B200. Here, I’m launching as many single-thread workgroups as GPU cores (SMs or CUs), and selecting different thread pairs to test. I’m using an access pattern similar to that of a CPU-side core to core latency test, but there’s no way to influence where each thread gets placed. Therefore, this isn’t a proper core to core latency test and results are not consistent between runs. But it is enough to display latency variation, and show that the B200 has a bimodal latency distribution.

Latency for bouncing a value between thread pairs. Because there’s no good way to pin threads to specific SMs, results vary across runs. However, the bimodal distribution remains.

Latency is 90-100 ns in good cases, likely when threads are on the same L2 partition. Bad cases land in the 190-220 ns range, and likely represents cases when communication crosses L2 partition boundaries. Results on AMD’s MI300X range from ~116 to ~202 ns. B200’s good case is slightly better than AMD’s, while the bad case is slightly worse.

I only ran a simple two thread test on H100 and A100 because testing all thread pairs takes a long time and cloud instances are expensive

Thread to thread latency is generally higher on datacenter GPUs compared to high clocked consumer GPUs like the RX 6900XT. Exchanging data across a GPU with hundreds of SMs or CUs is challenging, even in the best cases.

Atomic adds on global memory are usually handled by dedicated ALUs at a GPU-wide shared cache level. Nvidia’s B200 can sustain just short of 512 such operations per cycle across the GPU. AMD’s MI300A does poorly in this test, achieving lower throughput than the consumer oriented RX 6900XT.

Compute Throughput

Increased SM count gives the B200 higher compute throughput than the H100 across most vector operations. However, FP16 is an exception. Nvidia’s older GPUs could do FP16 operations at twice the FP32 rate. B200 cannot.

AMD’s MI300X can also do double rate FP16 compute. Likely, Nvidia decided to focus on handling FP16 with the Tensor Cores, or matrix multiplication units. Stepping back, the MI300X’s massive scale blows out both the H100 and B200 for most vector operations. Despite using older process nodes, AMD’s aggressive chiplet setup still holds advantages.

Tensor Memory

B200 targets AI applications, and no discussion would be complete without covering its machine learning optimizations. Nvidia has used Tensor Cores, or dedicated matrix multiplication units, since the Turing/Volta generation years ago. GPUs expose a SIMT programming model where developers can treat each lane as an independent thread, at least from a correctness perspective. Tensor Cores break the SIMT abstraction by requiring a certain matrix layout across a wave, or vector. Blackwell’s 5th generation Tensor Cores go one step further, and have a matrix span multiple waves in a workgroup (CTA).

Diagram of a B200 Streaming Multiprocessor from Nvidia’s blog post. Note the TMEM blocks

Blackwell also introduces Tensor Memory, or TMEM. TMEM acts like a register file dedicated to the Tensor Cores. Developers can store matrix data in TMEM, and Blackwell’s workgroup level matrix multiplication instructions use TMEM rather than the register file. TMEM is organized as 512 columns by 128 rows with 32-bit cells. Each wave can only access 32 TMEM rows, determined by its wave index. That implies each SM sub-partition has a 512 column by 32 row TMEM partition. A “TensorCore Collector Buffer” can take advantage of matrix data reuse, taking the role of a register reuse cache for TMEM.

TMEM therefore works like the accumulator register file (Acc VGPRs) on AMD’s CDNA architecture. CDNA’s MFMA matrix instructions similarly operate on data in the Acc VGPRs, though MFMA can also take source matrices from regular VGPRs. On Blackwell, only the older wave-level matrix multiplication instructions take regular register inputs. Both TMEM and CDNA’s Acc VGPRs have 64 KB of capacity, giving both architectures a 64+64 KB split register file per execution unit partition. The regular vector execution units cannot take inputs from TMEM on Nvidia, or the Acc VGPRs on AMD.

AMD CDNA’s split register file layout. Accumulation VGPRs take a similar role to TMEM, while the miSIMD block is analogous to a Tensor Core

While Blackwell’s TMEM and CDNA’s Acc VGPRs have similar high level goals, TMEM is a more capable and mature implementation of the split register file idea. CDNA had to allocate the same number of Acc and regular VGPRs for each wave. Doing so likely simplified bookkeeping, but creates an inflexible arrangement where mixing matrix and non-matrix waves would make inefficient use of register file capacity. TMEM in contrast uses a dynamic allocation scheme similar in principle to dynamic VGPR allocation on AMD’s RDNA4. Each wave starts with no TMEM allocated, and can allocate 32 to 512 columns in powers of two. All rows are allocated at the same time, and waves must explicitly release allocated TMEM before exiting. TMEM can also be loaded from Shared Memory or the regular register file, while CDNA’s Acc VGPRs can only be loaded through regular VGPRs. Finally, TMEM can optionally “decompress” 4 or 6-bit data types to 8 bits as data is loaded in.

How Shared Memory can provide enough bandwidth to feed all four execution unit partitions is beyond me

Compared to prior Nvidia generations, adding TMEM helps reduce capacity and bandwidth pressure on the regular register file. Introducing TMEM is likely easier than expanding the regular register file. Blackwell’s CTA-level matrix instructions can sustain 1024 16-bit MAC operations per cycle, per partition. Because one matrix input always comes from Shared Memory, TMEM only has to read one row and accumulate into another row every cycle. The regular vector registers would have to sustain three reads and one write per cycle for FMA instructions. Finally, TMEM doesn’t have to be wired to the vector units. All of that lets Blackwell act like it has a larger register file for AI applications, while allowing hardware simplifications. Nvidia has used 64 KB register files since the Kepler architecture from 2012, so a register file capacity increase feels overdue. TMEM delivers that in a way.

On AMD’s side, CDNA2 abandoned dedicated Acc VGPRs and merged all of the VGPRs into a unified 128 KB pool. Going for a larger unified pool of registers can benefit a wider range of applications, at the cost of not allowing certain hardware simplifications.

Some Light Benchmarking

Datacenter GPUs traditionally have strong FP64 performance, and that continues to be the case for B200. Basic FP64 operations execute at half the FP32 rate, making it much faster than consumer GPUs. In a self penned benchmark, B200 continues to do well compared to both consumer GPUs and H100. However, the MI300X’s massive size shows through even though it’s an outgoing GPU.

In the workload above, I take a 2360x2250 FITS file with column density values and output gravitational potential values with the same dimensions. The data footprint is therefore 85 MB. Even without performance counter data, it’s safe to assume it fits in the last level cache on both MI300X and B200.

FluidX3D is a different matter. Its benchmark uses a 256x256x256 cell configuration and 93 bytes per cell in FP32 mode, for a 1.5 GB memory footprint. Its access patterns aren’t cache friendly, based on testing with performance counters on Strix Halo. FluidX3D plays well into B200’s VRAM bandwidth advantage, and the B200 now pulls ahead of the MI300X.

FluidX3D can also use 16-bit floating point formats for storage, reducing memory capacity and bandwidth requirements. Computations still use FP32 and format conversion costs extra compute, so the FP16 formats result in a higher compute to bandwidth ratio. That typically improves performance because FluidX3D is so bandwidth bound. When using IEEE FP16 for storage, AMD’s MI300A catches up slightly but leaves the B200 ahead by a large margin.

Another FP16C format reduces the accuracy penalty associated with using 16 bit storage formats. It’s a custom floating point format without hardware support, which further drives up the compute to bandwidth ratio.

With compute front and center again, AMD’s MI300A pulls ahead. The B200 doesn’t do badly, but it can’t compete with the massive compute throughput that AMD’s huge chiplet GPU provides.

Teething Issues

We encountered three GPU hangs over several weeks of testing. The issue manifested with a GPU process getting stuck. Then, any process trying to use any of the system’s eight GPUs would also hang. None of the hung processes could be terminated, even with SIGKILL. Attaching GDB to one of them would cause GDB to freeze as well. The system remained responsive for CPU-only applications, but only a reboot restored GPU functionality. nvidia-smi would also get stuck. Kernel messages indicated the Nvidia unified memory kernel module, or nvidia_uvm, took a lock with preemption disabled.

The stack trace suggests Nvidia might be trying to free allocated virtual memory, possibly on the GPU. Taking a lock makes sense because Nvidia probably doesn’t want other threads accessing a page freelist while it’s being modified. Why it never leaves the critical section is anyone’s guess. Perhaps it makes a request to the GPU and never gets a response. Or perhaps it’s purely a software deadlock bug on the host side.

# nvidia-smi -r

The following GPUs could not be reset:

GPU 00000000:03:00.0: In use by another client

GPU 00000000:04:00.0: In use by another client

GPU 00000000:05:00.0: In use by another client

GPU 00000000:06:00.0: In use by another client

GPU 00000000:07:00.0: In use by another client

GPU 00000000:08:00.0: In use by another client

GPU 00000000:09:00.0: In use by another client

GPU 00000000:0A:00.0: In use by another client

Hangs like this aren’t surprising. Hardware acceleration adds complexity, which translates to more failure points. But modern hardware stacks have evolved to handle GPU issues without a reboot. Windows’s Timeout Detection and Recovery (TDR) mechanism for example can ask the driver to reset a hung GPU. nvidia-smi does offer a reset option. But frustratingly, it doesn’t work if the GPUs are in use. That defeats the purpose of offering a reset option. I expect Nvidia to iron out these issues over time, especially if the root cause lies purely with software or firmware. But hitting this issue several times within such a short timespan isn’t a good sign, and it would be good if Nvidia could offer ways to recover from such issues without a system reboot.

Final Words

Nvidia has made the chiplet jump without any major performance concessions. B200 is a straightforward successor to H100 and A100, and software doesn’t have to care about the multi-die setup. Nvidia’s multi-die strategy is conservative next to the 12-die monster that is AMD’s MI300X. MI300X retains some surprising advantages over Nvidia’s latest GPU, despite being an outgoing product. AMD’s incoming datacenter GPUs will likely retain those advantages, while catching up in areas that the B200 has pulled ahead in. The MI350X for example will bring VRAM bandwidth to 8 TB/s.

From Nvidia’s presentation at Hot Chips 2024

But Nvidia’s conservative approach is understandable. Their strength doesn’t lie in being able to build the biggest, baddest GPU around the block. Rather, Nvidia benefits from their CUDA software ecosystem. GPU compute code is typically written for Nvidia GPUs first. Non-Nvidia GPUs are an afterthought, if they’re thought about at all. Hardware isn’t useful without software to run on it, and quick ports won’t benefit from the same degree of optimization. Nvidia doesn’t need to match the MI300X or its successors in every area. They just have to be good enough to prevent people from filling in the metaphorical CUDA moat. Trying to build a monster to match the MI300X is risky, and Nvidia has every reason to avoid risk when they have a dominant market position.

From Nvidia’s blog post

Still, Nvidia’s strategy leaves AMD with an opportunity. AMD has everything to gain from being ambitious and taking risks. GPUs like the MI300X are impressive showcases of hardware engineering, and demonstrate AMD’s ability to take on ambitious design goals. It’ll be interesting to see whether Nvidia’s conservative hardware strategy and software strength will result in its continued dominance.

Again, a massive thank you goes to Verda (formerly DataCrunch) without which this article would not be possible! If you want to try out the B200 or test other NVIDIA GPUs yourself, Verda is offering free trial credits specifically for Chips & Cheese readers. Simply enter the code “CHEESE-B200” and redeem 50$ worth of credit and follow these instructions on how to redeem coupon credits.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. Blackwell Tuning Guide, indicating 126 MB of L2 capacity

  2. Inside Blackwell (Nvidia site)

  3. Tensor Memory Documentation

  4. Hopper Whitepaper

  5. AMD CDNA (MI100) ISA Manual

SC25: Estimating AMD’s Upcoming MI430X’s FP64 and the Discovery Supercomputer

2025-12-11 00:08:58

Hello, you fine Internet folks,

At Supercomputing 2025, EuroHPC, AMD, and Eviden announced that the second Exascale system in Europe was going to be called Alice Recoque. Powered by Eviden’s BullSequena XH3500 platform using AMD’s upcoming Instinct MI430X as most of the compute for the supercomputer. Alice Recoque is going “exceed a sustained performance of 1 Exaflop/s HPL performance… with less than 15 Megawatts of electrical power.”

Now, AMD hasn’t said what the FLOPs of MI430X is but there is likely enough to speculate as to the potential FP64 FLOPs of MI430X as a thought experiment.

In terms of what we know:

- Alice Recoque is going to be made of 94 XH3500 racks

- Alice Recoque uses less than 15 Megawatts of power in actual usage, but the facility can provide 24 Megawatts of power and 20 Megawatts of cooling

- BullSequena XH3500 has a cap of 264 Kilowatts per rack

- BullSequena XH3500 is 38U per rack

- BullSequena XH3500’s Sleds are 2U for Switch Sleds and 8 Compute Sleds and 1U for 4 Compute Sleds

What we also know is that HPE rates a single MI430X between 2000 watts and 2500 watts of power draw.

Alice Recoque has 3 different possible energy consumption numbers:

  1. Less than 15 Megawatts

  2. The facility thermal limit of 20 Megawatts of cooling

  3. Facility power limit of 24 Megawatts

This allows for 3 possible configurations of the XH3500 racks:

  1. For the sub 15 Megawatt system, 16 Compute nodes each with 1 Venice CPU and 4 MI430X along with 8 switch blades

  2. For the 20 Megawatt system, 18 Compute nodes each with 1 Venice CPU and 4 MI430X along with 9 switch blades

  3. For the 24 Megawatt system, 20 Compute nodes each with 1 Venice CPU and 4 MI430X along with 8 switch blades

For this calculation, I am going to run with the middle configuration of 18 Compute nodes and 9 switch nodes per rack along with a maximum sustained energy consumption of approximately 20 Megawatts for the full Alice Recoque Supercomputer. This puts the maximum power consumption of a single XH3500 rack at about 200 Kilowatts. As such I am going to run with the assumption that each of the Compute nodes can pull approximately 10.5 Kilowatts and as a result each of the MI430Xs will be estimated to have a TDP of about 2300 watts. This gives 1300 watts for the remainder of the Compute blade including the Venice CPU which has a TDP of up to 600 watts.

With 94 racks, 18 Compute nodes, and 4 MI430Xs per blade, this gives a total of 6768 GPUs for the whole Alice Recoque Supercomputer. Assuming the “exceed a sustained performance of 1 Exaflop/s HPL performance” is the HPL Rmax value of Alice Recoque and that the Rmax to Rpeak ratio is approximately 70% (similar to Frontier’s ratio), this puts Alice Recoque’s HPL Rpeak at a minimum of 1.43 Exaflops. Now dividing the HPL Rpeak number by the number of MI430Xs get you a FP64 Vector FLOPs number for MI430X of approximately 211 Teraflops.

MI430X pairs this approximate 211 Teraflops of FP64 Vector with 432 Gigabytes of HBM4 providing the MI430X with 19.6 TB/s of memory bandwidth; this is unsurprisingly the same memory subsystem as MI450X. An important metric in HPC is the amount of compute for the given memory bandwidth which is usually denoted as FLOPs per Byte or F/B. For many HPC tasks, the arithmetic intensity for those problems is low, a lower FLOPs per Byte number is preferred due to the majority of HPC code being memory bandwidth bound.

So assuming 211TF of FP64 Vector for MI430X, this puts MI430X ahead of both of AMD’s prior HPC focused accelerators in terms of FLOPs per Byte. However, MI430X does still have a higher FP64 FLOPs per Byte ratio compared to Nvidia’s offerings. But MI430X does have two aces up its sleeve compared to Nvidia’s latest and upcoming accelerators.

The first ace is that MI430X has significantly more bandwidth compared to AMD’s prior offering along with having even more memory bandwidth than Nvidia’s upcoming Rubin accelerator which is important due to the number of memory bandwidth bound tasks in HPC.

The second ace is that MI430X has nearly 3.5 times the HBM capacity of AMD’s prior accelerators and 50% more HBM capacity than Nvidia’s upcoming Rubin which means that a larger dataset can be fit on a single MI430X.

The Upcoming Discovery Supercomputer at ORNL

Just prior to Supercomputing 2025, AMD, HPE, and the Department of Energy announced the replacement of Frontier, codenamed Discovery, which is due to be delivered in 2028 and turned on in 2029 at the Oak Ridge National Laboratory in Oak Ridge, Tennessee.

Again, we don’t know much about Discovery other than it is going to use HPE’s new GX5000 platform and that it will use AMD’s Venice CPUs and MI430X accelerators.

Speaking of HPE’s GX5000 platform, it has 3 initial compute blade configurations:

  1. GX250: The GX250 blade has 8 Venice CPUs with up to 40 blades per rack for a total of up to 320 Venice CPUs per GX5000 rack

  2. GX350a: The GX350a blade has 1 Venice CPU and 4 MI430X Accelerators with up to 28 blades per rack for a total of 28 Venice CPUs and 112 MI430X Accelerators per GX5000 rack

  3. GX440n: The GX440n blade had 4 Nvidia Vera CPUs and 8 Rubin Accelerators per blade with up to 24 blades per rack for a total of 96 Vera CPUs and 192 Rubin Accelerators per GX5000 rack

The current GX5000 platform can supply up to 400 kilowatts per rack which is likely for the full GX440n configuration where the 192 Rubins, each rated for 1800 watts, alone pull about 350 kilowatts let alone the CPUs, memory, etc. The GX5000 is also about half the floor area compared to the prior generation of EX4000 (1.08 m^2 vs 2.055 m^2). This means that you can fit 2 GX5000 racks into the area of a single EX4000 rack.

For Discovery, the configuration that we are interested in is the GX350a configuration of the GX5000. Now, what hasn’t been announced yet is the HPL speedup goal but it is expected that Discovery will deliver “three to five times more computational throughput for benchmark and scientific applications than Frontier.”

Due to the vagueness of what exactly the three to five times performance increase refers to, whether this is three to five times faster in actual HPC workloads or if this is three to five times faster in LINPACK, I am going to propose 2 different configurations for Discovery:

  1. A configuration that can fit into the current power and floor footprint of Frontier’s building

  1. A configuration that is approximately 4 times the Rpeak of Frontier at the time of the RFP was closed on August 30th, 2024 which was approximately 1.714 Exaflops

Now for the first configuration, Frontier has 74 EX4000 Racks for its compute system which means that approximately 140 GX5000 racks can fit into that floor space. This means that Discovery would have a total of 3,920 Venice CPUs and 15,680 MI430Xs Accelerators for a rough total of 3.3 Exaflops of HPL Rpeak.

This Rpeak of 3 Exaflops would have a rough power draw of about 35 Megawatts assuming each of the GX5000 racks uses approximately 250 Kilowatts per rack. While this is 10 Megawatts more than what Frontier draws in HPL, the building for Frontier is designed to be able to distribute up to 40 Megawatts so this 140 rack configuration does just fit into the power and floor space of Frontier’s building. However, if you decrease the per-rack power to 160 Kilowatts then Discovery could easily fit into Frontier’s power footprint.

For the second configuration, I am taking the top end estimate of Discovery as five times faster than Frontier and running with a configuration that is approximately five times the Rpeak of Frontier ca. August 2024 which would make Discovery a ~8.5 Exaflop system. This would need approximately 360 GX5000 racks for a total of 10,080 Venice CPUs and 40,320 MI430X Accelerators.

This configuration would likely need a facilities upgrade for both power and floor space to accommodate the system. For the floor space, this configuration may require over 1,600 m^2. And for the power, assuming 250 Kilowatts per GX5000 rack, this configuration would consume over 90 Megawatts; however, toning down the per-rack power to around 160 Kilowatts would put Discovery into the 55 to 60 Megawatt range.

The most likely configuration of Discovery is probably somewhere in-between these two configurations. However, it is rumored that China may have a Supercomputer that is able to get over 2 Exaflops of HPL Rmax at 300 megawatts with a potential companion that uses 2.5 times the power at 800 megawatts which could get over 5 Exaflops of HPL Rmax. So this may drive the Department of Energy to have chosen the largest configuration or possibly a configuration that is even larger than five times Frontier.

Regardless of what Discovery ends up being in terms of the final configuration, it is undeniably an exciting time in the World of HPC!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Evaluating Uniform Memory Access Mode on AMD's Turin ft. Verda (formerly DataCrunch.io)

2025-11-27 04:46:13

NUMA, or Non-Uniform Memory Access, lets hardware expose affinity between cores and memory controllers to software. NUMA nodes traditionally aligned with socket boundaries, but modern server chips can subdivide a socket into multiple NUMA nodes. It’s a reflection of how non-uniform interconnects get as core and memory controller counts keep going up. AMD designates their NUMA modes with the NPS (Nodes Per Socket) prefix.

NPS0 is a special NUMA mode that goes in the other direction. Rather than subdivide the system, NPS0 exposes a dual socket system as a single monolithic entity. It evenly distributes memory accesses across all memory controller channels, providing uniform memory access like in a desktop system. NPS0 and similar modes exist because optimizing for NUMA can be complicated and time intensive. Programmers have to specify a NUMA node for each memory allocation, and take are to minimize cross-node memory accesses. Each NUMA node only represents a fraction of system resources, so code pinned to a NUMA node will be constrained by that node’s CPU core count, memory bandwidth, and memory capacity. Effort spent getting an application to scale across NUMA nodes might be effort not spent on a software project’s other goals.

From AMD’s EPYC 9005 Series Architecture Overview, showing a dual socket Zen 5 (Turin) setup in NPS1 mode

Acknowledgements

A massive thank you goes to Verda (formerly DataCrunch) for proving an instance with 2 AMD EPYC 9575Fs and 8 Nvidia B200 GPUs. Verda gave us about 3 weeks with the instance to do with as we wished. While this article looks at the AMD EPYC 9575Fs, there will be upcoming coverage of the B200s found in the VM.

This system appears to be running in NPS0 mode, giving an opportunity to see how a modern server acts with 24 memory controllers providing uniform memory access.

A simple latency test immediately shows the cost of providing uniform memory access. DRAM latency rises to over 220 ns, giving a nearly 90 ns penalty over the EPYC 9355P running in NPS1 mode. It’s a high penalty compared to using the equivalent of NPS0 on older systems. For example, a dual socket Broadwell system has 75.8 ns of DRAM latency when each socket is treated as a NUMA node, and 104.6 ns with uniform memory access[1].

NPS0 mode does have a bandwidth advantage from bringing twice as many memory controllers into play. But the extra bandwidth doesn’t translate to a latency advantage until bandwidth demands reach nearly 400 GB/s. The EPYC 9355P seems to suffer when a latency test thread is mixed with bandwidth heavy ones. A bandwidth test thread with just linear read patterns can achieve 479 GB/s in NPS1 mode. However, my bandwidth test produces low values on the EPYC 9575F because not all test threads finish at the same time. I avoid this problem in the loaded memory latency test, because I have bandwidth load threads check a flag. That lets me stop all threads at approximately the same time.

Per-CCD bandwidth is barely affected by the different NPS modes. Both the EPYC 9355P and 9575F use “GMI-Wide” links for their Core Complex Dies, or CCDs. GMI-Wide provides 64B/cycle of read and write bandwidth at the Infinity Fabric clock. On both chips, each CCD enjoys more bandwidth to the system compared to standard “GMI-Narrow” configurations. For reference, a GMI-Narrow setup running at a typical desktop 2 GHz FCLK would be limited to 64 GB/s of read and 32 GB/s of write bandwidth.

Performance: SPEC CPU2017

Higher memory latency could lead to lower performance, especially in single threaded workloads. But the EPYC 9575F does surprisingly well in SPEC CPU2017. The EPYC 9575F runs at a higher 5 GHz clock speed, and DRAM latency is only one of many factors that affect CPU performance.

Individual workloads show a more complex picture. The EPYC 9575F does best when workloads don’t miss cache. Then, its high 5 GHz clock speed can shine. 548.exchange2 is an example. On the other hand, workloads that hit DRAM a lot suffer in NPS0 mode. 502.gcc, 505.mcf, and 520.omnetpp see the EPYC 9575F’s higher clock speed count for nothing, and the higher clocked chip underperforms compared to 4.4 GHz setups with lower DRAM latency.

SPEC CPU2017’s floating point suite also shows diverse behavior. 549.fotonik3d and 554.roms suffer in NPS0 mode as the EPYC 9575F struggles to keep itself fed. 538.imagick plays nicely to the EPYC 9575F’s advantages. In that test, high cache hitrates let the 9575F’s higher core throughput shine through.

Final Words

NPS0 mode performs surprisingly well in a single threaded SPEC CPU2017 run. Some sub-tests suffer from higher memory latency, but enough other tests benefit from the higher 5 GHz clock speed to make up the difference. It’s a lesson about the importance of clock speeds and good caching in a modern server CPU. Those two factors go together, because faster cores only provide a performance advantage if the memory subsystem can feed them. The EPYC 9575F’s good overall performance despite having over 220 ns of memory latency shows how good its caching setup is.

As for running in NPS0 mode, I don’t think it’s worthwhile in a modern system. The latency penalty is very high, and bandwidth gains are minor for NUMA-unaware code. I expect those latency penalties to get worse as server core and memory controller counts continue to increase. For workloads that need to scale across socket boundaries, optimizing for NUMA looks to be an unfortunate necessity.

Again, a massive thank you goes to Verda (formerly DataCrunch) without which this article, and the upcoming B200 article, would not be possible!

SC25: HACCing over 500 Petaflops on Frontier

2025-11-22 11:57:05

Hello you fine Internet folks,

Here at Supercomputing the Gordon Bell Prize is announced every year. The Gordon Bell Prize is awarded every year to recognize outstanding achievement in high-performance computing applications.

One of this year’s finalist is the largest ever simulation of the Universe run on the Frontier Supercomputer at the Department of Energy’s Oak Ridge National Laboratory (ORNL). The simulation was run using the Hardware/Hybrid Accelerated Cosmology Code also known as HACC.

This simulation of the observable universe tracked over 4 trillion particles across 15 billion light years of space. The prior state of the art observable universe simulations only went up to about 250 billion particles which is a fifteenth the number of particles of this new simulation.

This HACC simulation shows the universe about 10 billion years after the Big Bang.

But, not only was this the largest universe simulation ever, the ORNL team managed to get over 500 Petaflops on nearly 9,000 nodes of Frontier’s 9,402 nodes. As a reminder, Frontier manages to get approximately 1,353 Petaflops on High Performance Linpack (HPL). This means that for this simulation the ORNL team managed to get about 37% of the Rmax HPL performance out of Frontier which is very impressive for a non-synthetic workload.

It is awesome to see the Department of Energy’s (DOE) supercomputers being used for amazing science like this! With the announcement of the Discovery Supercomputer that is due in 2028/2029, I can’t wait to see the science that comes out of that system when it is turned over to the scientific community!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Qualcomm’s Snapdragon X2 Elite

2025-11-19 22:10:20

Hello you fine Internet folks,

Last week I was in San Diego at Qualcomm’s Headquarters where Qualcomm disclosed more information about their upcoming Snapdragon X2 Elite SOC.

Snapdragon X2 Elite is Qualcomm’s newest SOC for the Windows on ARM ecosystem that is designed to bring a new level of performance so let’s dive in.

Oryon Gen 3 Prime CPU Cores and Cluster

The Snapdragon X2 Elite (SDX2E) comes equipped with a total of 18 CPU cores with 12 “Prime” cores and 6 “Performance” cores.

Starting with the Prime cores, these are the real heart of the SDX2E SoC with a total of up to 12 cores split across 2 clusters that can clock up to 5.0 GHz.

Each of these clusters has 16MB, 16-way associative, shared L2 cache with 6 Prime cores attached along with a Qualcomm Matrix Engine per cluster.

The L2 can serve up to 64B per cycle per core with a total fill bandwidth of up to 256B per cycle for the cluster. The L1 miss to L2 hit latency is now 21 cycles, up from the 17 cycles of the Snapdragon X Elite (SDXE). The reason for this increase is due to the increased size of the structure. The L2 runs at the same clocks as the cores and supports over 220 in-flight transactions with each core supporting over 50 requests to the L2 at a time.

Diving into the CPU cores and there is quite a bit familiar with this core at a high level.

Starting with the L1 instruction cache, it is 192KB in size with 6-way associativity and is fully coherent. The Fetch Unit can fetch up to 16 4 Byte instructions per cycle for a total fetch bandwidth of 64 Bytes per cycle. The L1 iTLB is an 8-way associative, 256 entry structure that supports 4KB and larger page sizes.

Moving to the Decode, Rename, and Retirement stages, Oryon Gen 3 has increased these stages up to 9 wide from the 8 wide that Oryon Gen 1 was meaning that Oryon Gen 3 can support up to 9 micro-ops retired per cycle. There are over 400 Vector and Integer registers in their respective physical register files which is similar to the number of entries in Oryon Gen 1. Similarly, the Reorder Buffer is similarly 650+ entries for Oryon Gen 3.

Delving into the Integer side of the core, Oryon Gen 3 now has 4 Branch units which is a doubling of the number of Branch units found inside Oryon Gen 1. Otherwise, the integer side of Oryon Gen 3 is very similar to Oryon Gen 1 with 6 20-entry Reservation Stations for a total of 120 entries in the Integer scheduler, 6 Integer ALUs with 2 capable of Multiplies and one unit capable of handling Crypto and Division instructions.

Swapping to the Vector unit and Oryon Gen 3 adds SVE and SVE2 support to the core with the high-level layout similar to Oryon Gen 1 with over 400 128b Vector registers, 4 128b Vector ALUs all capable of FMAs, and 4 48-entry Reservation Stations for a total of 192 entries in the Vector scheduler.

Moving to the Load and Store system, Oryon Gen 3 has the same 4 Memory AGUs all of which are capable of loads and stores as Oryon Gen 1. These then feed a 192 entry Load Queue and 56 entry Store Queue which are the same size as the queues found on Oryon Gen 1. The L1 Data Cache is also the same fully coherent 96KB, 6-way. 64 Byte cache line, structure as what Oryon Gen 1 had.

Landing at the Memory Management Unit, and the TLBs of Oryon Gen 3 are again very similar to Oryon Gen 1 with one possible difference. Slide 11 says that the L1 dTLB is a 224 entry, 7-way, structure whereas Slide 12 says that the L1 dTLB is a 256 entry, 8-way, structure. If Slide 12 is correct then this is an increase from Oryon Gen 1’s 224 entry, 7-way, L1 dTLB. Oryon Gen 3’s 256 entry, 8-way, L1 iTLB and 8K entry, 8-way, shared L2 is unchanged from Oryon Gen 1. Note, the 2 cycle access for the L2 TLB is the SRAM access time not the total latency for a TLB lookup which Qualcomm wouldn’t disclose but is a similar range to the ~7 cycles you see on modern x86 cores.

Qualcomm’s Matrix Engine

In each of SDX2E’s 3 clusters lies a SME compatible Matrix Engine.

This matrix unit uses a 64 bit x 64 bit MLA numeric element in a 8x8 or 4x8 grid. This means that this matrix unit is 4096 bit wide which can do up to 128 FP32/INT32, 256 FP16/BF16/INT16, or 512 INT8 operations per cycle. The matrix engine is on a separate clock domain to the Cores and L2 Cache for better power and thermal management of the SoC.

Qualcomm Oryon Gen 3 Performance Core and Cluster

Something that is new to the SDX2E that SDXE didn’t have is a 3rd cluster on board with what Qualcomm is calling their “Performance cores”.

This cluster has the same number of cores along with the Matrix Engine as the Prime Clusters but instead of 16 MB of shared L2, the Performance Cluster has 12 MB of shared L2.

The Performance core is also different to the Prime cores. These cores are of a similar but distinct microarchitecture to the Prime cores. These cores are targeted at a lower power point and has been optimized for operation below 2 watts. As such, the Performance core isn’t as wide as the Prime core and has fewer execution pipes, smaller caches, and shallower Out-of-Order structures compared to the Prime core.

Adreno X2 GPU

SDX2E has a revamped GPU architecture that Qualcomm is calling the Adreno X2 microarchitecture.

This is Qualcomm’s largest GPU they have made to date with 2048 FP32 ALUs clocking up to 1.85GHz.

The Adreno X2 is a “Slice-Based” architecture, roughly equivalent to a Shader Engine from AMD or a GPC from Nvidia, with 4 slices for the top-end X2-90. Each slice has one Front-End which capable of rasterizing up to 4 triangles per cycle.

After the Front-end, are the Shader Processors, which are roughly equivalent to AMD’s WGP or Nvidia’s SM, which has an instruction cache and 2 micro-Shader Processors (uSP), similar to AMD’s SIMD unit or Nvidia’s SMSP, which each have a 128KB register file feeding 128 ALUs which support FP32, FP16, and BF16. A change from the prior Adreno X1 architecture is the removal of Wave128, now the Adreno X2 only supports Wave64 and can dual issue Wave64 instructions in order to keep the 128 ALUs fed.

Each uSP has a Ray Tracing Unit which supports either 4 ray-triangles or 8 ray-box intersections per cycle.

From there is what is referred to as Adreno High Performance Memory (AHPM). There is 21 MB of AHPM in a X2-90 GPU, 5.25 MB per slice, which is both a scratchpad and a cache depending on what the driver configures it as. Up to 3 MB of each of the 5.25 MB slices can be configured as a cache with the remaining 2.25 MB of SRAM being a scratchpad.

AHPM is designed to allow for the GPU to do tiled rendering all within the AHPM before rendering out the frame to the display buffer. This reduces the amount of data movement that the GPU has to do which consequently improves the performance per watt of the Adreno X2 compared to the Adreno X1.

Moving back to the Slice level, each slice has a 128 KB cluster cache which is then backed by a unified 2 MB L2 cache. This L2 can then spill into the 8 MB System Level Cache (SLC) which then is backed by the up to 228 GB/s memory subsystem.

As for API support, Adreno X2 supports DX12.2, Shader Model 6.8, Vulkan 1.4, OpenCL 3.0, as well as SYCL support coming in the first half of 2026.

Hexagon NPU

Qualcomm has increased the performance of the Hexagon NPU from 45 TOPS of INT8 to 80 TOPS of INT8 with SDX2E.

Qualcomm has also added FP8 and BF16 support to the Hexagon NPU 6 vector unit.

In addition to the BF16 and FP8 support, the new matrix engine in NPU 6 has INT2 dequantization support.

However, the largest change in NPU 6 is the addition of 64 bit Virtual Addressing to the DMA unit which means that NPU 6 can now access more than 4GB of memory.

Power and Performance

For testing the power of a system, Qualcomm has used what they call INPP or Idle Normalized Platform Power. What INPP is, is talking the total platform power during load and subtracting out the platform power at idle.

What INPP gets you is the SOC power plus the DRAM power plus the Power Conversion Losses; while this isn’t quite solely SOC power, INPP is about as close as you can get to pure SOC power in a laptop form factor where discrete power sensors aren’t very common.

Different workloads have different power characteristics. For example, while GB 6 Multi-threaded doesn’t pull a ton of power overall, it is a very bursty workload that can spike to over 150 watts; whereas a memory bandwidth test pulls over 105 watts in a sustained fashion.

Looking at the performance versus power graph in Cinebench R24 MT, the SDX2E Extreme with 18 cores (12 Prime and 6 Performance cores) scores just over 1950 points in Cinebench R24 at about 105 watts INPP with the standard SDX2E with 12 cores (6 Prime and 6 Performance cores) scoring just over 1100 points at approximately 50 watts INPP.

Qualcomm has also implemented a clock boosting scheme similar to Intel’s Turbo Boost where depending on the number of cores active in a cluster, the cluster will clock up or down accordingly.

Editor’s Note (11/30/2025): Qualcomm got in touch with Chips and Cheese to clarify that their boosting algorithm is a cluster-level algorithm where the boosting behavior is independent for the 3 clusters; where as Intel’s Turbo Boost mechanism is a SoC-level algorithm that affects all cores on the package.

Qualcomm also highlighted the performance of the SDX2E when the laptop is on battery performance compared to the laptop on wall power.

Conclusion

Qualcomm has made significant advances with the SDX2E with regards to the SOC, GPU, and NPU. SDX2E is planned to hit shelves in the first half of 2026 and we can’t wait to get a system to test.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.