MoreRSS

site iconChips and CheeseModify

Deep dives into computer hardware and software and the wider industry...
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Chips and Cheese

Everactive’s Self-Powered SoC at Hot Chips 2025

2025-09-18 01:25:23

Hot Chips 2025 was filled with presentations on high power devices. AI can demand hundreds of watts per chip, requiring sophisticated power delivery and equally complex cooling. It’s a far cry from consumer devices that must make do with standard wall outlets, or batteries for mobile devices. But if working within a wall outlet or battery’s constraints is hard, it’s nothing compared to working with, well, nothing. That’s Everactive’s focus, and the subject of their presentation at Hot Chips 2025.

Everactive makes SoCs that power themselves off energy harvested from the environment, letting them operate without a steady power source. Self-powered SoCs are well suited for large IoT deployments. Wiring power to devices or replacing batteries complicates scaling device count, especially when those IoT devices are sensors in difficult to reach places. Generating energy off the environment is also attractive from a sustainability standpoint. Reliability can benefit too, because the device won’t suffer from electrical grid failures.

But relying on energy harvesting brings its own challenges. Harvested energy is often in the milliwatt or microwatt range, depending on energy source. Power levels fluctuate depending on environmental conditions too. All that means the SoC has to run at very low power while taking measures to maximize harvested power and survive through suboptimal conditions.

SoC Overview

Everactive’s SoC is designated PKS3000 and occupies 6.7 mm² on a 55 nm ULP process node. It’s designed around collecting and transmitting data within the tight constraints of self-powered operation. The SoC can run at 5 MHz with 12 microwatt power draw, and has a 2.19 microwatt power floor. It interfaces with a variety of sensors to gather data, and has a power-optimized Wifi/Bluetooth/5G radio for data transmission. An expansion port can connect to an external microcontroller and storage, which can also be powered off harvested energy.

On the processing side, the SoC includes an Arm Cortex M0+ microcontroller. Cortex M0+ is one Arm's lowest power cores, and comes with a simple two-stage pipeline. Its memory subsystem is similarly bare-bones, with no built-in caches and a memory port shared between instruction and data accesses. The chip includes 128 KB of SRAM and 256 KB of flash. Thus the SoC has less storage and less system memory than the original IBM PC but can clock higher which does go to show just how far silicon manufacturing has come in nearly 50 years.

EH-PMU: Orchestrating Power Delivery

Energy harvesting is central to the chip’s goals, and is controlled by the Energy Harvesting Power Management Unit (EH-PMU). The EH-PMU uses a multiple input, single inductor, multiple output (MISIMO) topology, letting it simultaneously harvest energy from two energy sources and output power on four rails. For energy harvesting sources, Everactive gives light and temperature differences as examples because they’re often available. Light can be harvested from a photovoltaic cell (PV), and a thermoelectric generator (TEG) can do so from temperature differences. Depending on expected environmental conditions, the SoC can be set up with other energy sources like radio emissions, mechanical vibrations, or airflow.

Maximum power point tracking (MPPT) helps the EH-PMU improve energy harvesting efficiency by getting energy from each harvesting source at its optimal voltage. Because harvested energy can often be unstable despite using two sources and optimization techniques, the PKS3000 can store energy in a pair of capacitors. A supercapacitor provides deep energy storage, and aims to let the chip keep going under adverse conditions, A smaller capacitor charges faster, and can quickly collect power to speed up cold starts after brownouts. Everactive’s SoC can cold-start using the PV/TEG combo at 60 lux and 8 C, which is about equivalent to indoor lighting in a chilly room.

The EH-PMU’s powers four output rails, named 1p8, 1p2, 0p9, and adj, which can be fed off a harvesting source under good conditions. Under bad conditions, EH-PMU can feed them off stored energy in the capacitors. A set of pulse counters monitor energy flow throughout the chip. Collected energy statistics are fed to the Energy Aware Subsystem.

Energy Aware Subsystem

The Energy Aware Subsystem (EAS) monitors energy harvesting, storage, and consumption to make power management decisions. It plays a role analogous to Intel’s Power Control Unit or AMD’s System Management Unit (SMU). Like Intel’s PCU or AMD’s SMU, the EAS manages frequency and voltage scaling with a set of different policies. Firmware can hook into the EAS to set maximum frequencies and power management policies, just as an OS would do on higher power platforms. Energy statistics can be used to decide when to enable components or perform OTA updates.

The EAS also controls a load switch that can be used to cut off external components. Everactive found that some external components have a high power floor. Shutting them off prevents them from monopolizing the chip’s very limited power budget. Components can be notified and allowed to perform a graceful shutdown. But the EAS also has the power to shut them off in a non-cooperative fashion. That gives the EAS more power than Intel’s PCU or AMD’s SMU, which can’t independently decide that some components should be shut off. But the EAS needs those powers because it has heavier responsibilities and constraints. For Everactive’s SoC, poor power management isn’t a question of slightly higher power bills or shorter battery life. Instead, it could cause frequent brownouts that prevent the SoC from providing timely sensor data updates. That’s highly non-ideal for industrial monitoring applications, where missed data could prevent operators from noticing a problem.

The EAS supports a Wait-for-Wakeup mode, which lets the device enter a low power mode where it can still respond very quickly to activity from sensors or the environment. Generally, Everactive’s SoC is geared towards trying to idle as often as possible.

Wake-Up Radio

Idle optimizations extend to the radio. Communication is a challenge for self powered SoCs because radios can have a high power floor even if all they’re doing is staying connected to a network. Disconnecting is an option, but it’s a poor one because reconnecting introduces delays, and some access points don’t reliably handle frequent reconnects. Everactive attacks this problem with a Wake-Up Radio (WRX), which uses an always-on receiver capable of receiving a subset of messages at very low power. Compared to a conventional radio that uses duty cycling, wakeup receivers seek to reduce power wasted on idle monitoring while achieving lower latency.

Everactive’s WRX shares an antenna with an external communication transceiver, which the customer picks with their network design in mind. A RF switch lets the WRX share the antenna, and on-board matching networks pick the frequency. The passive path can operate from 300 MHz to 3 GHz, letting it support a wide range of standards with the appropriate on-board matching networks. A broadband signal is input to the wakeup receiver on the passive path, which does energy detection. Then, the WRX can do baseband gain or intermediate frequency (IF) gain depending on the protocol.

WRX power varies with configuration. Using the passive path without RF gain provides a very low baseline power of under a microwatt while providing respectable -63 dBm sensitivity for a sub-GHz application. That mode provides about 200m range, making it useful for industrial monitoring environments. Longer range applications (over 1000m) require more power, because RF boost has to get involved to increase sensitivity. To prevent high active power from becoming a problem, Everactive uses multi-stage wakeup and very fine grained duty cycling, letting the radio sample at times when it might get a bit of a wakeup message. With those techniques, the chip is able to achieve -92 dBm sensitivity while keeping power below 6 microwatts on average.

For comparison, Intel’s Wi-Fi 6 AX201 can achieve similar sensitivities, but idles at 1.6 mW in Core Power Down. Power goes up to 3.4 mW when associated with a 2.4 GHz access point. That sort of power draw is still very low in an absolute sense. But Everactive’s WRX setup pushes even lower, which is both impressive and reiterates the challenges associated with self powered operation. Everactive for their part sees standards moving in the direction of using wake-up radios. Wake-up receivers have been researched for decades, and continue to improve over time. There’s undeniably a benefit to idle power, and it’ll be interesting to see if WRX-es get more widely adopted.

Final Words

Everactive’s PKS3000 is a showcase of extreme power saving measures. The goal of improving power efficiency is thrown around quite a lot. Datacenter GPUs strive for more FLOPS/watt. Laptop chips get power optimizations with each generation. But self powered SoCs really take things to the next level because their power budget is so limited. Many of Everactive’s optimizations shave off power on a microwatt scale, an order of magnitude below the milliwatts that mobile devices care about. PKS3000 can idle at 2.19 microwatts with some limitations, or under 4 microwatts with the wakeup receiver always on. Even under load, Everactive’s SoC draws order of magnitudes less power than the chips PC enthusiasts are familiar with, let alone the AI-oriented monster chips capable of drawing a kilowatt or more.

The PKS3000 improves over Everactive’s own PKS2001 SoC as well, reducing power while running at higher clocks and achieving better radio sensitivity. Pushing idle power down from 30 to 2.19 microwatts is impressive. Decreasing active power from 89.1 to 12 microwatts is commendable too. PKS3000 does move to a more advanced 55nm ULP node, compared to the 65nm node used by PKS2001. But a lot of improvements no doubt come from architectural techniques too.

And it’s important to not lose sight of the big picture. Neither SoC uses a cutting edge FinFET node, yet they’re able to accomplish their task with miniscule power requirements. Self powered SoCs have limitations of course, and it’s easy to see why Everactive focuses on industrial monitoring applications. But I do wonder if low power SoCs could cover a broader range of use cases as the technology develops, or if more funding lets them use modern process nodes. Everactive’s presentation was juxtaposed alongside talks on high power AI setups. Google talked about managing power swings on the megawatt scale with massive AI training deployments. Meta discussed how they increased per-rack power to 93.5 kW. Power consumption raises sustainability concerns with the current AI boom, and battery-less self-powered SoCs are so sweet from a sustainability perspective. I would love to see energy harvesting SoCs take on more tasks.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. Cortex M0+ from Arm: https://developer.arm.com/Processors/Cortex-M0-Plus

  2. Jesús ARGOTE-AGUILAR's thesis, "Powering Low-Power Wake-up Radios with RF Energy Harvesting": https://theses.hal.science/tel-05100987v1/file/ARGOTE_Jesus.pdf - gives an example of a wakeup receiver with 0.2 microwatt power consumption and -53 dBm sensitivity, and other examples with power going into the nanowatt range when no data is received.

  3. R. van Langevelde et al, "An Ultra-Low-Power 868/915 MHz RF Transceiver for Wireless Sensor Network Applications" at https://picture.iczhiku.com/resource/ieee/shkGpPyIRjUikxcX.pdf - ~1.2 microwatt sleep, 2.4 milliwatts receive, 2.7 milliwatts transmit, -89 dbM sensitivity

  4. Nathan M. Pletcher et al, "A 52 μW Wake-Up Receiver With 72 dBm Sensitivity Using an Uncertain-IF Architecture" - https://people.eecs.berkeley.edu/~pister/290Q/Papers/Radios/Pletcher%20WakeUp%20jssc09.pdf - old WRX from 2009

  5. Jonathan K. Brown eta al, "A 65nm Energy-Harvesting ULP SoC with 256kB Cortex-M0 Enabling an 89.1µW Continuous Machine Health Monitoring Wireless Self-Powered System" - paper on Everactive's older SoC

  6. Intel® Wi-Fi 6 AX201 (Harrison Peak 2) and Wi-Fi 6 AX101 (Harrison Peak 1) External Product Specifications (EPS)

AMD’s RDNA4 GPU Architecture at Hot Chips 2025

2025-09-14 05:01:39

RDNA4 is AMD’s latest graphics-focused architecture, and fills out their RX 9000 line of discrete GPUs. AMD noted that creating a good gaming GPU requires understanding both current workloads, as well as taking into account what workloads might look like five years in the future. Thus AMD has been trying to improve efficiency across rasterization, compute, and raytracing. Machine learning has gained importance including in games, so AMD’s new GPU architecture caters to ML workloads as well.

From AMD’s perspective, RDNA4 represents a large efficiency leap in raytracing and machine learning, while also improving on the rasterization front. Improved compression helps keep the graphics architecture fed. Outside of the GPU’s core graphics acceleration responsibility, RDNA4 brings improved media and display capabilities to round out the package.

Media Engine

The Media Engine provides hardware accelerated video encode and decode for a wide range of codecs. High end RDNA4 parts like the RX 9070XT have two media engines. RDNA4’s media engines feature faster decoding speed, helping save power during video playback by racing to idle. For video encoding, AMD targeted better quality in H.264, H.265, and AV1, especially in low latency encoding.

Low latency encoder modes are mostly beneficial for streaming, where delays caused by the media engine ultimately translate to a delayed stream. Reducing latency can make quality optimizations more challenging. Video codecs strive to encode differences between frames to economize storage. Buffering up more frames gives the encoder more opportunities to look for similar content across frames, and lets it allocate more bitrate budget for difficult sequences. But buffering up frames introduces latency. Another challenge is some popular streaming platforms mainly use H.264, an older codec that’s less efficient than AV1. Newer codecs are being tested, so the situation may start to change as the next few decades fly by. But for now, H.264 remains important due to its wide support.

Testing with an old gameplay clip from Elder Scrolls Online shows a clear advantage for RDNA4’s media engine when testing with the latency-constrained VBR mode and encoder tuned for low latency encoding (-usage lowlatency -rc vbr_latency). Netflix’s VMAF video quality metric gives higher scores for RDNA4 throughout the bitrate range. Closer inspection generally agrees with the VMAF metric.

RDNA4 does a better job preserving high contrast outlines. Differences are especially visible around text, which RDNA4 handles better than its predecessor while using a lower bitrate. Neither result looks great with such a close look, with blurred text on both examples and fine detail crushed in video encoding artifacts. But it’s worth remembering that the latency-constrained VBR mode uses a VBV buffer of up to three frames, while higher latency modes can use VBV buffer sizes covering multiple seconds of video. Encoding speed has improved slightly as well, jumping from ~190 to ~200 FPS from RDNA3.5 to RDNA4.

Display Engine

The display engine fetches on-screen frame data from memory, composites it into a final image, and drives it to the display outputs. It’s a basic task that most people take for granted, but the display engine is also a good place to perform various image enhancements. A traditional example is using a lookup table to apply color correction. Enhancements at the display engine are invisible to user software, and are typically carried out in hardware with minimal power cost. On RDNA4, AMD added a “Radeon Image Sharpening” filter, letting the display engine sharpen the final image. Using dedicated hardware at the display engine instead of the GPU’s programmable shaders means that the sharpening filter won’t impact performance and can be carried out with better power efficiency. And, AMD doesn’t need to rely on game developers to implement the effect. Sharpening can even apply to the desktop, though I’m not sure why anyone would want that.

Power consumption is another important optimization area for display engines. Traditionally that’s been more of a concern for mobile products, where maximizing battery life under low load is a top priority. But RDNA4 has taken aim at multi-monitor idle power with its newer display engine. AMD’s presentation stated that they took advantage of variable refresh rates on FreeSync displays. They didn’t go into more detail, but it’s easy to imagine what AMD might be doing. High resolution and high refresh rate displays translate to high pixel rates. That in turn drives higher memory bandwidth demands. Dynamically lowering refresh rates could let RDNA4’s memory subsystem enter a low power state while still meeting refresh deadlines.

Power and GDDR6 data rates for various refresh rate combinations. AMD’s monitoring software (and others) read out extremely low memory clocks when the memory bus is able to idle, so those readings aren’t listed.

I have a RX 9070 hooked up to a Viotek GN24CW 1080P display via HDMI, and a MSI MAG271QX 1440P capable of refresh rates up to 360 Hz. The latter is connected via DisplayPort. The RX 9070 manages to keep memory at idle clocks even at high refresh rate settings. Moving the mouse causes the card to ramp up memory clocks and consume more power, hinting that RDNA4 is lowering refresh rates when screen contents don’t change. Additionally, RDNA4 gets an intermediate GDDR6 power state that lets it handle the 1080P 60 Hz + 1440P 240 Hz combination without going to maximum memory clocks. On RDNA2, it’s more of an all or nothing situation. The older card is more prone to ramping up memory clocks to handle high pixel rates, and power consumption remains high even when screen contents don’t change.

Compute Changes

RDNA4’s Workgroup Processor retains the same high level layout as prior RDNA generations. However, it gets major improvements targeted towards raytracing, like improved raytracing units and wider BVH nodes, a dynamic register allocation mode, and a scheduler that no longer suffers false memory dependencies between waves. I covered those in previous articles. Besides those improvements, AMD’s presentation went over a couple other details worth discussing.

Scalar Floating Point Instructions

AMD has a long history of using a scalar unit to offload operations that are constant across a wave. Scalar offload saves power by avoiding redundant computation, and frees up the vector unit to increase performance in compute-bound sequences. RDNA4’s scalar unit gains a few floating point instructions, expanding scalar offload opportunities. This capability debuted on RDNA3.5, but RDNA4 brings it to discrete GPUs.

While not discussed in AMD’s presentation, scalar offload can bring additional performance benefits because scalar instructions sometimes have lower latency than their vector counterparts. Most basic vector instructions on RDNA4 have 5 cycle latency. FP32 adds and multiples on the scalar unit have 4 cycle latency. The biggest latency benefits still come from offloading integer operations though.

Split Barriers

GPUs use barriers to synchronize threads and enforce memory ordering. For example, a s_barrier instruction on older AMD GPUs would cause a thread to wait until all of its peers in the workgroup also reached the s_barrier instruction. Barriers degrade performance because any thread that happened to reach the barrier faster would have to stall until its peers catch up.

RDNA4 splits the barrier into separate “signal” and “wait” actions. Instead of s_barrier, RDNA4 has s_barrier_signal and s_barrier_wait. A thread can “signal” the barrier once it produces data that other threads might need. It can then do independent work, and only wait on the barrier once it needs to use data produced by other threads. The s_barrier_wait will then stall the thread until all other threads in the workgroup have signalled the barrier.

Memory Subsystem

The largest RDNA4 variants have a 8 MB L2 cache, representing a substantial L2 capacity increase compared to prior RDNA generations. RDNA3 and RDNA2 maxed out at 6 MB and 4 MB L2 capacities, respectively. AMD found that difficult workloads like raytracing benefit from the larger L2. Raytracing involves pointer chasing during BVH traversal, and it’s not surprising that it’s more sensitive to accesses getting serviced from the slower Infinity Cache as opposed to L2. In the initial scene in 3DMark’s DXR feature test, run in Explorer Mode, RDNA4 dramatically cuts down the amount of data that has to be fetched from beyond L2.

RDNA2 still does a good job of keeping data in L2 in absolute terms. But it’s worth noting that hitting Infinity Cache on both platforms adds more than 50 ns of extra latency over a L2 hit. That’s well north of 100 cycles because both RDNA2 and RDNA4 run above 2 GHz. While AMD’s graphics strategy has shifted towards making the faster caches bigger, it still contrasts with Nvidia’s strategy of putting way more eggs in the L2 basket. Blackwell’s L2 cache serves the functions of both AMD’s L2 and Infinity Cache, and has latency between those two cache levels. Nvidia also has a flexible L1/shared memory allocation scheme that can give them more low latency caching capacity in front of L2, depending on a workload’s requested local storage (shared memory) capacity.

A mid-level L1 cache was a familiar fixture on prior RDNA generations. It’s conspicuously missing from RDNA4, as well as AMD’s presentation. One possibility is that L1 cache hitrate wasn’t high enough to justify the complexity of an extra cache level. Perhaps AMD felt its area and transistor budget was better allocated towards increasing L2 capacity. To support this theory, L1 hitrate on RDNA1 was often below 50%. At the same time, the RDNA series always enjoyed a high bandwidth and low latency L2. Putting more pressure on L2 in exchange for reducing L2 misses may have been an enticing tradeoff. Another possibility is that AMD ran into validation issues with the L1 cache and decided to skip it for this generation. There’s no way to verify either possibility of course, but I think the former reasons make more sense.

Beyond tweaking the cache hierarchy, RDNA4 brings improvements to transparent compression. AMD emphasized that they’re using compression throughout the SoC, including at points like the display engine and media engine. Compressed data can be stored in caches, and decompressed before being written back to memory. Compression cuts down on data transfer, which reduces bandwidth requirements and improves power efficiency.

Transparent compression is not a new feature. It has a long history of being one tool in the GPU toolbox for reducing memory bandwidth usage, and it would be difficult to find any modern GPU without compression features of some sort. Even compression in other blocks like the display engine have precedent. Intel’s display engines for example use Framebuffer Compression (FBC), which can write a compressed copy of frame data and keep fetching the compressed copy to reduce data transfer power usage as long as the data doesn’t change. Prior RDNA generations had compression features too, and AMD’sdocumentation summarizes some compression targets. While AMD didn’t talk about compression efficiency, I tried to take similar frame captures using RGP on both RDNA1 and RDNA4 to see if there’s a large difference in memory access per frame. It didn’t quite work out the way I expected, but I’ll put them here anyway and discuss why evaluating compression efficacy is challenging.

The first challenge is that both architectures satisfy most memory requests from L0 or L1. AMD slides on RDNA1 suggest the L0 and L1 only hold decompressed data, at least for delta color compression. Compression does apply to L2. For RDNA4, AMD’s slides indicate it applies to the Infinity Cache too. However, focusing on data transfer to and from the L2 wouldn’t work due the large cache hierarchy differences between those RDNA generations.

DCC, or delta color compression, is not the only form of compression. But this slide shows one example of compression/decompression happening in front of L2

Another issue is, it’s easy to imagine a compression scheme that doesn’t change the number of cache requests involved. For example, data might be compressed to only take up part of a cacheline. A request only causes a subset of the cacheline to be read out, which a decompressor module expands to the full 128B. Older RDNA1 slides are ambiguous about this, indicating that DCC operates on 256B granularity (two cachelines) without providing further details.

In any case, compression may be a contributing factor in RDNA4 being able to achieve better performance while using a smaller Infinity Cache than prior generations, despite only having a 256-bit GDDR6 DRAM setup.

SoC Features

AMD went over RAS, or reliability, availability, and serviceability features in RDNA4. Modern chips use parity and ECC to detect errors and correct them, and evidently RDNA4 does the same. Unrecoverable errors are handled with driver intervention, by “re-initializing the relevant portion of the SoC, thus preventing the platform from shutting down”. There’s two ways to interpret that statement. One is that the GPU can be re-initialized to recover from hardware errors, obviously affecting any software relying on GPU acceleration. Another is that some parts of the GPU can be re-initialized while the GPU continues handling work. I think the former is more likely, though I can imagine the latter being possible in limited forms too. For example, an unrecoverable error reading from GDDR6 can hypothetically be fixed if that data is backed by a duplicate in system memory. The driver could transfer known-good data from the host to replace the corrupted copy. But errors with modified data would be difficult to recover from, because there might not be an up-to-date copy elsewhere in the system.

On the security front, microprocessors get private buses to “critical blocks” and protected register access mechanisms. Security here targets HDCP and other DRM features, which I don’t find particularly amusing. But terminology shown on the slide is interesting, because MP0 and MP1 are also covered in AMD’s CPU-side documentation. On the CPU side, MP0 (microprocessor 0) handles some Secure Encrypted Virtualization (SEV) features. It’s sometimes called the Platform Security Processor (PSP) too. MP1 on CPUs is called the System Management Unit (SMU), which covers power control functions. Curiously AMD’s slide labels MP1 and the SMU separately on RDNA4. MP0/MP1 could have completely different functions on GPUs of course. But the common terminology raises the possibility that there’s a lot of shared work between CPU and GPU SoC design. RAS is also a very traditional CPU feature, though GPUs have picked up RAS features over time as GPU compute picked up steam.

Infinity Fabric

One of the most obvious examples of shared effort between the CPU and GPU sides is Infinity Fabric making its way to graphics designs. This started years ago with Vega, though back then using Infinity Fabric was more of an implementation detail. But years later, Infinity Fabric components provided an elegant way to implement a large last level cache, or multi-socket coherent systems with gigantic iGPUs (like MI300A).

Slide from Hot Chips 29, covering Infinity Fabric used in AMD’s older Vega GPU

The Infinity Fabric memory-side subsystem on RDNA4 consists of 16 CS (Coherent Station) blocks, each paired with a Unified Memory Controller (UMC). Coherent Stations receive requests coming off the graphics L2 and other clients. They ensure coherent memory access by either getting data from a UMC, or by sending a probe if another block has a more up-to-date copy of the requested cacheline. The CS is a logical place to implement a memory side cache, and each CS instance has 4 MB of cache in RDNA4.

To save power, Infinity Fabric supports DVFS (dynamic voltage and frequency scaling) to save power, and clocks between 1.5 and 2.5 GHz. Infinity Fabric bandwidth is 1024 bytes per clock, which suggests the Infinity Cache can provide 2.5 TB/s of theoretical bandwidth. That roughly lines up with results from Nemes’s Vulkan-based GPU cache and memory bandwidth microbenchmark.

AMD also went over their ability to disable various SoC components to harvest dies and create different SKUs. Shader Engines, WGPs, and memory controller channels can be disabled. AMD and other manufacturers have used similar harvesting capabilities in the past. I’m not sure what’s new here. Likely, AMD wants to re-emphasize their harvesting options.

Finally, AMD mentioned that they chose a monolithic design for RDNA4 because it made sense for a graphics engine of its size. They looked at performance goals, package assembly and turnaround time, and cost. After evaluating those factors, they decided a monolithic design was the right option. It’s not a surprise. After all, AMD used monolithic designs for lower end RDNA3 products with smaller graphics engines, and only used chiplets for the largest SKUs. Rather, it’s a reminder that there’s no one size fits all solution. Whether a monolithic or chiplet-based design makes more sense depends heavily on design goals.

Final Words

RDNA4 brings a lot of exciting improvements to the table, while breaking away from any attempt to tackle the top end performance segment. Rather than going for maximum performance, RDNA4 looks optimized to improve efficiency over prior generations. The RX 9070 offers similar performance to the RX 7900XT in rasterization workloads despite having a lower power budget, less memory bandwidth, and a smaller last level cache. Techspot also shows the RX 9070 leading with raytracing workloads, which aligns with AMD's goal of enhancing raytracing performance.

Slide from RDNA4’s Launch Presentation not Hot Chips 2025

AMD achieves this efficiency using compression, better raytracing structures, and a larger L2 cache. As a result, RDNA4 can pack its performance into a relatively small 356.5 mm² die and use a modest 256-bit GDDR6 memory setup. Display and media engine improvements are welcome too. Multi-monitor idle power feels like a neglected area for discrete GPUs, even though I know many people use multiple monitors for productivity. Lowering idle power in those setups is much appreciated. On the media engine side, AMD’s video encoding capabilities have often lagged behind the competition. RDNA4’s progress at least prevents AMD from falling as far behind as they have before.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Intel’s E2200 “Mount Morgan” IPU at Hot Chips 2025

2025-09-11 05:35:12

Intel’s IPUs, or Infrastructure Processing Units, evolved as network adapters developed increasingly sophisticated offload capabilities. IPUs take things a step further, aiming to take on a wide variety of infrastructure services in a cloud environment in addition to traditional software defined networking functions. Infrastructure services are run by the cloud operator and orchestrate tasks like provisioning VMs or collecting metrics. They won’t stress a modern server CPU, but every CPU core set aside for those tasks is one that can’t be rented out to customers. Offloading infrastructure workloads also provides an extra layer of isolation between a cloud provider’s code and customer workloads. If a cloud provider rents out bare metal servers, running infrastructure services within the server may not even be an option.

Intel’s incoming “Mount Morgan” IPU packs a variety of highly configurable accelerators alongside general purpose CPU cores, and aims to capture as many infrastructure tasks as possible. It shares those characteristics with its predecessor, “Mount Evans”. Flexibility is the name of the game with these IPUs, which can appear as a particularly capable network card to up to four host servers, or run standalone to act as a small server. Compared to Mount Evans, Mount Morgan packs more general purpose compute power, improved accelerators, and more off-chip bandwidth to support the whole package.

Compute Complex

Intel includes a set of Arm cores in their IPU, because CPUs are the ultimate word in programmability. They run Linux and let the IPU handle a wide range of infrastructure services, and ensure the IPU stays relevant as infrastructure requirements change. Mount Morgan’s compute complex gets an upgrade to 24 Arm Neoverse N2 cores, up from 16 Neoverse N1 cores in Mount Evans. Intel didn’t disclose the exact core configuration, but Mount Evans set its Neoverse N1 cores up with 512 KB L2 caches and ran them at 2.5 GHz. It’s not the fastest Neoverse N1 configuration around, but it’s still nothing to sneeze at. Mount Morgan of course takes things further. Neoverse N1 is a 5-wide out-of-order core with a 160 entry ROB, ample execution resources, and a very capable branch predictor. Each core is already a substantial upgrade over Neoverse N1. 24 Neoverse N2 cores would be enough to handle some production server workloads, let alone a collection of infrastructure services.

Mount Morgan gets a memory subsystem upgrade to quad channel LPDDR5-6400 to feed the more powerful compute complex. Mount Evans had a triple channel LPDDR4X-4267 setup, connected to 48 GB of onboard memory capacity. If Intel keeps the same memory capacity per channel, Mount Morgan would have 64 GB of onboard memory. Assuming Intel’s presentation refers to 16-bit LPDDR4/5(X) channels, Mount Morgan would have 51.2 GB/s of DRAM bandwidth compared to 25.6 GB/s in Mount Evans. Those figures would be doubled if Intel refers to 32-bit data buses to LPDDR chips, rather than channels. A 32 MB System Level Cache helps reduce pressure on the memory controllers. Intel didn’t increase the cache’s capacity compared to the last generation, so 32 MB likely strikes a good balance between hitrate and die area requirements. The System Level Cache is truly system level, meaning it services the IPU’s various hardware acceleration blocks in addition to the CPU cores.

A Lookaside Crypto and Compression Engine (LCE) sits within the compute complex, and shares lineage with Intel’s Quickassist (QAT) accelerator line. Intel says the LCE features a number of upgrades over QAT targeted towards IPU use cases. But perhaps the most notable upgrade is getting asymmetric crypto support, which was conspicuously missing from Mount Evans’s LCE block. Asymmetric cryptography algorithms like RSA and ECDHE are used in TLS handshakes, and aren’t accelerated by special instructions on many server CPUs. Therefore, asymmetric crypto can consume significant CPU power when a server handles many connections per second. It was a compelling use case for QAT, and it’s great to see Mount Morgan get that as well. The LCE block also supports symmetric crypto and compression algorithms, capabilities inherited from QAT.

A programmable DMA engine in the LCE lets cloud providers move data as part of hardware accelerated workflows. Intel gives an example workflow for accessing remote storage, where the LCE helps move, compress, and encrypt data. Other accelerator blocks located in the IPU’s network subsystem help complete the process.

Network Subsystem

Networking bandwidth and offloads are a core function of the IPU, and its importance can’t be understated. Cloud servers need high network and storage bandwidth. The two are often two sides of the same coin, because cloud providers might use separate storage servers accessed over datacenter networking. Mount Morgan has 400 Gbps of Ethernet throughput, double Mount Evans’s 200 Gbps.

True to its smart NIC lineage, Mount Morgan uses a large number of inline accelerators to handle cloud networking tasks. A programmable P4-based packet processing pipeline, called the FXP, sits at the heart of the network subsystem. P4 is a packet processing language that lets developers express how they want packets handled. Hardware blocks within the FXP pipeline closely match P4 demands. A parser decodes packet headers and translates the packet into a representation understood by downstream stages. Downstream stages can check for exact or wildcard matches. Longest prefix matches can be carried out in hardware too, which is useful for routing.

The FXP can handle a packet every cycle, and can be configured to perform multiple passes per packet. Intel gives an example where one pass processes outer packet layers to perform decapsulation and checks against access control lists. A second pass can look at the inner packet, and carry out connection tracking or implement firewall rules.

An inline crypto block sits within the network subsystem as well. Unlike the LCE in the compute complex, this crypto block is dedicated to packet processing and focuses on symmetric cryptography. It includes its own packet parsers, letting it terminate IPSec and PSP connections and carry out IPSec/PSP functions like anti-replay window protection, sequence number generation, and error checking in hardware. IPSec is used for VPN connections, which are vital for letting customers connect to cloud services. PSP is Google’s protocol for encrypting data transfers internal to Google’s cloud. Compared to Mount Evans, the crypto block’s throughput has been doubled to support 400 Gbps, and supports 64 million flows.

Cloud providers have to handle customer network traffic while ensuring fairness. Customers only pay for a provisioned amount of network bandwidth. Furthermore, customer traffic can’t be allowed to monopolize the network and cause problems with infrastructure services. The IPU has a traffic shaper block, letting it carry out quality of service measures completely in hardware. One mode uses a mutli-level hierarchical scheduler to arbitrate between packets based on source port, destination port, and traffic class. Another “timing wheel” mode does per-flow packet pacing, which can be controlled by classification rules set up at the FXP. Intel says the timing wheel mode gives a pacing resolution of 512 nanoseconds per slot.

RDMA traffic accounts for a significant portion of datacenter traffic. For example, Azure says RDMA accounts for 70% of intra-cloud network traffic, and is used for disk IO. Mount Morgan has a RDMA transport option to provide hardware offload for that traffic. It can support two million queue pairs across multiple hosts, and can expose 1K virtual functions per host. The latter should let a cloud provider directly expose RDMA acceleration capabilities to VMs. To ensure reliable transport, the RDMA transport engine supports the Falcon and Swift transport protocols. Both protocols offer improvements over TCP, and Intel implements congestion control for those protocols completely in hardware. To reduce latency, the RDMA block can bypass the packet processing pipeline and handle RDMA connections on its own.

All of the accelerator blocks above are clients of the system level cache. Some hardware acceleration use cases, like connection tracking with millions of flows, can have significant memory footprints. The system level cache should let the IPU keep frequently accessed portions of accelerator memory structures on-chip, reducing DRAM bandwidth needs.

Host Fabric and PCIe Switch

Mount Morgan’s PCIe capabilities have grown far beyond what a normal network card may offer. It has 32 PCIe Gen 5 lanes, providing more IO bandwidth than some recent desktop CPUs. It’s also a huge upgrade over the 16 PCIe Gen 4 lanes in Mount Evans.

Traditionally, a network card sits downstream of a host, and thus appears as a device attached to a server. The host fabric and PCIe subsystem is flexible to let the IPU wear many hats. It can appear as a downstream device to up to four server hosts, each of which sees the IPU as a separate, independent device. Mount Evans supported this “multi-host” mode as well, but Mount Morgan’s higher PCIe bandwidth is necessary to utilize its 400 Gigabit networking.

Mount Morgan can run in a “headless” mode, where it acts as a standalone server and a lightweight alternative to dedicating a traditional server to infrastructure tasks. In this mode, Mount Morgan’s 32 PCIe lanes can let it connect to many SSDs and other devices. The IPU’s accelerators as well as the PCIe lanes appear downstream of the IPU’s CPU cores, which act as a host CPU.

A “converged” mode can use some PCIe lanes to connect to upstream server hosts, while other lanes connect to downstream devices. In this mode, the IPU shows up as a PCIe switch to connected hosts, with downstream devices visible behind it. A server could connect to SSDs and GPUs through the IPU. The IPU’s CPU cores can sit on top of the PCIe switch and access downstream devices, or can be exposed as a downstream device behind the PCIe switch.

The IPU’s multiple modes are a showcase of IO flexibility. It’s a bit like how AMD uses the same die as an IO die within the CPU and a part of the motherboard chipset on AM4 platforms. The IO die’s PCIe lanes can connect to downstream devices when it’s serving within the CPU, or be split between an upstream host and downstream devices when used in the chipset. Intel is also no stranger to PCIe configurability. Their early QAT PCIe cards reused their Lewisburg chipset, exposing it as a downstream device with three QAT devices appearing behind a PCIe switch.

Final Words

Cloud computing plays a huge role in the tech world today. It originally started with commodity hardware, with similar server configurations to what customers might deploy in on-premise environments. But as cloud computing expanded, cloud providers started to see use cases for cloud-specific hardware accelerators. Examples include "Nitro" cards in Amazon Web Services, or smart NICs with FPGAs in Microsoft Azure. Intel has no doubt seen this trend, and IPUs are the company's answer.

Mount Morgan tries to service all kinds of cloud acceleration needs by packing an incredible number of highly configurable accelerators, in recognition of cloud providers’ diverse and changing needs. Hardware acceleration always runs the danger of becoming obsolete as protocols change. Intel tries to avoid this by having very generalized accelerators, like the FXP, as well as packing in CPU cores that can run just about anything under the sun. The latter feels like overkill for infrastructure tasks, and could let the IPU remain relevant even if some acceleration capabilities become obsolete.

At a higher level, IPUs like Mount Morgan show that Intel still has ambitions to stretch beyond its core CPU market. Developing Mount Morgan must have been a complex endeavor. It’s a showcase of Intel’s engineering capability even when their CPU side goes through a bit of a rough spot. It’ll be interesting to see whether Intel’s IPUs can gain ground in the cloud market, especially with providers that have already developed in-house hardware offload capabilities tailored to their requirements.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Hot Chips 2025: Session 1 - CPUs

2025-09-08 05:55:48

Hello you fine Internet folks,

Today we are talking about Hot Chips 2025 and specifically the CPU session at Hot Chips 2025 where we had presentations from Condor Computing about their new Cuzco core, PEZY about their upcoming SC4s chip, IBM about their Power11 chip which is already shipping to customers, and Intel about their upcoming E-Core based Xeon CPU codenamed Clearwater Forest.

Hope y'all enjoy!

If you’d like more detailed break downs of most of the chips and the presentations, here are links to our articles on them:

Condor Computing’s Cuzco Core: https://chipsandcheese.com/p/condors-cuzco-risc-v-core-at-hot

PEZY’s SC4s Chip: https://chipsandcheese.com/p/pezy-sc4s-at-hot-chips-2025

IBM Power - What is Next?: https://chipsandcheese.com/p/ibm-power-whats-next

Intel’s Clearwater Forest Xeon: https://chipsandcheese.com/p/intels-clearwater-forest-e-core-server


PEZY-SC4s at Hot Chips 2025

2025-09-05 05:30:35

Japan has a long history of building domestic supercomputer architectures dating back to the 1980s. PEZY Computing is one player in Japan’s supercomputing scene alongside Fujitsu and NEC, and has taken several spots in the Green500 list. RIKEN’s Exascaler-1.4 used PEZY-SC chips to take first place in Green500’s November 2015 rankings. More recently, PEZY-SC3 placed 12th on Green500’s November 2021 list. PEZY presented their newest architecture, PEZY-SC4S, at Hot Chips 2025. While the physical product is not yet available, PEZY is presenting results of simulations and is talking about the architecture of SC4s.

PEZY targets highly efficient FP64 compute by running a massively parallel array of execution units at lower clocks and voltages than contemporary GPUs. At the same time, it tries to avoid glass jaw performance behavior with low branching penalties and a sophisticated cache hierarchy. Their PEZY-SC products connect to a host system via PCIe, much like a GPU. The ‘s’ in SC4s denotes a scaled down model that uses a smaller die and draws less power. For example, PEZY-SC3 used a 786 mm2 die on TSMC’s 7nm process and drew up to 470W. PEZY-SC3s uses a smaller 109 mm2 die with a milder 80W power draw, and has 512 Processing Elements (PEs) compared to 4096 on the larger PEZY-SC3.

PEZY-SC4s is large for a ‘s’ part, with the same per-clock throughput as SC3. A bump from 1.2 to 1.5 GHz gives it a slight lead in overall throughput compared to SC3, and places it well ahead of SC3s.

SC4s’s Processing Element

From an organization perspective, a PEZY PE is somewhat analogous to an execution unit partition on a GPU, like AMD’s SIMD or Nvidia’s SM sub-partitions. They’re very small cores that hide latency using thread level parallelism. On PEZY-SC4s, a PE has eight hardware threads, a bit like SMT8 on a CPU. These eight threads are arranged in pairs of “front” and “back” threads, but it’s probably more intuitive to see this is two groups of four threads. One four-thread group is active at a time. Hardware carries out fine-grained multithreading within a group, selecting a different thread every cycle to hide short duration stalls within individual threads.

PEZY handles longer latency events by swapping active thread groups. This coarse-grained multithreading can be carried out with a thread switching instruction or a flag on a potentially long latency instruction, such as a memory load. Programmers can also opt for an automatic thread switching mode, inherited from PEZY-SC2. Depending on how well this “automatic chgthread” mode works, a PEZY PE could be treated purely as a fine-grained multithreading design. That is, thread switching and latency hiding happens automatically without help from the programmer or compiler.

GPUs issue a single instruction across a wide “wave” or “warp” of data elements, which means they lose throughput if control flow diverges within a wave. PEZY emphasizes that they’re targeting a MIMD design, with minimal branching penalties compared to a GPU. A PEZY PE feeds its four-wide FP64 unit in a SIMD fashion, and uses wider vectors for lower precision data types. The comparatively small 256-bit SIMD width makes PEZY less susceptible to branch divergence penalties than a typical GPU, which may have 1024-bit (wave32) or 2048-bit (wave64) vector lengths.

For comparison, PEZY-SC3’s PEs had a 2-wide FP64 unit. PEZY-SC4S’s wider execution units reduce instruction control costs. But the wider SIMD width could increase the chance of control flow divergence within a vector. For lower precision data types, PEZY-SC4S introduces BF16 support, in a nod to the current AI boom. However, PEZY did not spend die area or transistors on dedicated matrix multiplication units, unlike its GPU peers.

Memory Subsystem

PEZY’s memory subsystem starts with small PE-private L1 caches, with lower level caches shared between various numbers of PEs at different organizational levels. PEZY names organizational levels after administrative divisions. Groups of four PEs form a Village, four Villages form a City, 16 Cities make a Prefecture, and eight Prefectures form a chip (a State). PEZY-SC4s actually has 18 Cities in each Prefecture, or 2304 PEs in total, but two Cities in each Prefecture are disabled to provide redundancy.

A PE’s instruction cache is just 4 KB, and likely fills a role similar to the L0 instruction caches on Nvidia’s Turing. PEZY-SC3 could fetch 8B/cycle from the instruction cache and issue two instructions per cycle, implying each instruction is 4 bytes. If that’s the same in PEZY-SC4s, the 4 KB L1 instruction cache can hold 1024 instructions. That’s small compared even to a micro-op cache. A 32 KB L2 instruction cache is shared across 16 PEs, and should help handle larger instruction footprints.

The L1 data cache is similarly small at 4 KB, though it has doubled in size compared to the 2 KB L1D in PEZY-SC3. L1D bandwidth remains unchanged at 16 bytes/cycle, which leaves PEZY-SC4s with a lower ratio of load bandwidth to compute throughput, when considering each PE’s execution units are now twice as wide. However, the massive load bandwidth available on PEZY-SC3 was likely overkill, and could only be fully utilized if a load instruction were issued every cycle. Increasing L1D capacity instead of bandwidth is likely a good tradeoff. 2 KB is really small, and 4 KB is still small. L1D load-to-use latency is 12 cycles, or three instructions because each thread only executes once every four cycles.

Each PE also implements local memory, much like GPUs. A 24 KB block of local storage fills a role analogous to AMD’s Local Data Share or Nvidia’s Shared Memory. It’s a directly addressed, software managed scratchpad and not a hardware managed cache. The compiler can also use a “stack region” in local storage, likely to handle register spills and function calls. Four PEs within a “village” can share local memory, possibly providing 96 KB pools of directly addressable storage. Local storage has 4 cycle load-to-use latency, so the next instruction within a thread can immediately use a previously loaded value from local storage without incurring extra delay.

A 64 KB L2 data cache is shared across 16 PEs in a “City” (four Villages), and has 20 cycle latency. There is no bandwidth reduction in going to L2D, which can also provide 16B/cycle of load bandwidth per PE. That’s 256B/cycle total per L2 instance. Matching L1D and L2D bandwidth suggests the L1D is meant to serve almost as a L0 cache, opportunistically providing lower latency while many memory loads fall through to L2D. The L2D’s low ~13-14 ns latency would match many GPU first level data caches. With a relatively low thread count per PE and small SIMD widths, PEZY likely needs low latency memory access to avoid stalls. That seems to be reflected in its L1/L2 cache setup.

System Level Organization

“Cities” are connected to each other and last level cache slices via crossbars, and share a 64 MB last level cache (L3). The L3 is split into slices, and can provide 12 TB/s of read bandwidth (1024 bytes/cycle) and 6 TB/s of write bandwidth (512 bytes/cycle). L3 latency is 100-160 cycles, likely depending on the distance between a PE and the L3 slice. Even the higher figure would give PEZY’s L3 better latency than AMD RDNA4’s similarly sized Infinity Cache (measured to just over 130 ns using scalar accesses). PEZY has not changed last level cache capacity compared to PEZY-SC3, keeping it at 64 MB.

Besides providing caching capacity, the last level cache handles atomic operations much like the shared L2 cache on GPUs. Similarly, PEZY uses explicit sync/flush instructions to synchronize threads and make writes visible at different levels. That frees PEZY from implementing cache coherency like a GPU, simplifying hardware.

For system memory, PEZY uses four HBM3 stacks to provide 3.2 TB/s of bandwidth and 96 GB of capacity. If each HBM3 stack has a 1024-bit bus, that works out to a 8 GT/s data rate. For comparison, PEZY-SC3 had a 1.23 TB/s HBM2 setup with 32 GB of capacity, supplemented by a dual channel DDR4-3200 setup (51.2 GB/s). Likely, PEZY-SC3 used DDR4 to make up for HBM2’s capacity deficiencies. With HBM3 providing 96 GB of DRAM capacity, PEZY likely decided the DDR4 controllers were no longer needed.

Management Processor

PEZY-SC4s also includes a quad core RISC-V management processor running at 1.5 GHz. PEZY chose to use the open source Rocket Core, which is an in-order, scalar core. PEZY’s shift to RISC-V has parallels elsewhere, as many vendors seem to find the open ISA attractive. Examples include Nvidia moving towards RISC-V for their GPU management processors, replacing their prior Falcon architecture.

Management processors across PEZY generations

In PEZY-SC3, the management processor “controls the PEs and PCIe interfaces”. Likely, they help distribute work to the PEs and orchestrate transfers between the PEZY accelerator and host system memory.

Host

PEZY’s accelerators connect to the host via a standard PCIe interface. PEZY-SC4s uses a 16 lane PCIe Gen 5 interface for host communication, which is an upgrade over the PCIe Gen 4 lanes on PEZY-SC3. The host system is a standard x86-64 server, which will use an EPYC 9555P CPU (Zen 5) and Infiniband networking. One system will host four PEZY-SC4s accelerators, in a similar configuration to PEZY-SC3.

For comparison, PEZY-SC3 uses an AMD EPYC 7702P host processor, which has Zen 2 cores.

Final Words

PEZY’s aims for efficient FP64 compute while also making it easy to utilize. PEZY-SC4S has a mutli-level cache setup to balance caching capacity and speed. It uses small vectors, reducing throughput losses from branch divergence. The programming model (PZCL) is very similar to OpenCL, which should make it intuitive for anyone used to programming GPUs.

Compared to their prior PEZY-SC3S, PEZY-SC4 is more of a refinement that focuses on increased efficiency. Power draw in an earlier PEZY presentation was given at 270W, while PE-only power was estimated at 212W when running DGEMM. PEZY didn’t give any final power figures because they don’t have silicon yet. But initial figures suggest PEZY-SC4S will come in comfortably below 300W per chip.

Slide from an earlier presentation showing a PEZY-SC series roadmap

If PEZY-SC4S can hit full throughput at 270W, it’ll achieve ~91 Gigaflops per Watt (GF/W) of FP64 performance. This is quite a bit better than Nvidia's H200, at around 49 FP64 GF/W, and somewhat less than AMD's HPC-focused MI300A, at around 110 FP64 GF/W. However there is no such thing as a free lunch. MI300A's 3D-stacked chiplet-based design was significantly more time-consuming and costly to develop along with more expensive to manufacture than PEZY's more traditional monolithic design.

Compared to the latest generation of AI-focused accelerators from AMD and NVIDIA, CDNA4 and Blackwell Ultra respectively, SC4s leads in FP64 efficiency by a considerable margin; though it is worth noting that NVIDIA have sacrificed the overwhelming majority of FP64 performance on B300 to the point where some consumer GPUs will outclass it at FP64 tasks.

The AI boom has left a bit of a blind spot for applications where high precision and result accuracy are paramount. In simulations for example, floating point error can compound over multiple iterations. Higher precision data types like FP64 can help reduce that error, and PEZY’s SC4S targets those applications.

At a higher level, efforts like PEZY-SC4s and Fujitsu’s A64FX show a curious pattern where Japan maintains domestic hardware architecture development capabilities. It’s contrasts with many other countries that still build their own supercomputers, but rely on chips designed in the US by companies like AMD, Intel, and Nvidia. From the perspective of those countries, it’s undoubtedly cheaper and less risky to rely on the US’s technological base to create the chips they need. But Japan’s approach has merits too. They can design chips tightly targeted to their needs, like energy efficient FP64 compute. It also leads to more unique designs, which I’m fascinated by. I look forward to seeing how PEZY-SC4s does once it’s deployed.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. Nvidia B200 power limit appears to be 1 kW: https://docs.nvidia.com/dgx/dgxb200-user-guide/power-capping.html

  2. メニーコアプロセッサPEZY-SC3 によるヒトゲノム解析の高速化とPEZY-SCシリーズの展望について, presented at Supercomputing Japan 2025. Projects power consumption for PEZY-SC4S to be 270W

  3. PEZY-SC3, 高い電力効率を実現するMIMDメニーコアプロセッサ

  4. DGX B200 specifications: https://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet

  5. https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html#increased-l2-capacity

Liquid Cooling Exhibits at Hot Chips 2025

2025-09-02 02:36:17

Hot Chips doesn’t just consist of presentations on hardware architecture, although those are the core of what Hot Chips is about. The conference also features stands where various companies show off their developments, and that’s not restricted to chips. Some of those showed off interesting liquid cooling components, particularly in cold plate design.

Water Jets

Many of the waterblocks on display use microjets, rather than microfin arrays. Water flows through a manifold at the top of the block, and reaches the cold plate surface through small channels. These channels can be distributed across the cold plate, ensuring cold water supply across the chip.

Alloy Enterprises showed a cutaway of an early prototype of a MI350X waterblock. The manifold in the top layer (bottom in the photo) would take incoming cold water. Holes in an intermediate layer allow water to pass through, forming microjets. A bottom layer, not shown, would interface with the chip. Finally, hot water from the bottom layer would be drawn out through tubing and eventually make its way to a heat exchanger.

Another advantage is that the jets can be positioned with chip hotspots in mind, rather than uniformly across the chip. Jetcool showed off such a design, with holes for water jets positioned at anticipated chip hotspots. Their “SmartLid” waterblock has non-uniform water jet distribution, seen in the hole placement below. Larger holes on the side let water exit.

“SmartLid” goes further too, sending water directly to the die without using a cold plate and thermal interface material. Removing layers improves heat transfer efficiency, a concept that enthusiasts are familiar with delidding. Hitting the die directly with water jets is a natural next step, though one that I find at least slightly scary. I hope that rubber gasket is a really good one. I also hope the chip being cooled doesn’t have any water-sensitive surface mount components too close to the die.

It’s also worth noting how small the water jets are. They’re difficult to see with the naked eye, so Jetcool mounted LEDs inside to show the water jet hole placement.

Needless to say, a solution like this won’t be broadly applicable to PC hardware. Coolers for PC enthusiasts must cater to a wide range of chips across different platforms. Specializing by positioning water jets over anticipated hot spots would require different designs for each chip.

But large scale server deployments, such as those meant for AI, won’t have a diverse collection of hardware. For logistics reasons, companies will want to standardize on one server configuration with memory, CPUs, and accelerators set in stone, and deploy that across a wide fleet. With increased thermal requirements from AI, designing waterblocks tightly optimized for certain chips may be worthwhile.

More Waterblock Designs

Alloy had a range of waterblocks on display, including the one above. Some used copper, much like consumer PC cooling solutions, but aluminum was present as well. Copper allows for excellent heat transfer efficiency, which is why it’s so popular, but its weight can be a huge disadvantage. Sometimes servers need to be transported by aircraft, where weight restrictions can be a concern. Aluminum is much better when weight matters. However, corrosion can reduce lifespan. As with everything, there’s a tradeoff.

For example, here’s the back of an aluminum waterblock for the MI300 series. While not obvious from a photograph, this design is noticeably lighter than copper waterblocks. Interestingly, the cold plate features cutouts for temperature sensors.

Jetcool’s stand also featured a copper waterblock for the MI300 series. It’s copper, and features contact pads for various components located around the GPU die.

The top of the waterblock has some beautiful metal machining patterns.

Jetcool also showed off system level stuff. This is a setup meant for cooling Nvidia’s GB200 setup, which features two B200 GPUs and a Grace CPU.

The three chips are connected in series. It looks like coolant would flow across one GPU, then to the other GPU, and finally to the CPU. It’s a setup that makes a lot of sense because the GPUs will do heavy lifting in AI workloads, and generate more heat too.

The cold plates for the GB200 setup are copper and flat, and remind me of PC cooling waterblocks.

Finally, Jetcool has a server set up with a self-contained water cooling setup. It’s a reminder that not all datacenters are set up to do water cooling at the building level. A solution that uses radiators, pumps, and waterblocks all contained within the same system can seamlessly slot into existing datacenters. This setup puts two CPUs in series.

Nvidia’s GB300 Server

While not specifically set up to showcase liquid cooling, Nvidia had a GB300 server and NVLink switch on display. The GB300 server has external connections for liquid cooling. Liquid goes through flexible rubber pipes, and then moves to hard copper pipes as it gets closer to the copper waterblocks. Heated water goes back to rubber pipes and exits the system.

A closer look at the waterblocks shows a layer over them that almost looks like the pattern on a resistive touch sensor. I wonder if it’s there to detect leaks. Perhaps water will close a circuit and trip a sensor.

The NVLink switch is also water cooled with a similar setup. Again, there’s hard copper pipes, copper waterblocks, and funny patterns hooked up to what could be a sensor.

Water cooling only extends to the hottest components like the GPUs or NVSwitch chips. Other components make do with passive air cooling, provided by fans near the front of the case. What looks like a Samsung SSD on the side doesn’t even need a passive heatsink.

Final Words

The current AI boom makes cooling a hot topic. Chips built to accelerate machine learning are rapidly trending towards higher power draw, which translates to higher waste heat output. For example, Meta’s Catalina uses racks of 18 compute trays, which each host two B200 GPUs and two Grace CPUs.

A single rack can draw 93.5 kW. For perspective, the average US household draws less than 2 kW, averaged out over a year (16000 kWh / 8760 hours per year). An AMD MI350 rack goes for even higher compute density, packing 128 MI355X liquid cooled GPUs into a rack. The MI350 OAM liquid cooled module is designed for 1.4 kW, so the 128 GPUs could draw nearly 180 kW. For a larger scale example, a Google “superpod” of 9216 networked Ironwood TPU chips draws about 10 MW. All of that waste heat has to go somewhere, and datacenter cooling technologies are being pushed to their limits. The current trend sees power draw, and thus waste heat, increase generation over generation as chipmakers build higher throughput hardware to handle AI demands.

All of that waste heat has to go somewhere, which drives innovation in liquid cooling technologies. While liquid cooling hardware displayed at Hot Chips 2025 was very focused on the enterprise side and cooling AI-related chips, I hope some techniques will trickle down to consumer hardware in the years to come. Hotspotting is definitely an issue that spans consumer and enterprise segments. And zooming up, I would love a solution that lets me pre-heat water with my computer, and use less energy at the water heater.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.