2025-04-15 03:12:10
Raytraced effects have gained increasing adoption in AAA titles, adding an extra graphics quality tier beyond traditional “ultra” settings. AMD has continuously advanced their raytracing implementation in response. Often, this involved fitting the GPU’s general architecture to the characteristics of raytracing workloads. On RDNA 4, examples include “out-of-order” memory access and dynamic register allocation. Both are targeted at raytracing, but other applications can benefit too, though perhaps not to the same degree.
Compared to RDNA 3, RDNA 4’s RT IP 3.1 gets doubled intersection testing, oriented bounding boxes, primitive compression, and other goodies. Fixed function hardware is tightly tied to predefined data structures, and AMD has unsurprisingly updated those data structures to utilize those updated features.
RDNA 4’s doubled intersection test throughput internally comes from putting two Intersection Engines in each Ray Accelerator. RDNA 2 and RDNA 3 Ray Accelerators presumably had a single Intersection Engine, capable of four box tests or one triangle test per cycle. RDNA 4’s two intersection engines together can do eight box tests or two triangle tests per cycle. A wider BVH is critical to utilizing that extra throughput.
Raytracing uses a Bounding Volume Hierarchy (BVH) that recursively subdivides scene geometry. Each node represents a 3D box-shaped space, and links to sub-boxes. Intersection tests determine which link (child) to follow, until traversal reaches the bottom of the tree where the node contains triangle data instead of links to sub-boxes. Each traversal step therefore scopes down the intersection search to a smaller box. More intersection test throughput can speed up this traversal process.
But speeding up traversal isn’t as simple as doubling up intersection test throughput. Each traversal step is a pointer chasing operation, which incurs memory latency. GPUs have high cache and DRAM latency compared to a CPU, but excel at parallel compute. RDNA 4 moves to 8-wide box nodes, up from 4-wide ones in RDNA 2 and 3. A wider box node presents more parallel work at each step. More importantly, it allows a “fatter” tree that requires fewer traversal steps to reach the bottom. Thus the 8-wide BVH shifts emphasis from latency to throughput, avoiding a key GPU weakness.
Since RDNA 2, AMD has used a conservative raytracing strategy where a shader program controls the raytracing process from ray generation to result shading. During BVH traversal, a raytracing shader accesses RDNA 4’s Ray Accelerator with a new IMAGE_BVH8_INTERSECT_RAY
instruction. This instruction takes a ray and a pointer to a 8-wide BVH node, and uses both Intersection Engines together. Output from both Intersection Engines head to a sorting unit, which can either sort the two 4-wide halves separately or sort across all eight results with a “wide sort” option. To speed up traversal, AMD had hardware traversal stack management in the LDS since RDNA 3. LDS stack management gets updated in RDNA 4 with a new DS_BVH_STACK_PUSH8_POP1_RTN_B32
instruction1.
An 8-wide BVH isn’t the only way to use RDNA 4’s doubled intersection test throughput. RDNA 4 adds an IMAGE_BVH_DUAL_INTERSECT_RAY
instruction, which takes a pair of 4-wide nodes and also uses both Intersection Engines. Like the BVH8 instruction, IMAGE_BVH_DUAL_INTERSECT_RAY
produces two pairs of 4 intersection test results and can intermix the eight results with a “wide sort” option. The traversal side likewise gets a DS_BVH_STACK_PUSH8_POP2_RTN_B64
instruction. AMD doesn’t describe the “BVH4x2” traversal algorithm, but it’s not hard to imagine what is considering what the two instructions above do. A ray can intersect multiple bounding boxes, creating multiple traversal paths. BVH4x2 almost certainly takes two of those paths in parallel, with two paths popped from the LDS and tested in the Ray Accelerator with each traversal step.
So far I’ve only seen AMD generate 8-wide BVH-es for RDNA 4. That includes DirectX 12’s procedural geometry example, 3DMark tests, Cyberpunk 2077, Elden Ring, GTA V Enhanced Edition, and Quake 2 RTX. BVH4x2 traversal is less efficient than using a 8-wide BVH, because it requires more memory accesses and generates more LDS traffic. Furthermore, BVH4x2 relies in having at least two valid traversal paths to fully feed the Ray Accelerator, and how often that’s true may vary wildly depending on the ray in question. I’m not sure why AMD added a way to utilize both Intersection Engines with a 4-wide BVH.
BVH-es traditionally use axis aligned bounding boxes (AABBs), meaning a box’s boundaries are aligned with the 3D world’s x, y, and z axes. Axis aligned boxes simplify intersection tests. However, game geometry is often not axis aligned. In those cases, an axis aligned box may end up much larger than the geometry it’s trying to contain, creating a lot of empty space. Rays that intersect the empty space end up taking useless traversal steps into the box before further intersection tests realize the path was useless.
RDNA 4 addresses this with Oriented Bounding Boxes (OBBs), where the bounding box is rotated to better approximate geometry within it. Storing a 3×3 3D rotation matrix for each box would dramatically increase memory usage. RDNA 4 therefore strikes a compromise. Each box node only specifies one OBB rotation for all of its children. For further storage savings, the OBB matrix is not stored in the node at all. Instead, the node stores an OBB matrix index, which is looked up in a table of 104 predefined matrices. In code, the matrix is stored as 9 packed 6-bit indices, which refer to entries in a second level lookup table with 26 unique FP32 values3.
AMD’s OBB strategy therefore takes a best-effort approach to reduce useless box intersections with minimal storage cost. It’s not trying to generate perfectly optimal bounding box rotations every time. Thanks to this strategy, 8-wide box nodes remain reasonably sized at 128 bytes, and continue to match RDNA’s cacheline length. While the lookup table specified in code doesn’t necessarily show the exact hardware implementation, a simplistic calculation indicates the lookup table would use about 800 bytes of storage. That would make it small enough to fit in a small ROM within the Ray Accelerator.
Compromises of course leave room for improvement. Chains hanging from the ceiling in Elden Ring’s Debate Parlor present an optimal case for RDNA 4’s OBBs, because one rotation can fit well across all of the chains on each side. The chandeliers are another story, with four support chains each have different rotations. AMD selects an OBB that matches chain rotation in the Y (up) axis, but not X/Z. As a result the OBBs for each chain leave significant empty space that could result in false intersections. It’s still better than RDNA 2’s aligned bounding boxes, but there’s clearly room for improvement as well.
Using one OBB rotation across the box is likely a necessary compromise. A simple way to handle OBB intersection testing is to rotate the incoming ray so that the box and ray become axis aligned (just with a different set of axes). RDNA 4 adds a ray transform block to the Ray Accelerator. AMD doesn’t explicitly say the transform block helps with OBBs. Instead, it’s used for transitions between top and bottom level acceleration structures (TLAS/BLAS). DirectX raytracing splits the BVH into two levels because a BLAS can be reused many times with different positions and rotations. That’s convenient for handling several copies of the same object, like the same chair type placed several times around a room.
…the addition of a dedicated ray transform block, and this is used to offload the transformation that occurs as you transition from the top level of the ray acceleration structure into the bottom level…
AMD Radeon Press Briefing Video
A TLAS to BLAS transition would involve rotating the ray to allow axis-aligned intersection tests within the BLAS, a similar operation to rotating a ray for OBB tests. A key difference is that traversal steps involving OBBs may happen more often than TLAS to BLAS transitions. AMD’s Ray Accelerator aims to handle a box node every cycle. Transforming a ray would involve multiplying both the origin and direction vector by a 3×3 rotation matrix, which naively requires 36 FLOPs (Floating Point Operations) per transform. At 2.5 GHz that would be 5.04 ray transformation TFLOP/s across the RX 9070’s ray accelerators. Providing different ray rotations for all eight boxes in a box node would multiply that figure by eight.
AMD could also implement OBB-related ray transforms within the Intersection Engines. RDNA 4 can use OBBs with RDNA 2/3 legacy 4-wide box nodes, slipping the OBB matrix ID into into previously unused padding bits near the end of the 128B structure. BVH4x2 traversal may run into two nodes that specify different OBB rotations. That would require applying two ray transformations to fully feed both Intersection Engines. Even in that case, two ray transforms are far less expensive than eight, so AMD’s compromise makes sense.
Minimizing BVH footprint is critical. A smaller BVH reduces VRAM consumption, cuts bandwidth requirements, and makes more economical use of cache capacity. RDNA 2 and RDNA 3 took a basic step with compressed triangle pairs, where a 64 byte triangle node could store a pair of triangles that share a side. Intel also does this. RDNA 4 goes further by packing multiple triangle pairs into a new 128 byte compressed primitive node.
RDNA 4’s compressed primitive node only stores unique vertices across its triangle pairs. Further gains come from finding the minimum number of trailing zeroes across vertex coordinates’s bitwise FP32 representations, and dropping those trailing zeroes from storage5. Code suggests a RDNA 4 primitive node can describe up to eight triangle pairs or up to 16 unique vertices.
In practice compression efficiency varies wildly depending on the nature of game geometry, but RDNA 4 often represents more than two triangle pairs in a 128 byte primitive node. While not mentioned in AMD’s presentation, RDNA 4 represents box extents with quantized 12-bit integers instead of FP32 values6. That lets RDNA 4 keep its 8-wide box node at 128 bytes, just like RDNA 2/3’s 4-wide box nodes.
Like OBBs, primitive compression increases hardware complexity. Compressed primitive nodes don’t require additional compute in the Ray Accelerator. However, they do force it to handle non-aligned, variable length data fields. An Intersection Engine would have to parse the 52-bit header before it knows the data section’s format. Then, leading zero compression would require shifting the packed values to reconstruct the original FP32 values. Reducing memory footprint often comes with extra hardware complexity. In a latency critical application like raytracing, placing a higher burden on hardware is probably worth it.
In Elden Ring and 3DMark’s Port Royal benchmark, AMD’s Radeon Raytracing Analyzer indicates that RDNA 4 achieves a decent BVH size reduction. The same curiously doesn’t apply to Cyberpunk 2077. However, Cyberpunk 2077 has a more dynamic environment with unpredictable NPC counts and movement paths, so the margin of error is definitely higher.
RDNA 4’s biggest wins come from reducing traversal step count. Lining up corresponding DispatchRays calls shows RDNA 4 goes through fewer traversal steps per ray. Cyberpunk 2077 is a particularly good case. RDNA 4 is still doing more intersection tests overall, because each traversal step requires eight intersection tests compared to four on RDNA 2, and traversal step count isn’t halved or lower. The additional work is well worth it though. GPUs aren’t latency optimized, so trading latency-bound pointer chasing steps for more parallel compute requirements is a good strategy. Gains in Elden Ring are minor by comparison, but any reduction is welcome considering high GPU cache latency.
In a frame captured from 3DMark’s DXR feature test, which raytraces an entire scene with minimal rasterization, the Radeon RX 9070 sustained 111.76G and 19.61G box and triangle tests per second, respectively. For comparison the RDNA 2 based Radeon RX 6900XT did 38.8G and 10.76G box and triangle tests per second. Ballparking Ray Accelerator utilization is difficult due to variable clock speeds on both cards. But assuming 2.5 GHz gives 24% and 10.23% utilization figures for RDNA 4 and RDNA 2’s Ray Accelerators. RDNA 4 is therefore able to feed its bigger Ray Accelerator better than RDNA 2 could. AMD has done a lot since their first generation raytracing implementation, and the cumulative progress is impressive.
RDNA 2 introduced AMD’s first hardware raytracing implementation in the PC scene. It took a conservative approach to raytracing by accelerating intersection testing but little else. AMD has made steady progress since then, shaping GPU hardware and the raytracing workload to match each other. RDNA 4 continues to use RDNA 2’s high level raytracing strategy, with a compute thread managing the raytracing process all the way from ray generation to traversal to result handling. But over several generations, AMD’s engineers have piled on improvements that put RDNA 4 leagues ahead.
AMD’s tools like the Radeon Raytracing Analyzer and Radeon GPU Profiler provide a fascinating look into how those improvements work together under the hood. Open source code further paints a picture where AMD is working nonstop on their raytracing hardware. Unused (possibly Playstation related) RT IP 1.0 and 3.0 levels provide more snapshots into AMD’s hardware raytracing evolution.
Still, RDNA 4 has room for improvement. OBBs could be more flexible, and first level caches could be larger. Intel and Nvidia are obvious competitors too. Intel has revealed a lot about their raytracing implementation, and no raytracing discussion would be complete without keeping them in context. Intel’s Raytracing Accelerator (RTA) takes ownership of the traversal process and is tightly optimized for it, with a dedicated BVH cache and short stack kept in internal registers. It’s a larger hardware investment that doesn’t benefit general workloads, but does let Intel even more closely fit fixed function hardware to raytracing demands. Besides the obvious advantage from using dedicated caches/registers instead of RDNA 4’s general purpose caches and local data share, Intel can keep traversal off Xe Core thread slots, leaving them free for ray generation or result handling.
AMD’s approach has advantages of its own. Avoiding thread launches between raytracing pipeline steps can reduce latency. And raytracing code running on the programmable shader pipelines naturally takes advantage of their ability to track massive thread-level parallelism. As RDNA 4 and Intel’s Battlemage have shown, there’s plenty of room to improve within both strategies. I’m excited to see how everything plays out as AMD, Intel, and Nvidia evolve their raytracing implementations.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
primitiveNode.hlsli in AMD’s GPU Ray Tracing Library
OrientedBoundingBoxes.hlsl and ObbCommon.hlsl in AMD’s GPU Ray Tracing Library
EncodeHwBVH3_1.hlsl in AMD’s GPU Ray Tracing Library
PrimitiveStructureEncoder3_1.hlsl in AMD’s GPU Ray Tracing Library. Describes trailing zero compression, and ComputeCompressedRanges has each lane find its vertex in the LDS during BVH build, set corresponding bits in a bitmask, then count set bits to find the unique vertex count
Multiple RT IP 3.1 BVH building functions call ComputeQuantizedBounds with numQuantBits=12, which then calls ComputeQuantizedMin/ComputeQuantizedMax, which appears to quantize to a 12-bit integer because the max value is (1 << numQuantBits -1) * 1.0f. It’s the maximum integer value that’ll fit in the specified quantization bits multiplied by a FP value of 1 to provide the max quantized value as a float.
BoxNode1_0.hlsli, defines a 4-wide box node for RT IP 1.0
2025-04-06 01:38:25
Modern GPUs often make a difficult tradeoff between occupancy (active thread count) and register count available to each thread. Higher occupancy provides more thread level parallelism to hide latency with, just as more SMT threads help hide latency on a CPU. But while a CPU can use all of its SMT threads regardless of what code it's running, the same doesn't apply to GPUs. GPU ISAs offer a large number of very wide vector registers. Storing all registers for all thread slots would be impractical because register files must balance capacity with speed and die area usage.
For example, RDNA 4's ISA lets instructions address up to 256 vector general purpose registers (VGPRs). Each register is 1024 bits wide in wave32 mode, and each RDNA 4 SIMD has 16 thread slots. The SIMD would need a 512 KB register file to hold 256 registers for all 16 threads. In practice register requirements vary across different GPU workloads. RDNA 4, like many other GPUs, uses a smaller register file and allocates depending on what threads require. Code that needs a lot of registers can do so at the cost of less thread-level parallelism, while code that uses fewer registers can run more active threads and be less sensitive to latency. RDNA 4 desktop GPUs have a 192 KB register file per SIMD, so a GPU kernel can use all thread slots (achieve maximum occupancy) if it uses 96 or fewer vector registers.
A bigger register file obviously improves the occupancy and register usage tradeoff situation. RDNA increased SIMD register file capacity to 128 KB, up from 64 KB on GCN. RDNA 3 introduced a 192 KB register file configuration for high end GPUs, where die area is likely less of a concern. But that strategy isn’t efficient for raytracing.
AMD notes that ray traversal and hit/miss handling have different VGPR requirements. AMD uses an inline raytracing model where all raytracing stages run within the same thread. A raytracing shader’s VGPR allocation has to be set to the maximum that any stage requires, because a thread’s register allocation remains static throughout its lifetime. Even if code that needs a lot of registers only accounts for a small part of execution time, that high VGPR allocation will limit active thread count for the duration of the workload. Raytracing is particularly latency sensitive, and AMD would like to run as many threads (rays) in parallel as possible to help absorb latency.
Therefore RDNA 4 introduces a new dynamic VGPR allocation mode. In this mode, a thread starts with a minimum VGPR allocation and changes it throughout it’s lifetime. Rather than specify how many VGPRs a shader will use, the driver tells GPU to launch it in dynamic VGPR mode. A chip-wide SQ_DYN_VGPR
register directly sets active thread count per SIMD, or occupancy, rather than having that inferred from shader VGPR usage. SQ_DYN_VGPR
also controls other dynamic VGPR mode parameters, like VGPR allocation block size and deadlock avoidance mode.
Each enabled thread slot gets a single reserved VGPR block, and a newly launched thread starts with just that VGPR block active. When the thread needs more registers, it requests a new VGPR count using a s_alloc_vgpr
instruction. s_alloc_vgpr
attempts to allocate more registers if called with a value higher than the current allocation, or frees registers if called with a lower value. Changing VGPR allocation affects the upper end of the usable VGPR range, just like with non-dynamic VGPR allocation. Hardware hands out VGPRs in blocks of 16 or 32, depending on how the driver sets up SQ_DYN_VGPR
. A thread can allocate up to eight blocks, so the driver must select the larger block size and give up some allocation granularity if a thread needs to use more than 128 VGPRs.
Allocation requests don’t always succeed. s_alloc_vgpr
sets the Scalar Condition Code (SCC) to indicate success, or clears it on failure. SCC is analogous to a flag register on CPUs, and is used for branching and add-with-carry. Shader code has to check SCC to determine if an allocation request succeeded. If an allocation request fails, a shader could in theory try to find other useful work to do while periodically retrying the allocation. But doing so would be quite complex, so in practice a shader will busy-wait until allocation succeeds.
Therefore dynamic VGPR mode turns the occupancy question on its head. A SIMD can have as many active threads as the driver feels like, regardless of register allocation. But theoretical occupancy doesn’t tell the whole story. Threads can still get blocked waiting on VGPR allocation. A SIMD could have all thread slots filled, but some of those threads could be busy-waiting on VGPR allocation rather than making useful progress.
Busy-waiting can become more than a performance inconvenience. Dynamic VGPR allocation can lead to deadlock. AMD knows this, and describes how that can happen in RDNA 4’s ISA manual.
I think a deadlock case can be more general than what AMD describes. If every thread in a SIMD needs to allocate more registers, but hardware doesn’t have enough free registers to satisfy any request, every thread will get stuck forever. That’s a deadlock, even if there are technically registers available.
AMD mitigates some deadlock scenarios with a deadlock avoidance mode. The ISA manual is light on details, only saying it reserves just enough VGPRs for one thread to reach maximum VGPR allocation at all times. Each thread can allocate up to eight VGPR blocks, and one block comes reserved with the thread slot, so deadlock avoidance mode would reserve 7 VGPR blocks. I believe deadlock avoidance mode works by only allowing one thread to allocate registers from the reserved pool at a time. In short:
Base case: No reserved registers allocated. Any request can proceed
From (1), any combination of allocation requests from all threads will allow at least one thread (say thread A) to succeed
From (2), no other thread can allocate from the reserved pool, allowing thread A to increase its register allocation to the maximum should it need to.
Eventually A will leave its high register usage code section, or terminate completely, and thus free up registers for other threads to do the same.
Needless to say, that situation isn’t great for performance because it can serialize useful work across threads. But getting to the finish line slowly is better than not getting there at all.
Deadlock avoidance mode isn’t foolproof. If the programmer manages to meet three conditions:
Two threads need to allocate registers
The high register usage sections of both threads depend on each other, for example in a producer consumer model
No other thread can give up their registers until the two threads above make progress
Then they can run into a deadlock even with deadlock avoidance mode enabled. Programmers should probably avoid cross-thread dependencies in dynamic VGPR mode, unless they’re confident threads only wait on each other in low VGPR usage sections.
As with many new features, dynamic VGPR mode isn’t a one-size-fits-all solution. It’s narrowly targeted to start, and can only be used with wave32 compute shaders. Graphics shaders like pixel and vertex shaders can only use the regular non-dynamic launch mode. The same goes for wave64 shaders of any type.
A workgroup of threads launched in dynamic VGPR mode will “take over” the equivalent of a GPU core. That would be a Workgroup Processor (WGP) in WGP mode, or a Compute Unit (CU) in CU mode. Thus dynamic and non-dynamic threads can’t coexist on the same GPU core.
Dynamic VGPR mode may be less efficient at using register file capacity. Each enabled thread slot gets a reserved VGPR block, regardless of whether a thread is actually running in that slot. A workload that doesn’t have enough parallelism to fill all enabled thread slots would waste those reserved registers. Deadlock avoidance mode would set aside more registers that could have been easily allocated in non-dynamic mode. Drivers can reduce reserved register count by disabling deadlock avoidance mode or reducing thread slot count. Both of those options come with obvious downsides. In wave32 mode, non-dynamic register mode can allocate up to 256 registers in 24 entry blocksa on current RDNA 4 GPUs. That offers finer granularity than the 32 entry blocks needed to give a thread 256 registers in dynamic VGPR mode.
AMD isn’t the only GPU maker that lets a thread adjust register allocation mid-execution. Nvidia introduced a setmaxnreg
PTX instruction in Hopper, and that’s carried forward to Blackwell consumer GPUs. setmaxnreg
superficially acts like AMD’s s_alloc_vgpr
, letting the calling thread request a different register allocation. However Nvidia’s dynamic register allocation works very differently from AMD’s, and is probably better called register reassignment. Nvidia for their part never gave this mechanism a name.
Nvidia doesn’t use a separate launch mode. Kernels always launch the regular way, with a specified register allocation that also determines how many threads they can run concurrently. For example a compute shader that uses 96 registers on Blackwell will only be able to run 5 concurrent threads in each SM sub-partition. After threads launch, they can call setmaxnreg
to shift registers between threads in the same workgroup. Unlike s_alloc_vgpr
, setmaxnreg
‘s register pool is whatever the workgroup started out with. If every thread calls setmaxnreg
and requested register count across threads is greater than what the workgroup started with, they will deadlock regardless of how much free space the register file may have.
As an aside, setmaxnreg
is a PTX instruction. PTX in an intermediate level programming language for Nvidia GPUs with an assembly-like syntax. It isn’t assembly, which Nvidia calls SASS. However PTX is meant to give more control over emitted instructions than a C-like high level language. Therefore PTX instructions often have similarities with SASS instructions, and can offer hints about the underlying ISA.
The semantics around setmaxnreg
suggest Nvidia’s mechanism is geared towards tightly orchestrated register swapping between threads. It’s not like AMD’s free-flowing dynamic allocation behavior where different threads can be out-of-phase with each other, so to speak. Nvidia’s “warpgroup” likely refers to threads sharing the same SM sub-partition, and thus the same register file.
The same
setmaxnreg
instruction must be executed by all warps in a warpgroup. After executing asetmaxnreg
instruction, all warps in the warpgroup must synchronize explicitly before executing subsequent setmaxnreg instructions. If asetmaxnreg
instruction is not executed by all warps in the warpgroup, then the behavior is undefinedMiscallenous Instructions, Parallel Thread Execution ISA Version 8.7
A determined developer could emulate AMD’s initial dynamic VGPR state on Nvidia by with a workgroup that allocates all register file capacity in a SM, then immediately has every thread trim its allocation down to the minimum. But after that, synchronization requirements on Nvidia would make it difficult to emulate AMD’s independent allocation behavior. setmaxnreg
‘s scalar-only input makes it harder to look up a desired allocation value from memory. Of course difficult doesn’t mean impossible. A register input can be emulated with a sufficient application of conditional branches, but let’s not think about that too much.
In exchange for less flexibility, Nvidia should have no problem mixing “dynamic” and regular threads on the same SM. Nvidia can also adjust register allocation with finer granularity than AMD. The latter can be especially important because Nvidia has smaller 64 KB register files, and waste from “slack” register file usage can be even more painful.
Nvidia’s register reassignment mechanism isn’t well suited for AMD’s raytracing use case. However, Nvidia’s raytracing design likely doesn’t need it. Nvidia hardware uses a DXR 1.0 raytracing model. If it works like Intel, raytracing stages execute as separate thread launches on the SMs. Regular vector register allocation that happens at each thread launch would already solve the problem AMD faces with all-in-one raytracing shaders.
Intel’s documentation explicitly states that raytracing stages execute as separate thread launches. But even if they didn’t, Intel would benefit less from dynamic register allocation than AMD. Intel GPUs used fixed register allocation until very recently. Each thread gets 128 registers whether it needs them or not. More recent GPUs like Battlemage add a “large GRF” mode that cuts occupancy in half to give each thread 256 registers. There’s no option in between.
Therefore Intel can maintain full occupancy with a higher per-thread register count than either AMD or Nvidia. Dynamic VGPR allocation is only useful if it helps increase occupancy in the first place – that is, the GPU can’t achieve full occupancy with non-dynamic VGPR allocation. If Intel were to dynamically allocate registers, the very coarse register allocation granularity may result in a more threads getting blocked than on AMD.
AMD’s dynamic VGPR allocation mode is an exciting new feature. It addresses a drawback with AMD’s inline raytracing technique, letting AMD keep more threads in flight without increasing register file capacity. That in turn makes RDNA 4 less latency sensitive in raytracing workloads, likely with minimal power and area cost. Raytracing shaders that use more than 96 VGPRs are attractive targets for the dynamic VGPR feature.
Raytracing shaders on AMD can either inline all raytracing stages, or use an “indirect” mode where different stages are executed in separate function calls. So far, I’ve only seen AMD use dynamic VGPR allocation in indirect mode. Raytracing stages all take place within the same thread in both modes, but perhaps function call sites provide a convenient place to adjust VGPR allocation. After all, a function has clearly defined entry and exit points. AMD often prefers to inline raytracing stages to avoid function call overhead. I have not seen dynamic VGPR mode used when raytracing stages are inlined, even when raytracing shader occupancy is VGPR limited.
Certainly s_alloc_vgpr
isn’t limited to function call sites, so I wonder if future AMD drivers will be more trigger-happy with dynamic VGPR mode. Conversely, AMD uses dynamic VGPR allocation in indirect mode even when non-dynamic allocation could have achieved full occupancy. Doing so shouldn’t hurt performance, but it does suggest driver decisions aren’t so fine grained at the moment.
Generic compute workloads could benefit from dynamic VGPR mode too, assuming AMD does work to expose the feature through various toolchains. Some of Nvidia’s GPGPU libraries take advantage of setmaxnreg
, so there’s probably compute applications for AMD’s dynamic VGPR feature too.
At a higher level, features like dynamic VGPR allocation paint a picture where AMD’s GPU efforts are progressing at a brisk pace. It doesn’t feel like an easy feature to implement. Thread register allocation could be non-contiguous in the physical register file, complicating register addressing under the hood. Features like deadlock avoidance would demand additional work. With regards to raytracing, dynamic VGPR allocation shows there’s plenty of progress to be made within AMD’s single-shader raytracing model. Along with breaking false cross-wave memory dependencies, AMD seems determined to keep stamping out performance limiters with each generation.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
a. RDNA 4’s ISA manual indicates the 24 register allocation granularity only applies to devices with 1536 VGPRs per SIMD, or 192 KB register files. Other RDNA 4 devices allocate VGPRs in blocks of 16 registers, and likely have a 128 KB register file. RDNA 3 used smaller 128 KB register files in lower end devices, reserving 192 KB register files for the highest end SKUs. As RDNA 4 SKUs with non-192 KB register files do not exist at the time of writing, there is no need to discuss them in the article proper. However, such devices may launch in the future and it’s something to be aware of.
2025-04-01 18:48:14
2025 has kicked off with a flurry of GPU activity. Intel's Arc B580 revealed that it's still possible to make a mid-range GPU with more than 8 GB of VRAM. AMD's RDNA 4 marked the continuation of a longstanding AMD practice where they reach for the top-end, before deciding it wasn't worth it after all. Nvidia too has a new generation to celebrate 2025, and their 5000 series launch has come and gone without cards on shelves. But bigger numbers are better, so it's time to talk about the GeForce 6 series.
Gamers demand higher quality graphics with each generation. No one knows this better than Nvidia, so GeForce 6000 series cards are built to deliver near-cinematic quality at framerates high enough to support interactive gameplay. GeForce 6000 GPUs, or the GeForce 6 series for short, are built with the highly parallel nature of graphics rendering in mind. At the same time, they take a great leap forward in terms of programmability, opening up exciting new possibilities for complex in-game effects.
Graphics rendering involves transforming vertex coordinates from 3D space to 2D screen space before calculating the final pixel colors, a process known as rasterization. Both stages are inherently parallel tasks, and map well to hardware with large arrays of execution units. Accordingly, a GeForce 6 series GPU is a massively parallel machine. It has a strong complement of fixed-function graphics hardware, but the real power of the GPU lies in a collection of vertex and pixel shader cores. These programmable components execute shader programs provided by the game instead of carrying out preset functions. They also serve as basic building blocks, letting Nvidia scale to different power, price, and performance targets. The highest end GeForce 6000 series chip, NV40, implements 6 vertex shader and 16 pixel shader cores.
A highly parallel machine needs a high bandwidth memory subsystem to keep it fed. GeForce 6 series products can feature up to a 256-bit GDDR3 DRAM setup, giving it significantly more memory bus width than typical desktop CPUs. The GPU features a L2 texture cache shared across all pixel and vertex shader cores, which enables short-term reuse of fetched texture data. Nvidia did not disclose cache sizes at this time, but they aim for a 90% hitrate with many misses in flight, rather than the 99% hitrate one often sees with CPU caches. The GPU communicates with the host system via the popular AGP interface, but can also support the upcoming PCI Express standard.
Vertex shader programs transform coordinates from 3D to 2D screen space. It may sound like a trivial task that involves little more than a camera matrix multiplication and perspective division. But programmable vertex shaders open up new tricks. For example, a vertex shader can sample a texture and use it as a displacement map. Besides supporting texture accesses, GeForce 6000's vertex shader cores support branches, loops, and function calls. Much of this capability was previously unthinkable on anything outside a real CPU, demonstrating the exciting nature of GPU evolution.
Vertex shader execution starts with instruction fetch from a 512-entry instruction RAM. Nvidia uses 128-bit vertex instructions from the driver format, which are translated into a 123-bit internal format. Thus the instruction RAM has approximately 8 KB of capacity. DirectX 9's vertex shader 3.0 standard mandates a minimum of 512 instruction slots, and Nvidia's vertex shader core ISA is closely aligned to DirectX 9's HLSL instructions. Thanks to instruction limits, shader programs are immune to performance losses from instruction cache misses, which CPU programs can often suffer. Furthermore, accessing the instruction RAM doesn't involve tag comparisons like a cache would, saving power.
DirectX 9 vertex shader HLSL instructions broadly fall into scalar and vector categories. Scalar instructions include special operations like inverse square roots. Vector instructions generally involve basic operations like multiply-add, and operate on 128-bit vectors of four 32-bit values. GeForce 6000's vertex shader pipeline is highly optimized for this arrangement, and features separate vector and scalar pipelines. Each ISA instruction specifies both a scalar and vector operation, letting the vertex shader core exploit parallelism in two dimensions within one instruction stream. Vectors specified by the DirectX 9 shader program provide vector-level parallelism. Any scalar+vector dual issue opportunities found by Nvidia's compiler provide additional parallelism.
A third source of parallelism comes from multithreading, and serves to hide latency. The vector operation slot can accept texture sampling instructions. Memory accesses from a vertex shader should still be relatively uncommon, so vertex shader cores don't have a L1 texture cache tied to their texture fetch unit. Nvidia expects a shader program will need 20-30 instructions to hide texture fetch latency, which can be hard to achieve from a single thread. Therefore, vertex shader cores can each track up to three threads and switch between them to hide latency.
Instruction inputs can come from registers or constant RAM. Both consist of 128-bit vector entries to match vector execution width. Register files are split into input, output, and temporary registers. The input and output registers each have 16 entries, and are read-only or write-only respectively from the shader program's point of view. The temporary register file supports both reads and writes, and has 32 entries. DirectX 9's vertex shader 3.0 specification lets a shader program address up to 32 registers, but Nvidia might share the register file between multiple threads. If so, a vertex shader program should use no more than 10 temporary registers to achieve maximum occupancy.
Pixel shaders, or fragment shaders, do much of the heavy lifting because rendering a scene typically involves processing far more pixels than vertices. Accordingly, a GeForce 6000 GPU can have up to 16 pixel shader cores. The pixel shader cores themselves are highly programmable just like the vertex shader cores, with branching support among other goodies. However, pixel shader cores are built very differently to exploit increased parallelism typically present at the pixel level.
GeForce 6000's pixel shaders use 128-bit instructions, though the encoding is substantially different from the one used in vertex shaders thanks to hardware differences. Nvidia has chosen to support up to 65536 pixel shader instructions, exceeding the DirectX 9 minimum specification of 512 instruction slots by a wide margin. Using all instruction slots would consume 1 MB of storage, so pixel shader cores might use an instruction cache.
The fragment processor has two fp32 shader units per pipeline, and fragments are routed through both shader units and the branch processor before recirculating through the entire pipeline to execute the next series of instructions
From The Geforce 6 Series GPU Architecture, Emmet Kilgariff and Ramdima Fernando
Where Nvidia's vertex shader core operates much like a CPU with 3-way SMT to hide latency, the pixel shader core uses a SIMD execution model across threads. That parallelism, often referred to as SIMT (Single Instruction Multiple Thread), applies on top of the SIMD you get within a thread from using multi-component vectors. Rather than tracking three separate threads, Nvidia groups many pixel shader invocations into a vector and effectively loops through the "threads" in hardware. This approach lets Nvidia keep thousands of "threads" in flight at low cost, because threads in the same vector must execute the same instruction and cannot take an independent execution path from other threads. Only the data being processed is different.
Programmers must pay attention to divergence penalties with this SIMT threading model. If different threads within a vector take different directions on a conditional branch, the pixel shader core will execute both sides of the branch with non-active threads masked off. That contrasts with the vertex shader core's MIMD execution model, which allows penalty free branching even if branch directions diverge across threads running in the same core. Nvidia suggests keeping branches coherent across regions of over 1000 pixels, or approximately 256 2x2 pixel quads, hinting at very long vector lengths.
Keeping that much work in flight is critical to hiding latency, but places pressure on internal chip storage. DirectX 9 lets pixel shaders address 32 temporary registers, which continue to be 128-bits wide. Keeping 256 threads in flight would require 128 KB of register file capacity per pixel shader core, which will not be achieved in GPUs for several years. GeForce 6000 uses smaller register files of unknown size. Nvidia says pixel shader programs can get the maximum number of threads in flight if they use four or fewer 128-bit registers. As ballpark estimate, 256 threads with four registers per thread would require 16 KB of register file capacity.
The pixel shader core's two 128-bit vector units are placed one after another in different pipeline stages. Both can execute four FP32 operations per cycle, though only the lower one can do multiply-add. The upper one can handle special functions and texture address calculation. Texture operations are issued between the two execution unit stages. Peak FP32 throughput is 12 operations per cycle. That can be achieved for example by issuing a vector FP32 multiply in the upper stage and a FP32 multiply-add in the lower one.
From a shader program's perspective, the upper and lower vector units together can complete two vector operations per cycle. Compared to the vertex shader cores, the pixel shader's sequential "dual issue" arrangement lets the upper unit forward its results to the lower one. Thus two dependent instructions can "dual issue". Besides interleaving instructions for the two vector units, Nvidia's compiler can pack operations that work on different subsets of vector elements into a single instruction, which improves vector unit utilization within a thread. FP16 execution can improve throughput even further. Full 32-bit precision is often not necessary for graphics rendering, especially when it comes to pixel colors. Both vector execution units in the pixel shader core can execute FP16 operations at double rate. Using FP16 also halves register file usage for those values, which in turn can also improve occupancy and therefore latency hiding.
Texture sampling is an important part of pixel shading, so pixel shader cores get an optimized texture sampling path compared to the vertex shader cores: each core has a L1 texture cache, backed by a chip-wide L2 texture cache.
Pixel shader programs normally output pixel colors, but colors are really just numbers. The bulk of GeForce 6000's massive parallel compute power is concentrated in its array of pixel shaders, and having lots of GFLOPs is great for non-pixel things too. Furthermore, the flexibility of the pixel processing pipelines can let creative programmers do just about anything.
For example, ray tracing is a fundamentally different approach to graphics rendering (compared to rasterization), that involves tracing light rays through a scene. Ray tracing was largely confined to offline applications because of its compute power requirements. However, GeForce 6's programmable pixel shaders are up to the task of real-time rendering, at least for simple scenes.
The possibilities go beyond different graphics rendering techniques. The power of programmable shaders has spurred the development of new GPU programming APIs not directly aimed at graphics. Stanford's Brook API targets general purpose compute on GPUs. Its programming model is tightly tied to the way GPUs are optimized for parallel work. Getting up to speed on such a model can take some getting used to, especially as most programmers have been taught using a serial execution model. But researchers and other developers doing highly parallel and highly regular data processing should take note of these APIs.
Significant barriers still stand in the way of running any parallel task on a GPU: shader programs access memory through textures bound to it; textures have limited size compared to CPU-side memory allocations; floating-point precision is often lacking compared to a full-spec IEEE 754 implementation; shaders can only execute for a short length of time without stalling the display; textures can't be modified during shader execution, etc.
Developers also have to move data between CPU and GPU memory spaces to provide the GPU with data and get the results. The latter can be problematic because GPUs are optimized for presenting pixel shader output as a frame on-screen, before quickly overwriting it with a subsequent frame. Copying data back from the GPU can run into host interface limitations.
Nvidia is no doubt aware of these limitations, and is working to address this. GeForce 6 will support the incoming PCI Express standard alongside AGP. PCI Express's increased bandwidth moves the GPU one step closer to being an accessible parallel accelerator.
GeForce 6's pixel and vertex shader pipelines are more flexible than ever, and shows Nvidia is taking programmable shaders seriously. Many of the capabilities introduced in GeForce 6 may seem excessive for current gaming workloads. It's hard to imagine anyone writing a shader hundreds of instructions long with loops, calls and branches mixed in. What GeForce 6's capabilities show is that Nvidia is competing on features beyond basic graphics rendering. It's part of a larger trend arising from the move away from fixed-function hardware, and has exciting implications for GPUs. Perhaps soon, we won't be calling these cards GPUs anymore, considering they can do much more than render graphics.
Despite its programmability, GeForce 6000 GPUs continue to focus strongly on graphics. Nvidia's shader ISA remains closely tied to DirectX 9 specifications, ensuring shader programs in games run well on the hardware. And the hardware is quite powerful; a high-end GeForce 6000 chip has over 200 million transistors. That's made possible by IBM's advanced 130nm process. Providing all that processing power demands serious power delivery too, so high end cards use a pair of molex connectors. Molex connectors are a time-tested standard, with thick pins and wires that can reliably supply power to a variety of peripherals without melting.
In conclusion, GPUs are advancing at an incredible pace. 2005 is an exciting time to be alive. Graphics rendering technologies are moving in lock-step with the nation's economy towards 2008, and undoubtedly everyone is looking forward to that bright future.
Wait, what year is it again? Oh, and happy April Fools!
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
Nvidia GeForce 6800 Hot Chips presentation
Emmet Kilgariff and Ramdima Fernando, The GeForce 6 Series GPU Architecture
John Montrym and Henry Moreton, The GeForce 6800
Ashu Rege, Shader Model 3.0
Matthias Wloka, GeForce 6 Series Performance
Ian Buck et al, Brook for GPUs: Stream Computing on Graphics Hardware
2025-03-28 08:15:21
Hello you fine Internet folks,
Today we have an interview with Bryan Cantrill from Oxide Computer Company.
Cloud computing has been a tour de force in the computing industry, with many businesses and even governments moving over to cloud services for their computing infrastructure. This gives these companies and governments near complete isolation from any hardware issues that may crop up due to the sheer size of many cloud providers, which allows them to automatically migrate VMs to other systems while the hardware issue is being resolved. But not everything can or even should be moved to the cloud, for various reasons such as compliance reasons, cost, etc. This means that some folks may have to stick with on-premises compute, but the cloud model of automatic migration in the case of hardware failure is still relevant and this is where Oxide fits in.
What Oxide is building is effectively an on-premises cloud. To start with, Oxide’s philosophy is to treat a rack of servers in the same way as a hyperscaler (AWS, GCP, Azure, etc) does. So, instead of directly accessing a single node in a server and running your workload, you start up a VM in the Oxide control panel just like in AWS or Azure. This allows for automatic fail-over of VMs in the event of a hardware failure. Now, in order to facilitate this cloud-like behavior, Oxide is building their own racks, which resemble a rack that you may be able to find in a hyperscaler’s datacenter. I had the chance to interview their CTO, Bryan Cantrill, while I was in the Bay Area this past week to ask about their hardware and what they are doing.
Hope y'all enjoy!
Transcript below has been edited for conciseness and readability.
GEORGE: Hello you fine internet folks! Today, we're here at Oxide Computer Company, and I have with me, Bryan.
BRYAN: Heya!
GEORGE: Who are you, and what do you do at Oxide?
BRYAN: Hi, I'm Bryan Cantrill, I'm the CTO at Oxide Computer Company, we're here at our office in Emeryville- our office and lab, kind of our playhouse here in Emeryville.
We're a computer company! We're a modern computer company, we are a rack-scale computer company.
So, this is the Oxide rack behind you, and what we have done- our observation was, if you look at those folks deploying compute at scale- Amazon, Google... hyperscalers right? They've all built their own computers. And we were (I along with my cofounder) were at a public cloud company, Joyent, that was purchased by Samsung-
GEORGE: Interesting!
BRYAN: It was interesting! And, we were... after Samsung bought the company, they were really trying to deploy at Samsung-scale. And we were deployed on commodity gear; we were deployed on Dell, Supermicro, some HP, some Arista... and uh...
GEORGE: Trying to match all that stuff can be very difficult.
BRYAN: When we hit scale, everything broke.
GEORGE: I can imagine.
BRYAN: And, to be fair, everything broke in hardware and software, but the difference is, with the software, we could actually go fix it. And we fixed a bunch of our software systems, but then the problems you're left with are those problems, that are at the hardware/software boundary. And... it was pretty frustrating, and you look at like, "how did these other folks do it?" And you realize, they've done their own machines.
GEORGE: So, what makes... (if you wanna show to the audience) one of these?
BRYAN: Right! So this is an Oxide sled, and this doesn't look like a normal server computer, right?
GEORGE: No, no it doesn't. It looks like a blade.
BRYAN: It looks like a blade, right. And actually, if you look at the back... it even looks more like a blade. Lemme take off the Kapton tape there... It blind mates in the power first of all, we - like everybody running at scale - we run a DC bus bar, up and down the rack. So you've got an actual power shelf, that contains [bridge] rectifiers, those rectifiers then take you from AC to DC, you run DC on the bus bar,
GEORGE: DC 48v?
BRYAN: Uh, yeah, 54v. That is the way that everybody at scale runs, with a DC bus bar.
... you can't buy a DC bus bar based machine, DC bus bar-based rack, from Dell, HP, Supermicro! And they'll tell you nobody wants it.
GEORGE: [Sounds of incredulity] Hmmm...
BRYAN: Right, exactly!
GEORGE: The fact that this exists tells me otherwise!
BRYAN: Exactly, it definitely does, and one of the things I've appreciated... we've kind of accreted the server architectures that we have.
This traditional [server] architecture has accreted over time, and until we took a clean sheet of paper, you really don't appreciate just how many things are broken with it! One of the things you commented on, is the noise; it's so much quieter.
GEORGE: Yeah, and, it's off-camera right here, but there's one rack running right now over to my right side... you can hear it, but it's not a tinny noise. It's a very... almost sort of wind-blowing noise. Which is exactly what it is.
BRYAN: Right, and you know what's funny? We didn't design this thing to be acoustically pleasant, ...
GEORGE: It just turned out like that?
BRYAN: It just turned out like that. And one of the things you'll appreciate is, when you look at the acoustic unpleasantness in a traditional server... yes, you've got a bunch that's coming from those small fans at the back; a bunch of it is also coming from those fans on the power supplies. Because you've got those AC power supplies...
GEORGE: And it's all like, 40mm fans.
BRYAN: Those are *tiny* fans, and, those AC power supplies, you take them apart... they're crammed. So there's a high static pressure that [the fans] have to overcome; those fans are workin' hard! And it's hot. And of course, that fan is a thing that blows on AC power supplies, so you have two of them. So now we've got two AC power supplies, in every one of these servers, all these power cords... and it's just like, the whole thing is a mess.
And... that's just the beginning; the DC bus bar is to me just the beginning.
GEORGE: So, speaking of... well, you say basic- the way a computer is booted, is you usually start with what's known as a Basic Input Output System, the BIOS,
BRYAN: The BIOS... yeah.
GEORGE: Now, in the early 2000s, this was replaced by UEFI,
BRYAN: UEFI, yes, Itanium's gift to the world!
GEORGE: Yeah, and while that works perfectly well for your average laptop or desktop, when you get to this scale,
BRYAN: It doesn't make sense.
GEORGE: Why is that?
BRYAN: Because it's giving you this kind of optionality you actually don't want. When you have something of this scale - and we have co-designed our host operating system with our hardware - you don't need that optionality, of booting kind of... I don't need to boot DOS on this thing!
GEORGE: [Laughing] You don't want DOS on these machines?!
BRYAN: Okay, that would be kind of entertaining, but... we actually don't need any of that. But we have preserved all of this kind of ancient optionality in the BIOS. A big problem that we have with the BIOS, is that the BIOS has to boot the system in order to boot the system.
So one of the things the BIOS has to do... it needs to find like, how do I boot this thing? I need to actually do I/O, to pull a boot image off of somewhere. I/O, as we know, everything's complicated... you can't just like "do I/O", like, we actually have to bring up PCIE engines, you have to bring up all the CPUs... so you're doing all this work to boot the system, and then you find the image you want to boot, and now you have to be like, "okay, now we have to pretend like we were never here".
So it then tries to- we call it "setting the machine backward", where it makes the machine *look like* it has not been booted, when it executes that first operating system instruction. But in reality an *entire city* has been constructed, and ploughed under; and the operating system can actually see the artifacts of that over time. There's something called System Management Mode, SMM…
GEORGE: [Laughing] Ahhh yes, what some people refer to as "ring -2" if I remember correctly?
BRYAN: That's right, ring -2, and that kind of platform initialization layer, can stuff whatever it wants in SMM.
GEORGE: I still find it hilarious - I think it was HP? - tried putting like a day-calendar in the SMM, which is like... why are you doing this?! [Laughing]
BRYAN: They're doing it, because they wanted to add value to their hardware, without controlling the system software.
So the way to do that, is to jam that software into the system software they do control, which is SMM, but from the perspective of actually running this thing as a server, that's just a problem for me. I don't want to have ring -2. So for us, SMM is empty. Because the other thing is... why do you end up in SMM? For any reason! If you look at the architecture manual, it can go into SMM for any reason, can stay there for any length of time... it's unspecified, that you have to wait.
GEORGE: So, how do you solve this?
BRYAN: So for us, we do have something in SMM mode: if you ever hit SMM mode, we panic the system. Because, under no condition should we enter SMM. So if we entered SMM mode, we bring the system down and take a crash dump.
That would be pretty wild, if we ever saw that happen, right? Because it would mean that... but we have not seen that happen, and we wanted to do that to make sure, if something were to errantly enter System Management Mode.
But we didn't use System Management Mode at all, we also didn't want to have a BIOS at all.
GEORGE: Yep, so how are you getting around that?
BRYAN: Yeah, so that was tough, and in fact, this is something that we didn't really appreciate at the time... AMD didn't think we could pull this off, apparently Google tried this and failed. And if Google has tried something and failed it must be impossible for humanity?!
GEORGE: [Laughing] Well, oftentimes what they do is succeed and then claim that they failed, and then just, cancel the product.
BRYAN: [Laughing] That's right, and it was tough, and it required us to work very closely with AMD. I think that AMD didn't really believe that we could pull it off,
GEORGE: It's... I wouldn't even say it's not a trivial- it's a very complicated problem.
BRYAN: It is, because you are doing that *lowest* layer of platform initialization.
GEORGE: And that platform initialization, people forget, is like, [AMD] memory training, bringing up the PCIE,
BRYAN: That's right.
GEORGE: And, remember, what's bringing up the system? Well, oftentimes, like if you try and access a BMC, that BMC is on a PCIE bus, it has to be brought up and initialized, so there's a lot of complex problems with the BMC,
BRYAN: Speaking of the BMC, we also threw that into the sea!
So the BMC - Baseboard Management Controller - the computer-within-the-computer... we felt that the BMC had grown far too large, far too complicated. BMC should not be on PCIE, from our perspective. What you actually want is environmentals; you want power control, it needs to be on its own network, ... and that's basically it. Its job is really to hand the host CPU its cup of coffee.
GEORGE: I wish I had someone that hands me my cup of coffee!
BRYAN: So we eliminated the BMC, and we replaced it with what we call the Service Processor, the SP, kind of going back to an order model... so if you look at this compute sled here, and it may be hard to see in there, that's our Service Processor.
So this a ST Microelectronics part, and this is a part that is kinda funny because it doesn't need a heatsink, right? This is a 400 MHz part! Which is faster than machines were when I was coming up; like faster than the first workstation that I had at Sun Microsystems, by a long shot.
GEORGE: It's what, 80x faster than the original x86? [Ed: Original 8086 was 5 MHz, 400 MHz is exactly 80x faster - good memory and quick arithmetic!]
BRYAN: That's right. So it's like, why are we taking this kind of BMC and running this kind of multi-user operating system on it, when we actually have plenty of compute power there. We did our own operating system, we took a clean sheet of paper there as well. I think we were looking around for kind of, best-of-breed, but we weren't finding anything that we liked exactly.
One of the things that we were not finding is operating systems have this kind of multi-user heritage, where, they know how to load programs... which makes sense, absolutely. The idea that an operating system can load a program that it has never seen before makes it valuable, makes it usable!
GEORGE: I mean if you think about it, everytime you power a system off, and you reboot the OS; essentially the OS goes, I'm brand new, and then you go to let's say Steam for example... it doesn't know what Steam is.
BRYAN: Right, yes, exactly.
GEORGE: So the OS has to figure out the program, and boot it.
BRYAN: So even in microcontroller-based operating systems, they still had this idea of program loading. We wanted programs, but we don't want to load foreign programs on this; we want - all the of things that are in this, we want it to be aware of, when it actually boots.
So Hubris is our operating system-
GEORGE: I love the names Hubris, and then-
BRYAN: Humility is for the debugger. So Cliff Biffle, the engineer who pioneered Hubris - this is of course one of the Deadly Sins of Programmers - the idea being a nod to, oh my god you're doing your own operating system, the hubris of doing your own operating system! And then of course the debugger for that is Humility.
What's been interesting, is that that kind of model - and Cliff has a great talk at OSFC, my colleague Matt Keeter also did a terrific talk on some of the debugging infrastructure we've built on this thing - that model has allowed us to keep Hubris as a very tight image. So Hubris knows about all the tasks that it's going to run, when it actually boots; that image has every task in it. It does not load foreign programs, which is what you want in this kind of firmware.
GEORGE: Yeah, you don't want someone to be able to- even if you had physical access, ... could I, change that?
BRYAN: Great question, so if you had physical access, you could load a new image on here. But, then there's a Root of Trust on here, that Root of Trust would know that that image - unless you were Oxide doing it - it would know that image has actually not been signed by Oxide. So we actually test the image.
GEORGE: Now, can you... so let's say I get access somehow to a just single node, I only have time to mess with a single node. You have a single node in a big rack, could you, re-... essentially, download a new system image for that microcontroller?
BRYAN: You could create your own image, but it would know this is not an Oxide image.
GEORGE: Nono I mean, can it then pull an image, a known good image, off a different sled?
BRYAN: Ohhh, yeah, well no, you need to have enough to actually get over our Service Processor network, so you'd have to qualify just how crippled this image is. If you put a brick on here, it's going to be a problem...
GEORGE: [Laughs]
BRYAN: As a result, we're as a practical matter very careful about that, we've got, there are A and B sides to the microcontroller, so if you do put a bad image on it can rollback to the other one, and so on. This thing is really designed- it is fit to purpose, for booting and operating a computer. Unlike a BMC, which is really designed to make a server look like a desktop.
GEORGE: Yep. I think, well, we're running quite long here, but... one last question- always my last question, what's your favorite type of cheese?
BRYAN: Ooooh... that's a good question you know. I... um, God I mean I love a good sharp cheddar.
GEORGE: I agree with you on that one.
BRYAN: Actually, I also really like Swiss, not just Swiss cheese but Swiss cheeses, so a good Emmentaler, something like that, but my kids are less into that. I guess I'm pretty traditional in that regard.
GEORGE: I 100% agree. Well thank you so much, Bryan, for this, thank you for watching! Like, hit subscribe, do all that, comment... it really does help the algorithm, and we must appease the algo gods.
So, thank you so much!
BRYAN: Thank you!
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-03-24 05:58:01
AMD's RDNA 4 brings a variety of memory subsystem enhancements. Among those, one slide stood out because it dealt with out-of-order memory accesses. According to the slide, RDNA 4 allows requests from different shaders to be satisfied out-of-order, and adds new out-of-order queues for memory requests.
AMD apparently had a false dependency case in the memory subsystem prior to RDNA 4. One wave could wait for a memory loads made by another wave. A "wavefront", "wave", or "warp" on a GPU is the rough equivalent of a CPU thread. It has its own register state, and can run out of sync with other waves. Each wave's instructions are independent from those in other waves with very few exceptions (like atomic operations).
In RDNA 3, there was a strict ordering on the return of data, such that effectively a request that was made later in time was not permitted to pass a request made earlier in time, even if the data for it was ready much sooner.
Navi 4 Architecture Deep Dive, Andrew Pomianowski, CVP, Silicon Design Engineering (AMD)
A fundamental tenet of multithreaded programming is that you get no ordering guarantees between threads unless you make it happen via locks or other mechanisms. That's what makes multithreaded performance scaling work. AMD's slide took me by surprise because there's no reason memory reads should be an exception. I re-watched the video several times and stared at the slide for a while to see if that's really what they meant. They clearly meant it, but I still didn't believe my eyes and ears. So I took time to craft a test for it.
AMD's slide describes a scenario where one wave's cache misses prevent another wave from quickly consuming data from cache hits. Causing cache misses is easy. I can pointer chase through a large array with a random pattern ("wave Y"). Similarly, I can keep accesses within a small memory footprint to get cache hits ("wave X"). But doing both at the same time is problematic. Wave Y may evict data used by wave X, causing cache misses on wave X.
Instead of going for cache hits and misses, I tested by seeing whether waiting on memory accesses in one wave would falsely wait in on memory accesses made by another. My "wave Y" is basically a memory latency test, and makes a fixed number of accesses. Each access depends on the previous one's result, and I have the wave pointer chase through a 1 GB array to ensure cache misses. My "wave X" makes four independent memory accesses per loop iteration. It then consumes the load data, which means waiting for data to arrive from memory.
Once wave Y completes all of its accesses, it sets a flag in local memory. Wave X makes as many memory accesses as it can until it sees the flag set, after which it writes out its “score” and terminates. I run both waves in the same workgroup to ensure they share a WGP, and therefore share as much of the memory subsystem as possible. Keeping both waves in the same workgroup also lets me place the “finished” flag in local memory. Wave X has to check that flag every iteration, and it’s best to have flag checks not go through the same caches that wave Y is busy contaminating.
If each wave X access gets delayed by a wave Y one, I should see approximately the same number of accesses from both. Instead on RDNA 3, I see wave X make more accesses than wave Y by exactly the loop unroll factor on wave X. AMD's compiler statically schedules instructions and sends out all four accesses before waiting on data. It then waits on load completion with s_waitcnt vmcnt(...)
instructions.
Accesses tracked by vmcnt
always return in-order, letting the compiler wait on specific accesses by waiting until vmcnt
decrements to a certain value or lower. In wave Y, I make all accesses dependent so the compiler only waits for vmcnt
to reach 0.
On RDNA 3, s_waitcnt vmcnt(...)
seems to wait for requests to complete not only from its wave, but from other waves too. That explains why wave X makes exactly four accesses for each access that wave Y makes. If I unroll the loop more, letting the compiler schedule more independent accesses before waiting, the ratio goes up to match the unroll factor.
On RDNA 4, the two waves don’t care what the other is doing. That’s the way it should be. RDNA 4 also displays more run-to-run variation, which is also expected because cache behavior is highly unpredictable in this test. I’m surprised by the results, but it’s convincing evidence that AMD indeed had false cross-wave memory delays on RDNA 3 and older GPU architectures. I also tested on Renoir’s Vega iGPU, and saw the same behavior as RDNA 3.
At a simplistic level, you can imagine that requests from the shaders go into a queue to be serviced, and many of those requests can be in flight
Navi 4 Architecture Deep Dive, Andrew Pomianowski, CVP, Silicon Design Engineering (AMD)
AMD's presentation hints that RDNA 3 and older GPUs had multiple waves sharing a memory access queue. As mentioned above, AMD GPUs since GCN handle memory dependencies with hardware counters that software waits on. By keeping vmcnt
returns in-order, the compiler can wait on the specific load that produces data needed by the next instruction, without also waiting on every other load the wave has pending. RDNA 3 and prior AMD GPUs possibly had a shared memory access queue, with each entry tagged with its wave's ID. As each memory access leaves the queue in-order, hardware decrements the counter for its wave.
Perhaps RDNA 4 divides the shared queue into per-thread queues. That would align with the point on AMD's slide saying RDNA 4 introduces "additional out-of-order queues" for memory requests. Or perhaps RDNA 4 retains a shared queue, but can drain entries out-of-order. That would require tracking extra info, like whether a memory access is the oldest one for its wave.
Sharing a memory access queue and returning data in-order seems like a natural hardware simplification. That raises the question of whether GPU architectures from Intel and Nvidia had similar limitations.
Intel's Xe-LPG does not have false cross-wave memory dependencies. Running the same test on Meteor Lake's iGPU shows variation depending on where the two waves end up. If wave X and wave Y run on XVEs with shared instruction control logic, wave X's performance is lower than in other cases. Regardless, it's clear Xe-LPG doesn't force a wave to wait on another's accesses. Intel's subsequent Battlemage (Xe2) architecture shows similar behavior, and the same applies to Intel's Gen 9 (Skylake) graphics from a while ago.
I also checked generated assembly to ensure Intel's compiler wasn't unrolling the loop further.
Nvidia's Pascal has varying behavior depending on where waves are located within a SM. Each Pascal SM has four partitions, which are arranged in pairs that share a texture unit and a 24 KB texture cache. Waves are assigned to partitions within a pair first. It's as if the partitions are numbered [0,1]-> tex, [2,3]-> tex. Waves in the same sub-partition pair have the false dependency issue. Evidently they share some general load/store logic besides the texture unit, because I don't touch textures in this test.
If a wave is not offset from another one by a multiple of four or multiple of 4 plus one, it doesn't have the false dependency problem. Turing, as tested on the GTX 1660 Ti, doesn't have a problem either.
Besides removing false cross-wave delays, AMD also improved memory request handling within a wave. Much like in-order CPU cores, like Arm's Cortex A510, GPUs can execute independent instructions while waiting on memory access. A thread only stalls when it tries to use the memory access's result. GPUs have done this for decades, though the implementation details differ. Intel and Nvidia's GPUs use a software managed scoreboard. AMD used pending request counters from GCN onward.
RDNA 4 uses the same scheme but splits out the vmcnt
category into several counters. A thread can interleave global memory, texture sampling, and raytracing intersection test requests, and wait on them separately. That gives the compiler more flexibility to move work ahead of a wait for memory access completion. Another interpretation of AMD's slide is that the each counter corresponds to a separate queue, each of which has out-of-order behavior across waves (but may have in-order behavior within a wave).
Similarly, lgkmcnt
gets separated into kmcnt
for scalar memory loads and dscnt
for LDS accesses. Scalar memory loads are out-of-order, which means the compiler must wait for all scalar memory loads to complete (kmcnt=0
or lgmkcnt=0
) before using results from any pending scalar memory load. On RDNA 4, the compiler can interleave scalar memory and LDS accesses without having to wait for lgkmcnt=0
.
Intel and Nvidia's GPUs use software managed scoreboards. A scoreboard entry can be set or and waited on by any instruction, regardless of memory access type. Therefore RDNA 4's optimization isn't applicable to those other GPU architectures. A cost to Intel/Nvidia's approach is that utilizing a big memory request queue would require a correspondingly large scoreboard. AMD can extend a counter by one bit and double the number of queue entries a wave can use.
RDNA 4's memory subsystem enhancements are exciting and improve performance across a variety of workloads compared to RDNA 3. AMD specifically calls out benefits in raytracing workloads, where traversal and result handling may occur simultaneously on the same WGP. Traversal involves pointer chasing, while result handling might involve more cache friendly data lookups and texture sampling. Breaking cross-wave memory dependencies would prevent different memory access patterns in those tasks from delaying each other.
Likely this wasn't an issue with rasterization because waves assigned to a WGP probably work on pixels in close proximity. Those waves may sample the same textures, and even take samples in close proximity to each other within the same texture. If one wave misses in cache, the others likely do too.
Breaking up vmcnt
and lgmkcnt
probably helps raytracing too. Raytracing shaders make BVH intersection and LDS stack management requests during traversal. Then they might sample textures or access global memory buffers during result handling. Giving the compiler flexibility to interleave those request types and still wait on a specific request is a good thing.
But RDNA 4's scheme for handling memory dependencies isn't fundamentally different from that of GCN many years ago. While the implementation details differ, RDNA 4, GCN, and Intel and Nvidia's GPUs can all absorb cache misses without immediately stalling a thread. Each GPU maker has improved their ability to do so, whether it's with more scoreboard tokens or more counters. RDNA 4 indeed can do Cortex A510 style nonblocking loads, but it's far from a new feature in the world of GPUs.
Resolving false cross-wave dependencies isn't new either. Nvidia had "out-of-order" cross-wave memory access handling in Turing, and presumably their newer architectures too. Intel had the same at least as far back as Gen 9 (Skylake) graphics. Therefore RDNA 4's "out-of-order" memory subsystem enhancements are best seen as generational tweaks, rather than new game changing techniques.
Still, AMD's engineers deserve credit for making them happen. RDNA 4’s arguably makes the most significant change to AMD’s GPU memory subsystem since RDNA launched in 2019. I'm glad to see the company continue to improve their GPU architecture and make it better suited to emerging workloads like raytracing.
2025-03-20 04:29:49
Intel’s foray into high performance graphics has enjoyed impressive progress over the past few years, and the company is not letting up on the gas. Tom Peterson from Intel has indicated that Xe3 hardware design is complete, and software work is underway. Some of that software work is visible across several different open source repositories, offering a preview of what’s to come.
Modern GPUs are built from a hierarchy of subdivision levels, letting them scale to hit different performance, power and price targets. A shader program running on an Intel GPU can check where it’s running by reading the low bits of the sr0
(state register 0) architectural register.
sr0
topology bits on Xe3 have a different layout1. Xe Cores within a Render Slice are enumerated with four bits, up from two in prior generations. Thus Xe3’s topology bits would be able to handle a Render Slice with up to 16 Xe Cores. Prior Xe generations could only have four Xe Cores per Render Slice, and often went right up to that. The B580 and A770 both placed four Xe Cores in each Render Slice.
Having enough bits to describe a certain configuration doesn’t mean Intel will ship something that big. Xe did use its maximum 32 core, 4096 lane setup in the Arc A770. However, Xe2 maxed out at 20 cores and 2560 lanes with the Arc B580. Xe2’s sr0
format could theoretically enumerate 16 slices. Giving each slice the maximum of 4 Xe Cores would make a 64 Xe Core GPU with 8192 FP32 lanes. Obviously the B580 doesn’t get anywhere near that.
Xe3 goes even further. Maxing out all the topology enumeration bits would result in a ludicrously large 256 Xe Core configuration with 32768 FP32 lanes. That’s even larger than Nvidia’s RTX 5090, which “only” has 21760 FP32 lanes. Intel has been focusing on the midrange segment for a while, and I doubt we’ll see anything that big.
Instead, I think Intel wants more flexibility to scale compute power independently of fixed function hardware like ROPs and rasterizers. AMD and Nvidia’s SAs and GPCs all pack a lot more than four cores. For example, the RX 6900XT’s Shader Engines each have 10 WGPs. Nvidia’s RTX 4090 puts eight SMs in each GPC. GPUs have become more compute-heavy over time, as games use more complex shader programs. Intel seems to be following the same trend.
Xe Vector Engines (XVEs) execute shader programs on Intel GPUs. They use a combination of vector-level and thread-level parallelism to hide latency.
Xe3 XVEs can run 10 threads concurrently, up from eight in prior generations. Like SMT on a CPU, tracking multiple threads helps a XVE hide latency using thread level parallelism. If one thread stalls, the XVE can hopefully find an un-stalled thread to issue instructions from. Active thread count is also referred to as thread occupancy. 100% occupancy on a GPU would be analogous to 100% utilization in Windows Task Manager. Unlike CPU SMT implementations, GPU occupancy can be limited by register file capacity.
Prior Intel GPUs had two register allocation modes. Normally each thread gets 128 512-bit registers, for 8 KB of registers per thread. A “large GRF” mode gives each thread 256 registers, but drops occupancy to 4 threads because of register file capacity limits. Xe3 continues to use 64 KB register files per XVE, but flexibly allocates registers in 32 entry blocks2. That lets Xe3’s XVEs get 10 threads in flight as long as each thread uses 96 or fewer registers. If a shader program needs a lot of registers, occupancy degrades more gracefully than in prior generations.
Nvidia and AMD GPUs allocate registers at even finer granularity. AMD’s RDNA 2 for example allocates registers in blocks of 16. But Xe3 is still more flexible than prior Intel generations. With this change, simple shaders that only need a few registers will enjoy better latency tolerance from more thread-level parallelism. And more complex shaders can avoid dropping to the “large GRF” mode.
Xe3’s XVEs have more scoreboard tokens too. Like AMD and Nvidia, Intel uses compiler assisted scheduling for long latency instructions like memory accesses. A long latency instruction can set a scoreboard entry, and a dependent instruction can wait until that entry is cleared. Each Xe3 thread gets 32 scoreboard tokens regardless of occupancy, so a XVE has 320 scoreboard tokens in total. On Xe2, a thread gets 16 tokens if the XVE is running eight threads, or 32 in “large GRF” mode with four threads. Thus Xe2’s XVEs only have 128 scoreboard tokens in total. More tokens let thread have more outstanding long latency instructions. That very likely translates to more memory level parallelism per thread.
Intel’s GPU ISA has a vector register file (GRF, or General Register File) that stores much of a shader program’s data and feeds the vector execution units. It also has an “Architecture Register File” (ARF) with special registers. Some of those can store data, like the accumulator registers. But others serve special purposes. For example, sr0
as mentioned above provides GPU topology info, along with floating point exception state and thread priority. A 32-bit instruction pointer points to the current instruction address, relative to the instruction base address.
Xe3 adds a “Scalar Register” (s0
) to the ARF6. s0
is laid out much like the address register (a0
), and is used for gather-send instructions. XVEs access memory and communicate with other shared using by sending messages over the Xe Core’s message fabric, using send
instructions. Gather-send appears to let Xe3 gather non-contiguous values from the register file, and send them with a single send
instruction.
Besides adding the Scalar Register, Xe3 extends the thread dependency register (TDR) to handle 10 threads. sr0
gains an extra 32-bit doubleword for unknown reasons.
Xe3 supports a saturation modifier for FCVT, an instruction that converts between different floating point types (not between integer and floating point). FCVT was introduced with Ponte Vecchio, but the saturation modifier could ease conversion from higher to lower precision floating point formats. Xe3 also gains HF8 (half float 8-bit) format support, providing another 8-bit floating point format option next to the BF8 type already supported in Xe2.
For the XMX unit, Xe3 gains a xdpas
instruction4. sdpas
stands for sparse systolic dot product with accumulate5. Matrices with a lot of zero elements are known as sparse matrices. Operations on sparse matrices can be optimized because anything multiplied by zero is obviously zero. Nvidia and AMD GPUs have both implemented sparsity optimizations, and Intel is apparently looking to do the same.
Sub-Triangle Opacity Culling (STOC) subdivides triangles in BVH leaf nodes, and marks sub-triangles as transparent, opaque, or partially transparent. The primary motivation is to reduce wasted any-hit shader work when games use texture alpha channels to handle complex geometry. Intel’s paper calls out foliage as an example, noting that programmers may use low vertex counts to reduce “rendering, animation, and even simulation run times.”7 BVH geometry from the API perspective can only be completely transparent or opaque, so games mark all partially transparent primitives as transparent. Each ray intersection will fire an any-hit shader, which carries out alpha testing. If alpha testing indicates the ray intersected a transparent part of the primitive, the shader program doesn’t contribute a sample and the any-hit shader launch is basically wasted. STOC bits let the any-hit shader skip alpha testing if the ray intersects a completely transparent or completely opaque sub-triangle.
Storing each sub-triangle’s opacity information takes two bits, so STOC does require more storage compared to using a single opacity bit for the entire triangle. Still, it’s far more practical than packing entire textures into the BVH. Intel’s paper found that a software-only STOC implementation improved performance by 5.9-42.2% compared to standard alpha tests when handling translucent ray-traced shadows.
STOC-aware raytracing hardware can provide further gains, especially with Intel's raytracing implementation. Intel's raytracing acceleration method closely aligns with the DXR 1.0 standard. A raytracing accelerator (RTA) autonomously handles traversal and launches hit/miss shaders by sending messages to the Xe Core's thread dispatcher. STOC bits could let the RTA skip shader launches if the ray intersects a completely transparent sub-triangle. For an opaque sub-triangle, the RTA can tell the shader program to skip alpha testing, and terminate the ray early.
Xe3 brings STOC bits into hardware raytracing data structures with two levels of sophistication. A basic implementation retains 64B leaf nodes, but creatively finds space to fit 18 extra bits. Intel's QuadLeaf
structure represents a merged pair of triangles. Each triangle gets 8 STOC bits, implying four sub-triangles. Another two bits indicate whether the any-hit shader should do STOC emulation in software, potentially letting programmers turn off hardware STOC for debugging. This mode is named "STOC1" in code.
A “STOC3” structure takes things further by storing pointers to STOC bits rather than embedding them into the BVH. That allows more flexibility in how much storage the STOC bits can use. STOC3 also specifies recursion levels for STOC bits, possibly for recursively partitioning triangles. Subdividing further would reduce the number of partially transparent sub-triangles, which require alpha testing from the any-hit shader. Storing pointers for STOC3 brings leaf node size to 128 bytes, increasing BVH memory footprint.
Possible performance gains are exciting, but using STOC requires work from game developers or game engines. Intel suggests that STOC bits can be generated offline as part of game asset compilation. Artists will have to determine whether using STOC will provide a performance uplift for a particular scene. A scene with a lot of foliage might benefit massively from STOC. A chain link fence may be another story. STOC isn’t a part of the DirectX or Vulkan standards, which can be another obstacle to adoption. However, software-only STOC can still provide benefits. That could encourage developers to try it out. If they do implement it, STOC-aware Xe3 hardware stands to gain more than a software-only solution.
We’re still some time away from real Xe3 products. But software changes suggest Xe3 is another significant step forward for Intel’s graphics architecture. Xe2 was a solid step in Intel’s foray into discrete graphics, providing better performance than Xe with a nominally smaller GPU. Xe3 tweaks the architecture again and likely has similar goals. Higher occupancy and dynamic register allocation would make Xe Cores more latency tolerant, improving utilization. Those changes also bring Intel’s graphics architecture closer to AMD and Nvidia’s.
XVE changes show Intel is still busy evolving their core compute architecture. In contrast, Nvidia’s Streaming Multiprocessors haven’t seen significant changes from Ampere to Blackwell. Nvidia may have felt Ampere’s SM architecture was good enough, and turned their efforts to tuning features while scaling up the GPU to keep providing generational gains. Intel meanwhile seeks to get more out of each Xe Core (and Xe2 achieved higher performance than Xe with fewer Xe Cores).
In a similarity with Nvidia, Intel is pushing hard on the features front and evidently has invested into research. GPUs often try to avoid doing wasted work. Just rasterization pipelines use early depth testing to avoid useless pixel shader invocations, STOC avoids spawning useless any-hit shaders. It’s too early to tell what kind of difference STOC or other Xe3 features will make. But anyone doubting Intel’s commitment to moving their GPU architecture forward should take a serious look at Mesa and Intel Graphics Compiler changes. There’s a lot going on, and I look forward to seeing Xe3 whenever it’s ready.