MoreRSS

site iconChips and CheeseModify

Deep dives into computer hardware and software and the wider industry...
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Chips and Cheese

Raytracing on Intel’s Arc B580

2025-03-15 07:42:44

Edit: The article originally said Intel’s BVH nodes were 4-wide, based on a misreading of QuadLeaf. After inspecting Intel compiler code, QuadLeaf actually means two merged triangles with a shared side (a quadrilateral, not a quad of four triangles).

Intel’s discrete GPU strategy has emphasized add-on features ever since Alchemist launched. Right from the start, Intel invested heavily in dedicated matrix multiplication units, raytracing accelerators, and hardware video codecs. Battlemage continues that trend. Raytracing deserves attention because raytraced effects are gaining prominence on an increasing number of titles.

Here, I’ll be looking at a Cyberpunk 2077 frame rendered on Intel’s Arc B580, with path tracing enabled. As always, I’m focusing on how the architecture handles the workload, rather than absolute performance. Mainstream tech outlets already do an excellent job discussing final performance.

Definitely check out the prior article on Meteor Lake’s raytracing implementation, because Intel uses the same raytracing strategy and data structures on Battlemage. Also be sure to check the Battlemage article, which covers the B580’s architecture and some of Intel’s terminology.

Cyberpunk 2077 Path Tracing, Lighting Shader?

I captured a Cyberpunk 2077 frame at 1080P with no upscaling. Framerate was a bit low at 12 FPS. An occupancy timeline from Intel’s Graphics Performance Analyzer (GPA) shows raytracing calls dominating frame time, though CP2077 surprisingly still spends some time on small rasterization calls.

I’m going to focus on the longest duration RT call, which appears to handle some lighting effects. Output from that DispatchRays call shows a very noisy representation of the scene. It’s similar to what you’d get if you stopped a Blender render at an extremely low sample count. Large objects are recognizable, but smaller ones are barely visible.

Output from the DispatchRays call, as shown by GPA

Raytracing Accelerator

Battlemage’s raytracing accelerator (RTA) plays a central role in Intel’s efforts to improve raytracing performance. The RTA receives messages from XVEs to start ray traversal. It then handles traversal without further intervention from the ray generation shader, which terminates shortly after talking to the RTA. BVH data formats are closely tied to the hardware implementation. Intel continues to use the same box and triangle node formats as the previous generation. Both box and triangle nodes continue to be 64B in size, and thus neatly fit into a cacheline.

Roughly how the raytracing process works. Any L1 request can miss of course

Compared to Alchemist and Meteor Lake, Battlemage’s RTA increases traversal pipeline count from 2 to 3. That brings box test rate up to three nodes per cycle, or 18 box tests. Triangle intersection test rate doubles as well. More RTA throughput could put more pressure on the memory subsystem, so the RTA’s BVH cache doubles in capacity from 8 KB to 16 KB.

During the path tracing DispatchRays call, the B580 processed 467.9M rays per second, or 23.4M rays/sec per Xe Core. Each ray required an average of 39.5 traversal steps. RTAs across the GPU handled just over 16 billion BVH nodes per second, which mostly lines up with traversal step count. Intel uses a short stack traversal algorithm with a restart trail. That reduces stack size compared to a simple depth first search, letting Intel keep the stack in low latency registers. However, it can require restarting traversal from the top using a restart trail. Doing so would mean some upper level BVH nodes get visited more than once by the same ray. That means more pointer chasing accesses, though it looks like the RTA can avoid repeating intersection tests on previously accessed nodes.

My impression of how the raytracing unit works. Each traversal pipeline holds ray state, possibly for multiple rays. The frontend allocates rays into the traversal pipelines

GPA’s RT_QUAD_TEST_RAY_COUNT and RT_QUAD_LEAF_RAY_COUNT metrics suggest 1.55% and 1.04% utilization figures for the ray-box and ray-triangle units, respectively. Intel isn’t bound by ray-triangle or ray-box throughput. Even if every node required intersection testing, utilization on the ray-box or ray-triangle units would be below 10%. Battlemage would likely be fine with the two ray-box and single triangle unit from before. I suspect Intel found that duplicating the the traversal pipeline was an easy way to let the RTA keep more work in flight, improving latency hiding.

“Percentage of time in which Ray Tracing Frontend is stalled by Traversal”

Description of the RT_TRAVERSAL_STALL metric in GPA

Intel never documented what they mean by the Ray Tracing Frontend. Perhaps the RTA consists of a frontend that accepts messages from the XVEs, and a traversal backend that goes through the BVH. A stall at the frontend may mean it has received messages from the XVEs, but none of the traversal pipelines in the backend can accept more work. Adding an extra traversal pipeline could be an easy way to process more rays in parallel. And the extra pipeline of course comes with its own ray-box units. Of course, there could be other workloads that benefit from higher intersection test throughput. Intel added an extra triangle test unit, and those aren’t part of the traversal pipelines.

BVH Caching

BVH traversal is latency sensitive. Intel’s short stack algorithm requires more pointer chasing steps than a simple depth first search, making it even more sensitive to memory latency. But it also creates room for optimization via caching. Using the restart trail involves re-visiting nodes that have been accessed not long ago. A cache can exploit that kind of temporal locality, which is likely why Intel gave the RTA a BVH cache. The Xe Core already has a L1 data cache, but that has to be accessed over the Xe Core’s message fabric. A small cache tightly coupled to the RTA is easier to optimize for latency.

Battlemage’s 16 KB BVH cache performs much better than the 8 KB one on prior generations. Besides reducing latency, the BVH cache also reduces pressure on L1 cache. Accessing 16.03G BVH nodes per second requires ~1.03 TB/s of bandwidth. Battlemage’s L1 can handle that easily. But minimizing data movement can reduce power draw. BVH traversal should also run concurrently with miss/hit shaders on the XVEs, and reducing contention between those L1 cache clients is a good thing.

Dispatching Shaders

Hit/miss shader programs provided by the game handle traversal hit/miss results. The RTA launches these shader programs by sending messages back to the Xe Core’s thread dispatcher, which allocates them to XVEs as thread slots become available. The thread dispatcher has two queues for non-pixel shader work, along with a pixel shader work queue. Raytracing work only uses one of the non-pixel shader queues (queue0).

81.95% of the time, queue0 had threads queued up. It spent 79.6% of the time stalled waiting to for free thread slots on the XVEs. That suggests the RTAs are generating traversal results faster than the shader array can handle the results.

Most of the raytracing related thread launches are any hit or closest hit shaders. Miss shaders are called less often. In total, RTAs across the Arc B580 launched just over a billion threads per second. Even though Intel’s raytracing method launches a lot of shader programs, much of that is contained within the Xe Cores and doesn’t bother higher level scheduling hardware. Just as with Meteor Lake, Intel’s hierarchical scheduling setup is key to making its RTAs work well.

Vector Execution

A GPU’s regular shader units run the hit/miss shader programs that handle raytracing results. During the DispatchRays call, the B580’s XVEs have almost all of their thread slots active. If you could see XVE thread slots as logical cores in Task Manager (1280 of them), you’d see 93.8% utilization. Judging from ALU0 utilization breakdowns, most of the work comes from any hit and closest hit shaders. Miss shader invocations aren’t rare, but perhaps the miss shaders don’t do a lot of work.

Just as high utilization in Task Manager doesn’t show how fast your CPU cores are doing work under the hood, high occupancy doesn’t imply high execution unit utilization. More threads simply give the GPU more thread-level parallelism to hide latency with, just like loading more SMT threads on CPU cores. In this workload, even high occupancy isn’t enough to achieve good hardware utilization. Execution unit usage is low across the board, with the ALU0 and ALU1 math pipelines busy less than 20% of the time.

Intel can break down thread stall reasons for cycles when the XVE wasn’t able to execute any instructions. Multiple threads can be stalled on different reasons during the same cycle, so counts will add up to over 100%. A brief look shows memory latency is a significant factor, as scoreboard ID stalls top the chart even without adding in SendWr stalls. Modern CPUs and GPUs usually spend a lot of time with execution units idle, waiting for data from memory. But Cyberpunk 2077’s path tracing shaders appear a bit more difficult than usual.

Execution latency hurts too, suggesting Cyberpunk 2077’s raytracing shaders don’t have a lot of instruction level parallelism. If the compiler can’t place enough independent instructions between dependent ones, threads will stall. GPUs can often hide execution latency by switching between threads, and the XVEs in this workload do have plenty of thread-level parallelism to work with. But it’s not enough, so there’s probably a lot of long latency instructions and long dependency chains.

Finally, threads often stall on instruction fetch. The instruction cache only has a 92.7% hitrate, so some shader programs are taking L1i misses. Instruction cache bandwidth may be a problem too. Each Xe Core’s instruction cache handled 1.11 hits per cycle if I did my math right, so the instruction cache sometimes has to handle more than one access per cycle. If each access is for a 64B cacheline, each Xe Core consumes over 200 GB/s of instruction bandwidth. Intel’s Xe Core design does seem to demand a lot of instruction bandwidth. Each Xe Core has eight XVEs, each of which can issue multiple instructions per cycle. Feeding both ALU0 and ALU1 would require 2 IPC, or 16 IPC across the Xe Core. For comparison, AMD’s RDNA 2 only needs 4 IPC from the instruction cache to feed its vector execution units.

Instruction Mix

Executed shader code usually uses 32-bit datatypes, with some 16-bit types playing a minor role. INT64 instructions also make an appearance, perhaps for address calculation. Special function units (math) see heavy usage, and may contribute to AluWr stalls above.

The close mix of INT32 and FP32 instructions plays well into the XVE’s pipeline layout, because those instruction types are executed on different ports. However, performance is bound by factors other than execution unit throughput and pipe layout.

Cache and Memory Access

Caches within Battlemage’s Xe Core struggle to contain this path tracing workload’s memory accesses. Despite increasing L1 cache capacity from 192 to 256 KB, Battlemage’s L1 hitrate still sits below 60%. Intel services texture accesses with a separate cache that appears to have 32 KB of capacity from a latency test. Texture cache hitrate is lackluster at under 30%.

Traffic across the GPU

A lot of accesses fall through to L2, and the Arc B580’s 18 MB L2 ends up handling over 1 TB/s of traffic. L2 hitrate is good at over 90%, so 18 MB of L2 capacity is adequate for this workload. The Arc B580’s 192-bit GDDR6 setup can provide 456 GB/s of bandwidth, and this workload used 334.27 GB/s on average. GPA indicates the memory request queue was full less than 1% of the time, so the B580’s GDDR6 subsystem did well on the bandwidth front. Curiously, L2 miss counts suggest 122.91 GB/s of L2 miss bandwidth. Something is consuming VRAM bandwidth without going through L2.

A Brief Look at Port Royal

3DMark’s Port Royal benchmark uses raytraced reflections and shadows. It still renders most of the scene using rasterization, instead of the other way around like Cyberpunk 2077’s path tracing mode. That makes Port Royal a better representation of a raytracing workload that’s practical to run on midrange cards. I’m looking at a DispatchRays call that appears to handle reflections.

Output from the DispatchRays call

Rays in Port Royal take more traversal steps. Higher BVH cache hitrate helps keep traversal fast, so the RTAs are able to sustain a similar rays per second figure compared to Cyberpunk 2077. Still, Port Royal places more relative pressure on the RTAs. RT traversal stalls happen more often, suggesting the RTA is getting traversal work handed to it faster than it can generate results.

At the same time, the RTA generates less work for the shader array when traversal finishes. Port Royal only has miss and closest hit shaders, so a ray won’t launch several any-hit shaders as it passes through transparent objects. Cyberpunk 2077’s path tracing mode also launches any-hit shaders, allowing more complex effects but also creating more work. In Port Royal, the Xe Core thread dispatchers rarely have work queued up waiting for free XVE thread slots. From the XVE side, occupancy is lower too. Together, those metrics suggest the B580’s shader array is also consuming traversal results faster than the RTAs generate them.

Just as a very cache friendly workload can achieve higher IPC with a single thread than two SMT threads together on a workload with a lot of cache misses, Port Royal enjoys better execution unit utilization despite having fewer active threads on average. Instruction fetch stalls are mostly gone. Memory latency is always an issue, but it’s not quite as severe. That shifts some stalls to execution latency, but that’s a good thing because math operations usually have lower latency than memory accesses.

Much of this comes down to Port Royal being more cache friendly. The B580’s L1 caches are able to contain more memory accesses, resulting in lower L2 and VRAM traffic.

Final Words

Intel’s Battlemage architecture is stronger than its predecessor at raytracing, thanks to beefed up RTAs with more throughput and better caching. Raytracing involves much more than BVH traversal, so Intel’s improved shader array also provides raytracing benefits. That especially applies to Cyberpunk 2077’s path tracing mode, which seeks to do more than simple reflections and shadows, creating a lot of pressure on the shader array. Port Royal’s limited raytracing effects present a different challenge. Simple effects mean less work on the XVEs, shifting focus to the RTAs.

Raytracing workloads are diverse, and engineers have to allocate their transistor budget between fixed function BVH traversal hardware and regular vector execution units. It reminds me of DirectX 9 GPUs striking a balance between vertex and pixel shader core counts. More vertex shaders help with complex geometry. More pixel shaders help with higher resolutions. Similarly, BVH traversal hardware deals with geometry. Hit/miss shaders affect on-screen pixel colors, though they operate on a sample basis rather than directly calculating colors for specified pixel coordinates.

Rasterization and raytracing both place heavy demands on the memory subsystem, which continues to limit performance. Intel has therefore improved their caches to keep the improved RTAs and XVEs fed. The 16 KB BVH cache and bigger general purpose L1 cache have a field day in Port Royal. They have less fun with Cyberpunk 2077 path tracing. Perhaps Intel could make the Xe Core’s caches even bigger. But as with everything that goes on a chip, engineers have to make compromises with their limited transistor budget.

Output across multiple DispatchRays calls, manually ADD-ed together and exposure adjusted in GIMP. No noise reduction applied.

Cyberpunk 2077’s path tracing mode is a cool showcase, but it’s not usable on a midrange card like the B580 anyway. Well, not at least without a heavy dose of upscaling and perhaps frame generation. The B580’s caches do better on a simpler workload like Port Royal. Maybe Intel tuned Battlemage’s caches with such workloads in mind. It’s a good tradeoff considering Cyberpunk 2077’s path tracing mode challenges even high end GPUs.

Much like Intel’s strategy of targeting the GPU market’s sweet spot, perhaps Battlemage’s raytracing implementation targets the sweet spot of raytracing-enabled games, which use a few raytraced effects help enhance a mostly rasterized scene. Going forward, Intel plans to keep advancing their raytracing implementation. Xe3 adds sub-triangle opacity culling with associated new data structures. While Intel compiler code only references Panther Lake (Xe3 LPG) for now, I look forward to seeing what Intel does next with raytracing on discrete GPUs.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

AMD's RDNA4 Architecture (Video)

2025-03-05 22:01:21

Hello you fine Internet folks,

Today AMD has released their new RDNA4 architecture found in the RX 9000 series, with the RX 9070 and RX 9070 XT GPUs launching first. These are cards targeting the high mid-range of the GPU market but not going after the highest end of the market like they did with RDNA2 and RDNA3.

RDNA4 is the fourth iteration of the RDNA architecture from AMD and it has some very interesting changes and additions compared to prior RDNA architectures.

Acknowledgments

AMD sent over a Powercolor Hellhound RX 9070 and a Powercolor Hellhound RX 9070 XT for review and analysis.

Architectural Improvements

Let’s start with the high-level die diagram changes first.

The die found in the RX 9070 series is fabricated on TSMC’s N4P node and has 53.9 billion transistors on a 356.5mm² die which give a density of roughly 151.2 million transistor per mm².

The majority of the differences between the RX 9070 and the RX 9070 XT are 8 fewer Compute Units, approximately 450MHz boost clock, and an 84-watt power limit reduction.

Cache Bandwidth and Hierarchy Changes

AMD has doubled the number of banks for the L2 cache which has also doubled the amount of L2 per Shader Engine. In prior RDNA architectures there was 1MB of L2 cache per Shader Engine, so the RX 6900 XT had 4MB of L2 cache due to having 4 Shader Engines and the RX 7900 XTX had 6MB of L2 cache due to having 6 Shader Engines.

With RDNA4 AMD has doubled the number of banks per L2 slice which has doubled the bandwidth of each L2 slice as well as doubled the amount of L2 cache per Shader Engine which means with the 4 Shader Engines that the RX 9070 XT has 8MB of L2 cache. It also appears as if AMD also increased the bandwidth of the Infinity Cache (MALL) as well.

What is interesting is that the L1 cache no longer exists on RDNA4. It appears as if the L1 is now a read/write-coalescing buffer and no longer a dedicated cache level. This is a major change in the RDNA cache hierarchy, the L1 was added to RDNA1 in order to reduce the number of requests to the L2 as well as reduce the number of clients to the L2.

Compute Unit and Matrix Unit Updates

There is an error in this slide; Named Barriers are not a part of RDNA4

AMD added FP operations to the Scalar Unit in RDNA3.5 and they have kept those additions to the Scalar Unit with RDNA4. AMD has also improved the scheduler with Split Barriers which allows instructions to be issued in between completion of consuming or producing one block of work and waiting for other waves to complete their work.

AMD has also added improvements to the register spill and fill behavior in the form of Register Block Spill/Fill operations which operates on up to 32 registers. A major reason for adding this capability was reducing unnecessary spill memory traffic for cases such as separately compiled functions by allowing code that knows which registers contain live state to pass that information to code that may need to save/restore registers to make use of them.

RDNA4 has also significantly beefied-up the matrix units compared to RDNA3. The FP16/BF16 throughput has been doubled along with a quadrupling of the INT8 and INT4 throughput. RDNA4 has also added FP8 (E4M3) along with BF8 (E5M2) at the same throughput as INT8. RDNA4 has also added 4:2 sparsity to its matrix units. However, one part of the Compute Unit that hasn’t seen improvement is the dual-issuing capability.

Ray Accelerator

Moving to the changes to AMD’s Ray Accelerator (RA) which has been a significant focus for AMD with RDNA4.

AMD has added a second intersection engine which has doubled the number of ray-box and ray-triangle intersections up from 4 and 1 to 8 and 2 respectively. AMD has also moved from a BVH4 to a BVH8 structure.

AMD has also added the ability for the ray accelerator to store results out of order, “Pack and Sort”. Packing allows the RA to skip ahead and process faster rays that don’t need to access memory or a higher latency cache level. Sorting allows the RA to preserve the correct ordering so that it appears to the program as if the instructions were executed in-order.

Out of Order Memory Access

One of the most exciting things that RDNA4 has added was out of order memory access. This is not like Nvidia’s Shader Execution Reordering, what SER allows for is reordering threads that hit or miss, as well as threads that go to the same cache or memory level, to be bundled in the same wave.

This out of order memory access seems to be very similar to the capabilities that Cortex-A510 has. While the Cortex-A510 is an in-order core for Integer and Floating-Point operations, for memory operations A510 can absorb up to 2 cache misses without stalling the rest of the pipeline. The number of misses that a RDNA4 Compute Unit can handle is unknown but the fact that it can deal with memory accesses out of order is a new feature for a GPU to have.

Dynamic Registers

And last but not least, AMD has added dynamic allocation of RDNA4’s registers.

This allows shaders to request more registers than they usually can get. This provides the GPU with more opportunity for having more waves in flight at any given time. Dynamic register allocation along with the out of order memory access gives RDNA4 many tricks to hide the latency that some workloads, in particular ray tracing, have.

Performance of RDNA4 at Different Wattages

I’ll leave the majority of the benchmarking to the rest of the tech media; however, I do want to look at RDNA4’s behavior at various wattages.

The RX 9070 does quite well at 154 watts, despite at being 44% the power compared to the RX 9070 XT at 348W, the RX 9070 at 154 watts get on average 70% the performance of the RX 9070 XT at 348 watts which is a 59% performance per watt advantage for the RX 9070 at 154 watts.

Conclusion

AMD’s brand new RDNA4 architecture has made many improvements to the core of the RDNA compute unit with much improved machine learning and ray tracing accelerators along with major improvements to the handling of latency sensitive workloads.

However, RDNA4’s Compute Unit is not just the only improvements that AMD has made to RDNA4. AMD has also improved the cache bandwidth at the L2 and MALL levels along with making the L1 no longer a cache level but a read/write-coalescing buffer.

However, with the launch of the RX 9070 series, I do wonder what a 500-600mm² RDNA4 die with a 384/512 bit memory bus would have performed like. I wonder if it could have competed with the 4090, or maybe even the 5090.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Zen 5's AVX-512 Frequency Behavior

2025-03-01 12:02:39

Zen 5 is AMD's first core to use full-width AVX-512 datapaths. Its vector execution units are 512 bits wide, and its L1 data cache can service two 512-bit loads per cycle. Intel went straight to 512-bit datapaths with Skylake-X back in 2017, and used fixed frequency offsets and transition periods to handle AVX-512 power demands. Later Intel CPUs did away with fixed fixed frequency offsets, but Skylake-X's AVX-512 teething troubles demonstrate the difficulties that come with running wide datapaths at high clock speeds. Zen 5 benefits from a much better process node and also has no fixed clock speed offsets when running AVX-512 code.

Through the use of improved on-die sensors, AC capacitance (Cac) monitors, and di/dt-based adaptive clocking, "Zen 5" can achieve full AVX512 performance at peak core frequency.

"Zen 5": The AMD High-Performance 4nm x86-64 Microprocessor Core

But if a Zen 5 core is running at 5.7 GHz and suddenly gets faced with an AVX-512 workload, what exactly happens? Here, I'm probing AVX-512 behavior using a modified version of my boost clock testing code. I wrote that code a couple years ago and used dependent integer adds to infer clock speed. Integer addition typically has single cycle latency, making it a good platform-independent proxy for clock speed. Instead of checking clock ramp time, I'm switching to a different test function with AVX-512 instructions mixed in after each dependent add. I also make sure the instructions I place between the dependent adds are well within what Zen 5's FPU can handle every cycle, which prevents the FPU from becoming a bottleneck.

512-bit FMA, Register Inputs

I started by placing two fused multiply-add (FMA) instructions in each iteration, each of which operates on a 512-bit vector of packed FP32 values. After the Ryzen 9 reaches 5.7 GHz running the scalar integer function, I switch over to AVX-512 instructions.

Impressively, nothing changes. The dependent integer adds continue to execute at ~5.7 GHz. Zooming in doesn’t show a transition period either. I see a single data point covering 1.3 microseconds where the core executed those dependent integer adds at “only” a 5.3 GHz average. The very next data point shows the core running at full speed again.

Evidently, Zen 5’s FPU is not troubled by getting hit with 512-bit vector operations, even when running at 5.7 GHz. If there is a transition period for the increased power draw, it’s so tiny that it can be ignored. That matches Alex Yee’s observations, and shows just how strong Zen 5’s FPU is. For comparison, the same experiment on Skylake-X shows both a transition period and lower clock speeds after the transition completes. Intel’s Core i9-10900X reaches 4.65 GHz after a rather long clock ramp using scalar integer code. Switching to the AVX-512 test function drops clock speeds to 4 GHz, a significant decrease from 4.65 GHz.

Zooming in on Skylake-X data reveals a transition period, which Travis Downs and Alex Yee noted from a while ago. My test eats a longer 55 microsecond transition period though. Travis Downs saw a 20 microsecond transition, while Alex Yee mentions a 50k cycle delay (~12 microseconds at 4 GHz).

Note the dip after switching to the AVX-512 test function

I’m not sure why I see a longer transition, but I don’t want to dwell on it because of methodology differences. Travis Downs used vector integer instructions, and I used floating point ones. And I want to focus on Zen 5.

After the transition finishes, the i9-10900X levels out at 3.7 GHz, then briefly settles at 3.8 GHz for less than 0.1 ms before reaching its steady state 4 GHz speed.

Adding a Memory Operand

Zen 5 also doubles L1D load bandwidth, and I’m exercising that by having each FMA instruction source an input from the data cache. I used the following pattern above:

  add rbx, r8
  vfmadd132ps zmm16, zmm1, zmm0
  vfmadd132ps zmm17, zmm2, zmm0
  
  add rbx, r8
  vfmadd132ps zmm18, zmm3, zmm0
  vfmadd132ps zmm19, zmm4, zmm0

  etc

I’m changing those FMA instructions to use a memory operand. Because Zen 5 can handle two 512-bit loads per cycle, the core should have no problems maintaining 3 IPC. That’s two 512-bit FMAs, or 64 FP32 FLOPS, alongside a scalar integer add.

  add rbx, r8
  vfmadd132ps zmm16, zmm1, [r15]
  vfmadd132ps zmm17, zmm2, [r15 + 64]

  add rbx, r8
  vfmadd132ps zmm18, zmm3, [r15 + 128]
  vfmadd132ps zmm19, zmm4, [r15 + 192]

  etc

With the load/store unit’s 512-bit paths in play, the Ryzen 9 9900X has to undergo a transition period of some sort before recovering to 5.5 GHz. From the instruction throughput perspective, Zen 5 apparently dips to 4.7 GHz and stays there for a dozen milliseconds. Then it slowly ramps back up until it reaches steady state speeds.

The Ryzen 9 9900X splits its cores across two Core Complex Dies, or CCDs. The first CCD can reach 5.7 GHz, while the second tops out at 5.4 GHz. Cores on the second CCD show no transition period on this test.

Cores within the fast CCD all need a transition period, but the exact nature of that transition varies. Not all cores reach steady state AVX-512 frequencies at the same time, and some cores take a sharper hit when this heavy AVX-512 sequence shows up.

Not showing the switch point because it’s pretty obvious where it is, and happens at about the same time for all cores within CCD0

Per-core variation suggests each Zen 5 core has its own sensors and adjusts its performance depending on something. Perhaps it’s measuring voltage. From this observation, I wouldn’t be surprised if another 9900X example shows slightly different behavior.

Am I Really Looking at Clock Speed?

Zen 5’s frequency appears to dip during the transition period, based on how fast it’s executing instructions compared to its expected capability. But while my approach of using dependent scalar integer adds is portable, it can’t differentiate between a core running at lower frequency and a core refusing to execute instructions at full rate. The second case may sound weird. But Travis Downs concluded Skylake-X did exactly that based on performance counter data.

[Skylake-X] continues executing instructions during a voltage transition, but at a greatly reduced speed: 1/4th of the usual instruction dispatch rate

Gathering Intel on Intel AVX-512 Transitions, Travis Downs

Executing instructions at 1/4 rate would make me infer a 4 GHz core is running at 1 GHz, which is exactly what I see with the Skylake-X transition graph above. Something similar actually happens on Zen 5. It’s not executing instructions at the usual rate during the transition period, making it look like it’s clocking slower from the software perspective.

But performance counter data shows Zen 5 does not sharply reduce its clock speed when hit with AVX-512 code. Instead, it gradually backs off from maximum frequency until it reaches a sustainable clock speed for the workload it’s hit with. During that period, instructions per cycle (IPC) decreases. IPC gradually recovers as clock speeds get closer to the final steady-state frequency. Once that happens, instruction execution rate recovers to the expected 3 IPC (1x integer add + 2x FMA).

I do have some extra margin of error when reading performance counter data, because I’m calling Linux’s perf API before and after each function call, but measuring time within those function calls. That error would become negligible if the test function runs for longer, with a higher iteration count. But I’m keeping iteration counts low to look for short transitions, resulting in a 1-3% margin of error. That’s fine for seeing whether the dip in execution speed is truly caused by lower clock speed.

As you might expect, different cores within CCD0 vary in how much they do “IPC throttling”, for lack of a better term. Some cores cut IPC more than others when hit with sudden AVX-512 load. Curiously, a core that cuts IPC harder (core 0) reaches steady state a tiny bit faster than a core that cut IPC less to start with (core 3).

Load Another FP Pipe?

Now I wonder if Zen 5’s IPC throttling is triggered by the load/store unit, or overall load. Creating heavier FPU load should make that clear. Zen 5’s FPU has four pipes for math operations. FMA operations can go down two of those pipes. Dropping a vaddps into the mix will load a third pipe. On the fast CCD, that increases the transition period from ~22 to ~32 ms. Steady state clock speed decreases by about 150 MHz compared to the prior test.

Therefore, overall stress on the core (for lack of a better term) determines whether Zen 5 needs a transition period. It just so happens that 512-bit accesses to the memory subsystem are heavy. 512-bit register-to-register operations are no big deal, but adding more of them on top of data cache accesses increases stress on the core and causes more IPC throttling.

To be clear, IPC during the transition period with the 512-bit FP add thrown in is higher than before. But 2.75 IPC is 68.75% of the expected 4 GHz, while 2.5 IPC in the prior test is 83.3% of the expected 3 IPC.

Performance counter data suggests the core decreases actual clock speed at a similar rate on both tests. The heavier load causes a longer transition period because the core has to keep reducing clocks for longer before things stabilize.

Really zoomed in, especially on the y-axis

Even with this heavier load, a core on the 9900X’s fast CCD still clocks higher than one on the slower CCD. Testing the add + 2x FMA + FADD combination on the slower CCD did not show any transition period. That’s a hint Zen 5 only needs IPC throttling if clock speed is too high when a heavy AVX-512 sequence comes up.

Transition and Recovery Periods

Transitions and the associated IPC throttling could be problematic if software rapidly alternates between heavy AVX-512 sequences and light scalar integer code. But AMD does not let Zen 5 get stuck repeatedly taking transition penalties. If I switch between the AVX-512 and scalar integer test functions, the transition doesn’t repeat.

AMD does not immediately let Zen 5 return to maximum frequency after AVX-512 load lets up. From the start of the graph we can see Zen 5 can ramp clocks from idle to 5.7 GHz in just a few milliseconds, so this isn’t a limitation of how fast the core can pull off frequency changes. Remember how the slower CCD never suffered transitions? I suspect that’s because it never found itself clocked too high to begin with. Apparently the same also applies to the fast CCD. If it happens to be running at a lower clock when a heavy AVX-512 sequence shows up, it doesn’t need to transition.

Longer function switching intervals bring the transitions back, but now they’re softer. The first transition interval starts at 5.61 GHz and ends at 5.39 GHz. It takes nearly 21 ms.

The next transition starts when Zen 5 has only recovered to 5.55 GHz, and only lasts ~10 ms. IPC throttling is less severe too. IPC only drops to ~3.2 IPC, not 2.7 GHz like before.

With an even longer switching interval, I can finally trigger full-out transitions in repeated fashion. Thus transition periods and IPC throttling behavior vary depending on how much excessive clock speed the core has to shed.

The longer switch interval also shows that Zen 5 takes over 100 ms to fully regain maximum clocks after the heavy AVX-512 load disappears. Therefore, the simple scalar integer code with just dependent adds is running slower for some time. It’s worth noting that “slower” here is still 5.3 GHz, and very fast in an absolute sense. AMD’s Ryzen 9 9900X is not shedding 600 MHz like Intel’s old Core i9-10900X did with a lighter AVX-512 test sequence.

Zen 5’s clock ramp during the recovery period isn’t linear. It gains 200 MHz over the first 50 ms, but takes longer than 50 ms to recover the last 100 MHz. I think this behavior is deliberate and aimed at avoiding frequent transitions. If heavy AVX-512 code might show up again, keeping the core clock just a few percent lower is bettert than throttling IPC by over 30%. As clock speed goes higher and potential transitions become more expensive, the core becomes less brave and increases clocks slower.

The way IPC drops more while clock speed (from perf counters) remains constant makes it feel like voltage droop. No way to tell though

Quickly switching between scalar integer and heavy AVX-512 code basically spreads out the transition, so that the core eventually converges at a sustainable clock speed for the AVX-512 sequence in question. Scalar integer code between AVX-512 sequences continues to run at full speed. And the degree of IPC throttling is very fine grained. There is no fixed 1:4 rate as on Skylake-X.

Investigating IPC Throttling

Zen 5 also has performance counters that describe why the renamer could not send micro-ops downstream. One reason is that the floating point non-scheduling queue (FP NSQ) is full. The FP NSQ on Zen 5 basically acts as the entry point for the FPU. See the prior Zen 5 article for more details.

If the NSQ fills often, the frontend is supplying the FPU with micro-ops faster than the FPU can handle them. As mentioned earlier, Zen 5’s FPU should have no problems doing two 512-bit FMA operations together with a 512-bit FP add every cycle.

FP NSQ dispatch stall counts are still nonzero after the transition ends, so the test might have some margin of error from sub-optimal pipe assignment within the FPU. Life can’t be perfect right

But during the transition period, the FP NSQ fills up quite often. At its worst, it’s full over nearly 10% of (actual) cycles. Therefore Zen 5’s frontend and renamer are running at full speed. The IPC throttling is happening somewhere further down the pipeline. Likely, Zen 5’s FP scheduler is holding back and not issuing micro-ops every cycle even when they’re ready. AMD doesn’t have performance counters at the schedule/execute stage, so that theory is impossible to verify.

Final Words

Running 512-bit execution units at 5.7 GHz is no small feat, and it’s amazing Zen 5 can do that. The core’s FPU by itself is very efficient. But hit more 512-bit datapaths around the core, and you’ll eventually run into cases where the core can’t do what you’re asking of it at 5.7 GHz. Zen 5 handles such sudden, heavy AVX-512 load with a very fine grained IPC throttling mechanism. It likely uses feedback from per-core sensors, and has varying behavior even between cores on the same CCD. Load-related clock frequency changes are slow in both directions, likely to avoid repeated IPC throttling and preserve high performance for scalar integer code in close proximity to heavy AVX-512 sequences.

The tested chip

Transient IPC throttling raises deep philosophical questions about the meaning of clock frequency. If you maintain a 5.7 GHz clock signal and increment performance counters every cycle, but execute instructions as if you were running at 3.6 GHz, how fast are you really running? Certainly it’s 5.7 GHz from a hardware monitoring perspective. But from a software performance perspective, the core behaves more like it’s running at 3.6 GHz. Which perspective is correct? If a tree falls and no one’s around to hear it, did it make a sound? What if some parts of the core are running at full speed, but others aren’t? If a tree is split by lightning and only half of it falls, is it still standing?

Pads on the back of the 9900X. Pretty eh?

Stepping back though, Zen 5’s AVX-512 clock behavior is much better than Skylake-X’s. Zen 5 has no fixed frequency offsets for AVX-512, and can handle heavier AVX-512 sequences while losing less clock speed than Skylake-X. Transitions triggered by heavier AVX-512 sequences are interesting, but Zen 5’s clocking strategy seems built to minimize those transitions. Maybe corner cases exist where Zen 5’s IPC throttling can significantly impact performance. I suspect such corner cases are rare, because Zen 5 has been out for a while and I haven’t seen anyone complain. And even if it does show up, you can likely avoid the worst of it by running on a slower clocked CCD (or part). Still, it was interesting to trigger it with a synthetic test and see just how AMD deals with 512-bit datapaths at high speed.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Intel’s Battlemage Architecture

2025-02-11 15:22:52

Intel’s Alchemist architecture gave the company a foot in the door to the high performance graphics segment. The Arc A770 proved to be a competent first effort, able to run many games with credible performance. Now, Intel is passing the torch to a new graphics architecture, named Battlemage.

Like Alchemist, Battlemage targets the midrange segment. It doesn’t try to compete with AMD or Nvidia’s high end cards. While it’s not as flashy as Nvidia’s RTX 4090 or AMD’s RX 7900 XTX, midrange GPUs account for a much larger share of the discrete GPU market, thanks to their lower prices. Unfortunately, today’s midrange cards like the RTX 4060 and RX 7600 only come with 8 GB of VRAM, and are poor value. Intel takes advantage of this by launching the Arc B580 at $250, undercutting both competitors while offering 12 GB of VRAM.

For B580 to be successful, its new Battlemage architecture has to execute well across a variety of graphics workloads. Intel has made numerous improvements over Alchemist, aiming to achieve better performance with less compute power and less memory bandwidth. I’ll be looking at the Arc B580, with comparison data from the A770 and A750, as well as scattered data I have lying around.

System Architecture

Battlemage is organized much like its predecessor. Xe Cores continue to act as a basic building block. Four Xe Cores are grouped into a Render Slice, which also contains render backends, a rasterizer, and associated caches for those fixed function units. The entire GPU shares an 18 MB L2 cache.

Block diagram of Intel’s Arc B580. B570 disables two Xe Cores. Only FP32 units shown because I generated this diagram using Javascript and heavy abuse of the CSS box model

The Arc B580 overall is a smaller GPU than its outgoing Alchemist predecessors. B580 has five Render Slices to A770’s eight. In total, B580 has 2560 FP32 lanes to A770’s 4096.

Battlemage launches with a smaller memory subsystem too. The B580 has a 192-bit GDDR6 bus running at 19 GT/s, giving it 456 GB/s of theoretical bandwidth. A770 has 560 GB/s of GDDR6 bandwidth, thanks to a 256-bit bus running at 17.5 GT/s.

Block diagram of the A770. A750 disables four Xe Cores (a whole Render Slice)

Even the host interface has been cut down. B580 only has a PCIe 4.0 x8 link, while A770 gets a full size x16 one. Intel’s new architecture has a lot of heavy lifting to do if it wants to beat a much larger implementation of its predecessor.

Battlemage’s Xe Cores

Battlemage’s architectural changes start at its Xe Cores. The most substantial changes between the two generations actually debuted on Lunar Lake. Xe Cores are further split into XVEs, or Xe Vector engines. Intel merged pairs of Alchemist XVEs into ones that are twice as wide, completing a transition towards larger execution unit partitions. Xe Core throughput stays the same at 128 FP32 operations per cycle.

A shared instruction cache feeds all eight XVEs in a Xe Core. Alchemist had a 96 KB instruction cache, and Battlemage almost certainly has an instruction cache at least as large. Instructions on Intel GPUs are generally 16 bytes long, with a 8 byte compacted form in some cases. A 96 KB instruction cache therefore has a nominal capacity of 6-12K instructions.

Xe Vector Engines (XVEs)

XVEs form the smallest partition in Intel GPUs. Each XVE tracks up to eight threads, switching between them to hide latency and keep its execution units fed. A 64 KB register file stores thread state, giving each thread up to 8 KB of registers while maintaining maximum occupancy. Giving a register count for Intel GPUs doesn’t really work, because Intel GPU instructions can address the register file with far more flexibility than Nvidia or AMD architectures. Each instruction can specify a vector width, and access a register as small as a single scalar element.

For most math instructions, Battlemage sticks with 16-wide or 32-wide vectors, dropping the SIMD8 mode that could show up with Alchemist. Vector execution reduces instruction control overhead because a single operation gets applied across all lanes in the vector. However, that results in lost throughput if some lanes take a different branch direction. On paper, Battlemage’s longer native vector lengths would make it more prone to suffering such divergence penalties. But Alchemist awkwardly shared control logic between XVE pairs, making SIMD8 act like SIMD16, and SIMD16 act a lot like SIMD64 aside from a funny corner case (see the Meteor Lake article for more on that).

Battlemage’s divergence behavior by comparison is intuitive and straightforward. SIMD16 achieves full utilization if groups of 16 threads go the same way. The same applies for SIMD32 and groups of 32 coherent threads. Thus Battlemage is actually more agile than its predecessor when dealing with divergent branches, while enjoying the efficiency advantage of using larger vectors.

Maybe XMX is on a separate port. Maybe not. I’m not sure

Like Alchemist, Battlemage executes most math operations down two ports (ALU0, ALU1). ALU0 handles basic FP32 and FP16 operations, while ALU1 handles integer math and less common instructions. Intel’s port layout has parallels to Nvidia’s Turing, which also splits dispatch bandwidth between 16-wide FP32 and INT32 units. A key difference is that Turing uses fixed 32-wide vectors, and keeps both units occupied by feeding them on alternate cycles. Intel can issue instructions of the same type back-to-back, and can select multiple instructions to issue per cycle to different ports.

In another similarity to Turing, Battlemage carries forward Alchemist’s “XMX” matrix multiplication units. Intel claims 3-way co-issue, implying XMX is on a separate port. However, VTune only shows multiple pipe active metrics for ALU0+ALU1 and ALU0+XMX. I’ve drawn XMX as a separate port above, but the XMX units could be on ALU1.

Data collected from Intel’s VTune profiler, zoomed in to show what’s happening at the millisecond scale. VTune’s y-axis scaling is funny (relative to max observed utilization rather than 100%), so I’ve labeled some interesting points.

Gaming workloads tend to use more floating point operations. During compute heavy sections, ALU1 offloads other operations and keeps ALU0 free to deal with floating point math. XeSS exercises the XMX unit, with minimal co-issue alongside vector operations. A generative AI workload shows even less XMX+vector co-issue.

As expected for any specialized execution unit, XMX software support is far from guaranteed. Running AI image generation or language models using other frameworks heavily exercises B580’s regular vector units, while leaving the XMX units idle.

In microbenchmarks, Intel’s older A770 and A750 can often use their larger shader arrays to achieve higher compute throughput than B580. However, B580 behaves more consistently. Alchemist had trouble with FP32 FMA operations. Battlemage in contrast has no problem getting right up to its theoretical throughput. FP32+INT32 dual issue doesn’t happen perfectly on Battlemage, but it barely happened at all on A750.

On the integer side, Battlemage is better at dealing with lower precision INT8 operations. Using Meteor Lake’s iGPU as a proxy, Intel’s last generation architecture used mov and add instruction pairs to handle char16 adds, while Battlemage gets it done with just an add.

Each XVE also has a branch port for control flow instructions, and a “send” port that lets the XVE talk with the outside world. Load on these ports is typically low, because GPU programs don’t branch as often as CPU ones, and shared functions accessed through the “send” port won’t have enough throughput to handle all XVEs hitting it at the same time.

Memory Access

Battlemage’s memory subsystem has a lot in common with Alchemist’s, and traces its origins to Intel’s integrated graphics architectures over the past decade. XVEs access the memory hierarchy by sending a message to the appropriate shared functional unit. At one point, the entire iGPU was basically the equivalent of a Xe Core, with XVE equivalents acting as basic building blocks. XVEs would access the iGPU’s texture units, caches, and work distribution hardware over a messaging fabric. Intel has since built larger subdivisions, but the terminology remains.

Texture Path

Each Xe Core has eight TMUs, or texture samplers in Intel terminology. The samplers have a 32 KB texture cache, and can return 128 bytes/cycle to the XVEs. Battlemage is no different from Alchemist in this respect. But the B580 has less texture bandwidth on tap than its predecessor. Its higher clock speed isn’t enough to compensate for having far fewer Xe Cores.

B580 runs at higher clock speeds, which brings down texture cache hit latency too. In clock cycle terms though, Battlemage has nearly identical texture cache hit latency to its predecessor. L2 latency has improved significantly, so missing the texture cache isn’t as bad on Battlemage.

Data Access (Global Memory)

Global memory accesses are first cached in a 256 KB block, which serves double duty as Shared Local Memory (SLM). It’s larger than Alchemist and Lunar Lake’s 192 KB L1/SLM block, so Intel has found the transistor budget to keep more data closer to the execution units. Like Lunar Lake, B580 favors SLM over L1 capacity even when a compute kernel doesn’t allocate local memory.

Intel may be able to split the L1/SLM block in another way, but a latency test shows exactly the same result regardless of whether I allocate local memory. Testing with Nemes’s Vulkan test suite also shows 96 KB of L1.

Global memory access on Battlemage offers lower latency than texture accesses, even though the XVEs have to handle array address generation. With texture accesses, the TMUs do all the address calculations. All the XVEs do is send them a message. L1 data cache latency is similar to Alchemist in clock cycle terms, though again higher clock speeds give B580 an actual latency advantage.

Scalar Optimizations?

Battlemage gets a clock cycle latency reduction too with scalar memory accesses. Intel does not have separate scalar instructions like AMD. But Intel’s GPU ISA lets each instruction specify its SIMD width, and SIMD1 instructions are possible. Intel’s compiler has been carrying out scalar optimizations and opportunistically generating SIMD1 instructions well before Battlemage, but there was no performance difference as far as I could tell. Now there is.

Forcing SIMD16 mode saves one cycle of latency over SIMD32, because address generation instructions don’t have to issue over two cycles

On B580, L1 latency for a SIMD1 (scalar) access is about 15 cycles faster than a SIMD16 access. SIMD32 accesses take one extra cycle when microbenchmarking, though that’s because the compiler generates two sets of SIMD16 instructions to calculate addresses across 32 lanes. I also got Intel’s compiler to emit scalar INT32 adds, but those didn’t see improved latency over vector ones. Therefore, the scalar latency improvements almost certainly come from an optimized memory pipeline.

Scalar load, with simple explanations

SIMD1 instructions also help within the XVEs. Intel doesn’t use a separate scalar register file, but can more flexibly address their vector register file than AMD or Nvidia. Instructions can access individual elements (sub-registers) and read out whatever vector width they want. Intel’s compiler could pack many “scalar registers” into the equivalent of a vector register, economizing register file capacity.

L1 Bandwidth

I was able to get better efficiency out of B580’s L1 than A750’s using float4 loads from a small array. Intel suggests Xe-HPG’s L1 can deliver 512 bytes per cycle, but I wasn’t able to get anywhere close on either Alchemist or Battlemage. Microbenchmarking puts per-Xe Core bandwidth at a bit under 256 bytes per cycle on both architectures.

Even if the L1 can only provide 256 bytes per cycle, that still gives Intel’s Xe Core as much L1 bandwidth as an AMD RDNA WGP, and twice as much L1 bandwidth as an Nvidia Ampere SM. 512 bytes per cycle would let each XVE complete a SIMD16 load every cycle, which is kind of overkill anyway.

Local Memory (SLM)

Battlemage uses the same 256 KB block for L1 cache and SLM. SLM provides an address space local to a group of threads, and acts as a fast software managed scratchpad. In OpenCL, that’s exposed via the local memory type. Everyone likes to call it something different, but for this article I’ll use OpenCL and Intel’s term.

Even though both local memory and L1 cache hits are backed by the same physical storage, SLM accesses enjoy better latency. Unlike cache hits, SLM accesses don’t need tag checks or address translation. Accessing Battlemage’s 256 KB block of memory in SLM mode brings latency down to just over 15 ns. It’s faster than doing the same on Alchemist, and is very competitive against recent GPUs from AMD and Nvidia.

Local memory/SLM also lets threads within a workgroup synchronize and exchange data. From testing with atomic_cmpxchg on local memory, B580 can bounce values between threads a bit faster than its predecessor. Nearly all of that improvement is down to higher clock speed, but it’s enough to bring B580 in line with AMD and Nvidia’s newer GPUs.

Backing structures for local memory often contain dedicated ALUs for handling atomic operations. For example, the LDS on AMD’s RDNA architecture is split into 32 banks, with one atomic ALU per bank. Intel almost certainly has something similar, and I’m testing that with atomic_add operations on local memory. Each thread targets a different address across an array, aiming to avoid contention.

Alchemist and Battlemage both appear to have 32 atomic ALUs attached to each Xe Core’s SLM unit, much like AMD’s RDNA and Nvidia’s Pascal. Meteor Lake’s Xe-LPG architecture may have half as many atomic ALUs per Xe Core.

L2 Cache

Battlemage has a two level cache hierarchy like its predecessor and Nvidia’s current GPUs. B580’s 18 MB L2 is slightly larger than A770’s 16 MB L2. A770 divided its L2 into 32 banks, each capable of handling a 64 byte access every cycle. At 2.4 GHz, that’s good for nearly 5 TB/s of bandwidth.

Intel didn’t disclose B580’s L2 topology, but a reasonable assumption is that Intel increased bank size from 512 to 768 KB, keeping 4 L2 banks tied to each memory controller. If so, B580’s L2 would have 24 banks and 4.3 TB/s of theoretical bandwidth at 2.85 GHz. Microbenchmarking using Nemes’s Vulkan test gets a decent proportion of that bandwidth. Efficiency is much lower on the older A750, which gets approximately as much bandwidth as B580 despite probably having more theoretical L2 bandwidth on tap.

Besides insulating the execution units from slow VRAM, the L2 can act as a point of coherency across the GPU. B580 is pretty fast when bouncing data between threads using global memory, and is faster than its predecessor.

With atomic add operations on global memory, Battlemage does fine for a GPU of its size and massively outperforms its predecessor.

I’m using INT32 operations, so 86.74 GOPS on the A750 would correspond to 351 GB/s of L2 bandwidth. On the B580, 220.97 GOPS would require 883.9 GB/s. VTune however reports far higher L2 bandwidth on A750. Somehow, A750 sees 1.37 TB/s of L2 bandwidth during the test, or nearly 4x more than it should need.

VTune capture of the test running on A750

Meteor Lake’s iGPU is a close relative of Alchemist, but its ratio of global atomic add throughput to Xe Core count is similar to Battlemage’s. VTune reports Meteor Lake’s iGPU using more L2 bandwidth than required, but only by a factor of 2x. Curiously, it also shows the expected bandwidth coming off the XVEs. I wonder if something in Intel’s cross-GPU interconnect didn’t scale well with bigger GPUs.

With Battlemage, atomics are broken out into a separate category and aren’t reported as regular L2 bandwidth. VTune indicates atomics are passed through the load/store unit to L2 without any inflation. Furthermore, the L2 was only 79.6% busy, suggesting there’s a bit of headroom at that layer.

And the same test on B580

This could just be a performance monitoring improvement, but performance counters are typically closely tied to the underlying architecture. I suspect Intel made major changes to how they handle global memory atomics, letting performance scale better on larger GPUs. I’ve noticed that newer games sometimes use global atomic operations. Perhaps Intel noticed that too, and decided it was time to optimize them.

VRAM Access

B580 has a 192-bit GDDR6 VRAM subsystem, likely configured as six 2×16-bit memory controllers. Latency from OpenCL is higher than it was in the previous generation.

I suspect this only applies to OpenCL, because latency from Vulkan (with Nemes’s test) shows just over 300 ns of latency. Latency at large test sizes will likely run into TLB misses, and I suspect Intel is using different page sizes for different APIs.

Compared to its peers, the Arc B580 has more theoretical VRAM bandwidth at 456 GB/s, but also less L2 capacity. For example, Nvidia’s RTX 4060 has 272 GB/s VRAM bandwidth using a 128-bit GDDR6 bus running at 17 GT/s, with 24 MB of L2 in front of it. I profiled a few things with VTune and picked out spikes in VRAM bandwidth usage. I also checked reported L2 bandwidth over the same sampling interval.

Intel’s balance of cache capacity and memory bandwidth seems to work well, at least in the few examples I checked. Even when VRAM bandwidth demands are high, the 18 MB L2 is able to catch enough traffic to avoid pushing GDDR6 bandwidth limits. If Intel hypothetically used a smaller GDDR6 memory subsystem like Nvidia’s RTX 4060, B580 would need a larger cache to avoid reaching VRAM bandwidth limits.

PCIe Link

Probably as a cost cutting measure, B580 has a narrower PCIe link than its predecessor. Still, a x8 Gen 4 link provides as much theoretical bandwidth as a x16 Gen 3 one. Testing with OpenCL doesn’t get close to theoretical bandwidth, but B580 is at a disadvantage compared to A750.

PCIe link bandwidth often has minimal impact on gaming performance, as long as you have enough VRAM. B580 has a comparatively large 12 GB VRAM pool compared to its immediate competitors, which also have PCIe 4.0 x8 links. That could give B580 an advantage within the midrange market, but that doesn’t mean it’s immune to problems.

DCS for example will uses over 12 GB of VRAM with mods. Observing different aircraft in different areas often causes stutters on the B580. VTune shows high PCIe traffic as the GPU must frequently read from host memory.

Final Words

Battlemage retains Alchemist’s high level goals and foundation, but makes a laundry list of improvements. Compute is easier to utilize, cache latency improves, and weird scaling issues with global memory atomics have been resolved. Intel has made some surprising optimizations too, like reducing scalar memory access latency. The result is impressive, with Arc B580 easily outperforming the outgoing A770 despite lagging in nearly every on-paper specification.

Some of Intel’s GPU architecture changes nudge it a bit closer to AMD and Nvidia’s designs. Intel’s compiler often prefers SIMD32, a mode that AMD often chooses for compute code or vertex shaders, and one that Nvidia exclusively uses. SIMD1 optimizations create parallels to AMD’s scalar unit or Nvidia’s uniform datapath. Battlemage’s memory subsystem emphasizes caching more than its predecessor, while relying less on high VRAM bandwidth. AMD’s RDNA 2 and Nvidia’s Ada Lovelace made similar moves with their memory subsystems.

Of course Battlemage is still a very different animal from its discrete GPU competitors. Even with larger XVEs, Battlemage still uses smaller execution unit partitions than AMD or Nvidia. With SIMD16 support, Intel continues to support shorter vector widths than the competition. Generating SIMD1 instructions gives Intel some degree of scalar optimization, but stops short of having a full-out scalar/uniform datapath like AMD or post-Turing Nvidia. And 18 MB of cache is still less than the 24 or 32 MB in Nvidia and AMD’s midrange cards.

Differences from AMD and Nvidia aside, Battlemage is a worthy step on Intel’s journey to take on the midrange graphics market. A third competitor in the discrete GPU market is welcome news for any PC enthusiast. For sure, Intel still has some distance to go. Driver overhead and reliance on resizable BAR are examples of areas where Intel is still struggling to break from their iGPU-only background.

But I hope Intel goes after higher-end GPU segments once they’ve found firmer footing. A third player in the high end dGPU market would be very welcome as many folks are still on Pascal or GCN due to folks feeling as if there is not a reasonable upgrade yet. Intel’s Arc B580 addresses some of that pent-up demand, at least when it’s not out-of-stock. I look forward to seeing Intel’s future GPU efforts.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Alibaba/T-HEAD's Xuantie C910

2025-02-04 13:12:58

T-HEAD is a wholly owned subsidiary of Alibaba, one of China's largest tech companies. Over the past few years, T-HEAD has created a line of RISC-V cores. Alibaba seems to have two motivations for pushing RISC-V. On one hand, the company stands to benefit from creating cost effective chips optimized for areas it cares about, like IoT endpoints and edge computing. On the other, Alibaba almost certainly wants to reduce its dependence on foreign imports. RISC-V is an open instruction set, and isn't controlled by US or British corporations like x86-64 or ARM. T-HEAD's RISC-V push can thus be seen more broadly as a part of China's push to create viable domestic microchips.

Xuantie C910 slots into the "high performance" category within T-HEAD's lineup. Besides joining a small number of out-of-order RISC-V cores that have made it into hardware, C910 is an early adopter for RISC-V's vector extension. It supports RVV 0.7.1, which features masking and variable vector length support. T-HEAD has since released the C920 core, which brings RVV support up to version 1.0, but otherwise leaves C910 unchanged.

From Alibaba's paper, with descriptions added in red by Clam. PIU and PLIC appear in the dual core diagram below.

C910 targets "AI, Edge servers, Industrial control, [and] ADAS" as possible applications. It's also T-HEAD's first generation out-of-order design, so taking on all those applications is ambitious. C910 is implemented in clusters of up to four cores, each with a shared L2 cache. T-HEAD targets 2 to 2.5 GHz on TSMC's 12nm FinFET process, where a C910 core occupies 0.8 mm2. Core voltage is 0.8V at 2 GHz, and 1.0V at 2.5 GHz. On TSMC's 7nm process, T-HEAD managed to push core frequency to 2.8 GHz. T-HEAD's paper further claims dynamic power is around 100 microwatts/MHz, which works out to 0.2W at 2 GHz. Of course, this figure doesn't include static power or power draw outside the core. Yet all of these characteristics together make clear C910 is a low power, low area design.

This article will examine C910 in the T-HEAD TH1520, using the LicheePi single board computer. TH1520 is fabricated on TSMC’s 12nm FinFET process, and has a quad-core C910 cluster with 1 MB of L2 running at 1.85 GHz. It’s connected to 8 GB of LPDDR4X-3733. C910 has been open-sourced, so I’ll be attempting to dig deeper into core details by reading some of the source code – but with some disclaimers. I’m a software engineer, not a hardware engineer. Also, some of the code is likely auto-generated from another undisclosed source, so reading that code has been a time consuming and painful experience. Expect some mistakes along the way.

Core Overview

The Xuantie C910 is a 3-wide, out-of-order core with a 12 stage pipeline.

Like Arm’s Cortex A73, C910 can release out-of-order resources early. For microbenchmarking, I used both a dependent branch and incomplete load to block retire, just as I did on A73.

Frontend: Instruction Fetch and Branch Prediction

C910’s frontend is tailored to handle both 16-bit and 32-bit RISC-V instructions, along with the requirements of RISC-V’s vector extension. The core has a 64 KB, 2-way set associative instruction cache with a FIFO replacement policy. Besides caching instruction data, C910 stores four bits of predecode data for each possible 16-bit instruction slot. Two bits tentatively indicate whether an instruction starts at that position, while the other two provide branch info. In total, C910 uses 83.7 KB of raw bit storage for instruction caching.

An L1i access reads instruction bytes, predecode data, and tags from both ways. Thus, the instruction fetch (IF) stage brings 256 bits of instruction bytes into temporary registers alongside 64 bits of predecode data. Tags for both ways are checked to determine which way has a L1i hit, if any. Simultaneously, the IF stage checks a 16 entry, fully associative L0 BTB, which lets the core handle a small number of taken branches with effectively single cycle latency.

Rough, simplified sketch of C910’s frontend

Instruction bytes and predecode data from both ways are passed to the next Instruction Pack (IP) stage. All of that is fed into a pair of 8-wide early decode blocks, called IP decoders in the source code. Each of the 16 early decode slots handles a possible instruction start position at a 16-bit boundary, across both ways. These early decoders do simple checks to categorize instructions. For vector instructions, the IP decoders also figure out VLEN (vector length), VSEW (selected element width), and VLMAX (number of elements).

Although the IP stage consumes 256 bits of instruction data and 64 bits of predecode data, and process all of that with 16 early decode slots, half of that is always discarded because the L1i can only hit in one way. Output from the 8-wide decode block that processed the correct way is passed to the next stage, while output from the other 8-wide decoder is discarded.

C910’s main branch predictor mechanisms also sit at the IP stage. Conditional branches are handled with a bi-mode predictor, with a 1024 entry selection table, two 16384 entry history tables containing 2-bit counters, and a 22-bit global history register. The selection table is indexed by hashing the low bits of the branch address and global history register, while the history tables are indexed by hashing the high hits of the history register. Output from the selection table is used to pick between the two history tables, labeled “taken” and “ntaken”. Returns are handled using a 12 entry return stack, while a 256 entry indirect target array handles indirect branches. In all, the branch predictor uses approximately 17.3 KB of storage. It’s therefore a small branch predictor by today’s standards, well suited to C910’s low power and low area design goals. For perspective, a high performance core like Qualcomm’s Oryon uses 80 KB for its conditional (direction) predictor alone, and another 40 KB for the indirect predictor.

Testing with random patterns of various lengths shows C910 can deal with moderately long patterns. It’s in line with what I’ve seen this test do with other low power cores. Both C910 and A73 struggle when there are a lot of branches in play, though they can maintain reasonably good accuracy for a few branches without excessively long history.

C910’s main BTB has 1024 entries and is 4-way set associative. Redirecting the pipeline from the IP stage creates a single pipeline bubble, or effectively 2 cycle taken branch latency. Branches that spill out of the 1024 entry BTB have 4 cycle latency, as long as code stays within the instruction cache.

The Instruction Pack stage feeds up to eight 16-bit instructions along with decoder output into the next Instruction Buffer (IB) stage. This stage’s job is to smooth out instruction delivery, covering any hiccups in frontend bandwidth as best as it can. To do this, the IB stage has a 32 entry instruction queue and a separate 16 entry loop buffer. Both have 16-bit entries, so 32-bit instructions will take two slots. C910’s loop buffer serves the same purpose as Pentium 4’s trace cache, seeking to fill in lost frontend slots after a taken branch. Of course, a 16 entry loop buffer can only do this for the smallest of loops.

To feed the subsequent decode stage, the IB stage can pick instructions from the loop buffer, instruction queue, or a bypass path to reduce latency if queuing isn’t needed. Each instruction and its associated early decode metadata are packed into a 73-bit format, and sent to the decode stage.

Frontend: Decode and Rename

The Instruction Decode (ID) stage contains C910’s primary decoders. Three 73-bit inputs from the IP stage are fed into these decoders, which parse out register info and splits instructions into multiple micro-ops if necessary.

Only the first decode slot can handle instructions that decode into four or more micro-ops. All decode slots can emit 1-2 micro-ops for simpler instructions, though the decode stage in total can’t emit more than four micro-ops per cycle. Output micro-ops are packed into a 178-bit format, and passed directly to the rename stage. C910 does not have a micro-op queue between the decoders and renamers like many other cores. Rename width and decoder output width therefore have to match, explaining why the renamer is 4-wide and why the decoders are restricted to 4 micro-ops per cycle. Any instruction that decoders into four or more micro-ops will block parallel decode.

Notes on micro-op format

C910’s instruction rename (IR) stage then checks for matches between architectural registers to find inter-instruction dependencies. It then assigns free registers out of the respective pool (integer or FP registers), or by picking newly deallocated registers coming off the retire stage. The IR stage does further decoding too. Instructions are further labeled with whether they’re a multi-cycle ALU operation, which ports they can go to, and so on. After renaming, micro-ops are 271 bits.

From software, C910’s frontend can sustain 3 instructions per cycle as long as code fits within the 64 KB instruction cache. L2 code bandwidth is low at under 1 IPC. SiFive’s P550 offers more consistent frontend bandwidth for larger code footprints, and can maintain 1 IPC even when running code from L3.

Out-of-Order Execution Engine

C910’s backend uses a physical register file (PRF) based out-of-order execution scheme, where both pending and known-good instruction results are stored in register files separate from the ROB. C910’s source code (ct_rtu_rob.v) defines 64 ROB entries, but T-HEAD’s paper says the ROB can hold up to 192 instructions. Microbenchmarking generally agrees.

Therefore, C910 has reorder buffer capacity on par with Intel’s Haswell from 2013, theoretically letting it keep more instructions in flight than P550 or Goldmont Plus. However, other structures are not appropriately sized to make good use of that ROB capacity.

RISC-V has 32 integer and 32 floating point registers, so 32 entries in each register file generally have to be reserved for holding known-good results. That leaves only 64 integer and 32 floating point registers to hold results for in-flight instructions. Intel’s Haswell supports its 192 entry ROB with much larger register files on both the integer and floating point side.

Execution Units

C910 has eight execution ports. Two ports on the scalar integer side handle the most common ALU operations, while a third is dedicated to branches. C910’s integer register file has 10 read ports to feed five execution pipes, which includes three pipes for handling memory operations. A distributed scheduler setup feeds C910’s execution ports. Besides the opcode and register match info, each scheduler entry has a 7-bit age vector to enable age-based prioritization.

Scheduler capacity is low compared to Goldmont Plus and P550, with just 16 entries available for the most common ALU operations. P550 has 40 scheduler entries available across its three ALU ports, while Goldmont Plus has 30 entries.

C910’s FPU has a simple dual pipe design. Both ports can handle the most common floating point operations like adds, multiplies, and fused multiply. Both pipes can handle 128-bit vector operations too. Feeding each port requires up to four inputs from the FP/vector register file. A fused multiply instruction (a*b+c) requires three inputs. A fourth input provides a mask register. Unlike AVX-512 and SVE, RISC-V doesn’t define separate mask registers, so all inputs have to come from the FP/vector register file. Therefore, C910’s FP register file has almost as many read ports as the integer one, despite feeding fewer ports.

Floating point execution latency is acceptable, and ranges from 3 to 5 cycles for the most common operations. Some recent cores like Arm’s Cortex X2, Intel’s Golden Cove, and AMD’s Zen 5 can do FP addition with 2 cycle latency. I don’t expect that from a low power core though.

Memory Subsystem

Two address generation units (AGUs) on C910 calculate effective addresses for memory accesses. One AGU handles loads, while the other handles stores. C910’s load/store unit is generally split into two pipelines, and aims to handle up to one load and one store per cycle. Like many other cores, store instructions are broken into a store address and a store data micro-op.

From Alibaba’s paper

39-bit virtual addresses are then generated into 40-bit physical addresses. C910’s L1 DTLB has 17 entries and is fully associative. A 1024 entry, 4-way L2 TLB handles L1 TLB misses for both data and instruction accesses, and adds 4 cycles latency over a L1 hit. Physically, the L2 TLB has two banks, both 256×84 SRAM instances. The tag array is a 256×196 bit SRAM instance, and a 196-bit access includes tags for all four ways along with four “FIFO” bits, possibly used to implement a FIFO replacement policy. Besides necessary info like the virtual page number and a valid bit, each tag includes an address space ID and a global bit. These can exempt an entry from certain TLB flushes, reducing TLB thrashing on context switches. In total, the L2 TLB’s tags and data require 8.96 KB of bit storage.

Physical addresses are written into the load and store queues, depending on whether the address is a load or store. I’m not sure how big the load queue is. C910’s source code suggests there are 12 entries, and microbenchmarking results are confusing.

In the source code, each load queue entry stores 36 bits of the load’s physical address along with 16 bits to indicate which bits are valid, and a 7-bit instruction id to ensure proper ordering. A store queue entry stores the 40-bit physical address, pending store data, 16 byte valid bits, a 7-bit instruction id, and a ton of other fields. To give some examples:

  • wakeup_queue: 12 bits, possibly indicates which dependent loads should be woken up when data is ready

  • sdid: 4 bits, probably store data id

  • age_vec, age_vec_1: 12 bit age vectors, likely for tracking store order

To check for memory dependencies, the load/store unit compares bits 11:4 of the memory address. From software testing, C910 can do store forwarding for any load completely contained within the store, regardless of alignment within the store. However, forwarding fails if a load crosses a 16B aligned boundary, or a store crosses a 8B aligned boundary. Failed store forwarding results in a 20+ cycle penalty.

C910 handles unaligned accesses well, unlike P550. If a load doesn’t cross a 16B boundary or a store doesn’t cross a 8B boundary, it’s basically free. If you do cross those alignment boundaries, you don’t face a performance penalty beyond making an extra L1D access under the hood. Overall, C910’s load/store unit and forwarding behavior is a bit short of the most recent cores from Intel and AMD. But it’s at about the same level as AMD’s Piledriver, a very advanced and high performance core in its own right. That’s a good place to be.

Data Cache

The 64 KB, 2-way data cache has 3 cycle latency, and is divided into 4 byte wide banks. It can handle up to one load and one store per cycle, though 128-bit stores take two cycles. L1D tags are split into two separate arrays, one for loads and one for stores.

Data cache misses are tracked by one of eight line-fill buffer entries, which store the miss address. Refill data is held in two 512-bit wide fill buffer registers. Like the instruction cache, the data cache uses a simple FIFO replacement policy.

L2 Cache and Interconnect

Each C910 core interfaces with the outside world via a “PIU”, or processor interface unit. At the other end, a C910 cluster has a Consistency Interface Unit (CIU) that accepts requests from up to four PIUs and maintains cache coherency. The CIU is split into two “snb” instances, each of which has a 24 entry request queue. SNB arbitrates between requests based on age, and has a 512-bit interface to the L2 cache.

C910’s L2 cache acts as both the first stop for L1 misses and as a cluster-wide shared last level cache. On the TH1520, it has 1 MB of capacity and is 16-way set associative with a FIFO replacement policy. To service multiple accesses per cycle, the L2 is built from two banks, selected by bit 6 of the physical address. The L2 is inclusive of upper level caches, and uses ECC protection to ensure data integrity.

L2 latency is 60 cycles which is problematic for a core with limited reordering capacity and no mid-level cache. Even P550’s 4 MB L3 cache has better latency than C910’s L2, from both a cycle count and true latency standpoint. Intel’s Goldmont Plus also uses a shared L2 as a last level cache, and has about 28 cycles of L2 latency (counting a uTLB miss).

C910’s L2 bandwidth also fails to impress. A single core gets just above 10 GB/s, or 5.5 bytes per cycle. All four cores together can read from L2 at 12.6 GB/s, or just 1.7 bytes per cycle per core. Write bandwidth is better at 23.81 GB/s from all four cores, but that’s still less than 16 bytes per cycle in total, and writes are usually less common than reads.

Again, C910’s L2 is outperformed by both P550’s L3 and Goldmont Plus’s L2. I suspect multi-threaded applications will easily push C910’s L2 bandwidth limits.

Off-cluster requests go through a 128-bit AXI4 bus. In the Lichee Pi 4A, the TH1520 has just under 30 GB/s of theoretical DRAM bandwidth from its 64-bit LPDDR4X-3733 interface. Achieved read bandwidth is much lower. Multithreaded applications might find 4.2 GB/s a bit tight, especially when there’s only 1 MB of last level cache shared across four cores.

DRAM latency is at least under control at 133.9 ns, tested using 2 MB pages and 1 GB array. It’s not on the level of a desktop CPU, but it’s better than Eswin and Intel’s low power implementations.

Core to Core Latency

Sometimes, the memory subsystem has to carry out a core to core transfer to maintain cache coherency. Sites like Anandtech have used a core to core latency test to probe this behavior, and I’ve written my own version. Results should be broadly comparable with those from Anandtech.

T-HEAD’s CIU can pass data between cores with reasonable speed. It’s much better than P550, which saw over 300 ns of latency within a quad core cluster.

Final Words

C910 is T-HEAD’s first out-of-order core. Right out of the gate, C910 is more polished than P550 in some respects. Core to core latency is better, unaligned accesses are properly handled, and there’s vector support. Like P550, C910 aims to scale across a broad range of low power applications. L2 capacity can be configured up to 8 MB, and multi-cluster setups allow scaling to high core counts. I feel like there’s ambition behind C910, since Alibaba wants to use in-house RISC-V cores instead of depending on external factors.

Alibaba has been promoting Xuantie core IP series to facilitate external customers for edge computing applications, such as AI, edge servers, industrial control and advanced driver assistance systems (ADAS)…by the end of 2022, a total volume of 15 million units is expected

Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension – T-Head Division, Alibaba Cloud

Yet I also feel the things C910 does well are overshadowed by executing poorly on the basics. The core’s out-of-order engine is poorly balanced, with inadequate capacity in critical structures like the schedulers and register files in relation to its ROB capacity. CPU performance is often limited by memory access performance, and C910’s cache subsystem is exceptionally weak. The cluster’s shared L2 is both slow and small, and the C910 cores have no mid-level cache to insulate L1 misses from that L2. DRAM bandwidth is also lackluster.

A TH1520 chip, seen at Hot Chips 2024 (not the one tested)

C910 is therefore caught in a position where it needs to keep a lot of instructions in flight to smooth out spikes in demand for memory bandwidth and mitigate high L2 latency, but can rarely do so in practice because its ROB capacity isn’t supported by other structures. C910’s unaligned access handling, vector support, and decent core-to-core latency are all nice to have. But tackling those edge cases is less important than building a well balanced core supported by a solid memory subsystem. Missing the subset of applications that use unaligned accesses or take advantage of vectorization is one thing. But messing up performance for everything else is another. And C910’s poor L2 and DRAM performance may even limit the usefulness of its vector capabilities, because vectorized applications tend to pull more memory bandwidth.

Hopefully T-HEAD will use experience gained from C910 to build better cores going forward. With Alibaba behind it, T-HEAD should have massive financial backing. I also hope to see more open source out-of-order cores going forward. Looking through C910 source code was very insightful. I appreciated being able to see how micro-op formats changed between pipeline stages, and how instruction decode is split across several stages that aren’t necessarily labeled “decode”.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

A RISC-V Progress Check: Benchmarking P550 and C910

2025-01-31 05:22:37

RISC-V has seen a flurry of activity over the past few years. Most RISC-V implementations have been small in-order cores. Western Digital’s SweRV and Nvidia’s RV-RISCV are good examples. But cores like those are meant for small microcontrollers, and the average consumer won’t care which core a company selects for a GPU or SSD’s microcontrollers. Flagship cores from AMD, Arm, Intel, and Qualcomm are more visible in our daily lives, and use large out-of-order execution engines to deliver high performance.

Out-of-order execution involves substantial complexity, which makes SiFive’s Performance P550 and T-HEAD’s Xuantie C910 interesting. Both feature out-of-order execution, though a quick look at headline specifications shows neither core can take on the best from AMD, Arm, Intel, or Qualcomm.

To check on RISC-V’s progress as its cores move toward higher performance targets, I’m comparing with Arm’s Cortex A73 and Intel’s Goldmont Plus. Both have comparably sized out-of-order execution engines.

SPEC CPU2017

SPEC is an industry standard benchmark distributed in source code form. It deliberately attempts to test both hardware and the compilers that target it. As before, I’m building SPEC CPU2017 with GCC 14.2.0. For P550, I used -march=rv64imafdc_zicsr_zifencei_zba_zbb -mtune=sifive-p400-series. For C910, I used -march=rv64imafdc_xtheadvector -mtune=generic-ooo. GCC doesn’t have optimization models for either RISC-V core, though I suspect that doesn’t matter much.

The two RISC-V cores fall short of Arm’s Cortex A73 and well short of Intel’s Goldmont Plus. Clock speed differences play a large role, and the EIC7700X is especially terrible in that respect. Eswin chose to clock its P550 cluster at just 1.4 GHz, even though the chip’s datasheet notes the CPU cores can run at “up to 1.8 GHz”. C910 does better at 1.85 GHz, though that’s still low in absolute terms. Unfortunately for T-HEAD, C910’s higher clock speed does not let it gain a performance lead against the P550. I’m still working on dissecting C910, but at first glance I’m not impressed with how T-HEAD balanced C910’s out-of-order execution engine and memory subsystem.

Cortex A55 and A53 provide perspective on where in-order execution sits today. Neither core can get anywhere close to high performance client designs, but C910 and P550 have relatively small out-of-order engines. They also run at low clock speeds. Mediatek’s Genio 1200 has a particularly strong A55 implementation, with higher clock speeds and better DRAM latency than C910 and P550. Its Cortex A55 cores are able to catch C910 and P550 without full out-of-order execution.

AMD expects to exceed Pentium performance at the same clock rate by about 30%

Microprocessor Report

This isn’t the first time an in-order core does surprisingly well against out-of-order ones. Back in 1996, AMD’s K5 featured 4-wide out-of-order execution and better per-clock performance than Intel’s 2-wide, in-order Pentium. Intel clocked the Pentium more than 30% faster, and came out top. Today’s situation with C910 and P550 against A55 has some parallels. A55 doesn’t win everywhere though. It loses to both RISC-V cores in SPEC CPU2017’s floating point suite. And a less capable in-order core like A53 can’t keep up despite running at higher clocks.

Across SPEC CPU2017’s integer workloads, C910 fails to win any test against the lower clocked EIC7700X. T-HEAD does better in the floating point suite, where it wins in a number of tests, but fails to take an overall performance lead. Meanwhile, A73 and Goldmont Plus do an excellent job of translating their higher clock speeds into a real advantage.

IPC data from hardware performance counters can show how well cores are utilizing their pipeline widths. IPC behavior tends to vary throughout a workload, but generally core width becomes more of a likely limitation as average IPC approaches core width. Conversely, low IPC workloads are less likely to care about core width, and might benefit from better branch prediction or lower memory access latency.

In SPEC CPU2017’s integer workloads, 548.exchange2 and 525.x264 are high IPC workloads. Arm’s 2-wide A73 is at a disadvantage in both. 3-wide cores like P550 and Goldmont Plus can stretch their legs, pushing up to and beyond 2 IPC. C910 is also 3-wide, but struggles to take advantage of its core width.

SPEC’s floating point suite has a few high IPC tests too, like 538.imagick and 508.namd. Low power cores don’t seem to do so well in these tests, unlike high performance cores like AMD’s Zen 5 or Intel’s Redwood Cove. Goldmont Plus gets destroyed in 538.imagick. But Intel’s low power core does well enough across other tests to let its high clock speed show through, and translate to a large overall lead. C910 again fails to impress. P550 somewhat makes up for its low clock speed with good IPC, though it’s really hard to compete from 1.4 GHz.

7-Zip File Compression

7-Zip is a file compression utility. It almost exclusively uses scalar integer instructions, so floating point and vector execution isn’t important in this workload. I’m compressing a 2.67 GB file using four cores, with 7-Zip set to use four threads.

C910 and P550 turn in a similar performance. Both fall slightly behind the in-order Cortex A55, again showing how well fed, higher clocked in-order cores can still pack a punch. For perspective though, I’ve included A55 cores from two cell phone chips.

In Qualcomm’s Snapdragon 855 and 670, A55 suffers from much higher DRAM latency and runs at lower clocks. Both fall behind P550 and C910, showing how performance for the same core can vary wildly depending on the chip it’s implemented in.

Not sure I trust A55’s performance counters, because instruction counts are similar to A73 but it’s slower?

7-Zip is relatively challenging from an IPC perspective, with a lot of branches and cache misses. P550 gets reasonably good utilization out of its pipeline.

Calculate SHA256 Checksum

Hash functions are used to ensure data integrity. Most desktop CPUs have more than enough power to hash gigabytes upon gigabytes of data without taking too long. Low power CPUs are a different story. I also find this checksum calculation workload interesting because it often reaches very high IPC on CPUs that don’t have specific instructions to accelerate it. I’m simply using Linux’s sha256sum command on the same 2.67 GB file fed into 7-Zip above.

Cortex A55 takes a surprisingly large lead. sha256sum‘s instruction stream seems to mostly consist of math and bitwise instructions, with few memory accesses or branches. That creates an ideal environment for in-order cores. Impressively, A55 manages higher IPC than A73.

3-wide cores also have a field day. P550 and Goldmont Plus sustain well over 2 IPC. C910 doesn’t enjoy the field day so much, but still gets close to 2 IPC.

Both RISC-V cores execute more instructions to get the same work done. x86-64 represents this workload more efficiently, and aarch64 is able to finish using even fewer instructions.

Collecting performance monitoring data comes with some margin of error, because tools like perf must interrupt the targeted workload to read out performance counter values. Hardware performance counters also aren’t validated to the same strict standard as other parts of the core, because results only have to be good enough to properly inform software turning decisions. Still, the gap between P550 and C910 is larger than I’d expect. P550 executes more instructions to finish the same work, and I’m not sure why.

x264 Encode

Software video encoding provides better compression efficiency than hardware video encoders, but is computationally expensive. SPEC CPU2017 represents software video encoding with the “525.x264” subtest, but in practice libx264 uses handwritten assembly kernels to take advantage of CPU vector extensions. Assembly of course can’t make it into SPEC – which needs to be fair to different ISAs and can’t use ISA specific code.

Unfortunately real life is not fair. Both CPU vector capabilities and software support for them can affect performance. x264 prints out CPU capabilities it can take advantage of:

C910 supports RVV 0.7.1, but libx264 does not have any assembly code written for any RISC-V ISA extension. Performance is a disaster for the RISC-V contenders, with A73 and Goldmont Plus landing on a different performance planet. Even A55 is very comfortably ahead.

Both RISC-V cores top the IPC chart, executing more instructions per cycle on average than the x86-64 or aarch64 cores I tested. P550 is especially impressive, pushing close to its 3 IPC limit. C910 doesn’t do as well, but 1.38 IPC is still respectable.

But IPC in isolation is misleading. Clock speed is an obvious factor. Instruction counts are another. In x264, the two RISC-V cores have to execute so many more instructions to get the same work done that IPC becomes meaningless.

Building a strong ecosystem takes a long time. RISC-V will need software developers to take advantage of vector extensions. But before that happens, RISC-V hardware needs to show those developers that it’s powerful enough to be worth the effort.

Final Words

SiFive’s Performance P550 and T-HEAD’s Xuantie C910 are both notable for featuring out-of-order execution in the RISC-V scene. Both are plagued by low clock speeds, even against older aarch64 and x86-64 cores. Arm’s Cortex A73 and Intel’s Goldmont Plus are good demonstrations of how even small out-of-order execution engines can pull a large lead against in-order cores. P550 and C910 don’t always do that.

Between the two RISC-V cores, P550 has a well balanced out-of-order execution engine. It’s able to sustain higher IPC and often keep pace with C910. In some easier workloads, P550 is able to get very close to its core width limits. SiFive has competently balanced P550’s out-of-order execution engine. C910 in comparison is less well balanced, and often fails to translate its higher clock speed into a real performance lead. I wonder if P550 has a lot more potential behind it if an implementer runs it at higher clock speeds, and backs it up with low DRAM latency.

From a hardware perspective, RISC-V is some distance away from competing with Arm or x86-64. SiFive has announced higher performance RISC-V designs, so the RISC-V world is making progress on that front. Beyond hardware though, RISC-V has a long way to go from the software perspective. RISC-V is a young instruction set and some of its extensions are younger still. These extensions can be critical to performance in certain applications (like video encoding), and building up that software ecosystem will likely take years. While I’m excited by the idea of an instruction set free from patents and licensing restrictions, RISC-V is still in its infancy.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.