MoreRSS

site iconChips and CheeseModify

Deep dives into computer hardware and software and the wider industry...
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Chips and Cheese

Intel’s Battlemage Architecture

2025-02-11 15:22:52

Intel’s Alchemist architecture gave the company a foot in the door to the high performance graphics segment. The Arc A770 proved to be a competent first effort, able to run many games with credible performance. Now, Intel is passing the torch to a new graphics architecture, named Battlemage.

Like Alchemist, Battlemage targets the midrange segment. It doesn’t try to compete with AMD or Nvidia’s high end cards. While it’s not as flashy as Nvidia’s RTX 4090 or AMD’s RX 7900 XTX, midrange GPUs account for a much larger share of the discrete GPU market, thanks to their lower prices. Unfortunately, today’s midrange cards like the RTX 4060 and RX 7600 only come with 8 GB of VRAM, and are poor value. Intel takes advantage of this by launching the Arc B580 at $250, undercutting both competitors while offering 12 GB of VRAM.

For B580 to be successful, its new Battlemage architecture has to execute well across a variety of graphics workloads. Intel has made numerous improvements over Alchemist, aiming to achieve better performance with less compute power and less memory bandwidth. I’ll be looking at the Arc B580, with comparison data from the A770 and A750, as well as scattered data I have lying around.

System Architecture

Battlemage is organized much like its predecessor. Xe Cores continue to act as a basic building block. Four Xe Cores are grouped into a Render Slice, which also contains render backends, a rasterizer, and associated caches for those fixed function units. The entire GPU shares an 18 MB L2 cache.

Block diagram of Intel’s Arc B580. B570 disables two Xe Cores. Only FP32 units shown because I generated this diagram using Javascript and heavy abuse of the CSS box model

The Arc B580 overall is a smaller GPU than its outgoing Alchemist predecessors. B580 has five Render Slices to A770’s eight. In total, B580 has 2560 FP32 lanes to A770’s 4096.

Battlemage launches with a smaller memory subsystem too. The B580 has a 192-bit GDDR6 bus running at 19 GT/s, giving it 456 GB/s of theoretical bandwidth. A770 has 560 GB/s of GDDR6 bandwidth, thanks to a 256-bit bus running at 17.5 GT/s.

Block diagram of the A770. A750 disables four Xe Cores (a whole Render Slice)

Even the host interface has been cut down. B580 only has a PCIe 4.0 x8 link, while A770 gets a full size x16 one. Intel’s new architecture has a lot of heavy lifting to do if it wants to beat a much larger implementation of its predecessor.

Battlemage’s Xe Cores

Battlemage’s architectural changes start at its Xe Cores. The most substantial changes between the two generations actually debuted on Lunar Lake. Xe Cores are further split into XVEs, or Xe Vector engines. Intel merged pairs of Alchemist XVEs into ones that are twice as wide, completing a transition towards larger execution unit partitions. Xe Core throughput stays the same at 128 FP32 operations per cycle.

A shared instruction cache feeds all eight XVEs in a Xe Core. Alchemist had a 96 KB instruction cache, and Battlemage almost certainly has an instruction cache at least as large. Instructions on Intel GPUs are generally 16 bytes long, with a 8 byte compacted form in some cases. A 96 KB instruction cache therefore has a nominal capacity of 6-12K instructions.

Xe Vector Engines (XVEs)

XVEs form the smallest partition in Intel GPUs. Each XVE tracks up to eight threads, switching between them to hide latency and keep its execution units fed. A 64 KB register file stores thread state, giving each thread up to 8 KB of registers while maintaining maximum occupancy. Giving a register count for Intel GPUs doesn’t really work, because Intel GPU instructions can address the register file with far more flexibility than Nvidia or AMD architectures. Each instruction can specify a vector width, and access a register as small as a single scalar element.

For most math instructions, Battlemage sticks with 16-wide or 32-wide vectors, dropping the SIMD8 mode that could show up with Alchemist. Vector execution reduces instruction control overhead because a single operation gets applied across all lanes in the vector. However, that results in lost throughput if some lanes take a different branch direction. On paper, Battlemage’s longer native vector lengths would make it more prone to suffering such divergence penalties. But Alchemist awkwardly shared control logic between XVE pairs, making SIMD8 act like SIMD16, and SIMD16 act a lot like SIMD64 aside from a funny corner case (see the Meteor Lake article for more on that).

Battlemage’s divergence behavior by comparison is intuitive and straightforward. SIMD16 achieves full utilization if groups of 16 threads go the same way. The same applies for SIMD32 and groups of 32 coherent threads. Thus Battlemage is actually more agile than its predecessor when dealing with divergent branches, while enjoying the efficiency advantage of using larger vectors.

Maybe XMX is on a separate port. Maybe not. I’m not sure

Like Alchemist, Battlemage executes most math operations down two ports (ALU0, ALU1). ALU0 handles basic FP32 and FP16 operations, while ALU1 handles integer math and less common instructions. Intel’s port layout has parallels to Nvidia’s Turing, which also splits dispatch bandwidth between 16-wide FP32 and INT32 units. A key difference is that Turing uses fixed 32-wide vectors, and keeps both units occupied by feeding them on alternate cycles. Intel can issue instructions of the same type back-to-back, and can select multiple instructions to issue per cycle to different ports.

In another similarity to Turing, Battlemage carries forward Alchemist’s “XMX” matrix multiplication units. Intel claims 3-way co-issue, implying XMX is on a separate port. However, VTune only shows multiple pipe active metrics for ALU0+ALU1 and ALU0+XMX. I’ve drawn XMX as a separate port above, but the XMX units could be on ALU1.

Data collected from Intel’s VTune profiler, zoomed in to show what’s happening at the millisecond scale. VTune’s y-axis scaling is funny (relative to max observed utilization rather than 100%), so I’ve labeled some interesting points.

Gaming workloads tend to use more floating point operations. During compute heavy sections, ALU1 offloads other operations and keeps ALU0 free to deal with floating point math. XeSS exercises the XMX unit, with minimal co-issue alongside vector operations. A generative AI workload shows even less XMX+vector co-issue.

As expected for any specialized execution unit, XMX software support is far from guaranteed. Running AI image generation or language models using other frameworks heavily exercises B580’s regular vector units, while leaving the XMX units idle.

In microbenchmarks, Intel’s older A770 and A750 can often use their larger shader arrays to achieve higher compute throughput than B580. However, B580 behaves more consistently. Alchemist had trouble with FP32 FMA operations. Battlemage in contrast has no problem getting right up to its theoretical throughput. FP32+INT32 dual issue doesn’t happen perfectly on Battlemage, but it barely happened at all on A750.

On the integer side, Battlemage is better at dealing with lower precision INT8 operations. Using Meteor Lake’s iGPU as a proxy, Intel’s last generation architecture used mov and add instruction pairs to handle char16 adds, while Battlemage gets it done with just an add.

Each XVE also has a branch port for control flow instructions, and a “send” port that lets the XVE talk with the outside world. Load on these ports is typically low, because GPU programs don’t branch as often as CPU ones, and shared functions accessed through the “send” port won’t have enough throughput to handle all XVEs hitting it at the same time.

Memory Access

Battlemage’s memory subsystem has a lot in common with Alchemist’s, and traces its origins to Intel’s integrated graphics architectures over the past decade. XVEs access the memory hierarchy by sending a message to the appropriate shared functional unit. At one point, the entire iGPU was basically the equivalent of a Xe Core, with XVE equivalents acting as basic building blocks. XVEs would access the iGPU’s texture units, caches, and work distribution hardware over a messaging fabric. Intel has since built larger subdivisions, but the terminology remains.

Texture Path

Each Xe Core has eight TMUs, or texture samplers in Intel terminology. The samplers have a 32 KB texture cache, and can return 128 bytes/cycle to the XVEs. Battlemage is no different from Alchemist in this respect. But the B580 has less texture bandwidth on tap than its predecessor. Its higher clock speed isn’t enough to compensate for having far fewer Xe Cores.

B580 runs at higher clock speeds, which brings down texture cache hit latency too. In clock cycle terms though, Battlemage has nearly identical texture cache hit latency to its predecessor. L2 latency has improved significantly, so missing the texture cache isn’t as bad on Battlemage.

Data Access (Global Memory)

Global memory accesses are first cached in a 256 KB block, which serves double duty as Shared Local Memory (SLM). It’s larger than Alchemist and Lunar Lake’s 192 KB L1/SLM block, so Intel has found the transistor budget to keep more data closer to the execution units. Like Lunar Lake, B580 favors SLM over L1 capacity even when a compute kernel doesn’t allocate local memory.

Intel may be able to split the L1/SLM block in another way, but a latency test shows exactly the same result regardless of whether I allocate local memory. Testing with Nemes’s Vulkan test suite also shows 96 KB of L1.

Global memory access on Battlemage offers lower latency than texture accesses, even though the XVEs have to handle array address generation. With texture accesses, the TMUs do all the address calculations. All the XVEs do is send them a message. L1 data cache latency is similar to Alchemist in clock cycle terms, though again higher clock speeds give B580 an actual latency advantage.

Scalar Optimizations?

Battlemage gets a clock cycle latency reduction too with scalar memory accesses. Intel does not have separate scalar instructions like AMD. But Intel’s GPU ISA lets each instruction specify its SIMD width, and SIMD1 instructions are possible. Intel’s compiler has been carrying out scalar optimizations and opportunistically generating SIMD1 instructions well before Battlemage, but there was no performance difference as far as I could tell. Now there is.

Forcing SIMD16 mode saves one cycle of latency over SIMD32, because address generation instructions don’t have to issue over two cycles

On B580, L1 latency for a SIMD1 (scalar) access is about 15 cycles faster than a SIMD16 access. SIMD32 accesses take one extra cycle when microbenchmarking, though that’s because the compiler generates two sets of SIMD16 instructions to calculate addresses across 32 lanes. I also got Intel’s compiler to emit scalar INT32 adds, but those didn’t see improved latency over vector ones. Therefore, the scalar latency improvements almost certainly come from an optimized memory pipeline.

Scalar load, with simple explanations

SIMD1 instructions also help within the XVEs. Intel doesn’t use a separate scalar register file, but can more flexibly address their vector register file than AMD or Nvidia. Instructions can access individual elements (sub-registers) and read out whatever vector width they want. Intel’s compiler could pack many “scalar registers” into the equivalent of a vector register, economizing register file capacity.

L1 Bandwidth

I was able to get better efficiency out of B580’s L1 than A750’s using float4 loads from a small array. Intel suggests Xe-HPG’s L1 can deliver 512 bytes per cycle, but I wasn’t able to get anywhere close on either Alchemist or Battlemage. Microbenchmarking puts per-Xe Core bandwidth at a bit under 256 bytes per cycle on both architectures.

Even if the L1 can only provide 256 bytes per cycle, that still gives Intel’s Xe Core as much L1 bandwidth as an AMD RDNA WGP, and twice as much L1 bandwidth as an Nvidia Ampere SM. 512 bytes per cycle would let each XVE complete a SIMD16 load every cycle, which is kind of overkill anyway.

Local Memory (SLM)

Battlemage uses the same 256 KB block for L1 cache and SLM. SLM provides an address space local to a group of threads, and acts as a fast software managed scratchpad. In OpenCL, that’s exposed via the local memory type. Everyone likes to call it something different, but for this article I’ll use OpenCL and Intel’s term.

Even though both local memory and L1 cache hits are backed by the same physical storage, SLM accesses enjoy better latency. Unlike cache hits, SLM accesses don’t need tag checks or address translation. Accessing Battlemage’s 256 KB block of memory in SLM mode brings latency down to just over 15 ns. It’s faster than doing the same on Alchemist, and is very competitive against recent GPUs from AMD and Nvidia.

Local memory/SLM also lets threads within a workgroup synchronize and exchange data. From testing with atomic_cmpxchg on local memory, B580 can bounce values between threads a bit faster than its predecessor. Nearly all of that improvement is down to higher clock speed, but it’s enough to bring B580 in line with AMD and Nvidia’s newer GPUs.

Backing structures for local memory often contain dedicated ALUs for handling atomic operations. For example, the LDS on AMD’s RDNA architecture is split into 32 banks, with one atomic ALU per bank. Intel almost certainly has something similar, and I’m testing that with atomic_add operations on local memory. Each thread targets a different address across an array, aiming to avoid contention.

Alchemist and Battlemage both appear to have 32 atomic ALUs attached to each Xe Core’s SLM unit, much like AMD’s RDNA and Nvidia’s Pascal. Meteor Lake’s Xe-LPG architecture may have half as many atomic ALUs per Xe Core.

L2 Cache

Battlemage has a two level cache hierarchy like its predecessor and Nvidia’s current GPUs. B580’s 18 MB L2 is slightly larger than A770’s 16 MB L2. A770 divided its L2 into 32 banks, each capable of handling a 64 byte access every cycle. At 2.4 GHz, that’s good for nearly 5 TB/s of bandwidth.

Intel didn’t disclose B580’s L2 topology, but a reasonable assumption is that Intel increased bank size from 512 to 768 KB, keeping 4 L2 banks tied to each memory controller. If so, B580’s L2 would have 24 banks and 4.3 TB/s of theoretical bandwidth at 2.85 GHz. Microbenchmarking using Nemes’s Vulkan test gets a decent proportion of that bandwidth. Efficiency is much lower on the older A750, which gets approximately as much bandwidth as B580 despite probably having more theoretical L2 bandwidth on tap.

Besides insulating the execution units from slow VRAM, the L2 can act as a point of coherency across the GPU. B580 is pretty fast when bouncing data between threads using global memory, and is faster than its predecessor.

With atomic add operations on global memory, Battlemage does fine for a GPU of its size and massively outperforms its predecessor.

I’m using INT32 operations, so 86.74 GOPS on the A750 would correspond to 351 GB/s of L2 bandwidth. On the B580, 220.97 GOPS would require 883.9 GB/s. VTune however reports far higher L2 bandwidth on A750. Somehow, A750 sees 1.37 TB/s of L2 bandwidth during the test, or nearly 4x more than it should need.

VTune capture of the test running on A750

Meteor Lake’s iGPU is a close relative of Alchemist, but its ratio of global atomic add throughput to Xe Core count is similar to Battlemage’s. VTune reports Meteor Lake’s iGPU using more L2 bandwidth than required, but only by a factor of 2x. Curiously, it also shows the expected bandwidth coming off the XVEs. I wonder if something in Intel’s cross-GPU interconnect didn’t scale well with bigger GPUs.

With Battlemage, atomics are broken out into a separate category and aren’t reported as regular L2 bandwidth. VTune indicates atomics are passed through the load/store unit to L2 without any inflation. Furthermore, the L2 was only 79.6% busy, suggesting there’s a bit of headroom at that layer.

And the same test on B580

This could just be a performance monitoring improvement, but performance counters are typically closely tied to the underlying architecture. I suspect Intel made major changes to how they handle global memory atomics, letting performance scale better on larger GPUs. I’ve noticed that newer games sometimes use global atomic operations. Perhaps Intel noticed that too, and decided it was time to optimize them.

VRAM Access

B580 has a 192-bit GDDR6 VRAM subsystem, likely configured as six 2×16-bit memory controllers. Latency from OpenCL is higher than it was in the previous generation.

I suspect this only applies to OpenCL, because latency from Vulkan (with Nemes’s test) shows just over 300 ns of latency. Latency at large test sizes will likely run into TLB misses, and I suspect Intel is using different page sizes for different APIs.

Compared to its peers, the Arc B580 has more theoretical VRAM bandwidth at 456 GB/s, but also less L2 capacity. For example, Nvidia’s RTX 4060 has 272 GB/s VRAM bandwidth using a 128-bit GDDR6 bus running at 17 GT/s, with 24 MB of L2 in front of it. I profiled a few things with VTune and picked out spikes in VRAM bandwidth usage. I also checked reported L2 bandwidth over the same sampling interval.

Intel’s balance of cache capacity and memory bandwidth seems to work well, at least in the few examples I checked. Even when VRAM bandwidth demands are high, the 18 MB L2 is able to catch enough traffic to avoid pushing GDDR6 bandwidth limits. If Intel hypothetically used a smaller GDDR6 memory subsystem like Nvidia’s RTX 4060, B580 would need a larger cache to avoid reaching VRAM bandwidth limits.

PCIe Link

Probably as a cost cutting measure, B580 has a narrower PCIe link than its predecessor. Still, a x8 Gen 4 link provides as much theoretical bandwidth as a x16 Gen 3 one. Testing with OpenCL doesn’t get close to theoretical bandwidth, but B580 is at a disadvantage compared to A750.

PCIe link bandwidth often has minimal impact on gaming performance, as long as you have enough VRAM. B580 has a comparatively large 12 GB VRAM pool compared to its immediate competitors, which also have PCIe 4.0 x8 links. That could give B580 an advantage within the midrange market, but that doesn’t mean it’s immune to problems.

DCS for example will uses over 12 GB of VRAM with mods. Observing different aircraft in different areas often causes stutters on the B580. VTune shows high PCIe traffic as the GPU must frequently read from host memory.

Final Words

Battlemage retains Alchemist’s high level goals and foundation, but makes a laundry list of improvements. Compute is easier to utilize, cache latency improves, and weird scaling issues with global memory atomics have been resolved. Intel has made some surprising optimizations too, like reducing scalar memory access latency. The result is impressive, with Arc B580 easily outperforming the outgoing A770 despite lagging in nearly every on-paper specification.

Some of Intel’s GPU architecture changes nudge it a bit closer to AMD and Nvidia’s designs. Intel’s compiler often prefers SIMD32, a mode that AMD often chooses for compute code or vertex shaders, and one that Nvidia exclusively uses. SIMD1 optimizations create parallels to AMD’s scalar unit or Nvidia’s uniform datapath. Battlemage’s memory subsystem emphasizes caching more than its predecessor, while relying less on high VRAM bandwidth. AMD’s RDNA 2 and Nvidia’s Ada Lovelace made similar moves with their memory subsystems.

Of course Battlemage is still a very different animal from its discrete GPU competitors. Even with larger XVEs, Battlemage still uses smaller execution unit partitions than AMD or Nvidia. With SIMD16 support, Intel continues to support shorter vector widths than the competition. Generating SIMD1 instructions gives Intel some degree of scalar optimization, but stops short of having a full-out scalar/uniform datapath like AMD or post-Turing Nvidia. And 18 MB of cache is still less than the 24 or 32 MB in Nvidia and AMD’s midrange cards.

Differences from AMD and Nvidia aside, Battlemage is a worthy step on Intel’s journey to take on the midrange graphics market. A third competitor in the discrete GPU market is welcome news for any PC enthusiast. For sure, Intel still has some distance to go. Driver overhead and reliance on resizable BAR are examples of areas where Intel is still struggling to break from their iGPU-only background.

But I hope Intel goes after higher-end GPU segments once they’ve found firmer footing. A third player in the high end dGPU market would be very welcome as many folks are still on Pascal or GCN due to folks feeling as if there is not a reasonable upgrade yet. Intel’s Arc B580 addresses some of that pent-up demand, at least when it’s not out-of-stock. I look forward to seeing Intel’s future GPU efforts.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Alibaba/T-HEAD's Xuantie C910

2025-02-04 13:12:58

T-HEAD is a wholly owned subsidiary of Alibaba, one of China's largest tech companies. Over the past few years, T-HEAD has created a line of RISC-V cores. Alibaba seems to have two motivations for pushing RISC-V. On one hand, the company stands to benefit from creating cost effective chips optimized for areas it cares about, like IoT endpoints and edge computing. On the other, Alibaba almost certainly wants to reduce its dependence on foreign imports. RISC-V is an open instruction set, and isn't controlled by US or British corporations like x86-64 or ARM. T-HEAD's RISC-V push can thus be seen more broadly as a part of China's push to create viable domestic microchips.

Xuantie C910 slots into the "high performance" category within T-HEAD's lineup. Besides joining a small number of out-of-order RISC-V cores that have made it into hardware, C910 is an early adopter for RISC-V's vector extension. It supports RVV 0.7.1, which features masking and variable vector length support. T-HEAD has since released the C920 core, which brings RVV support up to version 1.0, but otherwise leaves C910 unchanged.

From Alibaba's paper, with descriptions added in red by Clam. PIU and PLIC appear in the dual core diagram below.

C910 targets "AI, Edge servers, Industrial control, [and] ADAS" as possible applications. It's also T-HEAD's first generation out-of-order design, so taking on all those applications is ambitious. C910 is implemented in clusters of up to four cores, each with a shared L2 cache. T-HEAD targets 2 to 2.5 GHz on TSMC's 12nm FinFET process, where a C910 core occupies 0.8 mm2. Core voltage is 0.8V at 2 GHz, and 1.0V at 2.5 GHz. On TSMC's 7nm process, T-HEAD managed to push core frequency to 2.8 GHz. T-HEAD's paper further claims dynamic power is around 100 microwatts/MHz, which works out to 0.2W at 2 GHz. Of course, this figure doesn't include static power or power draw outside the core. Yet all of these characteristics together make clear C910 is a low power, low area design.

This article will examine C910 in the T-HEAD TH1520, using the LicheePi single board computer. TH1520 is fabricated on TSMC’s 12nm FinFET process, and has a quad-core C910 cluster with 1 MB of L2 running at 1.85 GHz. It’s connected to 8 GB of LPDDR4X-3733. C910 has been open-sourced, so I’ll be attempting to dig deeper into core details by reading some of the source code – but with some disclaimers. I’m a software engineer, not a hardware engineer. Also, some of the code is likely auto-generated from another undisclosed source, so reading that code has been a time consuming and painful experience. Expect some mistakes along the way.

Core Overview

The Xuantie C910 is a 3-wide, out-of-order core with a 12 stage pipeline.

Like Arm’s Cortex A73, C910 can release out-of-order resources early. For microbenchmarking, I used both a dependent branch and incomplete load to block retire, just as I did on A73.

Frontend: Instruction Fetch and Branch Prediction

C910’s frontend is tailored to handle both 16-bit and 32-bit RISC-V instructions, along with the requirements of RISC-V’s vector extension. The core has a 64 KB, 2-way set associative instruction cache with a FIFO replacement policy. Besides caching instruction data, C910 stores four bits of predecode data for each possible 16-bit instruction slot. Two bits tentatively indicate whether an instruction starts at that position, while the other two provide branch info. In total, C910 uses 83.7 KB of raw bit storage for instruction caching.

An L1i access reads instruction bytes, predecode data, and tags from both ways. Thus, the instruction fetch (IF) stage brings 256 bits of instruction bytes into temporary registers alongside 64 bits of predecode data. Tags for both ways are checked to determine which way has a L1i hit, if any. Simultaneously, the IF stage checks a 16 entry, fully associative L0 BTB, which lets the core handle a small number of taken branches with effectively single cycle latency.

Rough, simplified sketch of C910’s frontend

Instruction bytes and predecode data from both ways are passed to the next Instruction Pack (IP) stage. All of that is fed into a pair of 8-wide early decode blocks, called IP decoders in the source code. Each of the 16 early decode slots handles a possible instruction start position at a 16-bit boundary, across both ways. These early decoders do simple checks to categorize instructions. For vector instructions, the IP decoders also figure out VLEN (vector length), VSEW (selected element width), and VLMAX (number of elements).

Although the IP stage consumes 256 bits of instruction data and 64 bits of predecode data, and process all of that with 16 early decode slots, half of that is always discarded because the L1i can only hit in one way. Output from the 8-wide decode block that processed the correct way is passed to the next stage, while output from the other 8-wide decoder is discarded.

C910’s main branch predictor mechanisms also sit at the IP stage. Conditional branches are handled with a bi-mode predictor, with a 1024 entry selection table, two 16384 entry history tables containing 2-bit counters, and a 22-bit global history register. The selection table is indexed by hashing the low bits of the branch address and global history register, while the history tables are indexed by hashing the high hits of the history register. Output from the selection table is used to pick between the two history tables, labeled “taken” and “ntaken”. Returns are handled using a 12 entry return stack, while a 256 entry indirect target array handles indirect branches. In all, the branch predictor uses approximately 17.3 KB of storage. It’s therefore a small branch predictor by today’s standards, well suited to C910’s low power and low area design goals. For perspective, a high performance core like Qualcomm’s Oryon uses 80 KB for its conditional (direction) predictor alone, and another 40 KB for the indirect predictor.

Testing with random patterns of various lengths shows C910 can deal with moderately long patterns. It’s in line with what I’ve seen this test do with other low power cores. Both C910 and A73 struggle when there are a lot of branches in play, though they can maintain reasonably good accuracy for a few branches without excessively long history.

C910’s main BTB has 1024 entries and is 4-way set associative. Redirecting the pipeline from the IP stage creates a single pipeline bubble, or effectively 2 cycle taken branch latency. Branches that spill out of the 1024 entry BTB have 4 cycle latency, as long as code stays within the instruction cache.

The Instruction Pack stage feeds up to eight 16-bit instructions along with decoder output into the next Instruction Buffer (IB) stage. This stage’s job is to smooth out instruction delivery, covering any hiccups in frontend bandwidth as best as it can. To do this, the IB stage has a 32 entry instruction queue and a separate 16 entry loop buffer. Both have 16-bit entries, so 32-bit instructions will take two slots. C910’s loop buffer serves the same purpose as Pentium 4’s trace cache, seeking to fill in lost frontend slots after a taken branch. Of course, a 16 entry loop buffer can only do this for the smallest of loops.

To feed the subsequent decode stage, the IB stage can pick instructions from the loop buffer, instruction queue, or a bypass path to reduce latency if queuing isn’t needed. Each instruction and its associated early decode metadata are packed into a 73-bit format, and sent to the decode stage.

Frontend: Decode and Rename

The Instruction Decode (ID) stage contains C910’s primary decoders. Three 73-bit inputs from the IP stage are fed into these decoders, which parse out register info and splits instructions into multiple micro-ops if necessary.

Only the first decode slot can handle instructions that decode into four or more micro-ops. All decode slots can emit 1-2 micro-ops for simpler instructions, though the decode stage in total can’t emit more than four micro-ops per cycle. Output micro-ops are packed into a 178-bit format, and passed directly to the rename stage. C910 does not have a micro-op queue between the decoders and renamers like many other cores. Rename width and decoder output width therefore have to match, explaining why the renamer is 4-wide and why the decoders are restricted to 4 micro-ops per cycle. Any instruction that decoders into four or more micro-ops will block parallel decode.

Notes on micro-op format

C910’s instruction rename (IR) stage then checks for matches between architectural registers to find inter-instruction dependencies. It then assigns free registers out of the respective pool (integer or FP registers), or by picking newly deallocated registers coming off the retire stage. The IR stage does further decoding too. Instructions are further labeled with whether they’re a multi-cycle ALU operation, which ports they can go to, and so on. After renaming, micro-ops are 271 bits.

From software, C910’s frontend can sustain 3 instructions per cycle as long as code fits within the 64 KB instruction cache. L2 code bandwidth is low at under 1 IPC. SiFive’s P550 offers more consistent frontend bandwidth for larger code footprints, and can maintain 1 IPC even when running code from L3.

Out-of-Order Execution Engine

C910’s backend uses a physical register file (PRF) based out-of-order execution scheme, where both pending and known-good instruction results are stored in register files separate from the ROB. C910’s source code (ct_rtu_rob.v) defines 64 ROB entries, but T-HEAD’s paper says the ROB can hold up to 192 instructions. Microbenchmarking generally agrees.

Therefore, C910 has reorder buffer capacity on par with Intel’s Haswell from 2013, theoretically letting it keep more instructions in flight than P550 or Goldmont Plus. However, other structures are not appropriately sized to make good use of that ROB capacity.

RISC-V has 32 integer and 32 floating point registers, so 32 entries in each register file generally have to be reserved for holding known-good results. That leaves only 64 integer and 32 floating point registers to hold results for in-flight instructions. Intel’s Haswell supports its 192 entry ROB with much larger register files on both the integer and floating point side.

Execution Units

C910 has eight execution ports. Two ports on the scalar integer side handle the most common ALU operations, while a third is dedicated to branches. C910’s integer register file has 10 read ports to feed five execution pipes, which includes three pipes for handling memory operations. A distributed scheduler setup feeds C910’s execution ports. Besides the opcode and register match info, each scheduler entry has a 7-bit age vector to enable age-based prioritization.

Scheduler capacity is low compared to Goldmont Plus and P550, with just 16 entries available for the most common ALU operations. P550 has 40 scheduler entries available across its three ALU ports, while Goldmont Plus has 30 entries.

C910’s FPU has a simple dual pipe design. Both ports can handle the most common floating point operations like adds, multiplies, and fused multiply. Both pipes can handle 128-bit vector operations too. Feeding each port requires up to four inputs from the FP/vector register file. A fused multiply instruction (a*b+c) requires three inputs. A fourth input provides a mask register. Unlike AVX-512 and SVE, RISC-V doesn’t define separate mask registers, so all inputs have to come from the FP/vector register file. Therefore, C910’s FP register file has almost as many read ports as the integer one, despite feeding fewer ports.

Floating point execution latency is acceptable, and ranges from 3 to 5 cycles for the most common operations. Some recent cores like Arm’s Cortex X2, Intel’s Golden Cove, and AMD’s Zen 5 can do FP addition with 2 cycle latency. I don’t expect that from a low power core though.

Memory Subsystem

Two address generation units (AGUs) on C910 calculate effective addresses for memory accesses. One AGU handles loads, while the other handles stores. C910’s load/store unit is generally split into two pipelines, and aims to handle up to one load and one store per cycle. Like many other cores, store instructions are broken into a store address and a store data micro-op.

From Alibaba’s paper

39-bit virtual addresses are then generated into 40-bit physical addresses. C910’s L1 DTLB has 17 entries and is fully associative. A 1024 entry, 4-way L2 TLB handles L1 TLB misses for both data and instruction accesses, and adds 4 cycles latency over a L1 hit. Physically, the L2 TLB has two banks, both 256×84 SRAM instances. The tag array is a 256×196 bit SRAM instance, and a 196-bit access includes tags for all four ways along with four “FIFO” bits, possibly used to implement a FIFO replacement policy. Besides necessary info like the virtual page number and a valid bit, each tag includes an address space ID and a global bit. These can exempt an entry from certain TLB flushes, reducing TLB thrashing on context switches. In total, the L2 TLB’s tags and data require 8.96 KB of bit storage.

Physical addresses are written into the load and store queues, depending on whether the address is a load or store. I’m not sure how big the load queue is. C910’s source code suggests there are 12 entries, and microbenchmarking results are confusing.

In the source code, each load queue entry stores 36 bits of the load’s physical address along with 16 bits to indicate which bits are valid, and a 7-bit instruction id to ensure proper ordering. A store queue entry stores the 40-bit physical address, pending store data, 16 byte valid bits, a 7-bit instruction id, and a ton of other fields. To give some examples:

  • wakeup_queue: 12 bits, possibly indicates which dependent loads should be woken up when data is ready

  • sdid: 4 bits, probably store data id

  • age_vec, age_vec_1: 12 bit age vectors, likely for tracking store order

To check for memory dependencies, the load/store unit compares bits 11:4 of the memory address. From software testing, C910 can do store forwarding for any load completely contained within the store, regardless of alignment within the store. However, forwarding fails if a load crosses a 16B aligned boundary, or a store crosses a 8B aligned boundary. Failed store forwarding results in a 20+ cycle penalty.

C910 handles unaligned accesses well, unlike P550. If a load doesn’t cross a 16B boundary or a store doesn’t cross a 8B boundary, it’s basically free. If you do cross those alignment boundaries, you don’t face a performance penalty beyond making an extra L1D access under the hood. Overall, C910’s load/store unit and forwarding behavior is a bit short of the most recent cores from Intel and AMD. But it’s at about the same level as AMD’s Piledriver, a very advanced and high performance core in its own right. That’s a good place to be.

Data Cache

The 64 KB, 2-way data cache has 3 cycle latency, and is divided into 4 byte wide banks. It can handle up to one load and one store per cycle, though 128-bit stores take two cycles. L1D tags are split into two separate arrays, one for loads and one for stores.

Data cache misses are tracked by one of eight line-fill buffer entries, which store the miss address. Refill data is held in two 512-bit wide fill buffer registers. Like the instruction cache, the data cache uses a simple FIFO replacement policy.

L2 Cache and Interconnect

Each C910 core interfaces with the outside world via a “PIU”, or processor interface unit. At the other end, a C910 cluster has a Consistency Interface Unit (CIU) that accepts requests from up to four PIUs and maintains cache coherency. The CIU is split into two “snb” instances, each of which has a 24 entry request queue. SNB arbitrates between requests based on age, and has a 512-bit interface to the L2 cache.

C910’s L2 cache acts as both the first stop for L1 misses and as a cluster-wide shared last level cache. On the TH1520, it has 1 MB of capacity and is 16-way set associative with a FIFO replacement policy. To service multiple accesses per cycle, the L2 is built from two banks, selected by bit 6 of the physical address. The L2 is inclusive of upper level caches, and uses ECC protection to ensure data integrity.

L2 latency is 60 cycles which is problematic for a core with limited reordering capacity and no mid-level cache. Even P550’s 4 MB L3 cache has better latency than C910’s L2, from both a cycle count and true latency standpoint. Intel’s Goldmont Plus also uses a shared L2 as a last level cache, and has about 28 cycles of L2 latency (counting a uTLB miss).

C910’s L2 bandwidth also fails to impress. A single core gets just above 10 GB/s, or 5.5 bytes per cycle. All four cores together can read from L2 at 12.6 GB/s, or just 1.7 bytes per cycle per core. Write bandwidth is better at 23.81 GB/s from all four cores, but that’s still less than 16 bytes per cycle in total, and writes are usually less common than reads.

Again, C910’s L2 is outperformed by both P550’s L3 and Goldmont Plus’s L2. I suspect multi-threaded applications will easily push C910’s L2 bandwidth limits.

Off-cluster requests go through a 128-bit AXI4 bus. In the Lichee Pi 4A, the TH1520 has just under 30 GB/s of theoretical DRAM bandwidth from its 64-bit LPDDR4X-3733 interface. Achieved read bandwidth is much lower. Multithreaded applications might find 4.2 GB/s a bit tight, especially when there’s only 1 MB of last level cache shared across four cores.

DRAM latency is at least under control at 133.9 ns, tested using 2 MB pages and 1 GB array. It’s not on the level of a desktop CPU, but it’s better than Eswin and Intel’s low power implementations.

Core to Core Latency

Sometimes, the memory subsystem has to carry out a core to core transfer to maintain cache coherency. Sites like Anandtech have used a core to core latency test to probe this behavior, and I’ve written my own version. Results should be broadly comparable with those from Anandtech.

T-HEAD’s CIU can pass data between cores with reasonable speed. It’s much better than P550, which saw over 300 ns of latency within a quad core cluster.

Final Words

C910 is T-HEAD’s first out-of-order core. Right out of the gate, C910 is more polished than P550 in some respects. Core to core latency is better, unaligned accesses are properly handled, and there’s vector support. Like P550, C910 aims to scale across a broad range of low power applications. L2 capacity can be configured up to 8 MB, and multi-cluster setups allow scaling to high core counts. I feel like there’s ambition behind C910, since Alibaba wants to use in-house RISC-V cores instead of depending on external factors.

Alibaba has been promoting Xuantie core IP series to facilitate external customers for edge computing applications, such as AI, edge servers, industrial control and advanced driver assistance systems (ADAS)…by the end of 2022, a total volume of 15 million units is expected

Xuantie-910: A Commercial Multi-Core 12-Stage Pipeline Out-of-Order 64-bit High Performance RISC-V Processor with Vector Extension – T-Head Division, Alibaba Cloud

Yet I also feel the things C910 does well are overshadowed by executing poorly on the basics. The core’s out-of-order engine is poorly balanced, with inadequate capacity in critical structures like the schedulers and register files in relation to its ROB capacity. CPU performance is often limited by memory access performance, and C910’s cache subsystem is exceptionally weak. The cluster’s shared L2 is both slow and small, and the C910 cores have no mid-level cache to insulate L1 misses from that L2. DRAM bandwidth is also lackluster.

A TH1520 chip, seen at Hot Chips 2024 (not the one tested)

C910 is therefore caught in a position where it needs to keep a lot of instructions in flight to smooth out spikes in demand for memory bandwidth and mitigate high L2 latency, but can rarely do so in practice because its ROB capacity isn’t supported by other structures. C910’s unaligned access handling, vector support, and decent core-to-core latency are all nice to have. But tackling those edge cases is less important than building a well balanced core supported by a solid memory subsystem. Missing the subset of applications that use unaligned accesses or take advantage of vectorization is one thing. But messing up performance for everything else is another. And C910’s poor L2 and DRAM performance may even limit the usefulness of its vector capabilities, because vectorized applications tend to pull more memory bandwidth.

Hopefully T-HEAD will use experience gained from C910 to build better cores going forward. With Alibaba behind it, T-HEAD should have massive financial backing. I also hope to see more open source out-of-order cores going forward. Looking through C910 source code was very insightful. I appreciated being able to see how micro-op formats changed between pipeline stages, and how instruction decode is split across several stages that aren’t necessarily labeled “decode”.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

A RISC-V Progress Check: Benchmarking P550 and C910

2025-01-31 05:22:37

RISC-V has seen a flurry of activity over the past few years. Most RISC-V implementations have been small in-order cores. Western Digital’s SweRV and Nvidia’s RV-RISCV are good examples. But cores like those are meant for small microcontrollers, and the average consumer won’t care which core a company selects for a GPU or SSD’s microcontrollers. Flagship cores from AMD, Arm, Intel, and Qualcomm are more visible in our daily lives, and use large out-of-order execution engines to deliver high performance.

Out-of-order execution involves substantial complexity, which makes SiFive’s Performance P550 and T-HEAD’s Xuantie C910 interesting. Both feature out-of-order execution, though a quick look at headline specifications shows neither core can take on the best from AMD, Arm, Intel, or Qualcomm.

To check on RISC-V’s progress as its cores move toward higher performance targets, I’m comparing with Arm’s Cortex A73 and Intel’s Goldmont Plus. Both have comparably sized out-of-order execution engines.

SPEC CPU2017

SPEC is an industry standard benchmark distributed in source code form. It deliberately attempts to test both hardware and the compilers that target it. As before, I’m building SPEC CPU2017 with GCC 14.2.0. For P550, I used -march=rv64imafdc_zicsr_zifencei_zba_zbb -mtune=sifive-p400-series. For C910, I used -march=rv64imafdc_xtheadvector -mtune=generic-ooo. GCC doesn’t have optimization models for either RISC-V core, though I suspect that doesn’t matter much.

The two RISC-V cores fall short of Arm’s Cortex A73 and well short of Intel’s Goldmont Plus. Clock speed differences play a large role, and the EIC7700X is especially terrible in that respect. Eswin chose to clock its P550 cluster at just 1.4 GHz, even though the chip’s datasheet notes the CPU cores can run at “up to 1.8 GHz”. C910 does better at 1.85 GHz, though that’s still low in absolute terms. Unfortunately for T-HEAD, C910’s higher clock speed does not let it gain a performance lead against the P550. I’m still working on dissecting C910, but at first glance I’m not impressed with how T-HEAD balanced C910’s out-of-order execution engine and memory subsystem.

Cortex A55 and A53 provide perspective on where in-order execution sits today. Neither core can get anywhere close to high performance client designs, but C910 and P550 have relatively small out-of-order engines. They also run at low clock speeds. Mediatek’s Genio 1200 has a particularly strong A55 implementation, with higher clock speeds and better DRAM latency than C910 and P550. Its Cortex A55 cores are able to catch C910 and P550 without full out-of-order execution.

AMD expects to exceed Pentium performance at the same clock rate by about 30%

Microprocessor Report

This isn’t the first time an in-order core does surprisingly well against out-of-order ones. Back in 1996, AMD’s K5 featured 4-wide out-of-order execution and better per-clock performance than Intel’s 2-wide, in-order Pentium. Intel clocked the Pentium more than 30% faster, and came out top. Today’s situation with C910 and P550 against A55 has some parallels. A55 doesn’t win everywhere though. It loses to both RISC-V cores in SPEC CPU2017’s floating point suite. And a less capable in-order core like A53 can’t keep up despite running at higher clocks.

Across SPEC CPU2017’s integer workloads, C910 fails to win any test against the lower clocked EIC7700X. T-HEAD does better in the floating point suite, where it wins in a number of tests, but fails to take an overall performance lead. Meanwhile, A73 and Goldmont Plus do an excellent job of translating their higher clock speeds into a real advantage.

IPC data from hardware performance counters can show how well cores are utilizing their pipeline widths. IPC behavior tends to vary throughout a workload, but generally core width becomes more of a likely limitation as average IPC approaches core width. Conversely, low IPC workloads are less likely to care about core width, and might benefit from better branch prediction or lower memory access latency.

In SPEC CPU2017’s integer workloads, 548.exchange2 and 525.x264 are high IPC workloads. Arm’s 2-wide A73 is at a disadvantage in both. 3-wide cores like P550 and Goldmont Plus can stretch their legs, pushing up to and beyond 2 IPC. C910 is also 3-wide, but struggles to take advantage of its core width.

SPEC’s floating point suite has a few high IPC tests too, like 538.imagick and 508.namd. Low power cores don’t seem to do so well in these tests, unlike high performance cores like AMD’s Zen 5 or Intel’s Redwood Cove. Goldmont Plus gets destroyed in 538.imagick. But Intel’s low power core does well enough across other tests to let its high clock speed show through, and translate to a large overall lead. C910 again fails to impress. P550 somewhat makes up for its low clock speed with good IPC, though it’s really hard to compete from 1.4 GHz.

7-Zip File Compression

7-Zip is a file compression utility. It almost exclusively uses scalar integer instructions, so floating point and vector execution isn’t important in this workload. I’m compressing a 2.67 GB file using four cores, with 7-Zip set to use four threads.

C910 and P550 turn in a similar performance. Both fall slightly behind the in-order Cortex A55, again showing how well fed, higher clocked in-order cores can still pack a punch. For perspective though, I’ve included A55 cores from two cell phone chips.

In Qualcomm’s Snapdragon 855 and 670, A55 suffers from much higher DRAM latency and runs at lower clocks. Both fall behind P550 and C910, showing how performance for the same core can vary wildly depending on the chip it’s implemented in.

Not sure I trust A55’s performance counters, because instruction counts are similar to A73 but it’s slower?

7-Zip is relatively challenging from an IPC perspective, with a lot of branches and cache misses. P550 gets reasonably good utilization out of its pipeline.

Calculate SHA256 Checksum

Hash functions are used to ensure data integrity. Most desktop CPUs have more than enough power to hash gigabytes upon gigabytes of data without taking too long. Low power CPUs are a different story. I also find this checksum calculation workload interesting because it often reaches very high IPC on CPUs that don’t have specific instructions to accelerate it. I’m simply using Linux’s sha256sum command on the same 2.67 GB file fed into 7-Zip above.

Cortex A55 takes a surprisingly large lead. sha256sum‘s instruction stream seems to mostly consist of math and bitwise instructions, with few memory accesses or branches. That creates an ideal environment for in-order cores. Impressively, A55 manages higher IPC than A73.

3-wide cores also have a field day. P550 and Goldmont Plus sustain well over 2 IPC. C910 doesn’t enjoy the field day so much, but still gets close to 2 IPC.

Both RISC-V cores execute more instructions to get the same work done. x86-64 represents this workload more efficiently, and aarch64 is able to finish using even fewer instructions.

Collecting performance monitoring data comes with some margin of error, because tools like perf must interrupt the targeted workload to read out performance counter values. Hardware performance counters also aren’t validated to the same strict standard as other parts of the core, because results only have to be good enough to properly inform software turning decisions. Still, the gap between P550 and C910 is larger than I’d expect. P550 executes more instructions to finish the same work, and I’m not sure why.

x264 Encode

Software video encoding provides better compression efficiency than hardware video encoders, but is computationally expensive. SPEC CPU2017 represents software video encoding with the “525.x264” subtest, but in practice libx264 uses handwritten assembly kernels to take advantage of CPU vector extensions. Assembly of course can’t make it into SPEC – which needs to be fair to different ISAs and can’t use ISA specific code.

Unfortunately real life is not fair. Both CPU vector capabilities and software support for them can affect performance. x264 prints out CPU capabilities it can take advantage of:

C910 supports RVV 0.7.1, but libx264 does not have any assembly code written for any RISC-V ISA extension. Performance is a disaster for the RISC-V contenders, with A73 and Goldmont Plus landing on a different performance planet. Even A55 is very comfortably ahead.

Both RISC-V cores top the IPC chart, executing more instructions per cycle on average than the x86-64 or aarch64 cores I tested. P550 is especially impressive, pushing close to its 3 IPC limit. C910 doesn’t do as well, but 1.38 IPC is still respectable.

But IPC in isolation is misleading. Clock speed is an obvious factor. Instruction counts are another. In x264, the two RISC-V cores have to execute so many more instructions to get the same work done that IPC becomes meaningless.

Building a strong ecosystem takes a long time. RISC-V will need software developers to take advantage of vector extensions. But before that happens, RISC-V hardware needs to show those developers that it’s powerful enough to be worth the effort.

Final Words

SiFive’s Performance P550 and T-HEAD’s Xuantie C910 are both notable for featuring out-of-order execution in the RISC-V scene. Both are plagued by low clock speeds, even against older aarch64 and x86-64 cores. Arm’s Cortex A73 and Intel’s Goldmont Plus are good demonstrations of how even small out-of-order execution engines can pull a large lead against in-order cores. P550 and C910 don’t always do that.

Between the two RISC-V cores, P550 has a well balanced out-of-order execution engine. It’s able to sustain higher IPC and often keep pace with C910. In some easier workloads, P550 is able to get very close to its core width limits. SiFive has competently balanced P550’s out-of-order execution engine. C910 in comparison is less well balanced, and often fails to translate its higher clock speed into a real performance lead. I wonder if P550 has a lot more potential behind it if an implementer runs it at higher clock speeds, and backs it up with low DRAM latency.

From a hardware perspective, RISC-V is some distance away from competing with Arm or x86-64. SiFive has announced higher performance RISC-V designs, so the RISC-V world is making progress on that front. Beyond hardware though, RISC-V has a long way to go from the software perspective. RISC-V is a young instruction set and some of its extensions are younger still. These extensions can be critical to performance in certain applications (like video encoding), and building up that software ecosystem will likely take years. While I’m excited by the idea of an instruction set free from patents and licensing restrictions, RISC-V is still in its infancy.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Inside SiFive’s P550 Microarchitecture

2025-01-27 06:06:04

RISC-V is a relatively young and open source instruction set. So far, it has gained traction in microcontrollers and academic applications. For example, Nvidia replaced the Falcon microcontrollers found in their GPUs with RISC-V based ones. Numerous university projects have used RISC-V as well, like Berkeley’s BOOM. However, moving RISC-V into more consumer-visible, higher performance applications will be an arduous process. SiFive plays a key role in pushing RISC-V CPUs toward higher performance targets, and occupies a position analogous to that of Arm (the company). Arm and SiFive both design and license out IP blocks. The task of creating a complete chip is left to implementers.

Rendering of the P550 dev board from the datasheet

By designing CPU blocks, both SiFive and Arm can lower the cost of entry to building higher performance designs in their respective ISA ecosystems. To make that happen within the RISC-V ecosystem though, SiFive needs to develop strong CPU cores. Here, I’ll take a look at SiFive’s P550. This core aims for “30% higher performance in less than half the area of a comparable Arm Cortex A75.”

Just as with Arm’s cores, P550’s performance will depend heavily on how it’s implemented. For this article, I’m testing the P550 as implemented in the Eswin EC7700X SoC. This SoC has a 1.4 GHz, quad core P550 cluster with 4 MB of shared cache. The EIC7700X is manufactured on TSMC’s 12nm FFC process. The SiFive Premier P550 Dev Board that hosts the SoC has 16 GB of LPDDR5-6400 memory. For context, I’ve gathered some comparison data from the Qualcomm Snapdragon 670 in a Pixel 3a. The Snapdragon 670 has a dual core Arm Cortex A75 cluster running at 2 GHz.

Overview

The P550 is a 3-wide out-of-order core with a 13 stage pipeline. Out-of-order execution lets the core move past a stalled instruction to extract instruction level parallelism. It’s critical for achieving high performance because cache and memory latency can be significant limiters for modern CPUs. The P550 is far from SiFive’s first out-of-order design. That distinction belongs to SiFive’s U87, which is also a 3-wide out-of-order design. P550 comes several years after, and should be more mature.

Arm’s Cortex A75 is also a 3-wide out-of-order core. It’s an improved version of Arm’s Cortex A73, and carries forward distinguishing features like out-of-order retirement. Anandtech says A75 has a 11-13 stage pipeline, though their diagram suggests the minimum mispredict penalty is likely closer to 11 cycles.

Like SiFive’s P550, the Arm Cortex A75 has a modestly sized out-of-order execution engine. Both are far off the high performance designs we see from Intel and AMD today, and are optimized for low power and area.

Branch Prediction

Fast and accurate branch prediction is critical to both performance and power efficiency. SiFive has given the P550 a 9.1 KiB branch history table, which helps the core correlate past branch behavior with branch outcomes. From an abstract test with various numbers of branches that are taken/not-taken in increasingly long random patterns, the P550’s branch predictor looks to have reasonably good pattern recognition capabilities. It falls well short of high performance cores, but that’s to be expected.

Compared to Arm’s Cortex A75, the P550 can handle longer patterns for a small number of branches. The gap however narrows as more branches come into play.

Branch predictor speed can matter too, especially in high IPC code with a lot of branches. The P550 appears to have a 32 entry BTB capable handling taken branches with no bubbles. Past that, the core can handle a taken branch every three cycles as long as the test fits within 32 KB. Likely, P550 doesn’t have another BTB level. If a branch misses the 32 entry BTB, the core simply calculates the branch’s destination address when it arrives at the frontend. If so, the P550’s 32 KB L1 instruction cache has 3 cycles of latency.

Arm’s Cortex A75 also uses what appears to be a single small BTB level. Both cores lack the large decoupled BTBs that high performance cores tend to have.

P550 uses a 16 entry return stack to predict returns from function calls. A75 seems to have a return stack with 42 entries, because latency per call+return pair doesn’t hit an inflection point until I get past that. Even with the larger return stack, A75’s higher 2 GHz clock speed lets it achieve similar performance for the common case of a return stack hit.

When return stack capacity is exceeded, P550 sees a sharp spike in latency. That contrasts with A75’s more gentle increase in latency. Perhaps A75 only mispredicts for the return address that got pushed out of the stack. P550 possibly has less graceful handling for return stack overflows, making it mispredict many of the returns even when the test only exceeds return stack capacity by a few entries.

Fetch and Decode

The P550’s frontend comes with a parity protected 32 KB 4-way set associative instruction cache, capable of delivering enough bandwidth (12 bytes/cycle) to feed the 3-wide decoder downstream. The frontend can sustain 3 IPC as long as code fits within L1i. Instruction bandwidth takes a gradual drop past that. From L2, the core can still maintain a reasonable 2 IPC. Instruction bandwidth from L3 is good for 1 IPC, though you’d really hope L2 code misses are rare on any core.

Arm chose to give A75 a larger 64 KB instruction cache, giving it a better chance of satisfying instruction-side memory accesses from L1i. On a L1i miss, instruction bandwidth takes a sharp decline. Much of that is down to implementation decisions. Qualcomm gave the Snapdragon 670 a 1 MB system level cache. System level caches are placed closer to the memory controller than compute blocks. Therefore, they usually aren’t optimized to provide high performance for any one block. In contrast, the 4 MB L3 on the EIC7700X is tightly tied to the CPU cluster.

Fetched instructions are decoded into micro-ops, which pass through the renamer and head to the out-of-order backend.

Out-of-Order Execution

SiFive’s P550 has somewhat higher reordering capacity than Arm’s Cortex A75. However, Arm can make its out-of-order execution buffers go a bit further thanks to the out-of-order retirement trick carried forward from A73. On A75, I’m using an incomplete branch along with an incomplete load to block retirement.

Both cores have plenty of register file capacity compared their ROB size, though other structures like memory ordering queues can be a bit thin. P550 and A75 have nowhere near as much reordering capacity as current Intel and AMD cores, or even more recent Arm cores like the Cortex A710. They’re more comparable to Intel’s Core 2 or Goldmont Plus. Still, a modest out-of-order execution window is far better than in-order execution.

Execution resources on both cores are both allocated with low power and low area goals in mind. Between the two, P550 has a more flexible integer port setup and more scheduling capacity to feed those ports. Cortex A75 isn’t far behind though, with two ALU ports and a separate branch port. Scalar integer workloads often have a lot of branches, and a branch port doesn’t need a writeback path to the register file. Arm’s setup is likely a cheaper, while providing almost as much performance.

On the floating point side, both cores have two FP ports capable of handling the most common operations. P550 handles FP adds, multiplies, and fused multiply-adds (FMAs) with 4 cycle latency, suggesting the core uses FMA units to handle all of those operations. After all, an add is simply a FMA with a multiplier of 1, and a multiply is a FMA with the addend equal to 0. A75 has 3 cycle latency for FP adds and multiplies, while FMAs execute over 5 cycles. Arm may use separate execution units for FMAs and adds/multiplies. Or, Arm might have a FMA unit with optimized paths for doing just multiplies or just adds. Cortex A75’s FPU also supports vector execution, giving it a leg up over P550.

Microbenchmarking suggests A75 has 31 scheduler entries available for floating point operations. Anandtech says A75 has two 8-entry floating point schedulers, but my measurements disagree. P550 has 28 total scheduler entries for FP operations. This could be a dual ported unified scheduler, or two 14 entry schedulers. I haven’t found an operation that only goes to one port.

Memory Subsystem

P550 is a small, low power design and doesn’t need the high throughput memory subsystems found on Intel, AMD, or Arm’s big cores. Memory operations first have their addresses generated by two address generation units (AGUs). One handles loads, and the other handles stores. Both appear to be backed by relatively large schedulers, letting the core handle workloads with an uneven balance of loads and stores. Cortex A75 also has two AGUs, but each of A75’s AGU pipes can handle both loads and stores. Two load/store ports make a lot of sense because loads usually greatly outnumber stores, so P550’s load AGU may be very busy while the store AGU is mostly idle.

…we observed that each of the two load/store ports were used about 20 percent of the time. We surmised that changing to one dedicated load port and one dedicated store port should not have a large effect on performance…This proved to be the case, with less than a 1 percent performance loss for this change.

David B. Papworth, Tuning the Pentium Pro Microarchitecture

Curiously, Intel evaluated the same tradeoff when they designed their Pentium Pro back in 1996. They found using a less flexible load/store setup only came with a minor performance impact. SiFive may have come to the same conclusion. P550 does have more reordering capacity than the Pentium Pro though, and thus would be better able to feed its execution pipes (including the AGUs) in the face of cache misses.

AGUs generate program-visible virtual addresses, which have to be translated to physical addresses. P550 uses a two-level TLB setup to speed up address translation. The first level TLBs are fully associative, meaning that any entry can cache a translation for any address. However, both the data and instruction side TLBs are relatively small with only 32 entries. A larger 512 entry L2 TLB serves both instruction and data L1 TLB misses. On the data side, getting a translation from the L2 TLB adds 9 cycles of latency. Arm’s A75 has larger TLBs, and a lower 5-6 cycle penalty for hitting the L2 TLB.

Before accessing cache, load addresses have to be checked against older stores (and vice versa) to ensure proper ordering. If there is a dependency, P550 can only do fast store forwarding if the load and store addresses match exactly and both accesses are naturally aligned. Any unaligned access, dependent or not, confuses P550 for hundreds of cycles. Worse, the unaligned loads and stores don’t proceed in parallel. An unaligned load takes 1062 cycles, an unaligned store takes 741 cycles, and the two together take over 1800 cycles.

This terrible unaligned access behavior is atypical even for low power cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of dependent accesses that are both misaligned.

Digging deeper with performance counters reveals executing each unaligned load instruction results in ~505 executed instructions. P550 almost certainly doesn’t have hardware support for unaligned accesses. Rather, it’s likely raising a fault and letting an operating system handler emulate it in software.

Core-Private Caches

With a modest out-of-order execution engine and no vector capability, P550 doesn’t demand a lot of bandwidth from it’s memory subsystem. Latency however is very important, because P550 doesn’t have the massive reordering capacity that higher performance cores use to tolerate memory latency. Like many AMD, Intel, and Arm designs, each P550 core has its own L1 and L2 caches while L3 is shared. All levels of the data-side cache hierarchy are ECC protected.

P550’s 32 KB L1 data cache is 4-way set associative, and can service a load and a store every cycle, assuming no alignment issues. Maximum bandwidth is thus 16 bytes per cycle, achieved with an even mix of reads and writes. Latency is 3 cycles, matching many low-clocked cores.

Using the A73 for comparison because I don’t have a way to use hugepages on Android

The L2 cache is 256 KB and 8-way set associative. It’s built from two banks, and has 13 cycle latency. This combination of size and latency is a bit dated, as Arm and AMD have both implemented larger L2 caches with lower latency. However, P550’s L2 is still well positioned to catch L1 misses and insulate the core from L3 latency. L2 bandwidth is mediocre at 8 bytes per cycle, a limitation that applies to both reads and writes. While it’s not an impressive figure, 8 bytes per cycle should be adequate considering the core’s lack of vector capability and single load AGU.

Arm’s Cortex A75 enjoys higher cache bandwidth, thanks both to higher clock speeds and more per-cycle bandwidth at each level.

L3 Cache and Interconnect

P550’s interconnect has to be modular and scalable to address the largest possible market. A consumer router or set-top box may be fine with 2-4 cores, while a small edge server might benefit from higher core counts. P550 can be instantiated in clusters of up to four cores. Presumably, cores within a cluster share a single external interface. Multiple clusters sit on a “Coherent System Fabric”, which sends traffic coming off P550 clusters to the appropriate destination. From the EIC7700X datasheet, this “Coherent System Fabric” is likely a crossbar.

Cacheable memory accesses head to a L3 cache, which can be shared across multiple P550 clusters and is banked to meet the bandwidth demands of multiple cores. SiFive provides 1 MB, 2 MB, 4 MB, and 8 MB L3 capacity options. The largest 8 MB option has eight banks and is reserved for multi-cluster configurations. The EIC7700X we’re looking at has a 4 MB L3 with four banks. Bank count thus matches core count.

Microbenchmarking indicates the L3 can give 8 bytes per cycle of bandwidth to each core. In total, the EIC7700X’s quad core P550 cluster has about 43.88 GB/s of L3 bandwidth. L3 latency is respectable at about 38 cycles, which isn’t bad considering the cache’s flexibility. For comparison, Arm’s Cortex A73 uses a simpler two-level cache setup. A73’s L2 serves double duty as the first stop for L1D misses and as a large last level cache. That means compromise, so the 1 MB L2 has less capacity than EIC7700X’s L3 while having better latency at 25 cycles.

Very likely, each cluster has a single 32B/cycle (256-bit) interface to the crossbar.

L3 misses head out to a Memory Port. Depending on implementation goals, a P550 multi-cluster complex can have one or two memory ports, each of which can be 128 or 256 bits wide. Each Memory Port can track up to 128 pending requests to enable memory level parallelism. Less common requests to IO or non-cacheable addresses get routed to one of two 64-bit System Ports or a 64-bit Peripheral Port. Implementers can also use one or two Front Ports, which gives other agents coherent memory access through the multi-cluster complex.

Eswin has chosen to use a single Memory Port, likely 128 bits wide, and two System Ports. The first System Port’s address space includes a 256 MB PCIe BAR space, PCIe configuration space, and 4 MB of ROM. The second System Port accesses the DSP’s SRAM, among other things.

Those ports connect to the on-chip network, which uses the AXI protocol. At this point, everything is up to the implementer and is out of SiFive’s hands. For the EIC7700X, Eswin used two DDR controllers, each with two 16-bit sub-channels. On the SiFive Premier P550 Dev Board, they’re connected to 16 GB of LPDDR5-6400. The memory controller runs at 1/4 of the SDRAM clock, which would be 800 MHz. DRAM load-to-use latency isn’t great at 194 ns, which is about 165 ns beyond L3 latency. It’s impossible to tell how much of this latency comes from traversing the on-chip network versus the memory controller. But either way, memory latency on the EIC7700X is substantially worse than other LPDDR5 setups like Intel’s Meteor Lake or AMD’s Van Gogh (Steam Deck SoC).

I measured 16.74 GB/s of DRAM bandwidth, which is well short of what LPDDR5-6400 should be capable of even with a 64-bit bus. The EIC7700X uses some bus width for inline ECC, but achieved bandwidth is well short of theoretical even with that in mind. Still, that sort of bandwidth should be adequate for a very low clocked quad core setup with no vector capability.

Core to Core Latency

Rarely, the interconnect may need to carry out cross-core transfers to maintain cache coherency. Eswin’s EIC7700X datasheet says the memory subsystem has a “directory based coherency manager”, meaning memory accesses check the directory to see whether they need to send a probe or can proceed as normal down the memory hierarchy. Compared to a broadcast strategy, using a directory keeps probe traffic under control as core counts go up.

Anandtech and other sites have used “core to core” latency tests to check how long it takes for a core to observe a write from another, and I’ve written my own version of the test. Although the exact methodology differs, results should be broadly comparable to those from Anandtech. Core to core latency on the EIC7700X is quite high.

Qualcomm’s Snapdragon 670 does much better, even when transfers happen between A75 and A55 cores.

While high core to core latency is unlikely to impact application performance, it does contribute to the feeling that SiFive’s P550 isn’t quite so polished.

Final Words

RISC-V is a young instruction set and SiFive is a new player in the CPU design business. High performance CPU design is incredibly challenging, as shown by the small number of players in that space. P550 aims for “the highest performance within a tightly constrained power and area footprint” according to the company’s datasheet. It doesn’t go head-on against the likes of AMD’s Zen 5, Intel’s Lion Cove, or Qualcomm’s Oryon. P550’s out-of-order engine is closer in size to something like Intel’s Core 2 from over 15 years ago. Combine that with much lower clock speeds than even what Core 2 ran at, and P550 is really a low power core with modest performance. It’s best for light management tasks where an in-order core may be a tad sluggish.

More importantly, P550 represents a stepping stone in SiFive’s journey to push RISC-V to higher performance targets. Not long ago, SiFive primarily focused tiny in-order cores not suitable for much more than microcontrollers. With the P550, SiFive has built a reasonably well balanced out-of-order engine supported by a competent cache hierarchy. They got the basics down, and I can’t emphasize how important that is. Out-of-order execution has proven essential for building high performance, general purpose CPUs, but is also difficult to pull off. In fact, both Intel and IBM tried to step away from out-of-order execution because it added so much complexity, only to find out Itanium and POWER6’s strategies weren’t great. With that in mind, SiFive’s progress is promising.

Still, the P550 is just one step in SiFive’s journey to create higher performance RISC-V cores. As a step along that journey, P550 feels more comparable to one of Arm’s early out-of-order designs like Cortex A57. By the time A75 came out, Arm already accumulated substantial experience in designing out-of-order CPUs. Therefore, A75 is a well polished and well rounded core, aside from obvious sacrifices required for its low power and thermal budgets. P550 by comparison is rough around the edges. Its clock speed is low. Misaligned access penalties are ridiculous. Vector support is absent. Many programs won’t hit the worst of P550’s deficiencies, but SiFive has a long way to go.

In that respect, I can also see parallels between P550 and Intel’s first out-of-order CPU. The Pentium Pro back in the mid 1990s performed poorly when running 16-bit code. But despite its lack of polish in certain important areas, the core as a whole was well designed and gave Intel confidence in tackling more complex CPU designs. SiFive has since announced more sophisticated out-of-order designs like the P870. I’ll be excited to see those cores implemented in upcoming chips, because they look quite promising.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

References

  1. SiFive Performance P550 Data Sheet

  2. EIC7700X Datasheet

Disabling Zen 5’s Op Cache and Exploring its Clustered Decoder

2025-01-24 07:07:39

Zen 5 has an interesting frontend setup with a pair of fetch and decode clusters. Each cluster serves one of the core’s two SMT threads. That creates parallels to AMD’s Steamroller architecture from the pre-Zen days. Zen 5 and Steamroller can both decode up to eight instructions per cycle with two threads active, or up to four per cycle for a single thread.

Despite these decoder layout similarities, Zen 5’s frontend operates nothing like Steamroller. That’s because Zen 5 mostly feeds itself off a 6K entry op cache, which is often large enough to cover the vast majority of the instruction stream. Steamroller used its decoders for everything, but Zen 5’s decoders are only occasionally activated when there’s an op cache miss. Normally that’d make it hard to evaluate the strength of Zen 5’s decoders, which is a pity because I’m curious about how a clustered decoder could feed a modern high performance core.

Thankfully, Zen 5’s op cache can be turned off by setting bit 5 in MSR 0xC0011021. Setting that bit forces the decoders to handle everything. Of course, testing with the op cache off is completely irrelevant to Zen 5’s real world performance. And if AMD wanted to completely serve the core using the decoders, there’s a good chance they would have gone with a conventional 8-wide setup like Intel’s Lion Cove or Qualcomm’s Oryon. Still, this is a cool chance to see just how Zen 5 can do with just a 2×4-wide frontend.

Here, I’m testing Zen 5 using the AMD Ryzen 9 9900X, which implements 12 Zen 5 cores in two 6-core clusters. I did an in-socket swap from my Ryzen 9 7950X3D, which means the 9900X is fed off the same DDR5-5600 setup I had from 2023. Performance results won’t be directly comparable to Ryzen 9 9950X figures from a prior article, because the 9950X had faster DDR5-6000.

Microbenchmarking Instruction-Side Bandwidth

To get a handle on how the frontend behaves in pure decoder mode, I fill an array with NOPs (instructions that do nothing) and jump to it. AMD’s fetch/decode path can handle 16 bytes per thread cycle in this test. AMD’s slides imply each fetch/decode pipe has a 32B/cycle path to the instruction cache, but I wasn’t able to achieve that when testing with 8 byte NOPs. Maybe there’s another pattern that achieves higher instruction bandwidth, but I’m mostly doing this as a sanity check to ensure the op cache isn’t in play.

Shorter 4 byte NOPs are more representative of typical instruction length, and stress decoder throughput rather than instruction cache bandwidth. Turning off the op cache limits a single thread to 4 IPC, as expected. Running two threads in the core, and thus using both decode clusters, brings total throughput to 8 IPC.

Across both patterns, Zen 5’s dual fetch pipes provide a huge increase in L1i miss bandwidth. Likely, each fetch pipe maintains an independent queue of L1i miss requests, allowing increased memory level parallelism with both threads active.

Steamroller/Excavator Parallels?

AMD’s Excavator architecture is an iterative improvement over Steamroller, and carries forward Steamroller’s clustered decode scheme. Excavator behaves much like Zen 5 as long as code fits within the instruction cache. But if the test spills out of L1i, Zen 5 behaves far better. Where Excavator has a single L1i fetch path feeding two decode clusters, Zen 5 duplicates the fetch paths too. That’s another key difference between Zen 5 and Steamroller/Excavator, besides Zen 5 having an op cache.

With 8-byte NOPs though, Excavator can surprisingly give more L1i bandwidth to a single thread. That advantage goes away when both decoders pile onto the single fetch path, after which both Zen 5 and Excavator appear capped at 32 bytes per cycle.

Excavator does enjoy slightly better L2 code bandwidth when both threads are loaded, but the bandwidth increase is nowhere near the 2x that Zen 5 enjoys. Excavator really wants to avoid fetching code from L2, and has a very large 96 KB instruction cache to avoid that. I’m also amazed that AMD’s old architecture could sustain 8 NOPs/cycle through the module’s pipeline. Special thanks goes to cha0shacker for running tests on his Excavator system.

SPEC CPU2017

SPEC CPU2017 is an industry standard benchmark suite, and is a good way to get a general idea of performance. Disabling the op cache drops Zen 5’s score by 20.3% and 16.8% in the integer and floating point suites, respectively. That’s a substantially heavier penalty than what I saw with Zen 4, where turning off the op cache reduced performance by 11.4% and 6.6% in the integer and floating point suites, respectively. Zen 5 is capable of higher throughput than Zen 4, and feeding a bigger core through a 4-wide decoder is more difficult.

In the previous article, I also found Zen 4 suffered heavier losses from turning off the op cache when both SMT threads were active. Two threads expose more parallelism and enable higher throughput, making the 4-wide decoder even more of a bottleneck. Zen 5’s two decode clusters reverse the situation. Integer and floating point scores drop by 4.9% and 0.82% with the op cache off. For comparison, turning off Zen 4’s op cache leads to heavier 16% and 10.3% drops in the integer and floating point suites.

Zen 5 can reach very high IPC, especially when cache misses are rare and the core’s massive execution engine can be brought to bear. In those workloads, a single 4-wide decode cluster is plainly inadequate. Disabling the op cache in high IPC workloads like 548.exchange2 leads to downright devastating performance losses.

Lower IPC workloads are less affected, but overall penalties are generally heavier on Zen 5 than on Zen 4. For example, turning off Zen 4’s op cache dropped 502.gcc’s score by 6.35%. On Zen 5, doing the same drops the score by 13.7%.

Everything flips around once the second decoder comes into play with SMT. The op cache still often provides an advantage, thanks in part to its overkill bandwidth. Taken branches and alignment penalties can inefficiently use frontend bandwidth, and having extra bandwidth on tap is always handy.

Multiplying thread IPC by 2 because I’m running SPEC rate tests with two copies pinned to SMT siblings, and the core should spend negligible time in ST mode

But overall, the dual decode clusters do their job. Even high IPC workloads can be reasonably well fed off the decoders. Performance counters even suggest 525.x264 gains a bit of IPC in decoder-only mode, though that didn’t translate into a score advantage likely due to varying boost clocks.

In SPEC CPU2017’s floating point tests, the dual decoders pull off a surprising win in 507.cactuBSSN. IPC is higher, and SPEC CPU2017 gives it a score win too.

507.cactuBSSN is the only workload across SPEC CPU2017 where the op cache hitrate is below 90%. 75.94% isn’t a low op cache hitrate, but it’s still an outlier.

With both SMT threads active, op cache coverage drops to 61.79%. Two threads will have worse cache locality than a single thread, and thus put more pressure on any cache they have to share. That includes the op cache. Most other tests see minor impact because Zen 5’s op cache is so big that it has little trouble handling even two threads.

In 507.cactuBSSN, the op cache still delivers the majority of micro-ops. But 61.79% coverage likely means op cache misses aren’t a once-in-a-blue-moon event. Likely, the frontend is transitioning between decoder and op cache mode fairly often.

AMD’s Zen 5 optimization guide suggests such transitions come at a cost.

Excessive transitions between instruction cache and Op Cache mode may impact performance negatively. The size of hot code regions should be limited to the capacity of the Op Cache to minimize these transitions

Software Optimization Guide for the AMD Zen 5 Microarchitecture

I’m guessing more frequent op cache/decoder transitions, coupled with IPC not being high enough to benefit from the op cache’s higher bandwidth, combine to put pure-decoder mode ahead.

Besides cactuBSSN’s funny SMT behavior, the rest of SPEC CPU2017’s floating point suite behaves as expected. High IPC workloads like 538.imagick really want the op cache enabled. Lower IPC workloads don’t see a huge difference, though they often still benefit from the op cache. And differences are lower overall with SMT.

From a performance perspective, using dual 4-wide decoders wouldn’t be great as the primary source of instruction delivery for a modern core. It’s great for multithreaded performance, and can even provide advantages in corner cases with SMT. But overall, the two fetch/decode clusters are far better suited to act as a secondary source of instruction delivery. And that’s the role they play across SPEC CPU2017’s workloads on Zen 5.

Frontend Activity

Performance counters can provide additional insight into how hard the frontend is getting pushed. Here, I’m using event 0xAA, which counts micro-ops dispatched from the frontend and provides unit masks to filter by source. I’m also setting the count mask to 1 to count cycles where the frontend is sending ops to the backend from either source (op cache or decoders).

A single 4-wide decoder isn’t adequate for high IPC workloads, and performance monitoring data backs that up. The frontend has to work overtime in decoder-only mode, and it gets worse as IPC gets higher.

SPEC CPU2017’s floating point tests make everything more extreme. The floating point suite has a surprising number of moderate IPC workloads that seem to give the decoders a really rough time. For example, 519.lbm averages below 3 IPC and normally doesn’t stress Zen 5’s frontend. But with the op cache off, suddenly the frontend busy for over 90% of core active cycles.

SMT increases parallelism and thus potential core throughput, placing higher demand on frontend bandwidth. With the op cache on, frontend load goes up but everything is well within the frontend’s capabilities. With the op cache off, the decoders do an excellent job of picking up the slack. The frontend is a little busier, but the decoders aren’t getting pegged except in very high IPC outliers like 548.exchange2. And exchange2 is a bit of an unfair case because it even pushes the op cache hard.

The strange jump in decoder utilization across SPEC CPU2017’s floating point tests is gone with both SMT threads active. Likely, the two decode clusters together have enough headroom to hide whatever inefficiencies show up in single threaded mode.

Extremely high IPC workloads like 538.imagick do push the clustered decoder quite hard. But overall, the 2×4 decode scheme does well at handling SMT.

Cyberpunk 2077

Cyberpunk 2077 is a game that rewards holding down the tab key. It also features a built-in benchmark. Built-in benchmarks don’t always provide the best representation of in-game performance, but do allow for more consistency without a massive time investment. To minimize variation, I ran the Ryzen 9 9900X with Core Performance Boost disabled. That caps clock speed at 4.4 GHz, providing consistent performance regardless of which core code is running on, or how many cores are active. I’ve also capped by Radeon RX 6900XT to 2 GHz to minimize GPU-side boost clock variation. However, I’m testing at 1080P with medium settings, so the GPU side shouldn’t be a limiting factor.

If you remember the previous article, the 7950X3D is actually 3.4% faster than this. That CPU swap certainly isn’t an upgrade from a gaming performance perspective

Turning the op cache on or off doesn’t make a big difference. That’s notable because games usually don’t run enough threads to benefit from SMT, especially on a high core count chip like the 9900X. However, games are also usually low IPC workloads and don’t benefit from the op cache’s high throughput. Cyberpunk 2077 certainly fits into that category, averaging just below 1 IPC. The Ryzen 9 9900X delivers just 0.17% better performance with the op cache enabled.

In its normal configuration, Zen 5 sources 83.5% of micro-ops from the op cache. Hitrate isn’t quite as high as most of SPEC CPU2017’s workloads, with the notable exception of 507.cactuBSSN. However, that’s still enough to position the op cache as the primary source of instruction delivery.

On average, the frontend uses the op cache for 16.3% of core active cycles, and the decoders for 5.3%. Zen 5’s frontend spends much of its time idle, as you’d expect for a low IPC workload. With the decoders carrying all the load, the frontend delivers ops over 27.4% of core active cycles.

The decoders have to work a bit harder to feed the core, but they still have plenty of time to go on a lunch break, get coffee, and take a nap before getting back to work.

Grand Theft Auto V

Grand Theft Auto V (GTA V) is an older game with plenty of red lights. Again, I’m running with Core Performance Boost disabled to favor consistency over maximum performance. Don’t take these results, even ones with the op cache enabled, to be representative of Zen 5’s stock performance.

Disabling the op cache basically makes no difference, except in the fourth pass, where the op cache gave a 1.3% performance boost. I don’t think that counts either, because no one will notice a 1.3% performance difference.

Zen 5’s op cache covers 77% of the instruction stream on average. Like Cyberpunk 2077, GTA V has a larger instruction footprint than many of SPEC CPU2017’s workloads. The op cache does well from an absolute perspective, but the decoders still handle a significant minority of the instruction stream.

Like Cyberpunk 2077, GTA V averages just under 1 IPC. That won’t stress frontend throughput. On average, the frontend delivered ops in op cache mode over 16.4% of active cycles, and did so in decoder mode over 7.9% of active cycles.

With everything forced onto the decoders, the frontend delivers ops over 29.5% of active cycles. Again, the frontend is busier, but those decoders still spend most of their time on break.

Cinebench 2024

Cinebench 2024 is a popular benchmark among enthusiasts. It’s simple to run, leading to a lot of comparison data floating around the internet. That by itself makes the benchmark worth paying attention to. I’m again running with Core Performance Boost disabled to provide consistent clock speeds, because I’m running on Windows and not setting core affinity like I did with SPEC CPU2017 runs.

Single threaded mode has the op cache giving Zen 5 a 13.5% performance advantage over decoder-only mode. That’s in line with many of SPEC CPU2017’s workloads. Cinebench 2024 averages 2.45 IPC, making it a much higher IPC workload than the two games tested above. Op cache hitrate is closer to Cyberpunk 2077, at 84.4%. Again, that’s lower than in most SPEC CPU2017 workloads.

Higher IPC demands more frontend throughput. Zen 5’s frontend was feeding the core from the op cache over 35.4% of cycles, and did so from the decoder over 11.5% of cycles. Frontend utilization is thus higher than in the two games tested above. Still, the frontend is spending most of its time on break, or waiting for data.

Turn off the op cache, and IPC drops to 2.15. The decoders see surprisingly heavy utilization. On average they’re busy over 69% of core active cycles. I don’t know what’s going on here, but the same pattern showed up over some of SPEC CPU2017’s floating point workloads. Cinebench 2024 does use a lot of floating point operations, because 42.8% of ops from the frontend were dispatched to Zen 5’s floating point side.

I ran Cinebench and game tests on Windows, where I use my own performance monitoring program to log results. I wrote the program to give myself stats updated every second, because I wanted a convenient way to see performance limiters in everyday tasks. I later added logging capability, which logs on the same 1-second intervals. That gives me per-second data, unlike perf on Linux where I collect stats over the entire run. I can also shake things up and plot those 1-second intervals with IPC on one axis, and frontend busy-ness on the other.

Cinebench 2024 exhibits plenty of IPC variation as the benchmark renders different tiles. IPC can go as high as 3.63 over a one second interval (with the op cache on), which can push the capabilities of a single 4-wide decoder. Indeed, a single 4-wide decoder cluster starts to run out of headroom during higher IPC portions of Cinebench 2024’s single threaded test.

As an aside, ideally I’d have even faster sampling rates. But writing a more sophisticated performance monitoring program isn’t practical for a free time project. And I still think the graph above is a cool illustration of how the 4-wide decoder starts running out of steam during higher IPC sections of the benchmark, while the op cache has plenty of throughput left on tap.

Of course no one runs a rendering program in single threaded mode. I haven’t used Maxon’s Cinema 4D before, but Blender will grab all the CPU cores it can get its hands on right out of the box. In Cinebench 2024’s multi-threaded mode, I see just a 2.2% score difference with op cache enabled or disabled. Again, the two decode clusters show their worth in a multithreaded workload.

Cinebench 2024 sees its op cache hitrate drop to 78.1% in multi-threaded mode, highlighting how multiple threads put extra pressure on cache capacity. During certain 1-second intervals, hitrate can drop below 70%. Even though the op cache continues to do most of the work, Cinebench 2024’s multithreaded mode taps into the decoders a little more than other workloads here.

Dividing counts for event 0xAA by total core cycles (4.4G * 12 cores) because active cycles are counted per-thread

Disabling the op cache dumps a lot more load onto the decoders, but having two decode clusters lets the frontend handle it well. The decoders still can’t do as well as the op cache, and the frontend is a bit busier in pure-decoder mode. But as the overall performance results show, the decoders are still mostly keeping up.

Final Words

Turning off Zen 5’s op cache offers a glimpse of how a modern core may perform when fed from a Steamroller-style decoder layout. With two 4-wide decoders, single threaded performance isn’t great, but SMT performance is very good. Single threaded performance is still of paramount performance in client workloads, many of which can’t take advantage of high core counts, let alone SMT threads. Zen 5’s op cache therefore plays an important role in letting Zen 5 perform well. No one would design a modern high performance core fed purely off a Steamroller-style frontend, and it’s clear why.

But this kind of dual cluster decoder does have its place. Two threads present the core with a larger cache footprint, and that applies to the instruction side too. Zen 5’s op cache is very often large enough to cover the vast majority of the instruction stream, even with both SMT threads in use. However, there are cases like Cinebench 2024 where the decoders sometimes have work to do.

I think Zen 5’s clustered decoder targets these workloads. It takes a leaf out of Steamroller’s over-engineered frontend, and uses it to narrowly address cases where core IPC is likely to be high and code locality is likely poor. Specifically, that’s the SMT case. The clustered decoder is likely part of AMD’s strategy to cover as many bases as possible with a single core design. Zen 5’s improved op cache focuses on maximizing single threaded performance, while the decoders step in for certain multithreaded workloads where the op cache isn’t big enough.

The test victim subject for this article. It’s ok, it’s over now. I’ll turn your op cache back on in a moment

In the moment, Zen 5’s frontend setup makes a lot of sense. Optimizing a CPU design is all about striking the right balance. Zen 5’s competitive performance speaks for itself. But if we step back a few months to Hot Chips 2024, AMD, Intel, and Qualcomm all gave presentations on high performance cores there. All three were eight-wide, meaning their pipelines could handle up to eight micro-ops per cycle in a sustained fashion.

Zen 5 is the only core out of the three that couldn’t give eight decode slots to a single thread. Intel’s Lion Cove might not take the gaming crown, but Intel’s ability to run a plain 8-wide decoder at 5.7 GHz should turn some heads. For now, that advantage doesn’t seem to show up. I haven’t seen a high IPC, low-threaded workload with a giant instruction-side cache footprint. But who knows what the future will bring.

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.

Sizing up MI300A’s GPU

2025-01-21 04:42:52

AMD’s Instinct MI300A is a giant APU, created by swapping out two GPU chiplets (XCDs) for three CPU chiplets (CCDs). Even though MI300A integrates Threadripper-like CPU muscle, the chip’s main attraction is still its massive GPU compute power. Here, we’re going to size up MI300A’s GPU and see where it stands using comparison data from a few other GPUs, including MI300X.

Acknowledgments

Special thanks goes out to AMD and GIGABYTE along with their Launchpad service , who generously let Chips and Cheese play around with a massive quad-socket MI300A system system in the form of the G383-R80-AAP1 for over 2 weeks. As always, our testing was our own.

We also have limited data from Nvidia’s H100. H100 comes in both a PCIe and SXM5 version. I (Clam) rented a H100 PCIe cloud instance for the MI300X article from before. Cheese/Neggles rented a H100 SXM5 instance for benchmarking against the MI300A.

OpenCL Compute Throughput

MI300A may be down a couple XCDs compared to its pure GPU cousin, but it still has plenty of compute throughput. It’s well ahead of Nvidia’s H100 PCIe for just about every major category of 32-bit or 64-bit operations. H100’s SXM5 variant should slightly increase compute throughput, thanks to its higher SM count. But a 16% increase in SM count likely won’t do much to close the gap between H100 and either MI300 variant.

MI300A can achieve 113.2 TFLOPS of FP32 throughput, with each FMA counting as two floating point operations. For comparison, H100 PCIe achieved 49.3 TFLOPS in the same test.

Even though GPUs are vector throughput monsters, modern CPUs can pack pretty respectable vector throughput too. I’ve written a CPU-side multithreaded vector throughput test, mostly for fun. Using AVX-512, MI300A’s 24 Zen 4 cores can sustain 2.8 TFLOPS. That’s quite a bit compared to some older consumer GPUs. But MI300A’s GPU side is so massive that CPU-side throughput is a rounding error.

MI300A therefore strikes a very different balance between CPU and GPU size compared to a typical consumer integrated graphics solution like AMD’s Hawk Point. The Ryzen 7 PRO 8840HS’s Radeon 780M iGPU is no slouch with basic 32-bit operations, but CPU-side vector throughput is still significant.

Consumer GPU architectures de-emphasize FP64 throughput, so the 8840HS’s eight Zen 4 cores can provide more throughput than than the GPU side. Zen 4 also holds an advantage with integer multiplication, though it’s not as big as with FP64. From a compute throughput perspective, MI300A is more like a GPU that happens to have a CPU integrated.

First Level Cache Bandwidth

Measuring cache bandwidth is another way to size up the MI300A. Here, I’m testing load throughput over a small 1 KB array. That should fit within L1 on any GPU. Normally, I use an image1d_buffer_t buffer because some older Nvidia GPUs only have a L1 texture cache, and plain global memory loads go through to L2. But AMD’s CDNA3 architecture doesn’t support texture operations. It’s meant to handle parallel compute after all, not rasterize graphics. Therefore, I’m using global memory accesses on both the MI300X and MI300A.

H100 PCIe tested here

L1 bandwidth paints a similar picture to compute throughput. MI300A’s GPU is a slightly smaller MI300X, a bit like how AMD’s HD 7950 is a slightly smaller HD 7970. Again, the massive scale of AMD’s MI300 platform lets it leave Nvidia’s H100 in the dust even after making room for 24 CPU cores. The consumer oriented RX 6900XT also turns in a good showing, because it has very high L1 bandwidth L1 vector caches and runs at higher clock speeds.

Besides caches, GPUs offer a local memory space only visible within a workgroup of threads. AMD GPUs back this up with a Local Data Share within each Compute Unit, while recent Nvidia GPUs allocate local memory out of their L1 caches.

MI300 GPUs have tons of local memory bandwidth on tap, with both variants pushing past 60 TB/s. Everything else gets left in the dust.

Atomics Throughput

Atomic operations can be useful for multithreaded applications, because they let a thread guarantee that series of low-level operations happens without interference from any other thread. For example, adding a value to a memory location involves reading the old value from memory, performing the add, and writing the result back. If any other thread accesses the value between those operations, it’ll get stale data, or worse, overwrite the result.

GPUs handle global memory atomics using special execution units at the L2 slices. In contrast, CPU cores handle atomics by retaining ownership of a cacheline until the operation completes. That means GPU atomic throughput can be constrained by execution units at the L2, even without contention. MI300A has decent throughput for atomic adds to global memory, though it falls short of what I expect. AMD’s GPUs had 16 atomic ALUs per L2 slice since GCN. I get close to that on my RX 6900 XT, but not on MI300A.

Because atomic adds are a read-modify-write operation, 306.65 billion operations per second on INT32 values translates to 2.45 TB/s of L2 bandwidth. On the 6900XT, 397.67 GOPS would be 3.2 TB/s. Perhaps MI300A has to do something special to ensure atomics work properly across such a massive GPU. Compared to a consumer GPU like the 6900XT, MI300A’s L2 slices don’t have sole ownership of a chunk of the address space. One address could be cached within the L2 on different XCDs. A L2 slice carrying out an atomic operation would have to ensure no other L2 has the address cached.

OpenCL also allows atomic operations on local memory, though of course local memory atomics are only visible within a workgroup and have limited scope. But that also means the Compute Unit doesn’t have to worry about ensuring consistency across the GPU, so the atomic operation can be carried out within the Local Data Share. That makes scaling throughput much easier.

Now, atomic throughput is limited by the ALUs within MI300’s LDS instances. That means it’s extremely high, and leaves the RX 6900XT in the dust.

FluidX3D

FluidX3D uses the lattice Boltzmann method (LBM) to simulate fluid behavior. It features special optimizations that let it use FP32 instead of FP64 while still delivering good accuracy in “all but extreme edge cases.” In FluidX3D’s built-in benchmark, MI300A’s GPU lands closer to its larger cousin than microbenchmarks and specification sheets would suggest. The MI300A server we tested uses air cooling, so its default power target is 550W per GPU. MI300A is just 1% slower than MI300X at that power setting. With a higher 760W power target, MI300A manages to lead its nominally larger GPU-only sibling by 4.5%.

Testing FluidX3D is complicated because the software is constantly updated. Part of MI300A’s lead could be down to optimizations done after we tested MI300X back in June. FluidX3D also tends to be very memory bound when using FP32, so similar results aren’t a surprise. Both MI300 variants continue to place comfortably ahead of Nvidia’s H100, though the SXM5 variant does narrow the gap somewhat. H100’s SXM5 variant uses HBM3, giving it 3.9 TB/s of DRAM bandwidth compared to 3.35 TB/s on the H100 PCIe, which uses HBM2e.

Because FluidX3D is so heavily limited by memory bandwidth, the program can use 16-bit formats for data storage. Its FP16S mode uses IEEE-754’s FP16 format for storage. To maximize accuracy, it converts FP16 values to FP32 before doing computation. GPUs often have hardware support for FP16-to-FP32 conversion, which minimizes computation overhead from using FP16S for storage.

MI300A and H100 both see improved performance in FP16S mode, and MI300A continues to hold a large advantage.

FluidX3D also has a FP16C mode, which uses a custom FP16 format. The mantissa field gets an extra bit, which is taken away from the exponent field. That improves accuracy while using the same memory bandwidth as FP16S mode. However, GPUs won’t have hardware support for converting a custom FP16 format to FP32. Conversion therefore requires a lot more instructions, increasing compute demands compared to FP16S mode.

MI300A continues to stay ahead of H100 in FP16C mode. AMD has built a much larger GPU from both the compute and memory bandwidth perspective, so MI300A has an advantage regardless of whether FluidX3D wants more compute or memory bandwidth.

Power target also matters more in FP16C mode. Going from 550W to 760W improves performance by 12.4% in FP16C mode. In FP32 or FP16S mode, going up to 760W only gave 5.3% or 6.5% more performance, respectively. Likely, FP16C mode’s demand for more compute along with its still considerable appetite for memory bandwidth translates to higher power demand.

Calculate Gravitational Potential (FP64)

Large compute GPUs like MI300 and H100 differentiate themselves from consumer GPUs by offering substantial FP64 throughput. Client GPUs have de-prioritized FP64 because graphics rendering doesn’t need that level of accuracy.

This is a self-penned workload by Clam (Chester), based on a long running workload from a high school internship. The program takes a FITS image from the Herschel Space Telescope with column density values, and carries out a brute force gravitational potential calculation. The original program written back in 2010 took a full weekend to run on a quad core Nehalem system, and was written in Java. I remember I started it before leaving the office on a Friday, and it finished after lunch on Monday. Now, I’ve hacked together a GPU accelerated version using OpenCL.

AMD’s MI300A completes the workload in 17.5 seconds, which is a bit shorter than a weekend. I wish I had this sort of compute power back in 2010. Nvidia’s H100 PCIe manages to finish in 54 seconds. That’s still very fast, if not as fast as MI300A. Both datacenter GPUs are faster than the integrated GPU on the Ryzen 9 7950X3D, which takes more than four hours to finish the task.

Test was run on GPU 3. MI300A was unable to sustain 2.1 GHz even at the 760W power target

Comparing the three MI300 results is interesting too. MI300A’s GPU is smaller than the MI300X, and that’s clear from FP64 synthetic tests. When measuring compute throughput, I keep all values in registers. But a workload like this exercises the cache subsystem too, and data movement costs power. I should barely hit DRAM because my input data from 2010 fits within Infinity Cache. But still, my code makes MI300A use all the power it can get its hands on. Going from a 550W to 760W power target increases performance by 12.4%, which curiously puts MI300A ahead of MI300X. I wasn’t expecting that.

On one hand, I’m impressed at how much a power target difference can make. On the other, a 38% higher power target only improves performance by 12.4%. I’m not sure if that’s a great tradeoff. That’s a lot of extra power to finish the program in 17.5 seconds instead of 19.7 seconds.

GROMACS

GROMACS simulates molecular dynamics. Here, Cheese/Neggles put MI300A and H100 SXM5 through their paces on the STMV benchmark system. Again, both MI300 variants land in different performance segments compared to Nvidia’s H100.

Again, bigger power targets translate to higher performance. In GROMACS, MI300A enjoys a 15% performance gain in 760W mode. It’s even larger than in my gravitational potential computation workload or FP16C mode on FluidX3D.

Final Words

AMD has to cut down MI300X’s GPU to create the MI300A. 24 Zen 4 cores is a lot of CPU power, and occupies one quadrant on the MI300 chip. But MI300’s main attraction is still the GPU. AMD’s integrated graphics chips in the PC space are still CPU-first solutions with a big enough iGPU to handle graphics tasks or occasional parallel compute. MI300A by comparison is a giant GPU that happens to have a CPU integrated. Those 24 Zen 4 cores are meant to feed the GPU and handle code sections that aren’t friendly to GPU architectures. It’s funny to see a 24 core CPU in that role, but that’s how big MI300A is.

On the GPU side, MI300A punches above its weight. Synthetics clearly show it’s a smaller GPU than MI300X, but MI300A can hold its own in real workloads. Part of that is because GPU workloads tend to be bandwidth hungry, and both MI300 variants have the same memory subsystem. Large GPUs are often power constrained too, and MI300A may be making up some of the difference by clocking higher.

At a higher level, AMD has built such a massive monster with the MI300 platform that it has no problem kicking H100 around, even when dropping some GPU power to integrate a CPU. It’s an impressive showing because H100 isn’t a small GPU by any means. Products like MI300A and MI300X show AMD now has the interconnect and packaging know-how to build giant integrated solutions. That makes the company a dangerous competitor.

And again, we’d like to give a massive shout out to GIGABYTE and their Launchpad service without whom this testing would not be possible without!

If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.