2025-05-29 23:49:20
Cloud computing rose rapidly around 2010, powered by AMD’s Opteron and Intel’s Xeon processors. The large cloud market piqued the interest of other CPU makers including Qualcomm. Qualcomm had grown to become a formidable force in the mobile SoC market by the mid-2010s, and had several in-house CPU designs under their belt. They had good reason to be confident about their cloud server prospects. High core counts in server chips translate to low per-core power, blunting AMD and Intel’s greatest strength of high single threaded performance.
Meanwhile, Qualcomm’s mobile background gave them plenty of experience in low power CPU design. Their high mobile volume gave them access to Samsung’s 10nm FinFET process. That could at least level the playing field against Intel’s 14nm node, if not give Qualcomm an outright advantage in power consumption and density. Qualcomm hoped to use those factors to deliver a cheaper, lower power competitor on the cloud computing scene.
To break into the cloud computing segment, Qualcomm needed a CPU architecture with enough performance to meet key metrics in areas like tail-end latency. During a Hot Chips presentation, Qualcomm noted that throwing a pile of weak cores onto a chip and calling it day wouldn’t cut it. An Arm cloud CPU may not have to match Intel and AMD core for core, but they do need to hit a baseline level of performance. Qualcomm hopes to do that while maintaining their traditional power and density advantages.
The Falkor CPU architecture aims to meet that performance threshold with low power and silicon area requirements. Falkor is a 4-wide aarch64 core with features inherited from Qualcomm’s prior mobile cores. It runs the 64-bit Arm instruction set (aarch64, armv8) with a few features pulled in from armv8.1. 32-bit Arm support is absent, as there’s no large existing install base of Arm server applications. Falkor is Qualcomm’s 5th in-house core design, and the company’s first designed specifically for cloud computing.
A Centriq 2400 series chip packs up to 48 Falkor cores on a 398mm2 die with a 120W TDP. That translates to less than 2.5W per core. Qualcomm notes that power consumption is usually well under 120W in typical all-core loads.
For this article, Neggles (Andi) was given a Centriq 2452 system from the folks at Corellium at no cost in order for us to test it. So a massive shout out to both Corellium for providing the system as well as to Neggles for getting the system up and running.
The Centriq 2452 system is set up with 96 GB of DDR4 running at 2666 MT/s, and identifies itself as the “Qualcomm Centriq 2400 Reference Evaluation Platform CV90-LA115-P23”.
Falkor has both a L0 and L1 instruction cache like Qualcomm’s prior Krait architecture, and possibly Kryo too. The 24 KB, 3-way set associative L0 delivers instructions at lower power and with lower latency. The L0 is sized to contain the large majority of instruction fetches, while the 64 KB 8-way L1 instruction cache handles larger code footprints. Although the L0 fills a similar role to micro-op caches and loop buffers in other CPUs, it holds ISA instruction bytes just like a conventional instruction cache.
Both instruction cache levels have enough bandwidth to feed Falkor’s 4-wide decoder. The two instruction cache levels are exclusive of each other, so the core effectively has 88 KB of instruction cache capacity. Qualcomm might use a victim cache setup to maintain that exclusive relationship. If so, incoming probes would have to check both the L0 and L1, and L1 accesses would incur the additional cost of a copy-back from L0 on top of a fill into L0. An inclusive setup would let the L1 act as a snoop filter for the L0 and reduces the cost of L1 accesses, but has less total caching capacity.
The exclusive L0/L1 setup gives Falkor high instruction caching capacity compared to contemporary cores. Falkor wouldn't be beaten in this respect until Apple's M1 launched several years later. High instruction cache capacity makes L2 code fetch bandwidth less important. Like many 64-bit Arm cores of the time, or indeed AMD's pre-Zen cores, Falkor's instruction throughput drops sharply once code spills into L2. Still, Faklor performs better than A72 in that respect.
Falkor's instruction caches are parity protected, as is common for many CPUs. Hardware resolves parity errors by invalidating corrupted lines and reloading them from L2. The instruction caches also hold branch targets alongside instruction bytes, and therefore serve as branch target buffers (BTBs). A single cache access provides both instructions and branch targets, so Falkor doesn't have to make a separate BTB access like cores with a decoupled BTB do. However, that prevents the branch predictor from following the instruction stream past a L1i miss.
Taken branches incur one pipeline bubble (2 cycle latency) within L0, and up to 6 cycle latency from L1. For smaller branch footprints, Falkor do zero-bubble taken branches using a 16 entry branch target instruction cache (BTIC). Unlike a BTB, the BTIC caches instructions at the branch destination rather than the target address. It therefore bypasses cache latency, and allows zero-bubble taken branches without needing to make a L0 achieve single cycle latency.
Direction prediction uses multiple history tables, each using a different history length. The branch predictor tracks which history length and corresponding table works best for a given branch. The scheme described by Qualcomm is conceptually similar to a TAGE predictor, which also uses multiple history tables and tags tables to indicate whether they’re useful for a given branch. Falkor doesn’t necessarily use a classic TAGE predictor. For example, the history lengths may not be in a geometric series. But the idea of economizing history storage by using the most appropriate history length for each branch still stands. Arm’s Cortex A72 uses a 2-level predictor, presumably with a single table and a fixed history length.
In an abstract test with varying numbers of branches, each being taken or not-taken in random patterns of increasing length, Falkor does slightly better than Kryo. Falkor handles better with a lot of branches in play, though the longest repeating pattern either core can handle is similar for a small number of branches.
Falkor a two-level indirect target array for indirect branches, which read a target from a register rather than specifying a jump distance. An indirect branch may go to different targets, adding another dimension of difficulty to branch prediction. Falkor’s first level indirect target array has 16 entries, while a second level has 512 entries.
Having multiple targets for an indirect branch carries little penalty as long as total target count doesn’t exceed 16. That can be one branch switching between 16 targets, or eight branches alternating between two targets each.
Returns are a special case of indirect branches, because they typically go back to the call site. Falkor has a 16 entry return stack like Kryo. Cortex A72 has a much larger return stack with 31 entries. A function call and return costs approximately four cycles on Falkor, Kryo, and A72, which would be an average of 2 cycles per branch-with-link instruction.
Falkor’s decoders translate up to four instructions per cycle into micro-ops. Like most other CPUs, Qualcomm aims to decode most common instructions into a single micro-op. 128-bit vector math instructions appear to be a notable exception.
Micro-ops from the decoders need resources allocated in the backend for bookkeeping during out-of-order execution. Falkor’s renamer can handle register renaming and resource allocation for up to for micro-ops per cycle. However, the fourth slot can only process direct branches and a few specialized cases like NOPs or recognized register zeroing cases. A conditional branch that also includes an ALU operation, like cbz
/cbnz
, cannot go into the fourth slot.
Besides special handling for zeroing registers by moving an immediate value of zero into it, I didn’t see other common optimizations carried out.There’s no MOV elimination, and the renamer doesn’t recognize that XOR-ing or subtracting a register from itself results in zero.
Falkor doesn’t have a classic reorder buffer, or ROB. Rather, it uses a series of structures that together enable out-of-order execution while ensuring program results are consistent with in-order execution. Falkor has a 256 entry rename/completion buffer. Qualcomm further states Falkor can have 128 uncommitted instructions in flight, along with another 70+ uncommitted instructions for a total of 190 in-flight instructions. The core can retire 4 instructions per cycle.
From a microbenchmarking perspective, Falkor acts like Arm’s Cortex A73. It can free resources like registers and load/store queue entries past a long latency load, with no visible limit to reordering capacity even past 256 instructions. An unresolved branch similarly blocks out-of-order resource deallocation, after which Falkor’s reordering window can be measured. At that point, I may be measuring what Qualcomm considers uncommitted instructions.
Kryo and Falkor have similar reordering capacity from that uncommitted instruction perspective. But otherwise Qualcomm has re-balanced the execution engine to favor consistent performance for non-vector code. Falkor has a few more register file entries than Kryo, and more crucially, far larger memory ordering queues.
Integer execution pipelines on Falkor are specialized to handle different categories of operations from each other. Three pipes have integer ALUs, and a fourth pipe is dedicated to direct branches. Indirect branches use one of the ALU ports. Another ALU port has an integer multiplier, which can complete one 64-bit multiply per cycle with 5 cycle latency. Each ALU pipe has a modestly sized scheduler with about 11 entries.
Falkor has two largely symmetric FP/vector pipelines, each also with a 11 entry scheduler. Both pipes can handle basic operations like FP adds, multiplies, and fused multiply-adds. Vector integer adds and multiplies can also execute down both pipes. More specialized operations like AES acceleration instructions are only supported by one pipe.
FP and vector execution latency is similar to Kryo, as is throughput for scalar FP operations. Both of Falkor’s FP/vector pipes have a throughput of 64 bits per cycle. 128-bit math instructions are broken into two micro-ops, as they take two entries in the schedulers, register file, and completion buffer. Both factors cut into potential gains from vectorized code.
Falkor’s load/store subsystem is designed to handle one load and one store per cycle. The memory pipeline starts with a pair of AGUs, one for loads and one for stores. Both AGUs are fed from a unified scheduler with approximately 13 entries. Load-to-use latency is 3 cycles for a L1D hit, and the load AGU can handle indexed addressing with no penalty.
Virtual addresses (VAs) from the load AGU proceeds to access the 32 KB 8-way L1 data cache, which can provide 16 bytes per cycle. From testing, Falkor can handle either a single 128-bit load or store per cycle, or a 64-bit load and a 64-bit store in the same cycle. Mixing 128-bit loads and stores does not bring throughput over 128 bits per cycle.
Every location in the cache has a virtual tag and a physical tag associated with it… If you don’t have to do a TLB lookup prior to your cache, you can get the data out faster, and you can return the data with better latency.
Qualcomm’s Hot Chips 29 presentation
The L1D is both virtually and physically tagged, which lets Falkor retrieve data from the L1D without waiting for address translation. A conventional VIPT (virtually indexed, physically tagged) cache could select a set of lines using the virtual address, but needs the physical address (PA) to be available before checking tags for hits Qualcomm says some loads can skip address translation completely, in which case there’s no need for loads to check the physical tags at all. It’s quite an interesting setup, and I wonder how it handles multiple VAs aliasing to the same PA.
…a novel structure that is built off to the side of the L1 data cache, that acts almost like a write-back cache. It’s a combination of a store buffer, a load fill buffer, and a snoop filter buffer from the L2, and so this structure that sits off to the side gives us all the performance benefit and power savings of having a write-back cache without the need for the L1 data cache to be truly write-back
Qualcomm’s Hot Chips 29 presentation
Falkor’s store pipeline doesn’t check tags at all. The core has a write-through L1D, and uses an unnamed structure to provide the power and performance benefits of a write-back L1D. It functionally sounds similar to Bulldozer’s Write Coalescing Cache (WCC), so in absence of a better name from Qualcomm, I’ll call it that. Multiple writes to the same cacheline are combined at the WCC, reducing L2 accesses.
Stores on Falkor access the L1D physical tags to ensure coherency, and do so after they’ve reached the WCC. Thus the store combining mechanism also serves to reduce physical tag checks, saving power.
Qualcomm is certainly justified in saying they can deliver the performance of a write-back cache. A Falkor core can’t write more than 16B/cycle, and the L2 appears to have for more bandwidth than that. One way to see the WCC is to make one store per 128B cacheline, which reveals it’s a 3 KB per-core structure and can write a 128B cacheline back to L2 once every 2-3 cycles. But software shouldn’t run into this in practice.
Other architectures that use a write-through L1D, notably Intel’s Pentium 4 and AMD’s Bulldozer, suffered from poor store forwarding performance. Falkor doesn’t do well in this area, but isn’t terrible either. Loads that are 32-bit aligned within a store that they depend on can get their data with 8 cycle latency (so possibly 4 cycles for the store, and four cycles for the load). Slower cases, including partial overlaps, are handled with just one extra cycle. I suspect most cores handle partial overlaps by waiting for the store to commit, then having the load read data from cache. Quaclomm may have given Falkor a more advanced forwarding mechanism to avoid the penalty of reading from the WCC.
Using a write-through L1D lets Qualcomm parity protect the L1D rather than needing ECC. As with the instruction caches, hardware resolves parity errors by reloading lines from lower level caches, which are ECC protected.
Unlike mobile cores, server cores may encounter large data footprints with workloads running inside virtual machines. Virtualization can dramatically increase address translation overhead, as program-visible VAs are translated to VM-visible PAs, which in turn are translated via hypervisor page tables to a host PA. A TLB miss could require walking two sets of paging structures, turning a single memory access into over a dozen accesses under the hood.
Kryo appears to have a single level 192 entry TLB, which is plainly unsuited to such server demands. Falkor ditches that all-or-nothing approach in favor of a more conventional two-level TLB setup. A 64 entry L1 DTLB is backed by a 512 entry L2 TLB. Getting a translation from the L2 TLB adds just two cycles of latency, making it reasonably fast. Both the L1 DTLB and L2 TLB store “final” translations, which map a program’s virtual address all the way to a physical address on the host.
Falkor also has a 64 entry “non-final” TLB, which caches a pointer to the last level paging structure and can skip much of the page walk. Another “stage-2” TLB with 64 entries caches translations from VM PAs to host PAs.
Server chips must support high core counts and high IO bandwidth, which is another sharp difference between server and mobile SoCs. Qualcomm implements Falkor cores in dual core clusters called duplexes, and uses that as a basic building block for their Centriq server SoC. Kryo also used dual core clusters with a shared L2, so that concept isn’t entirely alien to Qualcomm.
Falkor’s L2 is a 512 KB, 8-way set associative, and inclusive of L1 contents. It serves both as a mid-level cache between L1 and the on-chip network, as well as a snoop filter for the L2 caches. The L2 is ECC protected, because it can contain modified data that hasn’t been written back anywhere else.
Qualcomm says the L2 has 15 cycles of latency, though a pointer chasing pattern sees 16-17 cycles of latency. Either way, it’s a welcome improvement over Kryo’s 20+ cycle L2 latency. Kryo and Arm’s Cortex A72 used the L2 as a last-level cache, which gave them the difficult task of keeping latency low enough to handle L1 misses with decent performance, while also having enough capacity to insulate the cores from DRAM latency. A72 uses a 4 MB L2 cache with 21 cycle latency, while Kryo drops the ball with both high latency and low L2 capacity.
Multiple interleaves (i.e. banks) help increase L2 bandwidth. Qualcomm did not specify the number of interleaves, but did say each interleave can deliver 32 bytes per cycle. The L2 appears capable of handling a 128B writeback every cycle, so it likely has at least four interleaves. Two Falkor cores in a complex together have just 32B/cycle of load/store bandwidth, so the L2 has more than enough bandwidth to feed both cores. In contrast, the L2 caches on Kryo and A72 have noticeably less bandwidth than their L1 caches.
A Falkor duplex interfaces with the system using the Qualcomm System Bus (QSB) protocol. QSB is a proprietary protocol that fulfills the same function as the ACE protocol used by Arm. It can also be compared to Intel’s IDI or AMD’s Infinity Fabric protocols. The duplex’s system bus interface provides 32 bytes per cycle of bandwidth per direction, per 128B interleave.
Qualcomm uses a bidirectional, segmented ring bus to link cores, L3 cache, and IO controllers. Data transfer uses two sets of bidirectional rings, and traffic is interleaved between the two bidirectional rings at 128B cacheline granularity. In total, Centriq has four rings covering even and odd interleaves in both clockwise and counterclockwise directions. Qualcomm’s slides suggest each ring can move 32B/cycle, so the ring bus effectively has 64B/cycle of bandwidth in each direction.
A dual core cluster can access just under 64 GB/s of L3 bandwidth from a simple bandwidth test, giving Qualcomm a significant cache bandwidth advantage over Cortex A72. L3 bandwidth from a dual core Falkor complex is similar to that of a Skylake core on the Core i5-6600K.
Ring bus clients include up to 24 dual core clusters, 12 L3 cache slices, six DDR4 controller channels, six PCIe controllers handling 32 Gen 3 lanes, and a variety of low speed IO controllers.
Centriq’s L3 slices have 5 MB of capacity and are 20-way set associative, giving the chip 60 MB of total L3 capacity across the 12 slices. The 46 core Centriq 2452 has 57.5 MB enabled. Cache ways can be reserved to divide L3 capacity across different applications and request types, which helps ensure quality of service.
Addresses are hashed across L3 slices to enable bandwidth scalability, like many other designs with many cores sharing a large L3. Centriq doesn’t match L3 slice count to core count, unlike Intel and AMD designs. However, each Centriq L3 slice has two ring bus ports, so the L3 and Falkor duplexes the same aggregate bandwidth to the on-chip network.
L3 latency is high at over 40 ns, or north of 100 cycles. That’s heavy for cores with 512 KB of L2. Bandwidth can scale to over 500 GB/s, which is likely adequate for anything except very bandwidth heavy vector workloads. Falkor isn’t a great choice for vector workloads anyway, so Centriq has plenty of L3 bandwidth. Latency increases to about 50 ns under moderate bandwidth load, and reaches 70-80 ns when approaching L3 bandwidth limits. Contention from loading all duplexes can bring latency to just over 90 ns.
Centriq’s L3 also acts as a point of coherency across the chip. The L3 is not inclusive of the upper level caches, and maintains L2 snoop filters to ensure coherency. In that respect it works like the L3 on AMD’s Zen or Intel’s Skylake server. Each L3 slice can track up to 32 outstanding snoops. Cache coherency operations between cores in the same duplex don’t need to transit the ring bus.
A core to core latency test shows lower latency between core pairs in a duplex, though latency is still high in an absolute sense. It also indicates Qualcomm has disabled two cores on the Centriq 2452 by turning off one core in a pair of duplexes. Doing so is a slightly higher performance option because two cores don’t have to share L2 capacity and a system bus interface.
Centriq supports up to 768 GB of DDR4 across six channels. The memory controllers support speeds of up to 2666 MT/s for 128 GB/s of theoretical bandwidth. Memory latency is about 121.4 ns, and is poorly controlled under high bandwidth load. Latency can rise beyond 500 ns at over 100 GB/s of bandwidth usage. For comparison, Intel is able to keep latency below 200 ns with more than 90% bandwidth utilization. Still, Centriq has plenty of bandwidth from an absolute sense. Against contemporary Arm server competition like Amazon’s Graviton 1, Centriq has a huge bandwidth advantage. Furthermore, the large L3 should reduce DRAM bandwidth demands compared to Graviton 1.
Unlike Intel and AMD server processors, Centriq cannot scale to multi-socket configurations. That caps a Centriq server to 48 cores, while AMD’s Zen 1 and Intel’s Skylake can scale further using multiple sockets. Qualcomm’s decision to not pursue multi-socket reasons is reasonable, because cross-socket connections require both massive bandwidth and additional interconnect work. However, it does exclude more specialized cloud applications that benefit from VMs with over a hundred CPU cores and terabytes of memory. Having just 32 PCIe lanes also limits Centriq’s ability to host piles of accelerators. Even contemporary high end workstations had more PCIe lanes.
Thus Centriq’s system architecture is designed to tackle mainstream cloud applications, rather than trying to cover everything Intel does. By not tackling all those specialized applications, Qualcomm’s datacenter effort can avoid getting distracted and focus on doing the best job they can for common cloud scenarios. For those use cases, sticking with 32 PCIe lanes and integrating traditional southbridge functions like USB and SATA likely reduce platform cost. And while Centriq’s interconnect may not compare well to Intel’s, it’s worlds ahead of Graviton 1.
In SPEC CPU2017, a Falkor core comfortably outperforms Arm’s Cortex A72, with a 21.6% lead in the integer suite and a 53.4% lead in the floating point suite. It falls behind later Arm offerings on more advanced process nodes.
With SPEC CPU2017’s integer workloads, Falkor compares best in memory-bound workloads like 505.mcf and 502.gcc. Falkor pulls off a massive lead in several floating point subtests like 503.bwaves and 507.cactuBSSN, which inflates its overall lead in the floating point suite.
From an IPC perspective, Falkor is able to stretch its legs in cache-friendly workloads like 538.imagick. Yet not all high IPC workloads give Falkor a substantial lead. Cortex A72 is just barely behind in 548.exchange2 and 525.x264, two high IPC tests in SPEC CPU2017’s integer suite. It’s a reminder that Falkor is not quite 4-wide.
For comparison, I’ve included IPC figures from Skylake, a 4-wide core with no renamer slot restrictions. It’s able to push up to and past 3 IPC in easier workloads, unlike Falkor.
With 7-Zip set to use eight threads and pinned to four cores, Falkor achieves a comfortable lead over Cortex A72. Using one core per cluster provides a negligible performance increase over loading both cores across two clusters.
Unlike 7-Zip, libx264 is a well vectorized workload. Falkor has poor vector capabilities, but so does Cortex A72. Again, additional L2 capacity from using four duplexes provides a slight performance increase. And again, Falkor has no trouble beating A72.
Qualcomm’s Kryo mobile core combined high core throughput with a sub-par memory subsystem. Falkor takes a different approach in its attempt o break into the server market. Its core pipeline is a downgrade compared to Kryo in many respects. Falkor has fewer execution resources, less load/store bandwidth, and worse handling for 128-bit vectors. Its 3+1 renamer acts more as a replacement for branch fusion than making Falkor a truly 4-wide core, which is another step back from Kryo. Falkor improves in some respects, like being able to free resources out-of-order, but it lacks the raw throughput Kryo could bring to the table.
In exchange, Falkor gets a far stronger memory subsystem. It has more than twice as much instruction caching capacity. The load/store unit can track many more in-flight accesses and can perform faster store forwarding. Even difficult cases like partial load/store overlaps are handled well. Outside the core, Falkor’s L2 is much faster than Kryo’s, and L2 misses benefit from a 60 MB L3 behind a high bandwidth interconnect. Rather than spam execution units and core width, Qualcomm is trying to keep Falkor fed.
Likely, Falkor aims to deliver adequate performance across a wide variety of workloads, rather than exceptional performance on a few easy ones. Cutting back the core pipeline may also have been necessary to achieve Qualcomm’s density goals. 48 cores is a lot in 2017, and would have given Qualcomm a core count advantage over Intel and AMD in single socket servers. Doing so within a 120W envelope is even more impressive. Kryo was perhaps a bit too “fat” for that role. A wide pipeline and full 128-bit vector execution units take power. Data transfer can draw significant power too, and Kryo’s poor caching capacity did it no favors.
Falkor ends up being a strong contender in the 2017 Arm server market. Centriq walks all over Amazon’s Graviton 1, which was the first widely available Arm platform from a major cloud provider. Even with core cutbacks compared to Kryo, Falkor is still quite beefy compared to A72. Combined with a stronger memory subsystem, Falkor is able to beat A72 core for core, while having more cores on a chip.
But beating Graviton 1 isn’t enough. The Arm server scene wasn’t a great place to be around the late 2010s. Several attempts to make a density optimized Arm server CPU had come and gone. These included AMD’s “Seattle”, Ampere’s eMAG 8180, and Cavium’s ThunderX2. Likely, the strength of the x86-64 competition and nascent state of the Arm software ecosystem made it difficult for these early Arm server chips to break into the market. Against Skylake-X for example, Falkor is a much smaller core. Centriq’s memory subsystem is strong next to Kryo or A72’s, but against Skylake it has less L2 and higher L3 latency.
Qualcomm Datacenter Technologies no doubt accomplished a lot when developing the Centriq server SoC. Stitching together dozens of cores and shuffling hundreds of gigabytes per second across a chip is no small feat, and is a very different game from mobile SoC design. But taking on experienced players like Intel and AMD isn’t easy, even when targeting a specific segment like cloud computing. Arm would not truly gain a foothold in the server market until Ampere Altra came out after 2020. At that point, Arm’s stronger Neoverse N1 core and TSMC’s 7 nm FinFET process left Falkor behind. Qualcomm planned to follow up on Falkor with a “Saphira” core, but that never hit the market as far as I know.
However, Qualcomm is looking to make a comeback into the server market with their announcement of supplying HUMAIN, a Saudi state-backed AI company, with "datacenter CPUs and AI solutions". NVIDIA’s NVLink Fusion announcement also mentions Qualcomm as a provider of server CPUs that can be integrated with NVIDIA’s GPUs using NVLink. I look forward to seeing how that goes, and whether Qualcomm's next server CPU builds off experience gained with Centriq.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-05-23 11:46:57
Hello you fine Internet folks,
Today we are covering the announcements that AMD had at Computex 2025 which were the RX 9060 XT, Threadripper 9000, the Radeon AI Pro R9700, and more ROCm commitments. Due to time constraints and the vastness of Computex, the transcript/article will be done after Computex.
Hope y'all enjoy!
2025-05-22 09:41:30
Hello you fine Internet folks,
Today we are covering the announcements that Nvidia had at Computex 2025 which were the RTX PRO servers, NVLink Fusion, a new supercomputer being built in Taiwan, and the announcement of a new building in Taiwan. Due to time constraints and the vastness of Computex, the transcript/article will be done after Computex.
Hope y'all enjoy!
2025-05-19 18:37:26
Hello you fine Internet folks,
Today Intel is announcing their successor to the Intel ARC Pro A-Series GPUs in the form of the Intel ARC Pro B-Series GPUs. The B-Series is launching the the form of 2 SKUs, the B50 and the B60. Due to time constraints and the vastness of Computex, the transcript/article will be done after Computex.
Hope y'all enjoy!
2025-05-10 11:50:53
Arm (the company) is best known for its Cortex CPU line. But Arm today has expanded to offer a variety of licensable IP blocks, ranging from interconnects to IOMMUs to GPUs. GPUs make for an interesting discussion topic because they’ve evolved to become highly programmable and complex components, much like CPUs. Like CPU performance, GPU performance is highly visible to users, and forms an important part of a device’s specifications.
Arm Mali GPUs target low power and embedded devices, a characteristic shared with Arm’s Cortex CPUs. As a GPU, Mali tackles the same fundamental problems as high performance discrete GPUs that gamers and PC enthusiasts are familiar with. Graphics processing has plenty of inherent parallelism; it maps well to hardware that can track a lot of parallel work, and map it to a wide array of execution units. However, power and area constraints force a delicate approach to exploiting parallelism. A low-end laptop GPU may have at most a dozen watts to work with; anything above 6W will likely be unsustainable in a cell phone or tablet. Die area constraints are similarly tight, because an iGPU has to share a small die alongside a CPU and numerous accelerator blocks.
In another difference from AMD, Intel, and Nvidia GPUs, Mali is licensed out as a standalone IP block; Arm does not control the chip design process. Instead, implementers buy Arm IP and bring together a wide variety of other IP blocks to meet their chip-level goals. This business model makes Mali peculiar in the GPU space. Mali only handles 3D rendering and parallel compute, doesn’t provide hardware acceleration for video codecs, and can’t even drive a display by itself. A PC enthusiast expects this functionality to come with any GPU. However, excluding it from the Mali package lets implementers pick and choose video and display engines to meet their needs. Hypothetically, an implementer could even go without video and display engines, and use Mali purely as compute accelerator. Lack of control over the chip design process creates further challenges: Mali has to perform well across the widest possible range of use cases to increase Arm’s customer base, yet Arm has to do so with no control over the all-important chip-level memory subsystem.
Bifrost is Arm’s second generation unified shader architecture from around 2016. It comes after Midgard, which brought unified shaders to Mali well after AMD, Intel, and Nvidia did so for their GPU lines. For this article I’ll use data from the Mali-G52, as implemented in the Amlogic S922X. The Mali-G52 is a very small GPU, so I’ll also use comparison data from Qualcomm’s Adreno 615, as implemented in the Snapdragon 670.
GPU programming interfaces saw enormous change in the late 2000s and early 2010s. Graphics APIs moved to a unified shader model, where different shader stages run on the same execution pipelines. GPU compute rose quickly as those execution pipelines became increasingly flexible. Arm’s Midgard got onboard the programmability train with OpenGL ES 3.0, Vulkan, and OpenCL 1.1 support. While Midgard could handle modern APIs, its VLIW4 setup had brittle performance characteristics. Arm’s compiler could be hard pressed to extract enough instruction level parallelism to fill a VLIW4 bundle, especially with compute code. Even in graphics code, Arm noted that 3-wide vectors were very common and could leave one VLIW4 component unused.
Bifrost switches to a scalar, dual issue execution model to address Midgard’s shortcomings. From a single thread’s point of view, registers are now 32-bits wide rather than 4×32-bit vectors. Instead of having one thread issue operations to fill four FP32 lanes, Bifrost relies on multiple API threads to fill an Execution Engine’s four or eight lanes. A lane in Bifrost feeds both a FMA and FADD execution pipeline, so Bifrost still benefits from instruction level parallelism. However, packing two operations into an instruction should be easier than four. As a result, Arm hopes to achieve more consistent performance with a simpler compiler.
Arm’s move from SIMD to scalar execution within each API thread with Bifrost parallels AMD’s Terascale to GCN transition, aiming for more consistent performance across compute workloads.
Midgard and Bifrost may be vastly different at the execution pipeline level, but the two share a similar high level organization. Bifrost Execution Engines (EEs) contain execution pipelines and register files, and act as a replacement for Midgard’s arithmetic pipelines. Looking outside Arm, EEs are the rough equivalent of Intel Execution Units (EUs) or AMD SIMDs.
Multiple EEs live in a Shader Core (SC). A messaging fabric internal to the Shader Core links EEs to memory pipelines and other shared fixed function hardware. A SC’s texture and load/store units include first level caches, making them a close equivalent to Intel’s subslices, AMD’s CUs or WGPs, or Nvidia’s SMs. One difference is that Bifrost places pixel backends (ROPs) at the Shader Core level, while desktop GPU architectures place them at a higher level subdivision. Bifrost doesn’t have another subdivision level beyond the Shader Core.
A defining Mali characteristic is an extraordinary number of levers for tuning GPU size. Besides adjusting Shader Core count, Arm can adjust a Shader Core’s EE count, cache sizes, and ROP/TMU throughput.
Flexibility extends to the EEs, which can operate on 4- or 8-wide warps, with correspondingly wide execution pipelines. Arm can therefore finely adjust GPU size in multiple dimensions to precisely target performance, power, and area goals. By comparison, AMD and Nvidia generally use the same WGP/SM structure, from integrated GPUs all the way to 300W+ monsters.
Bifrost can theoretically scale to 32 Shader Cores. Doing so with triple-EE SCs would provide 1.23 TFLOPS of FP32 FMA performanec at 800 MHz, which is in the same ballpark as Intel’s largest Skylake GT4e configuration. It’s not high power or high performance by discrete GPU standards, but well above what would fit in an average cell phone or tablet. The Mali-G52 in the Amlogic S922X is a small Bifrost configuration, with two triple-EE SCs running at 800 MHz, each EE 8-wide. Qualcomm’s Adreno can scale by varying Shader Processor count and uSPTP size. Adreno 6xx’s execution unit partitions are much larger at either 64- or 128-wide.
Shader Cores across a Bifrost GPU share a L2 cache. A standard ACE memory bus connects Bifrost to the rest of the system, and Arm’s influence ends at that point.
Bifrost’s instruction cache capacity hasn’t been disclosed, but instruction throughput is highest in a loop with 512 or fewer FP adds; as the loop body exceeds 1280 FP adds, instruction throughput takes another dip. Bifrost uses 78-bit instructions, which specify two operations corresponding to the FMA and FADD pipes. Arm’s compiler can issue FP adds down both pipes. Compiled binary size increases by 6-7 bytes for each FP add statement, with FMA+FADD packing decreasing size and clause/quadword header adding overhead. Based on binary size increase over a baseline, instruction cache capacity is possibly around 8 KB, which would put it in line with Qualcomm’s Adreno 6xx.
Each Execution Engine tracks state for up to 16 warps, each of which corresponds to a vector of eight API threads executing in lockstep. Hardware switches between warps to hide latency, much like SMT on a CPU. Bifrost uses a clause-based ISA to simplify scheduling. Instructions packed into a clause execute atomically, and architectural state is only well-defined between clauses. Only one instruction in a clause can access long- and variable-latency units outside the EU, such as memory pipelines.
Memory dependencies are managed between clauses, so an instruction that needs data from a memory access must go into a separate clause. A 6-entry software managed scoreboard specifies cross-clause dependencies. Clauses reduce pressure on scheduling hardware, which only has to consult the scoreboard at clause boundaries rather than with each instruction. Bifrost has parallels to AMD’s Terascale, which also uses clauses, though their implementation details differ. Terascale groups instructions into clauses based on type; for example, math instructions go into an ALU clause, and memory accesses go into separate texture- or vertex-fetch clauses.
From a programmer’s perspective, Mali-G52 can theoretically have 768 active workitems across the GPU; that’s eight lanes per warp * 16 warps per EE * 6 EEs across the GPU. In practice, actual active thread count, or occupancy, can vary depending on available parallelism and register usage; Bifrost’s ISA provides up to 64 registers, but using more than 32 will halve theoretical occupancy (implying 16 KB of register file capacity). There are no intermediate allocation steps. For comparison, Qualcomm’s Adreno 6xx can only achieve maximum occupancy with 12 registers per thread.
Instruction execution is split into register access, FMA, and FADD stages. During the register access stage, a Bifrost instruction takes direct control of the operand collector to read instruction inputs and write results from the prior instruction. Each EE’s register file has four ports; two can handle reads, one can handle writes, and one can handle either. Feeding the EE’s FMA and FADD pipes would nominally require five inputs, so register read bandwidth is very limited. If a prior instruction wants to write results from both the FMA and FADD pipes, register bandwidth constraints only become more severe.
To alleviate register file bandwidth demands, Bifrost can source inputs from a uniform/constant port, which provides 1×64 bits or 2×32 bits of data from immediate values embedded into a clause. Additionally, Bifrost provides “temporary registers”, which are really software-controlled forwarding paths that hold results from the prior instruction. Finally, because the FADD unit is placed at a later pipeline stage than the FMA unit, the FADD unit can use the FMA unit’s result as an input.
Bifrost’s temporary registers, software-managed operand collector, and register bandwidth constraints will be immediately familiar to enjoyers of AMD’s Terascale 2 architecture. Terascale 2 uses 12 register file inputs to feed five VLIW lanes, which could need up to 15 inputs. Just like Bifrost, AMD’s compiler uses a combination of register reuse, temporary registers (PV/PS), and constant reads to keep the execution units fed. Like Bifrost, PV/PS are only valid between consecutive instructions in the same clause, and reduce both register bandwidth and allocation requirements. One difference is that result writeback on Terascale 2 doesn’t share register file bandwidth with reads, so using temporary registers isn’t as critical.
Bifrost’s execution pipelines have impressive flexibility when handling different data types and essentially maintain 256-bit vector execution (or 128-bit on 4-wide EE variants) with 32/16/8-bit data types. Machine learning research was already well underway in the years leading up to Bifrost, and rather than going all-in as Nvidia did with Volta’s dedicated matrix multiplication units, Arm made sure the vector execution units could scale throughput with lower-precision types.
Qualcomm’s Adreno 615 takes a different execution strategy, with 64-wide warps and correspondingly wide execution units. That makes Adreno 615 more prone to divergence penalties, but lets Qualcomm control more parallel execution units with one instruction. Adreno 615 has 128 FP32 lanes across the GPU, all capable of multiply-adds, and runs them at a very low 430 MHz. Mali-G52 can only do 48 FP32 FMA operations per clock, but can complete 96 FP32 operations per clock with FMA+FADD dual issue. Combined with a higher 800 MHz clock speed, Mali-G52 can provide similar FP add throughput to Adreno 615. However, Adreno 615 fares better with multiply-adds, and can reach just over 100 GFLOPS.
I’m being specific by saying multiply-adds, not fused multiply-adds; the latter rounds only once after the multiply and add (both computed with higher intermediate precision), which improves accuracy. Adreno apparently has no fast-path FMA hardware, and demanding FMA accuracy (via OpenCL’s fma
function) requires over 600 cycles per FMA per warp. Bifrost handles FMA with no issues. Both mobile GPUs shine with FP16, which executes at double rate compared to FP32.
Special functions like inverse square roots execute down Bifrost’s FADD pipeline, at half rate compared to basic operations (or quarter rate if considering FMA+FADD dual issue). Arm has optimized handling for such complex operations on Bifrost compared to Midgard. Only the most common built-in functions exposed in APIs like OpenCL get handled with a single instruction. More complex special operations take multiple instructions. Adreno executes special functions at a lower 1/8 rate.
Integer operations see a similar split on Bifrost as FP ones. Integer adds can execute down the FADD pipe, while multiplies use the FMA pipe. Adreno’s more uniform setup gives it an advantage for adds; both of Bifrost’s pipes can handle integer operations at full rate, giving Bifrost an advantage for integer multiplies.
Lower precision INT8 operations see excellent throughput on Bifrost, but suffer glass jaw behavior on Adreno. Clearly Qualcomm didn’t implement fast-path INT8 hardware, but an INT8 operation can be carried out on INT32 units with the result masked to 8 bits. Terascale 2 also lacks INT8 hardware, but can emulate them at just under half rate. FP64 support is absent on both mobile GPUs.
Bifrost’s execution engines are built for maximum flexibility, and handle a wider range of operations with decent performance than Qualcomm’s Adreno. Adreno, by comparison, appears tightly optimized for graphics. Graphics rasterization doesn’t need higher precision from fused multiply-adds, nor do they need lower precision INT8 operations. Qualcomm packs a variety of other accelerators onto Snapdragon chips, which could explain why they don’t feel the need for consistently high GPU performance across such a wide range of use cases. Arm’s licensing business model means they can’t rely on the chip including other accelerators, and Bifrost’s design reflects that.
Bifrost’s memory subsystem includes separate texture and load/store paths, each with their own caches. EEs access these memory pipelines through the Shader Core’s messaging network. Intel uses similar terminology, with EUs accessing memory by sending messages across the subslice’s internal messaging fabric. AMD’s CUs/WGPs and Nvidia’s SMs have some sort of interconnect to link execution unit partitions to shared memory pipelines; Arm and Intel’s intra-core networks may be more flexible still, since they allow for variable numbers of execution unit partitions:
Arm’s documentation states Mali-G52’s load/store cache and texture caches are both 16 KB. However, latency testing suggests the texture cache is 8 KB. These parameters may be configurable on implementer request, and indeed vary across different Bifrost SKUs. For example, Mali-G71, a first-generation Bifrost variant, has 16 KB and 8 KB load/store and texture caches respectively on the spec sheet.
Pointer chasing in the texture cache carries slightly higher latency than doing so in the load/store cache, which isn’t surprising. However it’s worth noting some GPUs, like AMD’s Terascale, have TMUs that can carry out indexed addressing faster than doing the same calculation in the programmable shader execution units.
Texture cache bandwidth is low on Bifrost compared to Adreno and Intel’s Gen 9. OpenCL’s read_imageui
function returns a vector four 32-bit integers, which can be counted as a sample of sorts. Mali-G52 can deliver 26.05 bytes per Shader Core cycle via read_imageui
, consistent with Arm’s documentation which states a large Shader Core variant can do two samples per clock. Adreno 615 achieves 61.3 bytes per uSPTP cycle, or four samples. That’s enough to put Adreno 615 ahead despite its low clock speed. I suppose Qualcomm decided to optimize Adreno’s texture pipeline for throughput rather than caching capacity, because its 1 KB texture cache is small by any standard.
It’s amazing how little cache bandwidth these mobile GPUs have. Even on the same chip, the Amlogic S922X’s four A73 cores can hit 120 GB/s of L1D read bandwidth. Global memory bandwidth for compute applications is similarly limited. Adreno 615 and Mali-G52 are about evenly matched. A Bifrost SC delivered 16 bytes per cycle in a global memory bandwidth test, while an Adreno 615 uSPTP can load 32 bytes per cycle from L2.
Bifrost’s L1 can likely deliver 32 bytes per cycle, matching the texture cache, because I can get that from local memory using float4 loads. I don’t have a float4 version of my global memory bandwidth test written yet, but results from testing local memory should suffice:
Bifrost metaphorically drops local memory on the ground. GPU programming APIs provide a workgroup-local memory space, called local memory in OpenCL or Shared Memory in Vulkan. GPU hardware usually backs this with dedicated on-chip storage. Examples include AMD’s Local Data Share, and Nvidia/Intel reserving a portion of cache to use as local memory.
Mali GPUs do not implement dedicated on-chip shared memory for compute shaders; shared memory is simply system RAM backed by the load-store cache just like any other memory type
Bifrost doesn’t give local memory any special treatment. An OpenCL kernel can allocate up to 32 KB of local memory, but accesses to local memory aren’t guaranteed to remain on-chip. Worse, each Shader Core can only have one workgroup with local memory allocated, even if that workgroup doesn’t need all 32 KB.
GPUs that back local memory with on-chip storage can achieve better latency than Bifrost; that includes Adreno. Qualcomm disclosed that Adreno X1 allocates local memory out of GMEM, and their prior Adreno architectures likely did the same. However, Qualcomm doesn’t necessarily enjoy a bandwidth advantage, because GMEM access similarly appears limited to 32 bytes per cycle.
Bifrost’s L2 functionally works like the L2 cache in modern AMD and Nvidia GPUs. It’s a write-back cache built from multiple slices for scalability. Arm expects Bifrost implementations to have 64-128 KB of L2 per Shader Core, up from Midgard’s corresponding 32-64 KB figure. Amlogic has chosen the lower end option, so the Mali-G52 has 128 KB of L2. A hypothetical 32 Shader Core Bifrost GPU may have 2-4 MB of L2.
On the Amlogic S922X, L2 latency from the texture side is slightly better than on Qualcomm’s Adreno 615. However, Adreno 615 enjoys better L2 latency for global memory accesses, because it doesn’t need to check L1 on the way. L2 bandwidth appears somewhat lower than Adreno 615, though Mali-G52 has twice as much L2 capacity at 128 KB. However the Snapdragon 670 has a 1 MB system level cache, which likely mitigates the downsides of a smaller GPU-side L2.
Bifrost shows reasonably good latency when using atomic compare and exchange operations to pass data between threads. It’s faster than Adreno 615 when using global memory, though Qualcomm offers lower latency if you use atomics on local memory.
GPUs often handle atomic operations using dedicated ALUs close to L2 or backing storage for local memory. Throughput for INT32 atomic adds isn’t great compared to Intel, and is very low next to contemporary discrete GPUs from AMD and Nvidia.
L2 misses head out to the on-chip network, and make their way to the DRAM controller. Amlogic has chosen a a 32-bit DDR4-2640 interface for the S922X, which provides 10.56 GB/s of theoretical bandwidth. A chip’s system level architecture is ultimately up to the implementer, not Arm, which can affect Bifrost’s system level feature support.
In alignment with Bifrost’s compute aspirations, Arm designed a L2 that can accept incoming snoops. With a compatible on-chip interconnect and CPU complex, Bifrost can support OpenCL Shared Virtual Memory with fine-grained buffer sharing. That lets the CPU and GPU share data without explicit copying or map/unmap operations. Evidently Amlogic’s setup isn’t compatible, and Mali-G52 only supports coarse-grained buffer sharing. Worse, it appears to copy entire buffers under the hood with map/unmap operations.
Qualcomm on the other hand controls chip design from start to finish; Adreno 615 supports zero-copy behavior, and Qualcomm’s on-chip network has the features needed to make that happen.
While many modern GPUs can support zero-copy data sharing with the CPU, copy performance can still matter. Besides acting as a baseline way of getting data to the GPU and retrieving results, buffers can be de-allocated or reused once their data is copied to the GPU. Copy bandwidth between the host and GPU is low on the Amlogic S922X at just above 2 GB/s. A copy requires both a read and a write, so that would be 4 GB/s of bandwidth. That’s in line with measured global memory bandwidth above, and suggests the DMA units aren’t any better than the shader array when it comes to accessing DRAM bandwidth.
Adreno 615 has better copy performance, possibly helped by its faster LPDDR4X DRAM interface. However, copying data back from the GPU is agonizingly slow. Games largely transfer data to the GPU, not the other way around, so that’s another sign that Adreno is tightly optimized for gaming.
Adreno and Mali share another common problem: mobile devices can’t afford high bandwidth DRAM interfaces compared to desktop CPUs, let alone GPUs with wide GDDR setups. However, graphics rasterization can be bandwidth-hungry, and ROPs can be a significant source of bandwidth pressure. When pixel/fragment shaders output pixel colors to the ROPs, the ROPs have to ensure those results are written in the correct order. That can involve depth testing or alpha blending, as specified by the application.
Both Adreno 6xx and Bifrost reduce ROP-side DRAM traffic using tiled rendering, which splits the screen into rectangular tiles, and renders them one at a time. Tiles are sized to fit within on-chip buffers, which contain intermediate accesses from alpha blending and depth testing. The tile is only written out to DRAM after it’s finished rendering. Tiled rendering requires building per-tile visible triangle lists as vertex shaders finish calculating vertex coordinates. Then, those triangle lists are read back as the GPU rasterizes tiles one by one. Handling triangle lists generates DRAM traffic, which could obviate the benefit of tiled rendering if not handled carefully.
Bifrost uses a hierarchical tiling strategy like Midgard. Tiles are nominally 16×16 pixels, but Arm can use larger power-of-two tile sizes to try containing triangles. Doing so reduces how often a triangle overlaps different tiles and thus is referenced in multiple triangle lists. Compared to Midgard, Arm also redesigned tiler memory structures with finer-grained allocations and no minimum buffer allocations. Finally, Bifrost can eliminate triangles too small to affect pixel output at the tiler stage, which also reduces wasted pixel/fragment shader work. These optimizations can reduce both bandwidth usage and memory footprint between the vertex and pixel shader stages. Arm also optimizes bandwidth usage at the tile writeback stage, where “Transaction Elimination” compares a tile’s CRC with the corresponding tile in the prior frame, and skips tile writeback if they match, an often efficient trade of logic for memory bus usage.
Because Bifrost uses 256 bits of tile storage per pixel, tile memory likely has at least 8 KB of capacity. Arm further implies tile memory is attached to each Shader Core, so Mali-G52 may have 16 KB of tile memory across its two Shader Cores. Adreno 615 also uses tiled rendering, and uses 512 KB of tile memory (called GMEM) to hold intermediate tile state.
FluidX3D is a GPGPU compute application that simulates fluid behavior. FP16S/FP16C modes help reduce DRAM bandwidth requirements by using 16-bit FP formats for storage. Calculations are still performed on FP32 values to maintain accuracy, with extra instructions used to convert between 16 and 32-bit FP formats.
Adreno 615 and Mali-G52 both appear more compute-bound than bandwidth-bound, so FP16 formats don’t help. FluidX3D uses FMA operations by default, which destroys Adreno 615’s performance because it doesn’t have fast-path FMA hardware. Qualcomm does better if FMA operations are replaced by multiply-adds. However, Adreno 615 still turns in an unimpressive result. Despite having more FP32 throughput and more memory bandwidth on paper, it falls behind Mali-G52.
Mali-G52 is organized into four power domains. A “GL” domain is always on, and likely lets the GPU listen for power-on commands. Next, a “CG” (Common Graphics?) domain is powered on when the GPU needs to handle 3D work or parallel compute. Next, the Shader Cores (SC0, SC1) are powered on as necessary. Each Shader Core sits on a separate power domain, letting the driver partially power up Mali’s shader array for very light tasks.
Power savings can also come from adjusting clock speed. The Amlogic S922X appears to generate Mali-G52’s clocks from 2 GHz “FCLK”, using various divider settings.
Arm’s business model relies on making its IP blocks attractive to the widest possible variety of implementers. Bifrost fits into that business model thanks to its highly parameterized design and very small building blocks, which makes it ideal for hitting very specific power, area, and performance levels within the low power iGPU segment.
Qualcomm’s Adreno targets similar devices, but uses much larger building blocks. Adreno’s strategy is better suited to scaling up the architecture, which aligns with Qualcomm’s ambition to break into the laptop segment. However, larger building blocks make it more difficult to create tiny GPUs. I feel like Qualcomm scaled down Adreno 6xx with all the knobs they had, then had to cut clock speed to an incredibly low 430 MHz to hit the Snapdragon 670’s design goals.
Beyond scaling flexibility, Bifrost widens Arm’s market by aiming for consistent performance across a wide range of workloads. That applies both in comparison to Qualcomm’s Adreno, as well as Arm’s prior Midgard architecture. Bifrost’s execution units are pleasantly free of “glass jaw” performance characteristics, and GPU to CPU copy bandwidth is better than on Adreno. Adreno isn’t marketed as a standalone block, and is more sharply focused on graphics rasterization. Qualcomm might expect common mobile compute tasks to be offloaded to other blocks, such as their Hexagon DSP.
Overall, Bifrost is an interesting look into optimizing a low power GPU design to cover a wide range of applications. It has strong Terascale to GCN energy, though Bifrost really lands at a midpoint between those two extremes. It’s fascinating to see Terascale features like clause-based execution, a software-controlled operand collector, and temporary registers show up on a 2016 GPU architecture. Apparently features that Terascale used to pack teraflops of FP32 compute into 40nm class nodes continue to be useful for hitting very tight power targets on newer nodes. Since Bifrost, Arm has continued to modernize their GPU architecture with a focus on both graphics rasterization and general purpose compute. They’re a fascinating and unique GPU designer, and I look forward to seeing where they go in the future.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.
2025-05-01 04:20:16
Zhaoxin is a Chinese x86 CPU designer. The KaiXian KX-7000 is Zhaoxin’s latest CPU, and features a new architecture dubbed “世纪大道”. 世纪大道 is a road in Shanghai called “Century Avenue”, following Zhaoxin’s practice of naming architectures after Shanghai landmarks. Zhaoxin is notable because it’s a joint venture between VIA Technologies and the Shanghai municipal government. It inherits VIA’s x86-64 license, and also enjoys powerful government backing. That’s a potent combination, because Zhaoxin’s cores are positioned to take advantage of the strong x86-64 software ecosystem.
x86-64 compatibility is just one part of the picture, because performance matters too. Zhaoxin’s previous LuJiaZui, implemented in the KX-6640MA, was clearly inadequate for handling modern applications. LuJiaZui was a 2-wide core with sub-3 GHz clock speeds and barely more reordering capacity than Intel’s Pentium II from 1997. Century Avenue takes aim at that performance problem.
Century Avenue is a 4-wide, AVX2 capable core with an out-of-order execution window on par with Intel CPUs from the early 2010s. Besides making the core wider and more latency-tolerant, Zhaoxin targets higher clock speeds. The KX-7000 runs at 3.2 GHz, significantly faster than the KX-6640MA’s 2.6 GHz. Zhaoxin’s site claims the KX-7000 can reach 3.5-3.7 GHz, but I never saw the chip clock above 3.2 GHz.
The KX-7000 has eight Century Avenue cores, and uses a chiplet setup reminiscent of single-CCD AMD Ryzen desktop parts. All eight cores sit on a die and share 32 MB of L3 cache. A second IO die connects to DRAM and other IO. Zhaoxin did not specify what process node they’re using. Techpowerup and Wccftech suggests it uses an unspecified 16nm node.
At the frontend, instructions are fetched from a 64 KB 16-way instruction cache. The instruction cache can deliver 16 bytes per cycle, and feeds a 4-wide decoder. Century Avenue uses a thoroughly conventional frontend setup, without a loop buffer or op cache. Instruction cache bandwidth can therefore constrain frontend throughput if average instruction length exceeds 4 bytes.
Frontend bandwidth drops sharply as code spills out of L1i, creating another contrast with 2010s era western designs. Skylake for example can run code from L2 at over 12 bytes per cycle, adequate for >3 IPC with 4 byte instructions. Century Avenue suffers further if code spills into L3, where frontend bandwidth drops to under 4 bytes per cycle.
A 4096 entry branch target buffer (BTB) provides branch targets, and creates two pipeline bubbles after a taken branch. Taken branch latency jumps as the test spills out of L1i, even with far fewer than 4K branches. The BTB is likely tied to the L1i, and thus can’t be used to do long-distance prefetch past a L1i miss.
Century Avenue’s branching performance is reminiscent of older cores like VIA’s Nano. Dropping zero-bubble branching capability is a regression compared to LuJiaZui, which could do so from a small 16 entry L0 BTB. Perhaps Zhaoxin felt they couldn’t do zero-bubble branching at Century Avenue’s 3 GHz+ clock speed targets. However Intel and AMD CPUs from over a decade ago have faster branch target caching at higher clock speeds.
In Century Avenue’s favor, the direction predictor has vastly improved pattern recognition capabilities compared to its predecessor. When given repeating patterns of taken and not-taken branches, the KX-7000 handles a bit like Intel’s Sunny Cove.
Returns behave much like on LuJiaZui. Call+return pairs enjoy reasonable latency until they go more than four-deep. An inflection point further on suggests a second level return stack with approximately 32 entries. If there is a second level return stack, it’s rather slow with a cost of 14 cycles per call+return pair. Bulldozer shows more typical behavior. Call+return pairs are fast until they overflow a 24 entry return stack.
Century Avenue’s frontend aims to deliver up to four instructions per cycle with minimal sophistication. A conventional fetch and decode setup can be good if tuned properly, but Century Avenue’s frontend has a few clear weaknesses. Average instruction length can exceed 4 bytes in AVX2 code, thanks to VEX prefixes. AMD tackled this by increasing L1i bandwidth to 32B/cycle in 10h CPUs. Intel used loop buffers in Core 2 before introducing an op cache in Sandy Bridge (while keeping 16B/cycle L1i bandwidth). Either approach is fine, but Century Avenue does neither. Century Avenue also does not implement branch fusion, a technique that AMD and Intel have used for over a decade. An [add, add, cmp, jz] sequence executes at under 3 IPC.
Lack of sophistication extends to branch target caching. A single level BTB with effectively 3 cycle latency feels primitive today, especially when it’s tied to the instruction cache. As before, a decoupled BTB isn’t the only way to go. Apple’s M1 also appears to have a BTB coupled to the L1i, but it compensates with a massive 192 KB L1i. Century Avenue’s 64 KB L1i is larger than the 32 KB instruction caches found on many x86-64 cores, but it stops short of brute-forcing its way around large code footprints the way Apple does. To be fair to Zhaoxin, Bulldozer also combines a 64 KB L1i with poor L2 code bandwidth. However, I don’t think there’s a good excuse for 3 cycle taken branch latency on any post-2024 core, especially one running below 4 GHz.
Micro-ops from the frontend are allocated into backend tracking structures, which carry out bookkeeping necessary for out-of-order execution. Register allocation goes hand-in-hand with register renaming, which breaks false dependencies by allocating a new physical register whenever an instruction writes to one. The rename/allocate stage is also a convenient place to carry out other optimizations and expose more parallelism to the backend.
Century Avenue recognizes zeroing idioms like XOR-ing a register with itself, and can tell the backend that such instructions are independent. However such XORs are still limited to three per cycle, suggesting they use an ALU port. The renamer also allocates a physical register to hold the result, even though it will always be zero. Move elimination works as well, though it’s also limited to three per cycle.
Zhaoxin switches to a physical register file (PRF) based execution scheme, moving away from LuJiaZui’s ROB-based setup. Separate register files reduce data transfer within the core, and let designers scale ROB size independently of register file capacity. Both are significant advantages over LuJiaZui, and contribute to Century Avenue having several times as much reordering capacity. With a 192 entry ROB, Century Avenue has a theoretical out-of-order window on par with Intel’s Haswell, AMD’s Zen, and Centaur’s CNS. LuJiaZui’s 48 entry ROB is nowhere close.
Reorder buffer size only puts a cap on how far the backend can search ahead of a stalled instruction. Reordering capacity in practice is limited by whatever resource the core runs out of first, whether that be register files, memory ordering queues, or other structures. Century Avenue’s register files are smaller than Haswell or Zen’s, but the core can keep a reasonable number of branches and memory operations in flight.
Century Avenue has a semi-unified scheduler setup, shifting away from LuJiaZui’s distributed scheme. ALU, memory, and FP/vector operations each have a large scheduler with more than 40 entries. Branches appear to have their own scheduler, though maybe not a dedicated port. I wasn’t able to execute a not-taken jump alongside three integer adds in the same cycle. In any case, Century Avenue has fewer scheduling queues than its predecessor, despite having more execution ports. That makes tuning scheduler size easier, because there are fewer degrees of freedom.
Typically a unified scheduler can achieve similar performance to a distributed one with fewer total entries. An entry in a unified scheduling queue can hold a pending micro-op for any of the scheduler’s ports. That reduces the chance of an individual queue filling up and blocking further incoming instructions even though scheduler entries are available in other queues. With several large multi-ported schedulers, Century Avenue has more scheduler capacity than Haswell, Centaur CNS, or even Skylake.
Three ALU pipes generate results for scalar integer operations. Thus Century Avenue joins Arm’s Neoverse N1 and Intel’s Sandy Bridge in having thee ALU ports in an overall four-wide core. Two of Century Avenue’s ALU pipes have integer multipliers. 64-bit integer multiplies have just two-cycle latency, giving the core excellent integer multiply performance.
Century Avenue’s FP/vector side is surprisingly powerful. The FP/vector unit appears to have four pipes, all of which can execute 128-bit vector integer adds. Floating point operations execute at two per cycle. Amazingly, that rate applies even for 256-bit vector FMA instructions. Century Avenue therefore matches Haswell’s per-cycle FLOP count. Floating point latency is normal at 3 cycles for FP adds and multiplies or 5 cycles for a fused multiply-add. Vector integer adds have single-cycle latency.
However, the rest of Century Avenue’s execution engine isn’t so enthusiastic about AVX2. Instructions that operate on 256-bit vectors are broken into two 128-bit micro-ops for all the common cases I tested. A 256-bit FP add takes two ROB entries, two scheduler slots, and the result consumes two register file entries. On the memory side, 256-bit loads and stores take two load queue or two store queue entries, respectively. Zhaoxin’s AVX2 approach is the opposite of Zen 4’s AVX-512 strategy: AMD left execution throughput largely unchanged from the prior generation, however, its 512-bit register file entries let it keep more work in flight and better feed those execution units. Century Avenue’s approach is to bring execution throughput first, and think about how to feed them later.
Memory accesses start with a pair of address generation units (AGUs), which calculate virtual addresses. The AGUs are fed by 48 scheduler entries, which could be a 48 entry unified scheduler or two 24 entry queues.
48-bit virtual addresses from the AGUs are then translated into 46-bit physical addresses. Data-side address translations are cached in a 96 entry, 6-way set associative data TLB. 2 MB pages use a separate 32 entry, 4-way DTLB. Century Avenue doesn’t report L2 TLB capacity through CPUID, and DTLB misses add ~20 cycles of latency. That’s higher than usual for cores with a second level TLB, except for Bulldozer.
Besides address translation, the load/store unit has to handle memory dependencies. Century Avenue appears to do an initial dependency check using the virtual address, because a load has a false dependency on a store offset by 4 KB. For real dependencies, Century Avenue can do store forwarding with 5 cycle latency. Like many other cores, partial overlaps cause fast forwarding to fail. Century Avenue takes a 22 cycle penalty in that case, which isn’t out of the ordinary. For independent accesses, Century Avenue can do Core 2 style memory disambiguation. That lets a load execute ahead of a store with an unknown address, improving memory pipeline utilization.
“Misaligned” loads and stores that straddle a cacheline boundary take 12-13 cycles, a heavy penalty compared to modern cores. Skylake for example barely takes any penalty for misaligned loads, and handles misaligned stores with just a single cycle penalty. Century Avenue faces the heaviest penalties (>42 cycles) if a load depends on a misaligned store.
Century Avenue has a 32 KB, 8-way associative data cache with a pair of 128-bit ports and 4 cycle load-to-use latency. Only one port handles stores, so 256-bit stores execute over two cycles. Century Avenue’s L1D bandwidth is therefore similar to Sandy Bridge, even though its FMA capability can demand higher bandwidth. When Intel first rolled out 2×256-bit FMA execution with Haswell, their engineers increased L1D bandwidth to 2×256-bit loads and a 256-bit store per cycle.
L2 latency is unimpressive at 15 cycles. Skylake-X has a larger 2 MB L2 for example, and ran that with 14 cycle latency at higher clock speeds.
Century Avenue’s system architecture has been overhauled to improve core count scalability. The KX-7000 adopts a triple-level cache setup, aligning with high performance designs from AMD, Arm, and Intel. Core-private L2 caches help insulate L1 misses from high L3 latency. Thus L3 latency becomes less critical, which enables a larger L3 shared across more cores. Compared to LuJiaZui, Century Avenue increases L3 capacity by a factor of eight, going from 4 MB to 32 MB. Eight Century Avenue cores share the L3, while four LuJiaZui cores shared a 4 MB L2. Combined with the chiplet setup, the KX-7000 is built much like a single-CCD Zen 3 desktop part.
Unlike AMD’s recent designs, L3 latency is poor at over 27 ns, or over 80 core cycles. Bandwidth isn’t great either at just over 8 bytes per cycle. A read-modify-write pattern increases bandwidth to 11.5 bytes per cycle. Neither figure is impressive. Skylake could average 15 bytes per cycle from L3 using a read-only pattern, and recent AMD designs can achieve twice that.
The KX-7000 does enjoy good bandwidth scaling, but low clock speeds combined with low per-core bandwidth to start with mean final figures aren’t too impressive. A read-only pattern gets to 215 GB/s, while a read-modify-write pattern can exceed 300 GB/s. For comparison, a Zen 2 CCD enjoys more than twice as much L3 bandwidth.
The KX-7000 does have more L3 bandwidth than Intel’s Skylake-X, at least when testing with matched thread counts. However, Skylake-X has a larger 1 MB L2 cache to insulate the cores from poor L3 performance. Skylake-X is also a server-oriented part, where single-threaded performance is less important. On the client side, Bulldozer has similar L3 latency, but uses an even larger 2 MB to avoid hitting it.
DRAM performance is poor, with over 200 ns latency even when using 2 MB pages to minimize address translation latency. Latency goes over 240 ns using 4 KB pages, using a 1 GB array in both cases. The KX-7000’s DRAM bandwidth situation is tricky. To start, the memory controller was only able to train to 1600 MT/s, despite using DIMMs with 2666 MT/s JEDEC and 4000 MT/s XMP profiles. Theoretical bandwidth is therefore limited to 25.6 GB/s. However measured read bandwidth gets nowhere close, struggling to get past even 12 GB/s.
Mixing in writes increases achievable bandwidth. A read-modify-write pattern gets over 20 GB/s, while non-temporal writes reach 23.35 GB/s. The latter figure is close to theoretical, and indicates Zhaoxin’s cross-die link has enough bandwidth to saturate the memory controller. Read bandwidth is likely limited by latency. Unlike writes, where data to be written gets handed off, reads can only complete when data returns. Maintaining high read bandwidth requires keeping enough memory requests in-flight to hide latency.
Often loading more cores lets the memory subsystem keep more requests in flight, because each core has its own L1 and L2 miss queues. However the KX-7000’s read bandwidth abruptly stops scaling once a bandwidth test loads more than two cores. That suggests a queue shared by all the cores doesn’t have enough entries to hide latency, resulting in low read bandwidth.
To make things worse, the KX-7000’s memory subsystem doesn’t do well at ensuring fairness between requests coming from different cores. A pointer chasing thread sees latency skyrocket when other cores generate high bandwidth load. In a worst case with one latency test thread and seven bandwidth threads, latency pushes past 1 microsecond. I suspect the bandwidth-hungry threads monopolize entries in whatever shared queue limits read bandwidth.
AMD’s Bulldozer maintains better control over latency under high bandwidth load. The FX-8150’s Northbridge has a complicated setup with two crossbar levels, but does an excellent job. Latency increases as the test pushes up to the memory controller’s bandwidth limits, but doesn’t rise to more than double its un-loaded latency. In absolute terms, even Bulldozer’s worst case latency is better than the KX-7000’s best case.
Sometimes, the memory subsystem has to satisfy a request by retrieving data from a peer core’s cache. These cases are rare in practice, but can give insight into system topology. The KX-7000 posts relatively high but even latency in a core-to-core latency test. Some core pairs see lower latency than others, likely depending on which L3 slice the tested address belongs to.
Compared to LuJiaZui, Century Avenue posts a huge 48.8% gain in SPEC CPU2017’s integer suite, and provides more than a 2x speedup in the floating point suite. Zhaoxin has been busy over the past few years, and that work has paid off. Against high performance western x86-64 chips, the KX-7000 falls just short of AMD’s Bulldozer in the integer suite. The FX-8150 leads by 13.6% there. Zhaoxin flips things around in the floating point suite, drawing 10.4% ahead of Bulldozer.
Newer cores like Broadwell or Skylake land on a different performance planet compared to Century Avenue, so Bulldozer is the best relative comparison. Against Bulldozer, Century Avenue tends to do best in higher-IPC tests like 500.perlbench, 548.exchange2, and 525.x264. I suspect Century Avenue’s additional execution resources give it an advantage in those tests. Meanwhile Bulldozer bulldozes the KX-7000 in low IPC tests like 505.mcf and 520.omnetpp. Those tests present a nasty cocktail of difficult-to-predict branches and large memory footprints. Bulldozer’s comparatively strong memory subsystem and faster branch predictor likely give it a win there.
SPEC CPU2017’s floating point suite generally consists of higher IPC workloads, which hands the advantage to the KX-7000. However, the FX-8150 snatches occasional victories. 549.fotonik3d is a challenging low IPC workload that sees even recent cores heavily limited by cache misses. Bulldozer walks away with an impressive 46.2% lead in that workload. At the other end, 538.imagick basically doesn’t see L2 misses.
Overall the SPEC CPU2017 results suggest the KX-7000 can deliver single-threaded performance roughly on par with AMD’s Bulldozer.
Having eight cores is one of the KX-7000’s relative strengths against the FX-8150 and Core i5-6600K. However, multithreaded results are a mixed bag. libx264 software video encoding can take advantage of AVX2, and uses more than four threads. However, the KX-7000 is soundly beaten even by Bulldozer. 7-Zip compression uses scalar integer instructions. With AVX2 not playing a role, Bulldozer and the Core i5-6600K score even larger wins.
The KX-7000 turns in a better performance in Y-Cruncher, possibly with AVX2 giving it a large advantage over Bulldozer. However, eight Century Avenue cores still fail to match four Skylake ones. For a final test, OpenSSL RSA2048 signs are a purely integer operation that focuses on core compute power rather than memory access. They’re particularly important for web servers, which have to validate their identity when clients establish SSL/TLS connections. Zhaoxin again beats Bulldozer in that workload, but falls behind Skylake.
Zhaoxin inherits VIA’s x86 license, but plays a different ball game. VIA focused on low-power, low-cost applications. While Centaur CNS did branch into somewhat higher performance targets with a 4-wide design, the company never sought to tap into the wider general purpose compute market like AMD and Intel. Creating a high-clocking, high-IPC core that excels in everything from web browsing to gaming to video encoding is a massive engineering challenge. VIA reasonably decided to find a niche, rather than take AMD and Intel head-on without the engineering resources to match.
However Zhaoxin is part of China’s effort to build domestic chips in case western ones become unavailable. Doing so is a matter of national importance, so companies like Zhaoxin can expect massive government support, and survive even without being profitable. Zhaoxin’s chips don’t need to directly compete with AMD and Intel. But AMD and Intel’s chips have driven performance expectations from application developers. China needs chips with enough performance to substitute western chips without being disruptively slow.
Century Avenue is an obvious attempt to get into that position, stepping away from LuJiaZui’s low power and low performance design. At a high level, Century Avenue represents good progress. A 4-wide >3 GHz core with Bulldozer-level performance is a huge step up. At a lower level, it feels like Zhaoxin tried to make everything bigger without slowing down and making sure the whole picture makes sense. Century Avenue has 2×256-bit FMA units, which suggest Zhaoxin is trying to get the most out of AVX2. However Century Avenue has low cache bandwidth and internally tracks 256-bit instructions as a pair of micro-ops. Doing so suits a minimum-cost AVX2 implementation geared towards compatibility rather than high performance. Besides AVX2, Century Avenue has small register files relative to its ROB capacity, which hinders its ability to make use of its theoretical out-of-order window.
Zooming out to the system level shows the same pattern. Century Avenue’s L2 is too small considering it has to shield cores from 80+ cycle L3 latency. The KX-7000’s DRAM read bandwidth is inadequate for an octa-core setup, and the memory subsystem does a poor job of ensuring fairness under high bandwidth load. Besides unbalanced characteristics, Century Avenue’s high frontend latency and lack of branch fusion make it feel like a 2005-era core, not a 2025 one.
Ultimately performance is what matters to an end-user. In that respect, the KX-7000 sometimes falls behind Bulldozer in multithreaded workloads. It’s disappointing from the perspective that Bulldozer is a 2011-era design, with pairs of hardware thread sharing a frontend and floating point unit. Single-threaded performance is similarly unimpressive. It roughly matches Bulldozer there, but the FX-8150’s single-threaded performance was one of its greatest weaknesses even back in 2011. But of course, the KX-7000 isn’t trying to impress western consumers. It’s trying to provide a usable experience without relying on foreign companies. In that respect, Bulldozer-level single-threaded performance is plenty. And while Century Avenue lacks the balance and sophistication that a modern AMD, Arm, or Intel core is likely to display, it’s a good step in Zhaoxin’s effort to break into higher performance targets.
If you like the content then consider heading over to the Patreon or PayPal if you want to toss a few bucks to Chips and Cheese. Also consider joining the Discord.