MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

Streamlined iteration: exploring keys and values in C++20

2025-04-21 04:40:40

In software, we often use key-value data structures, where each key is unique and maps to a specific value. Common examples include dictionaries in Python, hash maps in Java, and objects in JavaScript. If you combine arrays with key-value data structures, you can represent most data.

In C++, we have two standard key-value data structures: the std::map and the std::unordered_map. Typically, the std::map is implemented as a tree (e.g., a red-black tree) with sorted keys, providing logarithmic time complexity O(log n) for lookups, insertions, and deletions, and maintaining keys in sorted order. The std::unordered_map is typically implemented as a hash table with unsorted keys, offering average-case constant time complexity O(1) for lookups, insertions, and deletions. In the std::unordered_map, the hash table uses a hashing function to map keys to indices in an array of buckets. Each bucket is essentially a container that can hold multiple key-value pairs, typically implemented as a linked list. When a key is hashed, the hash function computes an index corresponding to a bucket. If multiple keys hash to the same bucket (a collision), they are stored in that bucket’s linked list.

Quite often, we only need to look at the keys, or look at the values. The C++20 standard makes this convenient through the introduction of ranges (std::ranges::views::keys and std::ranges::views::values). Let us consider two functions using the ‘modern’ functional style. The first function sums the values and the next function counts how many keys (assuming that they are strings) start with a given prefix.

template<typename map_type>
auto sum_values(const map_type& map) {
    auto values = map | std::ranges::views::values;
    return std::accumulate(values.begin(), values.end(), 
                         typename map_type::mapped_type{});
}

template<typename map_type>
size_t count_keys_with_prefix(const map_type& map, 
                                     std::string_view prefix) {
    auto keys = map | std::ranges::views::keys;
    return std::ranges::count_if(keys, [prefix](std::string_view key) {
        return key.starts_with(prefix);
    });
}
The code defines two template functions for processing a map-like container. The sum_values function takes a map of any type and computes the sum of its values by extracting them using std::ranges::views::values, then applying std::accumulate to add them up, starting with a default-constructed value of the map’s value type. The count_keys_with_prefix function counts how many keys in the map start with a given prefix by extracting the keys with std::ranges::views::keys and using std::ranges::count_if with a lambda that checks if each key begins with the specified prefix via starts_with. Both functions leverage C++20’s ranges library for concise and expressive manipulation of the map’s keys and values.
If you do not crave the functional style, you can use loops… like so…
template <typename map_type> auto sum_values_daniel(map_type &map) {
  typename map_type::mapped_type sum{};
  for (const auto &value : map | std::ranges::views::values) {
    sum += value;
  }
  return sum;
}

template <typename map_type>
size_t count_keys_with_prefix_daniel(const map_type &map,
                                     std::string_view prefix) {
  size_t count = 0;
  for (const auto &key : map | std::ranges::views::keys) {
    if (key.starts_with(prefix)) {
      ++count;
    }
  }
  return count;
}
I personally find loops more readable than lambda functions, but it might be worth testing both to see which one provides the best performance.

When you do not have C++20, you can write the same functions using a more conventional style:

template<typename map_type>
typename map_type::mapped_type sum_values_cpp11(const map_type& map) {
    typename map_type::mapped_type sum = 
                  typename map_type::mapped_type();
    for (typename map_type::const_iterator it = map.begin(); 
                                    it != map.end(); ++it) {
        sum += it->second;
    }
    return sum;
}

template<typename map_type>
size_t count_keys_with_prefix_cpp11(const map_type& map, 
           const std::string& prefix) {
    size_t count = 0;
    for (typename map_type::const_iterator it = map.begin(); 
                                           it != map.end(); ++it) {
        const std::string& key = it->first;
        if (key.size() >= prefix.size() 
                    && key.compare(0, prefix.size(), prefix) == 0) {
            ++count;
        }
    }
    return count;
}
Prior to C++20, there was no dedicated starts_with function to directly check if a string was a prefix of another, requiring more verbose methods like compare. Of course, you can combine the old-style code with the ‘starts_with’  method. I think that the addition of functions like ‘starts_with’ is an underrated benefit of the recent C++ standards. Small details can be impactful.

Let us compare the performance. I have generated a large key-value map with a thousand keys (as strings) with corresponding integer values. All my keys begin with ‘key_’ and I ask my functions to count how many keys begin with ‘key_’ (that should be all of them). I use both an std::map and std::unordered_map data source. I record how many nanoseconds I need to spend per key checked. I use two machines, one with an Intel Ice Lake processor running at 3.2 GHz with GCC 12 and LLVM 16 as the compilers and one with an Apple M2 processor running at 3.5 GHz with LLVM 15 as the compiler. They do not use the same instruction set, so we cannot directly compare the number of instructions retired: we expect an ARM-based system like the Apple M2 to use a few more instructions.

Using an std::map I get the following timings and instruction counts:

function Apple M2/LLVM15 Intel Ice Lake/GCC12 Intel Ice Lake/LLVM16
C++20 functional 2.2 ns 5.5 ns 5.8 ns
C++20 loop 2.0 ns 5.5 ns 5.8 ns
C++11 4.0 ns 5.5 ns 5.5 ns
C++11 with starts_with 2.0 ns 5.5 ns 5.8 ns
function Apple M2/LLVM15 Intel Ice Lake/GCC12 Intel Ice Lake/LLVM16
C++20 functional 46 instructions 29 instructions 49 instructions
C++20 loop 41 instructions 29 instructions 49 instructions
C++11 58 instructions 30 instructions 30 instructions
C++11 with starts_with 43 instructions 30 instructions 49 instructions

Using an std::unordered_map, I get the following timings and instruction counts:

function Apple M2/LLVM15 Intel Ice Lake/GCC12 Intel Ice Lake/LLVM16
C++20 functional 1.1 ns 3.8 ns 3.8 ns
C++20 loop 0.9 ns 5.4 ns 3.8 ns
C++11 0.9 ns 5.4 ns 3.8 ns
C++11 with starts_with 0.9 ns 5.4 ns 3.8 ns
function Apple M2/LLVM15 Intel Ice Lake/GCC12 Intel Ice Lake/LLVM16
C++20 functional 35 instructions 10 instructions 30 instructions
C++20 loop 30 instructions 13 instructions 30 instructions
C++11 26 instructions 14 instructions 12 instructions
C++11 with starts_with 32 instructions 14 instructions 30 instructions

We find that iterating over an std::map instance is slower than iterating over an std::unordered_map instance.

Performance comparisons between Apple M2/LLVM15 and Intel Ice Lake/GCC12 systems reveal striking differences in C++ code execution. While functional approaches (e.g., C++20 ranges) can outperform traditional loops by up to 40% under GCC 12, using starts_with for string comparisons dramatically halves running time for std::map on Apple/LLVM15.

Despite small input sizes fitting in cache, Intel’s iteration performance is not compute-bound, achieving under 2 instructions per cycle (IPC), while Apple’s exceeds 4 IPC, highlighting its architectural advantage in optimizing data access patterns.

My source code is available.

Detect control characters, quotes and backslashes efficiently using ‘SWAR’

2025-04-14 06:36:41

When trying to write fast functions operating over many bytes, we sometimes use ‘SWAR’. SWAR stands for SIMD Within A Register, and it is a technique to perform parallel computations on multiple data elements packed into a single word. It treats a register (e.g., 64-bit) as a vector of smaller units (e.g., 8-bit bytes) and processes them simultaneously without explicit SIMD instructions.

For example, if you have 8 bytes in a 64-bit word x, computing (x – 0x2020202020202020) subtracts 32 (0x20 in hex) from each byte. It works well if each byte value is greater than or equal to 32. Otherwise, the operation ‘overflows’: if the least significant byte is smaller than 32, then its most significant bit is set, and you cannot rely on the other more significant byte results.

It works well if all bytes are ‘ASCII’ (they have their most significant bit unset). When that is the case, you can check if any byte value is less than 32 by computing (x – 0x2020202020202020) and checking if any most significant bit is set, by masking the result: (x – 0x2020202020202020) & 0x8080808080808080.

Similarly,  you can check whether the byte value 34 (0x22) appears simply by first doing an XOR which sets any such byte value to zero, (x ^ 0x2222222222222222) and then subtracting each byte value by 1: (x ^ 0x2222222222222222) – 0x0101010101010101. It then suffices to masking with 0x80…80 again to see if any  byte value was set to 34, as long as all input byte values were ASCII.

Thankfully, it is easy to select only the byte values that are ASCII: (0x8080808080808080 & ~x) will do.

In JSON and other languages, you need to escape within strings all character values smaller than 32 (control characters in ASCII), equal to 34 (‘”‘ in ASCII) or 92 (“\” in ASCII). So you want to quickly check if a byte is smaller than 32, equal to 34 or 92. With SWAR, you can do it with about ten operations:
bool has_json_escapable_byte(uint64_t x) {
    uint64_t is_ascii = 0x8080808080808080ULL & ~x;
    uint64_t lt32 =
        (x - 0x2020202020202020ULL);
    uint64_t sub34 = x ^ 0x2222222222222222ULL;
    uint64_t eq34 = (sub34 - 0x0101010101010101ULL);
    uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL;
    uint64_t eq92 = (sub92 - 0x0101010101010101ULL);
    return ((lt32 | eq34 | eq92) & is_ascii) != 0;
}

That’s slightly more than one operation per input byte.

The following routine is even better (credit to Falk Hüffner):

bool has_json_escapable_byte(uint64_t x) {
    uint64_t is_ascii = 0x8080808080808080ULL & ~x;
    uint64_t xor2 = x ^ 0x0202020202020202ULL;
    uint64_t lt32_or_eq34 = xor2 – 0x2121212121212121ULL;
    uint64_t sub92 = x ^ 0x5C5C5C5C5C5C5C5CULL;
    uint64_t eq92 = (sub92 - 0x0101010101010101ULL);
    return ((lt32_or_eq34 | eq92) & is_ascii) != 0;
}

You might be able to do even better.

How can really smart people appear totally incompetent?

2025-04-12 02:56:26

It is often puzzling to encounter organizations run by highly capable and ambitious people… appear dysfunctional.

An example that I like are colleges that claim to provide great learning experience… while providing their students with recycled lectures given by contract lecturers…

I like to tell the tale of this math class I took at the University of Toronto. The esteemed professor would show up with the textbook and a ruler. He would sit down, open the book to the page where the ruler was last left, and read each line of the textbook, slowly moving the ruler down. He would not look at us. He would just read, very slowly, page by page, the content of the textbook. When we showed up to the math department to complain, we were told that the esteemed professor had to teach something, and that they thought best to assign it to our class, since we were in the honours program and could learn on our own.

Recently, the Canadian state broadcaster ran an article about how an otherwise reasonable Canadian university accepted 4000 foreign students, of which only 38 showed up at all, and only 5 stuck with their classes during the first term. It appears that they contributed to a scam which is meant to facilitate immigration.

Apple announced triumphantly its ground-breaking ‘Apple Intelligence’ feature… only to restrict it so much that it becomes irrelevant.

For decades, the leading chip maker in the world was Intel. Intel had such a lead over its competitors that it was believed that attempts at competing against it were useless. Today, few would say that Intel makes the best processors. Its competitors quickly caught up.

Such dysfunctions can be explained by many effects. But I believe that a critical one is a culture of lies. It might be fine to lie to your enemy but once you start lying to yourself, you are inviting trouble. Lies about your objectives are the worst kind.

So you run a college. And you tell prospective students that the quality of their education is your primary concern. But it is a lie. In fact, you want to keep your enrolment at healthy levels and not have too much trouble with your faculty. You do not care very much for what the students learn. In truth, it does not matter because the employers also do not care. You selected bright people and your students managed to fulfil bureaucratic obligations for four years, and that’s good enough.

So what do you do with the esteemed professor who lectures by reading line-by-line in the textbook? Does he lower your enrolment? No. He does not. He is a prestigious researcher who won many awards, so he might contribute to the prestige of your university and even attract students. Trying to get him to learn to teach would mean going against the faculty, which is not in your interest as an administrator. Of course, the students are having a terrible experience, but what does it matter? They will get their grades and they will graduate. And if they tell their story, nobody will care.

Or, as in the example of the Canadian university, you are happy to have enrolled thousands of students. That these students were not actual students does not matter. You look really good on paper, at least for a time. And if you get caught… well, you did your best, didn’t you?

Apple probably did not really want to get into customer AI. They had a really long time to do something with Siri, and they did nothing at all. Apple probably wants to sell iPhones, watches and laptops. They want to look like they are doing something about AI, but they don’t really care.

As for Intel, I don’t have a good theory but I suspect that what happened is the following classic scenario. Intel’s leadership likely touted “we’re still number one” while ignoring warning signs. Nobody cared about long-term technological relevance. They were crushing the competition on the market. It seemed like they had many new things upcoming. It was good enough.

It is not that people running colleges, Intel or Apple are incapable of succeeding. They all succeed on their own terms. They are just not sharing openly what these terms are. And if pushed, they will lie about their goals.

So, how can really smart people appear totally incompetent? Because they are lying to you. What they are openly setting out to achieve is not their real goal.

Given a choice, prefer people who are brutally honest and highly capable. You are much more likely to see them exceed your expectations.

 

How helpful is AI?

2025-04-08 03:41:27

Do large language models (AI) make you 3x faster or only 3% faster? The answer depends on the quality of the work you are producing.

If you need something like a stock photo but not much beyond that, AI can make you 10x faster.

If you need a picture taken at the right time of the right person, AI doesn’t help you much.

If you need a piece of software that an intern could have written, AI can do it 10x faster than the intern.

If you need a piece of software that only 10 engineers in the country can understand, AI doesn’t help you much.

The effect is predictable: finding work if your skill level is low becomes more difficult. However, if you are a highly skilled individual, you can eliminate much of the boilerplate work and focus on what matters. Thus, elite people are going to become even more productive.

Faster shuffling in Go with batching

2025-04-07 04:26:56

Random integer generation is a fundamental operation in programming, often used in tasks like shuffling arrays. Go’s standard library provides convenient tools like rand.Shuffle for such purposes. You may be able to beat the standard library by a generous margin. Let us see why.
Go’s rand.Shuffle implements the Fisher-Yates (or Knuth) shuffle algorithm, a standard method for generating a uniform random permutation. In Go, you may call it like so:
import "math/rand/v2"

func shuffleStandard(data []uint64) {
  rand.Shuffle(len(data), func(i, j int) {
    data[i], data[j] = data[j], data[i]
  })
}
The rand.Shuffle function takes a length and a swap function, internally generating random indices in the range [0, i) for each position i and swapping elements accordingly.
Under the hood, rand.Shuffle needs random integers within a shrinking range. This can be a surprisingly expensive step that requires a modulo/division operation per function call. We can optimize such a process so that division are avoided with the following routine (Lemire, 2019):
package main

import (
    "math/bits"
)

func randomBounded(rangeVal uint64) uint64 {
    random64bit := rand.Uint64()
    hi, lo := bits.Mul64(random64bit, rangeVal)
    leftover := lo
    if leftover < rangeVal {
        threshold := -rangeVal % rangeVal
        for leftover < threshold {
            random64bit = rand.Uint64()
            hi, lo = bits.Mul64(random64bit, rangeVal)
            leftover = lo
        }
    }
    return hi // [0, range)
}
This function returns an integer in the interval [0, rangeVal). It is fair: as long as the 64-bit generator is uniform, it will generate all values in the interval with the same probability.
The original shuffle function in the Go standard library used a slower division-based approach while the math/rand/v2 approach uses this fast algorithm. So for better performance, always use math/rand/v2 and not math/rand.
While efficient, this improved function needs to generate many random numbers. For rand.Shuffle, this means one rand.Uint64() call per element.
We can optimize drastically the function with batching (Brackett-Rozinsky and Daniel Lemire, 2024). That is, we can use a single 64-bit random integer to produce multiple random numbers, reducing the number of random-number generations.
I have implemented shuffling by batching in Go and my results are promising.
The partialShuffle64b function is the heart of my approach:
func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, 
                           bound uint64) uint64 {
    r := rand.Uint64()

    for i := uint64(0); i < k; i++ {
        hi, lo := bits.Mul64(n-i, r)
        r = lo
        indexes[i] = hi
    }

    if r < bound {
        bound = n
        for i := uint64(1); i < k; i++ {
            bound *= (n - i)
        }
        t := (-bound) % bound

        for r < t {
            r = rand.Uint64()
            for i := uint64(0); i < k; i++ {
                hi, lo := bits.Mul64(n-i, r)
                r = lo
                indexes[i] = hi
            }
        }
    }
    for i := uint64(0); i < k; i++ {
        pos1 := n - i - 1
        pos2 := indexes[i]
        storage[pos1], storage[pos2] = storage[pos2], 
                     storage[pos1]
    }
    return bound
}
The partialShuffle64b function performs a partial shuffle of a slice by generating k random indices from a range of size n, using a single 64-bit random number r obtained from rand.Uint64(), and then swapping elements at those indices. I reuse an array where the indices are stored. Though it looks much different, it is essentially a generalization of the randomBounded routine already used by the Go standard library.
Using the partialShuffle64b function, we can code a shuffle function where we generate indices in batches of two:
func shuffleBatch2(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}
    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
}

We can also generalize and generate the indices in batches of six when possible:

func shuffleBatch23456(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}

    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1<<19; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
    bound = uint64(1) << 57
    for ; i > 1<<14; i -= 3 {
        bound = partialShuffle64b(storage, indexes, i, 3, bound)
    }
    bound = uint64(1) << 56
    for ; i > 1<<11; i -= 4 {
        bound = partialShuffle64b(storage, indexes, i, 4, bound)
    }
    bound = uint64(1) << 55
    for ; i > 1<<9; i -= 5 {
        bound = partialShuffle64b(storage, indexes, i, 5, bound)
    }
    bound = uint64(1) << 54
    for ; i > 6; i -= 6 {
        bound = partialShuffle64b(storage, indexes, i, 6, bound)
    }
    if i > 1 {
        partialShuffle64b(storage, indexes, i, i-1, 720)
    }
}

Using the fact that Go comes with full support for benchmarking, we can try shuffling 10,000 elements as a Go benchmark:

func BenchmarkShuffleStandard10K(b *testing.B) {
    size := 10000
    data := make([]uint64, size)
    for i := 0; i &lt; size; i++ {
        data[i] = uint64(i)
    }
    b.ResetTimer()
    for i := 0; i &lt; b.N; i++ {
        shuffleStandard(data)
    }
}

func BenchmarkShuffleBatch2_10K(b *testing.B) { ... }
func BenchmarkShuffleBatch23456_10K(b *testing.B) { ... }
On an Intel Gold 6338 CPU @ 2.00GHz with Go 1.24, I get the following results :
  • Standard (rand.Shuffle): 13 ns/element.
  • Batch 2: 9.3 ns/element (~1.4x faster).
  • Batch 2-3-4-5-6: 5.0 ns/element (~2.6x faster).
You will get different results on your systems. I make my code available on GitHub.
The batched approaches reduce the number of random numbers generated, amortizing the cost of rand.Uint64() across multiple outputs. The 6-way batch maximizes this effect, yielding the best performance.

Go’s rand.Shuffle is a solid baseline, but batching random integer generation can make it much faster. By generating multiple random numbers from a single 64-bit value, we can boost efficiency—by over 2x in our benchmarks.

References

Appendix: Code

package main

// Nevin Brackett-Rozinsky, Daniel Lemire, Batched Ranged Random 
// Integer Generation, Software: Practice and Experience 55 (1), 2024.
// Daniel Lemire, Fast Random Integer Generation in an Interval, 
// ACM Transactions on Modeling and Computer Simulation, 29 (1), 2019.
import (
    "fmt"
    "math/bits"
    "math/rand/v2"
)

func shuffleBatch2(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}
    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
}

func shuffleBatch23456(storage []uint64) {
    i := uint64(len(storage))
    indexes := [7]uint64{}

    for ; i > 1<<30; i-- {
        partialShuffle64b(storage, indexes, i, 1, i)
    }
    bound := uint64(1) << 60
    for ; i > 1<<19; i -= 2 {
        bound = partialShuffle64b(storage, indexes, i, 2, bound)
    }
    bound = uint64(1) << 57
    for ; i > 1<<14; i -= 3 {
        bound = partialShuffle64b(storage, indexes, i, 3, bound)
    }
    bound = uint64(1) << 56
    for ; i > 1<<11; i -= 4 {
        bound = partialShuffle64b(storage, indexes, i, 4, bound)
    }
    bound = uint64(1) << 55
    for ; i > 1<<9; i -= 5 {
        bound = partialShuffle64b(storage, indexes, i, 5, bound)
    }
    bound = uint64(1) << 54
    for ; i > 6; i -= 6 {
        bound = partialShuffle64b(storage, indexes, i, 6, bound)
    }
    if i > 1 {
        partialShuffle64b(storage, indexes, i, i-1, 720)
    }
}

func partialShuffle64b(storage []uint64, indexes [7]uint64, n, k, 
  bound uint64) uint64 {
    r := rand.Uint64()

    for i := uint64(0); i < k; i++ {
        hi, lo := bits.Mul64(n-i, r)
        r = lo
        indexes[i] = hi
    }

    if r < bound {
        bound = n
        for i := uint64(1); i < k; i++ {
            bound *= (n - i)
        }
        t := (-bound) % bound

        for r < t {
            r = rand.Uint64()
            for i := uint64(0); i < k; i++ {
                hi, lo := bits.Mul64(n-i, r)
                r = lo
                indexes[i] = hi
            }
        }
    }
    for i := uint64(0); i < k; i++ {
        pos1 := n - i - 1
        pos2 := indexes[i]
        storage[pos1], storage[pos2] = storage[pos2], 
                    storage[pos1]
    }
    return bound
}

Mixing ARM NEON with SVE code for fun and profit

2025-03-29 09:44:47

Most mobile devices use 64-bit ARM processors. A growing number of servers (Amazon, Microsoft) also use 64-bit ARM processors.

These processors  have special instructions called ARM NEON providing parallelism called Single instruction, multiple data (SIMD). For example, you can compare sixteen values with sixteen other values using one instruction.

Some of the most recent ARM processors also support even more advanced instructions called SVE or Scalable Vector Extension. They have added more and more extensions over time: SVE 2 and SVE 2.1.

While ARM NEON’s registers are set at 128 bits, SVE registers are a multiple of 128 bits. In practice, SVE registers are most often 128 bits although there are exceptions. The Amazon Graviton 3 is based on ARM Neoverse V1 core, which supports SVE with a vector length of 256 bits (32 bytes). For the Graviton 4, it is based on the ARM Neoverse V2 core, which supports SVE2 with a vector length of 16 bytes, like NEON.

So if you are a low-level engineer, you are supposed to choose: either you use ARM NEON or SVE. As a comment to a recent article I wrote, Ashton Six observed that you can mix and match these instructions (NEON and SVE) because it is guaranteed that the first 128 bits of SVE registers are the NEON registers. Ashton provided a demonstration using assembly code.
If you have a recent C/C++ compiler (e.g., GCC 14), it turns out that you can fairly easily switch back and forth between NEON and SVE. If you use the arm_neon_sve_bridge.h header, you are providing two functions:
  • svset_neonq: sets the contents of a NEON 128-bit vector (uint8x16_t, int32x4_t, etc.) into an SVE scalable vector (svuint8_t, svint32_t, etc.).
  • svget_neonq: Extracts the first 128 bits of an SVE scalable vector and returns them as a NEON 128-bit vector.

These functions are ‘free’: they are likely compiled to no instruction whatsoever.

Let us illustrate with an example. In a recent post, I discussed how it is a bit complicated to check whether there is a non-zero byte in a NEON register. A competitive solution is as follows:

int veq_non_zero_max(uint8x16_t v) {
  return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0;
}

Effectively, we compute the maximum 32-bit integer in the register, considering it as four 32-bit integers. The function  compiles down to three essential instruction: umaxv, fmov and a comparison (cmp).

Let us consider the following SVE alternative. It converts the input to an SVE vector, creates a mask for all 16 positions, compares each element to zero to generate a predicate of non-zero positions, and finally tests if any elements are non-zero, returning 1 if so or 0 if all are zero—essentially performing an efficient vectorized “any non-zero” check.
int sve_non_zero_max(uint8x16_t nvec) {
  svuint8_t vec;
  vec = svset_neonq_u8(vec, nvec);
  svbool_t mask = svwhilelt_b8(0, 16);
  svbool_t cmp = svcmpne_n_u8(mask, vec, 0);
  return svptest_any(mask, cmp);
}

Except for the initialization of mask, the function is made of two instructions cmpne and cset. These two instructions may be fused to one instruction in some ARM cores. Even though the code mixing NEON and SVE looks more complicated, it should be more efficient.

If you know that your target processor supports SVE (or SVE 2 or SVE 2.1), and you already have ARM NEON code, you could try adding bits of SVE to it.