MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

将 ARM NEON 与 SVE 代码混用,既有趣又有利

2025-03-29 09:44:47

Most mobile devices use 64-bit ARM processors. A growing number of servers (Amazon, Microsoft) also use 64-bit ARM processors.

These processors  have special instructions called ARM NEON providing parallelism called Single instruction, multiple data (SIMD). For example, you can compare sixteen values with sixteen other values using one instruction.

Some of the most recent ARM processors also support even more advanced instructions called SVE or Scalable Vector Extension. They have added more and more extensions over time: SVE 2 and SVE 2.1.

While ARM NEON’s registers are set at 128 bits, SVE registers are a multiple of 128 bits. In practice, SVE registers are most often 128 bits although there are exceptions. The Amazon Graviton 3 is based on ARM Neoverse V1 core, which supports SVE with a vector length of 256 bits (32 bytes). For the Graviton 4, it is based on the ARM Neoverse V2 core, which supports SVE2 with a vector length of 16 bytes, like NEON.

So if you are a low-level engineer, you are supposed to choose: either you use ARM NEON or SVE. As a comment to a recent article I wrote, Ashton Six observed that you can mix and match these instructions (NEON and SVE) because it is guaranteed that the first 128 bits of SVE registers are the NEON registers. Ashton provided a demonstration using assembly code.
If you have a recent C/C++ compiler (e.g., GCC 14), it turns out that you can fairly easily switch back and forth between NEON and SVE. If you use the arm_neon_sve_bridge.h header, you are providing two functions:
  • svset_neonq: sets the contents of a NEON 128-bit vector (uint8x16_t, int32x4_t, etc.) into an SVE scalable vector (svuint8_t, svint32_t, etc.).
  • svget_neonq: Extracts the first 128 bits of an SVE scalable vector and returns them as a NEON 128-bit vector.

These functions are ‘free’: they are likely compiled to no instruction whatsoever.

Let us illustrate with an example. In a recent post, I discussed how it is a bit complicated to check whether there is a non-zero byte in a NEON register. A competitive solution is as follows:

int veq_non_zero_max(uint8x16_t v) {
  return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0;
}

Effectively, we compute the maximum 32-bit integer in the register, considering it as four 32-bit integers. The function  compiles down to three essential instruction: umaxv, fmov and a comparison (cmp).

Let us consider the following SVE alternative. It converts the input to an SVE vector, creates a mask for all 16 positions, compares each element to zero to generate a predicate of non-zero positions, and finally tests if any elements are non-zero, returning 1 if so or 0 if all are zero—essentially performing an efficient vectorized “any non-zero” check.
int sve_non_zero_max(uint8x16_t nvec) {
  svuint8_t vec;
  vec = svset_neonq_u8(vec, nvec);
  svbool_t mask = svwhilelt_b8(0, 16);
  svbool_t cmp = svcmpne_n_u8(mask, vec, 0);
  return svptest_any(mask, cmp);
}

Except for the initialization of mask, the function is made of two instructions cmpne and cset. These two instructions may be fused to one instruction in some ARM cores. Even though the code mixing NEON and SVE looks more complicated, it should be more efficient.

If you know that your target processor supports SVE (or SVE 2 or SVE 2.1), and you already have ARM NEON code, you could try adding bits of SVE to it.

 

使用有符号类型进行无符号比较

2025-03-25 07:24:50

There are two main types of fixed-precision integers in modern software: unsigned and signed. In C++20 and above, the signed integers must use the two’s complement convention. Other programming languages typically specify two’s complement as well.

Two’s complement is a method for representing signed integers in binary, where the leftmost bit serves as the sign bit—0 for positive numbers and 1 for negative ones—allowing computers to handle both positive and negative values efficiently with simple arithmetic. Positive numbers are written as standard binary (e.g., 5 in 8-bit is 00000101), while negative numbers are formed by taking the positive version, flipping all bits (one’s complement), and adding 1 (e.g., -5 becomes 11111011 from 00000101).

There are cases where you have signed types and you wish to use unsigned comparisons. It happens in Java (which lacks unsigned types) with some of the x64 SIMD instruction sets: extensions to the x64 architecture that allow a singleinstruction to perform the same operation on multiple data elements simultaneously.

Thus, you would want any negative value to be larger than any positive value. The trick is to simply use the fact that arithmetic operations typically wrap around in hardware: if x + y is too large and exceeds the range, then the processor replaces it with x + yK where K is the total number of values (e.g., 2 to the power 64). Let M be the smallest possible value (a very small negative), and take two signed integers x and y, then (x + M) < (y + M) is equivalent to comparing x and y as if they had been cast to unsigned integer values.

To see why it is so, consider that the transformation x to x + M maps 0 to M (so 0 becomes the smallest value). It maps all positive values to the range from 1 to –M-1 to the range from M+1 to1 (so positives become negatives). The negative values all overflow in such a way that the last bit becomes a zero while other bits remain unchanged. Thus it maps negative values from M to -1 to the range 0 to -M-1 (so negatives become positives).

Unfortunately, it will not generally work in C or C++ where signed overflow leads to an undefined behavior. However, it will work if you code for SIMD instruction sets using intrinsics (special functions used by SIMD enthusiasts). It will also work in other programming languages like Java or Go.

To avoid undefined behavior in C and C++, we can use the following identity:

x + M = (x^M) + ((x&M)<<1)

where ^ is the XOR operation. This identity holds when you assume that the addition wraps around and that the left shift acts as if the values were unsigned. Because M has only the most significant bit set in two’s complement notation, we have that the second term is zero. Thus we have

x + M = x^M

Hence, instead of using the addition with M, we can use the XOR with M: (x ^ M) < (y ^ M). That’s well defined in C and C++.
Credit: Thanks to Marc Reynolds for his input.

使用模板 lambdas 加速 C++ 代码

2025-03-16 01:29:50

Let us consider a simple C++ function which divides all values in a range of integers:

void divide(std::span<int> i, int d) {
    for (auto& value : i) {
        value /= d;
    }
}

A division between two integers is one of the most expensive operations you can do over integers: it is much slower than a multiplication which is, in turn, more expensive than an addition. If the divisor d is known at compile-time, this function can be much faster. E.g., if d is 2, the compiler might optimize away the division and use a shift and a few cheap instructions instead. The same is true with all compile-time constant: the compiler can often do better knowing the constant. (See Lemire et al., Integer Division by Constants: Optimal Bounds, 2021)

If the ‘divide’ function is inline and the divisor is known at compile time, then an optimizing compiler will do fine. But we cannot expect all our functions to get inline in practice.

In C++, a template function is defined using the template keyword followed by a parameter (usually a type parameter) enclosed in angle brackets < >. The template parameter acts as a placeholder that gets replaced with actual data type when the function is called.

In C++, you can turn the division parameter into a template parameter:

template <int d>
void divide(std::span<int> i) {
    for (auto& value : i) {
        value /= d;
    }
}

The template function is not itself a function, but rather a recipe to generate functions: we provide the integer d and a function is created. This allows the compiler to work with a compile-time constant, producing faster code.

If you expect the divisor to be between 2 and 6, you can call the template function from a general-purpose function like so:

void divide_fast(std::span<int> i, int d) {
    if(d == 2) {
        return divide<2>(i);
    }
    if(d == 3) {
        return divide<3>(i);
    }
    if(d == 4) {
        return divide<4>(i);
    }
    if(d == 5) {
        return divide<5>(i);
    }
    if(d == 6) {
        return divide<6>(i);
    }

    for (auto& value : i) {
        value /= d;
    }
}

You could do it with a switch/case if you prefer but it does not simplify the code significantly. The compiler can produce efficient code with a series of if clauses (e.g., a jump table).

Unfortunately we have to expose a template function, which creates noise in our code base. We would prefer to keep all the logic inside one function. We can do so with lambda functions.

In C++, a lambda function (or lambda expression) is an anonymous, inline function that you can define on-the-fly, typically for short-term use. Starting with C++20, you have template lambda expressions.
We can almost do it like so:
void divide_fast(std::span<int> i, int d) {
    auto f = [&i]<int divisor>() {
      for (auto& value : i) {
        value /= divisor;
      }
    };
    if(d == 2) {
        return f<2>();
    }
    if(d == 3) {
        return f<3>();
    }
    if(d == 4) {
        return f<4>();
    }
    if(d == 5) {
        return f<5>();
    }
    if(d == 6) {
        return f<6>();
    }

    for (auto& value : i) {
        value /= d;
    }
}
Unfortunately, it does not quite work. Given template lambda expressions, you cannot directly pass template parameters. In C++, lambdas are syntactic sugar for objects of an unnamed class (a closure type). This class has an overloaded function call operator, operator(), which is what gets invoked when you “call” the lambda like a function. For a generic lambda, the operator() is a template, and its signature depends on the template parameters provided. Thus you need to specialize the template with an ugly expression (‘operator()<params>’):
void divide_fast(std::span<int> i, int d) {
    auto f = [&i]<int divisor>() {
      for (auto& value : i) {
        value /= divisor;
      }
    };
    if(d == 2) {
        return f.operator()<2>();
    }
    if(d == 3) {
        return f.operator()<3>();
    }
    if(d == 4) {
        return f.operator()<4>();
    }
    if(d == 5) {
        return f.operator()<5>();
    }
    if(d == 6) {
        return f.operator()<6>();
    }

    for (auto& value : i) {
        value /= d;
    }
}

In practice, it might still be a good choice. It keeps all the messy optimization hidden inside your function.

In the specific case of this function, I can probably get the same compiled output without templates as remarked by Martin Leitner-Ankerl:

void divide_fast_simple(std::span<int> i, int d) {
    auto f = [&i](int divisor) {
    for (auto& value : i) {
        value /= divisor;
    }
    };
    if(d == 2) {
        return f(2);
    }
    if(d == 3) {
        return f(3);
    }
    if(d == 4) {
        return f(4);
    }
    if(d == 5) {
        return f(5);
    }
    if(d == 6) {
        return f(6);
    }

    for (auto& value : i) {
        value /= d;
    }
}

And if it works in your particular case, you should avoid templates.

At the other end of the spectrum, Paul Dreik suggests doing it with template metaprogramming and fold expression like so:

void divide_fast(std::span<int> i, int d) {
  auto implementation =
      [&i]<int... ints>(std::integer_sequence<int, 0, ints...>, 
                                                  int d) {
    auto specialized =
        [&i]<int static_divisor>(
              std::integral_constant<int, static_divisor>,
                                 int dynamic_divisor) {
          if (static_divisor == dynamic_divisor) {
            for (auto &value : i) {
              value /= static_divisor;
            }
            return true;
          }
          return false;
        };
    const bool handled =
        (specialized(std::integral_constant<int, ints>{}, d) || ...);

    if (!handled) {
      // resort to dynamic calculation
      for (auto &value : i) {
        value /= d;
      }
    }
  };
  implementation(std::make_integer_sequence<int, 5>(), d);
}

It is likely overkill in this example, but template metaprogramming becomes handy with more challenging problems.

并行编程概述(Go 版)

2025-03-10 05:35:41

In practice, the software we write runs on several processors. Unfortunately, much of what we take for granted on a single processor becomes false when there are more than one processor. For example, if two processors modify the same piece of memory, what is the state of the memory after the modifications? It is difficult to tell in general. It is possible that the modification of one processor could overwrite any modification done by the other processor. The reverse could be true: the modification done by the other processor could win out. Or, maybe both processors will attempt the modification and the result will be a confused state that corresponds to nothing we would like to see. We call such accesses a ‘data race’: a situation where two or more processors in a program access the same memory location simultaneously, and at least one of those accesses is a write operation, without proper synchronization.
It gets more difficult when you want two or more processors to meaningfully modify the same memory. For example, suppose that you have a variable that counts the number of products sold. Maybe you have different processors incrementing this one variable.

Threads and goroutines

A thread is the smallest unit of execution within a process that can be scheduled and run independently by a computer’s operating system. It represents a single sequence of instructions that the CPU executes, allowing a program to perform multiple tasks concurrently within the same process. A thread exists within a larger entity called a process, which is essentially a running program with its own memory space, resources, and state. A process can contain one or more threads, all sharing the same memory and resources (like open files or global variables) allocated to that process.
There is a limit to how many threads a program can manage efficiently. To enable even more parallelism, the Go programming language has its own concept of a thread called a goroutine. While a goroutine is not a thread in the traditional sense, it maps to conventional threads under the hood. The Go runtime uses a scheduler to map many goroutines onto a smaller number of threads. These threads are the actual threads recognized by the operating system—kernel-level entities with their own stack and execution context, as described in general computing.
A single thread in Go can execute multiple goroutines by switching between them efficiently. This makes goroutines much cheaper than OS threads—creating thousands or even millions of goroutines is practical, whereas spawning that many threads would exhaust system resources due to their larger memory footprint.
In some sense, Go blurs the distinction between concurrency and parallelism. Concurrency is about managing multiple tasks so they can make progress independently. Parallelism, however, involves executing multiple tasks simultaneously across multiple resources. While concurrency focuses on software design for task coordination and can work with or without multiple cores, parallelism relies on hardware to achieve true simultaneous execution, and the two can combine when a concurrent system leverages parallel resources for efficiency.
To start a goroutine, you only need to type the keyword ‘go’ followed by a function:

go func() {
    fmt.Println("Canada")
}()

This spawns a goroutine, but the Go runtime decides which thread it runs on, potentially sharing that thread with other goroutines.
Unfortunately, a program made of only this goroutine could be disappointing:

package main

import (
    "fmt"
)

func main() {
    go func() {
        fmt.Println("Canada")
    }()
}

The problem is that the main function might end before the goroutine can terminate. In Go, goroutines run concurrently (at the same time), and the main function (which is the main goroutine) does not automatically wait for other goroutines to complete. If the main goroutine exits, the program terminates, potentially before other goroutines finish. To ensure a goroutine terminates before the program ends, the simplest approach is to synchronize the main goroutine with the spawned goroutine using a mechanism like a channel or a WaitGroup.
In Go, a channel is a built-in construct that provides a way for goroutines (concurrent functions) to communicate with each other and synchronize their execution. A channel has a type and it is created with the make function:

ch := make(chan int) // A channel that carries integers

The keyword chan is the keyword for declaring a channel. The type after chan (e.g., int) defines what kind of data the channel can transport.

We use the <- operator to send a value into a channel.

ch <- 42 // Send the value 42 into the channel

We use the <- operator to receive a value from a channel.

value := <-ch // Receive a value from the channel and store it in 'value'

We use the close function to indicate no more data will be sent: close(ch). Sending to a closed channel causes a panic.
The following program would print ‘Canada’:

package main

import "fmt"

func main() {
    ch := make(chan string) // Create a channel for strings

    go func() {
        ch <- "Canada" // Send a message to the channel
    }()

    msg := <-ch // Receive the message in the main goroutine
    fmt.Println(msg)
}

The following program illustrates how we might use a channel to wait for a goroutine to terminate:

package main

import (
    "fmt"
)

func main() {
    channel := make(chan bool) // Create a channel to signal completion

    go func() {
        fmt.Println("Canada")
        channel <- true // Signal that the goroutine is done
    }()

    <-channel // Wait for the goroutine to signal completion
}

The goroutine sends a value (true) to the channel when it finishes. The main function blocks at <-done, waiting to receive from the channel, ensuring it does not exit until the goroutine completes.
By default a channel is unbuffered: it can contain at most one value. So if you try to write to it more than one value, it will block until at least one value is read.

ch := make(chan int, 2)
ch <- 1 // Does not block (buffer has space)
ch <- 2 // Does not block (buffer is now full)
ch <- 3 // Blocks until a value is received

In Go, you can pass multiple channels to a function just like any other arguments. Channels are first-class values in Go, meaning they can be passed as parameters, returned from functions, or stored in variables. When passing several channels to a function, you simply include them in the function’s parameter list, specifying their types.
Let us consider an example where we access two URLs:

package main

import (
    "fmt"
    "net/http"
    "time"
)

// Response struct to hold URL and its fetch result
type Response struct {
    url    string
    status string
    err    error
}

func fetchURL(url string, ch chan Response) {
    // Create HTTP client with timeout
    client := &http.Client{
        Timeout: 10 * time.Second,
    }

    // Make HTTP GET request
    resp, err := client.Get(url)
    if err != nil {
        ch <- Response{url, "", err}
        return
    }
    defer resp.Body.Close()

    ch <- Response{url, resp.Status, nil}
}

func main() {
    // Record start time
    startTime := time.Now()
    // Create channel for responses
    ch := make(chan Response)

    // URLs to fetch
    urls := []string{
        "https://www.google.com",
        "https://www.github.com",
    }

    // Start goroutines for each URL
    for _, url := range urls {
        go fetchURL(url, ch)
    }

    // Collect responses
    for i := 0; i < len(urls); i++ {
        resp := <-ch
        if resp.err != nil {
            fmt.Printf("Error fetching %s: %v\n", resp.url, resp.err)
        } else {
            fmt.Printf("Successfully fetched %s: %s\n", resp.url, resp.status)
        }
    }

    // Close the channel (optional since program ends here)
    close(ch)

    // Calculate and print elapsed time
    elapsed := time.Since(startTime)

    fmt.Printf("\nTotal time taken: %s\n", elapsed)
}

This program defines a Response struct to hold the URL, its status, and any error that occurred. It implements a fetchURL function that takes a URL and a channel as parameters, uses an HTTP client with a 10-second timeout, makes a GET request to the URL, sends the result through the channel. It uses defer to ensure the response body is closed. In this instance, the channel can be written to or read from in the function: to ensure that it can only be written to, we could declare it as ch chan<- Response instead as ch chan Response when passing it. In the main function, we create a channel to receive responses, we define two URLs to fetch, we launch a goroutine for each URL, we collect responses from the channel and we print the results.
When we run this program, it will fetch both URLs simultaneously using separate goroutines, it will use channels to communicate results back to the main goroutine, and it will print the status (like “200 OK”) or any errors for each URL.
We can rewrite this program so that it is simpler, without goroutines, like so:

package main

import (
    "fmt"
    "net/http"
    "time"
)

// Response struct to hold URL and its fetch result
type Response struct {
    url    string
    status string
    err    error
}

func fetchURLSynchro(url string) Response {
    // Create HTTP client with timeout
    client := &http.Client{
        Timeout: 10 * time.Second,
    }

    // Make HTTP GET request
    resp, err := client.Get(url)
    if err != nil {
        return Response{url, "", err}
    }
    defer resp.Body.Close()

    return Response{url, resp.Status, nil}
}

func main() {
    // URLs to fetch
    urls := []string{
        "https://www.google.com",
        "https://www.github.com",
    }
    startTime := time.Now()

    for i := 0; i < len(urls); i++ {
        resp := fetchURLSynchro(urls[i])
        if resp.err != nil {
            fmt.Printf("Error fetching %s: %v\n", resp.url, resp.err)
        } else {
            fmt.Printf("Successfully fetched %s: %s\n", resp.url, resp.status)
        }
    }
    elapsed := time.Since(startTime)
    fmt.Printf("\nTotal time taken: %s\n", elapsed)
}

The two programs do the same work, but one uses two goroutines (in addition to the main goroutine) while the other uses only the main goroutine. Testing these programs, you may find that the one using two goroutines completes faster: network accesses are typically expensive and easily parallelizable. That is, the two tasks can be done almost independently on your computer, even if executed simultaneously. Hence, you may find that we can query two URLs using HTTP requests in 250 ms whereas 400 ms is needed if the requests are consecutive, using a single goroutine.
However, you should not assume that using more goroutines always makes software run faster. It often does not. Furthermore, additional goroutines might trigger the use of additional processors which increases the cost or power usage of your software. Adding more goroutines makes your software more complicated, more difficult to maintain and debug.
Formally speaking, you do not need parallelism (i.e., many physical processors) to execute two network requests concurrently. Executing such requests does not require much processing time and has much to do with waiting for the network response. Therefore, it is a case where using goroutines is likely appropriate. When splitting up more computational tasks into goroutines, you are less certain to get a performance boost.
To illustrate the point, let us consider the case where we are summing all values in an array. We consider two cases, first a small array (100k elements) and then a large array with millions of elements. For both cases, we can either use a simple function (with one goroutine) or a function that uses multiple goroutines. To maximize parallelism, we set the number of goroutines to the number of processors detected on the system by Go (runtime.NumCPU()).

package main

import (
    "fmt"
    "runtime"
    "testing"
)

// sequentialSum calculates the sum of an array sequentially
func sequentialSum(numbers []int) int {
    sum := 0
    for _, n := range numbers {
        sum += n
    }
    return sum
}

// goroutineSumWithChannels calculates the sum using goroutines and channels
func goroutineSumWithChannels(numbers []int) int {
    numGoroutines := runtime.NumCPU() // Use number of CPU cores
    chunkSize := (len(numbers) + numGoroutines - 1) / numGoroutines
    resultChan := make(chan int, numGoroutines) // Buffered channel for partial sums
    activeGoroutines := 0
    // Split the array into chunks and process with goroutines
    for i := 0; i < numGoroutines; i++ {
        start := i * chunkSize
        end := start + chunkSize
        if end > len(numbers) {
            end = len(numbers)
        }
        if start >= end {
            break
        }

        go func(slice []int) {
            partialSum := 0
            for _, n := range slice {
                partialSum += n
            }
            resultChan <- partialSum
        }(numbers[start:end])
        activeGoroutines++
    }

    // Collect partial sums from the channel
    total := 0
    for i := 0; i < activeGoroutines; i++ {
        total += <-resultChan
    }
    close(resultChan)

    return total
}

// Benchmark functions
func BenchmarkSequentialSum(b *testing.B) {
    numbers := make([]int, 100000)
    for i := range numbers {
        numbers[i] = i
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        sequentialSum(numbers)
    }
}

func BenchmarkGoroutineSumWithChannels(b *testing.B) {
    numbers := make([]int, 100000)
    for i := range numbers {
        numbers[i] = i
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        goroutineSumWithChannels(numbers)
    }
}

// Benchmark functions
func BenchmarkSequentialSumLarge(b *testing.B) {
    numbers := make([]int, 10000000)
    for i := range numbers {
        numbers[i] = i
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        sequentialSum(numbers)
    }
}

func BenchmarkGoroutineSumWithChannelsLarge(b *testing.B) {
    numbers := make([]int, 10000000)
    for i := range numbers {
        numbers[i] = i
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        goroutineSumWithChannels(numbers)
    }
}

func main() {
    fmt.Printf("Number of CPU cores: %d\n", runtime.NumCPU())

    res := testing.Benchmark(BenchmarkGoroutineSumWithChannels)
    fmt.Println("BenchmarkGoroutineSumWithChannels", res)
    ress := testing.Benchmark(BenchmarkSequentialSum)
    fmt.Println("BenchmarkSequentialSum", ress)

    resl := testing.Benchmark(BenchmarkGoroutineSumWithChannelsLarge)
    fmt.Println("BenchmarkGoroutineSumWithChannelsLarge", resl)
    ressl := testing.Benchmark(BenchmarkSequentialSumLarge)
    fmt.Println("BenchmarkSequentialSumLarge", ressl)
}

On a system with a large number of processors, we might get the following result:

Number of CPU cores: 128
BenchmarkGoroutineSumWithChannels     4048      258798 ns/op
BenchmarkSequentialSum    23756      50516 ns/op
BenchmarkGoroutineSumWithChannelsLarge      744    1414114 ns/op
BenchmarkSequentialSumLarge      237       5030224 ns/op

We see that when summing up the modest array, we get that the approach using 128 goroutines takes five times longer. If it does end up using 128 processors, then it might be 128 * 5 = 640 times less efficient! The lesson is that if the task is sufficiently inexpensive, such as summing up thousands of integers, you should not use more than one goroutine.
In the instance where we are summing 10 million integers, the parallelized task is more interesting: it goes 3.6 times faster. Again, the single-routine approach is likely much more efficient: a single processor takes 3.6 longer than over one hundred goroutine.
The problem with a simple sum is that it is driven by memory accesses and not especially computational. What if we consider a more expensive task? Let us sum the sine of the values of an array using various numbers of goroutines (1, 2, …). We use one million values in the array.

package main

import (
    "fmt"
    "math"
    "runtime"
    "testing"
)

func computeSineSum(numbers []int) float64 {
    sum := 0.0
    for _, n := range numbers {
        sum += math.Sin(float64(n))
    }
    return sum
}

// computeSineSumWithGoroutines computes the sum of squares with a specified number of goroutines
func computeSineSumWithGoroutines(numbers []int, numGoroutines int) float64 {
    chunkSize := (len(numbers) + numGoroutines - 1) / numGoroutines
    resultChan := make(chan float64, numGoroutines)

    for i := 0; i < numGoroutines; i++ {
        start := i * chunkSize
        end := start + chunkSize
        if end > len(numbers) {
            end = len(numbers)
        }
        if start >= end {
            break
        }

        go func(slice []int) {
            partialSum := 0.0
            for _, n := range slice {
                partialSum += math.Sin(float64(n))
            }
            resultChan <- partialSum
        }(numbers[start:end])
    }

    // Collect results
    total := 0.0
    activeGoroutines := (len(numbers) + chunkSize - 1) / chunkSize
    for i := 0; i < activeGoroutines; i++ {
        total += <-resultChan
    }
    close(resultChan)
    return total
}

// Benchmarks
func BenchmarkSequential(b *testing.B) {
    numbers := make([]int, 1000000)
    for i := range numbers {
        numbers[i] = i % 1000 // Keep numbers manageable
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        computeSineSum(numbers)
    }
}

func Benchmark1Goroutines(b *testing.B) {
    numbers := make([]int, 1000000)
    for i := range numbers {
        numbers[i] = i % 1000
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        computeSineSumWithGoroutines(numbers, 1)
    }
}

func Benchmark2Goroutines(b *testing.B) {
    numbers := make([]int, 1000000)
    for i := range numbers {
        numbers[i] = i % 1000
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        computeSineSumWithGoroutines(numbers, 2)
    }
}

func Benchmark4Goroutines(b *testing.B) {
    numbers := make([]int, 1000000)
    for i := range numbers {
        numbers[i] = i % 1000
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        computeSineSumWithGoroutines(numbers, 4)
    }
}

func Benchmark8Goroutines(b *testing.B) {
    numbers := make([]int, 1000000)
    for i := range numbers {
        numbers[i] = i % 1000
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        computeSineSumWithGoroutines(numbers, 8)
    }
}

func Benchmark16Goroutines(b *testing.B) {
    numbers := make([]int, 1000000)
    for i := range numbers {
        numbers[i] = i % 1000
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        computeSineSumWithGoroutines(numbers, 16)
    }
}

func BenchmarkMaxGoroutines(b *testing.B) {
    numbers := make([]int, 1000000)
    for i := range numbers {
        numbers[i] = i % 1000
    }

    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        computeSineSumWithGoroutines(numbers, runtime.NumCPU())
    }
}

func main() {

    fmt.Printf("CPU cores: %d\n", runtime.NumCPU())
    res1 := testing.Benchmark(BenchmarkSequential)
    fmt.Println("BenchmarkSequential", res1)
    res11 := testing.Benchmark(Benchmark1Goroutines)
    fmt.Println("Benchmark1Goroutines", res11)
    res2 := testing.Benchmark(Benchmark2Goroutines)
    fmt.Println("Benchmark2Goroutines", res2)
    res4 := testing.Benchmark(Benchmark4Goroutines)
    fmt.Println("Benchmark4Goroutines", res4)
    res8 := testing.Benchmark(Benchmark8Goroutines)
    fmt.Println("Benchmark8Goroutines", res8)
    res16 := testing.Benchmark(Benchmark16Goroutines)
    fmt.Println("Benchmark16Goroutines", res16)
    resmax := testing.Benchmark(BenchmarkMaxGoroutines)
    fmt.Println("BenchmarkMaxGoroutines", resmax)
}

On a powerful machine with many cores, we might get the following results:

CPU cores: 128
Benchmark1Goroutines 114 13701908 ns/op
Benchmark2Goroutines 134 8913817 ns/op
Benchmark4Goroutines 253 4648170 ns/op
Benchmark8Goroutines 472 2272842 ns/op
Benchmark16Goroutines 835 1227975 ns/op
BenchmarkMaxGoroutines 916 1189217 ns/op

Going from one goroutine to two improves the speed by a factor of 1.5. Going from one goroutine to 16 goroutines improves the speed by a factor of 11. Increasing the number of goroutines beyond 16 brings no further gain. This pattern is sublinear gains with an upper limit is rather typical.

Yet goroutines and channels can be remarkably efficient in their own right. Let us create a chain of channels. Each goroutine has an input channel and an output channel. As soon as data is received in the input channel, data is written to the input channel. We link hundreds of goroutines in a chain of input and output channels:

package main

import (
    "fmt"
    "time"
)

// relay function represents each goroutine in the chain
func relay(input <-chan int, output chan<- int) {
    // Wait for value from input channel
    value := <-input
    // Send value to output channel
    output <- value
}

func main() {
    // Number of goroutines in the chain
    const chainLength = 10000

    // Create slice to hold all channels
    channels := make([]chan int, chainLength+1)

    // Initialize all channels
    for i := range channels {
        channels[i] = make(chan int)
    }

    // Start timing
    startTime := time.Now()

    // Create the chain of goroutines
    for i := 0; i < chainLength; i++ {
        go relay(channels[i], channels[i+1])
    }

    // Send initial value into the first channel
    go func() {
        channels[0] <- 42
    }()

    // Wait for and receive the value from the last channel
    result := <-channels[chainLength]

    // Calculate elapsed time
    elapsed := time.Since(startTime)

    // Print results
    fmt.Printf("Value %d successfully passed through %d goroutines\n", result, chainLength)
    fmt.Printf("Time taken: %v\n", elapsed)
    fmt.Printf("Average time per hop: %v\n", elapsed/time.Duration(chainLength))
}

Running this program, you may get the following result:

Value 42 successfully passed through 10000 goroutines
Time taken: 13.987416ms
Average time per hop: 1.398µs

Effectively, you can traverse nearly a million goroutines per second in this manner.

Wait groups

Another common approach for managing multiple goroutines is to use sync.WaitGroup. Before we give an example, we need to review the keyword defer. In Go, the defer keyword is used to schedule a function call to be executed just before the surrounding function (the one containing the defer statement) returns. For example, the following function would print Canada followed by Mexico:

func() {
        defer fmt.Println("Mexico")
        fmt.Println("Canada")
    }

Let us consider a goroutine based on a defer a wait group:

package main

import (
    "fmt"
    "sync"
)

func main() {
    var wg sync.WaitGroup
    wg.Add(1) // Increment the WaitGroup counter by 1

    go func() {
        defer wg.Done() // Decrement the counter when the goroutine completes
        fmt.Println("Canada")
    }()

    wg.Wait() // Wait for the counter to reach 0
}

For a single goroutine like in our example, the channel approach is simpler because it requires fewer lines and no additional imports. However, if you have multiple goroutines, the wait group might be simpler.

Atomics

If you need to read data from different goroutines, that is not a problem as long as the data remains constant. If nobody writes to the data, there is no problem.
Unfortunately, we often need to change the data, while reading it from different goroutines. Sometimes you can use channels to communicate. But that is sometimes not enough.
Let us consider an example. We take an array of 10 integers, and goroutines randomly decrement one array element and then increment another array element. Initially, the sum of all elements should be 1000 and it should remain 1000 unless there is a bug. We can implement our code like so:

package main

import (
    "fmt"
    "math/rand"
    "sync"
    "time"
)

func main() {

    // Initialize array with 10 elements, each set to 100
    arr := [10]int{100, 100, 100, 100, 100, 100, 100, 100, 100, 100}
    var wg sync.WaitGroup

    // Function for goroutine behavior
    worker := func() {
        defer wg.Done()
        wg.Add(1)
        r := rand.New(rand.NewSource(time.Now().UnixNano()))

        // Run for 200000000 iterations as an example
        for i := 0; i < 200000000; i++ {
            // Pick first random index
            idx1 := r.Intn(10)
            // Only proceed if value > 0
            if arr[idx1] > 0 {
                // Decrement first element
                arr[idx1]--

                // Pick second random index
                idx2 := r.Intn(10)
                // Increment second element
                arr[idx2]++

            }
        }
    }

    // Launch two goroutines
    go worker()
    go worker()
    fmt.Println("waiting...")
    wg.Wait()
    fmt.Println("waiting...ok")

    fmt.Println("\nFinal array state:", arr)
    // Verify total sum should still be 1000 (10 * 100)
    sum := 0
    for _, val := range arr {
        sum += val
    }
    fmt.Println("Total sum:", sum)
}

This program is wrong: it contains data races because we are writing and reading data from different goroutines without synchronization. A possible ouput of this program is the following:

Final array state: [3001 644 880 324 2319 2845 3664 160 232 1741]
Total sum: 15810

Observe how the sum is higher than expected.

In Go you can avoid such a bug with the guarantee of atomicity provided by the sync/atomic package, which ensures that operations like increments are executed as indivisible steps, preventing race conditions. Functions like atomic.AddInt32(&x, 1) or atomic.AddInt64(&x, 1) ensure that the increment operation (read-modify-write) is performed atomically. This means that even if two threads execute the increment concurrently, the operations are serialized at the hardware level, and no updates are lost.

package main

import (
    "fmt"
    "math/rand"
    "sync"
    "sync/atomic"
    "time"
)

func main() {

    // Initialize array with 10 elements, each set to 100
    arr := [10]int32{100, 100, 100, 100, 100, 100, 100, 100, 100, 100}
    var wg sync.WaitGroup

    // Function for goroutine behavior
    worker := func() {
        defer wg.Done()
        wg.Add(1)
        r := rand.New(rand.NewSource(time.Now().UnixNano()))

        // Run for 200000000 iterations as an example
        for i := 0; i < 200000000; i++ {
            // Pick first random index
            idx1 := r.Intn(10)
            // Only proceed if value > 0
            val := atomic.LoadInt32(&arr[idx1])
            if val > 0 {
                if atomic.CompareAndSwapInt32(&arr[idx1], val, val-1) {
                    // Pick second random index
                    idx2 := r.Intn(10)
                    // Increment second element
                    atomic.AddInt32(&arr[idx2], 1)
                }

            }
        }
    }

    // Launch two goroutines
    go worker()
    go worker()
    fmt.Println("waiting...")
    wg.Wait()
    fmt.Println("waiting...ok")

    fmt.Println("\nFinal array state:", arr)
    // Verify total sum should still be 1000 (10 * 100)
    sum := 0
    for _, val := range arr {
        sum += int(val)
    }
    fmt.Println("Total sum:", sum)
}

The expression atomic.LoadInt32(&arr[idx1]) atomically reads the value at array position idx1. The value is stored in a local variable (val): data races are not possible with a local variable. We then use a Compare-And-Swap (CAS) operation: atomic.CompareAndSwapInt32(&arr[idx1], val, val-1). It checks if arr[idx1] still equals val (the previously loaded value) and if true, it sets arr[idx1] to val-1. It returns true if successful, false if the value changed since the load. Importantly, it executes as a single atomic operation. Finally, we use atomic.AddInt32(&arr[idx2], 1) to atomically add 1 to arr[idx2].

If you run the new program, the sum of the values in the array is maintained. The program is safe.

Mutex

Atomic operations (like atomic.AddInt32 or atomic.CompareAndSwapInt32) are designed for single, indivisible operations on a single variable. They become insufficient when we have more complex data structures.

In these more complex cases, we use a mutex. A mutex (short for “mutual exclusion”) is a synchronization primitive used in concurrent programming to prevent multiple threads or processes from simultaneously accessing or modifying a shared resource. It ensures that only one thread (or goroutine) can enter a critical section of code at a time, thus avoiding race conditions and maintaining data consistency. Essentially, only one ‘lock’ can be held at any given time.

To illustrate, let us create a program where money is transferred between two accounts, and we need to ensure that the withdrawal from one account and deposit to another happen together without interference from other goroutines. This requires protecting a multi-step operation, which goes beyond what atomic operations can do.

package main

import (
    "fmt"
    "sync"
    "time"
)

type Bank struct {
    accounts map[string]int // Map of account IDs to balances
    mutex    sync.Mutex    // Mutex to protect the entire transfer operation
}

func NewBank() *Bank {
    return &Bank{
        accounts: map[string]int{
            "Alice": 1000,
            "Bob":   500,
        },
    }
}

func (b *Bank) Transfer(from, to string, amount int, wg *sync.WaitGroup) {
    defer wg.Done()

    // Lock the mutex to protect the entire transfer operation
    b.mutex.Lock()
    defer b.mutex.Unlock() // Ensure unlock happens even if there's a panic

    // Check if source account has sufficient funds
    if b.accounts[from] >= amount {
        // Perform the transfer: two related operations
        b.accounts[from] -= amount
        b.accounts[to] += amount
        fmt.Printf("Transferred %d from %s to %s. New balances: %s=%d, %s=%d\n",
            amount, from, to, from, b.accounts[from], to, b.accounts[to])
    } else {
        fmt.Printf("Failed transfer of %d from %s to %s: insufficient funds\n",
            amount, from, to)
    }
}

func (b *Bank) GetBalance(account string) int {
    b.mutex.Lock()
    defer b.mutex.Unlock()
    return b.accounts[account]
}

func main() {
    bank := NewBank()
    var wg sync.WaitGroup

    // Launch multiple concurrent transfers
    wg.Add(4)
    go bank.Transfer("Alice", "Bob", 200, &wg)
    go bank.Transfer("Bob", "Alice", 100, &wg)
    go bank.Transfer("Alice", "Bob", 300, &wg)
    go bank.Transfer("Bob", "Alice", 50, &wg)

    wg.Wait()

    fmt.Printf("Final balances: Alice=%d, Bob=%d\n",
        bank.GetBalance("Alice"), bank.GetBalance("Bob"))
}

In general, acquiring and releasing a mutex involves system-level operations, which introduces overhead even when there is no contention. This can slow down a program compared to lock-free alternatives like atomic operations.
In complex cases, it is also possible to trigger a deadlock. A deadlock is a concurrency failure where threads are trapped in a circular wait for resources, unable to proceed due to mutual dependencies. We can modify our example to include a deadlock. Instead of a global mutex, we create a mutex per account. If the goroutine acquires the source account and then the destination account, a deadlock becomes possible.

package main

import (
    "fmt"
    "sync"
    "time"
)

type Account struct {
    balance int
    mutex   sync.Mutex
}

type Bank struct {
    accounts map[string]*Account // Map of account IDs to account objects with individual mutexes
}

func NewBank() *Bank {
    return &Bank{
        accounts: map[string]*Account{
            "Alice": {balance: 1000},
            "Bob":   {balance: 500},
        },
    }
}

func (b *Bank) Transfer(from, to string, amount int, wg *sync.WaitGroup) {
    defer wg.Done()

    // Get the accounts
    fromAccount := b.accounts[from]
    toAccount := b.accounts[to]

    // Lock the "from" account first
    fromAccount.mutex.Lock()
    fmt.Printf("Locked %s for transfer of %d to %s\n", from, amount, to)

    // Simulate some work to increase chance of deadlock (optional, but helps demonstrate)
    time.Sleep(100 * time.Millisecond)

    // Then try to lock the "to" account
    toAccount.mutex.Lock()
    fmt.Printf("Locked %s for transfer of %d from %s\n", to, amount, from)

    // Perform the transfer
    if fromAccount.balance >= amount {
        fromAccount.balance -= amount
        toAccount.balance += amount
        fmt.Printf("Transferred %d from %s to %s. New balances: %s=%d, %s=%d\n",
            amount, from, to, from, fromAccount.balance, to, toAccount.balance)
    } else {
        fmt.Printf("Failed transfer of %d from %s to %s: insufficient funds\n",
            amount, from, to)
    }

    // Unlock both accounts
    toAccount.mutex.Unlock()
    fromAccount.mutex.Unlock()
}

func (b *Bank) GetBalance(account string) int {
    acc := b.accounts[account]
    acc.mutex.Lock()
    defer acc.mutex.Unlock()
    return acc.balance
}

func main() {
    bank := NewBank()
    var wg sync.WaitGroup

    // Launch two transfers in opposite directions to create deadlock
    wg.Add(2)
    go bank.Transfer("Alice", "Bob", 200, &wg) // Alice -> Bob
    go bank.Transfer("Bob", "Alice", 100, &wg) // Bob -> Alice

    wg.Wait() // This will never complete due to deadlock

    fmt.Printf("Final balances: Alice=%d, Bob=%d\n",
        bank.GetBalance("Alice"), bank.GetBalance("Bob"))
}

The deadlock in this code occurs because two goroutines acquire mutexes in different orders, leading to a circular wait. One strategy to avoid such a deadlock is to use ordered mutexes. E.g., if accounts are numbered, we always lock the account with the lesser number first.

Conclusion

Concurrency is a powerful tool in modern software development, enabling programs to leverage multiple processors for improved performance. However, it introduces significant complexities that must be carefully managed. Data races, where unsynchronized access to shared memory leads to unpredictable outcomes, underscore the need for robust synchronization mechanisms. Go’s goroutines and channels offer an elegant, lightweight approach to concurrency, allowing developers to efficiently parallelize tasks like network requests or data processing while avoiding the overhead of traditional threads. Yet, the performance benefits of parallelism are not guaranteed—simple tasks may suffer from excessive goroutine overhead, while computationally intensive operations can see substantial gains, albeit with diminishing returns as the number of goroutines increases.

Synchronization tools like sync.WaitGroup, atomic operations from sync/atomic, and mutexes (sync.Mutex) provide essential safeguards against concurrency pitfalls. Atomics excel for single-variable updates, ensuring thread safety with minimal overhead, while mutexes protect multi-step operations on complex data structures. However, mutexes come with risks, such as deadlocks, which arise from circular dependencies and require careful design—like consistent lock ordering—to avoid. Choosing the right concurrency strategy depends on the task’s nature, scale, and performance requirements. Ultimately, effective concurrent programming in Go demands a balance between leveraging parallelism for speed and maintaining simplicity, correctness, and efficiency in the face of shared resource contention.

打开 1000 个文件的速度有多快?

2025-03-02 06:41:04

Jarred Sumner, the main author of the Bun JavaScript engine, commented a few days ago on X that opening many files on macOS could be slow due to thread contention: “your $5,000 computer is only capable of opening 1 file at a time”.

I was curious and I decided to test it out. I wrote a small C++ program which opens 10,000 files. The files are opened and closed immediately. However, there are 10,000 distinct (empty) files. It is a simple test: there is no writing and the files do not remain open long.

I use either 1 thread, 2 threads, 8 threads or 16 threads. The threads split the work equally: e.g., with two threads, you have two threads opening 5000 files each.

To make things interesting, let us add a “thread pool”: I create and launch all threads before the benchmark starts. Then I add all of the tasks (opening a file X) to the thread pool. This should be slower, but maybe more realistic of how engineers might solve such problems.

Indeed, with the thread pool, for each file, you need to grab a thread, assign the work, and wait for it to complete before more work can be assigned. You must grab thread for each file to be opened. When the unit of work is small (i.e., open a single file), thread pools are not very efficient. Assigning several files to be opened by a thread at once (in batches) is faster.

I run each test 10 times and I keep the median. There is some instability in the measures, I try to report representative numbers. My code is located on GitHub: I invite you to run it on your system.

Let us look first at the results on my Apple M2 laptop (it has 8 cores, 4 performance + 4 efficiency). Unfortunately, I do not have a beefier Apple computer at my disposal right now.

I report the total time.

threads regular with thread pool
1 thread 100 ms 140 ms
2 threads 75 ms 100 ms
4 threads 90 ms 95 ms
8 threads 240 ms 250 ms
16 threads 250 ms 270 ms

Let us test it out under Linux on a big x64 server with 64 cores. Although it is a bigger machine, it has worse per-core performance than the Apple macBook in general (slower memory, slower clock, fewer instructions per cycle).

threads regular with thread pool
1 thread 34 ms 55 ms
2 threads 25 ms 36 ms
4 threads 31 ms 27 ms
8 threads 36 ms 16 ms
16 threads 42 ms 27 ms

The macOS system has a faster disk, faster memory and faster cores. Yet opening files is clearly much slower under macOS according to this test.

I find it interesting that in both cases, using two threads minimizes the running time in the regular case: additional threads appear to slow things down.

With the big machine, a thread pool can go faster than the regular approach if I use four threads or more.

On my macBook, I cannot open much more than 120,000 files per second. My Linux server scales up to 400,000 files per second. In some cases, opening thousands of files could become a hard bottleneck. Throwing more threads at the problem might not work.

 

AVX-512 陷阱:使用 AMD Zen 4 处理器时避免将字压缩到内存中

2025-02-15 05:27:29

Convention computer instructions operate on a single piece of data at once (e.g., they can negate an integer or add two integers). For better performance, CPU vendors add support for SIMD instructions. SIMD stands for Single Instruction, Multiple Data. It is a type of parallel processing where a single operation is executed simultaneously on multiple data points. E.g., you can add sixteen integers to sixteen other integers using one instruction.

The most powerful SIMD instruction sets available on commodity processors are part of the AVX-512 family. AVX-512 (Advanced Vector Extensions 512) is an extension to the x86 instruction set architecture (ISA) introduced by Intel. You can process registers made of 64 bytes! And, more importantly, it has a wide range of very power instructions. The recent AMD processors (Zen 4) provide extensive support for the powerful AVX-512 instructions. I was surprised at first when AMD decided to provide this extensive support, given how much work it must have been.

One of the neat trick with AVX-512 is that given a mask, you can ‘compress’ words: Suppose that you have a vector made of thirty-two 16-bit words, and you want to only keep the second one and third one, then you can use the vpcompressw instruction and the mask 0b110. It will produce a register where the second and third words are placed in first and second position, or it will write just these two words out to memory. You can invoke this latter functionality with the _mm_mask_compressstoreu_epi16 function intrinsic in C/C++.

This works well on recent Intel processors, but there is a gotcha on AMD Zen 4 processors.

One way to judge how expensive an instruction is, is to count how may micro-instructions it requires. A micro-instruction is the most basic unit of work that a CPU can execute. It is not a direct measure of the cost of an instruction, but it gives a ballpark indication.

When working over registers, both Intel Ice Lake and AMD Zen 4 processors require 2 micro-instructions to execute vpcompressw. We cannot conclude that both Intel and AMD have the same performance when using this instruction, but it is likely close.

However, when writing the compressed data to memory, Intel processors require about six micro-instructions while the AMD Zen 4 processors require 256 micro-instructions. But this measure, the AMD Zen 4 processors are 40 times slower.

Importantly, you do not really need the variant of the vpcompressw instruction which writes to memory. You can first write the compressed result to a register, and then, separately, write the register to memory. The following two lines might be sufficient:

 __m512i compressed = _mm512_maskz_compress_epi8(mask, input);
 _mm512_storeu_si512(output, compressed);

We have written what might be the fastest JSON library in the world: simdjson. One feature of the simdjson library is its ability to compress JSON inputs very fast (simdjson::minify). That is, we remove unnecessary spaces turning the following…

{
  "name" : "Daniel"
}

into the compressed equivalent…

{"name":"Daniel"}

It should be intuitive that the vpcompressw instruction is useful for this purpose. And, indeed, the simdjson library uses vpcompressw. However, I wrote it before AMD Zen 4 came along and I was using the variant that compresses to memory.

I investigated the issue on a Zen 4 system from Amazon and the twitter.json test file. I replace the _mm_mask_compressstoreu_epi16 function intrinsic with two lines of code. The results are quite drastic, speeding up the function by over 15 times.

AVX-512 simdjson::minify with vpcompressw to memory 1.25 GB/s
AVX-512 simdjson::minify without vpcompressw to memory 22 GB/s

With a proper implementation, we get a minifying speed of 22 GB/s. The fix is available in version 3.12.2 of simdjson.

This is an interesting edge case where a particular CPU microarchitecture can massively impact the performance of otherwise fine code.

Note that AMD Zen 5 processors are not affected.

Remark: Under Linux, you can measure how many micro-ops are retired by a program with the command perf stat -e uops_issued.any.