2025-03-29 09:44:47
Most mobile devices use 64-bit ARM processors. A growing number of servers (Amazon, Microsoft) also use 64-bit ARM processors.
These processors have special instructions called ARM NEON providing parallelism called Single instruction, multiple data (SIMD). For example, you can compare sixteen values with sixteen other values using one instruction.
Some of the most recent ARM processors also support even more advanced instructions called SVE or Scalable Vector Extension. They have added more and more extensions over time: SVE 2 and SVE 2.1.
While ARM NEON’s registers are set at 128 bits, SVE registers are a multiple of 128 bits. In practice, SVE registers are most often 128 bits although there are exceptions. The Amazon Graviton 3 is based on ARM Neoverse V1 core, which supports SVE with a vector length of 256 bits (32 bytes). For the Graviton 4, it is based on the ARM Neoverse V2 core, which supports SVE2 with a vector length of 16 bytes, like NEON.
These functions are ‘free’: they are likely compiled to no instruction whatsoever.
Let us illustrate with an example. In a recent post, I discussed how it is a bit complicated to check whether there is a non-zero byte in a NEON register. A competitive solution is as follows:
int veq_non_zero_max(uint8x16_t v) { return vmaxvq_u32(vreinterpretq_u32_u8(v)) != 0; }
Effectively, we compute the maximum 32-bit integer in the register, considering it as four 32-bit integers. The function compiles down to three essential instruction: umaxv, fmov and a comparison (cmp).
int sve_non_zero_max(uint8x16_t nvec) { svuint8_t vec; vec = svset_neonq_u8(vec, nvec); svbool_t mask = svwhilelt_b8(0, 16); svbool_t cmp = svcmpne_n_u8(mask, vec, 0); return svptest_any(mask, cmp); }
Except for the initialization of mask, the function is made of two instructions cmpne and cset. These two instructions may be fused to one instruction in some ARM cores. Even though the code mixing NEON and SVE looks more complicated, it should be more efficient.
If you know that your target processor supports SVE (or SVE 2 or SVE 2.1), and you already have ARM NEON code, you could try adding bits of SVE to it.
2025-03-25 07:24:50
There are two main types of fixed-precision integers in modern software: unsigned and signed. In C++20 and above, the signed integers must use the two’s complement convention. Other programming languages typically specify two’s complement as well.
Two’s complement is a method for representing signed integers in binary, where the leftmost bit serves as the sign bit—0 for positive numbers and 1 for negative ones—allowing computers to handle both positive and negative values efficiently with simple arithmetic. Positive numbers are written as standard binary (e.g., 5 in 8-bit is 00000101), while negative numbers are formed by taking the positive version, flipping all bits (one’s complement), and adding 1 (e.g., -5 becomes 11111011 from 00000101).
There are cases where you have signed types and you wish to use unsigned comparisons. It happens in Java (which lacks unsigned types) with some of the x64 SIMD instruction sets: extensions to the x64 architecture that allow a singleinstruction to perform the same operation on multiple data elements simultaneously.
Thus, you would want any negative value to be larger than any positive value. The trick is to simply use the fact that arithmetic operations typically wrap around in hardware: if x + y is too large and exceeds the range, then the processor replaces it with x + y – K where K is the total number of values (e.g., 2 to the power 64). Let M be the smallest possible value (a very small negative), and take two signed integers x and y, then (x + M) < (y + M) is equivalent to comparing x and y as if they had been cast to unsigned integer values.
To see why it is so, consider that the transformation x to x + M maps 0 to M (so 0 becomes the smallest value). It maps all positive values to the range from 1 to –M-1 to the range from M+1 to –1 (so positives become negatives). The negative values all overflow in such a way that the last bit becomes a zero while other bits remain unchanged. Thus it maps negative values from M to -1 to the range 0 to -M-1 (so negatives become positives).
Unfortunately, it will not generally work in C or C++ where signed overflow leads to an undefined behavior. However, it will work if you code for SIMD instruction sets using intrinsics (special functions used by SIMD enthusiasts). It will also work in other programming languages like Java or Go.
To avoid undefined behavior in C and C++, we can use the following identity:
x + M = (x^M) + ((x&M)<<1)
where ^ is the XOR operation. This identity holds when you assume that the addition wraps around and that the left shift acts as if the values were unsigned. Because M has only the most significant bit set in two’s complement notation, we have that the second term is zero. Thus we have
x + M = x^M
2025-03-16 01:29:50
Let us consider a simple C++ function which divides all values in a range of integers:
void divide(std::span<int> i, int d) { for (auto& value : i) { value /= d; } }
A division between two integers is one of the most expensive operations you can do over integers: it is much slower than a multiplication which is, in turn, more expensive than an addition. If the divisor d is known at compile-time, this function can be much faster. E.g., if d is 2, the compiler might optimize away the division and use a shift and a few cheap instructions instead. The same is true with all compile-time constant: the compiler can often do better knowing the constant. (See Lemire et al., Integer Division by Constants: Optimal Bounds, 2021)
If the ‘divide’ function is inline and the divisor is known at compile time, then an optimizing compiler will do fine. But we cannot expect all our functions to get inline in practice.
In C++, a template function is defined using the template keyword followed by a parameter (usually a type parameter) enclosed in angle brackets < >. The template parameter acts as a placeholder that gets replaced with actual data type when the function is called.
In C++, you can turn the division parameter into a template parameter:
template <int d> void divide(std::span<int> i) { for (auto& value : i) { value /= d; } }
The template function is not itself a function, but rather a recipe to generate functions: we provide the integer d and a function is created. This allows the compiler to work with a compile-time constant, producing faster code.
If you expect the divisor to be between 2 and 6, you can call the template function from a general-purpose function like so:
void divide_fast(std::span<int> i, int d) { if(d == 2) { return divide<2>(i); } if(d == 3) { return divide<3>(i); } if(d == 4) { return divide<4>(i); } if(d == 5) { return divide<5>(i); } if(d == 6) { return divide<6>(i); } for (auto& value : i) { value /= d; } }
You could do it with a switch/case if you prefer but it does not simplify the code significantly. The compiler can produce efficient code with a series of if clauses (e.g., a jump table).
Unfortunately we have to expose a template function, which creates noise in our code base. We would prefer to keep all the logic inside one function. We can do so with lambda functions.
void divide_fast(std::span<int> i, int d) { auto f = [&i]<int divisor>() { for (auto& value : i) { value /= divisor; } }; if(d == 2) { return f<2>(); } if(d == 3) { return f<3>(); } if(d == 4) { return f<4>(); } if(d == 5) { return f<5>(); } if(d == 6) { return f<6>(); } for (auto& value : i) { value /= d; } }
void divide_fast(std::span<int> i, int d) { auto f = [&i]<int divisor>() { for (auto& value : i) { value /= divisor; } }; if(d == 2) { return f.operator()<2>(); } if(d == 3) { return f.operator()<3>(); } if(d == 4) { return f.operator()<4>(); } if(d == 5) { return f.operator()<5>(); } if(d == 6) { return f.operator()<6>(); } for (auto& value : i) { value /= d; } }
In practice, it might still be a good choice. It keeps all the messy optimization hidden inside your function.
In the specific case of this function, I can probably get the same compiled output without templates as remarked by Martin Leitner-Ankerl:
void divide_fast_simple(std::span<int> i, int d) { auto f = [&i](int divisor) { for (auto& value : i) { value /= divisor; } }; if(d == 2) { return f(2); } if(d == 3) { return f(3); } if(d == 4) { return f(4); } if(d == 5) { return f(5); } if(d == 6) { return f(6); } for (auto& value : i) { value /= d; } }
And if it works in your particular case, you should avoid templates.
At the other end of the spectrum, Paul Dreik suggests doing it with template metaprogramming and fold expression like so:
void divide_fast(std::span<int> i, int d) { auto implementation = [&i]<int... ints>(std::integer_sequence<int, 0, ints...>, int d) { auto specialized = [&i]<int static_divisor>( std::integral_constant<int, static_divisor>, int dynamic_divisor) { if (static_divisor == dynamic_divisor) { for (auto &value : i) { value /= static_divisor; } return true; } return false; }; const bool handled = (specialized(std::integral_constant<int, ints>{}, d) || ...); if (!handled) { // resort to dynamic calculation for (auto &value : i) { value /= d; } } }; implementation(std::make_integer_sequence<int, 5>(), d); }
It is likely overkill in this example, but template metaprogramming becomes handy with more challenging problems.
2025-03-10 05:35:41
In practice, the software we write runs on several processors. Unfortunately, much of what we take for granted on a single processor becomes false when there are more than one processor. For example, if two processors modify the same piece of memory, what is the state of the memory after the modifications? It is difficult to tell in general. It is possible that the modification of one processor could overwrite any modification done by the other processor. The reverse could be true: the modification done by the other processor could win out. Or, maybe both processors will attempt the modification and the result will be a confused state that corresponds to nothing we would like to see. We call such accesses a ‘data race’: a situation where two or more processors in a program access the same memory location simultaneously, and at least one of those accesses is a write operation, without proper synchronization.
It gets more difficult when you want two or more processors to meaningfully modify the same memory. For example, suppose that you have a variable that counts the number of products sold. Maybe you have different processors incrementing this one variable.
A thread is the smallest unit of execution within a process that can be scheduled and run independently by a computer’s operating system. It represents a single sequence of instructions that the CPU executes, allowing a program to perform multiple tasks concurrently within the same process. A thread exists within a larger entity called a process, which is essentially a running program with its own memory space, resources, and state. A process can contain one or more threads, all sharing the same memory and resources (like open files or global variables) allocated to that process.
There is a limit to how many threads a program can manage efficiently. To enable even more parallelism, the Go programming language has its own concept of a thread called a goroutine. While a goroutine is not a thread in the traditional sense, it maps to conventional threads under the hood. The Go runtime uses a scheduler to map many goroutines onto a smaller number of threads. These threads are the actual threads recognized by the operating system—kernel-level entities with their own stack and execution context, as described in general computing.
A single thread in Go can execute multiple goroutines by switching between them efficiently. This makes goroutines much cheaper than OS threads—creating thousands or even millions of goroutines is practical, whereas spawning that many threads would exhaust system resources due to their larger memory footprint.
In some sense, Go blurs the distinction between concurrency and parallelism. Concurrency is about managing multiple tasks so they can make progress independently. Parallelism, however, involves executing multiple tasks simultaneously across multiple resources. While concurrency focuses on software design for task coordination and can work with or without multiple cores, parallelism relies on hardware to achieve true simultaneous execution, and the two can combine when a concurrent system leverages parallel resources for efficiency.
To start a goroutine, you only need to type the keyword ‘go’ followed by a function:
go func() { fmt.Println("Canada") }()
This spawns a goroutine, but the Go runtime decides which thread it runs on, potentially sharing that thread with other goroutines.
Unfortunately, a program made of only this goroutine could be disappointing:
package main import ( "fmt" ) func main() { go func() { fmt.Println("Canada") }() }
The problem is that the main function might end before the goroutine can terminate. In Go, goroutines run concurrently (at the same time), and the main function (which is the main goroutine) does not automatically wait for other goroutines to complete. If the main goroutine exits, the program terminates, potentially before other goroutines finish. To ensure a goroutine terminates before the program ends, the simplest approach is to synchronize the main goroutine with the spawned goroutine using a mechanism like a channel or a WaitGroup.
In Go, a channel is a built-in construct that provides a way for goroutines (concurrent functions) to communicate with each other and synchronize their execution. A channel has a type and it is created with the make
function:
ch := make(chan int) // A channel that carries integers
The keyword chan
is the keyword for declaring a channel. The type after chan
(e.g., int
) defines what kind of data the channel can transport.
We use the <-
operator to send a value into a channel.
ch <- 42 // Send the value 42 into the channel
We use the <-
operator to receive a value from a channel.
value := <-ch // Receive a value from the channel and store it in 'value'
We use the close function to indicate no more data will be sent: close(ch)
. Sending to a closed channel causes a panic.
The following program would print ‘Canada’:
package main import "fmt" func main() { ch := make(chan string) // Create a channel for strings go func() { ch <- "Canada" // Send a message to the channel }() msg := <-ch // Receive the message in the main goroutine fmt.Println(msg) }
The following program illustrates how we might use a channel to wait for a goroutine to terminate:
package main import ( "fmt" ) func main() { channel := make(chan bool) // Create a channel to signal completion go func() { fmt.Println("Canada") channel <- true // Signal that the goroutine is done }() <-channel // Wait for the goroutine to signal completion }
The goroutine sends a value (true) to the channel when it finishes. The main function blocks at <-done
, waiting to receive from the channel, ensuring it does not exit until the goroutine completes.
By default a channel is unbuffered: it can contain at most one value. So if you try to write to it more than one value, it will block until at least one value is read.
ch := make(chan int, 2) ch <- 1 // Does not block (buffer has space) ch <- 2 // Does not block (buffer is now full) ch <- 3 // Blocks until a value is received
In Go, you can pass multiple channels to a function just like any other arguments. Channels are first-class values in Go, meaning they can be passed as parameters, returned from functions, or stored in variables. When passing several channels to a function, you simply include them in the function’s parameter list, specifying their types.
Let us consider an example where we access two URLs:
package main import ( "fmt" "net/http" "time" ) // Response struct to hold URL and its fetch result type Response struct { url string status string err error } func fetchURL(url string, ch chan Response) { // Create HTTP client with timeout client := &http.Client{ Timeout: 10 * time.Second, } // Make HTTP GET request resp, err := client.Get(url) if err != nil { ch <- Response{url, "", err} return } defer resp.Body.Close() ch <- Response{url, resp.Status, nil} } func main() { // Record start time startTime := time.Now() // Create channel for responses ch := make(chan Response) // URLs to fetch urls := []string{ "https://www.google.com", "https://www.github.com", } // Start goroutines for each URL for _, url := range urls { go fetchURL(url, ch) } // Collect responses for i := 0; i < len(urls); i++ { resp := <-ch if resp.err != nil { fmt.Printf("Error fetching %s: %v\n", resp.url, resp.err) } else { fmt.Printf("Successfully fetched %s: %s\n", resp.url, resp.status) } } // Close the channel (optional since program ends here) close(ch) // Calculate and print elapsed time elapsed := time.Since(startTime) fmt.Printf("\nTotal time taken: %s\n", elapsed) }
This program defines a Response struct to hold the URL, its status, and any error that occurred. It implements a fetchURL function that takes a URL and a channel as parameters, uses an HTTP client with a 10-second timeout, makes a GET request to the URL, sends the result through the channel. It uses defer to ensure the response body is closed. In this instance, the channel can be written to or read from in the function: to ensure that it can only be written to, we could declare it as ch chan<- Response
instead as ch chan Response
when passing it. In the main function, we create a channel to receive responses, we define two URLs to fetch, we launch a goroutine for each URL, we collect responses from the channel and we print the results.
When we run this program, it will fetch both URLs simultaneously using separate goroutines, it will use channels to communicate results back to the main goroutine, and it will print the status (like “200 OK”) or any errors for each URL.
We can rewrite this program so that it is simpler, without goroutines, like so:
package main import ( "fmt" "net/http" "time" ) // Response struct to hold URL and its fetch result type Response struct { url string status string err error } func fetchURLSynchro(url string) Response { // Create HTTP client with timeout client := &http.Client{ Timeout: 10 * time.Second, } // Make HTTP GET request resp, err := client.Get(url) if err != nil { return Response{url, "", err} } defer resp.Body.Close() return Response{url, resp.Status, nil} } func main() { // URLs to fetch urls := []string{ "https://www.google.com", "https://www.github.com", } startTime := time.Now() for i := 0; i < len(urls); i++ { resp := fetchURLSynchro(urls[i]) if resp.err != nil { fmt.Printf("Error fetching %s: %v\n", resp.url, resp.err) } else { fmt.Printf("Successfully fetched %s: %s\n", resp.url, resp.status) } } elapsed := time.Since(startTime) fmt.Printf("\nTotal time taken: %s\n", elapsed) }
The two programs do the same work, but one uses two goroutines (in addition to the main goroutine) while the other uses only the main goroutine. Testing these programs, you may find that the one using two goroutines completes faster: network accesses are typically expensive and easily parallelizable. That is, the two tasks can be done almost independently on your computer, even if executed simultaneously. Hence, you may find that we can query two URLs using HTTP requests in 250 ms whereas 400 ms is needed if the requests are consecutive, using a single goroutine.
However, you should not assume that using more goroutines always makes software run faster. It often does not. Furthermore, additional goroutines might trigger the use of additional processors which increases the cost or power usage of your software. Adding more goroutines makes your software more complicated, more difficult to maintain and debug.
Formally speaking, you do not need parallelism (i.e., many physical processors) to execute two network requests concurrently. Executing such requests does not require much processing time and has much to do with waiting for the network response. Therefore, it is a case where using goroutines is likely appropriate. When splitting up more computational tasks into goroutines, you are less certain to get a performance boost.
To illustrate the point, let us consider the case where we are summing all values in an array. We consider two cases, first a small array (100k elements) and then a large array with millions of elements. For both cases, we can either use a simple function (with one goroutine) or a function that uses multiple goroutines. To maximize parallelism, we set the number of goroutines to the number of processors detected on the system by Go (runtime.NumCPU()
).
package main import ( "fmt" "runtime" "testing" ) // sequentialSum calculates the sum of an array sequentially func sequentialSum(numbers []int) int { sum := 0 for _, n := range numbers { sum += n } return sum } // goroutineSumWithChannels calculates the sum using goroutines and channels func goroutineSumWithChannels(numbers []int) int { numGoroutines := runtime.NumCPU() // Use number of CPU cores chunkSize := (len(numbers) + numGoroutines - 1) / numGoroutines resultChan := make(chan int, numGoroutines) // Buffered channel for partial sums activeGoroutines := 0 // Split the array into chunks and process with goroutines for i := 0; i < numGoroutines; i++ { start := i * chunkSize end := start + chunkSize if end > len(numbers) { end = len(numbers) } if start >= end { break } go func(slice []int) { partialSum := 0 for _, n := range slice { partialSum += n } resultChan <- partialSum }(numbers[start:end]) activeGoroutines++ } // Collect partial sums from the channel total := 0 for i := 0; i < activeGoroutines; i++ { total += <-resultChan } close(resultChan) return total } // Benchmark functions func BenchmarkSequentialSum(b *testing.B) { numbers := make([]int, 100000) for i := range numbers { numbers[i] = i } b.ResetTimer() for i := 0; i < b.N; i++ { sequentialSum(numbers) } } func BenchmarkGoroutineSumWithChannels(b *testing.B) { numbers := make([]int, 100000) for i := range numbers { numbers[i] = i } b.ResetTimer() for i := 0; i < b.N; i++ { goroutineSumWithChannels(numbers) } } // Benchmark functions func BenchmarkSequentialSumLarge(b *testing.B) { numbers := make([]int, 10000000) for i := range numbers { numbers[i] = i } b.ResetTimer() for i := 0; i < b.N; i++ { sequentialSum(numbers) } } func BenchmarkGoroutineSumWithChannelsLarge(b *testing.B) { numbers := make([]int, 10000000) for i := range numbers { numbers[i] = i } b.ResetTimer() for i := 0; i < b.N; i++ { goroutineSumWithChannels(numbers) } } func main() { fmt.Printf("Number of CPU cores: %d\n", runtime.NumCPU()) res := testing.Benchmark(BenchmarkGoroutineSumWithChannels) fmt.Println("BenchmarkGoroutineSumWithChannels", res) ress := testing.Benchmark(BenchmarkSequentialSum) fmt.Println("BenchmarkSequentialSum", ress) resl := testing.Benchmark(BenchmarkGoroutineSumWithChannelsLarge) fmt.Println("BenchmarkGoroutineSumWithChannelsLarge", resl) ressl := testing.Benchmark(BenchmarkSequentialSumLarge) fmt.Println("BenchmarkSequentialSumLarge", ressl) }
On a system with a large number of processors, we might get the following result:
Number of CPU cores: 128 BenchmarkGoroutineSumWithChannels 4048 258798 ns/op BenchmarkSequentialSum 23756 50516 ns/op BenchmarkGoroutineSumWithChannelsLarge 744 1414114 ns/op BenchmarkSequentialSumLarge 237 5030224 ns/op
We see that when summing up the modest array, we get that the approach using 128 goroutines takes five times longer. If it does end up using 128 processors, then it might be 128 * 5 = 640
times less efficient! The lesson is that if the task is sufficiently inexpensive, such as summing up thousands of integers, you should not use more than one goroutine.
In the instance where we are summing 10 million integers, the parallelized task is more interesting: it goes 3.6 times faster. Again, the single-routine approach is likely much more efficient: a single processor takes 3.6 longer than over one hundred goroutine.
The problem with a simple sum is that it is driven by memory accesses and not especially computational. What if we consider a more expensive task? Let us sum the sine of the values of an array using various numbers of goroutines (1, 2, …). We use one million values in the array.
package main import ( "fmt" "math" "runtime" "testing" ) func computeSineSum(numbers []int) float64 { sum := 0.0 for _, n := range numbers { sum += math.Sin(float64(n)) } return sum } // computeSineSumWithGoroutines computes the sum of squares with a specified number of goroutines func computeSineSumWithGoroutines(numbers []int, numGoroutines int) float64 { chunkSize := (len(numbers) + numGoroutines - 1) / numGoroutines resultChan := make(chan float64, numGoroutines) for i := 0; i < numGoroutines; i++ { start := i * chunkSize end := start + chunkSize if end > len(numbers) { end = len(numbers) } if start >= end { break } go func(slice []int) { partialSum := 0.0 for _, n := range slice { partialSum += math.Sin(float64(n)) } resultChan <- partialSum }(numbers[start:end]) } // Collect results total := 0.0 activeGoroutines := (len(numbers) + chunkSize - 1) / chunkSize for i := 0; i < activeGoroutines; i++ { total += <-resultChan } close(resultChan) return total } // Benchmarks func BenchmarkSequential(b *testing.B) { numbers := make([]int, 1000000) for i := range numbers { numbers[i] = i % 1000 // Keep numbers manageable } b.ResetTimer() for i := 0; i < b.N; i++ { computeSineSum(numbers) } } func Benchmark1Goroutines(b *testing.B) { numbers := make([]int, 1000000) for i := range numbers { numbers[i] = i % 1000 } b.ResetTimer() for i := 0; i < b.N; i++ { computeSineSumWithGoroutines(numbers, 1) } } func Benchmark2Goroutines(b *testing.B) { numbers := make([]int, 1000000) for i := range numbers { numbers[i] = i % 1000 } b.ResetTimer() for i := 0; i < b.N; i++ { computeSineSumWithGoroutines(numbers, 2) } } func Benchmark4Goroutines(b *testing.B) { numbers := make([]int, 1000000) for i := range numbers { numbers[i] = i % 1000 } b.ResetTimer() for i := 0; i < b.N; i++ { computeSineSumWithGoroutines(numbers, 4) } } func Benchmark8Goroutines(b *testing.B) { numbers := make([]int, 1000000) for i := range numbers { numbers[i] = i % 1000 } b.ResetTimer() for i := 0; i < b.N; i++ { computeSineSumWithGoroutines(numbers, 8) } } func Benchmark16Goroutines(b *testing.B) { numbers := make([]int, 1000000) for i := range numbers { numbers[i] = i % 1000 } b.ResetTimer() for i := 0; i < b.N; i++ { computeSineSumWithGoroutines(numbers, 16) } } func BenchmarkMaxGoroutines(b *testing.B) { numbers := make([]int, 1000000) for i := range numbers { numbers[i] = i % 1000 } b.ResetTimer() for i := 0; i < b.N; i++ { computeSineSumWithGoroutines(numbers, runtime.NumCPU()) } } func main() { fmt.Printf("CPU cores: %d\n", runtime.NumCPU()) res1 := testing.Benchmark(BenchmarkSequential) fmt.Println("BenchmarkSequential", res1) res11 := testing.Benchmark(Benchmark1Goroutines) fmt.Println("Benchmark1Goroutines", res11) res2 := testing.Benchmark(Benchmark2Goroutines) fmt.Println("Benchmark2Goroutines", res2) res4 := testing.Benchmark(Benchmark4Goroutines) fmt.Println("Benchmark4Goroutines", res4) res8 := testing.Benchmark(Benchmark8Goroutines) fmt.Println("Benchmark8Goroutines", res8) res16 := testing.Benchmark(Benchmark16Goroutines) fmt.Println("Benchmark16Goroutines", res16) resmax := testing.Benchmark(BenchmarkMaxGoroutines) fmt.Println("BenchmarkMaxGoroutines", resmax) }
On a powerful machine with many cores, we might get the following results:
CPU cores: 128
Benchmark1Goroutines 114 13701908 ns/op
Benchmark2Goroutines 134 8913817 ns/op
Benchmark4Goroutines 253 4648170 ns/op
Benchmark8Goroutines 472 2272842 ns/op
Benchmark16Goroutines 835 1227975 ns/op
BenchmarkMaxGoroutines 916 1189217 ns/op
Going from one goroutine to two improves the speed by a factor of 1.5. Going from one goroutine to 16 goroutines improves the speed by a factor of 11. Increasing the number of goroutines beyond 16 brings no further gain. This pattern is sublinear gains with an upper limit is rather typical.
Yet goroutines and channels can be remarkably efficient in their own right. Let us create a chain of channels. Each goroutine has an input channel and an output channel. As soon as data is received in the input channel, data is written to the input channel. We link hundreds of goroutines in a chain of input and output channels:
package main import ( "fmt" "time" ) // relay function represents each goroutine in the chain func relay(input <-chan int, output chan<- int) { // Wait for value from input channel value := <-input // Send value to output channel output <- value } func main() { // Number of goroutines in the chain const chainLength = 10000 // Create slice to hold all channels channels := make([]chan int, chainLength+1) // Initialize all channels for i := range channels { channels[i] = make(chan int) } // Start timing startTime := time.Now() // Create the chain of goroutines for i := 0; i < chainLength; i++ { go relay(channels[i], channels[i+1]) } // Send initial value into the first channel go func() { channels[0] <- 42 }() // Wait for and receive the value from the last channel result := <-channels[chainLength] // Calculate elapsed time elapsed := time.Since(startTime) // Print results fmt.Printf("Value %d successfully passed through %d goroutines\n", result, chainLength) fmt.Printf("Time taken: %v\n", elapsed) fmt.Printf("Average time per hop: %v\n", elapsed/time.Duration(chainLength)) }
Running this program, you may get the following result:
Value 42 successfully passed through 10000 goroutines Time taken: 13.987416ms Average time per hop: 1.398µs
Effectively, you can traverse nearly a million goroutines per second in this manner.
Another common approach for managing multiple goroutines is to use sync.WaitGroup
. Before we give an example, we need to review the keyword defer
. In Go, the defer
keyword is used to schedule a function call to be executed just before the surrounding function (the one containing the defer statement) returns. For example, the following function would print Canada followed by Mexico:
func() { defer fmt.Println("Mexico") fmt.Println("Canada") }
Let us consider a goroutine based on a defer a wait group:
package main import ( "fmt" "sync" ) func main() { var wg sync.WaitGroup wg.Add(1) // Increment the WaitGroup counter by 1 go func() { defer wg.Done() // Decrement the counter when the goroutine completes fmt.Println("Canada") }() wg.Wait() // Wait for the counter to reach 0 }
For a single goroutine like in our example, the channel approach is simpler because it requires fewer lines and no additional imports. However, if you have multiple goroutines, the wait group might be simpler.
If you need to read data from different goroutines, that is not a problem as long as the data remains constant. If nobody writes to the data, there is no problem.
Unfortunately, we often need to change the data, while reading it from different goroutines. Sometimes you can use channels to communicate. But that is sometimes not enough.
Let us consider an example. We take an array of 10 integers, and goroutines randomly decrement one array element and then increment another array element. Initially, the sum of all elements should be 1000 and it should remain 1000 unless there is a bug. We can implement our code like so:
package main import ( "fmt" "math/rand" "sync" "time" ) func main() { // Initialize array with 10 elements, each set to 100 arr := [10]int{100, 100, 100, 100, 100, 100, 100, 100, 100, 100} var wg sync.WaitGroup // Function for goroutine behavior worker := func() { defer wg.Done() wg.Add(1) r := rand.New(rand.NewSource(time.Now().UnixNano())) // Run for 200000000 iterations as an example for i := 0; i < 200000000; i++ { // Pick first random index idx1 := r.Intn(10) // Only proceed if value > 0 if arr[idx1] > 0 { // Decrement first element arr[idx1]-- // Pick second random index idx2 := r.Intn(10) // Increment second element arr[idx2]++ } } } // Launch two goroutines go worker() go worker() fmt.Println("waiting...") wg.Wait() fmt.Println("waiting...ok") fmt.Println("\nFinal array state:", arr) // Verify total sum should still be 1000 (10 * 100) sum := 0 for _, val := range arr { sum += val } fmt.Println("Total sum:", sum) }
This program is wrong: it contains data races because we are writing and reading data from different goroutines without synchronization. A possible ouput of this program is the following:
Final array state: [3001 644 880 324 2319 2845 3664 160 232 1741] Total sum: 15810
Observe how the sum is higher than expected.
In Go you can avoid such a bug with the guarantee of atomicity provided by the sync/atomic
package, which ensures that operations like increments are executed as indivisible steps, preventing race conditions. Functions like atomic.AddInt32(&x, 1)
or atomic.AddInt64(&x, 1)
ensure that the increment operation (read-modify-write) is performed atomically. This means that even if two threads execute the increment concurrently, the operations are serialized at the hardware level, and no updates are lost.
package main import ( "fmt" "math/rand" "sync" "sync/atomic" "time" ) func main() { // Initialize array with 10 elements, each set to 100 arr := [10]int32{100, 100, 100, 100, 100, 100, 100, 100, 100, 100} var wg sync.WaitGroup // Function for goroutine behavior worker := func() { defer wg.Done() wg.Add(1) r := rand.New(rand.NewSource(time.Now().UnixNano())) // Run for 200000000 iterations as an example for i := 0; i < 200000000; i++ { // Pick first random index idx1 := r.Intn(10) // Only proceed if value > 0 val := atomic.LoadInt32(&arr[idx1]) if val > 0 { if atomic.CompareAndSwapInt32(&arr[idx1], val, val-1) { // Pick second random index idx2 := r.Intn(10) // Increment second element atomic.AddInt32(&arr[idx2], 1) } } } } // Launch two goroutines go worker() go worker() fmt.Println("waiting...") wg.Wait() fmt.Println("waiting...ok") fmt.Println("\nFinal array state:", arr) // Verify total sum should still be 1000 (10 * 100) sum := 0 for _, val := range arr { sum += int(val) } fmt.Println("Total sum:", sum) }
The expression atomic.LoadInt32(&arr[idx1])
atomically reads the value at array position idx1
. The value is stored in a local variable (val
): data races are not possible with a local variable. We then use a Compare-And-Swap (CAS) operation: atomic.CompareAndSwapInt32(&arr[idx1], val, val-1)
. It checks if arr[idx1]
still equals val
(the previously loaded value) and if true, it sets arr[idx1]
to val-1
. It returns true if successful, false if the value changed since the load. Importantly, it executes as a single atomic operation. Finally, we use atomic.AddInt32(&arr[idx2], 1)
to atomically add 1 to arr[idx2]
.
If you run the new program, the sum of the values in the array is maintained. The program is safe.
Atomic operations (like atomic.AddInt32
or atomic.CompareAndSwapInt32
) are designed for single, indivisible operations on a single variable. They become insufficient when we have more complex data structures.
In these more complex cases, we use a mutex. A mutex (short for “mutual exclusion”) is a synchronization primitive used in concurrent programming to prevent multiple threads or processes from simultaneously accessing or modifying a shared resource. It ensures that only one thread (or goroutine) can enter a critical section of code at a time, thus avoiding race conditions and maintaining data consistency. Essentially, only one ‘lock’ can be held at any given time.
To illustrate, let us create a program where money is transferred between two accounts, and we need to ensure that the withdrawal from one account and deposit to another happen together without interference from other goroutines. This requires protecting a multi-step operation, which goes beyond what atomic operations can do.
package main import ( "fmt" "sync" "time" ) type Bank struct { accounts map[string]int // Map of account IDs to balances mutex sync.Mutex // Mutex to protect the entire transfer operation } func NewBank() *Bank { return &Bank{ accounts: map[string]int{ "Alice": 1000, "Bob": 500, }, } } func (b *Bank) Transfer(from, to string, amount int, wg *sync.WaitGroup) { defer wg.Done() // Lock the mutex to protect the entire transfer operation b.mutex.Lock() defer b.mutex.Unlock() // Ensure unlock happens even if there's a panic // Check if source account has sufficient funds if b.accounts[from] >= amount { // Perform the transfer: two related operations b.accounts[from] -= amount b.accounts[to] += amount fmt.Printf("Transferred %d from %s to %s. New balances: %s=%d, %s=%d\n", amount, from, to, from, b.accounts[from], to, b.accounts[to]) } else { fmt.Printf("Failed transfer of %d from %s to %s: insufficient funds\n", amount, from, to) } } func (b *Bank) GetBalance(account string) int { b.mutex.Lock() defer b.mutex.Unlock() return b.accounts[account] } func main() { bank := NewBank() var wg sync.WaitGroup // Launch multiple concurrent transfers wg.Add(4) go bank.Transfer("Alice", "Bob", 200, &wg) go bank.Transfer("Bob", "Alice", 100, &wg) go bank.Transfer("Alice", "Bob", 300, &wg) go bank.Transfer("Bob", "Alice", 50, &wg) wg.Wait() fmt.Printf("Final balances: Alice=%d, Bob=%d\n", bank.GetBalance("Alice"), bank.GetBalance("Bob")) }
In general, acquiring and releasing a mutex involves system-level operations, which introduces overhead even when there is no contention. This can slow down a program compared to lock-free alternatives like atomic operations.
In complex cases, it is also possible to trigger a deadlock. A deadlock is a concurrency failure where threads are trapped in a circular wait for resources, unable to proceed due to mutual dependencies. We can modify our example to include a deadlock. Instead of a global mutex, we create a mutex per account. If the goroutine acquires the source account and then the destination account, a deadlock becomes possible.
package main import ( "fmt" "sync" "time" ) type Account struct { balance int mutex sync.Mutex } type Bank struct { accounts map[string]*Account // Map of account IDs to account objects with individual mutexes } func NewBank() *Bank { return &Bank{ accounts: map[string]*Account{ "Alice": {balance: 1000}, "Bob": {balance: 500}, }, } } func (b *Bank) Transfer(from, to string, amount int, wg *sync.WaitGroup) { defer wg.Done() // Get the accounts fromAccount := b.accounts[from] toAccount := b.accounts[to] // Lock the "from" account first fromAccount.mutex.Lock() fmt.Printf("Locked %s for transfer of %d to %s\n", from, amount, to) // Simulate some work to increase chance of deadlock (optional, but helps demonstrate) time.Sleep(100 * time.Millisecond) // Then try to lock the "to" account toAccount.mutex.Lock() fmt.Printf("Locked %s for transfer of %d from %s\n", to, amount, from) // Perform the transfer if fromAccount.balance >= amount { fromAccount.balance -= amount toAccount.balance += amount fmt.Printf("Transferred %d from %s to %s. New balances: %s=%d, %s=%d\n", amount, from, to, from, fromAccount.balance, to, toAccount.balance) } else { fmt.Printf("Failed transfer of %d from %s to %s: insufficient funds\n", amount, from, to) } // Unlock both accounts toAccount.mutex.Unlock() fromAccount.mutex.Unlock() } func (b *Bank) GetBalance(account string) int { acc := b.accounts[account] acc.mutex.Lock() defer acc.mutex.Unlock() return acc.balance } func main() { bank := NewBank() var wg sync.WaitGroup // Launch two transfers in opposite directions to create deadlock wg.Add(2) go bank.Transfer("Alice", "Bob", 200, &wg) // Alice -> Bob go bank.Transfer("Bob", "Alice", 100, &wg) // Bob -> Alice wg.Wait() // This will never complete due to deadlock fmt.Printf("Final balances: Alice=%d, Bob=%d\n", bank.GetBalance("Alice"), bank.GetBalance("Bob")) }
The deadlock in this code occurs because two goroutines acquire mutexes in different orders, leading to a circular wait. One strategy to avoid such a deadlock is to use ordered mutexes. E.g., if accounts are numbered, we always lock the account with the lesser number first.
Concurrency is a powerful tool in modern software development, enabling programs to leverage multiple processors for improved performance. However, it introduces significant complexities that must be carefully managed. Data races, where unsynchronized access to shared memory leads to unpredictable outcomes, underscore the need for robust synchronization mechanisms. Go’s goroutines and channels offer an elegant, lightweight approach to concurrency, allowing developers to efficiently parallelize tasks like network requests or data processing while avoiding the overhead of traditional threads. Yet, the performance benefits of parallelism are not guaranteed—simple tasks may suffer from excessive goroutine overhead, while computationally intensive operations can see substantial gains, albeit with diminishing returns as the number of goroutines increases.
Synchronization tools like sync.WaitGroup
, atomic operations from sync/atomic
, and mutexes (sync.Mutex
) provide essential safeguards against concurrency pitfalls. Atomics excel for single-variable updates, ensuring thread safety with minimal overhead, while mutexes protect multi-step operations on complex data structures. However, mutexes come with risks, such as deadlocks, which arise from circular dependencies and require careful design—like consistent lock ordering—to avoid. Choosing the right concurrency strategy depends on the task’s nature, scale, and performance requirements. Ultimately, effective concurrent programming in Go demands a balance between leveraging parallelism for speed and maintaining simplicity, correctness, and efficiency in the face of shared resource contention.
2025-03-02 06:41:04
Jarred Sumner, the main author of the Bun JavaScript engine, commented a few days ago on X that opening many files on macOS could be slow due to thread contention: “your $5,000 computer is only capable of opening 1 file at a time”.
I was curious and I decided to test it out. I wrote a small C++ program which opens 10,000 files. The files are opened and closed immediately. However, there are 10,000 distinct (empty) files. It is a simple test: there is no writing and the files do not remain open long.
I use either 1 thread, 2 threads, 8 threads or 16 threads. The threads split the work equally: e.g., with two threads, you have two threads opening 5000 files each.
To make things interesting, let us add a “thread pool”: I create and launch all threads before the benchmark starts. Then I add all of the tasks (opening a file X) to the thread pool. This should be slower, but maybe more realistic of how engineers might solve such problems.
Indeed, with the thread pool, for each file, you need to grab a thread, assign the work, and wait for it to complete before more work can be assigned. You must grab thread for each file to be opened. When the unit of work is small (i.e., open a single file), thread pools are not very efficient. Assigning several files to be opened by a thread at once (in batches) is faster.
I run each test 10 times and I keep the median. There is some instability in the measures, I try to report representative numbers. My code is located on GitHub: I invite you to run it on your system.
Let us look first at the results on my Apple M2 laptop (it has 8 cores, 4 performance + 4 efficiency). Unfortunately, I do not have a beefier Apple computer at my disposal right now.
I report the total time.
threads | regular | with thread pool |
---|---|---|
1 thread | 100 ms | 140 ms |
2 threads | 75 ms | 100 ms |
4 threads | 90 ms | 95 ms |
8 threads | 240 ms | 250 ms |
16 threads | 250 ms | 270 ms |
Let us test it out under Linux on a big x64 server with 64 cores. Although it is a bigger machine, it has worse per-core performance than the Apple macBook in general (slower memory, slower clock, fewer instructions per cycle).
threads | regular | with thread pool |
---|---|---|
1 thread | 34 ms | 55 ms |
2 threads | 25 ms | 36 ms |
4 threads | 31 ms | 27 ms |
8 threads | 36 ms | 16 ms |
16 threads | 42 ms | 27 ms |
The macOS system has a faster disk, faster memory and faster cores. Yet opening files is clearly much slower under macOS according to this test.
I find it interesting that in both cases, using two threads minimizes the running time in the regular case: additional threads appear to slow things down.
With the big machine, a thread pool can go faster than the regular approach if I use four threads or more.
On my macBook, I cannot open much more than 120,000 files per second. My Linux server scales up to 400,000 files per second. In some cases, opening thousands of files could become a hard bottleneck. Throwing more threads at the problem might not work.
2025-02-15 05:27:29
Convention computer instructions operate on a single piece of data at once (e.g., they can negate an integer or add two integers). For better performance, CPU vendors add support for SIMD instructions. SIMD stands for Single Instruction, Multiple Data. It is a type of parallel processing where a single operation is executed simultaneously on multiple data points. E.g., you can add sixteen integers to sixteen other integers using one instruction.
The most powerful SIMD instruction sets available on commodity processors are part of the AVX-512 family. AVX-512 (Advanced Vector Extensions 512) is an extension to the x86 instruction set architecture (ISA) introduced by Intel. You can process registers made of 64 bytes! And, more importantly, it has a wide range of very power instructions. The recent AMD processors (Zen 4) provide extensive support for the powerful AVX-512 instructions. I was surprised at first when AMD decided to provide this extensive support, given how much work it must have been.
One of the neat trick with AVX-512 is that given a mask, you can ‘compress’ words: Suppose that you have a vector made of thirty-two 16-bit words, and you want to only keep the second one and third one, then you can use the vpcompressw instruction and the mask 0b110. It will produce a register where the second and third words are placed in first and second position, or it will write just these two words out to memory. You can invoke this latter functionality with the _mm_mask_compressstoreu_epi16 function intrinsic in C/C++.
This works well on recent Intel processors, but there is a gotcha on AMD Zen 4 processors.
One way to judge how expensive an instruction is, is to count how may micro-instructions it requires. A micro-instruction is the most basic unit of work that a CPU can execute. It is not a direct measure of the cost of an instruction, but it gives a ballpark indication.
When working over registers, both Intel Ice Lake and AMD Zen 4 processors require 2 micro-instructions to execute vpcompressw. We cannot conclude that both Intel and AMD have the same performance when using this instruction, but it is likely close.
However, when writing the compressed data to memory, Intel processors require about six micro-instructions while the AMD Zen 4 processors require 256 micro-instructions. But this measure, the AMD Zen 4 processors are 40 times slower.
Importantly, you do not really need the variant of the vpcompressw instruction which writes to memory. You can first write the compressed result to a register, and then, separately, write the register to memory. The following two lines might be sufficient:
__m512i compressed = _mm512_maskz_compress_epi8(mask, input); _mm512_storeu_si512(output, compressed);
We have written what might be the fastest JSON library in the world: simdjson. One feature of the simdjson library is its ability to compress JSON inputs very fast (simdjson::minify). That is, we remove unnecessary spaces turning the following…
{ "name" : "Daniel" }
into the compressed equivalent…
{"name":"Daniel"}
It should be intuitive that the vpcompressw instruction is useful for this purpose. And, indeed, the simdjson library uses vpcompressw. However, I wrote it before AMD Zen 4 came along and I was using the variant that compresses to memory.
I investigated the issue on a Zen 4 system from Amazon and the twitter.json test file. I replace the _mm_mask_compressstoreu_epi16 function intrinsic with two lines of code. The results are quite drastic, speeding up the function by over 15 times.
AVX-512 simdjson::minify with vpcompressw to memory | 1.25 GB/s |
AVX-512 simdjson::minify without vpcompressw to memory | 22 GB/s |
With a proper implementation, we get a minifying speed of 22 GB/s. The fix is available in version 3.12.2 of simdjson.
This is an interesting edge case where a particular CPU microarchitecture can massively impact the performance of otherwise fine code.
Note that AMD Zen 5 processors are not affected.
Remark: Under Linux, you can measure how many micro-ops are retired by a program with the command perf stat -e uops_issued.any.