2024-11-22 07:42:22
A few years ago, we wrote csFastFloat, a C# library to parse floating-point numbers faster. Given the string “3.1416”, it computes the binary value 3.1416.
The functionality of the library is equivalent to the C# function Double.Parse. except that it runs faster.
We contributed much of the library to .NET, and as of .NET 7: the csFastFloat library is effectively part of .NET. However, the standard library (.NET) has additional overhead.
There is a new version of Microsoft C# runtime: .NET 9. I ran our benchmarks with .NET9 on an Apple M2 processor:
dataset | FastFloat.TryParseDouble | Double.Parse |
---|---|---|
canada.txt | 1.1 GB/s | 0.3 GB/s |
mesh.txt | 0.9 GB/s | 0.2 GB/s |
synthetic.txt | 1.0 GB/s | 0.3 GB/s |
I find that there can still be a significant benefit to using csFastFloat over the .NET library: it can be about 3 times faster. Prior to .NET, the gap was closer to 9 times, so the .NET library did get much faster.
The benefits of the csFastFloat depend on your data and may be less than what I report. Yet if you are parsing many floating-point numbers, you may want to give csFastFloat a try.
You can find csFastFloat on nuget. The csFastFloat library is used by Sep which might be the World’s Fastest .NET CSV Parser.
2024-11-13 01:39:24
Though I have many brilliant graduate students, I love working with undergraduate students. And I am not at all sure that you should favor people with graduate degrees, given a choice. Many graduate students tend to favor abstraction over practical skills. They often have an idealized view of the world. Moreover, these students are often consumed by research projects, theses, or dissertations, and the publication of scientific articles, which limits their time for concrete actions. On the other hand, undergraduate students, in my experience, show more enthusiasm for concrete and useful projects, even if they are not prestigious at first glance.
This selection effect is also evident in career choices: those aspiring for direct action are eager to engage in it and are not keen on spending years writing abstract documents. Conversely, those who seek to avoid direct engagement with reality tend to prolong their studies. This effect is noticeable during competitions to recruit professors. It is often surprisingly difficult to find candidates with significant real-world experience.
Irrespective of how smart and educated you are, it is difficult to beat a “can do” attitude. The world is built by people who are eager to get things done as opposed to people who merely want to talk or write. Or, at least, that should be what happens in a healthy society.
2024-11-02 08:24:09
Recent versions of the C++ language (C++20 and C++23) may allow you to change drastically how you program in C++. I want to provide some fun examples.
Thanks to the integration of the features from the popular fmt library, it is much easier to format strings elegantly in C++. In turn the fmt library was inspired by the work done in languages like Python.
Suppose that you have a vector of integers and you want to print its content:
std::vector<int> v = {1, 2, 3, 4, 5}; std::println("{}", v);
Suppose you want it to be centered in a line of 40 characters, with underscore characters around it:
std::vector<int> v = {1, 2, 3, 4, 5}; std::println("{:_^40}", v); // ____________[1, 2, 3, 4, 5]_____________
Maybe you want to print it in reverse order?
std::println("{}", v | std::views::reverse);
Want to print its maximum?
std::println("{}", std::ranges::max(v));
What about printing my name and the current time, and my name again? We do not want to repeat the name twice in the parameters, so let us use indexed printing parameters:
std::string_view name = "Daniel"sv; std::println("Hello, {0} today is {1:%Y-%m-%d %X}, good day {0}!", name, std::chrono::system_clock::now()); // Hello, Daniel today is 2024-11-02 00:02:17, good day Daniel!
Let say you want to print different values, each on their own line, justified to the right, with hyphens padding the left. For this purpose, we want to have template function which can take as many parameters as you want, and we want to do the same operation on each parameter. We can achieve this result with a fold, like so:
void print(auto ...args) { (std::println("{:->50}", args), ...); } void prints() { print("a", 1, "b", 2, "c", 3); } //-------------------------------------------------a //-------------------------------------------------1 //-------------------------------------------------b //-------------------------------------------------2 //-------------------------------------------------c //-------------------------------------------------3
It looks a bit mysterious, but folds are great if you want to have functions that elegantly take several parameters.
Suppose you want a function which takes two integer-like values, and returns their quotient… with the added trick that if the divisor is zero, you return a string as an error. We want it to work with any integer type (but only integer types). We would rather avoid exceptions. Then `std::expected is perfect for the task. It is effectively a value/error pair conveniently packaged. The following code illustrates the idea… and I have added a ‘test’ function to show you how it might used.
template <std::integral number> std::expected<number, std::string> divide(number a, number b) { if (b == 0) { return std::unexpected("Division by zero"); } return a / b; } void test(auto x, auto y) { if (auto result = divide(x, y); result) { std::println("Result: {}", result.value()); } else { std::println(stderr, "Error: {}", result.error()); } }
We wrote a fast URL parser called ada, and it returns parsed URL in an std::expected structure. This allows us to avoid the complexity of exceptions. The fast JSON library simdjson has a custom return type that follows the same idea.
Let us say that you want to print the integer 4 in binary (100) and know how many trailing 0 bits there are (the answer is 2):
std::println("{:b} {}", 4u, std::countr_zero(4u)); // 100 2
Suppose you want to rotate left the bits in an integer by 4:
std::println("{:x}", std::rotl(0xf0f0f0f0u, 4)); // f0f0f0f
Suppose you want to know what the floating point number 1.0 looks like as a 64-bit word:
std::println("{:b}", std::bit_cast<uint64_t>(1.0)); // 11111111110000000000000000000000000000000000000000000000000000
Suppose you want to print the size of a file in bytes?
std::println("{} bytes", std::filesystem::file_size("demo.cpp"));
Suppose you want to write to a log while including the exact position in the source code where the log was called?
auto log = [](std::string_view message, std::source_location loc) { std::println("{} {}:{} function name: {}", message, loc.file_name(), loc.line(), loc.function_name()); }; log("problem", std::source_location::current());
There are many more cool features in recent version of C++. If you want to learn more, I recommend recent books by Marius Bancila. He is a great technical writer.
2024-10-18 07:03:05
We often store large datasets using comma-separated-value (CSV) files. The format is simple enough, each line of a text file is made of several values separated by commas, like so:
"Willett, Walter C.",Harvard T.H. Chan School of Public Health,usa
Given an excel spreadsheet, you can easily get a CSV file, and vice versa. CSV files can easily be consumed or produced by your own software.
My interest in software performance entered a turning point when I had to parse large CSV files for a project.
So how fast can we go in C#? I am going to use .NET 9 on an Apple M2 system. I am using an 11 MB CSV file containing the name and affiliations of highly cited researchers. There are 217096 entries and I am going to look for researchers who have the string ‘Harvard’ in their affiliation. There are 1960 of those: Harvard University has many highly cited researchers. The file is encoded using UTF-8. You can find all my code on GitHub, look for the lemire/CsharpCSVBench repository.
Before we start actually parsing, let us load all lines one by one and count them:
public int ScanFile() { int count = 0; using (var reader = new StreamReader(filename, System.Text.Encoding.UTF8)) { while (reader.ReadLine() != null) { count++; } } return count; }
On my system, this takes 10 ms. Thus, we can scan it at about 1.2 GB/s. I am loading the file from disk. However, my disk has bandwidth higher than 1.2 GB/s, and the file is small enough to end up in cache. Thus we are already limited by the processor. And we are not yet doing any parsing!
As far as I know, the most popular library to parse CSV file in C# is the CsvHelper library. It is quite a good library as far as I can tell.
We can try locating the matching records like so:
var matchingLines = new List<string>(); using (var reader = new StreamReader(filename, System.Text.Encoding.UTF8)) using (var csv = new CsvHelper.CsvReader(reader, new CsvConfiguration(CultureInfo.InvariantCulture))) { while (csv.Read()) { if (csv[1].IndexOf("Harvard", StringComparison.OrdinalIgnoreCase) >= 0) { matchingLines.Add(string.Join(",", csv.Parser.RawRecord)); } } }
You can write nicer code, but I am only interested in measuring the performance.
This time it takes 42 ms. That is, I can parse the CSV at .28 GB/s. We are clearly bounded by our processor in this instance.
Can we go faster? Let us consider the NReco.Csv library which claims to be faster. I am skipping the code as it is more or less the same. It is indeed faster, I get the job done in 36 ms, or .33 GB/s.
These are not great speeds.
There is a faster library called Sep.
var matchingLines = new List<string>(); using var reader = Sep.Reader().FromFile(filename); var colIndex = reader.Header.IndexOf("inst_name"); foreach (var row in reader) { if (row[colIndex].Span.Contains(university, StringComparison.OrdinalIgnoreCase)) { matchingLines.Add(row.Span.ToString()); } }
Sep is much faster, it requires only 18 ms or 0.64 GB/s.
What else might we do? We could go lower level. Let us load blocks of bytes, look for the ‘H’ of Harvard, then when we find such as byte, check that we have ‘Harvard’. Since we prefer not to load the whole file in memory, we need to check at the end of the blocks that we are not in the middle of the word Harvard.
It gets a bit hairy, but it is only for benchmarking purposes:
var matchingLines = new List<string>(); using (FileStream fileStream = new FileStream(filename, FileMode.Open, FileAccess.Read)) { byte[] buffer = new byte[4 * 1024]; Span<byte> harvardBytes = Encoding.UTF8.GetBytes("Harvard"); var tailbytes = harvardBytes.Slice(1); int bytesRead; // Read the file in blocks int offset = 0; while ((bytesRead = fileStream.Read(buffer, offset, buffer.Length - offset)) > 0) { bytesRead += offset; offset = 0; for (int i = 0; i <= bytesRead - harvardBytes.Length;) { i = Array.IndexOf(buffer, (byte)harvardBytes[0], i, bytesRead - harvardBytes.Length - i); if (i < 0) { break; } Span<byte> region = buffer.AsSpan(i + 1, harvardBytes.Length - 1); if (region.SequenceEqual(tailbytes)) { var start = i; var end = i + harvardBytes.Length; while (start > 0 && buffer[start - 1] != '\n') { start--; } while (end + 1 < bytesRead && buffer[end + 1] != '\n') { end++; } string line = Encoding.UTF8.GetString(buffer, start, end - start + 1); matchingLines.Add(line); i += harvardBytes.Length; } else { i++; } } for (int i = bytesRead - harvardBytes.Length + 1; i <= bytesRead; i++) { Span<byte> region = buffer.AsSpan(i, bytesRead - i); if (harvardBytes.StartsWith(region)) { Array.Copy(buffer, i, buffer, 0, region.Length); offset = region.Length; break; } } } }
I could simplify it slightly. Further, it is possibly slightly wrong: if someone is called ‘Harvard’, then they will be matched by my code, so a few cheap checks should be added.
Further, you want to make sure that you do not have quoted fields with unescaped line endings. I am implicitly assuming that records are separated by line endings (i.g., the newline character \n). This assumption can be violated in some cases, although not in the cases I care about. In the general case, you should use a well tested CSV library.
The string search could be further optimized. I use a naive approach.
This time, I can scan the whole file in 3.3 ms. If I need to, I can do it hundreds of times a second. It might be fast enough that you may not need an actual database engine and its indexes. C# can be fast.
couning lines | 1.1 GB/s |
CsvHelper | .28 GB/s |
NReco.Csv | .33 GB/s |
Sep | .64 GB/s |
low-level | 3.5 GB/s |
Further reading: The fastest CSV parser in .NET.
2024-10-15 06:15:42
When optimizing small functions, I often rely on a table lookup: I replace the actual computation with table of precomputed values. It is often surprisingly efficient.
Let us consider an example. Suppose that you are given an array of characters and you want to replace all instances of the character ‘\’ with the two-character string “\\”, all instances of the character ‘\n’ with the two-character string “\n”, and all instances of the character ‘\r’ with the two-character string “\r”. You might do it with a sequence of branches, like so:
int index = 0; for (char c : original) { if (c == '\\') { newArray[index++] = '\\'; newArray[index++] = '\\'; } else if (c == '\n') { newArray[index++] = '\\'; newArray[index++] = 'n'; } else if (c == '\r') { newArray[index++] = '\\'; newArray[index++] = 'r'; } else { newArray[index++] = c; } }
Instead, we could store the escape sequences in a table, and load it from the table. In Java, it might look as follows…
private static final byte[] silly_table3; static { silly_table3 = new byte[256]; silly_table3['\\'] = '\\'; silly_table3['\n'] = 'n'; silly_table3['\t'] = 't'; } //... int index = 0; for (char c : original) { byte b = silly_table3[c%256]; if (c < 256 && b > 0) { newArray[index++] = '\\'; newArray[index++] = (char)b; } else { newArray[index++] = c; } }
My table is unnecessarily large: I leave it as an exercise to the reader to optimize the code so that a much smaller table could be used.
When processing inputs sequentially, we often have some data loading. In this case, we must load each character from the input array. The table lookups also require some load operations, but modern processors can typically execute two loads per cycle or more. Thus we are often not severely limited by the number of loads. The loads have some latency, but a table lookup in a hot function is often cached closed to the CPU. Thus we may have a relatively short latency (e.g., 5 cycles). If there is not too much dependency between how we process successive characters, this latency is not too harmful. On the plus side, table lookups reduce the number of instructions issued, and they often keep branch mispredictions to a minimum. Though processors become better at predicting branches, mispredicted branches remain quite expensive.
You might be concerned that tables can make your binary code larger. However, that may not be true. One of the most important optimization that compiler make is ‘inlining’. That is, your small functions are often replicated several times to avoid the overhead of a function call and to allow for more advanced optimizations. A hot function implemented with tables is made of few instructions. These instructions can be reproduced at little cost all over your code base. In contrast, an implementation made of many instructions can be costly to inline.
I wrote a little Java benchmark to test out the idea. I provide random ASCII inputs. In my tests, the table approach is over 20% faster.
system | conventional | table |
---|---|---|
Apple M2 + Java 21 | 1 GB/s | 1.2 GB/s |
Intel Ice Lake + Java 20 | 0.8 GB/s | 1.1 GB/s |
AMD Zen 4 (7940HS) + Java 21 | 0.7 GB/s | 1.0 GB/s |
The results will vary depending on your system and processor.
The table approach is not very sensitive to the complexity of the problem. In my case, I needed to replace 3 distinct characters. I could extend it to 5 distinct characters or more by updating the table. The conventional approach becomes increasingly more complex and expensive as I add characters. Thus the benefits of the table approach grow as the problem gets more complex. Conversely, there is a limit to the value of the table approach: it is inefficient to use a table when you need to identify just one character.
Optimizing compilers may turn your conventional code into a table lookup. For example, a switch-case is sometimes compiled to a small table. Further, what appears like a branch (if-else) might be compiled to conditional move instructions or even to a table. When in doubt, you should examine the assembly output. Unfortunately, Java makes it a bit difficult to do.
These speeds (1 GB/s) are not impressive. But Java makes it difficult to use more advanced low-level optimizations. Java Vector, when it becomes more mature, could help.
Credit: The blog post was motivated by an email by Lucas Sloan.