MoreRSS

site iconDaniel LemireModify

Computer science professor at the University of Quebec (TELUQ), open-source hacker, and long-time blogger.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Daniel Lemire

使用 std::views::split 更简单、更快速地解析代码

2024-12-22 05:35:13

Parsing text files is often confusing irrespective of your programming language. It can also be surprising slow.

As an example, let us consider the following problem. You have a comma-separated-value file. It is a text file made of several lines and each line is separated into fields by a comma. Excel spreadsheets are often exported as comma-separated-value files. You want to extract just one column and iterate over it.

I asked ChatGPT for a solution in C++, and it gave me the following code:

std::vector<std::string> get_column(const std::string& input, 
       size_t column_number, char delimiter = ',') {
    std::vector<std::string> rows;
    std::istringstream iss(input);
    std::string row;

    // Split the string into rows
    while (std::getline(iss, row)) {
        rows.push_back(row);
    }

    std::vector<std::string> result;
    for (const auto& r : rows) {
        std::istringstream row_stream(r);
        std::string field;
        size_t index = 0;

        while (std::getline(row_stream, field, delimiter)) {
            if (index == column_number) {
                result.push_back(field);
                break;
            }
            ++index;
        }
    }

    return result;
}

The function takes a string input representing CSV-like data, a column index column_number, and an optional delimiter (defaulting to comma). It splits the input string into rows, then for each row, it further splits the row into fields based on the delimiter. It collects the field at the specified column index from each row into a vector of strings, which it then returns.

It is pretty bad code. I do not think that any professional C++ programmer would ever write such bad code. But then, again, I could be surprised.

If you have a modern system with C++20, you can write much less code and get high performance with std::views::split. Let me demonstrate:

auto get_column_cxx20(std::string_view data, 
    size_t column_number, char delimiter = ',') {
  auto rows = data | std::views::split('\n');
  auto column =
      rows |
      std::views::transform(
          [delimiter, column_number](auto &&row) {
            auto fields = row | std::views::split(delimiter);
            auto it = std::ranges::begin(fields);
            std::advance(it, column_number);
            return *it;
          }) |
      std::views::transform([](auto &&rng) -> std::string_view {
        return std::string_view(&*rng.begin(), 
             std::ranges::distance(rng));
      });
  return column;
}

The function uses C++20’s ranges and views to extract a specific column from CSV-like data represented as a std::string_view. It first splits the input data into rows using newline characters, then applies a series of transformations: it splits each row into fields based on a delimiter, selects the field at the specified column index (column_number), and finally converts these selected fields back into std::string_view objects. This approach is memory-efficient as it avoids copying data, using lazy evaluation to only process data when necessary. A std::string_view instance is a lightweight data structure that simply points at the string data: though it has been formalized as of C++17, it goes back to at least Boost 1.53 (released in 2013).

I wrote a small benchmark where I take an existing CSV file, I ask for the second column and I just add up the width of that column. Using LLVM 16 and an Apple M2 processor, I get the following results:

old school 0.11 GB/s
std::views::split 2.6 GB/s

Importantly, there is no disk access. Everything is in memory. So the old-school approach is purely computationally bounded at a puny 0.11 GB/s. My source code is available on GitHub.

I did not attempt to tune the std::views::split approach. I suspect that we could get more performance out of it by further tuning. But it is already 20 times faster than the naive approach.

The main annoyance with the application std::views::split on strings is that it works naturally with ranges and subranges whereas I much prefer working with std::string_view instances. Thus we see an ugly conversion routine strapped into the code. With C++23, you will be able to simplify the code as follows:

auto get_column_cxx23(std::string_view data, 
    size_t column_number, char delimiter = ',') {
  auto rows = data | std::views::split('\n');
  auto column =
      rows |
      std::views::transform(
          [delimiter, column_number](auto &&row) {
            auto fields = row | std::views::split(delimiter);
            auto it = std::ranges::begin(fields);
            std::advance(it, column_number);
            return *it;
          }) | std::ranges::to<std::string_view>();

  return column;
}

在 C++ 中将结构体的属性作为数组元素访问?

2024-12-16 08:55:17

In C++, it might be reasonable to represent a URL using a class or a struct made of several strings, like so:

struct basic {
    std::string protocol;
    std::string username;
    std::string password;
    std::string hostname;
    std::string port;
    std::string pathname;
    std::string search;
    std::string hash;
};

To associate each component with an index, we can use an enum class:

enum class component {
     PROTOCOL = 0,
     USERNAME = 1,
     PASSWORD = 2,
     HOSTNAME = 3,
     PORT = 4,
     PATHNAME  = 5,
     SEARCH = 6,
     HASH = 7,
};

Initially, you might use a switch statement to access components:

std::string& get_component(basic& url, component comp) {
    switch (comp) {
        case component::PROTOCOL: return url.protocol;
        case component::USERNAME: return url.username;
        case component::PASSWORD: return url.password;
        case component::HOSTNAME: return url.hostname;
        case component::PORT:     return url.port;
        case component::PATHNAME: return url.pathname;
        case component::SEARCH:   return url.search;
        case component::HASH:     return url.hash;
    }
}

The switch case might introduce performance overhead if accessed frequently. It is likely to compile to a series of branches.

For better performance, you could store the components in an array:

struct fat {
    std::array<std::string, 8> data;
    std::string &protocol = data[0];
    std::string &username = data[1];
    std::string &password = data[2];
    std::string &hostname  = data[3];
    std::string &port  = data[4];
    std::string &pathname = data[5];
    std::string &search = data[6];
    std::string &hash = data[7];
};

Alternatively, you could avoid hardcoding the indexes:

struct fat {
    std::array<std::string, 8> data;
    std::string& protocol  = data[int(Component::PROTOCOL)];
    std::string& username  = data[int(Component::USERNAME)];
    std::string& password  = data[int(Component::PASSWORD)];
    std::string& hostname  = data[int(Component::HOSTNAME)];
    std::string& port      = data[int(Component::PORT)];
    std::string& pathname  = data[int(Component::PATHNAME)];
    std::string& search    = data[int(Component::SEARCH)];
    std::string& hash      = data[int(Component::HASH)];
};

With this new data structure, getting a component by its index becomes simpler:

std::string& get_component(fat& url, component comp) {
    return url.data[int(comp)];
};

Unfortunately, each reference in the new fat data structure might use 8 bytes on a 64-bit system. That is not a concern if you expect to have few instances of the data structures. However, if you do repeated create instances, you might want to avoid the references. You might try to replace the references by simple methods:

struct advanced {
    std::array<std::string, 8> data;
    std::string &protocol() { return data[0]; }
    std::string &username() { return data[1]; }
    std::string &password() { return data[2]; }
    std::string &hostname() { return data[3]; }
    std::string &port() { return data[4]; }
    std::string &pathname() { return data[5]; }
    std::string &search() { return data[6]; }
    std::string &hash() { return data[7]; }
};

It is not entirely satisfactory as it requires calling methods instead of accessing attributes. Eugene Ostroukhov has a better approach using pointer-to-member operator. We create a compile-time array with pointers to the attribute of the ‘basic’ struct:

std::string& get_component_fast(basic& url, component comp) {
    constexpr static std::array accessors = {&basic::protocol, 
      &basic::username, &basic::password, &basic::hostname, 
      &basic::port, &basic::port, &basic::pathname, 
      &basic::search, &basic::hash};
    return url.*accessors[int(comp)];
}

In this manner, you do not need to modify the original class, and you are likely to get excellent performance. This latter function is likely to compile to a couple of inexpensive operations. The array it contains is likely constructed at compile time is effectively a constant tied to your function.

Note: In C, you could use the standard macro offsetof.

作为程序员夹具的数据结构(围棋版)

2024-12-09 05:04:01

A data structure in programming is a specific way of organizing and storing data in a computer so that it can be accessed and used efficiently.

In woodworking or metalworking, a jig holds a piece of work and guides the tools operating on it. It helps to produce consistent results. The simplest jig is probably a straight edge. For example, a straight metal bar can help you cut wood following a straight line. Another simple jig is the mitre box which helps us cut wood at fixed angles.

Data structures can be thought of as jigs for programmers. Much like how a jig guides the tool to make precise cuts or shapes, data structures in programming provide a framework for organizing and accessing data. Just as a jig guides the tool, data structures guide how data should be stored, accessed, and manipulated. Data structures help us ensure that operations on the data are performed in a predictable and efficient manner.

Conventional computer science courses often introduce a wide range of data structures: linked lists, red-black trees, and so forth. Yet, in practice, the overwhelming majority of programming problems can be solved using little more than arrays, the standard composite type (struct), and the occasional hash table. The analogy with the woodworking jig still holds: the most popular woodworking jigs are so simple that we often even forget that they are an actual tool.

Arrays

Some data structures are particularly useful due to their simplicity and efficiency. The simplest data structure is the array. Arrays are fixed-size collections of elements. E.g., in Go you can declare an array of 5 integers like so:

var myArray [5]int

Typically arrays are stored in a contiguous manner in memory. Thus arrays are ideal if you need to traverse all of the elements as quickly as possible: you access the memory in a predictable manner with little overhead. Accessing any given element in an array typically just involves taking the index and converting it to a memory address. It may require only a few inexpensive operations.

Generally speaking, arrays generate fast code in part because compilers find it easy to optimize the functions. For example, in Go, array accesses are typically subject to a bound checker: Go would prevent access to an array at index 10 if the array size is 5. However, the compiler can often optimize away such checks. For example, consider the following function where we sum the five elements in an array. It should compile to efficient code with hardly any overhead (e.g., bound checking).

func Sum(x [5]int) int {
    sum := 0
    for i := 0; i < 5; i++ {
        sum += x[i]
    }
    return sum
}

An array can contain different values (including arrays!). It is even possible to store Boolean values (true/false) in an array although in such cases, at least a byte per element is required. You may replace an array with a bitset in such cases. You may implement a bitset over an array as follows (using 128 bits as an example).

// BitSet represents a fixed-size bitset
type BitSet struct {
    bytes [16]byte
}


// Set sets the bit at 'index' to 1
func (bs *BitSet) Set(index int) {
    bs.bytes[index/8] |= 1 << uint(7-(index%8))
}

// Clear sets the bit at 'index' to 0
func (bs *BitSet) Clear(index int) {
    bs.bytes[index/8] &^= 1 << uint(7-(index%8))
}

// Get returns the value of the bit at 'index'
func (bs *BitSet) Get(index int) bool {
    return bs.bytes[index/8]&(1<<uint(7-(index%8))) != 0
}

A natural extension of the array is the multidimensional array. Even though the Go language supports natively multidimensional arrays, we can still implement them from scratch using a standard array. Typically, we use a row-major implementation. For example, we can implement a simple 3×3 matrix using a 9-element array. The first three elements of the array represent the first row, the next three elements represent the second row, and so forth.

// Matrix represents a 3x3 matrix
type Matrix struct {
    data [9]float64
}


// Set sets the value
func (m *Matrix) Set(i, j int, val float64) {
    if i < 0 || i > 2 || j < 0 || j > 2 {
        panic("Index out of bounds for a 3x3 matrix")
    }
    m.data[i*3+j] = val
}

// Get returns the value
func (m *Matrix) Get(i, j int) float64 {
    if i < 0 || i > 2 || j < 0 || j > 2 {
        panic("Index out of bounds for a 3x3 matrix")
    }
    return m.data[i*3+j]
}

// String provides a string representation of the Matrix
func (m *Matrix) String() string {
    var str string
    for i := 0; i < 3; i++ {
        for j := 0; j < 3; j++ {
            str += fmt.Sprintf("%6.2f ", m.Get(i, j))
        }
        str += "\n"
    }
    return str
}

Go, like many programming languages, allows you to create efficient composite data structures (struct). In particular, you can put arrays inside these data structures. Suppose, for example, that we want to represent points in space, we might do it as follows:

type Point struct {
    x float64
    y float64
}

We can put these points inside an array of ten elements:

var list [10]Point 

This pattern is often called an array of structs. We can also represent the same data using a struct of arrays:

type Points struct {
    x [10]float64
    y [10]float64
}

While both the array of structs and the struct of arrays can contain the same information, they have different performance characteristics. For example, if you needed to sum the x values, the struct of arrays might lead to faster code.

Dynamic arrays and slices

A slightly more sophisticated data structure is the dynamic array. A dynamic array is an array that can grow or shrink in size. In Go, they are implemented as slices. They are one of the most fundamental data structures in Go.

var mySlice []int
mySlice = append(mySlice, 1, 2, 3)

You access arrays by integer indexes. E.g., you access an array of size 3 with the indexes 0, 1, 2. Each index gives you access to a value. Thus the array constitutes a key-value data structure where the keys are the indexes.

Because the size of the slice is dynamic, Go provides a function to query it: len (length). E.g., len(mySlice) might be 3.

In Go, you can have several slices share the same underlying array. We use the operator [begin:end] to create a new slice from an existing slice. Importantly, this operation does not copy the underlying data.

In the following example, the second slice will be of length 1 and contain the second element of the first slice.

var mySlice []int
mySlice = append(mySlice, 1, 2, 3)
secondSlice := mySlice[1:2]

Go provides some shorthand notations. Instead of writing mySlice[2:len(mySlice)], you may write mySlice[2:] and instead of writing mySlice[0:1], you may write mySlice[:1]. To create a copy of the slice, you may simply write mySlice[:].

Slices share many of the same characteristics as arrays. For example, we can write functions that take slices as parameters and compile into similarly efficient code, such as this example where we sum the first 5 elements of the slice:

func Sum(x []int) int {
    sum := 0
  if len(x) < 5 {
    return -1
  }
    for i := 0; i < 5; i++ {
        sum += x[i]
    }
    return sum
}

In fact, dynamic arrays, that is, slices, are typically implemented as a thin wrapper over an array. The slice points at a region inside an array. When we shrink or extend the region pointed at by the slice, the underlying array may remain the same.

Typically, the underlying array has more storage capacity than the length of the slice. You may query the size of the underlying array with the cap (capacity) function. It may seem wasteful to have more capacity than needed. However, consider the case where we regularly add elements to a given slice, and the size of the underlying array is set to match exactly the length of the slice. In such a case, each addition (append) to the slice might require allocating a new array and copying the elements. Counting only the copies of elements, we get that we need 1 + 2 + 3 + 4 + … + n − 1 = (n−1)n/2 copies to populate a slice with n values. Instead, we typically grow the size of the underlying array using exponential steps. For example, whenever the capacity becomes insufficient, we could allocate memory to the next power of two: we allocate 16 elements of capacity if we need 7 elements, we allocate 1024 elements if we need 600 elements, and so forth. We can verify that with such a strategy, if we need to repeatedly add an element to a slice, we will never need more than 2n copies to create a slice with n values: 2n is much smaller than (n−1)n/2 when n is large. By default, Go does not recover excess capacity when making the slice smaller (e.g., s = s[:newlength]). However, you can use the extended syntax s[low:high:max] to tell Go to reduce the size of the underlying array to max-low. E.g., s = s[::len(s)] would create a copy of the slice while adjusting the underlying capacity to len(s).

The following program will add elements to a slice and prints the capacity (size of the underlying array) and length of the slice at each increment. At the end of the program, we show that we can shrink the slice with or without adjusting the capacity.

package main

import "fmt"

func main() {
    var mySlice []int
    for i := 0; i < 100; i++ {
        mySlice = append(mySlice, 1)
        fmt.Println(cap(mySlice), " : ", len(mySlice))
    }
    mySlice = mySlice[:10]
    fmt.Println(cap(mySlice), " : ", len(mySlice))
    mySlice = mySlice[:10:10]
    fmt.Println(cap(mySlice), " : ", len(mySlice))
}

The following is a possible output:

1  :  1
2  :  2
4  :  3
4  :  4
8  :  5
8  :  6
8  :  7
8  :  8
16  :  9
...
128  :  99
128  :  100
128  :  10
10  :  10
9  :  2

Sometimes it is convenient to store the data in sorted order within an array. We can then use the array as an efficient set data structure even if the array is large. When searching for a value, we may use a binary search. The algorithm is relatively straight-forward: given the value that we are looking for, we first compare it against the value in the middle, that is, the median value. If the value we are searching for is greater than the median value, we search in the later half of the array, otherwise we search in the early half of the array. We repeat the division recursively until we either find the value that we are looking for, or we determine that the value cannot be found. We need about log2(N) iterations to complete the search over an array of size N, and the logarithmic function grows very slowly (e.g., log2(106) ≈ 20).

// Search searches for a number in a sorted array
// If the number is found, it returns the index of the number and true
// If the number is not found, it returns -1 and false
func Search(n int, array []int) (int, bool) {
    low, high := 0, len(array)-1
    for low <= high {
        mid := (low + high) / 2
        if array[mid] == n {
            return mid, true
        } else if array[mid] < n {
            low = mid + 1
        } else {
            high = mid - 1
        }
    }
    return -1, false
}

Sometimes we only want to be able to access the smallest or largest value. Suppose for example that you have a stream of requests that you need to process, and you always want to pick next the request that has been waiting the longest. You could repeatedly scan the array, or repeatedly sort it. However, if you constantly append and remove data from the array, it could become expensive. Instead, we can order the values in the array according to a binary heap. The root of the heap is stored at index 0, and it is either the smallest or the largest value depending on the type of heap you need. So we either have a max-heap or a min-heap. Within the array, the values are then organized as a tree. For any node at index i, its left child is at 2 * i + 1 and its right child is at 2 * i + 2. At the end of the array, some node may only have a left child, while others may have no child. If we have a max-heap, each node must be no smaller than its children. If we have a min-heap, each node must be no greater than its children. Operations like insertion and deletion are performed by manipulating elements in the array while maintaining the heap property. To insert a new value, you place the new element at the array’s end. It then becomes the child of an existing node if the array was not empty: if you put it at index i, then you can compute the index of the parent as the integer (i−1)/2. You then permute it with its parent to preserve the heap property (e.g., if we have a max-heap, each node must be no smaller than its children). You may then need to check against the parent once more and so forth until you get to the top of the heap. The following function modifies a slice according to this algorithm using the max-heap property.

func Insert(heap *[]int, value int) {
    // Append the new value to the end of the heap
    *heap = append(*heap, value)
    heapSize := len(*heap)

    // Start with the last element's index
    i := heapSize - 1

    // Heapify-up (sift up) to maintain the max-heap property
    for i > 0 {
        parent := (i - 1) / 2

        // If the value inserted is greater than its parent, swap them
        if (*heap)[i] > (*heap)[parent] {
            // Swap the elements
            (*heap)[i], (*heap)[parent] = (*heap)[parent], (*heap)[i]

            // Move up to the parent's index
            i = parent
        } else {
            // If we didn't swap, we've reached the correct position
            break
        }
    }
}

When removing the top element, you replace it with the last element and then propagate this element downward by swapping it with the larger (or smaller) child if it violates the heap property. We may implement it as follows for a max-heap:

// RemoveTop removes the top element of the max-heap
func RemoveTop(heap *[]int) int {
    if len(*heap) == 0 {
        return 0 // or an appropriate zero value/error handling
    }

    // Store the root value to return later
    top := (*heap)[0]

    // Replace the root with the last element
    heapSize := len(*heap)
    (*heap)[0] = (*heap)[heapSize-1]
    *heap = (*heap)[:heapSize-1]

    // Start heapifying down from the root
    i := 0
    for {
        // Calculate indices of children
        leftChild := 2*i + 1
        rightChild := 2*i + 2
        largest := i

        // Check left child
        if leftChild < len(*heap) && (*heap)[leftChild] >
            (*heap)[largest] {
            largest = leftChild
        }

        // Check right child
        if rightChild < len(*heap) && (*heap)[rightChild] >
            (*heap)[largest] {
            largest = rightChild
        }

        if largest != i {
            (*heap)[i],
                (*heap)[largest] = (*heap)[largest],
                (*heap)[i]
            i = largest
        } else {
            // If largest is the root, we are done
            break
        }
    }

    return top
}

Thus with only two relatively simple functions, we can maintain a binary heap. These functions require at most ⌈log2N comparisons where N is the number of elements in the array. Interestingly, a binary heap provides a sensible algorithm to sort an array sometimes called a heapsort: insert all elements in a binary heap, and then repeatedly remove the top value. The result is a O(NlogN) algorithm.

Just like we can implement a bitset over an array, we can implement a dynamic bitset over a dynamic array. The approach is much the same, but the array is replaced by a slice. Importantly, we need to check for the need to extend the bitset. When the need arises, we grow the bitset by allocating a larger array and copying our existing data. Observe that we choose to grow the slice by twice the amount needed: this prevents degenerate cases where the user might access the bits one by one (0, 1, …), leading to repeated copies and allocations.

// BitSet represents a dynamically growing bitset
type BitSet struct {
    bytes []byte
}

// ensureCapacity makes sure we have enough bytes to access the specified index
func (bs *BitSet) ensureCapacity(index int) {
    requiredBytes := (index + 1 + 7) / 8
    if len(bs.bytes) < requiredBytes {
        // Grow the slice if necessary
        newBytes := make([]byte, requiredBytes*2)
        copy(newBytes, bs.bytes)
        bs.bytes = newBytes
    }
}

// Set sets the bit at 'index' to 1
func (bs *BitSet) Set(index int) {
    bs.ensureCapacity(index)
    bs.bytes[index/8] |= 1 << uint(7-(index%8))
}

// Clear sets the bit at 'index' to 0
func (bs *BitSet) Clear(index int) {
    bs.ensureCapacity(index)
    bs.bytes[index/8] &^= 1 << uint(7-(index%8))
}

// Get returns the value of the bit at 'index'
func (bs *BitSet) Get(index int) bool {
    bs.ensureCapacity(index)
    return bs.bytes[index/8]&(1<<uint(7-(index%8))) != 0
}

Hash tables and maps

It is common that you need a data structure where you can access values using either non sequential integers (e.g., 10, 1000, 100000), or other types of values such as a string. E.g., maybe you want a map from names to phone numbers.

One of the most useful data structures is the hash table. Hash tables build on top of arrays, but instead of using consecutive integer values as indexes, they can take nearly any type of values (e.g., string) as keys.

The essential trick of the hash table is to use a hash function which maps arbitrary keys to integer values. The integer values are used as indexes inside an array. The array is typically picked to be of a size comparable to the number of distinct keys. Thus a hash table is a generalization of the array, but one where you replace the simple accesses with a hash function. A hash function takes a key and converts it into an index of the hash table array.

We typically hope that the hash function distributes keys evenly across the array to minimize collisions. When two keys are mapped to the same index inside the array, we must resolve the issue. One possibility is to create a larger array. But this would cause the arrays to grow too quickly in practice. Thus we must handle collisions efficiently. There are two broad strategies. One possibility is to have the ability to store several key-value pairs in each element of the array. It is often called a bucket approach. Another possibility is to store colliding key-value pairs elsewhere: e.g., you can store it to the next available slot in the array.

Many hash table implementations exist with varying properties. Go provides a built-in map type that implements a hash table. E.g., to create a map from strings to integers, we might proceed as follows:

myMap := make(map[string]int)
myMap["key"] = 10

Go uses a bucket approach: each element in the array can store several values. Each bucket in Go’s map implementation contains multiple key-value pairs. The exact number of pairs per bucket can vary, but it’s designed to handle a small number of entries efficiently. E.g., for example, we could use this type of data structure with slices:

type bucket struct {
    keys    []string // Keys
    values  []int    // Values
}

When two keys hash to the same index, they are placed in the same bucket.

When looking up a key, we compute the hash of the key, use the hash to find the correct bucket, check each entry in the bucket for a matching key. As long as the buckets are small enough, the performance is going to be acceptable.

We often compute the number of keys stored in the hash table and divide it by the size of the array. The result is called the load factor. When there are too many keys compared to the size of the underlying array, the array is grown, and the hash table is reconstructed. This will typically reduce the average size of the buckets. Similarly, we can reduce the size of the array if there are too few keys left.

Let us consider a complete example:

package main

import (
    "errors"
    "fmt"
)

type Bucket struct {
    keys   []string
    values []int
}

func (b *Bucket) Add(key string, value int) {
    b.keys = append(b.keys, key)
    b.values = append(b.values, value)
}

func (b *Bucket) Find(key string) (int, error) {
    for i := 0; i < len(b.keys); i++ {
        if key == b.keys[i] {
            return b.values[i], nil
        }
    }
    return 0, errors.New("Not found")
}

type HashTable struct {
    array []Bucket
}

func NewHashTable(size int) *HashTable {
    return &HashTable{
        make([]Bucket, size),
    }
}

func (ht *HashTable) hash(key string) int {
    // A very simple hash function
    hash := 0
    for i := 0; i < len(key); i++ {
        hash += 31 * int(key[i])
    }
    return hash % len(ht.array)
}

func (ht *HashTable) Get(key string) (int, error) {
    return ht.array[ht.hash(key)].Find(key)
}

func (ht *HashTable) Set(key string, value int) error {
    _, e := ht.Get(key)
    if e == nil {
        return errors.New("Key already present")
    }
    ht.array[ht.hash(key)].Add(key, value)
    return nil
}

func main() {
    ht := NewHashTable(10)
    ht.Set("apple", 1)
    ht.Set("banana", 3)
    fmt.Println(ht.Get("apple")) // Should print 1
}

Our example is simplified. However, it illustrates the basic principles of how hash tables work.

The approach we described, with buckets, is relatively simple to implement, but it may not always offer the best performance because of the overhead of maintaining buckets as separate dynamic arrays. Instead, we may consider a variation where we have one key-value per slot in the array. This alternative model is sometimes called open addressing. When two keys hash to the same slot, we can move one of the key-value pairs to another available slot, or increase the size of the underlying array. At query time, we begin by searching for the key-value according to the hash value of the key. If another key-value is found, we visit the next slot, until either we find the value we are looking for, or an empty slot is found. As long as we keep the underlying array large enough so that a sizeable fraction of slots are empty, the performance will be acceptable.

The following code illustrates the main idea behind open addressing, with the exception that the underlying array does not grow as more values are added. To add this functionality, we should track the number of keys added, and grow the array as needed. As a sentinel to indicate an available slot, we use the empty key. In practice, we would need to add additional checks to make sure that the user does not try to add a key with an empty string, and to handle the scenario appropriately.

type Item struct {
    key   string
    value int
}

type HashTable struct {
    items      []Item
    emptyValue Item
}

func NewHashTable(size int) *HashTable {
    ht := &HashTable{
        items:      make([]Item, size),
        emptyValue: Item{key: "", value: -1},
    }
    // Initialize all slots with emptyValue
    for i := 0; i < size; i++ {
        ht.items[i] = ht.emptyValue
    }
    return ht
}


func hash(key string) int {
    // A very simple hash function
    hash := 0
    for i := 0; i < len(key); i++ {
        hash += 31 * int(key[i])
    }
    return hash
}

// Put adds a key-value pair to the hash table
func (ht *HashTable) Put(key string, value int) {
    hashValue := hash(key) % len(ht.items)
    for {
        if ht.items[hashValue] == ht.emptyValue {
            ht.items[hashValue] = Item{key: key, value: value}
            return
        }
        if ht.items[hashValue].key == key {
            ht.items[hashValue].value = value
            return
        }
        // Linear probing to find next available slot
        hashValue = (hashValue + 1) % len(ht.items)
    }
}

// Get retrieves a value from the hash table by key
func (ht *HashTable) Get(key string) (int, bool) {
    hashValue := hash(key) % len(ht.items)
    for i := 0; i < len(ht.items); i++ {
        if ht.items[hashValue].key == key {
            return ht.items[hashValue].value, true
        }
        if ht.items[hashValue] == ht.emptyValue {
            return -1, false // key not found
        }
        hashValue = (hashValue + 1) % len(ht.items)
    }
    return -1, false // key not found after full cycle
}

There are many alternatives that could be even faster than open addressing in some instances, such as Cuckoo hashing. In practice, programmers rarely implement their own hash tables, but they should be aware that there are different implementations with various advantages.

A hash table that only has keys can be used to implement a set data structure. That is, we do not always need to have values. A straight-forward variation worth considering is to allow keys to be mapped to several different values. This variation is sometimes called a multimap.

Conclusion

Though there are countless data structures in computer science, much can be achieved with arrays, and a few simple functions. When organizing your data, you should first try to use arrays. In some cases, a hash table might be needed if you have a set or a map. It should cover the vast majority of your data structures.

To illustrate the idea, consider JSON. JSON (JavaScript Object Notation) is a standard text-based data interchange format that is commonly used to exchange data online. It has primitive values (strings, numbers, Booleans, the null value) and composite types (arrays and objects). Objects are maps from strings to primitive values or to other composite types (arrays and objects). We represent objects as comma-separated key-value pairs within curly braces {}, using the colon as a separator between the key and the value. We use square brackets [] for arrays, separating values with commas. Consider the following example representing a list of employees.

{
   "employees":[
      {
         "name":"Richard",
         "salary":56000,
         "function":"CEO"
      },
      {
         "name":"Lisa",
         "salary":55000,
         "function":"accountant"
      }
   ]
}

Empirically, JSON is found to be sufficient to represent most data, despite its simplicity. A key insight is that by combining arrays and maps, with a few standard types for numbers and strings, you can solve most problems.

In some cases, specialized data structures can provide superior performance. Yet you should be mindful that it is often easier to optimize code when the underlying data structure is simpler.

用 C# 以每秒超过千兆字节的速度解析浮点运算

2024-11-22 07:42:22

A few years ago, we wrote csFastFloat, a C# library to parse floating-point numbers faster. Given the string “3.1416”, it computes the binary value 3.1416.

The functionality of the library is equivalent to the C# function Double.Parse. except that it runs faster.

We contributed much of the library to .NET, and as of .NET 7: the csFastFloat library is effectively part of .NET. However, the standard library (.NET) has additional overhead.

There is a new version of Microsoft C# runtime: .NET 9.  I ran our benchmarks with .NET9 on an Apple M2 processor:

dataset FastFloat.TryParseDouble Double.Parse
canada.txt 1.1 GB/s 0.3 GB/s
mesh.txt 0.9 GB/s 0.2 GB/s
synthetic.txt 1.0 GB/s 0.3 GB/s

I find that there can still be a significant benefit to using csFastFloat over the .NET library: it can be about 3 times faster. Prior to .NET, the gap was closer to 9 times, so the .NET library did get much faster.

The benefits of the csFastFloat depend on your data and may be less than what I report. Yet if you are parsing many floating-point numbers, you may want to give csFastFloat a try.

You can find csFastFloat on nuget. The csFastFloat library is used by Sep which might be the World’s Fastest .NET CSV Parser.

研究生学位被高估了

2024-11-13 01:39:24

Though I have many brilliant graduate students, I love working with undergraduate students. And I am not at all sure that you should favor people with graduate degrees, given a choice. Many graduate students tend to favor abstraction over practical skills. They often have an idealized view of the world. Moreover, these students are often consumed by research projects, theses, or dissertations, and the publication of scientific articles, which limits their time for concrete actions. On the other hand, undergraduate students, in my experience, show more enthusiasm for concrete and useful projects, even if they are not prestigious at first glance.

This selection effect is also evident in career choices: those aspiring for direct action are eager to engage in it and are not keen on spending years writing abstract documents. Conversely, those who seek to avoid direct engagement with reality tend to prolong their studies. This effect is noticeable during competitions to recruit professors. It is often surprisingly difficult to find candidates with significant real-world experience.

Irrespective of how smart and educated you are, it is difficult to beat a “can do” attitude. The world is built by people who are eager to get things done as opposed to people who merely want to talk or write. Or, at least, that should be what happens in a healthy society.

玩转现代 C++

2024-11-02 08:24:09

Recent versions of the C++ language (C++20 and C++23) may allow you to change drastically how you program in C++. I want to provide some fun examples.

Thanks to the integration of the features from the popular fmt library, it is much easier to format strings elegantly in C++. In turn the fmt library was inspired by the work done in languages like Python.

Suppose that you have a vector of integers and you want to print its content:

    std::vector<int> v = {1, 2, 3, 4, 5};
    std::println("{}", v);

Suppose you want it to be centered in a line of 40 characters, with underscore characters around it:

    std::vector<int> v = {1, 2, 3, 4, 5};
    std::println("{:_^40}", v);
    // ____________[1, 2, 3, 4, 5]_____________

Maybe you want to print it in reverse order?

    std::println("{}", v | std::views::reverse);

Want to print its maximum?

    std::println("{}", std::ranges::max(v));

What about printing my name and the current time, and my name again? We do not want to repeat the name twice in the parameters, so let us use indexed printing parameters:

  std::string_view name = "Daniel"sv;
  std::println("Hello, {0} today is {1:%Y-%m-%d %X}, good day {0}!", name,
               std::chrono::system_clock::now());
   // Hello, Daniel today is 2024-11-02 00:02:17, good day Daniel!

Let say you want to print different values, each on their own line, justified to the right, with hyphens padding the left. For this purpose, we want to have template function which can take as many parameters as you want, and we want to do the same operation on each parameter. We can achieve this result with a fold, like so:

void print(auto ...args) {
  (std::println("{:->50}", args), ...);
}

void prints() {
  print("a", 1, "b", 2, "c", 3);
}

//-------------------------------------------------a
//-------------------------------------------------1
//-------------------------------------------------b
//-------------------------------------------------2
//-------------------------------------------------c
//-------------------------------------------------3

It looks a bit mysterious, but folds are great if you want to have functions that elegantly take several parameters.

Suppose you want a function which takes two integer-like values, and returns their quotient… with the added trick that if the divisor is zero, you return a string as an error. We want it to work with any integer type (but only integer types). We would rather avoid exceptions. Then `std::expected is perfect for the task. It is effectively a value/error pair conveniently packaged. The following code illustrates the idea… and I have added a ‘test’ function to show you how it might used.

template <std::integral number>
std::expected<number, std::string> divide(number a, number b) {
  if (b == 0) {
    return std::unexpected("Division by zero");
  }
  return a / b;
}

void test(auto x, auto y) {
  if (auto result = divide(x, y); result) {
    std::println("Result: {}", result.value());
  } else {
    std::println(stderr, "Error: {}", result.error());
  }
}

We wrote a fast URL parser called ada, and it returns parsed URL in an std::expected structure. This allows us to avoid the complexity of exceptions. The fast JSON library simdjson has a custom return type that follows the same idea.

Let us say that you want to print the integer 4 in binary (100) and know how many trailing 0 bits there are (the answer is 2):

  std::println("{:b} {}", 4u, std::countr_zero(4u));
  // 100 2

Suppose you want to rotate left the bits in an integer by 4:

  std::println("{:x}", std::rotl(0xf0f0f0f0u, 4));
  // f0f0f0f

Suppose you want to know what the floating point number 1.0 looks like as a 64-bit word:

  std::println("{:b}", std::bit_cast<uint64_t>(1.0));
  // 11111111110000000000000000000000000000000000000000000000000000

Suppose you want to print the size of a file in bytes?

  std::println("{} bytes", std::filesystem::file_size("demo.cpp"));

Suppose you want to write to a log while including the exact position in the source code where the log was called?

  auto log = [](std::string_view message, std::source_location loc) {
    std::println("{} {}:{} function name: {}", message, loc.file_name(),
                 loc.line(), loc.function_name());
  };

  log("problem", std::source_location::current());

There are many more cool features in recent version of C++. If you want to learn more, I recommend recent books by Marius Bancila. He is a great technical writer.

I make a complete example available on GitHub.