About Phil Eaton

For the last 10 years I've chased my way down the software stack starting from humble beginnings with the venerable jQuery and PHP.

The RSS's url is : https://notes.eatonphil.com/rss.xml

Please copy to your reader or subscribe it with :

Preview of RSS feed of Phil Eaton

Obsession

2024-08-24 08:00:00

In your professional and personal life, I don't believe there is a stronger motivation than having something in mind and the desire to do it. Yet the natural way to deal with a desire to do something is to justify why it's not possible.

"I want to read more books but nobody reads books these days so how could I."

"I want to write for a magazine but I have no experience writing professionally."

"I want to build a company someday but how could someone of my background."

Our official mentors, our managers, through a combination of well-intentioned defeatism and well-intentioned lack of accomplishment themselves, among other things, are often unable to process big goals or guide you toward them.

I've been one of these managers myself. In fact I have, to my immense regret, tried too often to convince people to do what is practical rather than what they want to do. Or to do what I judged they were capable of doing rather than what they wanted to do.

In the best cases, my listener had the self-confidence to ignore me. They did what they wanted to do anyway. In the worst case, again to my deep regret, I've been a well-intentioned part of derailing someone's career for years.

So I don't want to convince anyone of anything anymore. If I start trying to convince someone by accident, I try to catch myself. I try to avoid sentences like "I think you should …". Instead "Here is something that's worked for me: …" or "Here is what I've heard works well for other people: …".

Nobody wants to be convinced. But intelligent people will change their mind when exposed to new facts or different ideas. Being convinced is a battle of will. Changing one's mind is a purely personal decision.

There are certainly people with discipline who can grind on things they hate doing and eventually become experts at it. But more often I see people grind on things they hate only to become depressed and give up.

For most of us, our best hope is (healthy) obsession. And obsession, in the sense I'm talking about, does not come from something you are ambivalent about or hate. Obsession can only come when you're doing something you actually want to do.

For big goals or big changes, you need regular commitment weekly, monthly, yearly. Over the course of years. And only obsession makes that work not actually feel like work. Obsession is the only thing that makes discipline not feel like discipline.

That big goals take years to accomplish need not be scary. Obsession doesn't mean you can't pivot. There is quite a lot to gain by committing to something regularly over the course of years even if you decide to stop and commit from then on to something else. You will learn a good deal.

And healthy obsession to me is more specifically measurable on the order of weeks, not hours or days. Healthy obsession means you're still building healthy personal and professional relationships. You're still taking care of yourself, emotionally and physically.

I do not have high expectations for people in general. This seems healthy and reasonable. But as I meet more people and observe them over the years, I am only more convinced of the vast potential of individuals. Individuals are almost universally underestimated.

I think you can do almost anything you want to do. If you commit to do doing it.

I'll end this with a personal story.

Until 11th grade, I hated school. I hated the rigidity. Being forced to be somewhere for hours and to follow so many rules. I skipped so many days of school I'm embarrassed by it. I'd never do homework at home. I never studied for tests. I got Bs and Cs in the second-tier classes. I was in the orchestra for 6 years and never practiced at home. I was not cool enough to be a "bad kid" but I did not understand the system and had no discipline whatsoever.

I found out at the end of 10th grade that I could actually afford college if I got into a good enough school that paid full needs-based tuition. It sounded significantly better than the only other option that seemed obvious, joining the military as a recruit. I realized and decided that if I wanted to get into a good school I needed to not half-ass things.

Somehow, I decided to only do things I could become obsessed with. And I decided to be obsessed in the way that I wanted, not to do what everyone else did (which I basically could not do since I had no discipline). If we covered a topic in class, I'd read news about it or watch movies about it. I'd get myself excited about the topic in every way I could.

It basically worked out. I ended high school in the top 10% of the class (up from top 40% or something). I got into a good liberal arts college that paid the entirety of my tuition. But I remained a basically lazy and undisciplined person. I never stayed up late studying for a test. I dropped out after a year and a half for family reasons.

But I've now spent the last 10 years in my spare time working on compiler projects, interpreter projects, parser projects, database projects, distributed systems projects. I've spent the last 6 years consistently publishing at least one blog post per month.

I didn't want to work the way everyone else worked. I wanted to be obsessed about what I worked on.

Obsession has made all of this into something I now barely register as doing. It's allowed me to continue adding activities like organizing book clubs and meetups to the list of things I'm up to. Up until basically this year I could have in good faith said I am a very lazy and undisciplined person. But obsession turned me into someone with discipline.

Obsession became about more than just the tech. It meant trying to fully understand the product, the users, the market. It meant thinking more carefully about product documentation, user interfaces, company messaging. Obsession meant reflecting on how I treat my coworkers, and how my coworkers feel treated by others in general. Obsession meant wanting an equitable and encouraging work environment for everyone.

And, as I said, it's about healthy obsession. I didn't really understand the "healthy" part until a few years ago. But I'm now convinced that the "healthy" part is as important as the "obsession" part. To go to the gym regularly. To play pickup volleyball. To cook excellent food. To read fiction and poetry and play music. To serve the community. To be friendly and encouraging to all people. To meet new people and build better genuine friendships.

And in the context of work, "healthy obsession" means understanding you can't do everything, even while you care about everything. It means accepting that you make mistakes and that you do your best; that you try to do better and learn from mistakes the next time.

It's got to be sustainable. And we can develop a healthy obsession while we have quite a bit of fun too. :)

What's the big deal about Deterministic Simulation Testing?

2024-08-20 08:00:00

Bugs in distributed systems are hard to find, largely because systems interact in chaotic ways. And even once you've found a bug, it can be anywhere from simple to impossible to reproduce it. It's about as far away as you can get from the ideal test environment: property testing a pure function.

But what if we could write our code in a way that we can isolate the chaotic aspects of our distributed system during testing: run multiple systems communicating with each other on a single thread and control all randomness in each system? And property test this single-threaded version of the distributed system with controlled randomness, all the while injecting faults (fancy term for unhappy path behavior like errors and latency) we might see in the real-world?

Crazy as it sounds, people actually do this. It's called Deterministic Simulation Testing (DST). And it's become more and more popular with startups like FoundationDB, Antithesis, TigerBeetle, Polar Signals, and WarpStream; as well as folks like Tyler Neely and Pekka Enberg, talking about and making use of this technique.

It has become so popular to talk about DST in my corner of the world that I worry it risks coming off sounding too magical and maybe a little hyped. It's worth getting a better understanding of both the benefits and the limitations.

Thank you to Alex Miller and Will Wilson for reviewing a version of this post.

Randomness and time

A big source of non-determinism in business logic is the use of random numbers—in your code or your transitive dependencies or your language runtime or your operating system.

Crucially, DST does not imply you can't have randomness! DST merely assumes that you have a global seed for all randomness in your program and that the simulator controls the seed. The seed may change across runs of the simulator.

Once you observe a bad state as a result of running the simulation on a random seed, you allow the user to enter the same seed again. This allows the user to recreate the entire program run that led to that observed bad state. Allows the user to debug the program trivially.

Another big source of non-determinism is being dependent on time. As with randomness, DST does not mean you can't depend on time. DST means you must be able to control the clock during the simulation.

To "control" randomness or time basically means you support dependency injection, or the old-school alternative to dependency injection called passing the dependency as an explicit parameter. Rather than referring to a global clock or a global seed, you need to be able to receive a clock or a seed from someone.

For example we might separate the operation of an application into the language's main() entrypoint and an actual application start() entrypoint.

# app.pseudocode

def start(clock, seed):
  # lots of business logic that might depend on time or do random things

def main:
  clock = time.clock()
  seed = time.now()
  app.start(clock, seed)

The application entrypoint is where we must be able to swap out a real clock or real random seed for one controlled by our simulator:

# sim.pseudocode

import "app.pseudocode"

def main:
  sim_clock = make_sim_clock()
  sim_seed = os.env.DST_SEED or time.now()
  try:
    app.start(sim_clock, sim_seed)
  catch(e):
    print("Bad execution at seed: %s", sim_seed)
    throw e

Let's look at another example.

Converting an existing function

Let's say that we had a helper method that kept calling a function until it succeeded, with backoff.

# retry.pseudocode
class Backoff:
  def init:
    this.rnd = rnd.new(seed = time.now())
    this.tries = 0

  async def retry_backoff(f):
    while this.tries < 3:
      if f():
        return

      await time.sleep(this.rnd.gen())
      this.tries++

There is a single source of nondeterminism here and it's where we generate a seed. We could parameterize the seed, but since we want to call time.sleep() and since in DST we control the time, we can just parameterize time.

# retry.psuedocode
class Backoff:
  def init(this, time):
    this.time = time
    this.rnd = rnd.new(seed = this.time.now())
    this.tries = 0

  async def retry_backoff(this, f):
    while this.tries < 3:
      if f():
        return

      await this.time.sleep(this.rnd.gen())
      this.tries++

Now we can write a little simulator to test this:

# sim.psuedocode
import "retry.pseudocode"

sim_time = {
  now: 0
  sleep: (ms) => {
    await future.wait(ms)
  }
  tick: (ms) => now += ms
}

backoff = Backoff(sim_time)

while true:
  failures = 0
  f = () => {
    if rnd.rand() > 0.5:
      failures++
      return false

    return true
  }
  try:
    while sim_time.now < 60min:
      promise = backoff.retry_backoff(f)
      sim_time.tick(1ms)
      if promise.read():
        break

    assert_expect_failure_and_expected_time_elapse(sim_time, failures)
  catch(e):
    print("Found logical error with seed: %d", seed)
    throw e

This demonstrates a few critical aspects of DST. First, the simulator itself depends on randomness. But allows the user to provide a seed so they can replay a simulation that discovers a bug. The controlled randomness in the simulator is what lets us do property testing.

Second, the simulation workload must be written by the user. Even when you've got a platform like Antithesis that gives you an environment for DST, it's up to you to exercise the application.

Now let's get a little more complex.

A single thread and asynchronous IO

The determinism of multiple threads can only be controlled at the operating system or emulator or hypervisor layer. Realistically, that would require third-party systems like Antithesis or Hermit (which, don't get excited, is not actively developed and hasn't worked on any interesting program of mine) or rr.

These systems transparently transform multi-threaded code into single threaded code. But also note that Hermit and rr have only limited ability to do fault injection which, in addition to deterministic execution, is a goal of ours. And you can't run them on a mac. And can't run them on ARM.

But we can, and would like, to write a simulator without writing a new operating system or emulator or hypervisor, and without a third-party system. So we must limit ourselves to writing code that can be collapsed into a single thread. Significantly, since using blocking IO would mean an entire class of concurrency bugs could not be discovered while running the simulator in a single thread, we must limit ourselves to asynchronous IO.

Single threaded and asynchronous IO. These are already two big limitations.

Some languages like Go are entirely built around transparent multi-threading and blocking IO. Polar Signals solved this for DST by compiling their application to WASM where it would run on a single thread. But that wasn't enough. Even on a single thread, the Go runtime intentionally schedules goroutines randomly. So Polar Signals forked the Go runtime to control this randomness with an environment variable. That's kind of crazy. Resonate took another approach that also looks cumbersome. I'm not going to attempt to describe it. Go seems like a difficult choice of a language if you want to do DST.

Like Go, Rust has no builtin async IO. The most mature async IO library is tokio. The tokio folks attempted to provide a tokio-compatible simulator implementation with all sources of nondeterminism removed. From what I can tell, they did not at any point fully succeed. That repo has now been replaced with a "this is very experimental" tokio-rs project called turmoil that provides deterministic execution plus network fault injection. (But not disk fault injection. More on that later.) It isn't surprising that it is difficult to provide deterministic execution for an IO library that was not designed for it. tokio is a large project with many transitive dependencies. They must all be combed for non-determinism.

On the other hand, Pekka has already demonstrated for us how we might build a simpler Rust async IO library that is designed to be simulation tested. He modeled this on the TigerBeetle design King and I wrote about two years ago.

So let's sketch out a program that does buggy IO and let's look at how we can apply DST to it.

# readfile.pseudocode
def read_file(io, name, into_buffer):
  f = await io.open(name)
  read_buffer = [4096]u8{}
  while true:
    err, n_read = await f.read(&read_buffer)
    if err == io.EOF:
      into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])
      return

    if err:
      throw err

    into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])

In our simulator, we will provide a mocked out IO system and we will randomly inject various errors while asserting pre- and post-conditions.

# sim.psuedocode
import "readfile.pseudocode"

seed = if os.env.DST_SEED ? int(os.env.DST_SEED) : time.now()
rnd = rnd.new(seed)

while true:
  sim_disk_data = rnd.rand_bytes(10MB)
  sim_fd = {
    pos: 0
    EOF: Error("eof")
    read: (fd, buf) => {
      partial_read = rnd.rand_in_range_inclusive(0, sizeof(buf))
      memcpy(sim_disk_data, buf, fd.pos, partial_read)
      fd.pos += partial_read
      if fd.pos == sizeof(sim_disk_data):
        return io.EOF, partial_read
      return partial_read 
    }
  }
  sim_io = {
    open: (filename) => sim_fd
  }

  out_buf = Vector<u8>.new()
  try:
    read_file(sim_io, "somefile", out_buf)
    assert_bytes_equal(out_buf.data, sim_disk_data)
  catch (e):
    print("Found logical error with seed: %d", seed)
    throw e

And with this simulator we would have eventually caught our partial read bug! In our original program when we wrote:

      into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])

We should have written:

      into_buffer.copy_maybe_allocate(read_buffer[0:n_read])

Great! Let's get a little more complex.

A distributed system

I already mentioned in the beginning that the gist of deterministic simulation testing a distributed system is that you get all of the nodes in the system to run in the same process. This would be basically impossible if you wanted to test a system that involved your application plus Kafka plus Postgres plus Redis. But if your system is a self-contained distributed system, such as one that embeds a Raft library for high availability of your application, you can actually run multiple nodes into the same process!

For a system like this, our simulator might look like:

# sim.pseudocode
import "distsys-node.pseudocode"

seed = if os.env.DST_SEED ? int(os.env.DST_SEED) : time.now()
rnd = rnd.new(seed)

while true:
  sim_fd = {
    send(fd, buf) => {
      # Inject random failure.
      if rnd.rand() > .5:
         throw Error('bad write')

      # Inject random latency.
      if rnd.rand() > .5:
        await time.sleep(rnd.rand())

      n_written = assert_ok(os.fd.write(buf))
      return n_written
    },
    recv(fd, buf) => {
      # Inject random failure.
      if rnd.rand() > .5:
         throw Error('bad read')

      # Inject random latency.
      if rnd.rand() > .5:
        await time.sleep(rnd.rand())

      return os.fd.read(buf)
    }
  }
  sim_io = {
    open: (filename) => {
      # Inject random failure.
      if rnd.rand() > .5:
        throw Error('bad open')

      # Inject random latency.
      if rnd.rand() > .5:
        await time.sleep(rnd.rand())

      return sim_fd
    }
  }

  all_ports = [6000, 6001, 6002]
  nodes = [
    await distsys-node.start(sim_io, all_ports[0], all_ports),
    await distsys-node.start(sim_io, all_ports[1], all_ports),
    await distsys-node.start(sim_io, all_ports[2], all_ports),
  ]
  history = []
  try:
    key = rnd.rand_bytes(10)
    value = rnd.rand_bytes(10)
    nodes[rnd.rand_in_range_inclusive(0, len(nodes)].insert(key, value)
    history.add((key, value))
    assert_valid_history(nodes, history)

    # Crash a process every so often
    if rnd.rand() > 0.75:
      node = nodes[rnd.rand_in_range_inclusive(0, 3)]
      node.restart()
  catch (e):
    print("Found logical error with seed: %d", seed)
    throw e

I'm completely hand waving here to demonstrate the broader point and not any specific testing strategy for a specific distributed system. The important points are that these three nodes run in the same process, on different ports.

We control disk IO. We control network IO. We control how time elapses. We run a deterministic simulated workload against the three node system while injecting disk, network, and process faults.

And we are constantly checking for an invalid state. When we get the invalid state, we can be sure the user can easily recreate this invalid state.

Other sources of non-determinism

Within some error margin, most CPU instructions and most CPU behavior are considered to be deterministic. There are, however, certain CPU instructions that are definitely not. Unfortunately that might include system calls. It might also include malloc. There is very little to trust.

If we ignore Antithesis, people doing DST seem not to worry about these smaller bits of nondeterminism. Yet it's generally agreed that DST is still worthwhile anyway. The intuition here is that every bit of non-determinism you can eliminate makes it that much easier to reproduce bugs when you find them.

Put another way: determinism, even among DST practitioners, remains a spectrum.

Considerations

As you may have noticed already from some of the pseudocode, DST is not a panacea.

Consideration 1: Edges

First, because you must swap out non-deterministic parts of your code, you are not actually testing the entirety of your code. You are certainly encouraged to keep the deterministic kernel large. But there will always be the non-deterministic edges.

Without a system like Antithesis which gives you an entire deterministic machine, you can't test your whole program.

But even with Antithesis you cannot test the integration between your system and external systems. You must mock out the external systems.

It's also worth noting that there are many areas where you could inject simulation. You could do it at a high-level RPC and storage layer. This would be simpler and easier to understand. But then you'd be omitting testing and error-handling of lower-level errors.

Consideration 2: Your workload(s)

DST is dependent on your creativity and thoroughness of your workload as much as any other type of test or benchmark.

Just as you wouldn't depend on one single benchmark to qualify your application, you may not want to depend on a single simulated workload.

Or as Will Wilson put it for me:

The biggest challenge of DST in my experience is that tuning all the random distributions, the parameters of your system, the workload, the fault injection, etc. so that it produces interesting behavior is very challenging and very labor intensive. As with fuzzing or PBT, it's terrifyingly easy to build a DST system that appears to be doing a ton of testing, but actually never explores very much of the state space of your system. At FoundationDB, the vast majority of the work we put into the simulator was an iterative process of hunting for what wasn't being covered by our tests and then figuring out how to make the tests better. This process often resembles science more than it does engineering.

Unfortunately, unlike with fuzzing, mere branch coverage in your code is usually a pretty poor signal for the kinds of systems you want to test with DST. At Antithesis we handle this with Sometimes assertions, at FDB we did something pretty similar, and I assume TigerBeetle and others have their own version of this. But of course the ultimate figure of merit is whether your DST system is finding 100% of your bugs. It's quite difficult to get to the point that it does. The truly ambitious part of Antithesis isn't the hypervisor, but the fact that we also aim to solve the much harder "is my DST working?" problem with minimal human guidance or supervision.

Consideration 3: Your knowledge of what you mocked

When you mock out the behavior of disk or network IO, the benefits of DST are tied to your understanding of the spectrum of behavior that may happen in the real world.

What are all possible error conditions? What are the extreme latency bounds of the original method? What about corruption or misdirected IO?

The flipside here is that only in deterministic simulation testing can you configure these crazy scenarios to happen at a configurable regularity. You can kick off a set of runs that have especially high IO latency or especially high corrupt reads/writes. Joran and I wrote a year ago about how the TigerBeetle simulator does exactly this.

Consideration 4: Non-reproducible seeds as code changes

Critically, the reproducibility of DST only helps so long as your code doesn't change. As soon as your code changes, the seed may no longer even get you to the state where the bug was exhibited. So the reproducibility of DST means more that it may help you convert the seed simulation run into an integration test that describes the precise scenario even as the code changes.

Consideration 5: Time and compute

Because of Consideration 4, you need to keep rerunning the simulator not just to keep finding new seeds and new histories but because the new seeds and new histories may change every time you make changes to code.

What about Jepsen?

Jepsen does limited process and network fault injection while testing for linearizability. It's a fantastic project.

However, it represents only a subset of what is possible with Deterministic Simulation Testing (if you actually put in the effort described above to get there).

But even more importantly, Jepsen has nothing to do with deterministic execution. If Jepsen finds a bug and your system can't do deterministic execution, you may or may not be able to reproduce that Jepsen bug.

Here's another Will Wilson quote for you on Jepsen and FoundationDB:

Anyway, we did [Deterministic Simulation Testing] for a while and found all of the bugs in the database. I know, I know, that’s an insane thing to say. It’s kind of true though. In the entire history of the company, I think we only ever had one or two bugs reported by a customer. Ever. Kyle Kingsbury aka “aphyr” didn’t even bother testing it with Jepsen, because he didn’t think he’d find anything.

Conclusion

The degree to which you can place faith in DST alone, and not time spent in production, has limits. However, it certainly does no harm to employ DST. And, barring the considerations described above, will likely make the kernel of your product significantly more stable. Furthermore, everyone who uses DST knows about these considerations. But I think it's worthwhile to list them out to help folks who do not know DST to build an intuition for what it's excellent at.

Further reading:

Delightful, production-grade replication for Postgres

2024-07-30 08:00:00

This is an external post of mine. Click here if you are not redirected.

A reawakening of systems programming meetups

2024-07-07 08:00:00

This year has seen a resurgence in really high quality systems programming meetups. Munich Database Meetup, Berlin Systems Group, SF Distributed Systems Meetup, NYC Systems, Bengaluru Systems, to name a few.

This post summarizes a bit of disappointing recent tech meetup history, the new trend of excellent systems programming meetups, and ends with some encouragement and guidance for running your own systems programming events.

I will be a little critical in this post but I want to preface by saying: organizing meetups is really tough! It takes a lot of work and I have a huge amount of respect for meetup organizers even when their meetup style did not resonate with me.

Although much of this post talks about NYC Systems, the reason I think this post is worth writing is because so many other meetups in a similar vein popped up. I hope to encourage these other meetups and to encourage folks in other major metros (London, for example) to start similar meetups.

Meetups

I used to attend a bunch of meetups before the pandemic. But I quickly got disillusioned. Almost every meetup was varying degrees of startups pitching their product. The last straw for me was sitting through a talk at a JavaScript meetup that was by a devrel employee of a startup who literally gave a tutorial for their product.

There were also some pretty intelligent meetups like the New York Haskell Users Group and the New York Emacs Meetup. But not being an expert in either domain, and the attendees almost solely appearing to be experts, I didn't particularly enjoy going.

There were a couple of meetups that felt inclusive for various skill-levels of attendees yet still went into interesting depth. Specifically, New York Linux User Group and Papers We Love NYC.

These meetups were exceptional because they were language- and framework-agnostic, they would start broad to give you background, but then go deep into a topic. Maybe you only understood 50% of what was covered. But you get exposed to something new from an expert in that domain.

Unfortunately, the pandemic happened and these two excellent meetups basically have not come back.

A couple of students in Munich

The pandemic ended and I tried a couple of meetups I thought might be better quality. Rust and Go. But they weren't much better than I remembered. People would give a high level talk and brush over all the interesting concepts.

I had been thinking of doing an in-person talk series since 2022.

But I was busy with TigerBeetle until December of 2023 when I was messaged on LinkedIn by Georg Kreuzmayr, a graduate student at Technical University of Munich (TUM).

Georg and his friends, fellow graduate students at TUM, started a database club: TUMuchData. We got to talking about opportunities for collaboration and I started feeling a bit embarrassed that a graduate student had more guts than I had to get back onto the meetup organizer wagon.

A week later, with assurance from Justin Jaffray that at least he would show up with me if no one else did, I started the NYC Systems Coffee Club to bring together folks in NYC interested in any topic of systems programming (e.g. compilers, databases, web browser internals, distributed systems, formal methods, etc.). To bring them together in a completely informal setting for coffee at 9am in the morning in a public space in midtown Manhattan.

I set up that linked web page and started collecting subscribers to the club via Google Form. Once a month I'd send an email out to the list asking for RSVPs to this month's coffee club. The first 20 to respond would get a calendar invite.

/assets/coffee-club-invite.png

And about the same time I started asking around on Twitter/LinkedIn if someone would be interested in co-organizing a new systems programming meetup in NYC. Angelo Saraceno immediately took me up on the idea and we met up.

NYC Systems

We agreed on the premise: this would be a language- and framework-agnostic meetup that was focused on engineering challenges, not product pitches. It would be 100% for the sake of corporate marketing, but corporate marketing of the engineering team, not the product.

NYC Systems was born!

We'd find speakers who could start broad and dive deep into some interesting aspect of databases, programming languages, distributed systems, and so on. Product pitches were necessary to establish a context, but the focus of the talk would be about some interesting recent technical challenge and how they dealt with it.

We'd schedule talks only every other month to ease our own burden in organizing and finding great speakers.

Once Angelo and I had decided to go forward, the next two challenges were finding speakers and finding a venue. Thanks to Twitter and LinkedIn, finding speakers turned out to be the easy part.

It was harder to find a venue. It was surprisingly challenging to find a company in NYC with a shared vision that the important thing about being associated with a meetup like this is to be associated with the quality of speakers and audience we can bring in by not allowing transparent product pitches.

Almost every company in Manhattan with space we spoke with had a requirement that they have their own speaker each night. That seemed like a bad idea.

I think it was especially challenging to find a company willing to relax about branding requirements like this because we were a new meetup.

It was pretty frustrating not to find a sympathetic company with space in Manhattan. And the only reason we didn't give up was because Angelo was so adament that this kind of meetup actually happen. It's always best to start something new with someone else for this exact reason. You can keep each other going.

In the end we went with the company that did not insist on their own speaker or their own branding. A Brooklyn-based company whose CEO immediately got in touch with me that they wanted to host us, Trail of Bits.

How it works

To keep things easy, I set up a web page on my personal site with information about the meetup. (Eventually we moved this to nycsystems.xyz.) I set up a Google Form to collect emails for a mailing list. And we started posting about the group on Twitter and LinkedIn.

We published the event calendar in advance (an HTML table on the website) and announced each event's speakers a week in advance of the event. I'd send another Google Form to the mailing list taking RSVPs for the night. The first 60 people to respond got a Google Calendar invite.

/assets/nyc-systems.png

It's a bit of work, sure, but I'd do anything to avoid Meetup.com.

It is interesting to see every new systems programming meetup also not pick Meetup.com. The only one that went with it, Munich Database Meetup, is a revival of an existing group, the Munich NoSQL Meetup and presumably they didn't want to give up their subscribers. Though most others use lu.ma.

The mailing list is now about 400+ people. And in each event RSVP we have a wait list of 20-30 people. Of course although 60 people say Yes initially, by the time of the event we have typically gotten about 50 people in attendance.

At each event, Trail of Bits provided screens, chairs, food, and drink. Angelo had recording equipment so he took over audio/video capturing (and later editing and publishing).

After each event we'd publish talk videos to our @NYCSystems Youtube.

Network effects

In March 2024, the TUMuchData folks joined Alex Petrov's Munich NoSQL Meetup to form the Munich Database Meetup. In May, Kaivalya Apte and Manish Gill started the Berlin Systems Group, inspired by Alex and the Munich Database Meetup.

In May 2024, two PhD students in the San Francisco Bay Area, Shadaj Laddad and Conor Power, started the SF Distributed Systems meetup.

And in July 2024, Shraddha Agrawal, Anirudh Rowjee and friends kicked off the first Bengaluru Systems Meetup.

Suggestions

First off, don't pay for anything yourself. Find a company who will host. At the same time, don't feel the need to give in too much to the demands of the company. I'd be happy to help you think through how to talk about the event with companies. It is mutually beneficial for them to get to give a 5-minute hiring/product pitch and not need to do extensive branding nor to give a 30-minute product tutorial.

Second, keep a bit of pressure on speakers to not do an overview talk and not to do a product pitch. Suggest that they tell the story of some interesting recent bug or interesting recent feature. What happened? Why was it hard? What did you learn?

Focusing on these types of talks will help you get a really interesting audience.

I have been continuously surprised and impressed at the folks who show up for NYC Systems. It's a mix of technical founders in the systems space, pretty experienced developers in the systems space, graduate students, and developers of all sorts.

I am certain we can only get these kinds of folks to show up because we avoid product pitch-type talks.

Third, finding speakers is still hard! The best approach so far has been to individually message folks in industry and academia who hang out on Twitter. Sending out a public call is easy but doesn't often pan out. So keep an eye on interesting companies in the area.

Another avenue I've been thinking about is messaging VC connections to ask them if they know any engineers/technical founders/CTOs in the area who could give an interesting technical talk.

Fourth, speak with other organizers! I finally met Alex Petrov in person last month and we had a great time talking about the challenges and joys of organizing really high quality meetups.

I'm always happy to chat, DMs are open.

A write-ahead log is not a universal part of durability

2024-07-01 08:00:00

A database does not need a write-ahead log (WAL) to achieve durability. A database can write its long-term data structure durably to disk before returning to a client. Granted, this is a bad idea! And granted, a WAL is critical for durability by design in most databases. But I think it's helpful to understand WALs by understanding what you could do without them.

So let's look at what terrible design we can make for a durable database that has no write-ahead log. To motivate the idea of, and build an intuition for, a write-ahead log.

Thank you to Alex Miller for reviewing a version of this post.

But first, what is durability?

Durability

Durability happens in the context of a request a client makes to a data system (either an embedded system like SQLite or RocksDB or a standalone system like Postgres). Durability is a spectrum of guarantees the server provides when a client requests to write some data: that either the request succeeds and the data is safely written to disk, or the request fails and the client must retry or decide to do something else.

It can be difficult to set an absolute definition for durability since different databases have different concepts of what can go wrong with disks (also called a "storage fault model"), or they have no concept at all.

Let's start from the beginning.

An in-memory database

An in-memory database has no durability at all. Here is pseudo-code for an in-memory database service.

db = btree()

def handle_write(req):
  db.update(req.key, req.value)
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

Throughout this post, for the sake of code brevity, imagine that the environment is concurrent and that data races around shared mutable values like db are protected somehow.

Writing to disk

If we want to achieve the most basic level of durability, we can write this database to a file.

f = open("kv.db")
db = btree.init_from_disk(f)

def handle_write(req):
  db.update(req.key, req.value)
  db.write_to_disk(f)
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

btree.write_to_disk will call pwrite(2) under the hood. And we'll assume it does copy-on-write for only changed pages. So imagine we have a large database represented by a btree that takes up 10GiB on disk. With the btree algorithm, if we write a single entry to the btree, often only a single (often 4Kib) page will get written rather than all pages (holding all values) in the tree. At the same time, in the worst case, the entire tree (all 10GiB of data) may need to get rewritten.

But this code isn't crash-safe. If the virtual or physical machine this code is running on reboots, the data we wrote to the file may not actually be on disk.

fsync

File data is buffered by the operating system by default. By general consensus, writing data without flushing the operating system buffer is not considered durable. Every so often a new database will show up on Hacker News claiming to beat all other databases on insert speed until a commenter points out the new database doesn't actually flush data to disk.

In other words, the commonly accepted requirement for durability is that not only do you write data to a file on disk but you fsync(2) the file you wrote. This forces the operating system to flush to disk any data it has buffered.

f = open("kv.db")
db = btree.init_from_disk(f)

def handle_write(req):
  db.update(req.key, req.value)
  db.write_to_disk(f)
  f.fsync() # Force a flush
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

Furthermore you must not ignore fsync failure. How you deal with fsync failure is up to you, but exiting immediately with a message that the user should restore from a backup is sometimes considered acceptable.

Databases don't like to fsync because it's slow. Many major databases offer modes where they do not fsync data files before returning a success to a client. Postgres offers this unsafe mode, though does not default to it and warns against it. MongoDB offers this unsafe mode but does not default to it.

An earlier version of this post said that MongoDB would unsafely flush on an interval. Daniel Gomez Ferro from MongoDB messaged me that while the docs are confusing, the default write concern "majority" does actually imply "j: true" which means data is synchronized (i.e. fsync-ed) before returning a success to a client.

Almost every database trades safety for performance in some regard. For example, few databases but SQLite and Cockroach default to Serializable Isolation. While it is commonly agreed that basically no level below Serializable Isolation (that all other databases default to) can be reasoned about. Other databases offer Serializable Isolation, they just don't default to it. Because it can be slow.

Group commit

But let's get back to fsync. One way to amortize the cost of fsync is to delay requests so that you write data from each of them and then fsync the data from all requests. This is sometimes called group commit.

For example, we could update the database in-memory but have a background thread serialize to disk and call fsync only every 5ms.

f = open("kv.db")
db = btree.init_from_disk(f)

group_commit_sems = []

@background_worker()
def group_commit():
  for:
    if clock() % 5ms == 0:
      db.write_to_disk(f)
      f.fsync() # Durably flush for the group
      for sem in group_commit_sems:
        sem.signal()

def handle_write(req):
  db.update(req.key, req.value)
  sem = semaphore()
  group_commit_sems.push(sem)
  sem.wait()
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

It is critical that handle_write waits to return a success until the write is durable via fsync.

So to reiterate, the key idea for durability of a client request is that you have some version of the client message stored on disk durably with fsync before returning a success to a client.

From now on in this post, when you see "durable" or "durability", it means that the data has been written and fsync-ed to disk.

Optimizing durable writes

A key insight is that it's silly to serialize the entire permanent structure of the database to disk every time a user writes.

We could just write the user's message itself to an append-only log. And then only periodically write the entire btree to disk. So long as we have fsync-ed the append-only log file, we can safely return to the user even if the btree itself has not yet been written to disk.

The additional logic this requires is that on startup we must read the btree from disk and then replay the log on top of the btree.

f = open("kv.db", "rw")
db = btree.init_from_disk(f)

log_f = open("kv.log", "rw")
l = log.init_from_disk()
for log in l.read_logs_from(db.last_log_index):
  db.update(log.key, log.value)

group_commit_sems = []

@background_worker()
def group_commit():
  for:
    log_accumulator = log_page()
    if clock() % 5ms == 0:
      for (log, _) in group_commit_sems:
        log_accumulator.add(log)

      log_f.write(log_accumulator.page()) # Write out all log entries at once
      log_f.fsync() # Durably flush wal data
      for (_, sem) in group_commit_sems:
        sem.signal()

    if clock() % 1m == 0:
      db.write_to_disk(f)
      f.fsync() # Durably flush db data

def handle_write(req):
  db.update(req.key, req.value)
  sem = semaphore()
  log = req
  group_commit_sems.push((log, sem))
  sem.wait() # This time waiting for only the log to be written and flushed, not the btree.
  return 200, {}

def handle_read(req):
  value = db.read(req.key)
  return 200, {"value": value}

This is a write-ahead log!

Consider a few scenarios. One request writes the smallest key ever seen. And one request within the same millisecond writes the largest key ever seen. Writing these to disk on the btree means modifying at least two pages spread out in space on disk.

But if we only have to durably write these two messages to a log, they can likely both be included in the same log page. ("Likely" so long as key and values are small enough that multiple can fit into the same page.)

That is, it's cheaper to write only these small messages representing the client request to disk. And we save the structured btree persistence for a less frequent durable write.

Filesystem and disk bugs

Sometimes filesystems will write data to the wrong place. Sometimes disks corrupt data. A solution to both of these is to checksum the data on write, store the checksum on disk, and confirm the checksum on read. This combined with a background process called scrubbing to validate unread data can help you learn quickly when your data has been corrupted and you must recover from backup.

MongoDB's default storage engine WiredTiger does checksum data by default.

But some databases famous for integrity do not. Postgres does no data checksumming by default:

By default, data pages are not protected by checksums, but this can optionally be enabled for a cluster. When enabled, each data page includes a checksum that is updated when the page is written and verified each time the page is read. Only data pages are protected by checksums; internal data structures and temporary files are not.

SQLite likewise does no checksumming by default. Checksumming is an optional extension:

The checksum VFS extension is a VFS shim that adds an 8-byte checksum to the end of every page in an SQLite database. The checksum is added as each page is written and verified as each page is read. The checksum is intended to help detect database corruption caused by random bit-flips in the mass storage device.

But even this isn't perfect. Disks and nodes can fail completely. At that point you can only improve durability by introducing redundancy across disks (and/or nodes), for example, via distributed consensus.

Other reasons you need a WAL?

Some databases (like SQLite) require a write-ahead log to implement aspects of ACID transactions. But this need not be a requirement for ACID transactions if you do MVCC (SQLite does not). See my previous post on implementing MVCC for details.

Logical replication (also called change data capture (CDC)) is another interesting feature that requires a write-ahead log. The idea is that the log already preserves the exact order and changes that affect the database's "state machine". So we could copy these changes out of the system by tracking the write-ahead log, preserving change order, and apply these changes to a foreign system.

But again, just CDC is not about durability. It's an ancillary feature that write-ahead logs make simple.

Conclusion

A few key points. One, durability primarily matters if it is established before returning a success to the client. Second, a write-ahead log is a cheap way to get durability.

And finally, durability is a spectrum. You need to read the docs for your database to understand what it does and does not.

The limitations of LLMs, or why are we doing RAG?

2024-06-17 08:00:00

This is an external post of mine. Click here if you are not redirected.

Confusion is a muse

2024-06-14 08:00:00

Some of the most interesting technical blog posts I read come from, and a common reason for posts I write is, confusion. You're at work and you start asking questions that are difficult to answer. You spend a few hours or a day trying to get to the bottom of things.

If you ask a question to very experienced and successful developers at work, they have a tendency not to give context and to simplify things down to a single answer. This may be a good way to make business decisions. (One can't afford to waste an eternity considering everything indefinitely.) But accepting an answer you don't understand is actively harmful for building intuition.

Certainly, sometimes not accepting an answer can be irritating. You'll have to figure that out.

But beyond "go along to get along", another reason we don't pursue what we're confused about is because we're embarrassed that we're confused in the first place. What's worse, the embarrassment we feel naturally grows the more experienced we get. "I've got this job title, I don't want to seem like I don't know what you mean."

But if you fight the embarrassment and pursue your confusion regardless, you'll likely figure something very interesting out. Moreover, you will probably not have been the only person who was confused. At least personally it is quite rare that I am confused about something and no one else is.

So pay attention when you get confused, and consider why it happened. What did you expect to be the case, and how did reality differ? Explore the angles and the options. When you finally understand, think about what led you to that understanding.

Write it down. Put it into an internal Markdown doc, an internal Atlassian doc, an internal Google Slides page, whatever. The medium doesn't matter.

This entire process doesn't come easily. We feel embarrassed. We aren't used to lingering on something we're confused by. We aren't used to writing things down.

But if you can make yourself pause every once in a while and think about what you (or someone around you) got confused by, and if you can force yourself to stop getting embarrassed by what you got confused by, and if you can write down the background and the reasoning that led to your ultimate understanding, you're going to have something pretty interesting to talk about.

You'll contribute to the growth and intuition of your colleagues. And you'll never run out of things to write about.

How I run a software book club

2024-05-30 08:00:00

I've been running software book clubs almost continuously since last summer, about 12 months ago. We read through Designing Data-Intensive Applications, Database Internals, Systems Performance, and we just started Understanding Software Dynamics.

The DDIA discussions were in-person in NYC with about 5-8 consistent attendees. The rest have been over email with 300, 500, and 600 attendees.

This post is for folks who are interested in running their own book club. None of these ideas are novel. I co-opted the best parts I saw from other people running similar things. And hopefully you'll improve on my experience too, should you try.

Despite the length of this post running a book club takes almost no noticeable effort, other than when I need to select and confirm discussion leaders. It is the limited-effort-required to thank that I've kept up the book clubs so consistently.

Google Groups

I run the virtual book clubs over email. I create a Google Group and tell people to send me their email for an invite. I use a Google Form to collect emails since I get many. If you're doing a small group book club you can just collect member emails directly.

In the Google Form I ask people to volunteer to lead discussion for a chapter (or chapters). And I ask for a Twitter/GitHub/LinkedIn account.

When I've gotten enough responses I go through the list and check Twitter/GitHub/LinkedIn info to find people who might have a particularly interesting perspective to lead a discussion.

"Lead a discussion" sounds formal but I mean anything but. All I am looking for is someone to start a new Google Group thread each week and for them to share their thoughts.

For example a discussion leader might share:

The "discussion leader" has no responsibility for remaining in the discussion after posting the thread. There just isn't an easy way to say "person who kicks off discussion" than to call them a "discussion leader".

By the way, I didn't do discussion leaders for the first book club, reading DDIA. And that book club took noticeably more effort. Because I organized it, I was effectively the discussion leader every time. Having discussion leaders disperses the effort of the book club. And I think it makes the club much more interesting.

SparkNotes-ification

One thing I noticed happening often was that the discussion leader might do a large summary of the chapter. I greatly appreciate and respect that effort, but I think this is not the ideal thing to happen. Of course you can't control what people do and maybe they really wanted to write a summary. But since noticing this happen I now try to discourage the discussion leader from summarizing since 1) it must be quite time-consuming and 2) it isn't as interesting as some of the above bullet points.

Confirming with leaders

When I have picked out folks who seem like they'd be fun discussion leaders, I bcc email them all asking them to confirm. At the same time I explain what being a discussion leader means. As I just explained it here above.

Each week's discussion gets a new Google Group thread. Discussion happens in responses to the thread.

I ask the discussion leaders to create the new discussion thread between Friday and Saturday their local time.

For folks who don't confirm, I email them one last time and then if they still haven't confirmed I find someone new.

I always lead the first week's discussion so that the discussion leaders can see what I do and so that I can establish the pattern.

Managing leaders

It takes a while to read a book. Sometimes the leaders forget to do their part. If it gets to be Sunday and the discussion leader for the week hasn't started discussion, I email them to gently ask if they are still available to kick off discussion. And if they are not, no worries, I can step in.

I have had to step in a few times to start discussion and it's no problem.

Managing non-leaders

Just as you need to clarify and set expectations for discussion leaders, you need to clarify and set expectations for everyone else.

When I invite people to the Google Group I typically also create an Intro thread where I explain the discussion format.

An annoying aspect of Google Groups is that I cannot limit who can create a thread without limiting who can respond to a thread.

It would simplify things for me if I could limit thread creation to discussion leaders. But since I cannot, I try to repeatedly and explicitly mention in the Intro thread that no one should start a new discussion thread unless they are a discussion leader. And that new threads will come out each weekend to discuss the previous chapter.

Setting the tone

One of the most important things to do in the Intro email is to set the tone. I try to clarify this is a friendly and encouraging group focused on learning and improving ourselves. We have experts in the group and we have noobs in the group and they are all welcome and will all come away with different things.

Why email?

Email seems to be the most time-friendly and demographic-friendly medium. Doing live discussion sounds stressful and difficult to schedule, although I believe Alex Petrov runs live discussions. Email forces you to slow down and think things through. And email is built around an inbox. If you didn't get to read some discussion, you can mark it unread. You can't do that in Discord or Slack.

Avoiding long-term commitments

When I pick a book, aside from picking books I think are likely to be exceptionally well-written, I try to avoid books that we could not finish within 3 months. It concerns me to try to get people to commit to something longer than that.

This has led to some distortion though. Systems Performance has only 16 chapters. One chapter a week means about 3 months in total. But each chapter is 100 pages long.

I was hesitant to do a reading of Understanding Software Dynamics because it has 28 chapters. But each chapter is only 10-15 pages long. So when I decided to go with it, I decided we'd read 2 chapters a week. Each discussion leader is responsible for 2 chapters at a time. That means we can finish within 3 months. And each week we read only 20-30 pages, which is still much more doable than 100 pages of Systems Performance.

On the other hand, we did make it through Systems Performance! Which gives me confidence to pick other books that are physically daunting, should they otherwise seem like a good idea.

A book ends

Many public book clubs go through a book a month and have no ending. That is totally fair. But what I love about the way I organize book clubs is that each reading is unrelated to the next. It's an entirely new signup for each book. You need only "commit" (I mean, you can drop off whenever and definitely people do) to a 3-month reading and then you can justly feel good about yourself and join again in the future or not.

In contrast a paper reading club has no obvious ending, unless you pick all the papers in advance and organize them around a school year or something. This has made running a paper reading club feel more concerning to me. Though I greatly appreciate folks like Aleksey Charapko and Murat Demirbas who do.

Most people don't actively contribute, but they still value it

In a group of 500 people, maybe 1-2% of those people actively contribute to discussion. 5-10 people. But I often hear from people who didn't participate that they still highly valued the group. And this high percentage of non-active-participants is part of why I keep allowing the group size to grow. There's little work I have to do and a bunch of people benefit.

Doing it at your company likely won't go well

I wrote about this before. For some reason it's hard to get people who would otherwise join an external reading club to join a company-internal reading club.

Though perhaps I'm just doing it wrong because I hear of others like Elizabeth Garrett Christensen who run an internal software book club successfully.

Good luck, have fun!

That's all I've got. Send me questions if you've got any. But mostly, just give it a shot if you want to and you'll learn!

And if you still don't get it, you can of course just join one of my book clubs. :)

Implementing MVCC and major SQL transaction isolation levels

2024-05-16 08:00:00

In this post we'll build a database in 400 lines of code with basic support for five standard SQL transaction levels: Read Uncommitted, Read Committed, Repeatable Read, Snapshot Isolation and Serializable. We'll use multi-version concurrency control (MVCC) and optimistic concurrency control (OCC) to accomplish this. The goal isn't to be perfect but to explain the basics in a minimal way.

You don't need to know what these terms mean in advance. I did not understand them before doing this project. But if you've ever dealt with SQL databases, transaction isolation levels are likely one of the dark corners you either 1) weren't aware of or 2) wanted not to think about. At least, this is how I felt.

While there are many blog posts that list out isolation levels, I haven't been able to internalize their lessons. So I built this little database to demonstrate the common isolation levels for myself. It turned out to be simpler than I expected, and made the isolation levels much easier to reason about.

Thank you to Justin Jaffray, Alex Miller, Sujay Jayakar, Peter Veentjer, and Michael Gasch for providing feedback and suggestions.

All code is available on GitHub.

Why do we need transaction isolation?

If you already know the answer, feel free to skip this section.

When I first started working with databases in CRUD applications, I did not understand the point of transactions. I was fairly certain that transactions are locks. I was wrong about that, but more on that later.

I can't remember exact code I wrote, but here's something I could have written:

with database.transaction() as t:
  users = t.query("SELECT * FROM users WHERE group = 'admin';")
  ids = []
  for user in users:
    if some_complex_logic(user):
      ids.push(user.id)

  t.query("UPDATE users SET metadata = 'some value' WHERE id IN ($1)';", ids)

I would have thought that all users that were seen from the initial SELECT who matched the some_complex_logic filter would be exactly the same that are updated in my second SQL statement.

And if I were using SQLite, my guess would have been correct. But if I were using MySQL or Postgres or Oracle or SQL Server, and hadn't made any changes to defaults, that wouldn't necessarily be true! We'll discover exactly why throughout this post.

For example, some other connection and transaction could have set a user's group to admin after the initial SELECT was executed. It would then be missed from the some_complex_logic check and from the subsequent UPDATE.

Or, again after our initial SELECT, some other connection could have modified the group for some user that previously was admin. It would then be incorrectly part of the second UPDATE statement.

These are just a few examples of what could go wrong.

This is the realm of transaction isolation. How do multiple transactions running at the same time, interacting with the same data, interact with each other?

The answer is: it depends. The SQL standard itself loosely prescribes four isolation levels. But every database implements these four levels slightly differently. Sometimes using entirely different algorithms. And even among the standard levels, the default isolation level for each database differs.

Funky bugs that can show up across databases and across isolation levels, often dependent on particular details of common ways of implementing isolation levels, create what are called "anomalies". Examples include intimidating terms like "dirty reads" and "write cycles" and G2-Item.

The topic is so complex that we've got decades of research papers critiquing SQL isolation levels, categorization of common isolation anomalies, walkthroughs of anomalies by Martin Kleppmann in Designing Data-Intensive Applications, Martin Kleppman's Hermitage project documenting common anomalies across isolation levels in major databases, and the ACIDRain paper showing isolation-related bugs in major open-source ecommerce projects.

These aren't just random links. They're each quite interesting. And particularly for practitioners who don't know why they should care, check out Designing Data-Intensive Applications and the last link on ACIDRain.

And this is only a small list of some of the most interesting research and writing on the topic.

So there's a wide variety of things to consider:

Transaction isolation levels are basically vibes. The only truth for real projects is Martin Kleppmann's Hermitage project that catalogs behavior across databases. And a truth some people align with is Generalized Isolation Level Definitions.

So while all these linked works above are authoritative, and even though we can see that there might be some anomalies we have to worry about, the research can still be difficult to internalize. And many developers, my recent self included, do not have a great understanding of isolation levels.

Throughout this post we'll stick to informal definitions of isolation levels to keep things simple.

Let's dig in.

Locks? MVCC?

Historically, databases implemented isolation with locking algorithms such as Two-Phase Locking (not the same thing as Two-Phase Commit). Multi-version concurrency control (MVCC) is an approach that lets us completely avoid locks.

It's worthwhile to note that while we will validly not use locks (implementing what is called optimistic concurrency control or OCC), most MVCC databases do still use locks for certain things (implementing what is called pessimistic concurrency control).

But this is the story of databases in general. There are numerous ways to implement things.

We will take the simpler lockless route.

Consider a key-value database. With MVCC, rather than storing only the value for a key, we would store versions of the value. The version includes the transaction id (a monotonic incrementing integer) wherein the version was created, and the transaction id wherein the version was deleted.

With MVCC, it is possible to express transaction isolation levels almost solely as a set of different visibility rules for a version of a value; rules that vary by isolation level.

So we will build up a general framework first and discuss and implement each isolation level last.

Scaffolding

We'll build an in-memory key-value system that acts on transactions. I usually try to stick with only the standard library for projects like this but I really wanted a sorted data structure and Go doesn't implement one.

In main.go, let's set up basic helpers for assertions and debugging:

package main

import (
        "fmt"
        "os"
        "slices"

        "github.com/tidwall/btree"
)

func assert(b bool, msg string) {
        if !b {
                panic(msg)
        }
}

func assertEq[C comparable](a C, b C, prefix string) {
        if a != b {
                panic(fmt.Sprintf("%s '%v' != '%v'", prefix, a, b))
        }
}

var DEBUG = slices.Contains(os.Args, "--debug")

func debug(a ...any) {
        if !DEBUG {
                return
        }

        args := append([]any{"[DEBUG]"}, a...)
        fmt.Println(args...)
}

As mentioned previously, a value in the database will be defined with start and end transaction ids.

type Value struct {
        txStartId uint64
        txEndId   uint64
        value     string
}

Every transaction will be in an in-progress, aborted, or committed state.

type TransactionState uint8
const (
        InProgressTransaction TransactionState = iota
        AbortedTransaction
        CommittedTransaction
)

And we'll support a few major isolation levels.

// Loosest isolation at the top, strictest isolation at the bottom.
type IsolationLevel uint8
const (
        ReadUncommittedIsolation IsolationLevel = iota
        ReadCommittedIsolation
        RepeatableReadIsolation
        SnapshotIsolation
        SerializableIsolation
)

We'll get into detail about the meaning of the levels later.

A transaction has an isolation level, an id (monotonic increasing integer), and a current state. And although we won't make use of this data yet, transactions at stricter isolation levels will need some extra info. Specifically, stricter isolation levels need to know about other transactions that were in-progress when this one started. And stricter isolation levels need to know about all keys read and written by a transaction.

type Transaction struct {
        isolation  IsolationLevel
        id         uint64
        state      TransactionState

        // Used only by Repeatable Read and stricter.
        inprogress btree.Set[uint64]

        // Used only by Snapshot Isolation and stricter.
        writeset   btree.Set[string]
        readset    btree.Set[string]
}

We'll discuss why later.

Finally, the database itself will have a default isolation level that each transaction will inherit (for our own convenience in tests).

The database will have a mapping of keys to an array of value versions. Later elements in the array will represent newer versions of a value.

The database will also store the next free transaction id it will use to assign ids to new transactions.

type Database struct {
        defaultIsolation  IsolationLevel
        store             map[string][]Value
        transactions      btree.Map[uint64, Transaction]
        nextTransactionId uint64
}

func newDatabase() Database {
        return Database{
                defaultIsolation:  ReadCommittedIsolation,
                store:             map[string][]Value{},
                // The `0` transaction id will be used to mean that
                // the id was not set. So all valid transaction ids
                // must start at 1.
                nextTransactionId: 1,
        }
}

To be thread-safe: store, transactions, and nextTransactionId should be guarded by a mutex. But to keep the code small, this post will not use goroutines and thus does not need mutexes.

There's a bit of book-keeping when creating a transaction, so we'll make a dedicated method for this. We must give the new transaction an id, store all in-progress transactions, and add it to database transaction history.

func (d *Database) inprogress() btree.Set[uint64] {
        var ids btree.Set[uint64]
        iter := d.transactions.Iter()
        for ok := iter.First(); ok; ok = iter.Next() {
                if iter.Value().state == InProgressTransaction {
                        ids.Insert(iter.Key())
                }
        }
        return ids
}

func (d *Database) newTransaction() *Transaction {
        t := Transaction{}
        t.isolation = d.defaultIsolation
        t.state = InProgressTransaction

        // Assign and increment transaction id.
        t.id = d.nextTransactionId
        d.nextTransactionId++

        // Store all inprogress transaction ids.
        t.inprogress = d.inprogress()

        // Add this transaction to history.
        d.transactions.Set(t.id, t)

        debug("starting transaction", t.id)

        return &t
}

And we'll add a few more helpers for completing a transaction, for fetching a transaction by id, and for validating a transaction.

func (d *Database) completeTransaction(t *Transaction, state TransactionState) error {
        debug("completing transaction ", t.id)

        // Update transactions.
        t.state = state
        d.transactions.Set(t.id, *t)

        return nil
}

func (d *Database) transactionState(txId uint64) Transaction {
        t, ok := d.transactions.Get(txId)
        assert(ok, "valid transaction")
        return t
}

func (d *Database) assertValidTransaction(t *Transaction) {
        assert(t.id > 0, "valid id")
        assert(d.transactionState(t.id).state == InProgressTransaction, "in progress")
}

The final bit of scaffolding we'll set up is an abstraction for database connections. A connection will have at most associated one transaction. Users must ask the database for a new connection. Then within the connection they can manage a transaction.

type Connection struct {
        tx *Transaction
        db *Database
}

func (c *Connection) execCommand(command string, args []string) (string, error) {
        debug(command, args)

        // TODO
        return "", fmt.Errorf("unimplemented")
}

func (c *Connection) mustExecCommand(cmd string, args []string) string {
        res, err := c.execCommand(cmd, args)
        assertEq(err, nil, "unexpected error")
        return res
}

func (d *Database) newConnection() *Connection {
        return &Connection{
                db: d,
                tx: nil,
        }
}

func main() {
        panic("unimplemented")
}

And that's it for scaffolding. Now set up the go module and make sure this builds.

$ go mod init gomvcc
go: creating new go.mod: module gomvcc
go: to add module requirements and sums:
        go mod tidy
$ go mod tidy
go: finding module for package github.com/tidwall/btree
go: found github.com/tidwall/btree in github.com/tidwall/btree v1.7.0
$ go build
$ ./gomvcc
panic: unimplemented

goroutine 1 [running]:
main.main()
        /Users/phil/tmp/main.go:166 +0x2c

Great!

Transaction management

When the user asks to begin a transaction, we ask the database for a new transaction and assign it to the current connection.

 func (c *Connection) execCommand(command string, args []string) (string, error) {
         debug(command, args)

+       if command == "begin" {
+               assertEq(c.tx, nil, "no running transactions")
+               c.tx = c.db.newTransaction()
+               c.db.assertValidTransaction(c.tx)
+               return fmt.Sprintf("%d", c.tx.id), nil
+       }
+
         // TODO
         return "", fmt.Errorf("unimplemented")
 }

To abort a transaction, we call the completeTransaction method (which makes sure the database transaction history gets updated) with the AbortedTransaction state.

                return fmt.Sprintf("%d", c.tx.id), nil
        }

+       if command == "abort" {
+               c.db.assertValidTransaction(c.tx)
+               err := c.db.completeTransaction(c.tx, AbortedTransaction)
+               c.tx = nil
+               return "", err
+       }
+
         // TODO
         return "", fmt.Errorf("unimplemented")
 }

And to commit a transaction is similar.

                return "", err
        }

+       if command == "commit" {
+               c.db.assertValidTransaction(c.tx)
+               err := c.db.completeTransaction(c.tx, CommittedTransaction)
+               c.tx = nil
+               return "", err
+       }
+
         // TODO
         return "", fmt.Errorf("unimplemented")
 }

The neat thing about MVCC is that beginning, committing, and aborting a transaction is metadata work. Committing a transaction will get a bit more complex when we add support for Snapshot Isolation and Serializable Isolation, but we'll get to that later. Even then, it will not involve modifying any values we get, set, or delete.

Get, set, delete

Here is where things get fun. As mentioned earlier, the key-value store is actually map[string][]Value. With the more recent versions of a value at the end of the list of values for the key.

For get support, we'll iterate the list of value versions backwards for the key. And we'll call a special new isvisible method to determine if this transaction can see this value. The first value that passes the isvisible test is the correct value for the transaction.

                return "", err
        }

+       if command == "get" {
+               c.db.assertValidTransaction(c.tx)
+
+               key := args[0]
+
+               c.tx.readset.Insert(key)
+
+               for i := len(c.db.store[key]) - 1; i >= 0; i-- {
+                       value := c.db.store[key][i]
+                       debug(value, c.tx, c.db.isvisible(c.tx, value))
+                       if c.db.isvisible(c.tx, value) {
+                               return value.value, nil
+                       }
+               }
+
+               return "", fmt.Errorf("cannot get key that does not exist")
+       }
+
         // TODO
         return "", fmt.Errorf("unimplemented")
 }

I snuck in tracking which keys are read, and we'll also soon sneak in tracking which keys are written. This is necessary in stricter isolation levels. More on that later.

set and delete are similar to get. But this time when we walk the list of value versions, we will set the txEndId for the value to the current transaction id if the value version is visible to this transaction.

Then, for set, we'll append to the value version list with the new version of the value that starts at this current transaction.

                return "", err
        }

+       if command == "set" || command == "delete" {
+               c.db.assertValidTransaction(c.tx)
+
+               key := args[0]
+
+               // Mark all visible versions as now invalid.
+               found := false
+               for i := len(c.db.store[key]) - 1; i >= 0; i-- {
+                       value := &c.db.store[key][i]
+                       debug(value, c.tx, c.db.isvisible(c.tx, *value))
+                       if c.db.isvisible(c.tx, *value) {
+                               value.txEndId = c.tx.id
+                               found = true
+                       }
+               }
+               if command == "delete" && !found {
+                       return "", fmt.Errorf("cannot delete key that does not exist")
+               }
+
+               c.tx.writeset.Insert(key)
+
+               // And add a new version if it's a set command.
+               if command == "set" {
+                       value := args[1]
+                       c.db.store[key] = append(c.db.store[key], Value{
+                               txStartId: c.tx.id,
+                               txEndId:   0,
+                               value:     value,
+                       })
+
+                       return value, nil
+               }
+
+               // Delete ok.
+               return "", nil
+       }
+
        if command == "get" {
                c.db.assertValidTransaction(c.tx)

This time rather than modifying the readset we modify the writeset for the transaction.

And that is how commands get executed!

Let's zoom in to the core of the problem we have mentioned but not implemented: MVCC visibility rules and how they differ by isolation levels.

Isolation levels and MVCC visibility rules

To varying degrees, transaction isolation levels prevent concurrent transactions from messing with each other. The looser isolation levels prevent this almost not at all.

Here is what the 1999 ANSI SQL standard (page 84) has to say.

/sql99isolation.png

But as I mentioned in the beginning of the post, we're going to be a bit informal. And we'll mostly refer to Jepsen summaries of each isolation levels.

Read Uncommitted

According to Jepsen, the loosest isolation level, Read Uncommitted, has almost no restrictions. We can merely read the most recent (non-deleted) version of a value, regardless of if the transaction that set it has committed or aborted or not.

func (d *Database) isvisible(t *Transaction, value Value) bool {
        // Read Uncommitted means we simply read the last value
        // written. Even if the transaction that wrote this value has
        // not committed, and even if it has aborted.
        if t.isolation == ReadUncommittedIsolation {
                // We must merely make sure the value has not been
                // deleted.
                return value.txEndId == 0
        }

       assert(false, "unsupported isolation level")
       return false
}

Let's write a test that demonstrates this. We create two transactions, c1 and c2, and set a key in c1. The value set for the key in c1 should be immediately visible if c2 asks for that key. In main_test.go:

package main

import (
        "testing"
)

func TestReadUncommitted(t *testing.T) {
        database := newDatabase()
        database.defaultIsolation = ReadUncommittedIsolation

        c1 := database.newConnection()
        c1.mustExecCommand("begin", nil)

        c2 := database.newConnection()
        c2.mustExecCommand("begin", nil)

        c1.mustExecCommand("set", []string{"x", "hey"})

        // Update is visible to self.
        res := c1.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c1 get x")

        // But since read uncommitted, also available to everyone else.
        res = c2.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c2 get x")

        // And if we delete, that should be respected.
        res = c1.mustExecCommand("delete", []string{"x"})
        assertEq(res, "", "c1 delete x")

        res, err := c1.execCommand("get", []string{"x"})
        assertEq(res, "", "c1 sees no x")
        assertEq(err.Error(), "cannot get key that does not exist", "c1 sees no x")

        res, err = c2.execCommand("get", []string{"x"})
        assertEq(res, "", "c2 sees no x")
        assertEq(err.Error(), "cannot get key that does not exist", "c2 sees no x")
}

Thank you to @glaebhoerl for pointing out that in an earlier version of this post, Read Uncommitted incorrectly made deleted values visible.

That's pretty simple! But also pretty useless if your workload has conflicts. If you can arrange your workload in a way where you know no concurrent transactions will ever read or write conflicting keys though, this could be pretty efficient! The rules will only get more complex (and thus potentially more of a bottleneck) from here on.

But for the most part, people don't use this isolation level. SQLite, Yugabyte, Cockroach, and Postgres don't even implement it. It is also not the default for any major database that does implement it.

Let's get a little stricter.

Read Committed

We'll pull again from Jepsen:

Read committed is a consistency model which strengthens read uncommitted by preventing dirty reads: transactions are not allowed to observe writes from transactions which do not commit.

This sounds pretty simple. In isvisible we'll make sure that the value has a txStartId that is either this transaction or a transaction that has committed. Moreover we will now begin checking against txEndId to make sure the value wasn't deleted by any relevant transaction.

                return value.txEndId == 0
        }

+       // Read Committed means we are allowed to read any values that
+       // are committed at the point in time where we read.
+       if t.isolation == ReadCommittedIsolation {
+               // If the value was created by a transaction that is
+               // not committed, and not this current transaction,
+               // it's no good.
+               if value.txStartId != t.id &&
+                       d.transactionState(value.txStartId).state != CommittedTransaction {
+                       return false
+               }
+
+               // If the value was deleted in this transaction, it's no good.
+               if value.txEndId == t.id {
+                       return false
+               }
+
+               // Or if the value was deleted in some other committed
+               // transaction, it's no good.
+               if value.txEndId > 0 &&
+                       d.transactionState(value.txEndId).state == CommittedTransaction {
+                       return false
+               }
+
+               // Otherwise the value is good.
+               return true
+       }
+
        assert(false, "unsupported isolation level")
        return false
 }

This begins to look useful! We will never read a value that isn't part of a committed transaction (or isn't part of our own transaction). Indeed this is the default isolation level for many databases including Postgres, Yugabyte, Oracle, and SQL Server.

Let's add a test to main_test.go. This is a bit long, but give it a slow read. It is thoroughly commented.

func TestReadCommitted(t *testing.T) {
        database := newDatabase()
        database.defaultIsolation = ReadCommittedIsolation

        c1 := database.newConnection()
        c1.mustExecCommand("begin", nil)

        c2 := database.newConnection()
        c2.mustExecCommand("begin", nil)

        // Local change is visible locally.
        c1.mustExecCommand("set", []string{"x", "hey"})

        res := c1.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c1 get x")

        // Update not available to this transaction since this is not
        // committed.
        res, err := c2.execCommand("get", []string{"x"})
        assertEq(res, "", "c2 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")

        c1.mustExecCommand("commit", nil)

        // Now that it's been committed, it's visible in c2.
        res = c2.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c2 get x")

        c3 := database.newConnection()
        c3.mustExecCommand("begin", nil)

        // Local change is visible locally.
        c3.mustExecCommand("set", []string{"x", "yall"})

        res = c3.mustExecCommand("get", []string{"x"})
        assertEq(res, "yall", "c3 get x")

        // But not on the other commit, again.
        res = c2.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c2 get x")

        c3.mustExecCommand("abort", nil)

        // And still not, if the other transaction aborted.
        res = c2.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c2 get x")

        // And if we delete it, it should show up deleted locally.
        c2.mustExecCommand("delete", []string{"x"})

        res, err = c2.execCommand("get", []string{"x"})
        assertEq(res, "", "c2 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")

        c2.mustExecCommand("commit", nil)

        // It should also show up as deleted in new transactions now
        // that it has been committed.
        c4 := database.newConnection()
        c4.mustExecCommand("begin", nil)

        res, err = c4.execCommand("get", []string{"x"})
        assertEq(res, "", "c4 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c4 get x")
}

Again this seems great. However! You can easily get inconsistent data within a transaction at this isolation level. If the transaction A has multiple statements it can see different results per statement, even if the transaction A did not modify data. Another transaction B may have committed changes between two statements in this transaction A.

Let's get a little stricter.

Repeatable Read

Again as Jepsen says, Repeatable Read is the same as Read Committed but with the following anomaly not allowed (quoting from the ANSI SQL 1999 standard):

P2 (“Non-repeatable read”): SQL-transaction T1 reads a row. SQL-transaction T2 then modifies or deletes that row and performs a COMMIT. If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted.

To support this, we will add additional checks for the Read Committed logic that make sure the value was not created and not deleted within a transaction that started before this transaction started.

As it happens, this is the same logic that will be necessary for Snapshot Isolation and Serializable Isolation. The additional logic (that makes Snapshot Isolation and Serializable Isolation different) happens at commit time.

                return true
        }

-       assert(false, "unsupported isolation level")
-       return false
+       // Repeatable Read, Snapshot Isolation, and Serializable
+       // further restricts Read Committed so only versions from
+       // transactions that completed before this one started are
+       // visible.
+
+       // Snapshot Isolation and Serializable will do additional
+       // checks at commit time.
+       assert(t.isolation == RepeatableReadIsolation ||
+               t.isolation == SnapshotIsolation ||
+               t.isolation == SerializableIsolation, "invalid isolation level")
+       // Ignore values from transactions started after this one.
+       if value.txStartId > t.id {
+               return false
+       }
+
+       // Ignore values created from transactions in progress when
+       // this one started.
+       if t.inprogress.Contains(value.txStartId) {
+               return false
+       }
+
+       // If the value was created by a transaction that is not
+       // committed, and not this current transaction, it's no good.
+       if d.transactionState(value.txStartId).state != CommittedTransaction &&
+               value.txStartId != t.id {
+               return false
+       }
+
+       // If the value was deleted in this transaction, it's no good.
+       if value.txEndId == t.id {
+               return false
+       }
+
+       // Or if the value was deleted in some other committed
+       // transaction that started before this one, it's no good.
+       if value.txEndId < t.id &&
+               value.txEndId > 0 &&
+               d.transactionState(value.txEndId).state == CommittedTransaction &&
+               !t.inprogress.Contains(value.txEndId) {
+               return false
+       }
+
+       return true
 }

 type Connection struct {

How do I derive these rules? Mostly by writing tests that should pass or fail and seeing what doesn't make sense. I tried to steal from existing projects but these rules were not so simple to discover. Which is part of what I hope makes this project particularly useful to look at.

Let's write a test for Repeatable Read. Again, the test is long but well commented.

func TestRepeatableRead(t *testing.T) {
        database := newDatabase()
        database.defaultIsolation = RepeatableReadIsolation

        c1 := database.newConnection()
        c1.mustExecCommand("begin", nil)

        c2 := database.newConnection()
        c2.mustExecCommand("begin", nil)

        // Local change is visible locally.
        c1.mustExecCommand("set", []string{"x", "hey"})
        res := c1.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c1 get x")

        // Update not available to this transaction since this is not
        // committed.
        res, err := c2.execCommand("get", []string{"x"})
        assertEq(res, "", "c2 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")

        c1.mustExecCommand("commit", nil)

        // Even after committing, it's not visible in an existing
        // transaction.
        res, err = c2.execCommand("get", []string{"x"})
        assertEq(res, "", "c2 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")

        // But is available in a new transaction.
        c3 := database.newConnection()
        c3.mustExecCommand("begin", nil)

        res = c3.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c3 get x")

        // Local change is visible locally.
        c3.mustExecCommand("set", []string{"x", "yall"})
        res = c3.mustExecCommand("get", []string{"x"})
        assertEq(res, "yall", "c3 get x")

        // But not on the other commit, again.
        res, err = c2.execCommand("get", []string{"x"})
        assertEq(res, "", "c2 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")

        c3.mustExecCommand("abort", nil)

        // And still not, regardless of abort, because it's an older
        // transaction.
        res, err = c2.execCommand("get", []string{"x"})
        assertEq(res, "", "c2 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")

        // And again still the aborted set is still not on a new
        // transaction.
        c4 := database.newConnection()
        res = c4.mustExecCommand("begin", nil)

        res = c4.mustExecCommand("get", []string{"x"})
        assertEq(res, "hey", "c4 get x")

        c4.mustExecCommand("delete", []string{"x"})
        c4.mustExecCommand("commit", nil)

        // But the delete is visible to new transactions now that this
        // has been committed.
        c5 := database.newConnection()
        res = c5.mustExecCommand("begin", nil)

        res, err = c5.execCommand("get", []string{"x"})
        assertEq(res, "", "c5 get x")
        assertEq(err.Error(), "cannot get key that does not exist", "c5 get x")
}

Let's get stricter!

Snapshot Isolation

Back to [Jepsen](https://jepsen.io/consistency/models/snapshot-isolation for a definition:

In a snapshot isolated system, each transaction appears to operate on an independent, consistent snapshot of the database. Its changes are visible only to that transaction until commit time, when all changes become visible atomically to any transaction which begins at a later time. If transaction T1 has modified an object x, and another transaction T2 committed a write to x after T1’s snapshot began, and before T1’s commit, then T1 must abort.

So Snapshot Isolation is the same as Repeatable Read but with one additional rule: the keys written by any two concurrent committed transactions must not overlap.

This is why we tracked writeset. Every time a transaction modified or deleted a key, we added it to the transaction's writeset. To make sure we abort correctly, we'll add a conflict check to the commit step. (This idea is also well documented in A critique of snapshot isolation. This paper can be hard to find. Email me if you want a copy.)

When a transaction A goes to commit, it will run a conflict test for any transaction B that has committed since this transaction A started.

Serializable Isolation is going to have a similar check. So we'll add a helper for iterating through all relevant transactions, running a check function for any transaction that has committed.

func (d *Database) hasConflict(t1 *Transaction, conflictFn func(*Transaction, *Transaction) bool) bool {
        iter := d.transactions.Iter()

        // First see if there is any conflict with transactions that
        // were in progress when this one started.
        inprogressIter := t1.inprogress.Iter()
        for ok := inprogressIter.First(); ok; ok = inprogressIter.Next() {
                id := inprogressIter.Key()
                found := iter.Seek(id)
                if !found {
                        continue
                }
                t2 := iter.Value()
                if t2.state == CommittedTransaction {
                        if conflictFn(t1, &t2) {
                                return true
                        }
                }
        }

        // Then see if there is any conflict with transactions that
        // started and committed after this one started.
        for id := t1.id; id < d.nextTransactionId; id++ {
                found := iter.Seek(id)
                if !found {
                        continue
                }

                t2 := iter.Value()
                if t2.state == CommittedTransaction {
                        if conflictFn(t1, &t2) {
                                return true
                        }
                }
        }

        return false
}

It was around this point that I decided I did really need a B-Tree implementation and could not just stick to vanilla Go data structures.

Now we can modify completeTransaction to do this check if the transaction intends to commit. If the current transaction A's write set intersects with any other transaction B committed since transaction A started, we must abort.

 func (d *Database) completeTransaction(t *Transaction, state TransactionState) error {
         debug("completing transaction ", t.id)

+
+       if state == CommittedTransaction {
+               // Snapshot Isolation imposes the additional constraint that
+               // no transaction A may commit after writing any of the same
+               // keys as transaction B has written and committed during
+               // transaction A's life.
+               if t.isolation == SnapshotIsolation && d.hasConflict(t, func(t1 *Transaction, t2 *Transaction) bool {
+                       return setsShareItem(t1.writeset, t2.writeset)
+               }) {
+                       d.completeTransaction(t, AbortedTransaction)
+                       return fmt.Errorf("write-write conflict")
+               }
+       }
+
         // Update transactions.
         t.state = state
         d.transactions.Set(t.id, *t)

Lastly, the definition of setsShareItem.

func setsShareItem(s1 btree.Set[string], s2 btree.Set[string]) bool {
        s1Iter := s1.Iter()
        s2Iter := s2.Iter()
        for ok := s1Iter.First(); ok; ok = s1Iter.Next() {
                s1Key := s1Iter.Key()
                found := s2Iter.Seek(s1Key)
                if found {
                        return true
                }
        }

        return false
}

Since Snapshot Isolation shares all the same visibility rules as Repeatable Read, the tests get to be a little simpler! We'll simply test that two transactions attempting to commit a write to the same key fail. Or specifically: that the second transaction cannot commit.

func TestSnapshotIsolation_writewrite_conflict(t *testing.T) {
        database := newDatabase()
        database.defaultIsolation = SnapshotIsolation

        c1 := database.newConnection()
        c1.mustExecCommand("begin", nil)

        c2 := database.newConnection()
        c2.mustExecCommand("begin", nil)

        c3 := database.newConnection()
        c3.mustExecCommand("begin", nil)

        c1.mustExecCommand("set", []string{"x", "hey"})
        c1.mustExecCommand("commit", nil)

        c2.mustExecCommand("set", []string{"x", "hey"})

        res, err := c2.execCommand("commit", nil)
        assertEq(res, "", "c2 commit")
        assertEq(err.Error(), "write-write conflict", "c2 commit")

        // But unrelated keys cause no conflict.
        c3.mustExecCommand("set", []string{"y", "no conflict"})
        c3.mustExecCommand("commit", nil)
}

Not bad! But let's get stricter.

Upon further discussion with Alex Miller, and after reviewing A Critique of ANSI SQL Isolation Levels, the difference I am trying to suggest (between Repeatable Read an Snapshot Isolation) likely does not exist. A Critique of ANSI SQL Isolation Levels mentions Repeatable Read must not exhibit P4 (Lost Update) anomalies. And it mentions that you must check for read-write conflicts to avoid these. Therefore it seems likely that you can't easily separate Repeatable Read from Snapshot Isolation when implemented using MVCC. The differences between Repeatable Read and Snapshot Isolation may more readily show up when implementing transactions the classical way with Two-Phase Locking.

To reiterate, with MVCC and optimistic concurrency control, correct implementations of Repeatable Read and Snapshot Isolation do not seem to be distinguishable. Both require write-write conflict detection.

Serializable Isolation

In terms of end-result, this is the simplest isolation level to reason about. Serializable Isolation must appear as if only a single transaction were executing at a time. Some systems, like SQLite and TigerBeetle, do Actually Serial Execution where only one transaction runs at a time. But few databases implement Serializable like this because it removes a number of fair concurrent execution histories. For example, two concurrent read-only transactions.

Postgres implements serializability via Serializable Snapshot Isolation. MySQL implements serializability via Two-Phase Locking. FoundationDB implements serializability via sequential timestamp assignment and conflict detection.

But the paper, A critique of snapshot isolation, provides a simple (though not necessarily efficient; I have no clue) approach via what they call Write Snapshot Isolation. In their algorithm, if any two transactions read and write set intersect (but not write and write set intersect), the transaction should be aborted. And this (plus Repeatable Read rules) is sufficient for Serializability.

I leave it to that paper for the proof of correctness. In terms of implementing it though it's quite simple and very similar to the Snapshot Isolation we already mentioned.

Inside of completeTransaction add:

                }) {
                        d.completeTransaction(t, AbortedTransaction)
                        return fmt.Errorf("write-write conflict")
+               }
+
+               // Serializable Isolation imposes the additional constraint that
+               // no transaction A may commit after reading any of the same
+               // keys as transaction B has written and committed during
+               // transaction A's life, or vice-versa.
+               if t.isolation == SerializableIsolation && d.hasConflict(t, func(t1 *Transaction, t2 *Transaction) bool {
+                       return setsShareItem(t1.readset, t2.writeset) ||
+                               setsShareItem(t1.writeset, t2.readset)
+               }) {
+                       d.completeTransaction(t, AbortedTransaction)
+                       return fmt.Errorf("read-write conflict")
                }
        }

And if we add a test for read-write conflicts:

func TestSerializableIsolation_readwrite_conflict(t *testing.T) {
        database := newDatabase()
        database.defaultIsolation = SerializableIsolation

        c1 := database.newConnection()
        c1.mustExecCommand("begin", nil)

        c2 := database.newConnection()
        c2.mustExecCommand("begin", nil)

        c3 := database.newConnection()
        c3.mustExecCommand("begin", nil)

        c1.mustExecCommand("set", []string{"x", "hey"})
        c1.mustExecCommand("commit", nil)

        _, err := c2.execCommand("get", []string{"x"})
        assertEq(err.Error(), "cannot get key that does not exist", "c5 get x")

        res, err := c2.execCommand("commit", nil)
        assertEq(res, "", "c2 commit")
        assertEq(err.Error(), "read-write conflict", "c2 commit")

        // But unrelated keys cause no conflict.
        c3.mustExecCommand("set", []string{"y", "no conflict"})
        c3.mustExecCommand("commit", nil)
}

We see it work! And that's it for a basic implementation of MVCC and major transaction isolation levels.

Production-quality testing

There are two major projects I'm aware of that help you test transaction implementations: Elle and Hermitage. These are probably where I'd go looking if I were implementing this for real.

This project took me long enough on its own and I felt reasonably comfortable with my tests that the gist of my logic was right that I did not test further. For that reason it surely has bugs.

Vacuuming and cleanup

One of the major things this implementation does not do is cleaning up old data. Eventually, older versions of values will be required by no transactions. They should be removed from the value version array. Similarly, eventually older transactions will be required by no transactions. They should be removed from the database transaction history list.

Even if we had the vacuuming process in place though, what about some extreme use patterns. What if a key's value was always going to be 1GB long. And what if multiple transactions made only small changes to the 1GB data. We'd be duplicating a lot of the value across versions.

It sounds less extreme when thinking about storing rows of data rather than key-value data. If a user has 100 columns and only updates one column a number of times, in our scheme we'd end up storing a ton of duplicate cell data for a row.

This is a real-world issue in Postgres that was called out by Andy Pavlo and the Ottertune folks. It turns out that Postgres alone among major databases stores the entire value for every version. In contrast other major databases like MySQL store a diff.

Conclusion

This post only begins to demonstrate that database behavior differs quite a bit both in terms of results and in terms of optimizations. Everyone implements the ideas differently and to varying degrees.

Moreover, we have only begun to implement the behavior a real SQL database supports. For example, how do visibility rules and conflict detection work with range queries? What about sub-transactions, and save points? These will have to be covered another time.

Hopefully seeing this simple implementation of MVCC and visibility rules helps to clarify at least some of the basics.

What makes a great technical blog

2024-04-10 08:00:00

I want to explain why the blogs in My favorite technical blogs are my favorite. That page is solely about non-corporate tech blogs. So this post is too. I'll have to make another list for favorite corporate tech blogs.

In short, they:

Tackle hard and confusing topics

There are a number of problems in programming and computer science where otherwise knowledgeable programmers have to start mumbling about, or revert to cliches or group-think, because they aren't sure.

These are the best topics you can possibly dive deep into. And my favorite writers do exactly this.

They write about durability guarantees of disks and filesystems. They write about common pitfalls in benchmarking. They write about database consistency anomalies. They write about threading and IO models.

And they write about it by showing concrete examples and concrete logic so you can learn how to stop handwaving on the topic.

Their writing helps you come out with a useful mental model you can apply to your own problems.

And you know, sometimes it's not about about the topic being obscure. Good writers have the ability to tackle a boring topic in an interesting light. Maybe by digging deeper into a root cause. Or showing you the history behind the scenes.

Moreover, my favorite writers don't know everything. But they also don't pretend to know everything. They're quick to admit they don't understand something and ask for help from their readers.

Show working code

I love to see complete working code in a post. In contrast there are many projects that start out simple and people write an article that covers the project at a high level. But they keep working on the project and it becomes more complex.

It's not always easy to follow commits over time.

Eli Bendersky and Serge Zaitsev are particularly great at developing small but meaningful projects in a single post or short series.

On the other hand, if people only did this, we wouldn't hear about the development of long-running projects like V8 or Postgres. So I guess this style has limits. And I don't penalize people talking about long-running projects for not showing working code.

Make things simpler

One of the marks of a good writer is that you can make complex topics simple. And not just by being reductive. Though sometimes even being reductive is useful for education.

In contrast I sometimes see articles by less experienced writers and I marvel how they make a simple topic so complex. I recognize this because I was absolutely like that 10 years ago, if not 5 years ago.

Write regularly

My favorite blogs typically get a new post at least once a month. Some people, like Murat, write once a week.

I think the practice probably does improve your writing but mostly it's that they keep my attention by publishing regularly!

Talk about tradeoffs and downsides

Nothing builds trust like talking about the issues with something you built. No project is perfect. And to ignore the downsides risks seeming like you don't know or understand them.

So the writers I like the most talk about decisions in context. They talk about the good and the bad.

Avoid internet slang, memes, swearing, sarcasm, and ranting

There's no way I can think of talking about this without sounding super lame.

One thing I've noticed, particularly among younger colleagues, is the use of memes or swearing or using 4chan slang or using sarcasm. I used to write like this 10 years ago too.

There is a chunk of your audience who won't care. The problem is that there's also a chunk of your (potential) audience who definitely does care. There's even a chunk of your audience who may not care but just won't understand (i.e. non-native English speakers).

I have friends and folks I respect who write very well. But that also are also overly, unnecessarily edgy when they write. I don't like sharing these posts because I don't want to unnecessarily offend or turn off people either.

Closing thoughts

It would be boring if everyone wrote the same way. I'm glad the internet is fun and weird. But I wanted to share a few things that go into my favorite technical blogs that I'm always happy to refer people to.

A paper reading club at work; databases and distributed systems research

2024-04-05 08:00:00

I started a paper reading club this week at work, focused on databases and distributed systems research. I posted in a general channel about the premise and asked if anyone was interested. I clarified the intended topic and that discussions would be asynchronous over email, run fortnightly.

Weekly seemed way too often. A month seemed way too infrequent. Fortnightly seemed decent.

I was nervous to do this because I've been here about 2 months. In the past I would have waited 6 months or a year to do this. But I don't know. If you see something you think should exist, why wait?

The only other consideration was past experiences I've written about having difficulty getting engagement with clubs at work. But EDB has near 1,000 employees. I figured there might at least be a couple interested.

Furthermore I figured if I only got a few people this entire idea would at least benefit myself, since I have been wanting to force myself to build a paper reading habit. And if no one responded, it would be only mildly embarassing and I'd not pursue it further.

But after a day, about 6 people showed interest. Which was better than I hoped! Folks from product management, support, development, and beyond.

So I opened a dedicated channel and asked people to start submitting papers and voting on them. One of my teammates started submitting some great papers on caches and reference counting.

I picked a first one, the Redshift paper, to get us started. Demonstrating the process to avoid confusion. And I made a calendar invite for everyone in the channel, the paper linked in the invite. I clarified in the invite that it was just a reminder and that the real discussion would still be async over email. (I've found it's best to repeatedly clarify process stuff.)

Once I had these first few folks interested I was able to post again in a broader company channel that a couple of us were starting this paper club. By the end of the day the dedicated channel was 29 folks. All in about 2 days.

Mailing lists are nicer than Slack or Discord in my opinion because they sort of force you to slow down, they are harder to miss (if someone starts a thread after you've seen a message in Slack or Discord, you tend to miss it), and easier to manage (read/unread).

Engineers often seem to get overwhelmed by a mass of Slack messages. Whereas they seem to be a bit more comfortable with email threads.

All of this is all the more important when you're running a global group. EDB has people everywhere.

Why do this?

Before I dropped out of college I did a research internship with a VLSI group at Harvard SEAS. And my favorite part was that they had a weekly (or biweekly?) Wednesday paper reading session where 15 people from the lab and adjacent labs would eat pizza after hours and discuss a paper.

I've been dying to recreate this at a company ever since. Since EDB is so distributed, we won't be discussing over pizza. But I'm still excited.

And I hope my experience serves as a blueprint for others.

Finding memory leaks in Postgres C code

2024-03-27 08:00:00

This is an external post of mine. Click here if you are not redirected.

Zig, Rust, and other languages

2024-03-15 08:00:00

Having worked a bit in Zig, Rust, Go and now C, I think there are a few common topics worth having a fresh conversation on: automatic memory management, the standard library, and explicit allocation.

Zig is not a mature language. But it has made enough useful choices for a number of companies to invest in it and run it in production. The useful choices make Zig worth talking about.

Go and Rust are mature languages. But they have both made questionable choices that seem worth talking about.

All of these languages are developed by highly intelligent folks I personally look up to. And your choice to use any one of these is certainly fine, whichever it is.

The positive and negative choices particular languages made, though, are worth talking about as we consider what a systems programming language 10 years from now would look like. Or how these languages themselves might evolve in the next 10 years.

My perspective is mostly building distributed databases. So the points that I bring up may have no relevance to the kind of work you do, and that's alright. Moreover, I'm already aware most of these opinions are not shared by the language maintainers, and that's ok too. I am not writing to convince anyone.

Automatic memory management

One of my bigger issues with Zig is that it doesn't support RAII. You can defer cleanup to the end of a block; and this is half of the problem. But only RAII will allow for smart pointers and automatic (not manual) reference counting. RAII is an excellent option to default to, but in Zig you aren't allowed to. In contrast, even C "supports" automatic cleanup (via compiler extensions).

But most of the time, arenas are fine. Postgres is written in C and memory is almost entirely managed through nested arenas (called "memory contexts") that get cleaned up when some subset of a task finishes, recursively. Zig has builtin support for arenas, which is great.

Standard library

It seems regrettable that some languages have been shipping smaller standard libraries. Smaller standard libraries seem to encourage users of the language to install more transitively-unvetted third-party libraries, which increases build time and build flakiness, and which increases bitrot over time as unnecessary breaking changes occur.

People have been making jokes about node_modules for a decade now, but this problem is just as bad in Rust codebases I've seen. And to a degree it happens in Java and Go as well, though their larger standard libraries allow you to get further without dependencies.

Zig has a good standard library, which may be Go and Java tier in a few years. But one goal of their package manager seemed to be to allow the standard library to be broken up; made smaller. For example, JSON support moving out of the standard library into a package. I don't know if that is actually the planned direction. I hope not.

Having a large standard library doesn't mean that the programmer shouldn't be able to swap out implementations easily as needed. But all that is required is for the standard library to define an interface along with the standard library implementation.

The small size of the standard library doesn't just affect developers using the language, it even encourages developers of the language itself to depend on libraries owned by individuals.

Take a look at the transitive dependencies of an official Node.js package like node-gyp. Is it really the ideal outcome of a small standard library to encourage dependence in official libraries on libraries owned by individuals, like env-paths, that haven't been modified in 3 years? 68 lines of code. Is it not safer at this point to vendor that code? i.e. copy the env-paths code into node-gyp.

Similarly, if you go looking for compression support in Rust, there's none in the standard library. But you may notice the flate2-rs repo under the official rust-lang GitHub namespace. If you look at its transitive dependencies: flate2-rs depends on (an individual's) miniz_oxide which depends on (an individual's) adler that hasn't been updated in 4 years. 300 lines of code including tests. Why not vendor this code? It's the habits a small standard library builds that seem to encourage everyone not to.

I don't mean these necessarily constitute a supply-chain risk. I'm not talking about left-pad. But the pattern is sort of clear. Even official packages may end up depending on external party packages, because the commitment to a small standard library meant omitting stuff like compression, checksums, and common OS paths.

It's a tradeoff and maybe makes the job of the standard library maintainer easier. But I don't think this is the ideal situation. Dependencies are useful but should be kept to a reasonable minimum.

Hopefully languages end up more like Go than like Rust in this regard.

Explicit allocation

When folk discuss the Zig standard library's pattern of requiring an allocator argument for every method that allocates, they often talk about the benefit of swapping out allocators or the benefit of being able to handle OOM failures.

Both of these seem pretty niche to me. For example, in Zig tests you are encouraged to pass around a debug allocator that tells you about memory leaks. But this doesn't seem too different from compiling a C project with a debug allocator or compiling with different sanitizers on and running tests against the binary produced. In both cases you mostly deal with allocators at a global level depending on the environment you're running the code in (production or tests).

The real benefit of explicit allocations to me is much more trivial. You basically can't code a method in Zig without acknowledging allocations.

This is particularly useful for hotpath code. Take an iterator for example. It has a new() method, a next() method, and a done() method. In most languages, it's basically impossible at the syntax or compiler-level to know if you are allocating in the next() method. You may know because you know the behavior of all the code in next() by heart. But that won't happen all the time.

Zig is practically alone in that if you write the next() method and and don't pass an allocator to any method in the next() body, nothing in that next() method will allocate.

In any other language it might not be until you run a profiler that you notice an allocation that should have been done once in new() accidentally ended up in next() instead.

On the other hand, for all the same reasons, writing Zig is kind of a pain because everything takes an allocator!

Explicit allocation is not intrinsic to Zig, the language. It is a convention that is prevalent in the standard library. There is still a global allocator and any user of Zig could decide to use the global allocator. At which point you've got implicit allocation. So explicit allocation as a convention isn't a perfect solution.

But it, by default, gives you a level of awareness of allocations you just can't get from typical Go or Rust or C code, depending on the project's practices. Perhaps it's possible to switch off the Go, Rust and C standard library and use one where all functions that allocate do require an allocator.

But explicitly passing allocators is still sort of a visual hack.

I think the ideal situation in the future will be that every language supports annotating blocks of code as must-not-allocate or something along those lines. Either the compiler will enforce this and fail if you seem to allocate in a block marked must-not-allocate, or it will panic during runtime so you can catch this in tests.

This would be useful beyond static programming languages. It would be as interesting to annotate blocks in JavaScript or Python as must-not-allocate too.

Otherwise the current state of things is that you'd normally configure this sort of thing at the global level. Saying "there must not be any allocations in this entire program" just doesn't seem as useful in general as being able to say "there must not be any allocations in this one block".

Optional, not required, allocator arguments

Rust has nascent support for passing an allocator to methods that allocate. But it's optional. From what I understand, C++ STL is like this too.

These are both super useful for programming extensions. And it's one of the reasons I think Zig makes a ton of sense for Postgres extensions specifically. Because it was only and always ever built for running in an environment with someone else's allocator.

Praise for Zig, Rust, and Go tooling

All three of these have really great first-party tooling including build system, package management, test runners and formatters. The idea that the language should provide a great environment to code in (end-to-end) makes things simpler and nicer for programmers.

Meandering non-conclusion

Use the language you want to use. Zig and Rust are both nice alternatives to writing vanilla C.

On the other hand, I've been pleasantly surprised writing Postgres C. How high level it is. It's almost a separate language since you're often dealing with user-facing constructs, like Postgres's Datum objects which represent what you might think of as a cell in a Postgres database. And you can use all the same functions provided for Postgres SQL for working with Datums, but from C.

I've also been able work a bit on Postgres extensions in Rust with pgrx lately, which I hope to write about soon. And when I saw pgzx for writing Postgres extensions in Zig I was excited to spend some time with that too.

First month on a database team

2024-03-11 08:00:00

A little over a month ago, I joined EnterpriseDB on a distributed Postgres product (PGD). The process of onboarding myself has been pretty similar at each company in the last decade, though I think I've gotten better at it. The process is of course influenced by the team, and my coworkers have been excellent. Still, I wanted to share my thought process and personal strategies.

Avoid, at first, what is always challenging

Trickier things at companies are the people, organization, and processes. What code exists? How does it work together? Who owns what? How can I find easy code issues to tackle? How do I know what's important (so I can avoid picking it up and becoming a bottleneck).

But also, in the first few days or weeks you aren't exactly expected to contribute meaningfully to features or bugs. Your sprint contributions are not tracked too closely.

The combination of 1) what to avoid and 2) the sprint-freedom-you-have leads to a few interesting and valuable areas to work on on your own: the build process, tests, running the software, and docs.

But code need not be ignored either. Some frequent areas to get your first code contributions in include user configuration code, error messages, and stale code comments.

What follows are some little 1st day, 1st week, 1st month projects I went through to bootstrap my understanding of the system.

Build process

First off, where is the code and how do you build it? This requires you to have all the relevant dependencies. Much of my work is on a Postgres extension. This meant having a local Postgres development environment, having gcc, gmake (on mac), Perl, and so on. And furthermore, PGD is a pretty mature product so it supports building against multiple Postgres distributions. Can I build against all of them?

The easiest situation is when there are instructions for all of this, linked directly from your main repo. When I started, the instructions did exist but in a variety of places. So over the first week I started collecting all of what I had learned about building the system, with dependencies, across distributions, and with various important flags (debug mode on, asserts enabled, etc.). I finished the first week by writing a little internal blog post called "Hacking on PGD".

I hadn't yet figured out the team processes so I didn't want to bother anyone by trying to get this "blog post" committed anywhere yet as official internal documentation. Maybe there already was a good doc, I just hadn't noticed it yet. So I just published it in a private Confluence page and shared it in the private team slack. If anyone else benefited from it, great! Otherwise, I knew I'd want to refer back to it.

This is an important attitude I think. It can be hard to tell what others will benefit from. If you get into the habit of writing things down internally for your own sake, but making it available internally, it is likely others will benefit from it too. This is something I've learned from years of blogging publicly outside of work.

Moreover, the simple act of writing a good post creates yourself as something of an authority. This is useful for yourself if no one else.

Writing a good post

Let's get distracted here for a second. One of the most important things I think in documentation is documenting not just what does exist but what doesn't. If you had to take a path to get something to work, did you try other paths that didn't work? It can be extremely useful to figure out what exactly is required for something.

Was there a flag that you tried to build with but you didn't try building without it? Well try again without it and make sure it was necessary. Was there some process you executed where the build succeeded but you can't remember if it was actually necessary for the build to succeed?

It's difficult to explain why I think this sort of precision is useful but I'm pretty sure it is. Maybe because it builds the habit of not treating things as magic when you can avoid it. It builds the habit of asking questions (if only to yourself) to understand and not just to get by.

Static analysis? Dynamic analysis?

Going back to builds, another aspect to consider is static and dynamic analysis. Are there special steps to using gdb or valgrind or other analyzers? Are you using them already? Can you get them running locally? Has any of this been documented?

Maybe the answer to all of those is yes, or maybe none of those are relevant but there are likely similar tools for your ecosystem. If analysis tools are relevant and no one has yet explored them, that's another very useful area to explore as a newcomer.

Testing

After I got the builds working, I felt the obvious next step was to run tests. But what tests exist? Are there unit tests? Integration tests? Anything else? Moreover, is there test coverage? I was certain I'd be able to find some low-hanging contributions to make if I could find some files with low test coverage.

Alas, my certainty hit the wall in that there were in fact too many types of integration tests that all do provide coverage already. They just don't all report coverage.

The easiest ways to report coverage (with gcov) were only reporting coverage for certain integration tests that we run locally. There are more integration tests run in cloud environments and getting coverage reports there to merge with my local coverage files would have required more knowledge of people and processes, areas that I wanted not to be forced to think about too quickly.

So coverage wasn't a good route to go. But around this time, I noticed a ticket that asked for a simple change to user configuration code. I was able to make the change pretty quickly and wanted to add tests. We have our own test framework built on top of Postgres's powerful Perl test framework. But it was a little difficult to figure out how to use either of them.

So I copied code from other tests and pared it down until I got the smallest version of test code I could get. This took maybe a day or two of tweaking lines and rerunning tests since I didn't understand everything that was/wasn't required. Also it's Perl and I've never written Perl before so that took a bit of time and ChatGPT. (Arrays, man.)

In the end though I was able to collect my learnings into another internal confluence post just about how to write tests, how to debug tests, how to do common things within tests (for example, ensuring a Postgres log line was outputted), etc. I published this post as well and shared it in the team Slack.

Running

I had PGD built locally and was able to run integration tests locally, but I still hadn't gotten a cluster running. Nor played with the eventual consistency demos I knew we supported. We had a great quickstart that ran through all the manual steps of getting a two-node cluster up. This was a distillation, for devs, of a more elaborate process we give to customers in a production-quality script.

But I was looking for something in between a production-quality script and manually initializing a local cluster. And I also wanted to practice my understanding of our test process. So I ported our quickstart to our integration test framework and made a PR with this new test, eventually merging this into the repo. And I wrote a minimal Python script for bringing up a local cluster. I've got an open PR to add this script to the repo. Maybe I'll learn though that a simple script such as this does already exist, and that's fine!

Docs

The entire time, as I'd been trying to build and test and run PGD, I was trying to understand our terminology and architecture by going through our public docs. I had a lot of questions coming out of this I'd ask in the team channel.

Not to toot my horn but I think it's somewhat of a superpower to be able/willing to ask "dumb questions" in a group setting. That's how I frame it anyway. "Dumb question: what does X mean in this paragraph?" Or, "dumb question: when we say performance improvement because of Y, what is the intuition here?" Because of the time spent here, I was able to make a few more docs contributions as I read through the docs as well.

You have to balance where you ask your dumb questions though. Asking dumb questions to one person doesn't benefit the team. But asking dumb questions in too wide a group is sometimes bad politics. Asking "dumb questions" in front of your team seems to have the best bang for buck.

But maybe the more important contributions were, when I got more comfortable with the team, proposing to merge my personal, internal Confluence blog posts into the repo as docs. I think in a number of cases, what I wrote about indeed hadn't been concisely collected before and thus was useful to have as team documentation.

Even more challenging was trying to distill (a chunk of) the internal architecture. Only after following many varied internal docs and videos, and following through numerous code paths, was I able to propose an architecture diagram outlining major components and communication between them, with their differing formats (WAL records, internal enums, etc.) and means of communication (RPC, shared memory, etc.). This architecture diagram is still in review and may be totally off. But it's already helped at least me think about the system.

In most cases this was all information that the team had already written or explained but just bringing it together and summarizing provided a different useful perspective I think. Even if none of the docs got merged it still helped to build my own understanding.

Beyond the repo

Learning the project is just one aspect of onboarding. Beyond that I join the #cats channel, the #dogs channel, found some fellow New Yorkers and opened a NYC channel, and tried to find Zoom-time with the various people I'd see hanging around common team Slack channels. Trying to meet not just devs but support folk, product managers, marketing folk, sales folk, and anyone else!

Walking the line between scouring our docs and GitHub and Confluence and Jira on my own, and bugging people with my incessant questions.

I've enjoyed my time at startups. I've been a dev, a manager, a founder, a cofounder. But I'm incredibly excited to be back, at a bigger company, full-time as a developer hacking on a database!

And what about you? What do you do to onboard yourself at a new company or new project?

An intuition for distributed consensus in OLTP systems

2024-02-08 08:00:00

Distributed consensus in transactional databases (e.g. etcd or Cockroach) is a big deal these days. Most often under the hood are variations of log-based Paxos-like algorithms such as MultiPaxos, Viewstamped Replication, or Raft. While there are new variations that come out each year, optimizing for various workloads, these algorithms are fairly standard and well-understood.

In fact they are used in so many places, Kubernetes for example, that even if you don't decide to implement Raft (which is fun and I encourage it), it seems worth building an intuition for distributed consensus.

What happens as you tweak a configuration. What happens as the production environment changes. Or what to reach for as product requirements change.

I've been thinking about the basics of distributed consensus recently. There has been a lot to digest and characterize. And I'm only beginning to get an understanding.

This post is an attempt to share some of the intuition built up reading about and working in this space. Originally this post was also going to end with a walkthrough of my most recent Raft implementation in Rust. But I'm going to hold off on that for another time.

I was fortunate to have a few excellent reviewers looking at versions of this post: Paul Nowoczynski, Alex Miller, Jack Vanlightly, Daniel Chia, and Alex Petrov. Thank you!

Let's start with Raft.

Raft

Raft is a distributed consensus algorithm that allows you to build a replicated state machine on top of a replicated log.

A Raft library handles replicating and durably persisting a sequence (or log) of commands to at least a majority of nodes in a cluster. You provide the library a state machine that interprets the replicated commands. From the perspective of the Raft library, commands are just opaque byte strings.

For example, you could build a replicated key-value store out of SET and GET commands that are passed in by a client. You provide a Raft library state machine code that interprets the Raft log of SET and GET commands to modify or read from an in-memory hashtable. You can find concrete examples of exactly this replicated key-value store modeling in previous Raft posts I've written.

All nodes in the cluster run the same Raft code (including the state machine code you provide); communicating among themselves. Nodes elect a semi-permanent leader that accepts all reads and writes from clients. (Again, reads and writes are modeled as commands).

To commit a new command to the cluster, clients send the command to all nodes in the cluster. Only the leader accepts this command, if there is currently a leader. Clients retry until there is a leader that accepts the command.

The leader appends the command to its log and makes sure to replicate all commands in its log to followers in the same order. The leader sends periodic heartbeat messages to all followers to prolong its term as leader. If a follower hasn't heard from the leader within a period of time, it becomes a candidate and requests votes from the cluster.

When a follower is asked to accept a new command from a leader, it checks if its history is up-to-date with the leader. If it is not, the follower rejects the request and asks the leader to send previous commands to bring it up-to-date. It does this ultimately, in the worst case of a follower that has lost all history, by going all the way back to the very first command ever sent.

When a quorum (typically a majority) of nodes has accepted a command, the leader marks the command as committed and applies the command to its own state machine. When followers learn about newly committed commands, they also apply committed commands to their own state machine.

For the most part, these details are graphically summarized in Figure 2 of the Raft paper.

Availability and linearizability

Taking a step back, distributed consensus helps a group of nodes, a cluster, agree on a value. A client of the cluster can treat a value from the cluster as if the value was atomically written to and read from a single thread. This property is called linearizability.

However, with distributed consensus, the client of the cluster has better availability guarantees from the cluster than if the client atomically wrote to or read from a single thread. A single thread that crashes becomes unavailable. But some number f nodes can crash in a cluster implementing distributed consensus and still 1) be available and 2) provide linearizable reads and writes.

That is: distributed consensus solves the problem of high availability for a system while remaining linearizable.

Without distributed consensus you can still achieve high availability. For example, a database might have two read replicas. But a client reading from a read replica might get stale data. Thus, this system (a database with two read replicas) is not linearizable.

Without distributed consensus you can also try synchronous replication. It would be very simple to do. Have a fixed leader and require all nodes to acknowledge before committing, But the value here is extremely limited. If a single node in the cluster goes down the entire cluster is down.

You might think I'm proposing a strawman. We could simply designate a permanent leader that handles all reads and writes; and require a majority of nodes to commit a command before the leader responds to a client. But in that case, what's the process for getting a lagging follower up-to-date? And what happens if it is the leader who goes down?

Well, these are not trivial problems! And, beyond linearizability that we already mentioned, these problems are exactly what distributed consensus solves.

Why does linearizability matter?

It's very nice, and often even critical, to have a highly available system that will never give you stale data. And regardless, it's convenient to have a term for what we might naively think of as the "correct" way you'd always want to set and get a value.

So linearizability is a convenient way of thinking about complex systems, if you can use or build a system that supports it. But it's not the only consistency approach you'll see in the wild.

As you increase the guarantees of your consistency model, you tend to sacrifice performance. Going the opposite direction, some production systems sacrifice consistency to improve performance. For example, you might allow stale reads from any node, reading only from local state and avoiding consensus, so that you can reduce load on a leader and avoid the overhead of consensus.

There are formal definitions for lower consistency models, including sequential and read-your-writes. You can read the Jepsen page for more detail.

Best and worst case scenarios

A distributed system relies on communicating over the network. The worse the network, whether in terms of latency or reliability, the longer it will take for communication to happen.

Aside from the network, disks can misdirect writes or corrupt data. Or you could be mounted on a network filesystem such as EBS.

And processes themselves can crash due to low disk space or the OOM killer.

It will take longer to achieve consensus to commit messages these scenarios. If messages take longer to reach nodes, or if nodes are constantly crashing, followers will timeout more often, triggering leader election. And the leader election itself (which also requires consensus) will also take longer.

The best case scenario for distributed consensus is where the network is reliable and low-latency. Where disks are reliable and fast. And where processes don't often crash.

TigerBeetle has an incredible visual simulator that demonstrates what happens across ever-worsening environments. While TigerBeetle and this simulator use Viewstamped Replication, the demonstrated principles apply to Raft as well.

What happens when you add nodes?

Distributed consensus algorithms make sure that some minimum number of nodes in a cluster agree before continuing. The minimum number is proportional to the total number of nodes in the cluster.

A typical implementation of Raft for example will require 3 nodes in a 5-node cluster to agree before continuing. 4 nodes in a 7-node cluster. And so on.

Recall that the p99 latency for a service is at least as bad as the slowest external request the service must make. As you increase the number of nodes you must talk to in a consensus cluster, you increase the chance of a slow request.

Consider the extreme case of a 101-node cluster requiring 51 nodes to respond before returning to the client. That's 51 chances for a slower request. Compared to 4 chances in a 7-node cluster. The 101-node cluster is certainly more highly available though! It can tolerate 49 nodes going down. The 7-node cluster can only tolerate 3 nodes going down. The scenario where 49 nodes go down (assuming they're in different availability zones) seems pretty unlikely!

Horizontal scaling with distributed consensus? Not exactly

All of this is to say that the most popular algorithms for distributed consensus, on their own, have nothing to do with horizontal scaling.

The way that horizontally scaling databases like Cockroach or Yugabyte or Spanner work is by sharding the data, transparent to the client. Within each shard data is replicated with a dedicated distributed consensus cluster.

So, yes, distributed consensus can be a part of horizontal scaling. But again what distributed consensus primarily solves is high availability via replication while remaining linearizable.

This is not a trivial point to make. etcd, consul, and rqlite are examples of databases that do not do sharding, only replication, via a single Raft cluster that replicates all data for the entire system.

For these databases there is no horizontal scaling. If they support "horizontal scaling", they support this by doing non-linearizable (stale) reads. Writes remain a challenge.

This doesn't mean these databases are bad. They are not. One obvious advantage they have over Cockroach or Spanner is that they are conceptually simpler. Conceptually simpler often equates to easier to operate. That's a big deal.

Optimizations

We've covered the basics of operation, but real-world implementations get more complex.

Snapshots

Rather than letting the log grow indefinitely, most libraries implement snapshotting. The user of the library provides a state machine and also provides a method for serializing the state machine to disk. The Raft library periodically serializes the state machine to disk and truncates the log.

When a follower is so far behind that the leader no longer has a log entry (because it has been truncated), the leader transfers an entire snapshot to the follower. Then once the follower is caught up on snapshots, the leader can transfer normal log entries again.

This technique is described in the Raft paper. While it isn't necessary for Raft to work, it's so important that it is hardly an optimization and more a required part of a production Raft system.

Batching

Rather than limiting clients of the cluster to submitting only one command at a time, it's common for the cluster to accept many commands at a time. Similarly, many commands at a time are submitted to followers. When any node needs to write commands to disk, it can batch commands to disk as well.

But you can go a step beyond this in a way that is completely opaque to the Raft library. Each opaque command the client submits can also contain a batch of messages. In this scenario, only the user-provided state machine needs to be aware that each command it receives is actually a batch of messages that it should pull apart and interpret separately.

This latter techinque is a fairly trivial way to increase throughput by an order of magnitude or two.

Disk and network

In terms of how data is stored on disk and how data is sent over the network there is obvious room for optimization.

A naive implementation might store JSON on disk and send JSON over the network. A slightly more optimized implementation might store binary data on disk and send binary data over the network.

Similarly you can swap out your RPC for gRPC or introduce zlib for compression to network or disk.

You can swap out synchronous IO for libaio or io_uring or SPDK/DPDK.

A little tweak I made in my latest Raft implementation was to index log entries so searching the log was not a linear operation. Another little tweak was to introduce a page cache to eliminate unnecessary disk reads. This increased throughput for by an order of magnitude.

Flexible quorums

This brilliant optimization by Heidi Howard and co. shows you can relax the quorum required for committing new commands so long as you increase the quorum required for electing a leader.

In an environment where leader election doesn't happen often, flexible quorums can increase throughput and decrease latency. And it's a pretty easy change to make!

More

These are just a couple common optimizations. You can also read about parallel state machine apply, parallel append to disk, witnesses, compartmentalization, and leader leases. TiKV, Scylla, RedPanda, and Cockroach tend to have public material talking about this stuff.

There are also a few people I follow who are often reviewing relevant papers, if they are not producing their own. I encourage you to follow them too if this is interesting to you:

Safety and testing

The other aspect to consider is safety. For example, checksums for everything written to disk and passed over the network; or being able to recover from corruption in the log.

Testing is also a big deal. There are prominent tools like Jepsen that check for consistency in the face of fault injection (process failure, network failure, etc.). But even Jepsen has its limits. For example, it doesn't test disk failure.

FoundationDB made popular a number of testing techniques. And the people behind this testing went on to build a product, Antithesis, around deterministic testing of non-deterministic code while injecting faults.

And on that topic there's Facebook Experimental's Hermit deterministic Linux hypervisor that may help to test complex distributed systems. However, my experience with it has not been great and the maintainers do not seem very engaged with other people who have reported bugs. I'm hopeful for it but we'll see.

Antithesis and Hermit seem like a boon when half the trouble of working on distributed consensus implementations is avoiding flakey tests.

Another promising avenue is emitting logs during the Raft lifecycle and validating the logs against a TLA+ spec. Microsoft has such a project that has begun to see adoption among open-source Raft implementations.

Conclusion

Everything aside, consensus is expensive. There is overhead to the entire consensus process. So if you do not need this level of availability and can settle for some process via backups, it's going to have lower latency and higher throughput than if it had to go through distributed consensus.

If you do need high availability, distributed consensus can be a great choice. But consider the environment and what you want from your consensus algorithm.

Also, while MultiPaxos, Raft, and Viewstamped Replication are some of the most popular algorithms for distributed consensus, there is a world beyond. Two-phase commit, ZooKeeper Atomic Broadcast, PigPaxos, EPaxos, Accord by Cassandra. The world of distributed consensus also gets especially weird and interesting outside of OLTP systems.

But that's enough for one post.

Further reading

Writing a minimal in-memory storage engine for MySQL/MariaDB

2024-01-09 08:00:00

I spent a week looking at MySQL/MariaDB internals along with ~80 other devs. Although MySQL and MariaDB are mostly the same (more on that later), I focused on MariaDB specifically this week.

Before last week I had never built MySQL/MariaDB before. The first day of this hack week, I got MariaDB building locally and made a code tweak so that SELECT 23 returned 213, and another tweak so that SELECT 80 + 20 returned 60. The second day I got a basic UDF in C working so that SELECT mysum(20, 30) returned 50.

The rest of the week I spent figuring out how to build a minimal in-memory storage engine, which I'll walk through in this post. 218 lines of C++.

It supports CREATE, DROP, INSERT, and SELECT for tables that only have INTEGER fields. It is explicitly not thread-safe because I didn't have time to understand MariaDB's lock primitives.

In this post I'll also talk about how the MariaDB custom storage API compares to the Postgres one, based on a previous hack week project I did.

All code for this post can be found in my fork on GitHub.

MySQL and MariaDB

Before we go further though, why do I keep saying MySQL/MariaDB?

MySQL is GPL licensed (let's completely ignore the commercial variations of MySQL that Oracle offers). The code is open-source. However, the development is done behind closed doors. There is a code dump every month or so.

MariaDB is a fork of MySQL by the creator of MySQL (who is no longer involved, as it happens). It is also GPL licensed (let's completely ignore the commercial variations of MariaDB that MariaDB Corporation offers). The code is open-source. The development is also open-source.

When you install "MySQL" in your Linux distro you are often actually installing MariaDB.

The two are mostly compatible. During this week, I stumbled onto that they evolved support for SELECT .. FROM VALUES .. differently. Some differences are documented on the MariaDB KB. But this KB is painful to browse. Which leads me to my next point.

The MySQL docs are excellent. Easy to read, browse; and they are thorough. The MariaDB docs are a work in progress. I'm sorry I can't be stoic: in just a week I've come to really hate using this KB. Thankfully, in some twisted way, it also doesn't seem to be very thorough either. It isn't completely avoidable though since there is no guarantee MySQL and MariaDB do the same thing.

Ultimately, I spent the week using MariaDB because I'm biased toward fully open projects. But I kept having to look at MySQL docs, hoping they were relevant.

Now that you understand the state of things, let's move on to fun stuff!

Storage engines

Mature databases often support swapping out the storage layer. Maybe you want an in-memory storage layer so that you can quickly run integration tests. Maybe you want to switch between B-Trees (read-optimized) and LSM Trees (write-optimized) and unordered heaps (write-optimized) depending on your workload. Or maybe you just want to try a third-party storage library (e.g. RocksDB or Sled or TiKV).

The benefit of swapping out only the storage engine is that, from a user's perspective, the semantics and features of the database stay mostly the same. But the database is magically faster for a workload.

You keep powerful user management, extension support, SQL support, and a well-known wire protocol. You modify only the method of storing the actual data.

Existing storage engines

MySQL/MariaDB is particularly well known for its custom storage engine support. The MySQL docs for alternate storage engines are great.

While the docs do warn that you should probably stick with the default storage engine, that warning didn't quite feel strong enough because nothing else seemed to indicate the state of other engines.

Specifically, in the past I was always interested in the CSV storage engine. But when you look at the actual code for the CSV engine there is a pretty strong warning:

First off, this is a play thing for me, there are a number of things
wrong with it:
  *) It was designed for csv and therefore its performance is highly
     questionable.
  *) Indexes have not been implemented. This is because the files can
     be traded in and out of the table directory without having to worry
     about rebuilding anything.
  *) NULLs and "" are treated equally (like a spreadsheet).
  *) There was in the beginning no point to anyone seeing this other
     then me, so there is a good chance that I haven't quite documented
     it well.
  *) Less design, more "make it work"

Now there are a few cool things with it:
  *) Errors can result in corrupted data files.
  *) Data files can be read by spreadsheets directly.

TODO:
 *) Move to a block system for larger files
 *) Error recovery, its all there, just need to finish it
 *) Document how the chains work.

-Brian

The difference between the seeming confidence of the docs and seeming confidence of the contributor made me chuckle.

The benefit of these diverse storage engines for me was that they give examples of how to implement the storage engine API. The csv, blackhole, example, and heap storage engines were particularly helpful to read.

The heap engine is a complete in-memory storage engine. Complete means complex though. So there seemed to be room for a stripped down version of an in-memory engine.

And that's we'll cover in this post! First though I want to talk a little bit about the limitations of custom storage engines.

Limitations

While being able to tailor a storage engine to a workload is powerful, there are limits to the benefits based on the design of the storage API.

Both Postgres and MySQL/MariaDB currently have a custom storage API built around individual rows.

Column-wise execution

I have previously written that custom storage engines allows you to switch between column- and row-oriented data storage. Two big reasons to do column-wise storage are 1) opportunity for compression, and 2) fast operations on a single column.

The opportunity for 1) compression on disk would still exist even if you needed to deal with individual rows at the storage API layer since the compression could happen on disk. However any benefits of passing around compressed columns in memory disappear if you must convert to rows for the storage API.

You'd also lose the advantage for 2) fast operations on a single column if the column must be converted into a row at the storage API whereupon it's passed to higher levels that perform execution. The execution would happen row-wise, not column-wise.

All of this is to say that while column-wise storage is possible, the benefit of doing so is not obvious with the current API design for both MySQL/MariaDB and Postgres.

Vectorization

An API built around individual rows also sets limits on the amount of vectorization you can do. A custom storage engine could still do some vectorization under the hood: always filling a buffer with N rows and returning a row from the buffer when the storage API requests a single row. But there is likely some degree of performance left on the table with an API that deals with individual rows.

Remember though: if you did batched reads and writes of rows in the custom storage layer, there isn't necessarily any vectorization happening at the execution layer. From a previous study I did, neither MySQL/MariaDB nor Postgres do vectorized query execution. This paragraph isn't a critique of the storage API, it's just something to keep in mind.

Storage versus execution

The general point I'm making here is that unless both the execution and storage APIs are designed in a certain way, you may attempt optimizations in the storage layer that are ineffective or even harmfull because the execution layer doesn't or can't take advantage of them.

Nothing permanent

The current limitations of the storage API are not intrinsic aspects of MySQL/MariaDB or Postgres's design. For both project there used to be no pluggable storage at all. We can imagine a future patch to either project that allows support for batched row reads and writes that together could make column-wise storage and vectorized execution more feasible.

Even today there have been invasive attempts to fully support column-wise storage and execution in Postgres. And there have also been projects to bring vectorized execution to Postgres.

I'm not as familiar with the MySQL landscape to comment about efforts at the moment their.

Debug build of MariaDB running locally

Now that you've got some background, let's get a debug build of MariaDB!

$ git clone https://github.com/MariaDB/server mariadb
$ cd mariadb
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug ..
$ make -j8

This takes a while. When I was hacking on Postgres (a C project), it took 1 minute on my beefy Linux server to build. It took 20-30 minutes to build MySQL/MariaDB from scratch. That's C++ for you!

Thankfully incremental builds of MySQL/MariaDB for a tweak after the initial build take roughly the same time as incremental builds of Postgres after a tweak.

Once the build is done, create a database.

$ ./build/scripts/mariadb-install-db --srcdir=$(pwd) --datadir=$(pwd)/db

And create a config for the database.

$ echo "[client]
socket=$(pwd)/mariadb.sock

[mariadb]
socket=$(pwd)/mariadb.sock

basedir=$(pwd)
datadir=$(pwd)/db
pid-file=$(pwd)/db.pid" > my.cnf

Start up the server.

$ ./build/sql/mariadbd --defaults-extra-file=$(pwd)/my.cnf --debug:d:o,$(pwd)/db.debug
./build/sql/mariadbd: Can't create file '/var/log/mariadb/mariadb.log' (errno: 13 "Permission denied")
2024-01-03 17:10:15 0 [Note] Starting MariaDB 11.4.0-MariaDB-debug source revision 3fad2b115569864d8c1b7ea90ce92aa895cfef08 as process 185550
2024-01-03 17:10:15 0 [Note] InnoDB: !!!!!!!! UNIV_DEBUG switched on !!!!!!!!!
2024-01-03 17:10:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
2024-01-03 17:10:15 0 [Note] InnoDB: Number of transaction pools: 1
2024-01-03 17:10:15 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-01-03 17:10:15 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2024-01-03 17:10:15 0 [Note] InnoDB: Completed initialization of buffer pool
2024-01-03 17:10:15 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2024-01-03 17:10:15 0 [Note] InnoDB: End of log at LSN=57155
2024-01-03 17:10:15 0 [Note] InnoDB: Opened 3 undo tablespaces
2024-01-03 17:10:15 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2024-01-03 17:10:15 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2024-01-03 17:10:15 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2024-01-03 17:10:15 0 [Note] InnoDB: log sequence number 57155; transaction id 16
2024-01-03 17:10:15 0 [Note] InnoDB: Loading buffer pool(s) from ./db/ib_buffer_pool
2024-01-03 17:10:15 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-01-03 17:10:15 0 [Note] Plugin 'wsrep-provider' is disabled.
2024-01-03 17:10:15 0 [Note] InnoDB: Buffer pool(s) load completed at 240103 17:10:15
2024-01-03 17:10:15 0 [Note] Server socket created on IP: '0.0.0.0'.
2024-01-03 17:10:15 0 [Note] Server socket created on IP: '::'.
2024-01-03 17:10:15 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
2024-01-03 17:10:15 0 [Note] ./build/sql/mariadbd: ready for connections.
Version: '11.4.0-MariaDB-debug'  socket: './mariadb.sock'  port: 3306  Source distribution

With that --debug flag, debug logs will show up in $(pwd)/db.debug. It's unclear why debug logs are treated separately from the console logs shown here. I'd rather them all be in one place.

In another terminal, run a client and make a request!

$ ./build/client/mariadb --defaults-extra-file=$(pwd)/my.cnf --database=test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 11.4.0-MariaDB-debug Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> SELECT 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.001 sec)

Huzzah! Let's write a custom storage engine!

Where does the code go?

When writing an extension for some project, I usually expect to have the extension exist in its own repo. I was able to do this with the Postgres in-memory storage engine I wrote. And in general, Postgres extensions exist as their own repos.

I was able to create and build a UDF plugin outside the MariaDB source tree. But when it came to getting a storage engine to build and load successfully, I wasted almost an entire day (a large amount of time in a single hack week) getting nowhere.

Extensions for MySQL/MariaDB are most easily built via the CMake infrastructure within the repo. Surely there's some way to replicate that infrastructure from outside the repo but I wasn't able to figure it out within a day and didn't want to spend more time on it.

Apparently the normal thing to do in MySQL/MariaDB is to keep extensions within a fork of MySQL/MariaDB.

When I switched to this method I was able to very quickly get the storage engine building and loaded. So that's what we'll do.

Boilerplate

Within the MariaDB source tree, create a new folder in the storage subdirectory.

$ mkdir storage/memem

Within storage/memem/CMakeLists.txt add the following.

# Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
# 
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 of the License.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1335 USA

SET(MEMEM_SOURCES  ha_memem.cc ha_memem.h)
MYSQL_ADD_PLUGIN(memem ${MEMEM_SOURCES} STORAGE_ENGINE)

This hooks into MySQL/MariaDB build infrastructure. So next time you run make within the build directory we created above, it will also build this project.

The storage engine class

It would be nice to see a way to extend MySQL in C (for one, because it would then be easier to port to other languages). But all of the builtin storage methods use classes. So we'll do that too.

The class we must implement is an instance of handler. There is a single handler instance per thread, corresponding to a single running query. (Postgres gives each query its own process, MySQL gives each query its own thread.) However, handler instances are reused across different queries.

There are a number of virtual methods on handler we must implement in our subclass. For most of them we'll do nothing: simply returning immediately. These simple methods will be implemented in ha_memem.h. The methods with more complex logic will be implemented in ha_memem.cc.

Let's set up includes in ha_memem.h.

/* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; version 2 of the License.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335  USA */

#ifdef USE_PRAGMA_INTERFACE
#pragma interface /* gcc class implementation */
#endif

#include "thr_lock.h"
#include "handler.h"
#include "table.h"
#include "sql_const.h"

#include <vector>
#include <memory>

Next we'll define structs for our in-memory storage.

typedef std::vector<uchar> MememRow;

struct MememTable
{
  std::vector<std::shared_ptr<MememRow>> rows;
  std::shared_ptr<std::string> name;
};

struct MememDatabase
{
  std::vector<std::shared_ptr<MememTable>> tables;
};

Within ha_memem.cc we'll implement a global (not thread-safe) static MememDatabase* that all handler instances will query when requested. We need the definitions in the header file though because we'll store the table currently being queried in the handler subclass.

This is so that every call to write_row to write a single row or call to rnd_next to read a single row does not need to look up the in-memory table object N times within the same query.

And finally we'll define the subclass of handler and implementations of trivial methods.

class ha_memem final : public handler
{
  uint current_position= 0;
  std::shared_ptr<MememTable> memem_table= 0;

public:
  ha_memem(handlerton *hton, TABLE_SHARE *table_arg) : handler(hton, table_arg)
  {
  }
  ~ha_memem()= default;
  const char *index_type(uint key_number) { return ""; }
  ulonglong table_flags() const { return 0; }
  ulong index_flags(uint inx, uint part, bool all_parts) const { return 0; }
  /* The following defines can be increased if necessary */
#define MEMEM_MAX_KEY MAX_KEY     /* Max allowed keys */
#define MEMEM_MAX_KEY_SEG 16      /* Max segments for key */
#define MEMEM_MAX_KEY_LENGTH 3500 /* Like in InnoDB */
  uint max_supported_keys() const { return MEMEM_MAX_KEY; }
  uint max_supported_key_length() const { return MEMEM_MAX_KEY_LENGTH; }
  uint max_supported_key_part_length() const { return MEMEM_MAX_KEY_LENGTH; }
  int open(const char *name, int mode, uint test_if_locked) { return 0; }
  int close(void) { return 0; }
  int truncate() { return 0; }
  int rnd_init(bool scan);
  int rnd_next(uchar *buf);
  int rnd_pos(uchar *buf, uchar *pos) { return 0; }
  int index_read_map(uchar *buf, const uchar *key, key_part_map keypart_map,
                     enum ha_rkey_function find_flag)
  {
    return HA_ERR_END_OF_FILE;
  }
  int index_read_idx_map(uchar *buf, uint idx, const uchar *key,
                         key_part_map keypart_map,
                         enum ha_rkey_function find_flag)
  {
    return HA_ERR_END_OF_FILE;
  }
  int index_read_last_map(uchar *buf, const uchar *key,
                          key_part_map keypart_map)
  {
    return HA_ERR_END_OF_FILE;
  }
  int index_next(uchar *buf) { return HA_ERR_END_OF_FILE; }
  int index_prev(uchar *buf) { return HA_ERR_END_OF_FILE; }
  int index_first(uchar *buf) { return HA_ERR_END_OF_FILE; }
  int index_last(uchar *buf) { return HA_ERR_END_OF_FILE; }
  void position(const uchar *record) { return; }
  int info(uint flag) { return 0; }
  int external_lock(THD *thd, int lock_type) { return 0; }
  int create(const char *name, TABLE *table_arg, HA_CREATE_INFO *create_info);
  THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
                             enum thr_lock_type lock_type)
  {
    return to;
  }
  int delete_table(const char *name) { return 0; }

private:
  void reset_memem_table();
  virtual int write_row(const uchar *buf);
  int update_row(const uchar *old_data, const uchar *new_data)
  {
    return HA_ERR_WRONG_COMMAND;
  };
  int delete_row(const uchar *buf) { return HA_ERR_WRONG_COMMAND; }
};

A complete storage engine might seriously implement all of these methods. But we'll only seriously implement 7 of them.

To finish up the boilerplate, we'll switch over to ha_memem.cc and set up the includes.

/* Copyright (c) 2005, 2012, Oracle and/or its affiliates. All rights reserved.

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; version 2 of the License.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335  USA */

#ifdef USE_PRAGMA_IMPLEMENTATION
#pragma implementation // gcc: Class implementation
#endif

#define MYSQL_SERVER 1
#include <my_global.h>
#include "sql_priv.h"
#include "unireg.h"
#include "sql_class.h"

#include "ha_memem.h"

Ok! Let's dig into the implementation.

Implementation

The global database

First up, we need to declare a global MememDatabase* instance. We'll also implement a helper function for finding the index of a table by name within the database.

// WARNING! All accesses of `database` in this code are thread
// unsafe. Since this was written during a hack week, I didn't have
// time to figure out MySQL/MariaDB's runtime well enough to do the
// thread-safe version of this.
static MememDatabase *database;

static int memem_table_index(const char *name)
{
  int i;
  assert(database->tables.size() < INT_MAX);
  for (i= 0; i < (int) database->tables.size(); i++)
  {
    if (strcmp(database->tables[i]->name->c_str(), name) == 0)
    {
      return i;
    }
  }

  return -1;
}

As I wrote this post I noticed that this code also assumes there's only a single database. That isn't how MySQL works. Everytime you call USE ... in MySQL you are switching between databases. You can query tables across databases. A real in-memory backend would need to be aware of the different databases, not just different tables. But to keep the code succinct we won't implement that in this post.

Next we'll implement plugin initialization and cleanup.

Plugin lifecycle

Before we register the plugin with MariaDB, we need to set up initialization and cleanup methods for it.

The initialization method will take care of initializing the global MememDatabase* database object. It will set up a handler for creating new instances of our handler subclass. And it will set up a handler for deleting tables.

static handler *memem_create_handler(handlerton *hton, TABLE_SHARE *table,
                                     MEM_ROOT *mem_root)
{
  return new (mem_root) ha_memem(hton, table);
}

static int memem_init(void *p)
{
  handlerton *memem_hton;

  memem_hton= (handlerton *) p;
  memem_hton->db_type= DB_TYPE_AUTOASSIGN;
  memem_hton->create= memem_create_handler;
  memem_hton->drop_table= [](handlerton *, const char *name) {
    int index= memem_table_index(name);
    if (index == -1)
    {
      return HA_ERR_NO_SUCH_TABLE;
    }

    database->tables.erase(database->tables.begin() + index);
    DBUG_PRINT("info", ("[MEMEM] Deleted table '%s'.", name));

    return 0;
  };
  memem_hton->flags= HTON_CAN_RECREATE;

  // Initialize global in-memory database.
  database= new MememDatabase;

  return 0;
}

The DBUG_PRINT macro is a debug helper MySQL/MariaDB gives us. As noted above, the output is directed to a file specified by the --debug flag. Unfortunately I couldn't figure out how to flush the stream this macro writes to. It seemed like occasionally when there was a segfault logs I expected to be there weren't there. And the file would often contain what looked like partially written logs. Anyway, as long as there wasn't a segfault the debug file will eventually contain the DBUG_PRINT logs.

The only thing the plugin cleanup function must do is delete the global database.

static int memem_fini(void *p)
{
  delete database;
  return 0;
}

Now we can register the plugin!

Plugin registration

The maria_declare_plugin and maria_declare_plugin_end register the plugin's metadata (name, version, etc.) and callbacks.

struct st_mysql_storage_engine memem_storage_engine= {
    MYSQL_HANDLERTON_INTERFACE_VERSION};

maria_declare_plugin(memem){
    MYSQL_STORAGE_ENGINE_PLUGIN,
    &memem_storage_engine,
    "MEMEM",
    "MySQL AB",
    "In-memory database.",
    PLUGIN_LICENSE_GPL,
    memem_init, /* Plugin Init */
    memem_fini, /* Plugin Deinit */
    0x0100 /* 1.0 */,
    NULL,                          /* status variables                */
    NULL,                          /* system variables                */
    "1.0",                         /* string version */
    MariaDB_PLUGIN_MATURITY_STABLE /* maturity */
} maria_declare_plugin_end;

That's it! Now we need to implement methods for writing rows, reading rows, and creating a new table.

Create table

To create a table, we make sure one by this name doesn't already exist, make sure it only has INTEGER fields, allocate memory for the table, and append it to the global database.

int ha_memem::create(const char *name, TABLE *table_arg,
                     HA_CREATE_INFO *create_info)
{
  assert(memem_table_index(name) == -1);

  // We only support INTEGER fields for now.
  uint i = 0;
  while (table_arg->field[i]) {
    if (table_arg->field[i]->type() != MYSQL_TYPE_LONG)
      {
    DBUG_PRINT("info", ("Unsupported field type."));
    return 1;
      }

    i++;
  }

  auto t= std::make_shared<MememTable>();
  t->name= std::make_shared<std::string>(name);
  database->tables.push_back(t);
  DBUG_PRINT("info", ("[MEMEM] Created table '%s'.", name));

  return 0;
}

Not very complicated. Let's handle INSERT-ing rows next.

Insert row

There is no method called when an INSERT starts. There is a table field on the handler parent class that is updated though when a SELECT or INSERT is going. So we can fetch the current table from that field.

Since we have a slot for a std::shared_ptr<MememTable> memem_table on the ha_memem class, we can check if it is NULL when we insert a row. If it is, we look up the current table and set this->memem_table to its MememTable.

But there's a bit more to it than just the table name. The const char* name passed to the create() method above seems to be a sort of fully qualified name for the table. By observation, when creating a table y in a database test, the const char* name value is ./test/y. The . prefix probably means that the database is local, but I'm not sure.

So we'll write a helper method that will reconstruct the fully qualified table name before looking up that fully qualified table name in the global database.

void ha_memem::reset_memem_table()
{
  // Reset table cursor.
  current_position= 0;

  std::string full_name= "./" + std::string(table->s->db.str) + "/" +
                         std::string(table->s->table_name.str);
  DBUG_PRINT("info", ("[MEMEM] Resetting to '%s'.", full_name.c_str()));
  assert(database->tables.size() > 0);
  int index= memem_table_index(full_name.c_str());
  assert(index >= 0);
  assert(index < (int) database->tables.size());

  memem_table= database->tables[index];
}

Then we can use this within write_row to figure out the current MememTable being queried.

But first, let's digress into how MySQL stores rows.

The MySQL row API

When you write a Postgres custom storage API, you are expected to basically read from or write to an array of Datum.

Totally sensible.

In MySQL, you read from and write to an array of bytes. That's pretty weird to me. Of course you can build your own higher level serialization/deserialization on top of it. But it's just strange to me everyone has to know this basically opaque API.

Certainly it's documented.

The handler class is the interface for dynamically loadable
storage engines. Do not add ifdefs and take care when adding or
changing virtual functions to avoid vtable confusion

Functions in this class accept and return table columns data. Two data
representation formats are used:
1. TableRecordFormat - Used to pass [partial] table records to/from
   storage engine

2. KeyTupleFormat - used to pass index search tuples (aka "keys") to
   storage engine. See opt_range.cc for description of this format.

TableRecordFormat
=================
[Warning: this description is work in progress and may be incomplete]
The table record is stored in a fixed-size buffer:

  record: null_bytes, column1_data, column2_data, ...

The offsets of the parts of the buffer are also fixed: every column has 
an offset to its column{i}_data, and if it is nullable it also has its own
bit in null_bytes.

In our implementation, we'll skip the support for NULL values. We'll only support INTEGER fields. But we still need to be aware that the first byte will be taken up. We'll also assume there won't be more than one byte of a NULL bitmap.

It is this opaque byte array that we'll read from in write_row(const uchar* buf) and write to in read_row(uchar* buf).

Insert row (take two)

To keep things simple we're going to store the row in MememTable the same way MySQL passes it around.

int ha_memem::write_row(const uchar *buf)
{
  if (memem_table == NULL)
  {
    reset_memem_table();
  }

  // Assume there are no NULLs.
  buf++;

  uint field_count = 0;
  while (table->field[field_count]) field_count++;

  // Store the row in the same format MariaDB gives us.
  auto row= std::make_shared<std::vector<uchar>>(
      buf, buf + sizeof(int) * field_count);
  memem_table->rows.push_back(row);

  return 0;
}

Which makes reading the row quite simple too!

Read row

The only slight difference between reading and writing a row is that MySQL/MariaDB will tell us when the SELECT scan for a table starts.

We'll use that opportunity to reset the current_row cursor and reset the memem_table field. Since, again, handler classes are only used once per query but they are reused for queries running at other times.

int ha_memem::rnd_init(bool scan)
{
  reset_memem_table();
  return 0;
}

int ha_memem::rnd_next(uchar *buf)
{
  if (current_position == memem_table->rows.size())
  {
    // Reset the in-memory table to make logic errors more obvious.
    memem_table= NULL;
    return HA_ERR_END_OF_FILE;
  }
  assert(current_position < memem_table->rows.size());

  uchar *ptr= buf;
  *ptr= 0;
  ptr++;

  // Rows internally are stored in the same format that MariaDB
  // wants. So we can just copy them over.
  std::shared_ptr<std::vector<uchar>> row= memem_table->rows[current_position];
  std::copy(row->begin(), row->end(), ptr);

  current_position++;
  return 0;
}

And we're done!

Build and test

Go back into the build directory we created within the source tree root and rerun make -j8.

Kill the server (you'll need to do something like killall mariadbd since the server doesn't respond to Ctrl-c). And restart it.

For some reason this plugin doesn't need to be loaded. We can run SHOW PLUGINS; in the MariaDB CLI and we'll see it.

$ ./build/client/mariadb --defaults-extra-file=/home/phil/vendor/mariadb/my.cnf --database=test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 5
Server version: 11.4.0-MariaDB-debug Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> SHOW PLUGINS;
+-------------------------------+----------+--------------------+-----------------+---------+
| Name                          | Status   | Type               | Library         | License |
+-------------------------------+----------+--------------------+-----------------+---------+
| binlog                        | ACTIVE   | STORAGE ENGINE     | NULL            | GPL     |
...
| MEMEM                         | ACTIVE   | STORAGE ENGINE     | NULL            | GPL     |
...
| BLACKHOLE                     | ACTIVE   | STORAGE ENGINE     | ha_blackhole.so | GPL     |
+-------------------------------+----------+--------------------+-----------------+---------+
73 rows in set (0.012 sec)

There we go! To create a table with it we need to set ENGINE = MEMEM. For example, CREATE TABLE x (i INT) ENGINE = MEMEM.

Let's create a script to try out the memem engine, in storage/memem/test.sql.

drop table if exists y;
drop table if exists z;

create table y(i int, j int) engine = MEMEM;
insert into y values (2, 1029);
insert into y values (92, 8);
select * from y where i + 8 = 10;

create table z(a int) engine = MEMEM;
insert into z values (322);
insert into z values (8);
select * from z where a > 20;

And run it.

$ ./build/client/mariadb --defaults-extra-file=$(pwd)/my.cnf --database=test --table --verbose < storage/memem/test.sql
--------------
drop table if exists y
--------------

--------------
drop table if exists z
--------------

--------------
create table y(i int, j int) engine = MEMEM
--------------

--------------
insert into y values (2, 1029)
--------------

--------------
insert into y values (92, 8)
--------------

--------------
select * from y where i + 8 = 10
--------------

+------+------+
| i    | j    |
+------+------+
|    2 | 1029 |
+------+------+
--------------
create table z(a int) engine = MEMEM
--------------

--------------
insert into z values (322)
--------------

--------------
insert into z values (8)
--------------

--------------
select * from z where a > 20
--------------

+------+
| a    |
+------+
|  322 |
+------+

What you see there is the power of storage engines! It supports the full SQL language even while we implemented storage somewhere completely different than the default.

In-memory is boring

Certainly, I'm getting bored doing the same project over and over again on different databases. However, it's minimal projects like this that make it super easy to then go and port the storage engine to something else.

The goal here is to be minimal but meaningful. And I've accomplished that for myself at least!

On ChatGPT

As I've written before, this sort of exploration wouldn't be possible within the time frame I gave myself if it weren't for ChatGPT. Specifically, the paid tier GPT4.

Neither the MySQL nor the MariaDB docs were so helpful that I could immediately figure out things like how to get the current table name within a scan (the table member of the handler class).

With ChatGPT you can ask questions like: "In a MySQL C++ plugin, how do I get the name of the table from a handler class as a C string?". Sometimes it's right and sometime's it's not. But you can try out the code and if it builds it is at least somewhat correct!

Make your own way

2023-12-27 08:00:00

Over the years, I have repeatedly felt like I missed the timing for a meetup or an IRC group or social media in general. I'd go to a meetup every so often but I'd never make a meaningful connection with people, whereas everyone else knew each other. I'd join an IRC group and have difficulty catching up with what seemed to be the flow of conversation.

I hadn't thought much about this until the pandemic when I started a Discord group for software internals and a virtual tech talk series called Hacker Nights. Since 2021 the Discord reached around 1,500 members and ~20 fairly active members. And the Meetup peaked at about 300 members with about 10-20 showing up each Meetup.

After the pandemic receded I started an NYC-based book club over 2 months with about 5-8 active attendees. I ran a virtual hack week on Discord where I got ~100 devs into a temporary Discord server and we talked about Postgres internals and shared resources. Ultimately around 5 of us wrote blog posts and built new projects to explore Postgres.

I started a virtual, async email book club (that is still ongoing) with 300 devs from November 2023 to Feb 2024. There have been around 20 active members of the club. And each week the discussion is kicked off by one of the members, not myself.

And I felt like there wasn't enough community opportunity for folks in systems programming in NYC so I started an Manhattan-based Systems Coffee Club. Around 15 people showed up to the first meeting and seemed even more excited about it than I was. (And I was excited!) We'll see where it goes from here.

Organizing people to do this stuff doesn't come easy to me. I enjoy doing it to a degree, but every night before an event I have trouble sleeping. Worried about embarrassing myself. When the event happens though, and people are happy to be there to chat with everyone else, as they invariably have been, it makes it worthwhile.

Everyone want community

Something I realized along the way is that people (maybe devs especially, I don't know) are looking for community. And when I have noticed there seems to be a missing flashpoint (a topic, a career focus, a book, etc.) for community, it's been pretty easy to get people together around it.

The lifecycle of groups

Groups, meetups, naturally live and die. Organizers get burnt out. I don't see this as a problem. It's just the way it is.

At some point I'll get burnt out too. Or I'll get pickier. For example, I've been avoiding starting a systems programming meetup in NYC because I know it will be a big effort. So I've done lower effort groups like book clubs and coffee clubs.

Don't worry about signing yourself up for indefinite work. Just do whatever you'd like to and don't feel bad if you have to stop. Someone else will eventually start the next great group, even if it comes in a different medium or flavor.

Community is contagious

There are great communities out there that have inspired me.

And this year I've been hearing about more.

There are yet a few more systems programming groups I've heard rumors about being started on the US West Coast and Stockholm.

Do whatever you want!

If you feel like you can't find the right group or that you don't fit in with existing groups or that you're missing a moment, there are surely other folks in the same boat. Waiting for a new group to join. You may be the catalyst.

There's enormous potential for getting people together and doing something interesting and there isn't necessarily anyone telling you you should. Things you try may work and they may not. The more you try the more you'll learn what works and what doesn't. I've had a few years of making mistakes organizing to hone the sense.

The only boring thing to do is to necessarily limit yourself to the sort of thing others have done before! Run a browser meetup instead of a React meetup. Interview hardware developers to teach software developers something. Get software developers with 20 years of experience in niche fields to teach the rest of us something. Read books beyond SICP or Clean Code. Try difficult programming projects.

Whatever you want though, don't let me deter you. If you think something should exist, give it a shot!

Exploring a Postgres query plan

2023-11-19 08:00:00

I learned this week that you can intercept and redirect Postgres query execution. You can hook into the execution layer so you're given a query plan and you get to decide what to do with it. What rows to return, if any, and where they come from.

That's very interesting. So I started writing code to explore execution hooks. However, I got stuck interpreting the query plan. Either there's no query plan walking infrastructure or I just didn't find it.

So this post is a digression into walking a Postgres query plan. By the end we'll be able to run psql -c 'SELECT a FROM x WHERE a > 1' and reconstruct the entire SQL string from a Postgres QueryDesc object, the query plan object Postgres builds.

With that query plan walking infrastructure in place, we'll be in a good state to not just print out the query plan while walking it but instead to translate the query plan or evaluate it in our own way (e.g. over column-wise data, or vectorized execution over row-wise data).

Code for this project is available on Github.

What is a query plan?

If you're familiar with parsers and compilers, a query plan is like an intermediate representation (IR) of a program. It is not as raw as an abstract syntax tree (AST); it has already been optimized.

If that doesn't mean anything to you, think of a query plan as a structured and optimized version of the SQL query you submit to your database. It isn't text anymore. It is probably a tree.

Check out another Justin Jaffray article on the subject for more detail.

Development environment

Before we get to walking the query plan, let's set up the infrastructure to intercept query execution where we can eventually add in our print debugging of the query plan reconstructed as a SQL string.

Once you've got Postgres build dependencies, build and install a debug version of Postgres:

$ git clone https://github.com/postgres/postgres && cd postgres
$ # Make sure you're on the same commit I'm on, just to be safe.
$ git checkout b218fbb7a35fcf31539bfad12732038fe082a2eb
$ ./configure --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
$ make -j8
$ # Installs to to /usr/local/pgsql/bin.
$ sudo make install

I'm not going to cover Postgres extension infrastructure in detail. I wrote a bit about it in my last post. You need only read the first half, if at all; not the actual Table Access Method implementation.

It will be even simpler in this post because Postgres hooks are extensions but not extensions you install with CREATE EXTENSION. If you want to read about the different kinds of Postgres extensions, check out this article by Steven Miller.

The minimum we need, aside from the hook code itself, is a Makefile that uses PGXS:

MODULES = pgexec

PG_CONFIG = /usr/local/pgsql/bin/pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

The MODULES value there corresponds to the C file we'll create shortly, pgexec.c.

This pg_config binary path is important because you might have different versions of Postgres installed, for example by your package manager. It is important that the extension is built against the same version of Postgres which will load the extension.

Now we're ready for some hook code.

Intercepting query execution

You can find the basic structure of a hook (and which hooks are available) in Tamika Nomara's unofficial Postgres hooks docs.

There is no official central place describing all hooks I could find in Postgres docs. Some hooks are described in various places throughout the docs though.

Based on that page, we can write a bare minimum hook that will intercept queries, log when we've done so, and pass control back to the standard execution path for the actual query. In pgexec.c:

#include "postgres.h"
#include "fmgr.h"
#include "executor/executor.h"

PG_MODULE_MAGIC;

static ExecutorRun_hook_type prev_executor_run_hook = NULL;

static void print_plan(QueryDesc* queryDesc) {
  elog(LOG, "[pgexec] HOOKED SUCCESSFULLY!");
}

static void pgexec_run_hook(
  QueryDesc* queryDesc,
  ScanDirection direction,
  uint64 count,
  bool execute_once
) {
  print_plan(queryDesc);
  return prev_executor_run_hook(queryDesc, direction, count, execute_once);
}

void _PG_init(void) {
  prev_executor_run_hook = ExecutorRun_hook;
  if (prev_executor_run_hook == NULL) {
    prev_executor_run_hook = standard_ExecutorRun;
  }
  ExecutorRun_hook = pgexec_run_hook;
}

void _PG_fini(void) {
  ExecutorRun_hook = prev_executor_run_hook;
}

You can discover the standard_ExectutorRun function from a quick git grep ExecutorRun_hook in the Postgres source which leads to src/backend/executor/execMain.c#L306:

void
ExecutorRun(QueryDesc *queryDesc,
            ScanDirection direction, uint64 count,
            bool execute_once)
{
    if (ExecutorRun_hook)
        (*ExecutorRun_hook) (queryDesc, direction, count, execute_once);
    else
        standard_ExecutorRun(queryDesc, direction, count, execute_once);
}

So our hook will just log and pass back execution to the existing execution hook. Let's build and install the extension.

$ make
$ sudo make install

Now we'll create a new database and tell it to load the extension.

$ /usr/local/pgsql/bin/initdb test-db
$ echo "shared_preload_libraries = 'pgexec'" > test-db/postgresql.conf

Remember, hooks are not CREATE EXTENSION extensions. As far as I can tell they can't be dynamically loaded (without some additional dynamic loading infrastructure one could potentially write). So every time you make a change you need to rebuild the extension, reinstall it, and restart the Postgres server.

And start the server in the foreground:

$ /usr/local/pgsql/bin/postgres \
  --config-file=$(pwd)/test-db/postgresql.conf \
  -D $(pwd)/test-db
  -k $(pwd)/test-db
2023-11-18 19:35:16.680 GMT [3215547] LOG:  starting PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1), 64-bit
2023-11-18 19:35:16.681 GMT [3215547] LOG:  listening on IPv6 address "::1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-11-18 19:35:16.682 GMT [3215550] LOG:  database system was shut down at 2023-11-18 19:20:16 GMT
2023-11-18 19:35:16.684 GMT [3215547] LOG:  database system is ready to accept connections

Keep an eye on this foreground process since this is where elog(LOG, ...) calls will show up.

Now in a new window, create a test.sql script that we can use to exercise the hook:

DROP TABLE IF EXISTS x;
CREATE TABLE x (a INT);
INSERT INTO x VALUES (309);
SELECT a FROM x WHERE a > 1;

Run psql so we can trigger the hook:

$ /usr/local/pgsql/bin/psql -h localhost postgres -f test.sql
DROP TABLE
CREATE TABLE
INSERT 0 1
  a
-----
 309
(1 row)

And in the postgres foreground process you should see a log:

2023-11-19 17:42:03.045 GMT [3242321] LOG:  [pgexec] HOOKED SUCCESSFULLY!
2023-11-19 17:42:03.045 GMT [3242321] STATEMENT:  INSERT INTO x VALUES (309);
2023-11-19 17:42:03.045 GMT [3242321] LOG:  [pgexec] HOOKED SUCCESSFULLY!
2023-11-19 17:42:03.045 GMT [3242321] STATEMENT:  SELECT a FROM x WHERE a > 1;

That's our hook! Interestingly only the INSERT and CREATE statements show up, not the DROP and CREATE.

Now let's see if we can reconstruct the query text from that first argument, the QueryDesc* that pgexec_run_hook receives. And let's simplify things for ourselves and only worry about reconstructing a SELECT query.

Nodes and Datums

But first, let's talk about two fundemental ways data in Postgres (code) is organized.

Postgres code is extremely dynamic and, maybe relatedly, fairly object-oriented. Almost every entity in Postgres is a Node. While values in Postgres that are exposed to users of Postgres are Datums.

Each node has a type, NodeTag, that we can switch on to decide what to do. In contrast, Datum has no type. The type of the Datum must be known by context before using one of the transform functions like DatumGetBool to retrieve a C value from a Datum.

A table is a Node. A query plan is a Node. A sequential scan is a Node. A join is a Node. A literal in a query is a Node. The value for the literal in a query is a Datum.

Here is how The Internals of PostgreSQL book visualizes a query plan for example:

https://www.interdb.jp/pg/img/fig-3-04.png

Every box in that image is a Node.

And all Nodes in code I've seen share a common definition prefix like this:

typedef struct SomeThing {
  pg_node_attr(abstract) // If the node is indeed abstract in the OOP sense.
  NodeTag type;
}

Many Nodes you'll see are abstract, like Plan. But by printing out NodeTag and checking the value printed in src/include/nodes/nodetags.h, you can find the concrete type of the Node.

src/include/nodes/nodetags.h is generated during a preprocessing step. (Don't look if regex in Perl worries you).

We'll get back to Nodes later.

What's in a QueryDesc?

Let's take a look at the QueryDesc struct:

typedef struct QueryDesc
{
    /* These fields are provided by CreateQueryDesc */
    CmdType     operation;      /* CMD_SELECT, CMD_UPDATE, etc. */
    PlannedStmt *plannedstmt;   /* planner's output (could be utility, too) */
    const char *sourceText;     /* source text of the query */
    Snapshot    snapshot;       /* snapshot to use for query */
    Snapshot    crosscheck_snapshot;    /* crosscheck for RI update/delete */
    DestReceiver *dest;         /* the destination for tuple output */
    ParamListInfo params;       /* param values being passed in */
    QueryEnvironment *queryEnv; /* query environment passed in */
    int         instrument_options; /* OR of InstrumentOption flags */

    /* These fields are set by ExecutorStart */
    TupleDesc   tupDesc;        /* descriptor for result tuples */
    EState     *estate;         /* executor's query-wide state */
    PlanState  *planstate;      /* tree of per-plan-node state */

    /* This field is set by ExecutorRun */
    bool        already_executed;   /* true if previously executed */

    /* This is always set NULL by the core system, but plugins can change it */
    struct Instrumentation *totaltime;  /* total time spent in ExecutorRun */
} QueryDesc;

The PlannedStmt field looks interesting. Let's take a look:

typedef struct PlannedStmt
{
    pg_node_attr(no_equal, no_query_jumble)

    NodeTag     type;

    CmdType     commandType;    /* select|insert|update|delete|merge|utility */

    uint64      queryId;        /* query identifier (copied from Query) */

    bool        hasReturning;   /* is it insert|update|delete RETURNING? */

    bool        hasModifyingCTE;    /* has insert|update|delete in WITH? */

    bool        canSetTag;      /* do I set the command result tag? */

    bool        transientPlan;  /* redo plan when TransactionXmin changes? */

    bool        dependsOnRole;  /* is plan specific to current role? */

    bool        parallelModeNeeded; /* parallel mode required to execute? */

    int         jitFlags;       /* which forms of JIT should be performed */

    struct Plan *planTree;      /* tree of Plan nodes */

    List       *rtable;         /* list of RangeTblEntry nodes */

    List       *permInfos;      /* list of RTEPermissionInfo nodes for rtable
                                 * entries needing one */

    /* rtable indexes of target relations for INSERT/UPDATE/DELETE/MERGE */
    List       *resultRelations;    /* integer list of RT indexes, or NIL */

    List       *appendRelations;    /* list of AppendRelInfo nodes */

    List       *subplans;       /* Plan trees for SubPlan expressions; note
                                 * that some could be NULL */

    Bitmapset  *rewindPlanIDs;  /* indices of subplans that require REWIND */

    List       *rowMarks;       /* a list of PlanRowMark's */

    List       *relationOids;   /* OIDs of relations the plan depends on */

    List       *invalItems;     /* other dependencies, as PlanInvalItems */

    List       *paramExecTypes; /* type OIDs for PARAM_EXEC Params */

    Node       *utilityStmt;    /* non-null if this is utility stmt */

    /* statement location in source string (copied from Query) */
    int         stmt_location;  /* start location, or -1 if unknown */
    int         stmt_len;       /* length in bytes; 0 means "rest of string" */
} PlannedStmt;

The struct Plan* planTree field in there looks like what we'd want. But Plan is abstract:

typedef struct Plan
{
    pg_node_attr(abstract, no_equal, no_query_jumble)

    NodeTag     type;

So let's try printing out the planTree->type field and find the Node it is concretely. In pgexec.c change the definition of print_plan:

static void print_plan(QueryDesc* queryDesc) {
  elog(LOG, "[pgexec] HOOKED SUCCESSFULLY! %d", queryDesc->plannedstmt->planTree->type);
}

Rebuild and reinstall the extension, and restart Postgres:

$ make
$ sudo make install
$ /usr/local/pgsql/bin/postgres \
  --config-file=$(pwd)/test-db/postgresql.conf \
  -D $(pwd)/test-db
  -k $(pwd)/test-db
2023-11-18 19:35:16.680 GMT [3215547] LOG:  starting PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1), 64-bit
2023-11-18 19:35:16.681 GMT [3215547] LOG:  listening on IPv6 address "::1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-11-18 19:35:16.682 GMT [3215550] LOG:  database system was shut down at 2023-11-18 19:20:16 GMT
2023-11-18 19:35:16.684 GMT [3215547] LOG:  database system is ready to accept connections

And in another window run psql:

$ /usr/local/pgsql/bin/psql -h localhost postgres -f test.sql

And check the logs from the postgres process we just started and you should notice:

2023-11-19 17:46:18.834 GMT [3242495] LOG:  [pgexec] HOOKED SUCCESSFULLY! 322
2023-11-19 17:46:18.834 GMT [3242495] STATEMENT:  SELECT a FROM x WHERE a > 1;

So 322 is the NodeTag for the Plan. If we look that up in Postgres's src/include/nodes/nodetags.h (remember, this is generated after ./configure && make so I can't link you to it):

$ grep ' = 322' src/include/nodes/nodetags.h
        T_SeqScan = 322,

Hey, that makes sense! A SELECT without any indexes definitely sounds like a sequential scan!

Walking a sequential scan

Let's take a look at the SeqScan struct:

typedef struct SeqScan
{
    Scan        scan;
} SeqScan;

Ok, that's not very interesting. Let's look at Scan then:

typedef struct Scan
{
    pg_node_attr(abstract)

    Plan        plan;
    Index       scanrelid;      /* relid is index into the range table */
} Scan;

That's interesting! scanrelid represents the table we're scanning. I don't know what "range table" means exactly. But there was a field on the PlannedStmt called rtable that seems relevant.

rtable was described as a List of RangeTblEntry nodes. And browsing around the file where List is defined we can see some nice methods for working with Lists, like list_length().

Let's print out the scanrelid and let's check out the length of the rtable and see if it's filled out. Let's also restrict our print_plan code to only look at SeqScan nodes. In pgexec.c:

static void print_plan(QueryDesc* queryDesc) {
  SeqScan* scan = NULL;
  Plan* plan = queryDesc->plannedstmt->planTree;
  if (plan->type != T_SeqScan) {
    elog(LOG, "[pgexec] Unsupported plan type.");
    return;
  }

  scan = (SeqScan*)plan;

  elog(LOG, "[pgexec] relid: %d, rtable length: %d", scan->scan.scanrelid, list_length(queryDesc->plannedstmt->rtable));
}

Rebuild and reinstall the extension, and restart Postgres. (You can find the instructions for this above if you've forgotten.) Re-run the test.sql script. And check the Postgres server logs. You should see:

2023-11-19 18:00:34.184 GMT [3244438] LOG:  [pgexec] relid: 1, rtable length: 1
2023-11-19 18:00:34.184 GMT [3244438] STATEMENT:  SELECT a FROM x WHERE a > 1;

Awesome! So rtable does have data in it. There's only one table in this query so its length makes sense to be 1. The scanrelid being 1 also though is weird. Let's fetch the nth value from the rtable list using scanrelid-1 as the index.

For the RangeTblEntry itself, let's take a look:

typedef enum RTEKind
{
    RTE_RELATION,               /* ordinary relation reference */
    RTE_SUBQUERY,               /* subquery in FROM */
    RTE_JOIN,                   /* join */
    RTE_FUNCTION,               /* function in FROM */
    RTE_TABLEFUNC,              /* TableFunc(.., column list) */
    RTE_VALUES,                 /* VALUES (<exprlist>), (<exprlist>), ... */
    RTE_CTE,                    /* common table expr (WITH list element) */
    RTE_NAMEDTUPLESTORE,        /* tuplestore, e.g. for AFTER triggers */
    RTE_RESULT,                 /* RTE represents an empty FROM clause; such
                                 * RTEs are added by the planner, they're not
                                 * present during parsing or rewriting */
} RTEKind;

typedef struct RangeTblEntry
{
    pg_node_attr(custom_read_write, custom_query_jumble)

    NodeTag     type;

    RTEKind     rtekind;        /* see above */

    /*
     * XXX the fields applicable to only some rte kinds should be merged into
     * a union.  I didn't do this yet because the diffs would impact a lot of
     * code that is being actively worked on.  FIXME someday.
     */

    /*
     * Fields valid for a plain relation RTE (else zero):
     *
     * rellockmode is really LOCKMODE, but it's declared int to avoid having
     * to include lock-related headers here.  It must be RowExclusiveLock if
     * the RTE is an INSERT/UPDATE/DELETE/MERGE target, else RowShareLock if
     * the RTE is a SELECT FOR UPDATE/FOR SHARE target, else AccessShareLock.
     *
     * Note: in some cases, rule expansion may result in RTEs that are marked
     * with RowExclusiveLock even though they are not the target of the
     * current query; this happens if a DO ALSO rule simply scans the original
     * target table.  We leave such RTEs with their original lockmode so as to
     * avoid getting an additional, lesser lock.
     *
     * perminfoindex is 1-based index of the RTEPermissionInfo belonging to
     * this RTE in the containing struct's list of same; 0 if permissions need
     * not be checked for this RTE.
     *
     * As a special case, relid, relkind, rellockmode, and perminfoindex can
     * also be set (nonzero) in an RTE_SUBQUERY RTE.  This occurs when we
     * convert an RTE_RELATION RTE naming a view into an RTE_SUBQUERY
     * containing the view's query.  We still need to perform run-time locking
     * and permission checks on the view, even though it's not directly used
     * in the query anymore, and the most expedient way to do that is to
     * retain these fields from the old state of the RTE.
     *
     * As a special case, RTE_NAMEDTUPLESTORE can also set relid to indicate
     * that the tuple format of the tuplestore is the same as the referenced
     * relation.  This allows plans referencing AFTER trigger transition
     * tables to be invalidated if the underlying table is altered.
     */
    Oid         relid;          /* OID of the relation */
    char        relkind;        /* relation kind (see pg_class.relkind) */
    int         rellockmode;    /* lock level that query requires on the rel */
    struct TableSampleClause *tablesample;  /* sampling info, or NULL */
    Index       perminfoindex;

In SELECT a FROM x, x should be a plain relation RTE (to use the terminology there). So we can add a guard that validates that. But we don't get a Relation. (You might remember from my previous post that Relation is where we can finally see the table name.)

We get an Oid for the Relation. So we need to find a way to lookup a Relation from an Oid. And by grepping around in Postgres (or via judicious use of ChatGPT, I confess), we can notice RelationIdGetRelation that takes an Oid and returns a Relation. Notice also that the comment says we should close the relation when we're done with RelationClose.

So putting it altogether (and again, reusing some code from that previous post), we can print out the table name.

static void print_plan(QueryDesc* queryDesc) {
  SeqScan* scan = NULL;
  RangeTblEntry* rte = NULL;
  Relation relation = {};
  char* tablename = NULL;
  Plan* plan = queryDesc->plannedstmt->planTree;
  if (plan->type != T_SeqScan) {
    elog(LOG, "[pgexec] Unsupported plan type.");
    return;
  }

  scan = (SeqScan*)plan;
  rte = list_nth(queryDesc->plannedstmt->rtable, scan->scan.scanrelid-1);
  if (rte->rtekind != RTE_RELATION) {
    elog(LOG, "[pgexec] Unsupported FROM type: %d.", rte->rtekind);
    return;
  }

  relation = RelationIdGetRelation(rte->relid);
  tablename = NameStr(relation->rd_rel->relname);

  elog(LOG, "[pgexec] SELECT [todo] FROM %s", tablename);

  RelationClose(relation);
}

You'll also need to add a new #include for utils/rel.h.

Rebuild and reinstall the extension, and restart Postgres. Re-run the test.sql script. Check the Postgres server logs and you should see:

2023-11-19 18:36:03.986 GMT [3246777] LOG:  [pgexec] SELECT [todo] FROM x
2023-11-19 18:36:03.986 GMT [3246777] STATEMENT:  SELECT a FROM x WHERE a > 1;

Fantastic! Before we get into walking the SELECT columns and the (optional) WHERE clause, let's do some quick refactoring.

A string builder

Let's add a little string builder library so we can emit a single string we build up to a single elog() call.

I wrote this ahead of time and won't explain it here since the details aren't relevant.

Just copy this and paste near the top of pgexec.c:

typedef struct {
  char* mem;
  size_t len;
  size_t offset;
} PGExec_Buffer;

static void buffer_init(PGExec_Buffer* buf) {
  buf->offset = 0;
  buf->len = 8;
  buf->mem = (char*)malloc(sizeof(char) * buf->len);
}

static void buffer_resize_to_fit_additional(PGExec_Buffer* buf, size_t additional) {
  char* new = {};
  size_t newsize = 0;

  Assert(additional >= 0);

  if (buf->offset + additional < buf->len) {
    return;
  }

  newsize = (buf->offset + additional) * 2;
  new = (char*)malloc(sizeof(char) * newsize);
  Assert(new != NULL);
  memcpy(new, buf->mem, buf->len * sizeof(char));
  free(buf->mem);
  buf->len = newsize;
  buf->mem = new;
}

static void buffer_append(PGExec_Buffer*, char*, size_t);

static void buffer_appendz(PGExec_Buffer* buf, char* c) {
  buffer_append(buf, c, strlen(c));
}

static void buffer_append(PGExec_Buffer* buf, char* c, size_t chars) {
  buffer_resize_to_fit_additional(buf, chars);
  memcpy(buf->mem + buf->offset, c, chars);
  buf->offset += chars;
}

static void buffer_appendf(
  PGExec_Buffer *,
  const char* restrict,
  ...
) __attribute__ ((format (gnu_printf, 2, 3)));

static void buffer_appendf(PGExec_Buffer *buf, const char* restrict fmt, ...) {
  // First figure out how long the result will be.
  size_t chars = 0;
  va_list arglist;
  va_start(arglist, fmt);
  chars = vsnprintf(0, 0, fmt, arglist);
  Assert(chars >= 0); // TODO: error handling.

  // Resize to fit result.
  buffer_resize_to_fit_additional(buf, chars);

  // Actually do the printf into buf.
  va_end(arglist);
  va_start(arglist, fmt);
  chars = vsprintf(buf->mem + buf->offset, fmt, arglist);
  Assert(chars >= 0); // TODO: error handling.
  buf->offset += chars;
  va_end(arglist);
}

static char* buffer_cstring(PGExec_Buffer* buf) {
  char zero = 0;
  const size_t prev_offset = buf->offset;

  if (buf->offset == buf->len) {
    buffer_append(buf, &zero, 1);
    buf->offset--;
  } else {
    buf->mem[buf->offset] = 0;
  }

  // Offset should stay the same. This is a fake NULL.
  Assert(buf->offset == prev_offset);

  return buf->mem;
}

static void buffer_free(PGExec_Buffer* buf) {
  free(buf->mem);
}

Next we'll modify print_plan() in pgexec.c to use it, and add stubs for printing the SELECT columns and WHERE clauses.

static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
  buffer_appendz(buf, " [where todo]");
}

static void buffer_print_select_columns(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
  buffer_appendz(buf, "[columns todo]");
}

static void print_plan(QueryDesc* queryDesc) {
  SeqScan* scan = NULL;
  RangeTblEntry* rte = NULL;
  Relation relation = {};
  char* tablename = NULL;
  Plan* plan = queryDesc->plannedstmt->planTree;
  PGExec_Buffer buf = {};

  if (plan->type != T_SeqScan) {
    elog(LOG, "[pgexec] Unsupported plan type.");
    return;
  }

  scan = (SeqScan*)plan;
  rte = list_nth(queryDesc->plannedstmt->rtable, scan->scan.scanrelid-1);
  if (rte->rtekind != RTE_RELATION) {
    elog(LOG, "[pgexec] Unsupported FROM type: %d.", rte->rtekind);
    return;
  }

  buffer_init(&buf);

  relation = RelationIdGetRelation(rte->relid);
  tablename = NameStr(relation->rd_rel->relname);

  buffer_appendz(&buf, "SELECT ");
  buffer_print_select_columns(&buf, queryDesc, plan);
  buffer_appendf(&buf, " FROM %s", tablename);
  buffer_print_where(&buf, queryDesc, plan);

  elog(LOG, "[pgexec] %s", buffer_cstring(&buf));

  RelationClose(relation);
  buffer_free(&buf);
}

Now we just need to implement the buffer_print_where and buffer_print_select_columns functions and our walking infrastructure will be done! For now. :)

Walking the WHERE clause

If you remember back to the SeqScan and Scan nodes, they were both basically empty. They had a Plan and a scanrelid. So the rest of the SELECT info must be in the Plan since it wasn't in the Scan.

Let's look at Plan again. One part that stands out is:

    /*
     * Common structural data for all Plan types.
     */
    int         plan_node_id;   /* unique across entire final plan tree */
    List       *targetlist;     /* target list to be computed at this node */
    List       *qual;           /* implicitly-ANDed qual conditions */
    struct Plan *lefttree;      /* input plan tree(s) */
    struct Plan *righttree;
    List       *initPlan;       /* Init Plan nodes (un-correlated expr
                                 * subselects) */

qual kinda looks like a WHERE clause. (And targetlist kinda looks like the columns the SELECT pulls).

Lists just contain void pointers, so we can't tell what the type of qual or targetlist children are. But I'm going to make a wild guess they are Nodes.

There's even a nice helper that casts void pointers to Node* and pulls out the type, nodeTag().

And reading around pg_list.h shows some interesting helper utilities like foreach that we can use to iterate the list.

Let's try printing out the type of qual's members.

static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
  ListCell* cell = NULL;
  bool first = true;
  if (plan->qual == NULL) {
    return;
  }

  buffer_appendz(buf, " WHERE ");
  foreach(cell, plan->qual) {
    if (!first) {
      buffer_appendz(buf, " AND ");
    }

    first = false;
    buffer_appendf(buf, "[node: %d]", nodeTag(lfirst(cell)));
  }
}

Notice any vestiges of LISP?

Rebuild and reinstall the extension, and restart Postgres. Re-run the test.sql script. Check the Postgres server logs and you should see:

2023-11-19 19:17:00.879 GMT [3250850] LOG:  [pgexec] SELECT [columns todo] FROM x WHERE [node: 15]
2023-11-19 19:17:00.879 GMT [3250850] STATEMENT:  SELECT a FROM x WHERE a > 1;

Well, our code didn't crash! So the guess about qual List entries being Nodes seems right. Let's look up that node type in the Postgres repo:

$ grep ' = 15,' src/include/nodes/nodetags.h
        T_OpExpr = 15,

Woot! That is exactly what I'd expect the WHERE clause here to be.

Now that we know qual is a List of Nodes, let's do a bit of refactoring since targetlist will probably also be a List of Nodes. Back in pgexec.c:

static void buffer_print_expr(PGExec_Buffer*, Node*);
static void buffer_print_list(PGExec_Buffer*, List*, char*);

static void buffer_print_opexpr(PGExec_Buffer* buf, OpExpr* op) {
  buffer_appendf(buf, "[opexpr: todo]");
}

static void buffer_print_expr(PGExec_Buffer* buf, Node* expr) {
  switch (nodeTag(expr)) {
  case T_OpExpr:
    buffer_print_opexpr(buf, (OpExpr*)expr);
    break;

  default:
    buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
  }
}

static void buffer_print_list(PGExec_Buffer* buf, List* list, char* sep) {
  ListCell* cell = NULL;
  bool first = true;

  foreach(cell, list) {
    if (!first) {
      buffer_appendz(buf, sep);
    }

    first = false;
    buffer_print_expr(buf, (Node*)lfirst(cell));
  }
}

static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
  if (plan->qual == NULL) {
    return;
  }

  buffer_appendz(buf, " WHERE ");
  buffer_print_list(buf, plan->qual, " AND ");
}

And let's check out OpExpr!

Walking OpExpr

Take a look at the definition of OpExpr:

typedef struct OpExpr
{
    Expr        xpr;

    /* PG_OPERATOR OID of the operator */
    Oid         opno;

    /* PG_PROC OID of underlying function */
    Oid         opfuncid pg_node_attr(equal_ignore_if_zero, query_jumble_ignore);

    /* PG_TYPE OID of result value */
    Oid         opresulttype pg_node_attr(query_jumble_ignore);

    /* true if operator returns set */
    bool        opretset pg_node_attr(query_jumble_ignore);

    /* OID of collation of result */
    Oid         opcollid pg_node_attr(query_jumble_ignore);

    /* OID of collation that operator should use */
    Oid         inputcollid pg_node_attr(query_jumble_ignore);

    /* arguments to the operator (1 or 2) */
    List       *args;

    /* token location, or -1 if unknown */
    int         location;
} OpExpr;

The important fields are opno, the Oid of the operator, and args. args looks like another List of Nodes so we already know how to handle that.

But how do we find the string name of the operator? Presumably there's infrastructure like RelationIdGetRelation that takes an Oid and gets us an operator object.

Well I got stuck here as well. Again, thankfully, ChatGPT gave me some suggestions. There's no great story for how I got it working. So here's buffer_print_opexpr.

static void buffer_print_op(PGExec_Buffer* buf, OpExpr* op) {
  HeapTuple opertup = SearchSysCache1(OPEROID, ObjectIdGetDatum(op->opno));

  buffer_print_expr(buf, lfirst(list_nth_cell(op->args, 0)));

  if (HeapTupleIsValid(opertup)) {
    Form_pg_operator operator = (Form_pg_operator)GETSTRUCT(opertup);
    buffer_appendf(buf, " %s ", NameStr(operator->oprname));
    ReleaseSysCache(opertup);
  } else {
    buffer_appendf(buf, "[Unknown operation: %d]", op->opno);
  }

  // TODO: Support single operand operations.
  buffer_print_expr(buf, lfirst(list_nth_cell(op->args, 1)));
}

And add the following two includes to the top of pgexec.c:

#include "catalog/pg_operator.h"
#include "utils/syscache.h"

Rebuild and reinstall the extension, and restart Postgres. Re-run the test.sql script. Check the Postgres server logs and you should see:

2023-11-19 19:42:52.916 GMT [3252974] LOG:  [pgexec] SELECT [columns todo] FROM x WHERE [Unknown: 6] > [Unknown: 7]
2023-11-19 19:42:52.916 GMT [3252974] STATEMENT:  SELECT a FROM x WHERE a > 1;

And we continue to make progress! Let's look up the type of these two unknown nodes.

$ grep ' = 6,' src/include/nodes/nodetags.h
        T_Var = 6,
$ grep ' = 7,' src/include/nodes/nodetags.h
        T_Const = 7,

Let's deal with Const first.

Walking Const

If we take a look at the Const definition:

typedef struct Const
{
    pg_node_attr(custom_copy_equal, custom_read_write)

    Expr        xpr;
    /* pg_type OID of the constant's datatype */
    Oid         consttype;
    /* typmod value, if any */
    int32       consttypmod pg_node_attr(query_jumble_ignore);
    /* OID of collation, or InvalidOid if none */
    Oid         constcollid pg_node_attr(query_jumble_ignore);
    /* typlen of the constant's datatype */
    int         constlen pg_node_attr(query_jumble_ignore);
    /* the constant's value */
    Datum       constvalue pg_node_attr(query_jumble_ignore);
    /* whether the constant is null (if true, constvalue is undefined) */
    bool        constisnull pg_node_attr(query_jumble_ignore);

    /*
     * Whether this datatype is passed by value.  If true, then all the
     * information is stored in the Datum.  If false, then the Datum contains
     * a pointer to the information.
     */
    bool        constbyval pg_node_attr(query_jumble_ignore);

    /*
     * token location, or -1 if unknown.  All constants are tracked as
     * locations in query jumbling, to be marked as parameters.
     */
    int         location pg_node_attr(query_jumble_location);
} Const;

It looks like we need to switch on the consttype (an Oid) to figure out how to interpret the constvalue (a Datum). Remember I mentioned earlier that how to interpret a Datum is dependent on context. consttype is the context here.

In this case, although consttype is an Oid and we had to use Postgres infrastructure to look up the Oid's corresponding object, there are some builtin types and the literals we've queried with are among them.

We can simply check if consttype == INT4OID and the interpret the Datum as an int32 if so. DatumGetInt32 will get us that int32 in that case.

To support the new Const type, we'll add a case in buffer_print_expr to look for a T_Const.

static void buffer_print_expr(PGExec_Buffer* buf, Node* expr) {
  switch (nodeTag(expr)) {
  case T_Const:
    buffer_print_const(buf, (Const*)expr);
    break;

  case T_OpExpr:
    buffer_print_opexpr(buf, (OpExpr*)expr);
    break;

  default:
    buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
  }
}

And add a new function, buffer_print_const:

static void buffer_print_const(PGExec_Buffer* buf, Const* cnst) {
  switch (cnst->consttype) {
  case INT4OID:
    int32 val = DatumGetInt32(cnst->constvalue);
    buffer_appendf(buf, "%d", val);
    break;

  default:
    buffer_appendf(buf, "[Unknown consttype oid: %d]", cnst->consttype);
  }
}

Rebuild and reinstall the extension, and restart Postgres. Re-run the test.sql script. Check the Postgres server logs and you should see:

2023-11-19 19:53:47.922 GMT [3253746] LOG:  [pgexec] SELECT [columns todo] FROM x WHERE [Unknown: 6] > 1
2023-11-19 19:53:47.922 GMT [3253746] STATEMENT:  SELECT a FROM x WHERE a > 1;

Great! Now we just have to tackle T_Var.

Walking Var

Let's take a look at the definition of Var:

typedef struct Var
{
    Expr        xpr;

    /*
     * index of this var's relation in the range table, or
     * INNER_VAR/OUTER_VAR/etc
     */
    int         varno;

    /*
     * attribute number of this var, or zero for all attrs ("whole-row Var")
     */
    AttrNumber  varattno;

    /* pg_type OID for the type of this var */
    Oid         vartype pg_node_attr(query_jumble_ignore);
    /* pg_attribute typmod value */
    int32       vartypmod pg_node_attr(query_jumble_ignore);
    /* OID of collation, or InvalidOid if none */
    Oid         varcollid pg_node_attr(query_jumble_ignore);

    /*
     * RT indexes of outer joins that can replace the Var's value with null.
     * We can omit varnullingrels in the query jumble, because it's fully
     * determined by varno/varlevelsup plus the Var's query location.
     */
    Bitmapset  *varnullingrels pg_node_attr(query_jumble_ignore);

    /*
     * for subquery variables referencing outer relations; 0 in a normal var,
     * >0 means N levels up
     */
    Index       varlevelsup;

    /*
     * varnosyn/varattnosyn are ignored for equality, because Vars with
     * different syntactic identifiers are semantically the same as long as
     * their varno/varattno match.
     */
    /* syntactic relation index (0 if unknown) */
    Index       varnosyn pg_node_attr(equal_ignore, query_jumble_ignore);
    /* syntactic attribute number */
    AttrNumber  varattnosyn pg_node_attr(equal_ignore, query_jumble_ignore);

    /* token location, or -1 if unknown */
    int         location;
} Var;

It looks like this refers to a relation in the range table list again. So this means we need to have access to the full PlannedStmt so we can read its rtable field again to find the table. Then we need to look up the Relation for the table and then we can use the Var's varattno field to pick the nth attribute from the relation and get its string representation.

However, ChatGPT found a slightly higher-level function: get_attname() that takes a relation Oid and an attribute index and returns the string name of the column.

So here's what buffer_print_var looks like:

static void buffer_print_var(PGExec_Buffer* buf, PlannedStmt* stmt, Var* var) {
  char* name = NULL;
  RangeTblEntry* rte = list_nth(stmt->rtable, var->varno-1);
  if (rte->rtekind != RTE_RELATION) {
    elog(LOG, "[Unsupported relation type for var: %d].", rte->rtekind);
    return;
  }

  name = get_attname(rte->relid, var->varattno, false);
  buffer_appendz(buf, name);
  pfree(name);
}

You'll also need to add another #include for utils/lsyscache.h.

Let's add the case T_Var: check in buffer_print_expr, and also feed the PlannedStmt* through all the necessary buffer_print_X functions:

static void buffer_print_expr(PGExec_Buffer*, PlannedStmt*, Node*);
static void buffer_print_list(PGExec_Buffer*, PlannedStmt*, List*, char*);

static void buffer_print_opexpr(PGExec_Buffer* buf, PlannedStmt* stmt, OpExpr* op) {
  HeapTuple opertup = SearchSysCache1(OPEROID, ObjectIdGetDatum(op->opno));

  buffer_print_expr(buf, stmt, lfirst(list_nth_cell(op->args, 0)));

  if (HeapTupleIsValid(opertup)) {
    Form_pg_operator operator = (Form_pg_operator)GETSTRUCT(opertup);
    buffer_appendf(buf, " %s ", NameStr(operator->oprname));
    ReleaseSysCache(opertup);
  } else {
    buffer_appendf(buf, "[Unknown operation: %d]", op->opno);
  }

  // TODO: Support single operand operations.
  buffer_print_expr(buf, stmt, lfirst(list_nth_cell(op->args, 1)));
}

static void buffer_print_const(PGExec_Buffer* buf, Const* cnst) {
  switch (cnst->consttype) {
  case INT4OID:
    int32 val = DatumGetInt32(cnst->constvalue);
    buffer_appendf(buf, "%d", val);
    break;

  default:
    buffer_appendf(buf, "[Unknown consttype oid: %d]", cnst->consttype);
  }
}

static void buffer_print_var(PGExec_Buffer* buf, PlannedStmt* stmt, Var* var) {
  char* name = NULL;
  RangeTblEntry* rte = list_nth(stmt->rtable, var->varno-1);
  if (rte->rtekind != RTE_RELATION) {
    elog(LOG, "[Unsupported relation type for var: %d].", rte->rtekind);
    return;
  }

  name = get_attname(rte->relid, var->varattno, false);
  buffer_appendz(buf, name);
  pfree(name);
}

static void buffer_print_expr(PGExec_Buffer* buf, PlannedStmt* stmt, Node* expr) {
  switch (nodeTag(expr)) {
  case T_Const:
    buffer_print_const(buf, (Const*)expr);
    break;

  case T_Var:
    buffer_print_var(buf, stmt, (Var*)expr);
    break;

  case T_OpExpr:
    buffer_print_opexpr(buf, stmt, (OpExpr*)expr);
    break;

  default:
    buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
  }
}

static void buffer_print_list(PGExec_Buffer* buf, PlannedStmt* stmt, List* list, char* sep) {
  ListCell* cell = NULL;
  bool first = true;

  foreach(cell, list) {
    if (!first) {
      buffer_appendz(buf, sep);
    }

    first = false;
    buffer_print_expr(buf, stmt, (Node*)lfirst(cell));
  }
}

static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
  if (plan->qual == NULL) {
    return;
  }

  buffer_appendz(buf, " WHERE ");
  buffer_print_list(buf, queryDesc->plannedstmt, plan->qual, " AND ");
}

Rebuild and reinstall the extension, and restart Postgres. Re-run the test.sql script. Check the Postgres server logs and you should see:

2023-11-19 20:03:14.351 GMT [3254458] LOG:  [pgexec] SELECT [columns todo] FROM x WHERE a > 1
2023-11-19 20:03:14.351 GMT [3254458] STATEMENT:  SELECT a FROM x WHERE a > 1;

Huzzah!

Walking the column list

Let's get rid of [columns todo]. We already had the idea that List* targetlist on the Plan struct was a list of expression Nodes. Let's try it.

static void buffer_print_select_columns(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
  if (plan->targetlist == NULL) {
    return;
  }

  buffer_print_list(buf, queryDesc->plannedstmt, plan->targetlist, ", ");
}

Rebuild and reinstall the extension, and restart Postgres. Re-run the test.sql script. Check the Postgres server logs and you should see:

2023-11-19 20:12:48.091 GMT [3255398] LOG:  [pgexec] SELECT [Unknown: 53] FROM x WHERE a > 1
2023-11-19 20:12:48.091 GMT [3255398] STATEMENT:  SELECT a FROM x WHERE a > 1;

Hmm. Let's look up Node 53 in Postgres:

$ grep ' = 53,' src/include/nodes/nodetags.h
        T_TargetEntry = 53,

Based on the definition of TargetEntry, it looks like we can ignore most of the fields (because we don't need to handle SELECT a AS b yet) and just proxy the child expr field.

Let's add a case T_TargetEntry to buffer_print_expr:

static void buffer_print_expr(PGExec_Buffer* buf, PlannedStmt* stmt, Node* expr) {
  switch (nodeTag(expr)) {
  case T_Const:
    buffer_print_const(buf, (Const*)expr);
    break;

  case T_Var:
    buffer_print_var(buf, stmt, (Var*)expr);
    break;

  case T_TargetEntry:
    buffer_print_expr(buf, stmt, (Node*)((TargetEntry*)expr)->expr);
    break;

  case T_OpExpr:
    buffer_print_opexpr(buf, stmt, (OpExpr*)expr);
    break;

  default:
    buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
  }
}

Rebuild and reinstall the extension, and restart Postgres. Re-run the test.sql script. Check the Postgres server logs and:

2023-11-19 20:17:51.114 GMT [3257827] LOG:  [pgexec] SELECT a FROM x WHERE a > 1
2023-11-19 20:17:51.114 GMT [3257827] STATEMENT:  SELECT a FROM x WHERE a > 1;

We did it!

Variations

Let's try out some other queries to make sure this wasn't just luck.

$ /usr/local/pgsql/bin/psql -h localhost postgres -c 'SELECT a + 1 FROM x'
 ?column?
----------
      310
(1 row)

$ /usr/local/pgsql/bin/psql -h localhost postgres -c 'SELECT a + 1 FROM x WHERE 2 > a'
 ?column?
----------
(0 rows)

And back in the Postgres server logs:

2023-11-19 20:19:28.057 GMT [3257874] LOG:  [pgexec] SELECT a + 1 FROM x
2023-11-19 20:19:28.057 GMT [3257874] STATEMENT:  SELECT a + 1 FROM x
2023-11-19 20:19:30.474 GMT [3257878] LOG:  [pgexec] SELECT a + 1 FROM x WHERE 2 > a
2023-11-19 20:19:30.474 GMT [3257878] STATEMENT:  SELECT a + 1 FROM x WHERE 2 > a

Not bad!

Next steps

Printing out the statement here isn't incredibly useful. But it establishes a basis for future work that might avoid Postgres's query execution engine and do the execution ourselves, or to proxy execution to another system.

Postscript: On ChatGPT

My recent Postgres explorations would have been basically impossible if it weren't for being able to ask ChatGPT simple, stupid questions like "How do I get from a Postgres Var to a column name".

It isn't always right. It doesn't always give great code. Actually, it normally gives pretty weird code. But it's been extremely useful for quick iteration when I get stuck.

The only other place the information exists is in small blog posts around the internet, the Postgres mailing lists (that so far for me hasn't been super responsive), and the code itself.

Writing a storage engine for Postgres: an in-memory Table Access Method

2023-11-01 08:00:00

With Postgres 12, released in 2019, it became possible to swap out Postgres's storage engine.

This is a feature MySQL has supported for a long time. There are at least 8 different built-in engines you can pick from. MyRocks, MySQL on RocksDB, is another popular third-party distribution.

I assume there will be a renaissance of Postgres storage engines. To date, the efforts are nascent. OrioleDB and Citus Columnar are two promising third-party table access methods being actively developed.

Why alternative storage engines?

The ability to swap storage engines is useful because different workloads sometimes benefit from different storage approaches. Analytics workloads and columnar storage layouts go well together. Write-heavy workloads and LSM trees go well together. And some people like in-memory storage for running integration tests.

By swapping out only the storage engine, you get the benefit of the rest of the Postgres or MySQL infrastructure. The query language, the wire protocol, the ecosystem, etc.

Why not foreign data wrappers?

Very little has been written about the difference between foreign data wrappers (FDWs) and table access methods. Table access methods seems to be the lower-level layer where presumably you get better performance and cleaner integration. But there is clearly overlap between these two extension options.

For example there is a FDW for ClickHouse so when you create tables and rows and query the tables you are really creating and querying rows in a ClickHouse server. Similarly there's a FDW for RocksDB. And Citus's columnar engine works either as a foreign data wrapper or a table access method.

The Citus page draws the clearest distinction between FDWs and table access methods, but even that page is vague. Performance doesn't seem to be the main difference. Closer integration, and thus the ability to look more like vanilla Postgres from the outside, seems to be the gist.

In any case, I wanted to explore the table access method API.

Digging in

I haven't written Postgres extensions before and I've never written C professionally. If you're familiar with Postgres internals or C and notice something funky, please let me know!

It turns out that almost no one has written how to implement the minimal table access methods for various storage engine operations. So after quite a bit of stumbling to get the basics of an in-memory storage engine working, I'm going to walk you through my approach.

This is prototype-quality code which hopefully will be a useful base for further exploration.

All code for this post is available on GitHub.

A debug Postgres build

First off, let's make a debug build of Postgres.

$ git clone https://github.com/postgres/postgres
$ # An arbitrary commit from `master` after Postgres 16 I am on
$ git checkout 849172ff4883d44168f96f39d3fde96d0aa34c99
$ cd postgres
$ ./configure --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
$ make -j8
$ sudo make install

This will install Postgres binaries (e.g. psql, pg_ctl, initdb, pg_config) into /usr/local/pgsql/bin.

I'm going to reference those absolute paths throughout this post because you might have a system (package manager) install of Postgres already.

Let's create a database and start up this debug build:

$ /usr/local/pgsql/bin/initdb test-db
$ /usr/local/pgsql/bin/pg_ctl -D test-db -l logfile start

Extension infrastructure

Since we installed Postgres from scratch, /usr/local/pgsql/bin/pg_config will supply all of the infrastructure we need.

The "infrastructure" is basically just PGXS: Postgres Makefile utilities.

It's convention-heavy. So in a new Makefile for this project we'll specify:

  1. MODULES: Any C sources to build, without the .c file extension
  2. EXTENSION: Extension metadata file, without the .control file extension
  3. DATA: A SQL file that is executed when the extension is loaded, this time with the .sql extension
MODULES = pgtam
EXTENSION = pgtam
DATA = pgtam--0.0.1.sql

PG_CONFIG = /usr/local/pgsql/bin/pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

The final three lines set up the PGXS Makefile library based on the particular installed Postgres build we want to build the extension against and install the extension to.

PGXS gives us a few important targets like make distclean, make, and make install we'll use later on.

pgtam.c

A minimal C file that registers a function capable of serving as a table access method is:

#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"

PG_MODULE_MAGIC;

const TableAmRoutine memam_methods = {
  .type = T_TableAmRoutine,
};

PG_FUNCTION_INFO_V1(mem_tableam_handler);
Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
  PG_RETURN_POINTER(&memam_methods);
}

If you want to read about extension basics without the complexity of table access methods, you can find a complete, minimal Postgres extension I wrote to validate the infrastructure here. Or you can follow a larger tutorial.

The workflow for registering a table access method is to first run CREATE EXTENSION pgtam. This assumes pgtam is an extension that has a function that returns a TableAmRoutine struct instance, a table of table access methods.

Then you must run CREATE ACCESS METHOD mem TYPE TABLE HANDLER mem_tableam_handler. And finally you can use the access method when creating a table with USING mem: CREATE TABLE x(a INT) USING mem.

pgtam.control

This file contains extension metadata. At a minimum, the version of the extension and the filename for the extension where it should be installed.

default_version = '0.0.1'
module_pathname = '$libdir/pgtam'

pgtam--0.0.1.sql

Finally, in pgtam--0.0.1.sql (which is executed when we call CREATE EXTENSION pgtam), we register the handler function as a foreign function, and then we register the function as an access method.

CREATE OR REPLACE FUNCTION mem_tableam_handler(internal)
RETURNS table_am_handler AS 'pgtam', 'mem_tableam_handler'
LANGUAGE C STRICT;

CREATE ACCESS METHOD mem TYPE TABLE HANDLER mem_tableam_handler;

Build

Now that we've got all the pieces in place, we can build and install the extension.

$ make
$ sudo make install

Let's add a test.sql script to exercise the extension:

DROP EXTENSION IF EXISTS pgtam CASCADE;
CREATE EXTENSION pgtam;
CREATE TABLE x(a INT) USING mem;

And run it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
DROP EXTENSION
CREATE EXTENSION
psql:test.sql:3: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:3: error: connection to server was lost

Ok, so psql crashed! Let's look at the server logs. When we started Postgres with pg_ctl we specified the log file as logfile in the directory where we ran pg_ctl.

If we look through it we'll spot an assertion failure:

$ grep Assert logfile
TRAP: failed Assert("routine->scan_begin != NULL"), File: "tableamapi.c", Line: 52, PID: 2906922

That's a great sign! This is Postgres's debug infrastructure helping to make sure the table access method is correctly implemented.

Table access method stubs

The next step is to add function stubs for all the non-optional methods of the TableAmRoutine struct.

I've done all the work for you already so you can just copy this over the existing pgtam.c. It's a big file, but don't worry. There's nothing to explain. Just a bunch of blank functions returning default values when required.

#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"
#include "access/heapam.h"
#include "nodes/execnodes.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
#include "utils/builtins.h"
#include "executor/tuptable.h"

PG_MODULE_MAGIC;

const TableAmRoutine memam_methods;

static const TupleTableSlotOps* memam_slot_callbacks(
  Relation relation
) {
  return NULL;
}

static TableScanDesc memam_beginscan(
  Relation relation,
  Snapshot snapshot,
  int nkeys,
  struct ScanKeyData *key,
  ParallelTableScanDesc parallel_scan,
  uint32 flags
) {
  return NULL;
}

static void memam_rescan(
  TableScanDesc sscan,
  struct ScanKeyData *key,
  bool set_params,
  bool allow_strat,
  bool allow_sync,
  bool allow_pagemode
) {
}

static void memam_endscan(TableScanDesc sscan) {
}

static bool memam_getnextslot(
  TableScanDesc sscan,
  ScanDirection direction,
  TupleTableSlot *slot
) {
  return false;
}

static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
  return NULL;
}

static void memam_index_fetch_reset(IndexFetchTableData *scan) {
}

static void memam_index_fetch_end(IndexFetchTableData *scan) {
}

static bool memam_index_fetch_tuple(
  struct IndexFetchTableData *scan,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot,
  bool *call_again,
  bool *all_dead
) {
  return false;
}

static void memam_tuple_insert(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate
) {
}

static void memam_tuple_insert_speculative(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate,
  uint32 specToken) {
}

static void memam_tuple_complete_speculative(
  Relation relation,
  TupleTableSlot *slot,
  uint32 specToken,
  bool succeeded) {
}

static void memam_multi_insert(
  Relation relation,
  TupleTableSlot **slots,
  int ntuples,
  CommandId cid,
  int options,
  BulkInsertState bistate
) {
}

static TM_Result memam_tuple_delete(
  Relation relation,
  ItemPointer tid,
  CommandId cid,
  Snapshot snapshot,
  Snapshot crosscheck,
  bool wait,
  TM_FailureData *tmfd,
  bool changingPart
) {
  TM_Result result = {};
  return result;
}

static TM_Result memam_tuple_update(
  Relation relation,
  ItemPointer otid,
  TupleTableSlot *slot,
  CommandId cid,
  Snapshot snapshot,
  Snapshot crosscheck,
  bool wait,
  TM_FailureData *tmfd,
  LockTupleMode *lockmode,
  TU_UpdateIndexes *update_indexes
) {
  TM_Result result = {};
  return result;
}

static TM_Result memam_tuple_lock(
  Relation relation,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot,
  CommandId cid,
  LockTupleMode mode,
  LockWaitPolicy wait_policy,
  uint8 flags,
  TM_FailureData *tmfd)
{
  TM_Result result = {};
  return result;
}

static bool memam_fetch_row_version(
  Relation relation,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot
) {
  return false;
}

static void memam_get_latest_tid(
  TableScanDesc sscan,
  ItemPointer tid
) {
}

static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
  return false;
}

static bool memam_tuple_satisfies_snapshot(
  Relation rel,
  TupleTableSlot *slot,
  Snapshot snapshot
) {
  return false;
}

static TransactionId memam_index_delete_tuples(
  Relation rel,
  TM_IndexDeleteOp *delstate
) {
  TransactionId id = {};
  return id;
}

static void memam_relation_set_new_filelocator(
  Relation rel,
  const RelFileLocator *newrlocator,
  char persistence,
  TransactionId *freezeXid,
  MultiXactId *minmulti
) {
}

static void memam_relation_nontransactional_truncate(
  Relation rel
) {
}

static void memam_relation_copy_data(
  Relation rel,
  const RelFileLocator *newrlocator
) {
}

static void memam_relation_copy_for_cluster(
  Relation OldHeap,
  Relation NewHeap,
  Relation OldIndex,
  bool use_sort,
  TransactionId OldestXmin,
  TransactionId *xid_cutoff,
  MultiXactId *multi_cutoff,
  double *num_tuples,
  double *tups_vacuumed,
  double *tups_recently_dead
) {
}

static void memam_vacuum_rel(
  Relation rel,
  VacuumParams *params,
  BufferAccessStrategy bstrategy
) {
}

static bool memam_scan_analyze_next_block(
  TableScanDesc scan,
  BlockNumber blockno,
  BufferAccessStrategy bstrategy
) {
  return false;
}

static bool memam_scan_analyze_next_tuple(
  TableScanDesc scan,
  TransactionId OldestXmin,
  double *liverows,
  double *deadrows,
  TupleTableSlot *slot
) {
  return false;
}

static double memam_index_build_range_scan(
  Relation heapRelation,
  Relation indexRelation,
  IndexInfo *indexInfo,
  bool allow_sync,
  bool anyvisible,
  bool progress,
  BlockNumber start_blockno,
  BlockNumber numblocks,
  IndexBuildCallback callback,
  void *callback_state,
  TableScanDesc scan
) {
  return 0;
}

static void memam_index_validate_scan(
  Relation heapRelation,
  Relation indexRelation,
  IndexInfo *indexInfo,
  Snapshot snapshot,
  ValidateIndexState *state
) {
}

static bool memam_relation_needs_toast_table(Relation rel) {
  return false;
}

static Oid memam_relation_toast_am(Relation rel) {
  Oid oid = {};
  return oid;
}

static void memam_fetch_toast_slice(
  Relation toastrel,
  Oid valueid,
  int32 attrsize,
  int32 sliceoffset,
  int32 slicelength,
  struct varlena *result
) {
}

static void memam_estimate_rel_size(
  Relation rel,
  int32 *attr_widths,
  BlockNumber *pages,
  double *tuples,
  double *allvisfrac
) {
}

static bool memam_scan_sample_next_block(
  TableScanDesc scan, SampleScanState *scanstate
) {
  return false;
}

static bool memam_scan_sample_next_tuple(
  TableScanDesc scan,
  SampleScanState *scanstate,
  TupleTableSlot *slot
) {
  return false;
}

const TableAmRoutine memam_methods = {
  .type = T_TableAmRoutine,

  .slot_callbacks = memam_slot_callbacks,

  .scan_begin = memam_beginscan,
  .scan_end = memam_endscan,
  .scan_rescan = memam_rescan,
  .scan_getnextslot = memam_getnextslot,

  .parallelscan_estimate = table_block_parallelscan_estimate,
  .parallelscan_initialize = table_block_parallelscan_initialize,
  .parallelscan_reinitialize = table_block_parallelscan_reinitialize,

  .index_fetch_begin = memam_index_fetch_begin,
  .index_fetch_reset = memam_index_fetch_reset,
  .index_fetch_end = memam_index_fetch_end,
  .index_fetch_tuple = memam_index_fetch_tuple,

  .tuple_insert = memam_tuple_insert,
  .tuple_insert_speculative = memam_tuple_insert_speculative,
  .tuple_complete_speculative = memam_tuple_complete_speculative,
  .multi_insert = memam_multi_insert,
  .tuple_delete = memam_tuple_delete,
  .tuple_update = memam_tuple_update,
  .tuple_lock = memam_tuple_lock,

  .tuple_fetch_row_version = memam_fetch_row_version,
  .tuple_get_latest_tid = memam_get_latest_tid,
  .tuple_tid_valid = memam_tuple_tid_valid,
  .tuple_satisfies_snapshot = memam_tuple_satisfies_snapshot,
  .index_delete_tuples = memam_index_delete_tuples,

  .relation_set_new_filelocator = memam_relation_set_new_filelocator,
  .relation_nontransactional_truncate = memam_relation_nontransactional_truncate,
  .relation_copy_data = memam_relation_copy_data,
  .relation_copy_for_cluster = memam_relation_copy_for_cluster,
  .relation_vacuum = memam_vacuum_rel,
  .scan_analyze_next_block = memam_scan_analyze_next_block,
  .scan_analyze_next_tuple = memam_scan_analyze_next_tuple,
  .index_build_range_scan = memam_index_build_range_scan,
  .index_validate_scan = memam_index_validate_scan,

  .relation_size = table_block_relation_size,
  .relation_needs_toast_table = memam_relation_needs_toast_table,
  .relation_toast_am = memam_relation_toast_am,
  .relation_fetch_toast_slice = memam_fetch_toast_slice,

  .relation_estimate_size = memam_estimate_rel_size,

  .scan_sample_next_block = memam_scan_sample_next_block,
  .scan_sample_next_tuple = memam_scan_sample_next_tuple
};

PG_FUNCTION_INFO_V1(mem_tableam_handler);

Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
  PG_RETURN_POINTER(&memam_methods);
}

Let's build and test it!

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE

Hey we're getting somewhere! It successfully created the table with our custom table access method.

Querying rows

Next, let's try querying the table by adding a SELECT a FROM x to test.sql and running it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

This time there's nothing in logfile that helps:

$ tail -n15 logfile
2023-11-01 18:43:32.449 UTC [2906199] LOG:  database system is ready to accept connections
2023-11-01 18:58:32.572 UTC [2907997] LOG:  checkpoint starting: time
2023-11-01 18:58:35.305 UTC [2907997] LOG:  checkpoint complete: wrote 28 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.712 s, sync=0.015 s, total=2.733 s; sync files=23, longest=0.004 s, average=0.001 s; distance=128 kB, estimate=150 kB; lsn=0/15F88E0, redo lsn=0/15F8888
2023-11-01 19:08:14.485 UTC [2906199] LOG:  server process (PID 2908242) was terminated by signal 11: Segmentation fault
2023-11-01 19:08:14.485 UTC [2906199] DETAIL:  Failed process was running: SELECT a FROM x;
2023-11-01 19:08:14.485 UTC [2906199] LOG:  terminating any other active server processes
2023-11-01 19:08:14.486 UTC [2906199] LOG:  all server processes terminated; reinitializing
2023-11-01 19:08:14.508 UTC [2908253] LOG:  database system was interrupted; last known up at 2023-11-01 18:58:35 UTC
2023-11-01 19:08:14.518 UTC [2908253] LOG:  database system was not properly shut down; automatic recovery in progress
2023-11-01 19:08:14.519 UTC [2908253] LOG:  redo starts at 0/15F8888
2023-11-01 19:08:14.520 UTC [2908253] LOG:  invalid record length at 0/161DE70: expected at least 24, got 0
2023-11-01 19:08:14.520 UTC [2908253] LOG:  redo done at 0/161DE38 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-11-01 19:08:14.521 UTC [2908254] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-11-01 19:08:14.532 UTC [2908254] LOG:  checkpoint complete: wrote 35 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.010 s, total=0.012 s; sync files=27, longest=0.003 s, average=0.001 s; distance=149 kB, estimate=149 kB; lsn=0/161DE70, redo lsn=0/161DE70
2023-11-01 19:08:14.533 UTC [2906199] LOG:  database system is ready to accept connections

This was the first place I got stuck. How on earth do I figure out what methods to implement? I mean, it's clearly one or more of these methods from the struct. But there are so many methods.

I tried setting a breakpoint in gdb on the process returned by SELECT pg_backend_pid() for a psql session, but the breakpoint never seemed to be hit for any of my methods.

So I did the low-tech solution and opened a file, /tmp/pgtam.log, turned off buffering on it, and added a log to every method on the TableAmRoutine struct:

@@ -12,9 +12,13 @@

 const TableAmRoutine memam_methods;

+FILE* fd;
+#define DEBUG_FUNC() fprintf(fd, "in %s\n", __func__);
+
 static const TupleTableSlotOps* memam_slot_callbacks(
   Relation relation
 ) {
+  DEBUG_FUNC();
   return NULL;
 }

@@ -26,6 +30,7 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
+  DEBUG_FUNC();
   return NULL;
 }

@@ -37,9 +42,11 @@
   bool allow_sync,
   bool allow_pagemode
 ) {
+  DEBUG_FUNC();
 }

 static void memam_endscan(TableScanDesc sscan) {
+  DEBUG_FUNC();
 }

 static bool memam_getnextslot(
@@ -47,17 +54,21 @@
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

 static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
+  DEBUG_FUNC();
   return NULL;
 }

 static void memam_index_fetch_reset(IndexFetchTableData *scan) {
+  DEBUG_FUNC();
 }

 static void memam_index_fetch_end(IndexFetchTableData *scan) {
+  DEBUG_FUNC();
 }

 static bool memam_index_fetch_tuple(
@@ -68,6 +79,7 @@
   bool *call_again,
   bool *all_dead
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -78,6 +90,7 @@
   int options,
   BulkInsertState bistate
 ) {
+  DEBUG_FUNC();
 }

 static void memam_tuple_insert_speculative(
@@ -87,6 +100,7 @@
   int options,
   BulkInsertState bistate,
   uint32 specToken) {
+  DEBUG_FUNC();
 }

 static void memam_tuple_complete_speculative(
@@ -94,6 +108,7 @@
   TupleTableSlot *slot,
   uint32 specToken,
   bool succeeded) {
+  DEBUG_FUNC();
 }

 static void memam_multi_insert(
@@ -104,6 +119,7 @@
   int options,
   BulkInsertState bistate
 ) {
+  DEBUG_FUNC();
 }

 static TM_Result memam_tuple_delete(
@@ -117,6 +133,7 @@
   bool changingPart
 ) {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -133,6 +150,7 @@
   TU_UpdateIndexes *update_indexes
 ) {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -148,6 +166,7 @@
   TM_FailureData *tmfd)
 {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -157,6 +176,7 @@
   Snapshot snapshot,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -164,9 +184,11 @@
   TableScanDesc sscan,
   ItemPointer tid
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
+  DEBUG_FUNC();
   return false;
 }

@@ -175,6 +197,7 @@
   TupleTableSlot *slot,
   Snapshot snapshot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -183,6 +206,7 @@
   TM_IndexDeleteOp *delstate
 ) {
   TransactionId id = {};
+  DEBUG_FUNC();
   return id;
 }

@@ -193,17 +217,20 @@
   TransactionId *freezeXid,
   MultiXactId *minmulti
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_nontransactional_truncate(
   Relation rel
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_copy_data(
   Relation rel,
   const RelFileLocator *newrlocator
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_copy_for_cluster(
@@ -218,6 +245,7 @@
   double *tups_vacuumed,
   double *tups_recently_dead
 ) {
+  DEBUG_FUNC();
 }

 static void memam_vacuum_rel(
@@ -225,6 +253,7 @@
   VacuumParams *params,
   BufferAccessStrategy bstrategy
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_scan_analyze_next_block(
@@ -232,6 +261,7 @@
   BlockNumber blockno,
   BufferAccessStrategy bstrategy
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -242,6 +272,7 @@
   double *deadrows,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -258,6 +289,7 @@
   void *callback_state,
   TableScanDesc scan
 ) {
+  DEBUG_FUNC();
   return 0;
 }

@@ -268,14 +300,17 @@
   Snapshot snapshot,
   ValidateIndexState *state
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_relation_needs_toast_table(Relation rel) {
+  DEBUG_FUNC();
   return false;
 }

 static Oid memam_relation_toast_am(Relation rel) {
   Oid oid = {};
+  DEBUG_FUNC();
   return oid;
 }

@@ -287,6 +322,7 @@
   int32 slicelength,
   struct varlena *result
 ) {
+  DEBUG_FUNC();
 }

 static void memam_estimate_rel_size(
@@ -296,11 +332,13 @@
   double *tuples,
   double *allvisfrac
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_scan_sample_next_block(
   TableScanDesc scan, SampleScanState *scanstate
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -309,6 +347,7 @@
   SampleScanState *scanstate,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

And then in the entrypoint, initialize the file for logging.

@@ -369,5 +408,9 @@
 PG_FUNCTION_INFO_V1(mem_tableam_handler);

 Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
+  fd = fopen("/tmp/pgtam.log", "a");
+  setvbuf(fd, NULL, _IONBF, 0); // Prevent buffering
+  fprintf(fd, "\n\nmem_tableam handler loaded\n");
+
   PG_RETURN_POINTER(&memam_methods);
 }

Let's give it a shot!

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

And let's check our log file:

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks

Now we're getting somewhere!

I later realized elog() is the way most people log within Postgres/within extensions. I didn't know that when I was getting started though. This separate logging was a simple way to get the info out.

slot_callbacks

Since the request crashes and the last logged function is memam_slot_callbacks, it seems like that is where we should concentrate. The table access method docs suggest looking at the default heap access method for inspiration.

Its version of slot_callbacks returns &TTSOpsBufferHeapTuple:

static const TupleTableSlotOps *
heapam_slot_callbacks(Relation relation)
{
    return &TTSOpsBufferHeapTuple;
}

I have no idea what that means, but since it is defined in src/backend/executor/execTuples.c it doesn't seem to be tied to the heap access method implementation. Let's try it.

While it works initially, I noticed later on that TTSOpsBufferHeapTuple turns out not to be the right choice here. TTSOpsVirtual seems to be the right implementation.

@@ -19,7 +19,7 @@
   Relation relation
 ) {
   DEBUG_FUNC();
-  return NULL;
+  return &TTSOpsVirtual;
 }

 static TableScanDesc memam_beginscan(

Build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

It still crashes. But this time in /tmp/pgtam.log we made it into a new method!

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan

scan_begin

The function signature is:

TableScanDesc heap_beginscan(
  Relation relation,
  Snapshot snapshot,
  int nkeys,
  ScanKey key,
  ParallelTableScanDesc parallel_scan,
  uint32 flags
);

Since we just implemented stub versions of all the methods, we've been returning NULL. Since we're failing in this function, maybe we should try returning something that isn't NULL.

By looking at the definition of TableScanDesc, we can see it is a pointer to the TableScanDescData struct defined in src/include/access/relscan.h.

Let's malloc a TableScanDescData, free it in endscan, and return the TableScanDescData instance in beginscan:

@@ -30,8 +30,12 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
+  TableScanDescData* scan = {};
   DEBUG_FUNC();
-  return NULL;
+
+  scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+
+  return (TableScanDesc)scan;
 }

 static void memam_rescan(
@@ -87,6 +87,7 @@

 static void memam_endscan(TableScanDesc sscan) {
   DEBUG_FUNC();
+  free(sscan);
 }

Build and run (you can do it on your own). No difference.

I got stuck for a while here too. Clearly something must be filled out in this struct but it could be anything. Through trial and error I realized the one field that must be filled out is scan->rs_rd.

@@ -34,6 +34,7 @@
   DEBUG_FUNC();

   scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+  scan->rs_rd = relation;

   return (TableScanDesc)scan;
 }

We build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
 a
---
(0 rows)

And it works! It doesn't return anything but that's correct. There's nothing to return.

So what if we actually want to return something? Let's check our logs in /tmp/pgtam.log.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_endscan

Ok, I'm getting the gist of the API. A full table scan (which this is, because there are no indexes at play) starts with an initialization for a slot, then the scan begins, then getnextslot is called for each row, and then endscan is called to allow for cleanup.

So let's try returning a row in getnextslot.

getnextslot

The getnextslot signature is:

bool memam_getnextslot(
  TableScanDesc sscan,
  ScanDirection direction,
  TupleTableSlot *slot
);

So the sscan should be what we returned from beginscan and the interface docs say the current row gets stored in slot.

The return value seems to indicate whether or not we've reached the end of the scan. However, the scan will still end even if you return true if the slot is not filled out correctly. If the slot is filled out correctly and you unconditionally return true, you will crash the process.

Let's take a look at the definition of TupleTableSlot:

typedef struct TupleTableSlot
{
    NodeTag     type;
#define FIELDNO_TUPLETABLESLOT_FLAGS 1
    uint16      tts_flags;      /* Boolean states */
#define FIELDNO_TUPLETABLESLOT_NVALID 2
    AttrNumber  tts_nvalid;     /* # of valid values in tts_values */
    const TupleTableSlotOps *const tts_ops; /* implementation of slot */
#define FIELDNO_TUPLETABLESLOT_TUPLEDESCRIPTOR 4
    TupleDesc   tts_tupleDescriptor;    /* slot's tuple descriptor */
#define FIELDNO_TUPLETABLESLOT_VALUES 5
    Datum      *tts_values;     /* current per-attribute values */
#define FIELDNO_TUPLETABLESLOT_ISNULL 6
    bool       *tts_isnull;     /* current per-attribute isnull flags */
    MemoryContext tts_mcxt;     /* slot itself is in this context */
    ItemPointerData tts_tid;    /* stored tuple's tid */
    Oid         tts_tableOid;   /* table oid of tuple */
} TupleTableSlot;

tts_values is an array of Datum (which is a Postgres value). So that sounds like the actual values of the row. The tts_isnull field also looks important since that seems to be whether each value in the row is null or not. And tts_nvalid sounds important too since presumably it's the length of the tts_isnull and tts_values arrays.

The rest of it may or may not be important. Let's try filling out these three fields though and see what happens.

Datum

Back in the Postgres C extension documentation, we can see some simple examples of converting between C types and Postgres's Datum type.

For example:

Datum
add_one(PG_FUNCTION_ARGS)
{
    int32   arg = PG_GETARG_INT32(0);

    PG_RETURN_INT32(arg + 1);
}

If we look at the definition of PG_RETURN_INT32 in src/include/fmgr.h, we see:

#define PG_RETURN_INT32(x)   return Int32GetDatum(x)

So Int32GetDatum() is the function we'll use to set a Datum for a cell in a row.

@@ -54,13 +54,26 @@
   DEBUG_FUNC();
 }

+static bool done = false;
 static bool memam_getnextslot(
   TableScanDesc sscan,
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
   DEBUG_FUNC();
-  return false;
+
+  if (done) {
+    return false;
+  }
+
+  slot->tts_nvalid = 1;
+  slot->tts_values = (Datum*)malloc(sizeof(Datum) * slot->tts_nvalid);
+  slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
+  slot->tts_isnull = (bool*)malloc(sizeof(bool) * slot->tts_nvalid);
+  slot->tts_isnull[0] = false;
+  done = true;
+
+  return true;
 }

 static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {

The goal is that we return a single row and then exit the scan. It will have one 32-bit integer cell (remember we created the table CREATE TABLE x (a INT); INT is shorthand for INT4 which is a 32-bit integer) that will have the value 314.

But if we build and run this, we get no rows.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
 a
---
(0 rows)

I got stuck for a while here. Plugging my getnextslot code into ChatGPT helped. One thing it gave me to try was calling ExecStoreVirtualTuple on the slot. I noticed that the built-in heap access method also called a function like this in getnextslot.

And I realized that tts_nvalid is already set up and the memory for tts_values and tts_isnull is already allocated. So the code became a little simpler.

@@ -66,11 +66,9 @@
     return false;
   }

-  slot->tts_nvalid = 1;
-  slot->tts_values = (Datum*)malloc(sizeof(Datum) * slot->tts_nvalid);
   slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
-  slot->tts_isnull = (bool*)malloc(sizeof(bool) * slot->tts_nvalid);
   slot->tts_isnull[0] = false;
+  ExecStoreVirtualTuple(slot);
   done = true;

   return true;

Build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
  a
-----
 314
(1 row)

Fantastic!

Creating a table

Now that we've proven we can return random data, let's set up infrastructure for storing tables in memory.

@@ -15,6 +15,41 @@
 FILE* fd;
 #define DEBUG_FUNC() fprintf(fd, "in %s\n", __func__);

+
+struct Column {
+  int value;
+};
+
+struct Row {
+  struct Column* columns;
+  size_t ncolumns;
+};
+
+#define MAX_ROWS 100
+struct Table {
+  char* name;
+  struct Row* rows;
+  size_t nrows;
+};
+
+#define MAX_TABLES 100
+struct Database {
+  struct Table* tables;
+  size_t ntables;
+};
+
+struct Database* database;
+
+static void get_table(struct Table** table, Relation relation) {
+  char* this_name = NameStr(relation->rd_rel->relname);
+  for (size_t i = 0; i < database->ntables; i++) {
+    if (strcmp(database->tables[i].name, this_name) == 0) {
+      *table = &database->tables[i];
+      return;
+    }
+  }
+}
+
 static const TupleTableSlotOps* memam_slot_callbacks(
   Relation relation
 ) {

Based on what we logged in /tmp/pgtam.log it seems like memam_relation_set_new_filelocator is called when a new table is created. So let's handle adding a new table there.

@@ -233,7 +268,16 @@
   TransactionId *freezeXid,
   MultiXactId *minmulti
 ) {
+  struct Table table = {};
   DEBUG_FUNC();
+
+  table.name = strdup(NameStr(rel->rd_rel->relname));
+  fprintf(fd, "Created table: [%s]\n", table.name);
+  table.rows = (struct Row*)malloc(sizeof(struct Row) * MAX_ROWS);
+  table.nrows = 0;
+
+  database->tables[database->ntables] = table;
+  database->ntables++;
 }

 static void memam_relation_nontransactional_truncate(

Finally, we'll initialize the in-memory Database* when the handler is loaded.

@@ -428,5 +472,11 @@
   setvbuf(fd, NULL, _IONBF, 0); // Prevent buffering
   fprintf(fd, "\n\nmem_tableam handler loaded\n");

+  if (database == NULL) {
+    database = (struct Database*)malloc(sizeof(struct Database));
+    database->ntables = 0;
+    database->tables = (struct Table*)malloc(sizeof(struct Table) * MAX_TABLES);
+  }
+
   PG_RETURN_POINTER(&memam_methods);
 }

If we build and run, we won't notice anything new.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
  a
-----
 314
(1 row)

But we should see a message in /tmp/pgtam.log about the new table being created.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan

And there it is! Creation looks good.

Inserting rows

Let's add INSERT INTO x VALUES (23), (101); to test.sql and run the SQL script.

$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
  a
-----
 314
(1 row)

And let's check the log to see what method is called when we try to INSERT.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_slot_callbacks
in memam_tuple_insert
in memam_tuple_insert
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan

tuple_insert seems to be the method! Looks like it gets called once for each row to insert. Perfect.

The signature for tuple_insert is:

void memam_tuple_insert(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate
);

We can get the table name from relation, and instead of writing to slot we can read from slot->tts_values instead.

@@ -141,7 +141,38 @@
   int options,
   BulkInsertState bistate
 ) {
+  TupleDesc desc = RelationGetDescr(relation);
+  struct Table* table = NULL;
+  struct Column column = {};
+  struct Row row = {};
+
   DEBUG_FUNC();
+
+  get_table(&table, relation);
+  if (table == NULL) {
+    elog(ERROR, "table not found");
+    return;
+  }
+
+  if (table->nrows == MAX_ROWS) {
+    elog(ERROR, "cannot insert more rows");
+    return;
+  }
+
+  row.ncolumns = desc->natts;
+  Assert(slot->tts_nvalid == row.ncolumns);
+  Assert(row.ncolumns > 0);
+
+  row.columns = (struct Column*)malloc(sizeof(struct Column) * row.ncolumns);
+  for (size_t i = 0; i < row.ncolumns; i++) {
+    Assert(desc->attrs[i].atttypid == INT4OID);
+    column.value = DatumGetInt32(slot->tts_values[i]);
+    row.columns[i] = column;
+    fprintf(fd, "Got value: %d\n", column.value);
+  }
+
+  table->rows[table->nrows] = row;
+  table->nrows++;
 }

 static void memam_tuple_insert_speculative(

Build and run and again we won't notice anything new.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
  a
-----
 314
(1 row)

But if we check the logs, we should see the two column values we inserted, one for each row.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_slot_callbacks
in memam_tuple_insert
Got value: 23
in memam_tuple_insert
Got value: 101
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan

Woohoo!

Un-hardcoding the scan

The final thing we need to do is drop the hardcoded 314 we returned from getnextslot and instead we need to look up the current table and return rows from it. This also means we need to keep track of which row we're on. So beginscan will also need to change slightly.

@@ -57,6 +56,14 @@
   return &TTSOpsVirtual;
 }

+
+struct MemScanDesc {
+  TableScanDescData rs_base; // Base class from access/relscan.h.
+
+  // Custom data.
+  uint32 cursor;
+};
+
 static TableScanDesc memam_beginscan(
   Relation relation,
   Snapshot snapshot,
@@ -65,11 +72,13 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
-  TableScanDescData* scan = {};
-  DEBUG_FUNC();
+  struct MemScanDesc* scan;

-  scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
-  scan->rs_rd = relation;
+  DEBUG_FUNC();
+
+  scan = (struct MemScanDesc*)malloc(sizeof(struct MemScanDesc));
+  scan->rs_base.rs_rd = relation;
+  scan->cursor = 0;

   return (TableScanDesc)scan;
 }
@@ -89,23 +97,26 @@
   DEBUG_FUNC();
 }

-static bool done = false;
 static bool memam_getnextslot(
   TableScanDesc sscan,
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
+  struct MemScanDesc* mscan = (struct MemScanDesc*)sscan;
+  struct Table* table = NULL;
   DEBUG_FUNC();

-  if (done) {
+  ExecClearTuple(slot);
+
+  get_table(&table, mscan->rs_base.rs_rd);
+  if (table == NULL || mscan->cursor == table->nrows) {
     return false;
   }

-  slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
+  slot->tts_values[0] = Int32GetDatum(table->rows[mscan->cursor].columns[0].value);
   slot->tts_isnull[0] = false;
   ExecStoreVirtualTuple(slot);
-  done = true;
-
+  mscan->cursor++;
   return true;
 }

Let's try it out.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
  a
-----
  23
 101
(2 rows)

And there we have it. :)

Awesome SQL power

So we tried one table and we tried a SELECT without anything else.

What happens if we use more of SQL? Let's create another table and try some more complex queries. Edit test.sql:

DROP EXTENSION IF EXISTS pgtam CASCADE;
CREATE EXTENSION pgtam;
CREATE TABLE x(a INT) USING mem;
CREATE TABLE y(b INT) USING mem;
INSERT INTO x VALUES (23), (101);
SELECT a FROM x;
SELECT a + 100 FROM x WHERE a = 23;
SELECT a, COUNT(1) FROM x GROUP BY a ORDER BY COUNT(1) DESC;
SELECT b FROM y;

Run it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to 2 other objects
DETAIL:  drop cascades to table x
drop cascades to table y
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
CREATE TABLE
INSERT 0 2
  a
-----
  23
 101
(2 rows)

 ?column?
----------
      123
(1 row)

  a  | count
-----+-------
  23 |     1
 101 |     1
(2 rows)

 b
---
(0 rows)

Pretty sweet!

Next steps

It would be neat to build a storage engine that reads from and writes to a CSV a la MySQL's CSV storage engine. Or a storage engine that uses RocksDB.

It would also be good to figure out how indexes work, how deletions work, how updates and DDL beyond CREATE works.

And I should probably contribute some of this to the table access method docs which are pretty sparse at the moment.

io_uring basics: Writing a file to disk

2023-10-19 08:00:00

King and I wrote a blog post about building an event-driven cross-platform IO library that used io_uring on Linux. We sketched out how it works at a high level but I hadn't yet internalized how you actually code with io_uring. So I strapped myself down this week and wrote some benchmarks to build my intuition about io_uring and other IO models.

I started with implementations in Go and ported them to Zig to make sure I had done the Go versions decently. And I got some help from King and other internetters to find some inefficiencies in my code.

This post will walk through my process, getting increasingly efficient (and a little increasingly complex) ways to write an entire file to disk with io_uring, from Go and Zig.

Notably, we're not going to fsync() and we're not going to use O_DIRECT. So we won't be testing the entire IO pipeline from userland to disk hardware but just how fast IO gets to the kernel. The focus of this post is more on IO methods and using io_uring, not absolute numbers.

All code for this post is available on GitHub.

This code is going to indirectly show some differences in timing between Go and Zig. I could care less about benchmarketing. And I hope something about Zig vs Go is not what you take away from this post either.

The goal is to build an intuition and be generally correct. Observing the same relative behavior between implementations across two languages helps me gain confidence what I'm doing is correct.

io_uring

With normal blocking syscalls you just call read() or write() and wait for the results. io_uring is one of Linux's more powerful asynchronous IO offerings. Unlike epoll, you can use io_uring with both files and network connections. And unlike epoll you can even have the syscall run in the kernel.

To interact with io_uring, you register a submission queue for syscalls and their arguments. And you register a completion queue for syscall results.

You can batch many syscalls in one single call to io_uring, effectively turning up to N (4096 at most) syscalls into just one syscall. The kernel still does all the work of the N syscalls but you avoid some overhead.

As you check the completion queue and handle completed submissions, the submission queue is also freed all or somewhat, and you can now add more submissions.

For a more complete understanding, check out the kernel document Efficient IO with io_uring.

io_uring vs liburing

io_uring is a complex, low-level interface. Shuveb Hussain has an excellent series on programming io_uring. But that was too low-level for me as I was trying to figure out how to just get something working.

Instead, most people use liburing or a ported version of it like the Zig standard library's io_uring.zig or Iceber's iouring-go.

io_uring started clicking for me when I tried out the iouring-go library. So we'll start there.

Boilerplate

First off, let's set up some boilerplate for the Go and Zig code.

In main.go add:

package main

import (
  "bytes"
  "fmt"
  "os"
  "time"
)

func assert(b bool) {
  if !b {
    panic("assert")
  }
}

const BUFFER_SIZE = 4096

func readNBytes(fn string, n int) []byte {
  f, err := os.Open(fn)
  if err != nil {
    panic(err)
  }
  defer f.Close()

  data := make([]byte, 0, n)

  var buffer = make([]byte, BUFFER_SIZE)
  for len(data) < n {
    read, err := f.Read(buffer)
    if err != nil {
      panic(err)
    }

    data = append(data, buffer[:read]...)
  }

  assert(len(data) == n)

  return data
}

func benchmark(name string, data []byte, fn func(*os.File)) {
  fmt.Printf("%s", name)
  f, err := os.OpenFile("out.bin", os.O_RDWR | os.O_CREATE | os.O_TRUNC, 0755)
  if err != nil {
    panic(err)
  }

  t1 := time.Now()

  fn(f)

  s := time.Now().Sub(t1).Seconds()
  fmt.Printf(",%f,%f\n", s, float64(len(data))/s)

  if err := f.Close(); err != nil {
    panic(err)
  }

  assert(bytes.Equal(readNBytes("out.bin", len(data)), data))
}

And in main.zig add:

const std = @import("std");

const OUT_FILE = "out.bin";
const BUFFER_SIZE: u64 = 4096;

fn readNBytes(
  allocator: *const std.mem.Allocator,
  filename: []const u8,
  n: usize,
) ![]const u8 {
  const file = try std.fs.cwd().openFile(filename, .{});
  defer file.close();

  var data = try allocator.alloc(u8, n);
  var buf = try allocator.alloc(u8, BUFFER_SIZE);

  var written: usize = 0;
  while (data.len < n) {
    var nwritten = try file.read(buf);
    @memcpy(data[written..], buf[0..nwritten]);
    written += nwritten;
  }

  std.debug.assert(data.len == n);
  return data;
}

const Benchmark = struct {
  t: std.time.Timer,
  file: std.fs.File,
  data: []const u8,
  allocator: *const std.mem.Allocator,

  fn init(
    allocator: *const std.mem.Allocator,
    name: []const u8,
    data: []const u8,
  ) !Benchmark {
    try std.io.getStdOut().writer().print("{s}", .{name});

    var file = try std.fs.cwd().createFile(OUT_FILE, .{
      .truncate = true,
    });

    return Benchmark{
      .t = try std.time.Timer.start(),
      .file = file,
      .data = data,
      .allocator = allocator,
    };
  }

  fn stop(b: *Benchmark) void {
    const s = @as(f64, @floatFromInt(b.t.read())) / std.time.ns_per_s;
    std.io.getStdOut().writer().print(
      ",{d},{d}\n",
      .{ s, @as(f64, @floatFromInt(b.data.len)) / s },
    ) catch unreachable;

    b.file.close();

    var in = readNBytes(b.allocator, OUT_FILE, b.data.len) catch unreachable;
    std.debug.assert(std.mem.eql(u8, in, b.data));
    b.allocator.free(in);
  }
};

Keep it simple: write()

Now let's add the naive version of writing bytes to disk: calling write() repeatedly until all data has been written to disk.

In main.go:

func main() {
  size := 104857600 // 100MiB
  data := readNBytes("/dev/random", size)

  const RUNS = 10
  for i := 0; i < RUNS; i++ {
    benchmark("blocking", data, func(f *os.File) {
      for i := 0; i < len(data); i += BUFFER_SIZE {
        size := min(BUFFER_SIZE, len(data)-i)
        n, err := f.Write(data[i : i+size])
        if err != nil {
          panic(err)
        }

        assert(n == BUFFER_SIZE)
      }
    })
  }
}

And in main.zig:

pub fn main() !void {
  var allocator = &std.heap.page_allocator;

  const SIZE = 104857600; // 100MiB
  var data = try readNBytes(allocator, "/dev/random", SIZE);
  defer allocator.free(data);

  const RUNS = 10;
  var run: usize = 0;
  while (run < RUNS) : (run += 1) {
    {
      var b = try Benchmark.init(allocator, "blocking", data);
      defer b.stop();

      var i: usize = 0;
      while (i < data.len) : (i += BUFFER_SIZE) {
        const size = @min(BUFFER_SIZE, data.len - i);
        const n = try b.file.write(data[i .. i + size]);
        std.debug.assert(n == size);
      }
    }
  }
}

Let's build and run these programs and store the results to CSV we can analyze with DuckDB.

Go first:

$ go build main.go -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
blocking 0.07251540000000001s 1.4GB/s

And Zig:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
blocking 0.0656907669s 1.5GB/s

Alright, we've got a baseline now and both language implementations are in the same ballpark.

Let's add a simple io_uring version!

io_uring, 1 entry, Go

The iouring-go library has really excellent documentation for getting started.

To keep it simple, we'll use io_uring with only 1 entry. Add the following to func main() after the existing benchmark() call in main.go:

benchmark("io_uring", data, func(f * os.File) {
  iour, err := iouring.New(1)
  if err != nil {
    panic(err)
  }
  defer iour.Close()

  for i := 0; i < len(data); i += BUFFER_SIZE {
    size := min(BUFFER_SIZE, len(data)-i)
    prepRequest := iouring.Pwrite(int(f.Fd()), data[i : i+size], uint64(i))
    res, err := iour.SubmitRequest(prepRequest, nil)
    if err != nil {
      panic(err)
    }

    <-res.Done()
    i, err := res.ReturnInt()
    if err != nil {
      panic(err)
    }
    assert(size == i)
  }
})

Note that benchmark takes care of f.Seek(0) before each run. And it also validates that the file contents are equivalent to the input data. So it validates the benchmark for correctness.

Alright, let's run this new Go implementation with io_uring!

$ go mod init gomain
$ go mod tidy
$ go build main.go -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
blocking 0.0811486s 1.3GB/s
io_uring 0.5083049999999999s 213.2MB/s

Well that looks terrible.

Let's port it to Zig to see if we notice the same behavior there.

io_uring, 1 entry, Zig

There isn't an official Zig tutorial on io_uring I'm aware of. But io_uring.zig is easy enough to browse through. And there are tests in that file that also show how to use it.

And now that we've explored a bit in Go the basic gist should be similar:

Add the following to fn main() after the existing benchmark block in main.zig:

{
  var b = try Benchmark.init(allocator, "iouring", data);
  defer b.stop();

  const entries = 1;
  var ring = try std.os.linux.IO_Uring.init(entries, 0);
  defer ring.deinit();

  var i: usize = 0;
  while (i < data.len) : (i += BUFFER_SIZE) {
    const size = @min(BUFFER_SIZE, data.len - i);
    _ = try ring.write(0, b.file.handle, data[i .. i + size], i);

    const submitted = try ring.submit_and_wait(1);
    std.debug.assert(submitted == 1);

    const cqe = try ring.copy_cqe();
    std.debug.assert(cqe.err() == .SUCCESS);
    std.debug.assert(cqe.res >= 0);
    const n = @as(usize, @intCast(cqe.res));
    std.debug.assert(n <= BUFFER_SIZE);
  }
}

Now build and run:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
blocking 0.06650093630000001s 1.5GB/s
io_uring 0.17542890139999998s 597.7MB/s

Well it's similarly pretty bad. But our implementation ignores one major aspect of io_uring: batching requests.

Let's do some refactoring!

io_uring, N entries, Go

To support submitting N entries, we're going to have an inner loop running up to N that fills up N entries to io_uring.

Then we'll wait for the N submissions to complete and check their results.

We'll keep going until we write the entire file.

All of this can stay inside the loop in main, I'm just dropping preceding whitespace for nicer formatting here:

benchmarkIOUringNEntries := func (nEntries int) {
  benchmark(fmt.Sprintf("io_uring_%d_entries", nEntries), data, func(f * os.File) {
    iour, err := iouring.New(uint(nEntries))
    if err != nil {
      panic(err)
    }
    defer iour.Close()

    requests := make([]iouring.PrepRequest, nEntries)

    for i := 0; i < len(data); i += BUFFER_SIZE * nEntries {
      submittedEntries := 0
      for j := 0; j < nEntries; j++ {
        base := i + j * BUFFER_SIZE
        if base >= len(data) {
          break
        }
        submittedEntries++
        size := min(BUFFER_SIZE, len(data)-i)
        requests[j] = iouring.Pwrite(int(f.Fd()), data[base : base+size], uint64(base))
      }

      if submittedEntries == 0 {
        break
      }

      res, err := iour.SubmitRequests(requests[:submittedEntries], nil)
      if err != nil {
        panic(err)
      }

      <-res.Done()

      for _, result := range res.ErrResults() {
        _, err := result.ReturnInt()
        if err != nil {
          panic(err)
        }
      }
    }
  })
}
benchmarkIOUringNEntries(1)
benchmarkIOUringNEntries(128)

There are some specific things in there to notice.

First, toward the end of the file we may not have N entries to submit. We may have 1 or even 0.

If we have 0 to submit, we need to not even submit anything otherwise the Go library hangs. Similarly, if we don't slice requests to requests[:submittedEntries], the Go library will segfault if submittedEntries < N.

Other than that, let's build and run this!

$ go build -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
blocking 0.0740368s 1.4GB/s
io_uring_128_entries 0.127519s 836.6MB/s
io_uring_1_entries 0.46831579999999995s 226.9MB/s

Now we're getting somewhere! Still half the throughput but a 4x improvement from using only a single entry.

Let's port the N entry code to Zig.

io_uring, N entries, Zig

Unlike Go we can't do closures, so we'll have to make benchmarkIOUringNEntries a top-level function and keep the calls to it in the loop in main:

pub fn main() !void {
    var allocator = &std.heap.page_allocator;

    const SIZE = 104857600; // 100MiB
    var data = try readNBytes(allocator, "/dev/random", SIZE);
    defer allocator.free(data);

    const RUNS = 10;
    var run: usize = 0;
    while (run < RUNS) : (run += 1) {
        {
            var b = try Benchmark.init(allocator, "blocking", data);
            defer b.stop();

            var i: usize = 0;
            while (i < data.len) : (i += BUFFER_SIZE) {
                const size = @min(BUFFER_SIZE, data.len - i);
                const n = try b.file.write(data[i .. i + size]);
                std.debug.assert(n == size);
            }
        }

        try benchmarkIOUringNEntries(allocator, data, 1);
        try benchmarkIOUringNEntries(allocator, data, 128);
    }
}

And for the implementation itself, the only two big differences from the first version are that we'll bulk-read completion events (cqes) and that we'll create and wait for many submissions at once.

fn benchmarkIOUringNEntries(
  allocator: *const std.mem.Allocator,
  data: []const u8,
  nEntries: u13,
) !void {
  const name = try std.fmt.allocPrint(allocator.*, "iouring_{}_entries", .{nEntries});
  defer allocator.free(name);

  var b = try Benchmark.init(allocator, name, data);
  defer b.stop();

  var ring = try std.os.linux.IO_Uring.init(nEntries, 0);
  defer ring.deinit();

  var cqes = try allocator.alloc(std.os.linux.io_uring_cqe, nEntries);
  defer allocator.free(cqes);

  var i: usize = 0;
  while (i < data.len) : (i += BUFFER_SIZE * nEntries) {
    var submittedEntries: u32 = 0;
    var j: usize = 0;
    while (j < nEntries) : (j += 1) {
      const base = i + j * BUFFER_SIZE;
      if (base >= data.len) {
        break;
      }
      submittedEntries += 1;
      const size = @min(BUFFER_SIZE, data.len - base);
      _ = try ring.write(0, b.file.handle, data[base .. base + size], base);
    }

    const submitted = try ring.submit_and_wait(submittedEntries);
    std.debug.assert(submitted == submittedEntries);

    const waited = try ring.copy_cqes(cqes[0..submitted], submitted);
    std.debug.assert(waited == submitted);

    for (cqes[0..submitted]) |*cqe| {
      std.debug.assert(cqe.err() == .SUCCESS);
      std.debug.assert(cqe.res >= 0);
      const n = @as(usize, @intCast(cqe.res));
      std.debug.assert(n <= BUFFER_SIZE);
    }
  }
}

Let's build and run:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
blocking 0.0674331114s 1.5GB/s
iouring_128_entries 0.06773539590000001s 1.5GB/s
iouring_1_entries 0.1855542556s 569.9MB/s

Huh, that's surprising! We caught up to blocking writes with io_uring in Zig, but not in Go, even though we made good progress in Go.

Ring buffers

But we can do a bit better. We're doing batching, but the API is called "io_uring" not "io_batch". We're not even making use of the ring buffer behavior io_uring gives us!

We are waiting for all submitted results complete. But there's no reason to do that. Instead we should submit as much as we can. But we should not block waiting for completions. We should handle completions when they happen. And we should retry submissions until we're done reading. Retrying if there's no space for the moment.

Unfortunately the Go library doesn't seem to expose this ring behavior of io_uring. Or I've missed it.

But we can do it in Zig. Let's go.

io_uring, ring buffer, Zig

We need to change the way we track which offsets we need to submit so far. We also need to keep the loop going until we are sure we have written all data. And we need to stop blocking on the number we submitted; never blocking at all.

fn benchmarkIOUringNEntries(
  allocator: *const std.mem.Allocator,
  data: []const u8,
  nEntries: u13,
) !void {
  const name = try std.fmt.allocPrint(allocator.*, "iouring_{}_entries", .{nEntries});
  defer allocator.free(name);

  var b = try Benchmark.init(allocator, name, data);
  defer b.stop();

  var ring = try std.os.linux.IO_Uring.init(nEntries, 0);
  defer ring.deinit();

  var cqes = try allocator.alloc(std.os.linux.io_uring_cqe, nEntries);
  defer allocator.free(cqes);

  var written: usize = 0;
  var i: usize = 0;
  while (i < data.len or written < data.len) {
    var submittedEntries: u32 = 0;
    var j: usize = 0;
    while (true) {
      const base = i + j * BUFFER_SIZE;
      if (base >= data.len) {
        break;
      }
      const size = @min(BUFFER_SIZE, data.len - base);
      _ = ring.write(0, b.file.handle, data[base .. base + size], base) catch |e| switch (e) {
        error.SubmissionQueueFull => break,
        else => unreachable,
      };
      submittedEntries += 1;
      i += size;
    }

    _ = try ring.submit_and_wait(0);
    const cqesDone = try ring.copy_cqes(cqes, 0);

    for (cqes[0..cqesDone]) |*cqe| {
      std.debug.assert(cqe.err() == .SUCCESS);
      std.debug.assert(cqe.res >= 0);
      const n = @as(usize, @intCast(cqe.res));
      std.debug.assert(n <= BUFFER_SIZE);
      written += n;
    }
  }
}

The code got a bit simpler! Granted, we're omitting error handling.

Build and run:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
iouring_128_entries 0.06035423609999999s 1.7GB/s
iouring_1_entries 0.0610197624s 1.7GB/s
blocking 0.0671628515s 1.5GB/s

Not bad!

Crank it up

We've been inserting 100MiB of data. Let's go up to 1GiB to see how that affects things. Ideally the more data we write the more we reflect realistic long-term results.

In main.zig just change SIZE to 1073741824. Rebuild and run:

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'out.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
iouring_128_entries 0.6063814535s 1.7GB/s
iouring_1_entries 0.6167537295000001s 1.7GB/s
blocking 0.6831747749s 1.5GB/s

No real difference, perfect!

Let's make one more change though. Let's up the BUFFER_SIZE from 4KiB to 1MiB.

$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'out.csv' group by column0 order by avg(cast(column1 as double)) asc"
method avg_time avg_throughput
iouring_128_entries 0.2756831357s 3.8GB/s
iouring_1_entries 0.27575404880000004s 3.8GB/s
blocking 0.2833337046s 3.7GB/s

Hey that's an improvement!

Control

All these numbers are machine-specific obviously. So what does an existing tool like fio say? (Assuming I'm using it correctly. I await your corrections!)

With a 4KiB buffer size:

$ fio --name=fiotest --rw=write --size=1G --bs=4k --group_reporting --ioengine=sync
fiotest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1)
fiotest: (groupid=0, jobs=1): err= 0: pid=2437359: Thu Oct 19 23:33:42 2023
  write: IOPS=282k, BW=1102MiB/s (1156MB/s)(1024MiB/929msec); 0 zone resets
    clat (nsec): min=2349, max=54099, avg=2709.48, stdev=1325.83
     lat (nsec): min=2390, max=54139, avg=2752.89, stdev=1334.62
    clat percentiles (nsec):
     |  1.00th=[ 2416],  5.00th=[ 2416], 10.00th=[ 2416], 20.00th=[ 2448],
     | 30.00th=[ 2448], 40.00th=[ 2448], 50.00th=[ 2448], 60.00th=[ 2480],
     | 70.00th=[ 2512], 80.00th=[ 2544], 90.00th=[ 2832], 95.00th=[ 3504],
     | 99.00th=[ 5792], 99.50th=[15296], 99.90th=[19584], 99.95th=[20096],
     | 99.99th=[22656]
   bw (  KiB/s): min=940856, max=940856, per=83.36%, avg=940856.00, stdev= 0.00, samples=1
   iops        : min=235214, max=235214, avg=235214.00, stdev= 0.00, samples=1
  lat (usec)   : 4=97.22%, 10=2.03%, 20=0.71%, 50=0.04%, 100=0.01%
  cpu          : usr=17.35%, sys=82.11%, ctx=26, majf=0, minf=11
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1102MiB/s (1156MB/s), 1102MiB/s-1102MiB/s (1156MB/s-1156MB/s), io=1024MiB (1074MB), run=929-929msec

1.2GB/s is about in the ballpark of what we got.

And with a 1MiB buffer size?

$ fio --name=fiotest --rw=write --size=1G --bs=1M --group_reporting --ioengine=sync
fiotest: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=1
fio-3.33
Starting 1 process
fiotest: Laying out IO file (1 file / 1024MiB)

fiotest: (groupid=0, jobs=1): err= 0: pid=2437239: Thu Oct 19 23:32:09 2023
  write: IOPS=3953, BW=3954MiB/s (4146MB/s)(1024MiB/259msec); 0 zone resets
    clat (usec): min=221, max=1205, avg=241.83, stdev=43.93
     lat (usec): min=228, max=1250, avg=251.68, stdev=45.80
    clat percentiles (usec):
     |  1.00th=[  225],  5.00th=[  225], 10.00th=[  227], 20.00th=[  227],
     | 30.00th=[  231], 40.00th=[  233], 50.00th=[  235], 60.00th=[  239],
     | 70.00th=[  243], 80.00th=[  249], 90.00th=[  262], 95.00th=[  269],
     | 99.00th=[  302], 99.50th=[  318], 99.90th=[ 1074], 99.95th=[ 1205],
     | 99.99th=[ 1205]
  lat (usec)   : 250=80.96%, 500=18.85%
  lat (msec)   : 2=0.20%
  cpu          : usr=4.26%, sys=94.96%, ctx=3, majf=0, minf=10
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1024,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=3954MiB/s (4146MB/s), 3954MiB/s-3954MiB/s (4146MB/s-4146MB/s), io=1024MiB (1074MB), run=259-259msec

3.9GB/s is also roughly in the same ballpark we got.

Our code seems reasonable!

What's next?

None of this is original. fio is a similar tool, written in C, with many different IO engines including libaio and writev support. And it has many different IO workloads.

But it's been enjoyable to learn more about these APIs. How to program them and how they compare to eachother.

So next steps could include adding additional IO engines or IO workloads.

Also, either I need to understand Iceber's Go library better or its API needs to be loosened up a little bit so we can get that awesome ring buffer behavior we could use from Zig.

Keep an eye out here and on my io-playground repo!

Selected responses after publication

Go database driver overhead on insert-heavy workloads

2023-10-05 08:00:00

The most popular SQLite and PostgreSQL database drivers in Go are (roughly) 20-76% slower than alternative Go drivers on insert-heavy benchmarks of mine. So if you are bulk-inserting data with Go (and potentially also bulk-retrieving data with Go), you may want to consider the driver carefully. And you may want to consider avoiding database/sql.

Some driver authors have noted and benchmarked issues with database/sql.

So it may be the case that database/sql is responsible for some of this overhead. And indeed the variations between drivers in this post will be demonstrated by using database/sql and avoiding it. This post won't specifically prove that the variation is due to the database/sql interface. But that doesn't change the premise.

Not covered in this post but something to consider: JetBrains has suggested that other frontends like sqlc, sqlx, and GORM do worse than database/sql.

This post is built on the workload, environment, libraries, and methodology in my databases-intuition repo on GitHub. See the repo for details that will help you reproduce or correct me.

INSERT workload

In this workload, the data is random and there are no indexes. Neither of these aspects matter for this post though because we're comparing behavior within the same database among different drivers. This was just a workload I already had.

Two different data sizes are tested:

  1. 10M rows with 16 columns, each column is 32 bytes
  2. 10M rows with 3 columns, each column is 8 bytes

Each test is run 10 times and we record median, standard deviation, min, max and throughput.

SQLite

Both variations presented here load 10M rows using a single prepared statement called for each row within a single transaction.

The most popular driver is mattn/go-sqlite3.

It is roughly 20-40% slower than another driver that avoids database/sql.

10M Rows, 16 columns, each column 32 bytes:

Timing: 56.53 ± 1.26s, Min: 55.05s, Max: 59.62s
Throughput: 176,893.65 ± 3,853.90 rows/s, Min: 167,719.97 rows/s, Max: 181,646.02 rows/s

10M Rows, 3 columns, each column 8 bytes:

Timing: 15.92 ± 0.25s, Min: 15.69s, Max: 16.67s
Throughput: 628,044.37 ± 9,703.92 rows/s, Min: 599,852.91 rows/s, Max: 637,435.60 rows/s

The other driver I tested is my own fork of bvinc/go-sqlite-lite called eatonphil/gosqlite. I forked it because it is unmaintained and I wanted to bring it up-to-date for tests like this.

10M Rows, 16 columns, each column 32 bytes:

Timing: 45.51 ± 0.70s, Min: 43.72s, Max: 45.93s
Throughput: 219,729.65 ± 3,447.56 rows/s, Min: 217,742.98 rows/s, Max: 228,711.51 rows/s

10M Rows, 3 columns, each column 8 bytes:

Timing: 10.44 ± 0.20s, Min: 10.02s, Max: 10.68s
Throughput: 957,939.60 ± 18,879.43 rows/s, Min: 936,114.60 rows/s, Max: 998,426.62 rows/s

PostgreSQL

Both variations presented use PostgreSQL's COPY FROM support. This is significantly faster for PostgreSQL than doing the prepared statement we do in SQLite. (Here are my results for doing prepared statement INSERTs in PostgreSQL if you are curious.)

The most popular PostgreSQL driver is lib/pq. The performance issues with lib/pq are well-known, and the repo itself is marked as no longer developed.

It is roughly 44-76% slower than an alternative driver that avoids database/sql.

10M Rows, 16 columns, each column 32 bytes:

Timing: 104.53 ± 2.40s, Min: 102.57s, Max: 110.08s
Throughput: 95,665.37 ± 2,129.25 rows/s, Min: 90,847.08 rows/s, Max: 97,490.96 rows/s

10M Rows, 3 columns, each column 8 bytes:

Timing: 8.16 ± 0.43s, Min: 7.44s, Max: 8.80s
Throughput: 1,225,986.47 ± 66,631.53 rows/s, Min: 1,136,581.82 rows/s, Max: 1,343,441.37 rows

The other driver I tested is jackc/pgx, without database/sql.

10M Rows, 16 columns, each column 32 bytes:

Timing: 46.54 ± 1.60s, Min: 44.09s, Max: 49.51s
Throughput: 214,869.42 ± 7,265.10 rows/s, Min: 201,991.37 rows/s, Max: 226,801.07 rows/s

10M Rows, 3 columns, each column 8 bytes:

Timing: 5.20 ± 0.44s, Min: 4.71s, Max: 5.96s
Throughput: 1,923,722.79 ± 156,820.46 rows/s, Min: 1,676,894.32 rows/s, Max: 2,124,966.60 rows/

The discrepancies here are even greater than with the different SQLite drivers.

Workloads with small resultset

I won't go into as much detail but if you're doing queries that don't return many rows, the difference between drivers is negligible.

See here for details.

Conclusion

If you are doing INSERT-heavy workloads, or you are processing large number of rows returned from your SQL database, you might want to try benchmarking the same workload with different drivers.

And specifically, there is likely no good reason to use lib/pq anymore for accessing PostgreSQL from Go. Just use jackc/pgx.

Intercepting and modifying Linux system calls with ptrace

2023-10-01 08:00:00

How software fails is interesting. But real-world errors can be infrequent to manifest. Fault injection is a formal-sounding term that just means: trying to explicitly trigger errors in the hopes of discovering bad logic, typically during automated tests.

Jepsen and ChaosMonkey are two famous examples that help to trigger process and network failure. But what about disk and filesystem errors?

A few avenues seem worth investigating:

I would like to try out FUSE sometime. But LD_PRELOAD layer only works if IO goes through libc, which won't be the case for all programs. ptrace is something I've wanted to dig into for years since learning about gvisor.

SECCOMP_RET_TRAP doesn't have the same high-level guides that ptrace does so maybe I'll dig into it later. And symbolic analysis might be able to detect bad workloads but it also isn't fault injection. Maybe it's the better idea but fault injection just sounds more fun.

So this particular post will cover intercepting system calls (syscalls) using ptrace with code written in Zig. Not because readers will likely write their own code in Zig but because hopefully the Zig code will be easier for you to read and adapt to your language compared to if we had to deal with the verbosity and inconvenience of C.

In the end, we'll be able to intercept and force short (incomplete) writes in a Go, Python, and C program. Emulating a disk that is having an issue completing the write. This is a case that isn't common, but should probably be handled with retries in production code.

This post corresponds roughly to this commit on GitHub.

A bad program

First off, let's write some code for a program that would exhibit a short write. Basically, we write to a file and don't check how many bytes we wrote. This is extremely common code; or at least I've written it often.

$ cat test/main.go
package main

import (
        "os"
)

func main() {
        f, err := os.OpenFile("test.txt", os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0755)
        if err != nil {
                panic(err)
        }

        text := "some great stuff"
        _, _ = f.Write([]byte(text))

        _ = f.Close()
}

With this code, if the Write() call doesn't actually succeed in writing everything, we won't know that. And the file will contain less than all of some great stuff.

This logical mistake will happen rarely, if ever, on a normal disk. But it is possible.

Now that we've got an example program in mind, let's see if we can trigger the logic error.

ptrace

ptrace is a somewhat cross-platform layer that allows you to intercept syscalls in a process. You can read and modify memory and registers in the process, when the syscalls starts and before it finishes.

gdb and strace both use ptrace for their magic.

Google's gvisor that powers various serverless runtimes in Google Cloud was also historically based on ptrace (PTRACE_SYSEMU specifically, which we won't explore much in this post).

Interestingly though, gvisor switched only this year (2023) to a different default backend for trapping system calls. Based on SECCOMP_RET_TRAP.

You can get similar vibes from this Brendan Gregg post on the dangers of using strace (that is based on ptrace) in production.

Although ptrace is cross-platform, actually writing cross-platform-aware code with ptrace can be complex. So this post assumes amd64/linux.

Protocol

The ptrace protocol is described in the ptrace manpage, but Chris Wellons and a University of Alberta group also wrote nice introductions. I referenced these three pages heavily.

Here's what the UAlberta page has to say:

ptrace's syscall tracing protocol

We fork and have the child call PTRACE_TRACEME. Then we handle each syscall entrance by calling PTRACE_SYSCALL and waiting with wait until the child has entered the syscall. It is in this moment we can mess with things.

Implementation

Let's turn that graphic into Zig code.

const std = @import("std");
const c = @cImport({
    @cInclude("sys/ptrace.h");
    @cInclude("sys/user.h");
    @cInclude("sys/wait.h");
    @cInclude("errno.h");
});

const cNullPtr: ?*anyopaque = null;

// TODO //

pub fn main() !void {
    var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
    defer arena.deinit();

    var args = try std.process.argsAlloc(arena.allocator());
    std.debug.assert(args.len >= 2);

    const pid = try std.os.fork();

    if (pid < 0) {
        std.debug.print("Fork failed!\n", .{});
        return;
    } else if (pid == 0) {
        // Child process
        _ = c.ptrace(c.PTRACE_TRACEME, pid, cNullPtr, cNullPtr);
        return std.process.execv(
            arena.allocator(),
            args[1..],
        );
    } else {
        // Parent process
        const childPid = pid;
        _ = c.waitpid(childPid, 0, 0);
        var cm = ChildManager{ .arena = &arena, .childPid = childPid };
        try cm.childInterceptSyscalls();
    }
}

So like the graphic suggested, we fork and start a child process. That means this Zig program should be called like:

$ zig build-exe --library c main.zig
$ ./main /actual/program/to/intercept --and --its args

Presumably, as with strace or gdb, we could instead attach to an already running process with PTRACE_ATTACH or PTRACE_SEIZE (based on the ptrace manpage) rather than going the PTRACE_TRACEME route. But I haven't tried that out yet.

With the child ready to be intercepted, we can implement the ChildManager that actually does the interception.

ChildManager

The core of the ChildManager is an infinite loop (at least as long as the child process lives) that waits for the next syscall and then calls a hook for the sytem call if it exists.

const ChildManager = struct {
    arena: *std.heap.ArenaAllocator,
    childPid: std.os.pid_t,

    // TODO //

    fn childInterceptSyscalls(
        cm: *ChildManager,
    ) !void {
        while (true) {
            // Handle syscall entrance
            const status = cm.childWaitForSyscall();
            if (std.os.W.IFEXITED(status)) {
                break;
            }

            var args: ABIArguments = cm.getABIArguments();
            const syscall = args.syscall();

            for (hooks) |hook| {
                if (syscall == hook.syscall) {
                    try hook.hook(cm.*, &args);
                }
            }
        }
    }
};

Later we'll write a hook for the sys_write syscall that will force an incomplete write.

Back to the protocol, childWaitForSyscall will call PTRACE_SYSCALL to allow the child process to start up again and continue until the next syscall. We'll follow that by wait-ing for the child process to be stopped again so we can handle the syscall entrance.

    fn childWaitForSyscall(cm: ChildManager) u32 {
        var status: i32 = 0;
        _ = c.ptrace(c.PTRACE_SYSCALL, cm.childPid, cNullPtr, cNullPtr);
        _ = c.waitpid(cm.childPid, &status, 0);
        return @bitCast(status);
    }

Now that we've intercepted a syscall (after waitpid finishes blocking), we need to figure out what syscall it was. We do this by calling PTRACE_GETREGS and reading the rax register which according to amd64/linux calling convention is the syscall called.

Registers

PTRACE_GETREGS fills out the following struct:

struct user_regs_struct
{
  unsigned long r15;
  unsigned long r14;
  unsigned long r13;
  unsigned long r12;
  unsigned long rbp;
  unsigned long rbx;
  unsigned long r11;
  unsigned long r10;
  unsigned long r9;
  unsigned long r8;
  unsigned long rax;
  unsigned long rcx;
  unsigned long rdx;
  unsigned long rsi;
  unsigned long rdi;
  unsigned long orig_rax;
  unsigned long rip;
  unsigned long cs;
  unsigned long eflags;
  unsigned long rsp;
  unsigned long ss;
  unsigned long fs_base;
  unsigned long gs_base;
  unsigned long ds;
  unsigned long es;
  unsigned long fs;
  unsigned long gs;
};

Let's write a little amd64/linux-specific wrapper for accessing meaningful fields.

const ABIArguments = struct {
    regs: c.user_regs_struct,

    fn nth(aa: ABIArguments, i: u8) c_ulong {
        std.debug.assert(i < 4);

        return switch (i) {
            0 => aa.regs.rdi,
            1 => aa.regs.rsi,
            2 => aa.regs.rdx,
            else => unreachable,
        };
    }

    fn setNth(aa: *ABIArguments, i: u8, value: c_ulong) void {
        std.debug.assert(i < 4);

        switch (i) {
            0 => { aa.regs.rdi = value; },
            1 => { aa.regs.rsi = value; },
            2 => { aa.regs.rdx = value; },
            else => unreachable,
        }
    }

    fn result(aa: ABIArguments) c_ulong { return aa.regs.rax; }

    fn setResult(aa: *ABIArguments, value: c_ulong) void {
        aa.regs.rax = value;
    }

    fn syscall(aa: ABIArguments) c_ulong { return aa.regs.orig_rax; }
};

One thing to note is that the field we read to get rax is not aa.regs.rax but aa.regs.orig_rax. This is because rax is also the return value and PTRACE_SYSCALL gets called twice for some syscalls on entrance and exit. The orig_rax field preserves the original rax value on syscall entrance. You can read more about this here.

Getting and setting registers

Now let's write the ChildManager code that actually calls PTRACE_GETREGS to fill out one of these structs.

    fn getABIArguments(cm: ChildManager) ABIArguments {
        var args = ABIArguments{ .regs = undefined };
        _ = c.ptrace(c.PTRACE_GETREGS, cm.childPid, cNullPtr, &args.regs);
        return args;
    }

Setting registers is similar, we just pass the struct back and call PTRACE_SETREGS instead:

    fn setABIArguments(cm: ChildManager, args: *ABIArguments) void {
        _ = c.ptrace(c.PTRACE_SETREGS, cm.childPid, cNullPtr, &args.regs);
    }

A hook

Now we finally have enough code to write a hook that can get and set registers; i.e. manipulate a system call!

We'll start by registering a sys_write hook in the hooks field we check in childInterceptSyscalls above.

    const hooks = &[_]struct {
        syscall: c_ulong,
        hook: *const fn (ChildManager, *ABIArguments) anyerror!void,
    }{.{
        .syscall = @intFromEnum(std.os.linux.syscalls.X64.write),
        .hook = writeHandler,
    }};

If we look at the manpage for write we see it takes three arguments

  1. The file descriptor (fd) to write to
  2. The address to start writing data from
  3. And the number of bytes to write

Going back to the calling convention that means the fd will be in rdi, the data address in rsi, and the data length in rdx.

So if we shorten the data length, we should be creating a short (incomplete) write.

    fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
        const fd = entryArgs.nth(0);
        const dataAddress = entryArgs.nth(1);
        var dataLength = entryArgs.nth(2);

        // Truncate some bytes
        if (dataLength > 2) {
            dataLength -= 2;
            entryArgs.setNth(2, dataLength);
            cm.setABIArguments(entryArgs);
        }
    }

In a more sophisticated version of this program, we could randomly decide when to truncate data and randomly decide how much data to truncate. However, for our purposes this is sufficient.

But there are some real problems with this code. When I ran this program against a basic Go program, I saw duplicate requests.

So the deal with PTRACE_SYSCALL is that for (most?) syscalls, you get to modify data before the data actually is handled by the kernel. And you get to modify data after the kernel has finished the syscall too.

This makes sense because PTRACE_SYSCALL (unlike PTRACE_SYSEMU) allows the syscall to actually happen. And if we wanted to, for example, modify the syscall exit code, we'd have to do that after the syscall was done not before it started. We are modifying registers directly after all.

All this means for our Zig code is that when we handle sys_write, we need to call PTRACE_SYSCALL again to process the syscall exit. Otherwise we'd reach this writeHandler for both entries and exits, which would require some additional way of disambiguating entrances from exits.

    fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
        const fd = entryArgs.nth(0);
        const dataAddress = entryArgs.nth(1);
        var dataLength = entryArgs.nth(2);

        // Truncate some bytes
        if (dataLength > 2) {
            dataLength -= 2;
            entryArgs.setNth(2, dataLength);
            cm.setABIArguments(entryArgs);
        }

        const data = try cm.childReadData(dataAddress, dataLength);
        defer data.deinit();
        std.debug.print("Got a write on {}: {s}\n", .{ fd, data.items });

        // Handle syscall exit
        _ = cm.childWaitForSyscall();
    }

We could put the cm.childWaitForSyscall() waiting for the syscall exit in the main loop and I did try that at first. However, not all syscalls seemed to have the same entry and exit hook and this resulted in the hooks sometimes starting with a syscall exit rather than a syscall entry. So rather than making the code more complicated, I decided to only wait for the exit on syscalls I knew had an exit (by observation at least), like sys_write.

Multiple writes? No bad logic?

So I had this code as is, correctly handling syscall entrances and exits, but I was seeing multiple write calls. And the text file I was writing to had the complete text I wanted to write. There was no short write even though I truncated the data length.

This took some digging into Go source code to understand. If you trace what os.File.Write() does on Linux you eventually get to src/internal/poll/fd_unix.go:

// Write implements io.Writer.
func (fd *FD) Write(p []byte) (int, error) {
        if err := fd.writeLock(); err != nil {
                return 0, err
        }
        defer fd.writeUnlock()
        if err := fd.pd.prepareWrite(fd.isFile); err != nil {
                return 0, err
        }
        var nn int
        for {
                max := len(p)
                if fd.IsStream && max-nn > maxRW {
                        max = nn + maxRW
                }
                n, err := ignoringEINTRIO(syscall.Write, fd.Sysfd, p[nn:max])
                if n > 0 {
                        nn += n
                }
                if nn == len(p) {
                        return nn, err
                }
                if err == syscall.EAGAIN && fd.pd.pollable() {
                        if err = fd.pd.waitWrite(fd.isFile); err == nil {
                                continue
                        }
                }
                if err != nil {
                        return nn, err
                }
                if n == 0 {
                        return nn, io.ErrUnexpectedEOF
                }
        }
}

This might be common knowledge but I didn't realize Go did this. And when I tried out the same basic program in Python and even C, the behavior was the same. The builtin write() behavior on a file (in many languages apparantly) is to retry until all data is written, with some exceptions.

This makes sense since files on disk, unlike file descriptors backed by network sockets, are generally always available. Compared to a network connection, disks are physically close and almost always stay connected. (With some obvious exceptions like network-attached storage and thumb drives.)

So to trigger the short write, the easiest way seems to have the sys_write call return an error that is NOT EAGAIN since the code will retry if that is the error.

After looking through the list of errors that sys_write can return, EIO seems like a nice one.

So let's do our final version of writeHandler and on the syscall exit, we'll modify the return value (rax in amd64/linux) to be EIO.

    fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
        const fd = entryArgs.nth(0);
        const dataAddress = entryArgs.nth(1);
        var dataLength = entryArgs.nth(2);

        // Truncate some bytes
        if (dataLength > 2) {
            dataLength -= 2;
            entryArgs.setNth(2, dataLength);
            cm.setABIArguments(entryArgs);
        }

        // Handle syscall exit
        _ = cm.childWaitForSyscall();

        var exitArgs = cm.getABIArguments();
        dataLength = exitArgs.nth(2);
        if (dataLength > 2) {
            // Force the writes to stop after the first one by returning EIO.
            var result: c_ulong = 0;
            result = result -% c.EIO;
            exitArgs.setResult(result);
            cm.setABIArguments(&exitArgs);
        }
    }

Let's give it a whirl!

All together

Build the Zig fault injector and the Go test code:

$ zig build-exe --library c main.zig
$ ( cd test && go build main.go )

And run:

$ ./main test/main

And check test.txt:

$ cat test.txt
some great stu

Hey, that's a short write! :)

Sidenote: Reading data from the child

We accomplished everything we set out to, but there's one other useful thing we can do: reading the actual data passed to the write syscall.

Just like how we can get the child process registers with PTRACE_GETREGS, we can read child memory with PTRACE_PEEKDATA. PTRACE_PEEKDATA takes the child process id and the memory address in the child to read from. It returns a word of data (which on amd64/linux is 8 bytes).

We can use the syscall arguments (data address and length) to keep calling PTRACE_PEEKDATA on the child until we've read all bytes of the data the child process wanted to write:

    fn childReadData(
        cm: ChildManager,
        address: c_ulong,
        length: c_ulong,
    ) !std.ArrayList(u8) {
        var data = std.ArrayList(u8).init(cm.arena.allocator());
        while (data.items.len < length) {
            var word = c.ptrace(
                c.PTRACE_PEEKDATA,
                cm.childPid,
                address + data.items.len,
                cNullPtr,
            );

            for (std.mem.asBytes(&word)) |byte| {
                if (data.items.len == length) {
                    break;
                }
                try data.append(byte);
            }
        }
        return data;
    }

And we could modify writeHandler to print out the entirety of the write message each time (for debugging):

    fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
        const fd = entryArgs.nth(0);
        const dataAddress = entryArgs.nth(1);
        var dataLength = entryArgs.nth(2);

        // Truncate some bytes
        if (dataLength > 2) {
            dataLength -= 2;
            entryArgs.setNth(2, dataLength);
            cm.setABIArguments(entryArgs);
        }

        const data = try cm.childReadData(dataAddress, dataLength);
        defer data.deinit();
        std.debug.print("Got a write on {}: {s}\n", .{ fd, data.items });

        // Handle syscall exit
        _ = cm.childWaitForSyscall();

        var exitArgs = cm.getABIArguments();
        dataLength = exitArgs.nth(2);
        if (dataLength > 2) {
            // Force the writes to stop after the first one by returning EIO.
            var result: c_ulong = 0;
            result = result -% c.EIO;
            exitArgs.setResult(result);
            cm.setABIArguments(&exitArgs);
        }
    }

That's pretty neat!

Next steps

Short writes are just one of many bad IO interactions. Another fun one would be to completely buffer all writes on a file descriptor (not allowing anything to be written to disk at all) until fsync is called on the file descriptor. Or forcing fsyncs to fail.

An interesting optimization would be to apply seccomp filters so that rather than paying a penalty for watching every syscall, I only get notified about the ones I have hooks for like sys_write. Here's another post that explores ptrace with seccomp filters.

Credits: Thank you Charlie Cummings and Paul Khuong for reviewing a draft of this post!

Selected responses after publication

How do databases execute expressions?

2023-09-21 08:00:00

Databases are fun. They sit at the confluence of Computer Science topics that might otherwise not seem practical in life as a developer. For example, every database with a query language is also a programming language implementation of some caliber. That doesn't include all databases though of course; see: RocksDB, FoundationDB, TigerBeetle, etc.

This post looks at how various databases execute expressions in their query language.

tldr; Most surveyed databases use a tree-walking interpreter. A few use stack- or register-based virtual machines. A couple have just-in-time compilers. And, tangentially, a few do vectorized interpretation.

Throughout this post I'll use "virtual machine" as a shorthand for stack- or register-based loops that process a linearized set of instructions. I say this since it is sometimes fair to call a tree-walking interpreter a virtual machine. But that is not what I mean when I say virtual machine in this post.

Stepping back

Programming languages are typically implemented by turning an Abstract Syntax Tree (AST) into a linear set of instructions for a virtual machine (e.g. CPython, Java, C#) or native code (e.g. GCC's C compiler, Go, Rust). Some of the former implementations also generate and run Just-In-Time (JIT) compiled native code (e.g. Java and C#).

Less commonly these days in programming languages does the implementation interpret off the AST or some other tree-like intermediate representation. This style is often called tree-walking.

Shell languages sometimes do tree-walking. Otherwise, implementations that interpret directly off of a tree normally do so as a short-term measure before switching to compiled virtual machine code or JIT-ed native code (e.g. some JavaScript implementations, GraalVM, RPython, etc.)

That is, while some major programming language implementations started out with tree-walking interpreters, they mostly moved away from solely tree-walking over a decade ago. See JSC in 2008, Ruby in 2007, etc.

My intuition is that tree-walking takes up more memory and is less cache-friendly than the linear instructions you give to a virtual machine or to your CPU. There are some folks who disagree, but they mostly talk about tree-walking when you've also got a JIT compiler hooked up. Which isn't quite the same thing. There has also been some early exploration and improvements reported when tree-walking with a tree organized as an array.

And databases?

Databases often interpret directly off a tree. (It isn't, generally speaking, fair to say they are AST-walking interpreters because databases typically transform and optimize beyond just an AST as parsed from user code.)

But not all databases interpret a tree. Some have a virtual machine. And some generate and run JIT-ed native code.

Methodology

If a core function (in the query execution path that does something like arithmetic or comparison) returns a value, that's a sign it's a tree-walking interpreter. Or, if you see code that is evaluating its arguments during execution, that's also a sign of a tree-walking interpreter.

On the other hand, if the function mutates internal state such as by assigning a value to a context or pushing to a stack, that's a sign it's a stack- or register-based virtual machine. If a function pulls its arguments from memory and doesn't evaluate the arguments, that's also an indication it's a stack- or register-based virtual machine.

This approach can result in false-positives depending on the architecture of the interpreter. User-defined functions (UDFs) would probably accept evaluated arguments and return a value regardless of how the interpreter is implemented. So it's important to find not just functions that could be implemented like UDFs, but core builtin behavior. Control flow implementations of functions like if or case can be great places to look.

And tactically, I clone the source code and run stuff like git grep -i eval | grep -v test | grep \\.java | grep -i eval or git grep -i expr | grep -v test | grep \\.go | grep -i expr until I convince myself I'm somewhere interesting.

Note: In talking about a broad swath of projects, maybe I've misunderstood one or some. If you've got a correction, let me know! If there's a proprietary database you work on where you can link to the (publicly described) execution strategy, feel free to pass it along! Or if I'm missing your public-source database in this list, send me a message!

Survey

Cockroach (Ruling: Tree Walker)

Judging by functions like func (e *evaluator) EvalBinaryExpr that evaluates the left-hand side and then evaluates the right-hand side and returns a value, Cockroach looks like a tree walking interpreter.

It gets a little more interesting though, since Cockroach also supports vectorized expression execution. Vectorizing is a fancy term for acting on many pieces of data at once rather than one at a time. It doesn't necessarily imply SIMD. Here is an example of a vectorized addition of two int64 columns.

ClickHouse (Ruling: Tree Walker + JIT)

The ClickHouse architecture is a little unique and difficult for me to read through – likely due to it being fairly mature, with serious optimization. But they tend to document their header files well. So files like src/Functions/IFunction.h and src/Interpreters/ExpressionActions.h were helpful.

They have also spoken publicly about their pipeline execution model; e.g. this presentation and this roadmap issue. But it isn't completely clear how much pipeline execution (which is broader than just expression evaluation) connects to expression evaluation.

Moreover, they have publicly spoken about their support for JIT compilation for query execution. But let's look at how execution works when the JIT is not enabled. For example, If we take a look at how if is implemented, we know that the then and else rows must be conditionally evaluated.

In the runtime entrypoint, executeImpl, we see the function call executeShortCircuitArguments which in turn calls ColumnFunction::reduce() which evaluates each column vector that is an argument to the function and then calls execute on the function.

So from this we can tell the non-JIT execution is a tree walker and that it is over chunks of columns, i.e. vectorized data, similar to Cockroach. However in ClickHouse execution is always over column vectors.

In the original version of this post, I had some confusion about the ClickHouse execution strategy. Robert Schulze from ClickHouse helped clarify things for me. Thanks Robert!

DuckDB (Ruling: Tree Walker)

If we take a look at how function expressions are executed, we can see each argument in the function being evaluated before being passed to the actual function. So that looks like a tree walking interpreter.

Like ClickHouse, DuckDB expression execution is always over column vectors. You can read more about this architecture here and here.

Influx (Ruling: Tree Walker)

Influx originally had a SQL-like query language called InfluxQL. If we look at how it evaluates a binary expression, it first evaluates the left-hand side and then the right-hand side before operating on the sides and returning a value. That's a tree-walking interpreter.

Flux was the new query language for Influx. While the Flux docs suggest they transform to an intermediate representation that is executed on a virtual machine, there's nothing I'm seeing that looks like a stack- or register-based virtual machine. All the evaluation functions evaluate their arguments and return a value. That looks like a tree-walking interpreter to me.

Today Influx announced that Flux is in maintenance mode and they are focusing on InfluxQL again.

MariaDB / MySQL (Ruling: Tree Walker)

Control flow methods are normally a good way to see how an interpreter is implemented. The implementation of COALESCE looks pretty simple. We see it call val_str() for each argument to COALESCE. But I can only seem to find implementations of val_str() on raw values and not expressions. Item_func_coalesce itself does not implement val_str() for example, which would be a strong indication of a tree walker. Maybe it does implement val_str() through inheritance.

It becomes a little clearer if we look at non-control flow methods like acos. In this method we see Item_func_acos itself implement val_real() and also call val_real() on all its arguments. In this case it's obvious how the control flow of acos(acos(.5)) would work. So that seems to indicate expressions are executed with a tree walking interpreter.

I also noticed sql/sp_instr.cc. That is scary (in terms of invalidating my analysis) since it looks like a virtual machine. But after looking through it, I think this virtual machine only corresponds to how stored procedures are executed, hence the sp_ prefix for Stored Programs. MySQL docs also explain that stored procedures are executed with a bytecode virtual machine.

I'm curious why they don't use that virtual machine for query execution.

As far as I can tell MySQL and MariaDB do not differ in this regard.

MongoDB (Ruling: Virtual Machine)

Mongo recently introduced a virtual machine for executing queries, called Slot Based Execution (SBE). We can find the SBE code in src/mongo/db/exec/sbe/vm/vm.cpp and the main virtual machine entrypoint here. Looks like a classic stack-based virtual machine!

It isn't completely clear to me if the SBE path is always used or if there are still cases where it falls back to their old execution model. You can read more about Mongo execution here and here.

PostgreSQL (Ruling: Virtual Machine + JIT)

The top of PostgreSQL's src/backend/executor/execExprInterp.c clearly explains that expression execution uses a virtual machine. You see all the hallmarks: opcodes, a loop over a giant switch, etc. And if we look at how function expressions are executed, we see another hallmark which is that the function expression code doesn't evaluate its arguments. They've already been evaluated. And function expression code just acts on the results of its arguments.

PostgreSQL also supports JIT-ing expression execution. And we can find the switch between interpreting and JIT-compiling an expression here.

QuestDB (Ruling: Tree Walker + JIT)

QuestDB wrote about their execution engine recently. When the conditions are right, they'll switch over to a JIT compiler and run native code.

But let's look at the default path. For example, how AND is implemented. AndBooleanFunction implements BooleanFunction which implements Function. An expression can be evaluated by calling a getX() method on the expression type that implements Function. AndBooleanFunction calls getBool() on its left and right hand sides. And if we look at the partial implementation of BooleanFunction we'll also see it doing getX() specific conversions during the call of getX(). So that's a tree-walking interpreter.

Scylla (Ruling: Tree Walker)

If we take a look at how functions are evaluated in Scylla, we see function evaluation first evaluating all of its arguments. And the function evaluation function itself returns a cql3::raw_value. So that's a tree-walking interpreter.

SQLite (Ruling: Virtual Machine)

SQLite's virtual machine is comprehensive and well-documented. It encompasses more than just expression evaluation but the entirety of query execution.

We can find the massive virtual machine switch in src/vdbe.c.

And if we look, for example, at how AND is implemented, we see it pulling its arguments out of memory (already evaluated) and assigning the result back to a designated point in memory.

SingleStore (Ruling: Virtual Machine + JIT)

While there's no source code to link to, SingleStore gave a talk at CMU that broke down their query execution pipeline. Their docs also cover the topic.

SingleStore compiler pipeline

TiDB (Ruling: Tree Walker)

Similar to DuckDB and ClickHouse, TiDB implements vectorized interpretation. They've written publicly about their switch to this method.

Let's take a look at how if is implemented in TiDB. There is a vectorized and non-vectorized version of if (in expression/control_builtin.go and expression/control_builtin_generated.go respectively). So maybe they haven't completely switched over to vectorized execution or maybe it can only be used in some conditions.

If we look at the non-vectorized version of if, we see the condition evaluated. And then the then or else is evaluated depending on the result of the condition. That's a tree-walking interpreter.

Conclusion

As the DuckDB team points out, vectorized interpretation or JIT compilation seem like the future for database expression execution. These strategies seem particularly important for analytics or time-series workloads. But vectorized interpretation seems to make the most sense for column-wise storage engines. And column-wise storage normally only makes sense for analytics workloads. Still, TiDB and Cockroach are transactional databases that also vectorize execution.

And while SQLite and PostgreSQL use the virtual machine model, it's possible databases with tree-walking interpreters like Scylla and MySQL/MariaDB have decided there is not significant enough gains to be had (for transactional workloads) to justify the complexity of moving to a compiler + virtual machine architecture.

Tree-walking interpreters and virtual machines are also independent from whether or not execution is vectorized. So that will be another interesting dimension to watch: if more databases move toward vectorized execution even if they don't adapt JIT compilation.

Yet another alternative is that maybe as databases mature we'll see compilation tiers similar to what browsers do with JavaScript.

Credits: Thanks Max Bernstein, Alex Miller, and Justin Jaffray for reviewing a draft version of this! And thanks to the #dbs channel on Discord for instigating this post!

Eight years of organizing tech meetups

2023-09-04 08:00:00

This is a collection of random personal experiences. So if you don't want to read everything, feel free to skip to the end for takeaways.

I write because I'd like to see more high-quality meetups. And maybe my little bit of experience will help someone out.

2015: Philadelphia

I first tried to organize a meetup in Philly in 2015. I was contracting at the time and I figured a meetup might be a good way to source contracts or just meet interesting people. I created the "Philadelphia Software in Business" (or some other similarly vaguely named) group on Meetup.com.

I didn't have any network; the first companies I worked for were not in Philly. But Meetup.com got me a few tens of people joining the group.

My first challenge was finding a place to meet. I didn't know what I was doing so I looked at restaurants, bars, and cafes for dedicated event space. Needless to say, renting space was expensive on its own. And there was always an additional required minimum dollar spent per attendee.

I ultimately found a place near the Schuylkill River. Maybe it was a community event space. Maybe I paid for it. I can't remember.

The first and only time I hosted an event for the group, I got a surprising number of people for such a vague topic. There were maybe 6 of us. I was the youngest by far (I was 20), they were middle age. Excel users and one visionary type.

There was no real point to the meetup and I didn't continue doing it.

2016 - 2017: Linode

While I was at Linode, I organized "hack nights". I didn't ask for anyone's approval before starting it. I just said I'd be ordering pizza for anyone interested in staying after work to hack on Linode-related projects. I was willing to pay for the pizza, in part because I didn't want to risk being shut down by asking. But caker paid for it each time.

I was nervous because people would show up and ask for pizza and not want to hack. It was company-provided under the aspiration of doing Linode-related work. Maybe I mentioned this or not. I can't remember. I'm pretty sure they got their pizza.

Aside from myself, developers at Linode didn't really attend. The folks who attended were support staff or folks from the technical writing team who wanted more experience coding.

I ran this for maybe 3 to 5 Wednesdays before not continuing. It was pretty fun! But staying after work for a few hours each Wednesday lost its charm.

Book Club

Another time at Linode I started a book club. I was very torn about attempting to make the book club open to anyone in the area or just to Linode employees.

I knew I'd probably get more people to attend if I made it public. But I wasn't sure if Linode would be cool with having external folks in the office. Before they moved to the Old City office, visitors weren't really a thing.

So I made it private to Linode. And I started with the most obvious book for your average developer: Practical Common Lisp.

I am pretty sure I learned one big trick by this time though. When I announced I'd be starting the book club I said something like this:

Hey folks! I'm thinking of starting a book club. A book I have in mind to start with is Practical Common Lisp. If I get at least one other person to join in then I'll move forward!

I ended up getting two folks: one developer and one support staff member. We held the book club for 30 minutes once a week, covering one chapter each week. I was the only one who read anything I think, but the other two guys faithfully showed up for discussion.

I didn't ask for permission to do this either. And this time we met during company time. I think it was 2-2:30PM.

It was fun. We finished the book. But Practical Common Lisp probably wasn't a good choice. And I don't think I started a second book.

2017 - 2020: False starts

I moved to NYC and joined a small startup (~20 employees). Linode was 100+ employees.

We were in a WeWork so I considered starting a book club that was public to the WeWork. I had learned by then the law of numbers: I probably wouldn't get anyone from my company to join.

I considered putting up posters around the WeWork to advertise. But in the end, I didn't end up going through with anything.

I did present at a few meetups in NYC during this time. But I didn't organize anything.

And then the pandemic hit and everything disappeared.

2021 - 2022: Virtual

In 2021 I started contracting again, thinking about starting a company. I wanted a community to be at the center.

So I started a Discord focused on software internals.

I had a bit more of a network at this point so I posted about the Discord on Twitter and got 100 likes or something and slowly started gaining folks in the Discord.

I knew it was going to do better if I was pretty active in it so I made sure to post interesting blog posts at some regular interval. About compilers or databases or something.

The Discord didn't turn out to help me out much in the starting-a-company front. Or I didn't use it effectively for that.

I wanted more of an independent Discord of cool people who like to learn about systems internals. And that's what I got.

This turned out to be ok though because I stopped working on that company and the Discord is still around and I still get to hang out with cool people.

This Discord is still around and hit 1,700 members recently. Among other things, it has developers from many different database companies in it these days. They hang out and help out the noobs like me learn about database internals.

I culled inactive members recently, so today the total is around 1,100.

Hacker Nights

During the pandemic I became frustrated that all the good meetups disappeared so I decided to start an online one that would be somewhat tied to the Discord and be about software internals.

I would find 2 or 3 people to present for 10-20 minutes each on anything to do with software internals. We'd meet once a month at 8PM NY time I think.

To get speakers I'd mostly DM people who I saw do interesting things on Twitter or Hacker News. I was lucky to have Philip O'Toole (author of rqlite), Simon Eskildsen (author of the Napkin Math blog), Rasmus Andersson, and many other excellent folks speak.

You can find videos of these talks on YouTube.

The events were organized on Meetup.com. The group grew quickly and I'd have about 100 people RSVP to each event. 10-20 normally showed up.

I'd post a Zoom link on Meetup.com. Sometimes Meetup.com crashed right as the meetup started, so no one could get a link. That was fun.

On two different nights I had Zoom bombers show up and play crazy music or impersonate other members of the call and act weirdly (Zoom lets you change your name after you've joined the call).

I learned a little bit about how to administrate a Zoom meeting.

I ran Hacker Nights for 5 months. It was tiring to find speakers, tiring to deal with Zoom bombers. It was thankless and I wasn't really enjoying it.

I was proud though that I was offering a channel for developers to learn about software internals of compilers, databases, etc. And it was great to meet many interesting speakers and attendees.

2023: Designing Data Intensive Applications

A month ago I put out a call on Twitter for folks in NYC interested in reading through the book Designing Data Intensive Applications.

I'd read the book before and while it was challenging, I knew it was immensely useful to any developer who works with data or an API.

By this time I'd learned my second trick: not asking for public responses.

I said something like:

Hey folks! I'm thinking of starting a book club meeting in Midtown NYC reading through Designing Data Intensive Applications. DM me if you'd be interested! If I get 2 other interested folks this will be on!

I got maybe 40 DMs and 20 of them were based in NYC. Attendence thus would have been higher if I made the book club virtual. But virtual events take about as much effort as in-person events and somehow feel less rewarding. So I went through with the NYC group.

I'm sure I could have gotten some company to provide us space, but this would just mean more negotation for me and tedium for everyone involved (bring your ID to be checked in, make sure you're registered, etc.).

The group would meet every 2 weeks and cover 2 chapters at a time. We'd meet for 30 minutes. To avoid needing to find a place to meet, we'd meet in public at Bryant Park. (There turns out to be plenty of available seating on Fridays at 9AM in Bryant Park. When it rains we meet online.)

I wanted to keep the overhead minimal and the timeline slightly aggressive. We'd be through the book in only 3 months. No crazy commitment.

We've meet twice now and are 25% done the book. Attendance has been around 7 to 9 people each time so far, or a little less than 50%.

They're almost all software developers, with one manager I think, who work for a variety of large and small tech companies.

I'm loving it so far. And if it continues to go well, I'll probably continue running in-person book clubs.

But it would only meet a few months a year, giving me a few month breaks from running it.

Takeaways: The meh

Organizing any event takes effort. Meetups are especially hard because you need to find a place to run the meetup, you probably want to provide food, and you need to find speakers.

Often you can find a single place to host the meetup, but you have to constantly search for new speakers. Even one of the greatest meetups in NYC, Papers We Love, seems to be struggling to find speakers.

The CMU Database Group and the Distributed Systems Reading Group seem to have the right idea though. They only run sessions part of the year, and they plan out all sessions in advance (including speakers).

However, they are both virtual. And I'm not so interested in running virtual events anymore.

Takeaways: The good

For one, meetups are an awesome way to meet random people and expand your network.

Two, they're educational. Even beyond the content you are meeting about, there's the discussion alongside it you wouldn't get by yourself. And you, as organizer, get to pick the topic.

These work out great for me. I love to meet people, and I love to learn.

Tricks

Starting something new is embarassing because you're putting yourself out there. Maybe no one in your network shares your interests (to the degree or in the direction you do).

My tricks are:

These ideas apply to corporate planning too. I think about them when I'm sharing some new idea in company Slack as much as when I share on Twitter.

A note on attendance rates: 10-20% actual attendance versus RSVP seems normal. If you get a higher percentage of people actually attending versus RSVP-ing you're doing pretty well!

Finding sponsors

One final idea is about paying for space or paying for food. Companies with space and money for food are often willing to partner with folks willing to do the work to run an event.

Running your own event in a company's space is advertising for them. They get to be associated with cool tech. It's a chance for them to pitch their open positions.

Obviously this happens often when you start a meetup hosted by your own company. But you can also find other companies to host space.

The kind of people to find to make this happen are senior developers or engineering managers, often on Twitter and sometimes on LinkedIn.

I haven't done this myself yet because I'm not ready to commit to running a meetup. But I see it happen. And it's the approach I'd take if I were to run a real meetup again.

Though now that I've got some time off there are a few talks I'd like to do myself.

Thinking about functional programming

2023-08-15 08:00:00

Someone on Discord asked about how to learn functional programming.

The question and my initial tweet on the subject prompted an interesting discussion with Shriram Krishnamurthi and other folks.

So here's a slightly more thought out exploration.

And just for backstory sake: I spent a few years a while ago programming in Standard ML and I wrote a chunk of a Scheme implementation. I'm not an expert, but I have a bit of background.

Hey, this is a free opinion.

Concepts from functional programming

When people talk about functional programming, I think of a few key choices you can make while programming:

And if you have experience as a programmer, you either get the basic gist of these tenets or you can easily read about the basics.

That is, while most programmers I've met understood the basics, most programmers I've met were not particularly comfortable or fluent expressing programs with these ideas.

For myself, the only way I got comfortable expressing code with these ideas was lots of practice (as I mentioned above). And yet, even after I did a bunch of programming in Standard ML and Scheme, I really didn't see a particular benefit to practicing in a language other than one with which I wa already generally comfortable.

You have to learn a lot of other random things when you pick up Scheme or Standard ML that aren't just: practice immutability by default, recursion, and first-class functions.

So I think it's kind of misguided when person A asks how to learn functional programming and person B responds that they should learn Haskell or OCaml or whatever. I see this happen pretty often online.

Beyond any "language for functional programming" as a recommendation in general, Haskell is a particularly egregious suggestion to make in my opinion because not only are you trying to practice functional programming tenets but you're also dealing with a complex type system and lazy evaluation.

Instead, practice immutability, recursion, map/reduce in whatever language you like.

Programming languages

If you want to study programming languages, that's awesome. However, functional programming doesn't really have any direct connection to studying programming languages.

Languages are all over the place. Scheme, Standard ML, and Haskell are worlds apart, even within the functional programming family.

And modern languages have mostly adapted the aspects of functional programming that used to be unique 20 years ago.

Moreover, there are many other worthwhile families of languages to learn about:

The list isn't exhaustive, and the variations within families can be massive. But the point is that functional programming doesn't mean crazy programming languages or crazy programming ideas. Functional programming is a subset of crazy programming languages and crazy programming ideas.

If you want to learn about crazy programming languages and crazy programming ideas, you should! Go for it!

Introduction to Computer Science

SICP is famous as the (former) introductory textbook for computer science at MIT, and for its use of Scheme and the Metacircular Evaluator.

I don't have any experience teaching beginners how to program so I don't have thoughts on if this made sense. That's for folks like Shriram to think about.

However, I'm a half-decent programmer and I can't make it through this book. If you liked the book or want to read it, that's great! But I don't recommend it to anyone.

And many introductory Computer Science textbooks just don't make much sense to give to experienced programmers. For an experienced programmer, they can be quite slow!

Most of the folks I see asking about how to learn functional programming are experienced programmers.

Do whatever you feel like doing

I don't mean to overanalyze things, or get you overanalyzing things. If you want to learn functional programming by writing Haskell, that's awesome, you should go for it.

Wanting to do something is basically the best motivation there is.

The only reason I write this sort of post is so that folks who think that using Haskell or Standard ML or Scheme or reading SICP is the only way to learn functional programming see those ideas aren't necessarily true.

Write a Scheme!

Finally, for folks with time and motivation wanting to seriously work out their functional programming muscles, writing a Scheme implementation with a decent chunk of the standard library can be an immensely enjoyable project.

You'll learn a lot about languages and compilers and algorithms and data structures. It's leetcode with meaning.

We put a distributed database in the browser – and made a game of it

2023-07-11 08:00:00

This is an external post of mine. Click here if you are not redirected.

Metaprogramming in Zig and parsing CSS

2023-06-19 08:00:00

I knew Zig supported some sort of reflection on types. But I had been confused about how to use it. What's the difference between @typeInfo and @TypeOf? I ignored this aspect of Zig until a problem came up at work where reflection made sense.

The situation was parsing and storing parsed fields in a struct. Each field name that is parsed should match up to a struct field.

This is a fairly common problem. So this post walks through how to use Zig's metaprogramming features in a simpler but related domain: parsing CSS into typed objects, and pretty-printing these typed CSS objects.

I live-streamed the implementation of this project yesterday on Twitch. The video is available on YouTube. And the source is available on GitHub.

If you want to skip the parsing steps and just see the metaprogramming, jump to the implementation of match_property.

Parsing CSS

Let's imagine a CSS that only has alphabetical selectors, property names and values.

The following would be valid:

div {
  background: black;
  color: white;
}

a {
  color: blue;
}

Thinking about the structure of this stripped down CSS we've got:

  1. CSS properties that consist of property names and values (in our case the property names are limited to background and color)
  2. CSS rules that have a selector and a list of rules
  3. CSS sheets that have a list of rules

Turning that into Zig in main.zig:

const std = @import("std");

const CSSProperty = union(enum) {
    unknown: void,
    color: []const u8,
    background: []const u8,
};

const CSSRule = struct {
    selector: []const u8,
    properties: []CSSProperty,
};

const CSSSheet = struct {
    rules: []CSSRule,
};

The parser is going to look for CSS rules which contain a selector and a list of CSS rules. The entrypoint is that simple:

fn parse(
    arena: *std.heap.ArenaAllocator,
    css: []const u8,
) !CSSSheet {
    var index: usize = 0;
    var rules = std.ArrayList(CSSRule).init(arena.allocator());

    // Parse rules until EOF.
    while (index < css.len) {
        var res = try parse_rule(arena, css, index);
        index = res.index;
        try rules.append(res.rule);

        // In case there is trailing whitespace before the EOF,
        // eating whitespace here makes sure we exit the loop
        // immediately before trying to parse more rules.
        index = eat_whitespace(css, index);
    }

    return CSSSheet{
        .rules = rules.items,
    };
}

Let's implement the eat_whitespace helper we've referenced. It increments a cursor into the css file while it sees whitespace.

fn eat_whitespace(
    css: []const u8,
    initial_index: usize,
) usize {
    var index = initial_index;
    while (index < css.len and std.ascii.isWhitespace(css[index])) {
        index += 1;
    }

    return index;
}

In our stripped-down version of CSS all we have to think about is ASCII. So the builtin std.ascii.isWhitespace() function is perfect.

Next, parsing CSS rules.

parse_rule()

A rule consists of a selector, opening curly braces, any number of properties, and closing curly braces. We need to remember to eat whitespace between each piece of syntax.

And we'll reference a few more parsing helpers we'll talk about next for the selector, braces, and properties.

const ParseRuleResult = struct {
    rule: CSSRule,
    index: usize,
};
fn parse_rule(
    arena: *std.heap.ArenaAllocator,
    css: []const u8,
    initial_index: usize,
) !ParseRuleResult {
    var index = eat_whitespace(css, initial_index);

    // First parse selector(s).
    var selector_res = try parse_identifier(css, index);
    index = selector_res.index;

    index = eat_whitespace(css, index);

    // Then parse opening curly brace: {.
    index = try parse_syntax(css, index, '{');

    index = eat_whitespace(css, index);

    var properties = std.ArrayList(CSSProperty).init(arena.allocator());
    // Then parse any number of properties.
    while (index < css.len) {
        index = eat_whitespace(css, index);
        if (index < css.len and css[index] == '}') {
            break;
        }

        var attr_res = try parse_property(css, index);
        index = attr_res.index;

        try properties.append(attr_res.property);
    }

    index = eat_whitespace(css, index);

    // Then parse closing curly brace: }.
    index = try parse_syntax(css, index, '}');

    return ParseRuleResult{
        .rule = CSSRule{
            .selector = selector_res.identifier,
            .properties = properties.items,
        },
        .index = index,
    };
}

The parse_syntax helper is pretty simple, it does a bounds check and increments the cursor if the current character matches the one you pass in.

fn parse_syntax(
    css: []const u8,
    initial_index: usize,
    syntax: u8,
) !usize {
    if (initial_index < css.len and css[initial_index] == syntax) {
        return initial_index + 1;
    }

    debug_at(css, initial_index, "Expected syntax: '{c}'.", .{syntax});
    return error.NoSuchSyntax;
}

This calls attention to debugging messages on failure. When we fail to parse a syntax, we want to give a useful error message and point at the exact line and column of code where the error happens.

So let's implement debug_at.

debug_at

First, we iterate over the entire CSS source code until we find the entire line that contains the index where the parser failed. We also want to identify the exact line and column corresponding to that index.

fn debug_at(
    css: []const u8,
    index: usize,
    comptime msg: []const u8,
    args: anytype,
) void {
    var line_no: usize = 1;
    var col_no: usize = 0;

    var i: usize = 0;
    var line_beginning: usize = 0;
    var found_line = false;
    while (i < css.len) : (i += 1) {
        if (css[i] == '\n') {
            if (!found_line) {
                col_no = 0;
                line_beginning = i;
                line_no += 1;
                continue;
            } else {
                break;
            }
        }

        if (i == index) {
            found_line = true;
        }

        if (!found_line) {
            col_no += 1;
        }
    }

Then we print it all out in a nice format for users (which will likely just be ourselves).

    std.debug.print("Error at line {}, column {}. ", .{ line_no, col_no });
    std.debug.print(msg ++ "\n\n", args);
    std.debug.print("{s}\n", .{css[line_beginning..i]});
    while (col_no > 0) : (col_no -= 1) {
        std.debug.print(" ", .{});
    }
    std.debug.print("^ Near here.\n", .{});
}

Ok, popping our mental stack, if we look back at parse_rule we still need to implement parse_identifier and parse_property.

parse_identifier

An "identifier" for us here is just going to be an ASCII alphabetical string (i.e. [a-zA-Z]+). We're going to really simplify CSS because we're going to use this method for parsing not just selectors but property names and even property values.

Zig again has a nice builtin std.ascii.isAlphabetical we can use.

const ParseIdentifierResult = struct {
    identifier: []const u8,
    index: usize,
};
fn parse_identifier(
    css: []const u8,
    initial_index: usize,
) !ParseIdentifierResult {
    var index = initial_index;
    while (index < css.len and std.ascii.isAlphabetic(css[index])) {
        index += 1;
    }

    if (index == initial_index) {
        debug_at(css, initial_index, "Expected valid identifier.", .{});
        return error.InvalidIdentifier;
    }

    return ParseIdentifierResult{
        .identifier = css[initial_index..index],
        .index = index,
    };
}

In reality, CSS properties are highly complex. Parsing CSS correctly isn't the main aim of this post though. :)

parse_property

The final piece of CSS we need to parse is properties. These consist of a property name, then a colon, then a property value, and finally a semicolon. And within each piece we eat whitespace.

const ParsePropertyResult = struct {
    property: CSSProperty,
    index: usize,
};
fn parse_property(
    css: []const u8,
    initial_index: usize,
) !ParsePropertyResult {
    var index = eat_whitespace(css, initial_index);

    // First parse property name.
    var name_res = parse_identifier(css, index) catch |e| {
        std.debug.print("Could not parse property name.\n", .{});
        return e;
    };
    index = name_res.index;

    index = eat_whitespace(css, index);

    // Then parse colon: :.
    index = try parse_syntax(css, index, ':');

    index = eat_whitespace(css, index);

    // Then parse property value.
    var value_res = parse_identifier(css, index) catch |e| {
        std.debug.print("Could not parse property value.\n", .{});
        return e;
    };
    index = value_res.index;

    // Finally parse semi-colon: ;.
    index = try parse_syntax(css, index, ';');

    var property = match_property(name_res.identifier, value_res.identifier) catch |e| {
        debug_at(css, initial_index, "Unknown property: '{s}'.", .{name_res.identifier});
        return e;
    };

    return ParsePropertyResult{
        .property = property,
        .index = index,
    };
}

Finally we get to the first bit of metaprogramming. Once we have a property name and value, we need to turn that into a Zig union.

That's what match_property() is going to be responsible for doing.

match_property

This function needs to take a property name and value and return a CSSProperty with the correct field (matching up to the property name passed in) and assigned to the value passed in.

If we didn't have metaprogramming or reflection, the implementation might look like this:

fn match_property(
    name: []const u8,
    value: []const u8,
) !CSSProperty {
    if (std.mem.eql(u8, name, "color")) {
        return CSSProperty{.color = value};
    } else if (std.mem.eql(u8, name, "background")) {
        return CSSProperty{.background = value};
    }

    return error.UnknownProperty;
}

And that is not necessarily bad. In fact it may be how a lot of production code looks over time as product needs evolve. You can keep the internal field name unrelated to the external field name.

However for the sake of learning, we'll try to implement the same thing with Zig metaprogramming.

And specifically, we can take a look at lib/std/json/static.zig to understand the reflection APIs.

Specifically, if we look at line 210-226 of that file, we can see them iterating over fields of a Union:

        .Union => |unionInfo| {
            if (comptime std.meta.trait.hasFn("jsonParse")(T)) {
                return T.jsonParse(allocator, source, options);
            }

            if (unionInfo.tag_type == null) @compileError("Unable to parse into untagged union '" ++ @typeName(T) ++ "'");

            if (.object_begin != try source.next()) return error.UnexpectedToken;

            var result: ?T = null;
            var name_token: ?Token = try source.nextAllocMax(allocator, .alloc_if_needed, options.max_value_len.?);
            const field_name = switch (name_token.?) {
                inline .string, .allocated_string => |slice| slice,
                else => return error.UnexpectedToken,
            };

            inline for (unionInfo.fields) |u_field| {

Then right after that (lines 226-243) we see them conditionally modifying the result object:

            inline for (unionInfo.fields) |u_field| {
                if (std.mem.eql(u8, u_field.name, field_name)) {
                    // Free the name token now in case we're using an allocator that optimizes freeing the last allocated object.
                    // (Recursing into parseInternal() might trigger more allocations.)
                    freeAllocated(allocator, name_token.?);
                    name_token = null;

                    if (u_field.type == void) {
                        // void isn't really a json type, but we can support void payload union tags with {} as a value.
                        if (.object_begin != try source.next()) return error.UnexpectedToken;
                        if (.object_end != try source.next()) return error.UnexpectedToken;
                        result = @unionInit(T, u_field.name, {});
                    } else {
                        // Recurse.
                        result = @unionInit(T, u_field.name, try parseInternal(u_field.type, allocator, source, options));
                    }
                    break;
                }

We can see that the .Union => |unionInfo| condition is entered by switching on @typeInfo(T) (line 149) and that T is a type (line 144).

We don't have a generic type though. We know we are working with a CSSProperty. And we know CSSProperty is a union so we don't need the switch either.

So let's apply that to our match_property implementation.

fn match_property(
    name: []const u8,
    value: []const u8,
) !CSSProperty {
    const cssPropertyInfo = @typeInfo(CSSProperty);

    for (cssPropertyInfo.Union.fields) |u_field| {
        if (std.mem.eql(u8, u_field.name, name)) {
            return @unionInit(CSSProperty, u_field.name, value);
        }
    }

    return error.UnknownProperty;
}

And if we try to build that we'll get an error like this:

main.zig:15:31: error: values of type '[]const builtin.Type.UnionField' must be comptime-known, but index value is runtime-known
    for (cssPropertyInfo.Union.fields) |u_field| {

Zig's "reflection" abilities here are comptime only. So we can't use a runtime for loop, we must use a comptime inline for loop.

fn match_property(
    name: []const u8,
    value: []const u8,
) !CSSProperty {
    const cssPropertyInfo = @typeInfo(CSSProperty);

    inline for (cssPropertyInfo.Union.fields) |u_field| {
        if (std.mem.eql(u8, u_field.name, name)) {
            return @unionInit(CSSProperty, u_field.name, value);
        }
    }

    return error.UnknownProperty;
}

As far as I understand it, this loop is basically unrolled and the generated code would look a lot like our hard-coded initial version.

i.e. it would probably look like this:

fn match_property(
    name: []const u8,
    value: []const u8,
) !CSSProperty {
    const cssPropertyInfo = @typeInfo(CSSProperty);

    if (std.mem.eql(u8, "background", name)) {
        return @unionInit(CSSProperty, "background", value);
    }

    if (std.mem.eql(u8, "color", name)) {
        return @unionInit(CSSProperty, "color", value);
    }

    if (std.mem.eql(u8, "unknown", name)) {
        return @unionInit(CSSProperty, "unknown", value);
    }

    return error.UnknownProperty;
}

Again that's just how I imagine the compiler to generate code from the Union field reflection and inline for over the fields.

Try compiling that code. I get this:

main.zig:17:58: error: expected type 'void', found '[]const u8'
            return @unionInit(CSSProperty, u_field.name, value);

Thinking about the generated code makes it especially clear what's happening. We have an unknown field in there that has a void type. You can't assign a string to void.

We know at runtime that the condition where that happens should be impossible because the user shouldn't enter unknown as a property name. (Though now that I write this, I see they actually could. But let's pretend they wouldn't.)

So the problem isn't a runtime failure but a comptime type-checking failure.

Thankfully we can work around this with comptime conditionals.

If we wrap our current condition in an additional conditional that is evaluated at comptime and filters out the unknown pass of the inline for loop, the compiler shouldn't generate any code trying to assign to the unknown field.

fn match_property(
    name: []const u8,
    value: []const u8,
) !CSSProperty {
    const cssPropertyInfo = @typeInfo(CSSProperty);

    inline for (cssPropertyInfo.Union.fields) |u_field| {
        if (comptime !std.mem.eql(u8, u_field.name, "unknown")) {
            if (std.mem.eql(u8, u_field.name, name)) {
                return @unionInit(CSSProperty, u_field.name, value);
            }
        }
    }

    return error.UnknownProperty;
}

And indeed, if you try to compile it, this works. Since the conditional is evaluated at compile time, we can imagine the code the compiler generates is this:

fn match_property(
    name: []const u8,
    value: []const u8,
) !CSSProperty {
    const cssPropertyInfo = @typeInfo(CSSProperty);

    if (std.mem.eql(u8, "background", name)) {
        return @unionInit(CSSProperty, "background", value);
    }

    if (std.mem.eql(u8, "color", name)) {
        return @unionInit(CSSProperty, "color", value);
    }

    return error.UnknownProperty;
}

The unknown field has been skipped.

In retrospect, I realize that the unknown field probably isn't even needed. We could eliminate it from the CSSProperty union and get rid of that comptime conditional. However, sometimes there are in fact private fields you want to skip. And I wanted to show how to deal with that case.

For the last bit of metaprogramming, let's talk about displaying the resulting CSSSheet we'd get after parsing.

sheet.display()

If we didn't have metaprogramming and wanted to display the sheet, we'd have to switch on every possible union field.

Like so (I've modified the CSSSheet struct definition so it includes this method):

    fn display(sheet: *CSSSheet) void {
        for (sheet.rules) |rule| {
            std.debug.print("selector: {s}\n", .{rule.selector});
            for (rule.properties) |property| {
                switch (property) {
                    .unknown => unreachable,
                    .color => |color_value| std.debug.print("  color: {s}\n", .{color_value}),
                    .background => |background_value| std.debug.print("  background: {s}\n", .{background_value}),
                };
            }
            std.debug.print("\n", .{});
        }
    }

This is already a little annoying and could get unwieldy as we add fields to the CSSProperty union.

Instead we can use the inline for (@typeInfo(CSSProperty).Union.fields) |u_field| method to iterate over all fields, skip the unknown field at comptime, and print out the field name and value by matching on the current value of the property enum by using the @tagName builtin.

    fn display(sheet: *CSSSheet) void {
        for (sheet.rules) |rule| {
            std.debug.print("selector: {s}\n", .{rule.selector});