For the last 10 years I've chased my way down the software stack starting from humble beginnings with the venerable jQuery and PHP.
The RSS's url is : https://notes.eatonphil.com/rss.xml
2024-08-24 08:00:00
In your professional and personal life, I don't believe there is a stronger motivation than having something in mind and the desire to do it. Yet the natural way to deal with a desire to do something is to justify why it's not possible.
"I want to read more books but nobody reads books these days so how could I."
"I want to write for a magazine but I have no experience writing professionally."
"I want to build a company someday but how could someone of my background."
Our official mentors, our managers, through a combination of well-intentioned defeatism and well-intentioned lack of accomplishment themselves, among other things, are often unable to process big goals or guide you toward them.
I've been one of these managers myself. In fact I have, to my immense regret, tried too often to convince people to do what is practical rather than what they want to do. Or to do what I judged they were capable of doing rather than what they wanted to do.
In the best cases, my listener had the self-confidence to ignore me. They did what they wanted to do anyway. In the worst case, again to my deep regret, I've been a well-intentioned part of derailing someone's career for years.
So I don't want to convince anyone of anything anymore. If I start trying to convince someone by accident, I try to catch myself. I try to avoid sentences like "I think you should …". Instead "Here is something that's worked for me: …" or "Here is what I've heard works well for other people: …".
Nobody wants to be convinced. But intelligent people will change their mind when exposed to new facts or different ideas. Being convinced is a battle of will. Changing one's mind is a purely personal decision.
There are certainly people with discipline who can grind on things they hate doing and eventually become experts at it. But more often I see people grind on things they hate only to become depressed and give up.
For most of us, our best hope is (healthy) obsession. And obsession, in the sense I'm talking about, does not come from something you are ambivalent about or hate. Obsession can only come when you're doing something you actually want to do.
For big goals or big changes, you need regular commitment weekly, monthly, yearly. Over the course of years. And only obsession makes that work not actually feel like work. Obsession is the only thing that makes discipline not feel like discipline.
That big goals take years to accomplish need not be scary. Obsession doesn't mean you can't pivot. There is quite a lot to gain by committing to something regularly over the course of years even if you decide to stop and commit from then on to something else. You will learn a good deal.
And healthy obsession to me is more specifically measurable on the order of weeks, not hours or days. Healthy obsession means you're still building healthy personal and professional relationships. You're still taking care of yourself, emotionally and physically.
I do not have high expectations for people in general. This seems healthy and reasonable. But as I meet more people and observe them over the years, I am only more convinced of the vast potential of individuals. Individuals are almost universally underestimated.
I think you can do almost anything you want to do. If you commit to do doing it.
I'll end this with a personal story.
Until 11th grade, I hated school. I hated the rigidity. Being forced to be somewhere for hours and to follow so many rules. I skipped so many days of school I'm embarrassed by it. I'd never do homework at home. I never studied for tests. I got Bs and Cs in the second-tier classes. I was in the orchestra for 6 years and never practiced at home. I was not cool enough to be a "bad kid" but I did not understand the system and had no discipline whatsoever.
I found out at the end of 10th grade that I could actually afford college if I got into a good enough school that paid full needs-based tuition. It sounded significantly better than the only other option that seemed obvious, joining the military as a recruit. I realized and decided that if I wanted to get into a good school I needed to not half-ass things.
Somehow, I decided to only do things I could become obsessed with. And I decided to be obsessed in the way that I wanted, not to do what everyone else did (which I basically could not do since I had no discipline). If we covered a topic in class, I'd read news about it or watch movies about it. I'd get myself excited about the topic in every way I could.
It basically worked out. I ended high school in the top 10% of the class (up from top 40% or something). I got into a good liberal arts college that paid the entirety of my tuition. But I remained a basically lazy and undisciplined person. I never stayed up late studying for a test. I dropped out after a year and a half for family reasons.
But I've now spent the last 10 years in my spare time working on compiler projects, interpreter projects, parser projects, database projects, distributed systems projects. I've spent the last 6 years consistently publishing at least one blog post per month.
I didn't want to work the way everyone else worked. I wanted to be obsessed about what I worked on.
Obsession has made all of this into something I now barely register as doing. It's allowed me to continue adding activities like organizing book clubs and meetups to the list of things I'm up to. Up until basically this year I could have in good faith said I am a very lazy and undisciplined person. But obsession turned me into someone with discipline.
Obsession became about more than just the tech. It meant trying to fully understand the product, the users, the market. It meant thinking more carefully about product documentation, user interfaces, company messaging. Obsession meant reflecting on how I treat my coworkers, and how my coworkers feel treated by others in general. Obsession meant wanting an equitable and encouraging work environment for everyone.
And, as I said, it's about healthy obsession. I didn't really understand the "healthy" part until a few years ago. But I'm now convinced that the "healthy" part is as important as the "obsession" part. To go to the gym regularly. To play pickup volleyball. To cook excellent food. To read fiction and poetry and play music. To serve the community. To be friendly and encouraging to all people. To meet new people and build better genuine friendships.
And in the context of work, "healthy obsession" means understanding you can't do everything, even while you care about everything. It means accepting that you make mistakes and that you do your best; that you try to do better and learn from mistakes the next time.
It's got to be sustainable. And we can develop a healthy obsession while we have quite a bit of fun too. :)
I wrote an essay on my mistakes trying to convince people to do something, on doing what you want to do, and on obsession.
— Phil Eaton (@eatonphil) August 24, 2024
Ended with a personal note on developing healthy discipline, and having fun. :)https://t.co/4WWdtU6AhL pic.twitter.com/lBw7zlqWeq
2024-08-20 08:00:00
Bugs in distributed systems are hard to find, largely because systems interact in chaotic ways. And even once you've found a bug, it can be anywhere from simple to impossible to reproduce it. It's about as far away as you can get from the ideal test environment: property testing a pure function.
But what if we could write our code in a way that we can isolate the chaotic aspects of our distributed system during testing: run multiple systems communicating with each other on a single thread and control all randomness in each system? And property test this single-threaded version of the distributed system with controlled randomness, all the while injecting faults (fancy term for unhappy path behavior like errors and latency) we might see in the real-world?
Crazy as it sounds, people actually do this. It's called Deterministic Simulation Testing (DST). And it's become more and more popular with startups like FoundationDB, Antithesis, TigerBeetle, Polar Signals, and WarpStream; as well as folks like Tyler Neely and Pekka Enberg, talking about and making use of this technique.
It has become so popular to talk about DST in my corner of the world that I worry it risks coming off sounding too magical and maybe a little hyped. It's worth getting a better understanding of both the benefits and the limitations.
Thank you to Alex Miller and Will Wilson for reviewing a version of this post.
A big source of non-determinism in business logic is the use of random numbers—in your code or your transitive dependencies or your language runtime or your operating system.
Crucially, DST does not imply you can't have randomness! DST merely assumes that you have a global seed for all randomness in your program and that the simulator controls the seed. The seed may change across runs of the simulator.
Once you observe a bad state as a result of running the simulation on a random seed, you allow the user to enter the same seed again. This allows the user to recreate the entire program run that led to that observed bad state. Allows the user to debug the program trivially.
Another big source of non-determinism is being dependent on time. As with randomness, DST does not mean you can't depend on time. DST means you must be able to control the clock during the simulation.
To "control" randomness or time basically means you support dependency injection, or the old-school alternative to dependency injection called passing the dependency as an explicit parameter. Rather than referring to a global clock or a global seed, you need to be able to receive a clock or a seed from someone.
For example we might separate the operation of an application into the
language's main()
entrypoint and an actual application start()
entrypoint.
# app.pseudocode
def start(clock, seed):
# lots of business logic that might depend on time or do random things
def main:
clock = time.clock()
seed = time.now()
app.start(clock, seed)
The application entrypoint is where we must be able to swap out a real clock or real random seed for one controlled by our simulator:
# sim.pseudocode
import "app.pseudocode"
def main:
sim_clock = make_sim_clock()
sim_seed = os.env.DST_SEED or time.now()
try:
app.start(sim_clock, sim_seed)
catch(e):
print("Bad execution at seed: %s", sim_seed)
throw e
Let's look at another example.
Let's say that we had a helper method that kept calling a function until it succeeded, with backoff.
# retry.pseudocode
class Backoff:
def init:
this.rnd = rnd.new(seed = time.now())
this.tries = 0
async def retry_backoff(f):
while this.tries < 3:
if f():
return
await time.sleep(this.rnd.gen())
this.tries++
There is a single source of nondeterminism here and it's where we
generate a seed. We could parameterize the seed, but since we want to
call time.sleep()
and since in DST we control the time, we can just
parameterize time
.
# retry.psuedocode
class Backoff:
def init(this, time):
this.time = time
this.rnd = rnd.new(seed = this.time.now())
this.tries = 0
async def retry_backoff(this, f):
while this.tries < 3:
if f():
return
await this.time.sleep(this.rnd.gen())
this.tries++
Now we can write a little simulator to test this:
# sim.psuedocode
import "retry.pseudocode"
sim_time = {
now: 0
sleep: (ms) => {
await future.wait(ms)
}
tick: (ms) => now += ms
}
backoff = Backoff(sim_time)
while true:
failures = 0
f = () => {
if rnd.rand() > 0.5:
failures++
return false
return true
}
try:
while sim_time.now < 60min:
promise = backoff.retry_backoff(f)
sim_time.tick(1ms)
if promise.read():
break
assert_expect_failure_and_expected_time_elapse(sim_time, failures)
catch(e):
print("Found logical error with seed: %d", seed)
throw e
This demonstrates a few critical aspects of DST. First, the simulator itself depends on randomness. But allows the user to provide a seed so they can replay a simulation that discovers a bug. The controlled randomness in the simulator is what lets us do property testing.
Second, the simulation workload must be written by the user. Even when you've got a platform like Antithesis that gives you an environment for DST, it's up to you to exercise the application.
Now let's get a little more complex.
The determinism of multiple threads can only be controlled at the operating system or emulator or hypervisor layer. Realistically, that would require third-party systems like Antithesis or Hermit (which, don't get excited, is not actively developed and hasn't worked on any interesting program of mine) or rr.
These systems transparently transform multi-threaded code into single threaded code. But also note that Hermit and rr have only limited ability to do fault injection which, in addition to deterministic execution, is a goal of ours. And you can't run them on a mac. And can't run them on ARM.
But we can, and would like, to write a simulator without writing a new operating system or emulator or hypervisor, and without a third-party system. So we must limit ourselves to writing code that can be collapsed into a single thread. Significantly, since using blocking IO would mean an entire class of concurrency bugs could not be discovered while running the simulator in a single thread, we must limit ourselves to asynchronous IO.
Single threaded and asynchronous IO. These are already two big limitations.
Some languages like Go are entirely built around transparent multi-threading and blocking IO. Polar Signals solved this for DST by compiling their application to WASM where it would run on a single thread. But that wasn't enough. Even on a single thread, the Go runtime intentionally schedules goroutines randomly. So Polar Signals forked the Go runtime to control this randomness with an environment variable. That's kind of crazy. Resonate took another approach that also looks cumbersome. I'm not going to attempt to describe it. Go seems like a difficult choice of a language if you want to do DST.
Like Go, Rust has no builtin async IO. The most mature async IO library is tokio. The tokio folks attempted to provide a tokio-compatible simulator implementation with all sources of nondeterminism removed. From what I can tell, they did not at any point fully succeed. That repo has now been replaced with a "this is very experimental" tokio-rs project called turmoil that provides deterministic execution plus network fault injection. (But not disk fault injection. More on that later.) It isn't surprising that it is difficult to provide deterministic execution for an IO library that was not designed for it. tokio is a large project with many transitive dependencies. They must all be combed for non-determinism.
On the other hand, Pekka has already demonstrated for us how we might build a simpler Rust async IO library that is designed to be simulation tested. He modeled this on the TigerBeetle design King and I wrote about two years ago.
So let's sketch out a program that does buggy IO and let's look at how we can apply DST to it.
# readfile.pseudocode
def read_file(io, name, into_buffer):
f = await io.open(name)
read_buffer = [4096]u8{}
while true:
err, n_read = await f.read(&read_buffer)
if err == io.EOF:
into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])
return
if err:
throw err
into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])
In our simulator, we will provide a mocked out IO system and we will randomly inject various errors while asserting pre- and post-conditions.
# sim.psuedocode
import "readfile.pseudocode"
seed = if os.env.DST_SEED ? int(os.env.DST_SEED) : time.now()
rnd = rnd.new(seed)
while true:
sim_disk_data = rnd.rand_bytes(10MB)
sim_fd = {
pos: 0
EOF: Error("eof")
read: (fd, buf) => {
partial_read = rnd.rand_in_range_inclusive(0, sizeof(buf))
memcpy(sim_disk_data, buf, fd.pos, partial_read)
fd.pos += partial_read
if fd.pos == sizeof(sim_disk_data):
return io.EOF, partial_read
return partial_read
}
}
sim_io = {
open: (filename) => sim_fd
}
out_buf = Vector<u8>.new()
try:
read_file(sim_io, "somefile", out_buf)
assert_bytes_equal(out_buf.data, sim_disk_data)
catch (e):
print("Found logical error with seed: %d", seed)
throw e
And with this simulator we would have eventually caught our partial read bug! In our original program when we wrote:
into_buffer.copy_maybe_allocate(read_buffer[0:sizeof(read_buffer)])
We should have written:
into_buffer.copy_maybe_allocate(read_buffer[0:n_read])
Great! Let's get a little more complex.
I already mentioned in the beginning that the gist of deterministic simulation testing a distributed system is that you get all of the nodes in the system to run in the same process. This would be basically impossible if you wanted to test a system that involved your application plus Kafka plus Postgres plus Redis. But if your system is a self-contained distributed system, such as one that embeds a Raft library for high availability of your application, you can actually run multiple nodes into the same process!
For a system like this, our simulator might look like:
# sim.pseudocode
import "distsys-node.pseudocode"
seed = if os.env.DST_SEED ? int(os.env.DST_SEED) : time.now()
rnd = rnd.new(seed)
while true:
sim_fd = {
send(fd, buf) => {
# Inject random failure.
if rnd.rand() > .5:
throw Error('bad write')
# Inject random latency.
if rnd.rand() > .5:
await time.sleep(rnd.rand())
n_written = assert_ok(os.fd.write(buf))
return n_written
},
recv(fd, buf) => {
# Inject random failure.
if rnd.rand() > .5:
throw Error('bad read')
# Inject random latency.
if rnd.rand() > .5:
await time.sleep(rnd.rand())
return os.fd.read(buf)
}
}
sim_io = {
open: (filename) => {
# Inject random failure.
if rnd.rand() > .5:
throw Error('bad open')
# Inject random latency.
if rnd.rand() > .5:
await time.sleep(rnd.rand())
return sim_fd
}
}
all_ports = [6000, 6001, 6002]
nodes = [
await distsys-node.start(sim_io, all_ports[0], all_ports),
await distsys-node.start(sim_io, all_ports[1], all_ports),
await distsys-node.start(sim_io, all_ports[2], all_ports),
]
history = []
try:
key = rnd.rand_bytes(10)
value = rnd.rand_bytes(10)
nodes[rnd.rand_in_range_inclusive(0, len(nodes)].insert(key, value)
history.add((key, value))
assert_valid_history(nodes, history)
# Crash a process every so often
if rnd.rand() > 0.75:
node = nodes[rnd.rand_in_range_inclusive(0, 3)]
node.restart()
catch (e):
print("Found logical error with seed: %d", seed)
throw e
I'm completely hand waving here to demonstrate the broader point and not any specific testing strategy for a specific distributed system. The important points are that these three nodes run in the same process, on different ports.
We control disk IO. We control network IO. We control how time elapses. We run a deterministic simulated workload against the three node system while injecting disk, network, and process faults.
And we are constantly checking for an invalid state. When we get the invalid state, we can be sure the user can easily recreate this invalid state.
Within some error margin, most CPU instructions and most CPU behavior are considered to be deterministic. There are, however, certain CPU instructions that are definitely not. Unfortunately that might include system calls. It might also include malloc. There is very little to trust.
If we ignore Antithesis, people doing DST seem not to worry about these smaller bits of nondeterminism. Yet it's generally agreed that DST is still worthwhile anyway. The intuition here is that every bit of non-determinism you can eliminate makes it that much easier to reproduce bugs when you find them.
Put another way: determinism, even among DST practitioners, remains a spectrum.
As you may have noticed already from some of the pseudocode, DST is not a panacea.
First, because you must swap out non-deterministic parts of your code, you are not actually testing the entirety of your code. You are certainly encouraged to keep the deterministic kernel large. But there will always be the non-deterministic edges.
Without a system like Antithesis which gives you an entire deterministic machine, you can't test your whole program.
But even with Antithesis you cannot test the integration between your system and external systems. You must mock out the external systems.
It's also worth noting that there are many areas where you could inject simulation. You could do it at a high-level RPC and storage layer. This would be simpler and easier to understand. But then you'd be omitting testing and error-handling of lower-level errors.
DST is dependent on your creativity and thoroughness of your workload as much as any other type of test or benchmark.
Just as you wouldn't depend on one single benchmark to qualify your application, you may not want to depend on a single simulated workload.
Or as Will Wilson put it for me:
The biggest challenge of DST in my experience is that tuning all the random distributions, the parameters of your system, the workload, the fault injection, etc. so that it produces interesting behavior is very challenging and very labor intensive. As with fuzzing or PBT, it's terrifyingly easy to build a DST system that appears to be doing a ton of testing, but actually never explores very much of the state space of your system. At FoundationDB, the vast majority of the work we put into the simulator was an iterative process of hunting for what wasn't being covered by our tests and then figuring out how to make the tests better. This process often resembles science more than it does engineering.
Unfortunately, unlike with fuzzing, mere branch coverage in your code is usually a pretty poor signal for the kinds of systems you want to test with DST. At Antithesis we handle this with Sometimes assertions, at FDB we did something pretty similar, and I assume TigerBeetle and others have their own version of this. But of course the ultimate figure of merit is whether your DST system is finding 100% of your bugs. It's quite difficult to get to the point that it does. The truly ambitious part of Antithesis isn't the hypervisor, but the fact that we also aim to solve the much harder "is my DST working?" problem with minimal human guidance or supervision.
When you mock out the behavior of disk or network IO, the benefits of DST are tied to your understanding of the spectrum of behavior that may happen in the real world.
What are all possible error conditions? What are the extreme latency bounds of the original method? What about corruption or misdirected IO?
The flipside here is that only in deterministic simulation testing can you configure these crazy scenarios to happen at a configurable regularity. You can kick off a set of runs that have especially high IO latency or especially high corrupt reads/writes. Joran and I wrote a year ago about how the TigerBeetle simulator does exactly this.
Critically, the reproducibility of DST only helps so long as your code doesn't change. As soon as your code changes, the seed may no longer even get you to the state where the bug was exhibited. So the reproducibility of DST means more that it may help you convert the seed simulation run into an integration test that describes the precise scenario even as the code changes.
Because of Consideration 4, you need to keep rerunning the simulator not just to keep finding new seeds and new histories but because the new seeds and new histories may change every time you make changes to code.
Jepsen does limited process and network fault injection while testing for linearizability. It's a fantastic project.
However, it represents only a subset of what is possible with Deterministic Simulation Testing (if you actually put in the effort described above to get there).
But even more importantly, Jepsen has nothing to do with deterministic execution. If Jepsen finds a bug and your system can't do deterministic execution, you may or may not be able to reproduce that Jepsen bug.
Here's another Will Wilson quote for you on Jepsen and FoundationDB:
Anyway, we did [Deterministic Simulation Testing] for a while and found all of the bugs in the database. I know, I know, that’s an insane thing to say. It’s kind of true though. In the entire history of the company, I think we only ever had one or two bugs reported by a customer. Ever. Kyle Kingsbury aka “aphyr” didn’t even bother testing it with Jepsen, because he didn’t think he’d find anything.
The degree to which you can place faith in DST alone, and not time spent in production, has limits. However, it certainly does no harm to employ DST. And, barring the considerations described above, will likely make the kernel of your product significantly more stable. Furthermore, everyone who uses DST knows about these considerations. But I think it's worthwhile to list them out to help folks who do not know DST to build an intuition for what it's excellent at.
Further reading:
I wrote a new post talking through the basics, considerations, and limitations of Deterministic Simulation Testing.https://t.co/9Fp5ytL7Wz pic.twitter.com/xRE6FOwc0P
— Phil Eaton (@eatonphil) August 20, 2024
2024-07-30 08:00:00
This is an external post of mine. Click here if you are not redirected.
2024-07-07 08:00:00
This year has seen a resurgence in really high quality systems programming meetups. Munich Database Meetup, Berlin Systems Group, SF Distributed Systems Meetup, NYC Systems, Bengaluru Systems, to name a few.
This post summarizes a bit of disappointing recent tech meetup history, the new trend of excellent systems programming meetups, and ends with some encouragement and guidance for running your own systems programming events.
I will be a little critical in this post but I want to preface by saying: organizing meetups is really tough! It takes a lot of work and I have a huge amount of respect for meetup organizers even when their meetup style did not resonate with me.
Although much of this post talks about NYC Systems, the reason I think this post is worth writing is because so many other meetups in a similar vein popped up. I hope to encourage these other meetups and to encourage folks in other major metros (London, for example) to start similar meetups.
I used to attend a bunch of meetups before the pandemic. But I quickly got disillusioned. Almost every meetup was varying degrees of startups pitching their product. The last straw for me was sitting through a talk at a JavaScript meetup that was by a devrel employee of a startup who literally gave a tutorial for their product.
There were also some pretty intelligent meetups like the New York Haskell Users Group and the New York Emacs Meetup. But not being an expert in either domain, and the attendees almost solely appearing to be experts, I didn't particularly enjoy going.
There were a couple of meetups that felt inclusive for various skill-levels of attendees yet still went into interesting depth. Specifically, New York Linux User Group and Papers We Love NYC.
These meetups were exceptional because they were language- and framework-agnostic, they would start broad to give you background, but then go deep into a topic. Maybe you only understood 50% of what was covered. But you get exposed to something new from an expert in that domain.
Unfortunately, the pandemic happened and these two excellent meetups basically have not come back.
The pandemic ended and I tried a couple of meetups I thought might be better quality. Rust and Go. But they weren't much better than I remembered. People would give a high level talk and brush over all the interesting concepts.
I had been thinking of doing an in-person talk series since 2022.
If I put together a systems/databases/distributed systems meetup in NYC (a physical meetup, not Zoom), who'd be interested (in attending, or presenting, or helping me organize, or donating space)?
— Phil Eaton (@eatonphil) September 27, 2022
No promises!
But I was busy with TigerBeetle until December of 2023 when I was messaged on LinkedIn by Georg Kreuzmayr, a graduate student at Technical University of Munich (TUM).
Georg and his friends, fellow graduate students at TUM, started a database club: TUMuchData. We got to talking about opportunities for collaboration and I started feeling a bit embarrassed that a graduate student had more guts than I had to get back onto the meetup organizer wagon.
A week later, with assurance from Justin Jaffray that at least he would show up with me if no one else did, I started the NYC Systems Coffee Club to bring together folks in NYC interested in any topic of systems programming (e.g. compilers, databases, web browser internals, distributed systems, formal methods, etc.). To bring them together in a completely informal setting for coffee at 9am in the morning in a public space in midtown Manhattan.
Trying something new! If you're a dev in NYC working
— Phil Eaton (@eatonphil) December 11, 2023
on (or interested in) systems programming, grab a coffee and come hang out at 1 Bryant Park (indoor space) this Thursday 9AM - 9:30AM.
See post for details and fill out the Google Form or DM me!https://t.co/A4bzcPGy6x pic.twitter.com/n1ECMd59ev
I set up that linked web page and started collecting subscribers to the club via Google Form. Once a month I'd send an email out to the list asking for RSVPs to this month's coffee club. The first 20 to respond would get a calendar invite.
And about the same time I started asking around on Twitter/LinkedIn if someone would be interested in co-organizing a new systems programming meetup in NYC. Angelo Saraceno immediately took me up on the idea and we met up.
We agreed on the premise: this would be a language- and framework-agnostic meetup that was focused on engineering challenges, not product pitches. It would be 100% for the sake of corporate marketing, but corporate marketing of the engineering team, not the product.
NYC Systems was born!
We'd find speakers who could start broad and dive deep into some interesting aspect of databases, programming languages, distributed systems, and so on. Product pitches were necessary to establish a context, but the focus of the talk would be about some interesting recent technical challenge and how they dealt with it.
We'd schedule talks only every other month to ease our own burden in organizing and finding great speakers.
Once Angelo and I had decided to go forward, the next two challenges were finding speakers and finding a venue. Thanks to Twitter and LinkedIn, finding speakers turned out to be the easy part.
It was harder to find a venue. It was surprisingly challenging to find a company in NYC with a shared vision that the important thing about being associated with a meetup like this is to be associated with the quality of speakers and audience we can bring in by not allowing transparent product pitches.
Almost every company in Manhattan with space we spoke with had a requirement that they have their own speaker each night. That seemed like a bad idea.
I think it was especially challenging to find a company willing to relax about branding requirements like this because we were a new meetup.
It was pretty frustrating not to find a sympathetic company with space in Manhattan. And the only reason we didn't give up was because Angelo was so adament that this kind of meetup actually happen. It's always best to start something new with someone else for this exact reason. You can keep each other going.
In the end we went with the company that did not insist on their own speaker or their own branding. A Brooklyn-based company whose CEO immediately got in touch with me that they wanted to host us, Trail of Bits.
To keep things easy, I set up a web page on my personal site with information about the meetup. (Eventually we moved this to nycsystems.xyz.) I set up a Google Form to collect emails for a mailing list. And we started posting about the group on Twitter and LinkedIn.
Very pleased to share the first NYC Systems Talks are taking place next Thursday Feb 22nd 6PM. Hosted by @trailofbits, with @paulgb and @StefanKarpinski speaking.
— Phil Eaton (@eatonphil) February 15, 2024
Space is not infinite, fill out the Google Form if you can attend and would like an invite!https://t.co/jNssr5v1kJ
We published the event calendar in advance (an HTML table on the website) and announced each event's speakers a week in advance of the event. I'd send another Google Form to the mailing list taking RSVPs for the night. The first 60 people to respond got a Google Calendar invite.
It's a bit of work, sure, but I'd do anything to avoid Meetup.com.
It is interesting to see every new systems programming meetup also not pick Meetup.com. The only one that went with it, Munich Database Meetup, is a revival of an existing group, the Munich NoSQL Meetup and presumably they didn't want to give up their subscribers. Though most others use lu.ma.
The mailing list is now about 400+ people. And in each event RSVP we have a wait list of 20-30 people. Of course although 60 people say Yes initially, by the time of the event we have typically gotten about 50 people in attendance.
At each event, Trail of Bits provided screens, chairs, food, and drink. Angelo had recording equipment so he took over audio/video capturing (and later editing and publishing).
After each event we'd publish talk videos to our @NYCSystems Youtube.
In March 2024, the TUMuchData folks joined Alex Petrov's Munich NoSQL Meetup to form the Munich Database Meetup. In May, Kaivalya Apte and Manish Gill started the Berlin Systems Group, inspired by Alex and the Munich Database Meetup.
I want to start a Berlin Database/Storage systems group, where we have regular meetups, discussions and talks.
— Kaivalya Apte - The Geek Narrator (@thegeeknarrator) May 15, 2024
WDYT? @mgill25 @mehd_io @ClickHouseDB @SnowflakeDB @awscloud @GoogleDE @TUBerlin
Can I get some support? Who else would be interested? #Databases
Thanks…
In May 2024, two PhD students in the San Francisco Bay Area, Shadaj Laddad and Conor Power, started the SF Distributed Systems meetup.
We’re super excited to be organizing a new SF Distributed Systems meetup NEXT WEEK! Our first meetup features @julianhyde and @conor_power23 presenting work on extending SQL and applying algebraic properties, sign up at https://t.co/d2lLDaQ5iJ
— Shadaj Laddad (@ShadajL) May 15, 2024
And in July 2024, Shraddha Agrawal, Anirudh Rowjee and friends kicked off the first Bengaluru Systems Meetup.
Are you ready, Systems Enthusiasts of Bengaluru?
— Bengaluru Systems Meetup (@BengaluruSys) July 4, 2024
Speaking at our first-ever meetup on 6th July, we have:@simsimsandy with "Learn about the systems that power GenAI applications" and @vivekgalatage with "The Browser Backstage: Performance vs Security"
(talks linked below!)
First off, don't pay for anything yourself. Find a company who will host. At the same time, don't feel the need to give in too much to the demands of the company. I'd be happy to help you think through how to talk about the event with companies. It is mutually beneficial for them to get to give a 5-minute hiring/product pitch and not need to do extensive branding nor to give a 30-minute product tutorial.
Second, keep a bit of pressure on speakers to not do an overview talk and not to do a product pitch. Suggest that they tell the story of some interesting recent bug or interesting recent feature. What happened? Why was it hard? What did you learn?
Focusing on these types of talks will help you get a really interesting audience.
I have been continuously surprised and impressed at the folks who show up for NYC Systems. It's a mix of technical founders in the systems space, pretty experienced developers in the systems space, graduate students, and developers of all sorts.
I am certain we can only get these kinds of folks to show up because we avoid product pitch-type talks.
Third, finding speakers is still hard! The best approach so far has been to individually message folks in industry and academia who hang out on Twitter. Sending out a public call is easy but doesn't often pan out. So keep an eye on interesting companies in the area.
Another avenue I've been thinking about is messaging VC connections to ask them if they know any engineers/technical founders/CTOs in the area who could give an interesting technical talk.
Fourth, speak with other organizers! I finally met Alex Petrov in person last month and we had a great time talking about the challenges and joys of organizing really high quality meetups.
I'm always happy to chat, DMs are open.
New post telling a bit of the history behind https://t.co/NEh1tm8v3Q; why it only exists due to folks like @georg_kreuzmayr and @ngeloxyz; the explosion of systems meetups around the world; and encouragement and suggestions for future organizers!https://t.co/dwe4TtmXKK pic.twitter.com/ZMLkVYdZDJ
— Phil Eaton (@eatonphil) July 7, 2024
2024-07-01 08:00:00
A database does not need a write-ahead log (WAL) to achieve durability. A database can write its long-term data structure durably to disk before returning to a client. Granted, this is a bad idea! And granted, a WAL is critical for durability by design in most databases. But I think it's helpful to understand WALs by understanding what you could do without them.
So let's look at what terrible design we can make for a durable database that has no write-ahead log. To motivate the idea of, and build an intuition for, a write-ahead log.
Thank you to Alex Miller for reviewing a version of this post.
But first, what is durability?
Durability happens in the context of a request a client makes to a data system (either an embedded system like SQLite or RocksDB or a standalone system like Postgres). Durability is a spectrum of guarantees the server provides when a client requests to write some data: that either the request succeeds and the data is safely written to disk, or the request fails and the client must retry or decide to do something else.
It can be difficult to set an absolute definition for durability since different databases have different concepts of what can go wrong with disks (also called a "storage fault model"), or they have no concept at all.
Let's start from the beginning.
An in-memory database has no durability at all. Here is pseudo-code for an in-memory database service.
db = btree()
def handle_write(req):
db.update(req.key, req.value)
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
Throughout this post, for the sake of code brevity, imagine that the
environment is concurrent and that data races around shared mutable
values like db
are protected somehow.
If we want to achieve the most basic level of durability, we can write this database to a file.
f = open("kv.db")
db = btree.init_from_disk(f)
def handle_write(req):
db.update(req.key, req.value)
db.write_to_disk(f)
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
btree.write_to_disk
will call
pwrite(2) under the hood. And
we'll assume it does copy-on-write for only changed pages. So imagine
we have a large database represented by a btree that takes up 10GiB on
disk. With the btree algorithm, if we write a single entry to the
btree, often only a single (often 4Kib) page will get written rather
than all pages (holding all values) in the tree. At the same time, in
the worst case, the entire tree (all 10GiB of data) may need to get
rewritten.
But this code isn't crash-safe. If the virtual or physical machine this code is running on reboots, the data we wrote to the file may not actually be on disk.
File data is buffered by the operating system by default. By general consensus, writing data without flushing the operating system buffer is not considered durable. Every so often a new database will show up on Hacker News claiming to beat all other databases on insert speed until a commenter points out the new database doesn't actually flush data to disk.
In other words, the commonly accepted requirement for durability is that not only do you write data to a file on disk but you fsync(2) the file you wrote. This forces the operating system to flush to disk any data it has buffered.
f = open("kv.db")
db = btree.init_from_disk(f)
def handle_write(req):
db.update(req.key, req.value)
db.write_to_disk(f)
f.fsync() # Force a flush
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
Furthermore you must not ignore fsync failure. How you deal with fsync failure is up to you, but exiting immediately with a message that the user should restore from a backup is sometimes considered acceptable.
Databases don't like to fsync because it's slow. Many major databases offer modes where they do not fsync data files before returning a success to a client. Postgres offers this unsafe mode, though does not default to it and warns against it. MongoDB offers this unsafe mode but does not default to it.
An earlier version of this post said that MongoDB would unsafely flush on an interval. Daniel Gomez Ferro from MongoDB messaged me that while the docs are confusing, the default write concern "majority" does actually imply "j: true" which means data is synchronized (i.e. fsync-ed) before returning a success to a client.
Almost every database trades safety for performance in some regard. For example, few databases but SQLite and Cockroach default to Serializable Isolation. While it is commonly agreed that basically no level below Serializable Isolation (that all other databases default to) can be reasoned about. Other databases offer Serializable Isolation, they just don't default to it. Because it can be slow.
But let's get back to fsync. One way to amortize the cost of fsync is to delay requests so that you write data from each of them and then fsync the data from all requests. This is sometimes called group commit.
For example, we could update the database in-memory but have a background thread serialize to disk and call fsync only every 5ms.
f = open("kv.db")
db = btree.init_from_disk(f)
group_commit_sems = []
@background_worker()
def group_commit():
for:
if clock() % 5ms == 0:
db.write_to_disk(f)
f.fsync() # Durably flush for the group
for sem in group_commit_sems:
sem.signal()
def handle_write(req):
db.update(req.key, req.value)
sem = semaphore()
group_commit_sems.push(sem)
sem.wait()
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
It is critical that handle_write
waits to return a success until the
write is durable via fsync.
So to reiterate, the key idea for durability of a client request is that you have some version of the client message stored on disk durably with fsync before returning a success to a client.
From now on in this post, when you see "durable" or "durability", it means that the data has been written and fsync-ed to disk.
A key insight is that it's silly to serialize the entire permanent structure of the database to disk every time a user writes.
We could just write the user's message itself to an append-only log. And then only periodically write the entire btree to disk. So long as we have fsync-ed the append-only log file, we can safely return to the user even if the btree itself has not yet been written to disk.
The additional logic this requires is that on startup we must read the btree from disk and then replay the log on top of the btree.
f = open("kv.db", "rw")
db = btree.init_from_disk(f)
log_f = open("kv.log", "rw")
l = log.init_from_disk()
for log in l.read_logs_from(db.last_log_index):
db.update(log.key, log.value)
group_commit_sems = []
@background_worker()
def group_commit():
for:
log_accumulator = log_page()
if clock() % 5ms == 0:
for (log, _) in group_commit_sems:
log_accumulator.add(log)
log_f.write(log_accumulator.page()) # Write out all log entries at once
log_f.fsync() # Durably flush wal data
for (_, sem) in group_commit_sems:
sem.signal()
if clock() % 1m == 0:
db.write_to_disk(f)
f.fsync() # Durably flush db data
def handle_write(req):
db.update(req.key, req.value)
sem = semaphore()
log = req
group_commit_sems.push((log, sem))
sem.wait() # This time waiting for only the log to be written and flushed, not the btree.
return 200, {}
def handle_read(req):
value = db.read(req.key)
return 200, {"value": value}
This is a write-ahead log!
Consider a few scenarios. One request writes the smallest key ever seen. And one request within the same millisecond writes the largest key ever seen. Writing these to disk on the btree means modifying at least two pages spread out in space on disk.
But if we only have to durably write these two messages to a log, they can likely both be included in the same log page. ("Likely" so long as key and values are small enough that multiple can fit into the same page.)
That is, it's cheaper to write only these small messages representing the client request to disk. And we save the structured btree persistence for a less frequent durable write.
Sometimes filesystems will write data to the wrong place. Sometimes disks corrupt data. A solution to both of these is to checksum the data on write, store the checksum on disk, and confirm the checksum on read. This combined with a background process called scrubbing to validate unread data can help you learn quickly when your data has been corrupted and you must recover from backup.
MongoDB's default storage engine WiredTiger does checksum data by default.
But some databases famous for integrity do not. Postgres does no data checksumming by default:
By default, data pages are not protected by checksums, but this can optionally be enabled for a cluster. When enabled, each data page includes a checksum that is updated when the page is written and verified each time the page is read. Only data pages are protected by checksums; internal data structures and temporary files are not.
SQLite likewise does no checksumming by default. Checksumming is an optional extension:
The checksum VFS extension is a VFS shim that adds an 8-byte checksum to the end of every page in an SQLite database. The checksum is added as each page is written and verified as each page is read. The checksum is intended to help detect database corruption caused by random bit-flips in the mass storage device.
But even this isn't perfect. Disks and nodes can fail completely. At that point you can only improve durability by introducing redundancy across disks (and/or nodes), for example, via distributed consensus.
Some databases (like SQLite) require a write-ahead log to implement aspects of ACID transactions. But this need not be a requirement for ACID transactions if you do MVCC (SQLite does not). See my previous post on implementing MVCC for details.
Logical replication (also called change data capture (CDC)) is another interesting feature that requires a write-ahead log. The idea is that the log already preserves the exact order and changes that affect the database's "state machine". So we could copy these changes out of the system by tracking the write-ahead log, preserving change order, and apply these changes to a foreign system.
But again, just CDC is not about durability. It's an ancillary feature that write-ahead logs make simple.
A few key points. One, durability primarily matters if it is established before returning a success to the client. Second, a write-ahead log is a cheap way to get durability.
And finally, durability is a spectrum. You need to read the docs for your database to understand what it does and does not.
Here's a new post about durability and write-ahead logs. Write-ahead logs are used almost everywhere. But to build an intuition for why, it is helpful to imagine what you would do without a WAL. And to explore the meaning of durability.https://t.co/nzS2pMz22z pic.twitter.com/m1n9x8CNcp
— Phil Eaton (@eatonphil) July 1, 2024
2024-06-17 08:00:00
This is an external post of mine. Click here if you are not redirected.
2024-06-14 08:00:00
Some of the most interesting technical blog posts I read come from, and a common reason for posts I write is, confusion. You're at work and you start asking questions that are difficult to answer. You spend a few hours or a day trying to get to the bottom of things.
If you ask a question to very experienced and successful developers at work, they have a tendency not to give context and to simplify things down to a single answer. This may be a good way to make business decisions. (One can't afford to waste an eternity considering everything indefinitely.) But accepting an answer you don't understand is actively harmful for building intuition.
Certainly, sometimes not accepting an answer can be irritating. You'll have to figure that out.
But beyond "go along to get along", another reason we don't pursue what we're confused about is because we're embarrassed that we're confused in the first place. What's worse, the embarrassment we feel naturally grows the more experienced we get. "I've got this job title, I don't want to seem like I don't know what you mean."
But if you fight the embarrassment and pursue your confusion regardless, you'll likely figure something very interesting out. Moreover, you will probably not have been the only person who was confused. At least personally it is quite rare that I am confused about something and no one else is.
So pay attention when you get confused, and consider why it happened. What did you expect to be the case, and how did reality differ? Explore the angles and the options. When you finally understand, think about what led you to that understanding.
Write it down. Put it into an internal Markdown doc, an internal Atlassian doc, an internal Google Slides page, whatever. The medium doesn't matter.
This entire process doesn't come easily. We feel embarrassed. We aren't used to lingering on something we're confused by. We aren't used to writing things down.
But if you can make yourself pause every once in a while and think about what you (or someone around you) got confused by, and if you can force yourself to stop getting embarrassed by what you got confused by, and if you can write down the background and the reasoning that led to your ultimate understanding, you're going to have something pretty interesting to talk about.
You'll contribute to the growth and intuition of your colleagues. And you'll never run out of things to write about.
Confusion is embarrassing. But fight that feeling, and dig into why you're confused. And write it down.
— Phil Eaton (@eatonphil) June 14, 2024
You won't be the only one who was confused. And you'll tend to have something pretty interesting to talk about.https://t.co/IdX1nGBheR pic.twitter.com/KzTjqMxw6u
2024-05-30 08:00:00
I've been running software book clubs almost continuously since last summer, about 12 months ago. We read through Designing Data-Intensive Applications, Database Internals, Systems Performance, and we just started Understanding Software Dynamics.
The DDIA discussions were in-person in NYC with about 5-8 consistent attendees. The rest have been over email with 300, 500, and 600 attendees.
This post is for folks who are interested in running their own book club. None of these ideas are novel. I co-opted the best parts I saw from other people running similar things. And hopefully you'll improve on my experience too, should you try.
Despite the length of this post running a book club takes almost no noticeable effort, other than when I need to select and confirm discussion leaders. It is the limited-effort-required to thank that I've kept up the book clubs so consistently.
I run the virtual book clubs over email. I create a Google Group and tell people to send me their email for an invite. I use a Google Form to collect emails since I get many. If you're doing a small group book club you can just collect member emails directly.
In the Google Form I ask people to volunteer to lead discussion for a chapter (or chapters). And I ask for a Twitter/GitHub/LinkedIn account.
When I've gotten enough responses I go through the list and check Twitter/GitHub/LinkedIn info to find people who might have a particularly interesting perspective to lead a discussion.
"Lead a discussion" sounds formal but I mean anything but. All I am looking for is someone to start a new Google Group thread each week and for them to share their thoughts.
For example a discussion leader might share:
The "discussion leader" has no responsibility for remaining in the discussion after posting the thread. There just isn't an easy way to say "person who kicks off discussion" than to call them a "discussion leader".
By the way, I didn't do discussion leaders for the first book club, reading DDIA. And that book club took noticeably more effort. Because I organized it, I was effectively the discussion leader every time. Having discussion leaders disperses the effort of the book club. And I think it makes the club much more interesting.
One thing I noticed happening often was that the discussion leader might do a large summary of the chapter. I greatly appreciate and respect that effort, but I think this is not the ideal thing to happen. Of course you can't control what people do and maybe they really wanted to write a summary. But since noticing this happen I now try to discourage the discussion leader from summarizing since 1) it must be quite time-consuming and 2) it isn't as interesting as some of the above bullet points.
When I have picked out folks who seem like they'd be fun discussion leaders, I bcc email them all asking them to confirm. At the same time I explain what being a discussion leader means. As I just explained it here above.
Each week's discussion gets a new Google Group thread. Discussion happens in responses to the thread.
I ask the discussion leaders to create the new discussion thread between Friday and Saturday their local time.
For folks who don't confirm, I email them one last time and then if they still haven't confirmed I find someone new.
I always lead the first week's discussion so that the discussion leaders can see what I do and so that I can establish the pattern.
It takes a while to read a book. Sometimes the leaders forget to do their part. If it gets to be Sunday and the discussion leader for the week hasn't started discussion, I email them to gently ask if they are still available to kick off discussion. And if they are not, no worries, I can step in.
I have had to step in a few times to start discussion and it's no problem.
Just as you need to clarify and set expectations for discussion leaders, you need to clarify and set expectations for everyone else.
When I invite people to the Google Group I typically also create an Intro thread where I explain the discussion format.
An annoying aspect of Google Groups is that I cannot limit who can create a thread without limiting who can respond to a thread.
It would simplify things for me if I could limit thread creation to discussion leaders. But since I cannot, I try to repeatedly and explicitly mention in the Intro thread that no one should start a new discussion thread unless they are a discussion leader. And that new threads will come out each weekend to discuss the previous chapter.
One of the most important things to do in the Intro email is to set the tone. I try to clarify this is a friendly and encouraging group focused on learning and improving ourselves. We have experts in the group and we have noobs in the group and they are all welcome and will all come away with different things.
Email seems to be the most time-friendly and demographic-friendly medium. Doing live discussion sounds stressful and difficult to schedule, although I believe Alex Petrov runs live discussions. Email forces you to slow down and think things through. And email is built around an inbox. If you didn't get to read some discussion, you can mark it unread. You can't do that in Discord or Slack.
When I pick a book, aside from picking books I think are likely to be exceptionally well-written, I try to avoid books that we could not finish within 3 months. It concerns me to try to get people to commit to something longer than that.
This has led to some distortion though. Systems Performance has only 16 chapters. One chapter a week means about 3 months in total. But each chapter is 100 pages long.
I was hesitant to do a reading of Understanding Software Dynamics because it has 28 chapters. But each chapter is only 10-15 pages long. So when I decided to go with it, I decided we'd read 2 chapters a week. Each discussion leader is responsible for 2 chapters at a time. That means we can finish within 3 months. And each week we read only 20-30 pages, which is still much more doable than 100 pages of Systems Performance.
On the other hand, we did make it through Systems Performance! Which gives me confidence to pick other books that are physically daunting, should they otherwise seem like a good idea.
Many public book clubs go through a book a month and have no ending. That is totally fair. But what I love about the way I organize book clubs is that each reading is unrelated to the next. It's an entirely new signup for each book. You need only "commit" (I mean, you can drop off whenever and definitely people do) to a 3-month reading and then you can justly feel good about yourself and join again in the future or not.
In contrast a paper reading club has no obvious ending, unless you pick all the papers in advance and organize them around a school year or something. This has made running a paper reading club feel more concerning to me. Though I greatly appreciate folks like Aleksey Charapko and Murat Demirbas who do.
In a group of 500 people, maybe 1-2% of those people actively contribute to discussion. 5-10 people. But I often hear from people who didn't participate that they still highly valued the group. And this high percentage of non-active-participants is part of why I keep allowing the group size to grow. There's little work I have to do and a bunch of people benefit.
I wrote about this before. For some reason it's hard to get people who would otherwise join an external reading club to join a company-internal reading club.
Though perhaps I'm just doing it wrong because I hear of others like Elizabeth Garrett Christensen who run an internal software book club successfully.
That's all I've got. Send me questions if you've got any. But mostly, just give it a shot if you want to and you'll learn!
And if you still don't get it, you can of course just join one of my book clubs. :)
Since folks have asked, here's how I run a software book club.
— Phil Eaton (@eatonphil) May 30, 2024
But also, you could just join and see. :)https://t.co/tXBrLFYbvC pic.twitter.com/4iW8EfZCeY
2024-05-16 08:00:00
In this post we'll build a database in 400 lines of code with basic support for five standard SQL transaction levels: Read Uncommitted, Read Committed, Repeatable Read, Snapshot Isolation and Serializable. We'll use multi-version concurrency control (MVCC) and optimistic concurrency control (OCC) to accomplish this. The goal isn't to be perfect but to explain the basics in a minimal way.
You don't need to know what these terms mean in advance. I did not understand them before doing this project. But if you've ever dealt with SQL databases, transaction isolation levels are likely one of the dark corners you either 1) weren't aware of or 2) wanted not to think about. At least, this is how I felt.
While there are many blog posts that list out isolation levels, I haven't been able to internalize their lessons. So I built this little database to demonstrate the common isolation levels for myself. It turned out to be simpler than I expected, and made the isolation levels much easier to reason about.
Thank you to Justin Jaffray, Alex Miller, Sujay Jayakar, Peter Veentjer, and Michael Gasch for providing feedback and suggestions.
All code is available on GitHub.
If you already know the answer, feel free to skip this section.
When I first started working with databases in CRUD applications, I did not understand the point of transactions. I was fairly certain that transactions are locks. I was wrong about that, but more on that later.
I can't remember exact code I wrote, but here's something I could have written:
with database.transaction() as t:
users = t.query("SELECT * FROM users WHERE group = 'admin';")
ids = []
for user in users:
if some_complex_logic(user):
ids.push(user.id)
t.query("UPDATE users SET metadata = 'some value' WHERE id IN ($1)';", ids)
I would have thought that all users that were seen from the initial
SELECT
who matched the some_complex_logic
filter would be exactly
the same that are updated in my second SQL statement.
And if I were using SQLite, my guess would have been correct. But if I were using MySQL or Postgres or Oracle or SQL Server, and hadn't made any changes to defaults, that wouldn't necessarily be true! We'll discover exactly why throughout this post.
For example, some other connection and transaction could have set a
user
's group
to admin
after the initial SELECT
was
executed. It would then be missed from the some_complex_logic
check
and from the subsequent UPDATE
.
Or, again after our initial SELECT
, some other connection could have
modified the group
for some user that previously was admin
. It
would then be incorrectly part of the second UPDATE
statement.
These are just a few examples of what could go wrong.
This is the realm of transaction isolation. How do multiple transactions running at the same time, interacting with the same data, interact with each other?
The answer is: it depends. The SQL standard itself loosely prescribes four isolation levels. But every database implements these four levels slightly differently. Sometimes using entirely different algorithms. And even among the standard levels, the default isolation level for each database differs.
Funky bugs that can show up across databases and across isolation levels, often dependent on particular details of common ways of implementing isolation levels, create what are called "anomalies". Examples include intimidating terms like "dirty reads" and "write cycles" and G2-Item.
The topic is so complex that we've got decades of research papers critiquing SQL isolation levels, categorization of common isolation anomalies, walkthroughs of anomalies by Martin Kleppmann in Designing Data-Intensive Applications, Martin Kleppman's Hermitage project documenting common anomalies across isolation levels in major databases, and the ACIDRain paper showing isolation-related bugs in major open-source ecommerce projects.
These aren't just random links. They're each quite interesting. And particularly for practitioners who don't know why they should care, check out Designing Data-Intensive Applications and the last link on ACIDRain.
And this is only a small list of some of the most interesting research and writing on the topic.
So there's a wide variety of things to consider:
Transaction isolation levels are basically vibes. The only truth for real projects is Martin Kleppmann's Hermitage project that catalogs behavior across databases. And a truth some people align with is Generalized Isolation Level Definitions.
So while all these linked works above are authoritative, and even though we can see that there might be some anomalies we have to worry about, the research can still be difficult to internalize. And many developers, my recent self included, do not have a great understanding of isolation levels.
Throughout this post we'll stick to informal definitions of isolation levels to keep things simple.
Let's dig in.
Historically, databases implemented isolation with locking algorithms such as Two-Phase Locking (not the same thing as Two-Phase Commit). Multi-version concurrency control (MVCC) is an approach that lets us completely avoid locks.
It's worthwhile to note that while we will validly not use locks (implementing what is called optimistic concurrency control or OCC), most MVCC databases do still use locks for certain things (implementing what is called pessimistic concurrency control).
But this is the story of databases in general. There are numerous ways to implement things.
We will take the simpler lockless route.
Consider a key-value database. With MVCC, rather than storing only the value for a key, we would store versions of the value. The version includes the transaction id (a monotonic incrementing integer) wherein the version was created, and the transaction id wherein the version was deleted.
With MVCC, it is possible to express transaction isolation levels almost solely as a set of different visibility rules for a version of a value; rules that vary by isolation level.
So we will build up a general framework first and discuss and implement each isolation level last.
We'll build an in-memory key-value system that acts on transactions. I usually try to stick with only the standard library for projects like this but I really wanted a sorted data structure and Go doesn't implement one.
In main.go
, let's set up basic helpers for assertions and debugging:
package main
import (
"fmt"
"os"
"slices"
"github.com/tidwall/btree"
)
func assert(b bool, msg string) {
if !b {
panic(msg)
}
}
func assertEq[C comparable](a C, b C, prefix string) {
if a != b {
panic(fmt.Sprintf("%s '%v' != '%v'", prefix, a, b))
}
}
var DEBUG = slices.Contains(os.Args, "--debug")
func debug(a ...any) {
if !DEBUG {
return
}
args := append([]any{"[DEBUG]"}, a...)
fmt.Println(args...)
}
As mentioned previously, a value in the database will be defined with start and end transaction ids.
type Value struct {
txStartId uint64
txEndId uint64
value string
}
Every transaction will be in an in-progress, aborted, or committed state.
type TransactionState uint8
const (
InProgressTransaction TransactionState = iota
AbortedTransaction
CommittedTransaction
)
And we'll support a few major isolation levels.
// Loosest isolation at the top, strictest isolation at the bottom.
type IsolationLevel uint8
const (
ReadUncommittedIsolation IsolationLevel = iota
ReadCommittedIsolation
RepeatableReadIsolation
SnapshotIsolation
SerializableIsolation
)
We'll get into detail about the meaning of the levels later.
A transaction has an isolation level, an id (monotonic increasing integer), and a current state. And although we won't make use of this data yet, transactions at stricter isolation levels will need some extra info. Specifically, stricter isolation levels need to know about other transactions that were in-progress when this one started. And stricter isolation levels need to know about all keys read and written by a transaction.
type Transaction struct {
isolation IsolationLevel
id uint64
state TransactionState
// Used only by Repeatable Read and stricter.
inprogress btree.Set[uint64]
// Used only by Snapshot Isolation and stricter.
writeset btree.Set[string]
readset btree.Set[string]
}
We'll discuss why later.
Finally, the database itself will have a default isolation level that each transaction will inherit (for our own convenience in tests).
The database will have a mapping of keys to an array of value versions. Later elements in the array will represent newer versions of a value.
The database will also store the next free transaction id it will use to assign ids to new transactions.
type Database struct {
defaultIsolation IsolationLevel
store map[string][]Value
transactions btree.Map[uint64, Transaction]
nextTransactionId uint64
}
func newDatabase() Database {
return Database{
defaultIsolation: ReadCommittedIsolation,
store: map[string][]Value{},
// The `0` transaction id will be used to mean that
// the id was not set. So all valid transaction ids
// must start at 1.
nextTransactionId: 1,
}
}
To be thread-safe: store
, transactions
,
and nextTransactionId
should be guarded by a mutex. But
to keep the code small, this post will not use goroutines and thus
does not need mutexes.
There's a bit of book-keeping when creating a transaction, so we'll make a dedicated method for this. We must give the new transaction an id, store all in-progress transactions, and add it to database transaction history.
func (d *Database) inprogress() btree.Set[uint64] {
var ids btree.Set[uint64]
iter := d.transactions.Iter()
for ok := iter.First(); ok; ok = iter.Next() {
if iter.Value().state == InProgressTransaction {
ids.Insert(iter.Key())
}
}
return ids
}
func (d *Database) newTransaction() *Transaction {
t := Transaction{}
t.isolation = d.defaultIsolation
t.state = InProgressTransaction
// Assign and increment transaction id.
t.id = d.nextTransactionId
d.nextTransactionId++
// Store all inprogress transaction ids.
t.inprogress = d.inprogress()
// Add this transaction to history.
d.transactions.Set(t.id, t)
debug("starting transaction", t.id)
return &t
}
And we'll add a few more helpers for completing a transaction, for fetching a transaction by id, and for validating a transaction.
func (d *Database) completeTransaction(t *Transaction, state TransactionState) error {
debug("completing transaction ", t.id)
// Update transactions.
t.state = state
d.transactions.Set(t.id, *t)
return nil
}
func (d *Database) transactionState(txId uint64) Transaction {
t, ok := d.transactions.Get(txId)
assert(ok, "valid transaction")
return t
}
func (d *Database) assertValidTransaction(t *Transaction) {
assert(t.id > 0, "valid id")
assert(d.transactionState(t.id).state == InProgressTransaction, "in progress")
}
The final bit of scaffolding we'll set up is an abstraction for database connections. A connection will have at most associated one transaction. Users must ask the database for a new connection. Then within the connection they can manage a transaction.
type Connection struct {
tx *Transaction
db *Database
}
func (c *Connection) execCommand(command string, args []string) (string, error) {
debug(command, args)
// TODO
return "", fmt.Errorf("unimplemented")
}
func (c *Connection) mustExecCommand(cmd string, args []string) string {
res, err := c.execCommand(cmd, args)
assertEq(err, nil, "unexpected error")
return res
}
func (d *Database) newConnection() *Connection {
return &Connection{
db: d,
tx: nil,
}
}
func main() {
panic("unimplemented")
}
And that's it for scaffolding. Now set up the go module and make sure this builds.
$ go mod init gomvcc
go: creating new go.mod: module gomvcc
go: to add module requirements and sums:
go mod tidy
$ go mod tidy
go: finding module for package github.com/tidwall/btree
go: found github.com/tidwall/btree in github.com/tidwall/btree v1.7.0
$ go build
$ ./gomvcc
panic: unimplemented
goroutine 1 [running]:
main.main()
/Users/phil/tmp/main.go:166 +0x2c
Great!
When the user asks to begin a transaction, we ask the database for a new transaction and assign it to the current connection.
func (c *Connection) execCommand(command string, args []string) (string, error) {
debug(command, args)
+ if command == "begin" {
+ assertEq(c.tx, nil, "no running transactions")
+ c.tx = c.db.newTransaction()
+ c.db.assertValidTransaction(c.tx)
+ return fmt.Sprintf("%d", c.tx.id), nil
+ }
+
// TODO
return "", fmt.Errorf("unimplemented")
}
To abort a transaction, we call the completeTransaction
method
(which makes sure the database transaction history gets updated) with
the AbortedTransaction
state.
return fmt.Sprintf("%d", c.tx.id), nil
}
+ if command == "abort" {
+ c.db.assertValidTransaction(c.tx)
+ err := c.db.completeTransaction(c.tx, AbortedTransaction)
+ c.tx = nil
+ return "", err
+ }
+
// TODO
return "", fmt.Errorf("unimplemented")
}
And to commit a transaction is similar.
return "", err
}
+ if command == "commit" {
+ c.db.assertValidTransaction(c.tx)
+ err := c.db.completeTransaction(c.tx, CommittedTransaction)
+ c.tx = nil
+ return "", err
+ }
+
// TODO
return "", fmt.Errorf("unimplemented")
}
The neat thing about MVCC is that beginning, committing, and aborting a transaction is metadata work. Committing a transaction will get a bit more complex when we add support for Snapshot Isolation and Serializable Isolation, but we'll get to that later. Even then, it will not involve modifying any values we get, set, or delete.
Here is where things get fun. As mentioned earlier, the key-value
store is actually map[string][]Value
. With the more recent versions
of a value at the end of the list of values for the key.
For get
support, we'll iterate the list of value versions backwards
for the key. And we'll call a special new isvisible
method to
determine if this transaction can see this value. The first value that
passes the isvisible
test is the correct value for the transaction.
return "", err
}
+ if command == "get" {
+ c.db.assertValidTransaction(c.tx)
+
+ key := args[0]
+
+ c.tx.readset.Insert(key)
+
+ for i := len(c.db.store[key]) - 1; i >= 0; i-- {
+ value := c.db.store[key][i]
+ debug(value, c.tx, c.db.isvisible(c.tx, value))
+ if c.db.isvisible(c.tx, value) {
+ return value.value, nil
+ }
+ }
+
+ return "", fmt.Errorf("cannot get key that does not exist")
+ }
+
// TODO
return "", fmt.Errorf("unimplemented")
}
I snuck in tracking which keys are read, and we'll also soon sneak in tracking which keys are written. This is necessary in stricter isolation levels. More on that later.
set
and delete
are similar to get. But this time when we walk the
list of value versions, we will set the txEndId
for the value to the
current transaction id if the value version is visible to this
transaction.
Then, for set
, we'll append to the value version list with the new
version of the value that starts at this current transaction.
return "", err
}
+ if command == "set" || command == "delete" {
+ c.db.assertValidTransaction(c.tx)
+
+ key := args[0]
+
+ // Mark all visible versions as now invalid.
+ found := false
+ for i := len(c.db.store[key]) - 1; i >= 0; i-- {
+ value := &c.db.store[key][i]
+ debug(value, c.tx, c.db.isvisible(c.tx, *value))
+ if c.db.isvisible(c.tx, *value) {
+ value.txEndId = c.tx.id
+ found = true
+ }
+ }
+ if command == "delete" && !found {
+ return "", fmt.Errorf("cannot delete key that does not exist")
+ }
+
+ c.tx.writeset.Insert(key)
+
+ // And add a new version if it's a set command.
+ if command == "set" {
+ value := args[1]
+ c.db.store[key] = append(c.db.store[key], Value{
+ txStartId: c.tx.id,
+ txEndId: 0,
+ value: value,
+ })
+
+ return value, nil
+ }
+
+ // Delete ok.
+ return "", nil
+ }
+
if command == "get" {
c.db.assertValidTransaction(c.tx)
This time rather than modifying the readset
we modify the writeset
for the transaction.
And that is how commands get executed!
Let's zoom in to the core of the problem we have mentioned but not implemented: MVCC visibility rules and how they differ by isolation levels.
To varying degrees, transaction isolation levels prevent concurrent transactions from messing with each other. The looser isolation levels prevent this almost not at all.
Here is what the 1999 ANSI SQL standard (page 84) has to say.
But as I mentioned in the beginning of the post, we're going to be a bit informal. And we'll mostly refer to Jepsen summaries of each isolation levels.
According to Jepsen, the loosest isolation level, Read Uncommitted, has almost no restrictions. We can merely read the most recent (non-deleted) version of a value, regardless of if the transaction that set it has committed or aborted or not.
func (d *Database) isvisible(t *Transaction, value Value) bool {
// Read Uncommitted means we simply read the last value
// written. Even if the transaction that wrote this value has
// not committed, and even if it has aborted.
if t.isolation == ReadUncommittedIsolation {
// We must merely make sure the value has not been
// deleted.
return value.txEndId == 0
}
assert(false, "unsupported isolation level")
return false
}
Let's write a test that demonstrates this. We create two transactions,
c1
and c2
, and set a key in c1
. The value set for the key in
c1
should be immediately visible if c2
asks for that key. In
main_test.go:
package main
import (
"testing"
)
func TestReadUncommitted(t *testing.T) {
database := newDatabase()
database.defaultIsolation = ReadUncommittedIsolation
c1 := database.newConnection()
c1.mustExecCommand("begin", nil)
c2 := database.newConnection()
c2.mustExecCommand("begin", nil)
c1.mustExecCommand("set", []string{"x", "hey"})
// Update is visible to self.
res := c1.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c1 get x")
// But since read uncommitted, also available to everyone else.
res = c2.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c2 get x")
// And if we delete, that should be respected.
res = c1.mustExecCommand("delete", []string{"x"})
assertEq(res, "", "c1 delete x")
res, err := c1.execCommand("get", []string{"x"})
assertEq(res, "", "c1 sees no x")
assertEq(err.Error(), "cannot get key that does not exist", "c1 sees no x")
res, err = c2.execCommand("get", []string{"x"})
assertEq(res, "", "c2 sees no x")
assertEq(err.Error(), "cannot get key that does not exist", "c2 sees no x")
}
Thank you to @glaebhoerl for pointing out that in an earlier version of this post, Read Uncommitted incorrectly made deleted values visible.
That's pretty simple! But also pretty useless if your workload has conflicts. If you can arrange your workload in a way where you know no concurrent transactions will ever read or write conflicting keys though, this could be pretty efficient! The rules will only get more complex (and thus potentially more of a bottleneck) from here on.
But for the most part, people don't use this isolation level. SQLite, Yugabyte, Cockroach, and Postgres don't even implement it. It is also not the default for any major database that does implement it.
Let's get a little stricter.
We'll pull again from Jepsen:
Read committed is a consistency model which strengthens read uncommitted by preventing dirty reads: transactions are not allowed to observe writes from transactions which do not commit.
This sounds pretty simple. In isvisible
we'll make sure that the
value has a txStartId
that is either this transaction or a
transaction that has committed. Moreover we will now begin checking
against txEndId
to make sure the value wasn't deleted by any
relevant transaction.
return value.txEndId == 0
}
+ // Read Committed means we are allowed to read any values that
+ // are committed at the point in time where we read.
+ if t.isolation == ReadCommittedIsolation {
+ // If the value was created by a transaction that is
+ // not committed, and not this current transaction,
+ // it's no good.
+ if value.txStartId != t.id &&
+ d.transactionState(value.txStartId).state != CommittedTransaction {
+ return false
+ }
+
+ // If the value was deleted in this transaction, it's no good.
+ if value.txEndId == t.id {
+ return false
+ }
+
+ // Or if the value was deleted in some other committed
+ // transaction, it's no good.
+ if value.txEndId > 0 &&
+ d.transactionState(value.txEndId).state == CommittedTransaction {
+ return false
+ }
+
+ // Otherwise the value is good.
+ return true
+ }
+
assert(false, "unsupported isolation level")
return false
}
This begins to look useful! We will never read a value that isn't part of a committed transaction (or isn't part of our own transaction). Indeed this is the default isolation level for many databases including Postgres, Yugabyte, Oracle, and SQL Server.
Let's add a test to main_test.go
. This is a bit long, but give it a
slow read. It is thoroughly commented.
func TestReadCommitted(t *testing.T) {
database := newDatabase()
database.defaultIsolation = ReadCommittedIsolation
c1 := database.newConnection()
c1.mustExecCommand("begin", nil)
c2 := database.newConnection()
c2.mustExecCommand("begin", nil)
// Local change is visible locally.
c1.mustExecCommand("set", []string{"x", "hey"})
res := c1.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c1 get x")
// Update not available to this transaction since this is not
// committed.
res, err := c2.execCommand("get", []string{"x"})
assertEq(res, "", "c2 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")
c1.mustExecCommand("commit", nil)
// Now that it's been committed, it's visible in c2.
res = c2.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c2 get x")
c3 := database.newConnection()
c3.mustExecCommand("begin", nil)
// Local change is visible locally.
c3.mustExecCommand("set", []string{"x", "yall"})
res = c3.mustExecCommand("get", []string{"x"})
assertEq(res, "yall", "c3 get x")
// But not on the other commit, again.
res = c2.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c2 get x")
c3.mustExecCommand("abort", nil)
// And still not, if the other transaction aborted.
res = c2.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c2 get x")
// And if we delete it, it should show up deleted locally.
c2.mustExecCommand("delete", []string{"x"})
res, err = c2.execCommand("get", []string{"x"})
assertEq(res, "", "c2 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")
c2.mustExecCommand("commit", nil)
// It should also show up as deleted in new transactions now
// that it has been committed.
c4 := database.newConnection()
c4.mustExecCommand("begin", nil)
res, err = c4.execCommand("get", []string{"x"})
assertEq(res, "", "c4 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c4 get x")
}
Again this seems great. However! You can easily get inconsistent data within a transaction at this isolation level. If the transaction A has multiple statements it can see different results per statement, even if the transaction A did not modify data. Another transaction B may have committed changes between two statements in this transaction A.
Let's get a little stricter.
Again as Jepsen says, Repeatable Read is the same as Read Committed but with the following anomaly not allowed (quoting from the ANSI SQL 1999 standard):
P2 (“Non-repeatable read”): SQL-transaction T1 reads a row. SQL-transaction T2 then modifies or deletes that row and performs a COMMIT. If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted.
To support this, we will add additional checks for the Read Committed logic that make sure the value was not created and not deleted within a transaction that started before this transaction started.
As it happens, this is the same logic that will be necessary for Snapshot Isolation and Serializable Isolation. The additional logic (that makes Snapshot Isolation and Serializable Isolation different) happens at commit time.
return true
}
- assert(false, "unsupported isolation level")
- return false
+ // Repeatable Read, Snapshot Isolation, and Serializable
+ // further restricts Read Committed so only versions from
+ // transactions that completed before this one started are
+ // visible.
+
+ // Snapshot Isolation and Serializable will do additional
+ // checks at commit time.
+ assert(t.isolation == RepeatableReadIsolation ||
+ t.isolation == SnapshotIsolation ||
+ t.isolation == SerializableIsolation, "invalid isolation level")
+ // Ignore values from transactions started after this one.
+ if value.txStartId > t.id {
+ return false
+ }
+
+ // Ignore values created from transactions in progress when
+ // this one started.
+ if t.inprogress.Contains(value.txStartId) {
+ return false
+ }
+
+ // If the value was created by a transaction that is not
+ // committed, and not this current transaction, it's no good.
+ if d.transactionState(value.txStartId).state != CommittedTransaction &&
+ value.txStartId != t.id {
+ return false
+ }
+
+ // If the value was deleted in this transaction, it's no good.
+ if value.txEndId == t.id {
+ return false
+ }
+
+ // Or if the value was deleted in some other committed
+ // transaction that started before this one, it's no good.
+ if value.txEndId < t.id &&
+ value.txEndId > 0 &&
+ d.transactionState(value.txEndId).state == CommittedTransaction &&
+ !t.inprogress.Contains(value.txEndId) {
+ return false
+ }
+
+ return true
}
type Connection struct {
How do I derive these rules? Mostly by writing tests that should pass or fail and seeing what doesn't make sense. I tried to steal from existing projects but these rules were not so simple to discover. Which is part of what I hope makes this project particularly useful to look at.
Let's write a test for Repeatable Read. Again, the test is long but well commented.
func TestRepeatableRead(t *testing.T) {
database := newDatabase()
database.defaultIsolation = RepeatableReadIsolation
c1 := database.newConnection()
c1.mustExecCommand("begin", nil)
c2 := database.newConnection()
c2.mustExecCommand("begin", nil)
// Local change is visible locally.
c1.mustExecCommand("set", []string{"x", "hey"})
res := c1.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c1 get x")
// Update not available to this transaction since this is not
// committed.
res, err := c2.execCommand("get", []string{"x"})
assertEq(res, "", "c2 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")
c1.mustExecCommand("commit", nil)
// Even after committing, it's not visible in an existing
// transaction.
res, err = c2.execCommand("get", []string{"x"})
assertEq(res, "", "c2 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")
// But is available in a new transaction.
c3 := database.newConnection()
c3.mustExecCommand("begin", nil)
res = c3.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c3 get x")
// Local change is visible locally.
c3.mustExecCommand("set", []string{"x", "yall"})
res = c3.mustExecCommand("get", []string{"x"})
assertEq(res, "yall", "c3 get x")
// But not on the other commit, again.
res, err = c2.execCommand("get", []string{"x"})
assertEq(res, "", "c2 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")
c3.mustExecCommand("abort", nil)
// And still not, regardless of abort, because it's an older
// transaction.
res, err = c2.execCommand("get", []string{"x"})
assertEq(res, "", "c2 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c2 get x")
// And again still the aborted set is still not on a new
// transaction.
c4 := database.newConnection()
res = c4.mustExecCommand("begin", nil)
res = c4.mustExecCommand("get", []string{"x"})
assertEq(res, "hey", "c4 get x")
c4.mustExecCommand("delete", []string{"x"})
c4.mustExecCommand("commit", nil)
// But the delete is visible to new transactions now that this
// has been committed.
c5 := database.newConnection()
res = c5.mustExecCommand("begin", nil)
res, err = c5.execCommand("get", []string{"x"})
assertEq(res, "", "c5 get x")
assertEq(err.Error(), "cannot get key that does not exist", "c5 get x")
}
Let's get stricter!
Back to [Jepsen](https://jepsen.io/consistency/models/snapshot-isolation for a definition:
In a snapshot isolated system, each transaction appears to operate on an independent, consistent snapshot of the database. Its changes are visible only to that transaction until commit time, when all changes become visible atomically to any transaction which begins at a later time. If transaction T1 has modified an object x, and another transaction T2 committed a write to x after T1’s snapshot began, and before T1’s commit, then T1 must abort.
So Snapshot Isolation is the same as Repeatable Read but with one additional rule: the keys written by any two concurrent committed transactions must not overlap.
This is why we tracked writeset
. Every time a transaction modified
or deleted a key, we added it to the transaction's writeset
. To make
sure we abort correctly, we'll add a conflict check to the commit
step. (This idea is also well documented in A critique of snapshot
isolation. This
paper can be hard to find. Email me if you want a copy.)
When a transaction A goes to commit, it will run a conflict test for any transaction B that has committed since this transaction A started.
Serializable Isolation is going to have a similar check. So we'll add a helper for iterating through all relevant transactions, running a check function for any transaction that has committed.
func (d *Database) hasConflict(t1 *Transaction, conflictFn func(*Transaction, *Transaction) bool) bool {
iter := d.transactions.Iter()
// First see if there is any conflict with transactions that
// were in progress when this one started.
inprogressIter := t1.inprogress.Iter()
for ok := inprogressIter.First(); ok; ok = inprogressIter.Next() {
id := inprogressIter.Key()
found := iter.Seek(id)
if !found {
continue
}
t2 := iter.Value()
if t2.state == CommittedTransaction {
if conflictFn(t1, &t2) {
return true
}
}
}
// Then see if there is any conflict with transactions that
// started and committed after this one started.
for id := t1.id; id < d.nextTransactionId; id++ {
found := iter.Seek(id)
if !found {
continue
}
t2 := iter.Value()
if t2.state == CommittedTransaction {
if conflictFn(t1, &t2) {
return true
}
}
}
return false
}
It was around this point that I decided I did really need a B-Tree implementation and could not just stick to vanilla Go data structures.
Now we can modify completeTransaction
to do this check if the
transaction intends to commit. If the current transaction A's write
set intersects with any other transaction B committed since
transaction A started, we must abort.
func (d *Database) completeTransaction(t *Transaction, state TransactionState) error {
debug("completing transaction ", t.id)
+
+ if state == CommittedTransaction {
+ // Snapshot Isolation imposes the additional constraint that
+ // no transaction A may commit after writing any of the same
+ // keys as transaction B has written and committed during
+ // transaction A's life.
+ if t.isolation == SnapshotIsolation && d.hasConflict(t, func(t1 *Transaction, t2 *Transaction) bool {
+ return setsShareItem(t1.writeset, t2.writeset)
+ }) {
+ d.completeTransaction(t, AbortedTransaction)
+ return fmt.Errorf("write-write conflict")
+ }
+ }
+
// Update transactions.
t.state = state
d.transactions.Set(t.id, *t)
Lastly, the definition of setsShareItem
.
func setsShareItem(s1 btree.Set[string], s2 btree.Set[string]) bool {
s1Iter := s1.Iter()
s2Iter := s2.Iter()
for ok := s1Iter.First(); ok; ok = s1Iter.Next() {
s1Key := s1Iter.Key()
found := s2Iter.Seek(s1Key)
if found {
return true
}
}
return false
}
Since Snapshot Isolation shares all the same visibility rules as Repeatable Read, the tests get to be a little simpler! We'll simply test that two transactions attempting to commit a write to the same key fail. Or specifically: that the second transaction cannot commit.
func TestSnapshotIsolation_writewrite_conflict(t *testing.T) {
database := newDatabase()
database.defaultIsolation = SnapshotIsolation
c1 := database.newConnection()
c1.mustExecCommand("begin", nil)
c2 := database.newConnection()
c2.mustExecCommand("begin", nil)
c3 := database.newConnection()
c3.mustExecCommand("begin", nil)
c1.mustExecCommand("set", []string{"x", "hey"})
c1.mustExecCommand("commit", nil)
c2.mustExecCommand("set", []string{"x", "hey"})
res, err := c2.execCommand("commit", nil)
assertEq(res, "", "c2 commit")
assertEq(err.Error(), "write-write conflict", "c2 commit")
// But unrelated keys cause no conflict.
c3.mustExecCommand("set", []string{"y", "no conflict"})
c3.mustExecCommand("commit", nil)
}
Not bad! But let's get stricter.
Upon further discussion with Alex Miller, and after reviewing A
Critique of ANSI SQL Isolation Levels, the difference I am
trying to suggest (between Repeatable Read an Snapshot Isolation)
likely does not exist. A Critique of ANSI SQL Isolation Levels
mentions Repeatable Read must not exhibit P4 (Lost Update)
anomalies. And it mentions that you must check for read-write
conflicts to avoid these. Therefore it seems likely that you can't
easily separate Repeatable Read from Snapshot Isolation when
implemented using MVCC. The differences between Repeatable Read and
Snapshot Isolation may more readily show up when implementing
transactions the classical way with Two-Phase Locking.
To reiterate, with MVCC and optimistic concurrency control, correct
implementations of Repeatable Read and Snapshot Isolation do not
seem to be distinguishable. Both require write-write conflict
detection.
In terms of end-result, this is the simplest isolation level to reason about. Serializable Isolation must appear as if only a single transaction were executing at a time. Some systems, like SQLite and TigerBeetle, do Actually Serial Execution where only one transaction runs at a time. But few databases implement Serializable like this because it removes a number of fair concurrent execution histories. For example, two concurrent read-only transactions.
Postgres implements serializability via Serializable Snapshot Isolation. MySQL implements serializability via Two-Phase Locking. FoundationDB implements serializability via sequential timestamp assignment and conflict detection.
But the paper, A critique of snapshot isolation, provides a simple (though not necessarily efficient; I have no clue) approach via what they call Write Snapshot Isolation. In their algorithm, if any two transactions read and write set intersect (but not write and write set intersect), the transaction should be aborted. And this (plus Repeatable Read rules) is sufficient for Serializability.
I leave it to that paper for the proof of correctness. In terms of implementing it though it's quite simple and very similar to the Snapshot Isolation we already mentioned.
Inside of completeTransaction
add:
}) {
d.completeTransaction(t, AbortedTransaction)
return fmt.Errorf("write-write conflict")
+ }
+
+ // Serializable Isolation imposes the additional constraint that
+ // no transaction A may commit after reading any of the same
+ // keys as transaction B has written and committed during
+ // transaction A's life, or vice-versa.
+ if t.isolation == SerializableIsolation && d.hasConflict(t, func(t1 *Transaction, t2 *Transaction) bool {
+ return setsShareItem(t1.readset, t2.writeset) ||
+ setsShareItem(t1.writeset, t2.readset)
+ }) {
+ d.completeTransaction(t, AbortedTransaction)
+ return fmt.Errorf("read-write conflict")
}
}
And if we add a test for read-write conflicts:
func TestSerializableIsolation_readwrite_conflict(t *testing.T) {
database := newDatabase()
database.defaultIsolation = SerializableIsolation
c1 := database.newConnection()
c1.mustExecCommand("begin", nil)
c2 := database.newConnection()
c2.mustExecCommand("begin", nil)
c3 := database.newConnection()
c3.mustExecCommand("begin", nil)
c1.mustExecCommand("set", []string{"x", "hey"})
c1.mustExecCommand("commit", nil)
_, err := c2.execCommand("get", []string{"x"})
assertEq(err.Error(), "cannot get key that does not exist", "c5 get x")
res, err := c2.execCommand("commit", nil)
assertEq(res, "", "c2 commit")
assertEq(err.Error(), "read-write conflict", "c2 commit")
// But unrelated keys cause no conflict.
c3.mustExecCommand("set", []string{"y", "no conflict"})
c3.mustExecCommand("commit", nil)
}
We see it work! And that's it for a basic implementation of MVCC and major transaction isolation levels.
There are two major projects I'm aware of that help you test transaction implementations: Elle and Hermitage. These are probably where I'd go looking if I were implementing this for real.
This project took me long enough on its own and I felt reasonably comfortable with my tests that the gist of my logic was right that I did not test further. For that reason it surely has bugs.
One of the major things this implementation does not do is cleaning up old data. Eventually, older versions of values will be required by no transactions. They should be removed from the value version array. Similarly, eventually older transactions will be required by no transactions. They should be removed from the database transaction history list.
Even if we had the vacuuming process in place though, what about some extreme use patterns. What if a key's value was always going to be 1GB long. And what if multiple transactions made only small changes to the 1GB data. We'd be duplicating a lot of the value across versions.
It sounds less extreme when thinking about storing rows of data rather than key-value data. If a user has 100 columns and only updates one column a number of times, in our scheme we'd end up storing a ton of duplicate cell data for a row.
This is a real-world issue in Postgres that was called out by Andy Pavlo and the Ottertune folks. It turns out that Postgres alone among major databases stores the entire value for every version. In contrast other major databases like MySQL store a diff.
This post only begins to demonstrate that database behavior differs quite a bit both in terms of results and in terms of optimizations. Everyone implements the ideas differently and to varying degrees.
Moreover, we have only begun to implement the behavior a real SQL database supports. For example, how do visibility rules and conflict detection work with range queries? What about sub-transactions, and save points? These will have to be covered another time.
Hopefully seeing this simple implementation of MVCC and visibility rules helps to clarify at least some of the basics.
Here's a new post walking through an implementation of MVCC and major SQL transaction isolation levels, in 400 lines of Go code.
— Phil Eaton (@eatonphil) May 16, 2024
These ideas might sound esoteric, but they impact almost every developer using any database.https://t.co/crFKM74R5h pic.twitter.com/o9awTPpvvx
2024-04-10 08:00:00
I want to explain why the blogs in My favorite technical blogs are my favorite. That page is solely about non-corporate tech blogs. So this post is too. I'll have to make another list for favorite corporate tech blogs.
In short, they:
There are a number of problems in programming and computer science where otherwise knowledgeable programmers have to start mumbling about, or revert to cliches or group-think, because they aren't sure.
These are the best topics you can possibly dive deep into. And my favorite writers do exactly this.
They write about durability guarantees of disks and filesystems. They write about common pitfalls in benchmarking. They write about database consistency anomalies. They write about threading and IO models.
And they write about it by showing concrete examples and concrete logic so you can learn how to stop handwaving on the topic.
Their writing helps you come out with a useful mental model you can apply to your own problems.
And you know, sometimes it's not about about the topic being obscure. Good writers have the ability to tackle a boring topic in an interesting light. Maybe by digging deeper into a root cause. Or showing you the history behind the scenes.
Moreover, my favorite writers don't know everything. But they also don't pretend to know everything. They're quick to admit they don't understand something and ask for help from their readers.
I love to see complete working code in a post. In contrast there are many projects that start out simple and people write an article that covers the project at a high level. But they keep working on the project and it becomes more complex.
It's not always easy to follow commits over time.
Eli Bendersky and Serge Zaitsev are particularly great at developing small but meaningful projects in a single post or short series.
On the other hand, if people only did this, we wouldn't hear about the development of long-running projects like V8 or Postgres. So I guess this style has limits. And I don't penalize people talking about long-running projects for not showing working code.
One of the marks of a good writer is that you can make complex topics simple. And not just by being reductive. Though sometimes even being reductive is useful for education.
In contrast I sometimes see articles by less experienced writers and I marvel how they make a simple topic so complex. I recognize this because I was absolutely like that 10 years ago, if not 5 years ago.
My favorite blogs typically get a new post at least once a month. Some people, like Murat, write once a week.
I think the practice probably does improve your writing but mostly it's that they keep my attention by publishing regularly!
Nothing builds trust like talking about the issues with something you built. No project is perfect. And to ignore the downsides risks seeming like you don't know or understand them.
So the writers I like the most talk about decisions in context. They talk about the good and the bad.
There's no way I can think of talking about this without sounding super lame.
One thing I've noticed, particularly among younger colleagues, is the use of memes or swearing or using 4chan slang or using sarcasm. I used to write like this 10 years ago too.
There is a chunk of your audience who won't care. The problem is that there's also a chunk of your (potential) audience who definitely does care. There's even a chunk of your audience who may not care but just won't understand (i.e. non-native English speakers).
I have friends and folks I respect who write very well. But that also are also overly, unnecessarily edgy when they write. I don't like sharing these posts because I don't want to unnecessarily offend or turn off people either.
It would be boring if everyone wrote the same way. I'm glad the internet is fun and weird. But I wanted to share a few things that go into my favorite technical blogs that I'm always happy to refer people to.
I wrote a short post on what I think makes a great technical blog.https://t.co/QRFtQyQyU5 pic.twitter.com/QpsQC90EX5
— Phil Eaton (@eatonphil) April 10, 2024
2024-04-05 08:00:00
I started a paper reading club this week at work, focused on databases and distributed systems research. I posted in a general channel about the premise and asked if anyone was interested. I clarified the intended topic and that discussions would be asynchronous over email, run fortnightly.
Weekly seemed way too often. A month seemed way too infrequent. Fortnightly seemed decent.
I was nervous to do this because I've been here about 2 months. In the past I would have waited 6 months or a year to do this. But I don't know. If you see something you think should exist, why wait?
The only other consideration was past experiences I've written about having difficulty getting engagement with clubs at work. But EDB has near 1,000 employees. I figured there might at least be a couple interested.
Furthermore I figured if I only got a few people this entire idea would at least benefit myself, since I have been wanting to force myself to build a paper reading habit. And if no one responded, it would be only mildly embarassing and I'd not pursue it further.
But after a day, about 6 people showed interest. Which was better than I hoped! Folks from product management, support, development, and beyond.
So I opened a dedicated channel and asked people to start submitting papers and voting on them. One of my teammates started submitting some great papers on caches and reference counting.
I picked a first one, the Redshift paper, to get us started. Demonstrating the process to avoid confusion. And I made a calendar invite for everyone in the channel, the paper linked in the invite. I clarified in the invite that it was just a reminder and that the real discussion would still be async over email. (I've found it's best to repeatedly clarify process stuff.)
Once I had these first few folks interested I was able to post again in a broader company channel that a couple of us were starting this paper club. By the end of the day the dedicated channel was 29 folks. All in about 2 days.
Mailing lists are nicer than Slack or Discord in my opinion because they sort of force you to slow down, they are harder to miss (if someone starts a thread after you've seen a message in Slack or Discord, you tend to miss it), and easier to manage (read/unread).
Engineers often seem to get overwhelmed by a mass of Slack messages. Whereas they seem to be a bit more comfortable with email threads.
All of this is all the more important when you're running a global group. EDB has people everywhere.
Why do this?
Before I dropped out of college I did a research internship with a VLSI group at Harvard SEAS. And my favorite part was that they had a weekly (or biweekly?) Wednesday paper reading session where 15 people from the lab and adjacent labs would eat pizza after hours and discuss a paper.
I've been dying to recreate this at a company ever since. Since EDB is so distributed, we won't be discussing over pizza. But I'm still excited.
And I hope my experience serves as a blueprint for others.
I started a paper reading club at work, wrote about it as a possible blueprint for others.
— Phil Eaton (@eatonphil) April 6, 2024
I'm excited! I've wanted to have a gang at work with whom to read papers for a long time.https://t.co/vpwERj8pHe
2024-03-27 08:00:00
This is an external post of mine. Click here if you are not redirected.
2024-03-15 08:00:00
Having worked a bit in Zig, Rust, Go and now C, I think there are a few common topics worth having a fresh conversation on: automatic memory management, the standard library, and explicit allocation.
Zig is not a mature language. But it has made enough useful choices for a number of companies to invest in it and run it in production. The useful choices make Zig worth talking about.
Go and Rust are mature languages. But they have both made questionable choices that seem worth talking about.
All of these languages are developed by highly intelligent folks I personally look up to. And your choice to use any one of these is certainly fine, whichever it is.
The positive and negative choices particular languages made, though, are worth talking about as we consider what a systems programming language 10 years from now would look like. Or how these languages themselves might evolve in the next 10 years.
My perspective is mostly building distributed databases. So the points that I bring up may have no relevance to the kind of work you do, and that's alright. Moreover, I'm already aware most of these opinions are not shared by the language maintainers, and that's ok too. I am not writing to convince anyone.
One of my bigger issues with Zig is that it doesn't support RAII. You can defer cleanup to the end of a block; and this is half of the problem. But only RAII will allow for smart pointers and automatic (not manual) reference counting. RAII is an excellent option to default to, but in Zig you aren't allowed to. In contrast, even C "supports" automatic cleanup (via compiler extensions).
But most of the time, arenas are fine. Postgres is written in C and memory is almost entirely managed through nested arenas (called "memory contexts") that get cleaned up when some subset of a task finishes, recursively. Zig has builtin support for arenas, which is great.
It seems regrettable that some languages have been shipping smaller standard libraries. Smaller standard libraries seem to encourage users of the language to install more transitively-unvetted third-party libraries, which increases build time and build flakiness, and which increases bitrot over time as unnecessary breaking changes occur.
People have been making jokes about node_modules
for a decade now, but
this problem is just as bad in Rust codebases I've seen. And to a
degree it happens in Java and Go as well, though their larger standard
libraries allow you to get further without dependencies.
Zig has a good standard library, which may be Go and Java tier in a few years. But one goal of their package manager seemed to be to allow the standard library to be broken up; made smaller. For example, JSON support moving out of the standard library into a package. I don't know if that is actually the planned direction. I hope not.
Having a large standard library doesn't mean that the programmer shouldn't be able to swap out implementations easily as needed. But all that is required is for the standard library to define an interface along with the standard library implementation.
The small size of the standard library doesn't just affect developers using the language, it even encourages developers of the language itself to depend on libraries owned by individuals.
Take a look at the transitive dependencies of an official Node.js
package like
node-gyp. Is
it really the ideal outcome of a small standard library to encourage
dependence in official libraries on libraries owned by individuals,
like env-paths, that
haven't been modified in 3 years? 68 lines of code. Is it not safer at
this point to vendor that code? i.e. copy the env-paths
code into
node-gyp
.
Similarly, if you go looking for compression support in Rust, there's none in the standard library. But you may notice the flate2-rs repo under the official rust-lang GitHub namespace. If you look at its transitive dependencies: flate2-rs depends on (an individual's) miniz_oxide which depends on (an individual's) adler that hasn't been updated in 4 years. 300 lines of code including tests. Why not vendor this code? It's the habits a small standard library builds that seem to encourage everyone not to.
I don't mean these necessarily constitute a supply-chain risk. I'm not talking about left-pad. But the pattern is sort of clear. Even official packages may end up depending on external party packages, because the commitment to a small standard library meant omitting stuff like compression, checksums, and common OS paths.
It's a tradeoff and maybe makes the job of the standard library maintainer easier. But I don't think this is the ideal situation. Dependencies are useful but should be kept to a reasonable minimum.
Hopefully languages end up more like Go than like Rust in this regard.
When folk discuss the Zig standard library's pattern of requiring an allocator argument for every method that allocates, they often talk about the benefit of swapping out allocators or the benefit of being able to handle OOM failures.
Both of these seem pretty niche to me. For example, in Zig tests you are encouraged to pass around a debug allocator that tells you about memory leaks. But this doesn't seem too different from compiling a C project with a debug allocator or compiling with different sanitizers on and running tests against the binary produced. In both cases you mostly deal with allocators at a global level depending on the environment you're running the code in (production or tests).
The real benefit of explicit allocations to me is much more trivial. You basically can't code a method in Zig without acknowledging allocations.
This is particularly useful for hotpath code. Take an iterator for
example. It has a new()
method, a next()
method, and a done()
method. In most languages, it's basically impossible at the syntax or
compiler-level to know if you are allocating in the next()
method. You
may know because you know the behavior of all the code in next()
by
heart. But that won't happen all the time.
Zig is practically alone in that if you write the next()
method and
and don't pass an allocator to any method in the next()
body,
nothing in that next()
method will allocate.
In any other language it might not be until you run a profiler that
you notice an allocation that should have been done once in new()
accidentally ended up in next()
instead.
On the other hand, for all the same reasons, writing Zig is kind of a pain because everything takes an allocator!
Explicit allocation is not intrinsic to Zig, the language. It is a convention that is prevalent in the standard library. There is still a global allocator and any user of Zig could decide to use the global allocator. At which point you've got implicit allocation. So explicit allocation as a convention isn't a perfect solution.
But it, by default, gives you a level of awareness of allocations you just can't get from typical Go or Rust or C code, depending on the project's practices. Perhaps it's possible to switch off the Go, Rust and C standard library and use one where all functions that allocate do require an allocator.
But explicitly passing allocators is still sort of a visual hack.
I think the ideal situation in the future will be that every language
supports annotating blocks of code as must-not-allocate
or something
along those lines. Either the compiler will enforce this and fail if
you seem to allocate in a block marked must-not-allocate
, or it will
panic during runtime so you can catch this in tests.
This would be useful beyond static programming languages. It would be
as interesting to annotate blocks in JavaScript or Python as
must-not-allocate
too.
Otherwise the current state of things is that you'd normally configure this sort of thing at the global level. Saying "there must not be any allocations in this entire program" just doesn't seem as useful in general as being able to say "there must not be any allocations in this one block".
Rust has nascent support for passing an allocator to methods that allocate. But it's optional. From what I understand, C++ STL is like this too.
These are both super useful for programming extensions. And it's one of the reasons I think Zig makes a ton of sense for Postgres extensions specifically. Because it was only and always ever built for running in an environment with someone else's allocator.
All three of these have really great first-party tooling including build system, package management, test runners and formatters. The idea that the language should provide a great environment to code in (end-to-end) makes things simpler and nicer for programmers.
Use the language you want to use. Zig and Rust are both nice alternatives to writing vanilla C.
On the other hand, I've been pleasantly surprised writing Postgres C. How high level it is. It's almost a separate language since you're often dealing with user-facing constructs, like Postgres's Datum objects which represent what you might think of as a cell in a Postgres database. And you can use all the same functions provided for Postgres SQL for working with Datums, but from C.
I've also been able work a bit on Postgres extensions in Rust with pgrx lately, which I hope to write about soon. And when I saw pgzx for writing Postgres extensions in Zig I was excited to spend some time with that too.
Wrote a post on my wishlist for Zig and Rust. Focused on automatic memory management, the standard library, and explicit allocation.https://t.co/dvynizU9V2 pic.twitter.com/iTXp5QVxj0
— Phil Eaton (@eatonphil) March 15, 2024
2024-03-11 08:00:00
A little over a month ago, I joined EnterpriseDB on a distributed Postgres product (PGD). The process of onboarding myself has been pretty similar at each company in the last decade, though I think I've gotten better at it. The process is of course influenced by the team, and my coworkers have been excellent. Still, I wanted to share my thought process and personal strategies.
Trickier things at companies are the people, organization, and processes. What code exists? How does it work together? Who owns what? How can I find easy code issues to tackle? How do I know what's important (so I can avoid picking it up and becoming a bottleneck).
But also, in the first few days or weeks you aren't exactly expected to contribute meaningfully to features or bugs. Your sprint contributions are not tracked too closely.
The combination of 1) what to avoid and 2) the sprint-freedom-you-have leads to a few interesting and valuable areas to work on on your own: the build process, tests, running the software, and docs.
But code need not be ignored either. Some frequent areas to get your first code contributions in include user configuration code, error messages, and stale code comments.
What follows are some little 1st day, 1st week, 1st month projects I went through to bootstrap my understanding of the system.
First off, where is the code and how do you build it? This requires you to have all the relevant dependencies. Much of my work is on a Postgres extension. This meant having a local Postgres development environment, having gcc, gmake (on mac), Perl, and so on. And furthermore, PGD is a pretty mature product so it supports building against multiple Postgres distributions. Can I build against all of them?
The easiest situation is when there are instructions for all of this, linked directly from your main repo. When I started, the instructions did exist but in a variety of places. So over the first week I started collecting all of what I had learned about building the system, with dependencies, across distributions, and with various important flags (debug mode on, asserts enabled, etc.). I finished the first week by writing a little internal blog post called "Hacking on PGD".
I hadn't yet figured out the team processes so I didn't want to bother anyone by trying to get this "blog post" committed anywhere yet as official internal documentation. Maybe there already was a good doc, I just hadn't noticed it yet. So I just published it in a private Confluence page and shared it in the private team slack. If anyone else benefited from it, great! Otherwise, I knew I'd want to refer back to it.
This is an important attitude I think. It can be hard to tell what others will benefit from. If you get into the habit of writing things down internally for your own sake, but making it available internally, it is likely others will benefit from it too. This is something I've learned from years of blogging publicly outside of work.
Moreover, the simple act of writing a good post creates yourself as something of an authority. This is useful for yourself if no one else.
Let's get distracted here for a second. One of the most important things I think in documentation is documenting not just what does exist but what doesn't. If you had to take a path to get something to work, did you try other paths that didn't work? It can be extremely useful to figure out what exactly is required for something.
Was there a flag that you tried to build with but you didn't try building without it? Well try again without it and make sure it was necessary. Was there some process you executed where the build succeeded but you can't remember if it was actually necessary for the build to succeed?
It's difficult to explain why I think this sort of precision is useful but I'm pretty sure it is. Maybe because it builds the habit of not treating things as magic when you can avoid it. It builds the habit of asking questions (if only to yourself) to understand and not just to get by.
Going back to builds, another aspect to consider is static and dynamic analysis. Are there special steps to using gdb or valgrind or other analyzers? Are you using them already? Can you get them running locally? Has any of this been documented?
Maybe the answer to all of those is yes, or maybe none of those are relevant but there are likely similar tools for your ecosystem. If analysis tools are relevant and no one has yet explored them, that's another very useful area to explore as a newcomer.
After I got the builds working, I felt the obvious next step was to run tests. But what tests exist? Are there unit tests? Integration tests? Anything else? Moreover, is there test coverage? I was certain I'd be able to find some low-hanging contributions to make if I could find some files with low test coverage.
Alas, my certainty hit the wall in that there were in fact too many types of integration tests that all do provide coverage already. They just don't all report coverage.
The easiest ways to report coverage (with gcov) were only reporting coverage for certain integration tests that we run locally. There are more integration tests run in cloud environments and getting coverage reports there to merge with my local coverage files would have required more knowledge of people and processes, areas that I wanted not to be forced to think about too quickly.
So coverage wasn't a good route to go. But around this time, I noticed a ticket that asked for a simple change to user configuration code. I was able to make the change pretty quickly and wanted to add tests. We have our own test framework built on top of Postgres's powerful Perl test framework. But it was a little difficult to figure out how to use either of them.
So I copied code from other tests and pared it down until I got the smallest version of test code I could get. This took maybe a day or two of tweaking lines and rerunning tests since I didn't understand everything that was/wasn't required. Also it's Perl and I've never written Perl before so that took a bit of time and ChatGPT. (Arrays, man.)
In the end though I was able to collect my learnings into another internal confluence post just about how to write tests, how to debug tests, how to do common things within tests (for example, ensuring a Postgres log line was outputted), etc. I published this post as well and shared it in the team Slack.
I had PGD built locally and was able to run integration tests locally, but I still hadn't gotten a cluster running. Nor played with the eventual consistency demos I knew we supported. We had a great quickstart that ran through all the manual steps of getting a two-node cluster up. This was a distillation, for devs, of a more elaborate process we give to customers in a production-quality script.
But I was looking for something in between a production-quality script and manually initializing a local cluster. And I also wanted to practice my understanding of our test process. So I ported our quickstart to our integration test framework and made a PR with this new test, eventually merging this into the repo. And I wrote a minimal Python script for bringing up a local cluster. I've got an open PR to add this script to the repo. Maybe I'll learn though that a simple script such as this does already exist, and that's fine!
The entire time, as I'd been trying to build and test and run PGD, I was trying to understand our terminology and architecture by going through our public docs. I had a lot of questions coming out of this I'd ask in the team channel.
Not to toot my horn but I think it's somewhat of a superpower to be able/willing to ask "dumb questions" in a group setting. That's how I frame it anyway. "Dumb question: what does X mean in this paragraph?" Or, "dumb question: when we say performance improvement because of Y, what is the intuition here?" Because of the time spent here, I was able to make a few more docs contributions as I read through the docs as well.
You have to balance where you ask your dumb questions though. Asking dumb questions to one person doesn't benefit the team. But asking dumb questions in too wide a group is sometimes bad politics. Asking "dumb questions" in front of your team seems to have the best bang for buck.
But maybe the more important contributions were, when I got more comfortable with the team, proposing to merge my personal, internal Confluence blog posts into the repo as docs. I think in a number of cases, what I wrote about indeed hadn't been concisely collected before and thus was useful to have as team documentation.
Even more challenging was trying to distill (a chunk of) the internal architecture. Only after following many varied internal docs and videos, and following through numerous code paths, was I able to propose an architecture diagram outlining major components and communication between them, with their differing formats (WAL records, internal enums, etc.) and means of communication (RPC, shared memory, etc.). This architecture diagram is still in review and may be totally off. But it's already helped at least me think about the system.
In most cases this was all information that the team had already written or explained but just bringing it together and summarizing provided a different useful perspective I think. Even if none of the docs got merged it still helped to build my own understanding.
Learning the project is just one aspect of onboarding. Beyond that I join the #cats channel, the #dogs channel, found some fellow New Yorkers and opened a NYC channel, and tried to find Zoom-time with the various people I'd see hanging around common team Slack channels. Trying to meet not just devs but support folk, product managers, marketing folk, sales folk, and anyone else!
Walking the line between scouring our docs and GitHub and Confluence and Jira on my own, and bugging people with my incessant questions.
I've enjoyed my time at startups. I've been a dev, a manager, a founder, a cofounder. But I'm incredibly excited to be back, at a bigger company, full-time as a developer hacking on a database!
And what about you? What do you do to onboard yourself at a new company or new project?
I've been having an absolute blast in my first month at EDB and I wanted to share a few of my strategies for onboarding myself on a database team. Strategies broadly applicable for devs on a new team/project.https://t.co/TS5qRLysuA pic.twitter.com/lvuxDBQJwx
— Phil Eaton (@eatonphil) March 12, 2024
2024-02-08 08:00:00
Distributed consensus in transactional databases (e.g. etcd or Cockroach) is a big deal these days. Most often under the hood are variations of log-based Paxos-like algorithms such as MultiPaxos, Viewstamped Replication, or Raft. While there are new variations that come out each year, optimizing for various workloads, these algorithms are fairly standard and well-understood.
In fact they are used in so many places, Kubernetes for example, that even if you don't decide to implement Raft (which is fun and I encourage it), it seems worth building an intuition for distributed consensus.
What happens as you tweak a configuration. What happens as the production environment changes. Or what to reach for as product requirements change.
I've been thinking about the basics of distributed consensus recently. There has been a lot to digest and characterize. And I'm only beginning to get an understanding.
This post is an attempt to share some of the intuition built up reading about and working in this space. Originally this post was also going to end with a walkthrough of my most recent Raft implementation in Rust. But I'm going to hold off on that for another time.
I was fortunate to have a few excellent reviewers looking at versions of this post: Paul Nowoczynski, Alex Miller, Jack Vanlightly, Daniel Chia, and Alex Petrov. Thank you!
Let's start with Raft.
Raft is a distributed consensus algorithm that allows you to build a replicated state machine on top of a replicated log.
A Raft library handles replicating and durably persisting a sequence (or log) of commands to at least a majority of nodes in a cluster. You provide the library a state machine that interprets the replicated commands. From the perspective of the Raft library, commands are just opaque byte strings.
For example, you could build a replicated key-value store out of SET
and GET
commands that are passed in by a client. You provide a Raft
library state machine code that interprets the Raft log of SET
and
GET
commands to modify or read from an in-memory hashtable. You can
find concrete examples of exactly this replicated key-value store
modeling in previous Raft
posts I've written.
All nodes in the cluster run the same Raft code (including the state machine code you provide); communicating among themselves. Nodes elect a semi-permanent leader that accepts all reads and writes from clients. (Again, reads and writes are modeled as commands).
To commit a new command to the cluster, clients send the command to all nodes in the cluster. Only the leader accepts this command, if there is currently a leader. Clients retry until there is a leader that accepts the command.
The leader appends the command to its log and makes sure to replicate all commands in its log to followers in the same order. The leader sends periodic heartbeat messages to all followers to prolong its term as leader. If a follower hasn't heard from the leader within a period of time, it becomes a candidate and requests votes from the cluster.
When a follower is asked to accept a new command from a leader, it checks if its history is up-to-date with the leader. If it is not, the follower rejects the request and asks the leader to send previous commands to bring it up-to-date. It does this ultimately, in the worst case of a follower that has lost all history, by going all the way back to the very first command ever sent.
When a quorum (typically a majority) of nodes has accepted a command, the leader marks the command as committed and applies the command to its own state machine. When followers learn about newly committed commands, they also apply committed commands to their own state machine.
For the most part, these details are graphically summarized in Figure 2 of the Raft paper.
Taking a step back, distributed consensus helps a group of nodes, a cluster, agree on a value. A client of the cluster can treat a value from the cluster as if the value was atomically written to and read from a single thread. This property is called linearizability.
However, with distributed consensus, the client of the cluster has
better availability guarantees from the cluster than if the client
atomically wrote to or read from a single thread. A single thread that
crashes becomes unavailable. But some number f
nodes can crash in a
cluster implementing distributed consensus and still 1) be available
and 2) provide linearizable reads and writes.
That is: distributed consensus solves the problem of high availability for a system while remaining linearizable.
Without distributed consensus you can still achieve high availability. For example, a database might have two read replicas. But a client reading from a read replica might get stale data. Thus, this system (a database with two read replicas) is not linearizable.
Without distributed consensus you can also try synchronous replication. It would be very simple to do. Have a fixed leader and require all nodes to acknowledge before committing, But the value here is extremely limited. If a single node in the cluster goes down the entire cluster is down.
You might think I'm proposing a strawman. We could simply designate a permanent leader that handles all reads and writes; and require a majority of nodes to commit a command before the leader responds to a client. But in that case, what's the process for getting a lagging follower up-to-date? And what happens if it is the leader who goes down?
Well, these are not trivial problems! And, beyond linearizability that we already mentioned, these problems are exactly what distributed consensus solves.
It's very nice, and often even critical, to have a highly available system that will never give you stale data. And regardless, it's convenient to have a term for what we might naively think of as the "correct" way you'd always want to set and get a value.
So linearizability is a convenient way of thinking about complex systems, if you can use or build a system that supports it. But it's not the only consistency approach you'll see in the wild.
As you increase the guarantees of your consistency model, you tend to sacrifice performance. Going the opposite direction, some production systems sacrifice consistency to improve performance. For example, you might allow stale reads from any node, reading only from local state and avoiding consensus, so that you can reduce load on a leader and avoid the overhead of consensus.
There are formal definitions for lower consistency models, including sequential and read-your-writes. You can read the Jepsen page for more detail.
A distributed system relies on communicating over the network. The worse the network, whether in terms of latency or reliability, the longer it will take for communication to happen.
Aside from the network, disks can misdirect writes or corrupt data. Or you could be mounted on a network filesystem such as EBS.
And processes themselves can crash due to low disk space or the OOM killer.
It will take longer to achieve consensus to commit messages these scenarios. If messages take longer to reach nodes, or if nodes are constantly crashing, followers will timeout more often, triggering leader election. And the leader election itself (which also requires consensus) will also take longer.
The best case scenario for distributed consensus is where the network is reliable and low-latency. Where disks are reliable and fast. And where processes don't often crash.
TigerBeetle has an incredible visual simulator that demonstrates what happens across ever-worsening environments. While TigerBeetle and this simulator use Viewstamped Replication, the demonstrated principles apply to Raft as well.
Distributed consensus algorithms make sure that some minimum number of nodes in a cluster agree before continuing. The minimum number is proportional to the total number of nodes in the cluster.
A typical implementation of Raft for example will require 3 nodes in a 5-node cluster to agree before continuing. 4 nodes in a 7-node cluster. And so on.
Recall that the p99 latency for a service is at least as bad as the slowest external request the service must make. As you increase the number of nodes you must talk to in a consensus cluster, you increase the chance of a slow request.
Consider the extreme case of a 101-node cluster requiring 51 nodes to respond before returning to the client. That's 51 chances for a slower request. Compared to 4 chances in a 7-node cluster. The 101-node cluster is certainly more highly available though! It can tolerate 49 nodes going down. The 7-node cluster can only tolerate 3 nodes going down. The scenario where 49 nodes go down (assuming they're in different availability zones) seems pretty unlikely!
All of this is to say that the most popular algorithms for distributed consensus, on their own, have nothing to do with horizontal scaling.
The way that horizontally scaling databases like Cockroach or Yugabyte or Spanner work is by sharding the data, transparent to the client. Within each shard data is replicated with a dedicated distributed consensus cluster.
So, yes, distributed consensus can be a part of horizontal scaling. But again what distributed consensus primarily solves is high availability via replication while remaining linearizable.
This is not a trivial point to make. etcd, consul, and rqlite are examples of databases that do not do sharding, only replication, via a single Raft cluster that replicates all data for the entire system.
For these databases there is no horizontal scaling. If they support "horizontal scaling", they support this by doing non-linearizable (stale) reads. Writes remain a challenge.
This doesn't mean these databases are bad. They are not. One obvious advantage they have over Cockroach or Spanner is that they are conceptually simpler. Conceptually simpler often equates to easier to operate. That's a big deal.
We've covered the basics of operation, but real-world implementations get more complex.
Rather than letting the log grow indefinitely, most libraries implement snapshotting. The user of the library provides a state machine and also provides a method for serializing the state machine to disk. The Raft library periodically serializes the state machine to disk and truncates the log.
When a follower is so far behind that the leader no longer has a log entry (because it has been truncated), the leader transfers an entire snapshot to the follower. Then once the follower is caught up on snapshots, the leader can transfer normal log entries again.
This technique is described in the Raft paper. While it isn't necessary for Raft to work, it's so important that it is hardly an optimization and more a required part of a production Raft system.
Rather than limiting clients of the cluster to submitting only one command at a time, it's common for the cluster to accept many commands at a time. Similarly, many commands at a time are submitted to followers. When any node needs to write commands to disk, it can batch commands to disk as well.
But you can go a step beyond this in a way that is completely opaque to the Raft library. Each opaque command the client submits can also contain a batch of messages. In this scenario, only the user-provided state machine needs to be aware that each command it receives is actually a batch of messages that it should pull apart and interpret separately.
This latter techinque is a fairly trivial way to increase throughput by an order of magnitude or two.
In terms of how data is stored on disk and how data is sent over the network there is obvious room for optimization.
A naive implementation might store JSON on disk and send JSON over the network. A slightly more optimized implementation might store binary data on disk and send binary data over the network.
Similarly you can swap out your RPC for gRPC or introduce zlib for compression to network or disk.
You can swap out synchronous IO for libaio or io_uring or SPDK/DPDK.
A little tweak I made in my latest Raft implementation was to index log entries so searching the log was not a linear operation. Another little tweak was to introduce a page cache to eliminate unnecessary disk reads. This increased throughput for by an order of magnitude.
This brilliant optimization by Heidi Howard and co. shows you can relax the quorum required for committing new commands so long as you increase the quorum required for electing a leader.
In an environment where leader election doesn't happen often, flexible quorums can increase throughput and decrease latency. And it's a pretty easy change to make!
These are just a couple common optimizations. You can also read about parallel state machine apply, parallel append to disk, witnesses, compartmentalization, and leader leases. TiKV, Scylla, RedPanda, and Cockroach tend to have public material talking about this stuff.
There are also a few people I follow who are often reviewing relevant papers, if they are not producing their own. I encourage you to follow them too if this is interesting to you:
The other aspect to consider is safety. For example, checksums for everything written to disk and passed over the network; or being able to recover from corruption in the log.
Testing is also a big deal. There are prominent tools like Jepsen that check for consistency in the face of fault injection (process failure, network failure, etc.). But even Jepsen has its limits. For example, it doesn't test disk failure.
FoundationDB made popular a number of testing techniques. And the people behind this testing went on to build a product, Antithesis, around deterministic testing of non-deterministic code while injecting faults.
And on that topic there's Facebook Experimental's Hermit deterministic Linux hypervisor that may help to test complex distributed systems. However, my experience with it has not been great and the maintainers do not seem very engaged with other people who have reported bugs. I'm hopeful for it but we'll see.
Antithesis and Hermit seem like a boon when half the trouble of working on distributed consensus implementations is avoiding flakey tests.
Another promising avenue is emitting logs during the Raft lifecycle and validating the logs against a TLA+ spec. Microsoft has such a project that has begun to see adoption among open-source Raft implementations.
Everything aside, consensus is expensive. There is overhead to the entire consensus process. So if you do not need this level of availability and can settle for some process via backups, it's going to have lower latency and higher throughput than if it had to go through distributed consensus.
If you do need high availability, distributed consensus can be a great choice. But consider the environment and what you want from your consensus algorithm.
Also, while MultiPaxos, Raft, and Viewstamped Replication are some of the most popular algorithms for distributed consensus, there is a world beyond. Two-phase commit, ZooKeeper Atomic Broadcast, PigPaxos, EPaxos, Accord by Cassandra. The world of distributed consensus also gets especially weird and interesting outside of OLTP systems.
But that's enough for one post.
I wrote a post about building an intuition for distributed consensus in OLTP systems!
— Phil Eaton (@eatonphil) February 8, 2024
Very grateful to all the folks who reviewed.https://t.co/wMxUuokKeg pic.twitter.com/cfY2kdfqak
2024-01-09 08:00:00
I spent a week looking at MySQL/MariaDB internals along with ~80 other devs. Although MySQL and MariaDB are mostly the same (more on that later), I focused on MariaDB specifically this week.
Before last week I had never built MySQL/MariaDB before. The first day
of this hack week, I got MariaDB building locally and made a code
tweak so
that SELECT 23
returned 213
, and another
tweak so
that SELECT 80 + 20
returned 60
. The second day I got a basic UDF
in C
working so that SELECT mysum(20, 30)
returned 50
.
The rest of the week I spent figuring out how to build a minimal in-memory storage engine, which I'll walk through in this post. 218 lines of C++.
It supports CREATE
, DROP
, INSERT
,
and SELECT
for tables that only have INTEGER
fields. It is
explicitly not thread-safe because I didn't have time to understand
MariaDB's lock primitives.
In this post I'll also talk about how the MariaDB custom storage API compares to the Postgres one, based on a previous hack week project I did.
All code for this post can be found in my fork on GitHub.
Before we go further though, why do I keep saying MySQL/MariaDB?
MySQL is GPL licensed (let's completely ignore the commercial variations of MySQL that Oracle offers). The code is open-source. However, the development is done behind closed doors. There is a code dump every month or so.
MariaDB is a fork of MySQL by the creator of MySQL (who is no longer involved, as it happens). It is also GPL licensed (let's completely ignore the commercial variations of MariaDB that MariaDB Corporation offers). The code is open-source. The development is also open-source.
When you install "MySQL" in your Linux distro you are often actually installing MariaDB.
The two are mostly compatible. During this week, I stumbled
onto that
they evolved support for SELECT .. FROM VALUES ..
differently. Some
differences are documented on the MariaDB
KB. But this KB is
painful to browse. Which leads me to my next point.
The MySQL docs are excellent. Easy to read, browse; and they are thorough. The MariaDB docs are a work in progress. I'm sorry I can't be stoic: in just a week I've come to really hate using this KB. Thankfully, in some twisted way, it also doesn't seem to be very thorough either. It isn't completely avoidable though since there is no guarantee MySQL and MariaDB do the same thing.
Ultimately, I spent the week using MariaDB because I'm biased toward fully open projects. But I kept having to look at MySQL docs, hoping they were relevant.
Now that you understand the state of things, let's move on to fun stuff!
Mature databases often support swapping out the storage layer. Maybe you want an in-memory storage layer so that you can quickly run integration tests. Maybe you want to switch between B-Trees (read-optimized) and LSM Trees (write-optimized) and unordered heaps (write-optimized) depending on your workload. Or maybe you just want to try a third-party storage library (e.g. RocksDB or Sled or TiKV).
The benefit of swapping out only the storage engine is that, from a user's perspective, the semantics and features of the database stay mostly the same. But the database is magically faster for a workload.
You keep powerful user management, extension support, SQL support, and a well-known wire protocol. You modify only the method of storing the actual data.
MySQL/MariaDB is particularly well known for its custom storage engine support. The MySQL docs for alternate storage engines are great.
While the docs do warn that you should probably stick with the default storage engine, that warning didn't quite feel strong enough because nothing else seemed to indicate the state of other engines.
Specifically, in the past I was always interested in the CSV storage engine. But when you look at the actual code for the CSV engine there is a pretty strong warning:
First off, this is a play thing for me, there are a number of things
wrong with it:
*) It was designed for csv and therefore its performance is highly
questionable.
*) Indexes have not been implemented. This is because the files can
be traded in and out of the table directory without having to worry
about rebuilding anything.
*) NULLs and "" are treated equally (like a spreadsheet).
*) There was in the beginning no point to anyone seeing this other
then me, so there is a good chance that I haven't quite documented
it well.
*) Less design, more "make it work"
Now there are a few cool things with it:
*) Errors can result in corrupted data files.
*) Data files can be read by spreadsheets directly.
TODO:
*) Move to a block system for larger files
*) Error recovery, its all there, just need to finish it
*) Document how the chains work.
-Brian
The difference between the seeming confidence of the docs and seeming confidence of the contributor made me chuckle.
The benefit of these diverse storage engines for me was that they give examples of how to implement the storage engine API. The csv, blackhole, example, and heap storage engines were particularly helpful to read.
The heap engine is a complete in-memory storage engine. Complete means complex though. So there seemed to be room for a stripped down version of an in-memory engine.
And that's we'll cover in this post! First though I want to talk a little bit about the limitations of custom storage engines.
While being able to tailor a storage engine to a workload is powerful, there are limits to the benefits based on the design of the storage API.
Both Postgres and MySQL/MariaDB currently have a custom storage API built around individual rows.
I have previously written that custom storage engines allows you to switch between column- and row-oriented data storage. Two big reasons to do column-wise storage are 1) opportunity for compression, and 2) fast operations on a single column.
The opportunity for 1) compression on disk would still exist even if you needed to deal with individual rows at the storage API layer since the compression could happen on disk. However any benefits of passing around compressed columns in memory disappear if you must convert to rows for the storage API.
You'd also lose the advantage for 2) fast operations on a single column if the column must be converted into a row at the storage API whereupon it's passed to higher levels that perform execution. The execution would happen row-wise, not column-wise.
All of this is to say that while column-wise storage is possible, the benefit of doing so is not obvious with the current API design for both MySQL/MariaDB and Postgres.
An API built around individual rows also sets limits on the amount of vectorization you can do. A custom storage engine could still do some vectorization under the hood: always filling a buffer with N rows and returning a row from the buffer when the storage API requests a single row. But there is likely some degree of performance left on the table with an API that deals with individual rows.
Remember though: if you did batched reads and writes of rows in the custom storage layer, there isn't necessarily any vectorization happening at the execution layer. From a previous study I did, neither MySQL/MariaDB nor Postgres do vectorized query execution. This paragraph isn't a critique of the storage API, it's just something to keep in mind.
The general point I'm making here is that unless both the execution and storage APIs are designed in a certain way, you may attempt optimizations in the storage layer that are ineffective or even harmfull because the execution layer doesn't or can't take advantage of them.
The current limitations of the storage API are not intrinsic aspects of MySQL/MariaDB or Postgres's design. For both project there used to be no pluggable storage at all. We can imagine a future patch to either project that allows support for batched row reads and writes that together could make column-wise storage and vectorized execution more feasible.
Even today there have been invasive attempts to fully support column-wise storage and execution in Postgres. And there have also been projects to bring vectorized execution to Postgres.
I'm not as familiar with the MySQL landscape to comment about efforts at the moment their.
Now that you've got some background, let's get a debug build of MariaDB!
$ git clone https://github.com/MariaDB/server mariadb
$ cd mariadb
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug ..
$ make -j8
This takes a while. When I was hacking on Postgres (a C project), it took 1 minute on my beefy Linux server to build. It took 20-30 minutes to build MySQL/MariaDB from scratch. That's C++ for you!
Thankfully incremental builds of MySQL/MariaDB for a tweak after the initial build take roughly the same time as incremental builds of Postgres after a tweak.
Once the build is done, create a database.
$ ./build/scripts/mariadb-install-db --srcdir=$(pwd) --datadir=$(pwd)/db
And create a config for the database.
$ echo "[client]
socket=$(pwd)/mariadb.sock
[mariadb]
socket=$(pwd)/mariadb.sock
basedir=$(pwd)
datadir=$(pwd)/db
pid-file=$(pwd)/db.pid" > my.cnf
Start up the server.
$ ./build/sql/mariadbd --defaults-extra-file=$(pwd)/my.cnf --debug:d:o,$(pwd)/db.debug
./build/sql/mariadbd: Can't create file '/var/log/mariadb/mariadb.log' (errno: 13 "Permission denied")
2024-01-03 17:10:15 0 [Note] Starting MariaDB 11.4.0-MariaDB-debug source revision 3fad2b115569864d8c1b7ea90ce92aa895cfef08 as process 185550
2024-01-03 17:10:15 0 [Note] InnoDB: !!!!!!!! UNIV_DEBUG switched on !!!!!!!!!
2024-01-03 17:10:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
2024-01-03 17:10:15 0 [Note] InnoDB: Number of transaction pools: 1
2024-01-03 17:10:15 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-01-03 17:10:15 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2024-01-03 17:10:15 0 [Note] InnoDB: Completed initialization of buffer pool
2024-01-03 17:10:15 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2024-01-03 17:10:15 0 [Note] InnoDB: End of log at LSN=57155
2024-01-03 17:10:15 0 [Note] InnoDB: Opened 3 undo tablespaces
2024-01-03 17:10:15 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2024-01-03 17:10:15 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2024-01-03 17:10:15 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2024-01-03 17:10:15 0 [Note] InnoDB: log sequence number 57155; transaction id 16
2024-01-03 17:10:15 0 [Note] InnoDB: Loading buffer pool(s) from ./db/ib_buffer_pool
2024-01-03 17:10:15 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-01-03 17:10:15 0 [Note] Plugin 'wsrep-provider' is disabled.
2024-01-03 17:10:15 0 [Note] InnoDB: Buffer pool(s) load completed at 240103 17:10:15
2024-01-03 17:10:15 0 [Note] Server socket created on IP: '0.0.0.0'.
2024-01-03 17:10:15 0 [Note] Server socket created on IP: '::'.
2024-01-03 17:10:15 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
2024-01-03 17:10:15 0 [Note] ./build/sql/mariadbd: ready for connections.
Version: '11.4.0-MariaDB-debug' socket: './mariadb.sock' port: 3306 Source distribution
With that --debug
flag, debug logs will show up in
$(pwd)/db.debug
. It's unclear why debug logs are
treated separately from the console logs shown here. I'd rather them
all be in one place.
In another terminal, run a client and make a request!
$ ./build/client/mariadb --defaults-extra-file=$(pwd)/my.cnf --database=test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 11.4.0-MariaDB-debug Source distribution
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [test]> SELECT 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.001 sec)
Huzzah! Let's write a custom storage engine!
When writing an extension for some project, I usually expect to have the extension exist in its own repo. I was able to do this with the Postgres in-memory storage engine I wrote. And in general, Postgres extensions exist as their own repos.
I was able to create and build a UDF plugin outside the MariaDB source tree. But when it came to getting a storage engine to build and load successfully, I wasted almost an entire day (a large amount of time in a single hack week) getting nowhere.
Extensions for MySQL/MariaDB are most easily built via the CMake infrastructure within the repo. Surely there's some way to replicate that infrastructure from outside the repo but I wasn't able to figure it out within a day and didn't want to spend more time on it.
Apparently the normal thing to do in MySQL/MariaDB is to keep extensions within a fork of MySQL/MariaDB.
When I switched to this method I was able to very quickly get the storage engine building and loaded. So that's what we'll do.
Within the MariaDB source tree, create a new folder in the storage
subdirectory.
$ mkdir storage/memem
Within storage/memem/CMakeLists.txt
add the following.
# Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 of the License.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA
SET(MEMEM_SOURCES ha_memem.cc ha_memem.h)
MYSQL_ADD_PLUGIN(memem ${MEMEM_SOURCES} STORAGE_ENGINE)
This hooks into MySQL/MariaDB build infrastructure. So next time you
run make
within the build
directory we created above, it will also
build this project.
It would be nice to see a way to extend MySQL in C (for one, because it would then be easier to port to other languages). But all of the builtin storage methods use classes. So we'll do that too.
The class we must implement is an instance of
handler
. There
is a single handler
instance per thread, corresponding to a single
running query. (Postgres gives each query its own process, MySQL gives
each query its own thread.) However, handler
instances are reused
across different queries.
There are a number of virtual methods on handler
we must implement
in our subclass. For most of them we'll do nothing: simply returning
immediately. These simple methods will be implemented in
ha_memem.h
. The methods with more complex logic will be implemented
in ha_memem.cc
.
Let's set up includes in ha_memem.h
.
/* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA */
#ifdef USE_PRAGMA_INTERFACE
#pragma interface /* gcc class implementation */
#endif
#include "thr_lock.h"
#include "handler.h"
#include "table.h"
#include "sql_const.h"
#include <vector>
#include <memory>
Next we'll define structs for our in-memory storage.
typedef std::vector<uchar> MememRow;
struct MememTable
{
std::vector<std::shared_ptr<MememRow>> rows;
std::shared_ptr<std::string> name;
};
struct MememDatabase
{
std::vector<std::shared_ptr<MememTable>> tables;
};
Within ha_memem.cc
we'll implement a global (not thread-safe)
static MememDatabase*
that all handler
instances will query when
requested. We need the definitions in the header file though because
we'll store the table currently being queried in the handler
subclass.
This is so that every call to write_row
to write a single row or
call to rnd_next
to read a single row does not need to look up the
in-memory table object N times within the same query.
And finally we'll define the subclass of handler
and implementations
of trivial methods.
class ha_memem final : public handler
{
uint current_position= 0;
std::shared_ptr<MememTable> memem_table= 0;
public:
ha_memem(handlerton *hton, TABLE_SHARE *table_arg) : handler(hton, table_arg)
{
}
~ha_memem()= default;
const char *index_type(uint key_number) { return ""; }
ulonglong table_flags() const { return 0; }
ulong index_flags(uint inx, uint part, bool all_parts) const { return 0; }
/* The following defines can be increased if necessary */
#define MEMEM_MAX_KEY MAX_KEY /* Max allowed keys */
#define MEMEM_MAX_KEY_SEG 16 /* Max segments for key */
#define MEMEM_MAX_KEY_LENGTH 3500 /* Like in InnoDB */
uint max_supported_keys() const { return MEMEM_MAX_KEY; }
uint max_supported_key_length() const { return MEMEM_MAX_KEY_LENGTH; }
uint max_supported_key_part_length() const { return MEMEM_MAX_KEY_LENGTH; }
int open(const char *name, int mode, uint test_if_locked) { return 0; }
int close(void) { return 0; }
int truncate() { return 0; }
int rnd_init(bool scan);
int rnd_next(uchar *buf);
int rnd_pos(uchar *buf, uchar *pos) { return 0; }
int index_read_map(uchar *buf, const uchar *key, key_part_map keypart_map,
enum ha_rkey_function find_flag)
{
return HA_ERR_END_OF_FILE;
}
int index_read_idx_map(uchar *buf, uint idx, const uchar *key,
key_part_map keypart_map,
enum ha_rkey_function find_flag)
{
return HA_ERR_END_OF_FILE;
}
int index_read_last_map(uchar *buf, const uchar *key,
key_part_map keypart_map)
{
return HA_ERR_END_OF_FILE;
}
int index_next(uchar *buf) { return HA_ERR_END_OF_FILE; }
int index_prev(uchar *buf) { return HA_ERR_END_OF_FILE; }
int index_first(uchar *buf) { return HA_ERR_END_OF_FILE; }
int index_last(uchar *buf) { return HA_ERR_END_OF_FILE; }
void position(const uchar *record) { return; }
int info(uint flag) { return 0; }
int external_lock(THD *thd, int lock_type) { return 0; }
int create(const char *name, TABLE *table_arg, HA_CREATE_INFO *create_info);
THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
enum thr_lock_type lock_type)
{
return to;
}
int delete_table(const char *name) { return 0; }
private:
void reset_memem_table();
virtual int write_row(const uchar *buf);
int update_row(const uchar *old_data, const uchar *new_data)
{
return HA_ERR_WRONG_COMMAND;
};
int delete_row(const uchar *buf) { return HA_ERR_WRONG_COMMAND; }
};
A complete storage engine might seriously implement all of these methods. But we'll only seriously implement 7 of them.
To finish up the boilerplate, we'll switch over to ha_memem.cc
and
set up the includes.
/* Copyright (c) 2005, 2012, Oracle and/or its affiliates. All rights reserved.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA */
#ifdef USE_PRAGMA_IMPLEMENTATION
#pragma implementation // gcc: Class implementation
#endif
#define MYSQL_SERVER 1
#include <my_global.h>
#include "sql_priv.h"
#include "unireg.h"
#include "sql_class.h"
#include "ha_memem.h"
Ok! Let's dig into the implementation.
First up, we need to declare a global MememDatabase*
instance. We'll
also implement a helper function for finding the index of a table by
name within the database.
// WARNING! All accesses of `database` in this code are thread
// unsafe. Since this was written during a hack week, I didn't have
// time to figure out MySQL/MariaDB's runtime well enough to do the
// thread-safe version of this.
static MememDatabase *database;
static int memem_table_index(const char *name)
{
int i;
assert(database->tables.size() < INT_MAX);
for (i= 0; i < (int) database->tables.size(); i++)
{
if (strcmp(database->tables[i]->name->c_str(), name) == 0)
{
return i;
}
}
return -1;
}
As I wrote this post I noticed that this code also assumes there's
only a single database. That isn't how MySQL works. Everytime you
call USE ...
in MySQL you are switching between
databases. You can query tables across databases. A real in-memory
backend would need to be aware of the different databases, not just
different tables. But to keep the code succinct we won't implement
that in this post.
Next we'll implement plugin initialization and cleanup.
Before we register the plugin with MariaDB, we need to set up initialization and cleanup methods for it.
The initialization method will take care of initializing the global
MememDatabase* database
object. It will set up a handler for
creating new instances of our handler
subclass. And it will set up a
handler for deleting tables.
static handler *memem_create_handler(handlerton *hton, TABLE_SHARE *table,
MEM_ROOT *mem_root)
{
return new (mem_root) ha_memem(hton, table);
}
static int memem_init(void *p)
{
handlerton *memem_hton;
memem_hton= (handlerton *) p;
memem_hton->db_type= DB_TYPE_AUTOASSIGN;
memem_hton->create= memem_create_handler;
memem_hton->drop_table= [](handlerton *, const char *name) {
int index= memem_table_index(name);
if (index == -1)
{
return HA_ERR_NO_SUCH_TABLE;
}
database->tables.erase(database->tables.begin() + index);
DBUG_PRINT("info", ("[MEMEM] Deleted table '%s'.", name));
return 0;
};
memem_hton->flags= HTON_CAN_RECREATE;
// Initialize global in-memory database.
database= new MememDatabase;
return 0;
}
The DBUG_PRINT
macro is a debug helper MySQL/MariaDB gives us. As
noted above, the output is directed to a file specified by the
--debug
flag. Unfortunately I couldn't figure out how to flush the
stream this macro writes to. It seemed like occasionally when there
was a segfault logs I expected to be there weren't there. And the
file would often contain what looked like partially written
logs. Anyway, as long as there wasn't a segfault the debug file will
eventually contain the DBUG_PRINT
logs.
The only thing the plugin cleanup function must do is delete the global database.
static int memem_fini(void *p)
{
delete database;
return 0;
}
Now we can register the plugin!
The maria_declare_plugin
and maria_declare_plugin_end
register the
plugin's metadata (name, version, etc.) and callbacks.
struct st_mysql_storage_engine memem_storage_engine= {
MYSQL_HANDLERTON_INTERFACE_VERSION};
maria_declare_plugin(memem){
MYSQL_STORAGE_ENGINE_PLUGIN,
&memem_storage_engine,
"MEMEM",
"MySQL AB",
"In-memory database.",
PLUGIN_LICENSE_GPL,
memem_init, /* Plugin Init */
memem_fini, /* Plugin Deinit */
0x0100 /* 1.0 */,
NULL, /* status variables */
NULL, /* system variables */
"1.0", /* string version */
MariaDB_PLUGIN_MATURITY_STABLE /* maturity */
} maria_declare_plugin_end;
That's it! Now we need to implement methods for writing rows, reading rows, and creating a new table.
To create a table, we make sure one by this name doesn't already
exist, make sure it only has INTEGER
fields, allocate memory for the
table, and append it to the global database.
int ha_memem::create(const char *name, TABLE *table_arg,
HA_CREATE_INFO *create_info)
{
assert(memem_table_index(name) == -1);
// We only support INTEGER fields for now.
uint i = 0;
while (table_arg->field[i]) {
if (table_arg->field[i]->type() != MYSQL_TYPE_LONG)
{
DBUG_PRINT("info", ("Unsupported field type."));
return 1;
}
i++;
}
auto t= std::make_shared<MememTable>();
t->name= std::make_shared<std::string>(name);
database->tables.push_back(t);
DBUG_PRINT("info", ("[MEMEM] Created table '%s'.", name));
return 0;
}
Not very complicated. Let's handle INSERT
-ing rows next.
There is no method called when an INSERT
starts. There is a table
field on the handler
parent class that is updated though when a
SELECT
or INSERT
is going. So we can fetch the current table from
that field.
Since we have a slot for a std::shared_ptr<MememTable> memem_table
on the ha_memem
class, we can check if it is NULL
when we insert a
row. If it is, we look up the current table and set
this->memem_table
to its MememTable
.
But there's a bit more to it than just the table name. The const
char* name
passed to the create()
method above seems to be a sort
of fully qualified name for the table. By observation, when creating a
table y
in a database test
, the const char* name
value is
./test/y
. The .
prefix probably means that the database is local,
but I'm not sure.
So we'll write a helper method that will reconstruct the fully qualified table name before looking up that fully qualified table name in the global database.
void ha_memem::reset_memem_table()
{
// Reset table cursor.
current_position= 0;
std::string full_name= "./" + std::string(table->s->db.str) + "/" +
std::string(table->s->table_name.str);
DBUG_PRINT("info", ("[MEMEM] Resetting to '%s'.", full_name.c_str()));
assert(database->tables.size() > 0);
int index= memem_table_index(full_name.c_str());
assert(index >= 0);
assert(index < (int) database->tables.size());
memem_table= database->tables[index];
}
Then we can use this within write_row
to figure out the current
MememTable
being queried.
But first, let's digress into how MySQL stores rows.
When you write a Postgres custom storage
API,
you are expected to basically read from or write to an array of
Datum
.
Totally sensible.
In MySQL, you read from and write to an array of bytes. That's pretty weird to me. Of course you can build your own higher level serialization/deserialization on top of it. But it's just strange to me everyone has to know this basically opaque API.
Certainly it's documented.
The handler class is the interface for dynamically loadable
storage engines. Do not add ifdefs and take care when adding or
changing virtual functions to avoid vtable confusion
Functions in this class accept and return table columns data. Two data
representation formats are used:
1. TableRecordFormat - Used to pass [partial] table records to/from
storage engine
2. KeyTupleFormat - used to pass index search tuples (aka "keys") to
storage engine. See opt_range.cc for description of this format.
TableRecordFormat
=================
[Warning: this description is work in progress and may be incomplete]
The table record is stored in a fixed-size buffer:
record: null_bytes, column1_data, column2_data, ...
The offsets of the parts of the buffer are also fixed: every column has
an offset to its column{i}_data, and if it is nullable it also has its own
bit in null_bytes.
In our implementation, we'll skip the support for NULL
values. We'll
only support INTEGER
fields. But we still need to be aware that the
first byte will be taken up. We'll also assume there won't be more
than one byte of a NULL bitmap.
It is this opaque byte array that we'll read from in write_row(const uchar*
buf)
and write to in read_row(uchar* buf)
.
To keep things simple we're going to store the row in MememTable
the
same way MySQL passes it around.
int ha_memem::write_row(const uchar *buf)
{
if (memem_table == NULL)
{
reset_memem_table();
}
// Assume there are no NULLs.
buf++;
uint field_count = 0;
while (table->field[field_count]) field_count++;
// Store the row in the same format MariaDB gives us.
auto row= std::make_shared<std::vector<uchar>>(
buf, buf + sizeof(int) * field_count);
memem_table->rows.push_back(row);
return 0;
}
Which makes reading the row quite simple too!
The only slight difference between reading and writing a row is that
MySQL/MariaDB will tell us when the SELECT
scan for a table starts.
We'll use that opportunity to reset the current_row
cursor and reset
the memem_table
field. Since, again, handler
classes are only used
once per query but they are reused for queries running at other times.
int ha_memem::rnd_init(bool scan)
{
reset_memem_table();
return 0;
}
int ha_memem::rnd_next(uchar *buf)
{
if (current_position == memem_table->rows.size())
{
// Reset the in-memory table to make logic errors more obvious.
memem_table= NULL;
return HA_ERR_END_OF_FILE;
}
assert(current_position < memem_table->rows.size());
uchar *ptr= buf;
*ptr= 0;
ptr++;
// Rows internally are stored in the same format that MariaDB
// wants. So we can just copy them over.
std::shared_ptr<std::vector<uchar>> row= memem_table->rows[current_position];
std::copy(row->begin(), row->end(), ptr);
current_position++;
return 0;
}
And we're done!
Go back into the build
directory we created within the source tree
root and rerun make -j8
.
Kill the server (you'll need to do something like killall mariadbd
since the server doesn't respond to Ctrl-c). And restart it.
For some reason this plugin doesn't need to be loaded. We can run
SHOW PLUGINS;
in the MariaDB CLI and we'll see it.
$ ./build/client/mariadb --defaults-extra-file=/home/phil/vendor/mariadb/my.cnf --database=test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 5
Server version: 11.4.0-MariaDB-debug Source distribution
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [test]> SHOW PLUGINS;
+-------------------------------+----------+--------------------+-----------------+---------+
| Name | Status | Type | Library | License |
+-------------------------------+----------+--------------------+-----------------+---------+
| binlog | ACTIVE | STORAGE ENGINE | NULL | GPL |
...
| MEMEM | ACTIVE | STORAGE ENGINE | NULL | GPL |
...
| BLACKHOLE | ACTIVE | STORAGE ENGINE | ha_blackhole.so | GPL |
+-------------------------------+----------+--------------------+-----------------+---------+
73 rows in set (0.012 sec)
There we go! To create a table with it we need to set ENGINE =
MEMEM
. For example, CREATE TABLE x (i INT) ENGINE = MEMEM
.
Let's create a script to try out the memem
engine, in
storage/memem/test.sql
.
drop table if exists y;
drop table if exists z;
create table y(i int, j int) engine = MEMEM;
insert into y values (2, 1029);
insert into y values (92, 8);
select * from y where i + 8 = 10;
create table z(a int) engine = MEMEM;
insert into z values (322);
insert into z values (8);
select * from z where a > 20;
And run it.
$ ./build/client/mariadb --defaults-extra-file=$(pwd)/my.cnf --database=test --table --verbose < storage/memem/test.sql
--------------
drop table if exists y
--------------
--------------
drop table if exists z
--------------
--------------
create table y(i int, j int) engine = MEMEM
--------------
--------------
insert into y values (2, 1029)
--------------
--------------
insert into y values (92, 8)
--------------
--------------
select * from y where i + 8 = 10
--------------
+------+------+
| i | j |
+------+------+
| 2 | 1029 |
+------+------+
--------------
create table z(a int) engine = MEMEM
--------------
--------------
insert into z values (322)
--------------
--------------
insert into z values (8)
--------------
--------------
select * from z where a > 20
--------------
+------+
| a |
+------+
| 322 |
+------+
What you see there is the power of storage engines! It supports the full SQL language even while we implemented storage somewhere completely different than the default.
Certainly, I'm getting bored doing the same project over and over again on different databases. However, it's minimal projects like this that make it super easy to then go and port the storage engine to something else.
The goal here is to be minimal but meaningful. And I've accomplished that for myself at least!
As I've written before, this sort of exploration wouldn't be possible within the time frame I gave myself if it weren't for ChatGPT. Specifically, the paid tier GPT4.
Neither the MySQL nor the MariaDB docs were so helpful that I could
immediately figure out things like how to get the current table name
within a scan (the table
member of the handler
class).
With ChatGPT you can ask questions like: "In a MySQL C++ plugin, how
do I get the name of the table from a handler
class as a C
string?". Sometimes it's right and sometime's it's not. But you can
try out the code and if it builds it is at least somewhat correct!
Wrote a post walking you through building a super minimal in-memory storage engine for MySQL/MariaDB in 218 lines of C++.
— Phil Eaton (@eatonphil) January 9, 2024
And took time again to reflect on the limitations of custom storage engines and how MySQL compares to Postgres internally here.https://t.co/nImUC36DPs pic.twitter.com/1Oj2Lcua8O
2023-12-27 08:00:00
Over the years, I have repeatedly felt like I missed the timing for a meetup or an IRC group or social media in general. I'd go to a meetup every so often but I'd never make a meaningful connection with people, whereas everyone else knew each other. I'd join an IRC group and have difficulty catching up with what seemed to be the flow of conversation.
I hadn't thought much about this until the pandemic when I started a Discord group for software internals and a virtual tech talk series called Hacker Nights. Since 2021 the Discord reached around 1,500 members and ~20 fairly active members. And the Meetup peaked at about 300 members with about 10-20 showing up each Meetup.
After the pandemic receded I started an NYC-based book club over 2 months with about 5-8 active attendees. I ran a virtual hack week on Discord where I got ~100 devs into a temporary Discord server and we talked about Postgres internals and shared resources. Ultimately around 5 of us wrote blog posts and built new projects to explore Postgres.
I started a virtual, async email book club (that is still ongoing) with 300 devs from November 2023 to Feb 2024. There have been around 20 active members of the club. And each week the discussion is kicked off by one of the members, not myself.
And I felt like there wasn't enough community opportunity for folks in systems programming in NYC so I started an Manhattan-based Systems Coffee Club. Around 15 people showed up to the first meeting and seemed even more excited about it than I was. (And I was excited!) We'll see where it goes from here.
Organizing people to do this stuff doesn't come easy to me. I enjoy doing it to a degree, but every night before an event I have trouble sleeping. Worried about embarrassing myself. When the event happens though, and people are happy to be there to chat with everyone else, as they invariably have been, it makes it worthwhile.
Something I realized along the way is that people (maybe devs especially, I don't know) are looking for community. And when I have noticed there seems to be a missing flashpoint (a topic, a career focus, a book, etc.) for community, it's been pretty easy to get people together around it.
Groups, meetups, naturally live and die. Organizers get burnt out. I don't see this as a problem. It's just the way it is.
At some point I'll get burnt out too. Or I'll get pickier. For example, I've been avoiding starting a systems programming meetup in NYC because I know it will be a big effort. So I've done lower effort groups like book clubs and coffee clubs.
Don't worry about signing yourself up for indefinite work. Just do whatever you'd like to and don't feel bad if you have to stop. Someone else will eventually start the next great group, even if it comes in a different medium or flavor.
There are great communities out there that have inspired me.
And this year I've been hearing about more.
There are yet a few more systems programming groups I've heard rumors about being started on the US West Coast and Stockholm.
If you feel like you can't find the right group or that you don't fit in with existing groups or that you're missing a moment, there are surely other folks in the same boat. Waiting for a new group to join. You may be the catalyst.
There's enormous potential for getting people together and doing something interesting and there isn't necessarily anyone telling you you should. Things you try may work and they may not. The more you try the more you'll learn what works and what doesn't. I've had a few years of making mistakes organizing to hone the sense.
The only boring thing to do is to necessarily limit yourself to the sort of thing others have done before! Run a browser meetup instead of a React meetup. Interview hardware developers to teach software developers something. Get software developers with 20 years of experience in niche fields to teach the rest of us something. Read books beyond SICP or Clean Code. Try difficult programming projects.
Whatever you want though, don't let me deter you. If you think something should exist, give it a shot!
I used to struggle to get much out of meetups, couldn't pick up the flow of IRC. Some point I stopped trying solely to fit in. Instead to do what I thought was interesting. And to my surprise, folks were interested in coming along too!
— Phil Eaton (@eatonphil) December 27, 2023
Make your own wayhttps://t.co/tVEa2ndiZm pic.twitter.com/piWSsv14lj
2023-11-19 08:00:00
I learned this week that you can intercept and redirect Postgres query execution. You can hook into the execution layer so you're given a query plan and you get to decide what to do with it. What rows to return, if any, and where they come from.
That's very interesting. So I started writing code to explore execution hooks. However, I got stuck interpreting the query plan. Either there's no query plan walking infrastructure or I just didn't find it.
So this post is a digression into walking a Postgres query plan. By
the end we'll be able to run psql -c 'SELECT a FROM x WHERE a > 1'
and reconstruct the entire SQL string from a Postgres
QueryDesc
object, the query plan object Postgres builds.
With that query plan walking infrastructure in place, we'll be in a good state to not just print out the query plan while walking it but instead to translate the query plan or evaluate it in our own way (e.g. over column-wise data, or vectorized execution over row-wise data).
Code for this project is available on Github.
If you're familiar with parsers and compilers, a query plan is like an intermediate representation (IR) of a program. It is not as raw as an abstract syntax tree (AST); it has already been optimized.
If that doesn't mean anything to you, think of a query plan as a structured and optimized version of the SQL query you submit to your database. It isn't text anymore. It is probably a tree.
Check out another Justin Jaffray article on the subject for more detail.
Before we get to walking the query plan, let's set up the infrastructure to intercept query execution where we can eventually add in our print debugging of the query plan reconstructed as a SQL string.
Once you've got Postgres build dependencies, build and install a debug version of Postgres:
$ git clone https://github.com/postgres/postgres && cd postgres
$ # Make sure you're on the same commit I'm on, just to be safe.
$ git checkout b218fbb7a35fcf31539bfad12732038fe082a2eb
$ ./configure --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
$ make -j8
$ # Installs to to /usr/local/pgsql/bin.
$ sudo make install
I'm not going to cover Postgres extension infrastructure in detail. I wrote a bit about it in my last post. You need only read the first half, if at all; not the actual Table Access Method implementation.
It will be even simpler in this post because Postgres hooks are
extensions but not extensions you install with CREATE EXTENSION
. If
you want to read about the different kinds of Postgres extensions,
check out this
article by Steven
Miller.
The minimum we need, aside from the hook code itself, is a Makefile that uses PGXS:
MODULES = pgexec
PG_CONFIG = /usr/local/pgsql/bin/pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
The MODULES
value there corresponds to the C file we'll create
shortly, pgexec.c
.
This pg_config
binary path is important because you
might have different versions of Postgres installed, for example by
your package manager. It is important that the extension is built
against the same version of Postgres which will load the extension.
Now we're ready for some hook code.
You can find the basic structure of a hook (and which hooks are available) in Tamika Nomara's unofficial Postgres hooks docs.
There is no official central place describing all hooks I could find in Postgres docs. Some hooks are described in various places throughout the docs though.
Based on that page, we can write a bare minimum hook that will
intercept queries, log when we've done so, and pass control back
to the standard execution path for the actual query. In pgexec.c
:
#include "postgres.h"
#include "fmgr.h"
#include "executor/executor.h"
PG_MODULE_MAGIC;
static ExecutorRun_hook_type prev_executor_run_hook = NULL;
static void print_plan(QueryDesc* queryDesc) {
elog(LOG, "[pgexec] HOOKED SUCCESSFULLY!");
}
static void pgexec_run_hook(
QueryDesc* queryDesc,
ScanDirection direction,
uint64 count,
bool execute_once
) {
print_plan(queryDesc);
return prev_executor_run_hook(queryDesc, direction, count, execute_once);
}
void _PG_init(void) {
prev_executor_run_hook = ExecutorRun_hook;
if (prev_executor_run_hook == NULL) {
prev_executor_run_hook = standard_ExecutorRun;
}
ExecutorRun_hook = pgexec_run_hook;
}
void _PG_fini(void) {
ExecutorRun_hook = prev_executor_run_hook;
}
You can discover the standard_ExectutorRun
function from a quick
git grep ExecutorRun_hook
in the Postgres source which leads to
src/backend/executor/execMain.c#L306:
void
ExecutorRun(QueryDesc *queryDesc,
ScanDirection direction, uint64 count,
bool execute_once)
{
if (ExecutorRun_hook)
(*ExecutorRun_hook) (queryDesc, direction, count, execute_once);
else
standard_ExecutorRun(queryDesc, direction, count, execute_once);
}
So our hook will just log and pass back execution to the existing execution hook. Let's build and install the extension.
$ make
$ sudo make install
Now we'll create a new database and tell it to load the extension.
$ /usr/local/pgsql/bin/initdb test-db
$ echo "shared_preload_libraries = 'pgexec'" > test-db/postgresql.conf
Remember, hooks are not CREATE EXTENSION
extensions. As
far as I can tell they can't be dynamically loaded (without some
additional dynamic loading infrastructure one could potentially
write). So every time you make a change you need to rebuild the
extension, reinstall it, and restart the Postgres server.
And start the server in the foreground:
$ /usr/local/pgsql/bin/postgres \
--config-file=$(pwd)/test-db/postgresql.conf \
-D $(pwd)/test-db
-k $(pwd)/test-db
2023-11-18 19:35:16.680 GMT [3215547] LOG: starting PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1), 64-bit
2023-11-18 19:35:16.681 GMT [3215547] LOG: listening on IPv6 address "::1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-11-18 19:35:16.682 GMT [3215550] LOG: database system was shut down at 2023-11-18 19:20:16 GMT
2023-11-18 19:35:16.684 GMT [3215547] LOG: database system is ready to accept connections
Keep an eye on this foreground process since this is where elog(LOG,
...)
calls will show up.
Now in a new window, create a test.sql
script that we can use to
exercise the hook:
DROP TABLE IF EXISTS x;
CREATE TABLE x (a INT);
INSERT INTO x VALUES (309);
SELECT a FROM x WHERE a > 1;
Run psql
so we can trigger the hook:
$ /usr/local/pgsql/bin/psql -h localhost postgres -f test.sql
DROP TABLE
CREATE TABLE
INSERT 0 1
a
-----
309
(1 row)
And in the postgres
foreground process you should see a log:
2023-11-19 17:42:03.045 GMT [3242321] LOG: [pgexec] HOOKED SUCCESSFULLY!
2023-11-19 17:42:03.045 GMT [3242321] STATEMENT: INSERT INTO x VALUES (309);
2023-11-19 17:42:03.045 GMT [3242321] LOG: [pgexec] HOOKED SUCCESSFULLY!
2023-11-19 17:42:03.045 GMT [3242321] STATEMENT: SELECT a FROM x WHERE a > 1;
That's our hook! Interestingly only the INSERT
and CREATE
statements show up, not the DROP
and CREATE
.
Now let's see if we can reconstruct the query text from that first
argument, the QueryDesc*
that pgexec_run_hook
receives. And let's
simplify things for ourselves and only worry about reconstructing a
SELECT
query.
Node
s and Datum
sBut first, let's talk about two fundemental ways data in Postgres (code) is organized.
Postgres code is extremely dynamic and, maybe relatedly,
fairly object-oriented. Almost every entity in Postgres is a
Node
. While
values in Postgres that are exposed to users of Postgres are
Datum
s.
Each node has a type,
NodeTag
,
that we can switch on to decide what to do. In contrast, Datum
has
no type. The type of the Datum
must be known by context before using
one of the transform functions like
DatumGetBool
to retrieve a C value from a Datum
.
A table is a Node
. A query plan is a Node
. A sequential scan is a
Node
. A join is a Node
. A literal in a query is a Node
. The
value for the literal in a query is a Datum
.
Here is how The Internals of PostgreSQL book visualizes a query plan for example:
Every box in that image is a Node
.
And all Node
s in code I've seen share a common definition prefix
like this:
typedef struct SomeThing {
pg_node_attr(abstract) // If the node is indeed abstract in the OOP sense.
NodeTag type;
}
Many Node
s you'll see are abstract, like Plan
. But by printing out
NodeTag
and checking the value printed in
src/include/nodes/nodetags.h
, you can find the concrete type of the
Node
.
src/include/nodes/nodetags.h
is generated during a preprocessing
step. (Don't look if regex in
Perl
worries you).
We'll get back to Node
s later.
QueryDesc
?Let's take a look at the
QueryDesc
struct:
typedef struct QueryDesc
{
/* These fields are provided by CreateQueryDesc */
CmdType operation; /* CMD_SELECT, CMD_UPDATE, etc. */
PlannedStmt *plannedstmt; /* planner's output (could be utility, too) */
const char *sourceText; /* source text of the query */
Snapshot snapshot; /* snapshot to use for query */
Snapshot crosscheck_snapshot; /* crosscheck for RI update/delete */
DestReceiver *dest; /* the destination for tuple output */
ParamListInfo params; /* param values being passed in */
QueryEnvironment *queryEnv; /* query environment passed in */
int instrument_options; /* OR of InstrumentOption flags */
/* These fields are set by ExecutorStart */
TupleDesc tupDesc; /* descriptor for result tuples */
EState *estate; /* executor's query-wide state */
PlanState *planstate; /* tree of per-plan-node state */
/* This field is set by ExecutorRun */
bool already_executed; /* true if previously executed */
/* This is always set NULL by the core system, but plugins can change it */
struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
The
PlannedStmt
field looks interesting. Let's take a look:
typedef struct PlannedStmt
{
pg_node_attr(no_equal, no_query_jumble)
NodeTag type;
CmdType commandType; /* select|insert|update|delete|merge|utility */
uint64 queryId; /* query identifier (copied from Query) */
bool hasReturning; /* is it insert|update|delete RETURNING? */
bool hasModifyingCTE; /* has insert|update|delete in WITH? */
bool canSetTag; /* do I set the command result tag? */
bool transientPlan; /* redo plan when TransactionXmin changes? */
bool dependsOnRole; /* is plan specific to current role? */
bool parallelModeNeeded; /* parallel mode required to execute? */
int jitFlags; /* which forms of JIT should be performed */
struct Plan *planTree; /* tree of Plan nodes */
List *rtable; /* list of RangeTblEntry nodes */
List *permInfos; /* list of RTEPermissionInfo nodes for rtable
* entries needing one */
/* rtable indexes of target relations for INSERT/UPDATE/DELETE/MERGE */
List *resultRelations; /* integer list of RT indexes, or NIL */
List *appendRelations; /* list of AppendRelInfo nodes */
List *subplans; /* Plan trees for SubPlan expressions; note
* that some could be NULL */
Bitmapset *rewindPlanIDs; /* indices of subplans that require REWIND */
List *rowMarks; /* a list of PlanRowMark's */
List *relationOids; /* OIDs of relations the plan depends on */
List *invalItems; /* other dependencies, as PlanInvalItems */
List *paramExecTypes; /* type OIDs for PARAM_EXEC Params */
Node *utilityStmt; /* non-null if this is utility stmt */
/* statement location in source string (copied from Query) */
int stmt_location; /* start location, or -1 if unknown */
int stmt_len; /* length in bytes; 0 means "rest of string" */
} PlannedStmt;
The struct Plan* planTree
field in there looks like what we'd want. But
Plan
is abstract:
typedef struct Plan
{
pg_node_attr(abstract, no_equal, no_query_jumble)
NodeTag type;
So let's try printing out the planTree->type
field and find the
Node
it is concretely. In pgexec.c
change the definition of
print_plan
:
static void print_plan(QueryDesc* queryDesc) {
elog(LOG, "[pgexec] HOOKED SUCCESSFULLY! %d", queryDesc->plannedstmt->planTree->type);
}
Rebuild and reinstall the extension, and restart Postgres:
$ make
$ sudo make install
$ /usr/local/pgsql/bin/postgres \
--config-file=$(pwd)/test-db/postgresql.conf \
-D $(pwd)/test-db
-k $(pwd)/test-db
2023-11-18 19:35:16.680 GMT [3215547] LOG: starting PostgreSQL 17devel on x86_64-pc-linux-gnu, compiled by gcc (GCC) 13.2.1 20230728 (Red Hat 13.2.1-1), 64-bit
2023-11-18 19:35:16.681 GMT [3215547] LOG: listening on IPv6 address "::1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG: listening on IPv4 address "127.0.0.1", port 5432
2023-11-18 19:35:16.681 GMT [3215547] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-11-18 19:35:16.682 GMT [3215550] LOG: database system was shut down at 2023-11-18 19:20:16 GMT
2023-11-18 19:35:16.684 GMT [3215547] LOG: database system is ready to accept connections
And in another window run psql
:
$ /usr/local/pgsql/bin/psql -h localhost postgres -f test.sql
And check the logs from the postgres
process we just started and you
should notice:
2023-11-19 17:46:18.834 GMT [3242495] LOG: [pgexec] HOOKED SUCCESSFULLY! 322
2023-11-19 17:46:18.834 GMT [3242495] STATEMENT: SELECT a FROM x WHERE a > 1;
So 322
is the NodeTag
for the Plan
. If we look that up in
Postgres's src/include/nodes/nodetags.h
(remember, this is generated
after ./configure && make
so I can't link you to it):
$ grep ' = 322' src/include/nodes/nodetags.h
T_SeqScan = 322,
Hey, that makes sense! A SELECT
without any indexes definitely
sounds like a sequential scan!
Let's take a look at the
SeqScan
struct:
typedef struct SeqScan
{
Scan scan;
} SeqScan;
Ok, that's not very interesting. Let's look at
Scan
then:
typedef struct Scan
{
pg_node_attr(abstract)
Plan plan;
Index scanrelid; /* relid is index into the range table */
} Scan;
That's interesting! scanrelid
represents the table we're scanning. I
don't know what "range table" means exactly. But there was a field on
the PlannedStmt
called rtable
that seems relevant.
rtable
was described as a
List
of
RangeTblEntry
nodes. And browsing around the file where List
is defined we can see
some nice methods for working with List
s, like list_length()
.
Let's print out the scanrelid
and let's check out the length of the
rtable
and see if it's filled out. Let's also restrict our
print_plan
code to only look at SeqScan
nodes. In pgexec.c
:
static void print_plan(QueryDesc* queryDesc) {
SeqScan* scan = NULL;
Plan* plan = queryDesc->plannedstmt->planTree;
if (plan->type != T_SeqScan) {
elog(LOG, "[pgexec] Unsupported plan type.");
return;
}
scan = (SeqScan*)plan;
elog(LOG, "[pgexec] relid: %d, rtable length: %d", scan->scan.scanrelid, list_length(queryDesc->plannedstmt->rtable));
}
Rebuild and reinstall the extension, and restart Postgres. (You can
find the instructions for this above if you've forgotten.) Re-run the
test.sql
script. And check the Postgres server logs. You should see:
2023-11-19 18:00:34.184 GMT [3244438] LOG: [pgexec] relid: 1, rtable length: 1
2023-11-19 18:00:34.184 GMT [3244438] STATEMENT: SELECT a FROM x WHERE a > 1;
Awesome! So rtable
does have data in it. There's only one table in
this query so its length makes sense to be 1
. The scanrelid
being
1
also though is weird. Let's fetch the nth value from the rtable
list using scanrelid-1
as the index.
For the
RangeTblEntry
itself, let's take a look:
typedef enum RTEKind
{
RTE_RELATION, /* ordinary relation reference */
RTE_SUBQUERY, /* subquery in FROM */
RTE_JOIN, /* join */
RTE_FUNCTION, /* function in FROM */
RTE_TABLEFUNC, /* TableFunc(.., column list) */
RTE_VALUES, /* VALUES (<exprlist>), (<exprlist>), ... */
RTE_CTE, /* common table expr (WITH list element) */
RTE_NAMEDTUPLESTORE, /* tuplestore, e.g. for AFTER triggers */
RTE_RESULT, /* RTE represents an empty FROM clause; such
* RTEs are added by the planner, they're not
* present during parsing or rewriting */
} RTEKind;
typedef struct RangeTblEntry
{
pg_node_attr(custom_read_write, custom_query_jumble)
NodeTag type;
RTEKind rtekind; /* see above */
/*
* XXX the fields applicable to only some rte kinds should be merged into
* a union. I didn't do this yet because the diffs would impact a lot of
* code that is being actively worked on. FIXME someday.
*/
/*
* Fields valid for a plain relation RTE (else zero):
*
* rellockmode is really LOCKMODE, but it's declared int to avoid having
* to include lock-related headers here. It must be RowExclusiveLock if
* the RTE is an INSERT/UPDATE/DELETE/MERGE target, else RowShareLock if
* the RTE is a SELECT FOR UPDATE/FOR SHARE target, else AccessShareLock.
*
* Note: in some cases, rule expansion may result in RTEs that are marked
* with RowExclusiveLock even though they are not the target of the
* current query; this happens if a DO ALSO rule simply scans the original
* target table. We leave such RTEs with their original lockmode so as to
* avoid getting an additional, lesser lock.
*
* perminfoindex is 1-based index of the RTEPermissionInfo belonging to
* this RTE in the containing struct's list of same; 0 if permissions need
* not be checked for this RTE.
*
* As a special case, relid, relkind, rellockmode, and perminfoindex can
* also be set (nonzero) in an RTE_SUBQUERY RTE. This occurs when we
* convert an RTE_RELATION RTE naming a view into an RTE_SUBQUERY
* containing the view's query. We still need to perform run-time locking
* and permission checks on the view, even though it's not directly used
* in the query anymore, and the most expedient way to do that is to
* retain these fields from the old state of the RTE.
*
* As a special case, RTE_NAMEDTUPLESTORE can also set relid to indicate
* that the tuple format of the tuplestore is the same as the referenced
* relation. This allows plans referencing AFTER trigger transition
* tables to be invalidated if the underlying table is altered.
*/
Oid relid; /* OID of the relation */
char relkind; /* relation kind (see pg_class.relkind) */
int rellockmode; /* lock level that query requires on the rel */
struct TableSampleClause *tablesample; /* sampling info, or NULL */
Index perminfoindex;
In SELECT a FROM x
, x
should be a plain relation RTE (to use the
terminology there). So we can add a guard that validates that. But we
don't get a Relation
. (You might remember from my previous
post
that Relation
is where we can finally see the table name.)
We get an Oid
for the Relation
. So we need to find a way to lookup
a Relation
from an Oid
. And by grepping around in Postgres (or via
judicious use of ChatGPT, I confess), we can notice
RelationIdGetRelation
that takes an Oid
and returns a Relation
. Notice also that the
comment says we should close the relation when we're done with
RelationClose
.
So putting it altogether (and again, reusing some code from that previous post), we can print out the table name.
static void print_plan(QueryDesc* queryDesc) {
SeqScan* scan = NULL;
RangeTblEntry* rte = NULL;
Relation relation = {};
char* tablename = NULL;
Plan* plan = queryDesc->plannedstmt->planTree;
if (plan->type != T_SeqScan) {
elog(LOG, "[pgexec] Unsupported plan type.");
return;
}
scan = (SeqScan*)plan;
rte = list_nth(queryDesc->plannedstmt->rtable, scan->scan.scanrelid-1);
if (rte->rtekind != RTE_RELATION) {
elog(LOG, "[pgexec] Unsupported FROM type: %d.", rte->rtekind);
return;
}
relation = RelationIdGetRelation(rte->relid);
tablename = NameStr(relation->rd_rel->relname);
elog(LOG, "[pgexec] SELECT [todo] FROM %s", tablename);
RelationClose(relation);
}
You'll also need to add a new #include
for
utils/rel.h
.
Rebuild and reinstall the extension, and restart Postgres. Re-run the
test.sql
script. Check the Postgres server logs and you should see:
2023-11-19 18:36:03.986 GMT [3246777] LOG: [pgexec] SELECT [todo] FROM x
2023-11-19 18:36:03.986 GMT [3246777] STATEMENT: SELECT a FROM x WHERE a > 1;
Fantastic! Before we get into walking the SELECT
columns and the
(optional) WHERE
clause, let's do some quick refactoring.
Let's add a little string builder library so we can emit a single
string we build up to a single elog()
call.
I wrote this ahead of time and won't explain it here since the details aren't relevant.
Just copy this and paste near the top of pgexec.c
:
typedef struct {
char* mem;
size_t len;
size_t offset;
} PGExec_Buffer;
static void buffer_init(PGExec_Buffer* buf) {
buf->offset = 0;
buf->len = 8;
buf->mem = (char*)malloc(sizeof(char) * buf->len);
}
static void buffer_resize_to_fit_additional(PGExec_Buffer* buf, size_t additional) {
char* new = {};
size_t newsize = 0;
Assert(additional >= 0);
if (buf->offset + additional < buf->len) {
return;
}
newsize = (buf->offset + additional) * 2;
new = (char*)malloc(sizeof(char) * newsize);
Assert(new != NULL);
memcpy(new, buf->mem, buf->len * sizeof(char));
free(buf->mem);
buf->len = newsize;
buf->mem = new;
}
static void buffer_append(PGExec_Buffer*, char*, size_t);
static void buffer_appendz(PGExec_Buffer* buf, char* c) {
buffer_append(buf, c, strlen(c));
}
static void buffer_append(PGExec_Buffer* buf, char* c, size_t chars) {
buffer_resize_to_fit_additional(buf, chars);
memcpy(buf->mem + buf->offset, c, chars);
buf->offset += chars;
}
static void buffer_appendf(
PGExec_Buffer *,
const char* restrict,
...
) __attribute__ ((format (gnu_printf, 2, 3)));
static void buffer_appendf(PGExec_Buffer *buf, const char* restrict fmt, ...) {
// First figure out how long the result will be.
size_t chars = 0;
va_list arglist;
va_start(arglist, fmt);
chars = vsnprintf(0, 0, fmt, arglist);
Assert(chars >= 0); // TODO: error handling.
// Resize to fit result.
buffer_resize_to_fit_additional(buf, chars);
// Actually do the printf into buf.
va_end(arglist);
va_start(arglist, fmt);
chars = vsprintf(buf->mem + buf->offset, fmt, arglist);
Assert(chars >= 0); // TODO: error handling.
buf->offset += chars;
va_end(arglist);
}
static char* buffer_cstring(PGExec_Buffer* buf) {
char zero = 0;
const size_t prev_offset = buf->offset;
if (buf->offset == buf->len) {
buffer_append(buf, &zero, 1);
buf->offset--;
} else {
buf->mem[buf->offset] = 0;
}
// Offset should stay the same. This is a fake NULL.
Assert(buf->offset == prev_offset);
return buf->mem;
}
static void buffer_free(PGExec_Buffer* buf) {
free(buf->mem);
}
Next we'll modify print_plan()
in pgexec.c
to use it, and add stubs
for printing the SELECT
columns and WHERE
clauses.
static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
buffer_appendz(buf, " [where todo]");
}
static void buffer_print_select_columns(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
buffer_appendz(buf, "[columns todo]");
}
static void print_plan(QueryDesc* queryDesc) {
SeqScan* scan = NULL;
RangeTblEntry* rte = NULL;
Relation relation = {};
char* tablename = NULL;
Plan* plan = queryDesc->plannedstmt->planTree;
PGExec_Buffer buf = {};
if (plan->type != T_SeqScan) {
elog(LOG, "[pgexec] Unsupported plan type.");
return;
}
scan = (SeqScan*)plan;
rte = list_nth(queryDesc->plannedstmt->rtable, scan->scan.scanrelid-1);
if (rte->rtekind != RTE_RELATION) {
elog(LOG, "[pgexec] Unsupported FROM type: %d.", rte->rtekind);
return;
}
buffer_init(&buf);
relation = RelationIdGetRelation(rte->relid);
tablename = NameStr(relation->rd_rel->relname);
buffer_appendz(&buf, "SELECT ");
buffer_print_select_columns(&buf, queryDesc, plan);
buffer_appendf(&buf, " FROM %s", tablename);
buffer_print_where(&buf, queryDesc, plan);
elog(LOG, "[pgexec] %s", buffer_cstring(&buf));
RelationClose(relation);
buffer_free(&buf);
}
Now we just need to implement the buffer_print_where
and
buffer_print_select_columns
functions and our walking infrastructure
will be done! For now. :)
WHERE
clauseIf you remember back to the SeqScan
and Scan
nodes, they were both
basically empty. They had a Plan
and a scanrelid
. So the rest of
the SELECT
info must be in the Plan
since it wasn't in the Scan
.
Let's look at
Plan
again. One part that stands out is:
/*
* Common structural data for all Plan types.
*/
int plan_node_id; /* unique across entire final plan tree */
List *targetlist; /* target list to be computed at this node */
List *qual; /* implicitly-ANDed qual conditions */
struct Plan *lefttree; /* input plan tree(s) */
struct Plan *righttree;
List *initPlan; /* Init Plan nodes (un-correlated expr
* subselects) */
qual
kinda looks like a WHERE
clause. (And targetlist
kinda
looks like the columns the SELECT
pulls).
List
s
just contain void pointers, so we can't tell what the type of qual
or targetlist
children are. But I'm going to make a wild guess they
are Node
s.
There's even a nice helper that casts void pointers to Node*
and
pulls out the type,
nodeTag()
.
And reading around pg_list.h
shows some interesting helper utilities
like
foreach
that we can use to iterate the list.
Let's try printing out the type of qual
's members.
static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
ListCell* cell = NULL;
bool first = true;
if (plan->qual == NULL) {
return;
}
buffer_appendz(buf, " WHERE ");
foreach(cell, plan->qual) {
if (!first) {
buffer_appendz(buf, " AND ");
}
first = false;
buffer_appendf(buf, "[node: %d]", nodeTag(lfirst(cell)));
}
}
Notice any vestiges of LISP?
Rebuild and reinstall the extension, and restart Postgres. Re-run the
test.sql
script. Check the Postgres server logs and you should see:
2023-11-19 19:17:00.879 GMT [3250850] LOG: [pgexec] SELECT [columns todo] FROM x WHERE [node: 15]
2023-11-19 19:17:00.879 GMT [3250850] STATEMENT: SELECT a FROM x WHERE a > 1;
Well, our code didn't crash! So the guess about qual
List
entries
being Node
s seems right. Let's look up that node type in the
Postgres repo:
$ grep ' = 15,' src/include/nodes/nodetags.h
T_OpExpr = 15,
Woot! That is exactly what I'd expect the WHERE
clause here to be.
Now that we know qual
is a List
of Node
s, let's do a bit of
refactoring since targetlist
will probably also be a List
of
Node
s. Back in pgexec.c
:
static void buffer_print_expr(PGExec_Buffer*, Node*);
static void buffer_print_list(PGExec_Buffer*, List*, char*);
static void buffer_print_opexpr(PGExec_Buffer* buf, OpExpr* op) {
buffer_appendf(buf, "[opexpr: todo]");
}
static void buffer_print_expr(PGExec_Buffer* buf, Node* expr) {
switch (nodeTag(expr)) {
case T_OpExpr:
buffer_print_opexpr(buf, (OpExpr*)expr);
break;
default:
buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
}
}
static void buffer_print_list(PGExec_Buffer* buf, List* list, char* sep) {
ListCell* cell = NULL;
bool first = true;
foreach(cell, list) {
if (!first) {
buffer_appendz(buf, sep);
}
first = false;
buffer_print_expr(buf, (Node*)lfirst(cell));
}
}
static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
if (plan->qual == NULL) {
return;
}
buffer_appendz(buf, " WHERE ");
buffer_print_list(buf, plan->qual, " AND ");
}
And let's check out OpExpr
!
OpExpr
Take a look at the definition of
OpExpr
:
typedef struct OpExpr
{
Expr xpr;
/* PG_OPERATOR OID of the operator */
Oid opno;
/* PG_PROC OID of underlying function */
Oid opfuncid pg_node_attr(equal_ignore_if_zero, query_jumble_ignore);
/* PG_TYPE OID of result value */
Oid opresulttype pg_node_attr(query_jumble_ignore);
/* true if operator returns set */
bool opretset pg_node_attr(query_jumble_ignore);
/* OID of collation of result */
Oid opcollid pg_node_attr(query_jumble_ignore);
/* OID of collation that operator should use */
Oid inputcollid pg_node_attr(query_jumble_ignore);
/* arguments to the operator (1 or 2) */
List *args;
/* token location, or -1 if unknown */
int location;
} OpExpr;
The important fields are opno
, the Oid
of the operator, and
args
. args
looks like another List
of Node
s so we already know
how to handle that.
But how do we find the string name of the operator? Presumably there's
infrastructure like RelationIdGetRelation
that takes an Oid
and
gets us an operator object.
Well I got stuck here as well. Again, thankfully, ChatGPT gave me some
suggestions. There's no great story for how I got it working. So here's
buffer_print_opexpr
.
static void buffer_print_op(PGExec_Buffer* buf, OpExpr* op) {
HeapTuple opertup = SearchSysCache1(OPEROID, ObjectIdGetDatum(op->opno));
buffer_print_expr(buf, lfirst(list_nth_cell(op->args, 0)));
if (HeapTupleIsValid(opertup)) {
Form_pg_operator operator = (Form_pg_operator)GETSTRUCT(opertup);
buffer_appendf(buf, " %s ", NameStr(operator->oprname));
ReleaseSysCache(opertup);
} else {
buffer_appendf(buf, "[Unknown operation: %d]", op->opno);
}
// TODO: Support single operand operations.
buffer_print_expr(buf, lfirst(list_nth_cell(op->args, 1)));
}
And add the following two includes to the top of pgexec.c
:
#include "catalog/pg_operator.h"
#include "utils/syscache.h"
Rebuild and reinstall the extension, and restart Postgres. Re-run the
test.sql
script. Check the Postgres server logs and you should see:
2023-11-19 19:42:52.916 GMT [3252974] LOG: [pgexec] SELECT [columns todo] FROM x WHERE [Unknown: 6] > [Unknown: 7]
2023-11-19 19:42:52.916 GMT [3252974] STATEMENT: SELECT a FROM x WHERE a > 1;
And we continue to make progress! Let's look up the type of these two unknown nodes.
$ grep ' = 6,' src/include/nodes/nodetags.h
T_Var = 6,
$ grep ' = 7,' src/include/nodes/nodetags.h
T_Const = 7,
Let's deal with Const
first.
Const
If we take a look at the
Const
definition:
typedef struct Const
{
pg_node_attr(custom_copy_equal, custom_read_write)
Expr xpr;
/* pg_type OID of the constant's datatype */
Oid consttype;
/* typmod value, if any */
int32 consttypmod pg_node_attr(query_jumble_ignore);
/* OID of collation, or InvalidOid if none */
Oid constcollid pg_node_attr(query_jumble_ignore);
/* typlen of the constant's datatype */
int constlen pg_node_attr(query_jumble_ignore);
/* the constant's value */
Datum constvalue pg_node_attr(query_jumble_ignore);
/* whether the constant is null (if true, constvalue is undefined) */
bool constisnull pg_node_attr(query_jumble_ignore);
/*
* Whether this datatype is passed by value. If true, then all the
* information is stored in the Datum. If false, then the Datum contains
* a pointer to the information.
*/
bool constbyval pg_node_attr(query_jumble_ignore);
/*
* token location, or -1 if unknown. All constants are tracked as
* locations in query jumbling, to be marked as parameters.
*/
int location pg_node_attr(query_jumble_location);
} Const;
It looks like we need to switch on the consttype
(an Oid
) to
figure out how to interpret the constvalue
(a Datum
). Remember I
mentioned earlier that how to interpret a Datum
is dependent on
context. consttype
is the context here.
In this case, although consttype
is an Oid
and we had to use
Postgres infrastructure to look up the Oid
's corresponding object,
there are some builtin types and the literals we've queried with are
among them.
We can simply check if consttype == INT4OID
and the interpret the
Datum
as an int32
if so. DatumGetInt32
will get us that int32
in that case.
To support the new Const
type, we'll add a case in
buffer_print_expr
to look for a T_Const
.
static void buffer_print_expr(PGExec_Buffer* buf, Node* expr) {
switch (nodeTag(expr)) {
case T_Const:
buffer_print_const(buf, (Const*)expr);
break;
case T_OpExpr:
buffer_print_opexpr(buf, (OpExpr*)expr);
break;
default:
buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
}
}
And add a new function, buffer_print_const
:
static void buffer_print_const(PGExec_Buffer* buf, Const* cnst) {
switch (cnst->consttype) {
case INT4OID:
int32 val = DatumGetInt32(cnst->constvalue);
buffer_appendf(buf, "%d", val);
break;
default:
buffer_appendf(buf, "[Unknown consttype oid: %d]", cnst->consttype);
}
}
Rebuild and reinstall the extension, and restart Postgres. Re-run the
test.sql
script. Check the Postgres server logs and you should see:
2023-11-19 19:53:47.922 GMT [3253746] LOG: [pgexec] SELECT [columns todo] FROM x WHERE [Unknown: 6] > 1
2023-11-19 19:53:47.922 GMT [3253746] STATEMENT: SELECT a FROM x WHERE a > 1;
Great! Now we just have to tackle T_Var
.
Var
Let's take a look at the definition of Var
:
typedef struct Var
{
Expr xpr;
/*
* index of this var's relation in the range table, or
* INNER_VAR/OUTER_VAR/etc
*/
int varno;
/*
* attribute number of this var, or zero for all attrs ("whole-row Var")
*/
AttrNumber varattno;
/* pg_type OID for the type of this var */
Oid vartype pg_node_attr(query_jumble_ignore);
/* pg_attribute typmod value */
int32 vartypmod pg_node_attr(query_jumble_ignore);
/* OID of collation, or InvalidOid if none */
Oid varcollid pg_node_attr(query_jumble_ignore);
/*
* RT indexes of outer joins that can replace the Var's value with null.
* We can omit varnullingrels in the query jumble, because it's fully
* determined by varno/varlevelsup plus the Var's query location.
*/
Bitmapset *varnullingrels pg_node_attr(query_jumble_ignore);
/*
* for subquery variables referencing outer relations; 0 in a normal var,
* >0 means N levels up
*/
Index varlevelsup;
/*
* varnosyn/varattnosyn are ignored for equality, because Vars with
* different syntactic identifiers are semantically the same as long as
* their varno/varattno match.
*/
/* syntactic relation index (0 if unknown) */
Index varnosyn pg_node_attr(equal_ignore, query_jumble_ignore);
/* syntactic attribute number */
AttrNumber varattnosyn pg_node_attr(equal_ignore, query_jumble_ignore);
/* token location, or -1 if unknown */
int location;
} Var;
It looks like this refers to a relation in the range table list
again. So this means we need to have access to the full PlannedStmt
so we can read its rtable
field again to find the table. Then we
need to look up the Relation
for the table and then we can use the
Var
's varattno
field to pick the nth attribute from the relation
and get its string representation.
However, ChatGPT found a slightly higher-level function:
get_attname()
that takes a relation Oid
and an attribute index and returns the
string name of the column.
So here's what buffer_print_var
looks like:
static void buffer_print_var(PGExec_Buffer* buf, PlannedStmt* stmt, Var* var) {
char* name = NULL;
RangeTblEntry* rte = list_nth(stmt->rtable, var->varno-1);
if (rte->rtekind != RTE_RELATION) {
elog(LOG, "[Unsupported relation type for var: %d].", rte->rtekind);
return;
}
name = get_attname(rte->relid, var->varattno, false);
buffer_appendz(buf, name);
pfree(name);
}
You'll also need to add another #include
for utils/lsyscache.h
.
Let's add the case T_Var:
check in buffer_print_expr
, and also
feed the PlannedStmt*
through all the necessary buffer_print_X
functions:
static void buffer_print_expr(PGExec_Buffer*, PlannedStmt*, Node*);
static void buffer_print_list(PGExec_Buffer*, PlannedStmt*, List*, char*);
static void buffer_print_opexpr(PGExec_Buffer* buf, PlannedStmt* stmt, OpExpr* op) {
HeapTuple opertup = SearchSysCache1(OPEROID, ObjectIdGetDatum(op->opno));
buffer_print_expr(buf, stmt, lfirst(list_nth_cell(op->args, 0)));
if (HeapTupleIsValid(opertup)) {
Form_pg_operator operator = (Form_pg_operator)GETSTRUCT(opertup);
buffer_appendf(buf, " %s ", NameStr(operator->oprname));
ReleaseSysCache(opertup);
} else {
buffer_appendf(buf, "[Unknown operation: %d]", op->opno);
}
// TODO: Support single operand operations.
buffer_print_expr(buf, stmt, lfirst(list_nth_cell(op->args, 1)));
}
static void buffer_print_const(PGExec_Buffer* buf, Const* cnst) {
switch (cnst->consttype) {
case INT4OID:
int32 val = DatumGetInt32(cnst->constvalue);
buffer_appendf(buf, "%d", val);
break;
default:
buffer_appendf(buf, "[Unknown consttype oid: %d]", cnst->consttype);
}
}
static void buffer_print_var(PGExec_Buffer* buf, PlannedStmt* stmt, Var* var) {
char* name = NULL;
RangeTblEntry* rte = list_nth(stmt->rtable, var->varno-1);
if (rte->rtekind != RTE_RELATION) {
elog(LOG, "[Unsupported relation type for var: %d].", rte->rtekind);
return;
}
name = get_attname(rte->relid, var->varattno, false);
buffer_appendz(buf, name);
pfree(name);
}
static void buffer_print_expr(PGExec_Buffer* buf, PlannedStmt* stmt, Node* expr) {
switch (nodeTag(expr)) {
case T_Const:
buffer_print_const(buf, (Const*)expr);
break;
case T_Var:
buffer_print_var(buf, stmt, (Var*)expr);
break;
case T_OpExpr:
buffer_print_opexpr(buf, stmt, (OpExpr*)expr);
break;
default:
buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
}
}
static void buffer_print_list(PGExec_Buffer* buf, PlannedStmt* stmt, List* list, char* sep) {
ListCell* cell = NULL;
bool first = true;
foreach(cell, list) {
if (!first) {
buffer_appendz(buf, sep);
}
first = false;
buffer_print_expr(buf, stmt, (Node*)lfirst(cell));
}
}
static void buffer_print_where(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
if (plan->qual == NULL) {
return;
}
buffer_appendz(buf, " WHERE ");
buffer_print_list(buf, queryDesc->plannedstmt, plan->qual, " AND ");
}
Rebuild and reinstall the extension, and restart Postgres. Re-run the
test.sql
script. Check the Postgres server logs and you should see:
2023-11-19 20:03:14.351 GMT [3254458] LOG: [pgexec] SELECT [columns todo] FROM x WHERE a > 1
2023-11-19 20:03:14.351 GMT [3254458] STATEMENT: SELECT a FROM x WHERE a > 1;
Huzzah!
Let's get rid of [columns todo]
. We already had the idea that List*
targetlist
on the Plan
struct was a list of expression
Node
s. Let's try it.
static void buffer_print_select_columns(PGExec_Buffer* buf, QueryDesc* queryDesc, Plan* plan) {
if (plan->targetlist == NULL) {
return;
}
buffer_print_list(buf, queryDesc->plannedstmt, plan->targetlist, ", ");
}
Rebuild and reinstall the extension, and restart Postgres. Re-run the
test.sql
script. Check the Postgres server logs and you should see:
2023-11-19 20:12:48.091 GMT [3255398] LOG: [pgexec] SELECT [Unknown: 53] FROM x WHERE a > 1
2023-11-19 20:12:48.091 GMT [3255398] STATEMENT: SELECT a FROM x WHERE a > 1;
Hmm. Let's look up Node
53
in Postgres:
$ grep ' = 53,' src/include/nodes/nodetags.h
T_TargetEntry = 53,
Based on the definition of
TargetEntry
,
it looks like we can ignore most of the fields (because we don't need
to handle SELECT a AS b
yet) and just proxy the child expr
field.
Let's add a case T_TargetEntry
to buffer_print_expr
:
static void buffer_print_expr(PGExec_Buffer* buf, PlannedStmt* stmt, Node* expr) {
switch (nodeTag(expr)) {
case T_Const:
buffer_print_const(buf, (Const*)expr);
break;
case T_Var:
buffer_print_var(buf, stmt, (Var*)expr);
break;
case T_TargetEntry:
buffer_print_expr(buf, stmt, (Node*)((TargetEntry*)expr)->expr);
break;
case T_OpExpr:
buffer_print_opexpr(buf, stmt, (OpExpr*)expr);
break;
default:
buffer_appendf(buf, "[Unknown: %d]", nodeTag(expr));
}
}
Rebuild and reinstall the extension, and restart Postgres. Re-run the
test.sql
script. Check the Postgres server logs and:
2023-11-19 20:17:51.114 GMT [3257827] LOG: [pgexec] SELECT a FROM x WHERE a > 1
2023-11-19 20:17:51.114 GMT [3257827] STATEMENT: SELECT a FROM x WHERE a > 1;
We did it!
Let's try out some other queries to make sure this wasn't just luck.
$ /usr/local/pgsql/bin/psql -h localhost postgres -c 'SELECT a + 1 FROM x'
?column?
----------
310
(1 row)
$ /usr/local/pgsql/bin/psql -h localhost postgres -c 'SELECT a + 1 FROM x WHERE 2 > a'
?column?
----------
(0 rows)
And back in the Postgres server logs:
2023-11-19 20:19:28.057 GMT [3257874] LOG: [pgexec] SELECT a + 1 FROM x
2023-11-19 20:19:28.057 GMT [3257874] STATEMENT: SELECT a + 1 FROM x
2023-11-19 20:19:30.474 GMT [3257878] LOG: [pgexec] SELECT a + 1 FROM x WHERE 2 > a
2023-11-19 20:19:30.474 GMT [3257878] STATEMENT: SELECT a + 1 FROM x WHERE 2 > a
Not bad!
Printing out the statement here isn't incredibly useful. But it establishes a basis for future work that might avoid Postgres's query execution engine and do the execution ourselves, or to proxy execution to another system.
My recent Postgres explorations would have been basically impossible
if it weren't for being able to ask ChatGPT simple, stupid questions
like "How do I get from a Postgres Var
to a column name".
It isn't always right. It doesn't always give great code. Actually, it normally gives pretty weird code. But it's been extremely useful for quick iteration when I get stuck.
The only other place the information exists is in small blog posts around the internet, the Postgres mailing lists (that so far for me hasn't been super responsive), and the code itself.
I've been on a Postgres roll. Let's dig into interpreting a Postgres query plan in preparation for future work on completely diverting the flow of Postgres query execution using execution hooks!https://t.co/EZrgoIiTuX pic.twitter.com/7S6d6olPX8
— Phil Eaton (@eatonphil) November 19, 2023
2023-11-01 08:00:00
With Postgres 12, released in 2019, it became possible to swap out Postgres's storage engine.
This is a feature MySQL has supported for a long time. There are at least 8 different built-in engines you can pick from. MyRocks, MySQL on RocksDB, is another popular third-party distribution.
I assume there will be a renaissance of Postgres storage engines. To date, the efforts are nascent. OrioleDB and Citus Columnar are two promising third-party table access methods being actively developed.
The ability to swap storage engines is useful because different workloads sometimes benefit from different storage approaches. Analytics workloads and columnar storage layouts go well together. Write-heavy workloads and LSM trees go well together. And some people like in-memory storage for running integration tests.
By swapping out only the storage engine, you get the benefit of the rest of the Postgres or MySQL infrastructure. The query language, the wire protocol, the ecosystem, etc.
Very little has been written about the difference between foreign data wrappers (FDWs) and table access methods. Table access methods seems to be the lower-level layer where presumably you get better performance and cleaner integration. But there is clearly overlap between these two extension options.
For example there is a FDW for ClickHouse so when you create tables and rows and query the tables you are really creating and querying rows in a ClickHouse server. Similarly there's a FDW for RocksDB. And Citus's columnar engine works either as a foreign data wrapper or a table access method.
The Citus page draws the clearest distinction between FDWs and table access methods, but even that page is vague. Performance doesn't seem to be the main difference. Closer integration, and thus the ability to look more like vanilla Postgres from the outside, seems to be the gist.
In any case, I wanted to explore the table access method API.
I haven't written Postgres extensions before and I've never written C professionally. If you're familiar with Postgres internals or C and notice something funky, please let me know!
It turns out that almost no one has written how to implement the minimal table access methods for various storage engine operations. So after quite a bit of stumbling to get the basics of an in-memory storage engine working, I'm going to walk you through my approach.
This is prototype-quality code which hopefully will be a useful base for further exploration.
All code for this post is available on GitHub.
First off, let's make a debug build of Postgres.
$ git clone https://github.com/postgres/postgres
$ # An arbitrary commit from `master` after Postgres 16 I am on
$ git checkout 849172ff4883d44168f96f39d3fde96d0aa34c99
$ cd postgres
$ ./configure --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
$ make -j8
$ sudo make install
This will install Postgres binaries (e.g. psql
, pg_ctl
, initdb
,
pg_config
) into /usr/local/pgsql/bin
.
I'm going to reference those absolute paths throughout this post because you might have a system (package manager) install of Postgres already.
Let's create a database and start up this debug build:
$ /usr/local/pgsql/bin/initdb test-db
$ /usr/local/pgsql/bin/pg_ctl -D test-db -l logfile start
Since we installed Postgres from scratch,
/usr/local/pgsql/bin/pg_config
will supply all of the infrastructure
we need.
The "infrastructure" is basically just PGXS: Postgres Makefile utilities.
It's convention-heavy. So in a new Makefile
for this project we'll
specify:
MODULES
: Any C sources to build, without the .c
file extensionEXTENSION
: Extension metadata file, without the .control
file extensionDATA
: A SQL file that is executed when the extension is loaded, this time with the .sql
extensionMODULES = pgtam
EXTENSION = pgtam
DATA = pgtam--0.0.1.sql
PG_CONFIG = /usr/local/pgsql/bin/pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)
The final three lines set up the PGXS Makefile library based on the particular installed Postgres build we want to build the extension against and install the extension to.
PGXS gives us a few important targets like make distclean
, make
,
and make install
we'll use later on.
pgtam.c
A minimal C file that registers a function capable of serving as a table access method is:
#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"
PG_MODULE_MAGIC;
const TableAmRoutine memam_methods = {
.type = T_TableAmRoutine,
};
PG_FUNCTION_INFO_V1(mem_tableam_handler);
Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
PG_RETURN_POINTER(&memam_methods);
}
If you want to read about extension basics without the complexity of table access methods, you can find a complete, minimal Postgres extension I wrote to validate the infrastructure here. Or you can follow a larger tutorial.
The workflow for registering a table access method is to first run
CREATE EXTENSION pgtam
. This assumes pgtam
is an extension that
has a function that returns a TableAmRoutine
struct instance, a
table of table access methods.
Then you must run CREATE ACCESS METHOD mem TYPE TABLE HANDLER
mem_tableam_handler
. And finally you can use the access method when
creating a table with USING mem
: CREATE TABLE x(a INT) USING mem
.
pgtam.control
This file contains extension metadata. At a minimum, the version of the extension and the filename for the extension where it should be installed.
default_version = '0.0.1'
module_pathname = '$libdir/pgtam'
pgtam--0.0.1.sql
Finally, in pgtam--0.0.1.sql
(which is executed when we call CREATE
EXTENSION pgtam
), we register the handler function as a foreign
function, and then we register the function as an access method.
CREATE OR REPLACE FUNCTION mem_tableam_handler(internal)
RETURNS table_am_handler AS 'pgtam', 'mem_tableam_handler'
LANGUAGE C STRICT;
CREATE ACCESS METHOD mem TYPE TABLE HANDLER mem_tableam_handler;
Now that we've got all the pieces in place, we can build and install the extension.
$ make
$ sudo make install
Let's add a test.sql
script to exercise the extension:
DROP EXTENSION IF EXISTS pgtam CASCADE;
CREATE EXTENSION pgtam;
CREATE TABLE x(a INT) USING mem;
And run it:
$ /usr/local/pgsql/bin/psql postgres -f test.sql
DROP EXTENSION
CREATE EXTENSION
psql:test.sql:3: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:test.sql:3: error: connection to server was lost
Ok, so psql
crashed! Let's look at the server logs. When we started
Postgres with pg_ctl
we specified the log file as logfile
in the
directory where we ran pg_ctl
.
If we look through it we'll spot an assertion failure:
$ grep Assert logfile
TRAP: failed Assert("routine->scan_begin != NULL"), File: "tableamapi.c", Line: 52, PID: 2906922
That's a great sign! This is Postgres's debug infrastructure helping to make sure the table access method is correctly implemented.
The next step is to add function stubs for all the non-optional
methods of the TableAmRoutine
struct.
I've done all the work for you already so you can just copy this over
the existing pgtam.c
. It's a big file, but don't worry. There's
nothing to explain. Just a bunch of blank functions returning default
values when required.
#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"
#include "access/heapam.h"
#include "nodes/execnodes.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
#include "utils/builtins.h"
#include "executor/tuptable.h"
PG_MODULE_MAGIC;
const TableAmRoutine memam_methods;
static const TupleTableSlotOps* memam_slot_callbacks(
Relation relation
) {
return NULL;
}
static TableScanDesc memam_beginscan(
Relation relation,
Snapshot snapshot,
int nkeys,
struct ScanKeyData *key,
ParallelTableScanDesc parallel_scan,
uint32 flags
) {
return NULL;
}
static void memam_rescan(
TableScanDesc sscan,
struct ScanKeyData *key,
bool set_params,
bool allow_strat,
bool allow_sync,
bool allow_pagemode
) {
}
static void memam_endscan(TableScanDesc sscan) {
}
static bool memam_getnextslot(
TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot
) {
return false;
}
static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
return NULL;
}
static void memam_index_fetch_reset(IndexFetchTableData *scan) {
}
static void memam_index_fetch_end(IndexFetchTableData *scan) {
}
static bool memam_index_fetch_tuple(
struct IndexFetchTableData *scan,
ItemPointer tid,
Snapshot snapshot,
TupleTableSlot *slot,
bool *call_again,
bool *all_dead
) {
return false;
}
static void memam_tuple_insert(
Relation relation,
TupleTableSlot *slot,
CommandId cid,
int options,
BulkInsertState bistate
) {
}
static void memam_tuple_insert_speculative(
Relation relation,
TupleTableSlot *slot,
CommandId cid,
int options,
BulkInsertState bistate,
uint32 specToken) {
}
static void memam_tuple_complete_speculative(
Relation relation,
TupleTableSlot *slot,
uint32 specToken,
bool succeeded) {
}
static void memam_multi_insert(
Relation relation,
TupleTableSlot **slots,
int ntuples,
CommandId cid,
int options,
BulkInsertState bistate
) {
}
static TM_Result memam_tuple_delete(
Relation relation,
ItemPointer tid,
CommandId cid,
Snapshot snapshot,
Snapshot crosscheck,
bool wait,
TM_FailureData *tmfd,
bool changingPart
) {
TM_Result result = {};
return result;
}
static TM_Result memam_tuple_update(
Relation relation,
ItemPointer otid,
TupleTableSlot *slot,
CommandId cid,
Snapshot snapshot,
Snapshot crosscheck,
bool wait,
TM_FailureData *tmfd,
LockTupleMode *lockmode,
TU_UpdateIndexes *update_indexes
) {
TM_Result result = {};
return result;
}
static TM_Result memam_tuple_lock(
Relation relation,
ItemPointer tid,
Snapshot snapshot,
TupleTableSlot *slot,
CommandId cid,
LockTupleMode mode,
LockWaitPolicy wait_policy,
uint8 flags,
TM_FailureData *tmfd)
{
TM_Result result = {};
return result;
}
static bool memam_fetch_row_version(
Relation relation,
ItemPointer tid,
Snapshot snapshot,
TupleTableSlot *slot
) {
return false;
}
static void memam_get_latest_tid(
TableScanDesc sscan,
ItemPointer tid
) {
}
static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
return false;
}
static bool memam_tuple_satisfies_snapshot(
Relation rel,
TupleTableSlot *slot,
Snapshot snapshot
) {
return false;
}
static TransactionId memam_index_delete_tuples(
Relation rel,
TM_IndexDeleteOp *delstate
) {
TransactionId id = {};
return id;
}
static void memam_relation_set_new_filelocator(
Relation rel,
const RelFileLocator *newrlocator,
char persistence,
TransactionId *freezeXid,
MultiXactId *minmulti
) {
}
static void memam_relation_nontransactional_truncate(
Relation rel
) {
}
static void memam_relation_copy_data(
Relation rel,
const RelFileLocator *newrlocator
) {
}
static void memam_relation_copy_for_cluster(
Relation OldHeap,
Relation NewHeap,
Relation OldIndex,
bool use_sort,
TransactionId OldestXmin,
TransactionId *xid_cutoff,
MultiXactId *multi_cutoff,
double *num_tuples,
double *tups_vacuumed,
double *tups_recently_dead
) {
}
static void memam_vacuum_rel(
Relation rel,
VacuumParams *params,
BufferAccessStrategy bstrategy
) {
}
static bool memam_scan_analyze_next_block(
TableScanDesc scan,
BlockNumber blockno,
BufferAccessStrategy bstrategy
) {
return false;
}
static bool memam_scan_analyze_next_tuple(
TableScanDesc scan,
TransactionId OldestXmin,
double *liverows,
double *deadrows,
TupleTableSlot *slot
) {
return false;
}
static double memam_index_build_range_scan(
Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
bool allow_sync,
bool anyvisible,
bool progress,
BlockNumber start_blockno,
BlockNumber numblocks,
IndexBuildCallback callback,
void *callback_state,
TableScanDesc scan
) {
return 0;
}
static void memam_index_validate_scan(
Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
Snapshot snapshot,
ValidateIndexState *state
) {
}
static bool memam_relation_needs_toast_table(Relation rel) {
return false;
}
static Oid memam_relation_toast_am(Relation rel) {
Oid oid = {};
return oid;
}
static void memam_fetch_toast_slice(
Relation toastrel,
Oid valueid,
int32 attrsize,
int32 sliceoffset,
int32 slicelength,
struct varlena *result
) {
}
static void memam_estimate_rel_size(
Relation rel,
int32 *attr_widths,
BlockNumber *pages,
double *tuples,
double *allvisfrac
) {
}
static bool memam_scan_sample_next_block(
TableScanDesc scan, SampleScanState *scanstate
) {
return false;
}
static bool memam_scan_sample_next_tuple(
TableScanDesc scan,
SampleScanState *scanstate,
TupleTableSlot *slot
) {
return false;
}
const TableAmRoutine memam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = memam_slot_callbacks,
.scan_begin = memam_beginscan,
.scan_end = memam_endscan,
.scan_rescan = memam_rescan,
.scan_getnextslot = memam_getnextslot,
.parallelscan_estimate = table_block_parallelscan_estimate,
.parallelscan_initialize = table_block_parallelscan_initialize,
.parallelscan_reinitialize = table_block_parallelscan_reinitialize,
.index_fetch_begin = memam_index_fetch_begin,
.index_fetch_reset = memam_index_fetch_reset,
.index_fetch_end = memam_index_fetch_end,
.index_fetch_tuple = memam_index_fetch_tuple,
.tuple_insert = memam_tuple_insert,
.tuple_insert_speculative = memam_tuple_insert_speculative,
.tuple_complete_speculative = memam_tuple_complete_speculative,
.multi_insert = memam_multi_insert,
.tuple_delete = memam_tuple_delete,
.tuple_update = memam_tuple_update,
.tuple_lock = memam_tuple_lock,
.tuple_fetch_row_version = memam_fetch_row_version,
.tuple_get_latest_tid = memam_get_latest_tid,
.tuple_tid_valid = memam_tuple_tid_valid,
.tuple_satisfies_snapshot = memam_tuple_satisfies_snapshot,
.index_delete_tuples = memam_index_delete_tuples,
.relation_set_new_filelocator = memam_relation_set_new_filelocator,
.relation_nontransactional_truncate = memam_relation_nontransactional_truncate,
.relation_copy_data = memam_relation_copy_data,
.relation_copy_for_cluster = memam_relation_copy_for_cluster,
.relation_vacuum = memam_vacuum_rel,
.scan_analyze_next_block = memam_scan_analyze_next_block,
.scan_analyze_next_tuple = memam_scan_analyze_next_tuple,
.index_build_range_scan = memam_index_build_range_scan,
.index_validate_scan = memam_index_validate_scan,
.relation_size = table_block_relation_size,
.relation_needs_toast_table = memam_relation_needs_toast_table,
.relation_toast_am = memam_relation_toast_am,
.relation_fetch_toast_slice = memam_fetch_toast_slice,
.relation_estimate_size = memam_estimate_rel_size,
.scan_sample_next_block = memam_scan_sample_next_block,
.scan_sample_next_tuple = memam_scan_sample_next_tuple
};
PG_FUNCTION_INFO_V1(mem_tableam_handler);
Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
PG_RETURN_POINTER(&memam_methods);
}
Let's build and test it!
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
Hey we're getting somewhere! It successfully created the table with our custom table access method.
Next, let's try querying the table by adding a SELECT a FROM x
to
test.sql
and running it:
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:test.sql:6: error: connection to server was lost
This time there's nothing in logfile
that helps:
$ tail -n15 logfile
2023-11-01 18:43:32.449 UTC [2906199] LOG: database system is ready to accept connections
2023-11-01 18:58:32.572 UTC [2907997] LOG: checkpoint starting: time
2023-11-01 18:58:35.305 UTC [2907997] LOG: checkpoint complete: wrote 28 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.712 s, sync=0.015 s, total=2.733 s; sync files=23, longest=0.004 s, average=0.001 s; distance=128 kB, estimate=150 kB; lsn=0/15F88E0, redo lsn=0/15F8888
2023-11-01 19:08:14.485 UTC [2906199] LOG: server process (PID 2908242) was terminated by signal 11: Segmentation fault
2023-11-01 19:08:14.485 UTC [2906199] DETAIL: Failed process was running: SELECT a FROM x;
2023-11-01 19:08:14.485 UTC [2906199] LOG: terminating any other active server processes
2023-11-01 19:08:14.486 UTC [2906199] LOG: all server processes terminated; reinitializing
2023-11-01 19:08:14.508 UTC [2908253] LOG: database system was interrupted; last known up at 2023-11-01 18:58:35 UTC
2023-11-01 19:08:14.518 UTC [2908253] LOG: database system was not properly shut down; automatic recovery in progress
2023-11-01 19:08:14.519 UTC [2908253] LOG: redo starts at 0/15F8888
2023-11-01 19:08:14.520 UTC [2908253] LOG: invalid record length at 0/161DE70: expected at least 24, got 0
2023-11-01 19:08:14.520 UTC [2908253] LOG: redo done at 0/161DE38 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-11-01 19:08:14.521 UTC [2908254] LOG: checkpoint starting: end-of-recovery immediate wait
2023-11-01 19:08:14.532 UTC [2908254] LOG: checkpoint complete: wrote 35 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.010 s, total=0.012 s; sync files=27, longest=0.003 s, average=0.001 s; distance=149 kB, estimate=149 kB; lsn=0/161DE70, redo lsn=0/161DE70
2023-11-01 19:08:14.533 UTC [2906199] LOG: database system is ready to accept connections
This was the first place I got stuck. How on earth do I figure out what methods to implement? I mean, it's clearly one or more of these methods from the struct. But there are so many methods.
I tried setting a breakpoint in gdb
on the process returned by
SELECT pg_backend_pid()
for a psql
session, but the breakpoint
never seemed to be hit for any of my methods.
So I did the low-tech solution and opened a file, /tmp/pgtam.log
,
turned off buffering on it, and added a log to every method on the
TableAmRoutine
struct:
@@ -12,9 +12,13 @@
const TableAmRoutine memam_methods;
+FILE* fd;
+#define DEBUG_FUNC() fprintf(fd, "in %s\n", __func__);
+
static const TupleTableSlotOps* memam_slot_callbacks(
Relation relation
) {
+ DEBUG_FUNC();
return NULL;
}
@@ -26,6 +30,7 @@
ParallelTableScanDesc parallel_scan,
uint32 flags
) {
+ DEBUG_FUNC();
return NULL;
}
@@ -37,9 +42,11 @@
bool allow_sync,
bool allow_pagemode
) {
+ DEBUG_FUNC();
}
static void memam_endscan(TableScanDesc sscan) {
+ DEBUG_FUNC();
}
static bool memam_getnextslot(
@@ -47,17 +54,21 @@
ScanDirection direction,
TupleTableSlot *slot
) {
+ DEBUG_FUNC();
return false;
}
static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
+ DEBUG_FUNC();
return NULL;
}
static void memam_index_fetch_reset(IndexFetchTableData *scan) {
+ DEBUG_FUNC();
}
static void memam_index_fetch_end(IndexFetchTableData *scan) {
+ DEBUG_FUNC();
}
static bool memam_index_fetch_tuple(
@@ -68,6 +79,7 @@
bool *call_again,
bool *all_dead
) {
+ DEBUG_FUNC();
return false;
}
@@ -78,6 +90,7 @@
int options,
BulkInsertState bistate
) {
+ DEBUG_FUNC();
}
static void memam_tuple_insert_speculative(
@@ -87,6 +100,7 @@
int options,
BulkInsertState bistate,
uint32 specToken) {
+ DEBUG_FUNC();
}
static void memam_tuple_complete_speculative(
@@ -94,6 +108,7 @@
TupleTableSlot *slot,
uint32 specToken,
bool succeeded) {
+ DEBUG_FUNC();
}
static void memam_multi_insert(
@@ -104,6 +119,7 @@
int options,
BulkInsertState bistate
) {
+ DEBUG_FUNC();
}
static TM_Result memam_tuple_delete(
@@ -117,6 +133,7 @@
bool changingPart
) {
TM_Result result = {};
+ DEBUG_FUNC();
return result;
}
@@ -133,6 +150,7 @@
TU_UpdateIndexes *update_indexes
) {
TM_Result result = {};
+ DEBUG_FUNC();
return result;
}
@@ -148,6 +166,7 @@
TM_FailureData *tmfd)
{
TM_Result result = {};
+ DEBUG_FUNC();
return result;
}
@@ -157,6 +176,7 @@
Snapshot snapshot,
TupleTableSlot *slot
) {
+ DEBUG_FUNC();
return false;
}
@@ -164,9 +184,11 @@
TableScanDesc sscan,
ItemPointer tid
) {
+ DEBUG_FUNC();
}
static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
+ DEBUG_FUNC();
return false;
}
@@ -175,6 +197,7 @@
TupleTableSlot *slot,
Snapshot snapshot
) {
+ DEBUG_FUNC();
return false;
}
@@ -183,6 +206,7 @@
TM_IndexDeleteOp *delstate
) {
TransactionId id = {};
+ DEBUG_FUNC();
return id;
}
@@ -193,17 +217,20 @@
TransactionId *freezeXid,
MultiXactId *minmulti
) {
+ DEBUG_FUNC();
}
static void memam_relation_nontransactional_truncate(
Relation rel
) {
+ DEBUG_FUNC();
}
static void memam_relation_copy_data(
Relation rel,
const RelFileLocator *newrlocator
) {
+ DEBUG_FUNC();
}
static void memam_relation_copy_for_cluster(
@@ -218,6 +245,7 @@
double *tups_vacuumed,
double *tups_recently_dead
) {
+ DEBUG_FUNC();
}
static void memam_vacuum_rel(
@@ -225,6 +253,7 @@
VacuumParams *params,
BufferAccessStrategy bstrategy
) {
+ DEBUG_FUNC();
}
static bool memam_scan_analyze_next_block(
@@ -232,6 +261,7 @@
BlockNumber blockno,
BufferAccessStrategy bstrategy
) {
+ DEBUG_FUNC();
return false;
}
@@ -242,6 +272,7 @@
double *deadrows,
TupleTableSlot *slot
) {
+ DEBUG_FUNC();
return false;
}
@@ -258,6 +289,7 @@
void *callback_state,
TableScanDesc scan
) {
+ DEBUG_FUNC();
return 0;
}
@@ -268,14 +300,17 @@
Snapshot snapshot,
ValidateIndexState *state
) {
+ DEBUG_FUNC();
}
static bool memam_relation_needs_toast_table(Relation rel) {
+ DEBUG_FUNC();
return false;
}
static Oid memam_relation_toast_am(Relation rel) {
Oid oid = {};
+ DEBUG_FUNC();
return oid;
}
@@ -287,6 +322,7 @@
int32 slicelength,
struct varlena *result
) {
+ DEBUG_FUNC();
}
static void memam_estimate_rel_size(
@@ -296,11 +332,13 @@
double *tuples,
double *allvisfrac
) {
+ DEBUG_FUNC();
}
static bool memam_scan_sample_next_block(
TableScanDesc scan, SampleScanState *scanstate
) {
+ DEBUG_FUNC();
return false;
}
@@ -309,6 +347,7 @@
SampleScanState *scanstate,
TupleTableSlot *slot
) {
+ DEBUG_FUNC();
return false;
}
And then in the entrypoint, initialize the file for logging.
@@ -369,5 +408,9 @@
PG_FUNCTION_INFO_V1(mem_tableam_handler);
Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
+ fd = fopen("/tmp/pgtam.log", "a");
+ setvbuf(fd, NULL, _IONBF, 0); // Prevent buffering
+ fprintf(fd, "\n\nmem_tableam handler loaded\n");
+
PG_RETURN_POINTER(&memam_methods);
}
Let's give it a shot!
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:test.sql:6: error: connection to server was lost
And let's check our log file:
$ cat /tmp/pgtam.log
mem_tableam handler loaded
mem_tableam handler loaded
in memam_relation_set_new_filelocator
mem_tableam handler loaded
in memam_relation_needs_toast_table
mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
Now we're getting somewhere!
I later realized elog()
is the way most people log
within Postgres/within extensions. I didn't know that when I was
getting started though. This separate logging was a simple way to
get the info out.
slot_callbacks
Since the request crashes and the last logged function is
memam_slot_callbacks
, it seems like that is where we should
concentrate. The table access method
docs suggest
looking at the default heap
access method for inspiration.
Its
version
of slot_callbacks
returns &TTSOpsBufferHeapTuple
:
static const TupleTableSlotOps *
heapam_slot_callbacks(Relation relation)
{
return &TTSOpsBufferHeapTuple;
}
I have no idea what that means, but since it is defined in
src/backend/executor/execTuples.c
it doesn't seem to be tied to the heap
access method
implementation. Let's try it.
While it works initially, I noticed later on that
TTSOpsBufferHeapTuple
turns out not to be the right
choice here. TTSOpsVirtual
seems to be the right
implementation.
@@ -19,7 +19,7 @@
Relation relation
) {
DEBUG_FUNC();
- return NULL;
+ return &TTSOpsVirtual;
}
static TableScanDesc memam_beginscan(
Build and run:
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
psql:test.sql:6: error: connection to server was lost
It still crashes. But this time in /tmp/pgtam.log
we made it into a
new method!
$ cat /tmp/pgtam.log
mem_tableam handler loaded
mem_tableam handler loaded
in memam_relation_set_new_filelocator
mem_tableam handler loaded
in memam_relation_needs_toast_table
mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
scan_begin
The function signature is:
TableScanDesc heap_beginscan(
Relation relation,
Snapshot snapshot,
int nkeys,
ScanKey key,
ParallelTableScanDesc parallel_scan,
uint32 flags
);
Since we just implemented stub versions of all the methods, we've been
returning NULL
. Since we're failing in this function, maybe we
should try returning something that isn't NULL
.
By looking at the definition of TableScanDesc
, we can see it is a
pointer to the TableScanDescData
struct defined in
src/include/access/relscan.h
.
Let's malloc
a TableScanDescData
, free it in endscan
, and return
the TableScanDescData
instance in beginscan
:
@@ -30,8 +30,12 @@
ParallelTableScanDesc parallel_scan,
uint32 flags
) {
+ TableScanDescData* scan = {};
DEBUG_FUNC();
- return NULL;
+
+ scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+
+ return (TableScanDesc)scan;
}
static void memam_rescan(
@@ -87,6 +87,7 @@
static void memam_endscan(TableScanDesc sscan) {
DEBUG_FUNC();
+ free(sscan);
}
Build and run (you can do it on your own). No difference.
I got stuck for a while here too. Clearly something must be filled out
in this struct but it could be anything. Through trial and error I
realized the one field that must be filled out is scan->rs_rd
.
@@ -34,6 +34,7 @@
DEBUG_FUNC();
scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+ scan->rs_rd = relation;
return (TableScanDesc)scan;
}
We build and run:
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
a
---
(0 rows)
And it works! It doesn't return anything but that's correct. There's nothing to return.
So what if we actually want to return something? Let's check our logs
in /tmp/pgtam.log
.
$ cat /tmp/pgtam.log
mem_tableam handler loaded
mem_tableam handler loaded
in memam_relation_set_new_filelocator
mem_tableam handler loaded
in memam_relation_needs_toast_table
mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_endscan
Ok, I'm getting the gist of the API. A full table scan (which this is,
because there are no indexes at play) starts with an initialization
for a slot, then the scan begins, then getnextslot
is called for
each row, and then endscan
is called to allow for cleanup.
So let's try returning a row in getnextslot
.
getnextslot
The getnextslot
signature is:
bool memam_getnextslot(
TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot
);
So the sscan
should be what we returned from beginscan
and the
interface
docs
say the current row gets stored in slot
.
The return value seems to indicate whether or not we've reached the
end of the scan. However, the scan will still end even if you
return true
if the slot
is not filled out correctly. If the
slot
is filled out correctly and you unconditionally return
true
, you will crash the process.
Let's take a look at the
definition
of TupleTableSlot
:
typedef struct TupleTableSlot
{
NodeTag type;
#define FIELDNO_TUPLETABLESLOT_FLAGS 1
uint16 tts_flags; /* Boolean states */
#define FIELDNO_TUPLETABLESLOT_NVALID 2
AttrNumber tts_nvalid; /* # of valid values in tts_values */
const TupleTableSlotOps *const tts_ops; /* implementation of slot */
#define FIELDNO_TUPLETABLESLOT_TUPLEDESCRIPTOR 4
TupleDesc tts_tupleDescriptor; /* slot's tuple descriptor */
#define FIELDNO_TUPLETABLESLOT_VALUES 5
Datum *tts_values; /* current per-attribute values */
#define FIELDNO_TUPLETABLESLOT_ISNULL 6
bool *tts_isnull; /* current per-attribute isnull flags */
MemoryContext tts_mcxt; /* slot itself is in this context */
ItemPointerData tts_tid; /* stored tuple's tid */
Oid tts_tableOid; /* table oid of tuple */
} TupleTableSlot;
tts_values
is an array of Datum
(which is a Postgres value). So
that sounds like the actual values of the row. The tts_isnull
field
also looks important since that seems to be whether each value in the
row is null or not. And tts_nvalid
sounds important too since
presumably it's the length of the tts_isnull
and tts_values
arrays.
The rest of it may or may not be important. Let's try filling out these three fields though and see what happens.
Back in the Postgres C extension documentation, we can see some simple examples of converting between C types and Postgres's Datum type.
For example:
Datum
add_one(PG_FUNCTION_ARGS)
{
int32 arg = PG_GETARG_INT32(0);
PG_RETURN_INT32(arg + 1);
}
If we look at the definition of PG_RETURN_INT32
in
src/include/fmgr.h
,
we see:
#define PG_RETURN_INT32(x) return Int32GetDatum(x)
So Int32GetDatum()
is the function we'll use to set a Datum
for a
cell in a row.
@@ -54,13 +54,26 @@
DEBUG_FUNC();
}
+static bool done = false;
static bool memam_getnextslot(
TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot
) {
DEBUG_FUNC();
- return false;
+
+ if (done) {
+ return false;
+ }
+
+ slot->tts_nvalid = 1;
+ slot->tts_values = (Datum*)malloc(sizeof(Datum) * slot->tts_nvalid);
+ slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
+ slot->tts_isnull = (bool*)malloc(sizeof(bool) * slot->tts_nvalid);
+ slot->tts_isnull[0] = false;
+ done = true;
+
+ return true;
}
static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
The goal is that we return a single row and then exit the scan. It
will have one 32-bit integer cell (remember we created the table
CREATE TABLE x (a INT)
; INT
is shorthand for INT4
which is a
32-bit integer) that will have the value 314
.
But if we build and run this, we get no rows.
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
a
---
(0 rows)
I got stuck for a while here. Plugging my getnextslot
code into
ChatGPT helped. One thing it gave me to try was calling
ExecStoreVirtualTuple
on the slot
. I noticed that the built-in
heap
access method also called a function like
this
in getnextslot
.
And I realized that tts_nvalid
is already set up and the memory for
tts_values
and tts_isnull
is already allocated. So the code became
a little simpler.
@@ -66,11 +66,9 @@
return false;
}
- slot->tts_nvalid = 1;
- slot->tts_values = (Datum*)malloc(sizeof(Datum) * slot->tts_nvalid);
slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
- slot->tts_isnull = (bool*)malloc(sizeof(bool) * slot->tts_nvalid);
slot->tts_isnull[0] = false;
+ ExecStoreVirtualTuple(slot);
done = true;
return true;
Build and run:
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
a
-----
314
(1 row)
Fantastic!
Now that we've proven we can return random data, let's set up infrastructure for storing tables in memory.
@@ -15,6 +15,41 @@
FILE* fd;
#define DEBUG_FUNC() fprintf(fd, "in %s\n", __func__);
+
+struct Column {
+ int value;
+};
+
+struct Row {
+ struct Column* columns;
+ size_t ncolumns;
+};
+
+#define MAX_ROWS 100
+struct Table {
+ char* name;
+ struct Row* rows;
+ size_t nrows;
+};
+
+#define MAX_TABLES 100
+struct Database {
+ struct Table* tables;
+ size_t ntables;
+};
+
+struct Database* database;
+
+static void get_table(struct Table** table, Relation relation) {
+ char* this_name = NameStr(relation->rd_rel->relname);
+ for (size_t i = 0; i < database->ntables; i++) {
+ if (strcmp(database->tables[i].name, this_name) == 0) {
+ *table = &database->tables[i];
+ return;
+ }
+ }
+}
+
static const TupleTableSlotOps* memam_slot_callbacks(
Relation relation
) {
Based on what we logged in /tmp/pgtam.log
it seems like
memam_relation_set_new_filelocator
is called when a new table is
created. So let's handle adding a new table there.
@@ -233,7 +268,16 @@
TransactionId *freezeXid,
MultiXactId *minmulti
) {
+ struct Table table = {};
DEBUG_FUNC();
+
+ table.name = strdup(NameStr(rel->rd_rel->relname));
+ fprintf(fd, "Created table: [%s]\n", table.name);
+ table.rows = (struct Row*)malloc(sizeof(struct Row) * MAX_ROWS);
+ table.nrows = 0;
+
+ database->tables[database->ntables] = table;
+ database->ntables++;
}
static void memam_relation_nontransactional_truncate(
Finally, we'll initialize the in-memory Database*
when the handler is
loaded.
@@ -428,5 +472,11 @@
setvbuf(fd, NULL, _IONBF, 0); // Prevent buffering
fprintf(fd, "\n\nmem_tableam handler loaded\n");
+ if (database == NULL) {
+ database = (struct Database*)malloc(sizeof(struct Database));
+ database->ntables = 0;
+ database->tables = (struct Table*)malloc(sizeof(struct Table) * MAX_TABLES);
+ }
+
PG_RETURN_POINTER(&memam_methods);
}
If we build and run, we won't notice anything new.
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
a
-----
314
(1 row)
But we should see a message in /tmp/pgtam.log
about the new table
being created.
$ cat /tmp/pgtam.log
mem_tableam handler loaded
mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]
mem_tableam handler loaded
in memam_relation_needs_toast_table
mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan
And there it is! Creation looks good.
Let's add INSERT INTO x VALUES (23), (101);
to test.sql
and run
the SQL script.
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
a
-----
314
(1 row)
And let's check the log to see what method is called when we try to
INSERT
.
$ cat /tmp/pgtam.log
mem_tableam handler loaded
mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]
mem_tableam handler loaded
in memam_relation_needs_toast_table
mem_tableam handler loaded
in memam_slot_callbacks
in memam_tuple_insert
in memam_tuple_insert
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan
tuple_insert
seems to be the method! Looks like it gets called once
for each row to insert. Perfect.
The signature for tuple_insert
is:
void memam_tuple_insert(
Relation relation,
TupleTableSlot *slot,
CommandId cid,
int options,
BulkInsertState bistate
);
We can get the table name from relation
, and instead of writing to
slot
we can read from slot->tts_values
instead.
@@ -141,7 +141,38 @@
int options,
BulkInsertState bistate
) {
+ TupleDesc desc = RelationGetDescr(relation);
+ struct Table* table = NULL;
+ struct Column column = {};
+ struct Row row = {};
+
DEBUG_FUNC();
+
+ get_table(&table, relation);
+ if (table == NULL) {
+ elog(ERROR, "table not found");
+ return;
+ }
+
+ if (table->nrows == MAX_ROWS) {
+ elog(ERROR, "cannot insert more rows");
+ return;
+ }
+
+ row.ncolumns = desc->natts;
+ Assert(slot->tts_nvalid == row.ncolumns);
+ Assert(row.ncolumns > 0);
+
+ row.columns = (struct Column*)malloc(sizeof(struct Column) * row.ncolumns);
+ for (size_t i = 0; i < row.ncolumns; i++) {
+ Assert(desc->attrs[i].atttypid == INT4OID);
+ column.value = DatumGetInt32(slot->tts_values[i]);
+ row.columns[i] = column;
+ fprintf(fd, "Got value: %d\n", column.value);
+ }
+
+ table->rows[table->nrows] = row;
+ table->nrows++;
}
static void memam_tuple_insert_speculative(
Build and run and again we won't notice anything new.
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
a
-----
314
(1 row)
But if we check the logs, we should see the two column values we inserted, one for each row.
$ cat /tmp/pgtam.log
mem_tableam handler loaded
mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]
mem_tableam handler loaded
in memam_relation_needs_toast_table
mem_tableam handler loaded
in memam_slot_callbacks
in memam_tuple_insert
Got value: 23
in memam_tuple_insert
Got value: 101
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan
Woohoo!
The final thing we need to do is drop the hardcoded 314
we returned
from getnextslot
and instead we need to look up the current table
and return rows from it. This also means we need to keep track of
which row we're on. So beginscan
will also need to change slightly.
@@ -57,6 +56,14 @@
return &TTSOpsVirtual;
}
+
+struct MemScanDesc {
+ TableScanDescData rs_base; // Base class from access/relscan.h.
+
+ // Custom data.
+ uint32 cursor;
+};
+
static TableScanDesc memam_beginscan(
Relation relation,
Snapshot snapshot,
@@ -65,11 +72,13 @@
ParallelTableScanDesc parallel_scan,
uint32 flags
) {
- TableScanDescData* scan = {};
- DEBUG_FUNC();
+ struct MemScanDesc* scan;
- scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
- scan->rs_rd = relation;
+ DEBUG_FUNC();
+
+ scan = (struct MemScanDesc*)malloc(sizeof(struct MemScanDesc));
+ scan->rs_base.rs_rd = relation;
+ scan->cursor = 0;
return (TableScanDesc)scan;
}
@@ -89,23 +97,26 @@
DEBUG_FUNC();
}
-static bool done = false;
static bool memam_getnextslot(
TableScanDesc sscan,
ScanDirection direction,
TupleTableSlot *slot
) {
+ struct MemScanDesc* mscan = (struct MemScanDesc*)sscan;
+ struct Table* table = NULL;
DEBUG_FUNC();
- if (done) {
+ ExecClearTuple(slot);
+
+ get_table(&table, mscan->rs_base.rs_rd);
+ if (table == NULL || mscan->cursor == table->nrows) {
return false;
}
- slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
+ slot->tts_values[0] = Int32GetDatum(table->rows[mscan->cursor].columns[0].value);
slot->tts_isnull[0] = false;
ExecStoreVirtualTuple(slot);
- done = true;
-
+ mscan->cursor++;
return true;
}
Let's try it out.
$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
a
-----
23
101
(2 rows)
And there we have it. :)
So we tried one table and we tried a SELECT
without anything else.
What happens if we use more of SQL? Let's create another table
and try some more complex queries. Edit test.sql
:
DROP EXTENSION IF EXISTS pgtam CASCADE;
CREATE EXTENSION pgtam;
CREATE TABLE x(a INT) USING mem;
CREATE TABLE y(b INT) USING mem;
INSERT INTO x VALUES (23), (101);
SELECT a FROM x;
SELECT a + 100 FROM x WHERE a = 23;
SELECT a, COUNT(1) FROM x GROUP BY a ORDER BY COUNT(1) DESC;
SELECT b FROM y;
Run it:
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE: drop cascades to 2 other objects
DETAIL: drop cascades to table x
drop cascades to table y
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
CREATE TABLE
INSERT 0 2
a
-----
23
101
(2 rows)
?column?
----------
123
(1 row)
a | count
-----+-------
23 | 1
101 | 1
(2 rows)
b
---
(0 rows)
Pretty sweet!
It would be neat to build a storage engine that reads from and writes to a CSV a la MySQL's CSV storage engine. Or a storage engine that uses RocksDB.
It would also be good to figure out how indexes work, how deletions
work, how updates and DDL beyond CREATE
works.
And I should probably contribute some of this to the table access method docs which are pretty sparse at the moment.
I've been working this week to understand Postgres Table Access Methods for alternative storage engines.
— Phil Eaton (@eatonphil) November 2, 2023
Especially challenging because the documentation is pretty sparse and few minimal implementations exist.
I wrote up my approach!https://t.co/LQGglRkev5 pic.twitter.com/v0MeOu4Hbr
2023-10-19 08:00:00
King and I wrote a blog post about building an event-driven cross-platform IO library that used io_uring on Linux. We sketched out how it works at a high level but I hadn't yet internalized how you actually code with io_uring. So I strapped myself down this week and wrote some benchmarks to build my intuition about io_uring and other IO models.
I started with implementations in Go and ported them to Zig to make sure I had done the Go versions decently. And I got some help from King and other internetters to find some inefficiencies in my code.
This post will walk through my process, getting increasingly efficient (and a little increasingly complex) ways to write an entire file to disk with io_uring, from Go and Zig.
Notably, we're not going to fsync()
and we're not going to use
O_DIRECT
. So we won't be testing the entire IO pipeline from
userland to disk hardware but just how fast IO gets to the kernel. The
focus of this post is more on IO methods and using io_uring, not
absolute numbers.
All code for this post is available on GitHub.
This code is going to indirectly show some differences in timing
between Go and Zig. I could care less about benchmarketing. And I
hope something about Zig vs Go is not what you take away from this
post either.
The goal is to build an intuition and be generally
correct. Observing the same relative behavior between
implementations across two languages helps me gain confidence what
I'm doing is correct.
With normal blocking syscalls you just call read()
or write()
and
wait for the results. io_uring is one of Linux's more powerful
asynchronous IO offerings. Unlike epoll, you can use io_uring with
both files and network connections. And unlike epoll you can even have
the syscall run in the kernel.
To interact with io_uring, you register a submission queue for syscalls and their arguments. And you register a completion queue for syscall results.
You can batch many syscalls in one single call to io_uring, effectively turning up to N (4096 at most) syscalls into just one syscall. The kernel still does all the work of the N syscalls but you avoid some overhead.
As you check the completion queue and handle completed submissions, the submission queue is also freed all or somewhat, and you can now add more submissions.
For a more complete understanding, check out the kernel document Efficient IO with io_uring.
io_uring is a complex, low-level interface. Shuveb Hussain has an excellent series on programming io_uring. But that was too low-level for me as I was trying to figure out how to just get something working.
Instead, most people use liburing or a ported version of it like the Zig standard library's io_uring.zig or Iceber's iouring-go.
io_uring started clicking for me when I tried out the iouring-go library. So we'll start there.
First off, let's set up some boilerplate for the Go and Zig code.
In main.go add:
package main
import (
"bytes"
"fmt"
"os"
"time"
)
func assert(b bool) {
if !b {
panic("assert")
}
}
const BUFFER_SIZE = 4096
func readNBytes(fn string, n int) []byte {
f, err := os.Open(fn)
if err != nil {
panic(err)
}
defer f.Close()
data := make([]byte, 0, n)
var buffer = make([]byte, BUFFER_SIZE)
for len(data) < n {
read, err := f.Read(buffer)
if err != nil {
panic(err)
}
data = append(data, buffer[:read]...)
}
assert(len(data) == n)
return data
}
func benchmark(name string, data []byte, fn func(*os.File)) {
fmt.Printf("%s", name)
f, err := os.OpenFile("out.bin", os.O_RDWR | os.O_CREATE | os.O_TRUNC, 0755)
if err != nil {
panic(err)
}
t1 := time.Now()
fn(f)
s := time.Now().Sub(t1).Seconds()
fmt.Printf(",%f,%f\n", s, float64(len(data))/s)
if err := f.Close(); err != nil {
panic(err)
}
assert(bytes.Equal(readNBytes("out.bin", len(data)), data))
}
And in main.zig add:
const std = @import("std");
const OUT_FILE = "out.bin";
const BUFFER_SIZE: u64 = 4096;
fn readNBytes(
allocator: *const std.mem.Allocator,
filename: []const u8,
n: usize,
) ![]const u8 {
const file = try std.fs.cwd().openFile(filename, .{});
defer file.close();
var data = try allocator.alloc(u8, n);
var buf = try allocator.alloc(u8, BUFFER_SIZE);
var written: usize = 0;
while (data.len < n) {
var nwritten = try file.read(buf);
@memcpy(data[written..], buf[0..nwritten]);
written += nwritten;
}
std.debug.assert(data.len == n);
return data;
}
const Benchmark = struct {
t: std.time.Timer,
file: std.fs.File,
data: []const u8,
allocator: *const std.mem.Allocator,
fn init(
allocator: *const std.mem.Allocator,
name: []const u8,
data: []const u8,
) !Benchmark {
try std.io.getStdOut().writer().print("{s}", .{name});
var file = try std.fs.cwd().createFile(OUT_FILE, .{
.truncate = true,
});
return Benchmark{
.t = try std.time.Timer.start(),
.file = file,
.data = data,
.allocator = allocator,
};
}
fn stop(b: *Benchmark) void {
const s = @as(f64, @floatFromInt(b.t.read())) / std.time.ns_per_s;
std.io.getStdOut().writer().print(
",{d},{d}\n",
.{ s, @as(f64, @floatFromInt(b.data.len)) / s },
) catch unreachable;
b.file.close();
var in = readNBytes(b.allocator, OUT_FILE, b.data.len) catch unreachable;
std.debug.assert(std.mem.eql(u8, in, b.data));
b.allocator.free(in);
}
};
Now let's add the naive version of writing bytes to disk: calling
write()
repeatedly until all data has been written to disk.
In main.go
:
func main() {
size := 104857600 // 100MiB
data := readNBytes("/dev/random", size)
const RUNS = 10
for i := 0; i < RUNS; i++ {
benchmark("blocking", data, func(f *os.File) {
for i := 0; i < len(data); i += BUFFER_SIZE {
size := min(BUFFER_SIZE, len(data)-i)
n, err := f.Write(data[i : i+size])
if err != nil {
panic(err)
}
assert(n == BUFFER_SIZE)
}
})
}
}
And in main.zig
:
pub fn main() !void {
var allocator = &std.heap.page_allocator;
const SIZE = 104857600; // 100MiB
var data = try readNBytes(allocator, "/dev/random", SIZE);
defer allocator.free(data);
const RUNS = 10;
var run: usize = 0;
while (run < RUNS) : (run += 1) {
{
var b = try Benchmark.init(allocator, "blocking", data);
defer b.stop();
var i: usize = 0;
while (i < data.len) : (i += BUFFER_SIZE) {
const size = @min(BUFFER_SIZE, data.len - i);
const n = try b.file.write(data[i .. i + size]);
std.debug.assert(n == size);
}
}
}
}
Let's build and run these programs and store the results to CSV we can analyze with DuckDB.
Go first:
$ go build main.go -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
blocking | 0.07251540000000001s | 1.4GB/s |
And Zig:
$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
blocking | 0.0656907669s | 1.5GB/s |
Alright, we've got a baseline now and both language implementations are in the same ballpark.
Let's add a simple io_uring version!
The iouring-go library has really excellent documentation for getting started.
To keep it simple, we'll use io_uring with only 1 entry. Add the
following to func main()
after the existing benchmark()
call in
main.go
:
benchmark("io_uring", data, func(f * os.File) {
iour, err := iouring.New(1)
if err != nil {
panic(err)
}
defer iour.Close()
for i := 0; i < len(data); i += BUFFER_SIZE {
size := min(BUFFER_SIZE, len(data)-i)
prepRequest := iouring.Pwrite(int(f.Fd()), data[i : i+size], uint64(i))
res, err := iour.SubmitRequest(prepRequest, nil)
if err != nil {
panic(err)
}
<-res.Done()
i, err := res.ReturnInt()
if err != nil {
panic(err)
}
assert(size == i)
}
})
Note that benchmark
takes care of f.Seek(0)
before each run. And
it also validates that the file contents are equivalent to the input
data
. So it validates the benchmark for correctness.
Alright, let's run this new Go implementation with io_uring!
$ go mod init gomain
$ go mod tidy
$ go build main.go -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
blocking | 0.0811486s | 1.3GB/s |
io_uring | 0.5083049999999999s | 213.2MB/s |
Well that looks terrible.
Let's port it to Zig to see if we notice the same behavior there.
There isn't an official Zig tutorial on io_uring I'm aware of. But io_uring.zig is easy enough to browse through. And there are tests in that file that also show how to use it.
And now that we've explored a bit in Go the basic gist should be similar:
Add the following to fn main()
after the existing benchmark block in
main.zig
:
{
var b = try Benchmark.init(allocator, "iouring", data);
defer b.stop();
const entries = 1;
var ring = try std.os.linux.IO_Uring.init(entries, 0);
defer ring.deinit();
var i: usize = 0;
while (i < data.len) : (i += BUFFER_SIZE) {
const size = @min(BUFFER_SIZE, data.len - i);
_ = try ring.write(0, b.file.handle, data[i .. i + size], i);
const submitted = try ring.submit_and_wait(1);
std.debug.assert(submitted == 1);
const cqe = try ring.copy_cqe();
std.debug.assert(cqe.err() == .SUCCESS);
std.debug.assert(cqe.res >= 0);
const n = @as(usize, @intCast(cqe.res));
std.debug.assert(n <= BUFFER_SIZE);
}
}
Now build and run:
$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
blocking | 0.06650093630000001s | 1.5GB/s |
io_uring | 0.17542890139999998s | 597.7MB/s |
Well it's similarly pretty bad. But our implementation ignores one major aspect of io_uring: batching requests.
Let's do some refactoring!
To support submitting N entries, we're going to have an inner loop running up to N that fills up N entries to io_uring.
Then we'll wait for the N submissions to complete and check their results.
We'll keep going until we write the entire file.
All of this can stay inside the loop in main
, I'm just dropping
preceding whitespace for nicer formatting here:
benchmarkIOUringNEntries := func (nEntries int) {
benchmark(fmt.Sprintf("io_uring_%d_entries", nEntries), data, func(f * os.File) {
iour, err := iouring.New(uint(nEntries))
if err != nil {
panic(err)
}
defer iour.Close()
requests := make([]iouring.PrepRequest, nEntries)
for i := 0; i < len(data); i += BUFFER_SIZE * nEntries {
submittedEntries := 0
for j := 0; j < nEntries; j++ {
base := i + j * BUFFER_SIZE
if base >= len(data) {
break
}
submittedEntries++
size := min(BUFFER_SIZE, len(data)-i)
requests[j] = iouring.Pwrite(int(f.Fd()), data[base : base+size], uint64(base))
}
if submittedEntries == 0 {
break
}
res, err := iour.SubmitRequests(requests[:submittedEntries], nil)
if err != nil {
panic(err)
}
<-res.Done()
for _, result := range res.ErrResults() {
_, err := result.ReturnInt()
if err != nil {
panic(err)
}
}
}
})
}
benchmarkIOUringNEntries(1)
benchmarkIOUringNEntries(128)
There are some specific things in there to notice.
First, toward the end of the file we may not have N
entries to
submit. We may have 1
or even 0
.
If we have 0
to submit, we need to not even submit anything
otherwise the Go library hangs. Similarly, if we don't slice
requests
to requests[:submittedEntries]
, the Go library will
segfault if submittedEntries < N
.
Other than that, let's build and run this!
$ go build -o gomain
$ ./gomain > go.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'go.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
blocking | 0.0740368s | 1.4GB/s |
io_uring_128_entries | 0.127519s | 836.6MB/s |
io_uring_1_entries | 0.46831579999999995s | 226.9MB/s |
Now we're getting somewhere! Still half the throughput but a 4x improvement from using only a single entry.
Let's port the N entry code to Zig.
Unlike Go we can't do closures, so we'll have to make
benchmarkIOUringNEntries
a top-level function and keep the calls to
it in the loop in main
:
pub fn main() !void {
var allocator = &std.heap.page_allocator;
const SIZE = 104857600; // 100MiB
var data = try readNBytes(allocator, "/dev/random", SIZE);
defer allocator.free(data);
const RUNS = 10;
var run: usize = 0;
while (run < RUNS) : (run += 1) {
{
var b = try Benchmark.init(allocator, "blocking", data);
defer b.stop();
var i: usize = 0;
while (i < data.len) : (i += BUFFER_SIZE) {
const size = @min(BUFFER_SIZE, data.len - i);
const n = try b.file.write(data[i .. i + size]);
std.debug.assert(n == size);
}
}
try benchmarkIOUringNEntries(allocator, data, 1);
try benchmarkIOUringNEntries(allocator, data, 128);
}
}
And for the implementation itself, the only two big differences from
the first version are that we'll bulk-read completion events (cqe
s)
and that we'll create and wait for many submissions at once.
fn benchmarkIOUringNEntries(
allocator: *const std.mem.Allocator,
data: []const u8,
nEntries: u13,
) !void {
const name = try std.fmt.allocPrint(allocator.*, "iouring_{}_entries", .{nEntries});
defer allocator.free(name);
var b = try Benchmark.init(allocator, name, data);
defer b.stop();
var ring = try std.os.linux.IO_Uring.init(nEntries, 0);
defer ring.deinit();
var cqes = try allocator.alloc(std.os.linux.io_uring_cqe, nEntries);
defer allocator.free(cqes);
var i: usize = 0;
while (i < data.len) : (i += BUFFER_SIZE * nEntries) {
var submittedEntries: u32 = 0;
var j: usize = 0;
while (j < nEntries) : (j += 1) {
const base = i + j * BUFFER_SIZE;
if (base >= data.len) {
break;
}
submittedEntries += 1;
const size = @min(BUFFER_SIZE, data.len - base);
_ = try ring.write(0, b.file.handle, data[base .. base + size], base);
}
const submitted = try ring.submit_and_wait(submittedEntries);
std.debug.assert(submitted == submittedEntries);
const waited = try ring.copy_cqes(cqes[0..submitted], submitted);
std.debug.assert(waited == submitted);
for (cqes[0..submitted]) |*cqe| {
std.debug.assert(cqe.err() == .SUCCESS);
std.debug.assert(cqe.res >= 0);
const n = @as(usize, @intCast(cqe.res));
std.debug.assert(n <= BUFFER_SIZE);
}
}
}
Let's build and run:
$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
blocking | 0.0674331114s | 1.5GB/s |
iouring_128_entries | 0.06773539590000001s | 1.5GB/s |
iouring_1_entries | 0.1855542556s | 569.9MB/s |
Huh, that's surprising! We caught up to blocking writes with io_uring in Zig, but not in Go, even though we made good progress in Go.
But we can do a bit better. We're doing batching, but the API is called "io_uring" not "io_batch". We're not even making use of the ring buffer behavior io_uring gives us!
We are waiting for all submitted results complete. But there's no reason to do that. Instead we should submit as much as we can. But we should not block waiting for completions. We should handle completions when they happen. And we should retry submissions until we're done reading. Retrying if there's no space for the moment.
Unfortunately the Go library doesn't seem to expose this ring behavior of io_uring. Or I've missed it.
But we can do it in Zig. Let's go.
We need to change the way we track which offsets we need to submit so far. We also need to keep the loop going until we are sure we have written all data. And we need to stop blocking on the number we submitted; never blocking at all.
fn benchmarkIOUringNEntries(
allocator: *const std.mem.Allocator,
data: []const u8,
nEntries: u13,
) !void {
const name = try std.fmt.allocPrint(allocator.*, "iouring_{}_entries", .{nEntries});
defer allocator.free(name);
var b = try Benchmark.init(allocator, name, data);
defer b.stop();
var ring = try std.os.linux.IO_Uring.init(nEntries, 0);
defer ring.deinit();
var cqes = try allocator.alloc(std.os.linux.io_uring_cqe, nEntries);
defer allocator.free(cqes);
var written: usize = 0;
var i: usize = 0;
while (i < data.len or written < data.len) {
var submittedEntries: u32 = 0;
var j: usize = 0;
while (true) {
const base = i + j * BUFFER_SIZE;
if (base >= data.len) {
break;
}
const size = @min(BUFFER_SIZE, data.len - base);
_ = ring.write(0, b.file.handle, data[base .. base + size], base) catch |e| switch (e) {
error.SubmissionQueueFull => break,
else => unreachable,
};
submittedEntries += 1;
i += size;
}
_ = try ring.submit_and_wait(0);
const cqesDone = try ring.copy_cqes(cqes, 0);
for (cqes[0..cqesDone]) |*cqe| {
std.debug.assert(cqe.err() == .SUCCESS);
std.debug.assert(cqe.res >= 0);
const n = @as(usize, @intCast(cqe.res));
std.debug.assert(n <= BUFFER_SIZE);
written += n;
}
}
}
The code got a bit simpler! Granted, we're omitting error handling.
Build and run:
$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'zig.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
iouring_128_entries | 0.06035423609999999s | 1.7GB/s |
iouring_1_entries | 0.0610197624s | 1.7GB/s |
blocking | 0.0671628515s | 1.5GB/s |
Not bad!
We've been inserting 100MiB of data. Let's go up to 1GiB to see how that affects things. Ideally the more data we write the more we reflect realistic long-term results.
In main.zig
just change SIZE
to 1073741824
. Rebuild and run:
$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'out.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
iouring_128_entries | 0.6063814535s | 1.7GB/s |
iouring_1_entries | 0.6167537295000001s | 1.7GB/s |
blocking | 0.6831747749s | 1.5GB/s |
No real difference, perfect!
Let's make one more change though. Let's up the BUFFER_SIZE
from
4KiB to 1MiB.
$ zig build-exe main.zig
$ ./main > zig.csv
$ duckdb -c "select column0 as method, avg(cast(column1 as double)) || 's' avg_time, format_bytes(avg(column2::double)::bigint) || '/s' as avg_throughput from 'out.csv' group by column0 order by avg(cast(column1 as double)) asc"
method | avg_time | avg_throughput |
---|---|---|
iouring_128_entries | 0.2756831357s | 3.8GB/s |
iouring_1_entries | 0.27575404880000004s | 3.8GB/s |
blocking | 0.2833337046s | 3.7GB/s |
Hey that's an improvement!
All these numbers are machine-specific obviously. So what does an existing tool like fio say? (Assuming I'm using it correctly. I await your corrections!)
With a 4KiB buffer size:
$ fio --name=fiotest --rw=write --size=1G --bs=4k --group_reporting --ioengine=sync
fiotest: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sync, iodepth=1
fio-3.33
Starting 1 process
Jobs: 1 (f=1)
fiotest: (groupid=0, jobs=1): err= 0: pid=2437359: Thu Oct 19 23:33:42 2023
write: IOPS=282k, BW=1102MiB/s (1156MB/s)(1024MiB/929msec); 0 zone resets
clat (nsec): min=2349, max=54099, avg=2709.48, stdev=1325.83
lat (nsec): min=2390, max=54139, avg=2752.89, stdev=1334.62
clat percentiles (nsec):
| 1.00th=[ 2416], 5.00th=[ 2416], 10.00th=[ 2416], 20.00th=[ 2448],
| 30.00th=[ 2448], 40.00th=[ 2448], 50.00th=[ 2448], 60.00th=[ 2480],
| 70.00th=[ 2512], 80.00th=[ 2544], 90.00th=[ 2832], 95.00th=[ 3504],
| 99.00th=[ 5792], 99.50th=[15296], 99.90th=[19584], 99.95th=[20096],
| 99.99th=[22656]
bw ( KiB/s): min=940856, max=940856, per=83.36%, avg=940856.00, stdev= 0.00, samples=1
iops : min=235214, max=235214, avg=235214.00, stdev= 0.00, samples=1
lat (usec) : 4=97.22%, 10=2.03%, 20=0.71%, 50=0.04%, 100=0.01%
cpu : usr=17.35%, sys=82.11%, ctx=26, majf=0, minf=11
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=1102MiB/s (1156MB/s), 1102MiB/s-1102MiB/s (1156MB/s-1156MB/s), io=1024MiB (1074MB), run=929-929msec
1.2GB/s is about in the ballpark of what we got.
And with a 1MiB buffer size?
$ fio --name=fiotest --rw=write --size=1G --bs=1M --group_reporting --ioengine=sync
fiotest: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=sync, iodepth=1
fio-3.33
Starting 1 process
fiotest: Laying out IO file (1 file / 1024MiB)
fiotest: (groupid=0, jobs=1): err= 0: pid=2437239: Thu Oct 19 23:32:09 2023
write: IOPS=3953, BW=3954MiB/s (4146MB/s)(1024MiB/259msec); 0 zone resets
clat (usec): min=221, max=1205, avg=241.83, stdev=43.93
lat (usec): min=228, max=1250, avg=251.68, stdev=45.80
clat percentiles (usec):
| 1.00th=[ 225], 5.00th=[ 225], 10.00th=[ 227], 20.00th=[ 227],
| 30.00th=[ 231], 40.00th=[ 233], 50.00th=[ 235], 60.00th=[ 239],
| 70.00th=[ 243], 80.00th=[ 249], 90.00th=[ 262], 95.00th=[ 269],
| 99.00th=[ 302], 99.50th=[ 318], 99.90th=[ 1074], 99.95th=[ 1205],
| 99.99th=[ 1205]
lat (usec) : 250=80.96%, 500=18.85%
lat (msec) : 2=0.20%
cpu : usr=4.26%, sys=94.96%, ctx=3, majf=0, minf=10
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,1024,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1
Run status group 0 (all jobs):
WRITE: bw=3954MiB/s (4146MB/s), 3954MiB/s-3954MiB/s (4146MB/s-4146MB/s), io=1024MiB (1074MB), run=259-259msec
3.9GB/s is also roughly in the same ballpark we got.
Our code seems reasonable!
None of this is original. fio
is a similar tool, written in C, with
many different IO engines including libaio
and writev
support. And
it has many different IO workloads.
But it's been enjoyable to learn more about these APIs. How to program them and how they compare to eachother.
So next steps could include adding additional IO engines or IO workloads.
Also, either I need to understand Iceber's Go library better or its API needs to be loosened up a little bit so we can get that awesome ring buffer behavior we could use from Zig.
Keep an eye out here and on my io-playground repo!
fsync()
-ing if you want to test the entire IO subsystem and not
just hitting the kernel cache.Digging into io_uring has been on my list for a long time now! This week I finally made made some progress.
— Phil Eaton (@eatonphil) October 19, 2023
Let's go on a little journey through a few increasingly complex (and useful) implementations of writing a file to disk with io_uring.https://t.co/gR9K2OQs2R pic.twitter.com/TMaC8QYL6k
2023-10-05 08:00:00
The most popular SQLite and PostgreSQL database drivers in Go are
(roughly) 20-76% slower than alternative Go drivers on insert-heavy
benchmarks of mine. So if you are bulk-inserting data with Go (and
potentially also bulk-retrieving data with Go), you may want to
consider the driver carefully. And you may want to consider avoiding
database/sql
.
Some driver authors have noted and benchmarked issues with database/sql.
So it may be the case that database/sql
is responsible for some of
this overhead. And indeed the variations between drivers in this post
will be demonstrated by using database/sql
and avoiding it. This post
won't specifically prove that the variation is due to the
database/sql
interface. But that doesn't change the premise.
Not covered in this post but something to consider:
JetBrains has
suggested that other frontends like sqlc, sqlx, and GORM do
worse than database/sql
.
This post is built on the workload, environment, libraries, and methodology in my databases-intuition repo on GitHub. See the repo for details that will help you reproduce or correct me.
In this workload, the data is random and there are no indexes. Neither of these aspects matter for this post though because we're comparing behavior within the same database among different drivers. This was just a workload I already had.
Two different data sizes are tested:
Each test is run 10 times and we record median, standard deviation, min, max and throughput.
Both variations presented here load 10M rows using a single prepared statement called for each row within a single transaction.
The most popular driver is mattn/go-sqlite3.
It is roughly 20-40% slower than another driver that avoids
database/sql
.
10M Rows, 16 columns, each column 32 bytes:
Timing: 56.53 ± 1.26s, Min: 55.05s, Max: 59.62s
Throughput: 176,893.65 ± 3,853.90 rows/s, Min: 167,719.97 rows/s, Max: 181,646.02 rows/s
10M Rows, 3 columns, each column 8 bytes:
Timing: 15.92 ± 0.25s, Min: 15.69s, Max: 16.67s
Throughput: 628,044.37 ± 9,703.92 rows/s, Min: 599,852.91 rows/s, Max: 637,435.60 rows/s
The other driver I tested is my own fork of bvinc/go-sqlite-lite called eatonphil/gosqlite. I forked it because it is unmaintained and I wanted to bring it up-to-date for tests like this.
10M Rows, 16 columns, each column 32 bytes:
Timing: 45.51 ± 0.70s, Min: 43.72s, Max: 45.93s
Throughput: 219,729.65 ± 3,447.56 rows/s, Min: 217,742.98 rows/s, Max: 228,711.51 rows/s
10M Rows, 3 columns, each column 8 bytes:
Timing: 10.44 ± 0.20s, Min: 10.02s, Max: 10.68s
Throughput: 957,939.60 ± 18,879.43 rows/s, Min: 936,114.60 rows/s, Max: 998,426.62 rows/s
Both variations presented use PostgreSQL's COPY
FROM
support. This is significantly faster for PostgreSQL than doing the
prepared statement we do in
SQLite. (Here
are my results for doing prepared statement INSERTs in PostgreSQL if
you are curious.)
The most popular PostgreSQL driver is lib/pq. The performance issues with lib/pq are well-known, and the repo itself is marked as no longer developed.
It is roughly 44-76% slower than an alternative driver that avoids
database/sql
.
10M Rows, 16 columns, each column 32 bytes:
Timing: 104.53 ± 2.40s, Min: 102.57s, Max: 110.08s
Throughput: 95,665.37 ± 2,129.25 rows/s, Min: 90,847.08 rows/s, Max: 97,490.96 rows/s
10M Rows, 3 columns, each column 8 bytes:
Timing: 8.16 ± 0.43s, Min: 7.44s, Max: 8.80s
Throughput: 1,225,986.47 ± 66,631.53 rows/s, Min: 1,136,581.82 rows/s, Max: 1,343,441.37 rows
The other driver I tested is
jackc/pgx, without database/sql
.
10M Rows, 16 columns, each column 32 bytes:
Timing: 46.54 ± 1.60s, Min: 44.09s, Max: 49.51s
Throughput: 214,869.42 ± 7,265.10 rows/s, Min: 201,991.37 rows/s, Max: 226,801.07 rows/s
10M Rows, 3 columns, each column 8 bytes:
Timing: 5.20 ± 0.44s, Min: 4.71s, Max: 5.96s
Throughput: 1,923,722.79 ± 156,820.46 rows/s, Min: 1,676,894.32 rows/s, Max: 2,124,966.60 rows/
The discrepancies here are even greater than with the different SQLite drivers.
I won't go into as much detail but if you're doing queries that don't return many rows, the difference between drivers is negligible.
See here for details.
If you are doing INSERT-heavy workloads, or you are processing large number of rows returned from your SQL database, you might want to try benchmarking the same workload with different drivers.
And specifically, there is likely no good reason to use lib/pq
anymore for accessing PostgreSQL from Go. Just use jackc/pgx.
For INSERT-heavy workloads in Go, you may want to switch database drivers. For PostgreSQL and SQLite, the popular drivers are 20-76% slower for this workload in my tests.
— Phil Eaton (@eatonphil) October 6, 2023
Some driver developers have reported issues with database/sql as an interface.https://t.co/NLVp0P2uiV pic.twitter.com/RxTbgMZ1MG
2023-10-01 08:00:00
How software fails is interesting. But real-world errors can be infrequent to manifest. Fault injection is a formal-sounding term that just means: trying to explicitly trigger errors in the hopes of discovering bad logic, typically during automated tests.
Jepsen and ChaosMonkey are two famous examples that help to trigger process and network failure. But what about disk and filesystem errors?
A few avenues seem worth investigating:
SECCOMP_RET_TRAP
interception layerI would like to try out FUSE sometime. But LD_PRELOAD layer only works if IO goes through libc, which won't be the case for all programs. ptrace is something I've wanted to dig into for years since learning about gvisor.
SECCOMP_RET_TRAP
doesn't have the same high-level guides that ptrace
does so maybe I'll dig into it later. And symbolic analysis might be
able to detect bad workloads but it also isn't fault injection. Maybe
it's the better idea but fault injection just sounds more fun.
So this particular post will cover intercepting system calls (syscalls) using ptrace with code written in Zig. Not because readers will likely write their own code in Zig but because hopefully the Zig code will be easier for you to read and adapt to your language compared to if we had to deal with the verbosity and inconvenience of C.
In the end, we'll be able to intercept and force short (incomplete) writes in a Go, Python, and C program. Emulating a disk that is having an issue completing the write. This is a case that isn't common, but should probably be handled with retries in production code.
This post corresponds roughly to this commit on GitHub.
First off, let's write some code for a program that would exhibit a short write. Basically, we write to a file and don't check how many bytes we wrote. This is extremely common code; or at least I've written it often.
$ cat test/main.go
package main
import (
"os"
)
func main() {
f, err := os.OpenFile("test.txt", os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0755)
if err != nil {
panic(err)
}
text := "some great stuff"
_, _ = f.Write([]byte(text))
_ = f.Close()
}
With this code, if the Write()
call doesn't actually succeed in
writing everything, we won't know that. And the file will contain less
than all of some great stuff
.
This logical mistake will happen rarely, if ever, on a normal disk. But it is possible.
Now that we've got an example program in mind, let's see if we can trigger the logic error.
ptrace is a somewhat cross-platform layer that allows you to intercept syscalls in a process. You can read and modify memory and registers in the process, when the syscalls starts and before it finishes.
gdb and strace both use ptrace for their magic.
Google's gvisor that powers various serverless runtimes in Google
Cloud was also
historically based on ptrace (PTRACE_SYSEMU
specifically, which we
won't explore much in this post).
Interestingly though,
gvisor switched
only this year (2023) to a different default backend for
trapping system calls. Based
on SECCOMP_RET_TRAP
.
You can get similar vibes
from this
Brendan Gregg post on the dangers of using strace (that is based
on ptrace) in production.
Although ptrace is cross-platform, actually writing cross-platform-aware code with ptrace can be complex. So this post assumes amd64/linux.
The ptrace protocol is described in the ptrace manpage, but Chris Wellons and a University of Alberta group also wrote nice introductions. I referenced these three pages heavily.
Here's what the UAlberta page has to say:
We fork and have the child call PTRACE_TRACEME
. Then we handle each
syscall entrance by calling PTRACE_SYSCALL
and waiting with wait
until the child has entered the syscall. It is in this moment we can
mess with things.
Let's turn that graphic into Zig code.
const std = @import("std");
const c = @cImport({
@cInclude("sys/ptrace.h");
@cInclude("sys/user.h");
@cInclude("sys/wait.h");
@cInclude("errno.h");
});
const cNullPtr: ?*anyopaque = null;
// TODO //
pub fn main() !void {
var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
defer arena.deinit();
var args = try std.process.argsAlloc(arena.allocator());
std.debug.assert(args.len >= 2);
const pid = try std.os.fork();
if (pid < 0) {
std.debug.print("Fork failed!\n", .{});
return;
} else if (pid == 0) {
// Child process
_ = c.ptrace(c.PTRACE_TRACEME, pid, cNullPtr, cNullPtr);
return std.process.execv(
arena.allocator(),
args[1..],
);
} else {
// Parent process
const childPid = pid;
_ = c.waitpid(childPid, 0, 0);
var cm = ChildManager{ .arena = &arena, .childPid = childPid };
try cm.childInterceptSyscalls();
}
}
So like the graphic suggested, we fork and start a child process. That means this Zig program should be called like:
$ zig build-exe --library c main.zig
$ ./main /actual/program/to/intercept --and --its args
Presumably, as with strace or gdb, we could instead attach to an
already running process with PTRACE_ATTACH
or PTRACE_SEIZE
(based
on the ptrace
manpage) rather
than going the PTRACE_TRACEME
route. But I haven't tried that out
yet.
With the child ready to be intercepted, we can implement the
ChildManager
that actually does the interception.
The core of the ChildManager
is an infinite loop (at least as long
as the child process lives) that waits for the next syscall and then
calls a hook for the sytem call if it exists.
const ChildManager = struct {
arena: *std.heap.ArenaAllocator,
childPid: std.os.pid_t,
// TODO //
fn childInterceptSyscalls(
cm: *ChildManager,
) !void {
while (true) {
// Handle syscall entrance
const status = cm.childWaitForSyscall();
if (std.os.W.IFEXITED(status)) {
break;
}
var args: ABIArguments = cm.getABIArguments();
const syscall = args.syscall();
for (hooks) |hook| {
if (syscall == hook.syscall) {
try hook.hook(cm.*, &args);
}
}
}
}
};
Later we'll write a hook for the sys_write
syscall that
will force an incomplete write.
Back to the protocol, childWaitForSyscall
will call PTRACE_SYSCALL
to allow the child process to start up again and continue until the
next syscall. We'll follow that by wait
-ing for the child
process to be stopped again so we can handle the syscall entrance.
fn childWaitForSyscall(cm: ChildManager) u32 {
var status: i32 = 0;
_ = c.ptrace(c.PTRACE_SYSCALL, cm.childPid, cNullPtr, cNullPtr);
_ = c.waitpid(cm.childPid, &status, 0);
return @bitCast(status);
}
Now that we've intercepted a syscall (after waitpid
finishes
blocking), we need to figure out what syscall it was. We do this by
calling PTRACE_GETREGS
and reading the rax
register which
according to amd64/linux calling
convention is the
syscall called.
PTRACE_GETREGS
fills out the following
struct:
struct user_regs_struct
{
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long rbp;
unsigned long rbx;
unsigned long r11;
unsigned long r10;
unsigned long r9;
unsigned long r8;
unsigned long rax;
unsigned long rcx;
unsigned long rdx;
unsigned long rsi;
unsigned long rdi;
unsigned long orig_rax;
unsigned long rip;
unsigned long cs;
unsigned long eflags;
unsigned long rsp;
unsigned long ss;
unsigned long fs_base;
unsigned long gs_base;
unsigned long ds;
unsigned long es;
unsigned long fs;
unsigned long gs;
};
Let's write a little amd64/linux-specific wrapper for accessing meaningful fields.
const ABIArguments = struct {
regs: c.user_regs_struct,
fn nth(aa: ABIArguments, i: u8) c_ulong {
std.debug.assert(i < 4);
return switch (i) {
0 => aa.regs.rdi,
1 => aa.regs.rsi,
2 => aa.regs.rdx,
else => unreachable,
};
}
fn setNth(aa: *ABIArguments, i: u8, value: c_ulong) void {
std.debug.assert(i < 4);
switch (i) {
0 => { aa.regs.rdi = value; },
1 => { aa.regs.rsi = value; },
2 => { aa.regs.rdx = value; },
else => unreachable,
}
}
fn result(aa: ABIArguments) c_ulong { return aa.regs.rax; }
fn setResult(aa: *ABIArguments, value: c_ulong) void {
aa.regs.rax = value;
}
fn syscall(aa: ABIArguments) c_ulong { return aa.regs.orig_rax; }
};
One thing to note is that the field we read to get rax
is not
aa.regs.rax
but aa.regs.orig_rax
. This is because rax
is also
the return value and PTRACE_SYSCALL
gets called twice for some
syscalls on entrance and exit. The orig_rax
field preserves the
original rax
value on syscall entrance. You can read more about this
here.
Now let's write the ChildManager
code that actually calls
PTRACE_GETREGS
to fill out one of these structs.
fn getABIArguments(cm: ChildManager) ABIArguments {
var args = ABIArguments{ .regs = undefined };
_ = c.ptrace(c.PTRACE_GETREGS, cm.childPid, cNullPtr, &args.regs);
return args;
}
Setting registers is similar, we just pass the struct back and call
PTRACE_SETREGS
instead:
fn setABIArguments(cm: ChildManager, args: *ABIArguments) void {
_ = c.ptrace(c.PTRACE_SETREGS, cm.childPid, cNullPtr, &args.regs);
}
Now we finally have enough code to write a hook that can get and set registers; i.e. manipulate a system call!
We'll start by registering a sys_write
hook in the hooks
field we
check in childInterceptSyscalls
above.
const hooks = &[_]struct {
syscall: c_ulong,
hook: *const fn (ChildManager, *ABIArguments) anyerror!void,
}{.{
.syscall = @intFromEnum(std.os.linux.syscalls.X64.write),
.hook = writeHandler,
}};
If we look at the manpage for
write
we see it
takes three arguments
Going back to the calling
convention
that means the fd will be in rdi
, the data address in rsi
, and the
data length in rdx
.
So if we shorten the data length, we should be creating a short (incomplete) write.
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
}
In a more sophisticated version of this program, we could randomly decide when to truncate data and randomly decide how much data to truncate. However, for our purposes this is sufficient.
But there are some real problems with this code. When I ran this program against a basic Go program, I saw duplicate requests.
Ah ok, PTRACE_SYSCALL gets hit when you both enter and exit a syscall.
— Phil Eaton (@eatonphil) September 29, 2023
So each time you call PTRACE_SYSCALL and you do stuff, you just call it again afterwards to handle/wait for the exit. pic.twitter.com/PjmNwcMepG
So the deal with PTRACE_SYSCALL
is that for (most?) syscalls, you
get to modify data before the data actually is handled by the
kernel. And you get to modify data after the kernel has finished the
syscall too.
This makes sense because PTRACE_SYSCALL
(unlike PTRACE_SYSEMU
)
allows the syscall to actually happen. And if we wanted to, for
example, modify the syscall exit code, we'd have to do that after the
syscall was done not before it started. We are modifying registers
directly after all.
All this means for our Zig code is that when we handle sys_write
, we
need to call PTRACE_SYSCALL
again to process the syscall
exit. Otherwise we'd reach this writeHandler
for both entries and
exits, which would require some additional way of disambiguating
entrances from exits.
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
const data = try cm.childReadData(dataAddress, dataLength);
defer data.deinit();
std.debug.print("Got a write on {}: {s}\n", .{ fd, data.items });
// Handle syscall exit
_ = cm.childWaitForSyscall();
}
We could put the cm.childWaitForSyscall()
waiting for
the syscall exit in the main loop and I did try that at
first. However, not all syscalls seemed to have the same entry and
exit hook and this resulted in the hooks sometimes starting with a
syscall exit rather than a syscall entry. So rather than making the
code more complicated, I decided to only wait for the exit on
syscalls I knew had an exit (by observation at least), like
sys_write
.
So I had this code as is, correctly handling syscall entrances and exits, but I was seeing multiple write calls. And the text file I was writing to had the complete text I wanted to write. There was no short write even though I truncated the data length.
Ok so what happens in this Go program if I truncate the amount of data?
— Phil Eaton (@eatonphil) September 29, 2023
I assumed Go would do nothing since all I did was call `f.Write()` once and `f.Write()` returns a number of bytes written.
But actually, it still writes everything! pic.twitter.com/OSalKEbERM
This took some digging into Go source code to understand. If you trace
what os.File.Write()
does on Linux you eventually get to
src/internal/poll/fd_unix.go:
// Write implements io.Writer.
func (fd *FD) Write(p []byte) (int, error) {
if err := fd.writeLock(); err != nil {
return 0, err
}
defer fd.writeUnlock()
if err := fd.pd.prepareWrite(fd.isFile); err != nil {
return 0, err
}
var nn int
for {
max := len(p)
if fd.IsStream && max-nn > maxRW {
max = nn + maxRW
}
n, err := ignoringEINTRIO(syscall.Write, fd.Sysfd, p[nn:max])
if n > 0 {
nn += n
}
if nn == len(p) {
return nn, err
}
if err == syscall.EAGAIN && fd.pd.pollable() {
if err = fd.pd.waitWrite(fd.isFile); err == nil {
continue
}
}
if err != nil {
return nn, err
}
if n == 0 {
return nn, io.ErrUnexpectedEOF
}
}
}
This might be common knowledge but I didn't realize Go did this. And
when I tried out the same basic program in Python and even C, the
behavior was the same. The builtin write()
behavior on a file (in
many languages apparantly) is to retry until all data is written, with
some exceptions.
This makes sense since files on disk, unlike file descriptors backed by network sockets, are generally always available. Compared to a network connection, disks are physically close and almost always stay connected. (With some obvious exceptions like network-attached storage and thumb drives.)
So to trigger the short write, the easiest way seems to have the
sys_write
call return an error that is NOT EAGAIN
since the code
will retry if that is the error.
After looking through the list of errors that sys_write can
return,
EIO
seems like a nice one.
So let's do our final version of writeHandler
and on the syscall
exit, we'll modify the return value (rax
in amd64/linux) to be
EIO
.
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
// Handle syscall exit
_ = cm.childWaitForSyscall();
var exitArgs = cm.getABIArguments();
dataLength = exitArgs.nth(2);
if (dataLength > 2) {
// Force the writes to stop after the first one by returning EIO.
var result: c_ulong = 0;
result = result -% c.EIO;
exitArgs.setResult(result);
cm.setABIArguments(&exitArgs);
}
}
Let's give it a whirl!
Build the Zig fault injector and the Go test code:
$ zig build-exe --library c main.zig
$ ( cd test && go build main.go )
And run:
$ ./main test/main
And check test.txt
:
$ cat test.txt
some great stu
Hey, that's a short write! :)
We accomplished everything we set out to, but there's one other useful thing we can do: reading the actual data passed to the write syscall.
Just like how we can get the child process registers with
PTRACE_GETREGS
, we can read child memory with
PTRACE_PEEKDATA
. PTRACE_PEEKDATA
takes the child process id and
the memory address in the child to read from. It returns a word of
data (which on amd64/linux is 8 bytes).
We can use the syscall arguments (data address and length) to keep
calling PTRACE_PEEKDATA
on the child until we've read all bytes of
the data the child process wanted to write:
fn childReadData(
cm: ChildManager,
address: c_ulong,
length: c_ulong,
) !std.ArrayList(u8) {
var data = std.ArrayList(u8).init(cm.arena.allocator());
while (data.items.len < length) {
var word = c.ptrace(
c.PTRACE_PEEKDATA,
cm.childPid,
address + data.items.len,
cNullPtr,
);
for (std.mem.asBytes(&word)) |byte| {
if (data.items.len == length) {
break;
}
try data.append(byte);
}
}
return data;
}
And we could modify writeHandler
to print out the entirety of the write message each time (for debugging):
fn writeHandler(cm: ChildManager, entryArgs: *ABIArguments) anyerror!void {
const fd = entryArgs.nth(0);
const dataAddress = entryArgs.nth(1);
var dataLength = entryArgs.nth(2);
// Truncate some bytes
if (dataLength > 2) {
dataLength -= 2;
entryArgs.setNth(2, dataLength);
cm.setABIArguments(entryArgs);
}
const data = try cm.childReadData(dataAddress, dataLength);
defer data.deinit();
std.debug.print("Got a write on {}: {s}\n", .{ fd, data.items });
// Handle syscall exit
_ = cm.childWaitForSyscall();
var exitArgs = cm.getABIArguments();
dataLength = exitArgs.nth(2);
if (dataLength > 2) {
// Force the writes to stop after the first one by returning EIO.
var result: c_ulong = 0;
result = result -% c.EIO;
exitArgs.setResult(result);
cm.setABIArguments(&exitArgs);
}
}
That's pretty neat!
Short writes are just one of many bad IO interactions. Another fun one would be to completely buffer all writes on a file descriptor (not allowing anything to be written to disk at all) until fsync is called on the file descriptor. Or forcing fsyncs to fail.
An interesting optimization would be to apply seccomp
filters
so that rather than paying a penalty for watching every syscall, I
only get notified about the ones I have hooks for like
sys_write
. Here's another
post
that explores ptrace with seccomp filters.
Credits: Thank you Charlie Cummings and Paul Khuong for reviewing a draft of this post!
process_vm_readv
instead of
PTRACE_PEEKDATA
to read memory from the tracee process.Fault injection is a scary-sounding term. Intercepting and modifying Linux system calls sounds scary too.
— Phil Eaton (@eatonphil) October 1, 2023
But it's a neat way to trigger logical errors in programs, to build confidence we wrote code correctly.
Let's trigger short writes to disk in Zig!https://t.co/0C3tWt3vtT pic.twitter.com/OS7auDe8jR
2023-09-21 08:00:00
Databases are fun. They sit at the confluence of Computer Science topics that might otherwise not seem practical in life as a developer. For example, every database with a query language is also a programming language implementation of some caliber. That doesn't include all databases though of course; see: RocksDB, FoundationDB, TigerBeetle, etc.
This post looks at how various databases execute expressions in their query language.
tldr; Most surveyed databases use a tree-walking interpreter. A few use stack- or register-based virtual machines. A couple have just-in-time compilers. And, tangentially, a few do vectorized interpretation.
Throughout this post I'll use "virtual machine" as a shorthand for stack- or register-based loops that process a linearized set of instructions. I say this since it is sometimes fair to call a tree-walking interpreter a virtual machine. But that is not what I mean when I say virtual machine in this post.
Programming languages are typically implemented by turning an Abstract Syntax Tree (AST) into a linear set of instructions for a virtual machine (e.g. CPython, Java, C#) or native code (e.g. GCC's C compiler, Go, Rust). Some of the former implementations also generate and run Just-In-Time (JIT) compiled native code (e.g. Java and C#).
Less commonly these days in programming languages does the implementation interpret off the AST or some other tree-like intermediate representation. This style is often called tree-walking.
Shell languages sometimes do tree-walking. Otherwise, implementations that interpret directly off of a tree normally do so as a short-term measure before switching to compiled virtual machine code or JIT-ed native code (e.g. some JavaScript implementations, GraalVM, RPython, etc.)
That is, while some major programming language implementations started out with tree-walking interpreters, they mostly moved away from solely tree-walking over a decade ago. See JSC in 2008, Ruby in 2007, etc.
My intuition is that tree-walking takes up more memory and is less cache-friendly than the linear instructions you give to a virtual machine or to your CPU. There are some folks who disagree, but they mostly talk about tree-walking when you've also got a JIT compiler hooked up. Which isn't quite the same thing. There has also been some early exploration and improvements reported when tree-walking with a tree organized as an array.
Databases often interpret directly off a tree. (It isn't, generally speaking, fair to say they are AST-walking interpreters because databases typically transform and optimize beyond just an AST as parsed from user code.)
But not all databases interpret a tree. Some have a virtual machine. And some generate and run JIT-ed native code.
If a core function (in the query execution path that does something like arithmetic or comparison) returns a value, that's a sign it's a tree-walking interpreter. Or, if you see code that is evaluating its arguments during execution, that's also a sign of a tree-walking interpreter.
On the other hand, if the function mutates internal state such as by assigning a value to a context or pushing to a stack, that's a sign it's a stack- or register-based virtual machine. If a function pulls its arguments from memory and doesn't evaluate the arguments, that's also an indication it's a stack- or register-based virtual machine.
This approach can result in false-positives depending on the
architecture of the interpreter. User-defined functions (UDFs) would
probably accept evaluated arguments and return a value regardless of
how the interpreter is implemented. So it's important to find not just
functions that could be implemented like UDFs, but core builtin
behavior. Control flow implementations of functions like if
or
case
can be great places to look.
And tactically, I clone the source code and run stuff like git grep
-i eval | grep -v test | grep \\.java | grep -i eval
or git grep -i
expr | grep -v test | grep \\.go | grep -i expr
until I convince
myself I'm somewhere interesting.
Note: In talking about a broad swath of projects, maybe I've misunderstood one or some. If you've got a correction, let me know! If there's a proprietary database you work on where you can link to the (publicly described) execution strategy, feel free to pass it along! Or if I'm missing your public-source database in this list, send me a message!
Judging by functions like func (e *evaluator)
EvalBinaryExpr
that evaluates the left-hand
side
and then evaluates the right-hand
side
and returns a value, Cockroach looks like a tree walking interpreter.
It gets a little more interesting though, since Cockroach also supports vectorized expression execution. Vectorizing is a fancy term for acting on many pieces of data at once rather than one at a time. It doesn't necessarily imply SIMD. Here is an example of a vectorized addition of two int64 columns.
The ClickHouse architecture is a little unique and difficult for me to read through – likely due to it being fairly mature, with serious optimization. But they tend to document their header files well. So files like src/Functions/IFunction.h and src/Interpreters/ExpressionActions.h were helpful.
They have also spoken publicly about their pipeline execution model; e.g. this presentation and this roadmap issue. But it isn't completely clear how much pipeline execution (which is broader than just expression evaluation) connects to expression evaluation.
Moreover, they have publicly
spoken
about their support for JIT compilation for query execution. But let's
look at how execution works when the JIT is not enabled. For example,
If we take a look at how if
is
implemented,
we know that the then
and else
rows must be conditionally
evaluated.
In the runtime entrypoint,
executeImpl
,
we see the function call
executeShortCircuitArguments
which in turn calls
ColumnFunction::reduce()
which evaluates each column vector that is an
argument
to the function and then calls execute on the function.
So from this we can tell the non-JIT execution is a tree walker and that it is over chunks of columns, i.e. vectorized data, similar to Cockroach. However in ClickHouse execution is always over column vectors.
In the original version of this post, I had some confusion about the ClickHouse execution strategy. Robert Schulze from ClickHouse helped clarify things for me. Thanks Robert!
If we take a look at how function expressions are executed, we can see each argument in the function being evaluated before being passed to the actual function. So that looks like a tree walking interpreter.
Like ClickHouse, DuckDB expression execution is always over column vectors. You can read more about this architecture here and here.
Influx originally had a SQL-like query language called InfluxQL. If we look at how it evaluates a binary expression, it first evaluates the left-hand side and then the right-hand side before operating on the sides and returning a value. That's a tree-walking interpreter.
Flux was the new query language for Influx. While the Flux docs suggest they transform to an intermediate representation that is executed on a virtual machine, there's nothing I'm seeing that looks like a stack- or register-based virtual machine. All the evaluation functions evaluate their arguments and return a value. That looks like a tree-walking interpreter to me.
Today Influx announced that Flux is in maintenance mode and they are focusing on InfluxQL again.
Control flow methods are normally a good way to see how an interpreter
is implemented. The implementation of COALESCE looks pretty
simple. We
see it call
val_str()
for each argument to COALESCE. But I can only seem to find
implementations of val_str()
on raw values and not
expressions. Item_func_coalesce
itself does not implement
val_str()
for example, which would be a strong indication of a tree
walker. Maybe it does implement val_str()
through inheritance.
It becomes a little clearer if we look at non-control flow methods
like
acos
. In
this method we see Item_func_acos
itself implement val_real()
and
also call val_real()
on all its arguments. In this case it's obvious
how the control flow of acos(acos(.5))
would work. So that seems to
indicate expressions are executed with a tree walking interpreter.
I also noticed
sql/sp_instr.cc. That
is scary (in terms of invalidating my analysis) since it looks like a
virtual machine. But after looking through it, I think this virtual
machine only corresponds to how stored procedures are executed, hence
the sp_
prefix for Stored Programs. MySQL
docs
also explain that stored procedures are executed with a bytecode
virtual machine.
I'm curious why they don't use that virtual machine for query execution.
As far as I can tell MySQL and MariaDB do not differ in this regard.
Mongo recently introduced a virtual machine for executing queries, called Slot Based Execution (SBE). We can find the SBE code in src/mongo/db/exec/sbe/vm/vm.cpp and the main virtual machine entrypoint here. Looks like a classic stack-based virtual machine!
It isn't completely clear to me if the SBE path is always used or if there are still cases where it falls back to their old execution model. You can read more about Mongo execution here and here.
The top of PostgreSQL's src/backend/executor/execExprInterp.c clearly explains that expression execution uses a virtual machine. You see all the hallmarks: opcodes, a loop over a giant switch, etc. And if we look at how function expressions are executed, we see another hallmark which is that the function expression code doesn't evaluate its arguments. They've already been evaluated. And function expression code just acts on the results of its arguments.
PostgreSQL also supports JIT-ing expression execution. And we can find the switch between interpreting and JIT-compiling an expression here.
QuestDB wrote about their execution engine recently. When the conditions are right, they'll switch over to a JIT compiler and run native code.
But let's look at the default path. For example, how AND
is
implemented. AndBooleanFunction
implements BooleanFunction
which implements Function
. An
expression can be evaluated by calling a getX()
method on the
expression type that implements Function
. AndBooleanFunction
calls
getBool()
on its left and right hand sides. And if we look at the
partial
implementation
of BooleanFunction
we'll also see it doing getX()
specific
conversions during the call of getX()
. So that's a tree-walking
interpreter.
If we take a look at how functions are
evaluated
in Scylla, we see function evaluation first evaluating all of its
arguments. And
the function evaluation function itself returns a
cql3::raw_value
. So that's a tree-walking interpreter.
SQLite's virtual machine is comprehensive and well-documented. It encompasses more than just expression evaluation but the entirety of query execution.
We can find the massive virtual machine switch in src/vdbe.c.
And if we look, for example, at how AND
is implemented, we see it
pulling its arguments out of
memory
(already evaluated) and assigning the result back to a designated
point in
memory.
While there's no source code to link to, SingleStore gave a talk at CMU that broke down their query execution pipeline. Their docs also cover the topic.
Similar to DuckDB and ClickHouse, TiDB implements vectorized interpretation. They've written publicly about their switch to this method.
Let's take a look at how if
is implemented in TiDB. There is a
vectorized and non-vectorized version of if
(in
expression/control_builtin.go
and
expression/control_builtin_generated.go
respectively). So maybe they haven't completely switched over to
vectorized execution or maybe it can only be used in some conditions.
If we look at the non-vectorized version of
if
,
we see the condition
evaluated. And
then the then
or else
is evaluated depending on the result of the
condition. That's
a tree-walking interpreter.
As the DuckDB team points out, vectorized interpretation or JIT compilation seem like the future for database expression execution. These strategies seem particularly important for analytics or time-series workloads. But vectorized interpretation seems to make the most sense for column-wise storage engines. And column-wise storage normally only makes sense for analytics workloads. Still, TiDB and Cockroach are transactional databases that also vectorize execution.
And while SQLite and PostgreSQL use the virtual machine model, it's possible databases with tree-walking interpreters like Scylla and MySQL/MariaDB have decided there is not significant enough gains to be had (for transactional workloads) to justify the complexity of moving to a compiler + virtual machine architecture.
Tree-walking interpreters and virtual machines are also independent from whether or not execution is vectorized. So that will be another interesting dimension to watch: if more databases move toward vectorized execution even if they don't adapt JIT compilation.
Yet another alternative is that maybe as databases mature we'll see compilation tiers similar to what browsers do with JavaScript.
Credits: Thanks Max Bernstein, Alex Miller, and Justin Jaffray for reviewing a draft version of this! And thanks to the #dbs channel on Discord for instigating this post!
I spent some time looking into how various databases execute expressions in their query language.
— Phil Eaton (@eatonphil) September 21, 2023
Most of them have a tree-walking interpreter, some have a virtual machine, and some do just-in-time compilation.
Let's dig into some database code to see!https://t.co/BIGtHKh1X4 pic.twitter.com/nmhe9HmYw7
2023-09-04 08:00:00
This is a collection of random personal experiences. So if you don't want to read everything, feel free to skip to the end for takeaways.
I write because I'd like to see more high-quality meetups. And maybe my little bit of experience will help someone out.
I first tried to organize a meetup in Philly in 2015. I was contracting at the time and I figured a meetup might be a good way to source contracts or just meet interesting people. I created the "Philadelphia Software in Business" (or some other similarly vaguely named) group on Meetup.com.
I didn't have any network; the first companies I worked for were not in Philly. But Meetup.com got me a few tens of people joining the group.
My first challenge was finding a place to meet. I didn't know what I was doing so I looked at restaurants, bars, and cafes for dedicated event space. Needless to say, renting space was expensive on its own. And there was always an additional required minimum dollar spent per attendee.
I ultimately found a place near the Schuylkill River. Maybe it was a community event space. Maybe I paid for it. I can't remember.
The first and only time I hosted an event for the group, I got a surprising number of people for such a vague topic. There were maybe 6 of us. I was the youngest by far (I was 20), they were middle age. Excel users and one visionary type.
There was no real point to the meetup and I didn't continue doing it.
While I was at Linode, I organized "hack nights". I didn't ask for anyone's approval before starting it. I just said I'd be ordering pizza for anyone interested in staying after work to hack on Linode-related projects. I was willing to pay for the pizza, in part because I didn't want to risk being shut down by asking. But caker paid for it each time.
I was nervous because people would show up and ask for pizza and not want to hack. It was company-provided under the aspiration of doing Linode-related work. Maybe I mentioned this or not. I can't remember. I'm pretty sure they got their pizza.
Aside from myself, developers at Linode didn't really attend. The folks who attended were support staff or folks from the technical writing team who wanted more experience coding.
I ran this for maybe 3 to 5 Wednesdays before not continuing. It was pretty fun! But staying after work for a few hours each Wednesday lost its charm.
Another time at Linode I started a book club. I was very torn about attempting to make the book club open to anyone in the area or just to Linode employees.
I knew I'd probably get more people to attend if I made it public. But I wasn't sure if Linode would be cool with having external folks in the office. Before they moved to the Old City office, visitors weren't really a thing.
So I made it private to Linode. And I started with the most obvious book for your average developer: Practical Common Lisp.
I am pretty sure I learned one big trick by this time though. When I announced I'd be starting the book club I said something like this:
Hey folks! I'm thinking of starting a book club. A book I have in mind to start with is Practical Common Lisp. If I get at least one other person to join in then I'll move forward!
I ended up getting two folks: one developer and one support staff member. We held the book club for 30 minutes once a week, covering one chapter each week. I was the only one who read anything I think, but the other two guys faithfully showed up for discussion.
I didn't ask for permission to do this either. And this time we met during company time. I think it was 2-2:30PM.
It was fun. We finished the book. But Practical Common Lisp probably wasn't a good choice. And I don't think I started a second book.
I moved to NYC and joined a small startup (~20 employees). Linode was 100+ employees.
We were in a WeWork so I considered starting a book club that was public to the WeWork. I had learned by then the law of numbers: I probably wouldn't get anyone from my company to join.
I considered putting up posters around the WeWork to advertise. But in the end, I didn't end up going through with anything.
I did present at a few meetups in NYC during this time. But I didn't organize anything.
And then the pandemic hit and everything disappeared.
In 2021 I started contracting again, thinking about starting a company. I wanted a community to be at the center.
So I started a Discord focused on software internals.
I had a bit more of a network at this point so I posted about the Discord on Twitter and got 100 likes or something and slowly started gaining folks in the Discord.
I knew it was going to do better if I was pretty active in it so I made sure to post interesting blog posts at some regular interval. About compilers or databases or something.
The Discord didn't turn out to help me out much in the starting-a-company front. Or I didn't use it effectively for that.
I wanted more of an independent Discord of cool people who like to learn about systems internals. And that's what I got.
This turned out to be ok though because I stopped working on that company and the Discord is still around and I still get to hang out with cool people.
This Discord is still around and hit 1,700 members recently. Among other things, it has developers from many different database companies in it these days. They hang out and help out the noobs like me learn about database internals.
I culled inactive members recently, so today the total is around 1,100.
During the pandemic I became frustrated that all the good meetups disappeared so I decided to start an online one that would be somewhat tied to the Discord and be about software internals.
I would find 2 or 3 people to present for 10-20 minutes each on anything to do with software internals. We'd meet once a month at 8PM NY time I think.
To get speakers I'd mostly DM people who I saw do interesting things on Twitter or Hacker News. I was lucky to have Philip O'Toole (author of rqlite), Simon Eskildsen (author of the Napkin Math blog), Rasmus Andersson, and many other excellent folks speak.
You can find videos of these talks on YouTube.
The events were organized on Meetup.com. The group grew quickly and I'd have about 100 people RSVP to each event. 10-20 normally showed up.
I'd post a Zoom link on Meetup.com. Sometimes Meetup.com crashed right as the meetup started, so no one could get a link. That was fun.
On two different nights I had Zoom bombers show up and play crazy music or impersonate other members of the call and act weirdly (Zoom lets you change your name after you've joined the call).
I learned a little bit about how to administrate a Zoom meeting.
I ran Hacker Nights for 5 months. It was tiring to find speakers, tiring to deal with Zoom bombers. It was thankless and I wasn't really enjoying it.
I was proud though that I was offering a channel for developers to learn about software internals of compilers, databases, etc. And it was great to meet many interesting speakers and attendees.
A month ago I put out a call on Twitter for folks in NYC interested in reading through the book Designing Data Intensive Applications.
I'd read the book before and while it was challenging, I knew it was immensely useful to any developer who works with data or an API.
By this time I'd learned my second trick: not asking for public responses.
I said something like:
Hey folks! I'm thinking of starting a book club meeting in Midtown NYC reading through Designing Data Intensive Applications. DM me if you'd be interested! If I get 2 other interested folks this will be on!
I got maybe 40 DMs and 20 of them were based in NYC. Attendence thus would have been higher if I made the book club virtual. But virtual events take about as much effort as in-person events and somehow feel less rewarding. So I went through with the NYC group.
I'm sure I could have gotten some company to provide us space, but this would just mean more negotation for me and tedium for everyone involved (bring your ID to be checked in, make sure you're registered, etc.).
The group would meet every 2 weeks and cover 2 chapters at a time. We'd meet for 30 minutes. To avoid needing to find a place to meet, we'd meet in public at Bryant Park. (There turns out to be plenty of available seating on Fridays at 9AM in Bryant Park. When it rains we meet online.)
I wanted to keep the overhead minimal and the timeline slightly aggressive. We'd be through the book in only 3 months. No crazy commitment.
We've meet twice now and are 25% done the book. Attendance has been around 7 to 9 people each time so far, or a little less than 50%.
They're almost all software developers, with one manager I think, who work for a variety of large and small tech companies.
I'm loving it so far. And if it continues to go well, I'll probably continue running in-person book clubs.
But it would only meet a few months a year, giving me a few month breaks from running it.
Organizing any event takes effort. Meetups are especially hard because you need to find a place to run the meetup, you probably want to provide food, and you need to find speakers.
Often you can find a single place to host the meetup, but you have to constantly search for new speakers. Even one of the greatest meetups in NYC, Papers We Love, seems to be struggling to find speakers.
The CMU Database Group and the Distributed Systems Reading Group seem to have the right idea though. They only run sessions part of the year, and they plan out all sessions in advance (including speakers).
However, they are both virtual. And I'm not so interested in running virtual events anymore.
For one, meetups are an awesome way to meet random people and expand your network.
Two, they're educational. Even beyond the content you are meeting about, there's the discussion alongside it you wouldn't get by yourself. And you, as organizer, get to pick the topic.
These work out great for me. I love to meet people, and I love to learn.
Starting something new is embarassing because you're putting yourself out there. Maybe no one in your network shares your interests (to the degree or in the direction you do).
My tricks are:
These ideas apply to corporate planning too. I think about them when I'm sharing some new idea in company Slack as much as when I share on Twitter.
A note on attendance rates: 10-20% actual attendance versus RSVP seems normal. If you get a higher percentage of people actually attending versus RSVP-ing you're doing pretty well!
One final idea is about paying for space or paying for food. Companies with space and money for food are often willing to partner with folks willing to do the work to run an event.
Running your own event in a company's space is advertising for them. They get to be associated with cool tech. It's a chance for them to pitch their open positions.
Obviously this happens often when you start a meetup hosted by your own company. But you can also find other companies to host space.
The kind of people to find to make this happen are senior developers or engineering managers, often on Twitter and sometimes on LinkedIn.
I haven't done this myself yet because I'm not ready to commit to running a meetup. But I see it happen. And it's the approach I'd take if I were to run a real meetup again.
Though now that I've got some time off there are a few talks I'd like to do myself.
Wrote a post on my experience organizing tech meetups of various stripes over the years. And a few things I've learned.
— Phil Eaton (@eatonphil) September 4, 2023
"meetups" taken pretty broadly to include online communities, book clubs, and actual speaker events.https://t.co/xnd0LTneup pic.twitter.com/w1oEaSNDHb
2023-08-15 08:00:00
Someone on Discord asked about how to learn functional programming.
The question and my initial tweet on the subject prompted an interesting discussion with Shriram Krishnamurthi and other folks.
So here's a slightly more thought out exploration.
And just for backstory sake: I spent a few years a while ago programming in Standard ML and I wrote a chunk of a Scheme implementation. I'm not an expert, but I have a bit of background.
Hey, this is a free opinion.
When people talk about functional programming, I think of a few key choices you can make while programming:
And if you have experience as a programmer, you either get the basic gist of these tenets or you can easily read about the basics.
That is, while most programmers I've met understood the basics, most programmers I've met were not particularly comfortable or fluent expressing programs with these ideas.
For myself, the only way I got comfortable expressing code with these ideas was lots of practice (as I mentioned above). And yet, even after I did a bunch of programming in Standard ML and Scheme, I really didn't see a particular benefit to practicing in a language other than one with which I wa already generally comfortable.
You have to learn a lot of other random things when you pick up Scheme or Standard ML that aren't just: practice immutability by default, recursion, and first-class functions.
So I think it's kind of misguided when person A asks how to learn functional programming and person B responds that they should learn Haskell or OCaml or whatever. I see this happen pretty often online.
Beyond any "language for functional programming" as a recommendation in general, Haskell is a particularly egregious suggestion to make in my opinion because not only are you trying to practice functional programming tenets but you're also dealing with a complex type system and lazy evaluation.
Instead, practice immutability, recursion, map/reduce in whatever language you like.
If you want to study programming languages, that's awesome. However, functional programming doesn't really have any direct connection to studying programming languages.
Languages are all over the place. Scheme, Standard ML, and Haskell are worlds apart, even within the functional programming family.
And modern languages have mostly adapted the aspects of functional programming that used to be unique 20 years ago.
Moreover, there are many other worthwhile families of languages to learn about:
The list isn't exhaustive, and the variations within families can be massive. But the point is that functional programming doesn't mean crazy programming languages or crazy programming ideas. Functional programming is a subset of crazy programming languages and crazy programming ideas.
If you want to learn about crazy programming languages and crazy programming ideas, you should! Go for it!
SICP is famous as the (former) introductory textbook for computer science at MIT, and for its use of Scheme and the Metacircular Evaluator.
I don't have any experience teaching beginners how to program so I don't have thoughts on if this made sense. That's for folks like Shriram to think about.
However, I'm a half-decent programmer and I can't make it through this book. If you liked the book or want to read it, that's great! But I don't recommend it to anyone.
And many introductory Computer Science textbooks just don't make much sense to give to experienced programmers. For an experienced programmer, they can be quite slow!
Most of the folks I see asking about how to learn functional programming are experienced programmers.
I don't mean to overanalyze things, or get you overanalyzing things. If you want to learn functional programming by writing Haskell, that's awesome, you should go for it.
Wanting to do something is basically the best motivation there is.
The only reason I write this sort of post is so that folks who think that using Haskell or Standard ML or Scheme or reading SICP is the only way to learn functional programming see those ideas aren't necessarily true.
Finally, for folks with time and motivation wanting to seriously work out their functional programming muscles, writing a Scheme implementation with a decent chunk of the standard library can be an immensely enjoyable project.
You'll learn a lot about languages and compilers and algorithms and data structures. It's leetcode with meaning.
Wrote a short post on this idea about different things to think about when talking about learning functional programming
— Phil Eaton (@eatonphil) August 16, 2023
1. Core concepts (immutability, first-class functions, recursion)
2. Exploring programming languages
3. Teaching CS to studentshttps://t.co/k4LzvnHbNs
2023-07-11 08:00:00
This is an external post of mine. Click here if you are not redirected.
2023-06-19 08:00:00
I knew Zig supported some sort of reflection on types. But I had been
confused about how to use it. What's the difference between
@typeInfo
and @TypeOf
? I ignored this aspect of Zig until a
problem came up at work where reflection
made sense.
The situation was parsing and storing parsed fields in a struct. Each field name that is parsed should match up to a struct field.
This is a fairly common problem. So this post walks through how to use Zig's metaprogramming features in a simpler but related domain: parsing CSS into typed objects, and pretty-printing these typed CSS objects.
I live-streamed the implementation of this project yesterday on Twitch. The video is available on YouTube. And the source is available on GitHub.
If you want to skip the parsing steps and just see the metaprogramming, jump to the implementation of match_property.
Let's imagine a CSS that only has alphabetical selectors, property names and values.
The following would be valid:
div {
background: black;
color: white;
}
a {
color: blue;
}
Thinking about the structure of this stripped down CSS we've got:
background
and color
)Turning that into Zig in main.zig
:
const std = @import("std");
const CSSProperty = union(enum) {
unknown: void,
color: []const u8,
background: []const u8,
};
const CSSRule = struct {
selector: []const u8,
properties: []CSSProperty,
};
const CSSSheet = struct {
rules: []CSSRule,
};
The parser is going to look for CSS rules which contain a selector and a list of CSS rules. The entrypoint is that simple:
fn parse(
arena: *std.heap.ArenaAllocator,
css: []const u8,
) !CSSSheet {
var index: usize = 0;
var rules = std.ArrayList(CSSRule).init(arena.allocator());
// Parse rules until EOF.
while (index < css.len) {
var res = try parse_rule(arena, css, index);
index = res.index;
try rules.append(res.rule);
// In case there is trailing whitespace before the EOF,
// eating whitespace here makes sure we exit the loop
// immediately before trying to parse more rules.
index = eat_whitespace(css, index);
}
return CSSSheet{
.rules = rules.items,
};
}
Let's implement the eat_whitespace
helper we've referenced. It
increments a cursor into the css file while it sees whitespace.
fn eat_whitespace(
css: []const u8,
initial_index: usize,
) usize {
var index = initial_index;
while (index < css.len and std.ascii.isWhitespace(css[index])) {
index += 1;
}
return index;
}
In our stripped-down version of CSS all we have to think about is
ASCII. So the builtin std.ascii.isWhitespace()
function is perfect.
Next, parsing CSS rules.
parse_rule()
A rule consists of a selector, opening curly braces, any number of properties, and closing curly braces. We need to remember to eat whitespace between each piece of syntax.
And we'll reference a few more parsing helpers we'll talk about next for the selector, braces, and properties.
const ParseRuleResult = struct {
rule: CSSRule,
index: usize,
};
fn parse_rule(
arena: *std.heap.ArenaAllocator,
css: []const u8,
initial_index: usize,
) !ParseRuleResult {
var index = eat_whitespace(css, initial_index);
// First parse selector(s).
var selector_res = try parse_identifier(css, index);
index = selector_res.index;
index = eat_whitespace(css, index);
// Then parse opening curly brace: {.
index = try parse_syntax(css, index, '{');
index = eat_whitespace(css, index);
var properties = std.ArrayList(CSSProperty).init(arena.allocator());
// Then parse any number of properties.
while (index < css.len) {
index = eat_whitespace(css, index);
if (index < css.len and css[index] == '}') {
break;
}
var attr_res = try parse_property(css, index);
index = attr_res.index;
try properties.append(attr_res.property);
}
index = eat_whitespace(css, index);
// Then parse closing curly brace: }.
index = try parse_syntax(css, index, '}');
return ParseRuleResult{
.rule = CSSRule{
.selector = selector_res.identifier,
.properties = properties.items,
},
.index = index,
};
}
The parse_syntax
helper is pretty simple, it does a bounds check and
increments the cursor if the current character matches the one you
pass in.
fn parse_syntax(
css: []const u8,
initial_index: usize,
syntax: u8,
) !usize {
if (initial_index < css.len and css[initial_index] == syntax) {
return initial_index + 1;
}
debug_at(css, initial_index, "Expected syntax: '{c}'.", .{syntax});
return error.NoSuchSyntax;
}
This calls attention to debugging messages on failure. When we fail to parse a syntax, we want to give a useful error message and point at the exact line and column of code where the error happens.
So let's implement debug_at
.
debug_at
First, we iterate over the entire CSS source code until we find the entire line that contains the index where the parser failed. We also want to identify the exact line and column corresponding to that index.
fn debug_at(
css: []const u8,
index: usize,
comptime msg: []const u8,
args: anytype,
) void {
var line_no: usize = 1;
var col_no: usize = 0;
var i: usize = 0;
var line_beginning: usize = 0;
var found_line = false;
while (i < css.len) : (i += 1) {
if (css[i] == '\n') {
if (!found_line) {
col_no = 0;
line_beginning = i;
line_no += 1;
continue;
} else {
break;
}
}
if (i == index) {
found_line = true;
}
if (!found_line) {
col_no += 1;
}
}
Then we print it all out in a nice format for users (which will likely just be ourselves).
std.debug.print("Error at line {}, column {}. ", .{ line_no, col_no });
std.debug.print(msg ++ "\n\n", args);
std.debug.print("{s}\n", .{css[line_beginning..i]});
while (col_no > 0) : (col_no -= 1) {
std.debug.print(" ", .{});
}
std.debug.print("^ Near here.\n", .{});
}
Ok, popping our mental stack, if we look back at parse_rule
we still
need to implement parse_identifier
and parse_property
.
parse_identifier
An "identifier" for us here is just going to be an ASCII alphabetical
string (i.e. [a-zA-Z]+
). We're going to really simplify CSS
because we're going to use this method for parsing not just selectors
but property names and even property values.
Zig again has a nice builtin std.ascii.isAlphabetical
we can use.
const ParseIdentifierResult = struct {
identifier: []const u8,
index: usize,
};
fn parse_identifier(
css: []const u8,
initial_index: usize,
) !ParseIdentifierResult {
var index = initial_index;
while (index < css.len and std.ascii.isAlphabetic(css[index])) {
index += 1;
}
if (index == initial_index) {
debug_at(css, initial_index, "Expected valid identifier.", .{});
return error.InvalidIdentifier;
}
return ParseIdentifierResult{
.identifier = css[initial_index..index],
.index = index,
};
}
In reality, CSS properties are highly complex. Parsing CSS correctly isn't the main aim of this post though. :)
parse_property
The final piece of CSS we need to parse is properties. These consist of a property name, then a colon, then a property value, and finally a semicolon. And within each piece we eat whitespace.
const ParsePropertyResult = struct {
property: CSSProperty,
index: usize,
};
fn parse_property(
css: []const u8,
initial_index: usize,
) !ParsePropertyResult {
var index = eat_whitespace(css, initial_index);
// First parse property name.
var name_res = parse_identifier(css, index) catch |e| {
std.debug.print("Could not parse property name.\n", .{});
return e;
};
index = name_res.index;
index = eat_whitespace(css, index);
// Then parse colon: :.
index = try parse_syntax(css, index, ':');
index = eat_whitespace(css, index);
// Then parse property value.
var value_res = parse_identifier(css, index) catch |e| {
std.debug.print("Could not parse property value.\n", .{});
return e;
};
index = value_res.index;
// Finally parse semi-colon: ;.
index = try parse_syntax(css, index, ';');
var property = match_property(name_res.identifier, value_res.identifier) catch |e| {
debug_at(css, initial_index, "Unknown property: '{s}'.", .{name_res.identifier});
return e;
};
return ParsePropertyResult{
.property = property,
.index = index,
};
}
Finally we get to the first bit of metaprogramming. Once we have a property name and value, we need to turn that into a Zig union.
That's what match_property()
is going to be responsible for doing.
match_property
This function needs to take a property name and value and return a
CSSProperty
with the correct field (matching up to the property name
passed in) and assigned to the value passed in.
If we didn't have metaprogramming or reflection, the implementation might look like this:
fn match_property(
name: []const u8,
value: []const u8,
) !CSSProperty {
if (std.mem.eql(u8, name, "color")) {
return CSSProperty{.color = value};
} else if (std.mem.eql(u8, name, "background")) {
return CSSProperty{.background = value};
}
return error.UnknownProperty;
}
And that is not necessarily bad. In fact it may be how a lot of production code looks over time as product needs evolve. You can keep the internal field name unrelated to the external field name.
However for the sake of learning, we'll try to implement the same thing with Zig metaprogramming.
And specifically, we can take a look at lib/std/json/static.zig to understand the reflection APIs.
Specifically, if we look at line 210-226 of that file, we can see them
iterating over fields of a Union
:
.Union => |unionInfo| {
if (comptime std.meta.trait.hasFn("jsonParse")(T)) {
return T.jsonParse(allocator, source, options);
}
if (unionInfo.tag_type == null) @compileError("Unable to parse into untagged union '" ++ @typeName(T) ++ "'");
if (.object_begin != try source.next()) return error.UnexpectedToken;
var result: ?T = null;
var name_token: ?Token = try source.nextAllocMax(allocator, .alloc_if_needed, options.max_value_len.?);
const field_name = switch (name_token.?) {
inline .string, .allocated_string => |slice| slice,
else => return error.UnexpectedToken,
};
inline for (unionInfo.fields) |u_field| {
Then right after that (lines 226-243) we see them conditionally modifying the result object:
inline for (unionInfo.fields) |u_field| {
if (std.mem.eql(u8, u_field.name, field_name)) {
// Free the name token now in case we're using an allocator that optimizes freeing the last allocated object.
// (Recursing into parseInternal() might trigger more allocations.)
freeAllocated(allocator, name_token.?);
name_token = null;
if (u_field.type == void) {
// void isn't really a json type, but we can support void payload union tags with {} as a value.
if (.object_begin != try source.next()) return error.UnexpectedToken;
if (.object_end != try source.next()) return error.UnexpectedToken;
result = @unionInit(T, u_field.name, {});
} else {
// Recurse.
result = @unionInit(T, u_field.name, try parseInternal(u_field.type, allocator, source, options));
}
break;
}
We can see that the .Union => |unionInfo|
condition is entered by
switching on @typeInfo(T)
(line
149)
and that T
is a type (line
144).
We don't have a generic type though. We know we are working with a
CSSProperty
. And we know CSSProperty
is a union so we don't need
the switch
either.
So let's apply that to our match_property
implementation.
fn match_property(
name: []const u8,
value: []const u8,
) !CSSProperty {
const cssPropertyInfo = @typeInfo(CSSProperty);
for (cssPropertyInfo.Union.fields) |u_field| {
if (std.mem.eql(u8, u_field.name, name)) {
return @unionInit(CSSProperty, u_field.name, value);
}
}
return error.UnknownProperty;
}
And if we try to build that we'll get an error like this:
main.zig:15:31: error: values of type '[]const builtin.Type.UnionField' must be comptime-known, but index value is runtime-known
for (cssPropertyInfo.Union.fields) |u_field| {
Zig's "reflection" abilities here are comptime only. So we can't use a
runtime for
loop, we must use a comptime inline for
loop.
fn match_property(
name: []const u8,
value: []const u8,
) !CSSProperty {
const cssPropertyInfo = @typeInfo(CSSProperty);
inline for (cssPropertyInfo.Union.fields) |u_field| {
if (std.mem.eql(u8, u_field.name, name)) {
return @unionInit(CSSProperty, u_field.name, value);
}
}
return error.UnknownProperty;
}
As far as I understand it, this loop is basically unrolled and the generated code would look a lot like our hard-coded initial version.
i.e. it would probably look like this:
fn match_property(
name: []const u8,
value: []const u8,
) !CSSProperty {
const cssPropertyInfo = @typeInfo(CSSProperty);
if (std.mem.eql(u8, "background", name)) {
return @unionInit(CSSProperty, "background", value);
}
if (std.mem.eql(u8, "color", name)) {
return @unionInit(CSSProperty, "color", value);
}
if (std.mem.eql(u8, "unknown", name)) {
return @unionInit(CSSProperty, "unknown", value);
}
return error.UnknownProperty;
}
Again that's just how I imagine the compiler to generate code from the
Union field reflection and inline for
over the fields.
Try compiling that code. I get this:
main.zig:17:58: error: expected type 'void', found '[]const u8'
return @unionInit(CSSProperty, u_field.name, value);
Thinking about the generated code makes it especially clear what's
happening. We have an unknown
field in there that has a void
type. You can't assign a string to void.
We know at runtime that the condition where that happens should be
impossible because the user shouldn't enter unknown
as a property
name. (Though now that I write this, I see they actually could. But
let's pretend they wouldn't.)
So the problem isn't a runtime failure but a comptime type-checking failure.
Thankfully we can work around this with comptime conditionals.
If we wrap our current condition in an additional conditional that is
evaluated at comptime and filters out the unknown
pass of the
inline for
loop, the compiler shouldn't generate any code trying to
assign to the unknown
field.
fn match_property(
name: []const u8,
value: []const u8,
) !CSSProperty {
const cssPropertyInfo = @typeInfo(CSSProperty);
inline for (cssPropertyInfo.Union.fields) |u_field| {
if (comptime !std.mem.eql(u8, u_field.name, "unknown")) {
if (std.mem.eql(u8, u_field.name, name)) {
return @unionInit(CSSProperty, u_field.name, value);
}
}
}
return error.UnknownProperty;
}
And indeed, if you try to compile it, this works. Since the conditional is evaluated at compile time, we can imagine the code the compiler generates is this:
fn match_property(
name: []const u8,
value: []const u8,
) !CSSProperty {
const cssPropertyInfo = @typeInfo(CSSProperty);
if (std.mem.eql(u8, "background", name)) {
return @unionInit(CSSProperty, "background", value);
}
if (std.mem.eql(u8, "color", name)) {
return @unionInit(CSSProperty, "color", value);
}
return error.UnknownProperty;
}
The unknown
field has been skipped.
In retrospect, I realize that the unknown
field probably isn't
even needed. We could eliminate it from the CSSProperty
union and
get rid of that comptime conditional. However, sometimes there are in
fact private fields you want to skip. And I wanted to show how to
deal with that case.
For the last bit of metaprogramming, let's talk about displaying
the resulting CSSSheet
we'd get after parsing.
sheet.display()
If we didn't have metaprogramming and wanted to display the sheet, we'd have to switch on every possible union field.
Like so (I've modified the CSSSheet
struct definition so it includes this method):
fn display(sheet: *CSSSheet) void {
for (sheet.rules) |rule| {
std.debug.print("selector: {s}\n", .{rule.selector});
for (rule.properties) |property| {
switch (property) {
.unknown => unreachable,
.color => |color_value| std.debug.print(" color: {s}\n", .{color_value}),
.background => |background_value| std.debug.print(" background: {s}\n", .{background_value}),
};
}
std.debug.print("\n", .{});
}
}
This is already a little annoying and could get unwieldy as we add
fields to the CSSProperty
union.
Instead we can use the inline for
(@typeInfo(CSSProperty).Union.fields) |u_field|
method to iterate
over all fields, skip the unknown
field at comptime, and print out
the field name and value by matching on the current value of the
property
enum by using the @tagName
builtin.
fn display(sheet: *CSSSheet) void {
for (sheet.rules) |rule| {
std.debug.print("selector: {s}\n", .{rule.selector});