2025-03-04 08:00:00
Two related threads, both views on limits to the useful size of data.
I have recently been working with WebAssembly and Win32, which both use 32-bit pointers and are limited to 4gb of memory. Meanwhile, modern computers usually use 64-bit pointers. 64-bit bit pointers let you address more than 4gb of memory. (They also have other important uses, including pointer tricks around memory mapping like ASLR.)
But these larger pointers also cost memory. Fancy VMs like v8 employ "pointer compression" to try to reduce the memory cost. As always, you end up trading off CPU and memory. Is Memory64 actually worth using? talks about this tradeoff in the context of WebAssembly's relatively recent extension to support 64-bit pointers. The costs of the i386 to x86-64 upgrade dive in this in the context of x86, where it's difficult to disentangle the separate instruction set changes that accompanied x86-64 from the increased pointer size.
I provide these links as background for an interesting observation I heard about why a 4gb limit turns out to be "big enough" much of the time: to the extent a program needs to work with data sets larger than 4gb, the data is large enough that you end up working with it differently anyway.
In other words, the simple small data that programs typically work with, like command-line flags or configuration files, comfortably fits within a 4gb limit. And in contrast, the kind of bigger data that crosses the 4gb limit fundamentally will have different access patterns — typically smaller views onto the larger space — because it is too large to traverse quickly.
Imagine a program that generates 4gb of data or loads it from disk or network. Even if the program grinds through hundreds of megabytes per second, it still takes over 30 seconds to work through 4gb. This is enough time to probably require a different user interface such as a progress bar. Such a program likely will work with the data in a streaming fashion, paging in/out smaller blocks of the data via some file API that manages blocks of the >4gb data.
A standard example application that will make use of a ton of memory is a database. But a database is expressly designed around managing the larger size of the data, using indexing data structures so that a given query knows exactly which subsets of the larger data to access. Otherwise, large queries like table scans work with the data in a streaming fashion as in the previous paragraph.
Another canonical "needs a lot of memory" application is an image editor, which operates on a lot of pixels. I worked on one of those! To make the software grind through pixels fast you will take efforts to avoid needing to individually traverse all your pixels. To get the pixels onto the screen quickly, you instead load data piecewise into the GPU and let the GPU handle the rest. Or you write specialized "shader" GPU code. Both are again specialized APIs that work with
How about audio and video? These involve large amounts of data, but also have a time dimension, where most operations only work with a portion of the data near a particular timestamp.
In all, a 32-bit address space for the program's normal data structures coupled with some alternative mechanism for poking at the larger data indirectly has surprisingly ended up working out almost as well as a larger flat address space.
It's interesting to contrast this observation to a joke from the DOS era: "640kb ought to be enough for anybody". 640kb was a memory limit back then and the phrase was thrown around ironically to remark that reasonable programs actually need more. According to Gates (who it was apocryphally, falsely, attributed to) at the time, 640kb was already understood to be a painful limit.
In contrast, we've been easily fitting most programs in 4gb for the last 30 years — from the 1990s through today, where browsers still limit web pages to 4gb.
Why has this arbitrary threshold held up? You could argue it's a lack of imagination: maybe we have sized our programs to the limits we had? But 64-bit has been around for quite a while, long enough that there ought to be better examples of different kinds of programs that truly make use of it.
One answer is to observe is that many of the limits above are related to human limits. Humans won't wait 30s for a program to load its data; humans can't listen to 4 minutes of audio simultaneously, humans don't write >4gb of configuration files... or to put it together, humans don't consume 4gb of data in one gulp.
And that observation links to an interesting related phenomenon, which I sometimes call:
Data visualization is the problem of presenting data in a form where your brain can ingest it. What if you have a lot of data? We use the tools of statistics, to summarize collections of data into smaller representations that capture some of its essence.
Imagine building a stock chart, a simple line chart of a value over time. Even though you have data for every second of the day, when charting the data over a larger timeline, it's not useful to draw all of that data. Instead we will summarize, with either an average price per time window, or something that attempts to reduce each time window's data to a few numbers like a violin plot or OLHC.
Why do we do this? Visually, there's only so much you can usefully see. Even with a higher resolution display, shoving more details into smaller pixels does not convey usefully more information. In fact, it's often a better chart when it has fewer visual marks that still tells the same story. Much like the 4gb limit, the bandwidth of information into your brain sets an upper bound on the amount of useful data that can go into a chart.
From this you can derive an interesting sort of principle about data visualization software: your data visualization does not need be especially fast in supporting a lot of marks, beacuse you won't ever need to display many of them. For example, libraries like d3 are fine to do relatively low performance DOM traversal for rendering.
This is kind of a subtle point, so let me be clear: it's of course useful and important to be able to work with large data sets, and data visualization software often will provide critical API for processing them. The point is that once you are at the level of putting actual objects on the screen, you should have already reduced the data mathematically to a relatively easy small set of data, so you don't really need speedy handling of screen objects.
But wait, you say, this is a dumb argument — it really is more convenient to just have 64-bit pointers everywhere and not worry about any limits. And wouldn't a charting API where you can hand a million points to a scatter plot be more useful than one where you can't?
I think both of those are preferable — in the absence of performance constraints. I prefer those in the same way I'd prefer, in a hypothetical future where performance is completely free, making a database with no need for indexes and running big AI matmuls without specialized GPU code or hardware. But at least in today's software it is the case that even 64-bit pointers are costly enough that implementers will go to lengthy efforts to reduce their cost.
So don't take this post as advocating that these lower limits are somehow just or correct. Rather, I aimed only to observe why it is that the memory limits from the 90s, or the visualization limits from the pen and paper days, have not been as constraining as you might first expect.
2025-01-24 08:00:00
"Blazing"
What you meant: Very fast.
What it comes across as: This term in particular is overused to describe
software that isn't particularly fast in any useful sense, but which has random
micro-optimizations that don't matter.
What to do instead: If it's actually fast, just say fast, and describe what
it's fast in relation to.
"Modern"
What you meant: New, which implies good under the assumption that newer is
implicitly better? Or maybe it means based on newer design principles? I
actually struggled on this one for a while and I think it's most often used to
just mean good in general, nonspecific ways...?
What it comes across as: New for the sake of being new, possibly with new
bugs. Ignorant of the past. An empty filler word.
What to do instead: Describe the actual difference. When new things are good,
it's because of specific benefits such as active maintenance, a new take on an
old problem, a reduced and focused scope by discarding previous compatibility
concerns, etc.
"Isomorphic"
What you meant: The same as, or matching.
What it comes across as: Abusing a math term because it looks cool.
Sesquipedalianism.
What to do instead: Isomorphim is a technical term that describes
a specific
scenario distinct from equality.
Similar terms like "equivalent" or "one-to-one" can often better express the
intended idea.
"Magic"
What you meant: Something with surprisingly helpful behavior.
What it comes across as: Something with surprisingly unpredictable behavior.
What to do instead: Perhaps "intuitive", or something like "handles common
configuration out of the box". For most things I work with, the descriptive
words "simple" and "predictable" have more sparkle to them than "magical".
2024-12-12 08:00:00
Jujutsu is a new version control system that seems pretty nice!
The first few times I tried it I bounced off the docs, which to my taste has too much detail up front before I got the big picture. Someone else maybe had a similar experience and wrote an alternative tutorial but it's in a rambly bloggy style that is also too focused on the commands for me.
I suspect, much like writing a monad tutorial, the path of understanding is actually writing it down. So here's an attempt from me at an introduction / tutorial.
Perhaps unlike the others, my goal is that this is high-level enough to read and think about, without providing so much detail that it washes over you. Don't try to memorize the commands here or anything, they're just here to communicate the ideas. At the end if you're curious to try I recommend the docs found on their website.
Omitting details, you can think of Jujutsu (hereafter "jj") as a new Git frontend. The underlying data is still stored in Git. The difference is how you interact with your files locally, with a different conceptual model and a different set of commands.
Git quiz: are commits snapshots of file state or diffs? The technical answer is subtle — as a user you usually interact with them as diffs, while conceptually they are snapshots, but concretely they are stored as deltas. The more useful answer is that thinking about the details obfuscates the conceptual model. Similarly, to describe jj in terms of what happens in Git is tempting but I think ultimately clouds the explanation.
In practice what this means is try to put your knowledge of Git on hold, but also be aware you can use jj and continue to interoperate with the larger Git ecosystem, including e.g. pushing to GitHub.
The purpose of a version control system is to keep track of the history of your code. But interestingly in most, as soon as you edit a file locally in your working copy, that new history ("I have edited file X starting on version Y") is is in a kind of limbo state outside of the system and managed separately.
This is so pervasive it's almost difficult to see. But consider how a command
like git diff
has one mode that takes two commits to diff, and then a bunch of
other modes and flags to operate on the other kinds of things it tracks. You can
get a diff against your working copy but there's no way to name "the working
copy" in the diff command. (Git in particular adds the additional
not-quite-a-commit state of the index, with even more flavors of attendant
commands. The ultimate Git quiz: what are the different soft/hard/mixed
behaviors of git reset
?)
Another example: consider how if you have a working copy change and and want to
check out other code you either have to put it in a new place (git stash
, a
fourth place separate from the others) or make a temporary commit. Or how if you
have a working copy change that you want to transplant elsewhere you
git checkout -m
, but to move around committed changes it's git rebase
.
In jj, in contrast, your working copy state is always a commit. When making a new change this is a new (descriptionless) commit. Any edit you make on disk is immediately reflected in the current commit.
So many things fall out of this simple decision!
jj diff
shows a commit's diff. With no argument it's the current commit's
diff, i.e. your current working copy's diff; otherwise you can specify which
historical one you want. Many other jj commands similarly have a pleasing
symmetry about their behavior like this.
You can draft the commit message for a work in progress commit before you're
done, using the same command you'd use to edit any other commit message. There
is no final jj commit
command, the commit is implicit. (Instead you jj new
to start a new empty commit when done.)
You never need to "stash" your current work to go do something else, it is already stored in the current commit, and easy to jump back to.
In Git, to fix a typo in an old commit, you might make a new commit then
git rebase -i
to move the patch around. In jj you directly check out the old
commit (because working copy == commit) and edit the files with no further
commands.
(This blog post
walks through a real-world operation like this with Git and jj side by side.)
From a Git perspective, jj is very "rebasey". Editing a file is like a
git commit --amend
, and in the "fix a typo" move above the edit implictly
rebases any downstream commits. To make that work out there are some other
conceptual leaps around conflict handling and branches that will come below
after the basics.
In a Git repo:
$ jj git init --colocate
This creates a .jj
dir that works with the Git repo. Git commands will still
work but can be confusing.
The plain jj
command runs jj log
, showing recent commits. Here it is from
the repo for this blog:
$ jj
@ zyqszntn [email protected] 2024-12-12 11:58:52 21b06db8
│ (no description set)
○ pmnzyyru [email protected] 2024-12-12 11:58:48 86355427
│ unfinished drafts
◆ szzpmvlz [email protected] 2024-09-18 09:08:15 fcb1507d
│ syscalls
~
The leftmost letter string is the "change id", which is the identifier you use
to refer to the change in a command like diff
. They are stable across edits,
unlike the Git hashes on the right. In the terminal the change ids are colored
to show the necessary prefix to uniquely refer to them (a single letter) in
commands.
The topmost commit zyqszntn
is the current one, containing this blog post as I
write it. As you would expect, if I run jj status
it shows me the list of
edited files, and if I run jj diff
it shows me a diff.
I can give it a description now or when I'm done:
$ jj desc -m 'post about jujutsu'
And then create a new commit for the next change:
$ jj new
That's enough for trivial changes, but often I work on more significant changes where I might lose context across days. There are two ways you might do this depending on how you work.
The first is to just describe your change as above and keep on editing it,
without running jj new
. Each subsequent edit will update the change as you go.
This is simple to operate but it means jj diff
will always show the whole
diff. In Git this is similar to just keeping a lot of edits in your working
copy.
The other option is called
"the squash workflow"
in the tutorial book. In this, when you do new work you jj new
to create a new
distinct commit from your existing work, and when you are happy with it (by e.g.
examining jj diff
, which shows you just the working copy's new changes) you
run jj squash
to flush these new changes into the previous commit. To me this
feels pretty analogous to using the Git index as a staging area for a complex
change, or perhaps repeatedly using git commit --amend
.
These commands like jj diff
and jj desc
work on the current commit (or any
explicitly requested via the -r flag
).
To switch the working copy to an existing change, it's jj edit <changeid>
.
Again, any changes you make here, to the files or descriptions, or by making new
changes and squashing them, work directly on the historical commit you are
editing. I repeat this because it is both weird and obvious in retrospect.
Any operations on history cause implicit rebases that happen silently. Rebases can conflict. jj has interesting handling of how this works.
In Git, rebase resolution happens through your working copy, so there is again
extra state around "rebase in progress" and git rebase --continue
. In jj
instead, conflicting commits are just recorded as conflicting and marked as such
in the history, so rebases always "succeed" even if they produce a string of
conflicting commits.
If you go to fix a conflicting commit (via jj edit
as above), you edit the
files as usual and once the conflict markers are removed it's no longer
considered conflicting.
As usual, once you make a history edit, downstream changes are again rebased, possibly resolving their conflicted state after your edit. Again, the jj pattern of "all of the relevant information is modeled in the commits" without having a separate rebase mode with state etc. is a recurring powerful theme.
I don't have a lot of experience with this yet so I can't comment on how well it works, except that the times I've ran into it I was pleasantly surprised. The jj docs seem proud of the modeling and behavior here which makes me think it's plausibly sophisticated.
jj doesn't have named branches, but rather only keeps track of commits. Because of the way jj juggles commits, where it's trivial to start adding commits at random points in history, branch names are not as useful. In my experience so far having useful commit descriptions is enough to keep track of what I'm working on. Coming from Git the lack of named branches is surprising, but I believe this is comfortable for Mercurial users and Monotone worked similarly (I think?).
It's worth highlighting the absence of branches because in particular when
interoperating with Git you still do need branches, if only to push. There is
support in jj for this (where "bookmarks" are pointers to specific commits) but
it feels a little clunky. On the other hand, I probably have Stockholm syndrome
about the git push
syntax.
Working with jj made me realize how much I rely on VSCode's Git support, for both viewing diffs and for merges.
When editing a given commit in jj, Git thinks all the files in the commit are in the working tree and not the index. In other words, in the VSCode UI the current diff shows up as pending changes just as they would in Git. This works pretty well and is about all I would expect. I haven't yet touched the buttons that interact with Git's index, for fear of what jj will do with it.
For technical reasons I do not quite understand — possibly VSCode only does three-way file merges and jj needs three-way directory merges? — the two do not quite cooperate for resolving conflicts. The jj docs recommend meld and I have used meld in the past but I hadn't quite realized how VSCode had hooked me until I missed using it for a merge.
The author of jj works at Google and is possibly making it for the Google internal version control system. (Above I wrote that jj is a Git frontend, but officially it has pluggable backends; I'm just unlikely to ever see a non-Git one.)
When I left Google three years ago I recall they were trying to figure out what to do about either making Git scale, or adopting Mercurial, or what. I remember talking to someone involved in this area and thinking "realistically your users have to use Git to work with the larger world, so anything else you do is pure cost". I found this post from a Mercurial fan about jj an interesting read in how it talks about Mercurial shortcomings it fixes. From that perspective it is pretty interesting: it can replace the places you currently use Git, while also providing a superior UI.
In all, jj seems pretty polished, has been around for years, and has a pretty simple exit strategy if things go wrong — just bail out to the Git repo. I aim to continue using it.
PS: every time I read the name "jujutsu" I kept thinking it was a misspelling of "jiu-jitsu", the martial art. But the Japanese word is じゅうじゅつ, it's actually it's jiu-jitsu that is misspelled! Read a longer article about it.