2024-12-19 04:00:00
Back on December 13th, I
posted a challenge on Mastodon: In a simple UTF-8 byte-driven
finite automaton, how many states does it take to match the regular-expression construct “.
”, i.e. “any character”?
Commenter
Anthony Williams responded,
getting it almost right I think,
but I found his description a little hard to understand. In this piece I’m going to dig into what
.
actually means, and then how many states you need to match it.
The answer surprised me. Obviously this is of interest only to the faction of people who are interested in automaton wrangling, problematic characters, and the finer points of UTF-8. I expect close attention from all 17 of you!
Four. Or five, depending.
They’re represented by “code points”, which are numbers in the range 0 … 17×216, which is to say 1,114,112 possible values. It turns out you don’t actually want to match all of them; more on that later.
Quamina is a “byte-level automaton” which means it’s in a state, it reads a byte, looks up the value of that byte in a map
yielding either the next state, or nil
, which means no match. Repeat until you match or fail.
What bytes are we talking about here? We’re talking about UTF-8 bytes. If you don’t understand UTF-8 the rest of this is going to be difficult. I wrote a short explainer called Characters vs. Bytes twenty-one years ago. I now assume you understand UTF-8 and knew that code points are encoded as sequences of from 1 to 4 bytes.
Let’s count!
When you match a code point successfully you move to the part of the automaton that’s trying to match the next one; let’s call this condition MATCHED.
(From here on, all the numbers are hex, I’ll skip the leading 0x. And all the ranges are inclusive.)
In multi-byte characters, all the UTF-8 bytes but the first have bitmasks like 10XX XXXX
, so there are six
significant bits, thus 26 or 64 distinct possible values ranging from 80-BF.
There’s a Start state. It maps byte values 00-7F (as in ASCII) to MATCHED. That’s our first state, and we’ve handled all the one-byte code points.
In the Start state, the 32 byte values C0-DF, all of which begin 110
signaling a two-byte
code point, are mapped to the Last state. In the Last state,
the 64 values 80-BF are mapped to MATCHED. This takes care of all the two-byte code points and we’re up to two
states.
In the Start state, the 16 byte values E0-EF, all of which begin 1110
signaling a three-byte code
point, are mapped to the LastInter state. In that
state, the 64 values 80-BF are mapped to the Last state. Now we’re up to three states and we’ve handled the
three-byte code points.
In the Start state, the 8 byte values F0-F7, all of which begin 11110 signaling a four-byte code point, are mapped to the FirstInter state. In that state, the 64 values 80-BF are mapped to the LastInter state. Now we’ve handled all the code points with four states.
I mentioned above about not wanting to match all the code points. “Wait,” you say, “why wouldn’t you want to be maximally inclusive?!” Once again, I’ll link to Unicode Character Repertoire Subsets, a document I co-wrote that is making its way through the IETF and may become an RFC some year. I’m not going to try to summarize a draft that bends over backwards to be short and clear; suffice it to say that there are good reasons for leaving out several different flavors of code point.
Probably the most pernicious code points are the “Surrogates”, U+D800-U+DFFF. If you want an explanation of what they are and why they’re bad, go read that Repertoire Subsets draft or just take my word for it. If you were to encode them per UTF-8 rules (which the UTF-8 spec says you’re not allowed to), the low and high bounds would be ED,A0,80 and ED,BF,BF.
Go’s UTF-8 implementation agrees that Surrogates Are Bad and The UTF-8 Spec Is Good and flatly refuses to convert those UTF-8 sequences into code points or vice versa. The resulting subset of code points even has a catchy name: Unicode Scalars. Case closed, right?
Wrong. Because JSON was designed before we’d thought through these problems, explicitly saying it’s OK to include any code point whatsoever, including surrogates. And Quamina is used for matching JSON data. So, standards fight!
I’m being a little unfair here. I’m sure that if Doug Crockford were inventing JSON now instead of in 2001, he’d exclude surrogates and probably some of the other problematic code points discussed in that Subsets doc.
Anyhow, Quamina will go with Go and exclude surrogates. Any RFC8259 purists out there, feel free accuse me of standards apostasy and I will grant your point but won’t change Quamina. Actually, not true; at some point I’ll probably add an option to be more restrictive and exclude more than just surrogates.
Which means that now we have to go back to the start of this essay and figure out how many states it takes to match
“.
” Let’s see…
The Start state changes a bit. See #5 in the list above. Instead of mapping all of E0-EF to the LastInter state, it maps one byte in that range, ED, to a new state we’ll call, let’s see, how about ED.
In ED, just as in LastInter, 80-9F are mapped to Last. But A0-BF aren’t mapped to anything, because on that path lie the surrogates.
So, going with the Unicode Scalar path of virtue means I need five states, not four.
2024-12-15 04:00:00
This story is about Hong Kong and mountains and ferries and food and beer. What happened was, there’s a thirty-year-old picture I wanted to share and it brought the story to mind. I was sure I’d written it up but can’t find it here on the blog, hard as I try, so here we go. Happy ending promised!
The picture I wanted to share is from a business trip to Hong Kong in 1994 and hey, it turns out I have lots more pictures from that trip.
Kai Tak, what an airport that was. If you could open the plane’s windows, you’d have been able to grab laundry hung to dry on people’s balconies. My fast-talking HK friend said “Safest airport in the world! You know pilot paying 100% attention!”
My trip extended over a weekend and I wanted to get out of town so I read up on interesting walks; on paper of course, the Web only just barely existed. Lantau Island was recommended; there was a good hike up over the local mountains that reached a Trappist monastery with a well-reviewed milk bar. So I took the ferry from Central to Mui Wo.
The view from the ferry was great!
I revisited Mui Wo in 2019, visiting the Big Buddha.
It was easy to find the hiking trail up the mountains, well-maintained but steep. I stopped to take pictures maybe more often than strictly necessary because it was in the high Celsius thirties with 99% humidity and my North-Euro metabolism wasn’t dealing well. Visions of Trappist ice-cream danced in my head as the sweat dripped off my chin.
Having said that, I’m glad I stopped because the pictures please my eye. These are all Ektachrome; can’t remember whether I took them with the Pentax SLR or the little Nikon pocket camera.
Lantau has the new international airport on it now; I wonder if those green hills are still unspoiled.
Eventually, sweat-soaked and my body screaming for mercy, I reached a small mountaintop. I could see the monastery, but it was a couple of little mountains over, so I arrived in poor condition. Sadly for me, it was a Sunday so, commerce deferring to the sacred, the joint was closed. Poor Tim. Especially since I hadn’t brought anything to eat.
Fortunately I didn’t have to hike all the way back to Mui Wo; Almost straight downhill there there was a “Monastery Pier” with an occasional ferry to the nearby islet of Peng Chau and a connection back to Central. Looks like there still is.
It was midafternoon, the heat approaching its peak, and walking downhill has its own stresses and strains. By the time I got to the pier I was a sad excuse for a human. Here’s a picture of the ferry.
As you can see, it was pretty crowded, but unsurprisingly, nobody wanted to share the bench the big sweaty panting hungry-looking pink person was on.
Peng Chau itself was visually charming but the ferry connection was tight so I couldn’t explore.
Trudging onto the medium-sized ferry back home, I encountered a food-service option: A counter with one guy and a big steaming pot of soup behind it. My spirit lifted. The guy’s outfit might have once been white; he was unshaven and sweaty but then so was I, and my clothes were nothing to write home about either.
I stopped and pointed at the bowls. He filled one, then wanted to upsell me on a leathery, greasy-looking fried egg to go on top but there are limits. Disappointed, he stepped aside to put it back, revealing a small glass-fronted fridge, icicles hanging off it, full of big cans of San Miguel beer. My spirit lifted again.
The soup was salty and delicious. I’m not sure I’ve enjoyed a beer more in the thirty years since that day. The ferry was fast enough to generate a refreshing breeze all the way, and there were charming boats to photograph.
The tourist who walked off the boat at Central was a dry, well-hydrated, and cheerful specimen of humanity. The next day, my fast-talking HK friend said “You climb over Lantau in that weather yesterday? White guys so weird!” “It was great!” I told him, smirking obnoxiously.
I’ve been back to HK a few times over the years, but it’s not really a happy place any more.
2024-12-13 04:00:00
Implementing regular expressions is hard. Hard in interesting ways that make me want to share the lessons. Thus this series, QRS for short.
People who keep an eye on my Quamina open-source pattern-matching project will have noticed a recent absence of updates and conversation. That’s because, persuant to Issue #66, I’m working on adding fairly-full regular-expression matching.
This is turning out to be hard. Either it’s the hardest nut I’ve had to crack in many years, or maybe my advanced age is dulling my skills. It’s going to be some time before I can do the first incremental release. Whatever; the learning experiences coming out of this work still feel fresh and fascinating and give me the urge to share.
I hope I can retain that urge as long as I’m still mentally present. In fact, I hope I can retain the ability to work on software. For various reasons, I’ve been under a lot of personal stress in recent years. Stealing time from my adult responsibilities to wrestle with executable abstractions has been a pillar of my sanity.
Anyhow, normally when I code I blog about it, but so far I haven’t because the work is unfinished. Then I realized that it’s too big, and addresses too many distinct problems, to be just one piece, thus this mini-series.
[Readers who don’t know what regular expressions are should probably close this tab now. Don’t feel guilty, nobody who’s not a full-time computing professional should have to know much less care.]
[Notation: I’m gonna say “Regexp” or maybe just “RE” in this series.]
There are at least three pieces I know I’m going to write:
Parsing RE syntax. (Already written!)
Representing parsed REs.
Implementing UTF-8 based automata for REs.
At the moment, I think the hardest part of the work is #1, Parsing. (Maybe that’s because I haven’t really dug very deep into #3 yet.) I’d be amazed if the final series had only three parts.
Now, introductory material.
They come in lots of flavors. The one I’m implementing is I-Regexp, RFC 9485. The observant reader will notice that I co-edited that RFC, and I cheerfully confess to bias.
I-Regexp is basically a subset of XSD Regular Expressions (chosen to subset because they have a nice clean immutable spec), which are a lot like good ol’ PCRE (Perl-compatible regular expressions). Except for:
They are designed assuming they will only ever be used to match against a string and return a “yes” or “no” answer.
They are anchored, which is to say that (unlike PCREs) they’re all assumed to start with ^
and end with
$
.
They omit popular single-character escapes like \w
and \S
because those are sketchy in the
Unicode context.
They don’t have capture groups or back-references.
They don’t support character class subtraction, e.g. [a-z-m-p]
I’m going to claim that they hit a very useful 80/20 point if what you’re interested is asking “Did the field value match?” which of course is all Quamina is interested in doing.
I’m totally not going to try to do all this as a big bang. I’ve got a reliable RE parser now (it was hard!) that recognizes
ten different RE features, ranging from .
to everything in
(a+b*c?).|d[ef]{3,9}\?\P{Lu}
Go check out
Unbackslash.
Tl;dr: It’s terribly painful to deal with the standard RE escaping character \
in Go software that is processing
JSON. Because both Go and JSON use \
for escaping and your unit tests eventually fill up with \\
and
\\\\\\\\
and become brutally hard to read. So after publishing that blog piece and running polls on Mastodon,
~
is the new \
. So that RE above becomes (a+b*c?).|d[ef]{3,9}~?~P{Lu}
.
You’re allowed to not like it. But I request that you hold off pushing the big button that sends me to Hell until you’ve tried writing a few unit tests for REs that you want Quamina to process.
Back to strategy: The first feature is going to be that lovely little dot operator. And thus…
Just for fun, here’s an intellectual challenge. Suppose you’re building a byte-at-a-time state machine to process UTF-8 text. How
many states, roughly, would it take to match .
, i.e. any single Unicode code point? By “match” I mean reject
any byte sequence that
doesn’t, and when it does match, consume just enough bytes to leave you positioned after the .
and ready to start
matching whatever’s next.
I think I’ve found the correct answer. It surprised me, so I’m still sanity-checking, but I think I’m right. I am convinced the problem isn’t as simple as it looks.
2024-12-13 04:00:00
Parsing regular expression syntax is hard. I’ve written a lot of parsers and,for this one, adopted a couple of new techniques that I haven’t used before. I learned things that might be of general interest.
I was initially surprised that the problem was harder than it looked, but quickly realized that I shouldn’t have been, because my brain has also always had a hard time parsing them.
They’re definitely a write-only syntax and just because I’m gleefully writing this series doesn’t mean I’m recommending you reach for REs as a tool very often.
But I bet most people in my profession find themselves using them pretty regularly, in the common case where they’re the quickest path from A to B. And I know for sure that, on a certain number of occasions, they’ve ended up regretting that choice.
Anyhow, I console myself with the thought that the I-Regexp RE dialect has less syntax and fewer footguns than PCREs generally. Plus, I’ve been having fun implementing them. So knock yourselves out. (Not legal nor investing advice.)
When I started thinking seriously about the parser, the very first thought in my mind was “How in the freaking hell am I going to test this?” I couldn’t stand the thought of writing a single line of code without having a plausible answer. Then it occurred to me that since I-Regexp subsets XSD Regular Expressions, and since XSD (which I mostly dislike) is widely deployed and used, maybe someone already wrote a test suite? So I stuck my head into an XML community space (still pretty vigorous after all these years) and asked “Anyone have an XSD regexp test suite?”
And it worked! (I love this profession sometimes.)
Michael Kay pointed at me a few things notably including
this GitHub repo. The
_regex-syntax-test-set.xml
there, too big to display, contains just under a thousand regular expressions, some
valid, some not, many equipped with strings that should and should not match.
The process by which I turned it into a *_test.go
file, Dear Reader, was not pretty. I will not share the
ugliness, which involved awk and emacs, plus hideous and largely untested one-off Go code.
But I gotta say, if you have to write a parser for any anything, having 992 sample cases makes the job a whole lot less scary.
Lesson: When you’re writing code to process a data format that’s new to you, invest time, before you start, in looking for samples.
The I-Regexp specification contains a complete ABNF grammar for the syntax. For writing parsers I tend to like finite-automaton based approaches, but for a freakishly complicated mini-language like this, I bowed in the direction of Olympus for that grammar and started recursively descending.
I think at some point I understood the theory of Regular Languages and LL(1) and so on, but not any more. Having said that, the recursive-descent technique is conceptually simple, so I plowed ahead. And it worked eventually. But there seemed a lot of sloppy corners where I had to peek one byte ahead or backtrack one. Maybe if I understood LL(1) better it’d have been smoother.
The “character-class” syntax [abc0-9]
is particularly awful. The possible leading -
or
^
makes it worse, and it has the usual \
-prefixed stanzas. Once again, I salute the original
specifiers who managed to express this in a usable grammar.
I was tempted, but ended up making no use of Go’s regexp
library to help me parse REs.
I have to say that I don’t like the code I ended up with as much as any of my previous (automaton-based) parsers, nor as much as the rest of the Quamina code. But it seems to work OK. Speaking of that…
When I eventually got the code to do the right thing for each of Michael Kay’s 992 test cases, I was feeling a warm glow. So then I ran the test-coverage tool, and got a disappointingly-low number. I’m not a 100%-coverage militant generally, but I am for ultra-low-level stuff like this with a big blast radius.
And here’s the lesson: Code coverage tools are your friend. I went in and looked at the green and red bars; they revealed that while my tests had passed, I was really wrong in my assumptions about the paths they would make the code take. Substantial refactoring ensued.
Second, and somewhat disappointingly, there were a lot of coverage misses on Go’s notorious little if err != nil
stanza. Which revealed that my sample set didn’t cover the RE-syntax space quite as thoroughly as I’d hoped. In particular,
there was really no coverage of the code’s reaction to malformed UTF-8.
The reason I’m writing this is to emphasize that, even if you’re in a shop where the use of code-coverage tools is (regrettably) not required, you should use one anyhow, on basically every important piece of code. I have absolutely never failed to get surprises, and consequently improved code, by doing this.
I don’t know if I-Regexp is going to be getting any uptake, but it wouldn’t surprise me if it did; it’s a nice tractable subset that hits a lot of use cases. Anyhow, now I have reasonably robust and well-tested I-Regexp parsing code. I’d like to share it, but there’s a problem.
To do that, I’d have to put it in a separate repo; nobody would want to import all of Quamina, which is a fair-sized library, just to parse REs. But then that other repo would become a Quamina dependency. And one of my favorite things about Quamina is that it has 0 dependencies!
It’s not obvious what the right thing to do is; any ideas?
2024-12-03 04:00:00
The murderer I emailed with is still in prison. And the software that got him pissed off at me still runs, so I ran it. Now here I am to pass on the history and then go all geeky. Here’s the tell: If you don’t know what a “filesystem” is (that’s perfectly OK, few reasonable adults need to) you might want to stay for the murderer story then step off the train.
Filesystems are one of the pieces of software that computers need to run, where “computers” includes your phone and laptop and each of the millions of servers that drive the Internet and populate the cloud. There are many flavors of filesystem and people who care about them care a lot.
One of the differences between filesystems is how fast they are. This matters because how fast the apps you use run depends (partly) on how fast the underlying filesystems are.
Writing filesystem software is very, very difficult and people who have done this earn immense respect from their peers. So, a lot of people try. One of the people who succeeded was named Hans Reiser and for a while his “ReiserFS” filesystem was heavily used on many of those “Linux” servers out there on the Internet that do things for you.
Reiser at one point worked in Russia and used a “mail-order bride” operation to look for a spouse. He ended up marrying Nina Sharanova, one of the bride-brokerage translators, and bringing her back to the US with him. They had two kids, got divorced, and then, on September 3, 2006, he strangled her and buried her in a hidden location.
To make a long story short, he eventually pleaded guilty to a reduced charge in exchange for revealing the grave location, and remains in prison. I haven’t provided any links because it’s a sad, tawdry story, but if you want to know the details the Internet has them.
I had interacted with Reiser a few times as a consequence of having written a piece of filesystem-related software called “Bonnie” (more on Bonnie below). I can’t say he was obviously murderous but I found him unpleasant to deal with.
As you might imagine, people generally did not want to keep using the murderer’s filesystem software, but it takes a long time to make this kind of infrastructure change and just last month, ReiserFS was removed as a Linux option. Which led to this Mastodon exchange:
(People who don’t care about filesystems can stop reading now.)
After that conversation, on a whim I tracked down the Bonnie source and ran it on my current laptop, a 2023 M2 MacBook Pro with 32G of RAM and 3T of disk. I think the numbers are interesting in and of themselves even before I start discoursing about benchmarking and filesystems and disks and so on.
-------Sequential Output--------- ---Sequential Input--- --Random-- -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks--- Machine GB M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU M/sec %CPU /sec %CPU MBP-M2-32G 64 56.9 99.3 3719 89.0 2772 83.4 59.7 99.7 6132 88.0 33613 33.6
Bonnie says:
This puppy can write 3.7 GB/second to a file, and read it back at 6.1GB/sec.
It can update a file in place at 2.8 GB/sec.
It can seek around randomly in a 64GB file at 33K seeks/second.
Single-threaded sequential file I/O is almost but not quite CPU-limited.
I wonder: Are those good numbers for a personal computer in 2024? I genuinely have no idea.
I will shorten the story, because it’s long. In 1988 I was an employee of the University of Waterloo, working on the New Oxford English Dictionary Project. The computers we were using typically had 16MB or so of memory (so the computer I’m typing this on has two thousand times as much) and the full text of the OED occupied 572MB. Thus, we cared really a lot about I/O performance. Since the project was shopping for disks and computers I bashed out Bonnie in a couple of afternoons.
I revised it lots over the years, and Russell Coker made an excellent fork called Bonnie++ that (for a while at least) was more popular than Bonnie. Then I made my own major revision at some point called Bonnie-64.
In 1996, Linux Torvalds recommended Bonnie, calling it a “reasonable disk performance benchmark”.
That’s all I’m going to say here. If for some weird reason you want to know more, Bonnie’s quaint Nineties-flavor home and description pages are still there, plus this blog has documented Bonnie’s twisty history quite thoroughly. And explored, I claim, filesystem-performance issues in a useful way.
I will address a couple of questions here, though.
Many performance-sensitive applications go to a lot of work to avoid reading and/or writing filesystem data on their critical path. There are lots of ways to accomplish this, the most common being to stuff everything into memory using Redis or Memcached or, well, those two dominate the market, near as I can tell. Another approach is to have the data in a file but access it with mmap rather than filesystem logic. Finally, since real disk hardware reads and writes data in fixed-size blocks, you could arrange for your code to talk straight to the disk, entirely bypassing filesystems. I’ve never seen this done myself, but have heard tales of major commercial databases doing so.
I wonder if anyone has ever done a serious survey study of how the most popular high-performance data repositories, including Relational, NoSQL, object stores, and messaging systems, actually persist the bytes on disk when they have to?
I have an opinion, based on intuition and having seen the non-public inside of several huge high-performance systems at previous employers that, yes, filesystem performance still matters. I’ve no way to prove or even publicly support that intuition. But my bet is that benchmarks like Bonnie are still relevant.
I bet a few of the kind of people who read this blog similarly have intuitions which, however, might be entirely different than mine. I’d like to hear them.
There is a wide range of hardware and software constructs which are accessed through filesystem semantics. They have wildly different performance envelopes. If I didn’t have so many other hobbies and projects, it’d be fun to run Bonnie on a sample of EC2 instance types with files on various EBS and EFS and so on configurations.
For the vast majority of CPU/storage operations in the cloud, there’s at least one network hop involved. Out there in the real world, there is still really a lot of NFS in production. None of these things are much like that little SSD slab in my laptop. Hmmm.
I researched whether some great-great-grandchild of Bonnie was the new hotness in filesystem benchmarking, adopting the methodology of typing “filesystem benchmark” into Web search. The results were disappointing; it doesn’t seem like this is a thing people do a lot. Which would suggest that people don’t care about filesystem performance that much? Which I don’t believe. Puzzling.
Whenever there was a list of benchmarks you might look at, Bonnie and Bonnie++ were on that list. Looks to me like IOZone gets the most ink and is thus probably the “industry-leading” benchmark. But I didn’t really turn up any examples of quality research comparing benchmarks in terms of how useful the results are.
The biggest problem in benchmarking filesystem I/O is that Linux tries really hard to avoid doing it, aggressively using any spare memory as a filesystem cache. This is why serving static Web traffic out of the filesystem often remains a good idea in 2024; your server will take care of caching the most heavily fetched data in RAM without you having to do cache management, which everyone knows is hard.
I have read of various cache-busting strategies and have never really been convinced that they’ll outsmart this aspect of Linux, which was written by people who are way smarter and know way more than I think I do. So Bonnie has always used a brute-force approach: Work on a test file which is much bigger than main memory, so Linux has to do at least some real I/O. Ideally you’d like it to be several times the memory size.
But this has a nasty downside. The computer I’m typing on has 32GB of memory, so I ran Bonnie with a 64G filesize (128G would have been better) and it took 35 minutes to finish. I really don’t see any way around this annoyance but I guess it’s not a fatal problem.
Oh, and those numbers: Some of them look remarkably big to me. But I’m an old guy with memories of how we had to move the bits back and forth individually back in the day, with electrically-grounded tweezers.
I can’t remember when this was, but some important organization was doing an evaluation of filesystems for inclusion in a big contract or standard or something, and so they benchmarked a bunch, including ReiserFS. Bonnie was one of the benchmarks.
Bonnie investigates the rate at which programs can seek around in a file by forking off three child processes that do a bunch of random seeks, read blocks, and occasionally dirty them and write them back. You can see how this could be stressful for filesystem code, and indeed, it occasionally made ReiserFS misbehave, which was noted by the organization doing the benchmarking.
Pretty soon I had email from Reiser claiming that what Bonnie was doing was actually violating the contract specified for the filesystem API in terms of concurrent write access. Maybe he was right? I can’t remember how the conversation went, but he annoyed me and in the end I don’t think I changed any code.
At one time Bonnie was on SourceForge, then Google Code, but I decided that if I were going to invest effort in writing this blog, it should be on GitHub too, so here it is. I even filed a couple of bugs against it.
I make no apologies for the rustic style of the code; it was another millennium and I was just a kid.
I cheerfully admit that I felt a warm glow checking in code originally authored 36 years ago.
2024-11-16 04:00:00
As a dangerous and evil man drives people away from Xitter, many stories are talking up Bluesky as the destination for the diaspora. This piece explains why I kind of like Bluesky but, for the moment, have no intention of moving my online social life away from the Fediverse.
(By “Fediverse” I mean the social network built around the ActivityPub protocol, which for most people means Mastodon.)
If we’re gonna judge social-network alternatives, here are three criteria that, for me, really matter: Technology, culture, and money.
I don’t think that’s controversial. But this is: Those are in increasing order of importance. At this point in time, I don’t think the technology matters at all, and money matters more than all the others put together. Here’s why.
Mastodon and the rest of the fediverse rely on ActivityPub implementations. Bluesky relies on the AT Protocol, of which so far there’s only one serious implementation.
Both of these protocols are good enough. We know this is true because both are actually working at scale, providing good and reliable experiences to large numbers of people. It’s reasonable to worry what happens when you get to billions of users and also about which is more expensive to operate. But speaking as someone who spent decades in software and saw it from the inside at Google and AWS, I say: meh. My profession knows how to make this shit work and work at scale. Neither alternative is going to fail, or to trounce its competition, because of technology.
I could write many paragraphs about the competing nice features and problems of the competing platforms, and many people have. But it doesn’t matter that much because they’re both OK.
At the moment, Bluesky seems, generally speaking, to be more fun. The Fediverse is kind of lefty and geeky and queer. The unfortunate Mastodon culture of two years ago (“Ewww, you want us to have better tools and be more popular? Go away!”) seems to have mostly faded out. But the Fediverse doesn’t have much in the way of celebrities shitposting about the meme-du-jour. In fact it’s definitely celebrity-lite.
I enjoy both cultural flavors, but find Fedi quite a lot more conversational. There are others who find the opposite.
More important, I don’t think either culture is set in stone, or has lost the potential to grow in multiple new, interesting directions.
Here’s the thing. Whatever you think of capitalism, the evidence is overwhelming: Social networks with a single proprietor have trouble with long-term survival, and those that do survive have trouble with user-experience quality: see Enshittification.
The evidence is also perfectly clear that it doesn’t have to be this way. The original social network, email, is now into its sixth decade of vigorous life. It ain’t perfect but it is essential, and not in any serious danger.
The single crucial difference between email and all those other networks — maybe the only significant difference — is that nobody owns or controls it. If you have a deployment that can speak the languages of IMAP and SMTP and the many anti-spam tools, you are de facto part of the global email social network.
The definitive essay on this question is Mike Masnick’s Protocols, Not Platforms: A Technological Approach to Free Speech. (Mike is now on Bluesky’s Board of Directors.)
My bet for the future (and I think it’s the only one with a chance) is a global protocol-based conversation with many thousands of individual service providers, many of which aren’t profit-oriented businesses. One of them could be your local Buddhist temple, and another could be Facebook. The possibilities are endless: Universities, government departments, political parties, advocacy organizations, sports teams, and, yes, tech companies.
It’s obvious to me that the Fediverse has the potential to become just this. Because it’s most of the way there already.
Could Bluesky? Well, maybe. As far as I can tell, the underlying AT Protocol is non-proprietary and free for anyone to build on. Which means that it’s not impossible. But at the moment, the service and the app are developed and operated by “Bluesky Social, PBC”. In practice, if that company fails, the app and the network go away. Here’s a bit of Bluesky dialogue:
In practice, “Bsky corp” is not in immediate danger of hard times. Their team is much larger than Mastodon’s and on October 24th they announced they’d received $15M in funding, which should buy them at least a year.
But that isn’t entirely good news. The firm that led the investment is seriously sketchy, with strong MAGA and cryptocurrency connections.
The real problem, in my mind, isn’t in the nature of this particular Venture-Capital operation. Because the whole raison-d’etre of Venture Capital is to make money for the “Limited Partners” who provide the capital. Since VC investments are high-risk, most are expected to fail, and the ones that succeed have to exhibit exceptional revenue growth and profitability. Which is a direct path to the problems of survival and product quality that I mentioned above.
Having said that, the investment announcement is full of soothing words about focus on serving the user and denials that they’ll go down the corrupt and broken crypto road. I would like to believe that, but it’s really difficult.
To be clear, I’m a fan of the Bluesky leadership and engineering team. With the VC money as fuel, I expect their next 12 months or so to be golden, with lots of groovy features and mind-blowing growth. But that’s not what I’ll be watching.
I’ll be looking for ecosystem growth in directions that enable survival independent of the company. In the way that email is independent of any technology provider or network operator.
Just like Mastodon and the Fediverse already are.
Yes, in comparison to Bluesky, Mastodon has a smaller development team and slower growth and fewer celebrities and less buzz. It’s supported by Patreon donations and volunteer labor. And in the case of my own registered co-operative instance CoSocial.ca, membership dues of $50/year.
Think of the Fediverse not as just one organism, but a population of mammals, scurrying around the ankles of the bigger and richer alternatives. And when those alternatives enshittify or fall to earth, the Fediversians will still be there. That’s why it’s where my social-media energy is still going.
On the Fediverse you can follow a hashtag and I’m subscribed to #Bluesky, which means a whole lot of smart, passionate writing on the subject has been coming across my radar. If you’re interested enough to have read to the bottom of this piece, I bet one or more of these will reward an investment of your time:
Maybe Bluesky has “won”, by Gavin Anderegg, goes deep on the trade-offs around Bluesky’s AT Protocol and shares my concern about money.
Blue Sky Mine, by Rob Horning, ignores technology and wonders about the future of text-centric social media and is optimistic about Bluesky.
Does Bluesky have the juice?, by Max Read, is kind of cynical but says smart things about the wave of people currently landing on Bluesky.
The Great Migration to Bluesky Gives Me Hope for the Future of the Internet, by Jason Koebler over at 404 Media, is super-optimistic: “Bluesky feels more vibrant and more filled with real humans than any other social media network on the internet has felt in a very long time.” He also wonders out loud if Threads’ flirtation with Mastodon has been damaging. Hmm.
And finally there’s Cory Doctorow, probably the leading thinker about the existential conflict between capitalism and life online, with Bluesky and enshittification. This is the one to read if you’re thinking that I’m overthinking and over-worrying about a product that is actually pretty nice and currently doing pretty well. If you don’t know what a “Ulysses Pact” is, you should read up and learn about it. Strong stuff.