2025-01-23 04:00:00
That’s us. I assume you’re among those horrified at the direction of politics and culture in recent years and especially recent weeks, in the world at large and especially in America. We are a minority. We shouldn’t try to deny it, we should be adults and figure out how to deal with it.
I’m out of patience with people who put the blame on the pollsters or the media or Big Tech, or really any third party. People generally heard what Mr Trump was offering — portrayed pretty accurately I thought — and enough of them liked it to elect him. Those who didn’t are in a minority. Quit dodging and deal.
Clearly, we the minority have failed in explaining our views. Many years ago I wrote an essay called Two Laws of Explanation. One law says that if you’re explaining something and the person you’re explaining to doesn’t get it, that’s not their problem, it’s your problem. I still believe this, absolutely.
So let’s try to figure out better explanations.
But first, a side trip into economic perception and reality.
A strong faction of progressives and macroeconomists are baffled by people being disaffected when the economy, they say, is great. Paul Krugman beats this drum all the time. Unemployment and inflation are low! Everything’s peachy! Subtext: If the population disagrees, they are fools.
I call bullshit. The evidence of homelessness is in my face wherever I go, even if there are Lamborghinis cruising past the sidewalk tents. Food banks are growing. I give a chunk of money every year to Adopt a School, which puts free cafeterias in Vancouver schools where kids are coming to school hungry. Kingston, a mid-sized mid-Canadian city, just declared an emergency because one household in three is suffering from food insecurity.
Even among those who are making it, for many it’s just barely:
… half of Canadians (50%, +8) are now $200 or less away each month from not being able to pay their bills and debt payments. This is a result of significantly more Canadians saying they are already insolvent (35%, +9) compared to last quarter. Canadians who disproportionately report being $200 or less away from insolvency continue to be women (55%, +4) but the proportion of men at risk has increased to 44%, up 13 points from last quarter.
Source: MNP Consumer Debt Index. The numbers like “+8” give the change since last quarter.
(Yes, this data is Canadian, because I am. But I can’t imagine that America is statistically any better.)
Minorities need to study majorities closely. So let me sort them, the ones who gave Trump the election I mean, into baskets:
Stone racists who hate immigrants, especially brown ones.
Culture warriors who hate gays and trans people and so on.
Class warriors; the conventional billionaire-led Republican faction who are rich, voting for anyone they think offers lower taxes and less regulation.
People who don’t pay much attention to the news but remember that gas was cheaper when Trump was in office.
Oh wait, I forgot one: People who heard Trump say what boiled down to “The people who are running things don’t care about you and are corrupt!” This worked pretty well because far too many don’t and are. A whole lot of the people who heard this are financially stressed (see above).
Frankly, I wouldn’t bother trying to reach out to either of the first two groups. Empirically, some people are garbage. You can argue that it’s not their fault; maybe they had a shitty upbringing or just fell into the wrong fellowships. Maybe. But you can be sure that that’s not your fault. The best practice is some combination of ignoring them and defending against their attacks, politics vs politics and force versus force.
I think talking to the 1% is worthwhile. The fascist leaders are rich, but not all of the rich are fascist. Some retain much of their humanity. And presumably some are smart enough to hear an argument that on this economic path lie tumbrils and guillotines.
That leaves the people who mostly ignore the news and the ones who have just had it with the deal they’re getting from late-Capitalist society. I’m pretty sure that’s who we should be talking to, mostly.
I’m not going to claim I know. I hear lots of suggestions…
In the New Yorker, Elizabeth Kolbert’s Does One Emotion Rule All Our Ethical Judgments? makes two points. First, fear generally trumps all other emotions. So, try phrasing your arguments in terms of the threats that fascism poses directly to the listener, rather than abstract benefits to be enjoyed by everyone in a progressive world.
Second, she points out the awesome power of anecdote: MAGA made this terrible thing happen to this actual person, identified by name and neighborhood.
On Bluesky, Mark Cuban says we need offensive hardass progressive political podcasts, and offers a sort of horrifying example that might work.
On Bloomberg (paywalled) they say that the ruling class should be terrified of a K-shaped recovery; by inference, progressives should be using that as an attack vector.
Josh Marshall has been arguing for weeks that since the enemies won the election, they have the power and have to own the results. Progressives don’t need to sweat alternative policies, they just have to highlight the downsides of encroaching fascism (there are plenty) and say “What we are for is NOT THAT!” and just keep saying it. Here’s an example.
Maybe one of these lines of attack is right. I think they’re all worth trying. And I’m pretty sure I know one ingredient that’s going to have to be part of any successful line of attack…
Looking back at last year’s Presidential campaign, there’s a thing that strikes me as a huge example of What Not To Do. I’m talking about Harris campaign slogan: “Opportunity Economy”. This is marketing-speak. If there’s one thing we should have learned it’s that the population as a whole — rich, poor, Black, white, queer, straight, any old gender — has learned to see through this kind of happy talk.
Basically, in Modern Capitalism, whenever, and I mean whenever without exception, whenever someone offers you an “opportunity”, they’re trying to take advantage of you. This is appallingly tone-deaf, and apparently nobody inside that campaign asked themselves the simple question “Would I actually use this language in talking to someone I care about?” Because they wouldn’t.
Be blunt. Call theft theft. Call lies lies. Call violence violence. Call ignorance ignorance. Call stupidity stupidity.
Also, talk about money a lot. Because billionaires are unpopular.
Don’t say anything you wouldn’t say straight-up in straight-up conversation with a real person. Don’t let any marketing or PR professionals edit the messaging. This is the kind of messaging that social media is made for.
Maybe I’m oversimplifying, but I don’t think so.
2025-01-15 04:00:00
Bluesky and the Fediverse are our best online hopes for humane human conversation. Things happened on 2025/01/13; I’ll hand the microphone to Anil Dash, whose post starts “This is a monumental day for the future of the social web.”
What happened? Follow Anil’s links: Mastodon and Bluesky (under the “Free Our Feeds” banner). Not in his sound-bite: Both groups are seeking donations, raising funds to meet those goals.
I’m sympathetic to both these efforts, but not equally. I’m also cynical, mostly about the numbers: They’ve each announced a fundraising target, and both the targets are substantial, and I’m not going to share either, because they’re just numbers pulled out of the air, written on whiteboards, designed to sound impressive.
These initiatives, just by existing, are evidence in letters of fire 500 miles high, evidence of people noticing something important: Corporately-owned town squares are irreversibly discredited. They haven’t worked in the past, they don’t work now, and they’ll never work.
Something decentralized is the only way forward. Something not owned by anyone, defined by freely-available protocols. Something like email. Or like the Fediverse, which runs on the ActivityPub protocol. Or, maybe Bluesky, where by “Bluesky” I mean independent service providers federated via the AT Protocol, “ATProto” for short.
I’ll tell you what’s hard: Raising money for a good cause, when that good cause is full of abstractions about openness and the town square and so on. Which implies you’re not intending that the people providing the money will make money. So let’s wish both these efforts good luck. They’ll need it.
Previously in Why Not Bluesky I argued that, when thinking about the future of conversational media, what matters isn’t the technology, or even so much the culture, but the money: Who pays for the service? On that basis, I’m happy about both these initiatives.
But now I’m going to change course and talk about technology a bit. At the moment, the ATProto implementation that drives Bluesky is the only one in the world. If the company operating it failed in execution or ran out of money, the service would shut down.
So, in practice, Bluesky’s not really decentralized at all. Thus, I’m glad that the “Free Our Feeds” effort is going to focus on funding an alternative ATProto implementation. In particular, they’re talking about offering an alternative ATProto “Relay”.
Before I go on, you’re going to need a basic understanding of what ATProto is and how its parts work. Fortunately, as usual, Wikipedia has a terse, accurate introduction. If you haven’t looked into ATProto yet, please hop over there and remedy that. I’ll wait.
Now that you know the basics, you can understand why Free Our Feeds is focusing on the Relay. Because, assuming that Bluesky keeps growing, this is going to be a big, challenging piece of software to build, maintain, and operate, and the performance of the whole service depends on it.
The Fediverse in general and Mastodon in particular generally don’t rely on a global firehose feed that knows everything that happens, like an eye in the sky. In fact, the ActivityPub protocol assumes a large number of full-stack peer implementations that chatter with each other, in stark contrast to ATProto’s menagerie of Repos and PDSes and Relays and App Views and Lexicons.
The ATProto approach has advantages; since the Relay knows everything, you can be confident of seeing everything relevant. The Fediverse makes no such promise, and it’s well-known that in certain circumstances you can miss replies to your posts. And perhaps more important, miss replies to others’ posts, which opens the door to invisible attackers.
And this makes me nervous. Because why would anyone make the large engineering and financial investments that’d be required to build and operate an ATProto Relay?
ActivityPub servers may have their flaws, but in practice they are pretty cheap to operate. And it’s easy to think of lots of reasons why lots of organizations might want to run them:
A university, to provide a conversational platform for its students…
… or its faculty.
A Developer Relations team, to talk to geeks.
Organized religion, for evangelism, scholarship, and ministry.
Marketing and PR teams, to get the message out.
Government departments that provide services to the public.
Or consider my own instance, CoSocial, the creation of Canadians who (a) are fans of the co-operative movement, (b) concerned about Canadians’ data staying in Canada, and (c) want to explore modes of funding conversational media that aren’t advertising or Patreon.
Maybe, having built and run a Relay, the Free Our Feeds people will discover a rationale for why anyone else should do this.
I hope both efforts hit their fundraising targets. I hope both succeed at what they say they’re going to try.
But for my own conversation with the world, I’m sticking with the Fediverse.
Most of all, I’m happy that so many people, whatever they think of capitalism, have realized that it’s an unsuitable foundation for online human conversation. And most of all I hope that that number keeps growing.
2025-01-11 04:00:00
What happened was, there was a pretty moon in the sky, so I got out a tripod and the big honkin’ Tamron 150-500 and fired away. Here’s the shot I wanted to keep.
I thought the camera and lens did OK given that I was shooting from sea level through soggy Pacific-Northwest winter air. But when I zoomed in there was what looked like pretty heavy static. So I applied Lightroom to the problem, twice.
I’ll be surprised if many of you can see a significant difference. (Go ahead and enlarge.) But you would if it were printed on a big piece of paper and hung on a wall. So we’ll look at the zoomed-in version. But first…
Lightroom has had a Luminance-noise reduction tool for years. Once you wake it up, you can further refine with “Detail” and “Contrast” sliders, whose effects are subtle at best. For the moon shot, I cranked the Luminance slider pretty all the way over and turned up Detail quite a bit too.
In recent Lightroom versions there’s a “Denoise…” button. Yes, with an ellipsis and a note that says “Reduce noise with AI.” It’s slow; took 30 seconds or more to get where it was going.
Anyhow, here are the close-up shots.
I have a not-terribly-strong preference for the by-hand version. I think both noise reductions add value to the photo. I wonder why the AI decided to enhance the very-slight violet cast? You can look at the rim of one crater or another and obsess about things that nobody just admiring the moon will ever see.
It’s probably worth noting that the static in the original version isn’t “Luminance noise”, which is what you get when you’re pushing your sensor too hard to capture an image in low light. When you take pictures of the moon you quickly learn that it’s not a low-light scenario at all, the moon is a light-colored object in direct sunlight. These pix are taken at F7.1 at 1/4000 second shutter. I think the static is just the Earth’s atmosphere getting in the way. So I’m probably abusing Lightroom’s Luminance slider. Oh well.
You could take this as an opportunity to sneer at AI, but that would be dumb. First, Lightroom’s AI-driven “select sky” and “select subject” tools work astonishingly well, most times. Second, Adobe’s been refining that noise-reduction code for decades and the AI isn’t even a year old yet.
We’ll see how it goes.
2025-01-05 04:00:00
Here we are, it’s 2025 and Bitcoin is surging. Around $100K last time I looked. While its creation spews megatons of carbon into our atmosphere, investors line up to buy it in respectable ETFs, and long-term players like retirement pools and university endowments are looking to get in. Many of us are finding this extremely annoying. But I look at Bitcoin and I think what I’m seeing is Modern Capitalism itself, writ large and in brutally sharp focus.
[Disclosure: In 2017 I made a lot of money selling Bitcoins at around $20K, ones I’d bought in 2013. Then in 2021 I lost money shorting Bitcoin (but I’m still ahead on this regrettable game).]
It is verifiable proof that a large amount of computing has been done. Let’s measure it in carbon, and while it’s complicated and I’ve seen a range of answers, they’re all over 100 tonnes of CO2/Btc. That proof is all that a Bitcoin is.
Bitcoin is also a store of value. It doesn’t matter whether you think it should be, empirically it is, because lots of people are exchanging lots of money for Bitcoins on the assumption that they will store the value of that money. Is it a good store of value? Many of us think not, but who cares what we think?
I mean, sure, there are currency applications in gun-running, ransoms, narcotics, and sanctions-dodging. But nope, the blockchain is so expensive and slow that all most people can really do with Bitcoin is refresh their wallets hoping to see number go up.
The success of Bitcoin teaches the following about capitalism in the 2020s:
Capitalism doesn’t care about aesthetics. Bitcoins in and of themselves in no way offer any pleasure to any human.
Capitalism doesn’t care about negative externalities generally, nor about the future of the planet in particular. As long as the number goes up, the CO2 tonnage is simply invisible. Even as LA burns.
Capitalism can be oblivious to the sunk-cost fallacy as long as people are making money right now.
Capitalism doesn’t care about utility; the fact that you can’t actually use Bitcoins for anything is apparently irrelevant.
And oblivious about crime too. The fact that most actual use of Bitcoins as a currency carries the stench of international crime doesn’t seem to bother anyone.
Capitalism doesn’t care about resiliency or sustainability. Bitcoins are fragile; very easy to lose forever by forgetting a password or failing to back up data just right. Also, on the evidence, easy to steal.
Capitalism can get along with obviously crazy behavior, for example what MicroStrategy is doing: Turning a third-rate software company into a bag of Bitcoins and having an equity valuation that is higher than the value of the bag; see Matt Levine (you have to scroll down a bit, look for “MicroStrategy”).
Capitalism says: “Only money is real. Those other considerations are for amateurs. Also, fuck the future.”
Not entirely. As Paul Krugman points out, a market-based economy can in practice deliver reasonably good results for a reasonably high proportion of the population, as America’s did in the decades following 1945. Was that a one-time historical aberration? Maybe.
But as for what capitalism has become in the 21st century? Everything got financialized and Bitcoin isn’t the disease, it’s just a highly visible symptom. Other symptoms: The explosion of homelessness, the destruction of my children’s ecosystem, the gig economy, and the pervasiveness of wage theft. It’s really hard to find a single kind word to say.
Not existentially. I mean, smart people are worried, for example Rostin Behnam, chair of the Commodity Futures Trading Commission: “You still have a large swath of the digital asset space unregulated in the US regulatory system and it’s important — given the adoption we’ve seen by some traditional financial institutions, the huge demand for these products by both the retail and institutional investors — that we fill this gap.”
All that granted, the market cap of Bitcoin is around two trillion US dollars as I write this. Yes, that’s a lot of money. But most of them are held by market insiders, so even in the (plausible) case that it plunges close to zero, the damage to the mainstream economy shouldn’t be excessive.
It’s just immensely annoying.
One of the things Bitcoin teaches us is that there is too much money in the world, more than can be put to work in sensible investments. So the people who have it do things like buy Bitcoins.
Gold is also a store of value, also mostly just because people believe it is. But it has the virtues of beauty and of applications in jewellery and electronics. I dunno, I’m seriously thinking about buying some on the grounds that the people who have too much money are going to keep investing in it. In particular if Bitcoin implodes.
I’ve been snarling at cryptocurrencies since 2018 or so. But, number go up. So I’ll close by linking to HODLers apology.
Is this the best socio-economic system we as a species can build?
2024-12-30 04:00:00
Recently I posted
Matching “.” in UTF-8, in which I claimed that you could match
the regular-expression “.
” in a UTF-8 stream with either four or five states in a byte-driven finite automaton,
depending how
you define the problem. That statement was arguably wrong, and you might need three more states, for a total of eight. But you
can make a case that really, only four should be needed, and another case calling for quite a few more.
Because that phrase “depending how you define the problem” is doing a lot of work.
But first, thanks: Ed Davies, whose blog contributions (1, 2, 3) were getting insufficient attention from me until Daphne Preston-Kendal insisted I look more closely.
To summarize Ed’s argument: There are a bunch of byte combinations that look (and work) like regular UTF-8 but are explicitly ruled out by the Unicode spec, in particular Section 3.9.3 and its Table 3.7.
Ed posted a nice picture of a corrected 8-state automaton that will fail to match any of these forbidden sequences.
I looked closely at Ed’s proposal and it made sense, so I implemented it and (more important) wrote a bunch of unit tests exploring the code space, and it indeed seems to accept/reject everything correctly per Unicode 3.9.3.
So, argument over, and I should go forward with the 8-state Davies automaton, right? Why am I feeling nervous and grumpy, then?
I’ve already mentioned in this series that your protocols and data structures just gotta support Unicode in the 21st century, but you almost certainly don’t want to support all the Unicode characters, where by “character” I mean, well… if you care at all about this stuff, please go read Unicode Character Repertoire Subsets (“Unichars for short), a draft inching its way through the IETF, with luck an RFC some day. And if you really care, dig into RFC 3454: PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols. Get a coffee first, PRECIS has multiple walls of text and isn’t simple at all. But it goes to tremendous lengths to address security issues and other best practices.
If you don’t have the strength, take my word for it that the following things are true:
We don’t talk much about abstract characters; instead focus on the numeric “code points” that represent them.
JSON, for historical reasons, accepts all the code points.
There are several types of code points that don’t represent characters: “Surrogates”, “controls”, and “noncharacters”.
There are plenty of code points that are problematic because they can be used by phishers and other attackers to fool their victims because they look like other characters.
There are characters that you shouldn’t use because they represent one or another of the temporary historical hacks used in the process of migrating from previous encoding schemes to Unicode.
The consequence of all this is that there are many subsets of Unicode that you might want to restrict users of your protocols or data structures to:
JSON characters: That is to say, all of them, including all the bad stuff.
Unichars “Scalars”: Everything except the surrogates.
Unichars “XML characters”: Lots but not all of the problematic code points excluded.
Unichars “Unicode Assignables”: “All code points that are currently assigned, excluding legacy control codes, or that might in future be assigned.”
PRECIS “IdentifierClass”: “Strings that can be used to refer to, include, or communicate protocol strings like usernames, filenames, data feed identifiers, and chatroom name.”
PRECIS “FreeformClass”: “Strings that can be used in a free-form way, e.g., as a password in an authentication exchange or a nickname in a chatroom.”
Some variation where you don’t accept any unassigned code points; risky, because that changes with every Unicode release.
(I acknowledge that I am unreasonably fond of numbered lists, which is probably an admission that I should try harder to compose smoothly-flowing linear arguments that don’t need numbers.)
You’ll notice that I didn’t provide links for any of those entries. That’s because you really shouldn’t pick one without reading the underlying document describing why it exists.
I dunno. None of the above are crazy. I’m kind of fond of Unicode Assignables, which I co-invented. The only thing I’m sure of is that you should not go with JSON Characters, because of the fact that its rules make the following chthonic horror perfectly legal:
{"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}
Unichars describes it:
The value of the “example” field contains the C0 control NUL, the C1 control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two escaped UTF-16 surrogate code points. It is unlikely to be useful as the value of a text field. That value cannot be serialized into well-formed UTF-8, but the behavior of libraries asked to parse the sample is unpredictable; some will silently parse this and generate an ill-formed UTF-8 string.
No, really.
If you’re wondering what a “Quamina” is, you probably stumbled into this post through some link and, well, there’s a lot of history. Tl;dr: Quamina is a pattern-matching library in Go with an unusual (and fast) performance envelope; it can match thousands of Patterns to millions of JSON blobs per second. For much, much more, peruse the Quamina Diary series on this blog.
Anyhow, all this work in being correctly restrictive as to the shape of the incoming UTF-8 was making me uncomfortable. Quamina is about telling you what byte patterns are in your incoming data, not enforcing rules about what should be there.
And it dawned on me that it might be useful to ask Quamina to look at a few hundred thousand inputs per second and tell you which had ill-formed-data problems. Quamina’s dumb-but-fast byte-driven finite automaton would be happy to do that, and very efficiently too.
So, having literally lain awake at night fretting over this, here’s what I think I’m going to do:
I’ll implement a new Quamina pattern called ill-formed
or some such that will match any field that has
busted UTF-8 of the kind we’ve been talking about here. It’d rely on an automaton that is basically the inverse of
Davies’ state machine.
By default, the meaning of “.
” will be “matches the Davies automaton”; it’ll match
well-formed UTF-8 matching all code points except surrogates.
I’ll figure out how to parameterize regular-expression matches so you can change the definition of “.
” to
match one or more of the smaller subsets like those in the list above from Unichars and PRECIS.
But who knows, maybe I’ll end up changing my mind again. I already have, multiple times. Granted that implementing regular
expressions is hard, you’d think that matching “.
” would be the easy part. Ha ha ha.
2024-12-19 04:00:00
Back on December 13th, I
posted a challenge on Mastodon: In a simple UTF-8 byte-driven
finite automaton, how many states does it take to match the regular-expression construct “.
”, i.e. “any character”?
Commenter
Anthony Williams responded,
getting it almost right I think,
but I found his description a little hard to understand. In this piece I’m going to dig into what
.
actually means, and then how many states you need to match it.
[Update: Lots more on this subject and some of the material below is arguably wrong, but just “arguably”; see
Dot-matching Redux.]
The answer surprised me. Obviously this is of interest only to the faction of people who are interested in automaton wrangling, problematic characters, and the finer points of UTF-8. I expect close attention from all 17 of you!
Four. Or five, depending.
They’re represented by “code points”, which are numbers in the range 0 … 17×216, which is to say 1,114,112 possible values. It turns out you don’t actually want to match all of them; more on that later.
Quamina is a “byte-level automaton” which means it’s in a state, it reads a byte, looks up the value of that byte in a map
yielding either the next state, or nil
, which means no match. Repeat until you match or fail.
What bytes are we talking about here? We’re talking about UTF-8 bytes. If you don’t understand UTF-8 the rest of this is going to be difficult. I wrote a short explainer called Characters vs. Bytes twenty-one years ago. I now assume you understand UTF-8 and knew that code points are encoded as sequences of from 1 to 4 bytes.
Let’s count!
When you match a code point successfully you move to the part of the automaton that’s trying to match the next one; let’s call this condition MATCHED.
(From here on, all the numbers are hex, I’ll skip the leading 0x. And all the ranges are inclusive.)
In multi-byte characters, all the UTF-8 bytes but the first have bitmasks like 10XX XXXX
, so there are six
significant bits, thus 26 or 64 distinct possible values ranging from 80-BF.
There’s a Start state. It maps byte values 00-7F (as in ASCII) to MATCHED. That’s our first state, and we’ve handled all the one-byte code points.
In the Start state, the 32 byte values C0-DF, all of which begin 110
signaling a two-byte
code point, are mapped to the Last state. In the Last state,
the 64 values 80-BF are mapped to MATCHED. This takes care of all the two-byte code points and we’re up to two
states.
In the Start state, the 16 byte values E0-EF, all of which begin 1110
signaling a three-byte code
point, are mapped to the LastInter state. In that
state, the 64 values 80-BF are mapped to the Last state. Now we’re up to three states and we’ve handled the
three-byte code points.
In the Start state, the 8 byte values F0-F7, all of which begin 11110 signaling a four-byte code point, are mapped to the FirstInter state. In that state, the 64 values 80-BF are mapped to the LastInter state. Now we’ve handled all the code points with four states.
I mentioned above about not wanting to match all the code points. “Wait,” you say, “why wouldn’t you want to be maximally inclusive?!” Once again, I’ll link to Unicode Character Repertoire Subsets, a document I co-wrote that is making its way through the IETF and may become an RFC some year. I’m not going to try to summarize a draft that bends over backwards to be short and clear; suffice it to say that there are good reasons for leaving out several different flavors of code point.
Probably the most pernicious code points are the “Surrogates”, U+D800-U+DFFF. If you want an explanation of what they are and why they’re bad, go read that Repertoire Subsets draft or just take my word for it. If you were to encode them per UTF-8 rules (which the UTF-8 spec says you’re not allowed to), the low and high bounds would be ED,A0,80 and ED,BF,BF.
Go’s UTF-8 implementation agrees that Surrogates Are Bad and The UTF-8 Spec Is Good and flatly refuses to convert those UTF-8 sequences into code points or vice versa. The resulting subset of code points even has a catchy name: Unicode Scalars. Case closed, right?
Wrong. Because JSON was designed before we’d thought through these problems, explicitly saying it’s OK to include any code point whatsoever, including surrogates. And Quamina is used for matching JSON data. So, standards fight!
I’m being a little unfair here. I’m sure that if Doug Crockford were inventing JSON now instead of in 2001, he’d exclude surrogates and probably some of the other problematic code points discussed in that Subsets doc.
Anyhow, Quamina will go with Go and exclude surrogates. Any RFC8259 purists out there, feel free accuse me of standards apostasy and I will grant your point but won’t change Quamina. Actually, not true; at some point I’ll probably add an option to be more restrictive and exclude more than just surrogates.
Which means that now we have to go back to the start of this essay and figure out how many states it takes to match
“.
” Let’s see…
The Start state changes a bit. See #5 in the list above. Instead of mapping all of E0-EF to the LastInter state, it maps one byte in that range, ED, to a new state we’ll call, let’s see, how about ED.
In ED, just as in LastInter, 80-9F are mapped to Last. But A0-BF aren’t mapped to anything, because on that path lie the surrogates.
So, going with the Unicode Scalar path of virtue means I need five states, not four.