MoreRSS

site iconTim BrayModify

ongoing is short for “ongoing fragmented essay. The unifying themes are Truth, Technology, and Business.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tim Bray

Bye, Google Search

2025-11-02 03:00:00

For this blog, I mean. Which used to have a little search window in the corner of the banner just above that looked up blog fragments using Google. No longer; I wired in Pagefind; click on the magnifying glass up there on the right to try it out. Go ahead, do it now, I’ll wait before getting into the details.

The problem

Well, I mean, Google is definitely Part Of The Problem in advertising, surveillance, and Internet search. But the problem I’m talking about is that it just couldn’t find my pages, even when I knew they were here and knew words that should find them.

Either it dropped the entries from the index or dropped a bunch of search terms. Don’t know and don’t care now. ongoing is my outboard memory and I search it all the freaking time. This failure mode was making me crazy.

Pagefind

Tl;dr: I downloaded it and installed it and it Just Worked out of the box. I’d describe the look and feel but that’d be a waste of time since you just tried it out. It’s fast enough and doesn’t seem to miss anything and has a decent user interface.

How it works

They advertise “fully static search library”, which I assumed meant it’s designed to work against sites like this one composed of static files. And it is, but there’s more to it than that; read on.

First, you point a Node program at the root of your static-files tree and stand back. My tree has a bit over 5,000 files containing about 2½ million words, adding up to a bit over 20M of text. By default, it assumes you’re indexing HTML and includes all the text inside each page’s <body> element.

You have to provide a glob argument to match the files you want to index; in most cases, something like root/**/*.html would do the trick. Working this out was for me the hardest part because among other things my articles don’t end with .html; maybe it’ll be helpful for some to note that what worked for ongoing was:
When/???x/????/??/??/[a-zA-Z0-9]<[\-a-zA-Z0-9]:>

This produced an index organized into about 11K files adding up to about 78M. It includes a directory with one file per HTML page being searched.

I’d assumed I’d have to wire this up to my Web server somehow, but no: It’s all done in the client by fetching little bits and pieces of the index using ordinary HTTP GETs. For example, I ran a search for the word “minimal”, which resulted in my browser fetching a total of seven files totaling about 140K. That’s what they mean by “static”; not just the data, but the index too.

Finally, I noticed a couple of WASM files, so I had to check out the source code and, sure enough, this is basically a Rust app. Again I’m impressed. I hope that slick modern Rust/WASM code isn’t offended by me rubbing it up against this blog’s messy old Perl/Ruby/raw-JS/XML tangle.

Scalable?

Interesting question. For the purposes of this blog, Pagefind is ideal. But, indexing my 2½ million words burned a solid minute of CPU on the feeble VPS that hosts ongoing. I wonder if the elapsed time is linear in the data size, but it wouldn’t surprise me if it were worse. Furthermore, the index occupies more storage than the underlying data, which might be a problem for some.

Also, what happens when I do a search while the indexing is in progress? Just to be sure, I think I’ll wire it up to build the index in a new directory and switch indices as atomically as possible.

Finally, I think that if you wanted to sustain a lot of searches per second, you’d really want to get behind a CDN, which would make all that static index fetching really fly.

Configuring

The default look-and-feel was mostly OK by me, but the changes I had to make did involve quality time with the inspector, figuring out the class and ID complexities and then iterating the CSS.

The one thing that in the rear-view seems unnecessary is that I had to add a data-pagefind-meta attribute to the element at the very bottom of the page where the date is to include it in the result list. There should be a way to do this without custom markup. John Siracus filed a related bug.

Deployment

There’s hardly any work. I’ll re-run the indexer every day with a crontab entry and it looks it should just take care of itself.

To do?

Well, I could beautify the output some more but I’m pretty happy with it after just a little work. I can customize the sort order, which I gather is in descending order of how significant Pagefind thinks the match is. There’s a temptation to sort it in reverse date order. Actually, apparently I can also influence the significance algorithm. Anyhow I’ll run with mostly-defaults for now.

Search options

I notice that the software is pretty good at, and aggressive about, matching across verb forms and singular/plural and prefixes. Which I guess is what you want? You can apparently defeat that by enclosing a word in quotes if you want it matched exactly. Works for phrases too. I wonder what other goodies are in there; couldn’t find any docs on that subject.

Finally, there’s an excellent feature set I’ll never use; it’s smart about lots of languages. But alas, I write monolingually.

Shameful cleanup

Like I said, getting Pagefind installed and working was easy. Getting the CSS tuned up was a bit more effort. But I have to confess that I put hours and hours into hiding my dirty secrets.

You see, ongoing contains way more writing than you or Google can see. It’s set up so I can “semi-publish” pieces; there but unlinked. There was a whole lot of this kind of stuff: Photo albums from social events, pitches to employers about why they should hire various people including me, rants at employers for example about why Solaris should adopt the Linux userland (I was right) and why Android should include a Python SDK (I was right), and pieces that employer PR groups convinced me to bury. One of my minor regrets about no longer being employed is I no longer get to exercise my mad PR-group-wrangling skillz.

But when your search software is just walking the file tree, it doesn’t know what’s “published” and what’s not. I ended up using my rusty shell muscles with xarg and sed and awk and even an ed(1) script. I think I got it all, but who knows, search hard enough and you might find something embarrassing. If you do, I’d sure appreciate an email.

Thanks!

To the folks who built this. Seems like a good thing.

Grokipedia

2025-10-29 03:00:00

Last night I had a very strange experience: About two thirds of the way through reading a Web page about myself, Tim Bray, I succumbed to boredom and killed the tab. Thus my introduction to Grokipedia. Here are early impressions.

On Bray

My Grokipedia entry has over seven thousand words, compared to a mere 1,300 in my Wikipedia article. It’s pretty clear how it was generated; an LLM, trained on who-knows-what but definitely including that Wikipedia article and this blog, was told to go nuts.

Speaking as a leading but highly biased expert on the subject of T. Bray, here are the key take-aways:

(Overly) complete

It covers all the territory; there is no phase of my life’s activity that could possibly be encountered in combing the Web that is not exhaustively covered. In theory this should be good but in fact, who cares about the details of what I worked on at Sun Microsystems between 2004 and 2010? I suppose I should but, like I said, I couldn’t force myself to plod all the way through it.

Wrong

Every paragraph contains significant errors. Sometimes the text is explicitly self-contradictory on the face of it, sometimes the mistakes are subtle enough that only I would spot them.

Style

The writing has that LLM view-from-nowhere flat-affect semi-academic flavor. I don’t like it but the evidence suggests that some people do?

References

All the references are just URLs and at least some of them entirely fail to support the text. Here’s an example. In discussion of my expert-witness work for the FTC in their litigation against Meta concerning its acquisitions of Instagram and WhatsApp, Grokipedia says:

[Bray] opined that users' perceptions of response times in online services critically influence market dynamics.

It cites Federal Trade Commission’s Reply to Meta Platforms, Inc.’s Response to Federal Trade Commission’s Counterstatement of Material Facts (warning: 2,857-page PDF). Okay, that was one of the things I argued, but the 425 pages of court documents that I filed, and the references to my reporting in the monster document, make it clear that it was one tiny subset of the main argument.

Anyhow, I (so that you won’t have to) spent a solid fifteen minutes spelunking back and forth through that FTC doc, looking for strings like “response time” and “latency” and so on. Maybe somewhere in those pages there’s support for the claim quoted above, but I couldn’t find it.

Useful?

Wikipedia, in my mind, has two main purposes: A quick visit to find out the basics about some city or person or plant or whatever, or a deep-dive to find out what we really know about genetic linkages to autism or Bach’s relationship with Frederick the Great or whatever.

At the moment, Grokipedia doesn’t really serve either purpose very well. But, after all, this is release 0.1, maybe we should give it a chance.

Or, maybe not.

Woke/Anti-Woke

The whole point, one gathers, is to provide an antidote to Wikipedia’s alleged woke bias. So I dug into that. Let’s consider three examples of what I found. First, from that same paragraph about the FTC opinion quoted above:

While Bray and aligned progressives contend that such dominance stifles innovation by enabling predatory acquisitions and reduced rivalry—evidenced by fewer startup exits in concentrated sectors—counterarguments highlight that Big Tech's scale has fueled empirical gains, with these firms investing over $240 billion in U.S. R&D in 2024 (more than a quarter of national totals) and driving AI, cloud, and patent surges.[128] [131] Six tech industries alone accounted for over one-third of U.S. GDP growth from 2012–2021, comprising about 9% of the economy and sustaining 9.3 million jobs amid falling consumer prices and rapid technological diffusion. [132] [133] Right-leaning economists often defend consumer welfare metrics and market self-correction, warning that forced divestitures risk eroding the efficiencies and investment incentives that have propelled sector productivity above 6% annual growth in key areas like durable manufacturing tech. [134] [135]

I’ve linked the numbered citations to the indicated URLs. Maybe visit one or two of them and see what you think? Four are to articles arguing, basically, that monopolies must be OK because the companies accused of it are growing really fast and driving the economy. They seem mostly to be from right-wing think-tanks but I guess that’s what those think-tanks are for. One of them, #131, Big Tech and the US Digital-Military-Industrial Complex, I think isn’t helpful to the argument at all. But still, it’s broadly doing what they advertise: Pushing back against “woke” positions, in this case the position that monopolization is bad.

I looked at a couple of other examples. For example, this is from the header of the Greta Thunberg article:

While credited with elevating youth engagement on environmental issues, Thunberg's promotion of urgent, existential climate threats has drawn scrutiny for diverging from nuanced empirical assessments of climate risks and adaptation capacities, as well as for extending her activism into broader political arenas such as anti-capitalist and geopolitical protests.[5][6]

Somehow I feel no urge to click on those citation links.

If Ms Thunberg is out there on the “woke” end of the spectrum, let’s flit over to the other end, namely the entry for J.D. Vance, on the subject of his book Hillbilly Elegy.

Critics from progressive outlets, including Sarah Smarsh in her 2018 book Heartland, faulted the memoir for overemphasizing personal and cultural failings at the expense of structural economic policies, arguing it perpetuated stereotypes of rural whites as self-sabotaging.[71] These objections, often rooted in institutional analyses from academia and media, overlooked data on behavioral patterns like opioid dependency rates—peaking at 21.5 deaths per 100,000 in Appalachia around 2016—that aligned with Vance's observations of "deaths of despair" precursors.[72]

I read and enjoyed Heartland but the citation is to a New Yorker article that doesn’t mention Smarsh. As for the second sentence… my first reaction as I trudged through its many clauses, was “life’s too short”. But seriously, opioid-death statistics weaken the hypothesis about structural economic issues? Don’t get it.

Take-away

Wikipedia is, to quote myself, the encyclopedia that “anyone who’s willing to provide citations can edit”. Grokipedia is “the encyclopedia that Elon Musk’s LLM can edit, with sketchy citations and no progressive argument left un-attacked.”

So I guess it’s Working As Intended?

Recent Music

2025-10-16 03:00:00

There are musical seasons where I re-listen to the old faves, the kind of stuff you can read about in my half-year of “Song of the Day” essays from 2018. This autumn I find myself listening to new music by living people. Here’s some of it.

The musical influx is directly related to my adoption of Qobuz, whose weekly editors’-picks are always worth a look and have led me to more than half of the tunes in this post. Qobuz, like me, still believes in the album as a useful unit of music and thus I’ll cover a few of those. And live-performance YouTubes of course. You’ll spot a pattern: A lot of this stuff is African or African-adjacent with Euro angles and jazz flavors.

Ghana Downtown

The Kwashibu Area Band, founded in Accra, have been around for a few years and played in a few styles, sometimes as Pat Thomas’ band.

Title of Love Warrior’s Anthem by Kwashibu Area Band

What happened was, Qobuz offered up their recent Love Warrior’s Anthem and there isn’t a weak spot on it. Their record label says something about mixing Highlife and jazz; OK I guess. Here’s their YouTube channel but it doesn’t seem to have anything live from the Love-Warrior material. It isn’t often that I listen to an entire album end-to-end more than once.

The New Eves

Posted to Flickr by p_a_h, licensed under the Creative Commons Attribution 4.0 International license.

Loud Rude Brits

The New Eves are from Brighton and Wikipedia calls them “folk punk” which is weird because yeah, they’re loud and rude, but a lot of the instrumental sound is cello/violin/flute. Anyhow, check out Mother. I listened to most of their recent LP The New Eve Is Rising while driving around town and that’s really a lot of good and very intense music.

Rwanda Sings With Strings

That’s the title of the latest from “The Good Ones”, here it is on BandCamp. Adrien Kazigira and Janvier Havugimana are described as a “folk duo”; the songs are two-voice harmonies backed with swinging acoustic guitars. This record is just like the title says: They set up in a hotel room with a couple of string players and recorded these songs in a single take with no written music and no overdubs.

Cover of Rwanda Sings With Strings by The Good Ones

It’s awfully sweet stuff and while none of the lyrics are in English, they offer translations of the song titles, which include One Red Sunday, You Lied & Tried to Steal My Land, In the Hills of Nyarusange They Talk Too Much, and You Were Given a Dowry, But Abandoned Me. This music does so much with so little.

Rapper Piano

Alfa Mist was a rapper who went to music school and learned to play keyboards as an adult. The music’s straight-ahead Jazz but he still raps a bit here and there, it blends in nicely. If you get a chance to listen to an interview with him you should, if only for his voice; he’s from South London and of Ugandan heritage, which results in an accent like nothing I’ve ever heard before but makes me smile.

Alfa Mist

By Dirk Neven - Alfa Mist, Maassilo Rotterdam 20 November 2022 - Alfa Mist, CC0, (Wikimedia).

The problem with AM’s music is that’s it’s extremely smoooooooth, to the point that I thought of it as sort of pleasant-background stuff. Then I took in a YouTube of a live-in-studio session (maybe this one?) and realized that I was listening to extremely sophisticated soloing and ensemble playing that deserves to be foreground. But still sweet.

Sora Jobarteh

By World Trade Organization from Switzerland, cropped by User:HLHJ - Aid for Trade Global Review 2017 – Day 1, CC BY-SA 2.0, (Wikimedia).

Kora Magic

The Kora is that Gambian instrument with a gourd at the botton and dozens of strings. Sona Jobarteh, British-Gambian, plays Kora and guitar and sings beautifully and has a great band. Here she is at Jazz à Parquerolles.

Vanessa Wagner Glass Etudes album cover

And now for something completely different

Vanessa Wagner is a French classical pianist of whom I’d not heard. But Qobuz offered a new recording of Phil Glass’s Piano Etudes which, despite being a big fan, I’d never listened to. Here’s Etude No. 2, which is pretty nice, as is the whole recording; dreamy, shimmering stuff. I found myself leaning back with eyes closed.

It makes me happy

That there’s plenty of music out there that’s new and good.

Social Media Provenance Challenge

2025-10-02 03:00:00

At a a recent online conference, I said that we can “change the global Internet conversation for the better, by making it harder for liars to lie and easier for truth-tellers to be believed.” I was talking about media — images, video, audio. We can make it much easier to tell when media is faked and when it’s real. There’s work to do, but it’s straightforward stuff and we could get there soon. Here’s how.

The Nadia story

This is a vision of what success looks like.

Nadia lives in LA. She has a popular social-media account with a reputation for stylish pictures of urban life. She’s not terribly political, just a talented street photog. Her handle is “[email protected]”.

She’s in Venice Beach the afternoon of Sunday August 9, 2026, when federal agents take down a vendor selling cheap Asian ladies’ wear. She gets a great shot of an enforcer carrying away an armful of pretty dresses while two more bend the merchant over his countertop. None of the agents in the picture are in uniform, all are masked.

She signs into her “CoolTonesLA” account on hotpix.example and drafts a post saying “Feds raid Venice Beach”. When she uploads the picture, there’s a pop-up asking “Sign this image?” Nadia knows what this means, and selects “Yes”. By midnight her post has gone viral.

Content Credentials glyph

As a result of Nadia agreeing to “sign” the image, anyone who sees her post, whether in a browser or mobile app, also sees that little “Cr” badge in the photo’s top right corner. When they mouse over it, a little pop-up says something like:

The links point to Nadia’s feed and her instance’s home page. Following them can give the reader a feeling for what kind of person she is, the nature of her server, and the quality of her work. Most people are inclined to believe the photo is real.

Marco is a troublemaker. He grabs Nadia’s photo and posts it to his social-media account with the caption “Criminal illegals terrorize local business. Lock ’em up!” He’s not technical and doesn’t strip the metadata. Since the picture is already signed, he doesn’t get the “Sign this picture?” prompt. Anyone who sees his post will see the “Cr” badge and mousing over it makes it pretty clear that it isn’t what he says it is. Commenters gleefully point this out. By the time Marco takes the post down, his credibility is damaged.

Maggie is a more technical troublemaker. She sees Marco’s post and likes it, strips the picture’s metadata, and reposts it. When she gets the “Sign this picture?” prompt, she says “No”, so it doesn’t get a “Cr” badge. Hostile commenters accuse her of posting a fake, saying “LOL badge-free zone”. It is less likely that her post will go viral.

Miko isn’t political but thinks the photo would be more dramatic if she Photoshopped it to add a harsh dystopian lighting effect. When she reposts her version, the “Cr” badge won’t be there because the pixels have changed.

Morris follows Maggie. He grabs the stripped picture and, when he posts it, says “Yes” to signing. In his post the image will show up with the “Cr” and credit it to him, with a “posted” timestamp later than Nadia’s initial post. Now, the picture’s believability will depend on Morris’s. Does he have a credible track record? Also, there’s a chance that someone will notice what Morris did and point out that he stole Nadia’s picture.

(In fact, I wouldn’t be surprised if people ran programs against the social-network firehose looking for media signed by more than one account, which would be easy to detect.)

That’s the Nadia story.

How it’s done

The rest of this piece explains in some detail how the Nadia story can be supported by technology that already exists, with a few adjustments. If jargon like “PKIX” and “TLS” and “Nginx” is foreign to you, you’re unlikely to enjoy the following. Before you go, please consider: Do you think making the Nadia story come true would be a good investment?

I’m not a really deep expert on all the bits and pieces, so it’s possible that I’ve got something wrong. Therefore, this blog piece will be a living document in that I’ll correct any convincingly-reported errors, with the goal that it accurately describes a realistic technical roadmap to the Nadia story.

By this time I’ve posted enough times about C2PA that I’m going to assume people know what it is and how it works. For my long, thorough explainer, see On C2PA. Or, check out the Content Credentials Web site.

Tl;dr: C2PA is a list of assertions about a media object, stored in its metadata, with a digital signature that includes the assertions and the bits of the picture or video.

This discussion assumes the use of C2PA and also an in-progress specification from the Creator Assertions Working Group (CAWG) called Identity Assertion.

Not all the pieces are quite ready to support the Nadia story. But there’s a clear path forward to closing each gap.

“Sign this picture?”

C2PA and CAWG specify many assertions that you can make about a piece of media. For now let’s focus just on what we need for provenance. When the media is uploaded to a social-network service, there are two facts that the server knows, absolutely and unambiguously: Who uploaded it (because they’ve had to sign in) and when it happened.

In the current state of the specification drafts, “Who” is the cawg.social_media property from the draft Identity Assertion spec, section 8.1.2.5.1, and “When” is the c2pa.time-stamp property from the C2PA specification, section 18.17.3.

I think these two are all you need for a big improvement in social network media provenance, so let’s stick with them.

What key?

Let’s go back to the Nadia story. It needs the Who/When assertions to be digitally signed in a way that will convince a tech-savvy human or a PKIX validation library that the signature could only have been applied by the server at hotpix.example.

The C2PA people have been thinking about this. They are working on a Verified News Publishers List, to be maintained and managed by, uh, that’s not clear to me. The idea is that C2PA software would, when validating a digital signature, require that the PKIX cert is one of those on the Publishers List.

This isn’t going to work for a decentralized social network, which has tens of thousands of independent servers run by co-ops, academic departments, municipal governments, or just a gaggle of friends who kick in on Patreon. And anyhow, Fediverse instances don’t claim to be “News Publishers”, verified or not.

So what key can hotpix.example sign with? Fortunately, there’s already a keypair and PKIX certificate in place on every social-media server, the one it uses to support TLS connections. The one at tbray.org, that’s being used right now to protect your interaction with this blog, is in /etc/letsencrypt/live/ and the private key is obviously not generally readable.

That cert will contain the public key corresponding to the host’s private key, the cert's ancestry, and the host name. It’s all that any PKIX library needs to verify that yes, this could only have been signed by hotpix.example. However, there will be objections.

Objection: “hotpix.example is not a Verified News Publisher!” True enough, the C2PA validation libraries would have to accept X.509 certs. Maybe they do already? Maybe this requires an extension of the current specs? In any case, the software’s all open-source, could be forked if necessary.

Objection: “That cert was issued for the purpose of encrypting TLS connections, not for some weird photo provenance application. Look at the OID!” OK, but seriously, who cares? The math does what the math does, and it works.

Objection: “I have to be super-careful about protecting my private key and I don’t want to give a copy to the hippies running the social-media server.” I sympathize but, in most cases, social media is all that server’s doing.

Having said that, it would be great if there were extensions to Nginx and Apache httpd where you could request that they sign the assertions for you. Neither would be rocket science.

OK, so we sign Nadia’s Who/When assertions and her photo’s pixels with our host’s TLS key, and ship it off into the world. What’s next?

How to validate?

Verifying these assertions, in a Web or mobile app, is going to require a C2PA library to pick apart the assertions and a PKIX library for the signature check.

We already have c2pa-rs, Rust code with MIT and Apache licenses. Rust libraries can be called from some other programming languages but in the normal course of affairs I’d expect there soon to be native implementations. Once again, all these technologies are old as dirt, absolutely no rocket science required.

How about validating the signatures? I was initially puzzled about this one because, as a programmer, certs only come into the picture when I do something like http.Get() and the library takes care of all that stuff. So I can’t speak from experience.

But I think the infrastructure is there. Here’s a Curl blogger praising Apple SecTrust. Over on Android, there’s X509ExtendedTrustManager. I assume Windows has something. And if all else fails, you could just download a trusted-roots file from the Curl or Android projects and refresh it every week or two.

What am I missing?

This feels a little too easy, something that could be done in months not years. Perhaps I’m oversimplifying. Having said that, I think the most important thing to get right is the scenarios, so we know what effect we want to achieve.

What do you think of the Nadia story?

GenAI Predictions

2025-09-27 03:00:00

I’m going to take a big chance here and make predictions about GenAI’s future. Yeah, I know, you’re feeling overloaded on this stuff and me too, but it seems to have sucked the air out of all the other conversations. I would so like to return to arguing about Functional Programming or Free Trade. This is risky and there’s a pretty good chance that I’m completely wrong. But I’ll try to entertain while prognosticating.

Reverse Centaurs

That’s the title of a Cory Doctorow essay, which I think is spot on. I’m pretty sure anyone who’s read even this far would enjoy it and it’s not long, and it’d help understand this. Go have a look, I’ll wait.

Hallucinations won’t get fixed

I have one good and one excellent argument to support this prediction. Good first: While my understanding of LLMs is not that deep, it doesn’t have to be to understand that it’s really difficult (as in, we don’t know how) to connect the model’s machinations to our underlying reality, so as to fact-check.

The above is my non-expert intuition at work. But then there’s Why Language Models Hallucinate, three authors from OpenAI and one from Georgia Tech, which seems to show that hallucinations are an inevitable result of current training practices.

And here’s the excellent argument: If there were a way to eliminate the hallucinations, somebody already would have. An army of smart, experienced people, backed by effectively infinite funds, have been hunting this white whale for years now without much success. My conclusion is, don’t hold your breath waiting.

Maybe there’ll be a surprise breakthrough next Tuesday. Could happen, but I’d be really surprised.

(When it comes to LLMs and code, the picture is different; see below.)

The mass layoffs won’t happen

The central goal of GenAI is the elimination of tens of millions of knowledge workers. That’s the only path to the profits that can cover the costs of training and running those models.

To support this scenario the AI has to run in Cory’s “reverse centaur” mode, where the models do the work and the humans tend them. This allows the production of several times more work per human, generally of lower quality, with inevitable hallucinations. There are two problems here: First, that at least some of the output is workslop, whose cleanup costs eat away at the productivity wins. Second, that the lower quality hurts your customers and your business goes downhill.

I just don’t see it. Yeah, I know, every CEO is being told that this will work and they’ll be heroes to their shareholders. But the data we have so far keeps refusing to support those productivity claims.

OK then, remove the “reverse” and run in centaur mode, where smart humans use AI tools judiciously to improve productivity and quality. Which might be a good idea for some people in some jobs. But in that scenario neither the output boost nor the quality gain get you to where you can dismiss enough millions of knowledge workers to afford the AI bills.

The financial damage will be huge

Back to Cory, with The real (economic) AI apocalypse is nigh. It’s good, well worth reading, but at this point pretty well conventional wisdom as seen by everyone who isn’t either peddling a GenAI product or (especially) fundraising to build one.

To pile on a bit, I’m seeing things every week like for example this: The AI boom is unsustainable unless tech spending goes ‘parabolic,’ Deutsche Bank warns: ‘This is highly unlikely’.

The aggregate investment is ludicrous. The only people who are actually making money are the ones selling the gold-mining equipment to the peddlers. Like they say, “If something cannot go on forever, it will stop.” Where by “forever”, in the case of GenAI, I mean “sometime in 2026, probably”.

… But the economy won’t collapse

Cory forecasts existential disaster, but I’m less worried. Those most hurt when the bubble collapses will be the investing classes who, generally speaking, can afford it. Yeah, if the S&P 500 drops by a third, the screaming will shake the heavens, but I honestly don’t see it hitting as hard as 2008 and don’t see how the big-picture economy falls apart. That work that the genAI shills say would be automated away is still gonna have to be done, right?

The software profession will change, but not that much

Here’s where I get in trouble, because a big chunk of my professional peers, including people I admire, see GenAI-boosted coding as pure poison: “In a kind of nihilistic symmetry, their dream of the perfect slave machine drains the life of those who use it as well as those who turn the gears.” (The title of that essay is “I Am An AI Hater.”)

I’m not a hater. I argued above that LLMs generating human discourse have no way to check their output for consistency with reality. But if it’s code, “reality” is approximated by what will compile and build and pass the tests. The agent-based systems iteratively generate code, reality-check it, and don’t show it to you until it passes. One consequence is that the quality of help you get from the model should depend on the quality of your test framework. Which warms my testing-fanatic heart.

So, my first specific prediction: Generated code will be a routine thing in the toolkit, going forward from here. It’s pretty obvious that LLMs are better at predicting code sequences than human language.

In Revenge of the junior developer, Steve Yegge says, more or less, “Resistance is useless. You will be assimilated.” But he’s wrong; there are going to be places where we put the models to work, and others where we won’t. We don’t know which places those are and aren’t, but I have (weaker) predictions; let’s be honest and just say “guesses”.

Where I suspect generated code will likely appear:

  • Application logic: “Depreciate the values in the AMOUNT field of the INSTALLED table forward ten years and write the NAME field and the depreciated value into a CSV.” Or “Look at JIRA ticket 248975 and create a fix.”

    (By the way, this is a high proportion of what actual real-world programmers do every day.)

  • Glorified StackOverflow-style lookups like I did in My First GenAI Code.

  • Drafting code that needs to run against interfaces too big and complex to hold in your head, like for example the Android and AWS APIs (“When I shake the phone, grab the location from GPS and drop it in the INCOMING S3 bucket”). Or CSS (“Render that against a faded indigo background flush right, and hold it steady while scrolling so the text slides around it”).

  • SQL. This feels like a no-brainer. So much klunky syntax and so many moving pieces.

Where I suspect LLM output won’t help much.

  • Interaction design. I mean, c’mon, it requires predicting how humans understand and behave.

  • Low level infrastructure code, the kind I’ve spent my whole life on, where you care a whole lot about about conserving memory and finding sublinear algorithms and shrinking code paths and having good benchmarks.

Here are areas where I don’t have a prediction but would like to know whether and how LLM fits in (or not).

  • Help with testing: Writing unit and integration tests, keeping an eye on coverage, creating a bunch of BDD tests from a verbal description of what a function is going to do.

  • Infrastructure as code: CI/CD, Terraform and peers, all that stuff. There are so many ways to get it wrong.

  • Bad old-school concurrency that uses explicit mutexes and java.lang.Thread where you have to understand language memory models and suchlike.

The real reason not to use GenAI

Because it’s being sold by a panoply of grifters and chancers and financial engineers who know that the world where their dreams come true would be generally shitty, and they don’t care.

(Not to mention the environmental costs and the poor folk in the poor countries where the QA and safety work is outsourced.)

Final prediction: After the air goes out of the assholes’ bubble, we won’t have to live in the world they imagine. Thank goodness.

C2PA Investigations

2025-09-19 03:00:00

This is the blog version of my talk at the IPTC’s online Photo Metadata Conference conference. Its title is the one the conference organizers slapped on my session without asking; I was initially going to object but then I thought of the big guitar riff in Dire Straits’ Private Investigations and snickered. If you want, instead of reading, to watch me present, that’s on YouTube. Here we go.

Hi all, thanks for having me. Today I represent… nobody, officially. I’m not on any of the committees nor am I an employee of any of the providers. But I’m a photographer and software developer and social-media activist and have written a lot about C2PA. So under all those hats this is a subject I care about.

Also, I posted this on Twitter back in 2017.

Twitter post from 2017 presaging C2pA

I’m not claiming that I was the first with this idea, but I’ve been thinking about the issues for quite a while.

Enough self-introduction. Today I’m going to look at C2PA in practice right now in 2025. Then I’m going to talk about what I think it’s for. Let’s start with a picture.

Picture of a shopping mall storefront

This smaller version doesn’t have C2PA,
but if you click on it, the larger version you get does.
Photo credit: Rob Pike

I should start by saying that a few of the things that I’m going to show you are, umm, broken. But I’m still a C2PA fan. Bear in mind that at this point everything is beta or preview or whatever, at best v1.0. I think we’re in glass-half-full mode.

This photo is entirely created and processed by off-the-shelf commercial products and has content credentials, and let me say that I had a freaking hard time finding such a photo. There are very few Content Credentials out there on the Internet. That’s because nearly every online photo is delivered either via social media or by professional publishing software. In both cases, the metadata is routinely stripped, bye-bye C2PA. So one of the big jobs facing us in putting Content Credentials to work is to stop publishers from deleting them.

Of course, that’s complicated. Professional publishers probably want the Content Credentials in place, but on social media privacy is a key issue and stripping the metadata is arguably a good default choice. So there are a lot of policy discussions to be had up front of the software work.

Anyhow, let’s look at the C2PA.

Picture with two Content Credentials glyphs and one drop-down

I open up that picture in Chrome and there are little “Cr” glyphs at both the top left and top right corners; that’s because I’ve installed multiple C2PA Chrome plug-ins. Turns out these seem to only be available for Chrome, which is irritating. Anyhow, I’ve clicked on the one in the top left.

That’s a little disappointing. It says the credentials were recorded by Lightroom, and gives my name, but I think it’s hiding way more than it’s revealing. Maybe the one on the top right will be more informative?

Picture with two Content Credentials glyphs and one drop-down

More or less the same info. A slightly richer presentation But both displays have an “inspect” button and both do the same thing. Let’s click it!

Content Credentials inspector page, failing to retrieve a page for review

This is the Adobe Content Credentials inspector and it’s broken. That’s disappointing. Having said that, I was in a Discord chat with a senior Adobe person this morning and they’re aware of the problem.

But anyhow, I can drag and drop the picture like they say.

Content credentials as displayed by the Inspector

Much much better. It turns out that this picture was originally taken with a Leica M11-P. The photographer is a famous software guy named Rob Pike, who follows me on Mastodon and wanted to help out.

So, thanks Rob, and thanks also to the Leica store in Sydney, Australia, who loaned him the M11. He hasn’t told me how he arranged that, but I’m curious.

I edited it in Lightroom, and if you look close, you can see that I cropped it down and brightened it up. Let’s zoom in on the content credentials for the Leica image.

Leica-generated C2PA display

There’s the camera model, the capture date (which is wrong because Rob didn’t get around to setting the camera’s date before he took the picture.) The additional hardware (R-Adapter-M), the dimensions, ISO, focal length, and shutter speed.

Speaking as a photographer, this is kind of cool. There’s a problem in that it’s partly wrong. The focal length isn’t zero, and Rob is pretty sure he didn’t have an adapter on. But Leica is trying to do the right thing and they’ll get there.

Now let’s look at the assertions that were added by Lightroom.

Lightroom-generated C2PA display

There’s a lot of interesting stuff in here, particularly the provenance. Lightroom lets you manage your identities, using what we call “OAuth flows”, so it can ask Instagram (with my permission) what my Instagram ID is. It goes even further with LinkedIn; it turns out that LinkedIn has an integration with the Clear ID people, the ones who fast-track you at the airport. So I set up a Clear ID, which required photos of my passport, and went through the dance with LinkedIn to link it up, and then with Lightroom so it knew my LinkedIn ID. So to expand, what it’s really saying is: “Adobe says that LinkedIn says that Clear says that the government ID of the person who posted this says that he’s named Timothy Bray”.

I don’t know about you, but this feels like pretty strong provenance medicine to me. I understand that the C2PA committee and the CAWG people are re-working the provenance assertions. To them: Please don’t screw this particular style of provenance up.

Now let’s look at what Lightroom says it did. It may be helpful to know what I in fact did.

  1. Cropped the picture down.

  2. Used Lightroom’s “Dehaze” tool because it looked a little cloudy.

  3. Adjusted the exposure and contrast, and boosted the blacks a bit.

  4. Sharpened it up.

Lightroom knows what I did, and you might wonder how it got from those facts to that relatively content-free description that reads like it was written by lawyers. Anyhow, I’d like to know. Since I’m a computer geek, I used the open-source “c2patool” to dump what the assertions actually are. I apologize if this hurts your eyes.

It turns out that there is way more data in those files than the inspector shows. For example, the Leica claims included 29 EXIF values, here are three I selected more or less at random:

          "exif:ApertureValue": "2.79917",
          "exif:BitsPerSample": "16",
          "exif:BodySerialNumber": "6006238",

Some of these are interesting: In the Leica claims, the serial number. I could see that as a useful provenance claim. Or as a potentially lethal privacy risk. Hmmm.

            {
              "action": "c2pa.color_adjustments",
              "parameters": {
              "action": "c2pa.color_adjustments",
              "parameters": {
                "com.adobe.acr.value": "60",
                "com.adobe.acr": "Exposure2012"
              }
            },
            {
              "action": "c2pa.color_adjustments",
              "parameters": {
                "com.adobe.acr": "Sharpness",
                "com.adobe.acr.value": "52"
              }
            },
            {
              "action": "c2pa.cropped",
              "parameters": {
                "com.adobe.acr.value": "Rotated Crop",
                "com.adobe.acr": "Crop"
              }
            }

And in the Lightroom section, it actually shows exactly what I did, see the sharpness and exposure values.

My feeling is that the inspector is doing either too much or too little. At the minimal end you could just say “hand processed? Yes/No” and “genAI? Yes/No”. For a photo professional, they might like to drill down and see what I actually did. I don’t see who would find the existing presentation useful. There’s clearly work to do in this space.

Oh wait, did I just say “AI”? Yes, yes I did. Let’s look at another picture, in this case a lousy picture.

Picture of an under-construction high-rise behind leaves

I was out for a walk and thought the building behind the tree was interesting. I was disappointed when I pulled it up on the screen, but I still liked the shape and decided to try and save it.

Picture of an under-construction high-rise behind leaves, improved

So I used Lightroom’s “Select Sky” to recover its color, and “Select Subject” to pull the building details out of the shadows. Both of these Lightroom features, which are pretty magic and I use all the time, are billed as being AI-based. I believe it.

Let’s look at what the C2PA discloses.

Lightroom C2PA assertions with automation AI

Having said all that, if you look at the C2PA (or at the data behind it) Lightroom discloses only “Color or Exposure”, “Cropping”, and “Drawing” edits. Nothing about AI.

Hmm. Is that OK? I personally think it is, and highlights the distinction between what I’d call “automation” AI and Generative AI. I mean, selecting the sky and subject is something that a skilled Photoshop user could accomplish with a lot of tinkering, the software is just speeding things up. But I don’t know, others might disagree.

Well, how about that generative AI?

Turtle in shallow water, generated by ChatGPT

Fails c2patool validation, “DigitalSourceType” is trainedAlgorithmicMedia

Desktop with decorations, Magic Erase has been applied

“DigitalSourceType” is compositeWithTrainedAlgorithmicMedia

The turtle is 100% synthetic, from ChatGPT, and on the right is a Pixel 10 shot on which I did a few edits including “Magic Eraser”. Both of these came with Content Credentials; chatGPT’s is actually invalid, but on the glass-half-full front, the Pixel 10’s were also invalid up until a few days ago, then they fixed it. So this stuff does get fixed.

I’m happy about the consistent use of C2PA terminology, they are clearly marking the images as genAI-involved.

I’m about done talking about the state of the Content Credentials art generally but I should probably talk about this device.

Blue Pixel 10

Because it marks the arrival of Content Credentials on the mass consumer market. Nobody knows how many Pixels Google actually sells but I guarantee it’s a lot more than Leica sells M11’s. And since Samsung tends to follow Google pretty closely, we’re heading for tens then hundreds of millions of C2PA-generating mobile devices. I wonder when Apple will climb on board?

Let’s have a look at that C2PA.

C2PA associated with Magic Eraser image

This view of the C2PA is from the Google Photos app. It’s very limited. In particular, there is nothing in there to support provenance. In fact, it’s the opposite, Google is bending over backward to avoid anything that could be interpreted as breaking the privacy contract by sharing information about the user.

Let’s pull back the covers and dig a little deeper. Here are a few notes

  • The device is identified just as “Pixel camera”. There are lots of different kinds of those!

  • The C2PA inclusion is Not optional!

  • DigitalSourceType: computationalCapture
(if no genAI)

  • Timestamp is “untrusted”

The C2PA not being optional removes a lot of UI issues but still, well, I’m not smart enough to have fully thought through the implications. That Digital Source Type looks good and appropriate, and the untrusted-ness of the timestamp is interesting.

You notice that my full-workflow examples featured a Leica rather than the Pixel, and that’s because the toolchain is currently broken for me: Neither Lightroom nor Photoshop can handle the P10 C2PA. I’ll skip the details, except to say that Adobe is aware of the bug, a version mismatch, and they say they’re working on it.

Before we leave the Pixel 10, I should say that there are plenty of alternate camera apps in Android and iOS, some quite good, and it’d be perfectly possible for them to ship much richer C2PA, notably including provenance, location, and so on.

I guess that concludes my look at the current state of the Content Credentials art. Now I’d like to talk about what Content Credentials are for. To start with, I think it’d be helpful to sort the assertions into three baskets.

C2PA assertions in Capture, Processing, and Provenance baskets

Capture, that’s like that Leica EXIF stuff we showed earlier. What kind of camera and lens, what the shooting parameters were. Processing, that’s like the Lightroom report: How was the image manipulated, and by what software. Provenance: Which person or organization produced this?

But I think this picture has an important oversimplification, let me fix that.

C2PA assertion baskets with the addition of GenAI

Processing is logically where you’d disclose the presence of GenAI. And in terms of what people practically care about, that’s super important and deserves special consideration.

Now I’m going to leave the realm of facts and give you opinions. As for the Capture data there on the left… who cares? Really, I’m trying to imagine a scenario in which anyone cares about the camera or lens or F stop. I guess there’s an exception if you want to prove that the photo was taken by one of Annie Liebowitz’s cameras, but that’s really provenance.

Let’s think about a professional publication scenario. They get photos from photographers, employees or agencies or whatever. They might want to be really sure that the photo was from the photographer and not an imposter. So having C2PA provenance would be nice. Then when the publisher gets photos, they do a routine check of the provenance and if it doesn’t check out, they don’t run the picture without a close look first.

They also probably want to check for the “is there genAI?” indicator in the C2PA, and, well, I don’t know what they might do, but I’m pretty sure they’d want to know.

That same publisher might want to equip the photos they publish with C2PA, to demonstrate that they are really the ones who chose and provided the media. That assertion should be applied routinely by their content management system. Which should be easy, on the technology side anyhow.

So from the point of view of a professional publisher, provenance matters, and being careful about GenAI matters, and in the C2PA domain, I think that’s all that really matters.

Now let’s turn to Social Media, which is the source of most of the images that most people see most days. Today, all the networks strip all the photo metadata, and that decision involves a lot of complicated privacy and intellectual-property thinking. But there is one important FACT that they know: For any new piece of media, they know which account uploaded the damn thing, because that account owner had to log in to do it. So I think it’s a no-brainer that IF THE USER WISHES, they can have a Content Credentials assertion in the photo saying “Initially uploaded by Tim Bray at LinkedIn” or whoever at wherever.

What we’d like to achieve is that if you see some shocking or controversial media, you’d really want to know who originally posted it before you decided whether you believed it, and if Content Credentials are absent, that’s a big red flag. And if the picture is of the current situation in Gaza, your reaction might be different depending on whether it was originally from an Israeli military social-media account, or the Popular Front for the Liberation of Palestine, or by the BBC, or by [email protected].

Anyhow, here’s how I see it:

C2PA assertion baskets inflated according to their relative importance

So for me, it’s the P and A in C2PA that matter – provenance and authenticity. I think the technology has the potential to change the global Internet conversation for the better, by making it harder for liars to lie and easier for truth-tellers to be believed. I think the first steps that have been taken so far are broadly correct and the path forward is reasonably clear. All the little things that are broken, we can fix ’em.

And there aren’t that many things that matter more than promoting truth and discouraging lies.

And that’s all, folks.