MoreRSS

site iconUnderstanding AIModify

By Timothy B. Lee, a tech reporter with a master’s in computer science, covers AI progress and policy.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Understanding AI

17 predictions for AI in 2026

2026-01-01 01:41:20

Two quick notes before we get to today’s article:

  • There’s one week left to apply for a Tarbell Fellowship and potentially become the next Kai Williams! It’s is a fellowship program for people who want to become journalists covering AI. Understanding AI is participating again in 2026, along with media outlets like NBC News, The Guardian, Bloomberg, and the Verge. Click here to apply—the deadline is January 7.

  • Thanks to everyone who contributed to GiveDirectly! Because my readers gave more than $20,000, my wife and I donated an additional $10,000.


2025 has been a huge year for AI: a flurry of new models, broad adoption of coding agents, and exploding corporate investment were all major themes. It’s also been a big year for self-driving cars. Waymo tripled weekly rides, began driverless operations in several new cities, and started offering freeway service. Tesla launched robotaxi services in Austin and San Francisco.

What will 2026 bring? We asked eight friends of Understanding AI to contribute predictions, and threw another nine in ourselves. We give a confidence score for each prediction; a prediction with 90% confidence should be right nine times out of ten.

We don’t believe AI is a bubble on the verge of popping, but neither do we think we’re close to a “fast takeoff” driven by the invention of artificial general intelligence. Rather, we expect models to continue improving their capabilities — but we think it will take a while for the full impact to be felt across the economy.

1. Big Tech capital expenditures will exceed $500 billion (75%)

Timothy B. Lee

Wax sculptures of Mark Zuckerberg, Jeff Bezos, and other tech industry leaders were mounted to robot dogs at a recent exibit by artist Mike Winkelmann in Miami. (Photo by CHANDAN KHANNA / AFP via Getty Images)

In 2024, the five main hyperscalers — Google, Microsoft, Amazon, Meta, and Oracle — had $241 billion in capital expenditures. This year, those same companies are on track to spend more than $400 billion.

This rapidly escalating spending is a big reason many people believe that there’s a bubble in the AI industry. As we’ve reported, tech companies are now investing more, as a percentage of the economy, than the peak year of spending on the Apollo Project or the Interstate Highway System. Many people believe that this level of spending is simply unsustainable.

But I don’t buy it. Industry leaders like Mark Zuckerberg and Satya Nadella have said they aren’t building these data centers to prepare for speculative future demand — they’re just racing to keep up with orders their customers are placing right now. Corporate America is excited about AI and spending unprecedented sums on new AI services.

I don’t expect Big Tech’s capital spending to grow as much in 2026 as it did in 2025, but I do expect it to grow, ultimately exceeding $500 billion for the year.

Subscribe now

2. OpenAI and Anthropic will both hit their 2026 revenue goals (80%)

Timothy B. Lee

Anthropic and OpenAI have both enjoyed impressive revenue growth in 2025.

  • OpenAI expects to generate more than $13 billion for the calendar year, and to end the year with annual recurring revenue around $20 billion. A leaked internal document indicated OpenAI is aiming for $30 billion in revenue in 2026 — slightly more than double the 2025 figure.

  • Anthropic expects to generate around $4.7 billion in revenue in 2025. In October, the company said its annual recurring revenue had risen to “almost $7 billion.” The company is aiming for 2026 revenue of $15 billion.

I predict that both companies will hit these targets — and perhaps exceed them. The capabilities of AI models have improved a lot over the last year, and I expect there is a ton of room for businesses to automate parts of their operations even without new model capabilities.

3. The context windows of frontier models will stay around one million tokens (80%)

Kai Williams

LLMs have a “context window,” the maximum number of tokens they can process. A larger context window lets an LLM tackle more complex tasks, but it is more expensive to run.

When ChatGPT came out in November 2022, it could only process 8,192 tokens at once. Over the following year and a half, context windows from the major providers increased dramatically. OpenAI started offering a 128,000 token window with GPT-4 Turbo in November 2023. The same month, Anthropic released Claude 2.1, which offered 200,000 token windows. And Google started offering one million tokens of context with Gemini 1.5 Pro in February 2024 — which it later expanded to two million tokens.

Since then, progress has slowed. Anthropic has not changed its default context size since Claude 2.1.1 GPT-5.2 has a 400,000 token context window, but that’s less than GPT-4.1, released last April. And Google’s largest context window has shrunk to one million.

I expect context windows to stay fairly constant in 2026. As Tim explained in November, larger context window sizes brush up against limitations in the transformer architecture. For most tasks with current capabilities, smaller context windows are cheaper and just as effective. In 2026, there might be some coding-related LLMs — where it’s useful for the LLM to be able to read an entire codebase — that have larger context windows. But I predict the context lengths of general-purpose frontier models will stay about the same over the next year.

4. Real GDP will grow by less than 3.5% in the US (90%)

Timothy B. Lee

The year 2027 has acquired a totemic status in some corners of the AI world. In 2024, former OpenAI researcher Leopold Aschenbrenner penned a widely-read series of essays predicting a “fast takeoff” in 2027. Then in April 2025, an all-star team of researchers published AI 2027, a detailed forecast for rapid AI progress. They forecast that by the 2027 holiday season, GDP will be “ballooning.” One AI 2027 author suggested that this could eventually lead to annual GDP growth rates as high as 50%.

They don’t make a specific prediction about 2026, but if these predictions are close to right, we should start seeing signs of it by the end of 2026. If we’re on the cusp of an AI-powered takeoff, that should translate to above-average GDP growth, right?

So here’s my prediction: inflation-adjusted GDP in the third quarter of 2026 will not be more than 3.5% higher than the third quarter of 2025.2 Over the last decade, year-over-year GDP growth has only been faster than 3.5% in late 2021 and early 2022, a period when the economy was bouncing back from Covid. Outside of that period, year-over-year growth of real GDP has ranged from 1.4% to 3.4%.

I expect the AI industry to continue growing at a healthy pace, and this should provide a modest boost to the US economy. Indeed, data center construction has been supporting the economy over the last year. But I expect the boost from data center construction to be a fraction of one percent — not enough to push overall economic growth outside its normal range.

5. AI models will be able to complete 20-hour software engineering tasks (55%)

Kai Williams

The AI evaluation organization METR released the original version of this chart in March. They found that every seven months, the length of software engineering tasks that leading AI models were capable of completing (with a 50% success rate) was doubling. Note that the y-axis of this chart is on a log scale, so the straight line represents an exponential increase.

By mid-2025, LLM releases seemed to be improving more quickly, doubling successful task lengths in just five months. METR estimates that Claude Opus 4.5, released in November, could complete software tasks (with at least a 50% success rate) that took humans nearly five hours.

I predict that this faster trend will continue in 2026. AI companies will have access to significantly more computational resources in 2026 as the first gigawatt-scale clusters start operating early in the year, and LLM coding agents are starting to speed up AI development. Still, there are reasons to be skeptical. Both pre-training (with imitation learning) and post-training (with reinforcement learning) have shown diminishing returns.

Whatever happens, whether METR’s line will continue to hold is a crucial question. If the faster trend line holds, the strongest AI models will be at 50% reliability for 20-hour software tasks — half of a software engineer’s work week.

Subscribe now

6. The legal free-for-all that characterized the first few years of the AI boom will be definitively over (70%)

James Grimmelmann, professor at Cornell Tech and Cornell Law School

So far, AI companies are winning against the lawsuits that pose truly existential threats — most notably, courts in the US, EU, and UK have all held that it’s not copyright infringement to train a model. But for everything else, the courts have been putting real operational limits on them. Anthropic is paying $1.5 billion to settle claims that it trained on downloads from shadow libraries, and multiple courts have held or suggested that they need real guardrails against infringing outputs.

I expect the same thing to happen beyond copyright, too: courts won’t enjoin AI companies out of existence, but they will impose serious high-dollar consequences if the companies don’t take reasonable steps to prevent easily predictable harms. It may still take a head on a pike — my money is on Perplexity’s — but I expect AI companies to get the message in 2026.

7. AI will not cause any catastrophes in 2026 (90%)

Steve Newman, author of Second Thoughts

There are credible concerns that AI could eventually enable various disaster scenarios. For instance, an advanced AI might help create a chemical or biological weapon, or carry out a devastating cyberattack. This isn’t entirely hypothetical; Anthropic recently uncovered a group using its agentic coding tools to carry out cyberattacks with minimal human supervision. And AIs are starting to exhibit advanced capabilities in these domains.

However, I do not believe there will be any major “AI catastrophe” in 2026. More precisely: there will be no unusual physical or economic catastrophe (dramatically larger than past incidents of a similar nature) in which AI plays a crucial enabling role. For instance, no unusually impactful bio, cyber, or chemical attack.

Why? It always takes longer than expected for technology to find practical applications — even bad applications. And AI model providers are taking steps to make it harder to misuse their models.

Of course, people may jump to blame AI for things that might have happened anyway, just as some tech CEOs blamed AI for layoffs that were triggered by over-hiring during Covid.

8. Major AI companies like OpenAI and Anthropic will stop investing in MCP (90%)

Andrew Lee, CEO of Tasklet (and Tim’s brother)

The Model Context Protocol was designed to give AI assistants a standardized way to interact with external tools and data sources. Since its introduction in late 2024, it has exploded in popularity.

But here’s the thing: modern LLMs are already smart enough to reason about how to use conventional APIs directly, given just a description of that API. And those descriptions that MCP servers provide? They’re already baked into the training data or accessible on public websites.

Agents built to access APIs directly can be simpler and more flexible, and they can connect to any service — not just the ones that support MCP.

By the end of 2026, I predict MCP will be seen as an unnecessary abstraction that adds complexity without meaningful benefit. Major vendors will stop investing in it.

9. A Chinese company will surpass Waymo in total global robotaxi fleet size (55%)

Daniel Abreu Marques, author of The AV Market Strategist

Waymo has world-class autonomy, broad regulatory acceptance, and a maturing multi-city playbook. But vehicle availability remains a major bottleneck. Waymo is scheduled to begin using vehicles from the Chinese automaker Zeekr in the coming months, but tariff barriers and geopolitical pressures will limit the size of its Zeekr-based fleet. Waymo has also signed a deal with Hyundai, but volume production likely won’t begin until after 2026. So for the next year, fleet growth will remain incremental.

Chinese AV players operate under a different set of constraints. Companies like Pony.ai, Baidu Apollo Go, and WeRide have already demonstrated mass-production capability. For example, when Pony rolled out its Gen-7 platform, it reduced its bill of materials cost by 70%. Chinese companies are scaling fleets across China, the Middle East, and Europe simultaneously.

At the moment, Waymo has about 2,500 vehicles in its commercial fleet. The biggest Chinese company is probably Pony.ai, with around 1,000 vehicles. Pony.ai is aiming for 3,000 vehicles by the end of 2026, while Waymo will need 4,000 to 6,000 vehicles to meet its year-end goal of one million weekly rides.

But if Waymo’s supply chain ramps slower than expected due to unforeseen problems or delays — and Chinese players continue to ramp up production volume — then at least one of them could surpass Waymo in total global robotaxi fleet size by the end of 2026.

Subscribe now

10. The first fully autonomous vehicle will be sold to consumers — but it won’t be from Tesla (75%)

Sophia Tung, content editor of the Ride AI newsletter

Currently many customer-owned vehicles have advanced driverless systems (known as “level two” in industry jargon), but none are capable of fully driverless operations (“level four”). I predict that will change in 2026: you’ll be able to buy a car that’s capable of operating with no one behind the wheel — at least in some limited areas.

One company that might offer such a vehicle is Tensor, formerly AutoX. Tensor is working with younger, more eager automakers that already ship vehicles in the US, like VinFast, to manufacture and integrate their vehicles. The manufacturing hurdles, while significant, are not insurmountable.

Many people expect Tesla to ship the first fully driverless customer-owned vehicle, but I think that’s unlikely. Tesla is in a fairly comfortable position. Its driver-assistance system performs well enough most of the time. Users believe it is “pretty much” a fully driverless system. Being years behind Waymo in the robotaxi market hasn’t hurt Tesla’s credibility with its fans. So Tesla can probably retain the loyalty of its customers even if a little-known startup like Tensor introduces a customer-owned driverless vehicle before Tesla enables driverless operation for its customers.

Tensor has a vested interest in being first and flashiest in the market. It could launch a vehicle that can operate with no driver within a very limited area and credibly claim a first-to-market win. Tensor runs driverless robotaxi testing programs and therefore understands the risks involved. Tesla, in contrast, probably does not want to assume liability or responsibility for accidents caused by its system. So I expect Tesla to wait, observe how Tensor performs, and then adjust its own strategy accordingly.

11. Tesla will begin offering a truly driverless taxi service to the general public in at least one city (70%)

Timothy B. Lee

In June, Tesla delivered on Elon Musk’s promise to launch a driverless taxi service in Austin. But it did so in a sneaky way. There was no one in the driver’s seat, but every Robotaxi had a safety monitor in the passenger seat. When Tesla began offering Robotaxi rides in the San Francisco Bay Area, those vehicles had safety drivers.

It was the latest example of Elon Musk overpromising and underdelivering on self-driving technology. This has led many Tesla skeptics to dismiss Tesla’s self-driving program entirely, arguing that Tesla’s current approach simply isn’t capable of full autonomy.

I don’t buy it. Elon Musk tends to achieve ambitious technical goals eventually. And Tesla has been making genuine progress on its self-driving technology. Indeed, in mid-December, videos started to circulate showing Teslas on public roads with no one inside. I think that suggests that Tesla is nearly ready to debut genuinely driverless vehicles, with no Tesla employees anywhere in the vehicle.

Before Tesla fans get too excited, it’s worth noting that Waymo began its first fully driverless service in 2020. Despite that, Waymo didn’t expand commercial service to a second city — San Francisco — until 2023. Waymo’s earliest driverless vehicles were extremely cautious and relied heavily on remote assistance, making rapid expansion impractical. I expect the same will be true for Tesla — the first truly driverless Robotaxis will arrive in 2026, but technical and logistical challenges will limit how rapidly they expand.

12. Text diffusion models will hit the mainstream (75%)

Kai Williams

Current LLMs are autoregressive, which means they generate tokens one at a time. But this isn’t the only way that AI models can produce outputs. Another type of generation is diffusion. The basic idea is to train the model to progressively remove noise from an input. When paired with a prompt, a diffusion model can turn random noise into solid outputs.

For a while, diffusion models were the standard way to make image models, but it wasn’t as clear how to adapt that to text models. In 2025, this changed. In February, the startup Inception Labs released Mercury, a text diffusion model aimed at coding. In May, Google announced Gemini Diffusion as a beta release.

Diffusion models have several key advantages over standard models. For one, they’re much faster because they generate many tokens at once. They also might learn from data more efficiently, at least according to a July study by Carnegie Mellon researchers.

While I don’t expect diffusion models to supplant autoregressive models, I think there will be more interest in this space, with at least one established lab (Chinese or American) releasing a diffusion-based LLM for mainstream use.

13. There will be an anti-AI super PAC that raises at least $20 million (70%)

Charlie Guo, author of Artificial Ignorance

AI has become a vessel for a number of different anxieties: misinformation, surveillance, psychosis, water usage, and “Big Tech” power in general. As a result, opposition to AI is quickly becoming a bipartisan issue. One example: back in June, Ted Cruz attempted to add an AI regulation moratorium to the budget reconciliation bill (not unlike President Trump’s recent executive order), but it failed 99-1.

Interestingly, there are at least two well-funded pro-AI super PACs:

  • Leading The Future, with over $100 million from prominent Silicon Valley investors, and

  • Meta California, with tens of millions from Facebook’s parent company.

Meanwhile, there’s no equally organized counterweight on the anti-AI side. This feels like an unstable equilibrium, and I expect to see a group solely dedicated to lobbying against AI-friendly policies by the end of 2026.

Subscribe now

14. News coverage linking AI to suicide will triple — but actual suicides will not (85%)

Abi Olvera, author of Positive Sum

We’ve already seen extensive media coverage of cases like the Character.AI lawsuit, where a teen’s death became national news. I expect suicides involving LLMs to generate even more media attention in 2026. Specifically, I predict that news mentions of “AI” and “suicide” in media databases will be at least three times higher in 2026 than in 2025.

But increased coverage doesn’t mean increased deaths. The US suicide rate will likely continue on its baseline trends.

The US suicide rate is currently near a historic peak after a mostly steady rise since 2000. While the rate remained high through 2023, recent data shows a meaningful decrease in 2024. I expect suicide rates to stay stable or lower, reverting back toward average away from the 2018 and 2022 peaks.

15. The American open frontier will catch up to Chinese models (60%)

Florian Brand, editor at the Interconnects newsletter

In late 2024, Qwen 2.5, made by the Chinese firm Alibaba, surpassed the best American open model Llama 3. In 2025, we got a lot of insanely good Chinese models — DeepSeek R1, Qwen3, Kimi K2 — and American open models fell behind. Meta’s Llama 4, Google’s Gemma 3, and other releases were good models for their size, but didn’t reach the frontier. American investment in open weights started to flag; there have been rumors since the summer that Meta is switching to closed models.

But things could change next year. Through advocacy like the ATOM Project (led by Nathan Lambert, the founder of Interconnects), more Western companies have indicated interest in building open-weight models. In late 2025, there has been an uptick in solid American/Western open model releases like Mistral 3, Olmo 3, Rnj, and Trinity. Right now, those models are behind in raw performance, but I predict that this will change in 2026 as Western labs keep up their current momentum. American companies still have substantial resources, and organizations like Nvidia — which announced in December it would release a 500 billion parameter model — seem ready to invest.

16. Vibes will have more active users than Sora in a year (70%)

Kai Williams

This fall, OpenAI and Meta both released platforms for short-form AI-generated video. Initially, Sora caught all of the positive attention: the app came with a new video generation model and a clever mechanic around making deepfakes of your friends. Meta’s Vibes initially fell flat. Sora quickly became the number one app in Apple’s App Store, while the Meta AI app, which includes Vibes, languished around position 75.

Today, however, the momentum has seemed to shift. Sora’s initial excitement has seemed to wear off as the novelty of AI videos faded. Meanwhile, Vibes has been growing, albeit slowly, hitting two million daily active users in mid-November, according to Business Insider. Today, the Meta AI app ranks higher on the App Store than Sora.

I think this reversal will continue. From personal experience, Sora’s recommendation algorithm seems very clunky, and Meta is very skilled at building compelling products that grow its user base. I wouldn’t count out Mark Zuckerberg when it comes to growing a social media app.

17. Counterpoint: Sora will have more active users than Vibes in a year (65%)

Timothy B. Lee

This is one of the few places where Kai and I disagreed, so I thought it would be fun to air both sides of the argument.

I was initially impressed by Sora’s clever product design, but the app hasn’t held my attention since my October writeup. However, toward the end of that writeup I said this:

I expect the jokes to get funnier as the Sora audience grows. Another obvious direction is licensing content from Hollywood. I expect many users would love to put themselves into scenes involving Harry Potter, Star Wars, or other famous fictional worlds. Right now, Sora tersely declines such requests due to copyright concerns. But that could change if OpenAI writes big enough checks to the owners of these franchises.

This is exactly what happened. OpenAI just signed a licensing agreement with Disney to let users make videos of themselves with Disney-owned characters. It’s exclusive for the first year. I expect this to greatly increase interest in Sora, because while making fake videos of yourself is lame, making videos of yourself interacting with Luke Skywalker or Iron Man is going to be more appealing.

I doubt users will react well if they’re just given a blank prompt field to fill out, so fully exploiting this opportunity will require clever product design. But Sam Altman has shown a lot of skill at turning promising AI models into compelling products. There’s no guarantee he’ll be able to do this with Sora, but I’m guessing he’ll figure it out.

1

Anthropic does offer a million token context window in beta testing for Sonnet 4 and Sonnet 4.5.

2

I’m focusing on Q3 numbers because we don’t typically get GDP data for the fourth quarter until late January, which is too late for a year-end article like this.

Waymo and Tesla’s self-driving systems are more similar than people think

2025-12-18 06:01:17

The transformer architecture underlying large language models is remarkably versatile. Researchers have found many use cases beyond language, from understanding images to predicting the structure of proteins to controlling robot arms.

The self-driving industry has jumped on the bandwagon too. Last year, for example, the autonomous vehicle startup Wayve raised $1 billion. In a press release announcing the round, Wayve said it was “building foundation models for autonomy.”

“When we started the company in 2017, the opening pitch in our seed deck was all about the classical robotics approach,” Wayve CEO Alex Kendall said in a November interview. That approach was to “break down the autonomy problem into a bunch of different components and largely hand-engineer them.”

Wayve took a different approach, training a single transformer-based foundation model to handle the entire driving task. Wayve argues that its network can more easily adapt to new cities and driving conditions.

Tesla has been moving in the same direction.

Subscribe now

“We used to work on an explicit, modular approach because it was so much easier to debug,” said Tesla AI chief Ashok Elluswamy at a recent conference. “But what we found out was that codifying human values was really difficult.”

So a couple of years ago, Tesla scrapped its old code in favor of an end-to-end architecture. Here’s a slide from Elluswamy’s October presentation:

Conventional wisdom holds that Waymo has a dramatically different approach. Many people — especially Tesla fans — believe that Tesla’s self-driving technology is based on cutting-edge, end-to-end AI models, while Waymo still relies on a clunky collection of handwritten rules.

But that’s not true — or at least it greatly exaggerates the differences.

Last year, Waymo published a paper on EMMA, a self-driving foundation model built on top of Google’s Gemini.

“EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements,” the researchers wrote.

Although the EMMA model was impressive in some ways, the Waymo team noted that it “faces challenges for real-world deployment,” including poor spatial reasoning ability and high computational costs. In other words, the EMMA paper described a research prototype — not an architecture that was ready for commercial use.

But Waymo kept refining this approach. In a blog post last week, Waymo pulled back the curtain on the self-driving technology in its commercial fleet. It revealed that Waymo vehicles today are controlled by a foundation model that’s trained in an end-to-end fashion — just like Tesla and Wayve vehicles.

For this story, I read several Waymo research papers and watched presentations by (and interviews with) executives at Waymo, Wayve, and Tesla. I also had a chance to talk to Waymo co-CEO Dmitri Dolgov. Read on for an in-depth explanation of how Waymo’s technology works, and why it’s more similar to rivals’ technology than many people think.

Thinking fast and slow

Some driving scenarios require complex, holistic reasoning. For example, suppose a police officer is directing traffic around a crashed vehicle. Navigating this scene not only requires interpreting the officer’s hand signals, it also requires reasoning about the goals and likely actions of other vehicles as they navigate a chaotic situation. The EMMA paper showed that LLM-based models can handle these complex situations much better than a traditional modular approach.

But foundation models like EMMA also have real downsides. One is latency. In some driving scenarios, a fraction of a second can make the difference between life and death. The token-by-token reasoning style of models like Gemini can mean long and unpredictable response times.

Traditional foundation models are also not very good at geometric reasoning. They can’t always judge the exact locations of objects in an image. They might also overlook objects or hallucinate ones that aren’t there.

So rather than relying entirely on an EMMA-style vision-language model (VLM), Waymo placed two neural networks side by side. Here’s a diagram from Waymo’s blog post:

Let’s start by zooming in on the lower-left of the diagram:

VLM here stands for vision-language model — specifically Gemini, the Google AI model that can handle images as well as text. Waymo says this portion of its system was “trained using Gemini” and “leverages Gemini’s extensive world knowledge to better understand rare, novel, and complex semantic scenarios on the road.”

Compare that to EMMA, which Waymo described as maximizing the “utility of world knowledge” from “pre-trained large language models” like Gemini. The two approaches are very similar — and both are similar to the way Tesla and Wayve describe their self-driving systems.

“Milliseconds really matter”

But the model in today’s Waymo vehicles isn’t just an EMMA-like vision-language model — it’s a hybrid system that also includes a module called a sensor fusion encoder that is depicted in the upper-left corner of Waymo’s diagram:

This module is tuned for speed and accuracy.

“Imagine a latency-critical safety scenario where maybe an object appears from behind a parked car,” Waymo co-CEO Dmitri Dolgov told me. “Milliseconds really matter. Accuracy matters.”

Whereas the VLM (the blue box) considers the scene as a whole, the sensor fusion module (the yellow box) breaks the scene into dozens of individual objects: other vehicles, pedestrians, fire hydrants, traffic cones, the road surface, and so forth.

It helps that every Waymo vehicle has lidar sensors that measure the distance to nearby objects by bouncing lasers off of them. Waymo’s software matches these lidar measurements to the corresponding pixels in camera images — a process called sensor fusion. This allows the system to precisely locate each object in three-dimensional space.

In early self-driving systems, a human programmer would decide how to represent each object. For example, the data structure for a vehicle might record the type of vehicle, how fast it’s moving, and whether it has a turn signal on.

But a hand-coded system like this is unlikely to be optimal. It will save some information that isn’t very useful while discarding other information that might be crucial.

“The task of driving is not one where you can just enumerate a set of variables that are sufficient to be a good driver,” Dolgov told me. “There’s a lot of richness that is very hard to engineer.”

Waymo co-CEO Dmitri Dolgov. (Image courtesy of Waymo)

So instead, Waymo’s model learns the best way to represent each object through a data-driven training process. Waymo didn’t give me a ton of information about how this works, but I suspect it’s similar to the technique described in the 2024 Waymo paper called “MoST: Multi-modality Scene Tokenization for Motion Prediction.”

The system described in the MoST paper still splits a driving scene up into distinct objects as in older self-driving systems. But it doesn’t capture a set of attributes chosen by a human programmer. Rather, it computes an “object vector” that captures information that’s most relevant for driving — and the format of this vector is learned during the training process.

“Some dimensions of the vector will likely indicate whether it’s a fire truck, a stop sign, a tree trunk, or something else,” I wrote in an article last year. “Other dimensions will represent subtler attributes of objects. If the object is a pedestrian, for example, the vector might encode information about the position of the pedestrian’s head, arms, and legs.”

There’s an analogy here to LLMs. An LLM represents each token with a “token vector” that captures the information that’s most relevant to predicting the next token. In a similar way, the MoST system learns to capture the information about objects that are most relevant for driving.

I suspect that when Waymo says its sensor fusion module outputs “objects, sensor embeddings” in the diagram above, this is a reference to a MoST-like system.

How does the system know which information to include in these object vectors? Through end-to-end training of course!

This is the third and final module of Waymo’s self-driving system, called the world decoder.

It takes inputs from both the sensor fusion encoder (the fast-thinking module that breaks the scene into individual objects) and the driving VLM (the slow-thinking module that tries to understand the scene as a whole). Based on information supplied by these modules, the world decoder tries to decide the best action for a vehicle to take.

During training, information flows in the opposite direction. The system is trained on data from real-world situations. If the decoder correctly predicts the actions taken in the training example, the network gets positive reinforcement. If it guesses wrong, then it gets negative reinforcement.

These signals are then propagated backward to the other two modules. If the decoder makes a good choice, signals are sent back to the yellow and blue boxes encouraging them to continue doing what they’re doing. If the decoder makes a bad choice, signals are sent back to change what they’re doing.

Based on these signals, the sensor fusion module learns which information is most helpful to include in object vectors — and which information can be safely left out. Again, this is closely analogous to LLMs, which learn the most useful information to include in the vectors that represent each token.

Subscribe now

Modular networks can be trained end-to-end

Leaders at all three self-driving companies portray this as a key architectural difference between their self-driving systems. Waymo argues that its hybrid system delivers faster and more accurate results. Wayve and Tesla, in contrast, emphasize the simplicity of their monolithic end-to-end architectures. They believe that their models will ultimately prevail thanks to the Bitter Lesson — the insight that the best results often come from scaling up simple architectures.

In a March interview, podcaster Sam Charrington asked Waymo’s Dragomir Anguelov about the choice to build a hybrid system.

“We’re on the practical side,” Anguelov said. “We will take the thing that works best.”

Anguelov pointed out that the phrase “end-to-end” describes a training strategy, not a model architecture. End-to-end training just means that gradients are propagated all the way through the network. As we’ve seen, Waymo’s network is end-to-end in this sense: during training, error signals propagate backward from the purple box to the yellow and blue boxes.

“You can still have modules and train things end-to-end,” Anguelov said in March. “What we’ve learned over time is that you want a few large components, if possible. It simplifies development.” However, he added, “there is no consensus yet if it should be one component.”

So far, Waymo has found that its modular approach — with three modules rather than just one — is better for commercial deployment.

Waymo co-CEO Dmitri Dolgov told me that a monolithic architecture like EMMA “makes it very easy to get started, but it’s wildly inadequate to go to full autonomy safely and at scale.”

I’ve already mentioned latency and accuracy as two major concerns. Another issue is validation. A self-driving system doesn’t just need to be safe, the company making it needs to be able to prove it’s safe with a high level of confidence. This is hard to do when the system is a black box.

Under Waymo’s hybrid architecture, the company’s engineers know what function each module is supposed to perform, which allows them to be tested and validated independently. For example, if engineers know what objects are in a scene, they can look at the output of the sensor fusion module to make sure it identifies all the objects it’s supposed to.

These architectural differences seem overrated

My suspicion is that the actual differences are smaller than either side wants to admit. It’s not true that Waymo is stuck with an outdated system based on hand-coded rules. The company makes extensive use of modern AI techniques, and its system seems perfectly capable of generalizing to new cities.

Indeed, if Waymo deleted the yellow box from its diagram, the resulting model would be very similar to those at Tesla and Wayve. Waymo supplements this transformer-based model with a sensor fusion module that’s tuned for speed and geometric precision. But if Waymo finds the sensor fusion module isn’t adding much value, it can always remove it. So it’s hard to imagine the module puts Waymo at a major disadvantage.

At the same time, I wonder if Wayve and Tesla are downplaying the modularity of their own systems for marketing purposes. Their pitch to investors is that they’re pioneering a radically different approach than incumbents like Waymo — one that’s inspired by frontier labs like OpenAI and Anthropic. Investors were so impressed by this pitch that they gave Wayve $1 billion last year, and optimism about Tesla’s self-driving project has pushed up the company’s stock price in recent years.

For example, here’s how Wayve depicts its own architecture:

At first glance, this looks like a “pure” end-to-end architecture. But look closer and you’ll notice that Wayve’s model includes a “safety expert sub-system.” What’s that? I haven’t been able to find any details on how this works or what it does. But in a 2024 blog post, Wayve wrote about its effort to train its models to have an “innate safety reflex.”

According to Wayve, the company uses simulation to “optimally enrich our Emergency Reflex subsystem’s latent representations.” Wayve added that “to supercharge our Emergency Reflex, we can incorporate additional sources of information, such as other sensor modalities.”

This sounds at least a little bit like Waymo’s sensor fusion module. I’m not going to claim that the systems are identical or even all that similar. But any self-driving company has to address the same basic problem as Waymo: that large, monolithic language models are slow, error-prone, and difficult to debug. I expect that as it gets ready to commercialize its technology, Wayve will need to supplement the core end-to-end model with additional information sources that are easier to test and validate — if it isn’t doing so already.

The best Chinese open-weight models — and the strongest US rivals

2025-12-15 23:58:12

DeepSeek’s release of R1 in January shocked the world. It came just four months after OpenAI announced its first reasoning model, o1. The model parameters were released openly. And DeepSeek R1 powered the first consumer-facing chatbot to show the full chain of thought before answering.

The effect was electric.

The DeepSeek app briefly surpassed ChatGPT as the top app in the iOS App Store. Nvidia’s stock dropped almost 20% a few days later. Chinese companies rushed to use the model in their products.

DeepSeek’s success with R1 sparked a renaissance of open-weight efforts from Chinese companies. Before R1, Meta’s Llama models were the most prominent open-weight models. Today, Qwen, from the e-commerce firm Alibaba, is the leading open model family. But it faces stiff competition from DeepSeek, Moonshot AI, Z.AI, and other (primarily Chinese) companies.

American companies have also released a number of notable open-weight models. OpenAI released open-weight models in August. IBM officially released its well-regarded Granite 4 models in October. Google, Microsoft, Nvidia, and the Allen Institute for AI have all released new open models this year — and so has the French startup Mistral. But none of these models have been as good as the top Chinese models.

With so many releases, which ones are worth paying attention to?

In this piece, I’ll cover 13 of the most significant open-weight model developers, starting with the models that deliver the most bang for the buck. For each company, I’ll list a few models worth paying attention to, using Intelligence Index scores from Artificial Analysis as a rough approximation of model quality.

A key inspiration for this article has been Nathan Lambert, a researcher at the Allen Institute for Artificial Intelligence and the author of the excellent Interconnects newsletter. Lambert is concerned about the slow progress of American open-weight models and has been trying to rally support for building a new generation of open-weight models in the US. You can read about that effort here.

1. Qwen from Alibaba

Takeaway: There’s a very good Qwen model at basically every size through 235 billion parameters. The fact that Qwen is Chinese might be the biggest barrier for the average US firm.

Models:

  • Qwen3 4B Thinking:

    • Released April 26, 2025

    • Intelligence Index: 43

  • Qwen3 VL 32B:

    • Released October 19, 2025

    • Intelligence Index: 52

  • Qwen3 Next 80B:

    • Released September 9, 2025

    • Intelligence Index: 54

  • Qwen3 235B A22B 2507:1

    • Released July 25, 2025

    • Intelligence Index: 57

The Qwen family of open-weight models is made by Alibaba, an e-commerce and cloud services tech company.

Qwen models come in many sizes. As Nathan Lambert noted in a talk at the PyTorch conference, “Qwen alone is roughly matching the entire American open model ecosystem today.”

Enterprises often need to execute a series of simple tasks as part of a larger data pipeline. Open models — especially Qwen models — tend to work well here. The company has excelled at producing small models that run on cheap hardware.

The Qwen series faces stiffer open-weight competition at the large end of the spectrum. And the largest Qwen model — Qwen3-Max — is not open-weight.

There’s a robust community around Qwen models. According to an analysis of Hugging Face data by the ATOM Project, Qwen is now the most downloaded model family in the world.

There have also been whispers of American companies adopting Qwen models. In October, Airbnb CEO Brian Chesky caused a stir by telling Bloomberg that the company is “relying a lot on Alibaba’s Qwen model” because it is fast, cheap, and performant enough.

But I spoke with several people whose organizations could not use Qwen (and other Chinese open models) for branding or compliance reasons.

This is one of the biggest barriers to Qwen’s adoption, as Lambert wrote in May:

People vastly underestimate the number of companies that cannot use Qwen and DeepSeek open models because they come from China. This includes on-premise solutions built by people who know the fact that model weights alone cannot reveal anything to their creators.

Lambert argues that many companies worry about the output of Chinese models being compromised. With current techniques, it’s impossible to rule this out without access to the training data — though Lambert believes the models are probably safe.

2. Kimi K2 from Moonshot

Takeaway: Kimi K2 Thinking is arguably the best open model in the world, but it’s difficult to run locally.

Models:

  • Kimi K2 0905 (1 trillion parameters):

    • Released September 2, 2025

    • Intelligence Index: 50

  • Kimi K2 Thinking (1T):

    • Released November 4, 2025

    • Intelligence Index: 67

Moonshot AI is a Chinese startup founded in March 2023. Kimi K2 is their flagship large language model.

Kimi K2 Thinking is arguably the best open model in the world by benchmark score. Artificial Analysis currently ranks it as the strongest model not made by OpenAI, Google, or Anthropic. Epoch’s Capabilities Index ranks it as the second best open model, and 14th overall.

Beyond the benchmarks, most reactions to Kimi have been positive.

  • Many people have praised K2’s writing abilities, both as a reasoning model and not. Rohit Krishnan, an entrepreneur and Substacker, tweeted that “Kimi K2 is remarkably good at writing, and unlike all others thinking mode hasn’t degraded its writing ability more.”

  • Nathan Lambert noted that Kimi K2 Thinking is one of the first open-weight models to be able to make long strings of tool calls, several hundred at a time. This makes agentic workflows possible.

  • On a podcast, Chamath Palihapitiya, a venture capitalist and former Facebook executive, said that his company 8090 has “directed a ton of our workloads to Kimi K2 on Groq because it was really way more performant and frankly just a ton cheaper than OpenAI and Anthropic.”

Some Reddit commenters have noted that K2 Thinking is not quite as strong at agentic coding as other open-weight models like Qwen coding models or Z.AI’s GLM, but it’s still a solid option and a stronger all-around model.

Kimi K2 also uses a lot of tokens; of all the models listed in this piece, K2 Thinking uses the second most tokens on Artificial Analysis’s benchmark suite.

And good luck trying to run this on your own computer. K2 Thinking has more than one trillion parameters, and the Hugging Face download is over 600 GB. One Redditor managed to get a quantized version running on a personal computer (with a GPU attached) at the speed of … half a token per second.

So in practice, using Kimi requires Moonshot’s API service, a third-party inference provider, or your own cluster.

3. gpt-oss from OpenAI

Takeaway: The gpt-oss models are excellent at reasoning tasks, and very fast. But they are weaker outside of pure reasoning tasks.

Models:

  • gpt-oss-20b:

    • Released August 4, 2025

    • Intelligence Index: 52 on high reasoning, 44 on low reasoning

  • gpt-oss-120b:

    • Released August 4, 2025

    • Intelligence Index: 61 on high reasoning, 48 on low reasoning

In August, OpenAI released two open-weight models, gpt-oss-120b and gpt-oss-20b. The 120b version is almost certainly the most capable American open model.

Both models are optimized for reasoning and agentic tasks — OpenAI claimed they were at “near-parity with OpenAI o4-mini on core reasoning benchmarks.” This includes math; the fourth-best-performing entry in the Kaggle competition to solve IMO-level problems — which currently has a $2.5 million prize pot — is based on gpt-oss-120b. (It’s unclear what models are used by entries one through three.)

The gpt-oss models have a generally solid reputation. “When I ask people who work in these spaces, the impression has been very positive,” Nathan Lambert noted in a recent talk on the state of open models.

They are also very fast. A Reddit commenter benchmarking various models was able to run gpt-oss-20b locally at 224 tokens per second, faster than the GPT-5.1, Gemini, or Claude APIs. And according to Artificial Analysis, some inference providers can run the 120B variant at over 3,000 tokens per second.

However, the gpt-oss models aren’t as good outside of coding, math, and reasoning. In particular, they have little factual knowledge. On SimpleQA, gpt-oss-120b only gets 16.8% right and gpt-oss-20b gets a mere 6.7% right. (Gemini 3 Pro gets 70% right, while GPT-5.1 gets 50% right). And when they get stumped by a SimpleQA question, the gpt-oss models almost always hallucinate an answer.

It’s also unclear whether OpenAI will release a follow-up to these two models. In the meantime, it’s a solid choice for reasoning and coding tasks if you need an American model you can run locally.

4. The DeepSeek models

Takeaway: DeepSeek releases strong models, particularly in math. Its most recent release, V3.2, is solid but not exceptional. Future releases might be a big deal.

Models:

  • DeepSeek R1 0528 (685 billion parameters):

    • Released May 28, 2025

    • Intelligence Index: 52

  • DeepSeek V3.2 (685B):

    • Released December 1, 2025

    • Intelligence Index: 66

  • DeepSeek V3.2 Speciale (685B):

    • Released December 1, 2025

    • Intelligence Index: 59

DeepSeek is an AI company owned by the Chinese hedge fund HighFlyer. DeepSeek explicitly aims to develop artificial general intelligence (AGI) through open models. As I mentioned in the introduction, the success of DeepSeek R1 in January inspired many open-weight efforts in China.

At the beginning of December, DeepSeek released V3.2 and V3.2 Speciale. These models have impressive benchmark numbers: Artificial Analysis rates V3.2 as the second best open model on their index, while V3.2 Speciale tops all models — open or closed — in the MathArena benchmark for final answer competitions.

Still, DeepSeek’s recent releases haven’t seemed to catch the public’s attention. Substack writer Zvi Mowshowitz summed up V3.2 as “okay and cheap but slow.” Mowshowitz noted there had not been much public adoption of the model.

It’s also probably a good idea to use DeepSeek’s products through an American provider or on your own hardware. In February, a security firm found that DeepSeek’s website was passing information to a Chinese state-owned company. (It’s unclear whether this is still happening.)

Regardless of how V3.2 fares, DeepSeek will remain a lab to watch. Their next major model release (rumored to be in February 2026) might be a big deal.

5. Olmo 3 from the Allen Institute for AI

Takeaway: Olmo 3 models are open-source, not just open-weight. Their performance isn’t too far behind Qwen models.

Models:

  • Olmo 3 7B:

    • Released November 20, 2025

    • Intelligence Index: 32 for thinking, 22 for instruct

  • Olmo 3.1 32B Think:

    • Released December 12, 2025

    • Intelligence Index: Not yet benchmarked, but Olmo 3 32B Think was 36

The Allen Institute for Artificial Intelligence (Ai2) is a nonprofit research institute founded in 2014. The Olmo series is one of Ai2’s main research products.

Olmo 3, released in November, is probably the best open-source model in the world. Every other developer I discuss here (except Nvidia) releases only open-weight models, where the final set of model parameters is available but not the code and data used for training. Ai2 not only released training code and data, but also several model checkpoints from midway through the training process.

This lets researchers learn from Olmo’s development process, take advantage of the open datasets, and use the Olmo models in their experiments.

Olmo’s openness can also be helpful for enterprise. One of the project’s co-leaders, Hanna Hajishirzi, told me that having several Olmo checkpoints gives companies more flexibility.

Companies can train an earlier Olmo 3 checkpoint to ensure that the model ends up effectively learning from the data for their use case. Hajishirzi said that she hears a lot of people say fine-tuning doesn’t work, but that’s because they’re only training on the “final snapshot of the model.”

For instance, if a model has already gone through reinforcement learning to improve its coding skills, it may have weaker capabilities elsewhere. So if a company wants to fine-tune a model to be good at a non-coding skill — like giving feedback on writing — they are better off choosing an earlier checkpoint.

Still, out of the box, the Olmo 3 models perform a little worse than the best open-weight models of their size, which are the Qwen models. And they’re certainly much weaker than large open-weight models like Kimi K2 Thinking.

In any event, the Allen Institute is an organization to watch. It recently received a $150 million grant from the National Science Foundation and Nvidia to create open models for scientists.

6. GLM 4.6 from Z.AI

Takeaway: The GLM 4.6 models are solid, particularly for coding.

Models:

  • GLM 4.6V-Flash (10B):

    • Released December 8, 2025

    • Intelligence Index score has not been released

  • GLM 4.6 (357B):

    • Released September 29, 2025

    • Intelligence Index: 56

  • GLM 4.6V (108B):

    • Released December 8, 2025

    • Intelligence Index score has not been released

Z.AI (formerly Zhipu AI) is a Chinese AI startup founded in 2019. In addition to its flagship GLM series of LLMs, it also releases some non-text models.

Unlike many of the other startups that form the Six Chinese Tigers, Z.AI was popular in China even before DeepSeek came to prominence. Z.AI released the first version of GLM all the way back in 2021, albeit not as an open model. A market survey in mid-2024 found that Z.AI was the third most popular enterprise LLM provider in China, after Alibaba and SenseTime.

But until recently, Z.AI struggled to attract attention outside of China. Now that is starting to change after two strong releases: GLM 4.5 in July and GLM 4.6 in late September. In November, the South China Morning Post reported that Z.AI had 100,000 users of its API, a “tenfold increase over two months,” and over 3 million chatbot users.

GLM 4.6 is probably not as strong as the best Qwen models, nor Kimi, but it is a very solid option, particularly for coding.

Z.AI may not continue releasing open-weight models if its market position changes.

The company’s product director, Zixuan Li, recently told ChinaTalk that “as a Chinese company, we need to really be open to get accepted by some companies because people will not use your API to try your models.” Z.AI is a startup without a strong pre-existing brand or capital in reserve. Gaining adoption is crucial to the company’s survival. Releasing a model’s weights allows enterprises to try it without having to worry about the data security challenges of using a Chinese API.

Z.AI “only gets maybe 5% or 10% of all the services related to GLM,” according to Li. For now, that’s enough revenue for the company. But if its economic incentives change, Z.AI might go back to releasing closed models.

7. Nemotron from Nvidia

Takeaway: Nvidia is an underrated open-weight developer. The Nemotron models are solid and look to be expanded soon.

Read more

Google and Anthropic approach LLMs differently

2025-12-05 02:49:41

On Monday, OpenAI CEO Sam Altman declared a “code red” in the face of rising competition.

The biggest threat was Google; monthly active users for Google’s Gemini chatbot grew from 450 million in July to 650 million in November (ChatGPT had 800 million weekly active users in October). Meanwhile, the Wall Street Journal reports, “OpenAI is also facing pressure from Anthropic, which is becoming popular among business customers.”

Google ratcheted up the pressure on OpenAI two weeks ago with the release of Gemini 3 models, which set new records on a number of benchmarks. The next week, Anthropic released Claude Opus 4.5, which achieved even higher scores on some of the same benchmarks.

Over the last two weeks, I’ve been trying to figure out the best way to cover these new releases. I used to subject each new model to a battery of bespoke benchmarks and write about the results. But recent models have gotten good enough to easily solve most of these problems. They do still fail on a few simple tasks (like telling time on an analog clock) but I fear those examples are increasingly unrepresentative of real-world usage.

In the future, I hope to write more about the performance of these new Google and Anthropic models. But for now, I want to offer a more qualitative analysis of these models. Or rather, I want to highlight two pieces that illustrate the very different cultures at Google and Anthropic — cultures that have led them to take dramatically different approaches to model building.

Engineering excellence at Google

Jeff Dean, a legendary engineer who has worked at Google since 1999, has led a number of AI projects inside the company. (Photo by THOMAS SAMSON/AFP via Getty Images)

Last week the newsletter Semianalysis published a deep dive on the success of tensor processor units (TPUs), Google’s alternative to Nvidia GPUs. “Gemini 3 is one of the best models in the world and was trained entirely on TPUs,” the Semianalysis authors wrote. Notably, Claude Opus 4.5 was also trained on TPUs.

Google has employed TPUs for its own AI needs for a decade. But recently Google has made a serious effort to sell TPUs to other companies. The Semianalysis team argues that Google is “​​the newest and most threatening merchant silicon challenger to Nvidia.”

In October, Anthropic signed a deal to use up to one million TPUs. In addition to purchasing cloud services from Google, Semianalysis reported, “Anthropic will deploy TPUs in its own facilities, positioning Google to compete directly with Nvidia.”

Recent generations of the TPU were respectable chips, but Semianalysis argues Google’s real strength is the overall system architecture. Modern AI training runs require thousands of chips wired together for rapid communication. Google has designed racks and networking systems that squeeze maximum performance out of every chip.

This is one example of a broader principle: Google is fundamentally an engineering-oriented company, and it has approached large language models as an engineering problem.1 Engineers have worked hard to train the largest possible models at the lowest possible cost.

For example, Gemini 2.5 Flash-Lite costs 10 cents for a million input tokens. Anthropic’s cheapest model, Claude Haiku 4.5, costs 10 times as much. Google was also the first company to release an LLM with a million-token context window.

Another place Google’s engineering prowess has paid off is in pretraining. Google released this chart showing Gemini 3 crushing other models at SimpleQA, a benchmark that measures a model’s ability to recall obscure facts.

As a perceptive Reddit commenter points out, this likely reflects Google’s ability to deploy computing hardware on a large scale.

“My read is that Gemini 3 Pro’s gains in SimpleQA show that it’s a massive model, absolutely huge, with tons of parametric knowledge,” wrote jakegh. “Google uses its own TPU hardware to not only infer but also train so they can afford to do it.”

So Gemini 3 continues the Google tradition of building solid, affordable models. Public reaction to the new model has been broadly positive; the model seems to perform as well in real-world applications as it does on benchmarks.

The new model doesn’t seem to have much personality, but this may not matter. Billions of people already use Google products, so Google may be able to win the AI race simply by adding a good-but-not-amazing model like Gemini 3 to products like search, Gmail, and the Google Workspace suite.

Anthropic: thinking deeply about models

Philosopher Amanda Askell described her work at Anthropic in a recent 60 Minutes interview.

Last week’s release of Claude Opus 4.5 also got a positive reception, but the vibes were different.

Read more

Help some of the poorest people in Rwanda

2025-12-02 20:06:13

It’s Giving Tuesday, and Matt Yglesias—my former colleague at Vox and now author of the excellent newsletter Slow Boring—has organized a consortium of Substack writers to raise money for GiveDirectly. This non-profit organization does exactly what it sounds like: give cash directly to poor people in low-income countries.

This year the group is aiming to raise at least $1 million to help people in rural Rwanda. I’m hoping that Understanding AI readers will contribute at least $20,000 to that total—please use this special link if you’d like to be counted as an Understanding AI reader.

If donations from readers total at least $20,000, my wife and I will donate an additional $10,000.

There are lots of charities out there that try to help poor people in various ways, such as delivering food, building infrastructure, or providing education and health care. Such efforts are praiseworthy, but it can sometimes be difficult to tell how much good they are doing — or whether it would be better to spend the money on something else.

The insight of GiveDirectly is that we can just give cash directly to people in need and let them decide how to spend it. Here’s how Matt describes it:

The organization works in low-income countries, including Kenya, Malawi, Mozambique, Rwanda, and Uganda, to identify villages where a large majority of the population is very poor by global standards. They then enroll the entire population of the village in the program, using mobile banking to transfer approximately $1,100 to each household in town.

This transfer boosts recipients’ short-term living standards, minimizes logistical complications and perverse incentives, and, optimistically, is a kind of shot in the arm to the local economy. After all, one problem with being desperately poor and also surrounded by other desperately poor people is that even when you have useful goods or services to sell, no one can afford to buy them.

Some recipients spend the money on immediate needs like food or medicine. Others use the money in ways that have longer-term benefits, such as buying equipment, starting a business, or sending children to school. In either case, the money will go a lot farther in a Rwanda than it would here in a rich country like the United States.

Understanding AI has more than 100,000 readers, and 98 percent of you are on the “free” list. If you’ve found my newsletter useful, a donation to GiveDirectly would be a great way to say thanks.

Six reasons to think there’s an AI bubble — and six reasons not to

2025-11-25 21:03:31

I’m excited to publish this post co-authored with one of my favorite writers, Derek Thompson. Derek recently left the Atlantic to launch his own Substack covering business, technology, science, and politics. It’s one of the few newsletters I read as soon as it hits my inbox, and I bet a lot of Understanding AI readers would enjoy it.


In the last few weeks, something’s troubled and fascinated us about the national debate over whether artificial intelligence is a bubble. Everywhere we look and listen, experts are citing the same small number of statistics, factoids, and studies. The debate is like a board game with a tiny number of usable pieces. For example:

  • Talk to AI bears, and they’ll tell you how much Big Tech is spending.

  • Talk to AI bulls, and they’ll tell you how much Big Tech is making.

  • Talk to AGI believers, and they’ll quote a study on “task length” by an organization called METR.

  • Talk to AGI skeptics, and they’ll quote another study on productivity, also by METR.

Last week, we were discussing how one could capture the entire AI-bubble debate in about 12 statistics that people just keep citing and reciting — on CNBC, on tech podcasts, in Goldman Sachs Research documents, and at San Francisco AI parties. Since everybody seems to be reading and quoting from the same skinny playbook, we thought: What the hell, let’s just publish the whole playbook!

If you read this article, we think you’ll be prepared for just about every conversation about AI, whether you find yourself at a Bay Area gathering with accelerationists or a Thanksgiving debate with Luddite cousins. We think some of these arguments are compelling. We think others are less persuasive. So, throughout the article, we’ll explain both why each argument belongs in the discussion and why some arguments don’t prove as much as they claim. Read to the end, and you’ll see where each of us comes down on the debate.

Let’s start with the six strongest arguments that there is an AI bubble.

All about the Benjamins

When they say: Prove to me that AI is a bubble

You say: For starters, this level of spending is insane

When America builds big infrastructure projects, we often over-build. Nineteenth-century railroads? Overbuilt, bubble. Twentieth-century Internet? Overbuilt, bubble. It’s really nothing against AI specifically to suggest that every time US companies get this excited about a big new thing, they get too excited, and their exuberance creates a bubble.

Five of the largest technology giants — Amazon, Meta, Microsoft, Alphabet, and Oracle — had $106 billion in capital expenditures in the most recent quarter. That works out to almost 1.4% of gross domestic product, putting it on par with some of the largest infrastructure investments in American history.

This chart was originally created by Understanding AI’s Kai Williams, who noted, “not all tech capex is spent on data centers, and not all data centers are dedicated to AI. The spending shown in this chart includes all the equipment and infrastructure a company buys. For instance, Amazon also needs to pay for new warehouses to ship packages.”

Still, AI accounts for a very large share of this spending. Amazon’s CEO, for example, said last year that AI accounted for “the vast majority” of Amazon’s recent capex. And notice that the last big boom on the chart — the broadband investment boom of the late 1990s — ended with a crash. AI investments are now large enough that a sudden slowdown would have serious macroeconomic consequences.

Money for nothing

When they say: But this isn’t like the dot-com bubble, because these companies are for real

You say: I’m not so sure about that…

“It feels like there’s obviously a bubble in the private markets,” said Demis Hassabis, the CEO of Google DeepMind. “You look at seed rounds with just nothing being [worth] tens of billions of dollars. That seems a little unsustainable. It’s not quite logical to me.”

The canonical example of zillions of dollars for zilch in product has been Thinking Machines, the AI startup led by former OpenAI executive Mira Murati. This summer, Thinking Labs raised $2 billion, the largest seed round in corporate history, before releasing a product. According to a September report in The Information, the firm declined to tell investors or the public what they were even working on.

“It was the most absurd pitch meeting,” one investor who met with Murati said. “She was like, ‘So we’re doing an AI company with the best AI people, but we can’t answer any questions.’”

In October, the company launched a programming interface called Tinker. I guess that’s something. Or, at least, it better be something quite spectacular, because just days later, the firm announced that Murati was in talks with investors to raise another $5 billion. This would raise the value of the company to $50 billion—more than the market caps of Target or Ford.

When enterprises that barely have products are raising money at valuations rivaling 100-year-old multinational firms, it makes us wonder if something weird is going on.

Reality check

When they say: Well, AI is making me more productive

You say: You might be deluding yourself

One of the hottest applications of AI right now is programming. Over the last 18 months, millions of programmers have started using agentic AI coding tools such as Cursor, Anthropic’s Claude Code, and OpenAI’s Codex, which are capable of performing routine programming tasks. Many programmers have found that these tools make them dramatically more productive at their jobs.

But a July study from the research organization METR called that into question. They asked 16 programmers to tackle 246 distinct tasks. Programmers estimated how long it would take to complete each task. Then they were randomly assigned to use AI, or not, on a task-by-task basis.

On average, the developers believed that AI would allow them to complete their tasks 24% faster with the help of AI. Even after the fact, developers who used AI thought it had sped them up by 20%. But programmers who used AI took 19% longer, on average, than programmers who didn’t.

We were both surprised by this result when it first came out, and we consider it one of the strongest data points in favor of AI skepticism. While many people believe that AI has made them more productive at their jobs — including both of us — it’s possible that we’re all deluding ourselves. Maybe that will become more obvious over the next year or two and the hype around AI will dissipate.

But it’s also possible that programmers are just in the early stages of the learning process for AI coding tools. AI tools probably speed up programmers on some tasks and slow them down on others. Over time, programmers may get better at predicting which tasks fall into which category. Or perhaps the tools themselves will get better over time — AI coding tools have improved dramatically over the last year.

It’s also possible that the METR results simply aren’t representative of the software industry as a whole. For example, a November study examined 32 organizations that started to use Cursor’s coding agent in the fall of 2024. It found that programmer productivity increased by 26% to 39% as a result.

Infinite money glitch

When they say: But AI is clearly growing the overall economy

You say: Maybe the whole thing is a trillion-dollar ouroboros

Imagine Tim makes some lemonade. He loans Derek $10 to buy a drink. Derek buys Tim’s lemonade for $10. Can we really say that Tim has “earned $10” in this scenario? Maybe no: If Derek goes away, all Tim has done is move money from his left pocket to his right pocket. But maybe yes: If Derek loves the lemonade and keeps buying more every day, then Tim’s bet has paid off handsomely.

Artificial intelligence is more complicated than lemonade. But some analysts are worried that the circular financing scheme we described above is also happening in AI. In September, Nvidia announced it would invest “up to” $100 billion in OpenAI to support the construction of up to 10 gigawatts of data center capacity. In exchange, OpenAI agreed to use Nvidia’s chips for the buildout. The next day, OpenAI announced five new locations to be built by Oracle in a new partnership whose value reportedly exceeds $300 billion. The industry analyst Dylan Patel called this financial circuitry an “infinite money glitch.”

Bloomberg made this chart depicting the complex web of transactions among leading AI companies. This kind of thing sets off alarm bells for people who remember how financial shenanigans contributed to the 2008 financial crisis.

The fear is two-fold: first, that tech companies are shifting money around in a way that creates the appearance of new revenue that hasn’t actually materialized; and second, that if any part of this financial ouroboros breaks, everybody is going down.

In the last few months, OpenAI has announced four deals: with Nvidia, Oracle, and the chipmakers AMD and Broadcom. All four companies saw their market values jump by tens of billions of dollars the day their deals were announced. But, by that same logic, any wobble for OpenAI or Nvidia could reverberate throughout the AI ecosystem.

Something similar happened during the original dot-com bubble. The investor Paul Graham sold a company to Yahoo in 1998, so he had a front-row seat to the mania:

By 1998, Yahoo was the beneficiary of a de facto Ponzi scheme. Investors were excited about the Internet. One reason they were excited was Yahoo’s revenue growth. So they invested in new Internet startups. The startups then used the money to buy ads on Yahoo to get traffic. Which caused yet more revenue growth for Yahoo, and further convinced investors the Internet was worth investing in. When I realized this one day, sitting in my cubicle, I jumped up like Archimedes in his bathtub, except instead of “Eureka!” I was shouting “Sell!”

Are we seeing a similar dynamic with the data center boom? It doesn’t seem like a crazy theory.

Pay no attention to the man behind the curtain

When they say: The hyperscalers are smart companies and don’t need bubbles to grow

You say: So why are they resorting to financial trickery?

Some skeptics argue that big tech companies are concealing the actual cost of the AI buildout.

First, they’re shifting AI spending off their corporate balance sheets. Instead of paying for data centers themselves, they’re teaming up with private capital firms to create joint ventures known as special purpose vehicles (or SPVs). These entities build the facilities and buy the chips, while the spending sits somewhere other than the tech company’s books. This summer, Meta reportedly sought to raise about $29 billion from private credit firms for new AI data centers structured through such SPVs.

Meta isn’t alone. CoreWeave, the fast-growing AI cloud company, has also turned to private credit to fund its expansion through SPVs. These entities transfer risk off the balance sheets of Silicon Valley companies and onto the balance sheets of private-capital limited partners, including pension funds and insurance companies. If the AI bubble bursts, it won’t be just tech shareholders who feel the pain. It will be retirees and insurance policyholders.

To be fair, it’s not clear that anything shady is happening here. Tech companies have plenty of AI infrastructure on their own balance sheets, and they’ve been bragging about that spending in earnings calls, not downplaying it. So it’s not obvious that they are using SPVs in an effort to mislead people.

Second, skeptics argue that tech companies are underplaying the depreciation risk of the hardware that powers AI. Earlier waves of American infrastructure left us with infrastructure that held its value for decades: power lines from the 1940s, freeways from the 1960s, fiber optic cables from the 1990s. By contrast, the best GPUs are overtaken by superior models every few years. The hyperscalers spread their cost over five or six years through an accounting process called depreciation. But if they have to buy a new set of top-end chips every two years, they’ll eventually blow a hole in their profitability.

We don’t dismiss this fear. But the danger is easily exaggerated. Consider the A100 chip, which helped train GPT-4 in 2022. The first A100s were sold in 2020, which makes the oldest units about five years old. Yet they’re still widely used. “In a compute-constrained world, there is still ample demand for running A100s,” Bernstein analyst Stacy Rasgon recently wrote. Major cloud vendors continue to offer A100 capacity, and customers continue to buy it.

Of course, there’s no guarantee that today’s chips will be as durable. If AI demand cools, we could see a glut of hardware and early retirement of older chips. But based on what we know today, it’s reasonable to assume that a GPU purchased now will still be useful five years from now.

A changing debt picture

When they say: The hyperscalers are well-run companies that won’t use irresponsible leverage

You say: That might be changing

A common way for a bubble to end is with too much debt and too little revenue. Most of the Big Tech companies building AI infrastructure — including Google, Microsoft, and Meta — haven’t needed to take on much debt because they can fund the investments with profit. Oracle has been a notable exception to this trend, and some people consider it the canary in the coal mine.

Oracle recently borrowed $18 billion for data center construction, pushing the company’s total debt above $100 billion. The Wall Street Journal reports that “the company’s adjusted debt, a measure that includes what it owes on leases in addition to what it owes creditors, is forecast to more than double to roughly $300 billion by 2028, according to credit analysts at Morgan Stanley.”

At the same time, it’s not obvious that Oracle is going to make a lot of money from this aggressive expansion. There’s plenty of demand: in its most recent earnings call, Oracle said that it had $455 billion in contracted future revenue — a more than four-fold increase over the previous year. But The Information reports that in the most recent quarter, Oracle earned $125 million on $900 million worth of revenue from renting out data centers powered by Nvidia GPUs. That works out to a 14% profit margin. That’s a modest profit margin in a normal business, and it’s especially modest in a highly volatile industry like this one. It’s much smaller than the roughly 70% gross margin Oracle gets on more established services.

The worry for AI skeptics is that customer demand for GPUs could cool off as quickly as it heated up. In theory, that $455 billion figure represents firm customer commitments to purchase future computing services. But if there’s an industry-wide downturn, some customers might try to renegotiate the terms of these contracts. Others might simply go out of business. And that could leave Oracle with a lot of debt, a lot of idle GPUs, and not enough revenue to pay for it all.

And now, the very best arguments against an AI bubble

Read more