MoreRSS

site iconUnderstanding AIModify

By Timothy B. Lee, a tech reporter with a master’s in computer science, covers AI progress and policy.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Understanding AI

The feds are probing Waymo's behavior around school children

2026-01-30 04:48:29

Last week, a Waymo driverless vehicle struck a child near Grant Elementary School in Santa Monica, California. In a statement today, Waymo said that the child “suddenly entered the roadway from behind a tall SUV.” Waymo says its vehicle immediately slammed on the brakes, but wasn’t able to stop in time. The child sustained minor injuries but was able to…

Read more

An unlikely ally for open-source protein-folding models: Big Pharma

2026-01-28 23:39:43

Protein-folding models are the success story in AI for science.

In the late 2010s, researchers from Google DeepMind used machine learning to predict the three-dimensional shape of proteins. AlphaFold 2, announced in 2020, was so good that its creators shared the 2024 Nobel Prize in chemistry with an outside academic.

Yet many academics have had mixed feelings about DeepMind’s advances. In 2018, Mohammed AlQuraishi, then a research fellow at Harvard, wrote a widely read blog post reporting on a “broad sense of existential angst” among protein-folding researchers.

The first version of AlphaFold had just won CASP13, a prominent protein-folding competition. AlQuraishi wrote that he and his fellow academics worried about “whether protein structure prediction as an academic field has a future, or whether like many parts of machine learning, the best research will from here on out get done in industrial labs, with mere breadcrumbs left for academic groups.”

Industrial labs are less likely to share their findings fully or investigate questions without immediate commercial applications. Without academic work, the next generation of insights might end up siloed in a handful of companies, which could slow down progress for the entire field.

These concerns were borne out in the 2024 release of AlphaFold 3, which initially kept the model weights confidential. Today, scientists can download the weights for certain non-commercial uses “at Google DeepMind’s sole discretion.” Pushmeet Kohli, DeepMind’s head of AI science, told Nature that DeepMind had to balance making the model “accessible” and impactful for scientists against Alphabet’s desire to “pursue commercial drug discovery” via an Alphabet subsidiary, Isomorphic Labs.

Subscribe now

AlQuraishi went on to become a professor at Columbia, and he has fought to keep academic researchers in the game. In 2021, he co-founded a project called OpenFold, which sought to replicate AlphaFold’s innovations openly. This not only required difficult technical work, it also required innovations in organization and fundraising.

To get the millions of dollars’ worth of computing power they would need, AlQuraishi and his colleagues turned to an unlikely ally: the pharmaceutical industry. Drug companies are not generally known for their commitment to open science, but they really did not want to be dependent on Google.

Supporting OpenFold gives these drug companies input into the project’s research priorities. Pharmaceutical companies also get early access to OpenFold’s models for internal use. But crucially, OpenFold releases its models to the general public, along with full training data, source code, and other materials that have not been included in recent AlphaFold releases.

“I’d like to see the work have an impact,” AlQuraishi told me in a Monday interview. He wanted to contribute to new discoveries and the creation of new therapies. Today, he said, “most of that is happening in industry.” But projects like OpenFold could help carve out a larger role for academic researchers, accelerating the pace of scientific discovery in the process.

Protein folding: from sequence to structure

Proteins are large molecules essential to life. They perform many biological functions, from regulating blood sugar (like insulin) to acting as antibodies.

The shape of a protein is essential to its function. Take the example of myoglobin (pictured), which stores oxygen in muscle tissue. Myoglobin’s shape creates a little pocket that holds an iron-containing molecule (the grey shape circled). The pocket’s shape lets the iron bind with oxygen reversibly, so the protein can capture and release it in the muscle as necessary.

A 3D representation of the protein myoglobin. The circled area shows a heme group (gray) who’s central iron atom bonds to an oxygen molecule (in red).

It’s expensive to determine a protein’s shape experimentally, however. The conventional approach involves crystallizing the protein and then analyzing how X-rays scatter off the crystal structure. This process, called X-ray crystallography, can take months or even years for difficult proteins. Newer methods can be faster, but they’re still expensive.

So scientists often try to predict a protein’s structure computationally. Every protein is a chain of amino acids — just 20 types — that fold into a 3D shape. Determining a protein’s amino acid chain is “very easy” compared to figuring out the structure directly, said Claus Wilke, a professor of biology at The University of Texas at Austin.

But the process of predicting a 3D structure from the amino acids — figuring out how the protein folds — isn’t straightforward. There are so many possibilities that a brute-force search would take longer than the age of the universe.

Scientists have long used tricks to make the problem easier. For instance, they can compare a sequence with the 200,000 or so structures in the Protein Data Bank (PDB). Similar sequences are likely to have similar shapes. But finding an accurate, convenient prediction method remained an open question for over 50 years.

This changed with AlphaFold 2, which made it dramatically easier to predict protein structures. It didn’t “solve” protein folding per se — the predictions aren’t always accurate, for one — but it was a substantial advance. A 2022 Nature article reported that 80% of 214 million protein structure predictions were accurate enough to be useful for at least some applications, according to the European Bioinformatics Institute (EMBL-EBI).

AlphaFold 2 combined excellent engineering with several clever scientific ideas. One important technique DeepMind used is called coevolution. The basic idea is to compare the target protein with proteins that have closely related sequences. A key step is to compute a multiple sequence alignment (MSA) — a grid of protein sequences organized so that equivalent amino acids are in the same column. Including an MSA in AlphaFold’s input helped it to infer details about the protein’s structure.

An example of a multiple sequence alignment. The top row is the amino acid sequence of the target protein; each row below is a related protein. Dashes indicate gaps. (From OpenProteinSet: Training data for structural biology at scale by Ahdritz et al. CC BY 4.0)

Subscribe now

The original OpenFold

DeepMind released AlphaFold 2’s model weights and a high-level description of the architecture but did not include the training code or all the training data used. OpenFold, founded in 2021, sought to make this kind of information freely available.

AlQuraishi’s background prepared him well to co-found the project. He grew up in Baghdad as a computer kid — starting with a Commodore 64 at the age of five. When he was 12, his family moved to the Bay Area. He founded an Internet start-up in his junior year of high school and went to Santa Clara University for computer engineering.

In college, AlQuraishi’s interests shifted from tech entrepreneurship to science. After a year and a half of working to add computational biology capabilities to the software Wolfram Mathematica, he went to Stanford to get his doctorate in biology. After his PhD, he went on to study the application of machine learning to the protein-folding problem.

After the first AlphaFold won the CASP13 competition in 2018, AlQuraishi wrote that DeepMind’s success “presents a serious indictment of academic science.” Despite academics outnumbering DeepMind’s team by an order of magnitude, they had been scooped by a tech company new to the field.

AlQuraishi believed that tackling big problems like protein folding would require an organizational rethink. Academic labs traditionally consist of a senior scientist supervising a handful of graduate students. AlQuraishi worried that small organizations like this wouldn’t have the manpower or financial resources to tackle a big problem like protein folding.

Mohammed AlQuraishi (Photo courtesy of Mohammed AlQuraishi)

“I haven’t been too shy about trying new ways of organizing academic research,” AlQuraishi told me on Monday.

AlQuraishi thought that academic labs needed more frequent communication and better software engineering. They would also need substantial access to compute: when Geoff Hinton joined Google in 2013, AlQuraishi predicted that “without access to significant computing power, academic machine learning research will find it increasingly difficult to stay relevant.”

So in 2021, AlQuraishi teamed up with Nazim Bouatta and Gustaf Ahdritz to co-found the OpenFold project. The project didn’t just have an ambitious technical mission, it would also come to have an innovative structure.

OpenFold’s first objective was to reverse-engineer parts of AlphaFold 2 that DeepMind had not made public — including code and data used for training the model. While DeepMind had only drawn from public datasets in its training process, it did not release the multiple sequence alignment (MSA) data it had computed for use in training. MSAs are expensive to compute, so many other research groups settled for fine-tuning AlphaFold 2 rather than retraining it from scratch. OpenFold released both a public dataset of MSAs — using four million hours of donated compute — and training code.

The second goal was refactoring AlphaFold 2’s code to be more performant, modular, and easy to use. AlphaFold 2 was written in JAX — Google’s machine learning framework — rather than the more popular PyTorch. OpenFold wrote its code in PyTorch, which boosted performance and made it easier to adopt into other projects. Meta used parts of OpenFold’s architecture in its ESM-Fold project, for instance.

A third goal — true to AlQuraishi’s computer science background — was to study the models themselves. In their preprint, the OpenFold team analyzed the training dynamics of AlphaFold’s architecture. They found, for instance, that the model reached 90% of its final accuracy in the first 3% of training time.

Finally, AlQuraishi and his collaborators wanted to make sure there was a protein-folding model that pharmaceutical companies could use. They saw this as necessary because AlphaFold 2 was initially released under a non-commercial license. But this goal became irrelevant after AlphaFold 2’s license was changed to be more open.

The OpenFold team had made substantial progress on all of these goals by June 22, 2022, when it announced the release of OpenFold and the first 400,000 proteins in its MSA dataset. There was more refinement to be done — the preprint wouldn’t come out for another five months; the model code would continue to be iterated on — but OpenFold also had other scientific goals. AlphaFold 2 initially only predicted the structure of a single amino acid chain; could OpenFold replicate later efforts to predict more complex structures?

So the same day, OpenFold also announced that pharmaceutical companies — who are also interested in the same types of protein folding questions — would help fund OpenFold’s further research in exchange for input into its research direction.

The race to replicate AlphaFold 3

The peer-review process is so slow that the official OpenFold paper was published by Nature Methods in May 2024 — a year and a half after the initial release. A week before the paper came out, Google DeepMind incidentally demonstrated the value of open research.

DeepMind announced AlphaFold 3, which was able to predict how interactions with other types of molecules would impact the 3D shapes of proteins. But there was a caveat: the model would not be released openly. DeepMind had partnered with Isomorphic — Google’s AI drug discovery start-up that Hassabis founded in 2021 — to develop AlphaFold 3. Isomorphic would get full access and the right to commercial use; everyone else would have to use the model through a web interface.

Scientists were furious. Over 1,000 signed an open letter attacking the journal Nature for letting DeepMind publish a paper on AlphaFold 3 without providing more details about the model. The letter remarked that “the amount of disclosure in the AlphaFold 3 publication is appropriate for an announcement on a company website (which, indeed, the authors used to preview these developments), but it fails to meet the scientific community’s standards of being usable, scalable, and transparent.”

DeepMind responded by increasing the daily quota to 20 generations and promising that it would release the model weights within six months “for academic use.” When it did release the weights, it added significant restrictions. Access is strictly non-commercial and at “Google DeepMind’s sole discretion.” Moreover, scientists would not be able to fine-tune or distill the model.

This prompted an immediate demand for open replications of AlphaFold 3. Within months, companies like ByteDance and Chai Discovery had released models following the training details in the AlphaFold 3 paper. An MIT lab released the Boltz-1 model under an open license in November 2024.

In June 2024, AlQuraishi told the publication GEN Biotechnology that his research group was already working on replicating AlphaFold 3. But replicating AlphaFold 3 posed new challenges compared to AlphaFold 2.

Reverse engineering AlphaFold 3 requires succeeding on a larger variety of tasks than AlphaFold 2. “These different modalities are often in contention,” AlQuraishi told me. Even if a model matched AlphaFold 3’s performance in one domain, it might falter in another. “Optimizing the right trade-offs between all these modalities is quite challenging.”

This makes the resulting model more “finicky” to train, AlQuraishi said. AlphaFold 2 was such a “marvel of engineering” that OpenFold was largely able to replicate it with its first training run. Training OpenFold 31 has required a bit more “nursing,” AlQuraishi told me.

There’s 100 times more data to generate too. Google DeepMind used tens of millions of the highest-confidence predictions from AlphaFold 2 to augment the training set for AlphaFold 3, as well as many more MSAs than it used for AlphaFold 2. OpenFold has had to replicate both. One PhD student currently working on OpenFold 3, Lukas Jarosch, told me that the synthetic database in progress for OpenFold 3 might be the biggest ever computed by an academic lab.

All of this ends up requiring a lot of compute. Mallory Tollefson, OpenFold’s business development manager, told me in December that the project has probably used “approximately $17 million of compute” donated from a wide variety of sources. A lot of that is for dataset creation: AlQuraishi estimated that it has cost around $15 million to make.

Subscribe now

OpenFold has an unusual structure

Coordinating all of this computation takes a lot of work. “There’s definitely a lot of strings that Mohammed [AlQuraishi] needs to pull to keep such a big project running in practice,” Jarosch said.

This is where OpenFold’s structure — and membership in the Open Molecular Software Foundation — are essential aspects of the project. I think it also shows a clever alignment of incentives.

Other groups have been quicker to release partial replications of AlphaFold 3: for instance, the company Chai Discovery released Chai-1 in September 2024, while OpenFold 3-preview was only released in October 2025. And scientists needing an open version currently use other models: several people I spoke to praised Boltz-2, released in June 2025. But those replications are either made or managed by companies: Boltz recently incorporated as a public benefit corporation.

Companies can move quickly and marshal resources, but also have incentives to close down access to their models, so that they can license the product to pharmaceutical companies.2

While individual academics have less access to resources, they still have incentives not to share commercially lucrative results. For some areas like measuring how proteins bind with potential drugs, “people have never really made the code available because they’ve always had this idea that they can make money with it,” according to Wilke, the UT Austin professor. He said it’s held back that area “for decades.”

Yet OpenFold, in Jarosch’s estimation, “is very committed to long-term open source and not branching out into anything commercial.” How have they set this up? Partly by relying on pharmaceutical companies for funding.

At first glance, pharmaceutical companies might seem like an odd catalyst for open source. They are famously protective of intellectual property such as the hundreds of thousands of additional protein structures their scientists have experimentally determined. But pharmaceutical companies need AI tools they can’t easily build themselves.

$17 million is a lot of money to spend on compute. But when split 37 ways, it’s cheaper than licensing a model from a commercial supplier like Alphabet’s Isomorphic. Add in early access to models and the ability to vote on research priorities and OpenFold becomes an attractive project to fund.

If the pharmaceutical companies could get away with it, they’d probably want exclusive access to OpenFold’s model. (An OpenFold member, Apheris, is working on building a federated fine-tune of OpenFold 3 exclusive to the pharmaceutical companies who provide the proprietary data for training). But having a completely open model is a good compromise with the academics actually building the model.

From an academic perspective, this partnership is attractive too. Resources from pharmaceutical companies make it easier to run large projects like OpenFold. The computational resources they donate are more convenient for large training runs because jobs aren’t limited to a day or a week as with national labs, according to Jennifer Wei, a full-time software engineer at OpenFold. And the monetary contributions, combined with the open-source mission, help attract engineering talent like Wei — an ex-Googler — to produce high-quality code.

Pharmaceutical input makes the work more likely to be practically relevant, too. Lukas Jarosch, the PhD student, said he appreciated the input from industry. “I’m interested in making co-folding models have a big impact on actual drug discovery,” he told me.

The companies also give helpful feedback. “It’s hard to create benchmarks that really mimic real-world settings,” Jarosch said. Pharmaceutical companies have proprietary datasets which let them measure model performance in practice, but they rarely share these results publicly. OpenFold’s connections with pharmaceutical companies give a natural channel for high-quality feedback.

When I asked AlQuraishi why he had stayed in academia rather than getting funding for a start-up, he told me two things. First, he wanted to “actually be able to go after basic questions,” even if they didn’t make money right away. He’s interested in eventually being able to simulate an entire working cell completely on a computer. How would he be able to get venture funding for that if it might take decades to pan out?

But second, the experience of watching LLMs become increasingly restricted underlined the importance of open source. “It’s not something that I thought I cared about all that much,” he told me. “I’ve become a bit more of a true open source advocate.”

1

There was no OpenFold 2. OpenFold named its second model OpenFold 3 to align with the version of AlphaFold it sought to replicate. It turns out that confusing model naming is not unique to LLMs.

2

Boltz claims it will keep its models open source and focus on end-to-end services around its model, like fine-tuning on a company’s custom data. This may remain the case, but Boltz’s incentives ultimately point towards getting as much money from companies as possible.

How shifting risk to users makes Claude Code more powerful

2026-01-21 03:36:47

Anthropic’s Claude Code has been gaining popularity among programmers since its launch last February. When I first wrote about the tool back in May, it was little known among non-programmers.

That started to change over the holidays. Word began to spread that — despite its name — Claude Code wasn’t just for code. It’s a general-purpose agent that can help users with a wide range of tasks.

Claude Code is “marketed as a tool for computer programmers, so I wasn’t using it because I’m not a computer programmer,” wrote the liberal Substack author Matt Yglesias on December 26. “But some friends urged me to fire up the command line and use it.”

“In a sense, everything you can do on a computer is a question of writing code,” Yglesias added. “So I downloaded the entire General Social Survey file, and put it in a directory with a Claude Code project. Then if I ask Claude a question about the GSS data, Claude writes up the R scripts it needs to interrogate the data set and answer the question.”

Last week, Anthropic itself capitalized on this trend with the release of Anthropic Cowork, a variant of Claude Code designed for use by non-programmers.

Claude Code is a text-based tool that runs in a command-line environment (for example, the Terminal app on a Mac). The command line is a familiar environment for programmers, but many normal users find it confusing and even intimidating.

Cowork is a Mac app that superficially looks like a normal chatbot. Indeed, it looks so much like a normal chatbot that you might be wondering why it’s a separate product at all. If Anthropic wanted to bring Claude Code’s powerful capabilities to a general audience, why not just add those features to the regular Claude chatbot?

What ultimately differentiates Claude Code from conventional web-based chatbots isn’t any specific feature or capability. It’s a different philosophy about risk and responsibility.

Read more

AI is just starting to change the legal profession

2026-01-16 04:06:17

I’m pleased to publish this guest post by Justin Curl, a third-year student at Harvard Law School. Previously, Justin researched LLM jailbreaks at Microsoft, was a Schwarzman Scholar at Tsinghua University, and earned a degree in Computer Science from Princeton.


How much are lawyers using AI? Official reports vary widely: a Thomson Reuters report found that only 28% of law firms are actively using AI, while Clio’s Legal Trends 2025 reported that 79% of legal professionals use AI in their firms.

To learn more, I spoke with 10 lawyers, ranging from junior associates to senior partners at seven of the top 20 Vault law firms. Many told me that firms were adopting AI cautiously and that the industry was still in its early days of AI.

The lawyers I interviewed weren’t AI skeptics. They’d tested AI tools, could identify tasks where the technology worked, and often had sharp observations about why their co-workers were slow to adopt. But when I asked about their own habits, a more complicated picture emerged. Even lawyers who understood AI’s value seemed to be leaving gains on the table, sometimes for reasons they’d readily critique in colleagues.

One junior associate described the situation well: “The head of my firm said we want to be a fast follower on AI because we can’t afford to be reckless. But I think equating AI adoption with recklessness is a huge mistake. Elite firms cannot afford to view themselves as followers in anything core to their business.”

Subscribe now

How AI can accelerate lawyers’ work

Let’s start with a whirlwind tour of the work of a typical lawyer — and how AI tools could make them more productive at each step.

Lawyers spend a lot of time communicating with clients and other third parties. They can use general-purpose AI tools like Claude, ChatGPT, or Microsoft Copilot to revise an email, take meeting notes, or summarize a document. One corporate lawyer said their favorite application was using an internal AI tool to schedule due diligence calls, which was usually such a pain because it required coordinating with twenty people.

AI can also help with more distinctly legal tasks. Transactional lawyers and litigators work on different subject matter (writing contracts and winning lawsuits, respectively), but there is a fair amount of overlap in the kind of work they do.

Both types of lawyers typically need to do research before they begin writing. For transactional lawyers, this might be finding previous contracts to use as a template. For litigators, it could mean finding legal rulings that can be cited as precedent in a legal brief.

Thomson Reuters and LexisNexis, the two incumbent firms that together dominate the market for searchable databases of legal information, offer AI tools for finding public legal documents like judicial opinions or SEC filings. Legaltech startups like Harvey and DeepJudge also offer AI-powered search tools that let lawyers sift through large amounts of public and private documents to find the most relevant ones quickly.

Once lawyers have the right documents, they need to analyze and understand them. This is a great use case for general-purpose LLMs, though Harvey offers customized workflows for analyzing documents like court filings, deposition transcripts, and contracts. I also heard positive things about Kira (acquired by Litera in 2021), an AI product that’s designed specifically for reviewing contracts.

Once a lawyer is ready to begin writing, general-purpose AI models can help write an initial draft, revise tone and structure, or proofread. Harvey offers drafting help through a dialog-based tool that walks lawyers through the process of revising a document.

Finally, some legal work will require performing similar operations for many files — like updating party names or dates. Office & Dragons (also acquired by Litera) offers a bulk processing tool that can update document names, change document contents, and run redlines (comparing different document versions) for hundreds of files at once.

You’ll notice many legal tasks involve research and writing, which are areas where AI has recently shown great progress. Yet if AI has so much potential for improving lawyers’ productivity in theory, why haven’t we seen it used more widely in practice? The next sections outline the common reasons (some more convincing than others) that lawyers gave for why they don’t use AI more.

AI doesn’t save much time when the stakes are high

Losing a major lawsuit or drafting a contract in a way that advantages the other party can cost clients millions or even billions of dollars. So lawyers often need to carefully verify an AI’s output before using it. But that verification process can erode the productivity gains AI offered in the first place.

A senior associate told me about a junior colleague who did some analysis using Microsoft Copilot. “Since it was vital to the case, I asked him to double-check the outputs,” he said. “But that ended up taking more time than he saved from using AI.”

Another lawyer explicitly varied his approach based on a task’s importance. For a “change-of-control” provision, which is “super super important” because it allows one party to alter or terminate a contract if the ownership of the other party changes, “you want to make sure you’re checking everything carefully.”

But not all tasks have such high stakes: “if you’re just sending an email, it’s not the end of the world if there are small mistakes.”

Indeed, the first four lawyers I talked to all brought up the same example of when AI is helpful: writing and revising emails. One senior associate said: “I love using Copilot to revise my emails. Since I already know what I want to say, it’s much easier for me to tweak the output until I’m satisfied.”

A junior associate added that this functionality is “especially helpful when I’m annoyed with the client and need to make the tone more polite.” Because it was easy to review AI-generated emails for tone, style, and accuracy, she could use AI without fear of unintentional errors.

These dynamics also help explain differences in adoption across practice areas. One partner observed: “I’ve noticed adoption is stronger in our corporate than litigation groups.”

His hypothesis was that “corporate legal work is more of a good-enough practice than a perfection practice because no one is trying to ruin your life.” In litigation, every time you send your work to the other side, they think about how they can make your life harder. Because errors in litigation are at greater risk of being exploited for the other side’s gain, litigators verify more carefully, making it harder for AI to deliver net productivity gains.

Subscribe now

AI adds more value when verifying outputs is easier

The verification constraint points toward a pattern one associate described well: “AI is great for the first and last pass at things.”

For the first pass, lawyers are familiarizing themselves with an area of law or generating a very rough draft. These outputs won’t be shown directly to a client or judge, and there are subsequent rounds of edits to catch errors. Because the costs of mistakes at this stage are low, there’s less need for exhaustive verification and lawyers retain the productivity gains.

For the last pass, quality control is easier because lawyers already know the case law well and the document is in pretty good shape. The AI is mostly suggesting stylistic changes and catching typos, so lawyers can easily identify and veto bad suggestions.

But AI is less useful in the middle of the drafting process, when lawyers are making crucial decisions about what arguments to make and how to make them. AI models aren’t yet good enough to do this reliably, and human lawyers can’t do effective quality control over outputs if they haven’t mastered the underlying subject matter.

So a key skill when using AI for legal work is to develop strategies and workflows that make it easier to verify the accuracy and quality of AI outputs.

One patent litigator told me that “every time you use AI, you need to do quality control. You should ask it to show its work and use quotes, so you can make sure its summaries match the content of the patent.” A corporate associate reached the same conclusion, using direct quotes to quickly “Ctrl-F” for specific propositions he wanted to check.

Companies building AI tools for lawyers should look for ways to reduce the costs of verification. Google’s Gemini, for example, has a feature that adds a reference link for claims from uploaded documents. This opens the source document with the relevant text highlighted on the side, making it easier for users to quickly check whether a claim matches the underlying material.

Features like these don’t make AI tools any more capable. But by making verification faster, they let users capture more of the productivity gains.

AI might not help experienced lawyers as much

Two lawyers from different firms disagreed about the value of DeepJudge’s AI-powered natural-language search.

One associate found it helpful because she often didn’t know which keywords would appear in the documents she was looking for.

A partner, however, preferred the existing Boolean search tool because it gave her more control over the output list. Since she had greater familiarity with documents in her practice area, the efficiency gain of a natural-language search was smaller.

Another partner told me he worried that if junior lawyers don’t do the work manually, they won’t learn to distinguish good lawyering from bad. “If you haven’t made the closing checklist or mapped out the triggering conditions for a merger, will you know enough to catch mistakes when they arise?”

Even senior attorneys can face this tradeoff.

A senior litigation associate praised AI’s ability to “get me up to speed quickly on a topic. It’s great for summarizing a court docket and deposition transcripts.” But he also cautioned that “it’s sometimes harder to remember all the details of a case when I use AI than when I read everything myself.”

He found himself hesitating because he was unsure of the scope of his knowledge. He didn’t know what he didn’t know, which made it harder to check whether AI-generated summaries were correct. His solution was to revert to reading things in full, only using AI to refresh his memory or supplement his understanding.

Many lawyers are unaware of AI use cases and capabilities

A prerequisite for adopting AI is knowing what it can be used for. One associate mentioned he was “so busy” he didn’t “have time to come up with potential use cases.” He said, “I don’t use AI more because I’m not sure what to use it for.”

A different associate praised Harvey for overcoming this exact problem.

“Harvey is nice because it lists use cases and custom workflows, so you don’t need to think too much about how to use it,” the associate told me. As she spoke, she opened Harvey and gave examples: “translate documents, transcribe audio to text, proofread documents, analyze court transcripts, extract data from court filings.” She appreciated that Harvey showed her exactly how it could make her more productive.

But there’s a tradeoff: the performance of lawyer-specific AI products often lags state-of-the-art models.

“Claude is a better model, so I still prefer it when all the information is public,” one lawyer told me.

Meanwhile, many lawyers take a dim view of AI capabilities. An associate decided not to try her firm’s internal LLM because she had “heard such bad things.”

Earlier I mentioned that incumbents Thomson Reuters and LexisNexis have added AI tools to their platforms in recent years. When I asked two lawyers about this, they said they hadn’t tried them because their colleagues’ impressions weren’t positive. One even described them as “garbage.”

But it’s a mistake to write AI tools off due to early bad experiences. AI capabilities are improving rapidly. Researchers at METR found that the length of tasks AI agents can reliably complete has been doubling roughly every seven months since 2019. A tool that disappointed a colleague last year might be substantially more capable today.

Individual lawyers should periodically revisit tools they’ve written off to see if they have grown more capable. And firms should institutionalize that process, reevaluating AI tools after major updates to see if they better meet the firm’s needs.

Subscribe now

Pricing models can discourage (or encourage) AI use

The right level of AI use varies by client.

Billing by the hour creates tension between lawyer and client interests. More hours means more revenue for the firm, even if the client would prefer a faster result. AI that makes lawyers more efficient could reduce billable hours, which is good for clients but potentially bad for firm revenue.

Other pricing models align incentives differently. For fixed-fee work, clients don’t see cost savings when lawyers work faster. Lawyers, of course, benefit from efficiency since they keep the same fee while doing less work. A contingency pricing model is somewhere in the middle. Lawyers are paid when their clients achieve their desired legal outcome, so clients likely want lawyers to use their best judgment about how to balance productivity and quality.

One senior associate told me he used AI differently depending on client goals: “Some clients tell me to work cheap and focus on the 80/20 stuff. They don’t care if it’s perfect, so I use more AI and verify the important stuff.”

But another client wanted a “scorched earth” approach. In this case, the associate did all the work manually and only used AI to explore creative legal theories, which ensured he left no stone unturned.

Some clients have explicit instructions on AI use, though two associates said these clients are in the minority. “Most don’t have a preference and want us to use our best judgment.”

Clients who want the benefits of AI-driven productivity should communicate their preferences clearly and push firms for pricing arrangements that reward efficiency. For their part, lawyers should ask clients what they want rather than making assumptions.

17 predictions for AI in 2026

2026-01-01 01:41:20

2025 has been a huge year for AI: a flurry of new models, broad adoption of coding agents, and exploding corporate investment were all major themes. It’s also been a big year for self-driving cars. Waymo tripled weekly rides, began driverless operations in several new cities, and started offering freeway service. Tesla launched robotaxi services in Austin and San Francisco.

What will 2026 bring? We asked eight friends of Understanding AI to contribute predictions, and threw another nine in ourselves. We give a confidence score for each prediction; a prediction with 90% confidence should be right nine times out of ten.

We don’t believe AI is a bubble on the verge of popping, but neither do we think we’re close to a “fast takeoff” driven by the invention of artificial general intelligence. Rather, we expect models to continue improving their capabilities — but we think it will take a while for the full impact to be felt across the economy.

1. Big Tech capital expenditures will exceed $500 billion (75%)

Timothy B. Lee

Wax sculptures of Mark Zuckerberg, Jeff Bezos, and other tech industry leaders were mounted to robot dogs at a recent exibit by artist Mike Winkelmann in Miami. (Photo by CHANDAN KHANNA / AFP via Getty Images)

In 2024, the five main hyperscalers — Google, Microsoft, Amazon, Meta, and Oracle — had $241 billion in capital expenditures. This year, those same companies are on track to spend more than $400 billion.

This rapidly escalating spending is a big reason many people believe that there’s a bubble in the AI industry. As we’ve reported, tech companies are now investing more, as a percentage of the economy, than the peak year of spending on the Apollo Project or the Interstate Highway System. Many people believe that this level of spending is simply unsustainable.

But I don’t buy it. Industry leaders like Mark Zuckerberg and Satya Nadella have said they aren’t building these data centers to prepare for speculative future demand — they’re just racing to keep up with orders their customers are placing right now. Corporate America is excited about AI and spending unprecedented sums on new AI services.

I don’t expect Big Tech’s capital spending to grow as much in 2026 as it did in 2025, but I do expect it to grow, ultimately exceeding $500 billion for the year.

Subscribe now

2. OpenAI and Anthropic will both hit their 2026 revenue goals (80%)

Timothy B. Lee

Anthropic and OpenAI have both enjoyed impressive revenue growth in 2025.

  • OpenAI expects to generate more than $13 billion for the calendar year, and to end the year with annual recurring revenue around $20 billion. A leaked internal document indicated OpenAI is aiming for $30 billion in revenue in 2026 — slightly more than double the 2025 figure.

  • Anthropic expects to generate around $4.7 billion in revenue in 2025. In October, the company said its annual recurring revenue had risen to “almost $7 billion.” The company is aiming for 2026 revenue of $15 billion.

I predict that both companies will hit these targets — and perhaps exceed them. The capabilities of AI models have improved a lot over the last year, and I expect there is a ton of room for businesses to automate parts of their operations even without new model capabilities.

3. The context windows of frontier models will stay around one million tokens (80%)

Kai Williams

LLMs have a “context window,” the maximum number of tokens they can process. A larger context window lets an LLM tackle more complex tasks, but it is more expensive to run.

When ChatGPT came out in November 2022, it could only process 8,192 tokens at once. Over the following year and a half, context windows from the major providers increased dramatically. OpenAI started offering a 128,000 token window with GPT-4 Turbo in November 2023. The same month, Anthropic released Claude 2.1, which offered 200,000 token windows. And Google started offering one million tokens of context with Gemini 1.5 Pro in February 2024 — which it later expanded to two million tokens.

Since then, progress has slowed. Anthropic has not changed its default context size since Claude 2.1.1 GPT-5.2 has a 400,000 token context window, but that’s less than GPT-4.1, released last April. And Google’s largest context window has shrunk to one million.

I expect context windows to stay fairly constant in 2026. As Tim explained in November, larger context window sizes brush up against limitations in the transformer architecture. For most tasks with current capabilities, smaller context windows are cheaper and just as effective. In 2026, there might be some coding-related LLMs — where it’s useful for the LLM to be able to read an entire codebase — that have larger context windows. But I predict the context lengths of general-purpose frontier models will stay about the same over the next year.

4. Real GDP will grow by less than 3.5% in the US (90%)

Timothy B. Lee

The year 2027 has acquired a totemic status in some corners of the AI world. In 2024, former OpenAI researcher Leopold Aschenbrenner penned a widely-read series of essays predicting a “fast takeoff” in 2027. Then in April 2025, an all-star team of researchers published AI 2027, a detailed forecast for rapid AI progress. They forecast that by the 2027 holiday season, GDP will be “ballooning.” One AI 2027 author suggested that this could eventually lead to annual GDP growth rates as high as 50%.

They don’t make a specific prediction about 2026, but if these predictions are close to right, we should start seeing signs of it by the end of 2026. If we’re on the cusp of an AI-powered takeoff, that should translate to above-average GDP growth, right?

So here’s my prediction: inflation-adjusted GDP in the third quarter of 2026 will not be more than 3.5% higher than the third quarter of 2025.2 Over the last decade, year-over-year GDP growth has only been faster than 3.5% in late 2021 and early 2022, a period when the economy was bouncing back from Covid. Outside of that period, year-over-year growth of real GDP has ranged from 1.4% to 3.4%.

I expect the AI industry to continue growing at a healthy pace, and this should provide a modest boost to the US economy. Indeed, data center construction has been supporting the economy over the last year. But I expect the boost from data center construction to be a fraction of one percent — not enough to push overall economic growth outside its normal range.

5. AI models will be able to complete 20-hour software engineering tasks (55%)

Kai Williams

The AI evaluation organization METR released the original version of this chart in March. They found that every seven months, the length of software engineering tasks that leading AI models were capable of completing (with a 50% success rate) was doubling. Note that the y-axis of this chart is on a log scale, so the straight line represents an exponential increase.

By mid-2025, LLM releases seemed to be improving more quickly, doubling successful task lengths in just five months. METR estimates that Claude Opus 4.5, released in November, could complete software tasks (with at least a 50% success rate) that took humans nearly five hours.

I predict that this faster trend will continue in 2026. AI companies will have access to significantly more computational resources in 2026 as the first gigawatt-scale clusters start operating early in the year, and LLM coding agents are starting to speed up AI development. Still, there are reasons to be skeptical. Both pre-training (with imitation learning) and post-training (with reinforcement learning) have shown diminishing returns.

Whatever happens, whether METR’s line will continue to hold is a crucial question. If the faster trend line holds, the strongest AI models will be at 50% reliability for 20-hour software tasks — half of a software engineer’s work week.

Subscribe now

6. The legal free-for-all that characterized the first few years of the AI boom will be definitively over (70%)

James Grimmelmann, professor at Cornell Tech and Cornell Law School

So far, AI companies are winning against the lawsuits that pose truly existential threats — most notably, courts in the US, EU, and UK have all held that it’s not copyright infringement to train a model. But for everything else, the courts have been putting real operational limits on them. Anthropic is paying $1.5 billion to settle claims that it trained on downloads from shadow libraries, and multiple courts have held or suggested that they need real guardrails against infringing outputs.

I expect the same thing to happen beyond copyright, too: courts won’t enjoin AI companies out of existence, but they will impose serious high-dollar consequences if the companies don’t take reasonable steps to prevent easily predictable harms. It may still take a head on a pike — my money is on Perplexity’s — but I expect AI companies to get the message in 2026.

7. AI will not cause any catastrophes in 2026 (90%)

Steve Newman, author of Second Thoughts

There are credible concerns that AI could eventually enable various disaster scenarios. For instance, an advanced AI might help create a chemical or biological weapon, or carry out a devastating cyberattack. This isn’t entirely hypothetical; Anthropic recently uncovered a group using its agentic coding tools to carry out cyberattacks with minimal human supervision. And AIs are starting to exhibit advanced capabilities in these domains.

However, I do not believe there will be any major “AI catastrophe” in 2026. More precisely: there will be no unusual physical or economic catastrophe (dramatically larger than past incidents of a similar nature) in which AI plays a crucial enabling role. For instance, no unusually impactful bio, cyber, or chemical attack.

Why? It always takes longer than expected for technology to find practical applications — even bad applications. And AI model providers are taking steps to make it harder to misuse their models.

Of course, people may jump to blame AI for things that might have happened anyway, just as some tech CEOs blamed AI for layoffs that were triggered by over-hiring during Covid.

8. Major AI companies like OpenAI and Anthropic will stop investing in MCP (90%)

Andrew Lee, CEO of Tasklet (and Tim’s brother)

The Model Context Protocol was designed to give AI assistants a standardized way to interact with external tools and data sources. Since its introduction in late 2024, it has exploded in popularity.

But here’s the thing: modern LLMs are already smart enough to reason about how to use conventional APIs directly, given just a description of that API. And those descriptions that MCP servers provide? They’re already baked into the training data or accessible on public websites.

Agents built to access APIs directly can be simpler and more flexible, and they can connect to any service — not just the ones that support MCP.

By the end of 2026, I predict MCP will be seen as an unnecessary abstraction that adds complexity without meaningful benefit. Major vendors will stop investing in it.

9. A Chinese company will surpass Waymo in total global robotaxi fleet size (55%)

Daniel Abreu Marques, author of The AV Market Strategist

Waymo has world-class autonomy, broad regulatory acceptance, and a maturing multi-city playbook. But vehicle availability remains a major bottleneck. Waymo is scheduled to begin using vehicles from the Chinese automaker Zeekr in the coming months, but tariff barriers and geopolitical pressures will limit the size of its Zeekr-based fleet. Waymo has also signed a deal with Hyundai, but volume production likely won’t begin until after 2026. So for the next year, fleet growth will remain incremental.

Chinese AV players operate under a different set of constraints. Companies like Pony.ai, Baidu Apollo Go, and WeRide have already demonstrated mass-production capability. For example, when Pony rolled out its Gen-7 platform, it reduced its bill of materials cost by 70%. Chinese companies are scaling fleets across China, the Middle East, and Europe simultaneously.

At the moment, Waymo has about 2,500 vehicles in its commercial fleet. The biggest Chinese company is probably Pony.ai, with around 1,000 vehicles. Pony.ai is aiming for 3,000 vehicles by the end of 2026, while Waymo will need 4,000 to 6,000 vehicles to meet its year-end goal of one million weekly rides.

But if Waymo’s supply chain ramps slower than expected due to unforeseen problems or delays — and Chinese players continue to ramp up production volume — then at least one of them could surpass Waymo in total global robotaxi fleet size by the end of 2026.

Subscribe now

10. The first fully autonomous vehicle will be sold to consumers — but it won’t be from Tesla (75%)

Sophia Tung, content editor of the Ride AI newsletter

Currently many customer-owned vehicles have advanced driverless systems (known as “level two” in industry jargon), but none are capable of fully driverless operations (“level four”). I predict that will change in 2026: you’ll be able to buy a car that’s capable of operating with no one behind the wheel — at least in some limited areas.

One company that might offer such a vehicle is Tensor, formerly AutoX. Tensor is working with younger, more eager automakers that already ship vehicles in the US, like VinFast, to manufacture and integrate their vehicles. The manufacturing hurdles, while significant, are not insurmountable.

Many people expect Tesla to ship the first fully driverless customer-owned vehicle, but I think that’s unlikely. Tesla is in a fairly comfortable position. Its driver-assistance system performs well enough most of the time. Users believe it is “pretty much” a fully driverless system. Being years behind Waymo in the robotaxi market hasn’t hurt Tesla’s credibility with its fans. So Tesla can probably retain the loyalty of its customers even if a little-known startup like Tensor introduces a customer-owned driverless vehicle before Tesla enables driverless operation for its customers.

Tensor has a vested interest in being first and flashiest in the market. It could launch a vehicle that can operate with no driver within a very limited area and credibly claim a first-to-market win. Tensor runs driverless robotaxi testing programs and therefore understands the risks involved. Tesla, in contrast, probably does not want to assume liability or responsibility for accidents caused by its system. So I expect Tesla to wait, observe how Tensor performs, and then adjust its own strategy accordingly.

11. Tesla will begin offering a truly driverless taxi service to the general public in at least one city (70%)

Timothy B. Lee

In June, Tesla delivered on Elon Musk’s promise to launch a driverless taxi service in Austin. But it did so in a sneaky way. There was no one in the driver’s seat, but every Robotaxi had a safety monitor in the passenger seat. When Tesla began offering Robotaxi rides in the San Francisco Bay Area, those vehicles had safety drivers.

It was the latest example of Elon Musk overpromising and underdelivering on self-driving technology. This has led many Tesla skeptics to dismiss Tesla’s self-driving program entirely, arguing that Tesla’s current approach simply isn’t capable of full autonomy.

I don’t buy it. Elon Musk tends to achieve ambitious technical goals eventually. And Tesla has been making genuine progress on its self-driving technology. Indeed, in mid-December, videos started to circulate showing Teslas on public roads with no one inside. I think that suggests that Tesla is nearly ready to debut genuinely driverless vehicles, with no Tesla employees anywhere in the vehicle.

Before Tesla fans get too excited, it’s worth noting that Waymo began its first fully driverless service in 2020. Despite that, Waymo didn’t expand commercial service to a second city — San Francisco — until 2023. Waymo’s earliest driverless vehicles were extremely cautious and relied heavily on remote assistance, making rapid expansion impractical. I expect the same will be true for Tesla — the first truly driverless Robotaxis will arrive in 2026, but technical and logistical challenges will limit how rapidly they expand.

12. Text diffusion models will hit the mainstream (75%)

Kai Williams

Current LLMs are autoregressive, which means they generate tokens one at a time. But this isn’t the only way that AI models can produce outputs. Another type of generation is diffusion. The basic idea is to train the model to progressively remove noise from an input. When paired with a prompt, a diffusion model can turn random noise into solid outputs.

For a while, diffusion models were the standard way to make image models, but it wasn’t as clear how to adapt that to text models. In 2025, this changed. In February, the startup Inception Labs released Mercury, a text diffusion model aimed at coding. In May, Google announced Gemini Diffusion as a beta release.

Diffusion models have several key advantages over standard models. For one, they’re much faster because they generate many tokens at once. They also might learn from data more efficiently, at least according to a July study by Carnegie Mellon researchers.

While I don’t expect diffusion models to supplant autoregressive models, I think there will be more interest in this space, with at least one established lab (Chinese or American) releasing a diffusion-based LLM for mainstream use.

13. There will be an anti-AI super PAC that raises at least $20 million (70%)

Charlie Guo, author of Artificial Ignorance

AI has become a vessel for a number of different anxieties: misinformation, surveillance, psychosis, water usage, and “Big Tech” power in general. As a result, opposition to AI is quickly becoming a bipartisan issue. One example: back in June, Ted Cruz attempted to add an AI regulation moratorium to the budget reconciliation bill (not unlike President Trump’s recent executive order), but it failed 99-1.

Interestingly, there are at least two well-funded pro-AI super PACs:

  • Leading The Future, with over $100 million from prominent Silicon Valley investors, and

  • Meta California, with tens of millions from Facebook’s parent company.

Meanwhile, there’s no equally organized counterweight on the anti-AI side. This feels like an unstable equilibrium, and I expect to see a group solely dedicated to lobbying against AI-friendly policies by the end of 2026.

Subscribe now

14. News coverage linking AI to suicide will triple — but actual suicides will not (85%)

Abi Olvera, author of Positive Sum

We’ve already seen extensive media coverage of cases like the Character.AI lawsuit, where a teen’s death became national news. I expect suicides involving LLMs to generate even more media attention in 2026. Specifically, I predict that news mentions of “AI” and “suicide” in media databases will be at least three times higher in 2026 than in 2025.

But increased coverage doesn’t mean increased deaths. The US suicide rate will likely continue on its baseline trends.

The US suicide rate is currently near a historic peak after a mostly steady rise since 2000. While the rate remained high through 2023, recent data shows a meaningful decrease in 2024. I expect suicide rates to stay stable or lower, reverting back toward average away from the 2018 and 2022 peaks.

15. The American open frontier will catch up to Chinese models (60%)

Florian Brand, editor at the Interconnects newsletter

In late 2024, Qwen 2.5, made by the Chinese firm Alibaba, surpassed the best American open model Llama 3. In 2025, we got a lot of insanely good Chinese models — DeepSeek R1, Qwen3, Kimi K2 — and American open models fell behind. Meta’s Llama 4, Google’s Gemma 3, and other releases were good models for their size, but didn’t reach the frontier. American investment in open weights started to flag; there have been rumors since the summer that Meta is switching to closed models.

But things could change next year. Through advocacy like the ATOM Project (led by Nathan Lambert, the founder of Interconnects), more Western companies have indicated interest in building open-weight models. In late 2025, there has been an uptick in solid American/Western open model releases like Mistral 3, Olmo 3, Rnj, and Trinity. Right now, those models are behind in raw performance, but I predict that this will change in 2026 as Western labs keep up their current momentum. American companies still have substantial resources, and organizations like Nvidia — which announced in December it would release a 500 billion parameter model — seem ready to invest.

16. Vibes will have more active users than Sora in a year (70%)

Kai Williams

This fall, OpenAI and Meta both released platforms for short-form AI-generated video. Initially, Sora caught all of the positive attention: the app came with a new video generation model and a clever mechanic around making deepfakes of your friends. Meta’s Vibes initially fell flat. Sora quickly became the number one app in Apple’s App Store, while the Meta AI app, which includes Vibes, languished around position 75.

Today, however, the momentum has seemed to shift. Sora’s initial excitement has seemed to wear off as the novelty of AI videos faded. Meanwhile, Vibes has been growing, albeit slowly, hitting two million daily active users in mid-November, according to Business Insider. Today, the Meta AI app ranks higher on the App Store than Sora.

I think this reversal will continue. From personal experience, Sora’s recommendation algorithm seems very clunky, and Meta is very skilled at building compelling products that grow its user base. I wouldn’t count out Mark Zuckerberg when it comes to growing a social media app.

17. Counterpoint: Sora will have more active users than Vibes in a year (65%)

Timothy B. Lee

This is one of the few places where Kai and I disagreed, so I thought it would be fun to air both sides of the argument.

I was initially impressed by Sora’s clever product design, but the app hasn’t held my attention since my October writeup. However, toward the end of that writeup I said this:

I expect the jokes to get funnier as the Sora audience grows. Another obvious direction is licensing content from Hollywood. I expect many users would love to put themselves into scenes involving Harry Potter, Star Wars, or other famous fictional worlds. Right now, Sora tersely declines such requests due to copyright concerns. But that could change if OpenAI writes big enough checks to the owners of these franchises.

This is exactly what happened. OpenAI just signed a licensing agreement with Disney to let users make videos of themselves with Disney-owned characters. It’s exclusive for the first year. I expect this to greatly increase interest in Sora, because while making fake videos of yourself is lame, making videos of yourself interacting with Luke Skywalker or Iron Man is going to be more appealing.

I doubt users will react well if they’re just given a blank prompt field to fill out, so fully exploiting this opportunity will require clever product design. But Sam Altman has shown a lot of skill at turning promising AI models into compelling products. There’s no guarantee he’ll be able to do this with Sora, but I’m guessing he’ll figure it out.

1

Anthropic does offer a million token context window in beta testing for Sonnet 4 and Sonnet 4.5.

2

I’m focusing on Q3 numbers because we don’t typically get GDP data for the fourth quarter until late January, which is too late for a year-end article like this.

Waymo and Tesla’s self-driving systems are more similar than people think

2025-12-18 06:01:17

The transformer architecture underlying large language models is remarkably versatile. Researchers have found many use cases beyond language, from understanding images to predicting the structure of proteins to controlling robot arms.

The self-driving industry has jumped on the bandwagon too. Last year, for example, the autonomous vehicle startup Wayve raised $1 billion. In a press release announcing the round, Wayve said it was “building foundation models for autonomy.”

“When we started the company in 2017, the opening pitch in our seed deck was all about the classical robotics approach,” Wayve CEO Alex Kendall said in a November interview. That approach was to “break down the autonomy problem into a bunch of different components and largely hand-engineer them.”

Wayve took a different approach, training a single transformer-based foundation model to handle the entire driving task. Wayve argues that its network can more easily adapt to new cities and driving conditions.

Tesla has been moving in the same direction.

Subscribe now

“We used to work on an explicit, modular approach because it was so much easier to debug,” said Tesla AI chief Ashok Elluswamy at a recent conference. “But what we found out was that codifying human values was really difficult.”

So a couple of years ago, Tesla scrapped its old code in favor of an end-to-end architecture. Here’s a slide from Elluswamy’s October presentation:

Conventional wisdom holds that Waymo has a dramatically different approach. Many people — especially Tesla fans — believe that Tesla’s self-driving technology is based on cutting-edge, end-to-end AI models, while Waymo still relies on a clunky collection of handwritten rules.

But that’s not true — or at least it greatly exaggerates the differences.

Last year, Waymo published a paper on EMMA, a self-driving foundation model built on top of Google’s Gemini.

“EMMA directly maps raw camera sensor data into various driving-specific outputs, including planner trajectories, perception objects, and road graph elements,” the researchers wrote.

Although the EMMA model was impressive in some ways, the Waymo team noted that it “faces challenges for real-world deployment,” including poor spatial reasoning ability and high computational costs. In other words, the EMMA paper described a research prototype — not an architecture that was ready for commercial use.

But Waymo kept refining this approach. In a blog post last week, Waymo pulled back the curtain on the self-driving technology in its commercial fleet. It revealed that Waymo vehicles today are controlled by a foundation model that’s trained in an end-to-end fashion — just like Tesla and Wayve vehicles.

For this story, I read several Waymo research papers and watched presentations by (and interviews with) executives at Waymo, Wayve, and Tesla. I also had a chance to talk to Waymo co-CEO Dmitri Dolgov. Read on for an in-depth explanation of how Waymo’s technology works, and why it’s more similar to rivals’ technology than many people think.

Thinking fast and slow

Some driving scenarios require complex, holistic reasoning. For example, suppose a police officer is directing traffic around a crashed vehicle. Navigating this scene not only requires interpreting the officer’s hand signals, it also requires reasoning about the goals and likely actions of other vehicles as they navigate a chaotic situation. The EMMA paper showed that LLM-based models can handle these complex situations much better than a traditional modular approach.

But foundation models like EMMA also have real downsides. One is latency. In some driving scenarios, a fraction of a second can make the difference between life and death. The token-by-token reasoning style of models like Gemini can mean long and unpredictable response times.

Traditional foundation models are also not very good at geometric reasoning. They can’t always judge the exact locations of objects in an image. They might also overlook objects or hallucinate ones that aren’t there.

So rather than relying entirely on an EMMA-style vision-language model (VLM), Waymo placed two neural networks side by side. Here’s a diagram from Waymo’s blog post:

Let’s start by zooming in on the lower-left of the diagram:

VLM here stands for vision-language model — specifically Gemini, the Google AI model that can handle images as well as text. Waymo says this portion of its system was “trained using Gemini” and “leverages Gemini’s extensive world knowledge to better understand rare, novel, and complex semantic scenarios on the road.”

Compare that to EMMA, which Waymo described as maximizing the “utility of world knowledge” from “pre-trained large language models” like Gemini. The two approaches are very similar — and both are similar to the way Tesla and Wayve describe their self-driving systems.

“Milliseconds really matter”

But the model in today’s Waymo vehicles isn’t just an EMMA-like vision-language model — it’s a hybrid system that also includes a module called a sensor fusion encoder that is depicted in the upper-left corner of Waymo’s diagram:

This module is tuned for speed and accuracy.

“Imagine a latency-critical safety scenario where maybe an object appears from behind a parked car,” Waymo co-CEO Dmitri Dolgov told me. “Milliseconds really matter. Accuracy matters.”

Whereas the VLM (the blue box) considers the scene as a whole, the sensor fusion module (the yellow box) breaks the scene into dozens of individual objects: other vehicles, pedestrians, fire hydrants, traffic cones, the road surface, and so forth.

It helps that every Waymo vehicle has lidar sensors that measure the distance to nearby objects by bouncing lasers off of them. Waymo’s software matches these lidar measurements to the corresponding pixels in camera images — a process called sensor fusion. This allows the system to precisely locate each object in three-dimensional space.

In early self-driving systems, a human programmer would decide how to represent each object. For example, the data structure for a vehicle might record the type of vehicle, how fast it’s moving, and whether it has a turn signal on.

But a hand-coded system like this is unlikely to be optimal. It will save some information that isn’t very useful while discarding other information that might be crucial.

“The task of driving is not one where you can just enumerate a set of variables that are sufficient to be a good driver,” Dolgov told me. “There’s a lot of richness that is very hard to engineer.”

Waymo co-CEO Dmitri Dolgov. (Image courtesy of Waymo)

So instead, Waymo’s model learns the best way to represent each object through a data-driven training process. Waymo didn’t give me a ton of information about how this works, but I suspect it’s similar to the technique described in the 2024 Waymo paper called “MoST: Multi-modality Scene Tokenization for Motion Prediction.”

The system described in the MoST paper still splits a driving scene up into distinct objects as in older self-driving systems. But it doesn’t capture a set of attributes chosen by a human programmer. Rather, it computes an “object vector” that captures information that’s most relevant for driving — and the format of this vector is learned during the training process.

“Some dimensions of the vector will likely indicate whether it’s a fire truck, a stop sign, a tree trunk, or something else,” I wrote in an article last year. “Other dimensions will represent subtler attributes of objects. If the object is a pedestrian, for example, the vector might encode information about the position of the pedestrian’s head, arms, and legs.”

There’s an analogy here to LLMs. An LLM represents each token with a “token vector” that captures the information that’s most relevant to predicting the next token. In a similar way, the MoST system learns to capture the information about objects that are most relevant for driving.

I suspect that when Waymo says its sensor fusion module outputs “objects, sensor embeddings” in the diagram above, this is a reference to a MoST-like system.

How does the system know which information to include in these object vectors? Through end-to-end training of course!

This is the third and final module of Waymo’s self-driving system, called the world decoder.

It takes inputs from both the sensor fusion encoder (the fast-thinking module that breaks the scene into individual objects) and the driving VLM (the slow-thinking module that tries to understand the scene as a whole). Based on information supplied by these modules, the world decoder tries to decide the best action for a vehicle to take.

During training, information flows in the opposite direction. The system is trained on data from real-world situations. If the decoder correctly predicts the actions taken in the training example, the network gets positive reinforcement. If it guesses wrong, then it gets negative reinforcement.

These signals are then propagated backward to the other two modules. If the decoder makes a good choice, signals are sent back to the yellow and blue boxes encouraging them to continue doing what they’re doing. If the decoder makes a bad choice, signals are sent back to change what they’re doing.

Based on these signals, the sensor fusion module learns which information is most helpful to include in object vectors — and which information can be safely left out. Again, this is closely analogous to LLMs, which learn the most useful information to include in the vectors that represent each token.

Subscribe now

Modular networks can be trained end-to-end

Leaders at all three self-driving companies portray this as a key architectural difference between their self-driving systems. Waymo argues that its hybrid system delivers faster and more accurate results. Wayve and Tesla, in contrast, emphasize the simplicity of their monolithic end-to-end architectures. They believe that their models will ultimately prevail thanks to the Bitter Lesson — the insight that the best results often come from scaling up simple architectures.

In a March interview, podcaster Sam Charrington asked Waymo’s Dragomir Anguelov about the choice to build a hybrid system.

“We’re on the practical side,” Anguelov said. “We will take the thing that works best.”

Anguelov pointed out that the phrase “end-to-end” describes a training strategy, not a model architecture. End-to-end training just means that gradients are propagated all the way through the network. As we’ve seen, Waymo’s network is end-to-end in this sense: during training, error signals propagate backward from the purple box to the yellow and blue boxes.

“You can still have modules and train things end-to-end,” Anguelov said in March. “What we’ve learned over time is that you want a few large components, if possible. It simplifies development.” However, he added, “there is no consensus yet if it should be one component.”

So far, Waymo has found that its modular approach — with three modules rather than just one — is better for commercial deployment.

Waymo co-CEO Dmitri Dolgov told me that a monolithic architecture like EMMA “makes it very easy to get started, but it’s wildly inadequate to go to full autonomy safely and at scale.”

I’ve already mentioned latency and accuracy as two major concerns. Another issue is validation. A self-driving system doesn’t just need to be safe, the company making it needs to be able to prove it’s safe with a high level of confidence. This is hard to do when the system is a black box.

Under Waymo’s hybrid architecture, the company’s engineers know what function each module is supposed to perform, which allows them to be tested and validated independently. For example, if engineers know what objects are in a scene, they can look at the output of the sensor fusion module to make sure it identifies all the objects it’s supposed to.

These architectural differences seem overrated

My suspicion is that the actual differences are smaller than either side wants to admit. It’s not true that Waymo is stuck with an outdated system based on hand-coded rules. The company makes extensive use of modern AI techniques, and its system seems perfectly capable of generalizing to new cities.

Indeed, if Waymo deleted the yellow box from its diagram, the resulting model would be very similar to those at Tesla and Wayve. Waymo supplements this transformer-based model with a sensor fusion module that’s tuned for speed and geometric precision. But if Waymo finds the sensor fusion module isn’t adding much value, it can always remove it. So it’s hard to imagine the module puts Waymo at a major disadvantage.

At the same time, I wonder if Wayve and Tesla are downplaying the modularity of their own systems for marketing purposes. Their pitch to investors is that they’re pioneering a radically different approach than incumbents like Waymo — one that’s inspired by frontier labs like OpenAI and Anthropic. Investors were so impressed by this pitch that they gave Wayve $1 billion last year, and optimism about Tesla’s self-driving project has pushed up the company’s stock price in recent years.

For example, here’s how Wayve depicts its own architecture:

At first glance, this looks like a “pure” end-to-end architecture. But look closer and you’ll notice that Wayve’s model includes a “safety expert sub-system.” What’s that? I haven’t been able to find any details on how this works or what it does. But in a 2024 blog post, Wayve wrote about its effort to train its models to have an “innate safety reflex.”

According to Wayve, the company uses simulation to “optimally enrich our Emergency Reflex subsystem’s latent representations.” Wayve added that “to supercharge our Emergency Reflex, we can incorporate additional sources of information, such as other sensor modalities.”

This sounds at least a little bit like Waymo’s sensor fusion module. I’m not going to claim that the systems are identical or even all that similar. But any self-driving company has to address the same basic problem as Waymo: that large, monolithic language models are slow, error-prone, and difficult to debug. I expect that as it gets ready to commercialize its technology, Wayve will need to supplement the core end-to-end model with additional information sources that are easier to test and validate — if it isn’t doing so already.