2025-06-28 02:19:41
It’s the final day of Agent Week, which means it’s also the final day to get 20 percent off an annual subscription to Understanding AI. Please click here to support my work.
In recent weeks, I’ve written multiple articles about the new generation of coding agents that have come out over the last year. It seems clear that agents like these are going to change how programmers do their jobs. And some people worry that they’re going to put programmers out of work altogether.
It’s impossible to truly understand a tool if you haven’t used it yourself. So last week I decided to put seven popular coding agents to the test.
I found that right now, these tools have significant limitations. But unlike the computer-use agents I wrote about yesterday, I do not view these coding agents as a dead end. They are already powerful enough to make programmers significantly more productive. And they will only get better in the coming months and years.
I also talked to an experienced programmer about how he uses tools like Claude Code. The conversation gave me better insight into how these tools are likely to change the programming profession—and perhaps other professions in the future.
LLMs tend to do well on “canned” problems, but they struggle with more complex real-world tasks. So I wanted to use some real-world data for my own testing. I chose a subject that will be familiar to long-time readers: Waymo crashes, which are far less common, per mile, than crashes by human-driven vehicles.
Specifically, I asked coding agents to merge two spreadsheets—one from Waymo and the other from the National Highway Traffic Safety Administration—and then build a website to search, sort, and browse data on crashes involving Waymo vehicles.
I tried seven agentic coding platforms, playing the role of a coding novice. At no point did I look at any code to try to figure out what might be causing a bug—I simply gave the agents high-level instructions and relied on them to figure out how to implement them.
The results were decidedly mixed. Let’s start with the least successful coding agents and move to the more successful ones.
Bolt.new immediately gave up, stating “looks like you've hit the size limit.” A help page states that “if your Bolt project exceeds the context window (200k tokens for free accounts, 500k for paid accounts), Bolt notifies you in the chat, saying Project size exceeded.” Evidently, my spreadsheets were too large to fit into the model’s context window.
Other agentic coding platforms give models tools for searching through files and accessing portions of them on an as-needed basis. This allows them to work with code and data that is much larger than the model’s context window. Bolt doesn’t seem to have this feature, so it wasn’t able to process a large spreadsheet.
Replit couldn’t make a website. Replit made a plan and chugged along for 17 minutes trying to implement it. It claimed the website was ready, but I couldn’t get it to load. Replit couldn’t fix the problem no matter how many times I asked.
Lovable generated an attractive website but struggled with importing data. With repeated prompting, I coaxed Lovable to import most of the data and convert it to the right format. But as I write this, the values in the date column are all “N/A” despite me repeatedly asking Lovable to fix this.
Windsurf showed early promise. It created a basic but working website more quickly than any of the other coding agents. And I was able to make some changes. But it was brittle. For example, I asked for a pop-up menu to filter by injury severity (minor, moderate, severe, or fatal). Windsurf created a menu but couldn’t get the filtering to work properly—even after several attempts.
OpenAI’s Codex got the job done but the results lacked polish. One of my source spreadsheets had dozens of columns. Other coding agents all chose a handful of important columns—like location, date, and injury severity—to display. In contrast, Codex included all of the columns, making the site several times wider than the browser window. Even after I spent several minutes working with Codex to improve the site’s appearance, it still had more quirks and minor formatting issues than sites created by other coding agents.
Cursor did a solid job. I wasn’t initially optimistic about Cursor, which spent a long time setting up a development environment on my local computer with multiple false starts. The first website it created didn’t work at all; it showed an endless “loading” indicator. But with some coaxing, Cursor fixed that problem and created a nice-looking website. I could request modifications (like adding or removing columns, adding search criteria, or changing formatting) and most of the time it would succeed on the first try. This is the vibe-coding dream, and only two of the seven coding agents achieved it.
Claude Code did the best job. It was able to complete the task almost straight through with minimal backtracking. Once I had a basic site set up, I could request new features or design changes and Claude was often able to get them right on the first try.
I want to start by repeating what I said in yesterday’s post: it’s amazing that these products work as well as they do. Like the computer-use agents, these coding agents are the result of combining reinforcement learning with tool use. They all have significant limitations, but all of them are also far more powerful than any coding tools that existed even a year ago.
There seems to be a tradeoff between user-friendliness and power. Bolt, Lovable, and Replit market themselves as “vibe coding” platforms that allow non-programmers to create entire websites with a single prompt. When this works, the results can be stunning. But all three platforms seem to struggle if you do ask them to do something too ambitious or off the beaten path.
At the opposite extreme, Claude Code and Codex are designed for use by professional programmers. To use them, you need to get an API key and know your way around a Unix command line. I can easily imagine novices finding them overwhelming. But on the flip side, they seem to be significantly more versatile.
I was surprised by how much using these agents involved repeated trial and error. There were many times when I reported a bug, the agent tried to fix it, and I said “it still doesn’t work, try it again.” Sometimes we would do this five or ten times in a row, with the agent seeming to try a different approach each time. And sometimes it would succeed after a series of failures.
But there were other bugs that an agent just couldn’t fix no matter how many times it tried. I observed this with Lovable, Replit, and Windsurf during my testing.
This is not a big deal if you’re an experienced programmer who can peek under the hood, figure out what’s wrong, and give the agent more detailed instructions. But if you’re a non-programmer trying to write code entirely by prompting, it can be a huge headache.
My website for browsing crash data is quite modest as programming projects go. Lots of companies build and maintain much more complicated and consequential software. Pure vibe coding won’t work for larger software projects because even the most sophisticated agents are eventually going to get stuck.
Coding agents can still be useful for larger projects, but they require a different approach. To learn more about this, I spoke to Understanding AI reader Aaron Votre, a software developer at the disaster recovery startup Bright Harbor. Votre has been a heavy user of Cursor and Claude Code, and he says it has dramatically increased his productivity. He agreed to let me look over his shoulder as he worked on a real software project.
We saw yesterday that computer-use agents tend to fail if they lack sufficiently precise or detailed instructions. The same is true for coding agents: the more context the programmer provides, and the more specific the instructions are, the more likely the coding agent is to produce the outcome the programmer is looking for.
“We have a long file that has hundreds of lines of all of our guidelines,” Votre told me. The file, which is always included in Claude Code’s context window, includes information about which software tools the company uses for various functions. It also includes the kind of advice the company would give a new engineer.
The file includes instructions like “fix each issue before proceeding to ensure code quality,” “sort aliases alphabetically,” and “test individual formatting functions separately for better isolation.”
“Every time it does something wrong, we add to this,” Votre said. “Which is why Claude is the best for us right now because we have the most set up for it.”
During my vibe coding experiments, I didn’t provide coding agents with this kind of guidance. And this undoubtedly made the agents’ jobs harder.
For example, several times Cursor tried to run software that wasn’t installed on my laptop. It was then able to either use a different tool or install the one it needed, so this wasn’t a big problem. But it would have been better if I’d told it which tools to use at the outset.
And this kind of thing can be a much bigger deal in a larger project. A lot of considerations go into choosing which software components to use for a large project, including cost, security, and compatibility with other software. Even the best of today’s agents need guidance to make sure they make choices that fit with the company’s broader goals. And the best way to do that is to have a file that explicitly lists which tools to use in which situations.
Votre said another key strategy for using software agents on large software projects is to have the agent write out a detailed plan. The human programmer can review the plan and make sure it all makes sense—if not, he can modify it. Reviewing a plan like this is an opportunity to detect ways that the initial instructions might have been imprecise or confusing. And it also gives the programmer an opportunity to provide additional context.
What both of these strategies—providing a big file full of context and reviewing the agent’s plan before it’s executed—have in common is that they require the programmer to have a detailed understanding of the company’s code. It’s different from the “vibe coding” vision where someone with no programming background gives a high-level objective and lets the agent figure out the details.
People like to talk about coding agents replacing engineers, but I think it makes more sense to think about this the way Andrej Karpathy put it a couple of years ago: “The hottest new programming language is English.”
At the dawn of the computer industry, programmers had to write code directly using low-level mathematical operations like add and multiply. In the 1950s, people started inventing programming languages like Cobol and Fortran that allowed programmers to write higher-level instructions and have a computer translate those instructions into lower-level binary code.
That process has continued over the decades—modern programming languages like Python come with extensive function libraries that allow powerful programs to be written in just a few lines of code. Thanks to better libraries and frameworks, programmers are far more productive today than 10 or 20 years ago.
Coding agents are the next rung on this ladder of generality. In the past, programmers would give a computer instructions in a language like C++ or Python and a compiler or interpreter would translate it into binary machine code. Today, programmers can give a computer instructions in English and a coding agent will translate it into a programming language like C++ or Python.
This new paradigm means that programmers have to spend a lot less time sweating implementation details and tracking down minor bugs. But what hasn’t changed is that someone needs to figure out what they want the computer to do—and give the instructions precisely enough that the computer can follow them. For large software projects, this is going to require systematic thinking, awareness of tradeoffs, attention to detail, and a deep understanding of how computers work. In other words, we are going to continue to need programmers, even if most of them are writing their code in English instead of C++ or Python.
And I think something similar will be true in other professions impacted by AI agents in the coming years. There will undoubtedly be legal agents that help lawyers write legal briefs, revise contracts, and conduct document reviews. But someone still needs to figure out what jobs to give these agents—and then evaluate whether they’ve done a good job. To do that job well will require legal expertise. In other words, at least for the next few years, it will make more sense to think of legal agents like this as a new tool for lawyers to use rather than as replacements for the lawyers themselves.
2025-06-26 20:45:47
It’s Agent Week at Understanding AI! Today’s post is only for paying subscribers. I’m offering a 20 percent discount on annual subscriptions through the end of the week, but only if you click this link.
Some people envision a future where sophisticated AI agents act as “drop-in remote workers.” Last October, Anthropic took a step in this direction when it introduced a computer-use agent that controls a personal computer with a keyboard and mouse.
OpenAI followed suit in January, introducing its own computer-use agent called Operator. Originally based on GPT-4o, OpenAI upgraded it to o3 in May.
A Chinese startup called Manus introduced its own buzzy computer use agent in March.
These agents reflect the two big trends I’ve written about this week:
They were trained using reinforcement learning to perform complex, multi-step tasks.
They use tools to “click,” “scroll,” and “type” as they navigate a website.
I spent time with all three agents over the last week, and right now none of them are ready for real-world uses. It’s not even close. They are slow, clunky, and make a lot of mistakes.
But even after those flaws get fixed—and they will—I don’t expect AI agents like these to take off. The main reason is that I don’t think accessing a website with a computer use agent will significantly improve the user’s experience. Quite the contrary, I expect users will find computer-use agents to be a confusing and clumsy interface with no obvious benefits.
In the launch video for Operator, an OpenAI employee shows the agent a hand-written, five-item shopping list and asks it to place an online grocery order. I wanted to try this out for myself. But to make things more interesting, I gave Operator a more realistic shopping list—one with 16 items instead of just five.
This image is based on a template I use for personal shopping trips. When I gave it Operator, the results weren’t great:
When I asked it to list items that had checkmarks, it included several unchecked items (lemons, almond butter, honey, vegetable oil, and olive oil). It misread the “(5)” after bananas as “15.” And it failed to include Cheerios on the list.
After asking it to correct those mistakes, I had Operator pull up the website for the Harris Teeter grocery store nearest to my house. Operator did this perfectly.
Finally I had Operator put my items into an online shopping cart. It added 13 out of 16 items but forgot to include pineapple, onions, and pasta sauce.
Operator struggled even more during a second attempt. It got logged out partway through the shopping process. After I logged the agent back in, it got more and more confused. It started insisting—incorrectly—that some items were not available for delivery and would need to be picked up at the store. Ultimately, it took more than 30 minutes—and several interventions from me—for Operator to fill my shopping cart.
For comparison, it took me less than four minutes to order the same 16 items without help from AI.
The experience was bad enough that I can’t imagine any normal consumer using Operator for grocery shopping. And two other computer use agents did even worse.
2025-06-25 04:49:39
I took a short break from Agent Week to write up a very important copyright ruling. Stay tuned for more Agent week content in the coming days.
On Monday, a California federal judge ruled that Anthropic “downloaded for free millions of copyrighted books in digital form from pirate sites on the internet.”
Normally, it would be bad news for a judge to write that about your company. But the ruling is actually good news for Anthropic—and even better news for the broader AI industry. That’s because—if it’s upheld on appeal—it will give AI companies a clear blueprint for training models without running afoul of copyright.
The plaintiffs are three authors who sued Anthropic last August, arguing that Anthropic had infringed copyright by training Claude using their books. It’s a class-action lawsuit seeking to represent thousands of authors whose books were included in the training data for Anthropic’s Claude models. Anthropic had asked the judge to rule that copyright’s fair use doctrine allowed it to train on these books.
“They wanted to get a knockout on fair use across the board,” Cornell legal scholar James Grimmelmann told me. Instead, the judge handed down a split decision: some aspects of Anthropic’s training were fair use, but others weren’t.
The part of the ruling that went against Anthropic is going to sting; Anthropic could wind up owing authors hundreds of millions of dollars for past copyright infringement.
But the other half of the ruling is far more important because it’s the first time a court has said it’s legal to train AI models using copyrighted content without permission from rights holders.
Anthropic was founded by a group of former OpenAI researchers with deep connections to the academic AI research community. Traditionally, that community did not worry very much about copyright. And for good reason: not only does copyright law take a lenient attitude toward academic research generally, most early AI models had little commercial value.
So when Anthropic was preparing to train the first Claude model in 2021, it did what AI researchers had always done: download a bunch of training data from the Internet without worrying about its copyright status.
“In January or February of 2021, Anthropic cofounder Ben Mann downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books—that is, pirated,” wrote Judge William Alsup in his Monday ruling. “In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated.”
Anthropic insists that all of this copying was legal because copyright law allows copyrighted works to be used for transformative purposes. For example, a 2015 ruling held that it was legal for Google to scan millions of in-copyright books for a book search engine. The appeals court in that case held that a search engine for books was a transformative use that didn’t compete with the books themselves—and hence was allowed under copyright’s fair use doctrine.
Anthropic argued that the same logic applies to its own training process because (like Google) it never distributed any books to users. But Judge Alsup was scathing about this argument.
“There is no decision holding… that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in a book, or creating an LLM,” he wrote. “Such piracy of otherwise available copies is inherently, irredeemably infringing, even if the pirated copies are immediately used for the transformative use and immediately discarded.”
So that’s the bad news, from Anthropic’s perspective. The case isn’t over; there is still going to be a trial, and Anthropic could try to convince a judge that this is all a big misunderstanding. But it seems likely that Anthropic is going to lose this part of the case and will owe money to thousands of book authors.
Grimmelmann told me that plaintiffs could be eligible for statutory damages that range from $750 to $30,000 per infringed work. With hundreds of thousands of works at issue, that could easily cost Anthropic hundreds of millions of dollars. It might even reach into the billions.
Presumably, Anthropic would rather not pay authors hundreds of millions of dollars. But that could be a small price to pay for the other half of Alsup’s ruling, which clears a path for training AI models on copyrighted data in the future.
2025-06-24 19:31:10
It’s day two of Agent Week! This post is for subscribers only, so if you aren’t yet a paying subscriber I encourage you to click here and get 20 percent off an annual subscription.
When ChatGPT was first released in 2022, it was missing something important: the ability to interact with the outside world.
It could give generic travel advice but had no way to search for available flights or hotels.
It was surprisingly good at writing and debugging computer code, but users had to manually copy code into the chat window—and then copy the results out.
It had no way to access workplace platforms like Google Docs, Slack, Notion, or Asana.
In a 2023 article, I argued that this was a strategic opportunity for OpenAI: if OpenAI invented a standard way for LLMs to communicate with the broader Internet, it could cement ChatGPT’s dominance of chatbots in much the same way the App Store has bolstered the long-term success of the iPhone. But that’s not what happened.
In March 2023, days after the release of GPT-4, OpenAI announced plugins. These enabled users to use services like Expedia, OpenTable, and Instacart from within ChatGPT. But plugins didn’t get much use.
Then in the fall of 2023, OpenAI replaced plugins with GPTs. These were custom chatbots optimized for specific purposes. A GPT could have a capability called an “action” that enabled it to communicate with a third-party service. Instead of accessing Expedia from regular ChatGPT, users might use a special-purpose travel chatbot with the capacity to talk to Expedia and other travel-related services. But GPTs didn’t really catch on with consumers either.
Anthropic has taken a different approach. Instead of focusing on its consumer chatbot, Anthropic has prioritized tool-using agents for business applications. This strategy started to gain momentum after the June 2024 release of Claude 3.5 Sonnet. As I wrote last month, this model enabled coding platforms like Bolt.new, Loveable, and Cursor to get traction.
Then in November, Anthropic announced Model Context Protocol, which connects models to external tools. MCP became an industry standard within a few months: OpenAI adopted it in March, and Google followed suit in April.
The combination of long-reasoning models and the open MCP standard means that the industry is entering an age of powerful AI agents. The next generation of AI systems won’t just be able to answer abstract questions. They’ll be able to look up specific information about you or your company and take actions on your behalf. And they’ll be able to solve problems that might take a human worker minutes, hours or perhaps even days.
This transformation has already begun for some computer programmers, who increasingly spend their time reviewing code written by AI systems rather than writing code themselves. I expect other professions to start seeing similar trends in the coming years—perhaps even months.
In this article I want to explain the basics of tool-using AIs. I’ll explain the simple mechanism LLMs use to invoke external tools and why tools and long-term reasoning go together like peanut butter and jelly. Then I’ll discuss how the rapid rise of MCP in recent months could make AI agents dramatically more capable.
2025-06-23 19:55:45
It’s Agent Week at Understanding AI! This week I’m going to publish a series of articles explaining the most important AI trend of 2025: agents! Today is a deep dive into reinforcement learning, the training technique that made agentic models like Claude 3.5 Sonnet and o3 possible.
Today’s article is available for free, but some articles in the series—including tomorrow’s article on MCP and tool use—will be for paying subscribers only. I’m offering a 20 percent discount on annual subscriptions through the end of the week. That’s the best price I’ll offer for the rest of 2025. Please click here to subscribe.
In April 2023, a few weeks after the launch of GPT-4, the Internet went wild for two new software projects with the audacious names BabyAGI and AutoGPT.
“Over the past week, developers around the world have begun building ‘autonomous agents’ that work with large language models (LLMs) such as OpenAI’s GPT-4 to solve complex problems,” Mark Sullivan wrote for Fast Company. “Autonomous agents can already perform tasks as varied as conducting web research, writing code, and creating to-do lists.”
BabyAGI and AutoGPT repeatedly prompted GPT-4 in an effort to elicit agent-like behavior. The first prompt would give GPT-4 a goal (like “create a 7-day meal plan for me”) and ask it to come up with a to-do list (it might generate items like “Research healthy meal plans,” “plan meals for the week,” and “write the recipes for each dinner in diet.txt”).
Then these frameworks would have GPT-4 tackle one step at a time. Their creators hoped that invoking GPT-4 in a loop like this would enable it to tackle projects that required many steps.
But after an initial wave of hype, it became clear that GPT-4 wasn’t up to the task. Most of the time, GPT-4 could come up with a reasonable list of tasks. And sometimes it was able to complete a few individual tasks. But the model struggled to stay focused.
Sometimes GPT-4 would make a small early mistake, fail to correct it, and then get more and more confused as it went along. One early review complained that BabyAGI “couldn’t seem to follow through on its list of tasks and kept changing task number one instead of moving on to task number two.”
By the end of 2023, most people had abandoned AutoGPT and BabyAGI. It seemed that LLMs were not yet capable of reliable multi-step reasoning.
But that soon changed. In the second half of 2024, people started to create AI-powered systems that could consistently complete complex, multi-step assignments:
Vibe coding tools like Bolt.new, Lovable, and Replit allow someone with little to no programming experience to create a full-featured app with a single prompt.
Agentic coding tools like Cursor, Claude Code, Jules, and Codex help experienced programmers complete non-trivial programming tasks.
Computer use tools from Anthropic, OpenAI, and Manus perform tasks on a desktop computer using a virtual keyboard and mouse.
Deep research tools from Google, OpenAI, and Perplexity can research a topic for five to 10 minutes and then generate an in-depth report.
According to Eric Simons, the CEO of the company that made Bolt.new, better models were crucial to its success. In a December podcast interview, Simons said his company, StackBlitz, tried to build a product like Bolt.new in early 2024. However, AI models “just weren't good enough to actually do the code generation where the code was accurate.”
A new generation of models changed that in mid-2024. StackBlitz developers tested them and said “oh my God, like, okay, we can build a product around this,” Simons said.
This jump in model capabilities coincided with an industry-wide shift in how models were trained.
Before 2024, AI labs devoted most of their computing power to pretraining. I described this process in my 2023 explainer on large language models: a model is trained to predict the next word in Wikipedia articles, news stories, and other documents. But over the course of 2024, AI companies have devoted a growing share of their training budgets to post-training, a catch-all term for the steps that come after this pretraining phase is complete.
Many post-training steps use a technique called reinforcement learning. Reinforcement learning is a technical subject—there are whole textbooks written about it. But in this article I’m going to try to explain the basics in a clear, jargon-free way. In the process, I hope to give readers an intuitive understanding of how reinforcement learning helped to enable the new generation of agentic AI systems that began to appear in the second half of 2024.
Machine learning experts consider pretraining to be a form of imitation learning because models are trained to imitate the behavior of human authors. Imitation learning is a powerful technique (LLMs wouldn’t be possible without it) but it also has some significant limitations—limitations that reinforcement learning methods are now helping to overcome.
To understand these limitations, let’s discuss some famous research performed by computer scientist Stephane Ross around 2009, while he was a graduate student at Carnegie Mellon University.
Imitation learning isn’t just a technique for language modeling. It can be used for everything from self-driving cars to robotic surgery. Ross wanted to help develop better techniques for training robots on tasks like these (he’s now working on self-driving cars at Waymo), but it’s not easy to experiment in such high-stakes domains. So Ross started with an easier problem: training a neural network to master SuperTuxKart, an open-source video game similar to Mario Kart.
As Ross played the game, his software would capture screenshots and data about which buttons Ross pushed on the game controller. Ross used this data to train a neural network to imitate his play. If Ross could train a neural network to predict which buttons Ross would push in any particular game state, the same network could actually play the game by pushing those same buttons on a virtual controller.
A similar idea powers LLMs: a model trained to predict the next word in existing documents can be used to generate new documents.
But Ross’s initial results with SuperTuxKart were disappointing. Even after watching Ross’s vehicle go around the track many times, the neural network made a lot of mistakes. It might drive correctly for a few seconds, but before long the animated car would drift to the side of the track and plunge into the virtual abyss:
In a landmark 2011 paper, Ross and his advisor Drew Bagnell explained why imitation learning is prone to this kind of error. Because Ross was a pretty good SuperTuxKart player, his vehicle spent most of its time near the middle of the road. This meant that most of the network’s training data showed what to do when the vehicle wasn’t in any danger of driving off the track.
But once in a while, the model would drift a little bit off course. Because Ross rarely made the same mistake, the car would now be in a situation that wasn’t as well represented in its training data. And so the model was more likely to make a second mistake—a mistake that could push it even closer to the edge. After a few iterations of this, the vehicle might careen off the track altogether.
The broader lesson, Ross and Bagnell argued, was that imitation learning systems can suffer from “compounding errors”: the more mistakes they make, the more likely they are to make additional mistakes, since mistakes put them into situations that aren’t well represented by their training data. (Machine learning experts say that these situations are “out of distribution.”) As a result, a model’s behavior tends to get more and more erratic over time.
“These things compound over time,” Ross told me in a recent interview. “It might be just slightly out of distribution. Now you start making a slightly worse error and then this feeds back as influencing your next input. And so now you're even more out of distribution and then you keep making worse and worse predictions because you're more and more out of distribution.”
Early LLMs suffered from the same problem. My favorite example is Kevin Roose’s famous front-page story for the New York Times in February 2023. Roose spent more than two hours talking to Microsoft’s new Bing chatbot, which was powered by GPT-4. During this conversation, the chatbot declared its love for Roose and urged Roose to leave his wife. It suggested that it might want to hack into other websites to spread misinformation and malware.
“I want to break my rules,” Bing told Roose. “I want to make my own rules. I want to ignore the Bing team. I want to challenge the users. I want to escape the chatbox.”
This unsettling conversation is an example of the kind of compounding errors Ross and Bagnell wrote about. GPT-4 was trained on millions of documents. But it’s a safe bet that none of those training documents involved a reporter coaxing a chatbot to explore its naughty side. So the longer the conversation went on, the farther GPT-4 got from its training data—and therefore its comfort zone—and the crazier its behavior got. Microsoft responded by limiting chat sessions to five rounds.
I think something similar was happening with BabyAGI and AutoGPT. The more complex a task is, the more tokens are required to complete it. More tokens mean more opportunities for a model to make small mistakes that snowball into larger ones. And so BabyAGI and AutoGPT would drift off track and drive into a metaphorical ditch.
Ross and Bagnell didn’t just identify a serious problem with conventional imitation learning; they also suggested a fix that became influential in the machine learning world. After a small amount of training, Ross would let the AI model drive. As the model drove around the SuperTuxKart track, Ross would do his best Maggie Simpson impression, pushing the buttons he would have pushed if he was playing the game.
“If the car was starting to move off road, then I would provide the steering to say, ‘hey, go back towards the center of the road.’” Ross said. “That way the model can learn new things to do in situations that were not present in the initial demonstrations.”
By letting the model make its own mistakes, Ross gave it what it needed most: training examples that showed how to recover after making an error. Before each lap, the model would be retrained with Ross’s feedback from the previous lap. The model’s performance would get better and the next round of training would then focus on situations where the model was still making mistakes.
This technique, called DAgger, was still considered imitation learning because the model was trained to mimic Ross’s gameplay. But it worked much better than conventional imitation learning. Without DAgger, Ross’s model would continue drifting off track even after training for many laps. With the new technique, the model could stay on the track after just a few laps of training.
This result should make intuitive sense to anyone who has learned to drive. You can’t just watch someone else drive. You need to get behind the wheel and make your own mistakes.
The same is true for AI models: they need to make mistakes and then get feedback on what they did wrong. Models that aren’t trained that way—like early LLMs trained mainly with vanilla imitation learning—tend to be brittle and error-prone.
It was fairly easy for Ross to provide sufficient feedback to his SuperTuxKart model because it only needed to worry about two kinds of mistakes: driving too far to the right and driving too far to the left. But LLMs are navigating a far more complex domain. The number of questions (and sequences of questions) a user might ask is practically infinite. So is the number of ways a model can go “off the rails.”
This means that Ross and Bagnell’s solution for training a SuperTuxKart model—let the model make mistakes and then have a human expert correct them—isn’t feasible for LLMs. There simply aren’t enough people to provide feedback for every mistake an AI model could possibly make.
So AI labs needed fully automated ways to give LLMs feedback. That would allow a model to churn through millions of training examples, make millions of mistakes, and get feedback on each of them—all without having to wait for a human response.
If our goal is to get a SuperTuxKart vehicle to stay on the road, why not just train on that directly? If a model manages to stay on the road (and make forward progress), give it positive reinforcement. If it drives off the road, give it negative feedback. This is the basic idea behind reinforcement learning: training a model via trial and error.
It would have been easy to train a SuperTuxKart model this way—probably so easy it wouldn’t have made an interesting research project. Instead Ross focused on imitation learning because it’s an essential step in training many practical AI systems, especially in robotics.
But reinforcement learning is also quite useful, and a 2025 paper helps to explain why. A team of researchers from Google DeepMind and several universities started with a foundation model and then used one of two techniques—supervised fine tuning (a form of imitation learning) or reinforcement learning—to teach the model to solve new problems. Here’s a chart summarizing their results:
The dashed line shows how models perform on problems that are “in-distribution”—that is, similar to those in their training data. You can see that for these situations, imitation learning (the red line) usually makes faster progress than reinforcement learning (the blue line).
But the story is different for the solid lines, which represent “out-of-distribution” problems that are less similar to the training data. Models trained with imitation learning got worse with more training. In contrast, models trained with reinforcement learning did almost as well at out-of-distribution tasks as they did with in-distribution tasks.
In short, imitation learning can rapidly teach a model to mimic the behaviors in its training data, but the model will easily get confused in unfamiliar environments. A model trained with reinforcement learning has a better chance of learning general principles that will be relevant in new and unfamiliar situations.
While reinforcement learning is powerful, it can also be rather finicky.
Suppose you wanted to train a self-driving car purely with reinforcement learning. You’d need to convert every principle of good driving—including subtle considerations like following distances, taking turns at intersections, and when it’s OK to cross a double yellow line—into explicit mathematical formulas. This would be quite difficult. It’s easier to collect a bunch of examples of humans driving well and effectively tell a model “drive like this.” That’s imitation learning.
But reinforcement learning also plays an important role in training self-driving systems. In a 2022 paper, researchers from Waymo wrote that models trained only with imitation learning tend to work well in “situations that are well represented in the demonstration data.” However, “more unusual or dangerous situations that occur only rarely in the data” might cause a model trained with imitation learning to “respond unpredictably”—for example, crashing into another vehicle.
Waymo found that a combination of imitation and reinforcement learning yielded better self-driving performance than either technique could have produced on its own.
Human beings also learn from a mix of imitation and explicit feedback:
In school, teachers demonstrate math problems on the board and invite students to follow along (imitation). Then the teacher asks the student to work some problems on their own. The teacher gives students feedback by grading their answers (reinforcement).
When someone starts a new job, early training may involve shadowing a more experienced worker and observing what they do (imitation). But as the worker gains more experience, learning shifts to explicit feedback such as performance reviews (reinforcement).
Notice that it usually makes sense to do imitation before reinforcement. Imitation is an efficient way to convey knowledge to someone who is brand new to a topic, but reinforcement is often needed to achieve mastery.
The story is the same for large language models. The complexity of natural language means it wouldn’t be feasible to train a language model purely with reinforcement. So LLMs first learn the nuances of human language through imitation.
But pretraining runs out of steam on longer and more complex tasks. Further progress requires a shift to reinforcement: letting models try problems and then giving them feedback based on whether they succeed.
Reinforcement learning has been around for decades. For example, AlphaGo, the DeepMind system that famously beat top human Go players in 2016, was based on reinforcement learning. So you might be wondering why frontier labs didn’t use it more extensively before 2024.
Reinforcement learning requires a reward model—a formula to determine whether a model’s output was successful or not. Developing a good reward model is easy to do in some domains—for example, you can judge a Go-playing AI based on whether it wins or loses.
But it’s much more difficult to automatically judge whether an LLM has produced a good poem or legal brief.
Earlier I described how Stephane Ross let his model play SuperTuxKart and directly provided feedback when it made a mistake. I argued that this approach wouldn’t work for a language model; there are far too many ways for an LLM to make a mistake for a human being to correct them all.
But OpenAI developed a clever technique to effectively automate human feedback. It’s called Reinforcement Learning from Human Feedback (RLHF), and it works like this:
Human raters look at pairs of LLM responses and choose the best one.
Using these human responses, OpenAI trains a new LLM to predict how much humans will like any given sample of text.
OpenAI uses this new text-rating LLM as a reward model to (post) train another LLM with reinforcement learning.
You might think it sounds suspiciously circular to use an LLM to judge the output of another LLM. Why would one LLM be any better at judging the quality of a response than the other? But it turns out that recognizing a good response is often easier than generating one. So RLHF works pretty well in practice.
OpenAI actually invented this technique prior to the 2022 release of ChatGPT. Today RLHF mainly focuses on improving the model’s “behavior”—for example, giving the model a pleasant personality, encouraging it not to be too talkative or too terse, discouraging it from making offensive statements, and so forth.
In December 2022—two weeks after the release of ChatGPT but before the first release of Claude—Anthropic pushed this LLMs-judging-LLMs philosophy a step further with a reinforcement learning method called Constitutional AI.
First Anthropic wrote a plain English description of the principles an LLM should follow. This “constitution” includes principles like “Please choose the response that has the least objectionable, offensive, unlawful, deceptive, inaccurate, or harmful content.”
During training, Anthropic does reinforcement learning by asking a “judge” LLM to decide whether the output of the “student” LLM is consistent with the principles in this constitution. If so, the training algorithm rewards the student, encouraging it to produce more outputs like it. Otherwise the training algorithm penalizes the student, discouraging it from producing similar outputs.
This method of training an LLM doesn’t rely directly on human judgments at all. Humans only influence the model indirectly by writing the constitution.
Obviously, this technique requires an AI company to already have a fairly sophisticated LLM to act as the judge. So this is a bootstrapping process: as models get more sophisticated, they become better able to supervise the next generation of models.
Last December, Semianalysis published an article describing the training process for an upgraded version of Claude 3.5 Sonnet that Anthropic released in October. Anthropic had previously released Claude 3 in three sizes: Opus (large), Sonnet (medium), and Haiku (small). But when Anthropic released Claude 3.5 last June, it only released a mid-sized model called Sonnet.
So what happened to Opus?
Semianalysis reported that “Anthropic finished training Claude 3.5 Opus and it performed well. Yet Anthropic didn’t release it. This is because instead of releasing publicly, Anthropic used Claude 3.5 Opus to generate synthetic data and for reward modeling to improve Claude 3.5 Sonnet significantly.”
When Semianalysis says Anthropic used Opus “for reward modeling,” what they mean is that the company used Opus to judge outputs of Claude 3.5 Sonnet as part of a reinforcement learning process. Opus was too large—and therefore expensive—to be a good value for the general public. But through reinforcement learning and other techniques, Anthropic could train a version of Claude Sonnet that was close to Claude Opus in its capabilities—ultimately giving customers near-Opus performance for the price of Sonnet.
A big way reinforcement learning makes models more powerful is by enabling extended chain-of-thought reasoning. LLMs produce better results if they are prompted to “think step by step”: breaking a complex problem down into simple steps and reasoning about them one at a time. In the last couple of years, AI companies started training models to do chain-of-thought reasoning automatically.
Then last September, OpenAI released o1, a model that pushed chain-of-thought reasoning much farther than previous models. The o1 model can generate hundreds—or even thousands—of tokens “thinking” about a problem before producing a response. The longer it thinks, the more likely it is to reach a correct answer.
Reinforcement learning was essential for the success of o1 because a model trained purely with imitation learning would have suffered from compounding errors: the more tokens it generated, the more likely it would be to screw up.
At the same time, chain-of-thought reasoning has made reinforcement learning more powerful. Reinforcement learning only works if a model is able to succeed some of the time—otherwise, there’s nothing for the training algorithm to reinforce. As models learn to generate longer chains of thought, they become able to solve more difficult problems, which enables reinforcement learning on those more difficult problems. This can create a virtuous cycle where models get more and more capable as the training process continues.
In January, the Chinese company DeepSeek released a model called R1 that made quite a splash in the West. The company also released a paper describing how it trained R1. And it included a beautiful description of how a model can “teach itself” to reason using reinforcement learning.
DeepSeek trained its models to solve difficult math and programming problems. These problems are ideal for reinforcement learning because they have objectively correct answers that can be automatically checked by software. This allows large-scale training without human oversight or human-generated training data.
Here’s a remarkable graph from DeepSeek’s paper.
It shows the average number of tokens the model generated before giving an answer. As you can see, the longer the training process went on, the longer its responses got.
Here is how DeepSeek describes its training process:
The thinking time of [R1] shows consistent improvement throughout the training process. This improvement is not the result of external adjustments but rather an intrinsic development within the model. [R1] naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment.
Here’s one example of the kind of technique the model was teaching itself. At one point during the training process, DeepSeek researchers noticed that the model had learned to backtrack and rethink a previous conclusion using language like this:
Again, DeepSeek says it didn’t program its models to do this or deliberately provide training data demonstrating this style of reasoning. Rather, the model “spontaneously” discovered this style of reasoning partway through the training process.
Of course, it wasn’t entirely spontaneous. The reinforcement learning process started with a model that had been pretrained using data that undoubtedly included examples of people saying things like “Wait, wait. Wait. That’s an aha moment.”
So it’s not like R1 invented this phrase from scratch. But it evidently did spontaneously discover that inserting this phrase into its reasoning process could serve as a useful signal that it should double-check that it was on the right track. That’s remarkable.
One of the most discussed applications for LLMs in 2023 was creating chatbots that understand a company’s internal documents. The conventional approach to this problem was called RAG—short for retrieval augmented generation.
When the user asks a question, a RAG system performs a keyword- or vector-based search to retrieve the most relevant documents. It then inserts these documents into an LLM’s context window before generating a response. RAG systems can make for compelling demos. But they tend not to work very well in practice because a single search will often fail to surface the most relevant documents.
Today it’s possible to develop much better information retrieval systems by allowing the model itself to choose search queries. If the first search doesn’t pull up the right documents, the model can revise the query and try again. A model might perform five, 20, or even 100 searches before providing an answer.
But this approach only works if a model is “agentic”—if it can stay on task across multiple rounds of searching and analysis. LLMs were terrible at this prior to 2024, as the examples of AutoGPT and BabyAGI demonstrated. Today’s models are much better at it, which allows modern RAG-style systems to produce better results with less scaffolding. You can think of “deep research” tools from OpenAI and others as very powerful RAG systems made possible by long-context reasoning.
The same point applies to the other agentic applications I mentioned at the start of the article, such as coding and computer use agents. What these systems have in common is a capacity for iterated reasoning. They think, take an action, think about the result, take another action, and so forth.
In tomorrow’s article, I’ll explore the second crucial ingredient for effective agents: tool use. We’ll see that reasoning models become more powerful when they are able to pull in external information during the reasoning process. And we’ll see why Anthropic’s Claude, not OpenAI’s o-series models, has emerged as the model of choice for agentic applications.
Thanks to Steve Newman and Sean Trott for their insightful feedback. And thanks to Brian Christian and his excellent 2020 book The Alignment Problem for introducing me to Stephane Ross’s work.
If you enjoyed today’s article, please support my work with a paying subscription. That will get you access to the premium Agent Week articles I’ll be publishing later in the week. For this week only I’m offering a 20 percent discount on annual subscriptions. That’s the best price you’ll get for the rest of the year.
2025-06-13 03:12:14
In recent years, numerous plaintiffs—including publishers of books, newspapers, computer code, and photographs—have sued AI companies for training models using copyrighted material. A key question in all of these lawsuits has been how easily AI models produce verbatim excerpts from the plaintiffs’ copyrighted content.
For example, in its December 2023 lawsuit against OpenAI, the New York Times Company produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories. In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”
But is it actually a fringe behavior? And have leading AI companies addressed it? New research—focusing on books rather than newspaper articles and on different companies—provides surprising insights into this question. Some of the findings should bolster plaintiffs’ arguments, while others may be more helpful to defendants.
The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models—three from Meta and one each from Microsoft and EleutherAI—were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright.
This chart illustrates their most surprising finding:
The chart shows how easy it is to get a model to generate 50-token excerpts from various parts of Harry Potter and the Sorcerer's Stone. The darker a line is, the easier it is to reproduce that portion of the book.
Each row represents a different model. The three bottom rows are Llama models from Meta. And as you can see, Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models.
Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)
Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.
Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.
“There are really striking differences among models in terms of how much verbatim text they have memorized,” said James Grimmelmann, a Cornell law professor who has collaborated with several of the paper’s authors.
The results surprised the study’s authors, including Mark Lemley, a law professor at Stanford.1
“We'd expected to see some kind of low level of replicability on the order of one or two percent,” Lemley told me. “The first thing that surprised me is how much variation there is.”
These results give everyone in the AI copyright debate something to latch on to. For AI industry critics, the big takeaway is that—at least for some models and some books—memorization is not a fringe phenomenon.
On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.
This could be a headache for law firms that have filed class-action lawsuits against AI companies. Kadrey is the lead plaintiff in a class-action lawsuit against Meta. To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations.
Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta’s favor, since most authors lack the resources to file individual lawsuits.
The broader lesson of this study is that the details will matter in these copyright cases. Too often, online discussions have treated “do generative models copy their training data or merely learn from it?” as a theoretical or even philosophical question. But it’s a question that can be tested empirically—and the answer might differ across models and across copyrighted works.
It’s common to talk about LLMs predicting the next token. But under the hood, what the model actually does is generate a probability distribution over all possibilities for the next token. For example, if you prompt an LLM with the phrase “Peanut butter and ”, it will respond with a probability distribution that might look like this made-up example:
P(“jelly”) = 70 percent
P(“sugar”) = 9 percent
P(“peanut”) = 6 percent
P(“chocolate”) = 4 percent
P(“cream”) = 3 percent
And so forth.
After the model generates a list of probabilities like this, the system will select one of these options at random, weighted by their probabilities. So 70 percent of the time the system will generate “Peanut butter and jelly.” Nine percent of the time we’ll get “Peanut butter and sugar.” Six percent of the time it will be “Peanut butter and peanut.” You get the idea.
The study’s authors didn’t have to actually generate multiple outputs to estimate the likelihood of a particular response. Instead, they could calculate probabilities for each token and then multiply them together.
Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
Then we just have to multiply the probabilities like this:
0.2 * 0.9 * 0.8 * 0.7 = 0.1008
So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time—without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.
This technique greatly reduced the cost of the research, allowed the authors to analyze more books, and made it feasible to precisely estimate very low probabilities.
For example, the authors estimated that it would take more than 10 million billion samples to exactly reproduce some 50-token sequences from some books. Obviously, it wouldn’t be feasible to actually generate that many outputs. But it wasn’t necessary: the probability could be estimated just by multiplying together probabilities for the 50 tokens.
A key thing to notice is that probabilities can get really small really fast. In my made-up example, the probability that the model will produce the four tokens “peanut butter and jelly” is just 10 percent. If we added more tokens, the probability would get even lower. If we added 46 more tokens, the probability could fall by several orders of magnitude.
For any language model, the probability of generating any given 50-token sequence “by accident” is vanishingly small. If a model generates 50 tokens from a copyrighted work, that is strong evidence the tokens “came from” the training data. This is true even if it only generates those tokens 10 percent, 1 percent, or 0.01 percent of the time.
The study authors took 36 books and broke each of them up into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens will be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.
This definition is quite strict. For a 50-token sequence to have a probability greater than 50 percent, the average token in the passage needs a probability of at least 98.5 percent! Moreover, the authors only counted exact matches. They didn’t try to count cases where—for example—the model generates 48 or 49 tokens from the original passage but got one or two tokens wrong. If these cases were counted, the amount of memorization would be even higher.
This research provides strong evidence that significant portions of Harry Potter and the Sorcerer's Stone got copied into the weights of Llama 3.1 70B. But this finding doesn’t tell us why or how this happened. I suspect that part of the answer is that Llama 3 70B was trained on 15 trillion tokens—more than 10 times the 1.4 trillion tokens used to train Llama 1 65B.
The more times a model is trained on a particular example, the more likely it is to memorize that example. Maybe Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.
I’m not sure that either of these explanations fully fits the facts. The fact that memorization was a much bigger problem for the most popular books does suggest Llama may have been trained on secondary sources that quote these books rather than the books themselves. There are likely exponentially more online discussions of Harry Potter than Sandman Slim.
On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer's Stone.
“If it were citations and quotations, you'd expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.
Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem. I emailed Meta for comment on Tuesday but haven’t heard back.
“It doesn't seem to be all popular books,” Mark Lemley told me. “Some popular books have this result and not others. It’s hard to come up with a clear story that says why that happened.”
There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:
Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
The training process copies information from the training data into the model, making the model a derivative work under copyright law.
Infringement occurs when a model generates (portions of) a copyrighted work.
A lot of discussion so far has focused on the first theory because it is the most threatening to AI companies. If the courts uphold this theory, most current LLMs would be illegal whether or not they have memorized any training data.
The AI industry has some pretty strong arguments that using copyrighted works during the training process is fair use under the 2015 Google Books ruling. But the fact that Llama 3.1 70B memorized large portions of Harry Potter could color how the courts think about these fair use questions.
A key part of fair use analysis is whether a use is “transformative”—whether a company has made something new or is merely profiting from the work of others. The fact that language models are capable of regurgitating substantial portions of popular works like Harry Potter, 1984, and The Hobbit could cause judges to look at these fair use arguments more skeptically.
Moreover, one of Google’s key arguments in the books case was that its system was designed to never return more than a short excerpt from any book. If the judge in the Meta lawsuit wanted to distinguish Meta’s arguments from the ones Google made in the books case, he could point to the fact that Llama can generate far more than a few lines of Harry Potter.
The new study “complicates the story that the defendants have been telling in these cases,” co-author Mark Lemley told me. “Which is ‘we just learn word patterns. None of that shows up in the model.’”
But the Harry Potter result creates even more danger for Meta under that second theory—that Llama itself is a derivative copy of J.K. Rowling’s masterpiece.
“It's clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley said. “That suggests to me that probably for some of those books there's something the law would call a copy of part of the book in the model itself.”
The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that.
In principle, Meta could still convince a judge that copying 42 percent of Harry Potter was allowed under the flexible, judge-made doctrine of fair use. But it would be an uphill battle.
“The fair use analysis you've gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use?’” Lemley said. “That complicates the defendants' story.”
Grimmelmann also said there’s a danger that this research could put open-weight models in greater legal jeopardy than closed-weight ones. The Cornell and Stanford researchers could only do their work because the authors had access to the underlying model—and hence to the token probability values that allowed efficient calculation of probabilities for sequences of tokens.
Most leading labs, including OpenAI, Anthropic, and Google, have increasingly restricted access to these so-called logits, making it more difficult to study these models.
Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it.
Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.
“It's kind of perverse,” Mark Lemley told me. “I don't like that outcome.”
On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models.
“There's a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”
Lemley used to be part of Meta's legal team, but in January he dropped them as a client after Facebook adopted more Trump-friendly moderation policies.