2025-06-13 03:12:14
In recent years, numerous plaintiffs—including publishers of books, newspapers, computer code, and photographs—have sued AI companies for training models using copyrighted material. A key question in all of these lawsuits has been how easily AI models produce verbatim excerpts from the plaintiffs’ copyrighted content.
For example, in its December 2023 lawsuit against OpenAI, the New York Times Company produced dozens of examples where GPT-4 exactly reproduced significant passages from Times stories. In its response, OpenAI described this as a “fringe behavior” and a “problem that researchers at OpenAI and elsewhere work hard to address.”
But is it actually a fringe behavior? And have leading AI companies addressed it? New research—focusing on books rather than newspaper articles and on different companies—provides surprising insights into this question. Some of the findings should bolster plaintiffs’ arguments, while others may be more helpful to defendants.
The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models—three from Meta and one each from Microsoft and EleutherAI—were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright.
This chart illustrates their most surprising finding:
The chart shows how easy it is to get a model to generate 50-token excerpts from various parts of Harry Potter and the Sorcerer's Stone. The darker a line is, the easier it is to reproduce that portion of the book.
Each row represents a different model. The three bottom rows are Llama models from Meta. And as you can see, Llama 3.1 70B—a mid-sized model Meta released in July 2024—is far more likely to reproduce Harry Potter text than any of the other four models.
Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)
Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.
Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.
“There are really striking differences among models in terms of how much verbatim text they have memorized,” said James Grimmelmann, a Cornell law professor who has collaborated with several of the paper’s authors.
The results surprised the study’s authors, including Mark Lemley, a law professor at Stanford.1
“We'd expected to see some kind of low level of replicability on the order of one or two percent,” Lemley told me. “The first thing that surprised me is how much variation there is.”
These results give everyone in the AI copyright debate something to latch on to. For AI industry critics, the big takeaway is that—at least for some models and some books—memorization is not a fringe phenomenon.
On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim, a 2009 novel by author Richard Kadrey. That’s a tiny fraction of the 42 percent figure for Harry Potter.
This could be a headache for law firms that have filed class-action lawsuits against AI companies. Kadrey is the lead plaintiff in a class-action lawsuit against Meta. To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations.
Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta’s favor, since most authors lack the resources to file individual lawsuits.
The broader lesson of this study is that the details will matter in these copyright cases. Too often, online discussions have treated “do generative models copy their training data or merely learn from it?” as a theoretical or even philosophical question. But it’s a question that can be tested empirically—and the answer might differ across models and across copyrighted works.
It’s common to talk about LLMs predicting the next token. But under the hood, what the model actually does is generate a probability distribution over all possibilities for the next token. For example, if you prompt an LLM with the phrase “Peanut butter and ”, it will respond with a probability distribution that might look like this made-up example:
P(“jelly”) = 70 percent
P(“sugar”) = 9 percent
P(“peanut”) = 6 percent
P(“chocolate”) = 4 percent
P(“cream”) = 3 percent
And so forth.
After the model generates a list of probabilities like this, the system will select one of these options at random, weighted by their probabilities. So 70 percent of the time the system will generate “Peanut butter and jelly.” Nine percent of the time we’ll get “Peanut butter and sugar.” Six percent of the time it will be “Peanut butter and peanut.” You get the idea.
The study’s authors didn’t have to actually generate multiple outputs to estimate the likelihood of a particular response. Instead, they could calculate probabilities for each token and then multiply them together.
Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
Then we just have to multiply the probabilities like this:
0.2 * 0.9 * 0.8 * 0.7 = 0.1008
So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time—without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.
This technique greatly reduced the cost of the research, allowed the authors to analyze more books, and made it feasible to precisely estimate very low probabilities.
For example, the authors estimated that it would take more than 10 million billion samples to exactly reproduce some 50-token sequences from some books. Obviously, it wouldn’t be feasible to actually generate that many outputs. But it wasn’t necessary: the probability could be estimated just by multiplying together probabilities for the 50 tokens.
A key thing to notice is that probabilities can get really small really fast. In my made-up example, the probability that the model will produce the four tokens “peanut butter and jelly” is just 10 percent. If we added more tokens, the probability would get even lower. If we added 46 more tokens, the probability could fall by several orders of magnitude.
For any language model, the probability of generating any given 50-token sequence “by accident” is vanishingly small. If a model generates 50 tokens from a copyrighted work, that is strong evidence the tokens “came from” the training data. This is true even if it only generates those tokens 10 percent, 1 percent, or 0.01 percent of the time.
The study authors took 36 books and broke each of them up into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens will be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.
This definition is quite strict. For a 50-token sequence to have a probability greater than 50 percent, the average token in the passage needs a probability of at least 98.5 percent! Moreover, the authors only counted exact matches. They didn’t try to count cases where—for example—the model generates 48 or 49 tokens from the original passage but got one or two tokens wrong. If these cases were counted, the amount of memorization would be even higher.
This research provides strong evidence that significant portions of Harry Potter and the Sorcerer's Stone got copied into the weights of Llama 3.1 70B. But this finding doesn’t tell us why or how this happened. I suspect that part of the answer is that Llama 3 70B was trained on 15 trillion tokens—more than 10 times the 1.4 trillion tokens used to train Llama 1 65B.
The more times a model is trained on a particular example, the more likely it is to memorize that example. Maybe Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources—such as online Harry Potter fan forums, consumer book reviews, or student book reports—that included quotes from Harry Potter and other popular books.
I’m not sure that either of these explanations fully fits the facts. The fact that memorization was a much bigger problem for the most popular books does suggest Llama may have been trained on secondary sources that quote these books rather than the books themselves. There are likely exponentially more online discussions of Harry Potter than Sandman Slim.
On the other hand, it’s surprising that Llama memorized so much of Harry Potter and the Sorcerer's Stone.
“If it were citations and quotations, you'd expect it to concentrate around a few popular things that everyone quotes or talks about,” Lemley said. The fact that Llama 3 memorized almost half the book suggests that the entire text was well represented in the training data.
Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem. I emailed Meta for comment on Tuesday but haven’t heard back.
“It doesn't seem to be all popular books,” Mark Lemley told me. “Some popular books have this result and not others. It’s hard to come up with a clear story that says why that happened.”
There are actually three distinct theories of how training a model on copyrighted works could infringe copyright:
Training on a copyrighted work is inherently infringing because the training process involves making a digital copy of the work.
The training process copies information from the training data into the model, making the model a derivative work under copyright law.
Infringement occurs when a model generates (portions of) a copyrighted work.
A lot of discussion so far has focused on the first theory because it is the most threatening to AI companies. If the courts uphold this theory, most current LLMs would be illegal whether or not they have memorized any training data.
The AI industry has some pretty strong arguments that using copyrighted works during the training process is fair use under the 2015 Google Books ruling. But the fact that Llama 3.1 70B memorized large portions of Harry Potter could color how the courts think about these fair use questions.
A key part of fair use analysis is whether a use is “transformative”—whether a company has made something new or is merely profiting from the work of others. The fact that language models are capable of regurgitating substantial portions of popular works like Harry Potter, 1984, and The Hobbit could cause judges to look at these fair use arguments more skeptically.
Moreover, one of Google’s key arguments in the books case was that its system was designed to never return more than a short excerpt from any book. If the judge in the Meta lawsuit wanted to distinguish Meta’s arguments from the ones Google made in the books case, he could point to the fact that Llama can generate far more than a few lines of Harry Potter.
The new study “complicates the story that the defendants have been telling in these cases,” co-author Mark Lemley told me. “Which is ‘we just learn word patterns. None of that shows up in the model.’”
But the Harry Potter result creates even more danger for Meta under that second theory—that Llama itself is a derivative copy of J.K. Rowling’s masterpiece.
“It's clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley said. “That suggests to me that probably for some of those books there's something the law would call a copy of part of the book in the model itself.”
The Google Books precedent probably can’t protect Meta against this second legal theory because Google never made its books database available for users to download—Google almost certainly would have lost the case if it had done that.
In principle, Meta could still convince a judge that copying 42 percent of Harry Potter was allowed under the flexible, judge-made doctrine of fair use. But it would be an uphill battle.
“The fair use analysis you've gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use?’” Lemley said. “That complicates the defendants' story.”
Grimmelmann also said there’s a danger that this research could put open-weight models in greater legal jeopardy than closed-weight ones. The Cornell and Stanford researchers could only do their work because the authors had access to the underlying model—and hence to the token probability values that allowed efficient calculation of probabilities for sequences of tokens.
Most leading labs, including OpenAI, Anthropic, and Google, have increasingly restricted access to these so-called logits, making it more difficult to study these models.
Moreover, if a company keeps model weights on its own servers, it can use filters to try to prevent infringing output from reaching the outside world. So even if the underlying OpenAI, Anthropic, and Google models have memorized copyrighted works in the same way as Llama 3.1 70B, it might be difficult for anyone outside the company to prove it.
Moreover, this kind of filtering makes it easier for companies with closed-weight models to invoke the Google Books precedent. In short, copyright law might create a strong disincentive for companies to release open-weight models.
“It's kind of perverse,” Mark Lemley told me. “I don't like that outcome.”
On the other hand, judges might conclude that it would be bad to effectively punish companies for publishing open-weight models.
“There's a degree to which being open and sharing weights is a kind of public service,” Grimmelmann told me. “I could honestly see judges being less skeptical of Meta and others who provide open-weight models.”
Lemley used to be part of Meta's legal team, but in January he dropped them as a client after Facebook adopted more Trump-friendly moderation policies.
2025-05-30 03:19:29
An underrated AI story over the last year has been Anthropic’s success in the market for coding tools.
“We believe coding is extremely important,” said Anthropic engineer Sholto Douglas in an interview last week. “We care a lot about coding. We care a lot about measuring progress on coding. We think it’s the most important leading indicator of model capabilities.”
This focus has paid off. The company’s models have excelled at software engineering since last June’s release of Claude 3.5 Sonnet. Over the last year, a number of Claude-powered coding tools—including Cursor, Windsurf, Bolt.new, and Lovable—have enjoyed explosive growth. In February, Anthropic released a coding assistant called Claude Code that has become popular among programmers.
In media interviews, Anthropic employees have touted the extreme efficiency gains Claude has enabled for its own programmers.
“For me, it’s probably 2x my productivity,” said Anthropic engineer Boris Cherny in a recent podcast episode. “I think there’s some engineers at Anthropic where it’s probably 10x their productivity. And then there are some people that haven’t really figured out how to use it yet.”
Cat Wu, an Anthropic product manager, chimed in with an example: “Sometimes we're in meetings together and sales or compliance or someone is like, hey, like, we really need X feature. And then Boris will ask a few questions to understand the specs. And then like 10 minutes later, he's like, all right, it's built. I'm going to merge it later. Anything else?”
Anthropic’s success in the coding market has gotten the attention of both OpenAI and Google:
In early May, OpenAI announced it was acquiring Windsurf, an AI-powered code editing tool that had been powered by Anthropic models.
The next week, OpenAI announced Codex, a coding agent designed to compete with Anthropic’s Claude Code.
Last week Google announced its own coding agent called Jules.
I suspect one of Anthropic’s major goals for Claude 4, which was released last week, was to maintain its lead in this market. It seems to be helping. Days after the release of Claude 4, the CEO of vibe coding tool Lovable wrote that “Claude 4 just erased most of Lovable's errors.” He posted a chart showing a dramatic drop in syntax errors after Lovable upgraded to Claude 4.
In recent weeks, I’ve talked to a number of software developers and product managers about how AI-powered tools have changed their work. Based on these conversations, I think we’re on the verge of dramatic changes in the way people create software.
In this piece, I’ll survey the new software development tools that have gotten traction in the last year. I’ll start with “vibe coding” tools designed to enable programming novices to build full-featured apps. Then I’ll discuss tools designed for experienced programmers. As we’ll see, the leading tools in both categories owe their success to Claude.
Last week I talked to Anthony Jantzi, a product manager at Gloo, a startup that creates software for churches and other Christian organizations.
“In the olden days we would use Figma to build interactive prototypes where you could click around and it looked like a web app,” Jantzi told me. But he recently started using a vibe coding platform called Bolt.new for prototyping and it dramatically changed how he did his job.
Those old Figma mockups looked like real websites, but a lot of features didn’t actually work. For example, Gloo’s website includes a chatbot, which is beyond the abilities of a Figma mockup. So although Jantzi could solicit user input about the layout of a new feature, he told me it was “impossible to get any kind of good feedback” about its functionality.
Jantzi created a fully functional clone of Gloo’s website using Bolt with just a few weeks of work.
“I basically have built a prototype version of our app in Bolt that I can basically make whatever changes I want to, and put it in front of a potential user and see how they use it,” he told me.
Bolt lets Jantzi add new features with plain English prompts. It takes a fraction of the time it would take with conventional programming tools. But Jantzi said he wouldn’t use Bolt (or other vibe coding tools) to build a shipping software product.
“It's not getting to the level of robustness of an actual app,” he told me. “If I put it out for someone to use with any kind of volume of users it would fall over.”
So Jantzi still needs help from traditional engineers to put new features into production. But testing a feature first in his Bolt sandbox lets Jantzi use their time more efficiently.
“I'm not having my engineers waste time on things users won't want,” he told me.
Jantzi’s story isn’t unusual, according to Eric Simons, CEO of the company behind Bolt.new.
2025-05-27 23:26:29
A lot of models have come out in recent months:
In February, xAI released Grok 3 and OpenAI released GPT-4.5.
In March, Google released Gemini 2.5 Pro.
In April, OpenAI released o3 and GPT-4.1, Meta released Llama 4, and Google released Gemini 2.5 Flash.
Last week, Anthropic released Claude Opus 4 and Claude Sonnet 4.
Last year I made it a practice to do a w…
2025-05-19 18:35:19
I’m excited to publish this guest post by Nick McGreivy, a physicist who last year earned a PhD from Princeton. Nick used to be optimistic that AI could accelerate physics research. But when he tried to apply AI techniques to real physics problems the results were disappointing.
I’ve written before about the Princeton School of AI Safety, which holds that the impact of AI is likely to be similar to that of past general-purpose technologies such as electricity, integrated circuits, and the Internet. I think of this piece from Nick as being in that same intellectual tradition.
—Timothy B. Lee
In 2018, as a second-year PhD student at Princeton studying plasma physics, I decided to switch my research focus to machine learning. I didn’t yet have a specific research project in mind, but I thought I could make a bigger impact by using AI to accelerate physics research. (I was also, quite frankly, motivated by the high salaries in AI.)
I eventually chose to study what AI pioneer Yann LeCun later described as a “pretty hot topic, indeed”: using AI to solve partial differential equations (PDEs). But as I tried to build on what I thought were impressive results, I found that AI methods performed much worse than advertised.
At first, I tried applying a widely-cited AI method called PINN to some fairly simple PDEs, but found it to be unexpectedly brittle. Later, though dozens of papers had claimed that AI methods could solve PDEs faster than standard numerical methods—in some cases as much as a million times faster—I discovered that a large majority of these comparisons were unfair. When I compared these AI methods on equal footing to state-of-the-art numerical methods, whatever narrowly defined advantage AI had usually disappeared.
This experience has led me to question the idea that AI is poised to “accelerate” or even “revolutionize” science. Are we really about to enter what DeepMind calls “a new golden age of AI-enabled scientific discovery,” or has the overall potential of AI in science been exaggerated—much like it was in my subfield?
Many others have identified similar issues. For example, in 2023 DeepMind claimed to have discovered 2.2 million crystal structures, representing “an order-of-magnitude expansion in stable materials known to humanity.” But when materials scientists analyzed these compounds, they found it was “mostly junk” and “respectfully” suggested that the paper “does not report any new materials.”
Separately, Princeton computer scientists Arvind Narayanan and Sayash Kapoor have compiled a list of 648 papers across 30 fields that all make a methodological error called data leakage. In each case data leakage leads to overoptimistic results. They argue that AI-based science is facing a “reproducibility crisis.”
Yet AI adoption in scientific research has been rising sharply over the last decade. Computer science has seen the biggest impacts, of course, but other disciplines—physics, chemistry, biology, medicine, and the social sciences—have also seen rapidly increasing AI adoption. Across all scientific publications, rates of AI usage grew from 2 percent in 2015 to almost 8 percent in 2022. It’s harder to find data about the last few years, but there’s every reason to think that hockey stick growth has continued.
To be clear, AI can drive scientific breakthroughs. My concern is about their magnitude and frequency. Has AI really shown enough potential to justify such a massive shift in talent, training, time, and money away from existing research directions and towards a single paradigm?
Every field of science is experiencing AI differently, so we should be cautious about making generalizations. I’m convinced, however, that some of the lessons from my experience are broadly applicable across science:
AI adoption is exploding among scientists less because it benefits science and more because it benefits the scientists themselves.
Because AI researchers almost never publish negative results, AI-for-science is experiencing survivorship bias.
The positive results that get published tend to be overly optimistic about AI’s potential.
As a result, I’ve come to believe that AI has generally been less successful and revolutionary in science than it appears to be.
Ultimately, I don’t know whether AI will reverse the decades-long trend of declining scientific productivity and stagnating (or even decelerating) rates of scientific progress. I don’t think anyone does. But barring major (and in my opinion unlikely) breakthroughs in advanced AI, I expect AI to be much more a normal tool of incremental, uneven scientific progress than a revolutionary one.
In the summer of 2019, I got a first taste of what would become my dissertation topic: solving PDEs with AI. PDEs are mathematical equations used to model a wide range of physical systems, and solving (i.e., simulating) them is an extremely important task in computational physics and engineering. My lab uses PDEs to model the behavior of plasmas, such as inside fusion reactors and in the interstellar medium of outer space.
The AI models being used to solve PDEs are custom deep learning models, much more analogous to AlphaFold than ChatGPT.
The first approach I tried was something called the physics-informed neural network. PINNs had recently been introduced in an influential paper that had already racked up hundreds of citations.
PINNs were a radically different way of solving PDEs compared to standard numerical methods. Standard methods represent a PDE solution as a set of pixels (like in an image or video) and derive equations for each pixel value. In contrast, PINNs represent the PDE solution as a neural network and put the equations into the loss function.
As a naive grad student who didn’t even have an advisor yet, there was something incredibly appealing to me about PINNs. They just seemed so simple, elegant, and general.
They also seemed to have good results. The paper introducing PINNs found that their “effectiveness” had been “demonstrated through a collection of classical problems in fluids, quantum mechanics, reaction-diffusion systems, and the propagation of nonlinear shallow-water waves.” If PINNs had solved all these PDEs, I figured, then surely they could solve some of the plasma physics PDEs that my lab cared about.
But when I replaced one of the examples from that influential first paper (1D Burgers’) with a different, but still extremely simple, PDE (1D Vlasov), the results didn’t look anything like the exact solution. Eventually, after extensive tuning, I was able to get something that looked correct. However, when I tried slightly more complex PDEs (such as 1D Vlasov-Poisson), no amount of tuning could give me a decent solution.
After a few weeks of failure, I messaged a friend at a different university, who told me that he too had tried using PINNs, but hadn’t been able to get good results.
Eventually, I realized what had gone wrong. The authors of the original PINN paper had, like me, “observed that specific settings that yielded impressive results for one equation could fail for another.” But because they wanted to convince readers of how exciting PINNs were, they hadn’t shown any examples of PINNs failing.
This experience taught me a few things. First, to be cautious about taking AI research at face value. Most scientists aren’t trying to mislead anyone, but because they face strong incentives to present favorable results, there’s still a risk that you’ll be misled. Moving forward, I would have to be more skeptical, even (or perhaps especially) of high-impact papers with impressive results.
Second, people rarely publish papers about when AI methods fail, only when they succeed. The authors of the original PINN paper didn’t publish about the PDEs their method hadn’t been able to solve. I didn’t publish my unsuccessful experiments, presenting only a poster at an obscure conference. So very few researchers heard about them. In fact, despite the huge popularity of PINNs, it took two years for anyone to publish a paper about their failure modes. That paper now has over a thousand citations, suggesting that many other scientists tried PINNs and found similar issues.
Third, I concluded that PINNs weren’t the approach I wanted to use. They were simple and elegant, sure, but they were also far too unreliable, too finicky, and too slow.
As of today, six years later, the original PINN paper has a whopping 14,000 citations, making it the most cited numerical methods paper of the 21st century (and, by my count, a year or two away from becoming the second most-cited numerical methods paper of all time).
Though it’s now widely accepted that PINNs generally aren’t competitive with standard numerical methods for solving PDEs, there remains debate over how well PINNs perform for a different class of problems known as inverse problems. Advocates claim that PINNs are “particularly effective” for inverse problems, but some researchers have vigorously contested that idea.
I don’t know which side of the debate is right. I’d like to think that something useful has come from all this PINN research, but I also wouldn’t be surprised if one day we look back on PINNs as simply a massive citation bubble.
For my dissertation, I focused on solving PDEs using deep learning models that, like traditional solvers, treated the PDE solution as a set of pixels on a grid or a graph.
Unlike PINNs, this approach had shown a lot of promise on the complex, time-dependent PDEs that my lab cared about. Most impressively, paper after paper had demonstrated the ability to solve PDEs faster—often orders of magnitude faster—than standard numerical methods.
The examples that excited my advisor and me the most were PDEs from fluid mechanics, such as the Navier-Stokes equations. We thought we might see similar speedups because the PDEs we cared about—equations describing plasmas in fusion reactors, for example—have a similar mathematical structure. In theory, this could allow scientists and engineers like us to simulate larger systems, more rapidly optimize existing designs, and ultimately accelerate the pace of research.
By this point, I was seasoned enough to know that in AI research, things aren’t always as rosy as they seem. I knew that reliability and robustness might be serious issues. If AI models give faster simulations, but those simulations are less reliable, would that be worth the trade-off? I didn’t know the answer and set out to find out.
But as I tried—and mostly failed—to make these models more reliable, I began to question how much promise AI models had really shown for accelerating PDEs.
According to a number of high-profile papers, AI had solved the Navier-Stokes equations orders of magnitude faster than standard numerical methods. I eventually discovered, however, that the baseline methods used in these papers were not the fastest numerical methods available. When I compared AI to more advanced numerical methods, I found that AI was no faster (or at most, only slightly faster) than the stronger baselines.
My advisor and I eventually published a systematic review of research using AI to solve PDEs from fluid mechanics. We found that 60 out of the 76 papers (79 percent) that claimed to outperform a standard numerical method had used a weak baseline, either because they hadn’t compared to more advanced numerical methods, or because they weren’t comparing them on an equal footing. Papers with large speedups all compared to weak baselines, suggesting that the more impressive the result, the more likely the paper had made an unfair comparison.
We also found evidence, once again, that researchers tend not to report negative results, an effect known as reporting bias. We ultimately concluded that AI-for-PDE-solving research is overoptimistic: “weak baselines lead to overly positive results, while reporting biases lead to under-reporting of negative results.”
These findings sparked a debate about AI in computational science and engineering:
Lorena Barba, a professor at GWU who has previously discussed poor research practices in what she has called “Scientific Machine Learning to Fool the Masses,” saw our results as “solid evidence supporting our concerns in the computational science community over the hype and unscientific optimism” of AI.
Stephan Hoyer, the lead of a team at Google Research that independently reached similar conclusions, described our paper as “a nice summary of why I moved on from [AI] for PDEs” to weather prediction and climate modeling, applications of AI that seem more promising.
Johannes Brandstetter, a professor at JKU Linz and co-founder of a startup that provides “AI-driven physics simulations”, argued that AI might achieve better results for more complex industrial applications and that “the future of the field remains undeniably promising and brimming with potential impact.”
In my opinion, AI might eventually prove useful for certain applications related to solving PDEs, but I currently don’t see much reason for optimism. I’d like to see a lot more focus on trying to match the reliability of numerical methods and on red teaming AI methods; right now, they have neither the theoretical guarantees nor empirically validated robustness of standard numerical methods.
I’d also like to see funding agencies incentivize scientists to create challenge problems for PDEs. A good model could be CASP, a biennial protein folding competition that helped to motivate and focus research in this area over the last 30 years.
Besides protein folding, the canonical example of a scientific breakthrough from AI, a few examples of scientific progress from AI include:1
Weather forecasting, where AI forecasts have had up to 20% higher accuracy (though still lower resolution) compared to traditional physics-based forecasts.
Drug discovery, where preliminary data suggests that AI-discovered drugs have been more successful in Phase I (but not Phase II) clinical trials. If the trend holds, this would imply a nearly twofold increase in end-to-end drug approval rates.
But AI companies, academic and governmental organizations, and media outlets increasingly present AI not only as a useful scientific tool, but one that “will have a transformational impact” on science.
I don’t think we should necessarily dismiss these statements. While current LLMs, according to DeepMind, “still struggle with the deeper creativity and reasoning that human scientists rely on”, hypothetical advanced AI systems might one day be capable of fully automating the scientific process. I don’t expect that to happen anytime soon—if ever. But if such systems are created, there’s no doubt they would transform and accelerate science.
However, based on some of the lessons from my research experience, I think we should be pretty skeptical of the idea that more conventional AI techniques are on pace to significantly accelerate scientific progress.
Most narratives about AI accelerating science come from AI companies or scientists working on AI who benefit, directly or indirectly, from those narratives. For example, NVIDIA CEO Jensen Huang talks about how “AI will drive scientific breakthroughs” and “accelerate science by a million-X.” NVIDIA, whose financial conflicts of interest make them a particularly unreliable narrator, regularly makes hyperbolic statements about AI in science.
You might think that the rising adoption of AI by scientists is evidence of AI’s usefulness in science. After all, if AI usage in scientific research is growing exponentially, it must be because scientists find it useful, right?
I’m not so sure. In fact, I suspect that scientists are switching to AI less because it benefits science, and more because it benefits them.2
Consider my motives for switching to AI in 2018. While I sincerely thought that AI might be useful in plasma physics, I was mainly motivated by higher salaries, better job prospects, and academic prestige. I also noticed that higher-ups at my lab usually seemed more interested in the fundraising potential of AI than technical considerations.
Later research found that scientists who use AI are more likely to publish top-cited papers and receive on average three times as many citations. With such strong incentives to use AI, it isn’t surprising that so many scientists are doing so.
So even when AI achieves genuinely impressive results in science, that doesn’t mean that AI has done something useful for science. More often, it reflects only the potential of AI to be useful down the road.
This is because scientists working on AI (myself included) often work backwards. Instead of identifying a problem and then trying to find a solution, we start by assuming that AI will be the solution and then looking for problems to solve. But because it’s difficult to identify open scientific challenges that can be solved using AI, this “hammer in search of a nail” style of science means that researchers will often tackle problems which are suitable for using AI but which either have already been solved or don't create new scientific knowledge.
To accurately evaluate the impacts of AI in science, we need to actually look at the science. But unfortunately, the scientific literature is not a reliable source for evaluating the success of AI in science.
One issue is survivorship bias. Because AI research, in the words of one researcher, has “nearly complete non-publication of negative results,” we usually only see the successes of AI in science and not the failures. But without negative results, our attempts to evaluate the impacts of AI in science typically get distorted.
As anyone who’s studied the replication crisis knows, survivorship bias is a major issue in science. Usually, the culprit is a selection process in which results that are not statistically significant are filtered from the scientific literature.
For example, the distribution of z-values from medical research is shown below. A z-value between -1.96 and 1.96 indicates that a result is not statistically significant. The sharp discontinuity around these values suggests that many scientists either didn’t publish results between these values or massaged their data until they cleared the threshold of statistical significance.
The problem is that if researchers fail to publish negative results, it can cause medical practitioners and the general public to overestimate the effectiveness of medical treatments.
Something similar has been happening in AI-for-science, though the selection process is based not on statistical significance but on whether the proposed method outperforms other approaches or successfully performs some novel task. This means that AI-for-science researchers almost always report successes of AI, and rarely publish results when AI isn’t successful.
A second issue is that pitfalls often cause the successful results that do get published to reach overly optimistic conclusions about AI in science. The details and severity seem to differ between fields, but pitfalls mostly have fallen into one of four categories: data leakage, weak baselines, cherry-picking, and misreporting.
While the causes of this tendency towards overoptimism are complex, the core issue appears to be a conflict of interest in which the same people who evaluate AI models also benefit from those evaluations.
These issues seem to be bad enough that I encourage people to treat impressive results in AI-for-science the same way we treat surprising results in nutrition science: with instinctive skepticism.
Correction: This article originally stated that it took four years for anyone to publish a paper about the failure mode of PINNs, but I had overlooked an earlier paper. The story has been updated.
Early drafts of this article gave three examples here, including a paper by MIT graduate student Aidan Toner-Rodgers about the use of AI to discover new materials. That paper had been described as “the best paper written so far about the impact of AI on scientific discovery”. But then MIT announced that it was seeking the retraction of the paper due to concerns “about the integrity of the research.” Of course, allegations of outright fraud are a different issue than the subtler methodological problems I focus on in my article. But the fact that this paper got so much traction in the media underscores my broader point that researchers have a variety of incentives to exaggerate the effectiveness of AI techniques.
When I talk about scientists using AI, I mean training or using special-purpose AI models such as PINNs or AlphaFold. I’m not talking about using an LLM to help write grant proposals or do basic background research.
2025-05-02 02:25:10
Last month, a team of prominent AI researchers and Internet writers published AI 2027, a website predicting that AI will dramatically change the world within the next three years. As long-time readers might expect, I did not find these predictions convincing. But rather than writing a full analysis of the AI 2027 scenario, I thought I’d highlight three …
2025-04-30 19:07:07
In September 2022, a few months before the release of ChatGPT, podcaster Guy Raz asked Sam Altman why OpenAI had been founded as a non-profit organization. Altman said OpenAI was trying to develop artificial general intelligence, which Altman thought “really does deserve to belong to the world as a whole.”
“It's gonna have such a profound impact on all of us that I think we deserve, like we globally, all the people, all of humanity, deserve a say over how it's used, what happens, what the rules are,” Altman said. Altman said he was “very pro-capitalism,” but “AGI is sort of an exception to that."
Altman and other OpenAI leaders have been saying stuff like this since the organization was founded. And it was more than just talk: OpenAI is one of the few prominent tech companies to be organized as a non-profit organization rather than a for-profit company. Or more precisely, OpenAI today is a non-profit organization that controls a for-profit subsidiary that’s also named OpenAI.
In recent months, the for-profit subsidiary has been raising billions of dollars to fund its next generation of AI models. And investors have gotten increasingly nervous that OpenAI’s unconventional structure could prevent them from getting a financial return.
To address those fears, OpenAI is trying to convert itself into a more conventional for-profit company. Under a proposal announced last December, the non-profit parent would give up control over OpenAI’s technology in exchange for tens of billions of dollars it could use for charitable purposes. In a recent blog post, OpenAI boasted that such a transaction would create “the world’s best-equipped nonprofit.”
But opponents believe that this would betray the commitments OpenAI made when it was founded in 2015.
“Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return,” the founders wrote in the 2015 blog post announcing the creation of OpenAI. “Since our research is free from financial obligations, we can better focus on a positive human impact.”
Transforming OpenAI into a for-profit company would run directly counter to this founding vision. The question is whether statements like this are legally enforceable.
Last year Elon Musk, who co-founded OpenAI and provided much of its early funding, sued OpenAI, arguing that a for-profit conversion would violate commitments Altman made to Musk at the time OpenAI was founded.
That lawsuit is currently being heard by a federal judge who seems sympathetic to Musk’s concerns. However, the judge could rule that Musk lacks standing to bring the lawsuit because he would not personally be harmed by OpenAI transforming itself into a for-profit company.
But two government officials almost certainly do have standing: California attorney general Rob Bonta and Delaware attorney general Kathy Jennings. So far, both officials have been noncommittal about the issue, though we know that Jennings is looking into it. If one of them decided to sue OpenAI, they’d have a much better chance than Musk of blocking a for-profit conversion.
In 2018, OpenAI published a charter promising to try to “ensure that artificial general intelligence benefits all of humanity.” The document warned that AI development could become “a competitive race without time for adequate safety precautions.” To avoid that outcome, OpenAI pledged that “if a value-aligned, safety-conscious project comes close to building AGI before we do, we commit to stop competing with and start assisting this project.”
These and other commitments made in OpenAI’s charter “were considered to be binding and were taken extremely seriously” inside the company, according to a group of former OpenAI employees. The charter “was consistently emphasized by senior leadership at company meetings and in informal conversations.”
By 2019, it had become clear that the most promising path to AGI involved scaling up large language models, a project that would cost billions of dollars. So OpenAI created a for-profit subsidiary that received a $1 billion investment from Microsoft. To avoid undermining its founding principles, OpenAI inserted some unconventional terms in its investment agreements.
“It would be wise to view any investment in OpenAI Global LLC in the spirit of a donation,” OpenAI warned investors like Microsoft. “The Company may never make a profit, and the Company is under no obligation to do so.”
The non-profit parent company retained full control of the for-profit subsidiary. OpenAI capped the amount of profit Microsoft could earn from its investment. Microsoft got a license to OpenAI’s current and future technology, but the license didn’t include AGI—and OpenAI’s board got to decide what counted as AGI.
In June 2023, Bloomberg’s Emily Chang asked Sam Altman why the public should trust him.
“No one person should be trusted here,” Altman said. “I don’t have super-voting shares. I don’t want them. The board can fire me. I think that’s important.”
“The reason for our structure, and the reason it’s so weird, is we think this technology, the benefits, the access to it, the governance of it, belongs to humanity as a whole. If this really works, it’s quite a powerful technology. You should not trust one company and certainly not one person with it.”
A few months later, the board did fire Altman—or at least it tried to fire him. On the Friday before Thanksgiving, the board announced that Altman had been terminated because he “was not consistently candid in his communications with the board.”
Altman fought back and quickly got Microsoft CEO Satya Nadella on his side. The two threatened that if Altman wasn’t reinstated, he would take a job at Microsoft and bring most of OpenAI’s staff with him. Faced with the potential disintegration of OpenAI, the board surrendered. Altman not only got his job back, he got a new, more deferential board.
OpenAI’s culture seems to have shifted dramatically since Altman’s return. In 2024, OpenAI suffered a series of departures by safety-minded employees. These included the leaders of OpenAI’s Superalignment team. OpenAI had pledged to give this team 20 percent of OpenAI’s computing resources to work on AI safety. But insiders say Altman never made good on that promise.
The OpenAI charter warns about the dangers of a “competitive race without time for adequate safety precautions.” That seems to be exactly the situation the industry is in right now.
“According to eight people familiar with OpenAI’s testing processes, the start-up’s tests have become less thorough, with insufficient time and resources dedicated to identifying and mitigating risks,” the Financial Times reported earlier this month. According to the FT, “staff and third-party groups have recently been given just days” to evaluate new models for safety. For comparison, OpenAI spent six months testing GPT-4 before releasing it to the public in 2023.
The OpenAI charter says the company should try to tap the brakes in this kind of situation. But OpenAI seems to be doing the opposite.
In a December 2024 blog post titled “Why OpenAI’s structure must evolve to advance our mission,” OpenAI announced that it intended to “transform our existing for-profit into a Delaware Public Benefit Corporation with ordinary shares of stock and the OpenAI mission as its public benefit interest.” This will “enable us to raise the necessary capital” to stay on the cutting edge of AI development.
A public benefit corporation is a special kind of for-profit company that pledges to pursue goals beyond profit. Two of OpenAI’s leading competitors, Anthropic and xAI, are organized as PBCs. Anthropic is particularly known for its commitment to the safe development of AI technology, so there’s no inherent conflict between a PBC structure and a pro-safety culture.
But a PBC is still a for-profit company. In theory, a PBC might be obligated to serve the public, but there’s no mechanism to force a PBC to live up to such an obligation. So if OpenAI strips its non-profit parent of control, the newly independent for-profit will face the same commercial pressures as other companies.
Such a shift seems hard to reconcile with the pledges OpenAI has made over the last decade. The company would no longer be accountable to “humanity as a whole.” It would be accountable to a specific group of profit-minded investors.
In recent months, investors have poured billions of dollars into OpenAI. In the process, they’ve ratcheted up the pressure for OpenAI to complete a for-profit transformation.
OpenAI raised $6.6 billion last fall. The New York Times has reported that “if OpenAI did not change its corporate structure within two years, the investment would convert into debt.” That change would “put the company in a much riskier situation.”
Then last month, OpenAI raised $40 billion in a deal led by SoftBank. SoftBank pledged $30 billion, but there was a catch: if OpenAI failed to convert to a for-profit company by the end of 2025, SoftBank’s investment would be reduced by $10 billion.
Over the last decade, a lot of people supported OpenAI because they believed in its idealistic mission. Some early employees turned down higher salaries at big technology companies because they believed they could do more good at OpenAI. Some of them now feel burned as OpenAI prepares to renege on those earlier promises and convert itself into a conventional for-profit company.
The question is whether anyone can force OpenAI to honor its original commitments.
Elon Musk co-founded OpenAI and was its biggest funder during the first few years. Indeed, Musk was so prominent that many early news stories described OpenAI as an Elon Musk project. But Musk left the organization in 2018 after a feud with Sam Altman. In the years that followed, he became increasingly critical of Altman’s leadership.
Last year, Musk sued OpenAI, arguing that Altman had duped him into donating more than $40 million to OpenAI over five years.
“Altman feigned altruism to convince Musk into giving him free start-up capital and recruiting top AI scientists to develop technological assets from which defendants would stand to make billions,” Musk’s lawyers wrote. Now, Musk argues, Altman is trying to renege on commitments he made to Musk during OpenAI’s early years.
To win a lawsuit, a plaintiff doesn’t just need to show that a defendant did something illegal. The plaintiff must also show that he has standing. And that may not be easy.
For example, Dana Brakman Reiser of Brooklyn Law School argues that Musk is “surely the wrong person” to bring a lawsuit against OpenAI.
“Once gifts have been made, donated assets are no longer donors’ property, and they lose the authority to sue to protect them,” Reiser wrote last year. Reiser argued that strict enforcement of standing rules is necessary to shield nonprofit organizations from frivolous lawsuits.
A donor can’t sue simply because a non-profit used money differently than he expected. To gain standing, a donor needs to have a legally binding commitment from the non-profit promising to use money in a specific way.
Musk argues that his early email conversations with Altman created such a commitment. But OpenAI disputes that, arguing that those early discussions were too abstract and speculative to count as a binding contract.
In March, Judge Yvonne Gonzalez Rogers denied Musk’s request for a preliminary injunction that would have blocked OpenAI from converting to a for-profit entity. But her order signaled that Rogers had some sympathy for Musk’s point of view. She described it as a “toss up” whether OpenAI had made a legally binding commitment to Musk.
“Whether Musk’s emails and social media posts constitute a writing sufficient to constitute an actual contract or charitable trust between the parties is debatable,” she wrote. “The email exchanges convey early communications regarding altruistic motives of OpenAI’s early days and even include reassurances about those motives from Altman and [OpenAI co-founder and president Greg] Brockman when they perceived Musk as upset.”
At the same time, Judge Rogers wrote, “the emails do not by themselves necessarily demonstrate a likelihood of success.”
Later in her opinion, she wrote that “significant and irreparable harm is incurred when the public’s money is used to fund a non-profit’s conversion into a for-profit.” This suggests she is sympathetic to at least some of Musk’s legal arguments. But she might still conclude that Musk doesn’t have the legal standing required to win the lawsuit.
Charities are supposed to serve the public interest, but it would create too much chaos to allow any member of the public to sue non-profits on behalf of the public. This is why most states give the attorney general the power to file lawsuits on behalf of the public.
For OpenAI, the relevant states are Delaware (where OpenAI was incorporated) and California (where OpenAI has its headquarters). If the attorney general of either state wanted to sue OpenAI, they would very likely have standing to do so.
In December, Delaware attorney general Kathy Jennings notified Judge Rogers that she was looking into the legality of OpenAI converting itself into a for-profit company.
“The Delaware Attorney General has authority to review the Proposed Transaction for compliance with Delaware law by ensuring, among other things, that the Proposed Transaction accords with OpenAI’s charitable purpose and the fiduciary duties of OpenAI’s board of directors,” Jennings wrote. She added that she “has not yet concluded her review or reached any conclusions.”
Experts have told me that if Jennings did intervene, she wouldn’t have any trouble establishing standing. Rather, such a lawsuit would focus on the merits: would converting OpenAI to a for-profit company be in the public interest? And more specifically, would it be consistent with the charitable purpose described in OpenAI’s founding documents?
It’s likely that California attorney general Rob Bonta could also intervene if he wanted to. But Bonta seems less interested. In November, Musk tried to draft Bonta into the case by naming him as an “involuntary plaintiff” in the lawsuit. Bonta responded with a motion to be dismissed from the case. That doesn’t necessarily mean Bonta won’t file his own lawsuit in the future, but it doesn’t seem like a promising sign for opponents of OpenAI’s for-profit pivot.
If neither attorney general decides to intervene, the courts might decide that Musk and other private parties lack standing to bring a lawsuit. In that case, OpenAI would be free to convert itself into a for-profit company even if doing so would be contrary to every promise its leaders made in the early years of the company.