2025-10-09 04:56:04
On September 25, Meta announced Vibes, a “new way to discover, create, and share AI videos.” The next week, OpenAI announced a new app called Sora for creating and sharing AI-generated videos.
The public reaction to these launches could not have been more different. As I write this, the iOS App Store ranks Sora as its number one free app. The Meta AI app…
2025-10-01 21:03:30
We’re pleased to publish this guest post from Deena Mousa, a researcher at Open Philanthropy. If you’d like to read more of her work, please subscribe to her Substack or follow her on Twitter.
The piece was originally published by Works in Progress, a truly excellent magazine, which is part of Stripe. You can subscribe to their email newsletter or (if you live in the US or UK) their new print edition.
CheXNet can detect pneumonia with greater accuracy than a panel of board-certified radiologists. It is an AI model released in 2017, trained on more than 100,000 chest X-rays. It is fast, free, and can run on a single consumer-grade GPU. A hospital can use it to classify a new scan in under a second.
Since then, companies like Annalise.ai, Lunit, Aidoc, and Qure.ai have released models that can detect hundreds of diseases across multiple types of scans with greater accuracy and speed than human radiologists in benchmark tests. Some products can reorder radiologist worklists to prioritize critical cases, suggest next steps for care teams, or generate structured draft reports that fit into hospital record systems. A few, like IDx-DR, are even cleared to operate without a physician reading the image at all. In total, there are over 700 FDA-cleared radiology models, which account for roughly three-quarters of all medical AI devices.
Radiology is a field optimized for human replacement, where digital inputs, pattern recognition tasks, and clear benchmarks predominate. In 2016, Geoffrey Hinton – computer scientist and Turing Award winner – declared that “people should stop training radiologists now.” If the most extreme predictions about the effect of AI on employment and wages were true, then radiology should be the canary in the coal mine.
But demand for human labor is higher than ever. In 2025, American diagnostic radiology residency programs offered a record 1,208 positions across all radiology specialties, a four percent increase from 2024, and the field’s vacancy rates are at all-time highs. In 2025, radiology was the second-highest-paid medical specialty in the country, with an average income of $520,000, over 48 percent higher than the average salary in 2015.
Three things explain this. First, while models beat humans on benchmarks, the standardized tests designed to measure AI performance, they struggle to replicate this performance in hospital conditions. Most tools can only diagnose abnormalities that are common in training data, and models often don’t work as well outside of their test conditions. Second, attempts to give models more tasks have run into legal hurdles: regulators and medical insurers so far are reluctant to approve or cover fully autonomous radiology models. Third, even when they do diagnose accurately, models replace only a small share of a radiologist’s job. Human radiologists spend a minority of their time on diagnostics and the majority on other activities, like talking to patients and fellow clinicians.
Artificial intelligence is rapidly spreading across the economy and society. But radiology shows us that it will not necessarily dominate every field in its first years of diffusion — at least until these common hurdles are overcome. Exploiting all of its benefits will involve adapting it to society, and society’s rules to it.
All AIs are functions or algorithms, called models, that take in inputs and spit out outputs. Radiology models are trained to detect a finding, which is a measurable piece of evidence that helps identify or rule out a disease or condition. Most radiology models detect a single finding or condition in one type of image. For example, a model might look at a chest CT and answer whether there are lung nodules, rib fractures, or what the coronary arterial calcium score is.
For every individual question, a new model is required. In order to cover even a modest slice of what they see in a day, a radiologist would need to switch between dozens of models and ask the right questions of each one. Several platforms manage, run, and interpret outputs from dozens or even hundreds of separate AI models across vendors, but each model operates independently, analyzing for one finding or disease at a time. The final output is a list of separate answers to specific questions, rather than a single description of an image.
Even with hundreds of imaging algorithms approved by the Food and Drug Administration (FDA) on the market, the combined footprint of today’s radiology AI models still cover only a small fraction of real-world imaging tasks. Many cluster around a few use cases: stroke, breast cancer, and lung cancer together account for about 60 percent of models, but only a minority of the actual radiology imaging volume that is carried out in the US. Other subspecialties, such as vascular, head and neck, spine, and thyroid imaging currently have relatively few AI products. This is in part due to data availability: the scan needs to be common enough for there to be many annotated examples that can be used to train models. Some scans are also inherently more complicated than others. For example, ultrasounds are taken from multiple angles and do not have standard imaging planes, unlike X-rays.
Once deployed outside of the hospital where they were initially trained, models can struggle. In a standard clinical trial, samples are taken from multiple hospitals to ensure exposure to a broad range of patients and to avoid site-specific effects, such as a single doctor’s technique or how a hospital chooses to calibrate its diagnostic equipment.1 But when an algorithm is undergoing regulatory approval in the US, its developers will normally test it on a relatively narrow dataset. Out of the models in 2024 that reported the number of sites where they were tested, 38 percent were tested on data from a single hospital. Public benchmarks tend to rely on multiple datasets from the same hospital.
The performance of a tool can drop as much as 20 percentage points when it is tested out of sample, on data from other hospitals. In one study, a pneumonia detection model trained on chest X-rays from a single hospital performed substantially worse when tested at a different hospital. Some of these challenges stemmed from avoidable experimental issues like overfitting, but others are indicative of deeper problems like differences in how hospitals record and generate data, such as using slightly different imaging equipment. This means that individual hospitals or departments would need to retrain or revalidate today’s crop of tools before adopting them, even if they have been proven elsewhere.
The limitations of radiology models stem from deeper problems with building medical AI. Training datasets come with strict inclusion criteria, where the diagnosis must be unambiguous (typically confirmed by a consensus of two to three experts or a pathology result) and without images that are shot at an odd angle, look too dark, or are blurry. This skews performance towards the easiest cases, which doctors are already best at diagnosing, and away from real-world images. In one 2022 study, an algorithm that was meant to spot pneumonia on chest X-rays faltered when the disease presented in subtle or mild forms, or when other lung conditions resembled pneumonia, such as pleural effusions, where fluid builds up in lungs, or in atelectasis (collapsed lung). Humans also benefit from context: one radiologist told me about a model they use that labels surgical staples as hemorrhages, because of the bright streaks they create in the image.
Medical imaging datasets used for training also tend to have fewer cases from children, women, and ethnic minorities, making their performance generally worse for these demographics. Many lack information about the gender or race of cases at all, making it difficult to adjust for these issues and address the problem of bias. The result is that radiology models often predict only a narrow slice of the world,2 though there are scenarios where AI models do perform well, including identifying common diseases like pneumonia or certain tumors.
The problems don’t stop there. Even a model for the precise question you need and in the hospital where it was trained is unlikely to perform as well in clinical practice as it did in the benchmark. In benchmark studies, researchers isolate a cohort of scans, define goals in quantitative metrics, such as the sensitivity (the percentage of people with the condition who are correctly identified by the test) and specificity (the percentage of people without the condition who are correctly identified as such), and compare the performance of a model to the score of another reviewer, typically a human doctor. Clinical studies, on the other hand, show how well the model performs in a real healthcare setting without controls. Since the earliest days of computer-aided diagnosis, there has been a gulf between benchmark and clinical performance.
In the 1990s, computer-aided diagnosis, effectively rudimentary AI systems, were developed to screen mammograms, or X-rays of breasts that are performed to look for breast cancer. In trials, the combination of humans and computer-aided diagnosis systems outperformed humans alone in accuracy when evaluating mammograms. More controlled experiments followed, which pointed to computer-aided diagnosis helping radiologists pick up more cancer with minimal costs.
The FDA approved mammography computer-aided diagnosis in 1998, and Medicare started to reimburse the use of computer-aided diagnosis in 2001. The US government paid radiologists $7 more to report a screening mammogram if they used the technology; by 2010, approximately 74 percent of mammograms in the country were read by computer-aided diagnosis alongside a clinician.
But computer-aided diagnosis turned out to be a disappointment. Between 1998 and 2002 researchers analyzed 430,000 screening mammograms from 200,000 women at 43 community clinics in Colorado, New Hampshire, and Washington. Among the seven clinics that turned to computer-aided detection software, the machines flagged more images, leading to clinicians conducting 20 percent more biopsies, but uncovering no more cancer than before. Several other large clinical studies had similar findings.
Another way to measure performance is to compare having computerized help to a second clinician reading every film, called ‘double reading’. Across ten trials and seventeen studies of double reading, researchers found that computer aids did not raise the cancer detection rate but led to patients being called back an additional ten percent more often. In contrast, having two readers caught more cancers while slightly lowering callbacks. Computer-aided detection was worse than standard care, and much worse than another pair of eyes. In 2018, Medicare stopped reimbursing doctors more for mammograms read with computer-aided diagnosis than those read by a radiologist alone.
One explanation for this gap is that people behave differently if they are treating patients day to day than when they are part of laboratory studies or other controlled experiments.3In particular, doctors appear to defer excessively to assistive AI tools in clinical settings in a way that they do not in lab settings. They did this even with much more primitive tools than we have today: one clinical trial all the way back in 2004 asked 20 breast screening specialists to read mammogram cases with the computer prompts switched on, then brought in a new group to read the identical films without the software. When guided by computer aids, doctors identified barely half of the malignancies, while those reviewing without the model caught 68 percent. The gap was largest when computer aids failed to recognize the malignancy itself; many doctors seemed to treat an absence of prompts as reassurance that a film was clean. Another review, this time from 2011, found that when a system gave incorrect guidance, clinicians were 26 percent more likely to make a wrong decision than unaided peers.
It would seem as if better models and more automation could together fix the problems of current-day AI for radiology. Without a doctor involved whose behavior might change we might expect real-world results to match benchmark scores. But regulatory requirements and insurance policies are slowing the adoption of fully autonomous radiology AI.
The FDA splits imaging software into two regulatory lanes: assistive or triage tools, which require a licensed physician to read the scan and sign the chart, and autonomous tools, which do not. Makers of assistive tools simply have to show that their software can match the performance of tools that are already on the market. Autonomous tools have to clear a much higher bar: they must demonstrate that the AI tool will refuse to read any scan that is blurry, uses an unusual scanner, or is outside its competence. The bar is higher because, once the human disappears, a latent software defect could harm thousands before anyone notices.
Meeting that criteria is difficult. Even state-of-the-art vision networks falter with images that lack contrast, have unexpected angles, or lots of different artifacts. IDx-DR, a diabetic retinopathy screener and one of the few cleared to operate autonomously, comes with guardrails: the patient must be an adult with no prior retinopathy diagnosis; there must be two macula-centred photographs of the fundus (the rear of the eye) with a resolution of at least 1,000 times 1,000 pixels; if glare, small pupils or poor focus degrade quality, the software must self-abort and refer the patient to an eye care professional.
Stronger evidence and improved performance could eventually clear both hurdles, but other requirements would still delay widespread use. For example, if you retrain a model, you are required to receive new approval even if the previous model was approved. This contributes to the market generally lagging behind frontier capabilities.
And when autonomous models are approved, malpractice insurers are not eager to cover them. Diagnostic error is the costliest mistake in American medicine, resulting in roughly a third of all malpractice payouts, and radiologists are perennial defendants. Insurers believe that software makes catastrophic payments more likely than a human clinician, as a broken algorithm can harm many patients at once. Standard contract language now often includes phrases such as: ‘Coverage applies solely to interpretations reviewed and authenticated by a licensed physician; no indemnity is afforded for diagnoses generated autonomously by software’. One insurer, Berkley, even carries the blunter label ‘Absolute AI Exclusion’.
Without malpractice coverage, hospitals cannot afford to let algorithms sign reports. In the case of IDx-DR, the vendor, Digital Diagnostics, includes a product liability policy and an indemnity clause. This means that if the clinic used the device exactly as the FDA label prescribes, with adult patients, good-quality images, and no prior retinopathy, then the company will reimburse the clinic for damages traceable to algorithmic misclassification.
Today, if American hospitals wanted to adopt AI for fully independent diagnostic reads, they would need to believe that autonomous models deliver enough cost savings or throughput gains to justify pushing for exceptions to credentialing and billing norms. For now, usage is too sparse to make a difference. One 2024 investigation estimated that 48 percent of radiologists are using AI at all in their practice. A 2025 survey reported that only 19 percent of respondents who have started piloting or deploying AI use cases in radiology reported a ‘high’ degree of success.
Even if AI models become accurate enough to read scans on their own and are cleared to do so, radiologists may still find themselves busier, rather than out of a career.
Radiologists are useful for more than reading scans; a study that followed staff radiologists in three different hospitals in 2012 found that only 36 percent of their time was dedicated to direct image interpretation. More time is spent on overseeing imaging examinations, communicating results and recommendations to the treating clinicians and occasionally directly to patients, teaching radiology residents and technologists who conduct the scans, and reviewing imaging orders and changing scanning protocols.4 This means that, if AI were to get better at interpreting scans, radiologists may simply shift their time toward other tasks. This would reduce the substitution effect of AI.
As tasks get faster or cheaper to perform, we may also do more of them. In some cases, especially if lower costs or faster turnaround times open the door to new uses, the increase in demand can outweigh the increase in efficiency, a phenomenon known as Jevons Paradox. This has historical precedent in the field: in the early 2000s hospitals swapped film jackets for digital systems. Hospitals that digitized improved radiologist productivity, and time to read an individual scan went down. A study at Vancouver General found that the switch boosted radiologist productivity 27 percent for plain radiography and 98 percent for CT within a year of going filmless. This occurred alongside other advancements in imaging technology that made scans faster to execute. Yet, no radiologists were laid off.
Instead, the overall American utilization rate per 1,000 insured patients for all imaging increased by 60 percent from 2000 to 2008. This is not explained by a commensurate increase in physician visits. Instead, each visit was associated with more imaging on average. Before digitization, the nonmonetary price of imaging was high: the median reporting turnaround time for X-rays was 76 hours for patients discharged from emergency departments, and 84 hours for admitted patients. After departments digitized, these times dropped to 38 hours and 35 hours, respectively.
Faster scans give doctors more options. Until the early 2000s, only exceptional trauma cases would receive whole-body CT scans; the increased speed of CT turnaround times mean that they are now a common choice. This is a reflection of elastic demand, a concept in economics that describes when demand for a product or service is very sensitive to changes in price. In this case, when these scans got cheaper in terms of waiting time, demand for those scans increased.
Over the past decade, improvements in image interpretation have run far ahead of their diffusion. Hundreds of models can spot bleeds, nodules, and clots, yet AI is often limited to assistive use on a small subset of scans in any given practice. And despite predictions to the contrary, head counts and salaries have continued to rise. The promise of AI in radiology is overstated by benchmarks alone.
Multi‑task foundation models may widen coverage, and different training sets could blunt data gaps. But many hurdles cannot be removed with better models alone: the need to counsel the patient, shoulder malpractice risk, and receive accreditation from regulators. Each hurdle makes full substitution the expensive, risky option and human plus machine the default. Sharp increases in AI capabilities could certainly alter this dynamic, but it is a useful model for the first years of AI models that benchmark well at tasks associated with a particular career.
There are industries where conditions are different. Large platforms rely heavily on AI systems to triage or remove harmful or policy-violating content. At Facebook and Instagram, 94 percent and 98 percent of moderation decisions respectively are made by machines. But many of the more sophisticated knowledge jobs look more like radiology.
In many jobs, tasks are diverse, stakes are high, and demand is elastic. When this is the case, we should expect software to initially lead to more human work, not less. The lesson from a decade of radiology models is neither optimism about increased output nor dread about replacement. Models can lift productivity, but their implementation depends on behavior, institutions and incentives. For now, the paradox has held: the better the machines, the busier radiologists have become.
A few groups have started doing this, like the 2025 ‘OpenMIBOOD’ suite which explicitly scores chest-X-ray models on 14 out-of-distribution collections, but that hasn’t yet become standard.
A few companies and research groups are working to mitigate this, such as by training on multi-site datasets, building synthetic cases, or using self-supervised learning to reduce labeling needs, but these approaches are still early and expensive. This limitation is an important reason why AI models do not yet perform as expected.
One study tracked 27 mammographers and compared how well each interpreted real screening films versus a standardised ‘test-set’ of the same images. The researchers found no meaningful link between a radiologist’s accuracy in the lab and accuracy on live patients; the statistical correlation in sensitivity-specificity scores was essentially zero.
2025-09-29 21:30:53
On January 1, 2008, at 1:59 AM in Calipatria, California, an earthquake happened. You haven’t heard of this earthquake; even if you had been living in Calipatria, you wouldn’t have felt anything. It was magnitude -0.53, about the same amount of shaking as a truck passing by. Still, this earthquake is notable: not because it was large, but because it was small and yet we know about it.
Over the past seven years, AI tools based on computer imaging have almost completely automated one of the fundamental tasks of seismology: detecting earthquakes. What used to be the task of human analysts and later, simpler computer programs, can now be done automatically and quickly by machine learning tools.
These machine learning tools can detect smaller earthquakes than human analysts, especially in noisy environments like cities. Earthquakes give valuable information about the composition of the Earth and what hazards might occur in the future.
“In the best-case scenario, when you adopt these new techniques, even on the same old data, it’s kind of like putting on glasses for the first time, and you can see the leaves on the trees,” said Kyle Bradley, co-author of the Earthquake Insights newsletter.
I talked with several earthquake scientists, and they all agreed that machine learning methods have replaced humans for the better in these specific tasks.
“It’s really remarkable,” Judith Hubbard, a Cornell professor and Bradley’s co-author, told me.
Less certain is what comes next. Earthquake detection is a fundamental part of seismology, but there are many other data processing tasks that have yet to be disrupted. The biggest potential impacts, all the way to earthquake forecasting, haven’t materialized yet.
“It really was a revolution,” said Joe Byrnes, a professor at the University of Texas at Dallas. “But the revolution is ongoing.”
When an earthquake happens in one place, the shaking passes through the ground similar to how sound waves pass through the air. In both cases, it’s possible to draw inferences about the materials the waves pass through.
Imagine tapping a wall to figure out if it’s hollow. Because a solid wall vibrates differently than a hollow wall, you can figure out the structure by sound.
With earthquakes, this same principle holds. Seismic waves pass through different materials (rock, oil, magma, etc.) differently, and scientists use these vibrations to image the Earth’s interior.
The main tool that scientists traditionally use is a seismometer. These record the movement of the Earth in three directions: up–down, north–south, and east–west. If an earthquake happens, seismometers can measure the shaking in that particular location.
Scientists then process raw seismometer information to identify earthquakes.
Earthquakes produce multiple types of shaking, which travel at different speeds. Two types, Primary (P) waves and Secondary (S) waves are particularly important, and scientists like to identify the start of each of these phases.
Before good algorithms, earthquake cataloging had to happen by hand. Byrnes said that “traditionally, something like the lab at the United States Geological Survey would have an army of mostly undergraduate students or interns looking at seismograms.”
However, there are only so many earthquakes you can find and classify manually. Creating algorithms to effectively find and process earthquakes has long been a priority in the field — especially since the arrival of computers in the early 1950s.
“The field of seismology historically has always advanced as computing has advanced,” Bradley told me.
There’s a big challenge with traditional algorithms though: they can’t easily find smaller quakes, especially in noisy environments.
As we see in this seismogram above, many different events can cause seismic signals. If a method is too sensitive, it risks falsely detecting events as earthquakes. The problem is especially bad in cities, where the constant hum of traffic and buildings can drown out small earthquakes.
However, earthquakes have a characteristic “shape.” The magnitude 7.7 earthquake above looks quite different from the helicopter landing, for instance.
So one idea scientists had was to make templates from human-labeled datasets. If a new waveform correlates closely with an existing template, it’s almost certainly an earthquake.
Template matching works very well if you have enough human-labeled examples. In 2019, Zach Ross’s lab at Caltech used template matching to find ten times as many earthquakes in Southern California as had previously been known, including the earthquake at the start of this story. Almost all of the new 1.6 million quakes they found were very small, magnitude 1 and below.
If you don’t have an extensive pre-existing dataset of templates, however, you can’t easily apply template matching. That isn’t a problem in Southern California — which already had a basically complete record of earthquakes down to magnitude 1.7 — but it’s a challenge elsewhere.
Also, template matching is computationally expensive. Creating a Southern California quake dataset using template matching took 200 Nvidia P100 GPUs running for days on end.
There had to be a better way.
AI detection models solve both these problems:
They are faster than template matching. Because AI detection models are very small (around 350,000 parameters compared to billions in LLMs), they can be run on consumer CPUs.
AI models generalize well to regions not represented in the original dataset.
As an added bonus, AI models can give better information about when the different types of earthquake shaking arrive. Timing the arrivals of the two most important waves — P and S waves — is called phase picking. It allows scientists to draw inferences about the structure of the quake. AI models can do this alongside earthquake detection.
The basic task of earthquake detection (and phase picking) looks like this:
The first three rows represent different directions of vibration (east–west, north–south, and up–down respectively). Given these three dimensions of vibration, can we determine if an earthquake occurred, and if so, when it started?
Ideally, our model outputs three things at every time step in the sample:
The probability that an earthquake is occurring at that moment.
The probability that a P wave arrives at that moment.
The probability that an S wave arrives at that moment.
We see all three outputs in the fourth row: the detection in green, the P wave arrival in blue, and the S wave arrival in red. (There are two earthquakes in this sample).
To train an AI model, scientists take large amounts of labeled data like what’s above and do supervised training. I’ll describe one of the most used models: Earthquake Transformer, which was developed around 2020 by a Stanford team led by S. Mostafa Mousavi — who later became a Harvard professor.
Like many earthquake detection models, Earthquake Transformer adapts ideas from image classification. Readers may be familiar with AlexNet, a famous image recognition model that kicked off the deep learning boom in 2012.
AlexNet used convolutions, a neural network architecture that’s based on the idea that pixels that are physically close together are more likely to be related. The first convolutional layer of AlexNet broke an image down into small chunks — 11 pixels on a side — and classified each chunk based on the presence of simple features like edges or gradients.
The next layer took the first layer’s classifications as input and checked for higher-level concepts such as textures or simple shapes.
Each convolutional layer analyzed a larger portion of the image and operated at a higher level of abstraction. By the final layers, the network was looking at the entire image and identifying objects like “mushroom” and “container ship.”
Images are two-dimensional, so AlexNet is based on two-dimensional convolutions. In contrast, seismograph data is one-dimensional, so Earthquake Transformer uses one-dimensional convolutions over the time dimension. The first layer analyzes vibration data in 0.1-second chunks, while later layers identify patterns over progressively longer time periods.
It’s difficult to say what exact patterns the earthquake model is picking out, but we can analogize this to a hypothetical audio transcription model using one-dimensional convolutions. That model might first identify consonants, then syllables, then words, then sentences over increasing time scales.
Earthquake Transformer converts raw waveform data into a collection of high-level representations that indicate the likelihood of earthquakes and other seismologically significant events. This is followed by a series of deconvolution layers that pinpoint exactly when an earthquake — and its all-important P and S waves — occurred.
The model also uses an attention layer in the middle of the model to mix information between different parts of the time series. The attention mechanism is most famous in large language models, where it helps pass information between words. It plays a similar role in seismographic detection. Earthquake seismograms have a general structure: P waves followed by S waves followed by other types of shaking. So if a segment looks like the start of a P wave, the attention mechanism helps it check that it fits into a broader earthquake pattern.
All of the Earthquake Transformer’s components are standard designs from the neural network literature. Other successful detection models, like PhaseNet, are even simpler. PhaseNet uses only one-dimensional convolutions to pick the arrival times of earthquake waves. There are no attention layers.
Generally, there hasn’t been “much need to invent new architectures for seismology,” according to Byrnes. The techniques derived from image processing have been sufficient.
What made these generic architectures work so well then? Data. Lots of it.
We’ve previously reported on how the introduction of ImageNet, an image recognition benchmark, helped spark the deep learning boom. Large, publicly available earthquake datasets have played a similar role in seismology.
Earthquake Transformer was trained using the Stanford Earthquake Dataset (STEAD), which contains 1.2 million human-labeled segments of seismogram data from around the world. (The paper for STEAD explicitly mentions ImageNet as an inspiration). Other models, like PhaseNet, were also trained on hundreds of thousands or millions of labeled segments.
The combination of the data and the architecture just works. The current models are “comically good” at identifying and classifying earthquakes, according to Byrnes. Typically, machine learning methods find ten or more times the quakes that were previously identified in an area. You can see this directly in an Italian earthquake catalog:
AI tools won’t necessarily detect more earthquakes than template matching. But AI-based techniques are much less compute- and labor-intensive, making them more accessible to the average research project and easier to apply in regions around the world.
All in all, these machine learning models are so good that they’ve almost completely supplanted traditional methods for detecting and phase-picking earthquakes, especially for smaller magnitudes.
The holy grail of earthquake science is earthquake prediction. For instance, scientists know that a large quake will happen near Seattle, but have little ability to know whether it will happen tomorrow or in a hundred years. It would be helpful if we could predict earthquakes precisely enough to allow people in affected areas to evacuate.
You might think AI tools would help predict earthquakes, but that doesn’t seem to have happened yet.
The applications are more technical, and less flashy, said Cornell’s Judith Hubbard.
Better AI models have given seismologists much more comprehensive earthquake catalogs, which have unlocked “a lot of different techniques,” Bradley said.
One of the coolest applications is in understanding and imaging volcanoes. Volcanic activity produces a large number of small earthquakes, whose locations help scientists understand the structure of the magma system. In a 2022 paper, John Wilding and co-authors used a large AI-generated earthquake catalog to create this incredible image of the structure of the Hawaiian volcanic system.
They provided direct evidence of a previously hypothesized magma connection between the deep Pāhala sill complex and Mauna Loa’s shallow volcanic structure. You can see this in the image with the arrow labeled as Pāhala-Mauna Loa seismicity band. The authors were also able to clarify the structure of the Pāhala sill complex into discrete sheets of magma. This level of detail could potentially facilitate better real-time monitoring of earthquakes and more accurate eruption forecasting.
Another promising area is lowering the cost of dealing with huge datasets. Distributed Acoustic Sensing (DAS) is a powerful technique that uses fiber-optic cables to measure seismic activity across the entire length of the cable. A single DAS array can produce “hundreds of gigabytes of data” a day, according to Jiaxuan Li, a professor at the University of Houston. That much data can produce extremely high resolution datasets — enough to pick out individual footsteps.
AI tools make it possible to very accurately time earthquakes in DAS data. Before the introduction of AI techniques for phase picking in DAS data, Li and some of his collaborators attempted to use traditional techniques. While these “work roughly,” they weren’t accurate enough for their downstream analysis. Without AI, much of his work would have been “much harder,” he told me.
Li is also optimistic that AI tools will be able to help him isolate “new types of signals” in the rich DAS data in the future.
As in many other scientific fields, seismologists face some pressure to adopt AI methods whether or not it’s relevant to their research.
“The schools want you to put the word AI in front of everything,” Byrnes said. “It’s a little out of control.”
This can lead to papers that are technically sound but practically useless. Hubbard and Bradley told me that they’ve seen a lot of papers based on AI techniques that “reveal a fundamental misunderstanding of how earthquakes work.”
They pointed out that graduate students can feel pressure to specialize in AI methods at the cost of learning less about the fundamentals of the scientific field. They fear that if this type of AI-driven research becomes entrenched, older methods will get “out-competed by a kind of meaninglessness.”
While these are real issues, and ones Understanding AI has reported on before, I don’t think they detract from the success of AI earthquake detection. In the last five years, an AI-based workflow has almost completely replaced one of the fundamental tasks in seismology for the better.
That’s pretty cool.
2025-09-26 04:16:40
A striking thing about the AI industry is how many insiders believe AI could pose an existential risk to humanity.
Just last week, Anthropic CEO Dario Amodei described himself as “relatively an optimist” about AI. But he said there was a “25 percent chance that things go really really badly.” Among the risks Amodei worries about: “the autonomous danger of the model.”
In a 2023 interview, OpenAI CEO Sam Altman was blunter, stating that the worst-case scenario was “lights out for all of us.”
No one has done more to raise these concerns than rationalist gadfly Eliezer Yudkowsky. In a new book with co-author Nate Soares, Yudkowsky doesn’t mince words: If Anyone Builds It, Everyone Dies. Soares and Yudkowsky believe that if anyone invents superintelligent AI, it will take over the world and kill everyone.
Normally, when someone predicts the literal end of the world, you can write them off as a kook. But Yudkowsky is hard to dismiss. He has been warning about these dangers since the early 2010s, when he (ironically) helped get some of the leading AI companies off the ground. Legendary AI researchers like Geoffrey Hinton and Yoshua Bengio take Yudkowsky’s concerns seriously.
So is Yudkowsky right? In my mind, there are three key steps to his argument:
Humans are on a path to develop AI systems with superhuman intelligence.
These systems will gain a lot of power over the physical world.
We don’t know how to ensure these systems use their power for good rather than evil.
Outside the AI industry, debate tends to focus on the first claim; many normie skeptics think superintelligent AI is simply too far away to worry about. Personally, I think these skeptics are too complacent. I don’t know how soon AI systems will surpass human intelligence, but I expect progress to be fast enough over the next decade that we should start taking these questions seriously.
Inside the AI industry, many people accept Yudkowsky’s first two premises—superintelligent AI will be created and become powerful—but they disagree about whether we can get it to pursue beneficial goals instead of harmful ones. There’s now a sprawling AI safety community exploring how to align AI systems with human values.
But I think the weakest link in Yudkowsky and Soares’s argument is actually the second claim: that an AI system with superhuman intelligence would become so powerful it could kill everyone. I have no doubt that AI will give people new capabilities and solve long-standing problems. But I think the authors wildly overestimate how transformational the technology will be—and dramatically underestimate how easy it will be for humans to maintain control.
Over the last two centuries, humans have used our intelligence to dramatically increase our control over the physical world. From airplanes to antibiotics to nuclear weapons, modern humans accomplish feats that would have astonished our ancestors.
Yudkowsky and Soares believe AI will unlock another, equally large, jump in our (or perhaps just the AI’s) ability to control the physical world. And the authors expect this transformation to happen over months rather than decades.
Biology is one area where the authors expect radical acceleration.
“The challenge of building custom-designed biological technology is not so much one of producing the tools to make it, as it is one of understanding the design language, the DNA and RNA,” Yudkowsky and Soares argue. According to these authors, “our best wild guess is that it wouldn’t take a week” for a superintelligent AI system to “crack the secrets of DNA” so that it could “design genomes that yielded custom life forms.”
For example, they describe trees as “self-replicating factories that spin air into wood” and conclude that “any intelligence capable of comprehending biochemistry at the deepest level is capable of building its own self-replicating factories to serve its own purposes.”
Ironically, I think the first four chapters of the book do a good job of explaining why it’s probably not that simple.
These chapters argue that AI alignment is a fool’s errand. Due to the complexity of AI models and the way they’re trained, the authors say, humans won’t be able to design AI models to predictably follow human instructions or prioritize human values. I think this argument is correct, but it has broader implications than the authors acknowledge.
Here’s a key passage from Chapter 2 of If Anyone Builds It:
The way humanity finally got to the level of ChatGPT was not by finally comprehending intelligence well enough to craft an intelligent mind. Instead, computers became powerful enough that AIs can be churned out by gradient descent, without any human needing to understand the cognitions that grow inside.
Which is to say: engineers failed at crafting AI, but eventually succeeded in growing it.
“You can’t grow an AI that does what you want just by training it to be nice and hoping,” they write. “You don’t get what you train for.”
The authors draw an analogy to evolution, another complex process with frequently surprising results. For example, the long, colorful tails of male peacocks make it harder for them to flee predators. So why do they have them? At some point, early female peacocks developed a preference for large-tailed males, and this led to a self-reinforcing dynamic where males grew ever larger tails to improve their chances of finding a mate.
“If you ran the process [of evolution] again in very similar circumstances you’d get a wildly different result” than large-tailed peacocks, the authors argue. “The result defies what you might think natural selection should do, and you can’t predict the specifics no matter how clever you are.”
I love this idea that some systems are so complex that “you can’t predict the specifics no matter how clever you are.” But there’s an obvious tension with the idea that after an AI system “cracks the secret of DNA” it will be able to rapidly invent “custom life forms” and “self-replicating factories” that serve the purposes of the AI.
Yudkowsky and Soares believe that some systems are too complex for humans to fully understand or control, but superhuman AI won’t have the same limitations. They believe that AI systems will become so smart that they’ll be able to create and modify living organisms as easily as children rearrange Lego blocks. Once an AI system has this kind of predictive power, it could become trivial for it to defeat humanity in a conflict.
But I think the difference between grown and crafted systems is more fundamental. Some of the most important systems—including living organisms—are so complex that no one will ever be able to fully understand or control them. And this means that raw intelligence only gets you so far. At some point you need to perform real-world experiments to see if your predictions hold up. And that is a slow and error-prone process.
And not just in the domain of biology. Military conflicts, democratic elections, and cultural evolution are other domains that are beyond the predictive power—and hence the control—of even the smartest humans. Many doomers expect that superintelligent AIs won’t face such limitations—that they’ll be able to perfectly predict the outcome of battles or deftly manipulate the voting public to achieve its desired outcome in elections.
But I’m skeptical. I suspect that large-scale social systems like this are so complex that it’s impossible to perfectly understand and control them no matter how clever you are. Which isn’t to say that future AI systems won’t be helpful for winning battles or influencing elections. But the idea that superintelligence will yield God-like capabilities in these areas seems far-fetched.
Yudkowsky and Soares repeatedly draw analogies to chess, where AI has outperformed the best human players for decades. But chess has some unique characteristics that make it a poor model for the real world. Chess is a game of perfect information; both players know the exact state of the board at all times. The rules of chess are also far simpler than the physical world, allowing chess engines to “look ahead” many moves.
The real world is a lot messier. There’s a military aphorism that “no plan survives contact with the enemy.” Generals try to anticipate the enemy’s strategy and game out potential counter-attacks. But the battlefield is so complicated—and there’s so much generals don’t know prior to the battle—that things almost always evolve in ways that planners don’t anticipate.
Many real-world problems have this character: smarter people can come up with better experiments to try, but even the smartest people are still regularly surprised by experimental results. And so the bottleneck to progress is often the time and resources required to gain real-world experience, not raw brainpower.
In chess, both players start the game with precisely equal resources, and this means that even a small difference in intelligence can be decisive. In the real world, in contrast, specific people and organizations start out with control over essential resources. A rogue AI that wanted to take over the world would start out with a massive material disadvantage relative to governments, large corporations, and other powerful institutions that won’t want to give up their power.
There have been historical examples where brilliant scientists made discoveries that helped their nations win wars. Two of the best known are from World War II: the physicists in the Manhattan Project who helped the US build the first nuclear weapons and the mathematicians at Bletchley Park who figured out how to decode encrypted Nazi communications.
But it’s notable that while Enrico Fermi, Leo Szilard, Alan Turing, and others helped the Allies win the war, none of them personally wound up with significant political power. Instead, they empowered existing Allied leaders such as Franklin Roosevelt, Winston Churchill, and Harry Truman.
That’s because intelligence alone wasn’t sufficient to build an atomic bomb or decode Nazi messages. To make the scientists’ insights actionable, the government needed to mobilize vast resources to enrich uranium, intercept Nazi messages, and so forth. And so despite being less intelligent than Fermi or Turing, Allied leaders had no trouble maintaining control of the overall war effort.
A similar pattern is evident in the modern United States. Currently the most powerful person in the United States is Donald Trump. He has charisma and a certain degree of political cunning, but I think even many of his supporters would concede that he is not an intellectual giant. Neither was Trump’s immediate successor, Joe Biden. But it turns out that other characteristics—such as Trump’s wealth and fame—are at least as important as raw intelligence for achieving political power.
I see one other glaring flaw with the chess analogy. There’s actually an easy way for a human to avoid being humiliated by an AI at chess: run your own copy of the AI and do what it recommends. If you do that, you’ve got about a 50/50 chance of winning the game.
And I think the same point applies to AI takeover scenarios, like the fictional story in the middle chapters of If Anyone Builds It. Yudkowsky and Soares envision a rogue AI outsmarting the collective intelligence of billions of human beings. That seems implausible to me in any case, but it seems especially unlikely when you remember that human beings can always ask other AI models for advice.
This is related to my earlier discussion of how much AI models can accelerate technological progress. If it were true that a superintelligent AI system could “crack the secrets of DNA” in a week, I might find it plausible that it could gain a large enough technological head start to outsmart all humans.
But it seems much more likely that the first superhuman AI will be only slightly more intelligent than the smartest humans, and that within a few months rival AI labs will release their own models with similar capabilities.
Moreover, it’s possible to modify the behavior of today’s AI models through either prompting or fine-tuning. There’s no guarantee that future AI models will work exactly the same way, but it seems pretty likely that we’ll continue to have techniques for making copies of leading AI models and giving them different goals and behaviors. So even if one isntance of an AI “goes rogue,” we should be able to create other instances that are willing to help us defend ourselves.
So the question is not “will the best AI become dramatically smarter than humans?” It’s “will the best AI become dramatically smarter than humans advised by the second-best AI?” It’s hard to be sure about this, since no superintelligent AI systems exist yet. But I didn’t find Yudkowsky and Soares’s pessimistic case convincing.
2025-09-18 23:10:05
You might have noticed that yesterday’s piece had a new byline—Kai Williams!
After earning a bachelor’s degree in mathematics from Swarthmore College last year (and earning a top-500 score on the Putnam math competition), Kai spent a year honing his programming skills at the Recurse Center and studying AI safety at the prestigious ML Alignment and Theory Scholars (MATS) program.
In short, Kai is smart and knows quite a lot about AI. I expect great things from him.
Kai would like to get to know Understanding AI readers! He has opened up some slots on his calendar today and tomorrow for video calls.
If you’d like to talk to Kai, please click here and grab a time slot. [Update: readers have grabbed all time slots!] You could discuss your own use of AI, how AI is affecting your industry or profession, topics you’d like to see us cover, or anything else AI-related that’s on your mind.
He’d especially like to hear from people in K-12 and higher education, as this will likely be a focus of his reporting.
You can also follow Kai on Twitter.
For the next nine months, Kai will be writing for Understanding AI full time, supported by a fellowship from the Tarbell Center for AI Journalism. Tarbell is funded by groups like Open Philanthropy and the Future of Life Institute that believe AI poses an existential risk. However, Tarbell says that “our grantees and fellows maintain complete autonomy over their reporting.”
So I don’t plan to change anything about the way I’ve covered these topics. I don’t expect existential risk from AI to be a major focus of Kai’s reporting.
2025-09-18 05:28:49
Everything was fine until the wheel came off.
At 1:14 AM on May 31st, a Waymo was driving on South Lamar Boulevard in Austin, Texas, when the front left wheel detached. The bottom of the car scraped against the pavement as the car skidded to a stop, and the passenger suffered a minor injury, according to Waymo.
Among the 45 most serious crashes Waymo experienced in recent months, this was arguably the crash that was most clearly Waymo’s fault. And it was a mechanical failure, not an error by Waymo’s self-driving software.
On Tuesday, Waymo released new safety statistics for its fleet of self-driving cars. Waymo had completed 96 million miles of driving through June. The company estimates that its vehicles were involved in far fewer crashes than you’d expect of human drivers in the same locations and traffic conditions.
Waymo estimates that typical human drivers would have gotten into airbag-triggering crashes 159 times over 96 million miles in the cities where Waymo operates. Waymo, in contrast, got into only 34 airbag crashes—a 79 percent reduction.
In a similar manner, Waymo estimates that its vehicles get into injury-causing crashes (including that detached wheel crash) 80 percent less often than human drivers. Crashes that injure pedestrians were 92 percent less common, while crashes that injure cyclists were reduced by 78 percent relative to typical human drivers.
Can we trust these statistics? In the past, experts have told Understanding AI that Waymo’s methodology is credible. Still, it’s worth being skeptical any time a company publishes research about the performance of its own product.
So for this story I wanted to judge the safety of Waymo’s vehicles in another way: by looking at the crashes themselves. Waymo is required to disclose every significant crash to the National Highway Traffic Safety Administration (NHTSA). This week, the agency published a new batch of crash reports that cover crashes through August 15. For this story, I took a careful look at every Waymo crash between mid-February (the cutoff for our last story) and mid-August that was serious enough to cause an injury or trigger an airbag.
Waymo’s vehicles were involved in 45 crashes like that over those six months. A large majority of these crashes were clearly not Waymo’s fault, including 24 crashes where the Waymo wasn’t moving at all and another 7 where the Waymo was rear-ended by another vehicle.
For example, in April, a Waymo was stopped at a traffic light in Los Angeles behind a pickup truck that was being towed. The pickup truck came loose, rolled backwards, and hit the Waymo. It seems unlikely that Waymo’s self-driving software could have done anything to prevent that crash.
Over this six-month period, there were 41 crashes that involved some kind of driver error—either by Waymo’s software or a human driver. As we’ll see, Waymo’s software was clearly not at fault in a large majority of these crashes—in only a handful of cases did flaws in Waymo’s software even arguably play a role.
There were four other crashes that didn’t involve driving mistakes at all. I’ve already mentioned one of these—a Waymo crashed after one of its wheels literally fell off. The other three were cases where an exiting Waymo passenger opened a door and hit a passing bicycle or scooter. There may be steps Waymo could take to prevent such injuries in the future, but these clearly weren’t failures of Waymo’s self-driving software.
Of the 41 crashes that involved driving mistakes, 37 seemed to be mostly or completely the fault of other drivers.
24 crashes happened when another vehicle collided with a stationary Waymo, including 19 rear-end collisions. A representative example: a Waymo “came to a stop for a red light at the intersection” at 1:24 AM on March 16 in Los Angeles. The car behind the Waymo didn’t slow in time, and “made contact” with the Waymo. A passenger in the other car suffered a “minor injury,” according to Waymo. Non-rear-end crashes in this category include that pickup truck rolling backwards, and a time where the car ahead of a Waymo turned left, got hit by another vehicle, and got pushed back into the stationary Waymo.
7 crashes involved another vehicle rear-ending a Waymo in motion. In 5 of these the Waymo may as well have been stopped: the car was going less than 3 miles per hour ahead of an intersection. The Waymo was traveling faster in the other 2 crashes, but in both cases, the car from behind seems mostly or completely responsible.
In 4 crashes, another car entered Waymo’s right of way. There were 2 cases of turning vehicles cutting the Waymo off. In the other 2 cases, another car hit a different object (a parked car in one case, a streetlight in the other) and then careened into the Waymo’s path.
2 crashes happened when a bike or car hit a Waymo in an intersection. For instance, a San Francisco biker ran into the side of a Waymo passing through a four-way stop. The bike had been traveling on the sidewalk, which was blocked from the AV’s view by vegetation. The biker fell to the ground and suffered what Waymo described as a “minor injury.”
Three crashes involved another vehicle sideswiping a Waymo as it drove past:
In Atlanta in June, a Waymo came to a stop to yield to a heavy truck traveling in the opposite direction. The road had “vehicles parked at both curbs,” which was apparently too tight of a squeeze. The truck hit the Waymo as it passed by.
Similarly, in San Francisco in August, a Waymo approached a “narrow portion” of a roadway with vehicles “parked at the curb in both directions.” Again, the Waymo yielded to a heavy truck traveling in the opposite direction. As it passed the stopped Waymo, the truck started to turn left and clipped the Waymo.
A Waymo in San Francisco wanted to merge into the right lane, but a car ahead was stopped. So the Waymo stopped “partially between two lanes.” A car behind the Waymo tried to drive around, scraping the Waymo’s side in the process.
In all three of these cases, a passenger in the robotaxi informed Waymo of a “minor injury.” In the second case, the Waymo passenger was taken to the hospital by ambulance. In all three cases, it seems clear that the other driver bore some responsibility, but perhaps Waymo’s self-driving software could have handled the situation better.
The most ambiguous case occurred in Phoenix in May, when a cat crossed a street ahead of a Waymo. The Waymo braked, but couldn’t avoid hitting the cat, “which subsequently ran away.” The sudden braking caused several cars to rear-end the Waymo and each other. There were no (human) injuries, but an airbag was triggered.
Should Waymo have detected the cat earlier, allowing it to stop more gently? It’s hard to say given the information in Waymo’s report.
It’s worth mentioning one other crash: Last Sunday, a motorcyclist in Tempe, Arizona died after the motorcycle rear-ended a Waymo and then another car hit the motorcycle. We’re excluding it from the above tallies because it happened after August 15—and as a result we don’t yet have Waymo’s report to NHTSA. But Brad Templeton, a Forbes columnist and former Waymo advisor, wrote that Waymo was “apparently not at fault.”
A dooring incident occurs when a passenger opens a door into another vehicle's path of travel, usually a bike. Of the 45 most serious crashes, three were dooring incidents, but they accounted for two of the seven crashes that caused significant injuries.1
As we noted in March, Waymo has a “safe exit” feature, where the car alerts an exiting passenger to any incoming vehicles or pedestrians. The chime the car plays (recording), however, may not always be loud enough. A Reddit commenter pointed out that the warning message sounds similar to other innocuous notifications Waymo gives passengers.
A bicyclist who sued Waymo in June for a February dooring incident claimed that “there was no alert issued in the illegally parked car as according to the passengers,” as reported by the San Francisco Chronicle. (Waymo denied the claim).
This isn’t a Waymo-specific problem, though: Uber settled a similar lawsuit in 2023 for an undisclosed amount, likely north of one million dollars. These types of dooring accidents are a huge issue for cyclists generally, too. Other carmakers have developed similar warning systems.
In any case, none of these crashes are directly related to Waymo’s autonomous driving software.
Correction: The introduction originally misstated the number of times a Waymo was rear-ended while moving.
By “significant” I mean crashes which Waymo classifies as resulting in moderate injuries, serious injuries, or hospitalization.