MoreRSS

site iconMIT Technology ReviewModify

A world-renowned, independent media company whose insight, analysis, reviews, interviews and live events explain the newest technologies and their commercial, social and polit.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of MIT Technology Review

The Download: AI benchmarks, and Spain’s grid blackout

2025-05-08 20:10:00

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

How to build a better AI benchmark

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 as a way to evaluate an AI model’s coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” Entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement. Read the full story.

—Russell Brandom

Did solar power cause Spain’s blackout?

At roughly midday on Monday, April 28, the lights went out in Spain. The grid blackout, which extended into parts of Portugal and France, affected tens of millions of people—flights were grounded, cell networks went down, and businesses closed for the day.

Over a week later, officials still aren’t entirely sure what happened, but some have suggested that renewables may have played a role, because just before the outage happened, wind and solar accounted for about 70% of electricity generation. Others, including Spanish government officials, insist that it’s too early to assign blame.

It’ll take weeks to get the full report, but we do know a few things about what happened. Here are a few takeaways that could help our future grid. 

—Casey Crownhart

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 The Trump administration will repeal some global chip curbs 
It’s drawing up new rules that prioritize direct negotiations with various nations. (Bloomberg $)
+ The curbs have always been leaky anyway. (Economist $)

2 India and Pakistan have accused each other of overnight drone attacks
The conflict between the two countries is rapidly escalating. (The Guardian)
+ Pakistan claims to have shot down 25 drones in its airspace. (Reuters)
+ Mass-market military drones have changed the way wars are fought. (MIT Technology Review)

3 The FDA is interested in using AI for drug evaluation
And has met with OpenAI to hear more about how to do it. (Wired $)
+ An AI-driven “factory of drugs” claims to have hit a big milestone. (MIT Technology Review)

4 The US is pushing nations facing its tariffs to adopt Starlink
Government officials in India and other countries have fast tracked approvals. (WP $)
+ India recently announced new rules for satellite internet providers. (Rest of World)

5 Apple is overhauling its Safari browser to focus on AI search
Its search volume is down for the first time in 22 years. (The Verge)
+ Apple exec Eddy Cue thinks AI search will replace traditional search engines. (Bloomberg $)
+ AI means the end of internet search as we’ve known it. (MIT Technology Review)

6 Mark Zuckerberg is betting big on AI chatbots
He’s on a media charm offensive to convince us that AI friends are the future. (WSJ $)
+ The AI relationship revolution is already here. (MIT Technology Review)

7 Students can’t wean themselves off ChatGPT
And experts fear that they’ll emerge into the workforce essentially illiterate. (NY Mag $)
+ Some educators believe that AI highlights how the ways we teach need to change. (MIT Technology Review)

8 We don’t really know how memory works 🧠
But these researchers are doing their best to find out. (Quanta Magazine)

9 The vast majority of the sea depths are still unexplored
What lies beneath is a mystery. (New Scientist $)
+ Meet the divers trying to figure out how deep humans can go. (MIT Technology Review)

10 Pet psychics are taking over TikTok 🔮
But does your furry friend have anything to say?(NYT $)
+ Humans are still better than AI at futuregazing—for now. (Vox)
+ How DeepSeek became a fortune teller for China’s youth. (MIT Technology Review)

Quote of the day

“It’s like living in hell.”

—Elizabeth Martorana, a Virginia resident, describes what it’s like to live in a development zone for Amazon, Microsoft, and Google data centers, Semafor reports.

One more thing

How Antarctica’s history of isolation is ending—thanks to Starlink

“This is one of the least visited places on planet Earth and I got to open the door,” Matty Jordan, a construction specialist at New Zealand’s Scott Base in Antarctica, wrote in the caption to the video he posted to Instagram and TikTok in October 2023.

In the video, he guides viewers through the hut, pointing out where the men of Ernest Shackleton’s 1907 expedition lived and worked.

The video has racked up millions of views from all over the world. It’s also kind of a miracle: until very recently, those who lived and worked on Antarctic bases had no hope of communicating so readily with the outside world.

That’s starting to change, thanks to Starlink, the satellite constellation developed by Elon Musk’s company SpaceX to service the world with high-speed broadband internet. Read the full story.

—Allegra Rosenberg

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)

+ Does Boston still drink? Not in the same way it used to.
+ Where in the US you should set up camp to stargaze right now.
+ Wow: this New Zealand snail lays eggs from its neck. 🐌
+ Jurassic World Rebirth is coming: and it looks suitably bonkers.

How to build a better AI benchmark

2025-05-08 17:00:00

It’s not easy being one of Silicon Valley’s favorite benchmarks. 

SWE-Bench (pronounced “swee bench”) launched in November 2024 to evaluate an AI model’s coding skill, using more than 2,000 real-world programming problems pulled from the public GitHub repositories of 12 different Python-based projects. 

In the months since then, it’s quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google—and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. The top of the leaderboard is a pileup between three different fine tunings of Anthropic’s Claude Sonnet model and Amazon’s Q developer agent. Auto Code Rover—one of the Claude modifications—nabbed the number two spot in November, and was acquired just three months later.

Despite all the fervor, this isn’t exactly a truthful assessment of which model is “better.” As the benchmark has gained prominence, “you start to see that people really want that top spot,” says John Yang, a researcher on the team that developed SWE-Bench at Princeton University. As a result, entrants have begun to game the system—which is pushing many others to wonder whether there’s a better way to actually measure AI achievement.

Developers of these coding agents aren’t necessarily doing anything as straightforward cheating, but they’re crafting approaches that are too neatly tailored to the specifics of the benchmark. The initial SWE-Bench test set was limited to programs written in Python, which meant developers could gain an advantage by training their models exclusively on Python code. Soon, Yang noticed that high-scoring models would fail completely when tested on different programming languages—revealing an approach to the test that he describes as “gilded.”

“It looks nice and shiny at first glance, but then you try to run it on a different language and the whole thing just kind of falls apart,” Yang says. “At that point, you’re not designing a software engineering agent. You’re designing to make a SWE-Bench agent, which is much less interesting.”

The SWE-Bench issue is a symptom of a more sweeping—and complicated—problem in AI evaluation, and one that’s increasingly sparking heated debate: The benchmarks the industry uses to guide development are drifting further and further away from evaluating actual capabilities, calling their basic value into question. Making the situation worse, several benchmarks, most notably FrontierMath and Chatbot Arena, have recently come under heat for an alleged lack of transparency. Nevertheless, benchmarks still play a central role in model development, even if few experts are willing to take their results at face value. OpenAI cofounder Andrej Karpathy recently described the situation as “an evaluation crisis”: the industry has fewer trusted methods for measuring capabilities and no clear path to better ones. 

“Historically, benchmarks were the way we evaluated AI systems,” says Vanessa Parli, director of research at Stanford University’s Institute for Human-Centered AI. “Is that the way we want to evaluate systems going forward? And if it’s not, what is the way?”

A growing group of academics and AI researchers are making the case that the answer is to go smaller, trading sweeping ambition for an approach inspired by the social sciences. Specifically, they want to focus more on testing validity, which for quantitative social scientists refers to how well a given questionnaire measures what it’s claiming to measure—and, more fundamentally, whether what it is measuring has a coherent definition. That could cause trouble for benchmarks assessing hazily defined concepts like “reasoning” or “scientific knowledge”—and for developers aiming to reach the muchhyped goal of artificial general intelligence—but it would put the industry on firmer ground as it looks to prove the worth of individual models.

“Taking validity seriously means asking folks in academia, industry, or wherever to show that their system does what they say it does,” says Abigail Jacobs, a University of Michigan professor who is a central figure in the new push for validity. “I think it points to a weakness in the AI world if they want to back off from showing that they can support their claim.”

The limits of traditional testing

If AI companies have been slow to respond to the growing failure of benchmarks, it’s partially because the test-scoring approach has been so effective for so long. 

One of the biggest early successes of contemporary AI was the ImageNet challenge, a kind of antecedent to contemporary benchmarks. Released in 2010 as an open challenge to researchers, the database held more than 3 million images for AI systems to categorize into 1,000 different classes.

Crucially, the test was completely agnostic to methods, and any successful algorithm quickly gained credibility regardless of how it worked. When an algorithm called AlexNet broke through in 2012, with a then unconventional form of GPU training, it became one of the foundational results of modern AI. Few would have guessed in advance that AlexNet’s convolutional neural nets would be the secret to unlocking image recognition—but after it scored well, no one dared dispute it. (One of AlexNet’s developers, Ilya Sutskever, would go on to cofound OpenAI.)

A large part of what made this challenge so effective was that there was little practical difference between ImageNet’s object classification challenge and the actual process of asking a computer to recognize an image. Even if there were disputes about methods, no one doubted that the highest-scoring model would have an advantage when deployed in an actual image recognition system.

But in the 12 years since, AI researchers have applied that same method-agnostic approach to increasingly general tasks. SWE-Bench is commonly used as a proxy for broader coding ability, while other exam-style benchmarks often stand in for reasoning ability. That broad scope makes it difficult to be rigorous about what a specific benchmark measures—which, in turn, makes it hard to use the findings responsibly. 

Where things break down

Anka Reuel, a PhD student who has been focusing on the benchmark problem as part of her research at Stanford, has become convinced the evaluation problem is the result of this push toward generality. “We’ve moved from task-specific models to general-purpose models,” Reuel says. “It’s not about a single task anymore but a whole bunch of tasks, so evaluation becomes harder.”

Like the University of Michigan’s Jacobs, Reuel thinks “the main issue with benchmarks is validity, even more than the practical implementation,” noting: “That’s where a lot of things break down.” For a task as complicated as coding, for instance, it’s nearly impossible to incorporate every possible scenario into your problem set. As a result, it’s hard to gauge whether a model is scoring better because it’s more skilled at coding or because it has more effectively manipulated the problem set. And with so much pressure on developers to achieve record scores, shortcuts are hard to resist.

For developers, the hope is that success on lots of specific benchmarks will add up to a generally capable model. But the techniques of agentic AI mean a single AI system can encompass a complex array of different models, making it hard to evaluate whether improvement on a specific task will lead to generalization. “There’s just many more knobs you can turn,” says Sayash Kapoor, a computer scientist at Princeton and a prominent critic of sloppy practices in the AI industry. “When it comes to agents, they have sort of given up on the best practices for evaluation.”

In a paper from last July, Kapoor called out specific issues in how AI models were approaching the WebArena benchmark, designed by Carnegie Mellon University researchers in 2024 as a test of an AI agent’s ability to traverse the web. The benchmark consists of more than 800 tasks to be performed on a set of cloned websites mimicking Reddit, Wikipedia, and others. Kapoor and his team identified an apparent hack in the winning model, called STeP. STeP included specific instructions about how Reddit structures URLs, allowing STeP models to jump directly to a given user’s profile page (a frequent element of WebArena tasks).

This shortcut wasn’t exactly cheating, but Kapoor sees it as “a serious misrepresentation of how well the agent would work had it seen the tasks in WebArena for the first time.” Because the technique was successful, though, a similar policy has since been adopted by OpenAI’s web agent Operator. (“Our evaluation setting is designed to assess how well an agent can solve tasks given some instruction about website structures and task execution,” an OpenAI representative said when reached for comment. “This approach is consistent with how others have used and reported results with WebArena.” STeP did not respond to a request for comment.)

Further highlighting the problem with AI benchmarks, late last month Kapoor and a team of researchers wrote a paper that revealed significant problems in Chatbot Arena, the popular crowdsourced evaluation system. According to the paper, the leaderboard was being manipulated; many top foundation models were conducting undisclosed private testing and releasing their scores selectively.

Today, even ImageNet itself, the mother of all benchmarks, has started to fall victim to validity problems. A 2023 study from researchers at the University of Washington and Google Research found that when ImageNet-winning algorithms were pitted against six real-world data sets, the architecture improvement “resulted in little to no progress,” suggesting that the external validity of the test had reached its limit.

Going smaller

For those who believe the main problem is validity, the best fix is reconnecting benchmarks to specific tasks. As Reuel puts it, AI developers “have to resort to these high-level benchmarks that are almost meaningless for downstream consumers, because the benchmark developers can’t anticipate the downstream task anymore.” So what if there was a way to help the downstream consumers identify this gap?

In November 2024, Reuel launched a public ranking project called BetterBench, which rates benchmarks on dozens of different criteria, such as whether the code has been publicly documented. But validity is a central theme, with particular criteria challenging designers to spell out what capability their benchmark is testing and how it relates to the tasks that make up the benchmark.

“You need to have a structural breakdown of the capabilities,” Reuel says. “What are the actual skills you care about, and how do you operationalize them into something we can measure?”

The results are surprising. One of the highest-scoring benchmarks is also the oldest: the Arcade Learning Environment (ALE), established in 2013 as a way to test models’ ability to learn how to play a library of Atari 2600 games. One of the lowest-scoring is the Massive Multitask Language Understanding (MMLU) benchmark, a widely used test for general language skills; by the standards of BetterBench, the connection between the questions and the underlying skill was too poorly defined.

BetterBench hasn’t meant much for the reputations of specific benchmarks, at least not yet; MMLU is still widely used, and ALE is still marginal. But the project has succeeded in pushing validity into the broader conversation about how to fix benchmarks. In April, Reuel quietly joined a new research group hosted by Hugging Face, the University of Edinburgh, and EleutherAI, where she’ll develop her ideas on validity and AI model evaluation with other figures in the field. (An official announcement is expected later this month.) 

Irene Solaiman, Hugging Face’s head of global policy, says the group will focus on building valid benchmarks that go beyond measuring straightforward capabilities. “There’s just so much hunger for a good benchmark off the shelf that already works,” Solaiman says. “A lot of evaluations are trying to do too much.”

Increasingly, the rest of the industry seems to agree. In a paper in March, researchers from Google, Microsoft, Anthropic, and others laid out a new framework for improving evaluations—with validity as the first step. 

“AI evaluation science must,” the researchers argue, “move beyond coarse grained claims of ‘general intelligence’ towards more task-specific and real-world relevant measures of progress.” 

Measuring the “squishy” things

To help make this shift, some researchers are looking to the tools of social science. A February position paper argued that “evaluating GenAI systems is a social science measurement challenge,” specifically unpacking how the validity systems used in social measurements can be applied to AI benchmarking. 

The authors, largely employed by Microsoft’s research branch but joined by academics from Stanford and the University of Michigan, point to the standards that social scientists use to measure contested concepts like ideology, democracy, and media bias. Applied to AI benchmarks, those same procedures could offer a way to measure concepts like “reasoning” and “math proficiency” without slipping into hazy generalizations.

In the social science literature, it’s particularly important that metrics begin with a rigorous definition of the concept measured by the test. For instance, if the test is to measure how democratic a society is, it first needs to establish a definition for a “democratic society” and then establish questions that are relevant to that definition. 

To apply this to a benchmark like SWE-Bench, designers would need to set aside the classic machine learning approach, which is to collect programming problems from GitHub and create a scheme to validate answers as true or false. Instead, they’d first need to define what the benchmark aims to measure (“ability to resolve flagged issues in software,” for instance), break that into subskills (different types of problems or types of program that the AI model can successfully process), and then finally assemble questions that accurately cover the different subskills.

It’s a profound change from how AI researchers typically approach benchmarking—but for researchers like Jacobs, a coauthor on the February paper, that’s the whole point. “There’s a mismatch between what’s happening in the tech industry and these tools from social science,” she says. “We have decades and decades of thinking about how we want to measure these squishy things about humans.”

Even though the idea has made a real impact in the research world, it’s been slow to influence the way AI companies are actually using benchmarks. 

The last two months have seen new model releases from OpenAI, Anthropic, Google, and Meta, and all of them lean heavily on multiple-choice knowledge benchmarks like MMLU—the exact approach that validity researchers are trying to move past. After all, model releases are, for the most part, still about showing increases in general intelligence, and broad benchmarks continue to be used to back up those claims. 

For some observers, that’s good enough. Benchmarks, Wharton professor Ethan Mollick says, are “bad measures of things, but also they’re what we’ve got.” He adds: “At the same time, the models are getting better. A lot of sins are forgiven by fast progress.”

For now, the industry’s long-standing focus on artificial general intelligence seems to be crowding out a more focused validity-based approach. As long as AI models can keep growing in general intelligence, then specific applications don’t seem as compelling—even if that leaves practitioners relying on tools they no longer fully trust. 

“This is the tightrope we’re walking,” says Hugging Face’s Solaiman. “It’s too easy to throw the system out, but evaluations are really helpful in understanding our models, even with these limitations.”

Russell Brandom is a freelance writer covering artificial intelligence. He lives in Brooklyn with his wife and two cats.

This story was supported by a grant from the Tarbell Center for AI Journalism.

Did solar power cause Spain’s blackout?

2025-05-08 17:00:00

At roughly midday on Monday, April 28, the lights went out in Spain. The grid blackout, which extended into parts of Portugal and France, affected tens of millions of people—flights were grounded, cell networks went down, and businesses closed for the day.

Over a week later, officials still aren’t entirely sure what happened, but some (including the US energy secretary, Chris Wright) have suggested that renewables may have played a role, because just before the outage happened, wind and solar accounted for about 70% of electricity generation. Others, including Spanish government officials, insisted that it’s too early to assign blame.

It’ll take weeks to get the full report, but we do know a few things about what happened. And even as we wait for the bigger picture, there are a few takeaways that could help our future grid.

Let’s start with what we know so far about what happened, according to the Spanish grid operator Red Eléctrica:

  • A disruption in electricity generation took place a little after 12:30 p.m. This may have been a power plant flipping off or some transmission equipment going down.
  • A little over a second later, the grid lost another bit of generation.
  • A few seconds after that, the main interconnector between Spain and southwestern France got disconnected as a result of grid instability.
  • Immediately after, virtually all of Spain’s electricity generation tripped offline.

One of the theories floating around is that things went wrong because the grid diverged from its normal frequency. (All power grids have a set frequency: In Europe the standard is 50 hertz, which means the current switches directions 50 times per second.) The frequency needs to be constant across the grid to keep things running smoothly.

There are signs that the outage could be frequency-related. Some experts pointed out that strange oscillations in the grid frequency occurred shortly before the blackout.

Normally, our grid can handle small problems like an oscillation in frequency or a drop that comes from a power plant going offline. But some of the grid’s ability to stabilize itself is tied up in old ways of generating electricity.

Power plants like those that run on coal and natural gas have massive rotating generators. If there are brief issues on the grid that upset the balance, those physical bits of equipment have inertia: They’ll keep moving at least for a few seconds, providing some time for other power sources to respond and pick up the slack. (I’m simplifying here—for more details I’d highly recommend this report from the National Renewable Energy Laboratory.)

Solar panels don’t have inertia—they rely on inverters to change electricity into a form that’s compatible with the grid and matches its frequency. Generally, these inverters are “grid-following,” meaning if frequency is dropping, they follow that drop.

In the case of the blackout in Spain, it’s possible that having a lot of power on the grid coming from sources without inertia made it more possible for a small problem to become a much bigger one.

Some key questions here are still unanswered. The order matters, for example. During that drop in generation, did wind and solar plants go offline first? Or did everything go down together?

Whether or not solar and wind contributed to the blackout as a root cause, we do know that wind and solar don’t contribute to grid stability in the same way that some other power sources do, says Seaver Wang, climate lead of the Breakthrough Institute, an environmental research organization. Regardless of whether renewables are to blame, more capability to stabilize the grid would only help, he adds.

It’s not that a renewable-heavy grid is doomed to fail. As Wang put it in an analysis he wrote last week: “This blackout is not the inevitable outcome of running an electricity system with substantial amounts of wind and solar power.”

One solution: We can make sure the grid includes enough equipment that does provide inertia, like nuclear power and hydropower. Reversing a plan to shut down Spain’s nuclear reactors beginning in 2027 would be helpful, Wang says. Other options include building massive machines that lend physical inertia and using inverters that are “grid-forming,” meaning they can actively help regulate frequency and provide a sort of synthetic inertia.

Inertia isn’t everything, though. Grid operators can also rely on installing a lot of batteries that can respond quickly when problems arise. (Spain has much less grid storage than other places with a high level of renewable penetration, like Texas and California.)

Ultimately, if there’s one takeaway here, it’s that as the grid evolves, our methods to keep it reliable and stable will need to evolve too.

If you’re curious to hear more on this story, I’d recommend this Q&A from Carbon Brief about the event and its aftermath and this piece from Heatmap about inertia, renewables, and the blackout.

This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

The business of the future is adaptive

2025-05-07 22:14:41

Manufacturing is in a state of flux. From supply chain disruptions to rising costs, tougher environmental regulations, and a changing consumer market, the sector faces a series of competing challenges.

But a new way of operating offers a way to tackle complexities head-on: adaptive production hardwires flexibility and resilience into the enterprise, drawing on powerful tools like artificial intelligence, digital twins, and robotics. Taking automation a step further, adaptive production allows manufacturers to respond in real time to demand fluctuations, adapt to supply chain disruptions, and autonomously optimize operations. It also facilitates an unprecedented level of personalization and customization for regional markets.

Time to adapt

The journey to adaptive production is not just about addressing today’s pressures, like rising costs and supply chain disruptions—it’s about positioning businesses for long-term success in a world of constant change. “In the coming years,” says Jana Kirchheim, director of manufacturing for Microsoft Germany, “I expect that new key technologies like copilots, small language models, high-performance computing, or the adaptive cloud approach will revolutionize the shop floor and accelerate industrial automation by enabling faster adjustments and re-programming for specific tasks.” These capabilities make adaptive production a transformative force, enhancing responsiveness and opening doors to systems with increasing autonomy—designed to complement human ingenuity rather than replace it.

These advances enable more than technical upgrades—they drive fundamental shifts in how manufacturers operate. John Hart, professor of mechanical engineering and director of MIT’s Center for Advanced Production Technologies, explains that automation is “going from a rigid high-volume, low-mix focus”—where factories make large quantities of very few products—“to more flexible high-volume, high-mix, and low-volume, high-mix scenarios”—where many product types can be made in custom quantities. These new capabilities demand a fundamental shift in how value is created and captured.

Download the full report.

This content was produced by Insights, the custom content arm of MIT Technology Review. It was not written by MIT Technology Review’s editorial staff.

This content was researched, designed, and written entirely by human writers, editors, analysts, and illustrators. This includes the writing of surveys and collection of data for surveys. AI tools that may have been used were limited to secondary production processes that passed thorough human review.

The Download: Neuralink’s AI boost, and Trump’s tariffs

2025-05-07 20:10:00

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

This patient’s Neuralink brain implant gets a boost from generative AI

Last November, Bradford G. Smith got a brain implant from Elon Musk’s company Neuralink. The device, a set of thin wires attached to a miniscule computer that sits in his skull, lets him use his thoughts to move a computer pointer on a screen. And by last week he was ready to reveal it in a post on X.

Smith’s case is drawing interest because he’s not only communicating via a brain implant but also getting help from Grok, Musk’s AI chatbot. The generative AI is speeding up the rate at which he can communicate, but it also raises questions about who is really talking—him or Musk’s software. Read the full story.

—Antonio Regalado

MIT Technology Review Narrated: How Trump’s tariffs could drive up the cost of batteries, EVs, and more

The Trump administration’s hostile trade plans threaten to slow the shift to cleaner industries, boost inflation, and stall the economy.

This is our latest story to be turned into a MIT Technology Review Narrated podcast, which we’re publishing each week on Spotify and Apple Podcasts. Just navigate to MIT Technology Review Narrated on either platform, and follow us to get all our new content as it’s released.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 NSO Group has been ordered to pay Meta $167 million
After the Israeli firm’s spyware was used to hack journalists, activists, and politicians. (NYT $)
+ The firm has been implicated in abusive surveillance before. (Reuters)

2 OpenAI plans to reduce the fraction of its revenue Microsoft receives
It told investors it plans to slash the shared revenue from 20% to 10%. (The Information $)
+ Its for-profit U-turn is not yet a done deal. (Bloomberg $)+ We still don’t know a lot about OpenAI’s structure. (Economist $)

3 The Trump administration is axing the Energy Star program
The project certifies the energy efficiency of home appliances in the US. (WP $)

4 The US Justice Department wants Google to sell its ad businesses
But Google claims it’s not technically feasible. (WSJ $)
+ The judge says he’ll rule on the remedies by August. (The Information $)

5 Grok AI is undressing women on X
That’s what happens when you create AI models without proper guardrails. (404 Media)
+ Text-to-image AI models can be tricked into generating nude images. (MIT Technology Review)

6 Private investors are prepared to plow billions into Europe’s defense industry
They’re stepping up to fill gaps that governments can’t fund. (FT $)
+ The US is likely to strike a weapons deal in Riyadh next week. (Semafor)
+ Phase two of military AI has arrived. (MIT Technology Review)

7 Can anyone stop Starlink?
The speed of its total dominance of the satellite sector is unprecedented. (The Atlantic $)
+ The world’s next big environmental problem could come from space. (MIT Technology Review)

8 Amazon’s new robot has a sense of touch
To help it grab items in the e-retail giant’s warehouses. (The Guardian)
+ The Vulcan robot could end up shouldering more manufacturing work in the future. (Wired $)
+ Will we ever trust robots? (MIT Technology Review)

9 Argentina is investing big in nuclear-powered AI data centers
In an effort to attract big tech firms from overseas. (Rest of World)
+ Meanwhile, the Zaporizhzhia nuclear power plant in Ukraine is still not operational. (IEEE Spectrum)
+ China built hundreds of AI data centers to catch the AI boom. Now many stand unused. (MIT Technology Review)

10 RIP prompt engineering
The hottest job of 2023 is quickly fizzling out. (Fast Company $)

Quote of the day

“I think upper class households will be able to have something that makes your Roomba look like a total joke.”

—Reddit cofounder Alexis Ohanian believes that advanced robot domestic helpers are imminent, Insider reports.

One more thing

Is this the end of animal testing?

Animal studies are notoriously bad at identifying human treatments. Around 95% of the drugs developed through animal research fail in people, but until recently there was no other option.

Now organs on chips, also known as microphysiological systems, may offer a truly viable alternative. They’re triumphs of bioengineering, intricate constructions furrowed with tiny channels that are lined with living human tissues that expand and contract with the flow of fluid and air, mimicking key organ functions like breathing, blood flow, and peristalsis, the muscular contractions of the digestive system.

It’s only early days, but if they work as hoped, organs on chips could solve one of the biggest problems in medicine today. Read the full story.

—Harriet Brown

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.)

+ If you enjoyed this year’s Met Gala theme, check out this fascinating history of the Black dandy in art.
+ 2025 is shaping up to be a great year for literature.
+ The good news is that GTA VI finally has a release date—but it’s over a year away.
+ Chocolate pie for breakfast? I could be convinced.

This patient’s Neuralink brain implant gets a boost from generative AI

2025-05-07 17:00:00

Last November, Bradford G. Smith got a brain implant from Elon Musk’s company Neuralink. The device, a set of thin wires attached to a computer about the thickness of a few quarters that sits in his skull, lets him use his thoughts to move a computer pointer on a screen. 

And by last week he was ready to reveal it in a post on X.

“I am the 3rd person in the world to receive the @Neuralink brain implant. 1st with ALS. 1st Nonverbal. I am typing this with my brain. It is my primary communication,” he wrote. “Ask me anything! I will answer at least all verified users!”

Smith’s case is drawing interest because he’s not only communicating via a brain implant but also getting help from Grok, Musk’s AI chatbot, which is suggesting how Smith can add to conversations and drafted some of the replies he posted to X. 

The generative AI is speeding up the rate at which he can communicate, but it also raises questions about who is really talking—him or Musk’s software. 

“There is a trade-off between speed and accuracy. The promise of brain-computer interface is that if you can combine it with AI, it can be much faster,” says Eran Klein, a neurologist at the University of Washington who studies the ethics of brain implants. 

Smith is a Mormon with three kids who learned he had ALS after a shoulder injury he sustained in a church dodgeball game wouldn’t heal. As the disease progressed, he lost the ability to move anything except his eyes, and he was no longer able to speak. When his lungs stopped pumping, he made the decision to stay alive with a breathing tube.

Starting in 2024, he began trying to get accepted into Neuralink’s implant study via “a campaign of shameless self-promotion,” he told his local paper in Arizona: “I really wanted this.”

The day before his surgery, Musk himself appeared on a mobile phone screen to wish Smith well. “I hope this is a game changer for you and your family,” Musk said, according to a video of the call.

“I am so excited to get this in my head,” Smith replied, typing out an answer using a device that tracks his eye movement. This was the technology he’d previously used to communicate, albeit slowly.

Smith was about to get brain surgery, but Musk’s virtual appearance foretold a greater transformation. Smith’s brain was about to be inducted into a much larger technology and media ecosystem—one of whose goals, the billionaire has said, is to achieve a “symbiosis” of humans and AI.

Consider what unfolded on April 27, the day Smith  announced on X that he’d received the brain implant and wanted to take questions. One of the first came from “Adrian Dittmann,” an account often suspected of being Musk’s alter ego.

Dittmann: “Congrats! Can you describe how it feels to type and interact with technology overall using the Neuralink?”

Smith: “Hey Adrian, it’s Brad—typing this straight from my brain! It feels wild, like I’m a cyborg from a sci-fi movie, moving a cursor just by thinking about it. At first, it was a struggle—my cursor acted like a drunk mouse, barely hitting targets, but after weeks of training with imagined hand and jaw movements, it clicked, almost like riding a bike.”

Another user, noting the smooth wording and punctuation (a long dash is a special character, used frequently by AIs but not as often by human posters), asked whether the reply had been written by AI. 

Smith didn’t answer on X. But in a message to MIT Technology Review, he confirmed he’d used Grok to draft answers after he gave the chatbot notes he’d been taking on his progress. “I asked Grok to use that text to give full answers to the questions,” Smith emailed us. “I am responsible for the content, but I used AI to draft.”

The exchange on X in many ways seems like an almost surreal example of cross-marketing. After all, Smith was posting from a Musk implant, with the help of a Musk AI, on a Musk media platform and in reply to a famous Musk fanboy, if not actually the “alt” of the richest person in the world. So it’s fair to ask: Where does Smith end and Musk’s ecosystem begin? 

That’s a question drawing attention from neuro-ethicists, who say Smith’s case highlights key issues about the prospect that brain implants and AI will one day merge.

What’s amazing, of course, is that Smith can steer a pointer with his brain well enough to text with his wife at home and answer our emails. Since he’d only been semi-famous for a few days, he told us, he didn’t want to opine too much on philosophical questions about the authenticity of his AI-assisted posts. “I don’t want to wade in over my head,” he said. “I leave it for experts to argue about that!”

The eye tracker Smith previously used to type required low light and worked only indoors. “I was basically Batman stuck in a dark room,” he explained in a video he posted to X. The implant lets him type in brighter spaces—even outdoors—and quite a bit faster.

The thin wires implanted in his brain listen to neurons. Because their signals are faint, they need to be amplified, filtered, and sampled to extract the most important features—which are sent from his brain to a MacBook via radio and then processed further to let him move the computer pointer.

With control over this pointer, Smith types using an app. But various AI technologies are helping him express himself more naturally and quickly. One is a service from a startup called ElevenLabs, which created a copy of his voice from some recordings he’d made when he was healthy. The “voice clone” can read his written words aloud in a way that sounds like him. (The service is already used by other ALS patients who don’t have implants.) 

Researchers have been studying how ALS patients feel about the idea of aids like language assistants. In 2022, Klein interviewed 51 people with ALS and found a range of different opinions. 

Some people are exacting, like a librarian who felt everything she communicated had to be her words. Others are easygoing—an entertainer felt it would be more important to keep up with a fast-moving conversation. 

In the video Smith posted online, he said Neuralink engineers had started using language models including ChatGPT and Grok to serve up a selection of relevant replies to questions, as well as options for things he could say in conversations going on around him. One example that he outlined: “My friend asked me for ideas for his girlfriend who loves horses. I chose the option that told him in my voice to get her a bouquet of carrots. What a creative and funny idea.” 

These aren’t really his thoughts, but they will do—since brain-clicking once in a menu of choices is much faster than typing out a complete answer, which can take minutes. 

Smith told us he wants to take things a step further. He says he has an idea for a more “personal” large language model that “trains on my past writing and answers with my opinions and style.”  He told MIT Technology Review that he’s looking for someone willing to create it for him: “If you know of anyone who wants to help me, let me know.”