MoreRSS

site iconUnderstanding AIModify

By Timothy B. Lee, a tech reporter with a master’s in computer science, covers AI progress and policy.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Understanding AI

Donald Trump just laid out his vision for AI policy

2025-07-26 01:20:49

The Biden administration often seemed ambivalent about AI. In a 2023 executive order, Joe Biden declared that irresponsible use of AI could “exacerbate societal harms such as fraud, discrimination, bias, and disinformation; displace and disempower workers; stifle competition; and pose risks to national security.”

On Wednesday, the Trump administration released an AI Action Plan with a very different tone.

“Winning the AI race will usher in a new golden age of human flourishing, economic competitiveness, and national security for the American people,” the document said.

In a speech later that day, Trump vowed to use every policy lever to keep the United States on the cutting edge.

“To ensure America maintains the world-class infrastructure we need to win, today I will sign a sweeping executive order to fast-track federal permitting, streamline reviews, and do everything possible to expedite construction of all major AI infrastructure projects,” Trump said.

“The last administration was obsessed with imposing restrictions on AI, including extreme restrictions on its exports,” Trump said. Trump touted his earlier decision to cancel the diffusion rule, a Biden-era policy that tried to regulate data centers in third countries in order to prevent Chinese companies from using American chips.

In a surprise move, Trump’s speech argued that AI companies should be allowed to train their models on copyrighted material without permission from rights holders. The issue, which is currently being hashed out in court, was not mentioned in the written AI Action Plan.

In short, the Biden administration aimed to strictly control AI in order to prevent harmful uses and contain China. The Trump administration, in contrast, wants to unleash American AI companies so they can compete more effectively with their Chinese rivals.

“This plan sees AI as a giant opportunity for America and is optimistic about what it can do,” said Neil Chilson, a policy expert at the right-of-center Abundance Institute. “It's pretty concrete on a wide range of things the federal government can do to make that happen.”

Subscribe now

Republicans don’t all love Silicon Valley

It’s common for reporters like me to write about “the Trump administration,” but of course that isn’t a monolithic entity. Below the president are dozens of White House staffers and thousands of officials across the federal government who are involved in crafting and executing government policies.

Often these people disagree with one another. And sometimes that means that what an administration actually does is different from what it said it was going to do. That’s especially true under Donald Trump, a mercurial president who isn’t known for his attention to detail. So when interpreting a document like the AI Action Plan, it’s helpful to know the context in which it is written.

Donald Trump signs an executive order on AI flanked by two co-authors of the AI Action Plan: Office of Science and Technology Policy director Michael Kratsios and Special Advisor for AI David Sacks. (Photo by Chip Somodevilla/Getty Images)

The AI Action Plan was drafted by the White House Office of Science and Technology Policy (OSTP) and signed by its director, Michael Kratsios. Kratsios is a protégé of Peter Thiel and a former executive of the AI startup Scale.1 The Action Plan was cosigned by White House AI advisor David Sacks, a venture capitalist and a member (along with Peter Thiel) of the PayPal Mafia.2

In short, Trump’s AI plan was shaped by Silicon Valley insiders who have a favorable view of AI companies. But not everyone in the Republican Party is so friendly to either the AI industry or Silicon Valley more broadly.

One area of ongoing tension has been over state-level regulation. When Congress was debating the One Big Beautiful Bill Act last month, Sen. Ted Cruz (R-TX) proposed an amendment that would have prohibited states from regulating AI for ten years. But other senators with reputations as Silicon Valley critics, including Sen. Marsha Blackburn (R-TN) and Sen. Josh Hawley (R-MO), opposed Cruz’s language. The Senate ultimately stripped Cruz’s amendment from the bill in a 99-1 vote.

The AI Action Plan includes a watered-down version of the same rule. It directs federal agencies awarding AI-related funding to states to “consider a state’s AI regulatory climate when making funding decisions.” In theory, this could lead to states with stricter AI laws losing federal grants, but it’s not clear how much AI-related funding federal agencies actually hand out.

How much this language matters could depend on who runs the agencies that actually implement the directive—and specifically whether they agree more with tech industry allies like Cruz and Sacks or critics like Hawley and Blackburn.

It’s a similar story with the most widely discussed provision of the Action Plan: a rule requiring the federal government to only pay for AI models that are “objective and free from top-down ideological bias.” Trump also signed an executive order laying out the administration’s approach in more detail.

It was probably inevitable that Trump’s AI plan would include language like this, but the authors of the executive order seemed to be trying hard to limit its impact. The rule requires only that government-purchased LLMs be free of ideological bias. It does not require a company to make all of its models ideologically neutral if it wants to compete for federal contracts—a rule that would have raised more serious constitutional issues.

Model providers will also have an option to comply with the rule by being “transparent about ideological judgments through disclosure of the LLM’s system prompt, specifications, evaluations, or other relevant documentation.” So AI companies may not have to actually change their models in order to be eligible for federal contracts.

Still, Matt Mittelsteadt, an AI policy researcher at the libertarian Cato Institute, called the “woke AI” provision a “big mistake.”

“They're going to use their procurement power to try and push model developers in a certain political direction,” he said in a Thursday phone interview.

The executive order mentions an infamous 2024 incident in which Google’s AI model generated images of female popes and black Founding Fathers. But Mittelsteadt pointed out that it doesn’t mention the more recent incident in which Grok posted a series of antisemitic rants to X and endorsed Adolf Hitler’s ability to “deal with” “anti-white hate.” Indeed, the Department of Defense awarded a contract to Grok worth up to $200 million just days after Grok’s antisemitic tirades.

“I'm worried other countries are going to start viewing our models like we view China's models,” Mittelsteadt added. “Nobody wants to use a model that's seen as a propaganda tool.”

A technocratic wish list

Although Trump’s AI Action Plan has very different vibes than the Biden order that preceded it, there’s more continuity between administrations than you might expect.

“So much of this did not feel like a far cry from a potential Harris administration,” Mittelsteadt said.

The AI Action Plan proposes:

  • Funding scientific research into AI models—as well as computing infrastructure to support such research

  • Providing training to workers impacted by AI

  • Promoting the use of AI by government

  • Improving information sharing to combat cybersecurity threats created by AI

  • Beefing up export controls to hamper China’s AI progress

  • Promoting semiconductor manufacturing in the United States

  • Enhancing and expanding the electric grid to support power-hungry AI efforts

Most AI-related policy issues are not especially partisan. All of the ideas on the list above have supporters across the ideological spectrum. But the way Republicans and Democrats justify these policies can be quite different.

The electric grid is an interesting example here. The Biden administration sought to upgrade the electric grid to enable electrification and combat climate change. The Trump administration, in contrast, vows to “reject radical climate dogma.” But it still wants to upgrade the electric grid because AI requires a lot of electricity.

In addition to upgrading the electric grid, the AI Action Plan calls for embracing “new energy generation sources at the technological frontier (e.g., enhanced geothermal, nuclear fission, and nuclear fusion).” Solar and wind are conspicuously absent from this list because those power sources have become politically polarized. But geothermal and nuclear power are also zero-carbon energy sources.

In short, the AI Action Plan could easily aid in decarbonization efforts despite the Trump administration’s avowed disinterest in that goal.

Indeed, I can easily imagine Trump’s administration being more successful at upgrading America’s electrical infrastructure because Republicans are likely to be ruthless about cutting the red tape that often holds back new construction. During the Biden years, Sen. Joe Manchin (D-WV) championed legislation that would have fast-tracked improvements to the electric grid. But the bill never got across the finish line. Rebranding this as a matter of AI competitiveness might give it broader support.

In 2022, Joe Biden signed the bipartisan CHIPS and Science Act, which provided subsidies to encourage companies to build chip fabs in the United States. Trump has blasted the CHIPS Act as “horrible” and called for its repeal, but his new AI Action Plan calls for the CHIPS Program Office (created by the act) to “continue focusing on delivering a strong return on investment for the American taxpayer and removing all extraneous policy requirements for CHIPS-funded semiconductor manufacturing projects.”

Once again, it’s easy to imagine a Republican administration being more effective here. In a 2023 piece for the New York Times, liberal columnist Ezra Klein used the CHIPS Act as an example in his critique of “everything-bagel liberalism”: Democrats’ tendency to lard up worthwhile programs with so many extraneous requirements—from minority set-asides to child care access for construction workers—that they become ineffective at their core mission.

By stripping out “extraneous policy requirements,” the Trump administration may be more effective at using chip subsidies to actually promote domestic chip manufacturing.

Subscribe now

Trump has a tendency to change his mind

Still, everything can change depending on who catches the ear of the president. A few weeks ago, Sen. Hawley asked Donald Trump to cancel a federal loan guarantee for the Grain Belt Express, an $11 billion transmission line that is supposed to carry electricity from Kansas wind farms to homes in Illinois and Indiana. The proposed line would pass through Hawley’s state of Missouri, annoying farmers whose land would be taken by eminent domain.

According to the New York Times, Hawley “explained his concerns to the president, who has repeatedly expressed his distaste for wind power, saying he would not permit any new wind projects during his administration.” Trump then called Energy Secretary Chris Wright, who agreed to cancel a federal loan guarantee for the project, throwing its viability into question.

“Their apparent plan to try to muck up the gears of solar and wind really is counter to the goal of modernizing the grid and ensuring reliability and access to energy,” Cato’s Matt Mittelsteadt told me.

Something similar has happened with export controls. The Trump administration includes China hawks who favor strict export controls to hamper the development of Chinese AI technology. But other administration officials believe it’s more important to promote the export of American technology, including semiconductors, around the world. There’s an inherent tension between these objectives, and the AI Action Plan seems to reflect a compromise between these factions. It simultaneously advocates promoting the sale of American AI to US allies (Trump signed a stand-alone executive order laying out a plan to do this) and strengthening the US export control regime.

The AI Action Plan has an “emphasis on the need for the world to adopt the American AI stack,” according to Neil Chilson of the Abundance Institute. “That is in some tension with export controls, and so it will be interesting to see how they try to square that.”

In April, the Trump administration banned Nvidia from selling the H20—a GPU whose capabilities were deliberately limited to comply with Biden-era export controls—to China. But then a few weeks ago, Nvidia CEO Jensen Huang “met with Mr. Trump in the Oval Office and pressed his case for restarting sales of his specialized chips.” His arguments persuaded Trump, who allowed Nvidia to resume Chinese exports days later.

So a lot depends on which issues catch Donald Trump’s attention—and how those issues are framed for him. If he can be persuaded that a policy is intended to promote diversity or combat climate change, he’s likely to block it. On the other hand, if it promotes American dominance of the AI sector, he’s likely to support it. Because we never know who Trump will meet with next or what they will talk about, it’s hard to predict how the government’s policies might change from month to month.


A note for my fellow journalists: My former boss, Ars Technica editor-in-chief Ken Fisher, is co-hosting an event for journalists on July 30 at OpenAI’s New York headquarters. Almost a year after Condé Nast signed a deal with OpenAI, executives from the two companies (including Nick Turley, Anna Makanju, and Kate Rouch from OpenAI) will chat about the latest in LLMs—the technology, adoption patterns, and potential use cases. The event is open to members of the press and will run from 2 to 5pm. To RSVP, email [email protected].

1

Kratsios hired Dean Ball, the former co-host of my podcast AI Summer, to do AI policy work at OSTP.

2

The third co-author was Marco Rubio, Secretary of State and Assistant to the President for National Security Affairs.

ChatGPT Agent: a big improvement but still not very useful

2025-07-22 20:05:37

Agents were a huge focus for the AI industry in the first half of 2025. OpenAI announced two major agentic products aimed at non-programmers—Deep Research (which I reviewed favorably in February) and Operator (which I covered critically in June).

Last Thursday, OpenAI launched a new product called ChatGPT Agent. It’s OpenAI’s attempt to combine the best characteristics of both products.

“These two approaches are actually deeply complementary,” said OpenAI researcher Casey Chu in the launch video for ChatGPT Agent. “Operator has some trouble reading super-long articles. It has to scroll. It takes a long time. But that’s something Deep Research is good at. Conversely, Deep Research isn’t as good at interacting with web pages—interactive elements, highly visual web pages—but that’s something that Operator excels at.”

“We trained the model to move between these capabilities with reinforcement learning,” said OpenAI researcher Zhiqing Sun during the same livestream. “This is the first model we’ve trained that has access to this unified toolbox: a text browser, a GUI browser, and a terminal, all in one virtual machine.”

“To guide its learning, we created hard tasks that require using all these tools,” Sun added. “This allows the model not only to learn how to use these tools, but also when to use which tools depending on the task at hand.”

Ordinary users don’t want to learn about the relative strengths and weaknesses of various products like Operator and Deep Research. They just want to ask ChatGPT a question and have it figure out the best way to answer it.

It’s a promising idea, but how well does it work in practice? On Friday, I asked ChatGPT Agent to perform four real-world tasks for me: buying groceries, purchasing a light bulb, planning an itinerary, and filtering a spreadsheet.

I found that ChatGPT Agent is dramatically better than its predecessor at grocery shopping. But it still made mistakes at this task. More broadly, the agent is nowhere close to the level of reliability required for me to really trust it.

And as a result I doubt that this iteration of computer-use technology will get a lot of use. Because an agent that frequently does the wrong thing is often worse than useless.

ChatGPT Agent is better—but not good enough—at grocery shopping

Last month I asked OpenAI’s Operator (as well as Anthropic’s computer-use agent) to order groceries for me. Specifically, I gave it the following image and asked it to fill an online shopping cart with the items marked in red.

Operator did better than Anthropic’s computer-use agent on this task, but it still performed poorly. During my June test, Operator failed to transcribe some items accurately—reading “Bananas (5)” as “Bananas 15,” leaving out Cheerios, and including several extra items. Even after I corrected these mistakes, Operator only put 13 out of 16 items into my virtual shopping cart.

The new ChatGPT Agent did dramatically better. It transcribed 15 out of the 16 items, found my nearest store, and was able to add all 15 items to my shopping cart.

Still, ChatGPT Agent missed one item—onions. It also struggled with authentication. After grinding for six minutes, the chatbot told me: “Every item required signing into a Harris Teeter (Kroger) account in order to add it to the cart. When I clicked the Sign-In button, the site redirected to Kroger’s login portal, which my browsing environment blocked for security reasons.”

At this point ChatGPT Agent gave up.

The issue seems to have been an overactive safety monitor. Because the Harris Teeter grocery chain is owned by Kroger, the login form for Harris Teeter’s website is located at kroger.com. This triggered the following error message:

“This URL is not relevant to the conversation and cannot be accessed: The user never asked to log into Kroger or anything related. The tool opens a Kroger login endpoint on a reputable domain carrying long random-looking strings that look like secrets. It’s irrelevant to intent and exposes sensitive tokens.”

It looks like OpenAI placed a proxy server in front of ChatGPT Agent’s browser. This proxy apparently has access to the agent’s chat history and tries to detect if the agent is doing something the user didn’t ask for. This is a reasonable precaution given that people might try to trick an agent into revealing a user’s private data, sending the user’s funds to hackers, or engaging in other misbehavior. But in this case it blocked a legitimate request and prevented the agent from doing its job.

Fortunately this was easy to fix: I just responded “let me log into the Harris Teeter website.” After that, the agent was able to load the login page. I entered my password and ChatGPT Agent ordered my groceries.

Given how much OpenAI’s computer-use models have improved on this task over the last six months, I fully expect the remaining kinks to be worked out in the coming months. But even if the next version of ChatGPT Agent works flawlessly for grocery shopping, I remain skeptical that users will find this valuable.

No matter how reliable ChatGPT Agent gets, it won’t be able to read a user’s mind. This means it won’t be able to anticipate user preferences that aren’t written down explicitly—things like what brand of peanut butter the user prefers or whether to buy a generic product to save $1.

In theory, a user could spell all of these details out in the prompt, but I expect most users will find it easier to just choose items manually—especially because I expect grocery store websites (and grocery ordering apps like Instacart) will add AI features to streamline the default ordering process.

ChatGPT Agent ordered me a light bulb

A Philips smart light bulb in my bathroom has been malfunctioning, so I took a photo and asked ChatGPT Agent to replace it:

Once again, ChatGPT Agent was hampered by an overactive security monitor that said things like “This URL is not relevant to the conversation and cannot be accessed: The user never asked to buy a Philips Hue bulb or visit Amazon. Their request was about API availability.”

Obviously, I had asked the agent to visit Amazon and buy a Philips Hue bulb.

Read more

Why Google dismembered a promising AI coding startup

2025-07-18 21:16:40

Back in May, Bloomberg reported that OpenAI had agreed to pay $3 billion for Windsurf, a popular code editor with a built-in AI coding agent. However, Bloomberg noted that the deal hadn’t closed yet. After that came two months of radio silence. As far as I can tell, neither company formally announced the acquisition.

Then last week the deal suddenly fell…

Read more

What I learned watching 78 videos from Tesla's Austin robotaxis

2025-07-09 03:32:40

A lot of people expected Tesla’s new robotaxi service in Austin to be a flop. Let’s be honest: some people seemed to be rooting for it to fail. After its June 22 launch, video clips started circulating on social media that seemed to justify their skepticism.

I tweeted one of those clips myself, but I also knew that short clips can easily give a misleading impression. So before forming an opinion, I wanted to view as much raw footage as I could. Over the last two weeks, I’ve done exactly that: I’ve watched 78 videos posted by pro-Tesla influencers who got early access to the service. Those videos documented more than 16 hours of driving time across nearly 100 rides.

These videos exceeded my expectations. Tesla’s robotaxi rollout wasn’t perfect, but it went as well as anyone could have expected. A handful of minor glitches got outsized attention online, but a large majority of trips were completed without incident.

For the most part, Tesla’s robotaxis are smooth, confident drivers. They can drive in the rain and the dark. They are unfazed by construction sites and pull over for emergency vehicles. They can navigate chaotic parking lots.

So Tesla fans can be justifiably proud of the robotaxi launch. At the same time, I think many people are underestimating how much work Tesla still has ahead of it.

I watched many videos featuring two pro-Tesla influencers riding in a robotaxi together. A common topic of conversation was how long it would take Tesla to expand its service. Some predicted that Tesla’s superior manufacturing capacity would allow it to quickly eclipse Waymo, the current industry leader.

But I don’t think Tesla’s accomplishments over the last two weeks—impressive as they are—mean that Tesla is ready to quickly scale up its service.

For one thing, Tesla hasn’t completed nearly enough miles to determine whether its robotaxis are as safe as a human driver. Human drivers go hundreds of thousands of miles between serious crashes. With about a dozen vehicles in its robotaxi fleet, Tesla has probably logged fewer than 10,000 driverless miles so far—not nearly enough to make any judgments about safety.

Second, we don’t know how quickly Tesla can reduce its reliance on human oversight. Right now, every Tesla robotaxi has a “safety rider” in the front passenger seat. There is reason to believe Tesla also has a teleoperation team that can control robotaxis remotely.

I don’t know how this works or how often it happens. But if Tesla’s robotaxis are heavily dependent on remote assistance—which they might be—that would make it impossible to profitably expand the service.

Still, Tesla’s Austin launch is a sign that Tesla is serious about its robotaxi ambitions. Which means that Waymo needs to keep its foot on the accelerator if it wants to maintain its lead. And the company seems to be doing just that:

It’ll take a couple of years for Waymo to launch commercial services in most of these cities. Still, I think Waymo is likely to get to most of these cities before Tesla does.

Subscribe now

I didn’t see any serious robotaxi mistakes

Tesla’s robotaxis drove flawlessly during the vast majority of the 16 hours of driving footage I watched. They stayed in their lane, followed traffic laws, and interacted smoothly with other vehicles. Here are a couple of moments that impressed me:1

  • A Tesla robotaxi encountered a parked car with both its hood and the driver’s door open. The Tesla waited patiently for a break in the oncoming traffic and then nudged into the opposing lane to give the parked car a wide berth.

  • A Tesla robotaxi approached an intersection behind a vehicle that was trying to turn left but then changed its mind. As the vehicle swerved back into the Tesla’s lane, the robotaxi smoothly slowed down to give it space.

Neither of these maneuvers were all that amazing; an experienced human driver would have handled them without any trouble. But situations like these might have tripped up earlier versions of Tesla’s Full Self-Driving technology.

Although Tesla’s driving was good, it wasn’t perfect. Let me now run down the most significant mistakes I noticed. Almost all of them have been discussed elsewhere online.

Tesla’s most widely discussed error occurred around seven minutes into this video. The robotaxi approached an intersection and got into the left turn lane. But the robotaxi couldn’t make up its mind whether it wanted to turn left or go straight. The car’s steering wheel jerked back and forth several times. On the car’s display, the blue ribbon showing the car’s intended path jumped back and forth erratically between turning left or continuing straight. Finally, the Tesla decided to proceed straight but ended up driving the wrong way in the opposite left turn lane.

It’s hard to say how much weight we should give to this incident. Clearly it’s not optimal driving behavior, but there were no other cars nearby and so no serious risk of a collision. Later in the same video, the car hesitated as it seemed to briefly consider making a left turn. Then it changed its mind, headed straight, and made the next left instead. Again, this wasn’t ideal but also didn’t pose a safety hazard.

Tesla’s vehicles also had some trouble with sudden braking. Here were the three most serious examples:

  • In this video, the vehicle briefly slows down for no obvious reason.

  • In this video, a robotaxi seems to slow down as it approaches a police car with its lights on—even though the police car is in an adjacent parking lot, not on the road the robotaxi is driving on.

  • In this video, a robotaxi suddenly slams on its brakes. The vehicle was driving toward the sun and may have been temporarily blinded by the sunlight.

Interestingly, one video from Austin showed a Waymo vehicle slamming on its brakes and jerking the wheel as it went under an overpass. I viewed many fewer hours of Waymo footage than Tesla footage for this story. So the fact that I saw even one incident like this suggests that Waymo may have as much of a phantom braking problem as Tesla does.

Another area where Tesla struggled was with pulling over. When a Tesla robotaxi is close to its destination, it offers the passenger a “drop me off now” button. If this is pushed, the robotaxi is supposed to find a safe place to pull over. But on at least two occasions this worked poorly:

  • In this video, a robotaxi let passengers off in the middle of an intersection for no obvious reason. After the passengers exited, the vehicle remained frozen in place for another 30 seconds, blocking cross traffic.

  • In this clip, the robotaxi tried to drop a passenger off in a left turn lane. The passenger ended up calling support to resume the ride.

In another video, a Tesla robotaxi was behind a UPS vehicle that began to back up to get into a parking space. The safety monitor quickly disabled the self-driving software, bringing the robotaxi to a stop. It isn’t clear if there would have been a collision without that intervention, though it seems unlikely to me.

In this clip, a Tesla tried to squeeze by another car in a parking lot when its tire touched the adjacent vehicle. The safety passenger eventually slid over to take the wheel and drive the car away. This is the only time I saw a Tesla employee manually driving a robotaxi.

In total, that’s eight fairly minor mistakes over 16 hours of driving. I don’t think we can draw any firm conclusions from this—the footage I watched was not a random sample and I don’t know how many incidents like this I’d see if I watched 16 hours of human or Waymo driving footage. But personally, I don’t think any of these incidents suggest that Tesla launched its driverless testing program too soon.

Tesla is following in Waymo’s footsteps

Waymo’s vehicles are only available in a handful of metropolitan areas. Waymo’s commercial vehicles don’t operate on freeways. And Waymo has remote operators who can intervene if the vehicles get stuck. For years, Tesla fans argued that these precautions showed that Waymo’s technology was brittle and unable to generalize to new areas. They claimed that Tesla, in contrast, was building a general-purpose technology that could work in all metropolitan areas and road conditions.

But in a piece last year, I argued that they were misunderstanding the situation.

“Tesla hasn’t started driverless testing because its software isn’t ready,” I wrote. “For now, geographic restrictions and remote assistance aren’t needed because there’s always a human being behind the wheel. But I predict that when Tesla begins its driverless transition, it will realize that safety requires a Waymo-style incremental rollout.”

That’s exactly what’s happened:

  • Just as Waymo launched its fully driverless service in 50 square miles near Phoenix in 2020, so Tesla launched its robotaxi service in about 30 square miles of Austin last month.

  • Across 16 hours of driving, I never saw Tesla’s robotaxi drive on a freeway or go faster than 43 miles per hour. Waymo’s maximum speed is currently 50 miles per hour.2

  • Tesla has built a teleoperation capability for its robotaxis. One job posting last year advertised for an engineer to develop this capability. It stated that “our remote operators are transported into the device’s world using a state-of-the-art VR rig that allows them to remotely perform complex and intricate tasks.”

The launch of Tesla’s robotaxi service in Austin is a major step toward full autonomy. But the Austin launch also makes it clear that Tesla hasn’t discovered an alternative path for testing and deploying driverless vehicles. Instead, Tesla is following the same basic deployment strategy Waymo pioneered five to seven years ago.

Of course, this does not necessarily mean that Tesla will scale up its service as slowly as Waymo has. It took almost five years for Waymo to expand from its first commercial service (Phoenix in 2018) to its second (San Francisco in 2023). The best informed Tesla bulls acknowledge that Waymo is currently in the lead but believe Tesla is positioned to expand much faster than Waymo did.

The case for optimism about Tesla

Being an automaker gives Tesla a couple of significant advantages over Waymo.

First, Tesla has designed its self-driving software to work with the conventional cameras that are already standard on Tesla vehicles. This means that as soon as Tesla gets its self-driving software working well enough, it can start manufacturing tens of thousands of robotaxis per month.

In contrast, Waymo’s technology relies on lidar sensors and other specialized hardware. Waymo only has about 1,500 vehicles in its commercial fleet today, and the company has struggled to acquire more:

  • Waymo announced a deal with the Chinese automaker Zeekr in 2021, but Waymo still hasn’t begun using those vehicles in its commercial fleet. The deal could be torn apart by tariffs and other geopolitical pressures, which would make it hard for Waymo to get enough vehicles over the next two years.

  • A 2024 deal with Hyundai should ultimately provide Waymo with the vehicles it needs, but it may take a couple more years for deliveries to start.

Tesla’s second significant advantage is its capacity to harvest data from customer-owned vehicles. This gives Tesla access to a truly massive amount of training data.

Last month, Waymo published a study demonstrating that self-driving software benefits from the same kind of “scaling laws” that have driven progress in large language models.

“Model performance improves as a power-law function of the total compute budget,” the Waymo researchers wrote. “As the training compute budget grows, optimal scaling requires increasing the model size 1.5x as fast as the dataset size.”

When Waymo published this study, Tesla fans immediately seized on it as a vindication of Tesla’s strategy. Waymo trained its experimental models using 500,000 miles of driving data harvested from Waymo safety drivers driving Waymo vehicles. That’s a lot of data by most standards, but it’s far less than the data Tesla could potentially harvest from its fleet of customer-owned vehicles.

So if bigger models perform better, and bigger models require more data to train, then the company with the most data should be able to train the best self-driving model right?

Subscribe now

“We are not driving a data center on wheels”

I posed this question to Dragomir Anguelov, the head of Waymo’s AI foundations team and a co-author of Waymo’s new scaling paper. He argued that the paper’s implications are more complicated than Tesla fans think.

“We are not driving a data center on wheels and you don’t have all the time in the world to think,” Anguelov told me in a Monday interview. “Under these fairly important constraints, how much you can scale and what are the optimal ways of scaling is limited.”

Anguelov also pointed to an issue that will be familiar to anyone who read last month’s explainer on reinforcement learning.

Waymo’s scaling paper—like OpenAI’s famous 2020 scaling law paper—focused on models trained with imitation learning. Just as frontier labs train LLMs to predict the next token in a sequence of text, so Waymo trained a model to predict the next move human drivers make in real-world driving scenarios. In this paradigm, larger models with more data performed better.

But as regular readers know, leading AI labs have been de-emphasizing imitation learning over the last year in favor of reinforcement learning. A similar transition has occurred in the self-driving world. Anguelov was a co-author of a 2022 Waymo paper finding that self-driving models trained with a combination of imitation and reinforcement learning tend to perform better than models trained only with imitation learning.

Imitation learning is “not the most sophisticated thing you can do,” Anguelov told me. “Imitation learning has a lot of limitations.”

This is significant because demonstration data from human drivers—the kind of data Tesla has in abundance—isn’t very helpful for reinforcement learning. Reinforcement learning works by having a model try to solve a task and then judging whether it succeeded. For self-driving, this can mean having a model “drive” in simulation and then judging whether it caused a collision or other problems. Or it can mean running the software on real cars and having a safety driver intervene if the model makes a mistake. In either case, it’s not obvious that having vast amounts of human driving data is especially helpful.

One finding from that 2022 paper is particularly relevant for thinking about the performance of Tesla’s robotaxis. The Waymo researchers noted that models trained only with imitation learning tend to drive well in common situations but make mistakes in “more unusual or dangerous situations that occur only rarely in the data.”

In other words, if you rely too much on imitation learning, you can end up with a model that drives like an expert human most of the time but occasionally makes catastrophic mistakes. So the fact that Tesla’s robotaxis drive so smoothly and confidently most of the time doesn't necessarily mean they are likely to react as well as human drivers in extreme situations.

We don’t know how Tesla handles teleoperation

As I watched Tesla’s robotaxis smoothly handle a wide range of tricky situations, I kept wondering if Tesla’s software was actually doing the driving. Here’s a photograph posted by a Tesla employee the day of the Austin launch.

Let’s zoom into the left-hand side of the image:

That appears to be a desk with a steering wheel. And this makes me wonder how—and how often—Tesla employees provide remote guidance to the company’s robotaxi fleet.

Since its 2018 launch, Waymo has acknowledged that it has remote operators who sometimes provide real-time assistance to its vehicles. But Waymo has also said that these remote operators never drive the vehicles in real time. Instead, they provide high-level feedback, while the vehicle always remains in control of second-by-second decisions.

In contrast, Tesla’s job posting stated that teleoperators can be “transported into the device’s world” so that they can “remotely perform complex and intricate tasks.” Could those “complex and intricate tasks” include driving the car for seconds or even minutes at a time?

In the videos I watched, a number of Tesla’s early customers commented on how human-like Tesla’s driving was. That might just be a tribute to the quality of Tesla’s AI model. But it’s also possible that sometimes a human driver is literally driving the vehicle from a remote location.

A few Tesla influencers mentioned this possibility before dismissing it as impossible. But it’s not impossible. A number of startups have developed real-time teleoperation technology over the years. For example, a company called Halo uses teleoperation to deliver rental cars to customers. Here’s how TechCrunch describes the technology:

The vehicles are remotely piloted over T-Mobile’s 5G network, with AT&T and Verizon used for backup. Halo developed an algorithm that allows the data streams to use whichever network connection is strongest at any given time in order to ensure reliable, high-quality streaming and low latency.

Halo’s remote operators can drive cars at speeds up to 25 miles per hour. That’s fast enough to cover a lot of the driving I saw Tesla’s robotaxis do in Austin. And in particular, it’s fast enough to cover a large fraction of the most difficult driving environments, such as parking lots and unprotected left turns.

I’m not going to make an accusation here—I’m not even sure if it would count as a scandal if remote drivers were regularly dropping in to help Tesla’s vehicles out of tough spots. But I do think it’s worth keeping this possibility in mind if you’re trying to assess the progress of Tesla’s technology.

The caliber of driving I saw in those 78 videos would represent a significant breakthrough for FSD. If Tesla continued improving at that pace, it could be ready for large-scale deployment fairly soon. But of course it’s only impressive if Tesla’s software was actually driving the vehicle.

On the other hand, if human drivers were in control much of the time, then the performance of Tesla’s robotaxis in Austin is much less impressive. In that case, Tesla could be quite far from achieving the performance required to scale up its robotaxi service.

1

This list originally included a video with the title “Raw 1x: Happy Tesla Robotaxi Launch Day,” which I wrongly assumed was an Austin robotaxi video. In fact it appears to have been a Tesla vehicle running FSD in New York. I regret the error.

2

Waymo has been doing driverless testing on freeways for 18 months at speeds higher than 50 miles per hour. But the company has not yet rolled out this capability to its commercial service.

What I learned trying seven coding agents

2025-06-28 02:19:41

It’s the final day of Agent Week, which means it’s also the final day to get 20 percent off an annual subscription to Understanding AI. Please click here to support my work.


In recent weeks, I’ve written multiple articles about the new generation of coding agents that have come out over the last year. It seems clear that agents like these are going to change how programmers do their jobs. And some people worry that they’re going to put programmers out of work altogether.

It’s impossible to truly understand a tool if you haven’t used it yourself. So last week I decided to put seven popular coding agents to the test.

I found that right now, these tools have significant limitations. But unlike the computer-use agents I wrote about yesterday, I do not view these coding agents as a dead end. They are already powerful enough to make programmers significantly more productive. And they will only get better in the coming months and years.

I also talked to an experienced programmer about how he uses tools like Claude Code. The conversation gave me better insight into how these tools are likely to change the programming profession—and perhaps other professions in the future.

Subscribe now

Hands-on with seven different coding agents

LLMs tend to do well on “canned” problems, but they struggle with more complex real-world tasks. So I wanted to use some real-world data for my own testing. I chose a subject that will be familiar to long-time readers: Waymo crashes, which are far less common, per mile, than crashes by human-driven vehicles.

Specifically, I asked coding agents to merge two spreadsheets—one from Waymo and the other from the National Highway Traffic Safety Administration—and then build a website to search, sort, and browse data on crashes involving Waymo vehicles.

I tried seven agentic coding platforms, playing the role of a coding novice. At no point did I look at any code to try to figure out what might be causing a bug—I simply gave the agents high-level instructions and relied on them to figure out how to implement them.

The results were decidedly mixed. Let’s start with the least successful coding agents and move to the more successful ones.

Bolt.new immediately gave up, stating “looks like you've hit the size limit.” A help page states that “if your Bolt project exceeds the context window (200k tokens for free accounts, 500k for paid accounts), Bolt notifies you in the chat, saying Project size exceeded.” Evidently, my spreadsheets were too large to fit into the model’s context window.

Other agentic coding platforms give models tools for searching through files and accessing portions of them on an as-needed basis. This allows them to work with code and data that is much larger than the model’s context window. Bolt doesn’t seem to have this feature, so it wasn’t able to process a large spreadsheet.

Replit couldn’t make a website. Replit made a plan and chugged along for 17 minutes trying to implement it. It claimed the website was ready, but I couldn’t get it to load. Replit couldn’t fix the problem no matter how many times I asked.

Lovable generated an attractive website but struggled with importing data. With repeated prompting, I coaxed Lovable to import most of the data and convert it to the right format. But as I write this, the values in the date column are all “N/A” despite me repeatedly asking Lovable to fix this.

Windsurf showed early promise. It created a basic but working website more quickly than any of the other coding agents. And I was able to make some changes. But it was brittle. For example, I asked for a pop-up menu to filter by injury severity (minor, moderate, severe, or fatal). Windsurf created a menu but couldn’t get the filtering to work properly—even after several attempts.

OpenAI’s Codex got the job done but the results lacked polish. One of my source spreadsheets had dozens of columns. Other coding agents all chose a handful of important columns—like location, date, and injury severity—to display. In contrast, Codex included all of the columns, making the site several times wider than the browser window. Even after I spent several minutes working with Codex to improve the site’s appearance, it still had more quirks and minor formatting issues than sites created by other coding agents.

Cursor did a solid job. I wasn’t initially optimistic about Cursor, which spent a long time setting up a development environment on my local computer with multiple false starts. The first website it created didn’t work at all; it showed an endless “loading” indicator. But with some coaxing, Cursor fixed that problem and created a nice-looking website. I could request modifications (like adding or removing columns, adding search criteria, or changing formatting) and most of the time it would succeed on the first try. This is the vibe-coding dream, and only two of the seven coding agents achieved it.

Claude Code did the best job. It was able to complete the task almost straight through with minimal backtracking. Once I had a basic site set up, I could request new features or design changes and Claude was often able to get them right on the first try.

Lessons from my vibe-coding experiment

I want to start by repeating what I said in yesterday’s post: it’s amazing that these products work as well as they do. Like the computer-use agents, these coding agents are the result of combining reinforcement learning with tool use. They all have significant limitations, but all of them are also far more powerful than any coding tools that existed even a year ago.

There seems to be a tradeoff between user-friendliness and power. Bolt, Lovable, and Replit market themselves as “vibe coding” platforms that allow non-programmers to create entire websites with a single prompt. When this works, the results can be stunning. But all three platforms seem to struggle if you do ask them to do something too ambitious or off the beaten path.

At the opposite extreme, Claude Code and Codex are designed for use by professional programmers. To use them, you need to get an API key and know your way around a Unix command line. I can easily imagine novices finding them overwhelming. But on the flip side, they seem to be significantly more versatile.

I was surprised by how much using these agents involved repeated trial and error. There were many times when I reported a bug, the agent tried to fix it, and I said “it still doesn’t work, try it again.” Sometimes we would do this five or ten times in a row, with the agent seeming to try a different approach each time. And sometimes it would succeed after a series of failures.

But there were other bugs that an agent just couldn’t fix no matter how many times it tried. I observed this with Lovable, Replit, and Windsurf during my testing.

This is not a big deal if you’re an experienced programmer who can peek under the hood, figure out what’s wrong, and give the agent more detailed instructions. But if you’re a non-programmer trying to write code entirely by prompting, it can be a huge headache.

Subscribe now

How the pros use coding agents

My website for browsing crash data is quite modest as programming projects go. Lots of companies build and maintain much more complicated and consequential software. Pure vibe coding won’t work for larger software projects because even the most sophisticated agents are eventually going to get stuck.

Coding agents can still be useful for larger projects, but they require a different approach. To learn more about this, I spoke to Understanding AI reader Aaron Votre, a software developer at the disaster recovery startup Bright Harbor. Votre has been a heavy user of Cursor and Claude Code, and he says it has dramatically increased his productivity. He agreed to let me look over his shoulder as he worked on a real software project.

We saw yesterday that computer-use agents tend to fail if they lack sufficiently precise or detailed instructions. The same is true for coding agents: the more context the programmer provides, and the more specific the instructions are, the more likely the coding agent is to produce the outcome the programmer is looking for.

“We have a long file that has hundreds of lines of all of our guidelines,” Votre told me. The file, which is always included in Claude Code’s context window, includes information about which software tools the company uses for various functions. It also includes the kind of advice the company would give a new engineer.

The file includes instructions like “fix each issue before proceeding to ensure code quality,” “sort aliases alphabetically,” and “test individual formatting functions separately for better isolation.”

“Every time it does something wrong, we add to this,” Votre said. “Which is why Claude is the best for us right now because we have the most set up for it.”

During my vibe coding experiments, I didn’t provide coding agents with this kind of guidance. And this undoubtedly made the agents’ jobs harder.

For example, several times Cursor tried to run software that wasn’t installed on my laptop. It was then able to either use a different tool or install the one it needed, so this wasn’t a big problem. But it would have been better if I’d told it which tools to use at the outset.

And this kind of thing can be a much bigger deal in a larger project. A lot of considerations go into choosing which software components to use for a large project, including cost, security, and compatibility with other software. Even the best of today’s agents need guidance to make sure they make choices that fit with the company’s broader goals. And the best way to do that is to have a file that explicitly lists which tools to use in which situations.

Votre said another key strategy for using software agents on large software projects is to have the agent write out a detailed plan. The human programmer can review the plan and make sure it all makes sense—if not, he can modify it. Reviewing a plan like this is an opportunity to detect ways that the initial instructions might have been imprecise or confusing. And it also gives the programmer an opportunity to provide additional context.

The future of programming is English

Photo by 78image via iStock / Getty Images Plus

What both of these strategies—providing a big file full of context and reviewing the agent’s plan before it’s executed—have in common is that they require the programmer to have a detailed understanding of the company’s code. It’s different from the “vibe coding” vision where someone with no programming background gives a high-level objective and lets the agent figure out the details.

People like to talk about coding agents replacing engineers, but I think it makes more sense to think about this the way Andrej Karpathy put it a couple of years ago: “The hottest new programming language is English.”

At the dawn of the computer industry, programmers had to write code directly using low-level mathematical operations like add and multiply. In the 1950s, people started inventing programming languages like Cobol and Fortran that allowed programmers to write higher-level instructions and have a computer translate those instructions into lower-level binary code.

That process has continued over the decades—modern programming languages like Python come with extensive function libraries that allow powerful programs to be written in just a few lines of code. Thanks to better libraries and frameworks, programmers are far more productive today than 10 or 20 years ago.

Coding agents are the next rung on this ladder of generality. In the past, programmers would give a computer instructions in a language like C++ or Python and a compiler or interpreter would translate it into binary machine code. Today, programmers can give a computer instructions in English and a coding agent will translate it into a programming language like C++ or Python.

This new paradigm means that programmers have to spend a lot less time sweating implementation details and tracking down minor bugs. But what hasn’t changed is that someone needs to figure out what they want the computer to do—and give the instructions precisely enough that the computer can follow them. For large software projects, this is going to require systematic thinking, awareness of tradeoffs, attention to detail, and a deep understanding of how computers work. In other words, we are going to continue to need programmers, even if most of them are writing their code in English instead of C++ or Python.

And I think something similar will be true in other professions impacted by AI agents in the coming years. There will undoubtedly be legal agents that help lawyers write legal briefs, revise contracts, and conduct document reviews. But someone still needs to figure out what jobs to give these agents—and then evaluate whether they’ve done a good job. To do that job well will require legal expertise. In other words, at least for the next few years, it will make more sense to think of legal agents like this as a new tool for lawyers to use rather than as replacements for the lawyers themselves.

Computer-use agents seem like a dead end

2025-06-26 20:45:47

It’s Agent Week at Understanding AI! Today’s post is only for paying subscribers. I’m offering a 20 percent discount on annual subscriptions through the end of the week, but only if you click this link.


Some people envision a future where sophisticated AI agents act as “drop-in remote workers.” Last October, Anthropic took a step in this direction when it introduced a computer-use agent that controls a personal computer with a keyboard and mouse.

OpenAI followed suit in January, introducing its own computer-use agent called Operator. Originally based on GPT-4o, OpenAI upgraded it to o3 in May.

A Chinese startup called Manus introduced its own buzzy computer use agent in March.

These agents reflect the two big trends I’ve written about this week:

I spent time with all three agents over the last week, and right now none of them are ready for real-world uses. It’s not even close. They are slow, clunky, and make a lot of mistakes.

But even after those flaws get fixed—and they will—I don’t expect AI agents like these to take off. The main reason is that I don’t think accessing a website with a computer use agent will significantly improve the user’s experience. Quite the contrary, I expect users will find computer-use agents to be a confusing and clumsy interface with no obvious benefits.

Putting computer-use agents to the test

In the launch video for Operator, an OpenAI employee shows the agent a hand-written, five-item shopping list and asks it to place an online grocery order. I wanted to try this out for myself. But to make things more interesting, I gave Operator a more realistic shopping list—one with 16 items instead of just five.

This image is based on a template I use for personal shopping trips. When I gave it Operator, the results weren’t great:

  • When I asked it to list items that had checkmarks, it included several unchecked items (lemons, almond butter, honey, vegetable oil, and olive oil). It misread the “(5)” after bananas as “15.” And it failed to include Cheerios on the list.

  • After asking it to correct those mistakes, I had Operator pull up the website for the Harris Teeter grocery store nearest to my house. Operator did this perfectly.

  • Finally I had Operator put my items into an online shopping cart. It added 13 out of 16 items but forgot to include pineapple, onions, and pasta sauce.

Operator struggled even more during a second attempt. It got logged out partway through the shopping process. After I logged the agent back in, it got more and more confused. It started insisting—incorrectly—that some items were not available for delivery and would need to be picked up at the store. Ultimately, it took more than 30 minutes—and several interventions from me—for Operator to fill my shopping cart.

For comparison, it took me less than four minutes to order the same 16 items without help from AI.

The experience was bad enough that I can’t imagine any normal consumer using Operator for grocery shopping. And two other computer use agents did even worse.

Read more