2025-03-25 23:53:41
The scales are tipping. For years, we’ve heard open-source AI might someday catch up to the big, proprietary models—but that moment always seemed just out of reach. Until now.
DeepSeek dropped the best non-reasoning model.1
Let’s say that again — the best non-reasoning model is an open-source model. For the first time.
More astonishing still is that DeepSeek V3-0324 offers open-source accessibility at a fraction of the cost of closed platforms. You can run it on a top-spec Mac Studio desktop computer and it will produce 20 tokens per second, fast enough for personal use. Accessed via an API, V3-0324 is a quarter of the price of GPT-4o.
This is precisely the moment open-source proponents have been waiting for: proof that an open-source lab can keep pace with better-resourced corporate giants.
This new V3 also means that the highly anticipated R2, DeepSeek’s next advanced reasoning model, will likely be one of the best models in the world.
So what does this tipping point really mean for AI’s future? Let’s dig in.
In a recent discussion with my colleague , we were dissecting some of our AI workflows. I’d been relying on OpenAI’s o1 API for specific tasks. o1 is absurdly expensive and noticeably superior to alternatives like Gemini 2.0 Flash and Claude 3.7 Sonnet. Yet those models are 150 times and four times cheaper respectively and much faster than o1.
While I need the quality, I can’t justify the price given the workload volume. That’s when DeepSeek’s API entered the conversation. Even when accessed via a US host like Together.ai at four times the Chinese price, it’s still about fifteen times cheaper than o1.
This isn’t only a personal workaround; it shows you what a seismic shift DeepSeek is driving. Why would I even bother using o1?
This should put the frighteners on OpenAI. You can sense it in their submission to the US AI Action Plan which Trump announced a few weeks ago:
The recent release of DeepSeek’s R1 model is noteworthy—not because of its capabilities (R1’s reasoning capabilities, albeit impressive, are at best on par with several US models), but as a gauge of the state of this competition. […] As with Huawei, there is significant risk in building on top of DeepSeek models in critical infrastructure and other high-risk use cases given the potential that DeepSeek could be compelled by the CCP to manipulate its models to cause harm.
Read between the lines: OpenAI isn’t just concerned about national security, they’re complaining about a competitor offering similar capabilities at a fraction of the price. Their geopolitical warnings are consistent with a highly-capitalised, loss-making company facing the risk that their pricing power might suddenly collapse.
And DeepSeek is disrupting the entire AI business model.
This is particularly clear in China, DeepSeek’s home market. As Eleanor Olcott reports in the FT, major AI companies in China are scrambling to adjust their strategies under DeepSeek’s influence.
[It] was quickly crowned the country’s AI champion by Beijing and has seen lightning adoption of its technology everywhere from hospitals to local governments.
As Eleanor writes, this has compelled some leading AI start-ups in the country to reconsider their strategies. For instance, Kai-Fu Lee’s 01.ai has shifted its business model into what he describes as the “DeepSeek age” and ceased pre-training in late 2024 due to increasing expenses. At the same time, Zhipu, previously viewed as China’s foremost large language model start-up, has been banking on an initial public offering to support its cash-heavy expansion. In 2024, the company reportedly achieved Rmb300 million (about $41 million) in sales while incurring Rmb2 billion in losses, with soaring costs raising concerns among investors after DeepSeek demonstrated that state-of-the-art models could be developed on a more modest budget.
If the only challenge for major AI companies was that DeepSeek had released a faster and cheaper model, they could simply say, “Let’s go back and make ours better.”
But that’s not the real issue.
DeepSeek isn’t just building a superior product, it’s making it open-source and giving it away for free. As a result, you can’t just refine your existing offering. You have to create something fundamentally different.
A gazillion years ago, in 2023, a leaked internal memo from Google stated that the company expected the next big winner to be open-source. The memo said:
[T]he uncomfortable truth is, we aren’t positioned to win this arms race and neither is OpenAI. While we’ve been squabbling, a third faction has been quietly eating our lunch.
I’m talking, of course, about open source. Plainly put, they are lapping us. Things we consider “major open problems” are solved and in people’s hands today.
What’s happening with DeepSeek is about the balance of power between open-source and closed-source models. Historically, many disruptive new technologies have followed a familiar cycle: they begin in a collaborative, open environment—often supported by academia or innovators—and then slowly shift to a model dominated by large corporations that can marshal huge investments and massive infrastructure.
Will AI evolve into foundational infrastructure, akin to the Apache web server or DNS, freely accessible and endlessly customizable or will it stay locked behind proprietary walls?
As I wrote in 2023:
[T]he internet, as we know it, is largely built on open-source software. This model has fostered a broad developer ecosystem, resulting in increased innovation… In the mainstream software world, there is a balance between closed-source and open-source options.
And I argued that “the internet, on the whole, prefers open-source software.”
For a while, it looked like AI might do the same. Early deep learning breakthroughs were shared in open papers and repos, but the trend toward ever larger, more resource-intensive models that cost millions of dollars to train sparked fears that only the biggest players could compete. This, in turn, led to concerns that AI innovation would be locked up behind proprietary walls, out of reach for smaller labs or independent researchers.
Yet open-source AI has proved remarkably resilient. Initiatives like Hugging Face Transformers, Open Assistant and major collaborative projects such as BLOOM have shown that talent, community and distributed computing can push the boundaries of AI just as much as centralized corporate labs. It still needed help from big players like Meta and thanks to the Llama foundation sprang a vibrant ecosystem of fine-tuned variants (Vicuna, Alpaca, WizardLM, etc.) that materialised almost overnight, illustrating the speed with which open-source can evolve. We don’t necessarily have to follow the old script.
Closed-source players might retain an edge in building frontier models, thanks to their resource muscle. But these models will be expensive. And any model, regardless of how good, will only form a small part of the society of AI. There will be trillions of agents running billions of different instances of models and those models will have different capabilities. Some will be more general than others, some will work on small memory footprints or power footprints; some will be highly fine-tuned for specific tasks.
DeepSeek is building on the heritage of open-source and compared to those first open-ish AI models like Stability, Mosaic ML and of course, Llama, it’s more open, it’s more resource-effective and it’s more performant. I think that it’s changing the shape of the industry, tilting it further towards open-source. It’s contributing to a richer AI landscape where independent labs, individual contributors and nonprofits continue to push the boundaries, each model serving or specializing in its unique niche.
In the next few weeks, DeepSeek’s new reasoning model R2 will likely be released. We can expect it to be a big leap forward: higher performance, more robust multi-step reasoning capabilities and, potentially, an even broader training corpus than its predecessor R1.
2025-03-24 23:17:51
Hi all,
Here’s your Monday round-up of data driving conversations this week — all in less than 250 words.
Artificial competitive edge. Teams using AI are around three times more likely to produce top-tier solutions than teams without AI, while traditional specialist advantages fade when AI assistance becomes universal.
The most expensive model in history. OpenAI’s o1-pro model costs up to 1,000 times more than their own previous models and up to 1,500 times more than competitors’ services.
Ant’s chip success. China’s Ant Group has cut AI computing costs for running some advanced AI models by 20% by including domestically-produced chips from Alibaba and Huawei alongside US chips.
America’s drone deficit. Prominent US manufacturer Skydio’s total lifetime production of 45,000 drones equals just ‘a few weeks’ of output from Chinese competitor DJI.
2025-03-23 11:44:34
Hi, it’s Azeem with our weekend email.
The DeepSeek moment is becoming more than a moment. The nimble startup is challenging established AI labs with competitive models at a fraction of the cost and sending ripples across the industry. Meanwhile, the idea that liberalism must build to survive is going mainstream. Both stories highlight the inflection point that we’re here to understand: how we respond to technological and societal challenges will define our future. Thanks for reading!
and ’s new book is making the rounds. In Abundance, they issue a provocation to liberalism: if it can’t build, it can’t endure. Despite the immense wealth and technological prowess of the US, scarcity—of housing, energy, infrastructure and even opportunity—has become the defining constraint of its politics and future. This scarcity has fuelled zero-sum, illiberal populism but also eroded the liberal project from within.
In the words of reviewing the book, “liberalism has forgotten how to build the things that people want”.
I’m writing this in the same week that saw the busiest airport in Europe, Heathrow, shut down because it relied on a single source of power. And when that power was no longer available, everything had to stop. What happened at Heathrow is a symptom of the scarcity mindset baked into modern systems – scarcity of permission, initiative and ownership.
My view is that technology enables building. Technology is, after all, things getting cheaper; it is about reducing the resource cost of creating. But it’s only a catalyst. The necessary condition is the mindset.
See also, joined me on Friday to discuss the renewed culture of building—across energy, infrastructure, AI and defense—and why we believe this can unlock an abundant future.
“I think Sam Altman is probably not sleeping well,” Kai-Fu Lee said a couple of days ago commenting on the threat DeepSeek presents to OpenAI. As a reminder, DeepSeek is delivering competitive models at a tiny portion of the spend compared to OpenAI and other major US labs (read our deep dive for everything you need to know). And it’s causing ripple effects. Tencent is slowing down the pace of GPU roll-out because of DeepSeek. Oil giant Saudi Aramco’s chief shared that DeepSeek is “really making a big difference” internally and companies across China are steadily integrating its foundation models.
DeepSeek seems to have managed to create a research environment where the lab can out-innovate more heavily financed competitors. It only has160 employees, compared to over 2,000 at OpenAI and it is unburdened by the strict commercial or investor pressures that typically demand rapid productization. DeepSeek’s billionaire founder, Liang Wenfeng, bought thousands of high-end chips ahead of export restrictions and deliberately limits commercial activities. He’s content with breaking even rather than chasing growth.
Meanwhile, OpenAI has pivoted towards product, as it hunts for a defensible moat and profitable market position. And the pressures are growing… Open-source may be the foundational kernel of building AI services. After all, the very successful Mac operating system is building on the open-source Linux kernel. And if that wasn’t enough, there’s the reality of end-to-end reinforcement learning blurring the boundary between model and the product – the model is the product.
See also, what you should know about new models and AI development this week:
Models can improve not just by growing bigger or thinking longer, but also through search—generating multiple potential answers and selecting the best one. In an impressive demonstration, Gemini 1.5 matched the specialized reasoning model o1’s performance by creating and evaluating 200 responses to a single query. Models excel at identifying their own flawed reasoning when comparing multiple outputs side by side.
2025-03-22 13:02:36
Consider, for instance, the day Europe’s busiest airport1—Heathrow, naturally—plunged into silence; its operations were halted not by a storm or a strike but by a fire in an electrical substation.
Not just any substation, mind you, North Hyde substation, nestled in light industrial land in Hayes, an unremarkable exurb of London. The fictional character Ali G is a local. Hayes’ power network has been stressed for a while under the strain of population growth and its proximity to Britain’s data centre corridor. All these pressures were well known.2
The transformers, capacitors, circuit breakers and other gubbins that lay on this site, a bit bigger than a football pitch, formed the sole lifeline, the heart, it is fair to say, powering this paragon of modernity: London’s Heathrow Airport.
I can hear you all asking yourselves: such a critical facility, surely, didn’t have a single point of failure?
It’s 2025 after all, we’ve learnt not to put all our eggs in one basket.
Apparently, not the bosses who oversee the airport: “It’s never happened before, and that’s why I’m saying it has been a major incident.”
Of course, this isn’t true. Single points of failure are well known. That is why you don’t put all your eggs in one basket. That is why the Space Shuttle had four backup computers. That is why my home has two different Internet connections.
Heads should roll pdq, but that is an aside.
But ultimately, it is an oddity that beggars belief: how did such a vastly important piece of infrastructure, with well-known vulnerabilities, come to rely on the engineers equivalent of a twig? This is just weird.
A decade ago, I chanced upon an essay by a prodigious British software engineer titled, “The world will only get weirder.”
He posited that this phenomenon is the inevitable byproduct of the interaction of rule-making and a complex society.
Coast’s central argument hinges on the diminishing returns of rules.
We’ve adeptly mitigated commonplace calamities—think mechanical failures in aviation—through a labyrinth of regulations. Consequently, the residual risks are increasingly esoteric, manifesting as bizarre tail events, such as pilots deliberately downing their aircraft.
Yet, this regulatory zeal is not without its ironies. The very measures intended to safeguard can engender unforeseen perils. Consider the locked cockpit doors, a post-9/11 innovation that, while thwarting hijackers, inadvertently empowered a rogue pilot to barricade himself within.
Time spent managing rules is time people don't spend exchanging ideas or coming up with new stuff, or just spotting the blindingly obvious.
Domains emancipated from regulatory shackles—software and the internet spring to mind—often burgeon with innovation and economic vitality. These “rule-free zones” serve as crucibles for experimentation, yielding societal dividends when their ventures bear fruit. We probably need more of them.
I think there is another dimension to this. Living in a sanitised, largely unweird world, a dependable world, makes most of us blasé to risk, to weirdness. It’s perhaps why many people feel odd trying to describe their risk tolerance in sensible terms to their financial advisor. And it’s probably why the best investors in the world, Buffett, Marks, Singer or Dalio, are renowned for their risk management and their respect for the weird.
Our society is so good at excising edge cases that we can’t believe they exist. Engineers, safety analysts, and regulators, in their quest to cocoon us from catastrophe, have inadvertently obscured the lurking specters of systemic fragility.
2025-03-22 02:33:26
I hosted a live with earlier today and we had a blast1—covering a wide range of topics, including:
1. A sense-check on crypto and Web3:
Transition from idealistic vision to mainstream financial instrument.
Importance of regulation and guardrails.
Challenges of open systems and manipulation.
2. AI’s current state & impact:
“Super SaaS” vs existential threat.
ChatGPT and Claude competing for leadership in daily use.
Anthropic’s prediction of “thousands of geniuses in data centers” vs. skepticism of established AI scientists.
Focus on human differentiation and personal development in the AI era.
3. Energy infrastructure & innovation:
Innovation in battery storage and startups we’re excited about.
Nuclear power: Historical successes and future potential.
China’s electric vehicle and energy progress.
4. European strategic shift:
Major mood change from January 2024 to present.
Making sense of the €800 billion allocated to defense and resilience.
Infrastructure challenges in Europe and the UK highlighted by Heathrow Airport shutdown.
P.S. If you were wondering about the little blip at the start of the session—two squirrels in my garden decided to put on a show and completely stole my attention for a moment 🐿️
2025-03-21 01:59:29
How quickly will it be before artificial intelligence systems can undertake very long tasks without human intervention?
This is a question with massive implications for productivity, automation and the future of work. A recent analysis from METR Evaluations suggests that AI’s ability to sustain task execution is improving at a fast rate. The length of tasks AI can autonomously complete is doubling every seven months. If this trend holds, by 2027, off-the-shelf AI could handle eight-hour workdays with a 50% success rate.
Imagine if a factory worker could only work in 10-minute shifts today but, within three years, could complete a full 8-hour shift without breaks. That’s the trajectory AI might be on.
That’s a bold claim. And it’s not just coming from outside observers—insiders at major AI labs, including Anthropic co-founder Jared Kaplan, have signalled similar expectations. I take their claims very seriously. They have access to the models, the data, and the emerging scaling trends. And in both formal discussions and casual conversations, 2027 (or sometimes 2028) keeps coming up.
What does this mean? If AI can reliably execute long tasks, it reshapes industries, from knowledge work to automation-heavy fields. But is the trend as inevitable as it seems? And just how reliable do these systems need to be before they become truly transformative?
The idea of task length and success rate is an important one. GPT-3 was good at two- or three-second tasks: it was pretty reliable at pulling out the nouns or entities in a sentence. GPT-3.5 could get to a few tens of seconds; in a mediocre fashion parse a paragraph. Those of us using o1 or Sonnet know that we can throw much more complex tasks at them.
METR put clear bounds on the nature of this test:
The instructions for each task are designed to be unambiguous, with minimal additional context required to understand the task… most tasks performed by human machine learning or software engineers tend to require referencing prior context that may not easy to compactly describe, and does not consist of unambiguously specified, modular tasks
What is more:
Relatedly, every task comes with a simple-to-describe, algorithmic scoring function. In many real world-settings, defining such a well-specified scoring function would be difficult if not impossible. All of our tasks involve interfacing with software tools or the internet via a containerized environment. They are also designed to be performed autonomously, and do not require interfacing with humans or other agents.
METR’s research focuses on modular, well-defined tasks with clear scoring functions—conditions that don’t always apply in real-world scenarios.
How generalisable do the tasks need to be? AI systems already excel at tasks—such as chess analysis or anomaly detection—that would require hours of human effort, completing them in minutes. Just think about Stockmaster, the chess engine. Or any of the ML systems that detect anomalous patterns in financial data. METR has chosen rather more generalisable tasks than those narrow cases.
What level of performance do we need? A 50% success rate isn’t really a like-for-like comparison with human effort. While humans are far from perfect, in most work situations we are probably aiming for greater than a half chance.
One Twitter X user visualized METR’s data, plotting accuracy rates of 80%, 95%, and 99% on a log scale. The results show a clear trend: lower accuracy thresholds improve rapidly, while reaching near-perfect performance (99%) follows a much slower, more effort-intensive curve. This reinforces the challenge of achieving high reliability in AI outputs.
It’s a Pareto relationship. Reaching 80% accuracy is relatively rapid, potentially viable for four-hour tasks by 2028, while achieving 99% accuracy demands exponentially more effort, pushing timelines further out. This disparity shapes expectations for AI’s practical deployment.
Even a system that is fast, cheap and 50% accurate can be a game changer—provided we can quickly check its work. If it has made a mistake, we can then ask it to do it again or send it to a system (like a human) which might be slower but more reliable. Of course, if we can’t evaluate its work cheaply, then the cost of evaluations (in dollars or time) may make it uneconomic.
On the other hand, something that is 80% accurate would seem like a fair building block of a solid system.
I did some napkin math for a notional four-hour task.
Assuming each task requires 1,000,000 tokens at roughly $10. (Token costs are coming down, but let’s just use that for now.)
Each task has to be verified by a human, perhaps using some formal verification software. That verification takes 15 minutes.
If the task is not done correctly, the human needs to do it. This takes four hours.
The human’s fully loaded cost is $100 per hour.
To do 1,000 of these tasks manually costs 4,000 man-hours or $400,000.
What about the AI?