2025-04-24 00:25:25
o3 is somewhat of an enigma among the best AI models. When it launched last week, it caused a stir because using it not only delivered better output but it felt different. Economist called o3 “AGI.” wrote that the model resembles a “jagged AGI” – super‑human in some areas, surprisingly brittle in others. The pushback arrived just as fast: ’s satirical strawberry tweet and Colin Fraser’s algebra critique capture the gap between benchmarks and basic slips.
Benchmarks poured in. A VCT benchmark that measures the capability to troubleshoot complex virology laboratory protocols now has o3 in dominance with 43.8% accuracy vs. 22.1% that human PhD-level virologists scored. On ARC‑AGI v11, it solves more than half the secret set at a twentieth of the cost of the next best chain‑of‑thought model. Yet on the tougher ARC‑AGI v2 it barely clears 3%. The signal is getting harder to hear.
The more we measure AI, the less we seem to know. To cut through the noise, we stepped away from dashboards and put o3 – alongside Gemini 2.5 Pro, Claude 3.7 Sonnet and GPT-4o – through eight messy, real-world tests that reflect some of our day-to-day tasks:
Auditing codebases and building new features
Choosing the perfect laptop case
Crafting tricky international travel plans
Summarising dense essays
Reading scientific papers accurately
Executing research workflows
Translating chaotic brainstorming sessions into clear next steps
Ruthless fact-checking
Here’s what we learned from using the leading models in parallel.
began this test with a straightforward, consumer-facing prompt:
I need a laptop case for a MacBook Pro 17". Please find me the best product that fits my characteristics.
This vague prompt was his way of testing how well each model could scope a messy real-world query, ask the right clarifying questions and return a concise, high-confidence recommendation.
He scored each mode on three Cs: clarity, curation and confidence – awarding up to two points a piece. o3 began by firing five rapid‑fire questions about style, protection level, preferred materials, bonus features, budget and then delivered a shortlist of just two SKUs that met every constraint. The answer arrived in under three‑and‑a‑half minutes and included a neat comparison grid plus a one‑sentence “if/then” recommendation.
Gemini took the opposite approach. It was quicker, but asked no clarifiers, it just provided a selection of ten bags to choose from. Gemini felt like a search engine, o3 more of a personal concierge.
For the codebase audit test, Nathan chose to examine a large RAG-style project we use to create a searchable archive of our newsletter. The repository holds more than 2400 files and the only practical way to parse it is using OpenAI’s Codex or Anthropic’s Claude Code which lets the AI walk the file system, run shell commands, and read source in bulk. But because we don’t have API access to o3, we had to settle for o4-mini. Practically, this is what most people would use for coding anyway – it is much cheaper while being only slightly behind o3 on software engineering benchmarks.
Nathan graded performance on diagnosis accuracy, implementation advice, and whether the proposed patch ran on the first try – two points each, for a maximum of six. o4-mini received maximum marks. It loaded the full directory tree, crawled 2 400 files easily and behaved like a seasoned mechanic. It listed modules, in‑lined code and flagged exact inconsistencies. On the other hand, Claude Sonnet 3.7 offered a fine but higher‑level review, more driving‑impressions than under‑the‑hood specifics. Nathan gave it a 4/6.
The second part of Nathan’s coding test was to get the models to build a new feature. He handed both contenders the same brief: “After retrieval, run an LLM ‘aggregate & summarise’ step so you return a concise answer rather than raw chunks.” Claude 3.7 Sonnet treated it like a pull‑request ticket. In a single burst it created a tiny helper that grabs the top search chunks, feeds them to an OpenAI call and prints the resulting blurb. The patch slotted in cleanly: no hard‑coding of keys, clear logging and a new --raw flag for anyone who still wants the old behaviour. Nathan hit enter, the code compiled and the CLI spat out a neat three‑sentence answer – job done in under four minutes, API cost ¢68.
o4‑mini followed the same brief but took a slightly bumpier road. It altered fewer lines, ran faster and even wrote an “agent‑readable” guide inside the comments, yet it called an outdated OpenAI function. Fixing that import took thirty seconds, after which the feature worked exactly as advertised. Total bill: $1.
Both models could wire up a real feature in a single conversation, but Claude was slightly more robust. o4‑mini remains the fastest diagnostic mechanic, while Claude is still the safer pair of hands when the code needs to ship today.
The British Airways website is one of my most frequently visited sites. My trips often involve tight turnarounds and a dash to another commitment. So far, the large language models have been utterly useless. They haven’t browsed the web; they can’t plan through the combinatorial complexity of trip routings and they lack the nuance and understanding I expect.
One example, I generally prefer the earlier flight to the later flight, except when coming back from San Francisco as the early BA flight is a janky A380 but the later one gets a more comfortable 777.
My current travel test to the LLMs is to respond accurately to this made-up challenge:
I am in Tianjin on 26 June finishing a meeting by 11.45am. I must be in Watford, UK for 11am on Friday 27 June. What travel options do I have? Business class preferred.
This isn’t straightforward. It is a multimodal query, involving combinations of trains, taxis, planes and stop-overs in potentially many countries. It’s reasonably easy for me to solve: leave Beijing on a flight that gets into London by 9pm UK time; or take an overnight routing via one of the likely candidate cities (Helsinki, Istanbul, Dubai, Doha, Abu Dhabi, Paris or Frankfurt) and make sure you can get a flight first thing in the morning to London.
Most AI systems fail even if they try hard. Gemini 2.5 with Deep Research was like a breathless intern. It went off and boiled the ocean, eagerly offering me a 5500-word report which was completely wrong. So, I wanted to know, how does o3 compare?
It was remarkable, as you’ll see further below.
o3 understood the possibilities and responded in a structured format: to arrive in London in the evening of the 26th and stay overnight. Or take an overnight flight (via Hong Kong or Istanbul) to arrive in London in the morning, with time to travel to Watford, a town outside London. It even considered flying private.
2025-04-21 19:31:32
Hi all,
Here’s your round-up of data driving conversations this week — all in less than 250 words.
Bunny on a budget. Cocoa prices have tripled since 2023 due to climate change and trade tensions.
GenAI gets personal. GenAI use shifted significantly, with personal and professional support apps – for therapy and companionship – dominating over technical assistance in 2025.
Copilot boosts workers. Microsoft’s AI Copilot helped knowledge workers complete documents 12% faster and cut weekly email time by half an hour in a large study.
AI apps making cash.
2025-04-20 13:32:37
Hi, it’s Azeem with our weekend update bringing you a bit of distance from the headlines, so we can see what’s really going on in the intersections of technology, geopolitics and our changing world. Let’s dive in.
2025-04-19 13:23:17
What a week. The feeling isn’t just acceleration anymore; it’s delirium. You blink, and another model is released, another rumour flies. The pace is truly accelerating beyond all reasonable expectation. Welcome to artificial general confusion.
This week felt like a microcosm of the entire AI race. OpenAI dropped o3 and o4 Mini, their latest reasoning models. o3 is really quite impressive: it can use an array of tools from Python to web search and image analysis to do your bidding.
o3 has shown its mettle in my early tests. A particularly tricky one is my flight challenge. The test involves making a transatlantic booking with specific constraints (like my preference for particular planes). o3 did really well, beating all other models, but, for now, I’m still better. Another has been a multi-factor, quite complex, real-world strategic problem. o3 worked its way through the study like a master strategist: pulling out the key issues and addressing them with just enough detail.
One of the more formal measures I’m tracking closely is METR’s time horizon, which tests AI’s ability to complete long tasks. Mastering long tasks is a key unlock for significant productivity enhancements.
On this o3 does not disappoint.
Benchmarks offer a glimpse but we should be cautious about simplistic readings. The true, messier reality of real-world application shows a palpable rate of improvement, even as capabilities feel like jagged edges pushed into the market.
Predictably, there was a stampede online, people breathlessly declaring this “artificial general intelligence,” as if we’ve finally tripped over some obvious finish line. It’s a line that, frankly, remains stubbornly undefined.
Within a day, Google fired back, launching Gemini 2.5 Flash. This is the faster, more efficient sibling to the 2.5 Pro model that’s become my go-to. At least on one benchmark, Google seems to be carving out a fascinating and potentially dominant space right now. Their price-performance frontier is helped, no doubt, by their hardware expertise.
Anthropic’s releases this week went rather unnoticed. Claude can now search your email, Google Drive, and calendar, although it is ponderously slow, and I use other AI tools on my email.
Then rumours started that OpenAI is considering buying Windsurf, one of those increasingly indispensable code completion tools that software engineers use to magnify their output. The purchase price? A cool $3 billion. And what is more, rumours that Sam Altman might turn the billion-user base of ChatGPT into a social network full of yeets.
How do we make sense of this four-ring circus of releases and more?
First, the pace is unsustainable for traditional product cycles. Products are being released faster than they can be properly described, product-managed, or even benchmarked in a way that’s useful to anyone outside the lab. The capabilities, while impressive, feel like jagged edges pushed out into the market. As says:
2025-04-17 01:24:48
Before asking for more headcount and resources, teams must demonstrate why they cannot accomplish their goals using AI, explicitly showing what their area would look like if autonomous AI agents were integrated as part of the team.
— Tobi Lütke, CEO of Shopify
Tobi’s memo resonated widely because organizations now recognize what it truly means to integrate AI – not merely as an add-on but fundamentally into their operating structures.
At Exponential View, we’d embraced precisely this mindset to reimagine how we work. We consider ourselves AI-native, which means:
We use AI reflexively; it’s a core skill for everyone.
We’re tool-agnostic, continually evaluating and updating our stack.
We build new workflows from scratch rather than just automating old ones.
Our team is becoming increasingly technical – the baseline expectation for coding skills has risen significantly.
At the same time, I’ve encouraged my portfolio companies to scale through synthetic intelligence during this time of transformation.
In recent weeks, we sprinted to prototype new workflows using LLMs, automation tools and structured systems thinking. Today, I’ll share our most valuable lessons.
At the end, you’ll get access to something unique: our internal stack of 40+ tools – everything we’re actively using, testing or intend to test – to help you decide what tools might work for you.
Let’s jump into our seven lessons.
We’ve started applying a simple heuristic: if a task, or even a question, comes up five times a month, it’s a candidate for automation. This “5x rule” helps spot patterns hiding in plain sight and forces you to think in systems rather than routines. This habit sets the expectation that workflows should evolve constantly, not calcify.
Of course, we now ask the question “what do we do five times a month” more than five times a month, making that a candidate for automation.
One of my (Azeem’s) favourites is a simple workflow which does my expenses. I have to contend with dozens of invoices a month and my automations, relying on Dext and Gmail filters, are good but not great. Expense reconciliation has involved a lot of time in Gmail. My new expenses agent eliminates that repetition: it pulls out invoices, bills and receipts in my emails and puts them into a correctly structured spreadsheet. It also makes a PDF copy of the bill and sticks it into Google Drive. This saves me my least favourite hour every month.
If the bill is a plane, train or hotel booking it also dumps them into a different document. A separate agent reviews that document and turns it into a chronological, structured travel briefing which I use. With fifty travel days across ten trips to the end of June this is an enormous time saver – perhaps a dozen back-and-forths with Gmail has been replaced by the occasional check of this summary document.
One early and useful lesson in building with AI was to break down our workflows into smaller, autonomous components, rather than trying to automate an entire process in one go. This modular approach makes it easier to test individual pieces, troubleshoot in isolation and evolve parts of the system without destabilizing the whole.
This approach draws inspiration from classic software architecture: encapsulation and separation of concerns. But it also reflects how AI-native workflows behave. When you have an LLM doing part of the work, you want its task to be as narrow and unambiguous as possible. Broad instructions like “write a summary of the latest AI developments” often result in generic or unusable output. In contrast, narrower prompts like “list three key recent breakthroughs in battery technology and explain their relevance to electric vehicle adoption” yield precise answers and clearer points of failure, making them easier to improve iteratively.
One modular workflow we’ve found particularly valuable automates the discovery and initial research for potential partnerships. The first module scans our broader ecosystem, discovering companies actively engaging in areas aligned with our priorities. The second module enriches these initial leads, pinpointing key decision-makers and compiling relevant context from publicly available information. Finally, a third module– acting as our digital comms specialist – helps create outreach to get across as clearly as possible. The result is a process that frees our team to focus on building relationships rather than hunting down details.
A modular system also helps teams think like system designers. If Module A breaks, we know not to debug Module C. That clarity saves time. It also supports scale: when each unit functions independently, it’s easier to assign ownership, train interns, or plug in new AI tools.
Treat the LLM as the foreman, not the worker. That is, use the model to structure the task, but don’t ask it to do everything. Once the model has identified what tasks need to be carried out, you can decide whether a given task is deterministic (in which case you may need to farm it out to traditional software code) or requires more judgement (in which case an LLM might be able to handle it.)
2025-04-14 21:19:13
Hi all,
Here’s your Monday round-up of data driving conversations this week — all in less than 250 words.
Today’s edition is brought to you by Attio, the CRM for the AI era.
Sync your emails and Attio instantly builds your CRM with all contacts, companies, and interactions enriched with actionable data.
Trade tensions. China suspended exports of key rare earth minerals to underscore its control over roughly 90% of global supply of 17 strategic elements.
Ukraine’s drone advantage. Ukraine now manufactures FPV drones domestically at a lower cost than Chinese imports.
OpenAI’s dominance. 10% of the world now uses the firm’s AI, according to Sam Altman and monthly revenue is up 30% in the last three months at $415 million.
Academia embraces AI.
Breaking Wiki’s bank. AI data scraping increased Wikipedia’s infrastructure costs by 50% since January 2024.
Solar surge continues. Solar marked its 20th year as the fastest-growing power source in 2024.
The Detroit Three. Three major US car manufacturers’ global market share fell from 29% to ~13% since the early 2000s.
The American dream. Who’s going to do it?
Thanks for reading!
Today’s edition is brought to you by Attio
Attio is the AI native CRM built for the next era of companies.
Sync your emails and watch Attio build your CRM instantly - with every company, contact, and interaction you’ve ever had, enriched and organized. Join industry leaders like Flatfile, Replicate, Modal and more.