Rss preview of Blog of Tomasz Tunguz

The Bacon & the Skillet: When Does the AI Market Congeal?

2025-11-21 08:00:00

The AI market today is bacon in a hot skillet. Everything is sizzling, moving, & changing at an incredible pace. We’re all watching it closely.

Market share is fluid because no one yet knows what AI can do & the second we think have grasped it, models improve. The Nvidia chip performance & the launch of Gemini 3 the biggest gain ever in Google model performance suggest no simmering ahead.

As long as the underlying models hurtle towards PhD level performance, people will continue to test. How much better is Gemini 3 at coding? tool calling? writing?

If the progress is material, then the benefit of switching is worth the activation energy.

Activation energy diagram showing the effort required to switch between AI tools

Today, startups, incumbent software companies, cloud providers & AI labs all are competing. First the model, then infrastructure (memory & retrieval), then tools, then applications. Will the foundational models play at the application layer? Or will the applications differentiate themselves enough to overcome model differences?

Who can take advantage of the next big leap in model performance fastest? Which sales team can reach the target customers first & write the RFP?

This is the Great Game of Risk in Category Creation & aggression wins.

But this era of fluidity won’t last forever. The rate of improvement in AI models will eventually attenuate. When the performance gap between the best model & the second-best model shrinks, the incentive to switch evaporates.

Switching costs will start to matter more than marginal performance gains. The custom tools I’ve built, the muscle memory I’ve developed, the integrations my company has deployed, the enterprise contracts signed, all inertia.

At that point, the fat begins to congeal.

The winners will be those who use the sizzling phase to build fat worth congealing around.

This fun analogy came up during my conversation with Harry, Jason, & Rory.

The Scaling Wall Was A Mirage

2025-11-20 08:00:00

Two revelations this week have shaken the narrative in AI : Nvidia’s earnings & this tweet about Gemini.

Oriol Vinyals tweet about Gemini 3 scaling

The AI industry spent 2025 convinced that pre-training scaling laws had hit a wall. Models weren’t improving just from adding more compute during training.

Then Gemini 3 launched. The model has the same parameter count as Gemini 2.5, one trillion parameters, yet achieved massive performance improvements. It’s the first model to break 1500 Elo on LMArena & beat GPT-5.1 on 19 of 20 benchmarks.

Oriol Vinyals, VP of Research at Google DeepMind, credited improving pre-training & post-training for the gains. He continued that the delta between 2.5 & 3.0 is as big as Google has ever seen with no walls in sight.

This is the strongest evidence since o1 that pre-training scaling still works when algorithmic improvements meet better compute.

Second, Nvidia’s earnings call reinforced the demand.

We currently have visibility to $0.5 trillion in Blackwell and Rubin revenue from the start of this year through the end of calendar year 2026. By executing our annual product cadence and extending our performance leadership through full stack design, we believe NVIDIA will be the superior choice for the $3 trillion to $4 trillion in annual AI infrastructure build we estimate by the end of the decade.

The clouds are sold out and our GPU installed base, both new and previous generations, including Blackwell, Hopper and Ampere is fully utilized. Record Q3 data center revenue of $51 billion increased 66% year-over-year, a significant feat at our scale.

The infrastructure is accelerating headlong into hundreds of billions next year & Nvidia predicts it will be in the trillions, citing “$3 trillion to $4 trillion in data center by 2030”.

As Gavin Baker points out, Nvidia confirmed Blackwell Ultra delivers 5x faster training times than Hopper.

Gemini 3 proves the scaling laws are intact, so Blackwell’s extra power will translate directly into better model capabilities, not just cost efficiency.

Together, these two data points dismantle the scaling wall thesis.

What 375 AI Builders Actually Ship

2025-11-17 01:00:00

70% of production AI teams use open source models. 72.5% connect agents to databases, not chat interfaces. This is what 375 technical builders actually ship - & it looks nothing like Twitter AI.

350 out of 413 teams use open source models

70% of teams use open source models in some capacity. 48% describe their strategy as mostly open. 22% commit to only open. Just 11% stay purely proprietary.

Agents access deep systems: databases, web search, memory, file systems

Agents in the field are systems operators, not chat interfaces. We thought agents would mostly call APIs. Instead, 72.5% connect to databases. 61% to web search. 56% to memory systems & file systems. 47% to code interpreters.

The center of gravity is data & execution, not conversation. Sophisticated teams build MCPs to access their own internal systems (58%) & external APIs (54%).

85% use synthetic data for generating evals vs fine-tuning

Synthetic data powers evaluation more than training. 65% use synthetic data for eval generation versus 24% for fine-tuning. This points to a near-term surge in eval-data marketplaces, scenario libraries, & failure-mode corpora before synthetic training data scales up.

The timing reveals where the stack is heading. Teams need to verify correctness before they can scale production.

Automated methods for improving context: prompt optimization, ablations, manual

88% use automated methods for improving context. Yet it remains the #1 pain point in deploying AI products. This gap between tooling adoption & problem resolution points to a fundamental challenge.

The tools exist. The problem is harder than better retrieval or smarter chunking can solve.

Teams need systems that verify correctness before they can scale production. The tools exist. The problem is harder than better retrieval can solve.

Context remains the true challenge & the biggest opportunity for the next generation of AI infrastructure.

Explore the full interactive dataset here or read Lauren’s complete analysis.

Teaching Local Models to Call Tools Like Claude

2025-11-14 01:00:00

Ten months ago, DeepSeek collapsed AI training costs by 90% using distillation - transferring knowledge from larger models to smaller ones at a fraction of the cost.

Distillation works like a tutor training a student : a large model teaches a smaller one.¹ As we’ve shifted from knowledge retrieval to agentic systems, we wondered if there was a parallel technique for tool calling.²

Could a large model teach a smaller one to call the right tools?

The answer is yes, or at least yes in our case. Here’s our current effort :

Every time we used Claude Code, we logged the session - our query, available tools, & which tools Claude chose. These logs became training examples showing the local model what good tool calling looks like.

We wanted to choose the right data so we used algorithms to cherry-pick. We used SemDeDup³ & CaR⁴, algorithms to find the data examples that lead to better results.

Claude Code fired up our local model powered by GPT-OSS 20b⁵ & peppered it with the queries. Claude graded GPT on which tools it calls.

Claude’s assessments were fed into a prompt-optimization system with DSPy⁶ & GEPA⁷. All of that data was then fed to improve the prompt. DSPy searches for existing examples that could improve the prompt, while GEPA mutates or tests different mutations.

Combined, we improved from a 12% Claude match rate to 93% in three iterations by increasing the data volume to cover different scenarios :

Optimizer	Training Examples	% of Claude
DSPy Phase 1	50	12%
GEPA Phase 2	50	84%
GEPA Phase 3	15 (curated)	93%

DSPy improved accuracy from 0% to 12%, and GEPA pushed it much higher, all the way to 93%, after three phases. The local model now matches Claude’s tool call chain in 93% of cases.

Make no mistake : matching Claude 93% doesn’t mean 93% accuracy. When we benchmarked Claude itself, it only produced consistent results about 50% of the time. This is non-determinism at work.

This proof of concept works for a small set of tools written in the code mode fashion. It suggests there is a potential for tool calling distillation.

If you’ve tried something similar, I’d love to hear from you.

A Survey on Knowledge Distillation of Large Language Models - Xu et al. (2024) examine knowledge distillation as a methodology for transferring capabilities from proprietary LLMs like GPT-4 to open-source models like LLaMA & Mistral. The survey covers applications in model compression, efficient deployment, & resource-constrained environments, providing a comprehensive overview of distillation techniques for modern language models. ↩︎
ODIA: Oriented Distillation for Inline Acceleration of LLM-based Function Calling - Recent research on distilling function calling capabilities from larger models to smaller ones. ODIA leverages online user interaction data to accelerate function calling, reducing response latency by 45% (expected) & 78% (median) while maintaining accuracy. The method successfully handled 60% of traffic with negligible accuracy loss in production deployment. ↩︎
SemDeDup: Data-efficient learning at web-scale through semantic deduplication - Abbas et al. (2023) present a method that uses embeddings from pre-trained models to identify & remove semantic duplicates from training data. Analyzing LAION, they showed that removing 50% of semantically similar data resulted in minimal performance loss while effectively halving training time, with additional out-of-distribution performance improvements. ↩︎
CaR (Cluster and Retrieve) - A data selection technique that clusters similar training examples & retrieves the most representative ones to improve model performance. This method reduces redundancy in training data while preserving diversity, leading to more efficient learning. ↩︎
This model is sandboxed. It reads production data but doesn’t write for safety. ↩︎
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines - Khattab et al. (2024) introduce DSPy, a framework that programmatically creates & refines prompts through optimization strategies that systematically simulate instruction variations & generate few-shot examples. Research across multiple use cases showed DSPy can improve task accuracy substantially, with prompt evaluation tasks rising from 46.2% to 64.0% accuracy through bootstrap learning & teleprompter algorithms. ↩︎
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning - Agrawal et al. (2025) present GEPA, a reflective prompt optimizer that merges textual reflection with multi-objective evolutionary search. GEPA outperforms GRPO by 10% on average (up to 20%) while using up to 35x fewer rollouts. It surpasses the previous state-of-the-art prompt optimizer MIPROv2 on every benchmark, obtaining aggregate optimization gains of +14% compared to MIPROv2’s +7%. The system iteratively mutates prompts based on natural language feedback from execution traces. ↩︎

Running Out of AI

2025-11-12 08:00:00

By Monday lunch, I had burned through my Claude code credits. I’d been warned ; damn the budget, full prompting ahead.

I typed ultrathink to solve a particularly challenging coding problem, knowing the rainbow colors of the word was playing with digital fire.

When that still couldn’t solve the issue, I summoned Opus, the biggest & most expensive model, to solve it.

Now two days on, I’ve needed to figure out alternatives. Do I :

Switch to API billing (how much will that cost?)
Try another vendor? Gemini’s model is great, but ageing ; at nearly 8 months old, its a capable jalopy. Cursor’s free coding model Composer 1 sprints at problems with aplomb but a bit overwhelmed at times. Codex, the plodding giant is brilliant at large-scale technical challenges.
Create another Max subscription & switch between them? Can I ask AI to write a script to save me the hassle of changing my identity?
Stand-up GPT-OSS to run locally? A little bit more latency but potent & twice as fast on llama.cpp compared to Ollama.
Go back to writing code the old way? The hedonic treadmill moves quickly. I tried to return to the old ways, but it was painful. I’ve already forgotten where the blog server script is. Claude? Do you remember? Claude?

I’m working through the math of which option will cost more. How much is the Max plan subsidized? Will knowing the true API cost of my Claude Code usage increase my willingness to pay?

Switching between tools incurs costs. The tools, the workflow, the prompts that I’ve optimized for Claude code must all be ported (at my expense!) to other tools.

As the capabilities of these models begin to plateau, the costs to shift increase. So does my willingness to pay for Claude to answer me.

Datadog: As Reliable as Your Golden Retriever

2025-11-10 08:00:00

Datadog is becoming a platform company, & its Q3 2025 results underscore how successful this transition is. If nothing else, the consistency around 25% growth for the last 12 quarters exemplifies this point.

Datadog revenue growth chart showing quarterly revenue & year-over-year growth rate

Net dollar retention underpins this growth, combined with accelerating new customer account acquisition. One of the biggest changes in the last five quarters is terrific cross-selling across an increasingly large product suite.

Datadog net dollar retention recovery from 2023 trough to 120% in Q3 2025

Platform Adoption Deepening

At the end of Q3, 84% of customers were using 2 or more products, up from 83% a year ago. 54% of customers were using 4 or more products, up from 49% a year ago. 31% of our customers were using 6 or more products, up from 26% a year ago & 16% of our customers were using 8 or products, up from 12% a year ago.

Datadog’s platform spans six product categories:

Digital Experience Monitoring: RUM/Real User Monitoring, Synthetics, Product Analytics
Security: Cloud SIEM, Cloud Security
Infrastructure Observability: APM, Log Management, Flex Logs
Incident Response: Incident Management, On-Call
AI Capabilities: Bits AI, LLM Observability
Cost Management: Cloud Cost Management

The steady increase in multi-product adoption demonstrates customers consolidating their observability stack onto Datadog, with the highest-tier customers (8+ products) growing 33% year-over-year as a percentage of the base.

New Customer Momentum

New logo annualized bookings more than doubled year-over-year & set a new record driven by an increase in average new logo land size, particularly in enterprise.

The portion of our year-over-year revenue growth that related to new customers was about 25% in Q3, up from 20% in Q2.

New customer acquisition is also accelerating. This is in concert with a move-up market into the enterprise.

AI Native Customer Expansion

We also experienced strong revenue growth for our AI native customers & a broadening contribution to growth among those customers. There, too, we saw an acceleration of growth in our AI cohort in Q3 when excluding our largest customer.

This group represented 12% of our revenue, up from 11% last quarter & about 6% in the year ago quarter.

The AI native cohort is both growing & maturing. Datadog now has 15 AI native customers spending more than $1 million annually, up from essentially zero a year ago, with over 100 spending more than $100,000.

Revenue Growth

Revenue was $886 million, an increase of 28% year-over-year & above the high end of our guidance range.

The combination of these three factors : a broader product suite that is effectively cross-sold, accelerating new customer momentum, & a very fast-growing AI business, has led to outperformance.

Security Suite Accelerating

Security ARR growth was in the mid-50s as a percentage year-over-year in Q3, up from the mid-40s we mentioned last quarter.

We’re starting to see success in including Cloud SIEM in larger deals, & we’ll get back to that in a bit in our customer examples. And we’re seeing positive trends beyond Cloud SIEM, including fast uptake of good security & an increasing number of wins in cloud security.

Security is becoming a meaningful growth driver for Datadog, accelerating from mid-40s to mid-50s percentage growth & expanding beyond Cloud SIEM into broader cloud security use cases.

Enterprise Deal Momentum

First, we landed a 7-figure annualized deal with a leading European telco, our largest ever land deal in Europe. […] They will adopt 11 Datadog products to start.

Next, we landed a 7-figure annualized deal with a Fortune 500 technology hardware company.

Both of these data points confirm a significant move-up market. A million-dollar land deal with 11 products confirms that Datadog is truly selling a suite.

Datadog’s AI Products

In addition to the existing suite, Datadog is pushing heavily into AI with a broader range of AI deployment products.

Bits AI SRE Agent (Available in preview, announced June 2025) is an autonomous AI agent that investigates alerts & coordinates incident response 24/7, saving customers significant time on mean-time-to-resolution.
LLM Experiments & Playgrounds (Generally available, launched 2025) helps teams rapidly iterate on LLM applications by testing prompt changes, model swaps, & application changes against production traces.
Custom LLM-as-a-Judge Evaluations (Generally available) lets customers write natural language evaluation prompts to assess LLM application quality & safety across traces & spans.
Datadog MCP Server (Available in preview, announced 2025) bridges Datadog with AI agents like Codex, Claude, Cursor, & GitHub Copilot, providing structured access to metrics, logs, traces, & incidents directly from AI coding environments.
TOTO, Datadog’s open-source time series forecasting model (launched 2025), was trained on 2 trillion data points & became one of Hugging Face’s top downloads across all categories.

If SaaS companies were dog breeds, many would be temperamental. But Datadog demonstrates continued consistency across a broad range of different businesses.

Tomasz TunguzModify