MoreRSS

site iconTomasz TunguzModify

I’m a venture capitalist since 2008. I was a PM on the Ads team at Google and worked at Appian before.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tomasz Tunguz

What 375 AI Builders Actually Ship

2025-11-17 01:00:00

70% of production AI teams use open source models. 72.5% connect agents to databases, not chat interfaces. This is what 375 technical builders actually ship - & it looks nothing like Twitter AI.

350 out of 413 teams use open source models

70% of teams use open source models in some capacity. 48% describe their strategy as mostly open. 22% commit to only open. Just 11% stay purely proprietary.

Agents access deep systems: databases, web search, memory, file systems

Agents in the field are systems operators, not chat interfaces. We thought agents would mostly call APIs. Instead, 72.5% connect to databases. 61% to web search. 56% to memory systems & file systems. 47% to code interpreters.

The center of gravity is data & execution, not conversation. Sophisticated teams build MCPs to access their own internal systems (58%) & external APIs (54%).

85% use synthetic data for generating evals vs fine-tuning

Synthetic data powers evaluation more than training. 65% use synthetic data for eval generation versus 24% for fine-tuning. This points to a near-term surge in eval-data marketplaces, scenario libraries, & failure-mode corpora before synthetic training data scales up.

The timing reveals where the stack is heading. Teams need to verify correctness before they can scale production.

Automated methods for improving context: prompt optimization, ablations, manual

88% use automated methods for improving context. Yet it remains the #1 pain point in deploying AI products. This gap between tooling adoption & problem resolution points to a fundamental challenge.

The tools exist. The problem is harder than better retrieval or smarter chunking can solve.

Teams need systems that verify correctness before they can scale production. The tools exist. The problem is harder than better retrieval can solve.

Context remains the true challenge & the biggest opportunity for the next generation of AI infrastructure.

Explore the full interactive dataset here or read Lauren’s complete analysis.

Teaching Local Models to Call Tools Like Claude

2025-11-14 01:00:00

Ten months ago, DeepSeek collapsed AI training costs by 90% using distillation - transferring knowledge from larger models to smaller ones at a fraction of the cost.

Distillation works like a tutor training a student : a large model teaches a smaller one.1 As we’ve shifted from knowledge retrieval to agentic systems, we wondered if there was a parallel technique for tool calling.2

Could a large model teach a smaller one to call the right tools?

The answer is yes, or at least yes in our case. Here’s our current effort :

Screenshot 2025-11-13 at 11.28.01 AM

Every time we used Claude Code, we logged the session - our query, available tools, & which tools Claude chose. These logs became training examples showing the local model what good tool calling looks like.

We wanted to choose the right data so we used algorithms to cherry-pick. We used SemDeDup3 & CaR4, algorithms to find the data examples that lead to better results.

Claude Code fired up our local model powered by GPT-OSS 20b5 & peppered it with the queries. Claude graded GPT on which tools it calls.

Claude’s assessments were fed into a prompt-optimization system with DSPy6 & GEPA7. All of that data was then fed to improve the prompt. DSPy searches for existing examples that could improve the prompt, while GEPA mutates or tests different mutations.

Combined, we improved from a 12% Claude match rate to 93% in three iterations by increasing the data volume to cover different scenarios :

Optimizer Training Examples % of Claude
DSPy Phase 1 50 12%
GEPA Phase 2 50 84%
GEPA Phase 3 15 (curated) 93%

DSPy improved accuracy from 0% to 12%, and GEPA pushed it much higher, all the way to 93%, after three phases. The local model now matches Claude’s tool call chain in 93% of cases.

Make no mistake : matching Claude 93% doesn’t mean 93% accuracy. When we benchmarked Claude itself, it only produced consistent results about 50% of the time. This is non-determinism at work.

This proof of concept works for a small set of tools written in the code mode fashion. It suggests there is a potential for tool calling distillation.

If you’ve tried something similar, I’d love to hear from you.


  1. A Survey on Knowledge Distillation of Large Language Models - Xu et al. (2024) examine knowledge distillation as a methodology for transferring capabilities from proprietary LLMs like GPT-4 to open-source models like LLaMA & Mistral. The survey covers applications in model compression, efficient deployment, & resource-constrained environments, providing a comprehensive overview of distillation techniques for modern language models. ↩︎

  2. ODIA: Oriented Distillation for Inline Acceleration of LLM-based Function Calling - Recent research on distilling function calling capabilities from larger models to smaller ones. ODIA leverages online user interaction data to accelerate function calling, reducing response latency by 45% (expected) & 78% (median) while maintaining accuracy. The method successfully handled 60% of traffic with negligible accuracy loss in production deployment. ↩︎

  3. SemDeDup: Data-efficient learning at web-scale through semantic deduplication - Abbas et al. (2023) present a method that uses embeddings from pre-trained models to identify & remove semantic duplicates from training data. Analyzing LAION, they showed that removing 50% of semantically similar data resulted in minimal performance loss while effectively halving training time, with additional out-of-distribution performance improvements. ↩︎

  4. CaR (Cluster and Retrieve) - A data selection technique that clusters similar training examples & retrieves the most representative ones to improve model performance. This method reduces redundancy in training data while preserving diversity, leading to more efficient learning. ↩︎

  5. This model is sandboxed. It reads production data but doesn’t write for safety. ↩︎

  6. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines - Khattab et al. (2024) introduce DSPy, a framework that programmatically creates & refines prompts through optimization strategies that systematically simulate instruction variations & generate few-shot examples. Research across multiple use cases showed DSPy can improve task accuracy substantially, with prompt evaluation tasks rising from 46.2% to 64.0% accuracy through bootstrap learning & teleprompter algorithms. ↩︎

  7. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning - Agrawal et al. (2025) present GEPA, a reflective prompt optimizer that merges textual reflection with multi-objective evolutionary search. GEPA outperforms GRPO by 10% on average (up to 20%) while using up to 35x fewer rollouts. It surpasses the previous state-of-the-art prompt optimizer MIPROv2 on every benchmark, obtaining aggregate optimization gains of +14% compared to MIPROv2’s +7%. The system iteratively mutates prompts based on natural language feedback from execution traces. ↩︎

Running Out of AI

2025-11-12 08:00:00

By Monday lunch, I had burned through my Claude code credits. I’d been warned ; damn the budget, full prompting ahead.

Screenshot 2025-11-12 at 8.25.37 AM
I typed ultrathink to solve a particularly challenging coding problem, knowing the rainbow colors of the word was playing with digital fire.
Screenshot 2025-11-12 at 8.26.41 AM

When that still couldn’t solve the issue, I summoned Opus, the biggest & most expensive model, to solve it.

Now two days on, I’ve needed to figure out alternatives. Do I :

  • Switch to API billing (how much will that cost?)
  • Try another vendor? Gemini’s model is great, but ageing ; at nearly 8 months old, its a capable jalopy. Cursor’s free coding model Composer 1 sprints at problems with aplomb but a bit overwhelmed at times. Codex, the plodding giant is brilliant at large-scale technical challenges.
  • Create another Max subscription & switch between them? Can I ask AI to write a script to save me the hassle of changing my identity?
  • Stand-up GPT-OSS to run locally? A little bit more latency but potent & twice as fast on llama.cpp compared to Ollama.
  • Go back to writing code the old way? The hedonic treadmill moves quickly. I tried to return to the old ways, but it was painful. I’ve already forgotten where the blog server script is. Claude? Do you remember? Claude?

I’m working through the math of which option will cost more. How much is the Max plan subsidized? Will knowing the true API cost of my Claude Code usage increase my willingness to pay?

Switching between tools incurs costs. The tools, the workflow, the prompts that I’ve optimized for Claude code must all be ported (at my expense!) to other tools.

As the capabilities of these models begin to plateau, the costs to shift increase. So does my willingness to pay for Claude to answer me.

Datadog: As Reliable as Your Golden Retriever

2025-11-10 08:00:00

Datadog is becoming a platform company, & its Q3 2025 results underscore how successful this transition is. If nothing else, the consistency around 25% growth for the last 12 quarters exemplifies this point.

Datadog revenue growth chart showing quarterly revenue & year-over-year growth rate

Net dollar retention underpins this growth, combined with accelerating new customer account acquisition. One of the biggest changes in the last five quarters is terrific cross-selling across an increasingly large product suite.

Datadog net dollar retention recovery from 2023 trough to 120% in Q3 2025

Platform Adoption Deepening

At the end of Q3, 84% of customers were using 2 or more products, up from 83% a year ago. 54% of customers were using 4 or more products, up from 49% a year ago. 31% of our customers were using 6 or more products, up from 26% a year ago & 16% of our customers were using 8 or products, up from 12% a year ago.

Datadog’s platform spans six product categories:

  • Digital Experience Monitoring: RUM/Real User Monitoring, Synthetics, Product Analytics
  • Security: Cloud SIEM, Cloud Security
  • Infrastructure Observability: APM, Log Management, Flex Logs
  • Incident Response: Incident Management, On-Call
  • AI Capabilities: Bits AI, LLM Observability
  • Cost Management: Cloud Cost Management

The steady increase in multi-product adoption demonstrates customers consolidating their observability stack onto Datadog, with the highest-tier customers (8+ products) growing 33% year-over-year as a percentage of the base.

New Customer Momentum

New logo annualized bookings more than doubled year-over-year & set a new record driven by an increase in average new logo land size, particularly in enterprise.

The portion of our year-over-year revenue growth that related to new customers was about 25% in Q3, up from 20% in Q2.

New customer acquisition is also accelerating. This is in concert with a move-up market into the enterprise.

AI Native Customer Expansion

We also experienced strong revenue growth for our AI native customers & a broadening contribution to growth among those customers. There, too, we saw an acceleration of growth in our AI cohort in Q3 when excluding our largest customer.

This group represented 12% of our revenue, up from 11% last quarter & about 6% in the year ago quarter.

The AI native cohort is both growing & maturing. Datadog now has 15 AI native customers spending more than $1 million annually, up from essentially zero a year ago, with over 100 spending more than $100,000.

Revenue Growth

Revenue was $886 million, an increase of 28% year-over-year & above the high end of our guidance range.

The combination of these three factors : a broader product suite that is effectively cross-sold, accelerating new customer momentum, & a very fast-growing AI business, has led to outperformance.

Security Suite Accelerating

Security ARR growth was in the mid-50s as a percentage year-over-year in Q3, up from the mid-40s we mentioned last quarter.

We’re starting to see success in including Cloud SIEM in larger deals, & we’ll get back to that in a bit in our customer examples. And we’re seeing positive trends beyond Cloud SIEM, including fast uptake of good security & an increasing number of wins in cloud security.

Security is becoming a meaningful growth driver for Datadog, accelerating from mid-40s to mid-50s percentage growth & expanding beyond Cloud SIEM into broader cloud security use cases.

Enterprise Deal Momentum

First, we landed a 7-figure annualized deal with a leading European telco, our largest ever land deal in Europe. […] They will adopt 11 Datadog products to start.

Next, we landed a 7-figure annualized deal with a Fortune 500 technology hardware company.

Both of these data points confirm a significant move-up market. A million-dollar land deal with 11 products confirms that Datadog is truly selling a suite.

Datadog’s AI Products

In addition to the existing suite, Datadog is pushing heavily into AI with a broader range of AI deployment products.

  • Bits AI SRE Agent (Available in preview, announced June 2025) is an autonomous AI agent that investigates alerts & coordinates incident response 24/7, saving customers significant time on mean-time-to-resolution.

  • LLM Experiments & Playgrounds (Generally available, launched 2025) helps teams rapidly iterate on LLM applications by testing prompt changes, model swaps, & application changes against production traces.

  • Custom LLM-as-a-Judge Evaluations (Generally available) lets customers write natural language evaluation prompts to assess LLM application quality & safety across traces & spans.

  • Datadog MCP Server (Available in preview, announced 2025) bridges Datadog with AI agents like Codex, Claude, Cursor, & GitHub Copilot, providing structured access to metrics, logs, traces, & incidents directly from AI coding environments.

  • TOTO, Datadog’s open-source time series forecasting model (launched 2025), was trained on 2 trillion data points & became one of Hugging Face’s top downloads across all categories.

If SaaS companies were dog breeds, many would be temperamental. But Datadog demonstrates continued consistency across a broad range of different businesses.

Are We Being Railroaded by AI?

2025-11-06 08:00:00

Just how much are we spending on AI?

Compared to other massive infrastructure projects, AI is the sixth largest in US history, so far.

Just How Much Are We Spending on AI - Infrastructure spending as % of GDP

World War II dwarfs everything else at 37.8% of GDP. World War I consumed 12.3%. The New Deal peaked at 7.7%. Railroads during the Gilded Age reached 6.0%.

AI infrastructure today sits at 1.6%, just above the telecom bubble’s 1.2% & well below the major historical mobilizations.

Project Year Spending (2025$) % of GDP
World War II 1944 $1,152B 37.8%
World War I 1918 $138B 12.3%
New Deal 1936 $150B 7.7%
Railroads (peak) 1870 $18B 6.0%
Interstate Highways 1964 $142B 2.0%
AI Infrastructure 2024 $500B 1.6%
Telecom Bubble 2000 $226B 1.2%
Manhattan Project 1945 $36B 0.9%
Apollo Program 1966 $59B 0.7%

Companies like Microsoft, Google, & Meta are investing $140B, $92B, & $71B respectively in data centers & GPUs. OpenAI plans to spend $295B in 2030 alone.

If we assume OpenAI represents 30% of the market, total AI infrastructure spending would reach $983B annually by 2030, or 2.8% of GDP.1

Scenario 2024 2030 % of GDP (2030)
Current AI Infrastructure $500B - 1.6%
OpenAI Projected Spending - $295B 0.8%
Total Market (projected) - $983B 2.8%

To match the railroad era’s 6% of GDP, AI spending would need to reach $2.1T per year by 2030 (6% of projected $35.4T GDP), a 320% increase from today’s $500B. That would require Google, Meta, OpenAI, & Microsoft each investing $500-700B per year, a 5-7x increase from today’s levels.

And that should give you a sense of how much we were spending on railroads 150 years ago!


Sources

World War I & II:

New Deal:

Railroads:

Telecom Bubble:

Apollo Program:

Manhattan Project:

AI Infrastructure:


Methodology

All historical spending figures are adjusted to 2025 dollars using Consumer Price Index (CPI) inflation data. Each figure represents peak single-year spending in the year indicated. Percentages show spending as a share of GDP in that specific year, not as a percentage of today’s GDP.

For example, WWII’s $1,152B represents actual 1944 defense spending ($63B nominal) adjusted for inflation, which consumed 37.8% of 1944’s GDP ($175B). This differs from asking “what would 37.8% of today’s $30.5T GDP cost?” which would yield $11.5T.



  1. Assuming 2.5% annual GDP growth to $35.4T in 2030 ↩︎

A 1 in 15,787 Chance Blog Post

2025-11-05 08:00:00

I wrote a post titled Congratulations, Robot. You’ve Been Promoted! in which OpenAI declared that their AI coders were no longer junior engineers but mid-level engineers.

The post triggered the largest unsubscription rate in this blog’s history. It was a 4-sigma event.

A Nassim Taleb black swan, this was something that should happen once every 700 years of a blog author’s career.

Clearly, the post struck a nerve.

In a job market 13% smaller for recent grads than recent years, a subtle fear persists that positive developments in AI accuracy & performance accelerate job losses. Stanford’s research found :

“young workers (ages 22–25) in the most AI-exposed occupations, such as software developers & customer service reps, have experienced a 13% relative decline in employment since generative AI became widely used.”

The whispered question beneath all this data : are AI advances a zero-sum game for jobs? Are we in the modern era hearing the same refrain as Springsteen lamenting the impact of globalization on his hometown :

They’re closing down the textile mill across the railroad tracks. Foreman says “These jobs are going, boys, and they ain’t coming back”

Let’s go to the data.

Software Engineering Employment Time Series showing ZIRP boom (2010-2022) and contraction (2023-2024) with Fed Funds Rate overlay

Software engineering employment grew steadily from 3.2m developers in 2010 to 4.7m in 2022 during the ZIRP (Zero Interest Rate Policy) era. The 2020-2022 period was the hottest tech jobs market of all time, with demand doubling since 2020.

Then the Fed raised rates aggressively, increasing the cost of capital, triggering a 4.3% contraction, hitting younger workers.

But data from layoffs suggest that this trend isn’t accelerating.

Tech Industry Layoffs showing peak in 2023 with 264,220 employees laid off, with 2025 annualized projection

Tech companies laid off 264,220 employees in 2023. The 2025 data (annualized from 11 months through November) projects 122,890 layoffs for the full year. There’s no acceleration yet in the data.

The data doesn’t yet show what readers are clearly feeling : a trepidation that AI advances will accelerate job losses.