MoreRSS

site iconTomasz TunguzModify

I’m a venture capitalist since 2008. I was a PM on the Ads team at Google and worked at Appian before.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Tomasz Tunguz

Intelligence Per Dollar

2026-06-03 08:00:00

Screenshot 2026-06-02 at 9.22.43 PM

Yesterday Microsoft added a new metric to a model release card, one that will likely become a standard.1

Average token usage.

In the first row, the Microsoft model hits 71.6 on SWE-Bench Verified using about a third of the tokens Claude Haiku 4.5 burns.

Benchmarks are now measured on two different dimensions, the overall performance & the cost to achieve that intelligence.

This is yet another sign that the era of subsidies2, tokenmaxxing3, & all-out performance for many use cases is over.

Even the most valuable companies in the world cannot afford state-of-the-art intelligence for every conceivable use case.4 Uber capped employee AI spending after blowing through its budget in four months.5 Salesforce is spending $300M on Anthropic tokens & has frozen engineering hires.6

This new dual benchmark answers the buyer’s only question : what is my intelligence per dollar?

Screenshot 2026-06-03 at 5.49.00 AM

Artificial Analysis already benchmarks this.7 GPT 5.5 & Claude Opus 4.8 land within a point of each other on the Intelligence Index, around 60. Running the index costs $3,357 on GPT 5.5 & $4,685 on Opus 4.8. Same answer, 40% more expensive.

Model companies must now compete on both dimensions. The application layer will compete one level up, on dollars per outcome, what a closed ticket, a shipped PR, or a resolved support case actually costs.

Every layer in the stack now has to price the same way the customer thinks : per result, not per token.



  1. Introducing MAI-Code-1-Flash — Microsoft announces a new coding model with average token usage on the release card. ↩︎

  2. The Unsustainable Subsidy — The era of AI subsidies is ending. ↩︎

  3. Tokenmaxxing — Models that game benchmarks with extra tokens are losing their edge. ↩︎

  4. Microsoft cancels Claude Code licenses, shifting developers to GitHub Copilot CLI — Microsoft cancelled Claude Code licenses across its Experiences and Devices division (Windows, Microsoft 365, Outlook, Teams, Surface) after engineering usage outran budgets. ↩︎

  5. Uber caps employee AI spending after blowing through budget in 4 months — Uber caps employee AI spending after blowing through budget in four months. ↩︎

  6. Salesforce Spends $300M on AI, Freezes Engineering Hires — Salesforce Spends $300M on AI, Freezes Engineering Hires. ↩︎

  7. AI Model & API Providers Analysis — Independent analysis of AI model costs. ↩︎

The Thriving Ecosystem of Open Models

2026-06-02 08:00:00

Competition is a discovery procedure. — Friedrich Hayek

And developers are discovering the value of open models.

OpenRouter offers a useful view into the model market.1 It is not the whole AI economy. But it is close to the API frontier, where developers can switch models quickly, compare price-performance daily, & route each request to the best available option.

Stacked chart of open versus closed model token share on OpenRouter

Since 2025, open models have grown sharply on OpenRouter. In the latest model-level snapshot, open-weight models generated 69.1% of named open-versus-closed token volume. Closed models produced 30.9%.

Open-model demand jumps with launches

New models attract developer attention & large scale testing, after which token use surges. Each new clustered release of different models sustains a new plateau of token volume.

Open-model leadership keeps changing hands

Just as in the closed-model ecosystem, the competition among open models means rapid innovation & leaderboard changes.

DeepSeek’s early lead gave way to MiniMax & Kimi models in late 2025 & early 2026. Later, launches from MiMo, Qwen, Alibaba’s open-weight model family, Hy3, Tencent’s open-weight model release, & DeepSeek reshuffled share again.

Arcee, a US lab focused, makes a strong appearance recently.

Open models still represent a fraction of overall inference, but the thriving competition, increasing usage, & surge of experimentation suggest developers are increasingly willing to route production traffic to them.


  1. Source data: OpenRouter rankings & usage data, analyzed from weekly token-volume snapshots in the OpenRouter analysis dataset. ↩︎

The AI Skepticism Map

2026-06-01 08:00:00

With Michael Burry 1 & Leopold Aschenbrenner 2 placing heavy short trades on AI, questions about GPU depreciation, & the Saaspocalypse, how negative is the financial market on AI?

We can look at the percentage of shares sold short, a bet the stock will decline.

AI shorts have edged higher

Across all software, semiconductor, neocloud, data center, & hyperscalers, the median short interest (short shares / total shares) has increased by about 24% in the last quarter.

AI cloud and neoclouds have the gloomiest sky

One segment stands out for gloomy skies in the cloud: the GPU data center businesses, whose shorted shares have grown 60% in the last year 3. AI cloud and neocloud companies have the highest current median short interest at 16.8% of float.

The negative sentiment for SaaS & Dev Tools is a more abrupt & recent phenomenon. Developer tools and infrastructure software follow at 9.5%. Enterprise SaaS and AI apps sit at 8.9%.

Hyperscalers are at the other end of the spectrum. Their median short interest is 1.1%. NVIDIA, the defining AI infrastructure stock, is also lightly shorted: 1.2%.

Enterprise AI apps saw the sharpest rise

Semiconductor stocks saw a decrease in short-selling. With memory makers like Micron up 742% this year 4, & many ecosystem CEOs pointing to memory & storage as the limiting factor, the newest trillion-dollar companies are all memory.

The stocks with the most actively bearish betters? Most of these are small or mid-cap companies. The updated chart below adds market capitalization to each company label. The largest AI winners are mostly absent.

SoundHound AI is 36.3% short. C3.ai is 32.2%. BigBear.ai is 29.4%. Applied Digital is 28.0%. UiPath is 22.0%. TeraWulf is 21.3%.

Small AI names dominate the short book

This is the market’s current AI skepticism map.

The skepticism is concentrated in companies whose AI exposure still depends on future capital access, future demand, or future operating leverage.

That distinction matters. If short interest were rising uniformly across AI semiconductors, hyperscalers, and software, the message would be broad fatigue with the AI trade. Instead, the data suggest a more specific view: memory has become critical & in short supply; software & devtools businesses need to prove their worth post-AI; & businesses reselling GPUs have more than their fair share of doubters about current prices versus long-term value.

Skill Distillation

2026-05-29 08:00:00

I’ve been using state-of-the-art models to teach small models running on my computer how I work.

My personal agent, based on Pi, runs my inbox, my deal pipeline, my blog publishing, my calendar, & my research. It looks less like a chatbot & more like a small operating system.

The Pi Agent architecture : QMD procedural memory, SKILL.md playbooks, & the agent loop with tools & MCP

The first layer is QMD, a local markdown knowledge base of about eighty workflow files in ~/memories. Before answering any procedural question, the agent searches QMD for the right playbook.

The second layer is Skills, atomic SKILL.md files that describe one job each. The skills are written by a frontier model. So are the evaluations that grade them. The same system writes, tests, and rewrites each skill until accuracy converges. It also checks recall against QMD, so the right keywords always surface the right skill.

The third layer is the Agent Loop, a model running Plan → Tool Call → Observe → Refine, calling out to seventeen Rust APIs & a handful of MCP integrations.

Skill distillation : a frontier model authors SKILL.md files that smaller local models execute

One of the techniques I’ve started to use is skill distillation. A frontier model, Opus 4.7, GPT-5.1, Gemini 3 Pro, authors & refines the skill files. A smaller model, Qwen 35B or Gemma 26B running locally, executes them. The teacher transfers procedural knowledge to the student through markdown. The skill is inspectable, versionable, & hot-swappable.

This is fundamentally different from classical knowledge distillation, which compresses a big model’s soft probability outputs into a smaller model’s weights. It’s different from instruction tuning, which bakes behavior into weights through prompt-response pairs. It’s different from RAG, which retrieves facts.

Skill distillation retrieves procedures. The smaller model doesn’t have to know how to evaluate a company. It just has to know how to follow the steps.

Every night a system runs through historical logs to understand what new skills should be generated, mirroring the loop that Pete Koomen described at Y Combinator earlier this week.

The frontier model becomes a teacher. The library becomes the company’s institutional knowledge. The student becomes whichever model happens to be cheapest this quarter.

Security in the Age of AI Agents: Office Hours with Jonathan Jaffe

2026-05-28 08:00:00

When security practitioners become engineers, the mission changes from managing people to architecting the automated policies that govern an agentic world.

Jonathan Jaffe, CISO at Lemonade, joined me on Office Hours to discuss what this means for how we build, secure, & operate AI systems when both sides are automated.

 

AI is just as powerful for defenders as it is for attackers. The fear narrative underestimates this fact. Defenders harden everywhere, simultaneously, because every vendor in the stack is also racing to ship.

“There are tens of thousands of attack targets out there. The chances that you’re going to be one of those is small. At the same time, all of the vendors that you use will also have access to this to improve their services.”

The window of exploitability is narrowing. Yes, AI will write more vulnerable code. But AI-written code also gets reviewed, pen-tested, & patched faster than any human pipeline. Plus, the total number of bugs within a particular piece of software is finite. As the velocity of solving or resolving bugs increases, software will become far more resilient.

Security teams are becoming engineering teams. At Lemonade, every security person is an engineer. They built their own AI platform with agents on top of it. One agent reads threat intel. Another checks whether the vulnerable method is actually called in production code.

“Automation is the only way you can deal with the scale of what’s coming at us now.”

Every agent needs an identity. On a single endpoint, we could be running 200 or 10,000 agents, but each one of them needs to be numbered and then governed by policy at the point of action.

“Every agent needs to have an identity, and more than that, you need a way to control policy for all of these agents in a much more complex way than current identity and access management systems do.”

Modern agentic security engineering is rapidly transforming, and we should expect to see significantly hardened systems as a result. It’s a bright future for security and security professionals.

I’m grateful to Jonathan for sharing his insights at Office Hours!

Software After AI

2026-05-27 08:00:00

The end of the software era is the beginning of the harness era.

AI outmoded SaaS managed databases with fixed workflows with intelligence. Like a mustang, AI is powerful but wild. Harnessing the power means domestication.

The seven components of an AI agent harness arranged radially around the LLM at the center : context & memory, tools & action, orchestration & loop, state & persistence, sandbox & compute, observability & governance, & cost & workflow optimization

There are seven parts to this domestication :

  1. Context & memory : General models need bespoke retrieval. The system that fetches the right context for a radiologist is not the system that fetches it for a paralegal.

    Sometimes it’s a lot of short-term memory. What was the agent working on 45 seconds ago? Other times it’s large-scale image retrieval, say for radiology or for video generation. Other times it’s a keyword search across a billion documents. Those systems will be bespoke to each individual use case to drive the best accuracy.

    Sitting alongside retrieval is the context database, the recipe book of how each business actually runs. The standard operating procedures we all carry in our heads & bring to work every day are those recipes. Capturing them initially & evolving them as both people & process change is the essence of the context database.

  2. Tools & action : Tools are how the agent affects the outside world. The recipes in the context database describe what to do. Tools are the ingredients & utensils that actually do it.

    A modern harness exposes tools through a registry, validates the arguments the model passes, dispatches the call, gates sensitive actions behind approvals, & parses the result back into the agent’s loop. MCP has emerged as the connective tissue. The quality of a harness depends on how many tools it can safely expose & how cleanly it handles their failures.

  3. Orchestration & loop : The agentic loop is think, act, observe, repeat. Planning, decomposition, sub-agents, retries, & stop conditions define how the work gets done.

    We also expect our software to improve as we use it. Closed loop patterns that learn from each run will separate different vendors.

  4. State & persistence : In a large-scale enterprise with lots of different people working on a system, the system needs to be resilient. When a harness crashes at step 7 of a 10 step task, it should resume at step 8, not restart from zero. File systems, checkpoints, session threads, & artifact storage are the mechanisms that prevent lost work.

  5. Sandbox & compute : Each agent needs a sandbox in which to play. Isolated Unix workspaces, controlled network egress, & credentials that live outside the model are what make sandboxes secure, confidential, & fast at scale.

  6. Observability & governance : You cannot trust what you cannot see. Tracing every step, logging every tool call, running evals as regression tests, & putting humans in the loop for the highest stakes decisions are how a demo becomes a production system. Guardrails enforce policy. Evals catch regressions before customers do.

  7. Cost & workflow optimization : The seventh discipline is architectural judgment. What should be deterministic versus non-deterministic? Which model is the right one for each step, state of the art, medium, small, or fine-tuned? What knowledge belongs in skills versus in memory?

The result is a new competitive dynamic in software.

This won’t work in every category. The markets the major labs prioritize will benefit from their ability to move quickly & their direct control of the models. But that leaves thousands of separate markets up for startups.

What happens when every company has access to the same model? The best riders win.