2026-06-07 08:00:00
Three forces are reshaping the AI cost structure :
The natural response from AI buyers is substitution.
Coinbase6 :
At Coinbase we’re working hot on routing prompts to cheaper models where appropriate, & in some cases have been able to keep costs roughly flat, while token usage continues to grow exponentially.
Lindy7 :
Pulled the trigger today & switched 100% of Lindy traffic to DeepSeek v4, churning from Anthropic models. Saves us millions of $ & we’re actually seeing an increase in performance on many core use cases. Transformative for the business.
Harvey8 :
On a 100-task slice of our Legal Agent Benchmark (LAB), SFT moved Kimi 2.6’s all-pass rate from 11% to 15%, beating Opus’ 14%. But the cost gap was even more striking : $84 vs $954 across the same 100 tasks, or ~11x cheaper.
Cursor went further. They post-trained Kimi K2.5 into their own production model, Composer.9
Composer 2.5 is exceptionally intelligent & up to 10x more efficient than similarly capable models.
Coinbase’s quote shows where the savings go : costs flat, tokens exponential. Buyers don’t pocket the discount — they spend it on more intelligence.
Closed models are getting more expensive at the frontier; open models are getting cheaper at parity. The choice is which slope you want under your unit economics.
2026-06-05 08:00:00
A laptop on my desk now handles 78% of my AI work, with the rest sent to the cloud. The shift came out of my skill distillation work.
Here’s how it works.
I create tasks in Asana. An agent sees the task : scheduling, email triage, research, a CRM update ; & classifies it as easy or hard. If it’s straightforward, a local model on my Mac handles it in seconds. If it’s complex, the same model routes it to a cloud model.
Across the last seven days, daily peaks reached 88%.
As the workload grew, the two-lane design paid off. Throughput jumped about 25%, average task duration fell from 47 seconds to 19, & queue age dropped from 73 seconds to four. Nothing about the work changed. Small, fast tasks simply stopped waiting behind big, slow ones.
The task factory that uses distilled skills is now humming along with 25% more throughput, queue age down 94%, & a much more responsive system. For now, the cloud handles the hard fifth. The Mac handles the rest.
It’s the minimill of agentic work. Nucor’s minimills1 started small, capital-light, & close to demand; within a generation they outflanked the integrated steel giants.
Every laptop, phone, & edge device with enough memory to host a distilled model becomes its own minimill : routing locally, paying cloud rates only for the hard fifth. Tens of millions of these will proliferate inside companies in the next few years, each one quietly absorbing much of the work that today shows up on a hyperscaler invoice.
Nucor began in the 1960s by melting scrap steel in electric-arc furnaces rather than smelting iron ore in giant integrated blast-furnace mills. Each minimill was a fraction of the size & cost of an integrated plant, sited near regional demand, & ran on flexible, lower-cost labor. The integrated mills dismissed minimills as fit only for low-grade products like rebar. Over the next thirty years Nucor moved up-market into sheet steel & structural beams, & by 2014 had become the largest steel producer in the United States, while most of the integrated giants (Bethlehem, LTV, National) had gone bankrupt. Clayton Christensen used the story as the canonical example of disruptive innovation in The Innovator’s Dilemma. ↩︎
2026-06-03 08:00:00
Yesterday Microsoft added a new metric to a model release card, one that will likely become a standard.1
Average token usage.
In the first row, the Microsoft model hits 71.6 on SWE-Bench Verified using about a third of the tokens Claude Haiku 4.5 burns.
Benchmarks are now measured on two different dimensions, the overall performance & the cost to achieve that intelligence.
This is yet another sign that the era of subsidies2, tokenmaxxing3, & all-out performance for many use cases is over.
Even the most valuable companies in the world cannot afford state-of-the-art intelligence for every conceivable use case.4 Uber capped employee AI spending after blowing through its budget in four months.5 Salesforce is spending $300M on Anthropic tokens & has frozen engineering hires.6
This new dual benchmark answers the buyer’s only question : what is my intelligence per dollar?
Artificial Analysis already benchmarks this.7 GPT 5.5 & Claude Opus 4.8 land within a point of each other on the Intelligence Index, around 60. Running the index costs $3,357 on GPT 5.5 & $4,685 on Opus 4.8. Same answer, 40% more expensive.
Model companies must now compete on both dimensions. The application layer will compete one level up, on dollars per outcome, what a closed ticket, a shipped PR, or a resolved support case actually costs.
Every layer in the stack now has to price the same way the customer thinks : per result, not per token.
Introducing MAI-Code-1-Flash — Microsoft announces a new coding model with average token usage on the release card. ↩︎
The Unsustainable Subsidy — The era of AI subsidies is ending. ↩︎
Tokenmaxxing — Models that game benchmarks with extra tokens are losing their edge. ↩︎
Microsoft cancels Claude Code licenses, shifting developers to GitHub Copilot CLI — Microsoft cancelled Claude Code licenses across its Experiences and Devices division (Windows, Microsoft 365, Outlook, Teams, Surface) after engineering usage outran budgets. ↩︎
Uber caps employee AI spending after blowing through budget in 4 months — Uber caps employee AI spending after blowing through budget in four months. ↩︎
Salesforce Spends $300M on AI, Freezes Engineering Hires — Salesforce Spends $300M on AI, Freezes Engineering Hires. ↩︎
AI Model & API Providers Analysis — Independent analysis of AI model costs. ↩︎
2026-06-02 08:00:00
Competition is a discovery procedure. — Friedrich Hayek
And developers are discovering the value of open models.
OpenRouter offers a useful view into the model market.1 It is not the whole AI economy. But it is close to the API frontier, where developers can switch models quickly, compare price-performance daily, & route each request to the best available option.
Since 2025, open models have grown sharply on OpenRouter. In the latest model-level snapshot, open-weight models generated 69.1% of named open-versus-closed token volume. Closed models produced 30.9%.
New models attract developer attention & large scale testing, after which token use surges. Each new clustered release of different models sustains a new plateau of token volume.
Just as in the closed-model ecosystem, the competition among open models means rapid innovation & leaderboard changes.
DeepSeek’s early lead gave way to MiniMax & Kimi models in late 2025 & early 2026. Later, launches from MiMo, Qwen, Alibaba’s open-weight model family, Hy3, Tencent’s open-weight model release, & DeepSeek reshuffled share again.
Arcee, a US lab focused, makes a strong appearance recently.
Open models still represent a fraction of overall inference, but the thriving competition, increasing usage, & surge of experimentation suggest developers are increasingly willing to route production traffic to them.
Source data: OpenRouter rankings & usage data, analyzed from weekly token-volume snapshots in the OpenRouter analysis dataset. ↩︎
2026-06-01 08:00:00
With Michael Burry 1 & Leopold Aschenbrenner 2 placing heavy short trades on AI, questions about GPU depreciation, & the Saaspocalypse, how negative is the financial market on AI?
We can look at the percentage of shares sold short, a bet the stock will decline.
Across all software, semiconductor, neocloud, data center, & hyperscalers, the median short interest (short shares / total shares) has increased by about 24% in the last quarter.
One segment stands out for gloomy skies in the cloud: the GPU data center businesses, whose shorted shares have grown 60% in the last year 3. AI cloud and neocloud companies have the highest current median short interest at 16.8% of float.
The negative sentiment for SaaS & Dev Tools is a more abrupt & recent phenomenon. Developer tools and infrastructure software follow at 9.5%. Enterprise SaaS and AI apps sit at 8.9%.
Hyperscalers are at the other end of the spectrum. Their median short interest is 1.1%. NVIDIA, the defining AI infrastructure stock, is also lightly shorted: 1.2%.
Semiconductor stocks saw a decrease in short-selling. With memory makers like Micron up 742% this year 4, & many ecosystem CEOs pointing to memory & storage as the limiting factor, the newest trillion-dollar companies are all memory.
The stocks with the most actively bearish betters? Most of these are small or mid-cap companies. The updated chart below adds market capitalization to each company label. The largest AI winners are mostly absent.
SoundHound AI is 36.3% short. C3.ai is 32.2%. BigBear.ai is 29.4%. Applied Digital is 28.0%. UiPath is 22.0%. TeraWulf is 21.3%.
This is the market’s current AI skepticism map.
The skepticism is concentrated in companies whose AI exposure still depends on future capital access, future demand, or future operating leverage.
That distinction matters. If short interest were rising uniformly across AI semiconductors, hyperscalers, and software, the message would be broad fatigue with the AI trade. Instead, the data suggest a more specific view: memory has become critical & in short supply; software & devtools businesses need to prove their worth post-AI; & businesses reselling GPUs have more than their fair share of doubters about current prices versus long-term value.
2026-05-29 08:00:00
I’ve been using state-of-the-art models to teach small models running on my computer how I work.
My personal agent, based on Pi, runs my inbox, my deal pipeline, my blog publishing, my calendar, & my research. It looks less like a chatbot & more like a small operating system.
The first layer is QMD, a local markdown knowledge base of about eighty workflow files in ~/memories. Before answering any procedural question, the agent searches QMD for the right playbook.
The second layer is Skills, atomic SKILL.md files that describe one job each. The skills are written by a frontier model. So are the evaluations that grade them. The same system writes, tests, and rewrites each skill until accuracy converges. It also checks recall against QMD, so the right keywords always surface the right skill.
The third layer is the Agent Loop, a model running Plan → Tool Call → Observe → Refine, calling out to seventeen Rust APIs & a handful of MCP integrations.
One of the techniques I’ve started to use is skill distillation. A frontier model, Opus 4.7, GPT-5.1, Gemini 3 Pro, authors & refines the skill files. A smaller model, Qwen 35B or Gemma 26B running locally, executes them. The teacher transfers procedural knowledge to the student through markdown. The skill is inspectable, versionable, & hot-swappable.
This is fundamentally different from classical knowledge distillation, which compresses a big model’s soft probability outputs into a smaller model’s weights. It’s different from instruction tuning, which bakes behavior into weights through prompt-response pairs. It’s different from RAG, which retrieves facts.
Skill distillation retrieves procedures. The smaller model doesn’t have to know how to evaluate a company. It just has to know how to follow the steps.
Every night a system runs through historical logs to understand what new skills should be generated, mirroring the loop that Pete Koomen described at Y Combinator earlier this week.
The frontier model becomes a teacher. The library becomes the company’s institutional knowledge. The student becomes whichever model happens to be cheapest this quarter.