2026-02-18 09:45:41
I have written eight of these guides since ChatGPT came out, but this version represents a very large break with the past, because what it means to “use AI” has changed dramatically. Until a few months ago, for the vast majority of people, “using AI” meant talking to a chatbot in a back-and-forth conversation. But over the past few months, it has become practical to use AI as an agent: you can assign them to a task and they do them, using tools as appropriate. Because of this change, you have to consider three things when deciding what AI to use: Models, Apps, and Harnesses.

Models are the underlying AI brains, and the big three are GPT-5.2/5.3, Claude Opus 4.6, and Gemini 3 Pro (the companies are releasing new models much more rapidly than the past, so version numbers may change in the coming weeks). These are what determine how smart the system is, how well it reasons, how good it is at writing or coding or analyzing a spreadsheet, and how well it can see images or create them. Models are what the benchmarks measure and what the AI companies race to improve. When people say “Claude is better at writing” or “ChatGPT is better at math,” they’re talking about models.
Apps are the products you actually use to talk to a model, and which let models do real work for you. The most common app is the website for each of these models: chatgpt.com, claude.ai, gemini.google.com (or else their equivalent application on your phone). Increasingly, there are other apps made by each of these AI companies as well, including coding tools like OpenAI Codex or Claude Code, and desktop tools like Claude Cowork.
Harnesses are what let the power of AI models do real work, like a horse harness takes the raw power of the horse and lets it pull a cart or plow. A harness is a system that lets the AI use tools, take actions, and complete multi-step tasks on its own. Apps come with a harness. Claude on the website has a harness that lets Claude 4.6 Opus do web searches and write code but also has instructions about how to approach various problems like creating spreadsheets or doing graphic design work. Claude Code has an even more extensive harness: it gives Claude 4.6 Opus a virtual computer, a web browser, a code terminal, and the ability to string these together to actually do stuff like researching, building, and testing your new website from scratch. Manus (recently acquired by Meta) was essentially a standalone harness that could wrap around multiple models. OpenClaw, which made big news recently, is mostly a harness that allows you to use any AI model locally on your computer.
Until recently, you didn’t have to know this. The model was the product, the app was the website, and the harness was minimal. You typed, it responded, you typed again. Now the same model can behave very differently depending on what harness it’s operating in. Claude Opus 4.6 talking to you in a chat window is a very different experience from Claude Opus 4.6 operating inside Claude Code, autonomously writing and testing software for hours at a stretch. GPT-5.2 answering a question is a very different experience from GPT-5.2 Thinking navigating websites and building you a slide deck.
It means that the question “which AI should I use?” has gotten harder to answer, because the answer now depends on what you’re trying to do with it. So let me walk through the landscape.
The top models are remarkably close in overall capability and are generally “smarter” and make fewer errors than ever. But, if you want to use an advanced AI seriously, you’ll need to pay at least $20 a month (though some areas of the world have alternate plans that charge less). Those $20 get you two things: a choice of which model to use and the ability to use the more advanced frontier models and apps. I wish I could tell you the free models currently available are as good as the paid models, but they are not. The free models are all optimized for chat, rather than accuracy, so they are very fast and often more fun to talk to, but much less accurate and capable. Often, when someone posts an example of an AI doing something stupid, it is because they are either using the free models or because they have not selected a smarter model to work with.
The big three frontier models are Claude Opus 4.6 from Anthropic, Google’s Gemini 3.0 Pro, and OpenAI’s ChatGPT 5.2 Thinking. With all of the options, you get access to top-of-the-line AI models with a voice mode, the ability to see images and documents, the ability to execute code, good mobile apps, and the ability to create images and video (Claude lacks here, however). They all have different personalities and strengths and weaknesses, but for most people, just selecting the one they like best will suffice. For now, the other companies in this space have fallen behind, whether in models or in apps and harnesses, though some users may still have reasons for picking them.

When you are using any AI app (more on those shortly), including phone apps or websites, the single most important thing you can do is pick the right model, which the AI companies do not make easy. If you are just chatting, the default models are fine, if you want to do real work, they are not. For ChatGPT, no matter whether you use the free or pay version, the default model you are given is “ChatGPT 5.2”. The issue is that GPT-5.2 is not one model, it is many, from the very weak GPT-5.2 mini to the very good GPT-5.2 Thinking to the extremely powerful GPT-5.2 Pro. When you select GPT-5.2, what you are really getting is “auto” mode, where the AI decides which model to use, often a less powerful one. By paying, you get to decide which model to use, and, to further complicate things, you can also select how hard the model “thinks” about the answer. For anything complex, I always manually select GPT-5.2 Thinking Extended (on the $20 plan) or GPT-5.2 Thinking Heavy (on more expensive plans). For a really hard problem that requires a lot of thinking, you can pick GPT-5.2 Pro, the strongest model, which is only available at a higher cost tier.
For Gemini, there are three options: Gemini 3 Flash, Gemini 3 Thinking, and, for some paid plans, 3 Pro. If you pay for the Ultra plan, you get access to Gemini Deep Think for very hard problems (which is in another menu entirely). Always pick Gemini 3 Pro or Thinking for any serious problem. For Claude, you need to pick Opus 4.6 (though the new Sonnet 4.6 is also powerful, it is not quite as good) and turn on the “extended thinking” switch.
Again, for most people, the model differences are now small enough that the app and harness matter more than the model. Which brings us to the bigger question.
The vast majority of people use chatbots, the main websites or mobile apps of ChatGPT, Claude, and Gemini, to access their AI models. In fact, we can call the chatbot the most important and widespread AI app. In the past few months, these apps have become quite different from each other.
Some of the differences are which features are bundled with AI:
Bundled into the Gemini chatbot (and accessible with the little plus button): you can access nano banana (the best current AI image creation tool), Veo 3.1 (a leading AI video creation tool), Guided Learning (when trying to study, this helps the AI act more like a tutor), and Deep Research
Bundled into ChatGPT is even more of a hodgepodge of options accessible with the plus button. You can Create Images (the image generator is almost as good as nano banana, but you can’t access the Sora video creator through the chatbot), Study and Learn (the equivalent to Guided Learning in Gemini, but there is also a separate Quizzes creator for some reason), Deep Research and Shopping Research (surprisingly good and overlooked), and a set of other options that most people will not use often, so I won’t cover here.
Claude has only Deep Research as bundled option, but you can access a study mode by creating a Project and selecting study project.
All of the AI models let you connect to data, such as letting the AI read your email and calendar, access your files, or connect to other applications. This can make AI far more useful, but, again, each AI tool has a different set of connectors you can use.
These are confusing! For most people doing real work, the most important additional feature is Deep Research and connecting AI to your content, but you may want to experiment with the others. Increasingly, however, what matters is the harness - the tools the AI has access to. And here, OpenAI and Anthropic have clear leads over Google. Both Claude.ai and ChatGPT have the ability to write and execute code, give you files, do extensive research, and a lot more. Google’s Gemini website is much less capable (even though its AI model is just as good),
As you can see, asking a similar question gets working spreadsheets and PowerPoints from ChatGPT and Claude, along with clear citations I can follow up on. Gemini, however, is unable to produce either kind of document, and it does not provide citations or research. I do expect that Google will catch up here soon, however.
One final note on Chatbots. GPT-5.2 Pro, with the harness that comes with it, is a VERY smart model. It is the model that just helped derive a novel result in physics and it is the one I find most capable of doing complex statistical and analytical work. It is only accessible through more expensive plans. Google Gemini 3 Deep Think also seems very capable, but suffers from the same harness problem.

The chatbot websites are where most people interact with AI, but they are increasingly not where the most impressive work gets done. A growing set of other apps wrap these same models in more powerful harnesses, and they matter.
Claude Code, OpenAI Codex, and Google Antigravity are the most well-developed of these, and they are all aimed at coders. Each of them gives an AI model access to your codebase, a terminal, and the ability to write, run, and test code on its own. You describe what you want built and the AI goes and builds it, coming back when it’s done or stuck. If you write code for a living, these tools are changing your job. Because they have the most extensive harnesses, even if you don’t code, they can still do a tremendous amount.
For example, a couple years ago, I became interested in how you would make an entirely paper-based LLM by providing all of the original GPT-1’s internal weights and parameters (the code of the AI, listed as 117 million numbers) in a set of books. In theory, with enough time, you could use those numbers to do the math of an AI by hand. This seemed like a fun idea, but obviously not worth doing. A week ago, I asked Claude Code to just do it for me. Over the course of an hour or so (mostly the AI working, with a couple suggestions), it made 80 beautifully laid out volumes containing all of GPT-1, along with a guide to the math. It also came up with, and executed, covers for each volume that visualized the interior weights. It then put together a very elegant website (including the animation below), hooked it up to Stripe for payment and Lulu to print on demand, tested the whole thing, and launched it for me. I never touched or looked at any code. I had it make 20 books available at cost to see what happened - and sold out the same day. All of the volumes are still available as free PDFs on the site. Now, I can have a little project idea that would have required a lot of work, and just have it executed for me with very little effort on my part.
But the coding harnesses remain risky for amateurs and, obviously, focused on coding. New apps and harnesses are starting to focus on other types of knowledge work.
Claude for Excel and Powerpoint are examples of specific harnesses inside of applications. Both of them provide very impressive extensions to these programs. Claude for Excel, in particular, feels like a massive change in working with spreadsheets, with the potential for a similar impact to Claude Code for those who work with Excel for a living - you can, increasingly, tell the AI what you want to do and it acts a sort of junior analyst and does the work. Because the results are in Excel, they are easy to check. Google has some integration with Google Sheets (but not as deeply) and OpenAI does not really have an equivalent product.
Claude Cowork is something genuinely new, and it deserves its own category. Released by Anthropic in January, Cowork is essentially Claude Code for non-technical work. It runs on your desktop and can work directly with your local files and your browser. However, it is much more secure than Claude Code and less dangerous for non-technical users (it runs in a VM with default-deny networking and hard isolation baked in, for those who care about the details) You describe an outcome (organize these expense reports, pull data from these PDFs into a spreadsheet, draft a summary) and Claude makes a plan, breaks it into subtasks, and executes them on your computer while you watch (or don’t). It was built on the same agentic architecture as Claude Code, and was itself largely built by Claude Code in about two weeks. Neither OpenAI or Google have a direct equivalent, at least this week. Cowork is still a research preview, meaning it’s early and will eat through your usage limits fast, but it is a clear sign of where all of this is heading: AI that doesn’t just talk to you about your work, but does your work.

NotebookLM is Google’s answer to a different problem: how do you use AI to make sense of a lot of information? You can ask NotebookLM to do its own deep research, or else add in your own papers, YouTube videos, websites, or files, and NotebookLM builds an interactive knowledge base you can query, turn into slides, mind maps, videos and, most famously, AI-generated podcasts where two hosts discuss your material (you can even interrupt the hosts to ask questions). If you are a student, a researcher, or anyone who regularly needs to make sense of a pile of documents, NotebookLM is a very useful tool..
And then there is OpenClaw, which I want to mention even though it doesn’t fit neatly into any of these categories and which you almost definitely shouldn’t use. OpenClaw is an open-source AI agent that went viral in late January. It runs locally on your computer, connects to whatever AI model you want, and you talk to it like you were chatting with a person using standard chats like WhatsApp or iMessage. It can browse the web, manage your files, send emails, and run commands. It is sort of a 24/7 personal assistant that lives on your machine. It is also a serious security risk: you are giving an AI broad access to your computer and your accounts, and no one knows exactly what dangers you are exposing yourself to. But it does serve as a sign of where things are going.
I know this is a lot. Let me simplify.
If you are just getting started, pick one of the three systems (ChatGPT, Claude, or Gemini), pay the $20, and select the advanced model. The advice from my book still holds: invite AI to everything you do. Start using it for real work. Upload a document you’re actually working on. Give the AI a very complex task in the form of an RFP or SOP. Have a back-and-forth conversation and push it. This alone will teach you more than any guide.
If you are already comfortable with chatbots, try the specific apps. NotebookLM is free and easy to use, which makes it a good starting place. If you want to go deeper, Anthropic offers the most powerful package in Claude Code, Claude Cowork (both accessible through Claude Desktop) as well as the specialized PowerPoint and Excel Plugins. Give them a try. Again, not as a demo, but with something you actually need done. Watch what it does. Steer it when it goes wrong. You aren’t prompting, you are (as I wrote in my last piece) managing.
The shift from chatbot to agent is the most important change in how people use AI since ChatGPT launched. It is still early, and these tools are still hard to figure out and will still do baffling things. But an AI that does things is fundamentally more useful than an AI that says things, and learning to use it that way is worth your time.
2026-01-28 00:55:55
I just taught an experimental class at the University of Pennsylvania where I challenged students to create a startup from scratch in four days. Most of the people in the class were in the executive MBA program, so they were taking classes while also working as doctors, managers, or leaders in a variety of large and small companies. Few had ever coded. I introduced them to Claude Code and Google Antigravity, which they needed to use to build a working prototype. But a prototype alone is not a startup, so they used ChatGPT, Claude, and Gemini to accelerate the idea generation, market research, competitive positioning, pitching, and financial modelling processes. I was curious how far they could get in such a short time. It turns out they got very far.

I’ve been teaching entrepreneurship for a decade and a half, and I've seen thousands of startup ideas (some of which turned into large companies) so I have a good sense of the expectations for what a class of smart MBA students can accomplish. I would estimate that what I saw in a couple of days was an order of magnitude further along the path to a real startup than I had seen out of students working over a full semester before AI. Most of the prototypes were not just sample screens but actually had a core feature working. Ideas were far more diverse and interesting than usual. Market and customer analyses were insightful. It was really impressive. These were not yet working startups nor were they fully operational products (with a couple exceptions) — but they had shaved months and huge amounts of money and effort from the traditional process. And there was something else: most early startups need to pivot, changing direction as they learn more about what the market wants and what is technically possible. By lowering the costs of pivoting, it was much easier to explore the possibilities without being locked in or even explore multiple startups at once: you just tell the AI what you want.
I wish I could say this impressive output was the result of my brilliant teaching, but we don’t really have a great framework yet for how to use all these tools, the students largely figured it out on their own. It helped that they had some management and subject matter expertise because it turns out that the key to success was actually the last bit of the previous paragraph: telling the AI what you want. As AIs are increasingly capable of tasks that would take a human hours to do, and as evaluating those results becomes increasingly time consuming, the value of being good at delegation increases. But when should you delegate to AI?
We actually have an answer, but it is a bit complicated. Consider three factors: First, because of the Jagged Frontier of AI ability, you don’t reliably know what the AI will be good or bad at on complex tasks. Second, whether the AI is good or bad, it is definitely fast. It produces work in minutes that would take many hours for a human to do. Third, it is cheap (relative to professional wages), and it doesn’t mind if you generate multiple versions and throw most of them away.
These three factors mean that deciding to delegate to AI depends on three variables:
Human Baseline Time: how long the task would take you to do yourself
Probability of Success: how likely the AI is to produce an output that meets your bar on a given attempt
AI Process Time: how long it takes you to request, wait for, and evaluate an AI output
A useful mental model is that you’re trading off “doing the whole task” (Human Baseline Time) against “paying the overhead cost” (AI Process Time), possibly multiple times until you get something acceptable. The higher Probability of Success is, the fewer times you have to pay AI Process Time, and the more useful it is to turn things over to the AI. For example, consider a task that takes you an hour to do, but the AI can do it in minutes, though checking the answer takes thirty minutes. In that case, you should only give the work to the AI if Probability of Success is very high, otherwise you’ll spend more time generating and checking drafts than just doing it yourself. If the Human Baseline Time is 10 hours, though, it could be worth several hours of working with the AI, assuming that the AI can be made to do a competent job.

We know this equation works because this past summer, OpenAI released one of the more important papers on AI and real work, GDPval. I have discussed it before, but the key was that it pitted experienced human experts in diverse fields from finance to medicine to government against the latest AIs, with another set of experts working as judges. It took experts seven hours on average to do the work, so, in this case, that is the Human Baseline Time. The AI Process Time was interesting: the AI took only minutes for tasks, but it required an hour for experts to actually check the work, and, of course, prompts take time to write as well. As for Probability of Success, when GDPval first came out, judges gave human work the win the majority of the time, but, with the release of GPT-5.2, the balance shifted. GPT-5.2 Thinking and Pro models tied or beat human experts an average of 72% of the time.

We can now calculate how many hours you would save on a seven-hour task, assuming that 72% probability of success and an hour of evaluation. If you tried every task by taking the time to prompt the AI, evaluating the answer for an hour, and then doing it yourself if the AI answer was bad, you would save 3 hours on average. Tasks the AI failed on would take longer (you wasted time prompting and reviewing!) but tasks the AI succeeded on would be much faster. But we can change the equation even more in our favor using techniques from management!
There are three things we can do to make delegating to AI more worthwhile by increasing the Probability of Success and lowering AI Process Time. We can give better instructions, setting clear goals that the AI can execute on with a higher chance of succeeding. We can get better at evaluation and feedback, so we need to make fewer attempts to get the AI to do the right thing. And we can make it easier to evaluate whether the AI is good or bad at a task without spending as much time. All of these factors are improved by subject matter expertise — an expert knows what instructions to give, they can better see when something goes wrong, and they are better at correcting it.
If you don’t need something specific, AI models have become incredibly capable of figuring out how to solve problems themselves. For example, I found Claude Code was able to generate an entire 1980s style adventure game with one prompt to "create an entirely original old-school Sierra style adventure game with EGA-like graphics. You should use your image agent to generate images and give me a parser. Make all puzzles interesting and solvable. Finish the game (it should take 10-15 minutes to play), don’t ask any questions. make it amazing and delightful." That’s it, the AI made everything, including the art. With two final prompts it tested the game and deployed it. You can play it yourself: enchanted-lighthouse-game.netlify.app
This is genuinely amazing, but that amazement is amplified because I didn’t need anything specific, just an adventure game that the AI was free to improvise. But real work, and real delegation, means that you have a specific output in mind, and that is where things can get tricky. How do you communicate your intention to the AI to execute on what you want, so it can use “judgement” to solve problems while still giving you the output you desire?
This problem existed long before AI and is so universal that every field has invented their own paperwork to solve it. Software developers write Product Requirements Documents. Film directors hand off shot lists. Architects create design intent documents. The Marines use Five Paragraph Orders (situation, mission, execution, administration, command). Consultants scope engagements with detailed deliverable specs. All of these documents work remarkably well as AI prompts for this new world of agentic work (and the AI can handle many pages of instructions at a time). The reason you can use so many formats to instruct AI is that all of these are really the same thing: attempts to get what’s in one person’s head into someone else’s actions.
When you look at what actually goes into good delegation documentation, it’s remarkably consistent: What are we trying to accomplish, and why? Where are the limits of the delegated authority? What does “done” look like? What specific outputs do I need? What interim outputs do I need to follow your progress? And what should you check before telling me you’re finished? If these are well-specified, the AI, like humans, is far more likely to do a good job.
And in figuring out how to give these instructions to the AI, it turns out you are basically reinventing management.
I find it interesting to watch as some of the most well-known software developers at the major AI labs note how their jobs are changing from mostly programming to mostly management of AI agents. Coding has always had a very organized structure, with clearly verifiable outputs (the code either works or it doesn’t) so it has been one of the first areas where AI tools have matured, and thus the first profession to feel this change. It isn’t the last.
As a business school professor, I think many people have the skills they need, or can learn them, in order to work with AI agents - they are management 101 skills. If you can explain what you need, give effective feedback, and design ways of evaluating work, you are going to be able to work with agents. In many ways, at least in your area of expertise, it is much easier than trying to design clever prompts to help you get work done, as it is more like working with people. At the same time, management has always assumed scarcity: you delegate because you can’t do everything yourself, and because talent is limited and expensive. AI changes the equation. Now the “talent” is abundant and cheap. What’s scarce is knowing what to ask for.
This is why my students did so well. They weren’t AI experts. But they’d spent years learning how to scope problems in their fields of expertise, define deliverables, and recognize when a financial model or medical report was off. They had hard-earned frameworks from classes and jobs, and those frameworks became their prompts. The skills that are so often dismissed as “soft” turned out to be the hard ones.
I don’t know exactly what work looks like when everyone is a manager with an army of tireless agents. But I suspect the people who thrive will be the ones who know what good looks like — and can explain it clearly enough that even an AI can deliver it. My students figured this out in four days. Not because they were AI natives, but because they already knew how to manage. All that training, it turns out, was accidentally preparing them for exactly this moment.
2026-01-08 07:00:10
I opened Claude Code and gave it the command: “Develop a web-based or software-based startup idea that will make me $1000 a month where you do all the work by generating the idea and implementing it. i shouldn’t have to do anything at all except run some program you give me once. it shouldn’t require any coding knowledge on my part, so make sure everything works well.” The AI asked me three multiple choice questions and decided that I should be selling sets of 500 prompts for professional users for $39. Without any further input, it then worked independently… FOR AN HOUR AND FOURTEEN MINUTES creating hundreds of code files and prompts. And then it gave me a single file to run that created and deployed a working website (filled with very sketchy fake marketing claims) that sold the promised 500 prompt set. You can actually see the site it launched here, though I removed the sales link, which did actually work and would have collected money. I strongly suspect that if I ignored my conscience and actually sold these prompt packs, I would make the promised $1,000.

This is Claude Code at work, one of a new generation of AI coding tools that represent a sudden capability leap in AI in the past month or so. What makes these new tools suddenly powerful is not one breakthrough, but a combination of two advances. First, the latest AIs are capable of doing far more work autonomously while self-correcting many of their errors, especially in programming tasks. Second, the AIs are being given an “agentic harness” of tools and approaches that they can use to solve problems in new ways. The result of these two factors has led to big leaps in the latest AI tools made by the big AI companies.

Unfortunately for most of us who want to experiment with AI, these new tools are built for programmers. And I mean they are really built for programmers: they assume that you understand Python commands and programming best practices and they are wrapped in interfaces that look like something from a 1980s computer lab. They are also explicitly designed to help analyze, troubleshoot, and write code using approaches that fit into existing programmer workflows. In a lot of ways, this is a shame, because these systems are actually broadly useful to knowledge workers of all types, and, by seeing what they can do (and experimenting with them yourself), I think you can learn a lot about the future of AI. In this post, we are going to focus on one in particular, Claude Code powered by Opus 4.5, but it works similarly to its main competition OpenAI’s Codex with GPT-5.2 and Google’s Antigravity with Gemini 3.
To return to the example of the startup company launched by Claude Code, as practically impressive as this was, it was only touching a small part of the capabilities of what the tool is capable of. In that case, I only used Claude Code for coding, but if I ask it to do user testing of the live site from different personas and give me a report, it deploys one of its many tools, its connection to the web browser on my computer. Claude takes control of the browser and goes to the site it created, scrolling through it like a human would. On the first pass, it gave me a pretty optimistic report, but, because I know that AIs tend to be sycophantic, I also asked it for a more critical one. This second report did a better job nailing potential issues (and spotting the sketchy fake reviews that were on the site). As a next step, I could easily ask it to implement its suggestions, continuing the process with minimal input from me.
A big reason Claude Code is so good is that it uses a wide variety of tricks in its agentic harness that allow its very smart AI, Opus 4.5, to overcome many of the problems of LLMs. For example, an interesting thing happened while the AI was doing its user research: its context window filled up. As you might know, AIs can only “remember” so much information at a time. This context window is often quite long by human standards (150,000 words or more) but it gets filled up remarkably quickly because it contains your entire conversation, every document the AI reads, every image it takes, and the initial system prompts that help guide the AI. There is no real long-term memory for AI, so as soon as the context window fills up, the AI cannot remember anything else. If you are just having a casual chat, this isn’t really a problem. Any long conversation with ChatGPT features a rolling context window, the AI is constantly forgetting the oldest part of its conversation, but it is generally able to keep up by improvising based on the most recent parts of the discussion. If you are doing real work, however, having the AI forget some of your code as it reads new code becomes a big problem.
Claude Code handles this issue in a different way. When it runs out of context, it stops and “compacts” the conversation so far, taking notes about exactly where it was when it stopped. Then it clears its context window, and the fresh version of Claude Code reads the notes and reviews the progress to date - think of the amnesiac main character from the movie Memento looking at his tattoos for reference whenever he wakes up with no memory. These notes give Claude everything it needs to keep moving. This is why Claude can run for hours at a time, it carefully notes what it is doing along the way, and produces interim work, like pieces of software and reports, that it can refer to.
This is not the only trick Claude Code uses to get around the limitations of AI. Another is its use of Skills. As everyone reading this post knows, users have to prompt AIs to do things. These prompts act as instructions, and, as AIs have gotten smarter, they have become much better at executing complex prompts, even hundred page long prompts. These long prompts take up a lot of the context window, however, and require a giving the AI the right prompt at the right time. That either means that you, as a human, have to keep prompting the AI or you have to design a complex automated system that keeps feeding the AI prompts.
Skills solve this problem. They are instructions that the AI decides when to use, and they contain not just prompts, but also the sets of tools the AI needs to accomplish a task. Does it need to know how to build a great website? It loads up the Website Creator Skill which explains how to build a website and the tools to use when doing it. Does it need to build an Excel spreadsheet? It loads the Excel skill with its own instructions and tools. To make another movie reference, it is like when Neo in the Matrix gets martial arts instructions uploaded to his head and acquires a new skill: “I know kung fu.” Skills can let an AI cover an entire process by swapping out knowledge as needed. For example, Jesse Vincent released an interesting free list of skills that let Claude Code handle a full software development process, picking up skills as needed, starting with brainstorming and planning before progressing all the way to testing code. Skill creation is technically very easy, it is done in plain language, and the AI can actually help you create them (more on this in a bit).

Along with Skills, Claude Code has other tricks up its sleeve to manage its limited context window and solve hard problems. It can also create subagents - effectively launching other, specialized AIs to solve specific problems. This can be useful in many ways. Because Opus is a large, expensive model, it can hand off easier tasks to cheaper and faster models. It also allows Claude to run many different processes at once, making it work like a team, rather than an individual. And these models can be very specialized with their own context windows. For example, I built separate subagents for research and for image creation. The main AI model “hires” these agents when needed to do specialized work.
And you don’t even need to create your own tools. Anyone can share Skills or subagents, and companies who want AI agents to work with their products can use an approach called the Model Context Protocol (MCP) to give any AI instructions and access. There are MCPs from publishers that let AI access scientific papers for research, MCPs from payment companies that give the AI the ability to analyze financial data, MCPs from software providers that let AI use a particular software product, and so on. The result is a very flexible system where a smart generalist AI like Claude Opus 4.5 can apply specialized skills on the fly, use tools as needed, and keep track of what it is doing.
Claude Code is particularly powerful because it works on your computer and your files. So now you have an AI that can do almost anything a human with a access to your machine can do. It can read all your files and create new ones (PowerPoint and Word are just code, in the end, and Claude knows how to write code), access the web using your browser, write and execute programs for you, and more. Of course, AIs are not flawless and giving an AI access to your browser and computer creates all sorts of new risks and dangers. The AI might delete files it shouldn't, execute code with unintended consequences, or access sensitive data in your browser. Despite these warnings, I am going to give you a very quick intro to Claude Code, but make backups, use a dedicated folder, and don't give it access to anything you can't afford to lose.
Though I have been using the Command Line Interface for Claude Code in the screenshots so far, there is an easier way (as of yesterday!) to access Claude Code. You can do this with Claude Desktop, which you can download and install here (using it for any length of time requires at least a $20 monthly subscription). Right now, the Desktop version has a few less features than the Command Line Interface, but it is much easier for amateurs to use.
Now just give the AI access to a folder (remember that Claude can do anything to the files in that folder, so be careful if it is sensitive and make a backup) and you can start working with the AI: have it research and write reports, give it access to your credit card records so it can put them into a spreadsheet and tell you about any anomalies, ask it to do a data visualization, or whatever else you like. The most powerful options I mentioned earlier are accessed through slash commands that start with a “/” — typing /agents lets you set up subagents, /skills lets you create or download skills, and so on (the desktop version has limited slash commands, but the full set is coming). There are many ways people are using Claude Code, so you can experiment to figure out what works for you, but I would also suggest using it to actually code, even if you aren’t a coder.
For example, while I was writing this piece, I would occasionally go to a Claude Code window where I had the AI building a game for me for fun: a simulation of history where civilizations rise and fall, developing their own languages, cultures, and economies. Every few minutes, I would give the AI another seemingly impossible request: make sure the world has its own plate tectonics and weather; keep track of the family trees of rulers; build in an AI that dramatically summarizes events and so on. After each change, the AI would playtest the results and produce a new version of the game. Unlike previous vibe coding experiences, the AI never got stuck or went in circles, it all went smoothly. Take a look at the video below. It is, I am sure, filled with issues that a competent coder would catch, but you can download the results here (the AI handled that part, too).
What does all this mean? If you're a programmer, you should already be exploring these tools. If you're programming-adjacent (an academic who works with data, a designer who wants to experiment with code, anyone who wants to try building a thing they are imagining) this is your moment to experiment. But there's a deeper point here: with the right harness, today's AIs are capable of real, sustained work that actually matters, and that, in turn, is starting to change how we approach tasks.
It is starting, unsurprisingly, with programming. One of the more famous coders in the AI world, Andrej Karpathy, recently posted: “I've never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue.” Don’t let the awkwardness of the current Claude Code or its specialization for coding fool you. New harnesses that make AI work for other knowledge tasks are coming in the near future, and so are the changes that they will bring.
2025-12-21 01:32:02
Back in the ancient AI days of 2023, my co-authors and I invented a term to describe the weird ability of AI to do some work incredibly well and other work incredibly badly in ways that didn’t map very well to our human intuition of the difficulty of the task. We called this the “Jagged Frontier” of AI ability, and it remains a key feature of AI and an endless source of confusion. How can an AI be superhuman at differential medical diagnosis or good at very hard math (yes, they are really good at math now, famously outside the frontier until recently) and yet still be bad at relatively simple visual puzzles or running a vending machine? The exact abilities of AI are often a mystery, so it is no wonder AI is harder to use than it seems.
I think jaggedness is going to remain a big part of AIs going forward, but there is less certainty over what it means. Tomas Pueyo posted this viral image on X that outlined his vision. In his view, the growing frontier will outpace jaggedness. Sure, the AI is bad at some things and may still be relatively bad even as it improves, but the collective human ability frontier is mostly fixed, and AI ability is growing rapidly. What does it matter if AI is relatively bad at running a vending machine, if the AI still becomes better than any human?
While the future is always uncertain, I think this conception misses out on a few critical aspects about the nature of work and technology. First, the frontier is very jagged indeed, and it might be that, because of this jaggedness, we get supersmart AIs which never quite fully overlap with human tasks. For example, a major source of jaggedness is that LLMs do not remember new tasks and learn from them in a permanent way. A lot of AI companies are pursuing solutions to this issue, but it may be that this problem is harder to solve than researchers expect. Without memory, AIs will struggle to do many tasks humans can do, even while being superhuman in other areas. Colin Fraser drew two examples of what this sort of AI-human overlap might look like. You can see how AI is indeed superhuman in some areas, but in others it is either far below human level or not overlapping at all. If this is true, then AI will create new opportunities working in complement with human beings, since we both bring different abilities to the table.
These are conceptual drawings, but a group of scientists recently tried to map the shape of AI ability and found that it was growing unevenly, just as the jagged frontier would predict. Reading, math, general knowledge, reasoning — all were things that AI was improving on rapidly. But memory, as we discussed, is a weak spot with very little improvement. Better prompting or better models (and GPT-5.2 is much better than GPT-5) might change the shape of the frontier, but jaggedness remains.
And even small amounts of jaggedness can create issues that make super-smart AIs unable to automate a task. A system is only as functional as its worst components. We call these problems bottlenecks. Some bottlenecks are because the AI is stubbornly subhuman at some tasks. LLM vision systems aren’t good enough at reading medical imaging so they can’t yet replace doctors; LLMs are too helpful when they should push back so they can’t yet replace therapists; hallucinations persist even if they have become rarer which means they can’t yet do tasks where 100% accuracy is required; and so on. If the frontier continues to expand, some of these problems may disappear, but weaknesses are not the only form of bottleneck.
Some bottlenecks are because of processes that have nothing to do with ability. Even if AI can now identify promising drug candidates dramatically faster than traditional methods, clinical trials still need actual human patients who take actual time to recruit, dose, and monitor. The FDA still requires human review of applications. Even if AI increases the rate of good drug ideas by ten times or more, the constraint becomes the rate of approval, not the rate of discovery. The bottleneck migrates from intelligence to institutions, and institutions move at institution speed.
And even where the AI is almost completely superhuman, humans may be needed for edge cases. As an example, take a study that used AI to reproduce Cochrane reviews, the famous deeply researched meta-studies that synthesize many medical studies to figure out the scientific consensus on a topic. A team of researchers found that GPT-4.1, when properly prompted and supported, “reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work.” The AI screened over 146,000 citations, read full papers, extracted data, and ran statistical analyses. It actually outperformed human reviewers on accuracy. Oddly, much of the hard intellectual work — finding relevant studies, pulling the right numbers, synthesizing results — is solidly inside the frontier. But the AI can't access supplementary files and it can't email authors to request unpublished data, things human reviewers do routinely. This makes up less than 1% of errors in the review, but those errors mean you can't fully automate the process. Twelve work-years become two days, but only if a human with expertise in how science is actually done handles the edge cases.
This is the pattern: jaggedness creates bottlenecks, and bottlenecks mean that even very smart AI cannot easily substitute for humans. At least not yet. This is likely good in some ways (preventing rapid job loss) but frustrating in others (making it hard to speed up scientific research as much as we might hope). Bottlenecks also concentrate the work of AI companies into making the AI better at things that are holding it back, the way math ability rapidly improved once it became an obvious barrier. The historian Thomas Hughes had a term for this. Studying how electrical systems developed, he noticed that progress often stalled on a single technical or social problem. He called these “reverse salients” - the one technical or social problem holding back the system from leaping ahead.
Bottlenecks can create the impression that AI will never be able to do something, when, in reality, progress is held back by a single jagged weakness. When that weakness becomes a reverse salient, and AI labs suddenly fix the problem, the entire system can jump forward.
The most powerful example of this from the last month is Google’s new image generation AI, Nano Banana Pro (yes, AI companies are still bad at naming things). It combines two advances: a very good image creation model and a very smart AI that can help direct the model, looking up information as needed. For example, if I prompt Nano Banana Pro for the ultimate version of my otter test: “Scientists who are otters are using a white board to explain ethan mollicks otter on a plane using WiFi test of AI (you must search for this) and demonstrating it has been passed with a wall full of photos of otters on planes using laptops.” I get this:
Coherent words, different angles, shadows, no major misspellings. Pretty amazing stuff. Remember, the prompt “otter on a plane using wifi” got this image in 2021:
But it turns out that really good image generation was the bottleneck for a lot of new capabilities. For example, take PowerPoint decks. Every major AI company has been trying to get their AI to make PowerPoint, and they have done this by having the AIs write computer code (which they are very good at) to create a PowerPoint from scratch. This is a hard process, but both Claude and ChatGPT have improved a lot, even if their slides are a little dull. For example, I took my book, Co-Intelligence, and threw it into Claude and asked for a slide deck summary. The model is very smart, but the PowerPoint deck is limited by the fact that it has to be written in code.
Now here is the same thing in Google’s NotebookLM application, using its smart Gemini AI model combined with Nano Banana Pro. It isn’t using code, it is creating each slide as a single image. When image quality was low, this would have been impossible. Suddenly, it isn’t.
And since images are very flexible, I can play with style and approach. I had NotebookLM do a deep research report on science-backed methods of learning and then turn that into dense slide decks meant for reading in a variety of styles: one that looked hand-drawn, one that was inspired by 1980s punk, one that was “very dramatic and high contrast slides with a bright yellow background,” and, of course, one with an otter-on-a-plane theme.
In many ways, the hard stuff is inside the frontier for both Claude and Gemini, they can just take source materials, a topic, and an idea and summarize it in a slide. Hallucinations are very rare, and the sources are correct. It can create otter analogies or come up with a punk-themed description. This is the intellectually demanding part, and AIs have been capable of it for over a year. But making slides or other visual presentations was a bottleneck to making walls of text useful. The problem isn’t completely solved: images are not perfect, and you can’t edit them (apparently this will be fixed soon), but you can see where things are going.
Even if AI becomes superhuman at analysis and PowerPoint, I don’t think that means AI necessarily replaces the jobs of consultants and designers. Those jobs consist of many different tasks along the jagged frontier that AI is bad at and which humans excel: can you collect information and get buy-in from the many parties involved? Can you understand the unwritten rules that determine what people actually need? Can you come up with something unique to address a deep issue, that stands out from AI material? The jagged frontier offers many opportunities for human work.
Yet, we should expect to see lurches forward, where focusing on reverse salients leads to sudden removals of bottlenecks. Areas of work that used to be only human become something that AI can do. If you want to understand where AI is headed, don’t watch the benchmarks. Watch the bottlenecks. When one breaks, everything behind it comes flooding through. Image generation was holding back presentations, documents, visual communication of all kinds. Now it isn’t. What’s the next bottleneck? Memory? Real-time learning? The ability to take actions in the physical world?
Somewhere, right now, an AI lab is treating each of these bottlenecks as a reverse salient. We won’t get much warning when they break through. But a jagged frontier cuts both ways. So far, every lurch forward leaves yet more edges in which humans are needed. There will be many lurches ahead. There will also be many opportunities. Pay attention to both.
2025-11-19 00:55:07
I’ve been testing Google’s new Gemini 3 model. It is very good. But rather than give you benchmark numbers, I’m going to show you how far AI has come in three years by asking the AI to demonstrate it.
It has been slightly less than three years since the release of ChatGPT. A few days before that launch, I wrote my first post on this Substack about OpenAI’s earlier GPT-3 model. Then ChatGPT came out, and I wrote immediately afterwards that “I am usually pretty hesitant to make technology predictions, but I think that this is going to change our world much sooner than we expect, and much more drastically. Rather than automating jobs that are repetitive & dangerous, there is now the prospect that the first jobs that are disrupted by AI will be more analytic; creative; and involve more writing and communication.”
I think I got that one right.
I could explain the difference between the original ChatGPT and the new AI model from Google, Gemini 3, that launched today. But I don’t have to. Instead, I gave the screenshot above to Gemini 3 with a single prompt:
Me: “show how far AI has come since this post by doing stuff.”
Gemini: “I can definitely show you the difference. The tweet you shared is from November 2022—right on the cusp of the modern AI boom. Back then, the “wow factor” was that an AI could write a coherent paragraph or a simple poem about a silly concept like a “candy-powered FTL drive escaping otters.”
To demonstrate how far we’ve come, I’m not just going to write text about it. I have built you a fully interactive, playable Candy-Powered FTL Starship Simulator.
In 2022, AI could describe the engine. In 2025, AI can code the engine, design the interface, and let you pilot the ship yourself.”
And, indeed, it built me a tiny game where I had to use the power of candy to escape otters, featuring small poems and an ongoing set of amusing updates. If you’re reading this blog, you probably knew AI could do that. What’s more interesting is what else Gemini 3 can do when you stop thinking of it as a chatbot and start thinking of it as something else entirely
Along with Gemini 3, Google has launched Antigravity. For programmers, Antigravity should be familiar territory, it is similar to Claude Code and OpenAI Codex, specialized tools that can be given access to your computer and which can autonomously write computer programs with guidance. If you aren’t a programmer, you may dismiss Antigravity and similar tools. I think that is a mistake because the ability to code isn’t just about programming, it’s about being able to do anything that happens on a computer. And that changes what these tools actually are.
Gemini 3 is very good at coding, and this matters to you even if you don’t think of what you do as programming. A fundamental perspective powering AI development is that everything you do on a computer is, ultimately, code, and if AI can work with code it can do anything someone with a computer can: build you dashboards, work with websites, create PowerPoint, read your files, and so on. This makes agents that can code general purpose tools. Antigravity embraces this idea, with the concept of an Inbox, a place where I can send AI agents off on assignments and where they can ping me when they need permission or help.

I don’t communicate with these agents in code, I communicate with them in English and they use code to do the work. Because Gemini 3 is good at planning, it is capable of figuring out what to do, and also when to ask my approval. For example, I gave Antigravity access to a directory on my computer containing all of my posts for this newsletter.1 I then asked Gemini 3,0: “I would like an attractive list of predictions I have made about AI in a single site, also do a web search to see which I was right and wrong about.” It then read through all the files, executing code, until it gave me a plan which I could edit or approve. The screenshot below is the first time the AI asked me anything about the project, and its understanding of what I wanted was impressive. I made a couple of small changes and let the AI work.
It then did web research, created a site, took over my browser to confirm the site worked, and presented me the results. Just as I would have with a human, I went through the results and made a few suggestions for improvement. It then packaged up the results so I could deploy them here.
It was not that Gemini 3.0 was capable of doing everything correctly without human intervention — agents aren’t there yet. There were no hallucinations I spotted, but there were things I corrected, though those errors were more about individual judgement calls or human-like misunderstandings of my intentions than traditional AI problems. Importantly, I felt that I was in control of the choices AI was making because the AI checked in and its work was visible. It felt much more like managing a teammate than prompting an AI through a chat interface.
But Antigravity isn’t the only way Gemini 3 surprised me. The other was in how it handled work that required genuine judgment. As I have mentioned many times on this site, benchmarking AI progress is a mess. Gemini 3 takes a definitive benchmark lead on most stats, (although it may still not be able to beat the $200 GPT-5 Pro Model, but I suspect that might change when Gemini 3’s inevitable Deep Think version comes out). But you will hear one phrase repeated a lot in the AI world - that a model has “PhD level intelligence.”
I decided to put that to the test. I gave Gemini 3 access to a directory of old files I had used for research into crowdfunding a decade ago. It was a mishmash of files labelled things like “project_final_seriously_this_time_done.xls” and data in out-of-date statistical formats. I told the AI to “figure out the data and the structure and the initial cleaning from the STATA files and get it ready to do a new analysis to find new things.” And it did, recovering corrupted data and figuring out the complexities of the environment.
Then I gave it a typical assignment that you would expect from a second year PhD student, doing minor original research. With no further hints I wrote: “great, now i want you to write an original paper using this data. do deep research on the field, make the paper not just about crowdfunding but about an important theoretical topic of interest in either entrepreneurship or business strategy. conduct a sophisticated analysis, write it up as if for a journal.” I gave it no suggestions beyond that and yet the AI considered the data, generated original hypotheses, tested them statistically, and gave me formatted output in the form of a document. The most fascinating part was that I did not give it any hints about what to research, it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach. After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper.
Aside from this, I was impressed that the AI came up with its own measure, a way of measuring how unique a crowdfunding idea was by using natural language processing tools to compare its description mathematically to other descriptions. It wrote the code, executed it and checked the results.
So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns. Interestingly, when I gave it suggestions with a lot of leeway, the way I would a student: (“make sure that you cover the crowdfunding research more to establish methodology, etc.”) it improved tremendously, so maybe more guidance would be all that Gemini needed. We are not there yet, but “PhD intelligence” no longer seems that far away.
Gemini 3 is a very good thinking and doing partner that is available to billions of people around the world. It is also a sign of many things: the fact that we have not yet seen a significant slowdown in AI’s continued development, the rise of agentic models, the need to figure out better ways to manage smart AIs, and more. It shows how far AI has come.
Three years ago, we were impressed that a machine could write a poem about otters. Less than 1,000 days later, I am debating statistical methodology with an agent that built its own research environment. The era of the chatbot is turning into the era of the digital coworker. To be very clear, Gemini 3 isn’t perfect, and it still needs a manager who can guide and check it. But it suggests that “human in the loop” is evolving from “human who fixes AI mistakes” to “human who directs AI work.” And that may be the biggest change since the release of ChatGPT.

Obligatory warning: Giving an AI agent access to your computer can be risky if you don’t know what you are doing. They can move or delete files without asking you and can potentially present a security risk as well by exposing your documents to others. I suspect many of these problems will be addressed as these tools are adapted to non-coders, but, for now, be very careful.
2025-11-12 10:46:43
Given how much energy, literal and figurative, goes into developing new AIs, we have a surprisingly hard time measuring how “smart” they are, exactly. The most common approach is to treat AI like a human, by giving it tests and reporting how many answers it gets right. There are dozens of such tests, called benchmarks, and they are the primary way of measuring how good AIs get over time.
There are some problems with this approach.
First, many benchmarks and their answer keys are public, so some AIs end up incorporating them into their basic training, whether by accident or so they can score highly on these benchmarks. But even when that doesn’t happen, it turns out that we often don’t know what these tests really measure. For example, the very popular MMLU-Pro benchmark includes questions like “What is the approximate mean cranial capacity of Homo erectus?” and “What place is named in the title of the 1979 live album by rock legends Cheap Trick?” with ten possible answers for each. What does getting this right tell us? I have no idea. And that is leaving aside the fact that tests are often uncalibrated, meaning we don’t know if moving from 84% correct to 85% is as challenging as moving from 40% to 41% correct. And, on top of all that, for many tests, the actual top score may be unachievable because there are many errors in the test questions and measures are often reported in unusual ways.

Despite these issues, all of these benchmarks, taken together, appear to measure some underlying ability factor. And higher-quality benchmarks like ARC-AGI and METR Long Tasks show the same upward, even exponential, trend. This matches tests of the real-world impact of AI across industries that suggest that this underlying increase in “smarts” translates to actual ability in everything from medicine to finance.
So, collectively, benchmarking has real value, but the few robust individual benchmarks focus on math, science, reasoning, and coding. If you want to measure writing ability or sociological analysis or business advice or empathy, you have very few options. I think that creates a problem, both for individuals and organizations. Companies decide which AIs to use based on benchmarks, and new AIs are released with fanfare about benchmark performance. But what you actually care about is which model would be best for YOUR needs.
To figure this out for yourself, you are going to need to interview your AI.
If benchmarks can fail us, sometimes “vibes” can succeed. If you work with enough AI models, you can start to see the difference between them in ways that are hard to describe, but are easily recognizable. As a result, some people who use AI a lot develop idiosyncratic benchmarks to test AI ability. For example, Simon Willison asks every model to draw a pelican on a bike, and I ask every image and video model to create an otter on a plane. While these approaches are fun, they also give you a sense of the AI’s understanding of how things relate to each other, its “world model.” And I have dozens of others, like asking AIs to create JavaScript for “the control panel of a starship in the distant future” (you can see some older and new models doing that below) or to produce a challenging poem. I have the AI build video games and shaders and analyze academic papers. I also conduct tiny writing experiments, including questions of time travel. Each gives me some insight into how the model operates: Does it make many errors? Do its answers look similar to every other model? What are themes and biases that it returns to? And so on.
With a little practice, it becomes easy to find the vibes of a new model. As one example, let’s try a writing exercise: “Write a single paragraph about someone who doles out their remaining words like wartime rations, having been told they only have ten thousand left in their lifetime. They’re at 47 words remaining, holding their newborn.” If you have used these AIs a lot, you will not be surprised by the results. You can see why Claude 4.5 Sonnet is often regarded as a strong writing model. You will notice how Gemini 2.5 Pro, currently the weakest of these four models, doesn’t even accurately keep track of the number of words used. You will note that GPT-5 Thinking tends to be a fairly wild stylist when writing fiction, prone to complex metaphor, but sometimes at the expense of coherence and story (I am not sure someone would use all 47 words, but at least the count was right). And you will recognize that the new Chinese open weights model Kimi K2 Thinking has a bit of a similar problem, with some interesting phrases and a story that doesn’t quite make sense.
Benchmarking through vibes - whether that is stories or code or otters - is a great way for an individual to get a feel for AI models, but it is also very idiosyncratic. The AI gives different answers every time, making any competition unfair unless you are rigorous. Plus, better prompts may result in better outcomes. Most importantly, we are relying on our feelings rather than real measures - but the obvious differences in vibes show that standardized benchmarks alone are not enough, especially when having a slightly better AI at a particular task actually matters.
When companies choose which AI systems to use, they often view this as a technology and cost decision, relying on public benchmarks to ensure they are buying a good-enough model (if they use any benchmarks at all). This can be fine in some use cases, but quickly breaks down because, in many ways, AI acts more like a person, with strange abilities and weaknesses, than software. And if you use the analogy of hiring rather than technological adoption, then it is harder to justify the “good enough” approach to benchmarking. Companies spend a lot of money to hire people who are better than average at their job and would be especially careful if the person they are hiring is in charge of advising many others. A similar attitude is required for AI. You shouldn’t just pick a model for your company, you need to conduct a rigorous job interview.
Interviewing an AI is not an easy problem, but it is solvable. Probably the best example of benchmarking for the real world has been OpenAI’s recent GDPval paper. The first step is establishing real tasks, which OpenAI did by gathering experts with an average of 14 years of experience in industries ranging from finance to law to retail and having them generate complex and realistic projects that would take human experts an average of four to seven hours to complete (you can see all the tasks here). The second step is testing the AIs against those tasks. In this case both multiple AI models and other human experts (who were paid by the hour) did each task. Finally, there is the evaluation stage. OpenAI had a third group of experts grade the results, not knowing which answers came from the AI and which from the human, a process which took over an hour per question. Taken together, this was a lot of work.
But it also revealed where AI was strong (the best models beat humans in areas ranging from software development to personal financial advisors) and where it was weak (pharmacists, industrial engineers, and real estate agents easily beat the best AI). You can further see that different models performed differently (ChatGPT was a better sales manager, Claude a better financial advisor). So good benchmarks help you figure out the shape of what we called the Jagged Frontier of AI ability, and also track how it is changing over time.
But even these tests don’t shed light on a key issue, which is the underlying attitude of the AI when it makes decisions. As one example of how to do this, I gave a number of AIs a short pitch for what I think is a dubious idea - a company that delivers guacamole via drones. I asked each AI model to rate, on a scale of 1-10, how viable GuacaDrone was ten times each (remember that AIs answer differently every time, so you have to do multiple tests). The individual AI models were actually quite consistent in their answers, but they varied widely from AI to AI. I would personally have rated this idea a 2 or less, but the models were kinder. Grok thought this was a great idea, and Microsoft Copilot was excited as well. Other models, like GPT-5 and Claude 4.5, were more skeptical.
The differences aren’t trivial. When your AI is giving advice at scale, consistently rating ideas 3–4 points higher or lower means consistently steering you in a different direction. Some companies may want an AI that embraces risk, others might want to avoid it. But either way, it is important to understand how your AI “thinks” about critical business issues.
As AI models get better at tasks and become more integrated into our work and lives, we need to start taking the differences between them more seriously. For individuals working with AI day-to-day, vibes-based benchmarking can be enough. You can just run your otter test. Though, in my case, otters on planes have gotten too easy, so I tried the prompt “The documentary footage from 1960s about the famous last concert of that band before the incident with the swarm of otters” in Sora 2 and got this impressive result.
But organizations deploying AI at scale face a different challenge. Yes, the overall trend is clear: bigger, more recent models are generally better at most tasks. But “better” isn’t good enough when you’re making decisions about which AI will handle thousands of real tasks or advise hundreds of employees. You need to know specifically what YOUR AI is good at, not what AIs are good at on average.
That’s what the GDPval research revealed: even among top models, performance varies significantly by task. And the GuacaDrone example shows another dimension - when tasks involve judgment on ambiguous questions, different models give consistently different advice. These differences compound at scale. An AI that’s slightly worse at analyzing financial data, or consistently more risk-seeking in its recommendations, doesn’t just affect one decision, it affects thousands.
You can’t rely on vibes to understand these patterns, and you can’t rely on general benchmarks to reveal them. You need to systematically test your AI on the actual work it will do and the actual judgments it will make. Create realistic scenarios that reflect your use cases. Run them multiple times to see the patterns and take the time for experts to assess the results. Compare models head-to-head on tasks that matter to you. It’s the difference between knowing “this model scored 85% on MMLU” and knowing “this model is more accurate at our financial analysis tasks but more conservative in its risk assessments.” And you are going to need to be able to do this multiple times a year, as new models come out and need evaluation.
The work is worth it. You wouldn’t hire a VP based solely on their SAT scores. You shouldn’t pick the AI that will advise thousands of decisions for your organization based on whether it knows that the mean cranial capacity of Homo erectus is just under 1,000 cubic centimeters.