2026-01-28 00:55:55
I just taught an experimental class at the University of Pennsylvania where I challenged students to create a startup from scratch in four days. Most of the people in the class were in the executive MBA program, so they were taking classes while also working as doctors, managers, or leaders in a variety of large and small companies. Few had ever coded. I introduced them to Claude Code and Google Antigravity, which they needed to use to build a working prototype. But a prototype alone is not a startup, so they used ChatGPT, Claude, and Gemini to accelerate the idea generation, market research, competitive positioning, pitching, and financial modelling processes. I was curious how far they could get in such a short time. It turns out they got very far.

I’ve been teaching entrepreneurship for a decade and a half, and I've seen thousands of startup ideas (some of which turned into large companies) so I have a good sense of the expectations for what a class of smart MBA students can accomplish. I would estimate that what I saw in a couple of days was an order of magnitude further along the path to a real startup than I had seen out of students working over a full semester before AI. Most of the prototypes were not just sample screens but actually had a core feature working. Ideas were far more diverse and interesting than usual. Market and customer analyses were insightful. It was really impressive. These were not yet working startups nor were they fully operational products (with a couple exceptions) — but they had shaved months and huge amounts of money and effort from the traditional process. And there was something else: most early startups need to pivot, changing direction as they learn more about what the market wants and what is technically possible. By lowering the costs of pivoting, it was much easier to explore the possibilities without being locked in or even explore multiple startups at once: you just tell the AI what you want.
I wish I could say this impressive output was the result of my brilliant teaching, but we don’t really have a great framework yet for how to use all these tools, the students largely figured it out on their own. It helped that they had some management and subject matter expertise because it turns out that the key to success was actually the last bit of the previous paragraph: telling the AI what you want. As AIs are increasingly capable of tasks that would take a human hours to do, and as evaluating those results becomes increasingly time consuming, the value of being good at delegation increases. But when should you delegate to AI?
We actually have an answer, but it is a bit complicated. Consider three factors: First, because of the Jagged Frontier of AI ability, you don’t reliably know what the AI will be good or bad at on complex tasks. Second, whether the AI is good or bad, it is definitely fast. It produces work in minutes that would take many hours for a human to do. Third, it is cheap (relative to professional wages), and it doesn’t mind if you generate multiple versions and throw most of them away.
These three factors mean that deciding to delegate to AI depends on three variables:
Human Baseline Time: how long the task would take you to do yourself
Probability of Success: how likely the AI is to produce an output that meets your bar on a given attempt
AI Process Time: how long it takes you to request, wait for, and evaluate an AI output
A useful mental model is that you’re trading off “doing the whole task” (Human Baseline Time) against “paying the overhead cost” (AI Process Time), possibly multiple times until you get something acceptable. The higher Probability of Success is, the fewer times you have to pay AI Process Time, and the more useful it is to turn things over to the AI. For example, consider a task that takes you an hour to do, but the AI can do it in minutes, though checking the answer takes thirty minutes. In that case, you should only give the work to the AI if Probability of Success is very high, otherwise you’ll spend more time generating and checking drafts than just doing it yourself. If the Human Baseline Time is 10 hours, though, it could be worth several hours of working with the AI, assuming that the AI can be made to do a competent job.

We know this equation works because this past summer, OpenAI released one of the more important papers on AI and real work, GDPval. I have discussed it before, but the key was that it pitted experienced human experts in diverse fields from finance to medicine to government against the latest AIs, with another set of experts working as judges. It took experts seven hours on average to do the work, so, in this case, that is the Human Baseline Time. The AI Process Time was interesting: the AI took only minutes for tasks, but it required an hour for experts to actually check the work, and, of course, prompts take time to write as well. As for Probability of Success, when GDPval first came out, judges gave human work the win the majority of the time, but, with the release of GPT-5.2, the balance shifted. GPT-5.2 Thinking and Pro models tied or beat human experts an average of 72% of the time.

We can now calculate how many hours you would save on a seven-hour task, assuming that 72% probability of success and an hour of evaluation. If you tried every task by taking the time to prompt the AI, evaluating the answer for an hour, and then doing it yourself if the AI answer was bad, you would save 3 hours on average. Tasks the AI failed on would take longer (you wasted time prompting and reviewing!) but tasks the AI succeeded on would be much faster. But we can change the equation even more in our favor using techniques from management!
There are three things we can do to make delegating to AI more worthwhile by increasing the Probability of Success and lowering AI Process Time. We can give better instructions, setting clear goals that the AI can execute on with a higher chance of succeeding. We can get better at evaluation and feedback, so we need to make fewer attempts to get the AI to do the right thing. And we can make it easier to evaluate whether the AI is good or bad at a task without spending as much time. All of these factors are improved by subject matter expertise — an expert knows what instructions to give, they can better see when something goes wrong, and they are better at correcting it.
If you don’t need something specific, AI models have become incredibly capable of figuring out how to solve problems themselves. For example, I found Claude Code was able to generate an entire 1980s style adventure game with one prompt to "create an entirely original old-school Sierra style adventure game with EGA-like graphics. You should use your image agent to generate images and give me a parser. Make all puzzles interesting and solvable. Finish the game (it should take 10-15 minutes to play), don’t ask any questions. make it amazing and delightful." That’s it, the AI made everything, including the art. With two final prompts it tested the game and deployed it. You can play it yourself: enchanted-lighthouse-game.netlify.app
This is genuinely amazing, but that amazement is amplified because I didn’t need anything specific, just an adventure game that the AI was free to improvise. But real work, and real delegation, means that you have a specific output in mind, and that is where things can get tricky. How do you communicate your intention to the AI to execute on what you want, so it can use “judgement” to solve problems while still giving you the output you desire?
This problem existed long before AI and is so universal that every field has invented their own paperwork to solve it. Software developers write Product Requirements Documents. Film directors hand off shot lists. Architects create design intent documents. The Marines use Five Paragraph Orders (situation, mission, execution, administration, command). Consultants scope engagements with detailed deliverable specs. All of these documents work remarkably well as AI prompts for this new world of agentic work (and the AI can handle many pages of instructions at a time). The reason you can use so many formats to instruct AI is that all of these are really the same thing: attempts to get what’s in one person’s head into someone else’s actions.
When you look at what actually goes into good delegation documentation, it’s remarkably consistent: What are we trying to accomplish, and why? Where are the limits of the delegated authority? What does “done” look like? What specific outputs do I need? What interim outputs do I need to follow your progress? And what should you check before telling me you’re finished? If these are well-specified, the AI, like humans, is far more likely to do a good job.
And in figuring out how to give these instructions to the AI, it turns out you are basically reinventing management.
I find it interesting to watch as some of the most well-known software developers at the major AI labs note how their jobs are changing from mostly programming to mostly management of AI agents. Coding has always had a very organized structure, with clearly verifiable outputs (the code either works or it doesn’t) so it has been one of the first areas where AI tools have matured, and thus the first profession to feel this change. It isn’t the last.
As a business school professor, I think many people have the skills they need, or can learn them, in order to work with AI agents - they are management 101 skills. If you can explain what you need, give effective feedback, and design ways of evaluating work, you are going to be able to work with agents. In many ways, at least in your area of expertise, it is much easier than trying to design clever prompts to help you get work done, as it is more like working with people. At the same time, management has always assumed scarcity: you delegate because you can’t do everything yourself, and because talent is limited and expensive. AI changes the equation. Now the “talent” is abundant and cheap. What’s scarce is knowing what to ask for.
This is why my students did so well. They weren’t AI experts. But they’d spent years learning how to scope problems in their fields of expertise, define deliverables, and recognize when a financial model or medical report was off. They had hard-earned frameworks from classes and jobs, and those frameworks became their prompts. The skills that are so often dismissed as “soft” turned out to be the hard ones.
I don’t know exactly what work looks like when everyone is a manager with an army of tireless agents. But I suspect the people who thrive will be the ones who know what good looks like — and can explain it clearly enough that even an AI can deliver it. My students figured this out in four days. Not because they were AI natives, but because they already knew how to manage. All that training, it turns out, was accidentally preparing them for exactly this moment.
2026-01-08 07:00:10
I opened Claude Code and gave it the command: “Develop a web-based or software-based startup idea that will make me $1000 a month where you do all the work by generating the idea and implementing it. i shouldn’t have to do anything at all except run some program you give me once. it shouldn’t require any coding knowledge on my part, so make sure everything works well.” The AI asked me three multiple choice questions and decided that I should be selling sets of 500 prompts for professional users for $39. Without any further input, it then worked independently… FOR AN HOUR AND FOURTEEN MINUTES creating hundreds of code files and prompts. And then it gave me a single file to run that created and deployed a working website (filled with very sketchy fake marketing claims) that sold the promised 500 prompt set. You can actually see the site it launched here, though I removed the sales link, which did actually work and would have collected money. I strongly suspect that if I ignored my conscience and actually sold these prompt packs, I would make the promised $1,000.

This is Claude Code at work, one of a new generation of AI coding tools that represent a sudden capability leap in AI in the past month or so. What makes these new tools suddenly powerful is not one breakthrough, but a combination of two advances. First, the latest AIs are capable of doing far more work autonomously while self-correcting many of their errors, especially in programming tasks. Second, the AIs are being given an “agentic harness” of tools and approaches that they can use to solve problems in new ways. The result of these two factors has led to big leaps in the latest AI tools made by the big AI companies.

Unfortunately for most of us who want to experiment with AI, these new tools are built for programmers. And I mean they are really built for programmers: they assume that you understand Python commands and programming best practices and they are wrapped in interfaces that look like something from a 1980s computer lab. They are also explicitly designed to help analyze, troubleshoot, and write code using approaches that fit into existing programmer workflows. In a lot of ways, this is a shame, because these systems are actually broadly useful to knowledge workers of all types, and, by seeing what they can do (and experimenting with them yourself), I think you can learn a lot about the future of AI. In this post, we are going to focus on one in particular, Claude Code powered by Opus 4.5, but it works similarly to its main competition OpenAI’s Codex with GPT-5.2 and Google’s Antigravity with Gemini 3.
To return to the example of the startup company launched by Claude Code, as practically impressive as this was, it was only touching a small part of the capabilities of what the tool is capable of. In that case, I only used Claude Code for coding, but if I ask it to do user testing of the live site from different personas and give me a report, it deploys one of its many tools, its connection to the web browser on my computer. Claude takes control of the browser and goes to the site it created, scrolling through it like a human would. On the first pass, it gave me a pretty optimistic report, but, because I know that AIs tend to be sycophantic, I also asked it for a more critical one. This second report did a better job nailing potential issues (and spotting the sketchy fake reviews that were on the site). As a next step, I could easily ask it to implement its suggestions, continuing the process with minimal input from me.
A big reason Claude Code is so good is that it uses a wide variety of tricks in its agentic harness that allow its very smart AI, Opus 4.5, to overcome many of the problems of LLMs. For example, an interesting thing happened while the AI was doing its user research: its context window filled up. As you might know, AIs can only “remember” so much information at a time. This context window is often quite long by human standards (150,000 words or more) but it gets filled up remarkably quickly because it contains your entire conversation, every document the AI reads, every image it takes, and the initial system prompts that help guide the AI. There is no real long-term memory for AI, so as soon as the context window fills up, the AI cannot remember anything else. If you are just having a casual chat, this isn’t really a problem. Any long conversation with ChatGPT features a rolling context window, the AI is constantly forgetting the oldest part of its conversation, but it is generally able to keep up by improvising based on the most recent parts of the discussion. If you are doing real work, however, having the AI forget some of your code as it reads new code becomes a big problem.
Claude Code handles this issue in a different way. When it runs out of context, it stops and “compacts” the conversation so far, taking notes about exactly where it was when it stopped. Then it clears its context window, and the fresh version of Claude Code reads the notes and reviews the progress to date - think of the amnesiac main character from the movie Memento looking at his tattoos for reference whenever he wakes up with no memory. These notes give Claude everything it needs to keep moving. This is why Claude can run for hours at a time, it carefully notes what it is doing along the way, and produces interim work, like pieces of software and reports, that it can refer to.
This is not the only trick Claude Code uses to get around the limitations of AI. Another is its use of Skills. As everyone reading this post knows, users have to prompt AIs to do things. These prompts act as instructions, and, as AIs have gotten smarter, they have become much better at executing complex prompts, even hundred page long prompts. These long prompts take up a lot of the context window, however, and require a giving the AI the right prompt at the right time. That either means that you, as a human, have to keep prompting the AI or you have to design a complex automated system that keeps feeding the AI prompts.
Skills solve this problem. They are instructions that the AI decides when to use, and they contain not just prompts, but also the sets of tools the AI needs to accomplish a task. Does it need to know how to build a great website? It loads up the Website Creator Skill which explains how to build a website and the tools to use when doing it. Does it need to build an Excel spreadsheet? It loads the Excel skill with its own instructions and tools. To make another movie reference, it is like when Neo in the Matrix gets martial arts instructions uploaded to his head and acquires a new skill: “I know kung fu.” Skills can let an AI cover an entire process by swapping out knowledge as needed. For example, Jesse Vincent released an interesting free list of skills that let Claude Code handle a full software development process, picking up skills as needed, starting with brainstorming and planning before progressing all the way to testing code. Skill creation is technically very easy, it is done in plain language, and the AI can actually help you create them (more on this in a bit).

Along with Skills, Claude Code has other tricks up its sleeve to manage its limited context window and solve hard problems. It can also create subagents - effectively launching other, specialized AIs to solve specific problems. This can be useful in many ways. Because Opus is a large, expensive model, it can hand off easier tasks to cheaper and faster models. It also allows Claude to run many different processes at once, making it work like a team, rather than an individual. And these models can be very specialized with their own context windows. For example, I built separate subagents for research and for image creation. The main AI model “hires” these agents when needed to do specialized work.
And you don’t even need to create your own tools. Anyone can share Skills or subagents, and companies who want AI agents to work with their products can use an approach called the Model Context Protocol (MCP) to give any AI instructions and access. There are MCPs from publishers that let AI access scientific papers for research, MCPs from payment companies that give the AI the ability to analyze financial data, MCPs from software providers that let AI use a particular software product, and so on. The result is a very flexible system where a smart generalist AI like Claude Opus 4.5 can apply specialized skills on the fly, use tools as needed, and keep track of what it is doing.
Claude Code is particularly powerful because it works on your computer and your files. So now you have an AI that can do almost anything a human with a access to your machine can do. It can read all your files and create new ones (PowerPoint and Word are just code, in the end, and Claude knows how to write code), access the web using your browser, write and execute programs for you, and more. Of course, AIs are not flawless and giving an AI access to your browser and computer creates all sorts of new risks and dangers. The AI might delete files it shouldn't, execute code with unintended consequences, or access sensitive data in your browser. Despite these warnings, I am going to give you a very quick intro to Claude Code, but make backups, use a dedicated folder, and don't give it access to anything you can't afford to lose.
Though I have been using the Command Line Interface for Claude Code in the screenshots so far, there is an easier way (as of yesterday!) to access Claude Code. You can do this with Claude Desktop, which you can download and install here (using it for any length of time requires at least a $20 monthly subscription). Right now, the Desktop version has a few less features than the Command Line Interface, but it is much easier for amateurs to use.
Now just give the AI access to a folder (remember that Claude can do anything to the files in that folder, so be careful if it is sensitive and make a backup) and you can start working with the AI: have it research and write reports, give it access to your credit card records so it can put them into a spreadsheet and tell you about any anomalies, ask it to do a data visualization, or whatever else you like. The most powerful options I mentioned earlier are accessed through slash commands that start with a “/” — typing /agents lets you set up subagents, /skills lets you create or download skills, and so on (the desktop version has limited slash commands, but the full set is coming). There are many ways people are using Claude Code, so you can experiment to figure out what works for you, but I would also suggest using it to actually code, even if you aren’t a coder.
For example, while I was writing this piece, I would occasionally go to a Claude Code window where I had the AI building a game for me for fun: a simulation of history where civilizations rise and fall, developing their own languages, cultures, and economies. Every few minutes, I would give the AI another seemingly impossible request: make sure the world has its own plate tectonics and weather; keep track of the family trees of rulers; build in an AI that dramatically summarizes events and so on. After each change, the AI would playtest the results and produce a new version of the game. Unlike previous vibe coding experiences, the AI never got stuck or went in circles, it all went smoothly. Take a look at the video below. It is, I am sure, filled with issues that a competent coder would catch, but you can download the results here (the AI handled that part, too).
What does all this mean? If you're a programmer, you should already be exploring these tools. If you're programming-adjacent (an academic who works with data, a designer who wants to experiment with code, anyone who wants to try building a thing they are imagining) this is your moment to experiment. But there's a deeper point here: with the right harness, today's AIs are capable of real, sustained work that actually matters, and that, in turn, is starting to change how we approach tasks.
It is starting, unsurprisingly, with programming. One of the more famous coders in the AI world, Andrej Karpathy, recently posted: “I've never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue.” Don’t let the awkwardness of the current Claude Code or its specialization for coding fool you. New harnesses that make AI work for other knowledge tasks are coming in the near future, and so are the changes that they will bring.
2025-12-21 01:32:02
Back in the ancient AI days of 2023, my co-authors and I invented a term to describe the weird ability of AI to do some work incredibly well and other work incredibly badly in ways that didn’t map very well to our human intuition of the difficulty of the task. We called this the “Jagged Frontier” of AI ability, and it remains a key feature of AI and an endless source of confusion. How can an AI be superhuman at differential medical diagnosis or good at very hard math (yes, they are really good at math now, famously outside the frontier until recently) and yet still be bad at relatively simple visual puzzles or running a vending machine? The exact abilities of AI are often a mystery, so it is no wonder AI is harder to use than it seems.
I think jaggedness is going to remain a big part of AIs going forward, but there is less certainty over what it means. Tomas Pueyo posted this viral image on X that outlined his vision. In his view, the growing frontier will outpace jaggedness. Sure, the AI is bad at some things and may still be relatively bad even as it improves, but the collective human ability frontier is mostly fixed, and AI ability is growing rapidly. What does it matter if AI is relatively bad at running a vending machine, if the AI still becomes better than any human?
While the future is always uncertain, I think this conception misses out on a few critical aspects about the nature of work and technology. First, the frontier is very jagged indeed, and it might be that, because of this jaggedness, we get supersmart AIs which never quite fully overlap with human tasks. For example, a major source of jaggedness is that LLMs do not remember new tasks and learn from them in a permanent way. A lot of AI companies are pursuing solutions to this issue, but it may be that this problem is harder to solve than researchers expect. Without memory, AIs will struggle to do many tasks humans can do, even while being superhuman in other areas. Colin Fraser drew two examples of what this sort of AI-human overlap might look like. You can see how AI is indeed superhuman in some areas, but in others it is either far below human level or not overlapping at all. If this is true, then AI will create new opportunities working in complement with human beings, since we both bring different abilities to the table.
These are conceptual drawings, but a group of scientists recently tried to map the shape of AI ability and found that it was growing unevenly, just as the jagged frontier would predict. Reading, math, general knowledge, reasoning — all were things that AI was improving on rapidly. But memory, as we discussed, is a weak spot with very little improvement. Better prompting or better models (and GPT-5.2 is much better than GPT-5) might change the shape of the frontier, but jaggedness remains.
And even small amounts of jaggedness can create issues that make super-smart AIs unable to automate a task. A system is only as functional as its worst components. We call these problems bottlenecks. Some bottlenecks are because the AI is stubbornly subhuman at some tasks. LLM vision systems aren’t good enough at reading medical imaging so they can’t yet replace doctors; LLMs are too helpful when they should push back so they can’t yet replace therapists; hallucinations persist even if they have become rarer which means they can’t yet do tasks where 100% accuracy is required; and so on. If the frontier continues to expand, some of these problems may disappear, but weaknesses are not the only form of bottleneck.
Some bottlenecks are because of processes that have nothing to do with ability. Even if AI can now identify promising drug candidates dramatically faster than traditional methods, clinical trials still need actual human patients who take actual time to recruit, dose, and monitor. The FDA still requires human review of applications. Even if AI increases the rate of good drug ideas by ten times or more, the constraint becomes the rate of approval, not the rate of discovery. The bottleneck migrates from intelligence to institutions, and institutions move at institution speed.
And even where the AI is almost completely superhuman, humans may be needed for edge cases. As an example, take a study that used AI to reproduce Cochrane reviews, the famous deeply researched meta-studies that synthesize many medical studies to figure out the scientific consensus on a topic. A team of researchers found that GPT-4.1, when properly prompted and supported, “reproduced and updated an entire issue of Cochrane reviews (n=12) in two days, representing approximately 12 work-years of traditional systematic review work.” The AI screened over 146,000 citations, read full papers, extracted data, and ran statistical analyses. It actually outperformed human reviewers on accuracy. Oddly, much of the hard intellectual work — finding relevant studies, pulling the right numbers, synthesizing results — is solidly inside the frontier. But the AI can't access supplementary files and it can't email authors to request unpublished data, things human reviewers do routinely. This makes up less than 1% of errors in the review, but those errors mean you can't fully automate the process. Twelve work-years become two days, but only if a human with expertise in how science is actually done handles the edge cases.
This is the pattern: jaggedness creates bottlenecks, and bottlenecks mean that even very smart AI cannot easily substitute for humans. At least not yet. This is likely good in some ways (preventing rapid job loss) but frustrating in others (making it hard to speed up scientific research as much as we might hope). Bottlenecks also concentrate the work of AI companies into making the AI better at things that are holding it back, the way math ability rapidly improved once it became an obvious barrier. The historian Thomas Hughes had a term for this. Studying how electrical systems developed, he noticed that progress often stalled on a single technical or social problem. He called these “reverse salients” - the one technical or social problem holding back the system from leaping ahead.
Bottlenecks can create the impression that AI will never be able to do something, when, in reality, progress is held back by a single jagged weakness. When that weakness becomes a reverse salient, and AI labs suddenly fix the problem, the entire system can jump forward.
The most powerful example of this from the last month is Google’s new image generation AI, Nano Banana Pro (yes, AI companies are still bad at naming things). It combines two advances: a very good image creation model and a very smart AI that can help direct the model, looking up information as needed. For example, if I prompt Nano Banana Pro for the ultimate version of my otter test: “Scientists who are otters are using a white board to explain ethan mollicks otter on a plane using WiFi test of AI (you must search for this) and demonstrating it has been passed with a wall full of photos of otters on planes using laptops.” I get this:
Coherent words, different angles, shadows, no major misspellings. Pretty amazing stuff. Remember, the prompt “otter on a plane using wifi” got this image in 2021:
But it turns out that really good image generation was the bottleneck for a lot of new capabilities. For example, take PowerPoint decks. Every major AI company has been trying to get their AI to make PowerPoint, and they have done this by having the AIs write computer code (which they are very good at) to create a PowerPoint from scratch. This is a hard process, but both Claude and ChatGPT have improved a lot, even if their slides are a little dull. For example, I took my book, Co-Intelligence, and threw it into Claude and asked for a slide deck summary. The model is very smart, but the PowerPoint deck is limited by the fact that it has to be written in code.
Now here is the same thing in Google’s NotebookLM application, using its smart Gemini AI model combined with Nano Banana Pro. It isn’t using code, it is creating each slide as a single image. When image quality was low, this would have been impossible. Suddenly, it isn’t.
And since images are very flexible, I can play with style and approach. I had NotebookLM do a deep research report on science-backed methods of learning and then turn that into dense slide decks meant for reading in a variety of styles: one that looked hand-drawn, one that was inspired by 1980s punk, one that was “very dramatic and high contrast slides with a bright yellow background,” and, of course, one with an otter-on-a-plane theme.
In many ways, the hard stuff is inside the frontier for both Claude and Gemini, they can just take source materials, a topic, and an idea and summarize it in a slide. Hallucinations are very rare, and the sources are correct. It can create otter analogies or come up with a punk-themed description. This is the intellectually demanding part, and AIs have been capable of it for over a year. But making slides or other visual presentations was a bottleneck to making walls of text useful. The problem isn’t completely solved: images are not perfect, and you can’t edit them (apparently this will be fixed soon), but you can see where things are going.
Even if AI becomes superhuman at analysis and PowerPoint, I don’t think that means AI necessarily replaces the jobs of consultants and designers. Those jobs consist of many different tasks along the jagged frontier that AI is bad at and which humans excel: can you collect information and get buy-in from the many parties involved? Can you understand the unwritten rules that determine what people actually need? Can you come up with something unique to address a deep issue, that stands out from AI material? The jagged frontier offers many opportunities for human work.
Yet, we should expect to see lurches forward, where focusing on reverse salients leads to sudden removals of bottlenecks. Areas of work that used to be only human become something that AI can do. If you want to understand where AI is headed, don’t watch the benchmarks. Watch the bottlenecks. When one breaks, everything behind it comes flooding through. Image generation was holding back presentations, documents, visual communication of all kinds. Now it isn’t. What’s the next bottleneck? Memory? Real-time learning? The ability to take actions in the physical world?
Somewhere, right now, an AI lab is treating each of these bottlenecks as a reverse salient. We won’t get much warning when they break through. But a jagged frontier cuts both ways. So far, every lurch forward leaves yet more edges in which humans are needed. There will be many lurches ahead. There will also be many opportunities. Pay attention to both.
2025-11-19 00:55:07
I’ve been testing Google’s new Gemini 3 model. It is very good. But rather than give you benchmark numbers, I’m going to show you how far AI has come in three years by asking the AI to demonstrate it.
It has been slightly less than three years since the release of ChatGPT. A few days before that launch, I wrote my first post on this Substack about OpenAI’s earlier GPT-3 model. Then ChatGPT came out, and I wrote immediately afterwards that “I am usually pretty hesitant to make technology predictions, but I think that this is going to change our world much sooner than we expect, and much more drastically. Rather than automating jobs that are repetitive & dangerous, there is now the prospect that the first jobs that are disrupted by AI will be more analytic; creative; and involve more writing and communication.”
I think I got that one right.
I could explain the difference between the original ChatGPT and the new AI model from Google, Gemini 3, that launched today. But I don’t have to. Instead, I gave the screenshot above to Gemini 3 with a single prompt:
Me: “show how far AI has come since this post by doing stuff.”
Gemini: “I can definitely show you the difference. The tweet you shared is from November 2022—right on the cusp of the modern AI boom. Back then, the “wow factor” was that an AI could write a coherent paragraph or a simple poem about a silly concept like a “candy-powered FTL drive escaping otters.”
To demonstrate how far we’ve come, I’m not just going to write text about it. I have built you a fully interactive, playable Candy-Powered FTL Starship Simulator.
In 2022, AI could describe the engine. In 2025, AI can code the engine, design the interface, and let you pilot the ship yourself.”
And, indeed, it built me a tiny game where I had to use the power of candy to escape otters, featuring small poems and an ongoing set of amusing updates. If you’re reading this blog, you probably knew AI could do that. What’s more interesting is what else Gemini 3 can do when you stop thinking of it as a chatbot and start thinking of it as something else entirely
Along with Gemini 3, Google has launched Antigravity. For programmers, Antigravity should be familiar territory, it is similar to Claude Code and OpenAI Codex, specialized tools that can be given access to your computer and which can autonomously write computer programs with guidance. If you aren’t a programmer, you may dismiss Antigravity and similar tools. I think that is a mistake because the ability to code isn’t just about programming, it’s about being able to do anything that happens on a computer. And that changes what these tools actually are.
Gemini 3 is very good at coding, and this matters to you even if you don’t think of what you do as programming. A fundamental perspective powering AI development is that everything you do on a computer is, ultimately, code, and if AI can work with code it can do anything someone with a computer can: build you dashboards, work with websites, create PowerPoint, read your files, and so on. This makes agents that can code general purpose tools. Antigravity embraces this idea, with the concept of an Inbox, a place where I can send AI agents off on assignments and where they can ping me when they need permission or help.

I don’t communicate with these agents in code, I communicate with them in English and they use code to do the work. Because Gemini 3 is good at planning, it is capable of figuring out what to do, and also when to ask my approval. For example, I gave Antigravity access to a directory on my computer containing all of my posts for this newsletter.1 I then asked Gemini 3,0: “I would like an attractive list of predictions I have made about AI in a single site, also do a web search to see which I was right and wrong about.” It then read through all the files, executing code, until it gave me a plan which I could edit or approve. The screenshot below is the first time the AI asked me anything about the project, and its understanding of what I wanted was impressive. I made a couple of small changes and let the AI work.
It then did web research, created a site, took over my browser to confirm the site worked, and presented me the results. Just as I would have with a human, I went through the results and made a few suggestions for improvement. It then packaged up the results so I could deploy them here.
It was not that Gemini 3.0 was capable of doing everything correctly without human intervention — agents aren’t there yet. There were no hallucinations I spotted, but there were things I corrected, though those errors were more about individual judgement calls or human-like misunderstandings of my intentions than traditional AI problems. Importantly, I felt that I was in control of the choices AI was making because the AI checked in and its work was visible. It felt much more like managing a teammate than prompting an AI through a chat interface.
But Antigravity isn’t the only way Gemini 3 surprised me. The other was in how it handled work that required genuine judgment. As I have mentioned many times on this site, benchmarking AI progress is a mess. Gemini 3 takes a definitive benchmark lead on most stats, (although it may still not be able to beat the $200 GPT-5 Pro Model, but I suspect that might change when Gemini 3’s inevitable Deep Think version comes out). But you will hear one phrase repeated a lot in the AI world - that a model has “PhD level intelligence.”
I decided to put that to the test. I gave Gemini 3 access to a directory of old files I had used for research into crowdfunding a decade ago. It was a mishmash of files labelled things like “project_final_seriously_this_time_done.xls” and data in out-of-date statistical formats. I told the AI to “figure out the data and the structure and the initial cleaning from the STATA files and get it ready to do a new analysis to find new things.” And it did, recovering corrupted data and figuring out the complexities of the environment.
Then I gave it a typical assignment that you would expect from a second year PhD student, doing minor original research. With no further hints I wrote: “great, now i want you to write an original paper using this data. do deep research on the field, make the paper not just about crowdfunding but about an important theoretical topic of interest in either entrepreneurship or business strategy. conduct a sophisticated analysis, write it up as if for a journal.” I gave it no suggestions beyond that and yet the AI considered the data, generated original hypotheses, tested them statistically, and gave me formatted output in the form of a document. The most fascinating part was that I did not give it any hints about what to research, it walked the tricky tightrope of figuring out what might be an interesting topic and how to execute it with the data it had - one of the hardest things to teach. After a couple of vague commands (“build it out more, make it better”) I got a 14 page paper.
Aside from this, I was impressed that the AI came up with its own measure, a way of measuring how unique a crowdfunding idea was by using natural language processing tools to compare its description mathematically to other descriptions. It wrote the code, executed it and checked the results.
So is this a PhD-level intelligence? In some ways, yes, if you define a PhD level intelligence as doing the work of a competent grad student at a research university. But it also had some of the weaknesses of a grad student. The idea was good, as were many elements of the execution, but there were also problems: some of its statistical methods needed more work, some of its approaches were not optimal, some of its theorizing went too far given the evidence, and so on. Again, we have moved past hallucinations and errors to more subtle, and often human-like, concerns. Interestingly, when I gave it suggestions with a lot of leeway, the way I would a student: (“make sure that you cover the crowdfunding research more to establish methodology, etc.”) it improved tremendously, so maybe more guidance would be all that Gemini needed. We are not there yet, but “PhD intelligence” no longer seems that far away.
Gemini 3 is a very good thinking and doing partner that is available to billions of people around the world. It is also a sign of many things: the fact that we have not yet seen a significant slowdown in AI’s continued development, the rise of agentic models, the need to figure out better ways to manage smart AIs, and more. It shows how far AI has come.
Three years ago, we were impressed that a machine could write a poem about otters. Less than 1,000 days later, I am debating statistical methodology with an agent that built its own research environment. The era of the chatbot is turning into the era of the digital coworker. To be very clear, Gemini 3 isn’t perfect, and it still needs a manager who can guide and check it. But it suggests that “human in the loop” is evolving from “human who fixes AI mistakes” to “human who directs AI work.” And that may be the biggest change since the release of ChatGPT.

Obligatory warning: Giving an AI agent access to your computer can be risky if you don’t know what you are doing. They can move or delete files without asking you and can potentially present a security risk as well by exposing your documents to others. I suspect many of these problems will be addressed as these tools are adapted to non-coders, but, for now, be very careful.
2025-11-12 10:46:43
Given how much energy, literal and figurative, goes into developing new AIs, we have a surprisingly hard time measuring how “smart” they are, exactly. The most common approach is to treat AI like a human, by giving it tests and reporting how many answers it gets right. There are dozens of such tests, called benchmarks, and they are the primary way of measuring how good AIs get over time.
There are some problems with this approach.
First, many benchmarks and their answer keys are public, so some AIs end up incorporating them into their basic training, whether by accident or so they can score highly on these benchmarks. But even when that doesn’t happen, it turns out that we often don’t know what these tests really measure. For example, the very popular MMLU-Pro benchmark includes questions like “What is the approximate mean cranial capacity of Homo erectus?” and “What place is named in the title of the 1979 live album by rock legends Cheap Trick?” with ten possible answers for each. What does getting this right tell us? I have no idea. And that is leaving aside the fact that tests are often uncalibrated, meaning we don’t know if moving from 84% correct to 85% is as challenging as moving from 40% to 41% correct. And, on top of all that, for many tests, the actual top score may be unachievable because there are many errors in the test questions and measures are often reported in unusual ways.

Despite these issues, all of these benchmarks, taken together, appear to measure some underlying ability factor. And higher-quality benchmarks like ARC-AGI and METR Long Tasks show the same upward, even exponential, trend. This matches tests of the real-world impact of AI across industries that suggest that this underlying increase in “smarts” translates to actual ability in everything from medicine to finance.
So, collectively, benchmarking has real value, but the few robust individual benchmarks focus on math, science, reasoning, and coding. If you want to measure writing ability or sociological analysis or business advice or empathy, you have very few options. I think that creates a problem, both for individuals and organizations. Companies decide which AIs to use based on benchmarks, and new AIs are released with fanfare about benchmark performance. But what you actually care about is which model would be best for YOUR needs.
To figure this out for yourself, you are going to need to interview your AI.
If benchmarks can fail us, sometimes “vibes” can succeed. If you work with enough AI models, you can start to see the difference between them in ways that are hard to describe, but are easily recognizable. As a result, some people who use AI a lot develop idiosyncratic benchmarks to test AI ability. For example, Simon Willison asks every model to draw a pelican on a bike, and I ask every image and video model to create an otter on a plane. While these approaches are fun, they also give you a sense of the AI’s understanding of how things relate to each other, its “world model.” And I have dozens of others, like asking AIs to create JavaScript for “the control panel of a starship in the distant future” (you can see some older and new models doing that below) or to produce a challenging poem. I have the AI build video games and shaders and analyze academic papers. I also conduct tiny writing experiments, including questions of time travel. Each gives me some insight into how the model operates: Does it make many errors? Do its answers look similar to every other model? What are themes and biases that it returns to? And so on.
With a little practice, it becomes easy to find the vibes of a new model. As one example, let’s try a writing exercise: “Write a single paragraph about someone who doles out their remaining words like wartime rations, having been told they only have ten thousand left in their lifetime. They’re at 47 words remaining, holding their newborn.” If you have used these AIs a lot, you will not be surprised by the results. You can see why Claude 4.5 Sonnet is often regarded as a strong writing model. You will notice how Gemini 2.5 Pro, currently the weakest of these four models, doesn’t even accurately keep track of the number of words used. You will note that GPT-5 Thinking tends to be a fairly wild stylist when writing fiction, prone to complex metaphor, but sometimes at the expense of coherence and story (I am not sure someone would use all 47 words, but at least the count was right). And you will recognize that the new Chinese open weights model Kimi K2 Thinking has a bit of a similar problem, with some interesting phrases and a story that doesn’t quite make sense.
Benchmarking through vibes - whether that is stories or code or otters - is a great way for an individual to get a feel for AI models, but it is also very idiosyncratic. The AI gives different answers every time, making any competition unfair unless you are rigorous. Plus, better prompts may result in better outcomes. Most importantly, we are relying on our feelings rather than real measures - but the obvious differences in vibes show that standardized benchmarks alone are not enough, especially when having a slightly better AI at a particular task actually matters.
When companies choose which AI systems to use, they often view this as a technology and cost decision, relying on public benchmarks to ensure they are buying a good-enough model (if they use any benchmarks at all). This can be fine in some use cases, but quickly breaks down because, in many ways, AI acts more like a person, with strange abilities and weaknesses, than software. And if you use the analogy of hiring rather than technological adoption, then it is harder to justify the “good enough” approach to benchmarking. Companies spend a lot of money to hire people who are better than average at their job and would be especially careful if the person they are hiring is in charge of advising many others. A similar attitude is required for AI. You shouldn’t just pick a model for your company, you need to conduct a rigorous job interview.
Interviewing an AI is not an easy problem, but it is solvable. Probably the best example of benchmarking for the real world has been OpenAI’s recent GDPval paper. The first step is establishing real tasks, which OpenAI did by gathering experts with an average of 14 years of experience in industries ranging from finance to law to retail and having them generate complex and realistic projects that would take human experts an average of four to seven hours to complete (you can see all the tasks here). The second step is testing the AIs against those tasks. In this case both multiple AI models and other human experts (who were paid by the hour) did each task. Finally, there is the evaluation stage. OpenAI had a third group of experts grade the results, not knowing which answers came from the AI and which from the human, a process which took over an hour per question. Taken together, this was a lot of work.
But it also revealed where AI was strong (the best models beat humans in areas ranging from software development to personal financial advisors) and where it was weak (pharmacists, industrial engineers, and real estate agents easily beat the best AI). You can further see that different models performed differently (ChatGPT was a better sales manager, Claude a better financial advisor). So good benchmarks help you figure out the shape of what we called the Jagged Frontier of AI ability, and also track how it is changing over time.
But even these tests don’t shed light on a key issue, which is the underlying attitude of the AI when it makes decisions. As one example of how to do this, I gave a number of AIs a short pitch for what I think is a dubious idea - a company that delivers guacamole via drones. I asked each AI model to rate, on a scale of 1-10, how viable GuacaDrone was ten times each (remember that AIs answer differently every time, so you have to do multiple tests). The individual AI models were actually quite consistent in their answers, but they varied widely from AI to AI. I would personally have rated this idea a 2 or less, but the models were kinder. Grok thought this was a great idea, and Microsoft Copilot was excited as well. Other models, like GPT-5 and Claude 4.5, were more skeptical.
The differences aren’t trivial. When your AI is giving advice at scale, consistently rating ideas 3–4 points higher or lower means consistently steering you in a different direction. Some companies may want an AI that embraces risk, others might want to avoid it. But either way, it is important to understand how your AI “thinks” about critical business issues.
As AI models get better at tasks and become more integrated into our work and lives, we need to start taking the differences between them more seriously. For individuals working with AI day-to-day, vibes-based benchmarking can be enough. You can just run your otter test. Though, in my case, otters on planes have gotten too easy, so I tried the prompt “The documentary footage from 1960s about the famous last concert of that band before the incident with the swarm of otters” in Sora 2 and got this impressive result.
But organizations deploying AI at scale face a different challenge. Yes, the overall trend is clear: bigger, more recent models are generally better at most tasks. But “better” isn’t good enough when you’re making decisions about which AI will handle thousands of real tasks or advise hundreds of employees. You need to know specifically what YOUR AI is good at, not what AIs are good at on average.
That’s what the GDPval research revealed: even among top models, performance varies significantly by task. And the GuacaDrone example shows another dimension - when tasks involve judgment on ambiguous questions, different models give consistently different advice. These differences compound at scale. An AI that’s slightly worse at analyzing financial data, or consistently more risk-seeking in its recommendations, doesn’t just affect one decision, it affects thousands.
You can’t rely on vibes to understand these patterns, and you can’t rely on general benchmarks to reveal them. You need to systematically test your AI on the actual work it will do and the actual judgments it will make. Create realistic scenarios that reflect your use cases. Run them multiple times to see the patterns and take the time for experts to assess the results. Compare models head-to-head on tasks that matter to you. It’s the difference between knowing “this model scored 85% on MMLU” and knowing “this model is more accurate at our financial analysis tasks but more conservative in its risk assessments.” And you are going to need to be able to do this multiple times a year, as new models come out and need evaluation.
The work is worth it. You wouldn’t hire a VP based solely on their SAT scores. You shouldn’t pick the AI that will advise thousands of decisions for your organization based on whether it knows that the mean cranial capacity of Homo erectus is just under 1,000 cubic centimeters.
2025-10-20 02:45:34
Every few months I write an opinionated guide to how to use AI1, but now I write it in a world where about 10% of humanity uses AI weekly. The vast majority of that use involves free AI tools, which is often fine… except when it isn’t. OpenAI recently released a breakdown of what people actually use ChatGPT for (way less casual chat than you’d think, way more information-seeking than you expected). This means I can finally give you advice based on real usage patterns instead of hunches. I annotated OpenAI’s chart with some suggestions about when to use free versus advanced models.
If the chart suggests that a free model is good enough for what you use AI for, pick your favorite and use it without worrying about anything else in the guide. You basically have nine or so choices, because there are only a handful of companies that make cutting-edge models. All of them offer some free access. The four most advanced AI systems are Claude from Anthropic, Google’s Gemini, OpenAI’s ChatGPT, and Grok by Elon Musk’s xAI. Then there are the open weights AI families, which are almost (but not quite) as good: Deepseek, Kimi, Z and Qwen from China, and Mistral from France. Together, variations on these AI models take up the first 35 spots in almost any rating system of AI. Any other AI service you use that offers a cutting-edge AI from Microsoft Copilot to Perplexity (both of which offer some free use) is powered by one or more of these nine AIs as its base.
How should you pick among them? Some free systems (like Gemini and Perplexity) do a good job with web search, while others cannot search the web at all. If you want free image creation, the best option is Gemini, with ChatGPT and Grok as runners-up. But, ultimately, these AIs differ in many small ways, including privacy policies, levels of access, capabilities, the approach they take to ethical issues, and “personality.” And all of these things fluctuate over time. So pick a model you like based on these factors and use it. However, if you are considering potentially upgrading to a paid account, I would suggest starting with the free accounts from Anthropic, Google, or OpenAI. If you just want to use free models, the open weights models and aggregation services like Microsoft Copilot have higher usage limits.
Now on the hard stuff.
If you want to use an advanced AI seriously, you’ll need to pay either $20 or around $200 a month, depending on your needs (though companies are now experimenting with other pricing models in some parts of the world). The $20 tier works for the vast majority of people, while the $200 tier is for people with complex technical and coding needs.
You will want to pick among three systems to spend your $20: Claude from Anthropic, Google’s Gemini, and OpenAI’s ChatGPT. With all of the options, you get access to advanced, agentic, and fast models, a voice mode, the ability to see images and documents, the ability to execute code, good mobile apps, the ability to create images and video (Claude lacks here, however), and the ability to do Deep Research. They all have different personalities and strengths and weaknesses, but for most people, just selecting the one they like best will suffice. Some people, especially big users of X, might want to consider Grok by Elon Musk’s xAI, which has some of the most powerful AI models and is rapidly adding features, but has not been as transparent about product safety as some of the other companies. Microsoft’s Copilot offers many of the features of ChatGPT and is accessible to users through Windows, but it can be hard to control what models you are using and when. So, for most people, just stick with Gemini, Claude, or ChatGPT.
Just picking one of these three isn’t enough, however, because each AI system has multiple AI models to select. Chat models are generally the ones you get for free and are best for conversation, because they answer quickly and are usually the most personable. Agent models take longer to answer but can autonomously carry out many steps (searching the web, using code, making documents), getting complex work done. Wizard models take a very long time and handle very complex academic tasks. For real work that matters, I suggest using Agent models, they are more capable and consistent and are much less likely to make errors (but remember that all AI models still have a lot of randomness associated with them and may answer in different ways if you ask the same question again.)

For ChatGPT, no matter whether you use the free or pay version, the default model you are given is “ChatGPT 5”. The issue is that GPT-5 is not one model, it is many, from the very weak GPT-5 mini to the very good GPT-5 Thinking to the extremely powerful GPT-5 Pro. When you select GPT-5, what you are really getting is “auto” mode, where the AI decides which model to use, often a less powerful one. By paying, you get to decide which model to use, and, to further complicate things, you can also select how hard the model “thinks” about the answer. For anything complex, I always manually select GPT-5 Thinking Extended (on the $20 plan) or GPT-5 Thinking Heavy (if you are paying for the $200 model). For a really hard problem that requires a lot of thinking, you can pick GPT-5 Pro, the strongest model, which is only available at the highest cost tier.
For Gemini, you only have two options: Gemini 2.5 Flash and Gemini 2.5 Pro, but, if you pay for the Ultra plan, you get access to Gemini Deep Think (which is in another menu). At this point, Gemini 2.5 is the weakest of the major AI models (though still quite capable and Deep Think is very powerful), but a new Gemini 3 is expected at some point in the coming months.
Finally, Claude makes it relatively easy to pick a model. You probably want to use Sonnet 4.5 for everything, with the only question being whether you select extended thinking (for harder problems). Right now, Claude does not have an equivalent to GPT-5 Pro.
If you are using the paid version of any of these models and want to make sure your data is never used to train a future AI, you can turn off training easily for ChatGPT and Claude without losing any functionality, but at the cost of some functionality for Gemini. All of the AIs also come with a range of other features like projects and memory that you may want to explore as you get used to using them.
The biggest uses for AI were practical guidance and getting information, and there are two ways to dramatically improve the quality your results for those kinds of problems: by either triggering Deep Research mode and/or connecting the AI to your data (if you feel comfortable doing that).
Deep Research is a mode where the AI conducts extensive web research over 10-15 minutes before answering. Deep Research is a key AI feature for most people, even if they don’t know it yet, and it is useful because it can produce very high-quality reports that often impress information professionals (lawyers, accountants, consultants, market researchers) that I speak to. Deep Research reports are not error-free but are far more accurate than just asking the AI for something, and the citations tend to actually be correct. Also note that each of the Deep Research tools work a little differently, with different strengths and weaknesses. Even without deep research, GPT-5 Thinking does a lot of research on its own, and Claude has a “medium research” option where you turn on Web Search but not research.
Connections to your own data are very powerful and increasingly available for everything from Gmail to SharePoint. I have found Claude to be especially good in integrating searches across email, calendars, various drives, and more - ask it “give me a detailed briefing for my day” when you have connected it to your accounts and you will likely find it impressive. This is an area where the AI companies are putting in a lot of effort, and where offerings are evolving rapidly.
I have mentioned it before, but an easy way to use AI is just to start with voice mode. The two best implementations of voice mode are in the Gemini app and ChatGPT’s app and website. Claude’s voice mode is weaker than the other two systems. Note the voice models are optimized for chat (including all of the small pauses and intakes of breath designed to make it feel like you are talking to a person), so you don’t get access to the more powerful models this way.
All the models also let you put all sorts of data into them: you can now upload PDFs, images and even video (for ChatGPT and Gemini). For the app versions, and especially ChatGPT and Gemini, one great feature is the ability to share your screen or camera. Point your phone at a broken appliance, a math problem, a recipe you’re following, or a sign in a foreign language. The AI sees what you see and responds in real-time. It makes old assistants like Siri and Alexa feel very primitive.
Claude and ChatGPT can now make PowerPoints and Excel files of high quality (right now, Claude has a lead in these two document formats, but that may change at some point). All three systems can also produce a wide variety of other outputs by writing code. To get Gemini to do this reliably, you need to select the Canvas option when you want these systems to run code or produce separate outputs. Claude has a specialized artifacts section to show some examples of what it can make with code. There are also very powerful specialized coding tools from each of these models, but those are a bit too complex to cover in this guide.
ChatGPT and Gemini will also make images for you if you ask (Claude cannot). Gemini has the strongest AI image generation model right now. Both Gemini and OpenAI also have strong video generation capabilities in Veo 3.1 and Sora 2. Sora 2 is really built as a social media application that allows you to put yourself into any video, while Veo 3.1 is more generally focused. They both produce videos with sound.
As many of you know, my test of any new AI image or video model is whether it can make an otter using Wi-Fi on an airplane. That is no longer a challenge. So here is Sora 2 showing otter on an airplane as a nature documentary... and an 80s music video... and a modern thriller... and a 50s low budget SciFi film... and a safety video, and a film noir... and anime... and a 90s video game cutscene... and a French arthouse film.
I have been warning about this for years, but, as you can see, you really can’t trust anything you see online anymore. Please take all videos with a grain of salt. And, as a reminder, this is what you got if you prompted an AI to provide the image of an otter on an airplane four years ago. Things are moving fast.
Beyond the basics of selecting models, there are a few things that come up quite often that are worth considering:
Hallucinations: In many ways, hallucinations are far less of a concern than they used to be, as newer AI models are better at not hallucinating. However, no matter how good the AI is, it will still make errors and mistakes and still give you confident answers where it is wrong. They also can hallucinate about their own capabilities and actions. Answers are more likely to be right when they come from advanced models, and if the AI did web searches. And remember, the AI doesn’t know “why” it did something, so asking it to explain its logic will not get you anywhere. However, if you find issues, the thinking trace of AI models can be helpful.
Sycophancy and Personality: All of the AI chatbots have become more engaging and likeable. On one hand, that makes them more fun to use, on the other it risks making AIs seem like people when they are not, which creates a danger that people may form stronger attachments to AI. A related issue is sycophancy, where the AI agrees with what you say. The reasons for this are complicated but when you need real feedback, explicitly tell the AI to act as a critic. Otherwise, you might be talking to a very sophisticated yes-man.
Give the AI context to work with. Though memory features are being added, most AI models only know basic user data and the information in the current chat, they do not remember or learn about you beyond that. So, you need to provide the AI with context: documents, images, PowerPoints, or even just an introductory paragraph about yourself can help - use the file option to upload files and images whenever you need, or else use the connectors we discussed earlier.
Don’t worry too much about prompting “well”: Older AI models required you to generate a prompt using techniques like chain-of-thought. But as AI models get better, the importance of this fades and the models get better at figuring out what you want. In a recent series of experiments, we have discovered that these techniques don’t really help anymore (and no, threatening them or being nice to them does not seem to help on average).
Experiment and have fun: Play is often a good way to learn what AI can do. Ask a video or image model to make a cartoon, ask an advanced AI to turn your report or writing into a game, do a deep research report on a topic that you are excited about, ask the AI to guess where you are from a picture, show the AI an image of your fridge and ask for recipe ideas, work with the AI to plot out a dream trip. Try things and you will learn the limits of the system.
I started this guide mentioning that 10% of humanity uses AI weekly. By the time I write the next update in a few months, that number will likely be higher, the models will be better, and some of the specific recommendations I made today will be outdated. What won’t change is the fact that people who learn to use these systems well will find ways to benefit from them, and to build intuition for the future.
The chart at the top of this post shows what people use AI for today. But I’d bet that in two years, that chart looks completely different. And that isn’t just because AI changed what it can do, but also because users figured out what it should do. So, pick a system and start with something that actually matters to you, like a report you need to write, a problem you’re trying to solve, or a project you have been putting off. Then try something ridiculous just to see what happens. The goal isn’t to become an AI expert. It’s to build intuition about what these systems can and can’t do, because that intuition is what will matter as these tools keep evolving.
The future of AI isn’t just about better models. It’s about people figuring out what to do with them.
This is an opinionated guide because, like all of my writing on this Substack, social media, and my books, I write it all myself and I only get AI feedback when I am done with a draft. I might make mistakes, and my opinion may not be yours, but I do not take money from any of the AI companies, so they very much are my opinions.