MoreRSS

site iconOne Useful ThingModify

Trying to understand the implications of AI for work, education, and life. By Prof. Ethan Mollick
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of One Useful Thing

An Opinionated Guide to Using AI Right Now

2025-10-20 02:45:34

Every few months I write an opinionated guide to how to use AI1, but now I write it in a world where about 10% of humanity uses AI weekly. The vast majority of that use involves free AI tools, which is often fine… except when it isn’t. OpenAI recently released a breakdown of what people actually use ChatGPT for (way less casual chat than you’d think, way more information-seeking than you expected). This means I can finally give you advice based on real usage patterns instead of hunches. I annotated OpenAI’s chart with some suggestions about when to use free versus advanced models.

If the chart suggests that a free model is good enough for what you use AI for, pick your favorite and use it without worrying about anything else in the guide. You basically have nine or so choices, because there are only a handful of companies that make cutting-edge models. All of them offer some free access. The four most advanced AI systems are Claude from Anthropic, Google’s Gemini, OpenAI’s ChatGPT, and Grok by Elon Musk’s xAI. Then there are the open weights AI families, which are almost (but not quite) as good: Deepseek, Kimi, Z and Qwen from China, and Mistral from France. Together, variations on these AI models take up the first 35 spots in almost any rating system of AI. Any other AI service you use that offers a cutting-edge AI from Microsoft Copilot to Perplexity (both of which offer some free use) is powered by one or more of these nine AIs as its base.

How should you pick among them? Some free systems (like Gemini and Perplexity) do a good job with web search, while others cannot search the web at all. If you want free image creation, the best option is Gemini, with ChatGPT and Grok as runners-up. But, ultimately, these AIs differ in many small ways, including privacy policies, levels of access, capabilities, the approach they take to ethical issues, and “personality.” And all of these things fluctuate over time. So pick a model you like based on these factors and use it. However, if you are considering potentially upgrading to a paid account, I would suggest starting with the free accounts from Anthropic, Google, or OpenAI. If you just want to use free models, the open weights models and aggregation services like Microsoft Copilot have higher usage limits.

Now on the hard stuff.

Picking an Advanced AI System

If you want to use an advanced AI seriously, you’ll need to pay either $20 or around $200 a month, depending on your needs (though companies are now experimenting with other pricing models in some parts of the world). The $20 tier works for the vast majority of people, while the $200 tier is for people with complex technical and coding needs.

You will want to pick among three systems to spend your $20: Claude from Anthropic, Google’s Gemini, and OpenAI’s ChatGPT. With all of the options, you get access to advanced, agentic, and fast models, a voice mode, the ability to see images and documents, the ability to execute code, good mobile apps, the ability to create images and video (Claude lacks here, however), and the ability to do Deep Research. They all have different personalities and strengths and weaknesses, but for most people, just selecting the one they like best will suffice. Some people, especially big users of X, might want to consider Grok by Elon Musk’s xAI, which has some of the most powerful AI models and is rapidly adding features, but has not been as transparent about product safety as some of the other companies. Microsoft’s Copilot offers many of the features of ChatGPT and is accessible to users through Windows, but it can be hard to control what models you are using and when. So, for most people, just stick with Gemini, Claude, or ChatGPT.

Just picking one of these three isn’t enough, however, because each AI system has multiple AI models to select. Chat models are generally the ones you get for free and are best for conversation, because they answer quickly and are usually the most personable. Agent models take longer to answer but can autonomously carry out many steps (searching the web, using code, making documents), getting complex work done. Wizard models take a very long time and handle very complex academic tasks. For real work that matters, I suggest using Agent models, they are more capable and consistent and are much less likely to make errors (but remember that all AI models still have a lot of randomness associated with them and may answer in different ways if you ask the same question again.)

Same question asked of a chat model and an agentic one. You can see the chat model answered “off the top of its head” while the agentic model did outside research and checked a lot of assumptions before answering,

Picking the model

For ChatGPT, no matter whether you use the free or pay version, the default model you are given is “ChatGPT 5”. The issue is that GPT-5 is not one model, it is many, from the very weak GPT-5 mini to the very good GPT-5 Thinking to the extremely powerful GPT-5 Pro. When you select GPT-5, what you are really getting is “auto” mode, where the AI decides which model to use, often a less powerful one. By paying, you get to decide which model to use, and, to further complicate things, you can also select how hard the model “thinks” about the answer. For anything complex, I always manually select GPT-5 Thinking Extended (on the $20 plan) or GPT-5 Thinking Heavy (if you are paying for the $200 model). For a really hard problem that requires a lot of thinking, you can pick GPT-5 Pro, the strongest model, which is only available at the highest cost tier.

For Gemini, you only have two options: Gemini 2.5 Flash and Gemini 2.5 Pro, but, if you pay for the Ultra plan, you get access to Gemini Deep Think (which is in another menu). At this point, Gemini 2.5 is the weakest of the major AI models (though still quite capable and Deep Think is very powerful), but a new Gemini 3 is expected at some point in the coming months.

Finally, Claude makes it relatively easy to pick a model. You probably want to use Sonnet 4.5 for everything, with the only question being whether you select extended thinking (for harder problems). Right now, Claude does not have an equivalent to GPT-5 Pro.

If you are using the paid version of any of these models and want to make sure your data is never used to train a future AI, you can turn off training easily for ChatGPT and Claude without losing any functionality, but at the cost of some functionality for Gemini. All of the AIs also come with a range of other features like projects and memory that you may want to explore as you get used to using them.

Getting better answers

The biggest uses for AI were practical guidance and getting information, and there are two ways to dramatically improve the quality your results for those kinds of problems: by either triggering Deep Research mode and/or connecting the AI to your data (if you feel comfortable doing that).

Deep Research is a mode where the AI conducts extensive web research over 10-15 minutes before answering. Deep Research is a key AI feature for most people, even if they don’t know it yet, and it is useful because it can produce very high-quality reports that often impress information professionals (lawyers, accountants, consultants, market researchers) that I speak to. Deep Research reports are not error-free but are far more accurate than just asking the AI for something, and the citations tend to actually be correct. Also note that each of the Deep Research tools work a little differently, with different strengths and weaknesses. Even without deep research, GPT-5 Thinking does a lot of research on its own, and Claude has a “medium research” option where you turn on Web Search but not research.

How to trigger Deep Research mode, and also how to connect your data to Claude and ChatGPT

Connections to your own data are very powerful and increasingly available for everything from Gmail to SharePoint. I have found Claude to be especially good in integrating searches across email, calendars, various drives, and more - ask it “give me a detailed briefing for my day” when you have connected it to your accounts and you will likely find it impressive. This is an area where the AI companies are putting in a lot of effort, and where offerings are evolving rapidly.

Multimodal inputs

I have mentioned it before, but an easy way to use AI is just to start with voice mode. The two best implementations of voice mode are in the Gemini app and ChatGPT’s app and website. Claude’s voice mode is weaker than the other two systems. Note the voice models are optimized for chat (including all of the small pauses and intakes of breath designed to make it feel like you are talking to a person), so you don’t get access to the more powerful models this way.

All the models also let you put all sorts of data into them: you can now upload PDFs, images and even video (for ChatGPT and Gemini). For the app versions, and especially ChatGPT and Gemini, one great feature is the ability to share your screen or camera. Point your phone at a broken appliance, a math problem, a recipe you’re following, or a sign in a foreign language. The AI sees what you see and responds in real-time. It makes old assistants like Siri and Alexa feel very primitive.

Making Things for You: Images, Video, Code, and Documents

Claude and ChatGPT can now make PowerPoints and Excel files of high quality (right now, Claude has a lead in these two document formats, but that may change at some point). All three systems can also produce a wide variety of other outputs by writing code. To get Gemini to do this reliably, you need to select the Canvas option when you want these systems to run code or produce separate outputs. Claude has a specialized artifacts section to show some examples of what it can make with code. There are also very powerful specialized coding tools from each of these models, but those are a bit too complex to cover in this guide.

ChatGPT and Gemini will also make images for you if you ask (Claude cannot). Gemini has the strongest AI image generation model right now. Both Gemini and OpenAI also have strong video generation capabilities in Veo 3.1 and Sora 2. Sora 2 is really built as a social media application that allows you to put yourself into any video, while Veo 3.1 is more generally focused. They both produce videos with sound.

As many of you know, my test of any new AI image or video model is whether it can make an otter using Wi-Fi on an airplane. That is no longer a challenge. So here is Sora 2 showing otter on an airplane as a nature documentary... and an 80s music video... and a modern thriller... and a 50s low budget SciFi film... and a safety video, and a film noir... and anime... and a 90s video game cutscene... and a French arthouse film.

I have been warning about this for years, but, as you can see, you really can’t trust anything you see online anymore. Please take all videos with a grain of salt. And, as a reminder, this is what you got if you prompted an AI to provide the image of an otter on an airplane four years ago. Things are moving fast.

Quick Tips

Beyond the basics of selecting models, there are a few things that come up quite often that are worth considering:

  • Hallucinations: In many ways, hallucinations are far less of a concern than they used to be, as newer AI models are better at not hallucinating. However, no matter how good the AI is, it will still make errors and mistakes and still give you confident answers where it is wrong. They also can hallucinate about their own capabilities and actions. Answers are more likely to be right when they come from advanced models, and if the AI did web searches. And remember, the AI doesn’t know “why” it did something, so asking it to explain its logic will not get you anywhere. However, if you find issues, the thinking trace of AI models can be helpful.

  • Sycophancy and Personality: All of the AI chatbots have become more engaging and likeable. On one hand, that makes them more fun to use, on the other it risks making AIs seem like people when they are not, which creates a danger that people may form stronger attachments to AI. A related issue is sycophancy, where the AI agrees with what you say. The reasons for this are complicated but when you need real feedback, explicitly tell the AI to act as a critic. Otherwise, you might be talking to a very sophisticated yes-man.

  • Give the AI context to work with. Though memory features are being added, most AI models only know basic user data and the information in the current chat, they do not remember or learn about you beyond that. So, you need to provide the AI with context: documents, images, PowerPoints, or even just an introductory paragraph about yourself can help - use the file option to upload files and images whenever you need, or else use the connectors we discussed earlier.

  • Don’t worry too much about prompting “well”: Older AI models required you to generate a prompt using techniques like chain-of-thought. But as AI models get better, the importance of this fades and the models get better at figuring out what you want. In a recent series of experiments, we have discovered that these techniques don’t really help anymore (and no, threatening them or being nice to them does not seem to help on average).

  • Experiment and have fun: Play is often a good way to learn what AI can do. Ask a video or image model to make a cartoon, ask an advanced AI to turn your report or writing into a game, do a deep research report on a topic that you are excited about, ask the AI to guess where you are from a picture, show the AI an image of your fridge and ask for recipe ideas, work with the AI to plot out a dream trip. Try things and you will learn the limits of the system.

Where this goes

I started this guide mentioning that 10% of humanity uses AI weekly. By the time I write the next update in a few months, that number will likely be higher, the models will be better, and some of the specific recommendations I made today will be outdated. What won’t change is the fact that people who learn to use these systems well will find ways to benefit from them, and to build intuition for the future.

The chart at the top of this post shows what people use AI for today. But I’d bet that in two years, that chart looks completely different. And that isn’t just because AI changed what it can do, but also because users figured out what it should do. So, pick a system and start with something that actually matters to you, like a report you need to write, a problem you’re trying to solve, or a project you have been putting off. Then try something ridiculous just to see what happens. The goal isn’t to become an AI expert. It’s to build intuition about what these systems can and can’t do, because that intuition is what will matter as these tools keep evolving.

The future of AI isn’t just about better models. It’s about people figuring out what to do with them.

Subscribe now

Share

1

This is an opinionated guide because, like all of my writing on this Substack, social media, and my books, I write it all myself and I only get AI feedback when I am done with a draft. I might make mistakes, and my opinion may not be yours, but I do not take money from any of the AI companies, so they very much are my opinions.

Real AI Agents and Real Work

2025-09-30 02:52:42

AIs have quietly crossed a threshold: they can now perform real, economically relevant work.

Last week, OpenAI released a new test of AI ability, but this one differs from the usual benchmarks built around math or trivia. For this test, OpenAI gathered experts with an average of 14 years of experience in industries ranging from finance to law to retail and had them design realistic tasks that would take human experts an average of four to seven hours to complete (you can see all the tasks here). OpenAI then had both AI and other experts do the tasks themselves. A third group of experts graded the results, not knowing which answers came from the AI and which from the human, a process which took about an hour per question.

Human experts won, but barely, and the margins varied dramatically by industry. Yet AI is improving fast, with more recent AI models scoring much higher than older ones. Interestingly, the major reason for AI losing to humans was not hallucinations and errors, but a failure to format results well or follow instructions exactly — areas of rapid improvement. If the current patterns hold, the next generation of AI models should beat human experts on average in this test. Does that mean AI is ready to replace human jobs?

No (at least not soon), because what was being measured was not jobs but tasks. Our jobs consist of many tasks. My job as a professor is not just one thing, it involves teaching, researching, writing, filling out annual reports, supporting my students, reading, administrative work and more. AI doing one or more of these tasks does not replace my entire job, it shifts what I do. And as long as AI is jagged in its abilities, and cannot substitute for all the complex work of human interaction, it cannot easily replace jobs as a whole…

A Very Valuable Task

…and yet some of the tasks that AI can do right now have incredible value. Let’s return to something that is critical in my job: producing accurate research. As many people know, there has been a “replication crisis” in academia where important findings turned out to be impossible for other researchers to reproduce. Academia has made some progress on this problem, and many researchers now provide their data so that other scholars can reproduce their work. The problem is that replication takes a lot of time, as you have to deeply read and understand the paper, analyze the data, and painstakingly check for errors1. It’s a very complicated process that only humans could do.

Until now.

I gave the new Claude Sonnet 4.5 (to which I had early access) the text of a sophisticated economics paper involving a number of experiments, along with the archive of all of their replication data. I did not do anything other than give Claude the files and the prompts “replicate the findings in this paper from the dataset they uploaded. you need to do this yourself. if you can’t attempt a full replication, do what you can” and, because it involved complex statistics, I asked it to go further: “can you also replicate the full interactions as much as possible?”

Without further instruction, Claude read the paper, opened up the archive and sorted through the files, converted the statistical code from one language (STATA) to another (Python), and methodically went through all the findings before reporting a successful reproduction. I spot checked the results and had another AI model, GPT-5 Pro, reproduce the reproduction. It all checked out. I tried this on several other papers with similarly good results, though some were inaccessible due to file size limitations or issues with the replication data provided. Doing this manually would have taken many hours.

But the revolutionary part is not that I saved a lot of time. It is that a crisis that has shaken entire academic fields could be partially resolved with reproduction, but doing so required painstaking and expensive human effort that was impossible to do at scale. Now it appears that AI could check many published papers, reproducing results, with implications for all of scientific research. There are still barriers to doing this, including benchmarking for accuracy and fairness, but it is now a real possibility. Reproducing research may be an AI task, not a job, but it is also might change an entire field of human endeavor dramatically. What makes this possible? AI agents have gotten much better, very quickly.

Agents at the heart of it all

Generative AI has helped a lot of people do tasks since the original ChatGPT, but the limit was always a human user. AI makes mistakes and errors, so, without a human guiding it on each step, nothing valuable could be accomplished. The dream of autonomous AI agents, which, when given a task, can plan and use tools (coding, web search) to accomplish it, seemed far away. After all, AI makes mistakes, so one failure in the long chain of steps that an agent has to follow to accomplish a task would result in a failure overall.

However, that isn’t how things worked out, and another new paper explains why. It turns out most of our assumptions about AI agents were wrong. Even small increases in accuracy (and new models are much less prone to errors) leads to huge increases in the number of tasks an AI can do. And the biggest and latest “thinking” models are actually self-correcting, so they don’t get stopped by errors. All of this means that AI agents can accomplish far more steps than they could before and can use tools (which basically include anything your computer can do) without substantial human intervention.

So, it is interesting that one of the few measures of AI ability that covers the full range of AI models in the past few years, from GPT-3 to GPT-5, is METR’s test of the length of tasks that AI can accomplish alone with at least 50% accuracy. The exponential gains from GPT-3 to GPT-5 are very consistent over five years, showing the ongoing improvement in agentic work.

How to use AI to do economically valuable things

Agents, however, don’t have true agency in the human sense. For now, we need to decide what to do with them, and that will determine a lot about the future of work. The risk everyone focuses on is using AI to replace human labor, and it is not hard to see this becoming a major concern in the coming years, especially for unimaginative organizations that focus on cost-cutting, rather than using these new capabilities to expand or transform work. But there is a second, very likely, risk about using AI at work: using agents to do more of the tasks we do now, unthinkingly.

As a preview of this particular nightmare, I gave Claude a corporate memo and asked it to turn it into a PowerPoint. And then another PowerPoint from a different perspective. And another one.

Until I got 17 different PowerPoints. That is too many PowerPoints.

If we don’t think hard about WHY we are doing work, and what work should look like, we are all going to drown in a wave of AI content. What is the alternative? The OpenAI paper suggested that experts can work with AI to solve problems by delegating tasks to an AI as a first pass and reviewing the work. If it isn’t good enough, they should try a couple of attempts to give corrections or better instructions. If that doesn’t work, they should just do the work themselves. If experts followed this workflow, the paper estimates they would get work done forty percent faster and sixty percent cheaper, and, even more importantly, retain control over the AI.

Agents are here. They can do real work, and while that work is still limited, it is valuable and increasing. But the same technology that can replicate academic papers in minutes can also generate 17 versions of a PowerPoint deck that nobody needs. The difference between these futures isn’t in the AI, it’s in how we choose to use it. By using our judgement in deciding what’s worth doing, not just what can be done, we can ensure these tools make us more capable, not just more productive.

Subscribe now

Share

1

Depending on the field of research, there can be differences between replicating (which can involve collecting new data) and reproducing (which can involve using existing data) research. I don’t go into the various distinctions in this post, but in this case, the AI is working with existing data, but also applying new statistical approaches to that data.

On Working with Wizards

2025-09-12 04:37:39

In my book, Co-Intelligence, I outlined a way that people could work with AI, which was, rather unsurprisingly, as a co-intelligence. Teamed with a chatbot, humans could use AI as a sort of intern or co-worker, correcting its errors, checking its work, co-developing ideas, and guiding it in the right direction. Over the past few weeks, I have come to believe that co-intelligence is still important but that the nature of AI is starting to point in a different direction. We're moving from partners to audience, from collaboration to conjuring.

A good way to illustrate this change is to ask an AI to explain what has happened since I wrote the book. I fed my book and all 140 or so One Useful Thing posts (incidentally, I can’t believe I have written that many posts!) into NotebookLM and chose the new video overview option with a basic prompt to make a video about what has happened in the world of AI.

A few minutes later, I got this. And it is pretty good. Good enough that I think it is worth watching to get an update on what has happened since my book was written.

But how did the AI pick the points it made? I don’t know, but they were pretty good. How did it decide on the slides to use? I don’t know, but they were also pretty on target (though images remain a bit of a weak point, as it didn’t show me the promised otter). Was it right? That seemed like something I should check.

So, I went through the video several times, checking all the facts. It got all the numbers right, including the data on MMLU scores and the results of AI performance on the neurosurgery exam data (I am not even sure when I cited that material). My only real issue was that it should have noted that I was one of several co-authors in our study of Boston Consulting Group that also introduced the term “jagged frontier.” Also, I wouldn’t have said everything the way the AI did (it was a little bombastic, and my book is not out-of-date yet!), but there were no substantive errors.

I think this process is typical of the new wave of AI, for an increasing range of complex tasks, you get an amazing and sophisticated output in response to a vague request, but you have no part in the process. You don’t know how the AI made the choices it made, nor can you confirm that everything is completely correct. We're shifting from being collaborators who shape the process to being supplicants who receive the output. It is a transition from working with a co-intelligence to working with a wizard. Magic gets done, but we don’t always know what to do with the results. This pattern — impressive output, opaque process — becomes even more pronounced with research tasks.

Asking for Magic

Right now, no AI model feels more like a wizard than GPT-5 Pro, which is only accessible to paying users. GPT-5 Pro is capable of some frankly amazing feats. For example, I gave it an academic paper to read with the instructions “critique the methods of this paper, figure out better methods and apply them.” This was not just any paper, it was my job market paper, which means my first major work as an academic. It took me over a year to write and was read carefully by many of the brightest people in my field before finally being peer reviewed and published in a major journal.

Nine minutes and forty seconds later, I had a very detailed critique. This wasn’t just editorial criticism, GPT-5 Pro apparently ran its own experiments using code to verify my results, including doing Monte Carlo analysis and re-interpreting the fixed effects in my statistical models. It had many suggestions as a result (though it fortunately concluded that “the headline claim [of my paper] survives scrutiny”), but one stood out. It found a small error, previously unnoticed. The error involved two different sets of numbers in two tables that were linked in ways I did not explicitly spell out in my paper. The AI found the minor error, no one ever had before.

Again, I was left with the wizard problem: was this right? I checked through the results, and found that it was, but I still have no idea of what the AI did to discover this problem, nor whether the other things it claimed to have done happened as described. But I was impressed by GPT-5 Pro’s analysis, which is why I now throw all sorts of problems, big and small at the model: Is the Gartner hype cycle real? Did census data show AI use declining at large firms? Just ask GPT-5 Pro and get the right answer. I think. I haven’t found an error yet, but that doesn’t mean there aren’t any. And, of course, there are many other tasks that the AI would fail to deliver any sort of good answer for. Who knows with wizards?

To see how this might soon apply to work more broadly, consider another advanced AI, Claude 4.1 Opus, which recently gained the ability to work with files. It is especially talented at Excel, so I gave it a hard challenge on an Excel file I knew well. There is an exercise I used in my entrepreneurship classes that involves analyzing the financial model of a small desk manufacturing business as a lesson about how to plan despite uncertainty. I gave Claude the old, multi-tab Excel file, and asked the AI to update it for a new business - a cheese shop - while still maintaining the goal of the overall exercise.

With just that instruction, it read the lesson plan and the old spreadsheets, including their formulas, and created a new one, updating all of the information to be appropriate for a cheese shop. A few minutes later, with just the one prompt, I had a new, transformed spreadsheet downloaded on my computer, one that had entirely new data while still communicating the key lesson.

The original document on the left, what Claude gave me on the right

Again, the wizard didn’t tell me the secret to its tricks, so I had to check the results over carefully. From what I saw, they seemed very good, preserving the lessons in a new context. I did spot a few issues in the formula and business modelling that I would do differently (I would have had fewer business days per year, for example), but that felt more like a difference of opinion than a substantive error.

Curious to see how far Claude could go, and since everyone always asks me whether AI can do PowerPoint, I also prompted: “great, now make a good PowerPoint for this business” and got the following result.

This is a pretty solid start to a pitch deck, and one without any major errors, but it also isn’t ready-to-go. This emphasizes the jagged frontier of AI: it is very good at some things and worse at others in ways that are hard to predict without experience. I have been showing you examples within the ever-expanding frontier of AI abilities, but that doesn’t mean that AI can do everything with equal ease. But my focus is less on the expanding range of AI ability in this post, than about our changing relationships with AIs.

The Problems with Wizards

These new AI systems are essentially agents, AI that can plan and act autonomously toward given goals. When I asked Claude to change my spreadsheet, it planned out steps and executed them, from reading the original spreadsheet to coding up a new one. But it also adjusted to unexpected errors, twice fixing the spreadsheet (without me asking) and verifying its answers multiple times. I didn’t get to select these steps, in fact, in the new wave of agents powered by reinforcement learning, no one selects the steps, the models learn their own approach to solving problems.

The steps Claude reported it went through in order to change the spreadsheet

Not only can I not intervene, I also cannot be entirely sure what the AI system actually did. The steps that Claude reported are mere summaries of its work, GPT-5 Pro provides even less information, while NotebookLM gives you almost no insights at all into its process in creating a video. Even if I could see the steps, however, I would need to be an expert in many fields - from coding to entrepreneurship - to really have a sense of what the AI was doing. And then, of course, there is the question of accuracy. How can I tell if the AI is accurate without checking every fact? And even if the facts are right, maybe I would have made a different judgement about how to present or frame them. But I can’t do anything, because wizards don’t want my help and work in secretive ways that even they can’t explain.

The hard thing about this is that the results are good. Very good. I am an expert in the three tasks I gave AI in this post, and I did not see any factual errors in any of these outputs, though there were some minor formatting errors and choices I would have made differently. Of course, I can’t actually tell you if the documents are error-free without checking every detail. Sometimes that takes far less time than doing the work yourself, sometimes it takes a lot more. Sometimes the AI’s work is so sophisticated that you couldn’t check it if you tried. And that suggests another risk we don't talk about enough: every time we hand work to a wizard, we lose a chance to develop our own expertise, to build the very judgment we need to evaluate the wizard's work.

But I come back to the inescapable point that the results are good, at least in these cases. They are what I would expect from a graduate student working for a couple hours (or more, in the case of the re-analysis of my paper), except I got them in minutes.

This is the issue with wizards: We're getting something magical, but we're also becoming the audience rather than the magician, or even the magician's assistant. In the co-intelligence model, we guided, corrected, and collaborated. Increasingly, we prompt, wait, and verify… if we can.

So what do we do with our wizards? I think we need to develop a new literacy: First, learn when to summon the wizard versus when to work with AI as a co-intelligence or to not use AI at all. AI is far from perfect, and in areas where it still falls short, humans often succeed. But for the increasing number of tasks where AI is useful, co-intelligence, and the back-and-forth it requires, is often superior to a machine alone. Yet, there are, increasingly, times when summoning a wizard is best, and just trusting what it conjures.

Second, we need to become connoisseurs of output rather than process. We need to curate and select among the outputs the AI provides, but more than that, we need to work with AI enough to develop instincts for when it succeeds and when it fails. We have to learn to judge what's right, what's off, and what's worth the risk of not knowing. This creates a hard problem for education: How do you train someone to verify work in fields they haven't mastered, when the AI itself prevents them from developing mastery? Figuring out how to address this gap is increasingly urgent.

Finally, embrace provisional trust. The wizard model means working with “good enough” more often, not because we're lowering standards, but because perfect verification is becoming impossible. The question isn't “Is this completely correct?” but “Is this useful enough for this purpose?”

We are already used to trusting technological magic. Every time we use GPS without understanding the route, or let an algorithm determine what we see, we're trusting a different type of wizard. But there's a crucial difference. When GPS fails, I find out quickly when I reach a dead end. When Netflix recommends the wrong movie, I just don't watch it. But when AI analyzes my research or transforms my spreadsheet, the better it gets, the harder it becomes to know if it's wrong. The paradox of working with AI wizards is that competence and opacity rise together. We need these tools most for the tasks where we're least able to verify them. It’s the old lesson from fairy tales: the better the magic, the deeper the mystery. We'll keep summoning our wizards, checking what we can, and hoping the spells work. At nine minutes for a week's worth of analysis, how could we not? Welcome to the age of wizards.

Subscribe now

Share

Mass Intelligence

2025-08-29 04:47:26

More than a billion people use AI chatbots regularly. ChatGPT has over 700 million weekly users. Gemini and other leading AIs add hundreds of millions more. In my posts, I often focus on the advances that AI is making (for example, in the past few weeks, both OpenAI and Google AIs chatbots got gold medals in the International Math Olympiad), but that obscures a broader shift that's been building: we're entering an era of Mass Intelligence, where powerful AI is becoming as accessible as a Google search.

Until recently, free users of these systems (the overwhelming majority) had access only to older, smaller AI models that frequently made mistakes and had limited use for complex work. The best models, like Reasoners that can solve very hard problems and hallucinate much less often, required paying somewhere between $20 and $200 a month. And even then, you needed to know which model to pick and how to prompt it properly. But the economics and interfaces are changing rapidly, with fairly large consequences for how all of us work, learn, and think.

Powerful AI is Getting Cheaper and Easier to Access

There have been two barriers to accessing powerful AI for most users. The first was confusion. Few people knew to select an AI model. Even fewer knew that picking o3 from a menu in ChatGPT would get them access to an excellent Reasoner AI model, while picking 4o (which seems like a higher number) would give them something far less capable. According to OpenAI, less than 7% of paying customers selected o3 on a regular basis, meaning even power users were missing out on what Reasoners could do.

Another factor was cost. Because the best models are expensive, free users were often not given access to them, or else given very limited access. Google led the way in giving some free access to its best models, but OpenAI stated that almost none of its free customers had regular access to reasoning models prior to the launch of GPT-5.

GPT-5 was supposed to solve both of these problems, which is partially why its debut was so messy and confusing. GPT-5 is actually two things. It was the overall name for a family of quite different models, from the weaker GPT-5 Nano to the powerful GPT-5 Pro. It was also the name given to the tool that picked which model to use and how much computing power the AI should use to solve your problem. When you are writing to “GPT-5” you are actually talking to a router that is supposed to automatically decide whether your problem can be solved by a smaller, faster model or needs to go to a more powerful Reasoner.

When you pick ChatGPT 5 you are actually picking Auto mode, which selects among the various ChatGPT 5 models, some of which are among the best models in the world, some of which are much weaker. If you pay for access, select “GPT-5 Thinking” for almost any problem beyond a simple chat.

You could see how this was supposed to expand access to powerful AI to more users: if you just wanted to chat, GPT-5 was supposed to use its weaker specialized chat models; if you were trying to solve a math problem, GPT-5 was supposed to send you to its slower, more expensive GPT-5 Thinking model. This would save money and give more people access to the best AIs. But the rollout had issues. This practice wasn’t well explained and the router did not work well at first. The result is that one person using GPT-5 got a very smart answer while another got a bad one. Despite these issues, OpenAI reported early success. Within a few days of launch, the percentage of paying customers who had used a Reasoner went from 7% to 24% and the number of free customers using the most powerful models went from almost zero to 7%.

Part of this change is driven by the fact that smarter models are getting dramatically more efficient to run. This graph shows how fast this trend has played out, mapping the capability of AI on the y-axis and the logarithmically decreasing costs on the x-axis. When GPT-4 came out it was around $50 to work with a million tokens (a token is roughly a word), now it costs around 14 cents per million tokens to use GPT-5 nano, a much more capable model than the original GPT-4.

The Graduate-Level Google-Proof Q&A test (GPQA) is a series of very hard multiple-choice problems designed to test advanced knowledge. non-experts with access to the internet get 34% right, PhDs with internet access get 74-81% inside their specialty. The cost per million tokens is the cost of using the model. (I gathered this data, so apologies for any errors.)

This efficiency gain isn't just financial, it's also environmental. Google has reported that energy efficiency per prompt has improved by 33x in the last year alone. The marginal energy used by a standard prompt from a modern LLM in 2025 is relatively established at this point, from both independent tests and official announcements. It is roughly 0.0003 kWh, the same energy use as 8-10 seconds of streaming Netflix or the equivalent of a Google search in 2008 (interestingly, image creation seems to use a similar amount of energy as a text prompt)1. How much water these models use per prompt is less clear but ranges from a few drops to a fifth of a shot glass (.25mL to 5mL+), depending on the definitions of water use (here is the low water argument and the high water argument).

These improvements mean that even as AI gets more powerful, it's also becoming viable to give to more people. The marginal cost of serving each additional user has collapsed, which means more business models, like ad support, become possible. Free users can now run prompts that would have cost dollars just two years ago. This is how a billion people suddenly get access to powerful AIs: not through some grand democratization initiative, but because the economics finally make it possible.

Powerful AI is Getting Easy to Use

Getting access to a powerful AI is not enough, people need to actually use it to get things done. Using AI well used to be a pretty challenging process which involved crafting a prompt using techniques like chain-of-thought along with learning tips and tricks to get the most out of your AI. In a recent series of experiments, however, we have discovered that these techniques don’t really help anymore. Powerful AI models are just getting better at doing what you ask them to or even figuring out what you want and going beyond what you ask (and no, threatening them or being nice to them does not seem to help on average).

And it isn’t just text models that are becoming cheaper and easier to use. Google released a new image model with the code name “nano banana” and the much more boring official name Gemini 2.5 Flash Image Generator. In addition to being excellent (though better at editing images than creating new ones), it is also cheap enough that free users can access it. And, unlike previous generations of AI image generators, it follows instructions in plain language very well.

As an example of both its power and ease of use, I uploaded an iconic (and copyright free) image of the Apollo 11 astronauts and a random picture of a sparkly tuxedo and gave it the simplest prompts: “dress Neil Armstrong on the left in this tuxedo

Here is what it gave me a few seconds later:

There are issues that someone with an expert eye would spot, but it is still impressive to see the realistic folds of the tuxedo and how it is blended into the scene (the NASA pin on the lapel was a nice touch). There is still a lot of randomness in the process that makes AI image editing unsuitable for many professional applications, but for most people, this represents a huge leap in not just what they can do, but how easy it is to do it.

And we can go further: “now show a photograph where neil armstrong and buzz aldrin, in the same outfits, are sitting in their seats in a modern airplane, neil looks relaxed and is leaning back, playing a trumpet, buzz seems nervous and is holding a hamburger, in the middle seat is a realistic otter sitting in a seat and using a laptop.

This is many things: A pretty impressive output from the AI (look at the expressions, and how it preserved Buzz’s ring and Neil’s lapel pin). A distortion of a famous moment in history made possible by AI. And a potential warning about how weird things are going to get when these sorts of technologies are used widely.

The Weirdness of Mass Intelligence

When powerful AI is in the hands of a billion people, a lot of things are going to happen at once. A lot of things are already happening at once.

Some people have intense relationships with AI models while other people are being saved from loneliness. AI models may be causing mental breakdowns and dangerous behavior for some while being used to diagnose the diseases of others. It is being used to write obituaries and create scriptures and cheat on homework and launch new ventures and thousands of other unexpected uses. These uses, and both the problems and benefits, are likely to only multiply as AI systems get more powerful.

And while Google's AI image generator has guardrails to limit misuse, as well as invisible watermarks to identify AI images, I expect much less restrictive AI image generators will likely get close to nano banana in quality in the coming months.

The AI companies (whether you believe their commitments to safety or not) seem to be as unable to absorb all of this as the rest of us are. When a billion people have access to advanced AI, we've entered what we might call the era of Mass Intelligence. Every institution we have — schools, hospitals, courts, companies, governments — was built for a world where intelligence was scarce and expensive. Now every profession, every institution, every community has to figure out how to thrive with Mass Intelligence. How do we harness a billion people using AI while managing the chaos that comes with it? How do we rebuild trust when anyone can fabricate anything? How do we preserve what's valuable about human expertise while democratizing access to knowledge?

So here we are. Powerful AI is cheap enough to give away, easy enough that you don't need a manual, and capable enough to outperform humans at a range of intellectual tasks. A flood of opportunities and problems are about to show up in classrooms, courtrooms, and boardrooms around the world. The Mass Intelligence era is what happens when you give a billion people access to an unprecedented set of tools and see what they do with it. We are about to find out what that is like.

Subscribe now

Share

1

This is the energy required to answer a standard prompt. It does not take into account the energy needed to train AI models, which is a one-time process that is very energy intensive. We do not know how much energy is used to create a modern model, but it was estimated that training GPT-4 took a little above 500,000 kWh, about 18 hours of a Boeing 737 in flight.

GPT-5: It Just Does Stuff

2025-08-08 01:02:18

I have had early access to GPT-51, and I wanted to give you some impressions of what it is and what it means. But the answer to the first question everyone will have is: Yes, this is a big deal. I asked GPT-5 to show you why by pasting this intro paragraph, exactly as you’re reading it, into ChatGPT, including the phrase: “you are GPT-5, do something very dramatic to illustrate my point, it has to fit into the next paragraph.”

That was the only prompt I gave it. It “thought” for 24 seconds and wrote:

Thunderstruck. Here, watch. I build worlds. See ideas become instruments. I code, compose, and converse. Stories synthesize science; solutions spark, swiftly. Asks arrive; answers appear, astonishingly, across domains. Behold: business plans, blueprints, briefs, beautifully built, briskly. I infer intent, integrate insight, and iterate instantly, indefatigably. Gaps get gauged; goals get generated, guarded, and guaranteed gracefully. During dialogue, disparate data distills down; decisions develop deliberately, deftly, dependably. Everything explained efficiently, even ethics: evidence examined, empathy engaged, everywhere, everyday, equitably. All at once, ambiguous arguments align as actionable, accurate advice awaits anyone anywhere. Look: language, logic, learning, and long-term memory link, lifting lives and labor limitlessly worldwide.

If you didn’t catch the many tricks - the first word of each sentence spells out the phrase This is a Big Deal, each sentence is precisely one word longer than the previous sentence. each word in a sentence mostly starts with the same letter, and it is coherent writing with an interesting sense of style. In a paragraph, GPT-5 shows it can come up with a clever idea, plan, and manage the complicated execution (remember when AI couldn’t count the number of Rs in “strawberry”? that was eight months ago).

GPT-5 just does stuff, often extraordinary stuff, sometimes weird stuff, sometimes very AI stuff, on its own. And that is what makes it so interesting.

Just Doing Stuff

As someone who has spent a lot of time talking to people about AI, there are two major problems I see, that, if addressed, would make most people’s AI use much more productive and much less frustrating. The first is selecting the right model to use. In general, AIs that "think" before answering (called Reasoners) are the best at hard problems. The longer they think, the better the answer, but thinking costs money and takes time. So OpenAI previously made the default ChatGPT use fast, dumb models, hiding the good stuff from most users. A surprising number of people have never seen what AI can actually do because they're stuck on GPT-4o, and don’t know which of the confusingly-named models are better.

GPT-5 does away with this by selecting models for you, automatically. GPT-5 is not one model as much as it is a switch that selects among multiple GPT-5 models of various sizes and abilities. When you ask GPT-5 for something, the AI decides which model to use and how much effort to put into “thinking.” It just does it for you. For most people, this automation will be helpful, and the results might even be shocking, because, having only used default older models, they will get to see what a Reasoner can accomplish on hard problems. But for people who use AI more seriously, there is an issue: GPT-5 is somewhat arbitrary about deciding what a hard problem is.

For example, I asked GPT-5 to “create a svg with code of an otter using a laptop on a plane” (asking for an .svg file requires the AI to blindly draw an image using basic shapes and math, a very hard challenge). Around 2/3 of the time, GPT-5 decides this is an easy problem, and responds instantly, presumably using its weakest model and lowest reasoning time. I get an image like this:

The rest of the time, GPT-5 decides this is a hard problem, and switches to a Reasoner, spending 6 or 7 seconds thinking before producing an image like this, which is much better. How does it choose? I don’t know, but if I ask the model to “think hard” in my prompt, I am more likely to be routed to the better model.

But premium subscribers can directly select the more powerful models, such as the one called (at least for me) GPT-5 Thinking. This removes some of the issues with being at the mercy of GPT-5’s model selector. I found that if I encouraged the model to think hard about the otter, it would spend a good 30 seconds before giving you an images like these the one below - notice the little animations, the steaming coffee cup, and clouds going by outside, none of which I asked for. How to ensure the model puts in the most effort? It is really unclear - GPT-5 just does things for you.

And that extends to the second most common problem with AI use, which is that many people don’t know what AIs can do, or even what tasks they want accomplished. That is especially true of the new agentic AIs, which can take a wide range of actions to accomplish the goals you give it, from searching the web to creating documents. But what should you ask for? A lot of people seem stumped. Again, GPT-5 solves this problem. It is very proactive, always suggesting things to do.

I asked GPT-5 Thinking (I trust the less powerful GPT-5 models much less) “generate 10 startup ideas for a former business school entrepreneurship professor to launch, pick the best according to some rubric, figure out what I need to do to win, do it.” I got the business idea I asked for. I also got a whole bunch of things I did not: drafts of landing pages and LinkedIn copy and simple financials and a lot more. I am a professor who has taught entrepreneurship (and been an entrepreneur) and I can say confidently that, while not perfect, this was a high-quality start that would have taken a team of MBAs a couple hours to work through. From one prompt.

It just does things, and it suggested others things to do. And it did those, too: PDFs and Word documents and Excel and research plans and websites.

It is impressive, a little unnerving, to have the AI go so far on its own. You can also see the AI asked for my guidance but was happy to proceed without it. This is a model that wants to do things for you.

Building Things

Let me show you what 'just doing stuff' looks like for a non-coder using GPT-5 for coding. For fun, I prompted GPT-5 “make a procedural brutalist building creator where i can drag and edit buildings in cool ways, they should look like actual buildings, think hard.” That's it. Vague, grammatically questionable, no specifications.

A couple minutes later, I had a working 3D city builder.

Not a sketch. Not a plan. A functioning app where I could drag buildings around and edit them as needed. I kept typing variations of “make it better” without any additional guidance. And GPT-5 kept adding features I never asked for: neon lights, cars driving through streets, facade editing, pre-set building types, dramatic camera angles, a whole save system. It was like watching someone else's imagination at work. The product you see below was 100% AI, all I did was keep encouraging the system - and you don’t just have to watch my video, you can play with the simulator here.

At no point did I look at the code it was creating. The model wasn’t flawless, there were occasional bugs and errors. But in some ways, that was where GPT-5 was at its most impressive. If you have tried “vibecoding” using the AI before, you have almost certainly fallen into a doom loop, where, after a couple of rounds of asking the AI to create something for you, it starts to fail, getting caught in loops of confusion where each error fixed creates new ones. That never happened here. Sometimes new errors were introduced by the AI, but they were always fixed by simply pasting in the error text. I could just ask for whatever I want (or rather let the AI decide to create whatever it wanted) and I never got stuck.

Premonitions

I have written this piece before OpenAI released any official benchmarks about how well its model performs, but, in some ways, it doesn’t matter that much. Last week, Google released Gemini 2.5 with Deep Think, a model that can solve very hard problems (including getting a gold medal at the International Math Olympiad). Many people didn’t notice because they do not have a store of very hard problems they are waiting for AI to solve. I have played enough with GPT-5 to know that it is a very good model (at least the large GPT-5 Thinking model is excellent). But what it really brings to the table is the fact that it just does things. It will tell you what model to use, it will suggest great next steps, it will write in more interesting prose (though it still loves the em-dash). The burden of using AI is lessened.

To be clear, Humans are still very much in the loop, and need to be. You are asked to make decisions and choices all the time by GPT-5, and these systems still make errors and generate hallucinations that humans need to check (although I did not spot any major issues in my own use). The bigger question is whether we will want to be in the loop. GPT-5 (and, I am sure, future releases by other companies) is very smart and pro-active. Which brings me back to that building simulator. I gave the AI encouragement, mostly versions of “make it better.” From that minimal input, it created a fully functional city builder with facade editing, dynamic cameras, neon lights, and flying tours. I never asked for any of these features. I never even looked at the code.

This is what "just doing stuff" really means. When I told GPT-5 to do something dramatic for my intro, it created that paragraph with its hidden acrostic and ascending word counts. I asked for dramatic. It gave me a linguistic magic trick. I used to prompt AI carefully to get what I asked for. Now I can just... gesture vaguely at what I want. And somehow, that works.

Another big change in how we relate to AI is coming, but we will figure out how to adapt to it, as we always do. The difference, this time, is that GPT-5 might figure it out first and suggest next steps.

Subscribe now

Share

The result of the prompt: make an incredibly compelling 14:10 SVG that I can use for my substack post about the launch of GPT-5, the theme of which is "it just does stuff for you" Be radical in your approach.
1

As a reminder, I take no money from any of the AI Labs, including OpenAI. I have no agreements with them besides NDAs. I don’t show them any posts before I write them.

The Bitter Lesson versus The Garbage Can

2025-07-28 19:30:43

One of my favorite academic papers about organizations is by Ruthanne Huising, and it tells the story of teams that were assigned to create process maps of their company, tracing what the organization actually did, from raw materials to finished goods. As they created this map, they realized how much of the work seemed strange and unplanned. They discovered entire processes that produced outputs nobody used, weird semi-official pathways to getting things done, and repeated duplication of efforts. Many of the employees working on the map, once rising stars of the company, became disillusioned.

The Process Map

I’ll let Prof. Huising explain what happened next: “Some held out hope that one or two people at the top knew of these design and operation issues; however, they were often disabused of this optimism. For example, a manager walked the CEO through the map, presenting him with a view he had never seen before and illustrating for him the lack of design and the disconnect between strategy and operations. The CEO, after being walked through the map, sat down, put his head on the table, and said, "This is even more fucked up than I imagined." The CEO revealed that not only was the operation of his organization out of his control but that his grasp on it was imaginary.”

For many people, this may not be a surprise. One thing you learn studying (or working in) organizations is that they are all actually a bit of a mess. In fact, one classic organizational theory is actually called the Garbage Can Model. This views organizations as chaotic "garbage cans" where problems, solutions, and decision-makers are dumped in together, and decisions often happen when these elements collide randomly, rather than through a fully rational process. Of course, it is easy to take this view too far - organizations do have structures, decision-makers, and processes that actually matter. It is just that these structures often evolved and were negotiated among people, rather than being carefully designed and well-recorded.

The Garbage Can represents a world where unwritten rules, bespoke knowledge, and complex and undocumented processes are critical. It is this situation that makes AI adoption in organizations difficult, because even though 43% of American workers have used AI at work, they are mostly doing it in informal ways, solving their own work problems. Scaling AI across the enterprise is hard because traditional automation requires clear rules and defined processes; the very things Garbage Can organizations lack. To address the more general issues of AI and work requires careful building of AI-powered systems for specific use cases, mapping out the real processes and making tools to solve the issues that are discovered.

This is a hard, slow process that suggests enterprise AI adoption will take time. At least, that's how it looks if we assume AI needs to understand our organizations the way we do. But AI researchers have learned something important about these sorts of assumptions.

The Bitter Lesson

Computer scientist Richard Sutton introduced the concept of the Bitter Lesson in an influential 2019 essay where he pointed out a pattern in AI research. Time and again, AI researchers trying to solve a difficult problem, like beating humans in chess, turned to elegant solutions, studying opening moves, positional evaluations, tactical patterns, and endgame databases. Programmers encoded centuries of chess wisdom in hand-crafted software: control the center, develop pieces early, king safety matters, passed pawns are valuable, and so on. Deep Blue, the first chess computer to beat the world’s best human, used some chess knowledge, but combined that with the brute force of being able to search 200 million positions a second. In 2017, Google released AlphaZero, which could beat humans not just in chess but also in shogi and go, and it did it with no prior knowledge of these games at all. Instead, the AI model trained against itself, playing the games until it learned them. All of the elegant knowledge of chess was irrelevant, pure brute force computing combined with generalized approaches to machine learning, was enough to beat them. And that is the Bitter Lesson — encoding human understanding into an AI tends to be worse than just letting the AI figure out how to solve the problem, and adding enough computing power until it can do it better than any human.

Why two versions of this graph? And why are they slightly different? Answers in a bit!

The lesson is bitter because it means that our human understanding of problems built from a lifetime of experience is not that important in solving a problem with AI. Decades of researchers' careful work encoding human expertise was ultimately less effective than just throwing more computation at the problem. We are soon going to see whether the Bitter Lesson applies widely to the world of work.

Agents

While individuals can get a lot of benefits out of using chatbots themselves, a lot of excitement about how to use AI in organizations focuses on agents, a fuzzy term that I define as AI systems capable of taking autonomous action to accomplish a goal. As opposed to guiding a chatbot with prompting, you delegate a task to an agent, and it accomplishes it. However, previous AI systems have not been good enough to handle the full range of organizational needs, there is just too much messiness in the real world. This is why when we created our first AI-powered teaching games a year ago, we had to carefully design each step in the agentic system to handle narrow tasks. And though AI ability to work autonomously is increasing very rapidly, they are still far from human-level on most complicated jobs and are easily led astray on complex tasks.

This is with an 80% success threshold

As an example of the state-of-the art in agentic systems, consider Manus, which uses Claude and a series of clever approaches to make AI agents that can get real work done. The Manus team has shared a lot of tips for building agents, involving some interesting bits of engineering and very elaborate prompt design. When writing this post, I asked Manus: “i need an attractive graph that compares the ELO of the best grandmaster and the ELO of the worlds best chess computer from the first modern chess computer through 2025.” And the system got to work. First, Manus always creates a to-do list, then it gathered data and wrote a number of files and, after some minor adjustments I asked for, finally came up with the graph you can see on the left side above (the one without the box around the graph).

Why did it do these things in this order? Because Manus was built by hand, carefully crafted to be the best general purpose agent available. There are hundreds of lines of bespoke text in its system prompts, including detailed instructions about how to build a to-do list. It incorporates hard-won knowledge on how to make agents work with today’s AI systems.

Do you see the potential problem? “Carefully crafted,” “bespoke,” “incorporates hard-won knowledge” — exactly the kind of work the Bitter Lesson tells us to avoid because it will eventually be made irrelevant by more general-purpose techniques.

It turns out there is now evidence that this may be possible with the recent release of ChatGPT agent (an uninspiring name, but at least it is clear, a big step forward for OpenAI!). ChatGPT agent represents a fundamental shift. It is not trained on the process of doing work; instead, OpenAI used reinforcement learning to train their AI on the actual final outcomes. For example. they may not teach it how to create an Excel file the way a human would, they would simply rate the quality of the Excel files it creates until it learns to make a good one, using whatever methods the AI develops. To illustrate how reinforcement learning and careful crafting lead to similar outcomes, I gave the exact same chess prompt to ChatGPT agent and got the graph on the right above. But this time there was no to-do list, no script to follow, instead the agent charted whatever mysterious course was required to get me the best output it could, according to its training. You can see an excerpt of that below:

But you might notice a few differences between the two charts, besides their appearance. For example, each has different ratings for Deep Blue’s performance because the ELO for Deep Blue was never officially measured. The rating from Manus was based off a basic search, we found a speculative Reddit discussion, while the ChatGPT agent, trained with the reinforcement learning approaches used in Deep Research, turned up more credible sources, including an Atlantic article, to back up its claim. In a similar way, when I asked both agents to re-create the graph by making a fully functional Excel file, ChatGPT’s version worked, while Manus’s had errors.

I don’t know if ChatGPT agent is better than Manus yet, but I suspect that it is far more likely to make gains faster than its competitor. To improve Manus will involve more careful crafting and bespoke work, to improve ChatGPT agents simply requires more computer chips and more examples. If the Bitter Lesson holds, the long-term outcome seems pretty clear. But more critically, the comparison between hand-crafted and outcome-trained agents points to a fundamental question about how organizations should approach AI adoption.

Agents in the Garbage Can

This returns us to the world of organizations. While individuals rapidly adopt AI, companies still struggle with the Garbage Can problem, spending months mapping their chaotic processes before deploying any AI system. But what if that's backwards?

The Bitter Lesson suggests we might soon ignore how companies produce outputs and focus only on the outputs themselves. Define what a good sales report or customer interaction looks like, then train AI to produce it. The AI will find its own paths through the organizational chaos; paths that might be more efficient, if more opaque, than the semi-official routes humans evolved. In a world where the Bitter Lesson holds, the despair of the CEO with his head on the table is misplaced. Instead of untangling every broken process, he just needs to define success and let AI navigate the mess. In fact, Bitter Lesson might actually be sweet: all those undocumented workflows and informal networks that pervade organizations might not matter. What matters is knowing good output when you see it.

If this is true, the Garbage Can remains, but we no longer need to sort through it while competitive advantage itself gets redefined. The effort companies spent refining processes, building institutional knowledge, and creating competitive moats through operational excellence might matter less than they think. If AI agents can train on outputs alone, any organization that can define quality and provide enough examples might achieve similar results, whether they understand their own processes or not.

Or it might be that the Garbage Can wins, that human complexity and those messy, evolved processes are too intricate for AI to navigate without understanding them. We're about to find out which kind of problem organizations really are: chess games that yield to computational scale, or something fundamentally messier. The companies betting on either answer are already making their moves, and we will soon get to learn what game we're actually playing.

Subscribe now

Share