MoreRSS

site iconOne Useful ThingModify

Trying to understand the implications of AI for work, education, and life. By Prof. Ethan Mollick
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of One Useful Thing

The Cybernetic Teammate

2025-03-22 19:39:02

Over the past couple years, we have learned that AI can boost the productivity of individual knowledge workers ranging from consultants to lawyers to coders. But most knowledge work isn’t purely an individual activity; it happens in groups and teams. And teams aren't just collections of individuals – they provide critical benefits that individuals alone typically can't, including better performance, sharing of expertise, and social connections.

So, what happens when AI acts as a teammate? This past summer we conducted a pre-registered, randomized controlled trial of 776 professionals at Procter and Gamble, the consumer goods giant, to find out.

We are ready to share the results in a new working paper: The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise. Given the scale of this project, it shouldn’t be a surprise that this paper was a massive team effort coordinated by the Digital Data Design Institute at Harvard and led by Fabrizio Dell’Acqua, Charles Ayoubi, and Karim Lakhani, along with Hila Lifshitz, Raffaella Sadun, Lilach Mollick, me, and our partners at Procter and Gamble: Yi Han, Jeff Goldman, Hari Nair, and Stewart Taub.

We wanted this experiment to be a test of real-world AI use, so we were able to replicate the product development process at P&G, thanks to the cooperation and help of the company (which had no control over the results or data). To do that, we ran one-day workshops where professionals from Europe and the US had to actually develop product ideas, packaging, retail strategies and other tasks for the business units they really worked for, which included baby products, feminine care, grooming, and oral care. Teams with the best ideas had them submitted to management for approval, so there were some real stakes involved.

We also had two kinds of professionals in our experiment: commercial experts and technical R&D experts. They were generally very experienced, with over 10 years of work at P&G alone. We randomly created teams consisting of one person in each specialty. Half were given GPT-4 or GPT-4o to use, and half were not. We also picked a random set of both types of specialists to work alone, and gave half of them access to AI. Everyone assigned to the AI condition was given a training session and a set of prompts they could use or modify. This design allowed us to isolate the effects of AI and teamwork independently and in combination. We measured outcomes across multiple dimensions including solution quality (as determined by at least two expert judges per solution), time spent, and participants' emotional responses. What we found was interesting.

AI boosts performance

When working without AI, teams outperformed individuals by a significant amount, 0.24 standard deviations (providing a sigh of relief for every teacher and manager who has pushed the value of teamwork). But the surprise came when we looked at AI-enabled participants. Individuals working with AI performed just as well as teams without AI, showing a 0.37 standard deviation improvement over the baseline. This suggests that AI effectively replicated the performance benefits of having a human teammate – one person with AI could match what previously required two-person collaboration.

Teams with AI performed best overall with a 0.39 standard deviation improvement, though the difference between individuals with AI and teams with AI wasn't statistically significant. But we found an interesting pattern when looking at truly exceptional solutions, those ranking in the top 10% of quality. Teams using AI were significantly more likely to produce these top-tier solutions, suggesting that there is value in having human teams working on a problem that goes beyond the value of working with AI alone.

Both AI-enabled groups also worked much faster, saving 12-16% of the time spent by non-AI groups while producing solutions that were substantially longer and more detailed than those from non-AI groups.

Expertise boundaries vanish

Without AI, we saw clear professional silos in how people approached problems. R&D specialists consistently proposed technically-oriented solutions while Commercial specialists suggested market-focused ideas. When these specialists worked together in teams without AI, they produced more balanced solutions through their cross-functional collaboration (teamwork wins again!).

But this was another place AI made a big difference. When paired with AI, both R&D and Commercial professionals, in teams or when working alone, produced balanced solutions that integrated both technical and commercial perspectives. The distinction between specialists virtually disappeared in AI-aided conditions, as you can see in the graph. We saw a similar effect on teams.

This effect was especially pronounced for employees less familiar with product development. Without AI, these less experienced employees performed relatively poorly even in teams. But with AI assistance, they suddenly performed at levels comparable to teams that included experienced members. AI effectively helped people bridge functional knowledge gaps, allowing them to think and create beyond their specialized training, and helped amateurs act more like experts.

Working with AI led to better emotional experiences

A particularly surprising finding was how AI affected the emotional experience of work. Technological change, and especially AI, has often been associated with reduced workplace satisfaction and increased stress. But our results showed the opposite, at least in this case.

Positive emotions increase and negative emotions decrease after working with AI compared to teams and individuals who did not have AI access.

People using AI reported significantly higher levels of positive emotions (excitement, energy, and enthusiasm) compared to those working without AI. They also reported lower levels of negative emotions like anxiety and frustration. Individuals working with AI had emotional experiences comparable to or better than those working in human teams.

While we conducted a thorough study that involved a pre-registered randomized controlled trial, there are always caveats to these sorts of studies. For example, it is possible that larger teams would show very different results when working with AI, or that working with AI for longer projects may impact its value. It is also possible that our results represent a lower bound: all of these experiments were conducted with GPT-4 or GPT-4o, less capable models than what are available today; the participants did not have a lot of prompting experience so they may not have gotten as much benefit; and chatbots are not really built for teamwork. There is a lot more detail on all of this in the paper, but limitations aside, the bigger question might be: why does this all matter?

Why This Matters

Organizations have primarily viewed AI as just another productivity tool, like a better calculator or spreadsheet. This made sense initially but has become increasingly limiting as models get better and as recent data finds users most often employ AI for critical thinking and complex problem solving, not just routine productivity tasks. Companies that focus solely on efficiency gains from AI will not only find workers unwilling to share their AI discoveries for fear of making themselves redundant but will also miss the opportunity to think bigger about the future of work.

To successfully use AI, organizations will need to change their analogies. Our findings suggest AI sometimes functions more like a teammate than a tool. While not human, it replicates core benefits of teamwork—improved performance, expertise sharing, and positive emotional experiences. This teammate perspective should make organizations think differently about AI. It suggests a need to reconsider team structures, training programs, and even traditional boundaries between specialties. At least with the current set of AI tools, AI augments human capabilities. It democratizes expertise as well, enabling more employees to contribute meaningfully to specialized tasks and potentially opening new career pathways.

The most exciting implication may be that AI doesn't just automate existing tasks, it changes how we can think about work itself. The future of work isn't just about individuals adapting to AI, it's about organizations reimagining the fundamental nature of teamwork and management structures themselves. And that's a challenge that will require not just technological solutions, but new organizational thinking.

Subscribe now

Leave a comment

Speaking things into existence

2025-03-12 02:21:28

Influential AI researcher Andrej Karpathy wrote two years ago that “the hottest new programming language is English,” a topic he expanded on last month with the idea of “vibecoding” a practice where you just ask an AI to create something for you, giving it feedback as it goes. I think the implications of this approach are much wider than coding, but I wanted to start by doing some vibecoding myself.

I decided to give it a try using Anthropic’s new Claude Code agent, which gives the Claude Sonnet 3.7 LLM the ability to manipulate files on your computer and use the internet. Actually, I needed AI help before I could even use Claude Code. I can only code in a few very specific programming languages (mostly used in statistics) and have no experience at all with Linux machines. Yet Claude Code only runs in Linux. Fortunately, Claude told me how to handle my problems, so after some vibetroubleshooting (seriously, if you haven’t used AI for technical support, you should) I was able to set up Claude Code.

Time to vibecode. The very first thing I typed into Claude Code was: “make a 3D game where I can place buildings of various designs and then drive through the town i create.” That was it, grammar and spelling issues included. I got a working application (Claude helpfully launched it in my browser for me) about four minutes later, with no further input from me. You can see the results in the video below.

It was pretty neat, but a little boring, so I wrote: hmmm its all a little boring (also sometimes the larger buildings don't place properly). Maybe I control a firetruck and I need to put out fires in buildings? We could add traffic and stuff.

A couple minutes later, it made my car into a fire truck, added traffic, and made it so houses burst into flame. Now we were getting somewhere, but there were still things to fix. I gave Claude feedback: looking better, but the firetruck changes appearance when moving (wheels suddenly appear) and there is no issue with traffic or any challenge, also fires don't spread and everything looks very 1980s, make it all so much better.

After seeing the results, I gave it a fourth, and final, command as a series of three questions: can i reset the board? can you make the buildings look more real? can you add in a rival helicopter that is trying to extinguish fires before me? You can see the results of all four prompts in the video below. It is a working, if blocky, game, but one that includes day and night cycles, light reflections, missions, and a computer-controlled rival, all created using the hottest of all programming languages - English.

Actually, I am leaving one thing out. Between the third and fourth prompts, something went wrong and the game just wouldn't work. As someone with no programming skills in JavaScript or whatever the game was written in, I had no idea how to fix it. The result was a sequence of back-and-forth discussions with the AI where I would tell it errors and it would work to solve them. After twenty minutes, everything was working again, better than ever. In the end, the game cost around $5 in Claude API fees to make… and $8 more to get around the bug, which turned out to be a pretty simple problem. Prices will likely fall quickly but the lesson is useful: as amazing as it is (I made a working game by asking!), vibecoding is most useful when you actually have some knowledge and don't have to rely on the AI alone. A better programmer might have immediately recognized that the issue was related to asset loading or event handling. And this was a small project, I am less confident of my ability to work with AI to handle a large codebase or complex project, where even more human intervention would be required.

This underscores how vibecoding isn't about eliminating expertise but redistributing it - from writing every line of code to knowing enough about systems to guide, troubleshoot, and evaluate. The challenge becomes identifying what "minimum viable knowledge" is necessary to effectively collaborate with AI on various projects.

Vibeworking with expertise

Expertise clearly still matters in a world of creating things with words. After all, you have to know what you want to create; be able to judge whether the results are good or bad; and give appropriate feedback. As I wrote in my book, with current AIs, you can often achieve the best results by working as a co-intelligence with AI systems which continue to have a "jagged frontier" of abilities.

But applying expertise need not involve a lot of work. Take for example, my recent experience with Manus, a new AI agent out of China. It basically uses Claude (and possibly other LLMs as well) but gives the AI access to a wide range of tools, including the ability to do web research, code, create documents and websites and more. It is the most capable general-purpose agent I have seen so far, but like other general agents, it still makes errors and mistakes. Despite that, it can accomplish some pretty impressive things.

For example, here is a small portion of what it did when I asked it to “create an interactive course on elevator pitching using the best academic advice.” You can see the system set up a checklist of tasks and then go through them, doing web research before building the pages (this is sped up, the actual process unfolds autonomously, but over tens of minutes or even hours).

As someone who teaches entrepreneurship, I would say that the output it created was surface-level impressive - it was an entire course that covered much of the basics of pitching, and without obvious errors! Yet, I also could instantly see that it was too text heavy and did not include opportunities for knowledge checks or interactive exercises. I gave the AI a second prompt: “add interactive experiences directly into course material and links to high quality videos.” Even though this was the bare minimum feedback, it was enough to improve the course considerably, as you can see below.

On the left you can see the overall class structure it created, when I clicked on the first lesson it took you to an overall module guide, and then each module was built out with videos, text, and interactive quizzes.

If I were going to deploy the course, I would push the AI further and curate the results much more, but it is impressive to see how far you can get with just a little guidance. But there are other modes of vibework as well. While course creation demonstrates AI's ability to handle casual structured creative work with minimal guidance, research represents a more complex challenge requiring deeper expertise integration.

Deep Vibeworking

It is at the cutting edge of expertise where AI gets to be most interesting to use. Unfortunately for anyone writing about this sort of work, they are also the use cases that are hardest to explain, but I can give you one example.

I have a large, anonymized set of data about crowdfunding efforts that I collected nearly a decade ago, but never got a chance to use for any research purposes. The data is very complex - a huge Excel file, a codebook (that explains what the various parts of the Excel file mean), and a data dictionary (that details each entry in the Excel file). Working on the data involved frequent cross-referencing through these files and is especially tedious if you haven’t been working with the data in a long time. I was curious how far I could get in writing a new research paper using this old data with the help of AI.

I started by getting an OpenAI Deep Research report on the latest literature on how organizations could impact crowdfunding. I was able to check the report over based on my knowledge. I knew that it would not include all the latest articles (Deep Research cannot access paid academic content), but its conclusions were solid and would be useful to the AI when considering what topics might be worth exploring. So, I pasted in the report and the three files into the secure version of ChatGPT provided by my university and worked with multiple models to generate hypotheses. The AI suggested multiple potential directions, but I needed to filter them based on what would actually contribute meaningfully to the field—a judgment call requiring years of experience with the relevant research.

Then I worked back and forth with the models to test the hypothesis and confirm that our findings were correct. The AI handled the complexity of the data analysis and made a lot of suggestions, while I offered overall guidance and direction about what to do next. At several points, the AI proposed statistically valid approaches that I, with my knowledge of the data, knew would not be appropriate. Together, we worked through the hypothesis to generate fairly robust findings.

Then I gave all of the previous output to o1-pro and asked it to write a paper, offering a few suggestions along the way. It is far from a blockbuster, but it would make a solid contribution to the state of knowledge (after a bit more checking of the results, as AI still makes errors). More interestingly, it took less than an hour to create, as compared to weeks of thinking, planning, writing, coding and iteration. Even if I had to spend an hour checking the work, it would still result in massive time savings.

I never had to write a line of code, but only because I knew enough to check the results and confirm that everything made sense. I worked in plain English, shaving dozens of hours of work that I could not have done anywhere near as quickly without the AI… but there were many places where the AI did not yet have the “instincts” to solve problems properly. The AI is far from being able to work alone, humans still provide both vibe and work in the world of vibework.

Work is changing

Work is changing, and we're only beginning to understand how. What's clear from these experiments is that the relationship between human expertise and AI capabilities isn't fixed. Sometimes I found myself acting as a creative director, other times as a troubleshooter, and yet other times as a domain expert validating results. It was my complex expertise (or lack thereof) that determined the quality of the output.

The current moment feels transitional. These tools aren't yet reliable enough to work completely autonomously, but they're capable enough to dramatically amplify what we can accomplish. The $8 debugging session for my game reminds me that the gaps in AI capabilities still matter, and knowing where those gaps are becomes its own form of expertise. Perhaps most intriguing is how quickly this landscape is changing. The research paper that took me an hour with AI assistance would have been impossible at this speed just eighteen months ago.

Rather than reaching definitive conclusions about how AI will transform work, I find myself collecting observations about a moving target. What seems consistent is that, for now, the greatest value comes not from surrendering control entirely to AI or clinging to entirely human workflows, but from finding the right points of collaboration for each specific task—a skill we're all still learning.

Subscribe now

Share

A new generation of AIs: Claude 3.7 and Grok 3

2025-02-25 02:42:26

Note: After publishing this piece, I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 10^26 FLOP model and cost a few tens of millions of dollars to train, though future models will be much bigger. I updated the post with that information. The only significant change is that Claude 3 is now referred to as an advanced model but not a Gen3 model.

I have been experimenting with the first of a new generation AI models, Claude 3.7 and Grok 3, for the last few days. Grok 3 is the first model that we know trained with an order of magnitude more computing power of GPT-4, and Claude includes new coding and reasoning capabilities, so they are not just interesting in their own right but also tell us something important about where AI is going.

Before we get there, a quick review: this new generation of AIs is smarter and the jump in capabilities is striking, particularly in how these models handle complex tasks, math and code. These models often give me the same feeling I had when using ChatGPT-4 for the first time, where I am equally impressed and a little unnerved by what it can do. Take Claude's native coding ability, I can now get working programs through natural conversation or documents, no programming skill needed.

For example, giving Claude a proposal for a new AI educational tool and engaging in conversation where it was asked to “display the proposed system architecture in 3D, make it interactive,” resulted in this interactive visualization of the core design in our paper, with no errors. You can try it yourself here, and edit or change it by asking the AI. The graphics, while neat, are not the impressive part. Instead, it was that Claude decided to turn this into a step-by-step demo to explain the concepts, which wasn’t something that it was asked to do. This anticipation of needs and consideration of new angles of approach is something new in AI.

Or, for a more playful example, I told Claude “make me an interactive time machine artifact, let me travel back in time and interesting things happen. pick unusual times I can go back to…” and “add more graphics.” What emerged after just those two prompts was a fully functional interactive experience, complete with crude but charming pixel graphics (which are actually surprisingly impressive- the AI has to 'draw' these using pure code, without being able to see what it's creating, like an artist painting blindfolded but still getting the picture right).

To be clear, these systems are far from perfect and make mistakes, but they are getting much better, and fast. To understand where things are and where they are going,

The Two Scaling Laws

Though they may not look it, these may be the two most important graphs in AI. Published by OpenAI, they show the two “Scaling Laws,” which tell you how to increase the ability of the AI to answer hard questions, in this case to score more highly on the famously difficult American Invitational Mathematics Examination (AIME).

The left-hand graph is the training Scaling Law. It shows that larger models are more capable. Training these larger models requires increasing the amount of computing power, data, and energy used, and you need to do so on a grand scale. Typically, you need a 10x increase in computing power to get a linear increase in performance. Computing power is measured in FLOPs (Floating Point Operations) which are the number of basic mathematical operations, like addition or multiplication, that a computer performs, giving us a way to quantify the computational work done during AI training.

We are now seeing the first models of a new generation of AIs, trained with over 10x the computing power of GPT-4 and its many competitors. These models use over 10^26 FLOPS of computing power in training. This is a staggering amount of computing power, equivalent to running a modern smartphone for 634,000 years or the Apollo Guidance Computer that took humans to the moon for 79 trillion years. Naming 10^26 is awkward, though - it is one hundred septillion FLOPS, or, taking a little liberty with standard unit names, a HectoyottaFLOP. So, you can see why I just call them Gen3 models, the first set of AIs that were trained with an order of magnitude more computing power than GPT-4 (Gen2).

xAI, Elon Musk's AI company, made the first public move into Gen3 territory with Grok 3, which is unsurprising given their strategy. xAI is betting big on the idea that bigger (way bigger) is better. xAI built the world’s largest computer cluster in record time, and that meant Grok 3 was the first AI model to show us whether the Scaling Law would hold up for a new generation of AI. It seems that it did, as Grok 3 had the highest benchmark scores we've seen from any base model. Today, Claude 3.7 was released, though not yet a Gen3 model, it also shows substantial improvements in performance over previous AIs. While it is similar in benchmarks to Grok 3, I personally find it more clever for my use cases, but you may find otherwise. The still unreleased o3 from OpenAI also seems to be a Gen3 model, with excellent performance. It is likely this is just the beginning - more companies are gearing up to launch their own models at this scale, including Anthropic.

You might have noticed I haven’t yet mentioned the second graph, the one on the right. While the first Scaling Law is about throwing massive computing power at training (basically, building a smarter AI from the start), this second one revealed something surprising: you can make AI perform better simply by giving it more time to think. OpenAI discovered that if you let a model spend more computing power working through a problem (what they call test-time or inference-time compute), it gets better results - kind of like giving a smart person a few extra minutes to solve a puzzle. This second Scaling Law led to the creation of Reasoners, which I wrote about in my last post. The new generation of Gen3 models will all operate as Reasoners when needed, so they have two advantages: larger scale in training, and the ability to scale when actually solving a problem.

An example of three different models using reasoning

Together, these two trends are supercharging AI abilities, and also adding others. If you have a large, smart AI model, that can be used to create smaller, faster, cheaper models that are still quite smart, if not as much as their parent. And if you add Reasoner capabilities to even small models, they get even smarter. What that means is that AI abilities are getting better even as costs are dropping. This graph shows how quickly this trend has advanced, mapping the capability of AI on the y axis and the logarithmically decreasing costs on the x axis. When GPT-4 came out it was around $50 per million tokens (roughly a word), now it costs around 12 cents per million tokens to use Gemini 1.5 Flash, an even more capable model than the original GPT-4.

The Graduate-Level Google-Proof Q&A test (GPQA) is a series of very hard multiple-choice problems designed to test advanced knowledge. PhDs with access to the internet get 34% right on this test outside their specialty, 81% inside their specialty. The cost per million tokens is the cost of using the model (Gemini Flash Thinking Costs are estimated). Data based on my research, but Epoch and Artificial Analysis were good sources, and Latent Space offers its own more comprehensive graph of costs across many models.

You can see the intelligence of models is increasing, and their cost is decreasing over time. That has some pretty big implications for all of us.

Taking Scale Seriously

A lot of the focus on AI use, especially in the corporate world, has been stuck in what I call the “automation mindset” - viewing AI primarily as a tool for speeding up existing workflows like email management and meeting transcription. This perspective made sense for earlier AI models, but it's like evaluating a smartphone solely on its ability to make phone calls. The Gen3 generation give the opportunity for a fundamental rethinking of what's possible.

As models get better, and as they apply more tricks like reasoning and internet access, they hallucinate less (though they still make mistakes) and they are capable of higher order “thinking.” For example, in this case we gave Claude a 24 page academic paper outlining a new way of creating teaching games with AI, along with some unrelated instruction manuals for other games. We asked the AI to use those examples and write a customer-friendly guide for a game based on our academic paper. The results were extremely high-quality. To do this, the AI needed to both abstract out the ideas in the paper, and the patterns and approaches from other instruction manuals, and build something entirely new. This would have been a week of PhD-level work, done in a few seconds. And, on the right, you can also see an excerpt from another PhD-level task, reading a complex academic paper and checking the math and logic, as well as the implications for practice.

Managers and leaders will need to update their beliefs for what AI can do, and how well it can do it, given these new AI models. Rather than assuming they can only do low-level work, we will need to consider the ways in which AI can serve as a genuine intellectual partner. These models can now tackle complex analytical tasks, creative work, and even research-level problems with surprising sophistication. The examples I've shared - from creating interactive 3D visualizations of academic concepts to performing PhD-level analysis - demonstrate that we're moving beyond simple automation into the realm of AI-powered knowledge work. These systems are still far from flawless, nor do they beat human experts consistently across a wide range of tasks, but they are very impressive.

This shift has profound implications for how organizations should approach AI integration. First, the focus needs to move from task automation to capability augmentation. Instead of asking "what tasks can we automate?" leaders should ask "what new capabilities can we unlock?" And they will need to build the capacity in their own organizations to help explore, and develop these changes.

Second, the rapid improvement in both capabilities and cost efficiency means that any static strategy for AI implementation will quickly become outdated. Organizations need to develop dynamic approaches that can evolve as these models continue to advance. Going all-in on a particular model today is not a good plan in a world where both Scaling Laws are operating.

Finally, and perhaps most importantly, we need to rethink how we measure and value AI contributions. The traditional metrics of time saved or costs reduced may miss the more transformative impacts of these systems - their ability to generate novel insights, synthesize complex information, and enable new forms of problem-solving. Moving too quickly to concrete KPIs, and leaving behind exploration, will blind companies to what is possible. Worse, they encourage companies to think of AI as a replacement for human labor, rather than exploring ways in which human work can be boosted by AI.

Exploring for Yourself

With that serious warning out of the way, I want to leave you with a suggestion. These new models are clever, but they are also friendly and more engaging to use. They are likely to ask you questions or push your thinking in new directions, and tend to be good at two-way conversation. The best way to understand their capabilities, then, is to explore them yourself. Claude 3.7 is available for paying customers and has a neat feature where it can run the code it writes for you, as you have seen throughout this post. It does not train on your uploaded data. Grok 3 is free and has a wider range of features, including a good Deep Research option, but is harder for amateurs to use for coding. It is not as good as Claude 3.7 for the tasks I have tried, but the Xai commitment to scaling means it will improve rapidly. You should also note that Grok does train on your data, but that can be turned off for paying customers.

Regardless of what model you pick, you should experiment. Ask the model to code something for you by just asking for it (I asked Claude for a video game with unique mechanics based on the Herman Melville story “Bartleby the Scrivner” - and it did so based on a single prompt), feed it a document and ask it for an infographic summary, or ask it to comment on an image you upload. If this is too playful, follow the advice in my book and just use it for work tasks, taking into account the privacy caveat above. Use it to brainstorm new ideas, ask it how a news article or analyst report might affect your business, or ask it to create a financial dashboard for a new product or startup concept. You will likely find cases that amaze you, and others where the new models are not yet good enough to be helpful.

The limitations of these models remain very real, but the fact that Gen3 AIs are better than Gen2, due to both the first and second Scaling Law shows us something essential. These laws aren't fundamental constants of the universe - they're observations about what happens when you throw massive resources at AI development. The computing power keeps growing, the capabilities keep improving, and this cycle accelerates with each generation. As long as they continue to hold, AIs will keep getting better. Now we know that the next generation of AIs will continue to offer rapid improvements, suggesting that there is a good chance that AI capabilities may continue to increase into the future.

Subscribe now

Share

The End of Search, The Beginning of Research

2025-02-03 20:38:53

A hint to the future arrived quietly over the weekend. For a long time, I've been discussing two parallel revolutions in AI: the rise of autonomous agents and the emergence of powerful Reasoners since OpenAI's o1 was launched. These two threads have finally converged into something really impressive - AI systems that can conduct research with the depth and nuance of human experts, but at machine speed. OpenAI's Deep Research demonstrates this convergence and gives us a sense of what the future might be. But to understand why this matters, we need to start with the building blocks: Reasoners and agents.

Reasoners

For the past couple years, whenever you used a chatbot, it worked in a simple way: you typed something in, and it immediately started responding word by word (or more technically, token by token). The AI could only "think" while producing these tokens, so researchers developed tricks to improve its reasoning - like telling it to "think step by step before answering." This approach, called chain-of-thought prompting, markedly improved AI performance.

Reasoners essentially automate the process, producing “thinking tokens” before actually giving you an answer. This was a breakthrough in at least two important ways. First, because the AI companies could now get AIs to learn how to reason based on examples of really good problem-solvers, the AI can “think” more effectively. This training process can produce a higher quality chain-of-thought than we can by prompting. This means Reasoners are capable of solving much harder problems, especially in areas like math or logic where older chatbots failed.

The second way this was a breakthrough is that it turns out that the longer Reasoners “think,” the better their answers get (though the rate of improvement slows as they think longer). This is a big deal because previously the only way to make AIs perform better was to train bigger and bigger models, which is very expensive and requires a lot of data. Reasoning models show you can make AIs better by just letting them produce more and more thinking tokens, using computing power at the time of answering your question (called inference-time compute) rather than when the model was trained.

The Graduate-Level Google-Proof Q&A test (GPQA) is a series of multiple-choice problems that internet access doesn't help PhDs with access to the internet get 34% right on this test outside their specialty, 81% inside their specialty. It illustrates how reasoning models have sped up the capability gain of AI. Data source.

Because Reasoners are so new, their capabilities are expanding rapidly. In just months, we've seen dramatic improvements from OpenAI's o1 family to their new o3 models. Meanwhile, China's DeepSeek r1 has found innovative ways to boost performance while cutting costs, and Google has launched their first Reasoner. This is just the beginning - expect to see more of these powerful systems, and soon.

Agents

While experts debate the precise definition of an AI agent, we can think of it simply as “an AI that is given a goal and can pursue that goal autonomously.” Right now, there's an AI labs arms race to build general-purpose agents - systems that can handle any task you throw at them. I've written about some early examples like Devin and Claude with Computer Use, but OpenAI just released Operator, perhaps the most polished general-purpose agent yet.

The video below, sped up 16x, captures both the promise and pitfalls of general-purpose agents. I give Operator a task: read my latest substack post at OneUsefulThing and then go onto Google ImageFX and make an appropriate image, download it, and give it to me to post. What unfolds is enlightening. At first, Operator moves with impressive precision - finding my website, reading the post, navigating to ImageFX (pausing briefly for me to enter my login), and creating the image. Then the troubles begin, and they're twofold: not only is Operator blocked by OpenAI's security restrictions on file downloads, but it also starts to struggle with the task itself. The agent methodically tries every conceivable workaround: copying to clipboard, generating direct links, even diving into the site's source code. Each attempt fails - some due to OpenAI's browser restrictions, others due to the agent's own confusion about how to actually accomplish the task. Watching this determined but ultimately failed problem-solving loop reveals both the current limitations of these systems and raises questions about how agents will eventually behave when they encounter barriers in the real world.

Operator's issues highlight the current limits of general-purpose agents, but that doesn’t suggest that agents are useless. It appears that economically valuable narrow agents that focus on specific tasks are already possible. These specialists, powered by current LLM technology, can achieve remarkable results within their domains. Case in point: OpenAI's new Deep Research, which shows just how powerful a focused AI agent can be.

Deep Research

OpenAI’s Deep Research (not to be confused with Google’s Deep Research, more on that soon) is essentially a narrow research agent, built on OpenAI’s still unreleased o3 Reasoner, and with access to special tools and capabilities. It is one of the more impressive AI applications I have seen recently. To understand why, let’s give it a topic. I am specifically going to pick a highly technical and controversial issue within my field of research: When should startups stop exploring and begin to scale? I want you to examine the academic research on this topic, focusing on high quality papers and RCTs, including dealing with problematic definitions and conflicts between common wisdom and the research. Present the results for a graduate-level discussion of this issue.

The AI asks some smart questions, and I clarify what I want. Now o3 goes off and gets to work. You can see its progress and “thinking” as it goes. It is really worth taking a second to look at a couple samples of that process below. You can see that the AI is actually working as a researcher, exploring findings, digging deeper into things that “interest” it, and solving problems (like finding alternative ways of getting access to paywalled articles). This goes on for five minutes.

Seriously take a moment to look at these three slices of its “thought” process

At the end, I get a 13 page, 3,778 word draft with six citations and a few additional references. It is, honestly, very good, even if I would have liked a few more sources. It wove together difficult and contradictory concepts, found some novel connections I wouldn’t expect, cited only high-quality sources, and was full of accurate quotations. I cannot guarantee everything is correct (though I did not see any errors) but I would have been satisfied to see something like it from a beginning PhD student. You can see the full results here but the couple excerpts below would suffice to show you why I am so impressed.

The quality of citations also marks a genuine advance here. These aren't the usual AI hallucinations or misquoted papers - they're legitimate, high-quality academic sources, including seminal work by my colleagues Saerom (Ronnie) Lee and Daniel Kim. When I click the links, they don't just lead to the papers, they often take me directly to the relevant highlighted quotes. While there are still constraints - the AI can only access what it can find and read in a few minutes, and paywalled articles remain out of reach - this represents a fundamental shift in how AI can engage with academic literature. For the first time, an AI isn't just summarizing research, it's actively engaging with it at a level that actually approaches human scholarly work.

It is worth contrasting it with Google’s product launched last month also called Deep Research (sigh). Google surfaces far more citations, but they are often a mix of websites of varying quality (the lack of access to paywalled information and books hurts all of these agents). It appears to gather documents all at once, as opposed to the curiosity-driven discovery of OpenAI’s researcher agent. And, because (as of now) this is powered by the non-reasoning, older Gemini 1.5 model, the overall summary is much more surface-level, though still solid and apparently error-free. It is like a very good undergraduate product. I suspect that the difference will be clear if you read a little bit below.

To put this in perspective: both outputs represent work that would typically consume hours of human effort - near PhD-level analysis from OpenAI's system, solid undergraduate work from Google's. OpenAI makes some bold claims in their announcement, complete with graphs suggesting their agent can handle 15% of high economic value research projects and 9% of very high value ones. While these numbers deserve skepticism - their methodology isn't explained - my hands-on testing suggests they're not entirely off base. Deep Research can indeed produce valuable, sophisticated analysis in minutes rather than hours. And given the rapid pace of development, I expect Google won't let this capability gap persist for long. We are likely to see fast improvement in research agents in the coming months.

The pieces come together

You can start to see how the pieces that the AI labs are building aren't just fitting together - they're playing off each other. The Reasoners provide the intellectual horsepower, while the agentic systems provide the ability to act. Right now, we're in the era of narrow agents like Deep Research, because even our best Reasoners aren't ready for general-purpose autonomy. But narrow isn’t limiting - these systems are already capable of performing work that once required teams of highly-paid experts or specialized consultancies.

These experts and consultancies aren't going away - if anything, their judgment becomes more crucial as they evolve from doing the work to orchestrating and validating the work of AI systems. But the labs believe this is just the beginning. They're betting that better models will crack the code of general-purpose agents, expanding beyond narrow tasks to become autonomous digital workers that can navigate the web, process information across all modalities, and take meaningful action in the world. Operator shows we aren’t there yet, but Deep Research suggests that we may be on our way.

Subscribe now

Share

Which AI to Use Now: An Updated Opinionated Guide (Updated Again 2/15)

2025-01-26 20:45:48

Please note that I updated this guide on 2/15, less than a month after writing it - a lot has changed in a short time.

While my last post explored the race for Artificial General Intelligence – a topic recently thrust into headlines by Apollo Program-scale funding commitments to building new AIs – today I'm tackling the one question I get asked most: what AI should you actually use? Not five years from now. Not in some hypothetical future. Today.

Every six months or so, I have written an opinionated guide for individual users of AI, not specializing in any one type of use, but as a general overview. Writing this is getting more challenging. AI models are gaining capabilities at an increasingly rapid rate, new companies are releasing new models, and nothing is well documented or well understood. In fact, in the few days I have been working on this draft, I had to add an entirely new model and update the chart below multiple times due to new releases. As a result, I may get something wrong, or you may disagree with my answers, but that is why I consider it an opinionated guide (though as a reminder, I take no money from AI labs, so it is my opinion!)

A Tour of Capabilities

To pick an AI model for you, you need to know what they can do. I decided to focus here on the major AI companies that offer easy-to-use apps that you can run on your phone, and which allow you to access their most up-to-date AI models. Right now, to consistently access a frontier model with a good app, you are going to need to pay around $20/month (at least in the US), with a couple exceptions. Yes, there are free tiers, but you'll generally want paid access to get the most capable versions of these models.

We are going to go through things in detail, but, for most people, there are three good choices right now: Claude from Anthropic, Google’s Gemini, and OpenAI’s ChatGPT. There are also a trio of models that might make sense for specialized users: Grok by Elon Musk’s X.ai is an excellent model that is most useful if you are a big X user; Microsoft’s Copilot offers many of the features of ChatGPT and is accessible to users through Windows; and DeepSeek r1, a Chinese model that is remarkably capable (and free). I’ll talk about some caveats and other options at the end.

Service and Model

For most people starting to use AI, the most important goal is to ensure that you have access to a frontier model with its own app. Frontier models are the most advanced AIs, and, thanks to the 'scaling law' (where bigger models get disproportionately smarter), they’re far more capable than older versions. That means they make fewer mistakes, and they often can provide more useful features.

The problem is that most of the AI companies push you towards their smaller AI models if you don’t pay for access, and sometimes even if you do. Generally, smaller models are much faster to run, slightly less capable, and also much cheaper for the AI companies to operate. For example, GPT-4o-mini is the smaller version of GPT-4 and Gemini Flash is the smaller version of Gemini. Often, you want to use the full models where possible, but there are exceptions when the smaller model is actually more advanced. And everything has terrible names. Right now, for Claude you want to use Claude 3.5 Sonnet (which consistently outperforms its larger sibling Claude 3 Opus), for Gemini you want to use Gemini 2.0 Pro (though Gemini 2.0 Flash Thinking is also excellent), and for ChatGPT you want to use GPT-4o (except when tackling complex problems that benefit from o1 or o3's reasoning capabilities). While this can be confusing, it is also a side effect of how quickly these companies are updating their AIs, and their features.

Live Mode

Imagine an AI that can converse with you in real-time, seeing what you see, hearing what you say, and responding naturally – that's “Live Mode” (though it goes by various names). This interactive capability represents a powerful way to use AI. To demonstrate, I used ChatGPT's “Advanced Voice Mode” to discuss my game collection. This entire interaction, which you can hear with sound on, took place on my phone

You are actually seeing three advances in AI working together: First, multimodal speech lets the AI handle voice natively, unlike most AI models that use separate systems to convert between text and speech. This means it can theoretically generate any sound, though OpenAI limits this for safety. Second, multimodal vision lets the AI see and analyze real-time video. Third, internet connectivity provides access to current information. The system isn't perfect - when pulling the board game ratings from the internet, it got one right but mixed up another with its expansion pack. Still, the seamless combination of these features creates a remarkably natural interaction, like chatting with a knowledgeable (if not always 100% accurate) friend who can see what you're seeing.

Right now, only ChatGPT offers a full multimodal Live Mode for all paying customers. It’s the little icon all the way to the right of the prompt bar (ChatGPT is full of little icons). But Google has already demonstrated a Live Mode for its Gemini model, and I expect we will see others soon.

Reasoning

For those who are watching the AI space, by far the most important recent advance in the last few months has been the development of reasoning models. As I explained in my post about o1, it turns out that if you let an AI “think” about a problem before answering, you get better results. The longer the model thinks, generally, the better the outcome. Behind the scenes, it's cranking through a whole thought process you never see, only showing you the final answer. Interestingly, when you peek behind that curtain, you find these AIs think in ways that feel eerily human:

Really worth reading the thinking process, it is kind of charming

That was the thinking process of DeepSeek-v3 r1, one of only a few reasoning models that have been released to the public. It is also an unusual model in many ways: it is an excellent model from China1; it is open source so anyone can download and modify it; and it is cheap to run (and is currently offered for free by its parent company, DeepSeek). Google also offers a reasoning version of its Gemini 2.0 Flash. However, the most capable reasoning models right now are the o1 family from OpenAI. These are confusingly named, but, in order of capability, there are o1-mini, o3-mini, o3-mini-high, o1, and o1-pro (OpenAI could not get the rights to the o2 name, making things even more baffling).

Reasoning models aren’t chatty assistants – they’re more like scholars. You’ll ask a question, wait while they ‘think’ (sometimes minutes!), and get an answer. You want to make sure that the question you give them is very clear and has all the context they need. For very hard questions, especially in academic research, math, or computer science, you will want to use a reasoning model. Otherwise, a standard chat model is fine.

Web Access and Research

Not all AIs can access the web and do searches to learn new information past their original training. Currently, Gemini, Grok, DeepSeek, Copilot and ChatGPT can search the web actively, while Claude cannot. This capability makes a huge difference when you need current information or fact-checking, but not all models use their internet connections fully, so you will still need to fact-check.

Two models, Gemini and OpenAI, go far beyond simple internet access and offer the option for “Deep Research” which I discuss in more detail in this post. OpenAI’s model is more like a PhD analyst who looks at relatively few sources yet assembles a striklingly sophisticated analyst report, while Gemini’s approach is more like a summary of the open web on a topic.

Generates Images

Most of the LLMs that generate images do so by actually using a separate image generation tool. They do not have direct control over what that tool does, they just send a prompt to it and then show you the picture that results. That is changing with multimodal image creation, which lets the AI directly control the images it makes. For right now, Gemini's Imagen 3 leads the pack, but honestly? They'll all handle your basic “otter holding a sign saying 'This is ____' as it sits on a pink unicorn float in the middle of a pool” just fine.

Executes Code and Does Data Analysis

All AIs are pretty good at writing code, but only a few models (mostly Claude and ChatGPT, but also Gemini to a lesser extent) have the ability to execute the code directly. Doing so lets you do a lot of exciting things. For example, this is the result of telling o1 using the Canvas feature (which you need to turn on by typing /canvas): “create an interactive tool that visually shows me how correlation works, and why correlation alone is not a great descriptor of the underlying data in many cases. make it accessible to non-math people and highly interactive and engaging”

Further, when models can code and use external files, they are capable of doing data analysis. Want to analyze a dataset? ChatGPT's Code Interpreter will do the best job on statistical analyses, Claude does less statistics but often is best at interpretation, and Gemini tends to focus on graphing. None of them are great with Excel files full of formulas and tabs yet, but they do a good job with structured data.

Claude does not do as sophisticated data analysis as ChatGPT, but it is very good at an “intuitive” understanding of data and what it means

Reads documents, sees images, sees video

It is very useful for your AI to take in data from the outside world. Almost all of the major AIs include the ability to process images. The models can often infer a huge amount from a picture. Far fewer models do video (which is actually processed as images at 1 frame every second or two). Right now that can only be done by Google’s Gemini, though ChatGPT can see video in Live Mode.

Given the first photo Claude guesses where I am. Given the second it identifies the type of plane. These aren’t obvious.

And, while all the AI models can work with documents, they aren’t equally good at all formats. Gemini, GPT-4o (but not o3), and Claude can process PDFs with images and charts, while DeepSeek can only read the text. No model is particularly good at Excel or PowerPoint (though Microsoft Copilot does a bit better here, as you might expect), though that will change soon. The different models also have different amounts of memory ("context windows") with Gemini having by far the most, capable of holding up to 2 million words at once.

Privacy and other factors

A year ago, privacy was a major concern when choosing an AI model. The early versions of these systems would save your chats and use them to improve their models. That's changed dramatically. Every major provider (except DeepSeek) now offers some form of privacy-focused mode: ChatGPT lets you opt out of training, and Claude says it will not train on your data as does Gemini. The exception is if you're handling truly sensitive data like medical records – in those cases, you'll still want to look into enterprise versions of these tools that offer additional security guarantees and meet regulatory requirements.

Each platform offers different ways to customize the AI for your use cases. ChatGPT lets you create custom GPTs tailored to specific tasks and includes an optional feature to remember facts from previous conversations, Gemini integrates with your Google workspace, and Claude has custom styles and projects.

Which AI should you use?

As you can see, there are lots of features to pick from, and, on top of that, there is the issue of “vibes” - each model has its own personality and way of working, almost like a person. If you happen to like the personality of a particular AI, you may be willing to put up with fewer features or less capabilities. You can try out the free versions of multiple AIs to get a sense for that. That said, for most people, you probably want to pick among the paid versions of ChatGPT, Claude or Gemini.

ChatGPT currently has the best Live Mode in its Advanced Voice Mode. The other big advantage of ChatGPT is that it does everything, often in somewhat confusing ways - OpenAI has AI models specialized in hard problems (o1/o3 series) and models for chat (GPT-4o); some models can write and run complex software programs (though it is hard to know which); there are reseachers and agents; there are systems that remember past interactions and scheduling systems; movie-making tools and early software agents. It can be a lot, but it gives you opportunities to experiment with many different AI capabilities. It is also worth noting that ChatGPT offers a $200/month tier, whose main advantage is access to very powerful reasoning models.

Gemini does not yet have as good a Live Mode, but that is supposed to be coming soon. For now, Gemini’s advantage is a family of powerful models including reasoners, very good integration with search, and a pretty easy-to-use user interface, as you might expect from Google. It also has top-flight image and video generation. Also excellent is Deep Research, which I wrote about at length in my last post.

Claude has the smallest number of features of any of these three systems, and really only has one model you care about - Claude 3.5 Sonnet. But Sonnet is very, very good. It often seems to be clever and insightful in ways that the other models are not. A lot of people end up using Claude as their primary model as a result, even though it is not as feature rich.

While it is new, you might also consider DeepSeek if you want a very good all-around model with excellent reasoning. As an open model, you can either use it hosted on the original Chinese DeepSeek site or from a number of local providers. If you subscribe to X, you get Grok for free, and the team at X.ai are scaling up capabilities quickly, with a soon-to-be-released new model, Grok 3, promising to be the largest model ever trained. And if you have Copilot, you can use that, as it includes a mix of Microsoft and OpenAI models, though I find the lack of transparency over which models it is using when to be somewhat confusing. There are also many services, like Poe, that offer access to multiple models at the same time, if you want to experiment.

In the time it took you to read this guide, a new AI capability probably launched and two others got major upgrades. But don't let that paralyze you. The secret isn't waiting for the perfect AI - it's diving in and discovering what these tools can actually accomplish. Jump in, get your hands dirty, and find what clicks. It will help you understand where AI can help you, where it can’t, and what is coming next.

Subscribe now

Share

1

The fact that it is a Chinese model is interesting in many ways, including the fact that this is the first non-US model to reach near the top of the AI ranking leaderboards. The quality of the model when it was released last week came as a surprise to many people in the AI space, causing a tremendous amount of discussion. Its origin also means that it tends to echo the official Chinese position on a variety of political topics. (Since the model itself is open, it is very likely that modified versions of the original will be released soon and hosted by other providers)

Prophecies of the Flood

2025-01-10 20:15:53

Recently, something shifted in the AI industry. Researchers began speaking urgently about the arrival of supersmart AI systems, a flood of intelligence. Not in some distant future, but imminently. They often refer to AGI - Artificial General Intelligence - defined, albeit imprecisely, as machines that can outperform expert humans across most intellectual tasks. This availability of intelligence on demand will, they argue, change society deeply and will change it soon.

A sample of the recent statements by prominent researchers within AI labs predicting near-term supersmart AIs.

There are plenty of reasons to not believe insiders as they have clear incentives to make bold predictions: they're raising capital, boosting stock valuations, and perhaps convincing themselves of their own historical importance. They're technologists, not prophets, and the track record of technological predictions is littered with confident declarations that turned out to be decades premature. Even setting aside these human biases, the underlying technology itself gives us reason for doubt. Today's Large Language Models, despite their impressive capabilities, remain fundamentally inconsistent tools - brilliant at some tasks while stumbling over seemingly simpler ones. This “jagged frontier” is a core characteristic of current AI systems, one that won't be easily smoothed away

Plus, even assuming researchers are right about reaching AGI in the next year or two, they are likely overestimating the speed at which humans can adopt and adjust to a technology. Changes to organizations take a long time. Changes to systems of work, life, and education, are slower still. And technologies need to find specific uses that matter in the world, which is itself a slow process. We could have AGI right now and most people wouldn’t notice (indeed, some observers have suggested that has already happened, arguing that the latest AI models like Claude 3.5 are effectively AGI1).

Yet dismissing these predictions as mere hype may not be helpful. Whatever their incentives, the researchers and engineers inside AI labs appear genuinely convinced they're witnessing the emergence of something unprecedented. Their certainty alone wouldn't matter - except that increasingly public benchmarks and demonstrations are beginning to hint at why they might believe we're approaching a fundamental shift in AI capabilities. The water, as it were, seems to be rising faster than expected.

Where the water is rising

The event that kicked off the most speculation was the reveal of a new model by OpenAI called o3 in late December. No one outside of OpenAI has really used this system yet, but it is the successor to o1, which is already very impressive2. The o3 model is one of the new generation of “reasoners” - AI models that take extra time to “think” before answering questions, which greatly improves their ability to solve hard problems. OpenAI provided a number of startling benchmarks for o3 that suggest a large advance over o1, and, indeed, over where we thought the state-of-the-art in AI was. Three benchmarks, in particular, deserve a little attention.

The first is the called the Graduate-Level Google-Proof Q&A test (GPQA), and it is supposed to test high-level knowledge with a series of multiple-choice problems that even Google can’t help you with. PhDs with access to the internet got 34% of the questions right on this test outside their specialty, and 81% right inside their specialty. When tested, o3 achieved 87% beating human experts for the first time. The second is Frontier Math, a set of private math problems created by mathematicians to be incredibly hard to solve, and, indeed, no AI ever scored higher than 2%, until o3, which got 25% right. The final benchmark is ARC-AGI, a rather famous test of fluid intelligence that was designed to be relatively easy for humans but hard for AIs. Again, o3 beat all previous AIs as well as the baseline human level on the test, scoring 87.5%. All of these tests come with significant caveats3 but they suggest that what we previously considered unpassable barriers to AI performance may actually be beaten quite quickly.

Agents

As AIs get smarter, they become more effective agents, another ill-defined term (see a pattern?) that generally means an AI given the ability to act autonomously towards achieving a set of goals. I have demonstrated some of the early agentic systems in previous posts, but I think the past few weeks have also shown us that practical agents, at least for narrow but economically important areas, are now viable.

A nice example of that is Google’s Gemini with Deep Research (accessible to everyone who subscribes to Gemini), which is really a specialized research agent. I gave it a topic like “research a comparison of ways of funding startup companies, from the perspective of founders, for high growth ventures.” And the agentic system came up with a plan, read through 173(!) websites and compiled a report for me with the answer a few minutes later.

The result was a 17 page paper with 118 references! But is it any good? I have taught the introductory entrepreneurship class at Wharton for over a decade, published on the topic, started companies myself, and even wrote a book on entrepreneurship, and I think this is pretty solid. I didn’t spot any obvious errors, but you can read it yourself if you would like here. The biggest issue is not accuracy, but that the agent is limited to public non-paywalled websites, and not scholarly or premium publications. It also is a bit shallow and does not make strong arguments in the face of conflicting evidence. So not as good as the best humans, but better than a lot of reports that I see.

Still, this is a genuinely disruptive example of an agent with real value. Researching and report writing is a major task of many jobs. What Deep Research accomplished in three minutes would have taken a human many hours, though they might have added more nuanced analysis. Given that, anyone writing a research report should probably try Deep Research and see how it works as a starting place, even though a good final report will still require a human touch. I had a chance to speak with the leader of the Deep Research project, where I learned that it is just a pilot project from a small team. I thus suspect that other groups and companies that were highly incentivized to create narrow but effective agents would be able to do so. Narrow agents are now a real product, rather than a future possibility. There are already many coding agents, and you can use experimental open-source agents that do scientific and financial research.

Narrow agents are specialized for a particular task, which means they are somewhat limited. That raises the question of whether we soon see generalist agents where you can just ask the AI anything and it will use a computer and the internet to do it. Simon Willison thinks not despite what Sam Altman has argued. We will learn more as the year progresses, but if general agentic systems work reliably and safely, that really will change things, as it allows smart AIs to take action in the world.

Many smaller advances are happening

Agents and very smart models are the core elements needed for transformative AI, but there are many other pieces as well that seem to be making rapid progress. This includes advances in how much AIs can remember (context windows) and multimodal capabilities that allow them to see and speak. It can be helpful to look back a little to get a sense of progress. For example, I have been testing the prompt “otter on a plane using wifi” for image and video models since before ChatGPT came out. In October, 2023, that prompt got you this terrifying monstrosity.

“Otter on a plane using wifi” October, 2023

Less than 18 months later, multiple image creation tools nail the prompt. The result is that I have had to figure out something more challenging (this is an example of benchmark saturation, where old benchmarks get beaten by the AI). I decided to take a few minutes and see how far I could get with Google’s Veo 2 video model in producing a movie of the otter’s journey. The video you see below took less than 15 minutes of active work, although I had to wait a bit for the videos to be created. Take a look at the quality of the shadows and light. I especially appreciate how the otter opens the computer at the end.

And, to up the ante even further, I decided to turn the saga of the otter into a 1980s style science fiction anime featuring otters in space and a period-appropriate theme song (thanks to Suno). Again, very little (human) work was involved.

What of the flood?

Given all of this, how seriously should we take the claims of the AI labs that a flood of intelligence is coming? Even if we only consider what we've already seen - the o3 benchmarks shattering previous barriers, narrow agents conducting complex research, and multimodal systems creating increasingly sophisticated content - we're looking at capabilities that could transform many knowledge-based tasks. And yet the labs insist this is merely the start, that far more capable systems and general agents are imminent.

What concerns me most isn't whether the labs are right about this timeline - it's that we're not adequately preparing for what even current levels of AI can do, let alone the chance that they might be correct. While AI researchers are focused on alignment, ensuring AI systems act ethically and responsibly, far fewer voices are trying to envision and articulate what a world awash in artificial intelligence might actually look like. This isn't just about the technology itself; it's about how we choose to shape and deploy it. These aren't questions that AI developers alone can or should answer. They're questions that demand attention from organizational leaders who will need to navigate this transition, from employees whose work lives may transform, and from stakeholders whose futures may depend on these decisions. The flood of intelligence that may be coming isn't inherently good or bad - but how we prepare for it, how we adapt to it, and most importantly, how we choose to use it, will determine whether it becomes a force for progress or disruption. The time to start having these conversations isn't after the water starts rising - it's now.

Subscribe now

Share

1

I asked Claude to read over the completed document and give me feedback and it wrote: “The parenthetical comment about Claude 3.5 could potentially benefit from an update or revision since it's mentioned as an example of potential AGI. As Claude 3.5 Sonnet, I should note that I can't verify specific claims about my capabilities in relation to AGI.”

2

They skipped the name o2 because it is the name of a telephone company in the UK, AI naming continues to be very bad.

3

The caveat for GPQA is that the data is publicly available, and it is possible that the model trained on that data, either by accident or on purpose, although there is no indication that it did so. The caveat from the Frontier Math test is that the problems differ in difficulty level, Tier 1 are hard Math Olympiad problems, Tier 2 are graduate level problems, and Tier 3 are genuine research-level problems. In the words of the mathematician in charge, of o3’s correct answers: “40% have been Tier 1, 50% Tier 2, and 10% Tier 3. However, most Tier 3 “solutions”—and many Tier 2 ones—stem from heuristic shortcuts rather than genuine mathematical understanding.” The caveat for ARC-AGI is that it required a lot of very expensive computer time for o3 to run long enough to achieve its high score.