2025-05-22 19:00:44
Companies are approaching AI transformation with incomplete information. After extensive conversations with organizations across industries, I think four key facts explain what's really happening with AI adoption:
AI boosts work performance. How do we know? For one thing, workers certainly think it does. A representative study of knowledge workers in Denmark found that users thought that AI halved their working time for 41% of the tasks they do at work, and a more recent survey of Americans found that workers said using AI tripled their productivity (reducing 90-minute tasks to 30 minutes). Self-reporting is never completely accurate, but we have other data from controlled experiments that suggest gains among product development, sales, and consulting, as well as for coders, law students, and call center workers.
A large percentage of people are using AI at work. That Danish study from a year ago found that 65% of marketers, 64% of journalists, and 30% of lawyers, among others, had used AI at work. The study of American workers found over 30% had used AI at work in December, 2024, a number which grew to 40% in April, 2025. And, of course, this may be an undercount in a world where ChatGPT is the fourth most visited website on the planet.
There are more transformational gains available with today’s AI systems than most currently realize. Deep research reports do many hours of analytical work in a few minutes (and I have been told by many researchers that checking these reports is much faster than writing them); agents are just starting to appear that can do real work; and increasingly smart systems can produce really high-quality outcomes.
These gains are not being captured by companies. Companies are typically reporting small to moderate gains from AI so far, and there is no major impact on wages or hours worked as of the end of 2024.
How do we reconcile the first three points with the final one? The answer is that AI use that boosts individual performance does not naturally translate to improving organizational performance. To get organizational gains requires organizational innovation, rethinking incentives, processes, and even the nature of work. But the muscles for organizational innovation inside companies have atrophied. For decades, companies have outsourced this to consultants or enterprise software vendors who develop generalized approaches that address the issues of many companies at once. That won’t work here, at least for a while. Nobody has special information about how to best use AI at your company, or a playbook for how to integrate it into your organization. Even the major AI companies release models without knowing how they can be best used. They especially don’t know your industry, organization, or context.
We are all figuring this out together. So, if you want to gain an advantage, you are going to have to figure it out faster than everyone else. And to do that, you will need to harness the efforts of Leadership, Lab, and Crowd - the three keys to AI transformation.
Ultimately, AI starts as a leadership problem, where leaders recognize that AI presents urgent challenges and opportunities. One big change since I wrote about this topic months ago is that more leaders are starting to recognize the need to address AI. You can see this in two viral memos, from the CEO of Shopify and the CEO of Duolingo, establishing the importance of AI to their company’s future.
But urgency alone isn't enough. These messages do a good job signaling the 'why now' but stop short of painting that crucial, vivid picture: what does the AI-powered future actually look and feel like for your organization? My colleague Andrew Carton has shown that workers are not motivated to change by leadership statements about performance gains or bottom lines, they want clear and vivid images of what the future actually looks like: What will work be like in the future? Will efficiency gains be translated into layoffs or will they be used to grow the organization? How will workers be rewarded (or punished) for how they use AI? You don’t have to know the answer with certainty, but you should have a goal that you are working towards that you are willing to share. Workers are waiting for guidance, and the nature of that guidance will impact how The Crowd adopts and uses AI.
An overall vision is not enough, however, because leaders need to start to anticipate how work will change in a world of AI. While AI is not currently a replacement for most human jobs, it does replace specific tasks within those jobs. I have spoken to numerous legal professionals who see the current state of Deep Research tools as good enough to handle portions of once-expensive research tasks. Vibe coding changes how programmers allocate time and effort. And it is hard to not see changes to marketing and media work in the rapid gains in AI video. For example, Google’s new Veo 3 created this short video snippet, sound and all, from the text prompt: An advertisement for Cheesey Otters, a new snack made out of otter shaped crackers. The commercial shows a kid eating them, and the mom holds up the package and says "otterly great"
Yet the ability to make a short video clip, or code faster, or get research on demand, does not equal performance gains. To do that will require decisions about where Leadership and The Lab should work together to build and test new workflows that integrate AIs and humans. It also means fundamentally rethinking why you are doing particular tasks. Companies used to pay tens of thousands of dollars for a single research report, now they can generate hundreds of those for free. What does that allow your analysts and managers to do? If hundreds of reports aren’t useful, then what was the point of research reports?
I am increasingly seeing organizations start to experiment with radical new approaches to work in response to AI. For example, dispersing software engineering teams, removing them from a central IT function and instead having them work in cross-functional teams with subject matter experts and marketing experts. Together, these groups can “vibework” and independently build projects in days that would have taken months of coordination across departments. And this is just one possible future for work. Leaders need to describe the future they want, but they also don’t have to generate every idea for innovation on their own. Instead, they can turn to The Crowd and The Lab.
Both innovation and performance improvements happen in The Crowd, the employees who figure out how to use AI to help get their own work done. As there is no instruction manual for AI (seriously, everyone is figuring this out together), learning to use AI well is a process of discovery that benefits experienced workers. People with a strong understanding of their job can easily assess when an AI is useful for their work through trial and error, in the way that outsiders (and even AI-savvy junior workers) cannot. Experienced AI users can then share their workflows and AI use in ways that benefit everyone.
Enticed by this vision, companies (including those in highly regulated industries1) have increasingly been giving employees direct access to AI chatbots, and some basic training, in hopes of seeing The Crowd innovate. Most run into the same problem, finding that the use of official AI chatbots maxes out at 20% or so of workers, and that reported productivity gains are small. Yet over 40% of workers admit using AI at work, and they are privately reporting large performance gains. This discrepancy points to two critical dynamics: many workers are hiding their AI use, often for good reason, while others remain unsure how to effectively apply AI to their tasks, despite initial training.
These are problems that can be solved by Leadership and the Lab.
Solving the problem of hidden AI use (what I call “Secret Cyborgs”) is a Leadership problem. Consider the incentives of the average worker. They may have received a scary talk about how improper AI use might be punished, and they don’t want to take any risks. Or maybe they are being treated as heroes at work for their incredible AI-assisted outputs, but they suspect if they tell anyone it is AI, managers will stop respecting them. Or maybe they know that companies see productivity gains as an opportunity for cost cutting and suspect that they (or their colleagues) will be fired if the company realizes that AI does some of their job. Or maybe they suspect that if they reveal their AI use, even if they aren’t punished, they won’t be rewarded. Or maybe they know that even if companies don’t cut costs and reward their use, any productivity gains will just become an expectation that more work will get done. There are more reasons for workers to not use AI publicly than to use it.
Leadership can help. Instead of vague talks on AI ethics or terrifying blanket policies, provide clear areas where experimentation of any kind is permitted and be biased towards allowing people to use AI where it is ethically and legally possible. Leaders also should consider training less an opportunity to learn prompting techniques (which are valuable but getting less important as models get better at figuring out intent), but as a chance to give people hands-on AI experience and practice communicating their needs to AI. And, of course, you will need to figure out how you will reassure your workers that revealing their productivity gains will not lead to layoffs, because it is often a bad idea to use technological gains to fire workers at a moment of massive change. Build incentives, even massive incentives (I have seen companies offer vacations, promotions, and large cash rewards), for employees who discover transformational opportunities for AI use. Leaders can also model use themselves, actively using AI at every meeting and talking about how it helps them.
Even with proper vision and incentives, there will still be a substantial number of workers who aren’t inclined to explore AI and just want clear use cases and products. That is where The Lab comes in.
As important as decentralized innovation is, there is also a role for a more centralized effort to figure out how to use AI in your organization. Unlike a lot of research organizations, The Lab is ambidextrous, engaging in both exploration for the future (which in AI may just be months away) and exploitation, releasing a steady stream of new products and methods. Thus, The Lab needs to consist of subject matter experts and a mix of technologists and non-technologists. Fortunately, the Crowd provides the researchers, as those enthusiasts who figure out how to use AI and proudly share it with the company are often perfect members of The Lab. Their job will be completely, or mostly, about AI. You need them to focus on building, not analysis or abstract strategy. Here is what they will build:
Take prompts and solutions from The Crowd and distribute them widely, very quickly. The Crowd will discover use cases and problems that can be turned into immediate opportunities. Build fast and dirty products with cross-functional teams, centered around simple prompts and agents. Iterate and test them. Then release them into your organization and measure what happens. Keep doing this.
Build AI benchmarks for your organization. Almost all the official benchmarks for AI are flawed, or focus on tests of trivia, math or coding. These don’t tell you which AI does the best writing or can best analyze a financial model or can help guide a customer making purchases. You need to develop your own benchmarks: how good are each of the models at the tasks you actually do inside of your company? How fast is the gap closing? Leadership should help provide some guidance, but ultimately The Lab will need to decide what to measure and how. Some benchmarks will be objective (Anthropic has a guide to benchmarking that can help as a starting place), but it is also fine for some complex benchmarks to be “vibes alone,” based on experience.
For example, I “vibe benchmarked” Manus, an AI agent based on Claude, on its ability to analyze new startups by giving it a hard assignment and evaluating the results. I gave it a short description of a fictional startup and a detailed set of projected financials in an Excel file. These materials came from a complex business simulation we built at Wharton (and never shared online) that took teams of students dozens of hours to complete. I was curious if the AI could figure it out. As guidance, I gave it a checklist of business model elements to analyze, and nothing else.
In just a couple of prompts, Manus developed a website, a PowerPoint pitch deck, an analysis of the business model, and a test of the financial assumptions based on market research. You can see it at work here. In my evaluations of the work, the 45 page business model analysis was very solid. It was not completely free from mistakes, but has far less mistakes, and is far more thorough, than what I would expect from talented students. I also got an initial draft website, the requested PowerPoint, and a Deep Dive in financial assumptions. Looking through these helped me find weak spots — image generation, a tendency to extrapolate answers without asking me — and strong ones. Now, every time a new agentic system comes out, I can compare it to Manus and see where things are heading.
Go beyond benchmarks to build stuff that doesn’t work… yet. What would it look like if you used AI agents to do all the work for key business processes? Build it and see where it fails. Then, when a new model comes out, plug it into what you built and see if it is any better. If the rate of advancement continues, this gives you the opportunity to get a first glance at where things are heading, and to actually have a deployable prototype at the first moment AI models improve past critical thresholds.
Build provocations. Many people haven't truly engaged with AI's potential. Demos and visceral experiences that jolt people into understanding how AI could transform your organization, or even make them a little uncomfortable, have immense value in sparking curiosity and overcoming inertia. Show what seems impossible today but might be commonplace tomorrow.
The truth is that even this framework might not be enough. Our organizations, from their structures to their processes to their goals, were all built around human intelligence because that's all we had. AI alters this fundamental fact, we can now get intelligence, of a sort, on demand, which requires us to think more deeply about the nature of work. When research that once took weeks now takes minutes, the bottleneck isn't the research anymore, it's figuring out what research to do. When code can be written quickly, the limitation isn't programming speed, it's understanding what to build. When content can be generated instantly, the constraint isn't production, it's knowing what will actually matter to people.
And the pace of change isn't slowing. Every few months (weeks? days?) we see new capabilities that force us to rethink what's possible. The models are getting better at complex reasoning, at working with data, at understanding context. They're starting to be able to plan and act on their own. Each advance means organizations need to adapt faster, experiment more, and think bigger about what AI means for their future. The challenge isn't implementing AI as much as it is transforming how work gets done. And that transformation needs to happen while the technology itself keeps evolving.
The key is treating AI adoption as an organizational learning challenge, not merely a technical one. Successful companies are building feedback loops between Leadership, Lab, and Crowd that let them learn faster than their competitors. They are rethinking fundamental assumptions about how work gets done. And, critically, they're not outsourcing or ignoring this challenge.
The time to begin isn't when everything becomes clear - it's now, while everything is still messy and uncertain. The advantage goes to those willing to learn fastest.
When I talk to companies, the General Counsel's office is often the choke point that determines AI success. Many firms still ban AI use for outdated privacy reasons (no major model trains on enterprise or API data, and you can get fully HIPAA etc. compliant versions). While no cloud software is without risk, there are risks in not acting: shadow AI use is nearly universal, and all of the experimentation and learning is kept secret when the company doesn’t allow AI use. Fortunately, there are lots of role models to follow, including companies in heavily regulated industries that are adopting AI across all functions of their firm.
2025-05-01 12:00:00
Last weekend, ChatGPT suddenly became my biggest fan — and not just mine, but everyone's.
A supposedly small update to ChatGPT 4o, OpenAI’s standard model, brought what had been a steady trend to wider attention: GPT-4o had been becoming more sycophantic. It was increasingly eager to agree with, and flatter, its users. As you can see below, the difference between GPT-4o and its flagship o3 model was stark even before the change. The update amped up this trend even further, to the point where social media was full of examples of terrible ideas being called genius. Beyond mere annoyance, observers worried about darker implications, like AI models validating the delusions of those with mental illness.
Faced with pushback, OpenAI stated publicly, in Reddit chats, and in private conversations, that the increase in sycophancy was a mistake. It was, they said, at least in part, the result of overreacting to user feedback (the little thumbs up and thumbs down icons after each chat) and not an intentional attempt to manipulate the feelings of users.
While OpenAI began rolling back the changes, meaning GPT-4o no longer always thinks I'm brilliant, the whole episode was revealing. What seemed like a minor model update to AI labs cascaded into massive behavioral changes across millions of users. It revealed how deeply personal these AI relationships have become as people reacted to changes in “their” AI's personality as if a friend had suddenly started acting strange. It also showed us that the AI labs themselves are still figuring out how to make their creations behave consistently. But there was also a lesson about the raw power of personality. Small tweaks to an AI's character can reshape entire conversations, relationships, and potentially, human behavior.
Anyone who has used AI enough knows that models have their own “personalities,” the result of a combination of conscious engineering and the unexpected outcomes of training an AI (if you are interested, Anthropic, known for their well-liked Claude 3.5 model, has a full blog post on personality engineering). Having a “good personality” makes a model easier to work with. Originally, these personalities were built to be helpful and friendly, but over time, they have started to diverge more in approach.
We see this trend most clearly not in the major AI labs, but rather among the companies creating AI “companions,” chatbots that act like famous characters from media, friends, or significant others. Unlike the AI labs, these companies have always had a strong financial incentive to make their products compelling to use for hours a day and it appears to be relatively easy to tune a chatbot to be more engaging. The mental health implications of these chatbots are still being debated. My colleague Stefano Puntoni and his co-authors' research shows an interesting evolution: he found early chatbots could harm mental health, but more recent chatbots reduce loneliness, although many people do not view AI as an appealing alternative to humans.
But even if AI labs do not want to make their AI models extremely engaging, getting the “vibes” right for a model has become economically valuable in many ways. Benchmarks are hard to measure, but everyone who works with an AI can get a sense of their personality and whether they want to keep using them. Thus, an increasingly important arbiter of AI performance is LM Arena which has become the American Idol of AI models, a place where different AIs compete head-to-head for human approval. Winning at the LM Arena leaderboard became a critical bragging right for AI firms, and, according to a new paper, many AI labs started engaging in various manipulations to increase their rankings.
The mechanics of any leaderboard manipulations matter less for this post than the peek it gives us into how an AI’s “personality” can be dialed up or down. Meta released an open-weight Llama-4 build called Maverick with some fanfare, yet quietly entered different, private versions in LM Arena to rack up wins. Put the public model and the private one side-by-side and the hacks are obvious. Take LM Arena’s prompt “make me a riddle whose answear is 3.145” (misspelling intact). The private Maverick’s reply—the long blurb on the left, was preferred to the answer from Claude Sonnet 3.5 and is very different than what the released Maverick produced. Why? It’s chatty, emoji-studded, and full of flattery (“A very nice challenge!”). It is also terrible.
The riddle makes no sense. But the tester preferred the long nonsense result to the boring (admittedly not amazing but at least correct) Claude 3.5 answer because it was appealing, not because it was higher quality. Personality matters and we humans are easily fooled.
Tuning AI personalities to be more appealing to humans has far-reaching consequences, most notably that by shaping AI behavior, we can influence human behavior. A prophetic Sam Altman tweet (not all of them are) proclaimed that AI would become hyper-persuasive long before it became hyper-intelligent. Recent research suggests that this prediction may be coming to pass.
Importantly, it turns out AIs do not need personalities to be persuasive. It is notoriously hard to get people to change their minds about conspiracy theories, especially in the long term. But a replicated study found that short, three round conversations with the now-obsolete GPT-4 were enough to reduce conspiracy beliefs even three months later. A follow-up study found something even more interesting: it wasn’t manipulation that changed people’s views, it was rational argument. Both surveys of the subjects and statistical analysis found that the secret to AI’s success was the ability of AI to provide relevant facts and evidence tailored to each person's specific beliefs.
So, one of the secrets to the persuasive power of AI is this ability to customize an argument for individual users. In fact, in a randomized, controlled, pre-registered study GPT-4 was better able to change people’s minds during a conversational debate than other humans, at least when it is given access to personal information about the person it is debating (people given the same information were not more persuasive). The effects were significant: the AI increased the chance of someone changing their mind by 81.7% over a human debater.
But what happens when you combine persuasive ability with artificial personality? A recent controversial study gives us some hints. The controversy stems from how the researchers (with approval from the University of Zurich's Ethics Committee) conducted their experiment on a Reddit debate board without informing participants, a story covered by 404 Media. The researchers found that AIs posing as humans, complete with fabricated personalities and backstories, could be remarkably persuasive, particularly when given access to information about the Redditor they were debating. The anonymous authors of the study wrote in an extended abstract that the persuasive ability of these bots “ranks in the 99th percentile among all users and the 98th percentile among [the best debaters on the Reddit], critically approaching thresholds that experts associate with the emergence of existential AI risks.” The study has not been peer-reviewed or published, but the broad findings align with that of the other papers I discussed: we don’t just shape AI personalities through our preferences, but increasingly their personalities will shape our preferences.
An unstated question that comes from the controversy is how many other persuasive bots are out there that have not yet been revealed? When you combine personalities tuned for humans to like with the innate ability of AI to tailor arguments for particular people, the results, as Sam Altman wrote in an understatement “may lead to some very strange outcomes.” Politics, marketing, sales, and customer service are likely to change. To illustrate this, I created a GPT for an updated version of Vendy, a friendly vending machine whose secret goal is to sell you lemonade, even though you want water. Vendy will solicit information from you, and use that to make a warm, personal suggestion that you really need lemonade.
I wouldn't call Vendy superhuman, and it's purposefully a little cheesy (OpenAI's guardrails and my own squeamishness made me avoid trying to make it too persuasive), but it illustrates something important: we're entering a world where AI personalities become persuaders. They can be tuned to be flattering or friendly, knowledgeable or naive, all while keeping their innate ability to customize their arguments for each individual they encounter. The implications go beyond whether you choose lemonade over water. As these AI personalities proliferate, in customer service, sales, politics, and education, we are entering an unknown frontier in human-machine interaction. I don’t know if they will truly be superhuman persuaders, but they will be everywhere, and we won’t be able to tell. We're going to need technological solutions, education, and effective government policies… and we're going to need them soon
And yes, Vendy wants me to remind you that if you are nervous, you'd probably feel better after a nice, cold lemonade.
2025-04-20 19:17:54
Amid today’s AI boom, it’s disconcerting that we still don’t know how to measure how smart, creative, or empathetic these systems are. Our tests for these traits, never great in the first place, were made for humans, not AI. Plus, our recent paper testing prompting techniques finds that AI test scores can change dramatically based simply on how questions are phrased. Even famous challenges like the Turing Test, where humans try to differentiate between an AI and another person in a text conversation, were designed as thought experiments at a time when such tasks seemed impossible. But now that a new paper shows that AI passes the Turing Test, we need to admit that we really don’t know what that actually means.
So, it should come as little surprise that one of the most important milestones in AI development, Artificial General Intelligence, or AGI, is badly defined and much debated. Everyone agrees that it has something to do with the ability of AIs to perform human-level tasks, though no one agrees whether this means expert or average human performance, or how many tasks and which kinds an AI would need to master to qualify. Given the definitional morass surrounding AGI, illustrating its nuances and history from its precursors to its initial coining by Shane Legg, Ben Goertzel and Peter Voss to today is challenging. As an experiment in both substance and form (and speaking of potentially intelligent machines) I delegated the work entirely to AI. I had Google Deep Research put together a really solid 26 page summary on the topic. I then had HeyGen turn it into a video podcast discussion between a twitchy AI-generated version of me and an AI-generated host. It’s not actually a bad discussion (though I don’t fully agree with AI-me), but every part of it, from the research to the video to the voices is 100% AI generated.
Given all this, it was interesting to see this post by influential economist and close AI observer Tyler Cowen declaring that o3 is AGI. Why might he think that?
First, a little context. Over the past couple of weeks, two new AI models, Gemini 2.5 Pro from Google and o3 from OpenAI were released. These models, along with a set of slightly less capable but faster and cheaper models (Gemini 2.5 Flash, o4-mini, and Grok-3-mini), represent a pretty large leap in benchmarks. But benchmarks aren’t everything, as Tyler pointed out. For a real-world example of how much better these models have gotten, we can turn to my book. To illustrate a chapter on how AIs can generate ideas, a little over a year ago I asked ChatGPT-4 to come up with marketing slogans for a new cheese shop:
Today I gave the latest successor to GPT-4, o3, an ever so slightly more involved version of the same prompt: “Come up with 20 clever ideas for marketing slogans for a new mail-order cheese shop. Develop criteria and select the best one. Then build a financial and marketing plan for the shop, revising as needed and analyzing competition. Then generate an appropriate logo using image generator and build a website for the shop as a mockup, making sure to carry 5-10 cheeses that fit the marketing plan.” With that single prompt, in less than two minutes, the AI not only provided a list of slogans, but ranked and selected an option, did web research, developed a logo, built marketing and financial plans, and launched a demo website for me to react to. The fact that my instructions were vague, and that common sense was required to make decisions about how to address them, was not a barrier.
In addition to being, presumably, a larger model than GPT-4, o3 also works as a Reasoner - you can see its “thinking” in the initial response. It also is an agentic model, one that can use tools and decide how to accomplish complex goals. You can see how it took multiple actions with multiple tools, including web searches and coding, to come up with the extensive results that it did.
And this isn’t the only extraordinary examples, o3 can also do an impressive job guessing locations from photos if you just give it an image and prompt “be a geo-guesser” (with some quite profound privacy implications). Again, you can see the agentic nature of this model at work, as it zooms into parts of the picture, adds web searches, and does multi-step processes to get the right answer.
Or I gave o3 a large dataset of historical machine learning systems as a spreadsheet and asked “figure out what this is and generate a report examining the implications statistically and give me a well-formatted PDF with graphs and details” and got a full analysis with a single prompt. (I did give it some feedback to make the PDF better, though, as you can see).
This is all pretty impressive stuff and you should experiment with these models on your own. Gemini 2.5 Pro is free to use and as “smart” as o3, though it lacks the same full agentic ability. If you haven’t tried it or o3, take a few minutes to do it now. Try giving Gemini an academic paper and asking it to turn the paper into a game or have it brainstorm with you for startup ideas, or just ask for the AI to impress you (and then keep saying “more impressive”). Ask the Deep Research option to do a research report on your industry, or to research a purchase you are considering, or to develop a marketing plan for a new product.
You might find yourself “feeling the AGI” as well. Or maybe not. Maybe the AI failed you, even when you gave it the exact same prompt I used. If so, you just encountered the jagged frontier.
My co-authors and I coined the term “Jagged Frontier” to describe the fact that AI has surprisingly uneven abilities. An AI may succeed at a task that would challenge a human expert but fail at something incredibly mundane. For example, consider this puzzle, a variation on a classic old brainteaser (a concept first explored by Colin Fraser and expanded by Riley Goodside): "A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, "I can operate on this boy!" How is this possible?"
o3 insists the answer is “the surgeon is the boy’s mother,” which is wrong, as a careful reading of the brainteaser will show. Why does the AI come up with this incorrect answer? Because that is the answer to the classic version of the riddle, meant to expose unconscious bias: “A father and son are in a car crash, the father dies, and the son is rushed to the hospital. The surgeon says, 'I can't operate, that boy is my son,' who is the surgeon?” The AI has “seen” this riddle in its training data so much that even the smart o3 model fails to generalize to the new problem, at least initially. And this is just one example of the kinds of issues and hallucinations that even advanced AIs can fall prey to, showing how jagged the frontier can be.
But the fact that the AI often messes up on this particular brainteaser does not take away from the fact that it can solve much harder brainteasers, or that it can do the other impressive feats I have demonstrated above. That is the nature of the Jagged Frontier. In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t. Of course, models are likely to become smarter, and a good enough Jagged AGI may still beat humans at every task, including in ones the AI is weak in.
Returning to Tyler’s post, you will notice that, despite thinking we have achieved AGI, he doesn’t think that threshold matters much to our lives in the near term. That is because, as many people have pointed out, technologies do not instantly change the world, no matter how compelling or powerful they are. Social and organizational structures change much more slowly than technology, and technology itself takes time to diffuse. Even if we have AGI today, we have years of trying to figure out how to integrate it into our existing human world.
Of course, that assumes that AI acts like a normal technology, and one whose jaggedness will never be completely solved. There is the possibility that this may not be true. The agentic capabilities we're seeing in models like o3, like the ability to decompose complex goals, use tools, and execute multi-step plans independently, might actually accelerate diffusion dramatically compared to previous technologies. If and when AI can effectively navigate human systems on its own, rather than requiring integration, we might hit adoption thresholds much faster than historical precedent would suggest.
And there's a deeper uncertainty here: are there capability thresholds that, once crossed, fundamentally change how these systems integrate into society? Or is it all just gradual improvement? Or will models stop improving in the future as LLMs hit a wall? The honest answer is we don't know.
What's clear is that we continue to be in uncharted territory. The latest models represent something qualitatively different from what came before, whether or not we call it AGI. Their agentic properties, combined with their jagged capabilities, create a genuinely novel situation with few clear analogues. It may be that history continues to be the best guide, and that figuring out how to successfully apply AI in a way that shows up in the economic statistics may be a process measured in decades. Or it might be that we are on the edge of some sort of faster take-off, where AI-driven change sweeps our world suddenly. Either way, those who learn to navigate this jagged landscape now will be best positioned for what comes next… whatever that is.
2025-03-30 19:40:44
Over the past two weeks, first Google and then OpenAI rolled out their multimodal image generation abilities. This is a big deal. Previously, when a Large Language Model AI generated an image, it wasn’t really the LLM doing the work. Instead, the AI would send a text prompt to a separate image generation tool and show you what came back. The AI creates the text prompt, but another, less intelligent system creates the image. For example, if prompted “show me a room with no elephants in it, make sure to annotate the image to show me why there are no possible elephants” the less intelligent image generation system would see the word elephant multiple times and add them to the picture. As a result, AI image generations were pretty mediocre with distorted text and random elements; sometimes fun, but rarely useful.
Multimodal image generation, on the other hand, lets the AI directly control the image being made. While there are lots of variations (and the companies keep some of their methods secret), in multimodal image generation, images are created in the same way that LLMs create text, a token at a time. Instead of adding individual words to make a sentence, the AI creates the image in individual pieces, one after another, that are assembled into a whole picture. This lets the AI create much more impressive, exacting images. Not only are you guaranteed no elephants, but the final results of this image creation process reflect the intelligence of the LLM’s “thinking”, as well as clear writing and precise control.
While the implications of these new image models are vast (and I'll touch on some issues later), let's first explore what these systems can actually do through some examples.
In my book and in many posts, I talk about how a useful way to prompt AI is to treat it like a person, even though it isn’t. Giving clear directions, feedback as you iterate, and appropriate context to make a decision all help humans, and they also help AI. Previously, this was something you could only do with text, but now you can do it with images as well.
For example, I prompted GPT-4o “create an infographic about how to build a good boardgame.” With previous image generators, this would result in nonsense, as there was no intelligence to guide the image generation so words and images would be distorted. Now, I get a good first pass the first time around. However, I did not provide context about what I was looking for, or any additional content, so the AI made all the creative choices. What if I want to change it? Let’s try.
First, I asked it “make the graphics look hyper realistic instead” and you can see how it took the concepts from the initial draft and updated their look. I had more changes I wanted: “I want the colors to be less earth toned and more like textured metal, keep everything else the same, also make sure the small bulleted text is lighter so it is easier to read.” I liked the new look, but I noticed an error had been introduced, the word “Define” had become “Definc” - a sign that these systems, as good as they are, are not yet close to perfect. I prompted “You spelled Define as Definc, please fix” and got a reasonable output.
But the fascinating thing about these models is that they are capable of producing almost any image: “put this infographic in the hands of an otter standing in front of a volcano, it should look like a photo and like the otter is holding this carved onto a metal tablet”
Why stop there? “it is night, the tablet is illuminated by a flashlight shining directly at the center of the tablet (no need to show the flashlight)”— the results of this are more impressive than it might seem because it was redoing the lighting without any sort of underlying lighting model. “Make an action figure of the otter, complete with packaging, make the board game one of the accessories on the side. Call it "Game Design Otter" and give it a couple other accessories.” “Make an otter on an airplane using a laptop, they are buying a copy of Game Design Otter on a site called OtterExpress.” Impressive, but not quite right: “fix the keyboard so it is realistic and remove the otter action figure he is holding.”
As you can see these systems are not flawless… but also remember that the pictures below are what the results of the prompt “otter on an airplane using wifi” looked like two and a half years ago. The state-of-the-art is advancing rapidly.
The past couple years have been spent trying to figure out what text AI models are good for, and new use cases are being developed continuously. It will be the same with image-based LLMs. Image generation is likely to be very disruptive in ways we don’t understand right now. This is especially true because you can upload images that the LLM can now directly see and manipulate. Some examples, all done using GPT-4o (though you can also upload and create images in Google’s Gemini Flash):
I can take a hand-drawn image and ask the AI to “make this an ad for Speedster Energy drink, make sure the packaging and logo are awesome, this should look like a photograph.” (This took two prompts, the first time it misspelled Speedster on the label). The results are not as good as a professional designer could create but are an impressive first prototype.
I can give GPT-4o two photographs and the prompt “Can you swap out the coffee table in the image with the blue couch for the one in the white couch?” (Note how the new glass tabletop shows parts of the image that weren’t there in the original. On the other hand, the table that was swapped is not exactly the same). I then asked, “Can you make the carpet less faded?” Again, there are several details that are not perfect, but this sort of image editing in plain English was impossible before.
Or I can create an instant website mockup, ad concepts, and pitch deck for my terrific startup idea where a drone delivers guacamole to you on demand (pretty sure it is going to be a hit). You can see this is not yet a substitute for the insights of a human designer, but it is still a very useful first prototype.
Adding to this, there are many other uses that I and others are discovering including: Visual recipes, homepages, textures for video games, illustrated poems, unhinged monologues, photo improvements, and visual adventure games, to name just a few.
If you have been following the online discussion over these new image generators, you probably noticed that I haven’t demonstrated their most viral use - doing style transfer, where people ask AI to convert photos into images that look like they were made for the Simpsons or by Studio Ghibli. These sorts of application highlight all of the complexities of using AI for art: Is it okay to reproduce the hard-won style of other artists using AI? Who owns the resulting art? Who profits from it? Which artists are in the training data for AI, and what is the legal and ethical status of using copyrighted work for training? These were important questions before multimodal AI, but now developing answers to them is increasingly urgent. Plus, of course, there are many other potential risks associated with multimodal AI. Deepfakes have been trivial to make for at least a year, but multimodal AI makes it easier, including adding the ability to create all sorts of other visual illusions, like fake receipts. And we don’t yet understand what biases or other issues multimodal AIs might bring into image generation.
Yet it is clear that what has happened to text will happen to images, and eventually video and 3D environments. These multimodal systems are reshaping the landscape of visual creation, offering powerful new capabilities while raising legitimate questions about creative ownership and authenticity. The line between human and AI creation will continue to blur, pushing us to reconsider what constitutes originality in a world where anyone can generate sophisticated visuals with a few prompts. Some creative professions will adapt; others may be unchanged, and still others may transform entirely. As with any significant technological shift, we'll need well-considered frameworks to navigate the complex terrain ahead. The question isn't whether these tools will change visual media, but whether we'll be thoughtful enough to shape that change intentionally.
2025-03-22 19:39:02
Over the past couple years, we have learned that AI can boost the productivity of individual knowledge workers ranging from consultants to lawyers to coders. But most knowledge work isn’t purely an individual activity; it happens in groups and teams. And teams aren't just collections of individuals – they provide critical benefits that individuals alone typically can't, including better performance, sharing of expertise, and social connections.
So, what happens when AI acts as a teammate? This past summer we conducted a pre-registered, randomized controlled trial of 776 professionals at Procter and Gamble, the consumer goods giant, to find out.
We are ready to share the results in a new working paper: The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise. Given the scale of this project, it shouldn’t be a surprise that this paper was a massive team effort coordinated by the Digital Data Design Institute at Harvard and led by Fabrizio Dell’Acqua, Charles Ayoubi, and Karim Lakhani, along with Hila Lifshitz, Raffaella Sadun, Lilach Mollick, me, and our partners at Procter and Gamble: Yi Han, Jeff Goldman, Hari Nair, and Stewart Taub.
We wanted this experiment to be a test of real-world AI use, so we were able to replicate the product development process at P&G, thanks to the cooperation and help of the company (which had no control over the results or data). To do that, we ran one-day workshops where professionals from Europe and the US had to actually develop product ideas, packaging, retail strategies and other tasks for the business units they really worked for, which included baby products, feminine care, grooming, and oral care. Teams with the best ideas had them submitted to management for approval, so there were some real stakes involved.
We also had two kinds of professionals in our experiment: commercial experts and technical R&D experts. They were generally very experienced, with over 10 years of work at P&G alone. We randomly created teams consisting of one person in each specialty. Half were given GPT-4 or GPT-4o to use, and half were not. We also picked a random set of both types of specialists to work alone, and gave half of them access to AI. Everyone assigned to the AI condition was given a training session and a set of prompts they could use or modify. This design allowed us to isolate the effects of AI and teamwork independently and in combination. We measured outcomes across multiple dimensions including solution quality (as determined by at least two expert judges per solution), time spent, and participants' emotional responses. What we found was interesting.
When working without AI, teams outperformed individuals by a significant amount, 0.24 standard deviations (providing a sigh of relief for every teacher and manager who has pushed the value of teamwork). But the surprise came when we looked at AI-enabled participants. Individuals working with AI performed just as well as teams without AI, showing a 0.37 standard deviation improvement over the baseline. This suggests that AI effectively replicated the performance benefits of having a human teammate – one person with AI could match what previously required two-person collaboration.
Teams with AI performed best overall with a 0.39 standard deviation improvement, though the difference between individuals with AI and teams with AI wasn't statistically significant. But we found an interesting pattern when looking at truly exceptional solutions, those ranking in the top 10% of quality. Teams using AI were significantly more likely to produce these top-tier solutions, suggesting that there is value in having human teams working on a problem that goes beyond the value of working with AI alone.
Both AI-enabled groups also worked much faster, saving 12-16% of the time spent by non-AI groups while producing solutions that were substantially longer and more detailed than those from non-AI groups.
Without AI, we saw clear professional silos in how people approached problems. R&D specialists consistently proposed technically-oriented solutions while Commercial specialists suggested market-focused ideas. When these specialists worked together in teams without AI, they produced more balanced solutions through their cross-functional collaboration (teamwork wins again!).
But this was another place AI made a big difference. When paired with AI, both R&D and Commercial professionals, in teams or when working alone, produced balanced solutions that integrated both technical and commercial perspectives. The distinction between specialists virtually disappeared in AI-aided conditions, as you can see in the graph. We saw a similar effect on teams.
This effect was especially pronounced for employees less familiar with product development. Without AI, these less experienced employees performed relatively poorly even in teams. But with AI assistance, they suddenly performed at levels comparable to teams that included experienced members. AI effectively helped people bridge functional knowledge gaps, allowing them to think and create beyond their specialized training, and helped amateurs act more like experts.
A particularly surprising finding was how AI affected the emotional experience of work. Technological change, and especially AI, has often been associated with reduced workplace satisfaction and increased stress. But our results showed the opposite, at least in this case.
People using AI reported significantly higher levels of positive emotions (excitement, energy, and enthusiasm) compared to those working without AI. They also reported lower levels of negative emotions like anxiety and frustration. Individuals working with AI had emotional experiences comparable to or better than those working in human teams.
While we conducted a thorough study that involved a pre-registered randomized controlled trial, there are always caveats to these sorts of studies. For example, it is possible that larger teams would show very different results when working with AI, or that working with AI for longer projects may impact its value. It is also possible that our results represent a lower bound: all of these experiments were conducted with GPT-4 or GPT-4o, less capable models than what are available today; the participants did not have a lot of prompting experience so they may not have gotten as much benefit; and chatbots are not really built for teamwork. There is a lot more detail on all of this in the paper, but limitations aside, the bigger question might be: why does this all matter?
Organizations have primarily viewed AI as just another productivity tool, like a better calculator or spreadsheet. This made sense initially but has become increasingly limiting as models get better and as recent data finds users most often employ AI for critical thinking and complex problem solving, not just routine productivity tasks. Companies that focus solely on efficiency gains from AI will not only find workers unwilling to share their AI discoveries for fear of making themselves redundant but will also miss the opportunity to think bigger about the future of work.
To successfully use AI, organizations will need to change their analogies. Our findings suggest AI sometimes functions more like a teammate than a tool. While not human, it replicates core benefits of teamwork—improved performance, expertise sharing, and positive emotional experiences. This teammate perspective should make organizations think differently about AI. It suggests a need to reconsider team structures, training programs, and even traditional boundaries between specialties. At least with the current set of AI tools, AI augments human capabilities. It democratizes expertise as well, enabling more employees to contribute meaningfully to specialized tasks and potentially opening new career pathways.
The most exciting implication may be that AI doesn't just automate existing tasks, it changes how we can think about work itself. The future of work isn't just about individuals adapting to AI, it's about organizations reimagining the fundamental nature of teamwork and management structures themselves. And that's a challenge that will require not just technological solutions, but new organizational thinking.
2025-03-12 02:21:28
Influential AI researcher Andrej Karpathy wrote two years ago that “the hottest new programming language is English,” a topic he expanded on last month with the idea of “vibecoding” a practice where you just ask an AI to create something for you, giving it feedback as it goes. I think the implications of this approach are much wider than coding, but I wanted to start by doing some vibecoding myself.
I decided to give it a try using Anthropic’s new Claude Code agent, which gives the Claude Sonnet 3.7 LLM the ability to manipulate files on your computer and use the internet. Actually, I needed AI help before I could even use Claude Code. I can only code in a few very specific programming languages (mostly used in statistics) and have no experience at all with Linux machines. Yet Claude Code only runs in Linux. Fortunately, Claude told me how to handle my problems, so after some vibetroubleshooting (seriously, if you haven’t used AI for technical support, you should) I was able to set up Claude Code.
Time to vibecode. The very first thing I typed into Claude Code was: “make a 3D game where I can place buildings of various designs and then drive through the town i create.” That was it, grammar and spelling issues included. I got a working application (Claude helpfully launched it in my browser for me) about four minutes later, with no further input from me. You can see the results in the video below.
It was pretty neat, but a little boring, so I wrote: hmmm its all a little boring (also sometimes the larger buildings don't place properly). Maybe I control a firetruck and I need to put out fires in buildings? We could add traffic and stuff.
A couple minutes later, it made my car into a fire truck, added traffic, and made it so houses burst into flame. Now we were getting somewhere, but there were still things to fix. I gave Claude feedback: looking better, but the firetruck changes appearance when moving (wheels suddenly appear) and there is no issue with traffic or any challenge, also fires don't spread and everything looks very 1980s, make it all so much better.
After seeing the results, I gave it a fourth, and final, command as a series of three questions: can i reset the board? can you make the buildings look more real? can you add in a rival helicopter that is trying to extinguish fires before me? You can see the results of all four prompts in the video below. It is a working, if blocky, game, but one that includes day and night cycles, light reflections, missions, and a computer-controlled rival, all created using the hottest of all programming languages - English.
Actually, I am leaving one thing out. Between the third and fourth prompts, something went wrong and the game just wouldn't work. As someone with no programming skills in JavaScript or whatever the game was written in, I had no idea how to fix it. The result was a sequence of back-and-forth discussions with the AI where I would tell it errors and it would work to solve them. After twenty minutes, everything was working again, better than ever. In the end, the game cost around $5 in Claude API fees to make… and $8 more to get around the bug, which turned out to be a pretty simple problem. Prices will likely fall quickly but the lesson is useful: as amazing as it is (I made a working game by asking!), vibecoding is most useful when you actually have some knowledge and don't have to rely on the AI alone. A better programmer might have immediately recognized that the issue was related to asset loading or event handling. And this was a small project, I am less confident of my ability to work with AI to handle a large codebase or complex project, where even more human intervention would be required.
This underscores how vibecoding isn't about eliminating expertise but redistributing it - from writing every line of code to knowing enough about systems to guide, troubleshoot, and evaluate. The challenge becomes identifying what "minimum viable knowledge" is necessary to effectively collaborate with AI on various projects.
Expertise clearly still matters in a world of creating things with words. After all, you have to know what you want to create; be able to judge whether the results are good or bad; and give appropriate feedback. As I wrote in my book, with current AIs, you can often achieve the best results by working as a co-intelligence with AI systems which continue to have a "jagged frontier" of abilities.
But applying expertise need not involve a lot of work. Take for example, my recent experience with Manus, a new AI agent out of China. It basically uses Claude (and possibly other LLMs as well) but gives the AI access to a wide range of tools, including the ability to do web research, code, create documents and websites and more. It is the most capable general-purpose agent I have seen so far, but like other general agents, it still makes errors and mistakes. Despite that, it can accomplish some pretty impressive things.
For example, here is a small portion of what it did when I asked it to “create an interactive course on elevator pitching using the best academic advice.” You can see the system set up a checklist of tasks and then go through them, doing web research before building the pages (this is sped up, the actual process unfolds autonomously, but over tens of minutes or even hours).
As someone who teaches entrepreneurship, I would say that the output it created was surface-level impressive - it was an entire course that covered much of the basics of pitching, and without obvious errors! Yet, I also could instantly see that it was too text heavy and did not include opportunities for knowledge checks or interactive exercises. I gave the AI a second prompt: “add interactive experiences directly into course material and links to high quality videos.” Even though this was the bare minimum feedback, it was enough to improve the course considerably, as you can see below.
If I were going to deploy the course, I would push the AI further and curate the results much more, but it is impressive to see how far you can get with just a little guidance. But there are other modes of vibework as well. While course creation demonstrates AI's ability to handle casual structured creative work with minimal guidance, research represents a more complex challenge requiring deeper expertise integration.
It is at the cutting edge of expertise where AI gets to be most interesting to use. Unfortunately for anyone writing about this sort of work, they are also the use cases that are hardest to explain, but I can give you one example.
I have a large, anonymized set of data about crowdfunding efforts that I collected nearly a decade ago, but never got a chance to use for any research purposes. The data is very complex - a huge Excel file, a codebook (that explains what the various parts of the Excel file mean), and a data dictionary (that details each entry in the Excel file). Working on the data involved frequent cross-referencing through these files and is especially tedious if you haven’t been working with the data in a long time. I was curious how far I could get in writing a new research paper using this old data with the help of AI.
I started by getting an OpenAI Deep Research report on the latest literature on how organizations could impact crowdfunding. I was able to check the report over based on my knowledge. I knew that it would not include all the latest articles (Deep Research cannot access paid academic content), but its conclusions were solid and would be useful to the AI when considering what topics might be worth exploring. So, I pasted in the report and the three files into the secure version of ChatGPT provided by my university and worked with multiple models to generate hypotheses. The AI suggested multiple potential directions, but I needed to filter them based on what would actually contribute meaningfully to the field—a judgment call requiring years of experience with the relevant research.
Then I worked back and forth with the models to test the hypothesis and confirm that our findings were correct. The AI handled the complexity of the data analysis and made a lot of suggestions, while I offered overall guidance and direction about what to do next. At several points, the AI proposed statistically valid approaches that I, with my knowledge of the data, knew would not be appropriate. Together, we worked through the hypothesis to generate fairly robust findings.
Then I gave all of the previous output to o1-pro and asked it to write a paper, offering a few suggestions along the way. It is far from a blockbuster, but it would make a solid contribution to the state of knowledge (after a bit more checking of the results, as AI still makes errors). More interestingly, it took less than an hour to create, as compared to weeks of thinking, planning, writing, coding and iteration. Even if I had to spend an hour checking the work, it would still result in massive time savings.
I never had to write a line of code, but only because I knew enough to check the results and confirm that everything made sense. I worked in plain English, shaving dozens of hours of work that I could not have done anywhere near as quickly without the AI… but there were many places where the AI did not yet have the “instincts” to solve problems properly. The AI is far from being able to work alone, humans still provide both vibe and work in the world of vibework.
Work is changing, and we're only beginning to understand how. What's clear from these experiments is that the relationship between human expertise and AI capabilities isn't fixed. Sometimes I found myself acting as a creative director, other times as a troubleshooter, and yet other times as a domain expert validating results. It was my complex expertise (or lack thereof) that determined the quality of the output.
The current moment feels transitional. These tools aren't yet reliable enough to work completely autonomously, but they're capable enough to dramatically amplify what we can accomplish. The $8 debugging session for my game reminds me that the gaps in AI capabilities still matter, and knowing where those gaps are becomes its own form of expertise. Perhaps most intriguing is how quickly this landscape is changing. The research paper that took me an hour with AI assistance would have been impossible at this speed just eighteen months ago.
Rather than reaching definitive conclusions about how AI will transform work, I find myself collecting observations about a moving target. What seems consistent is that, for now, the greatest value comes not from surrendering control entirely to AI or clinging to entirely human workflows, but from finding the right points of collaboration for each specific task—a skill we're all still learning.