MoreRSS

site iconOne Useful ThingModify

Trying to understand the implications of AI for work, education, and life. By Prof. Ethan Mollick
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of One Useful Thing

No elephants: Breakthroughs in image generation

2025-03-30 19:40:44

Over the past two weeks, first Google and then OpenAI rolled out their multimodal image generation abilities. This is a big deal. Previously, when a Large Language Model AI generated an image, it wasn’t really the LLM doing the work. Instead, the AI would send a text prompt to a separate image generation tool and show you what came back. The AI creates the text prompt, but another, less intelligent system creates the image. For example, if prompted “show me a room with no elephants in it, make sure to annotate the image to show me why there are no possible elephants” the less intelligent image generation system would see the word elephant multiple times and add them to the picture. As a result, AI image generations were pretty mediocre with distorted text and random elements; sometimes fun, but rarely useful.

Multimodal image generation, on the other hand, lets the AI directly control the image being made. While there are lots of variations (and the companies keep some of their methods secret), in multimodal image generation, images are created in the same way that LLMs create text, a token at a time. Instead of adding individual words to make a sentence, the AI creates the image in individual pieces, one after another, that are assembled into a whole picture. This lets the AI create much more impressive, exacting images. Not only are you guaranteed no elephants, but the final results of this image creation process reflect the intelligence of the LLM’s “thinking”, as well as clear writing and precise control.

The results of the prompt “show me a room with no elephants in it, make sure to annotate the image to show me why there are no possible elephants” in Microsoft Copilot’s traditional image generator (left), and GPT-4o’s multimodal model (right). Note the traditional model not only shows multiple elephants but also features distorted text.

While the implications of these new image models are vast (and I'll touch on some issues later), let's first explore what these systems can actually do through some examples.

Prompting, but for images

In my book and in many posts, I talk about how a useful way to prompt AI is to treat it like a person, even though it isn’t. Giving clear directions, feedback as you iterate, and appropriate context to make a decision all help humans, and they also help AI. Previously, this was something you could only do with text, but now you can do it with images as well.

For example, I prompted GPT-4o “create an infographic about how to build a good boardgame.” With previous image generators, this would result in nonsense, as there was no intelligence to guide the image generation so words and images would be distorted. Now, I get a good first pass the first time around. However, I did not provide context about what I was looking for, or any additional content, so the AI made all the creative choices. What if I want to change it? Let’s try.

First, I asked it “make the graphics look hyper realistic instead” and you can see how it took the concepts from the initial draft and updated their look. I had more changes I wanted: “I want the colors to be less earth toned and more like textured metal, keep everything else the same, also make sure the small bulleted text is lighter so it is easier to read.” I liked the new look, but I noticed an error had been introduced, the word “Define” had become “Definc” - a sign that these systems, as good as they are, are not yet close to perfect. I prompted “You spelled Define as Definc, please fix” and got a reasonable output.

But the fascinating thing about these models is that they are capable of producing almost any image: “put this infographic in the hands of an otter standing in front of a volcano, it should look like a photo and like the otter is holding this carved onto a metal tablet

Why stop there? “it is night, the tablet is illuminated by a flashlight shining directly at the center of the tablet (no need to show the flashlight)”— the results of this are more impressive than it might seem because it was redoing the lighting without any sort of underlying lighting model. “Make an action figure of the otter, complete with packaging, make the board game one of the accessories on the side. Call it "Game Design Otter" and give it a couple other accessories.” “Make an otter on an airplane using a laptop, they are buying a copy of Game Design Otter on a site called OtterExpress.” Impressive, but not quite right: “fix the keyboard so it is realistic and remove the otter action figure he is holding.

As you can see these systems are not flawless… but also remember that the pictures below are what the results of the prompt “otter on an airplane using wifi” looked like two and a half years ago. The state-of-the-art is advancing rapidly.

But what is it good for?

The past couple years have been spent trying to figure out what text AI models are good for, and new use cases are being developed continuously. It will be the same with image-based LLMs. Image generation is likely to be very disruptive in ways we don’t understand right now. This is especially true because you can upload images that the LLM can now directly see and manipulate. Some examples, all done using GPT-4o (though you can also upload and create images in Google’s Gemini Flash):

I can take a hand-drawn image and ask the AI to “make this an ad for Speedster Energy drink, make sure the packaging and logo are awesome, this should look like a photograph.” (This took two prompts, the first time it misspelled Speedster on the label). The results are not as good as a professional designer could create but are an impressive first prototype.

I can give GPT-4o two photographs and the prompt “Can you swap out the coffee table in the image with the blue couch for the one in the white couch?” (Note how the new glass tabletop shows parts of the image that weren’t there in the original. On the other hand, the table that was swapped is not exactly the same). I then asked, “Can you make the carpet less faded?” Again, there are several details that are not perfect, but this sort of image editing in plain English was impossible before.

Or I can create an instant website mockup, ad concepts, and pitch deck for my terrific startup idea where a drone delivers guacamole to you on demand (pretty sure it is going to be a hit). You can see this is not yet a substitute for the insights of a human designer, but it is still a very useful first prototype.

Adding to this, there are many other uses that I and others are discovering including: Visual recipes, homepages, textures for video games, illustrated poems, unhinged monologues, photo improvements, and visual adventure games, to name just a few.

Complexities

If you have been following the online discussion over these new image generators, you probably noticed that I haven’t demonstrated their most viral use - doing style transfer, where people ask AI to convert photos into images that look like they were made for the Simpsons or by Studio Ghibli. These sorts of application highlight all of the complexities of using AI for art: Is it okay to reproduce the hard-won style of other artists using AI? Who owns the resulting art? Who profits from it? Which artists are in the training data for AI, and what is the legal and ethical status of using copyrighted work for training? These were important questions before multimodal AI, but now developing answers to them is increasingly urgent. Plus, of course, there are many other potential risks associated with multimodal AI. Deepfakes have been trivial to make for at least a year, but multimodal AI makes it easier, including adding the ability to create all sorts of other visual illusions, like fake receipts. And we don’t yet understand what biases or other issues multimodal AIs might bring into image generation.

Yet it is clear that what has happened to text will happen to images, and eventually video and 3D environments. These multimodal systems are reshaping the landscape of visual creation, offering powerful new capabilities while raising legitimate questions about creative ownership and authenticity. The line between human and AI creation will continue to blur, pushing us to reconsider what constitutes originality in a world where anyone can generate sophisticated visuals with a few prompts. Some creative professions will adapt; others may be unchanged, and still others may transform entirely. As with any significant technological shift, we'll need well-considered frameworks to navigate the complex terrain ahead. The question isn't whether these tools will change visual media, but whether we'll be thoughtful enough to shape that change intentionally.

Subscribe now

Share

The Cybernetic Teammate

2025-03-22 19:39:02

Over the past couple years, we have learned that AI can boost the productivity of individual knowledge workers ranging from consultants to lawyers to coders. But most knowledge work isn’t purely an individual activity; it happens in groups and teams. And teams aren't just collections of individuals – they provide critical benefits that individuals alone typically can't, including better performance, sharing of expertise, and social connections.

So, what happens when AI acts as a teammate? This past summer we conducted a pre-registered, randomized controlled trial of 776 professionals at Procter and Gamble, the consumer goods giant, to find out.

We are ready to share the results in a new working paper: The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise. Given the scale of this project, it shouldn’t be a surprise that this paper was a massive team effort coordinated by the Digital Data Design Institute at Harvard and led by Fabrizio Dell’Acqua, Charles Ayoubi, and Karim Lakhani, along with Hila Lifshitz, Raffaella Sadun, Lilach Mollick, me, and our partners at Procter and Gamble: Yi Han, Jeff Goldman, Hari Nair, and Stewart Taub.

We wanted this experiment to be a test of real-world AI use, so we were able to replicate the product development process at P&G, thanks to the cooperation and help of the company (which had no control over the results or data). To do that, we ran one-day workshops where professionals from Europe and the US had to actually develop product ideas, packaging, retail strategies and other tasks for the business units they really worked for, which included baby products, feminine care, grooming, and oral care. Teams with the best ideas had them submitted to management for approval, so there were some real stakes involved.

We also had two kinds of professionals in our experiment: commercial experts and technical R&D experts. They were generally very experienced, with over 10 years of work at P&G alone. We randomly created teams consisting of one person in each specialty. Half were given GPT-4 or GPT-4o to use, and half were not. We also picked a random set of both types of specialists to work alone, and gave half of them access to AI. Everyone assigned to the AI condition was given a training session and a set of prompts they could use or modify. This design allowed us to isolate the effects of AI and teamwork independently and in combination. We measured outcomes across multiple dimensions including solution quality (as determined by at least two expert judges per solution), time spent, and participants' emotional responses. What we found was interesting.

AI boosts performance

When working without AI, teams outperformed individuals by a significant amount, 0.24 standard deviations (providing a sigh of relief for every teacher and manager who has pushed the value of teamwork). But the surprise came when we looked at AI-enabled participants. Individuals working with AI performed just as well as teams without AI, showing a 0.37 standard deviation improvement over the baseline. This suggests that AI effectively replicated the performance benefits of having a human teammate – one person with AI could match what previously required two-person collaboration.

Teams with AI performed best overall with a 0.39 standard deviation improvement, though the difference between individuals with AI and teams with AI wasn't statistically significant. But we found an interesting pattern when looking at truly exceptional solutions, those ranking in the top 10% of quality. Teams using AI were significantly more likely to produce these top-tier solutions, suggesting that there is value in having human teams working on a problem that goes beyond the value of working with AI alone.

Both AI-enabled groups also worked much faster, saving 12-16% of the time spent by non-AI groups while producing solutions that were substantially longer and more detailed than those from non-AI groups.

Expertise boundaries vanish

Without AI, we saw clear professional silos in how people approached problems. R&D specialists consistently proposed technically-oriented solutions while Commercial specialists suggested market-focused ideas. When these specialists worked together in teams without AI, they produced more balanced solutions through their cross-functional collaboration (teamwork wins again!).

But this was another place AI made a big difference. When paired with AI, both R&D and Commercial professionals, in teams or when working alone, produced balanced solutions that integrated both technical and commercial perspectives. The distinction between specialists virtually disappeared in AI-aided conditions, as you can see in the graph. We saw a similar effect on teams.

This effect was especially pronounced for employees less familiar with product development. Without AI, these less experienced employees performed relatively poorly even in teams. But with AI assistance, they suddenly performed at levels comparable to teams that included experienced members. AI effectively helped people bridge functional knowledge gaps, allowing them to think and create beyond their specialized training, and helped amateurs act more like experts.

Working with AI led to better emotional experiences

A particularly surprising finding was how AI affected the emotional experience of work. Technological change, and especially AI, has often been associated with reduced workplace satisfaction and increased stress. But our results showed the opposite, at least in this case.

Positive emotions increase and negative emotions decrease after working with AI compared to teams and individuals who did not have AI access.

People using AI reported significantly higher levels of positive emotions (excitement, energy, and enthusiasm) compared to those working without AI. They also reported lower levels of negative emotions like anxiety and frustration. Individuals working with AI had emotional experiences comparable to or better than those working in human teams.

While we conducted a thorough study that involved a pre-registered randomized controlled trial, there are always caveats to these sorts of studies. For example, it is possible that larger teams would show very different results when working with AI, or that working with AI for longer projects may impact its value. It is also possible that our results represent a lower bound: all of these experiments were conducted with GPT-4 or GPT-4o, less capable models than what are available today; the participants did not have a lot of prompting experience so they may not have gotten as much benefit; and chatbots are not really built for teamwork. There is a lot more detail on all of this in the paper, but limitations aside, the bigger question might be: why does this all matter?

Why This Matters

Organizations have primarily viewed AI as just another productivity tool, like a better calculator or spreadsheet. This made sense initially but has become increasingly limiting as models get better and as recent data finds users most often employ AI for critical thinking and complex problem solving, not just routine productivity tasks. Companies that focus solely on efficiency gains from AI will not only find workers unwilling to share their AI discoveries for fear of making themselves redundant but will also miss the opportunity to think bigger about the future of work.

To successfully use AI, organizations will need to change their analogies. Our findings suggest AI sometimes functions more like a teammate than a tool. While not human, it replicates core benefits of teamwork—improved performance, expertise sharing, and positive emotional experiences. This teammate perspective should make organizations think differently about AI. It suggests a need to reconsider team structures, training programs, and even traditional boundaries between specialties. At least with the current set of AI tools, AI augments human capabilities. It democratizes expertise as well, enabling more employees to contribute meaningfully to specialized tasks and potentially opening new career pathways.

The most exciting implication may be that AI doesn't just automate existing tasks, it changes how we can think about work itself. The future of work isn't just about individuals adapting to AI, it's about organizations reimagining the fundamental nature of teamwork and management structures themselves. And that's a challenge that will require not just technological solutions, but new organizational thinking.

Subscribe now

Leave a comment

Speaking things into existence

2025-03-12 02:21:28

Influential AI researcher Andrej Karpathy wrote two years ago that “the hottest new programming language is English,” a topic he expanded on last month with the idea of “vibecoding” a practice where you just ask an AI to create something for you, giving it feedback as it goes. I think the implications of this approach are much wider than coding, but I wanted to start by doing some vibecoding myself.

I decided to give it a try using Anthropic’s new Claude Code agent, which gives the Claude Sonnet 3.7 LLM the ability to manipulate files on your computer and use the internet. Actually, I needed AI help before I could even use Claude Code. I can only code in a few very specific programming languages (mostly used in statistics) and have no experience at all with Linux machines. Yet Claude Code only runs in Linux. Fortunately, Claude told me how to handle my problems, so after some vibetroubleshooting (seriously, if you haven’t used AI for technical support, you should) I was able to set up Claude Code.

Time to vibecode. The very first thing I typed into Claude Code was: “make a 3D game where I can place buildings of various designs and then drive through the town i create.” That was it, grammar and spelling issues included. I got a working application (Claude helpfully launched it in my browser for me) about four minutes later, with no further input from me. You can see the results in the video below.

It was pretty neat, but a little boring, so I wrote: hmmm its all a little boring (also sometimes the larger buildings don't place properly). Maybe I control a firetruck and I need to put out fires in buildings? We could add traffic and stuff.

A couple minutes later, it made my car into a fire truck, added traffic, and made it so houses burst into flame. Now we were getting somewhere, but there were still things to fix. I gave Claude feedback: looking better, but the firetruck changes appearance when moving (wheels suddenly appear) and there is no issue with traffic or any challenge, also fires don't spread and everything looks very 1980s, make it all so much better.

After seeing the results, I gave it a fourth, and final, command as a series of three questions: can i reset the board? can you make the buildings look more real? can you add in a rival helicopter that is trying to extinguish fires before me? You can see the results of all four prompts in the video below. It is a working, if blocky, game, but one that includes day and night cycles, light reflections, missions, and a computer-controlled rival, all created using the hottest of all programming languages - English.

Actually, I am leaving one thing out. Between the third and fourth prompts, something went wrong and the game just wouldn't work. As someone with no programming skills in JavaScript or whatever the game was written in, I had no idea how to fix it. The result was a sequence of back-and-forth discussions with the AI where I would tell it errors and it would work to solve them. After twenty minutes, everything was working again, better than ever. In the end, the game cost around $5 in Claude API fees to make… and $8 more to get around the bug, which turned out to be a pretty simple problem. Prices will likely fall quickly but the lesson is useful: as amazing as it is (I made a working game by asking!), vibecoding is most useful when you actually have some knowledge and don't have to rely on the AI alone. A better programmer might have immediately recognized that the issue was related to asset loading or event handling. And this was a small project, I am less confident of my ability to work with AI to handle a large codebase or complex project, where even more human intervention would be required.

This underscores how vibecoding isn't about eliminating expertise but redistributing it - from writing every line of code to knowing enough about systems to guide, troubleshoot, and evaluate. The challenge becomes identifying what "minimum viable knowledge" is necessary to effectively collaborate with AI on various projects.

Vibeworking with expertise

Expertise clearly still matters in a world of creating things with words. After all, you have to know what you want to create; be able to judge whether the results are good or bad; and give appropriate feedback. As I wrote in my book, with current AIs, you can often achieve the best results by working as a co-intelligence with AI systems which continue to have a "jagged frontier" of abilities.

But applying expertise need not involve a lot of work. Take for example, my recent experience with Manus, a new AI agent out of China. It basically uses Claude (and possibly other LLMs as well) but gives the AI access to a wide range of tools, including the ability to do web research, code, create documents and websites and more. It is the most capable general-purpose agent I have seen so far, but like other general agents, it still makes errors and mistakes. Despite that, it can accomplish some pretty impressive things.

For example, here is a small portion of what it did when I asked it to “create an interactive course on elevator pitching using the best academic advice.” You can see the system set up a checklist of tasks and then go through them, doing web research before building the pages (this is sped up, the actual process unfolds autonomously, but over tens of minutes or even hours).

As someone who teaches entrepreneurship, I would say that the output it created was surface-level impressive - it was an entire course that covered much of the basics of pitching, and without obvious errors! Yet, I also could instantly see that it was too text heavy and did not include opportunities for knowledge checks or interactive exercises. I gave the AI a second prompt: “add interactive experiences directly into course material and links to high quality videos.” Even though this was the bare minimum feedback, it was enough to improve the course considerably, as you can see below.

On the left you can see the overall class structure it created, when I clicked on the first lesson it took you to an overall module guide, and then each module was built out with videos, text, and interactive quizzes.

If I were going to deploy the course, I would push the AI further and curate the results much more, but it is impressive to see how far you can get with just a little guidance. But there are other modes of vibework as well. While course creation demonstrates AI's ability to handle casual structured creative work with minimal guidance, research represents a more complex challenge requiring deeper expertise integration.

Deep Vibeworking

It is at the cutting edge of expertise where AI gets to be most interesting to use. Unfortunately for anyone writing about this sort of work, they are also the use cases that are hardest to explain, but I can give you one example.

I have a large, anonymized set of data about crowdfunding efforts that I collected nearly a decade ago, but never got a chance to use for any research purposes. The data is very complex - a huge Excel file, a codebook (that explains what the various parts of the Excel file mean), and a data dictionary (that details each entry in the Excel file). Working on the data involved frequent cross-referencing through these files and is especially tedious if you haven’t been working with the data in a long time. I was curious how far I could get in writing a new research paper using this old data with the help of AI.

I started by getting an OpenAI Deep Research report on the latest literature on how organizations could impact crowdfunding. I was able to check the report over based on my knowledge. I knew that it would not include all the latest articles (Deep Research cannot access paid academic content), but its conclusions were solid and would be useful to the AI when considering what topics might be worth exploring. So, I pasted in the report and the three files into the secure version of ChatGPT provided by my university and worked with multiple models to generate hypotheses. The AI suggested multiple potential directions, but I needed to filter them based on what would actually contribute meaningfully to the field—a judgment call requiring years of experience with the relevant research.

Then I worked back and forth with the models to test the hypothesis and confirm that our findings were correct. The AI handled the complexity of the data analysis and made a lot of suggestions, while I offered overall guidance and direction about what to do next. At several points, the AI proposed statistically valid approaches that I, with my knowledge of the data, knew would not be appropriate. Together, we worked through the hypothesis to generate fairly robust findings.

Then I gave all of the previous output to o1-pro and asked it to write a paper, offering a few suggestions along the way. It is far from a blockbuster, but it would make a solid contribution to the state of knowledge (after a bit more checking of the results, as AI still makes errors). More interestingly, it took less than an hour to create, as compared to weeks of thinking, planning, writing, coding and iteration. Even if I had to spend an hour checking the work, it would still result in massive time savings.

I never had to write a line of code, but only because I knew enough to check the results and confirm that everything made sense. I worked in plain English, shaving dozens of hours of work that I could not have done anywhere near as quickly without the AI… but there were many places where the AI did not yet have the “instincts” to solve problems properly. The AI is far from being able to work alone, humans still provide both vibe and work in the world of vibework.

Work is changing

Work is changing, and we're only beginning to understand how. What's clear from these experiments is that the relationship between human expertise and AI capabilities isn't fixed. Sometimes I found myself acting as a creative director, other times as a troubleshooter, and yet other times as a domain expert validating results. It was my complex expertise (or lack thereof) that determined the quality of the output.

The current moment feels transitional. These tools aren't yet reliable enough to work completely autonomously, but they're capable enough to dramatically amplify what we can accomplish. The $8 debugging session for my game reminds me that the gaps in AI capabilities still matter, and knowing where those gaps are becomes its own form of expertise. Perhaps most intriguing is how quickly this landscape is changing. The research paper that took me an hour with AI assistance would have been impossible at this speed just eighteen months ago.

Rather than reaching definitive conclusions about how AI will transform work, I find myself collecting observations about a moving target. What seems consistent is that, for now, the greatest value comes not from surrendering control entirely to AI or clinging to entirely human workflows, but from finding the right points of collaboration for each specific task—a skill we're all still learning.

Subscribe now

Share

A new generation of AIs: Claude 3.7 and Grok 3

2025-02-25 02:42:26

Note: After publishing this piece, I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 10^26 FLOP model and cost a few tens of millions of dollars to train, though future models will be much bigger. I updated the post with that information. The only significant change is that Claude 3 is now referred to as an advanced model but not a Gen3 model.

I have been experimenting with the first of a new generation AI models, Claude 3.7 and Grok 3, for the last few days. Grok 3 is the first model that we know trained with an order of magnitude more computing power of GPT-4, and Claude includes new coding and reasoning capabilities, so they are not just interesting in their own right but also tell us something important about where AI is going.

Before we get there, a quick review: this new generation of AIs is smarter and the jump in capabilities is striking, particularly in how these models handle complex tasks, math and code. These models often give me the same feeling I had when using ChatGPT-4 for the first time, where I am equally impressed and a little unnerved by what it can do. Take Claude's native coding ability, I can now get working programs through natural conversation or documents, no programming skill needed.

For example, giving Claude a proposal for a new AI educational tool and engaging in conversation where it was asked to “display the proposed system architecture in 3D, make it interactive,” resulted in this interactive visualization of the core design in our paper, with no errors. You can try it yourself here, and edit or change it by asking the AI. The graphics, while neat, are not the impressive part. Instead, it was that Claude decided to turn this into a step-by-step demo to explain the concepts, which wasn’t something that it was asked to do. This anticipation of needs and consideration of new angles of approach is something new in AI.

Or, for a more playful example, I told Claude “make me an interactive time machine artifact, let me travel back in time and interesting things happen. pick unusual times I can go back to…” and “add more graphics.” What emerged after just those two prompts was a fully functional interactive experience, complete with crude but charming pixel graphics (which are actually surprisingly impressive- the AI has to 'draw' these using pure code, without being able to see what it's creating, like an artist painting blindfolded but still getting the picture right).

To be clear, these systems are far from perfect and make mistakes, but they are getting much better, and fast. To understand where things are and where they are going,

The Two Scaling Laws

Though they may not look it, these may be the two most important graphs in AI. Published by OpenAI, they show the two “Scaling Laws,” which tell you how to increase the ability of the AI to answer hard questions, in this case to score more highly on the famously difficult American Invitational Mathematics Examination (AIME).

The left-hand graph is the training Scaling Law. It shows that larger models are more capable. Training these larger models requires increasing the amount of computing power, data, and energy used, and you need to do so on a grand scale. Typically, you need a 10x increase in computing power to get a linear increase in performance. Computing power is measured in FLOPs (Floating Point Operations) which are the number of basic mathematical operations, like addition or multiplication, that a computer performs, giving us a way to quantify the computational work done during AI training.

We are now seeing the first models of a new generation of AIs, trained with over 10x the computing power of GPT-4 and its many competitors. These models use over 10^26 FLOPS of computing power in training. This is a staggering amount of computing power, equivalent to running a modern smartphone for 634,000 years or the Apollo Guidance Computer that took humans to the moon for 79 trillion years. Naming 10^26 is awkward, though - it is one hundred septillion FLOPS, or, taking a little liberty with standard unit names, a HectoyottaFLOP. So, you can see why I just call them Gen3 models, the first set of AIs that were trained with an order of magnitude more computing power than GPT-4 (Gen2).

xAI, Elon Musk's AI company, made the first public move into Gen3 territory with Grok 3, which is unsurprising given their strategy. xAI is betting big on the idea that bigger (way bigger) is better. xAI built the world’s largest computer cluster in record time, and that meant Grok 3 was the first AI model to show us whether the Scaling Law would hold up for a new generation of AI. It seems that it did, as Grok 3 had the highest benchmark scores we've seen from any base model. Today, Claude 3.7 was released, though not yet a Gen3 model, it also shows substantial improvements in performance over previous AIs. While it is similar in benchmarks to Grok 3, I personally find it more clever for my use cases, but you may find otherwise. The still unreleased o3 from OpenAI also seems to be a Gen3 model, with excellent performance. It is likely this is just the beginning - more companies are gearing up to launch their own models at this scale, including Anthropic.

You might have noticed I haven’t yet mentioned the second graph, the one on the right. While the first Scaling Law is about throwing massive computing power at training (basically, building a smarter AI from the start), this second one revealed something surprising: you can make AI perform better simply by giving it more time to think. OpenAI discovered that if you let a model spend more computing power working through a problem (what they call test-time or inference-time compute), it gets better results - kind of like giving a smart person a few extra minutes to solve a puzzle. This second Scaling Law led to the creation of Reasoners, which I wrote about in my last post. The new generation of Gen3 models will all operate as Reasoners when needed, so they have two advantages: larger scale in training, and the ability to scale when actually solving a problem.

An example of three different models using reasoning

Together, these two trends are supercharging AI abilities, and also adding others. If you have a large, smart AI model, that can be used to create smaller, faster, cheaper models that are still quite smart, if not as much as their parent. And if you add Reasoner capabilities to even small models, they get even smarter. What that means is that AI abilities are getting better even as costs are dropping. This graph shows how quickly this trend has advanced, mapping the capability of AI on the y axis and the logarithmically decreasing costs on the x axis. When GPT-4 came out it was around $50 per million tokens (roughly a word), now it costs around 12 cents per million tokens to use Gemini 1.5 Flash, an even more capable model than the original GPT-4.

The Graduate-Level Google-Proof Q&A test (GPQA) is a series of very hard multiple-choice problems designed to test advanced knowledge. PhDs with access to the internet get 34% right on this test outside their specialty, 81% inside their specialty. The cost per million tokens is the cost of using the model (Gemini Flash Thinking Costs are estimated). Data based on my research, but Epoch and Artificial Analysis were good sources, and Latent Space offers its own more comprehensive graph of costs across many models.

You can see the intelligence of models is increasing, and their cost is decreasing over time. That has some pretty big implications for all of us.

Taking Scale Seriously

A lot of the focus on AI use, especially in the corporate world, has been stuck in what I call the “automation mindset” - viewing AI primarily as a tool for speeding up existing workflows like email management and meeting transcription. This perspective made sense for earlier AI models, but it's like evaluating a smartphone solely on its ability to make phone calls. The Gen3 generation give the opportunity for a fundamental rethinking of what's possible.

As models get better, and as they apply more tricks like reasoning and internet access, they hallucinate less (though they still make mistakes) and they are capable of higher order “thinking.” For example, in this case we gave Claude a 24 page academic paper outlining a new way of creating teaching games with AI, along with some unrelated instruction manuals for other games. We asked the AI to use those examples and write a customer-friendly guide for a game based on our academic paper. The results were extremely high-quality. To do this, the AI needed to both abstract out the ideas in the paper, and the patterns and approaches from other instruction manuals, and build something entirely new. This would have been a week of PhD-level work, done in a few seconds. And, on the right, you can also see an excerpt from another PhD-level task, reading a complex academic paper and checking the math and logic, as well as the implications for practice.

Managers and leaders will need to update their beliefs for what AI can do, and how well it can do it, given these new AI models. Rather than assuming they can only do low-level work, we will need to consider the ways in which AI can serve as a genuine intellectual partner. These models can now tackle complex analytical tasks, creative work, and even research-level problems with surprising sophistication. The examples I've shared - from creating interactive 3D visualizations of academic concepts to performing PhD-level analysis - demonstrate that we're moving beyond simple automation into the realm of AI-powered knowledge work. These systems are still far from flawless, nor do they beat human experts consistently across a wide range of tasks, but they are very impressive.

This shift has profound implications for how organizations should approach AI integration. First, the focus needs to move from task automation to capability augmentation. Instead of asking "what tasks can we automate?" leaders should ask "what new capabilities can we unlock?" And they will need to build the capacity in their own organizations to help explore, and develop these changes.

Second, the rapid improvement in both capabilities and cost efficiency means that any static strategy for AI implementation will quickly become outdated. Organizations need to develop dynamic approaches that can evolve as these models continue to advance. Going all-in on a particular model today is not a good plan in a world where both Scaling Laws are operating.

Finally, and perhaps most importantly, we need to rethink how we measure and value AI contributions. The traditional metrics of time saved or costs reduced may miss the more transformative impacts of these systems - their ability to generate novel insights, synthesize complex information, and enable new forms of problem-solving. Moving too quickly to concrete KPIs, and leaving behind exploration, will blind companies to what is possible. Worse, they encourage companies to think of AI as a replacement for human labor, rather than exploring ways in which human work can be boosted by AI.

Exploring for Yourself

With that serious warning out of the way, I want to leave you with a suggestion. These new models are clever, but they are also friendly and more engaging to use. They are likely to ask you questions or push your thinking in new directions, and tend to be good at two-way conversation. The best way to understand their capabilities, then, is to explore them yourself. Claude 3.7 is available for paying customers and has a neat feature where it can run the code it writes for you, as you have seen throughout this post. It does not train on your uploaded data. Grok 3 is free and has a wider range of features, including a good Deep Research option, but is harder for amateurs to use for coding. It is not as good as Claude 3.7 for the tasks I have tried, but the Xai commitment to scaling means it will improve rapidly. You should also note that Grok does train on your data, but that can be turned off for paying customers.

Regardless of what model you pick, you should experiment. Ask the model to code something for you by just asking for it (I asked Claude for a video game with unique mechanics based on the Herman Melville story “Bartleby the Scrivner” - and it did so based on a single prompt), feed it a document and ask it for an infographic summary, or ask it to comment on an image you upload. If this is too playful, follow the advice in my book and just use it for work tasks, taking into account the privacy caveat above. Use it to brainstorm new ideas, ask it how a news article or analyst report might affect your business, or ask it to create a financial dashboard for a new product or startup concept. You will likely find cases that amaze you, and others where the new models are not yet good enough to be helpful.

The limitations of these models remain very real, but the fact that Gen3 AIs are better than Gen2, due to both the first and second Scaling Law shows us something essential. These laws aren't fundamental constants of the universe - they're observations about what happens when you throw massive resources at AI development. The computing power keeps growing, the capabilities keep improving, and this cycle accelerates with each generation. As long as they continue to hold, AIs will keep getting better. Now we know that the next generation of AIs will continue to offer rapid improvements, suggesting that there is a good chance that AI capabilities may continue to increase into the future.

Subscribe now

Share

The End of Search, The Beginning of Research

2025-02-03 20:38:53

A hint to the future arrived quietly over the weekend. For a long time, I've been discussing two parallel revolutions in AI: the rise of autonomous agents and the emergence of powerful Reasoners since OpenAI's o1 was launched. These two threads have finally converged into something really impressive - AI systems that can conduct research with the depth and nuance of human experts, but at machine speed. OpenAI's Deep Research demonstrates this convergence and gives us a sense of what the future might be. But to understand why this matters, we need to start with the building blocks: Reasoners and agents.

Reasoners

For the past couple years, whenever you used a chatbot, it worked in a simple way: you typed something in, and it immediately started responding word by word (or more technically, token by token). The AI could only "think" while producing these tokens, so researchers developed tricks to improve its reasoning - like telling it to "think step by step before answering." This approach, called chain-of-thought prompting, markedly improved AI performance.

Reasoners essentially automate the process, producing “thinking tokens” before actually giving you an answer. This was a breakthrough in at least two important ways. First, because the AI companies could now get AIs to learn how to reason based on examples of really good problem-solvers, the AI can “think” more effectively. This training process can produce a higher quality chain-of-thought than we can by prompting. This means Reasoners are capable of solving much harder problems, especially in areas like math or logic where older chatbots failed.

The second way this was a breakthrough is that it turns out that the longer Reasoners “think,” the better their answers get (though the rate of improvement slows as they think longer). This is a big deal because previously the only way to make AIs perform better was to train bigger and bigger models, which is very expensive and requires a lot of data. Reasoning models show you can make AIs better by just letting them produce more and more thinking tokens, using computing power at the time of answering your question (called inference-time compute) rather than when the model was trained.

The Graduate-Level Google-Proof Q&A test (GPQA) is a series of multiple-choice problems that internet access doesn't help PhDs with access to the internet get 34% right on this test outside their specialty, 81% inside their specialty. It illustrates how reasoning models have sped up the capability gain of AI. Data source.

Because Reasoners are so new, their capabilities are expanding rapidly. In just months, we've seen dramatic improvements from OpenAI's o1 family to their new o3 models. Meanwhile, China's DeepSeek r1 has found innovative ways to boost performance while cutting costs, and Google has launched their first Reasoner. This is just the beginning - expect to see more of these powerful systems, and soon.

Agents

While experts debate the precise definition of an AI agent, we can think of it simply as “an AI that is given a goal and can pursue that goal autonomously.” Right now, there's an AI labs arms race to build general-purpose agents - systems that can handle any task you throw at them. I've written about some early examples like Devin and Claude with Computer Use, but OpenAI just released Operator, perhaps the most polished general-purpose agent yet.

The video below, sped up 16x, captures both the promise and pitfalls of general-purpose agents. I give Operator a task: read my latest substack post at OneUsefulThing and then go onto Google ImageFX and make an appropriate image, download it, and give it to me to post. What unfolds is enlightening. At first, Operator moves with impressive precision - finding my website, reading the post, navigating to ImageFX (pausing briefly for me to enter my login), and creating the image. Then the troubles begin, and they're twofold: not only is Operator blocked by OpenAI's security restrictions on file downloads, but it also starts to struggle with the task itself. The agent methodically tries every conceivable workaround: copying to clipboard, generating direct links, even diving into the site's source code. Each attempt fails - some due to OpenAI's browser restrictions, others due to the agent's own confusion about how to actually accomplish the task. Watching this determined but ultimately failed problem-solving loop reveals both the current limitations of these systems and raises questions about how agents will eventually behave when they encounter barriers in the real world.

Operator's issues highlight the current limits of general-purpose agents, but that doesn’t suggest that agents are useless. It appears that economically valuable narrow agents that focus on specific tasks are already possible. These specialists, powered by current LLM technology, can achieve remarkable results within their domains. Case in point: OpenAI's new Deep Research, which shows just how powerful a focused AI agent can be.

Deep Research

OpenAI’s Deep Research (not to be confused with Google’s Deep Research, more on that soon) is essentially a narrow research agent, built on OpenAI’s still unreleased o3 Reasoner, and with access to special tools and capabilities. It is one of the more impressive AI applications I have seen recently. To understand why, let’s give it a topic. I am specifically going to pick a highly technical and controversial issue within my field of research: When should startups stop exploring and begin to scale? I want you to examine the academic research on this topic, focusing on high quality papers and RCTs, including dealing with problematic definitions and conflicts between common wisdom and the research. Present the results for a graduate-level discussion of this issue.

The AI asks some smart questions, and I clarify what I want. Now o3 goes off and gets to work. You can see its progress and “thinking” as it goes. It is really worth taking a second to look at a couple samples of that process below. You can see that the AI is actually working as a researcher, exploring findings, digging deeper into things that “interest” it, and solving problems (like finding alternative ways of getting access to paywalled articles). This goes on for five minutes.

Seriously take a moment to look at these three slices of its “thought” process

At the end, I get a 13 page, 3,778 word draft with six citations and a few additional references. It is, honestly, very good, even if I would have liked a few more sources. It wove together difficult and contradictory concepts, found some novel connections I wouldn’t expect, cited only high-quality sources, and was full of accurate quotations. I cannot guarantee everything is correct (though I did not see any errors) but I would have been satisfied to see something like it from a beginning PhD student. You can see the full results here but the couple excerpts below would suffice to show you why I am so impressed.

The quality of citations also marks a genuine advance here. These aren't the usual AI hallucinations or misquoted papers - they're legitimate, high-quality academic sources, including seminal work by my colleagues Saerom (Ronnie) Lee and Daniel Kim. When I click the links, they don't just lead to the papers, they often take me directly to the relevant highlighted quotes. While there are still constraints - the AI can only access what it can find and read in a few minutes, and paywalled articles remain out of reach - this represents a fundamental shift in how AI can engage with academic literature. For the first time, an AI isn't just summarizing research, it's actively engaging with it at a level that actually approaches human scholarly work.

It is worth contrasting it with Google’s product launched last month also called Deep Research (sigh). Google surfaces far more citations, but they are often a mix of websites of varying quality (the lack of access to paywalled information and books hurts all of these agents). It appears to gather documents all at once, as opposed to the curiosity-driven discovery of OpenAI’s researcher agent. And, because (as of now) this is powered by the non-reasoning, older Gemini 1.5 model, the overall summary is much more surface-level, though still solid and apparently error-free. It is like a very good undergraduate product. I suspect that the difference will be clear if you read a little bit below.

To put this in perspective: both outputs represent work that would typically consume hours of human effort - near PhD-level analysis from OpenAI's system, solid undergraduate work from Google's. OpenAI makes some bold claims in their announcement, complete with graphs suggesting their agent can handle 15% of high economic value research projects and 9% of very high value ones. While these numbers deserve skepticism - their methodology isn't explained - my hands-on testing suggests they're not entirely off base. Deep Research can indeed produce valuable, sophisticated analysis in minutes rather than hours. And given the rapid pace of development, I expect Google won't let this capability gap persist for long. We are likely to see fast improvement in research agents in the coming months.

The pieces come together

You can start to see how the pieces that the AI labs are building aren't just fitting together - they're playing off each other. The Reasoners provide the intellectual horsepower, while the agentic systems provide the ability to act. Right now, we're in the era of narrow agents like Deep Research, because even our best Reasoners aren't ready for general-purpose autonomy. But narrow isn’t limiting - these systems are already capable of performing work that once required teams of highly-paid experts or specialized consultancies.

These experts and consultancies aren't going away - if anything, their judgment becomes more crucial as they evolve from doing the work to orchestrating and validating the work of AI systems. But the labs believe this is just the beginning. They're betting that better models will crack the code of general-purpose agents, expanding beyond narrow tasks to become autonomous digital workers that can navigate the web, process information across all modalities, and take meaningful action in the world. Operator shows we aren’t there yet, but Deep Research suggests that we may be on our way.

Subscribe now

Share

Which AI to Use Now: An Updated Opinionated Guide (Updated Again 2/15)

2025-01-26 20:45:48

Please note that I updated this guide on 2/15, less than a month after writing it - a lot has changed in a short time.

While my last post explored the race for Artificial General Intelligence – a topic recently thrust into headlines by Apollo Program-scale funding commitments to building new AIs – today I'm tackling the one question I get asked most: what AI should you actually use? Not five years from now. Not in some hypothetical future. Today.

Every six months or so, I have written an opinionated guide for individual users of AI, not specializing in any one type of use, but as a general overview. Writing this is getting more challenging. AI models are gaining capabilities at an increasingly rapid rate, new companies are releasing new models, and nothing is well documented or well understood. In fact, in the few days I have been working on this draft, I had to add an entirely new model and update the chart below multiple times due to new releases. As a result, I may get something wrong, or you may disagree with my answers, but that is why I consider it an opinionated guide (though as a reminder, I take no money from AI labs, so it is my opinion!)

A Tour of Capabilities

To pick an AI model for you, you need to know what they can do. I decided to focus here on the major AI companies that offer easy-to-use apps that you can run on your phone, and which allow you to access their most up-to-date AI models. Right now, to consistently access a frontier model with a good app, you are going to need to pay around $20/month (at least in the US), with a couple exceptions. Yes, there are free tiers, but you'll generally want paid access to get the most capable versions of these models.

We are going to go through things in detail, but, for most people, there are three good choices right now: Claude from Anthropic, Google’s Gemini, and OpenAI’s ChatGPT. There are also a trio of models that might make sense for specialized users: Grok by Elon Musk’s X.ai is an excellent model that is most useful if you are a big X user; Microsoft’s Copilot offers many of the features of ChatGPT and is accessible to users through Windows; and DeepSeek r1, a Chinese model that is remarkably capable (and free). I’ll talk about some caveats and other options at the end.

Service and Model

For most people starting to use AI, the most important goal is to ensure that you have access to a frontier model with its own app. Frontier models are the most advanced AIs, and, thanks to the 'scaling law' (where bigger models get disproportionately smarter), they’re far more capable than older versions. That means they make fewer mistakes, and they often can provide more useful features.

The problem is that most of the AI companies push you towards their smaller AI models if you don’t pay for access, and sometimes even if you do. Generally, smaller models are much faster to run, slightly less capable, and also much cheaper for the AI companies to operate. For example, GPT-4o-mini is the smaller version of GPT-4 and Gemini Flash is the smaller version of Gemini. Often, you want to use the full models where possible, but there are exceptions when the smaller model is actually more advanced. And everything has terrible names. Right now, for Claude you want to use Claude 3.5 Sonnet (which consistently outperforms its larger sibling Claude 3 Opus), for Gemini you want to use Gemini 2.0 Pro (though Gemini 2.0 Flash Thinking is also excellent), and for ChatGPT you want to use GPT-4o (except when tackling complex problems that benefit from o1 or o3's reasoning capabilities). While this can be confusing, it is also a side effect of how quickly these companies are updating their AIs, and their features.

Live Mode

Imagine an AI that can converse with you in real-time, seeing what you see, hearing what you say, and responding naturally – that's “Live Mode” (though it goes by various names). This interactive capability represents a powerful way to use AI. To demonstrate, I used ChatGPT's “Advanced Voice Mode” to discuss my game collection. This entire interaction, which you can hear with sound on, took place on my phone

You are actually seeing three advances in AI working together: First, multimodal speech lets the AI handle voice natively, unlike most AI models that use separate systems to convert between text and speech. This means it can theoretically generate any sound, though OpenAI limits this for safety. Second, multimodal vision lets the AI see and analyze real-time video. Third, internet connectivity provides access to current information. The system isn't perfect - when pulling the board game ratings from the internet, it got one right but mixed up another with its expansion pack. Still, the seamless combination of these features creates a remarkably natural interaction, like chatting with a knowledgeable (if not always 100% accurate) friend who can see what you're seeing.

Right now, only ChatGPT offers a full multimodal Live Mode for all paying customers. It’s the little icon all the way to the right of the prompt bar (ChatGPT is full of little icons). But Google has already demonstrated a Live Mode for its Gemini model, and I expect we will see others soon.

Reasoning

For those who are watching the AI space, by far the most important recent advance in the last few months has been the development of reasoning models. As I explained in my post about o1, it turns out that if you let an AI “think” about a problem before answering, you get better results. The longer the model thinks, generally, the better the outcome. Behind the scenes, it's cranking through a whole thought process you never see, only showing you the final answer. Interestingly, when you peek behind that curtain, you find these AIs think in ways that feel eerily human:

Really worth reading the thinking process, it is kind of charming

That was the thinking process of DeepSeek-v3 r1, one of only a few reasoning models that have been released to the public. It is also an unusual model in many ways: it is an excellent model from China1; it is open source so anyone can download and modify it; and it is cheap to run (and is currently offered for free by its parent company, DeepSeek). Google also offers a reasoning version of its Gemini 2.0 Flash. However, the most capable reasoning models right now are the o1 family from OpenAI. These are confusingly named, but, in order of capability, there are o1-mini, o3-mini, o3-mini-high, o1, and o1-pro (OpenAI could not get the rights to the o2 name, making things even more baffling).

Reasoning models aren’t chatty assistants – they’re more like scholars. You’ll ask a question, wait while they ‘think’ (sometimes minutes!), and get an answer. You want to make sure that the question you give them is very clear and has all the context they need. For very hard questions, especially in academic research, math, or computer science, you will want to use a reasoning model. Otherwise, a standard chat model is fine.

Web Access and Research

Not all AIs can access the web and do searches to learn new information past their original training. Currently, Gemini, Grok, DeepSeek, Copilot and ChatGPT can search the web actively, while Claude cannot. This capability makes a huge difference when you need current information or fact-checking, but not all models use their internet connections fully, so you will still need to fact-check.

Two models, Gemini and OpenAI, go far beyond simple internet access and offer the option for “Deep Research” which I discuss in more detail in this post. OpenAI’s model is more like a PhD analyst who looks at relatively few sources yet assembles a striklingly sophisticated analyst report, while Gemini’s approach is more like a summary of the open web on a topic.

Generates Images

Most of the LLMs that generate images do so by actually using a separate image generation tool. They do not have direct control over what that tool does, they just send a prompt to it and then show you the picture that results. That is changing with multimodal image creation, which lets the AI directly control the images it makes. For right now, Gemini's Imagen 3 leads the pack, but honestly? They'll all handle your basic “otter holding a sign saying 'This is ____' as it sits on a pink unicorn float in the middle of a pool” just fine.

Executes Code and Does Data Analysis

All AIs are pretty good at writing code, but only a few models (mostly Claude and ChatGPT, but also Gemini to a lesser extent) have the ability to execute the code directly. Doing so lets you do a lot of exciting things. For example, this is the result of telling o1 using the Canvas feature (which you need to turn on by typing /canvas): “create an interactive tool that visually shows me how correlation works, and why correlation alone is not a great descriptor of the underlying data in many cases. make it accessible to non-math people and highly interactive and engaging”

Further, when models can code and use external files, they are capable of doing data analysis. Want to analyze a dataset? ChatGPT's Code Interpreter will do the best job on statistical analyses, Claude does less statistics but often is best at interpretation, and Gemini tends to focus on graphing. None of them are great with Excel files full of formulas and tabs yet, but they do a good job with structured data.

Claude does not do as sophisticated data analysis as ChatGPT, but it is very good at an “intuitive” understanding of data and what it means

Reads documents, sees images, sees video

It is very useful for your AI to take in data from the outside world. Almost all of the major AIs include the ability to process images. The models can often infer a huge amount from a picture. Far fewer models do video (which is actually processed as images at 1 frame every second or two). Right now that can only be done by Google’s Gemini, though ChatGPT can see video in Live Mode.

Given the first photo Claude guesses where I am. Given the second it identifies the type of plane. These aren’t obvious.

And, while all the AI models can work with documents, they aren’t equally good at all formats. Gemini, GPT-4o (but not o3), and Claude can process PDFs with images and charts, while DeepSeek can only read the text. No model is particularly good at Excel or PowerPoint (though Microsoft Copilot does a bit better here, as you might expect), though that will change soon. The different models also have different amounts of memory ("context windows") with Gemini having by far the most, capable of holding up to 2 million words at once.

Privacy and other factors

A year ago, privacy was a major concern when choosing an AI model. The early versions of these systems would save your chats and use them to improve their models. That's changed dramatically. Every major provider (except DeepSeek) now offers some form of privacy-focused mode: ChatGPT lets you opt out of training, and Claude says it will not train on your data as does Gemini. The exception is if you're handling truly sensitive data like medical records – in those cases, you'll still want to look into enterprise versions of these tools that offer additional security guarantees and meet regulatory requirements.

Each platform offers different ways to customize the AI for your use cases. ChatGPT lets you create custom GPTs tailored to specific tasks and includes an optional feature to remember facts from previous conversations, Gemini integrates with your Google workspace, and Claude has custom styles and projects.

Which AI should you use?

As you can see, there are lots of features to pick from, and, on top of that, there is the issue of “vibes” - each model has its own personality and way of working, almost like a person. If you happen to like the personality of a particular AI, you may be willing to put up with fewer features or less capabilities. You can try out the free versions of multiple AIs to get a sense for that. That said, for most people, you probably want to pick among the paid versions of ChatGPT, Claude or Gemini.

ChatGPT currently has the best Live Mode in its Advanced Voice Mode. The other big advantage of ChatGPT is that it does everything, often in somewhat confusing ways - OpenAI has AI models specialized in hard problems (o1/o3 series) and models for chat (GPT-4o); some models can write and run complex software programs (though it is hard to know which); there are reseachers and agents; there are systems that remember past interactions and scheduling systems; movie-making tools and early software agents. It can be a lot, but it gives you opportunities to experiment with many different AI capabilities. It is also worth noting that ChatGPT offers a $200/month tier, whose main advantage is access to very powerful reasoning models.

Gemini does not yet have as good a Live Mode, but that is supposed to be coming soon. For now, Gemini’s advantage is a family of powerful models including reasoners, very good integration with search, and a pretty easy-to-use user interface, as you might expect from Google. It also has top-flight image and video generation. Also excellent is Deep Research, which I wrote about at length in my last post.

Claude has the smallest number of features of any of these three systems, and really only has one model you care about - Claude 3.5 Sonnet. But Sonnet is very, very good. It often seems to be clever and insightful in ways that the other models are not. A lot of people end up using Claude as their primary model as a result, even though it is not as feature rich.

While it is new, you might also consider DeepSeek if you want a very good all-around model with excellent reasoning. As an open model, you can either use it hosted on the original Chinese DeepSeek site or from a number of local providers. If you subscribe to X, you get Grok for free, and the team at X.ai are scaling up capabilities quickly, with a soon-to-be-released new model, Grok 3, promising to be the largest model ever trained. And if you have Copilot, you can use that, as it includes a mix of Microsoft and OpenAI models, though I find the lack of transparency over which models it is using when to be somewhat confusing. There are also many services, like Poe, that offer access to multiple models at the same time, if you want to experiment.

In the time it took you to read this guide, a new AI capability probably launched and two others got major upgrades. But don't let that paralyze you. The secret isn't waiting for the perfect AI - it's diving in and discovering what these tools can actually accomplish. Jump in, get your hands dirty, and find what clicks. It will help you understand where AI can help you, where it can’t, and what is coming next.

Subscribe now

Share

1

The fact that it is a Chinese model is interesting in many ways, including the fact that this is the first non-US model to reach near the top of the AI ranking leaderboards. The quality of the model when it was released last week came as a surprise to many people in the AI space, causing a tremendous amount of discussion. Its origin also means that it tends to echo the official Chinese position on a variety of political topics. (Since the model itself is open, it is very likely that modified versions of the original will be released soon and hosted by other providers)