MoreRSS

site iconStrange Loop CanonModify

By Rohit Krishnan. Here you’ll find essays about ways to push the frontier of our knowledge forward. The essays aim to bridge the gaps between Business, Science and Technology.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Strange Loop Canon

Working with LLMs: A Few Lessons

2025-05-07 13:46:12

An interesting part of working with LLMs is that you get to see a lot of people trying to work with them, inside companies both small and large, and fall prey to entirely new sets of problems. Turns out using them well isn’t just a matter of knowhow or even interest, but requires unlearning some tough lessons. So I figured I’d jot down a few observations. Here we go, starting with the hardest one, which is:

Perfect verifiability doesn’t exist

LLMs inherently are probabilistic. No matter how much you might want it, there is no perfect verifiability of what it produces. Instead what’s needed is to find ways to deal with the fact that occasionally it will get things wrong.

This is unlike code that we’re used to running before. That’s why using an LLM can be so cool, because they can do different things. But the cost of it being able to read and understand badly phrased natural language questions is that it’s also liable to go off the rails occasionally.

This is true whether you’re asking the LLMs to answer questions from context, like RAG, or if you’re asking it to write Python, or if you’re asking it to use tools. It doesn’t matter, perfect verifiability doesn’t exist. This means you have to add evaluation frameworks, human-in-the-loop processes, designing for graceful failure, using LLMs for probabilistic guidance rather than deterministic answers, or all of the above, and hope they catch most of what you care about, but know things will still slip through.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

There is a Pareto frontier

Now, you can mitigate problems by adding more LLMs into the equation. This however also introduces the problem of increased hallucination or new forms of errors from LLMs chinese-whispering to each other. This isn’t new, Shannon said “Information is the resolution of uncertainty.” Shannon’s feedback channels and fuzzy-logic controllers in 1980s washing machines accepted uncertainty and wrapped control loops round it.

They look like software, but they act like people. And just like people, you can’t just hire someone and pop them on a seat, you have to train them. And create systems around them to make the outputs verifiable.

Which means there’s a pareto frontier of the number of LLM calls you’ll need ot make for verification and the error-rate each LLM introduces. Practically this has to be learnt, usually painfully, for the task at hand. This is because LLMs are not equally good at every task, or even equally good at tasks that seem arbitrarily similar to each other for us humans.

This creates an asymmetric trust problem, especially since you can’t verify everything. What it needs is a new way to think about “how should we accomplish [X] goal” rather than “how can we automate [X] process”.

An example frontier, based on some real numbers

Which means, annoyingly:

There is no substitute for trial and error

Unlike with traditional software there is no way to get better at using AI than using AI. There is no perfect software that will solve your problems without you engaging in it. The reason this feels a bit alien is because while this was also somewhat true for B2B saas, the people who had to “reconfigure” themselves were often technologists, and while they grumble this was kind of seen as the price of doing business. This isn't just technologists. It's product managers, designers, even end-users who need to adapt their expectations and workflows.

My friend Matt Clifford says there are no AI shaped holes in the world. What this means is that there are no solutions where simply “slot in AI” is the answer. You’d have to rejigger the way the entire organisation works. That’s hard. That’s the kind of thing that makes middle managers sweat and break out in hives.

This, by the way, is partly the reason why even though every saas company in the world have “added AI” none of them have “won”. Because the success of this technology comes when people start using them and build solutions around its unique strengths and weaknesses.

Which also means:

There is limited predictability of development

It’s really hard, if not impossible, to have a clear prediction on what will work and when. Getting to 80% reliability is easy. 90% is hard but possible, and beyond that is a crapshoot, depending on what you’re doing, if you have data, systems around to check the work, technical and managerial setups to help error correct, and more.

Traditionally with software you could kind of make plans. Even then development was notorious for being unpredictable. Now add in the fact that training the LLMs themselves is an unreliable process. The data mix used, the methods used, the sequence of methods used, the scaffolding you use around the LLMs you trained, the way you prompt, they all directly affect whether you’ll be successful.

Note what this means to anyone working in management. Senior management of course will be more comfortable taking this leap. Junior folks would love the opportunity to try play with the newest tech. For everyone else, this needs a leap of faith. To try develop things until they work. If your job requires you to convince people below to use something and above that it will work perfectly, you’re in trouble. They can’t predict or plan, not easily.

Therefore:

You can’t build for the future

This also means that building future-proof tech is almost impossible. Yes some/much of your code will get obsolete in a few months. Yes new models might incorporate some of the functionality that you created. Some of them will break existing functionality. It’s a constant Red Queen’s Race. Interfaces ossify, models churn.

This mean, you also can’t plan for multiple quarters. That will go the way of agile or scrum or whatever you want to use. If you’re not ready to ship a version quickly, and by quickly I mean in weeks, nothing will happen for months. An extraordinary amount of work is going to be needed to manage context, make it more reliable, add all manner of compliance and governance.

And even with all of that, whether your super-secret proprietary data is useful or not is really hard to tell. The best way to tell is actually just to try.

Mostly, the way to make sure you have the skills and the people to jump into a longer-term project is to build many things. Repeatedly. Until those who would build it have enough muscle memory to be able to do more complicated projects.

And:

If it works, your economics will change dramatically

And if you do all the above, your economics of LLM deployment will change dramatically from the way traditional software is built. The costs are backloaded.

Bill Gates said: “The wonderful thing about information-technology products is that they’re scale-economic: once you’ve done all that R & D, your ability to get them out to additional users is very, very inexpensive. Software is slightly better in this regard because it’s almost zero marginal cost.”

This means that a lot of what one might consider below the line cost becomes above the line. Unlike what Bill Gates said about the business of software, success here will strain profit margins, especially as Jevon’s paradox increases the demand for it and increasing competition hits the marginal inference margin.

The pricing has to drop from seat based to usage based, since that’s also how costs stack up. But, for instance, the reliability threshold is also a death knell if it hits user churn. Overshoot the capacity plan and you eat idle silicon depreciation. Model performance gains therefore have real-options value: better weights let you defer capex or capture more traffic without rewriting the stack.

Software eating the world was predicated on zero marginal cost. Cognition eating software brings back a metered bill. The firms that thrive will treat compute as COGS, UX as moat, and rapid iteration as life support. Everyone else will discover that “AI shaped holes” can also be money pits: expensive, probabilistic, and mercilessly competitive.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Deplatforming: AI edition

2025-04-26 04:30:48

So yesterday I got this email. A few seconds after my o3 queries got downgraded to gpt-4o-mini (I noticed when the answers got worse). Then it stopped entirely. Then my API calls died. Then my open chat windows hung. And the history was gone.

I have no idea why this happened. I’ve asked people I know inside OpenAI, they don't know either. Might be an API key leak, but I’d deleted all my keys a couple days ago so it shouldn’t have been an issue. Could be multiple device use (I have a couple laptops and phones). Might be asking simultaneous queries, which again doesn’t seem that much of an issue?

Could it be the content? I doubt it, unless OpenAI hates writing PRDs or vibe coding. I can empathise. Most of my queries are hardly anything that goes anywhere near triggering things1. I mean, this is what ChatGPT seemed to find interesting amongst my questions.

Or this.

We had a lot of conversations about deplatforming from like 2020 to 2024, when the vibes changed. That was in the context of social media, where there were bans both legitimate (people knew why) and illegitimate (where nobody knew why) and where bans where govt “encouraged”.

That was rightly seen as problematic. If you don’t know why something happened, then it’s the act of a capricious algorithmic god. And humans hate capriciousness. We created entire pantheons to try and make ourselves feel better about this. But you cannot fight an injustice you do not understand.

In the future, or even in the present, we’re at a different level of the same problem. We have extremely powerful AIs which are so smart the builders think they will soon disempower humanity as a whole, but where people are still getting locked out of their accounts for no explainable reason. If you are truly building God, you should at least not kick people out of the temples without reason.

And I can’t help wonder, if this were entirely done with an LLM, if OpenAI’s policies were enforced by o3, Dario might think of this as a version of the interpretability problem. I think so too, albeit without the LLMs. Our institutions and policies are set up opaquely enough that we do have an interpretability crisis.

This crisis is also what made many if not most of us angry at the world, throwing the baby out with the bathwater when we decried the FDA and the universities and the NIH and the IRS and NASA and …. Because they seemed unaccountable. They seemed to have generated Kafkaesque drama internally so that the workings are not exposed even to those who work within the system.

It’s only been a day. I have moved my shortcuts to Gemini and Claude and Grok to replace my work. And of course this is not a complex case and hopefully will get resolved. I know some people in the lab and maybe they can help me out. They did once before when an API key got leaked.

But it’s also a case where I still don’t know what actually happened, because it’s not said anywhere. Nobody will, or can, tell me anything. It’s “You Can Just Do Things” but the organisational Kafka edition. All I know is that my history of conversations, my Projects, are all gone. I have felt like this before. In 2006 when I wiped by hard drive by accident. In 2018 when I screwed up my OneDrive backup. But at least those cases were my fault.

In most walks of life we imagine that the systems that surround us also are somewhat predictable. The breakdown of order in the macroeconomic sense we see today (April 2025) is partly because those norms and predictability of rules have broken down. When they’re no longer predictable or seen as capricious we move to a fragile world. A world where we cannot rely on the systems to protect us, but we have to rely on ourselves or trusted third parties. We live in fear of falling into the interstitial gaps where the varying organisations are happy to let us fester unless one musters up the voice to speak up and clout to get heard.

You could imagine a new world where this is rampant. That would be a world where you have to focus on decentralised ownership. You want open source models run on your hardware. You back up your data obsessively both yourself and to multiple providers. You can keep going down the row and end up with wanting entirely decentralised money. Many have taken that exact path.

I’m not saying this is right, by the way. I’m suggesting that when the world inevitably moves towards incorporating even more AI into more of its decision making functions, the problems like this are inevitable. And they are extremely important to solve, because otherwise the trust in the entire system disappears.

If we are moving towards a world where AI is extremely important, if OpenAI is truly AGI, then getting deplatformed from it is a death knell, as a commenter wrote.

One weird hypothesis I have is that the reason I got wrongly swept up in some weird check is because OpenAI does not use LLMs to do this. They, same as any other company, rely on various rules, some ad hoc and some machine learnt, that are applied with a broad brush.

If they had LLMs, as we surely will have in the near future, they would likely have been much smarter in terms of figuring out when to apply which rules to whom. And if they don’t have LLMs doing this already, why? Do you need help in building it? I am, as they say, motivated. It would have to be a better system than what we have now, where people are left to fall into the chinks in the armour, just to see who can climb out.

Or so I hope.


[I would’ve type checked and edited this post more but, as I said, I don’t have my ChatGPT right now]

1

Unless, is asking why my back hurts repeatedly causing trauma?

Vibe Governing

2025-04-07 22:29:33

It looks like the US has started a trade war. This is obviously pretty bad, as the markets have proven. 15% down in three trading sessions is a historic crash. And it’s not slowing. There is plenty of econ 101 commentary out there about why this is bad, from people across every aisle you can think of, but it’s actually pretty simple and not that hard to unravel anyway, so I was more interested in a different question. The question of why the process by which this decision was taken in the first place.

And thinking about what caused this to happen, even after Trump’s rhetoric all through the campaign trail that folks like Bill Ackman didn’t believe to be true, it seemed silly enough that there had to be some explanation.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Even before the formula used to calculate the tariffs was published showing how they actually analysed this, I wondered if it’s something that was gotten from an LLM. It had that aura of extreme confidence in a hypothesis for a government plan. It’s also something you can test. And if you ask the question “What would be an easy way to calculate the tariffs that should be imposed on other countries so that the US is on even-playing fields when it comes to trade deficit? Set minimum at 10%.” to any of the major LLMs, they all give remarkably similar answers.

The answer is, of course, wrong. Very wrong. Perhaps pre-Adam Smith wrong.

This is “Vibe Governing”.

The idea that you could just ask for an answer to any question and take the output and run with it. And, as a side effect, wipe out trillions from the world economy.

A while back when I wrote about the potential negative scenarios of AI, ones that could actually happen, I said two things - one was that once AI is integrated into the daily lives of a large number of organisations it could interact in complex ways and create various Black Monday type scenarios (when due to atuomated trading the market fell sharply). There’s the other scenario where reliance on AI would make people take their answers at face value - sort of like taking Wikipedia as gospel.

But in the aggregate, the latter is good. It’s better that they asked the LLMs. Because the LLMs gave pretty good answers even with really bad questions. It tried to steer the reader away from the recommended formula, noted the problems inherent in that application, and explained in exhaustive detail the mistaken assumptions inside it.

When I dug into why the LLMs seem to give this answer even though its obviously wrong from an economics point of view, it seemed to come down to data. First of all, asking about tariff percentages based on “putting US on an even-footing when it comes to trade deficit” is a weird thing to ask. It seems to have come from Peter Navarro’s 2011 book Death by China.

The Wikipedia analogy is even more true here. You have bad inputs somewhere in the model, you will get bad outputs.

LLMs, to their credit, and unlike Wikipedia, do try immensely hard to not parrot this blindly as the answer and give all sorts of nuances on how this might be wrong.

Which means two things.

  1. A world where more people rely on asking such questions would be a better world because it would give a more informed baseline, especially if people read the entire response.

  2. Asking the right questions is exceedingly important to get better answers.

The question is once we learn to ask questions a bit better, like when we learnt to Google better, whether this reliance would mean we have a much better baseline to stand on top of before creating policies. The trouble is that LLMs are consensus machines, and sometimes the consensus is wrong. But quite often the consensus is true!

So maybe we have easier ways to create less flawed policies especially if the writing up of those policies is outsourced to a chatbot. And perhaps we won’t be so burdened by idiotic ideas when things are so easily LLM-checkable?

On the other hand Google’s existed for a quarter century and people still spread lies on the internet, so maybe it’s not a panacea after all.

However, if you did want to be concerned, there are at least two reasons:

  1. Data poisoning is real, and will affect the answers to questions if posed just so!

  2. People seem overwhelmingly ready to “trust the computer” even at this stage

The administration thus far have been remarkably ready to use AI.

  • They used it to write Executive Orders.

  • The “research” paper that seemed to underpin the three-word formula seems like a Deep Research output.

  • The tariffs on Nairu and so on show that they probably used LLMs to parse some list of domains or data to set it, which is why we’re setting tariffs on penguins (yes, really).

While they’ve so far been using it with limited skill, and playing fast and loose with both laws and norms, I think this perhaps is the largest spark of “good” I’ve seen so far. Because if the Federal Govt can embrace AI and actually harness it, perhaps the AI adoption curve won’t be so flat for so long after all, and maybe the future GDP growth that was promised can materialise, along with government efficiency.

Last week I had written that with AI becoming good, vibe coding was the inevitable future. If that’s the case for technical work, which it is slowly starting to be and seems more and more likely to be the case for the future, then the other parts of the world can’t be far behind!

Whether we want to admit it or not Vibe Governing is the future. We will end up relying on these oracles to decide what to do, and help us do it. We will get better at interpreting the signs and asking questions the right way. We will end up being less wrong. But what this whole thing shows is that the base knowledge which we have access to means that doing things the old fashioned way is not going to last all that long.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

If AGI is the future, vibe coding is what we should all be doing

2025-03-27 04:28:38

So, the consensus in many parts of the AI world is that we'll all be living in luxury fully automated space communism by the time my kids finish high school. Not only will certain tasks be automated, entire jobs would be, and not just jobs we have but every job we could ever think of. “Entire data centers filled with Nobel prize winners” Dario Amodei, co-founder of Anthropic, said.

Now look, you might not believe it, most people don't, after all it sounds kind of crazy. Even most investors who invest in this space don't believe in it, as you can easily see by their investments. Nobody is really betting on employees becoming almost too cheap to meter or thinking beyond what if the future's like today but with more automation.

Subscribe for free to receive new posts and support my work.

But you don't need to believe that the entire thing will necessarily happen, all or nothing, for you to think that the world of work will change quite a bit in the coming years.

If you believe this is coming, or even if it doesn't quite hit the metric of automating “every job that can be done with a computer”, then surely doing things like writing code by hand is as archaic as when people used to calculate ballistic missile trajectories by hand. Computers used to be a job, but now it's just a machine.

And soon, we would be able to merely ask for something, and the entire apparatus would click into gear and things would just get made. In fact that’s what many have been doing since ChatGPT came out a couple years ago. Ask for pieces of code, copy paste it into your favourite IDE, and press run. Voila, you have yourself a running python script or a website or an app.

Andrej Karpathy recently named this phenomenon of asking the models to do things and accepting it immediately as “vibe coding”. As all fantasy novels have taught us, naming is incredibly powerful, so this term has completely taken off. Vibe is of course the word of the century at this point.

Most of the arguments against doing this basically were different versions of saying to do this he will get burned because the models do not yet know how to do things very well. People leaked login credentials and got laughed at.

If you want to know what the world will look like once this is the norm, we recently had a glimpse. It got particularly popular when Pieter created a simple, very simple, flying game, which became massively popular. How popular? It's barely a month and he's already making $100,000 a month from it. Not just because the game is really fun, after all virtual dog fighting games have been dime-a-dozen over the years. But because he created it almost entirely with vibe coding. Just him, Claude, and Cursor.

This made game developers, especially professional game developers, really angry. Understandably so because while they had to work really really hard to make something a thousand times as good and not make a fraction of the money he's making. He developed the game in public, each new feature added either the same day or next day, and the features are as simple as “hey you can now pilot a triangle around instead of a plane”, and “do you want your name on the side of a blimp”.

Right now Pieter needs to be smart enough to fix things quite a lot but presumably the idea is that very soon he would not need to be smart at all, or at least not conversant with code.

(I also think this should be an existing benchmark we all see regularly to see how well new models do arbitrarily complex things. The measure is “how good a game can a non-coder make with this”.)

And people who are very conversant with code, they get leverage. Some are writing thousands of lines of code a day by managing multiple Claude code agents, each of which (whom?) is working on one particular feature, each of whom will submit PR for the author to review, and effectively automating large chunks of software development. This isn't tomorrow, it's happening now. Today.

Yeah yeah it's still slow and expensive and error prone and hallucinates and sometimes tries to change the test to pass it and can't do long code bases and … But still. You get to type in a thing you want or point to a JIRA ticket and bam! Almost working code!

It's got vast limitations still. It cannot work on super large code bases, makes hallucinations or mistakes, it's sometimes so eager to pass a test that it will try to hard code the answer or find an end run around it. But still, man a couple years ago it could barely write a python script that was right…

Whether you want to or not, you're gonna change from being an Individual Contributor to a Manager. The only question is what you manage and how tiresome they are to manage.

What this means for work is chaos. Almost everyone will have “part time AI trainer” as their job description, that's for sure. Every company will have its AI employees outnumber its human employees. There will be a dislocation, arguably it’s started happening now, where hiring for PMs and sales and marketing and engineers have already subsided. And guess what. This mostly won't matter much because humans are already outnumbered by any number of things and we just grow the organisations or make more work to compensate.

It also means individual productivity will be a function of how much inference you can “suck” from the models. There's a capex component if you want to host it or train it. But there's also an opex component. I racked up $6-15 per hour when I was messing with Claude code. I'm not the most productive engineer, so the actual would have to be higher than that. And even higher if you have multiple agents running at a time, as you should, and some already are.

Since I wrote this Steve Yegge wrote an excellent post that said something similar.

Running N agents at once multiplies your devs' daily agent spend of $10/day by N, not counting cloud costs, just token burn. If your developers are each on average running, say, five agents at once – a very conservative number, since agents will work mostly independently, leaving the dev free to do other stuff – then those devs are each now spending $50/hr, or roughly $100k/year.

It’s not really a steal anymore, so much as a heist. We’re talking about each developer gradually boosting their productivity by a multiplier of ~5x by Q4 2025 (allowing for ramp-up time), for an additional amortized cost of only maybe $50k/year the first year. Who wouldn’t go for that deal?

Unfortunately, you almost certainly didn't include $50k/year per developer of LLM spend in your 2026 operating budget.

And this goes so much further once you have Agents which can run the Claude Code coding agents. To do your PR review and check it they ran the unit tests properly and if it makes sense and other things you'd end up doing otherwise. This too won't be perfect but will slowly get better. When I say slowly I mean week by week not year by year.

This is not just true of coding, it's true of an extraordinarily large percentage of white collar work. Coding just happens to be the obsession for the models right now, mostly because people developing them are obsessed with coding. Sort of like how silicon valley cannot stop creating new devops startups.

Any job which has sufficient data to train, which is almost every job, and has decent ways of telling if something is right or wrong, which is also a large enough number of those jobs, can't help but transform. Can you imagine the same principle being applied to finance? Literature reviw? Report writing? PRDs? Contract writing?

It's already started. Every forward leaning tech company is already doing this. From Stripe to the big tech to Linear to every startup especially from the YC group, it's absolutely dominating every hiring decision. Not just coding, though that's the leading one, but also marketing and sales and operations and product managers, but also lawyers and compliance.

Or even infographic creators.

This is the future that’s being built right now. The world of work has already transformed. We're much wealthier than three decades ago, programmers do vastly different tasks helped along by automation and the congealed remnants of decisions made firm from the eras gone past. Consultants today, for instance, do on their first day what would've taken an entire three month project in the 90s. And yet, consultants have grown as an industry. Most jobs are like this.

Even if the AI train comes to a crashing halt for some reason, this trend will only slow and not stop. If this is where it stops the world of work will be transformed. Especially if you believe that AI will get better, which seems true whether that's 2x or 200x, the only way to do it well is to use it.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

In defense of Gemini

2025-03-05 21:57:43

AI has completely entered its product arc. The models are getting better but they’re not curios any longer, we’re using them for stuff. And all day, digitally and physically, I am surrounded by people who use AI for everything. From being completely integrated into every facet of their daily workflow to those who want to spend their time it's basically chatting or using only command line and everything in between. People who spend most of their time either playing around with or learning about whichever model dropped.

Despite this I find myself almost constantly the only person who actually seems to have played with or is interested in or even likes Gemini. Apart from a few friends who actually work at Gemini, though even they look apologetic most of the time and I-can’t-quite-believe-it-slight-smile the handful of other times when they realise you’re not about to yell at them about it.

This seemed really weird, especially for a top notch research organisation with infinite money, which annoys me, so I thought I would write down why. I like big models, models that “smell” good, and seeing it hobbled is annoying. We need better AI and the way we get them requires strong competition. It’s also a case study, but this is mostly a kvetch.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

Right now we only have Hirschmann’s Exit, it’s time for Voice. I’m writing this out of a sense of frustration because a) you should not hobble good models, and b) as AI is undeniably moving to a product-first era, I want more competition in the AI product world, to push the boundaries.

Gemini, ever since it released, has felt consistently like a really good model that is smothered under all sorts of extra baggage. Whether it’s strong system prompts or a complete lack of proper tool calling (it still says it doesn’t know if it can read attachments sometimes!), it feels like a WIP product even when it isn’t. It seems to be just good enough that you can try to use it, but not good enough that you want to use it. People talk about a big model smell, similarly there is a big bureaucracy smell, and Gemini has both.

Now, despite the annoying tone which feels like a know-it-all who has exclusively read Wikipedia, I constantly find myself using it whenever I want to analyze a whole code base, or analyse a bunch of pdfs, or if I need to have a memo read and rewritten, and especially if I need any kind of multimodal interaction with video or audio. (Gemini is the only other model that I have found offers good enough suggestions for memos, same as or exceeding Claude, even though the suggestions are almost always couched in the form of exceedingly boring looking bullet points and a tone I can only describe as corporate cheerleader.)

Gemini also has the deepest bench of interesting ideas that I can think of. It had the longest context, multimodal interactivity, ability to use it to actually keep looking at your desktop while you chat to it, NotebookLM, the ability to directly export in the documents into Google Drive, the first deep research, learn LM something specific to actually learn things, and probably 10 more things that I'm forgetting but were equally breakthrough that nobody uses.

Oh and Gemini live, the ability to talk live with an AI.

And the first reasoning model with visible thinking chain of thought traces in Gemini flash 2.0 thinking.

And even outside this Google is chock full of extraordinary assets that would be useful for Gemini. Hey Google being an example that seems to have helped a little bit. But also Google itself to provide search grounding, the best search engine known to planet. Google scholar. Even news, a clear way to provide real-time updates and have a proper competitor to X. They had Google podcasts which they shuttered, unnecessarily in my opinion, since they they could have easily just created a version that was only based off of NotebookLM.

Also Colab, an existing way to write and execute python code and Jupiter notebooks including GPU and TPU support.

Colab even just launched a Data Science agent, with Colab, which seems interesting and similar to the Advanced Data Analysis. But true to form it’s also in a separate user interface, in a separate website, as a separate offering. One that’s probably great for those who use it, which is overwhelmingly students and researchers (I think), one with 7 million users, and one that’s still unknown to the vast majority who interact with Gemini! But why would this sit separate? Shouldn’t it be integrated with the same place I can run code? Or write that code? Or read documents about that code?

Colab even links with your Github to pull your repos from there.

Gemini even has its Code Assist, a product I didn’t even know existed until a day ago, despite spending much (most?) of my life interacting with Google.

Gemini Code Assist completes your code as you write, and generates whole code blocks or functions on demand. Code assistance is available in many popular IDEs, such as Visual Studio Code, JetBrains IDEs (IntelliJ, PyCharm, GoLand, WebStorm, and more), Cloud Workstations, Cloud Shell Editor, and supports most programming languages, including Java, JavaScript, Python, C, C++, Go, PHP, and SQL.

It supposedly can even let you interact via natural language with BigQuery. But, at this point, if people don’t even know about it, and if Google can’t help me create an entire frontend and backend, then they’re missing the boat! (by the way, why can’t they?!)

Gemini even built the first co-scientist that actually seems to work! Somehow I forgot about this until I came across a stray tweet. It's multi agent system that generates scientific hypotheses through iterated debate and reasoning and tool use.

Gemini has every single ingredient I would expect from a frontier lab shipping useful products. What it doesn’t have is smart product sense to actually try and combine it into an experience that a user or a developer would appreciate.

Just think about how much had to change, to push uphill, to get a better AIStudio in front us, arguably Gemini’s crown jewel! And that was already a good product. Now think about anyone who had to suffer through using Vertex, and these dozens of other products, all with its own KPIs and userbase and profiles.

I don’t know the internal politics or problems in making this come about, but honestly it doesn’t really matter. Most of the money comes from the same place at Google and this is an existential issue. There is no good reason why Perplexity or Grok should be able to eat even part of their lunch considering neither of them even have a half decent search engine to help!

Especially as we're moving from individual models to AI systems which work together, Google's advantages should come to fore. I think the flashes of brilliance that the models demonstrate are a good start but man, they’ve got a lot to build.

Gemini needs a skunkworks team to bring this whole thing together. Right now it feels like disjointed geniuses putting LLMs inside anything they can see - inside Colab or Docs or Gmail or Pixel. And some of that’s fine! But unless you can have a flagship offering that shows the true power of it working together, this kind of doesn’t matter. Gemini can legitimately ship the first Agent which can go from a basic request to a production ready product with functional code, fully battle-tested, running on Google servers, with the payment and auth setup, ready for you to deploy.

Similarly, not just for coding, you should be able to go from iterative refinement of a question (a better UX would help, to navigate the multiple types of models and the mess), to writing and editing a document, to searching specific sources via Google and researching a question, to final proof reading and publishing. The advantage it has is that all the pieces already exist, whereas for OpenAI or Anthropic or Microsoft even, they still need to build most of this.

While this is Gemini specific, the broad pattern is much more general. If Agents are to become ubiquitous they have to meet users where they are. The whole purpose is to abstract the complexity away. Claude Code is my favourite example, where it thinks and performs actions and all within the same terminal window with the exact same interaction - you typing a message.

It’s the same thing that OpenAI will inevitably build, bringing their assets together. I fought this trend when they took away Code Interpreter and bundled it into the chat, but I was wrong and Sama was right. The user should not be burdened with the selection trouble.

I don’t mean to say there’s some ultimate UX that’s the be all and end all of software. But there is a better form factor to use the models we have to their fullest extent. Every single model is being used suboptimally and it has been the case for a year. Because to get the best from them is to link them across everything that we use, and to do that is hard! Humans context switch constantly, and the way models do is if you give them the context. This is so incredibly straightforward that I feel weird typing it out. Seriously, if you work on this and are confused, just ask us!

Google started with the iconic white box. Simplicity to counter the insane complexity of the web. Ironically now there is the potential to have another white box. Please go build it!

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

PS: A short missive. If any of y’all are interested in running your python workloads faster and want to crunch a lot of data, you should try out Bodo. 100% open source: just “pip install bodo” and try one of the examples. We’d also appreciate stars and comments!

Github: https://github.com/bodo-ai/Bodo

How would you interview an AI, to give it a job?

2025-02-28 04:05:01

Evaluating people has always been a challenge. We morph and change and grow and get new skills. Which means that the evaluations have to grow with us. Which is also why most exams are bound to a supposed mean of what people in that age group or go hard should know. As you know from your schooling, this is not perfect, and this is really hard.

The same problem exists with AI. Each day you wake up, and there’s a new AI model. They are constantly getting better. And the way we know that they are getting better is that when we train them we analyze the performance across some evaluation or the other.

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.

But real world examinations have a problem. A well understood one. They are not perfect representations of what they measure. We constantly argue whether they are under or over optimizing.

AI is no different. Even as it's gotten better generally, it's also made it much harder to know what they're particularly good at. Not a new problem, but it's still a more interesting problem. Enough that even OpenAI admitted that their dozen models with various names might be a tad confusing. What exactly does o3-mini-high do that o1 does not, we all wonder. For that, and for figuring out how to build better models in the future, the biggest gap remains evaluations.

Now, I’ve been trying to figure out what I can do with LLMs for a while. I talk about it often, but many times in bits and pieces. Including the challenges with evaluating them. So this time around I thought I’d write about what I learnt from doing this for a bit. The process is the reward. Bear in mind this is personal, and not meant to be an exhaustive review of all evaluations others have done. That was the other essay. It’s maybe more a journey of how the models got better over time.

The puzzle, in a sense, is simple: how do we know what a model can do, and what to use it for. Which, as the leaderboards keep changing and the models keep evolving, often without explicitly saying what changed, is very very hard!

I started like most others, thinking the point was to trick the models by giving it harder questions. Puzzles, tricks, and questions about common sense, to see what the models actually knew.

Then I understood that this doesn't scale, and what the models do in evaluations of math tests or whatever doesn't correlate with how well it does tasks in the real world. So I collected questions across real life work, from computational biology to manufacturing to law and to economics.

This was useful in figuring out which models did better with which sorts of questions. This was extremely useful in a couple of real world settings, and a few more that were private to particular companies and industries.

To make it work better I had to figure out how to make the models work with real-life type questions. Which meant adding RAG capabilities to read documents and respond, search queries to get relevant information, write database queries and analyse the responses.

This was great, and useful. It’s also when I discovered for the first time that the Chinese models were getting pretty good - Yi was the one that did remarkably well!

This was good, but after a while the priors just took over. There was a brief open vs closed tussle, but otherwise it’s just use OpenAI.

But there was another problem. The evaluations were too … static. After all, in the real world, the types of questions you need to get response to change often. Requirements shift, needs change. So the models have to be able to answer questions even when the types of questions being asked or the “context” within which the questions are asked changes.

So I set up a “random perturbation” routine. To check whether it has the ability to answer queries well even when you change the underlying context. And that worked pretty nicely to test the models’ ability to change its thinking as needed, to show some flexibility.

This too was useful, and changed the view on the types of questions that one could reasonably expect LLMs to be able to tackle, at least without significant context being added each time. The problem was though that this was only interesting and useful insofar as you had enough “ground truth” answers to check whether the model was good enough. And while that was possible for some domains, as the number of those domains increased and the number of questions increased, the average “vibe check” and “just use the strongest model” rubrics easily outweigh using specific models like this.

So while they are by far the most useful way to check which model to use for what, they’re not as useful on a regular basis.

Which made me think a bit more deeply about what exactly are we testing these models for. They know a ton of things, and they can solve many puzzles, but the key issue is neither of those. The issue is that a lot of the work we do regularly aren’t of the “just solve this hard problem” variety. They’re of the “let us think through this problem, think again, notice something wrong, change an assumption, fix that, test again, ask someone, check answer, change assumptions, and so on and on” an enormous number of times.

And if we want to test that, we should test it. With real, existing, problems of that nature. But to test those means you need to gather up an enormous number of real, existing problems which can be broken down into individual steps and then, more importantly, analysed!

I called it the need for iterative reasoning. Enter Loop Evals.

Much like Hofstadter’s concept of strange loops - this blog’s patron namesake, self-referential cycles that reveal deeper layers of thought - the iterative evaluation of LLMs uncovers recursive complexities that defy linear analysis. Or that was the theory.

So I started wondering what exists that has similar qualities, and is also super easy to analyse and don't need a human for it. And I ended up with word games. First I was thinking I wanted to test it on crosswords. Which was a little hard, so I ended up with Wordle.

And then later, to Word Grids (literally grid of words which are correct horizontally and vertically). And later again, sudoku.

This was so much fun! Remember, this was a year and change ago, and we didn't have these amazing models like we do today.

But OpenAI kicked ass. Anthropic and others struggled. Llama too. None, however, were good enough to solve things completely. People cry tokenisation with most of these, but the mistakes it makes are far beyond just tokenisation issues, it’s issues of common logic, catastrophic forgetting, or just insisting things like FLLED are real words.

I still think this is a brilliant eval, though I wasn't sure what to make of its results, beyond giving an ordering for LLMs. And as you’d expect now, with the advent of better reasoning models, the eval stacks up, as the new models do much better!

Also, while I worked on it for a while, but it was never quite clear how the “iterative reasoning” that this analysed translated into which real world tasks this would be the worst for.

But I was kind of obsessed with why evals suck at this point. So I started messing around with why it couldn't figure out these iterative reasoning models and started looking at Conway's Game of Life. And Evolutionary Cellular Automata.

It was fun, but not particularly fruitful. I also figured out that if taught enough, a model could learn any pattern you threw at it, but to follow a hundred steps to figure something like this out is something LLMs we're really bad at. It did come up with a bunch of ways in which LLMs might struggle to follow longer term complex instructions that need backtracking, but it still showed that they can do it, albeit with difficulty.

One might even draw an analogy to Kleene’s Fixed-Point Theorem in recursion theory, suggesting that the models’ iterative improvements are not arbitrary but converge toward a stable, optimal reasoning state. Alas.

Then I kept thinking it's only vibes based evals that matter at this point.

The problem however with these evals is that they end up being set against a set evaluation. The problem is that the capabilities of LLMs grow much faster than the ways we can come up with to test it.

The only way it seemed to fix that is to see if LLMs can judge each other, and then to figure out if the rankings they give to each other can be judged by each other, we could create a PageRank equivalent. So I created “sloprank”.

This is interesting, because it’s probably the most scalable way I’ve found to use LLM-as-a-judge, and a way to expand the ways in which it can be used!

It is a really good way to test how LLMs think about the world and make it iteratively better, but it stays within its context. It’s more a way to judge LLMs better than an independent evaluation. Sloprank mirrors the principles of eigenvector centrality from network theory, a recursive metric where a node’s importance is defined by the significance of its connections, elegantly encapsulating the models’ self-evaluative process.

And to do that evaluation, then the question became, how can you create something to test the capabilities of LLMs by testing them against each other? Not single-player games like wordle, but multiplayer adversarial games, like poker. That would ensure the LLMs are forced to create better strategies to compete with each other.

Hence, LLM-poker.

It’s fascinating! The most interesting part is that all different model seem to have their own personality in terms of how they play the game. And Claude Haiku seems to be able to beat Claude Sonnet quite handily.

The coolest part is that if it’s a game, then we can even help the LLMs learn from their play using RL. It’s fun, but I think it’s likely the best way to teach the models how to get better more broadly, since the rewards are so easily measurable.

The lesson from this trajectory is that we can sort of mirror what I wanted from LLMs and how that’s changed. In the beginning it was to get accurate enough information. Then it became, can it deal with “real world” like scenarios of moving data and information, can it summarise the info given to it well enough. Then soon it became its ability to solve particular forms of puzzles which mirror real world difficulties. And then it became can they just actually measure each other and figure out who’s right about what topic. And lastly, now, it’s whether they can learn and improve from each other, in an adversarial setting, which is all too common in our darwinian information environment.

The key issue, as always, remains whether you can reliably ask the models certain questions and get answers. The answers have to be a) truthful to the best of its knowledge, which means the model has to be able to say “I don’t know”, and b) reliable, meaning it followed through on the actual task at hand without getting distracted.

The models have gotten better at both of thesee especially the frontier models. But they haven’t gotten better enough at these compared to how much they’ve gotten better at other things like being able to do PhD level mathematics or answering esoteric economics questions in perfect detail.

Now while this is all idiosyncratic, interspersed with vibes based evals and general usage guides discussed in dms, the frontier labs are also doing the same thing.

The best recent example is Anthropic showing how well Claude 3.7 Sonnet does playing Pokemon. To figure out if the model can strategise, follow directions over a long period of time, work in complex environments, and reach its objective. It is spectacular!

This is a bit of fun. It’s also particularly interesting because the model isn’t specifically trained on playing Pokemon, but rather this is an emergent capability to follow instructions and “see” the screen and play.

Evaluating new models are becoming far closer to evaluating a company or evaluating an employee. They need to be dynamic, assessed across a Pareto frontier of performance vs latency vs cost, continually evolving against a complex and often adversarial environment, and be able to judge whether the answers are right themselves.

In some ways our inability to measure how well these models do at various tasks is what’s holding us back from realising how much better they are at things than one might expect and how much worse they are at things than one might expect. It’s why LLM haters dismiss it by calling it fancy autocorrect and say how it’s useless and burns a forest, and LLM lovers embrace it by saying how it solved a PhD problem they had struggled with for years in an hour.

They’re both right in some ways, but we still don’t have an ability to test them well enough. And in the absence of a way to test them, a way to verify. And in the absence of testing and verification, to improve. Until we do we’re all just HR reps trying to figure out what these candidates are good at!

Thanks for reading Strange Loop Canon! Subscribe for free to receive new posts and support my work.