2024-11-23 01:28:39
Table of Contents:
Red vs. blue digital town-squares.
2012 continues to be a fulcrum year.
Mainstream media incompetence made prediction markets look good.
Why did AI companies stop competing in e-sports?
No conscious oversight over who gets liver implants is bad.
Selling out, artists, and writing on Substack.
God-shaped holes hungry for content.
From the archives.
Ask Me Anything.
1. Last year I warned in “The internet's "town square" is dead” we would end up with a “blue” Twitter and a “red” Twitter—tailored echo-chambers reinforcing their own political views. At the time, I thought the blue-team competitor to Twitter (now X) would be Threads:
To put it very simply: there is a good chance (not an inevitable one, but a good chance) that Threads becomes the “blue” Twitter, while Twitter becomes the “red,” well, Twitter—following the same sort of politicization, red or blue, that has happened to the smaller Twitter clones, and that you can find almost everywhere in our culture.
I’d say this has pretty much come to pass, except it’s been BlueSky lately that’s overtaken Threads.
Why not Threads? To explain this, I think people need to understand that writing or producing content about politics is the short and easy path to engagement, but in doing so you strike a dark bargain. You have to be comfortable being hated. I think Zuckerberg, with his ongoing rehabilitation of his image, and in the wake of how much he was blamed for Trump's 2016 victory, was honestly just tired of politics. And so anything heavy or political was openly banned or shadowbanned on Threads. This, along with some differences in structure (e.g., an Instagram-like emphasis on visuals and connecting users to brands) opened the space for BlueSky to become the “blue” digital town-square.
Personally, I think having a blue and a red town-square is a significant net negative as an outcome. For it's important to have a culture which is not entirely about politics—to have topics, writing, thinking, ideas, discussion, that spans the aisle. But you can imagine how difficult that is if there's no one big shared platform that crosses political lines and instead all the discussion is split into pieces.
Science (outside of IQ graphs) is increasingly absent from X; same for literature (outside of scandals). All the non-political content of our civilization has been fundamentally nerfed by the splitting of the digital town square post-2022. And it sucks.
2024-11-20 00:06:04
Consider how dismally confused the science of nutrition and human health has been the last few decades. Is sugar bad for you? Is fat? There have been unprecedented levels of what can only be (kindly referred to) as trust-rendering scientific failure when it comes to basic questions about health.
It occurred to me that my own field, neuroscience, has been similarly confused. You can do all sorts of neuroscientific experiments, yet ultimately, we just still don't know that much about brain function. But we have a really good excuse: the brain is a high-dimensional neural network that acts as a mathematical black box. Every brain is different, and the principles of each individual brain’s operation are so strung out across millions of little synapses, even for simple functions, so trying to understand it in the way that you understand a car engine fails over and over.
Looking over at the scientists studying health I have both sympathy and a striking feeling of recognition—I too have walked around conferences and thought “Oh, this is like 95% bullshit.”
Perhaps my feeling of recognition is because your own body is a mathematical black box as well? What if it had some of the same confusing properties as a large-scale neural network, like patterns of minute-scale changes in memory across individual cells? That hypothesis would certainly explain why nutritionists and health experts are themselves so often perplexed and contradictory. Or at a personal level, why you might be gluten intolerant and then go to Italy and eat all the bread that you want. Or why you used to be allergic to shellfish, and then one day weren’t. Or why red wine used to give you headaches, but then stopped. All of us have experienced personal mysteries like that, and yet the science of health struggles mightily to put people's disparate personal experiences into a rigorous car engine-like causal model.
And we have an increasingly strong inkling as to why. Of course, there’s evidence that things like working out causes neuron growth—aspects of that has been known for a while—but more and more there is evidence of cellular memory across scales. Indeed, just yesterday a paper in Nature was published:
…our findings indicate the existence of an obesogenic memory, largely on the basis of stable epigenetic changes, in mouse adipocytes and probably other cell types. These changes seem to prime cells for pathological responses in an obesogenic environment, contributing to the problematic ‘yo-yo’ effect often seen with dieting.
In other words, your cells remember being fat. And, in a similar vein, research in Nature published just two weeks ago has shown that non-neural cells obey the same fundamental laws of memory as neurons do at the fine-grained level of patterns of inputs. They are sort of like proto-neurons, then, in this sense. Much like your personality is an output of a neural network, perhaps your body morphology itself is best thought of in this complex black-box way too—a thing not so much mechanical as it is thinking itself into form.
2024-11-15 00:07:28
According to inside reports, Orion (codename for the attempted GPT-5 release from OpenAI) is not significantly smarter than the existing GPT-4. Which likely means AI progress on baseline intelligence is plateauing. Here’s TechCrunch summarizing:
Employees who tested the new model, code-named Orion, reportedly found that even though its performance exceeds OpenAI’s existing models, there was less improvement than they’d seen in the jump from GPT-3 to GPT-4.
If this were just a few hedged anonymous reports about “less improvement,” I honestly wouldn’t give it too much credence. But traditional funders and boosters like Marc Andreessen are also saying the models are reaching a “ceiling,” and now one of the great proponents of the scaling hypothesis (the idea that AI capabilities scale with how big they are and the amount of data they’re fed) is agreeing. Ilya Sutskever was always the quiet scientific brains behind OpenAI, not Sam Altman, so what he recently told Reuters should be given significant weight:
Ilya Sutskever, co-founder of AI labs Safe Superintelligence (SSI) and OpenAI, told Reuters recently that results from scaling up pre-training—the phase of training an AI model that uses a vast amount of unlabeled data to understand language patterns and structures—have plateaued.
I know that there is a huge amount of confusion, fear, and hope that creates a fog of war around this technology. It's a fog I've struggled to pierce myself. But I do think signs are increasingly pointing to the saturation of AI intelligence at below domain-expert human level. It’s false to say this is a failure, as some critics want to: if AI paused tomorrow, people would be figuring out applications for decades.
But a stalling makes sense because, honestly, we haven’t yet gotten a model that feels significantly smarter than the original GPT-4 (released early 2023). Sure, there have been some advancement on the benchmarks, but there’s been no massive leap in real-world performance other than (a) an increase intelligence to at best an occasional “GPT-4.5ish” level, but not a GPT-5 one, (b) a useful decrease in hallucinations, and (c) a pleasing task-relevancy of responses (but we wait longer for them). How much this matters to users depends greatly on the task. E.g., metrics like the ranked persuasiveness of AI-generated political messages show sharply diminishing returns with model scale, such that leading frontier models are barely more persuasive than order-of-magnitude smaller ones.
Here’s the successive increases in official performance on the flagship Massive Multitask Language Understanding benchmark (the MMLU is a bunch of hard tasks ranging from math to law). At first it looks impressive, but if you look close no model has made significant progress since early 2023.
And we now know that models are often over-fitted to benchmarks. As Dwarkesh Patel wrote while debating (with himself) if scaling up models will continue to work:
But have you actually tried looking at a random sample of MMLU and BigBench questions? They are almost all just Google Search first hit results. They are good tests of memorization, not of intelligence.
One issue with discerning the truth here is that researchers who release the models aren’t unbiased scientists, so they don’t test on every single benchmark ever. They are more likely to choose benchmarks that show improvement in the announcement. But I’ve noticed online that private benchmarks often diverge from public ones. E.g., independent researchers and users were less impressed by this summer’s GPT-4o:
I think it’s likely that over the past year private benchmarks show more variable performance (e.g., not always going up across the models) compared to the public benchmarks everyone uses (this often appears to be the case). In other words, improvements look ever more like “teaching to the test” than anything about real fundamental capabilities.
In which case, it’s pretty meaningful news that models like Orion/GPT-5 are not improving universally across the tests like they used to. As a recent article in The Information revealed:
Some researchers at the company [OpenAI] believe Orion isn't reliably better than its predecessor in handling certain tasks, according to employees. Orion performs better at language tasks, but may not outperform previous models at tasks such as coding, according to an OpenAI employee.
Another sign of stalling is that, despite all the hype, companies like OpenAI are now regularly not releasing internal models to the public, almost certainly because they don’t work as well as the hype claims. For instance, the simple explanation for why OpenAI’s video-generating AI Sora isn’t publicly available—after nearly a year!—is because it sucks. If OpenAI did release Sora it would be met with mostly laughter and mocking. As I argued in “Recent AI failures are cracks in the magic,” the information density of video makes it very obvious when the AI gets things wrong (we humans have an entire lobe of our brain dedicated to picking out the slightest visual errors).
In addition to keeping some models locked behind closed doors, recent deployed models, like GPT-4 Turbo or GPT-4o, are likely already-failed attempts at GPT-5, and have consistently been stuck at the same roughly “GPT-4.5ish” level (which is what Turbo and 4o and now 4o1 basically are). Many have speculated that the expectations are so high that OpenAI keeps renaming failed GPT-5 attempts as “GPT-4-whatever” to not expend the built-up hype (in fact, they may never release a GPT-5, just keeping Orion as “Orion” when it launches next year).
Why is this stalling out happening?
The first hypothesis was given during the initial wave of criticism against deep learning: that these systems lack symbolic reasoning and therefore couldn’t ever “truly” reason. These objections have turned out to have far less teeth than predicted—e.g., Gary Marcus used to lean heavily on the idea that AIs can’t do causal reasoning because they’re just “curve fitting,” but they obviously can reason about causation now. At the same time, the models never quite overcame their nature as autocompleters, still getting fooled with simple questions like “How many Rs are in strawberry?” or not being able to generate simple phonics lessons.
The second hypothesis behind the stall is one people discuss all the time: deep learning is limited by the amount of high-quality data (of which human civilization has been thoroughly mined of by the big companies). This creates an “S-curve” that, at best, approximates human intelligence on tasks with lots of data (I’ve pointed out this entails a “supply paradox” wherein AI is good at things that aren’t that economically valuable at a per-unit level, like writing text). That is, they are essentially just a function of the data that made them (explaining why AIs converge to be so similar).
Of course, there’s still the question of whether or not synthetic data can get around this. But there are well-known downsides, like model collapse, and this seems the sort of obvious solution that I suspect the companies have been trying to not much success.
Why AI is stalling out is probably some combination of the two standard reasons. There was always some sketchiness to the degree of which deep learning was mimicking real human reasoning (but critics called “the wall” early by several orders of magnitude). And this sketchiness could be papered over, but only via huge amounts of data, which these companies have run out of.
Perhaps we should not dwell too closely on the question of why it is that the sum of human civilization’s data fed into deep learning techniques plateaus at something smart—but not that smart—lest we see ourselves too clearly in the mirror.
What to make then, of the claim that researchers have unlocked a new axis of scaling by giving the models time to “think”? For this is what the entire industry will now rely on, both in practice but also to keep the sky-high hype alive.
I think this new axis of scaling will matter far less than people expect and will not keep the train of improvements chugging along anywhere near the previous rate.
First, this new type of scaling appears limited to certain domains: it makes the models no better on English language tasks or provides only minimal improvements for questions about biology, for instance. Second, there are good reasons to believe that this sort of scaling will be much more expensive: linear improvements for extreme sums of money. Here’s OpenAI’s chart on this new scaling law (shown only for a math benchmark, notably), released alongside their most-recent 4o1 model from September:
As tech journalist Garrison Lovely wrote about it on Substack:
If you have a way of outcompeting human experts on STEM tasks, but it costs $1B to run on a days worth of tasks, you can't get to a capabilities explosion, which is the main thing that makes the idea of artificial general intelligence (AGI) so compelling to many people…. the y-axis is not on a log scale, while the x-axis is, meaning that cost increases exponentially for linear returns to performance (i.e. you get diminishing marginal returns to ‘thinking’ longer on a task). This reminds me of quantum computers or fusion reactors—we can build them, but the economics are far from working.
But I think people focusing on price or the domain-specificity of improvements are missing the even bigger picture about this new supposed scaling law. For what I’m noticing is that the field of AI research appears to be reverting to what the mostly-stuck AI of the 70s, 80s, and 90s relied on: search.
I don’t mean searching the web, rather, I mean examples like when this summer Google DeepMind released their state-of-the-art math AI able to handle mathematical Olympiad problems. It's a hybrid, very different than a leading generalist LLM. How does it work? It just considers a huge number of possibilities.
When presented with a problem, AlphaProof generates solution candidates and then proves or disproves them by searching over possible proof steps…. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.
This sort of move makes sense, because search was the main advantage for AIs against humans for a very long time (e.g., in Chess, whereas Go cannot be brute-forced in a similar manner). But AI relying too much on search has been criticized before. In the infamous Lighthill Report in 1973, James Lighthill pointed out that combinatorial explosions awaited any AI techniques that relied on search—predicting well the limits of the field for decades until the advent of deep learning (which actually made progress on real intelligence).
I think the return to search in AI is a bearish sign, at least for achieving AGI and superintelligence. This is for a very simple reason: search over a vast number of possibilities is not at all like human thought! Nor is it qualitatively the same as having one singularly smart big artificial neural network (like a big super brain). Rather, it is merely a coupling of the traditional superiority of computers—search over huge numbers of possibilities—with some sort of roughly-GPT-4.5ish-level intelligence that mainly just proposes and evaluates.
In fact, the entire new axis of scaling touted by the companies to replace the old failing scaling laws is arguably just search given a makeover (Noam Brown, a foundational contributor to GPT-4o1, describes his research mostly in terms of advances in search). As Reuters reported a few days ago:
To overcome these challenges, researchers are exploring “test-time compute,” a technique that enhances existing AI models during the so-called “inference” phase, or when the model is being used. For example, instead of immediately choosing a single answer, a model could generate and evaluate multiple possibilities in real-time, ultimately choosing the best path forward.
People (both in the field, but also the writers for Reuter) are keen to describe this sort of thing in mentalese. Yet actually, both the leading avenues—“test time training” and “test-time compute”—look an awful lot to me like just fancy search!
In “test-time training” the model rapidly adjusts its own parameters (like fine-tuning) in order to find the parameters best suited for that particular test. In other words, it searches over parameter space. In “test-time compute” there are hidden iterative refinement stages to the official output where multiple draft outputs are evaluated or averaged or voted on. Either way, these methods require more and more time and more and more resources in order to explore possibilities, and no matter how much search you do you’re still fundamentally bounded by the place you started, which is converging across models to some sort of roughly GPT-4.5ish level intelligence.
Some version of search is what researchers are leaning on everywhere now; e.g., recent gains on the ARC test, a test of general intelligence considered difficult for LLMs that had a jump in performance just a few days ago, were caused by using a similar kind of “test-time training.”
Continued improvements are going to happen, but if the post-GPT-4 gains in AI came mainly from adding first better prompts (chain-of-thought prompting) and now more recently the addition of search to the process (either over many potential outputs or over the model’s parameters itself) this is different than actually constructing baseline-smarter artificial neural networks. It indicates to me a return to the 70s, 80s, and 90s in AI.
A mini thought experiment. Who do you think can make breakthroughs in physics? One Einstein or 1,000 physics undergraduates? Barring that none of the undergraduates just happens to be an Einstein-in-waiting, the answer is one Einstein. We already know that very large numbers of people are rarely as good as a dedicated small team or even a lone individual in terms of creativity, knowledge, and impact. These new scaling techniques are the creation of an algorithmic pipeline that gives better answers via “cloning” or “varying” an already-saturated-near-max intelligence.
Does this new reliance on search-based techniques obviate all concerns about AI safety and existential risk from these technologies? No, it doesn't. Pandora's box has been forever unlocked. Forevermore, we will live with the existential risk of AI. However, the news that baseline general intelligence is plateauing, and that AI researchers are now relying on old standbys like massive search spaces to progress, is an early indication that—at least barring further fundamental breakthroughs—timelines have just been pushed back in our lifetimes.
For the sake of human culture not being filled with unconscious AI slop, for the sake of our children being able to find meaning in their own expertise and abilities and intelligence, and for the sake of risk-blind primates who thought it a good idea to try to make entities more intelligent than themselves—I personally hope I'm right about a return to search being bearish. I can’t offer much more than that.
2024-11-08 23:22:11
You’ve heard me talk about consciousness a lot. After all, it’s been one of my main intellectual focuses for almost two decades now. In graduate school, I worked on developing aspects of Integrated Information Theory, the most popular mathematically formal approach to measuring consciousness that exists in neuroscience. My novel, The Revelations, is about the hunt for a scientific theory of consciousness (with a murder mixed in). And I just wrote an essay laying out the evidence for the (at-least-worth-considering) provocative thesis that science itself might be incomplete, with consciousness being a main culprit.
Personally, I still hold out hope that there will be a scientific answer to consciousness! It is very possible that somewhere in Borges' infinite library there sits a book, and in that book there is a paragraph with the right words in the right order, and that simple paragraph contains everything you need to deduce a scientific explanation of consciousness. One measly paragraph would be enough! I would collapse in joy, weeping, if I ever read such a thing.
But when it comes to consciousness, it’s still such a nascent field that you don’t have to be an expert to have opinions about it. Unlike a physics “Theory of Everything,” consciousness is far more immediate, both in that we’re all intimately familiar with it, and also in that the field itself still doesn’t require any deep technical expertise to have fun speculating about (or even, arguably, contributing to).
After being subjected to years of my rambling on the subject, it’s time I hear from you all.
What are your favorite theories of consciousness? Do you adhere to any? Do you have your own? Thoughts on places to start? How, exactly, do subjective experiences arise from a gelatinous structure of cells puffing little chemicals at one another?
Will the problem ever be solved, or will it one day just be dissolved? And if AI isn’t conscious, then what do you even need consciousness for, anyways?
All levels of analysis are welcome. Feel free to speculate wildly, and I’ll offer feedback on the more promising / interesting proposals.
2024-11-04 23:38:48
One only needs to check the texts we’re all getting spammed with here in America to know the election is tomorrow. So here are three thoughts as we finally finish off this strange, entertaining, horrifying, wonderfully insane cavalcade we all march in every four years.
I can't recall an election in my lifetime where public figures staked out endorsements to this degree. Obviously, this has been building for a while. On the Democratic side, this year doesn't look too different than, say, the celebrity push behind Hillary Clinton in 2016. As The Guardian reported:
During the final weeks of the campaign, Harris was joined on the trail by Beyoncé, Bruce Springsteen, Stevie Wonder, Maggie Rogers, Willie Nelson, Jennifer Lopez and Oprah, among others.
But Elon Musk, who in addition to everything else is inarguably one of the biggest social media stars on the planet (he has 203 million followers on X), jumped in to support Trump. On social media, in outlets, in blogs, people have squared off in 2024 like never before. Even Nature just jumped in, with their second endorsement after Biden (their first) in 2020.
This much politicization has triggered pushback. Recently, Jeff Bezos overruled the editors of The Washington Post, refusing to let the newspaper issue an endorsement this election cycle. As Bezos himself explained it:
No undecided voters in Pennsylvania are going to say, “I’m going with Newspaper A’s endorsement.” None. What presidential endorsements actually do is create a perception of bias. A perception of non-independence. Ending them is a principled decision, and it’s the right one.
Some, especially Post subscribers, were quite mad at this (interestingly, neither MSNBC, CNN, nor Fox News catch any flak for not issuing official endorsements—one of the problems with political endorsements is that it's a lot easier to start than stop).
Of course, to try not appearing too biased, one could take the “I am a centrist creature of pure nuance” approach (with the center just happening to be whatever politics the then-sensible, then-sane, adult world had when they were 15). Alternatively on offer is the “I am making this choice via Spock-like logic” approach, but I think this is a category error given the deep psychological resonances behind why we like a certain politics, or find certain figures trustworthy or untrustworthy. Neither are great options.
Certainly, I’m sympathetic to the criticism that principled non-endorsement is a form of cowardice for anyone with a platform. But a mere perception of bias is not the only reason to have concerns over (what’s felt like) nearly everyone and every outlet “officially” endorsing. In general, knowing too much about the politics of writers and thinkers usually makes me lose some degree of respect for them—their opinions are never exactly what you want, and exactly what you want is, well, what you want when it comes to politics.