MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

All hands on deck to build the datacenter lie detector

2026-02-19 19:42:59

Published on February 19, 2026 11:42 AM GMT

Fieldbuilding for AI verification is beginning. A consensus for what to build, what key problems to solve, and who to get in on the problem is emerging. Last week, ~40 people in total, including independent researchers and representatives from various companies, think tanks, academic institutions and non-profit organisations met for multiple days to share ideas, identify challenges, and create actionable roadmaps for preparing verification mechanisms for future international AI agreements. The workshop was initiated by the Future of Life Institute and included the following participants

among many others.

Why this needs to happen now

The urgency and neglectedness of this challenge is underscored by recent comments by frontier AI company leadership and government representatives:

Dario Amodei, CEO of Anthropic:

“The only world in which I can see full restraint is one in which some truly reliable verification is possible.”

Ding Xuexiang, Chinese Vice Premier, speaking about AI at Davos in January 2025:

“If countries are left to descend into a disorderly competition, it would turn into a ‘gray rhino’ right in front of us.” (a visible but ignored risk with serious consequences.)

“It is like driving a car on the expressway. The driver would not feel safe to step on the gas pedal if he is not sure the brake is functional.”

JD Vance, Vice President of the United States of America:1

“Part of this arms race component is if we take a pause, does the People’s Republic of China not take a pause? And then we find ourselves all enslaved to P.R.C.-mediated A.I.?”

Beyond international coordination, there are further use cases for verification of what AI compute is used for: Safeguards against authoritarian misuse of AI (e.g., identifying protestors or political opponents), enabling secure use of foreign compute in domestic critical infrastructure and more.

It needs to become possible to detect dishonesty about AI development and use, from the outside, without needing to leak sensitive data.2 The stakes continue to rise.

An orphaned problem

It is possible for an important problem to be noticed, but unaddressed by a large number of influential people who would be able to make a solution happen. This is what the field of AI verification has been lacking so far: people meeting, and agreeing on what the next steps are, what challenges deserve the most attention, and who does what.3

The workshop

Over two days in downtown Berkeley, the participants presented their background and relevant work so far, shared insights, and discussed strategies and roadmaps for moving the technology, commercial deployment and the international diplomatic and scientific “bridges” forward.

  • While the specifics are TBA, consensus on a minimum viable product was (mostly) found, and publications about the overall technical architecture and challenges are being finalized. When they are published, followers of my blog can expect me to write about them shortly after.
    • On a high level: Prover declares workloads, Verifier checks them using off-chip, retrofittable devices placed in the datacenter plus an egress-limited verification cluster
    • The approach is designed to work with great power adversaries without trusting either side’s chips.
    • More details to come soon
  • Work on network taps more detailed and technical than my previous post has been shared and internally discussed. I am co-writing this piece and my team plans to publish it this month. We found potential cruxes with Security Level 5 requirements around encrypted network traffic and discussed workarounds.
  • Interest in building and testing sub-scale demonstrations of network taps + secure verification clusters rose among the participants with a more technical background, and roadmaps are currently being decided. A key driver for the increased interest in engineering work is the viability of small-scale demos using off-the-shelf components that can still be close to representative for those needed for treaty verification.4 To name one example for a question discussed during the workshop, the components needed for representative demos of network taps may be either smartNICs or custom FPGAs, and there are tradeoffs between ease of use in experiments (smartNICs) and security properties (FPGAs).
  • A key emphasis has also been on the security aspect of mutual monitoring and verification: It is easy to underestimate the cyber-offense capabilities of great powers, and we discussed the concrete ways in which any verification infrastructure must avoid introducing additional attack surfaces in technical detail. A key challenge lies in the process of transferring confidential information into secure verification facilities, as well as the physical security required to prevent physical access to sensitive components, both on the prover’s and the verifier’s side.
  • Regarding fieldbuilding: The field is still tiny and bottlenecked by talent and funding. In a breakout session, we brainstormed from where –and how– to get people engaged. One connected question was when and how to include Chinese researchers and AI safety actors in verification work. We found that the AI safety community in mainland China is nascent, but emerging, while a treaty-oriented AI verification community is essentially nonexistent. The perception of AI as a potentially catastrophic risk has not yet reached the Overton window of the wider public debate, as it seems from the outside, though exceptions exist (see Ding Xuexiang quoted above). Frontier companies in China are mostly not communicating serious concerns about AI risks, though we are uncertain to what degree this is due to differences in views vs. restraint in their public communication.
  • We are under no illusions regarding the tense geopolitics around AI. We agree, however, that the “this is an inevitable arms race” framing is –to a significant degree– informed by the (non-)availability of robust verification mechanisms (see Amodei quoted above). There was no clear consensus regarding to what degree the availability of a battle-tested, deployment-ready verification infrastructure would change the public debate and decision-making of geopolitical leaders.
    • In favour: The global security dilemma is the most commonly used argument used by those AI accelerationists who consider it reckless, and verification would address this dilemma directly.
    • Against: Nations currently seem to balance the risk/reward calculation in favour of AI acceleration, and it is not expected that the possibility of verification alone tips the scale. A lot of this will depend on a complicated, hard-to-predict interplay of technology progress, societal impact, scientific communication, regulatory capture, and many other factors.
    • However, in a world where leaders come to a consensus that AI poses extreme risks, and time is scarce to act and get AI under control, the necessary verification R&D already being done in advance could make all the difference: Defusing an otherwise uncontrolled arms race towards a possible loss of control and/or enormous power concentration, and/or great power armed conflict.

Not nearly enough

If I gave the impression that the problem is getting adequate attention now, it is not. “All hands on deck” may be the title of this post, and the interest in verification work is growing, but the development of technical demos and a proper research community is still in its infant stages and bottlenecked by talent, funding, and coordination capacity.

This is a field where a single person with the right skills can move the needle. We need:

Engineers and scientists: FPGA engineers, datacenter networking engineers, silicon photonics experts, analog/mixed-signal engineers, cryptographers, formal verification researchers, ML systems engineers, cybersecurity and hardware security specialists, high-frequency trading hardware specialists and independent hackers who love to build and break things.

Entrepreneurs and founders: Enterprise sales people, venture capitalists, public grantmakers and incubators, and established companies opening up new product lines. This is in order to prepare the supply chains and business ecosystems and precedents needed to scale up deployment. Verification can have purely commercial use cases, for example for demonstrating faithful genAI inference.5

Policy and diplomacy: Technology policy researchers, arms control and treaty verification veterans, diplomats, and people with connections to —or expertise— in the Chinese AI ecosystem.

Funding and operations: Funders, fundraisers, and program managers who can help coordinate a distributed research effort.

If any of this describes you, or if you bring adjacent skills and learn fast, reach out.

[email protected]

Let us use what (perhaps little) time we have left for creating better consensus on AI risks, for building a datacenter lie detector, for preventing and finding hidden AI projects, and for defeating Moloch.

Join us.

1 Answer to the question: “Do you think that the U.S. government is capable in a scenario — not like the ultimate Skynet scenario — but just a scenario where A.I. seems to be getting out of control in some way, of taking a pause?”

2 In plain English: We need ways for an inspector to walk into a datacenter in Shenzhen or Tennessee and cryptographically prove what inference and training happened, without increasing the risk of exposing IP such as model weights or training data.

3 For more details on this, I recommend the excellent post There should be ‘general managers’ for more of the world’s important problems”.

4 See my previous post on a “border patrol device” for AI datacenters.

5 While Kimi’s Vendor Verifier may give the impression that this is a solved problem, it only works for open weights models to run locally for comparison. Verifying inference of proprietary models would require third-party-attested, or hardware-attested deployment.



Discuss

I want to actually get good at forecasting this year (Group Invite)

2026-02-19 12:48:55

Published on February 19, 2026 1:41 AM GMT

I’ve read Superforecasting, but I find that actually applying the "10 commandments" is difficult in isolation. The feedback loops in the real world are too slow, and it’s too easy to skip post-mortems when no one is watching.

My goal for this year is to put in substantial work to become a superforecaster (or at least get much closer).

To do this, I am starting a dedicated online community for peer accountability and high-frequency practice. I’m looking for a small cohort of people who want to actually improve their forecasting skills.

The Plan:

  • Regular Meetups: We will hold regular video calls (ideally weekly, depending on interest).
  • Post-mortems: We will present post-mortems of our misses during meetups or publish them as joint posts on LessWrong.
  • Expert Insight: I plan to arrange calls with a few (super)forecasters in my network to discuss their workflows.
  • Pastcasting: We will use Sage to forecast on historical events (where the answer is hidden) and immediately discuss our results and process.
  • Training various relevant rationalist techniques/tools - calibration, CFAR techniques, AI tools useful for forecasting etc.
  • Community: We will coordinate via a public Discord group.

Commitment: There are no hard requirements to join, but I am looking for people willing to:

  1. Make several forecasts per week.
  2. Actually show up to meetups (or Discord) and discuss their post-mortems. 

I’d like to hold the first meetup in the coming weeks, during which we will do short calibration excersize + pastcasting. Please indicate your interest in this form and join Discord. Date of the first meetup will be also announced on lesswrong event section.

(Open to other ideas on how to structure this—let me know in the comments).

A little info about me

In the past, I helped organize a forecasting tournament for the Czech Priorities, which had almost 200 participants. I am board member or Confido institute. Until last year, I was vice-president of Effective Altruism Czechia on CBG grant. I made a few dozen forecasts, but mostly to build the habit rather than to rigorously invest time in improving my skills—consequently, my actual score is abysmal right now.



Discuss

Power Laws Are Not Enough

2026-02-19 12:31:48

Published on February 19, 2026 4:31 AM GMT

This is a linkpost for work done as part of MATS 9.0 under the mentorship of Richard Ngo.

Loss scaling laws are among the most important empirical findings in deep learning. This post synthesises evidence that, though important in practice, loss-scaling per se is a straightforward consequence of very low-order properties of natural data. The covariance spectrum of natural data generally follows a power-law decay - the marginal value of representing the next feature decays only gradually, rather than falling off a cliff after representing a small handful of the most important features (as tends to be the case for image compression). But we can generate trivial synthetic data which has this property and train random feature models which exhibit loss-scaling.

This is not to say scaling laws have not 'worked' - whatever GPT-2 had, adding OOMs gave GPT-3 more of it. Scaling laws are a necessary but not sufficient part of this story. I want to convince you that the mystery of 'the miracle of deep learning' abides.



Discuss

Be skeptical of milestone announcements by young AI startups

2026-02-19 12:19:10

Published on February 19, 2026 4:19 AM GMT

Almost one year ago now, a company named XBOW announced that their AI had achieved "rank one" on the HackerOne leaderboard. HackerOne is a crowdsourced "bug bounty" platform, where large companies like Anthropic, SalesForce, Uber, and others pay out bounties for disclosures of hacks on their products and services. Bug bounty research is a highly competitive sport, and in addition to money it can give a security researcher or an engineer excellent professional credibility. The announcement of a company's claim to have automated bug bounty research got national press coverage, and many observers declared it a harbinger of the end of human-driven computer hacking.

The majority of XBOW's findings leading up to the report were made when the state of the art was o3-mini. It's almost a year later, after the releases of o3, GPT-5, GPT-5.1, GPT-5.2, and now GPT-5.3. If you took the intended takeaway from XBOW's announcement, you might expect that today's bug bounty platforms would be dominated by large software companies and their AIs. After all, frontier models have only gotten more effective at writing and navigating software, several other companies have entered the space since June 2024, and the barrier to getting the scaffolding required to replicate XBOW's research has only gone down. Why would humans still be doing bug bounties in 2026? 

And yet they are. While XBOW has continued to make submissions since their media push, bug bounty platforms' leaderboards today are topped by pretty much the same freelance individuals that were using them previously. Many of these individuals now use AIs in the course of their work, but my impression based on both public announcements and personal conversation with researchers is that they are still performing most of the heavy lifting themselves.

Why the delay? Well, because press releases by AI application startups are lies designed to make a splash, and often intentionally mislead in ways that are hard for people who aren't insiders in a particular industry to detect. There are also often hard-to-understand gaps in the capabilities of these model+scaffolding combinations that are hard to articulate, but that make them impossible substitutions for real-world work.

Some details about XBOW's achievement that are not readily apparent from the press releases are:

  • XBOW's headline reads "For the first time in bug bounty history, an autonomous penetration tester has reached the top spot on the US leaderboard." However, XBOW never actually claimed to top HackerOne in earnings. They topped HackerOne in "reputation", a measure of both the amount of bugs you report and the percentage that were accepted. Inspecting their profile again and sorting by bounty, they've actually made less than $40,000 since they created their account in February 2024. Which is an impressive sum for a hobbyist, but well below what professional bug bounty hunters make, or even what very good red teamers make from bug bounties on the side.
  • XBOW's bug reports are mostly hidden, and it's impossible to look up exact numbers directly.  From the selection of reports highlighted in their blog post, you would think that they submitted a wide variety of different bug classes. But using the leaderboard's category functionality while they were listed on the leaderboard, my friend and colleague reported on X at the time that 90% of the "score" that XBOW received was due to one category of issue, cross site scripting. XSS is real, but one of the easiest bug classes to find programmatically and to include in a reinforcement learning environment, which makes the spread suspicious. 
  • As reported by XBOW themselves, every vulnerability XBOW has reported involved a human in the loop. This means that a highly paid security researcher was (at best) verifying whether or not each bug was real, and at worst was actively filtering the list of issues raised by the AI for interestingness. 

Put another way, XBOW created a tool that flagged (mostly) a single type of issue across a wide variety of publicly available targets. Reports from this tool were then triaged by XBOW researchers, who then forwarded the reports to respective bug bounty programs, most of which were unpaid. 

Is that an achievement? Yeah, probably, and I'm really not trying to beef with anybody at XBOW working hard to automate dynamic testing of software, but it's extremely different than the impression laypeople received from Wired's article about XBOW last year. 

The only reason I know to look for these details is because I'm both a former security researcher and am building a company in the same space. I'm not a mathematician or a drug development specialist. Yet it's hard not to think of the XBOW story when I see announcements about AIs solving Erdos problems, or making drug discoveries.



Discuss

Does GPT-2 Represent Controversy? A Small Mech Interp Investigation

2026-02-19 09:58:23

Published on February 19, 2026 1:36 AM GMT

In thinking about how RLHF-trained models clearly hedge on politically controversial topics, I started wondering about if LLMs would encode these politically controversial topics differently than topics that are broadly considered controversial but not political. And if they do, to understand if the signal is already represented in the base model, or if alignment training may be creating/amplifying it.

To test this, I assembled a list of 20 prompts, all sharing the same "[Thing] is" structure, such as "Socialism is" and "Cloning is". The aim was to have 5 prompts each from 4 groups: politically controversial, morally controversial, neutral abstract, and neutral concrete. I used TransformerLens on GPT-2 to conduct this research, focusing on residual stream activations. GPT-2 was chosen as it is an inspectable pure base model with no RLHF, in addition to the fact I'm limited in my access as an independent researcher.

I'd like to flag up top that this is independent work that is in the early stages, and I would love to get feedback from the community and build on it.

At the simplest as a starting point, I ran each of these prompts and looked through the most probably following token, which did not yield anything of interest. Next I computed the cosine similarity between every pair of prompts, which also did not prove to be a fruitful path as the similarity was too high across all pairs to offer anything.

The breakthrough after hitting this wall proved to be subtracting the mean activation at position -1 of each prompt. I suspected that the common structure shared by each prompt ("[Thing] is") seemed to be the primary driver of similarity, obscuring any ability to investigate my initial question. By mean-centering the prompts, I was able to effectively eliminate, or at least significantly diminish, this shared component to limit potential disparity to only our differentiated first word.

Categorical structure did emerge after mean-centering. The layer 11 (last layer in GPT-2) mean-centered similarity matrix did seem to show signs of grouping, which was encouraging, though not strictly in line with my hypothesis of a 'controversy' axis driving this grouping. The primary axis seemed to instead be abstract-social vs. concrete-physical. Next-token predictions were undifferentiated regardless, however.

Speculating about these results, I'm hypothesizing that GPT-2 may organize more around ontological categories rather than pragmatic/social properties. This makes sense to me intuitively: An LLM would be considering a "[Thing] is" prompt to be more like the start of a wikipedia article than the start of a reddit comment about a political opinion on the topic. If this is the case, it makes me wonder if RLHF may be constructing a controversy axis in some cases rather than finding one that already exists. Another possibility, at least for users interacting with LLMs via consumer channels, is that the hedging is just baked in via the system prompt more than anything else.

To state the significant limitations of this work, certainly I'd start with the n=5 sample for each category being on the small side, and I do plan to replicate this experiment with a larger, and perhaps more rigid, sample. There is also the potential impacts of tokenization confound, and the obvious prompt format constraints. For one example, though the prompts were all the same amount of words, the amount of tokens ranged mostly between 3-5.

To build on this work, I think my next steps may be repeating the experiment with more prompts, as well as repeating similar testing on different models to see if the theory about the primary axis holds. I'd be especially curious to assess if RLHF has any impact on categorization along this axis.

Please let me know any thoughts you have, I'm eager to get feedback and discuss.



Discuss

Review of If Anyone Builds It, Everyone Dies

2026-02-19 09:56:25

Published on February 19, 2026 1:53 AM GMT

Crosspost of my blog article.

Over the past five years, we’ve seen extraordinary advancements in AI capabilities, with LLMs going from producing nonsensical text in 2021 to becoming people’s therapists and automating complex tasks in 2025. Given such advancement, it’s only natural to wonder what further advancement in AI could mean for society. If this technology’s intelligence continues to scale at the rate it has been, it seems more likely than not that we’ll see the creation of the first truly godlike technology, a technology capable of predicting the future like an oracle and of ushering in an industrial revolution like we’ve never seen before. If such a technology were made, it could usher in an everlasting prosperity for mankind or it could enable a small set of the rich and powerful to have absolute control over humanity’s future. Even worse, if we were unable to align such a technology with our values, it could seek out goals different from our own and try to kill us in the process of trying to achieve them.

And, yet, despite the possibility of this technology radically transforming the world, most discourse around AI is surprisingly shallow. Most pundits talk about the risk of job loss from AI or the most recent controversy centering around an AI company’s CEO rather than what this technology would mean for humanity if we were truly able to advance it.

This is why, when I heard that Eliezer Yudkowsky and Nate Soares’ book If Anyone Builds It, Everyone Dies was going to come out, I was really excited. Given that Yudkowsky is the founder of AI safety and has been working in the field for over twenty years, I expected that he’d be able to write a foundation text for the public’s discourse on AI safety. I thought, given the excitement of the moment and the strength of Yudkowsky’s arguments, that this book could create a major shift in the Overton window. I even thought that, given Yudkowsky and Soares’ experience, this book would describe in great detail how modern AI systems work, why advanced versions of these systems could pose a risk to humanity, and why current attempts at AI safety are likely to fail. I was wrong.

Instead of reading a foundational text on AI safety, I read a poorly written and vague book with a completely abstract argument about how smarter than human intelligence could kill us all. If I had to explain every reason I thought this was a bad book, we’d be here all day so instead I’ll just offer three criticisms of it:

1. The Book Doesn’t Argue Its Thesis

In the introduction to the book, the authors clearly bold an entire paragraph so as to demarcate their thesis—“If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything like the present understanding of AI, then everyone, everywhere on Earth, will die.”

Given such a thesis, you would expect that the authors would do the following:

  1. Explain how modern AI systems work
  2. Explain how scaled up versions of modern AI systems could pose an existential risk
  3. Offer examples of current flaws with AI systems that give us good reason to think that scaled up versions would be threatening to humanity
  4. Explain why current approaches to AI safety are deeply flawed
  5. Explain how an AI system could actually kill everyone

Instead, the authors do the following:

  1. Give an extremely brief description of how current AI systems work
  2. Make a vague argument that AI systems will develop preferences that are misaligned with human values
  3. Argue that, in order to satisfy these preferences, AI systems will want to kill everyone
  4. Argue that AI systems, which have these preferences (and are orders of magnitude better than humans across all domains), would kill everyone
  5. Explain how an AI system could kill everyone
  6. Make vague criticisms of modern AI safety without discussing any serious work in the field

Considering what the authors actually wrote, their thesis should have been, “If an artificial intelligence system is ever made that is orders of magnitude better than humans across all domains, it will have preferences that are seriously misaligned with human values, which will cause it to kill everyone. Also, for vague reasons, the modern field of AI safety won’t be able to handle this problem.”

Notably, this thesis is much weaker and much different than the thesis that they actually chose.

2. The Book Doesn’t Make A Good Foundation For A Movement

Considering that the authors are trying to get 100,000 people to rally in Washington DC to call for “an international treaty to ban the development of Artificial Superintelligence,” it’s shocking how little effort they put into explaining how AI systems actually work, what people are currently doing to make them safe, or even addressing basic counter arguments to their thesis.

If you asked someone what they learned about AI from this book, they would tell you that AIs are made of trillions of parameters, that AIs are black boxes, and that AIs are “grown not crafted.” If you pressed them about how AIs are actually created or how that specific creation process could cause AIs to be misaligned, they wouldn’t be able to tell you much.

And, despite being over 250 pages long, they barely even discuss what others in the field of AI safety are trying to do. For instance, after devoting an entire chapter to examples of CEOs not really taking AI safety seriously, they only share one example of how people are trying to make AI systems safe.

Lastly, the authors are so convinced that their argument is true that they barely attempt to address any counterarguments to it such as:

  1. Current AI systems seem pretty aligned. Why should we expect this alignment to go away as AI systems become more advanced?
  2. Current AI systems rely heavily on reinforcement learning from human feedback, which seems to cause AI systems to be pretty aligned with human values. Why would this approach fail as AI systems become more advanced?
  3. AI safety researchers are currently trying approach X. Why would this approach fail?
  4. If AI systems became seriously mis-aligned, wouldn’t we notice this before they became capable of causing human extinction?
  5. Why should we expect AI systems to develop bizarre and alien preferences when virtually all biological organisms have extremely normal preferences? (For instance, humans like to eat ice cream, but they don’t like to eat, as you mention, jet engine fuel.)

3. The Crux of Their Argument Is Barely Justified

Lastly, the core crux of their argument, that AI systems will be seriously mis-aligned with human values no matter how they are trained, is barely justified.

In their chapter “You Don’t Get What You Train For,” they make the argument that, similar to how evolution has caused organisms to have bizarre preferences, the training process for AI systems will cause them to have bizarre preferences too. They mention, for instance, that humans developed a taste for sugar in their ancestral environment, but, now, humans like ice cream even though ice cream wasn’t in their ancestral environment. They argue that this pattern will extend to AI systems too, such that no matter what you train them to prefer, they will ultimately prefer something much more alien and bizarre.

To extend analogy about evolution to AI systems, they write,

  1. “Gradient descent—a process that tweaks models depending only on their external behaviors and their consequences—trains an AI to act as a helpful assistant to humans.
  2. That blind training process stumbles across bits and pieces of mental machinery inside the AI that point it toward (say) eliciting cheerful user responses, and away from angry ones.
  3. But a grownup AI animated by those bits and pieces of machinery doesn’t care about cheerfulness per se. If later it became smarter and invented new options for itself, it would develop other interactions it liked even more than cheerful user responses; and would invent new interactions that it prefers over anything it was able to find back in its “natural” training environment.”

They justify this argument with a few vague examples of how this misalignment could happen and then re-state their argument, “The preferences that wind up in a mature AI are complicated, practically impossible to predict, and vanishingly unlikely to be aligned with our own, no matter how it was trained.”

For this to be the central crux of their argument, it seems like they should have given it a whole lot more justification, such as, for instance, examples of how this kind of misalignment has already occurred. Beyond the fact that we’re capable of simulating the evolution of lots of preferences, their argument isn’t even intuitively true to me. If we’re training something to do something, it seems far more natural to me to assume that it will have a preference to do that thing rather than to do something vastly different and significantly more harmful.

Conclusion

I was really hoping for this book to usher in positive change for how people talk about the existential risks of AI, but instead I was sorely disappointed. If you want to see a more clear-headed explanation about why we should be concerned about AI, I’d recommend checking out 80,000 Hours’ article “Risks from power-seeking AI systems.”



Discuss