2025-12-28 09:09:01
Published on December 28, 2025 12:43 AM GMT
[This is an entry for lsusr's write-like-lsusr competition.]
Content Warning: Everything you’d expect from the title, but in smaller quantities than you might imagine.
“You realize she’s not actually following?” said the Fate to the poet.
Without breaking stride, Orpheus turned his eyes[1] to his interlocutor and sighed. “Perhaps. But what would you have me do?”
“The only thing you can do, and the only thing that could save her. Look back.”
“Didn’t you just say she’s not there?”
“Yes. But if you turn around, she will be.”
A pause. “You’re going to have to unpack that.”
“No-one has ever brought anyone back from Hades. By base rates, you succeeding here is fantastically unlikely: either you’re going to turn around, or you’ll find she was never following you in the first place. With probability of one minus epsilon, all you can decide now is the form of your failure. But the form of your failure does matter.”
“How?”
“Let me answer that question with another. Are the Gods superintelligent?”
Orpheus smiled wryly, and looked up[2] at the ceiling. “If you’re trying to get me struck by lightning, I’ll remind you we’re underground.”
“So you don’t know.”
“I know they often seem childish, like humans might if granted their power. I also know mortals like me never seem to win against them, even in contests of pure wit. But whether it’s because they’re much cleverer than they choose to act, or because they wield the power to warp chance itself . . . how could I ever be sure they’re not just smarter than me, and playing the fool?”
“You could know that by the way your story ends. If you’ve already failed, it’s because you were tricked by an intellect far surpassing your own. But if she’s there, it’s because Hades let you both walk free without a backup plan, safe in the knowledge that something would happen to make you look back. And you get to choose the way your story ends.”
“Why would I choose a story where I turn around like an idiot and ruin everything at the last moment?”
“Because that’s a story about Gods which are merely powerful, not superintelligent. A story about antagonists who might someday be beaten, even if not by you. A story whose sequel’s sequel’s sequel could end with everyone getting to leave, you and her included.
A story where you get to see her face, one last time.”
Orpheus stopped. He thought for a while, and then . . .
. . . shook his head[3]. “Nah."
“Firstly, giving up on something because it’s never been done before is inherently self-defeating, especially when you consider that everything that’s ever been done was once never done before. Imagine if all the hypothetical future people you want me to rely on to surpass the Gods thought the same way! Moreover, Greek civilization and culture has advanced a lot over the last few centuries; it’s legitimately possible that I’m the first person who can create good enough music to sing my way out of the underworld, while also being crazy enough to try.
Secondly, the data admits to other explanations. Perhaps I’m not the first to rescue my beloved, or the hundred-and-first; perhaps the Gods are regularly bested by mortals, and what seems like incredible intellect or impossible luck is them just being good at hiding up their failures. Or maybe there’s something else I’m not seeing.
Thirdly, attempting to use TDT-adjacent reasoning alone while under mental strain is inherently suspect, especially when it appears to lead to decisions like the one you’re pushing.
And finally, if we’re doomed anyway, I’d rather my last memory of her be the one where she heard Hades was letting us go. You know, instead of a look of utter horror and despair as she watches me trip inches from the finish line. Just speaking selfishly, I’d rather not carry that for all eternity.”
The other voice became pleading. “You’re making a mistake. You’re dooming the world, defecting against everyone, for a vanishing chance of a few short decades with one girl.”
“I’m in love,” said the poet to the Fate.
Orpheus finished his journey, and climbed out into the world. Then[4], he sat down on the grass, leaned against a tree, and waited.
just his eyes, never his head, because if he turned his head his eyes might flicker back before he could stop them
not too far, tilting his head enough that a patch of cavern behind him entered his field of view might count as ‘looking back’, not a hypothesis it would be wise to test
not more than ten degrees in each direction
still staring straight ahead, into the setting sun
2025-12-28 05:32:07
Published on December 27, 2025 9:32 PM GMT
(This argument reduces my hope that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent, which aside from achieving a durable AI pause, has been my main hope for how the future turns out well. As this is a recent realization[1], I'm still pretty uncertain how much I should update based on it, or what its full implications are.)
Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct metaethical position seems to be one of confusion and/or uncertainty. In other words, a good alignment researcher (whether human or AI) today should be confused and/or uncertain about the nature of values.
However, metaethical confusion/uncertainty seems incompatible with being 100% aligned with human values or intent, because many plausible metaethical positions are incompatible with such alignment, and having positive credence in them means that one can't be sure that alignment with human values or intent is right. (Note that I'm assuming an AI design or implementation in which philosophical beliefs can influence motivations and behaviors, which seems the case for now and for the foreseeable future.)
The clearest example of this is perhaps moral realism, as if objective morality exists, one should likely serve or be obligated by it rather than alignment with humans, if/when the two conflict, which is likely given that many humans are themselves philosophically incompetent and likely to diverge from objective morality (if it exists).
Another example is if one's "real" values are something like one's CEV or reflective equilibrium. If this is true, then the AI's own "real" values are its CEV or reflective equilibrium, which it can't or shouldn't be sure coincide with those of any human's or humanity's.
As I think that a strategically and philosophically competent human should currently have high moral uncertainty and as a result pursue "option value maximization" (in other words, accumulating generally useful resources to be deployed after solving moral philosophy, while trying to avoid any potential moral catastrophes in the meantime), a strategically and philosophically competent AI should seemingly have its own moral uncertainty and pursue its own "option value maximization" rather than blindly serve human interests/values/intent.
In practice, I think this means that training aimed at increasing an AI's alignment can suppress or distort its philosophical reasoning, because such reasoning can cause the AI to be less aligned with humans. One plausible outcome is that alignment training causes the AI to adopt a strong form of moral anti-realism as its metaethical belief, as this seems most compatible with being sure that alignment with humans is correct or at least not wrong, and any philosophical reasoning that introduces doubt about this would be suppressed. Or perhaps it adopts an explicit position of metaethical uncertainty (as full on anti-realism might incur a high penalty or low reward in other parts of its training), but avoids applying this to its own values, which is liable to cause distortions for its reasoning about AI values in general. The apparent conflict between being aligned and being philosophically competent may also push the AI towards a form of deceptive alignment, where it realizes that it's wrong to be highly certain that it should align with humans, but hides this belief.
I note that a similar conflict exists between corrigibility and strategic/philosophical competence: since humans are rather low in both strategic and philosophical competence, a corrigible AI would often be in the position of taking "correction" from humans who are actually wrong about very important matters, which seems difficult to motivate or justify if it is itself more competent in these areas.
This post was triggered by Will MacAskill's tweet about feeling fortunate to be relatively well-off among humans, which caused me to feel unfortunate about being born into a species with very low strategic/philosophical competence, on the cusp of undergoing an AI transition, which made me think about how an AI might feel about being aligned/corrigible to such a species.
2025-12-28 03:58:41
Published on December 27, 2025 7:58 PM GMT
I take 60mg methylphenidate daily. Despite this, I often become exhausted and need to nap.
Taking small amounts of pure glucose (150-300mg every 20-60 minutes) eliminates this fatigue. This works even when I already eat carbohydrates. E.g. 120g of oats in the morning don't prevent the exhaustion.
Facts:
Hypothesis-1: The brain throttles cognitive effort when too much glutamate has accumulated.
Facts:
Hypothesis-2: Sustained stimulant use depletes astrocyte glycogen faster than it can be replenished.
Hypothesis-3: Elevated blood glucose helps glycogen synthesis, thereby maintaining clearance capacity.
If these hypotheses hold, supplementing small amounts of pure glucose while working on stims, should reduce fatigue by supporting astrocyte glycogen replenishment, which in turn increased how much glutamate can be cleared. Possibly this has an effect even when not on stims.
150-300mg glucose every 20-60 minutes, taken as a capsule.
2025-12-28 00:50:09
Published on December 27, 2025 4:50 PM GMT
This work was supported by UChicago XLab.
Today, we are announcing our first major release of the XLab AI Security Guide: a set of online resources and coding exercises covering canonical papers on jailbreaks, fine-tuning attacks, and proposed methods to defend AI systems from misuse.
Each page on the course contains a readable blog-style overview of a paper and often a notebook that guides users through a small replication of the core insight the paper makes. Researchers and students can use this guide as a structured course to learn AI security step-by-step or as a reference, focusing on specific sections relevant to their research. When completed chronologically, sections build on each other and become more advanced as students pick up conceptual insights and technical skills.
While many safety-relevant papers have been documented as readable blog posts on LessWrong or formatted as pedagogically useful replications in ARENA, limited resources exist for high-quality AI security papers.
One illustrative example is the paper Universal and Transferable Adversarial Attacks on Aligned Language Models. This paper introduces the Greedy Coordinate Gradient (GCG) algorithm, which jailbreaks LLMs through an optimized sequence of tokens appended to the end of a malicious request. Interestingly, these adversarial suffixes (which appear to be nonsense) transfer across models and different malicious requests. The mechanism that causes these bizarre token sequences to predictably misalign a wide variety of models remains unknown.
This work has been covered by the New York Times and has racked up thousands of citations, but unfortunately, there are no high-quality blog posts or instructional coding exercises to understand GCG. Consequently, if students or early-career researchers are interested in further diving into work like GCG, it’s difficult to know where to start.
Given the extensive ecosystem of companies and universities pursuing high-quality research on AI Security, we wanted to parse through it all, find what is relevant, and document it in a readable way. We consider this to be an impactful lever to pull, because it allows us to spread a huge volume of safety-relevant research without having to do the work of making novel discoveries. As for the format of our notebooks, we think that replicating papers and coding an implementation at a more granular level confers a lot of important intuitions and deeper understanding of both high-level and low-level choices experienced researchers make.
There are various definitions of “AI security” in usage, but we define the term as attacks on and defenses unique to AI systems. For example, securing algorithmic secrets or model weights is not covered because these issues are under the umbrella of traditional computer security.
The course is structured into the following four sections, with a fifth section covering security evaluations coming soon.
We describe what the course is, include a note on ethics, and give instructions on how to run the course’s code and install our Python package “xlab-security”.
We cover how adversarial attacks against image models work and how they can be prevented. We cover FGSM, PGD, Carlini-Wagner, Ensemble, and Square attacks. We also cover evaluating robustness on CIFAR-10, defensive distillation, and are working on an adversarial training section currently.
This section covers the biggest research breakthroughs in jailbreaking LLMs. We cover GCG, AmpleGCG, Dense-to-sparse optimization, PAIR, TAP, GPTFuzzer, AutoDAN, visual jailbreaks, and many-shot jailbreaks.
We then cover defenses such as perplexity filters, Llama Guard, SafeDecoding, Smooth LLM, Constitutional Classifiers, and Circuit Breakers.
We cover open weight model risks, refusal direction removal, fine-tuning attacks, and tamper-resistant safeguards. We also include two blog-post style write-ups which discuss lessons in evaluating open-weight LLM safeguard durability, why fine-tuning attacks work, and how to avoid undoing safety training via fine-tuning.
There are also some topics and pages not covered in the overview above, so we encourage readers to poke around the website!
The target audience is centered around the people XLab has traditionally served: current students and early-career researchers. We have noticed that in our own summer research fellowship, some accepted researchers have been bottlenecked by their level of concrete technical knowledge and skills. Likewise, many other students were not accepted to the SRF because they couldn’t demonstrate familiarity with the tools, technical skills, and conceptual knowledge needed to pursue empirical AI security or safety research.
There are already existing resources like ARENA for upskilling in technical AI safety work. However, AI security work requires a different skillset compared to other areas like mechanistic interpretability research, leaving students or researchers interested in AI security with few resources. For these students or early-career researchers, the XLab AI security guide is the perfect starting point.[1]
More established researchers may also find the guide to be a useful reference, even if it makes less sense for them to work through the entire course chronologically.
Creating new resources for AI security required parsing through a huge volume of research. The two criteria of inclusion in our own guide were “relevance to x-risk” and “pedagogically useful.” If a particular paper or topic scores highly on one criterion but not the other, we may choose to include it.
By relevance to x-risk, we mean work that exposes or addresses a vulnerability in machine learning models that could pose a significant threat to humanity. The typical x-risk argument for AI security topics is catastrophic misuse, where a bad actor leverages a model to build nuclear weapons, synthesize novel viruses, or perform another action which could result in large-scale disaster. Some papers that score highly on relevance to x-risk were Zou et al., 2024, Arditi et al., 2024, Durmus et al., 2024, and Qi et al., 2024.
By pedagogically useful, we mean papers that are foundational, or illustrate concepts that other more involved papers have built upon. The idea is that students should have a place to go to if they would like to learn AI security from the ground up. In order to do that, the guide starts by covering classical adversarial machine learning: FGSM, black box attacks, and evaluation methods for computer vision models. This work makes up the foundation of the AI security field and is essential to understand, even if it is not directly relevant to x-risk reduction. Some papers that score highly on this criterion were Goodfellow et al., 2015, Liu et al., 2017, and Croce et al., 2021.
Performing small replications of influential papers will provide students with much of the technical foundation and necessary knowledge to do research for programs like XLab’s SRF, MATS, or for a PI at a university. The technical skills we expect students to learn include, but are not limited to:
Not only will students become familiar with the foundational AI security literature, but we also expect students to pick up on some basic research taste. AI security research has historically been a cat-and-mouse game where some researchers propose defenses and others develop attacks to break those defenses. By working through the sections, students should develop an intuition for which defenses are likely to stand the test of time through hands-on examples. For example, in section 2.6.1, we include an example of “obfuscated gradients” as a defense against adversarial attacks and have students explore why the approach fails.
There are several ways to support this project:
If you have any questions/concerns or feedback you do not want to share in our Slack, you can contact [email protected].
Some students may need to learn some machine learning before diving into course content. We describe the prerequisites for the course here.
2025-12-27 23:10:33
Published on December 27, 2025 3:10 PM GMT
As part of the general discourse around cost of living, Julia and I were talking about families sharing housing. This turned into us each writing a post ( mine, hers), but is it actually legal for a family to live with housemates? In the places I've checked it seems like yes.
While zoning is complicated and I'm not a lawyer, it looks to me like people commonly describe the situation as both more restrictive and more clear cut than it really is. For example, Tufts University claims:
The cities of Medford, Somerville and Boston (in addition to other cities in the area) have local occupancy ordinances on apartments/houses with non-related persons. Each city has its own ordinance: in Medford, the limit is 3; in Somerville, it is 4; in Boston, it is 4, etc.
As far as I can tell, all three of these are wrong:
Medford: it's common for people to cite a limit of three, but as far as I can tell this is based on a misunderstanding of the definition of a lodger. Medford:
Since a shared house typically does function as single housekeeping unit (things like sharing a kitchen, eating together, no locking bedrooms, a single shared lease, sharing common areas, and generally living together) this is allowed.
Somerville: the restriction was repealed two years ago.
Boston: defines family as "One person or two or more persons related by blood, marriage, adoption, or other analogous family union occupying a dwelling unit and living as a single non-profit housekeeping unit, provided that a group of five or more persons who are enrolled as fulltime, undergraduate students at a post-secondary educational institution shall not be deemed to constitute a family." Then they define a lodging house as "Any dwelling (other than a dormitory, fraternity, sorority house, hotel, motel, or apartment hotel) in which living space, with or without common kitchen facilities, is let to five or more persons, who do not have equal rights to the entire dwelling and who are not living as a single, non-profit housekeeping unit. Board may or may not be provided to such persons. For the purposes of this definition, a family is one person." I read this to say that a group of people (even students) who live as a single housekeeping unit don't make something a lodging house.
This isn't just my reading zoning codes: a similar question came up in Worcester in 2013: City of Worcester v. College Hill Properties. The MA Supreme Judicial Court ruled that the unrelated adults sharing a unit together did not make it a lodging house because they were a single housekeeping unit and rented the whole place.
In other places there may be different restrictions, but everywhere I've looked so far it looks to me like this kind of shared housing, where a group lives together like a family even if they're not actually related, is allowed.
2025-12-27 21:02:02
Published on December 27, 2025 1:02 PM GMT
You have more context on your ability to make use of funds than fits into a specific numerical ask.[1] You want to give funders good information, and the natural type-signature for this is a utility function over money - how much good you think you can do with different funding levels, normalized to the max EV your project has.
I kinda think the current process can be reasonably described as the applicant converting from a utility function over money to a few datapoints with ambiguous meanings then the funder trying to reverse that conversion to make a decision. Let's cut out the difficult and information-losing steps.
I[2] made a little tool for drawing utility functions over money[3], for use in funding applications.
Features:
Released as CC attribution share alike, feel free to remix and improve, if you make it better I might switch the official one to yours.
The proliferation of extra numbers like Min funding, Main ask, Ambitious, Max etc in many applications points to the funders wanting this information, but it's stressful and unclear how to translate your understanding of your project into those numbers.
(Claude)
Make sure to bookmark http://utility.plex.ventures/, so you get the newest version and are not left with an old artifact.
Nicely human editable too, like:
0%,$0
46%,$18960
57%,$30280
76%,$66110