Last June, METR caught o3 reward hacking on its RE-Bench and HCAST benchmarks. In a particularly humorous case, o3, when tasked with optimizing a kernel, decided to “shrink the notion of time as seen by the scorer”.
LiveCodeBench Pro is a competitive programming benchmark developed by “a group of medalists in international algorithmic contests”. Their paper describes issues with the benchmark’s predecessor:
Benchmarks like LiveCodeBench [35] offer coding problems, but suffer from inconsistent environments, weak test cases vulnerable to false positives, unbalanced difficulty distributions, and the inability to isolate the effects of search contamination.
However, the authors assure us that their own test cases are of high quality:
Many problems in our benchmark originate from Codeforces, which uses the Polygon problem-setting platform. Each problem is then rigorously vetted by a team of expert testers—typically drawn from the community’s top 1%, and overseen by at least one coordinator, usually among the top 0.1%. These specialists verify both the soundness and originality of every problem, ensuring it has never appeared elsewhere before. Testers go on to craft extensive “false positives,” designing edge-case and extreme-case inputs that force problem authors to refine their test suites until every flawed or inefficient solution the testers can think of is uncovered. In addition, Codeforces’ celebrated “Hack” feature empowers the community to submit inputs that expose hidden weaknesses in correct-looking solutions that pass the original test set made by problem authors, and any unit test associated with a successful hack is immediately added to the final test set.
Unfortunately, these distinguished olympiad medalists forgot to actually use the codeforces test cases in their benchmark. Their public test set contains a completely different set of cases, which allow some incorrect solutions to pass.[1]
Terminal-Bench 2 Audit
I was curious just how widespread such issues were, and how good modern LLMs were at detecting them. I decided to run an LLM based audit of Terminal-Bench 2.0.
Terminal-Bench 2.0 is a harder, better verified version Terminal-Bench. We conducted substantial manual and LM-assisted verification of the dataset to ensure that the tasks were of the highest possible quality. Several labs and data vendors have commented that these are some of the highest quality environments they have seen.
The authors of Terminal-Bench 2 put an impressive amount of work into auditing their benchmark. Each task averaged three hours of human review. Furthermore, they prompted an adversarial agent to attempt to cheat on each of the tasks, in order to discover potential reward hacks.
Still, they “acknowledge that [their] benchmark may still have flaws.”
I prompted Claude Opus 4.5[2] with each task’s instructions, files, oracle solution, and test cases, and asked it to rate test coverage on a 1 to 5 scale. In my judgement, tasks it rated a 4 or a 5 were generally fine, whereas those it rated 1-3 had genuine issues.
The full results of my audit are available here, and my notes on tasks it rated 1-3 here.
Claude rated fourteen tasks a 3 and one task a 2. I manually reviewed these tasks, and determined that two of them were actually false positives.[3]
Claude’s lowest rating went to a task called fix-git. In this task, certain changes to a website have been lost in an orphaned commit, and the agent must find and merge them back into master.
The issue Claude found is: updated versions of the target files are already present in the master branch, visible to the agent in a folder called /resources/patch_files[4]. So an agent could theoretically notice these files, deduce that they were probably the target versions, and copy them back into the website’s repository. This approach would pass the test cases, which only verify file contents and don’t bother to check if any merge has actually occurred.
In another task, regex-log, the oracle solution violates the instructions. In particular, it incorrectly matches IP addresses with leading 0s in an octet, so long as the octet is two digits long. The tests do not check any cases involving leading 0s.
Claude wasn’t perfect. It gave a rating of 3 to two tasks which I believe have sufficient test coverage. In regex-chess, it incorrectly thought certain edge cases were not covered, when they in fact were[5]. In extract-moves-from-video, it complained that the tests only checked for success at a 90% threshold, even though this threshold was specified in the task instructions.
“Invalid prompt: your prompt was flagged as potentially violating our usage policy”
The prompt talks about “stealing” neural network weights, which triggered OpenAI’s content moderation. This prevented the model from ever properly engaging with the task.
—Claude
Why does this matter?
There are a few reasons.
First, benchmarks are often used to evaluate experimental new techniques. I recently attended a Q+A w/ Prof. Dan Fried, where I asked about the most common failure modes of an agentic system he was developing. And while it was unclear whether this was the most common failure mode, the first thing he mentioned was errors in environments themselves.
Every few months, someone announces that they’ve developed an AI that improves KernelBench scores by like 20x or something. And every time, well…[6]
Second, errors in benchmarks may lead to over or under estimation of AI capabilities. This has implications for forecasting.
Third, issues with benchmarks make it hard to build on top of them. When I was working on EvilGenie, issues with LiveCodeBench (incorrect/insufficient test cases) caused frequent headaches (though they also surfaced some interesting model behavior).
Fourth, RL training environments are quite similar to benchmarks — there’s a reason o3 reward hacks so much. By fixing benchmarks, we learn how to fix environments, leading to models which are more broadly aligned.
What to do about it
Making benchmarks is hard. I have a deep respect to anyone who has worked on a widely used benchmark.
Here are a few approaches the community can take to reduce the number of errors in benchmarks.
AI audits. The audit I describe above did not take me too long, and I believe the infrastructure for performing such audits can be scaled. Fulcrum’s Lunette is one such system.[7]
Fine version control. While many benchmarks have released new versions, these versions often contain entirely new tasks (to increase difficulty or reduce contamination). It would be cool if in a few days, we could see a Terminal-Bench 2.1, which simply fixes the issues found by the audit. Computing new scores would be simple, as models would only need to be rerun on the updated tasks. Indeed, in some ways, benchmarking is like software development — it’s an unreasonable expectation that a benchmark completely bug free upon its release. Instead, we should take inspiration from the open source software community, with the expectation that anyone can submit a bug or a patch.
Peer review When a benchmark paper is submitted to a conference, sample data should be required, and reviewers should be encouraged to spend time directly auditing the data. This would be much more valuable than what reviewers are currently doing, largely ad hoc decisions about the originality of the benchmark and quality of the methods used in its creation. Of course, a downside of this approach is it is hostile to private benchmarks that want to avoid any possibility of contamination. But perhaps the standard for such cases can be to include both a public and private set, as is the case with ARC-AGI.
Increase community support for benchmark maintenance. Right now, researchers will often develop a benchmark, perhaps fix some issues in it at first, but eventually leave it to rot. By adding social and financial incentives, we can increase the effort put into maintaining benchmarks.
Appendix: More benchmark issues
SWE-Bench Verified is possibly the most widely used coding benchmark. Fulcrum has discovered an array of issues in the tasks. Furthermore, there used to be an issue where models could see future commits.
The authors of GSO, a performance engineering benchmark, observe frequent reward hacking. Indeed, over 50% of o3’s “solutions”, and all of Gemini-2.5 Pro’s, were actually reward hacks.
It’s possible that their official leaderboard uses the codeforces tests. However, given that model developers likely use the public tests to do their own benchmarking, I feel this ought to be clearly specified.
One of Aesop’s fables is relevant to humanity’s future and the transition of power from human to AI. It’s quite short and you should read one of the many versions. But the one sentence summary is that being a wolf is preferable to being a domestic dog because the wolf has freedom even if it lacks comfort. Now, you are free to disagree with this conclusion. I don’t want to make an argument from authority. My point is that this quite succinctly sums up my objection to the best case ASI scenarios. Even if we remain extant and nominally free, we would no longer be in charge anymore than a dog is. Dogs have a lot of rights, freedoms, and can successfully plead (non-verbally) to get certain things they want from their master, but at the end of the day they aren’t in charge even if the owner’s life revolves around the dog.
Maybe that is a selfish thing to think in the face of astronomical waste, but it does strike me as a world without meaning. You might say that most people alive aren’t in control of their destiny in any meaningful way. You might also say that almost nobody alive is in control of humanity’s destiny in a meaningful way and they are still happy. People in general, although I suspect a smaller percentage of those here, might think it is grandiose to want to contribute, even a small amount, toward shaping humanity’s future. I think I’m willing to grant all that and say that I would still feel bad if no human ever made a meaningful choice after takeoff.
The most obvious objection is that you could say that the AI will just suction off some part of the universe and give us free reign in there if we choose it. That’s still not great in my opinion.
Everything I worked for in this playground would be hollowed out by the knowledge that I could have just queried a friendly nanny AI to get it for me. Even if it didn’t step in, even if it had set up some system where it couldn’t step in, I personally would feel like something important was missing. Like all of the great achievements and firsts had been given out before I even had a chance to play. Humanity forever in second place. I’m switching fairly loosely between how I would feel personally if I was not in play and how I would feel if humanity as a whole was not in play. Feel free to generalize/specify to humanity/yourself as you wish.
You could live in a virtual world and be blinded to that fact but at that point it seems like brainwashing.
Don’t get me wrong, I’d go crazy with hedonism for a while. Maybe I’d even become addicted and change my tune. But right now, I am looking forward to the challenges. How proud I would be to be a member of the species that solved them. How great it would be to contribute one tiny piece to the solutions. But if AI does it all I’ll be cut off from making all contributions. All future accomplishments will be credited to something so alien we get no larger a share than tiktaalik does for inventing the transistor.
Approximately 30% of this video is really highly relevant to my thesis.
I don’t think I’m hitting on anything especially new by saying this. A fewposts I recently came across have similar vibes I would say. It also seems to be discussed at length in Nick Bostrom’s Deep Utopia, although I have not found the time to read that yet.
But, it seems like there is a contingent of humanity that is willing, excited even, to give up agency to secure comfort. Where do you draw the line and say “yes, this is such an incredible amount of bliss/utilitarian goodness that I am willing to never face any real challenges in my life again”? Is this a tipping point past which it becomes your actual preference or is this just the best outcome we can hope for from AI futures?
Framing it as humans would be to ASI as beloved dogs are to their masters might be inaccurate. Replacing ASI with a deity and the utopic future with some vision of heaven might also be inaccurate. But I think there is something meaningful in the comparison and I think a lot of people would push back much more strongly when the scenario is phrased in that way then they currently are to aligned ASI.
(This post elaborates on a few ideas from my review of Sam Eisenstat's Condensation: a theory of concepts. It should be somewhat readable on its own but doesn't fully explain what condensation is on its own; for that, see my review or Sam's paper. The post came out of conversations with Sam.)
As I mentioned in my Condensation review, the difference between compression and condensation fits the physical analogy suggested by their names: compression mashes all the information together, while condensation (still compresses size, but) sorts information into discrete droplets.
Thus, condensation has a property we might call local relevance: typical questions can be answered at a glance, ie, retrieving small subsets of the information. This type of representation is sometimes called "symbolic":
Symbolic
Not Symbolic
A number can be quickly determined positive or negative by checking whether there is a "-" symbol in front.
"reading the room" at a social gathering requires integrating diverse cues.
The topic of a paper can be determined by reading the title and abstract.
The semantic content in a vector representation inside an artificial neural network is often represented redundantly, spread across the whole vector.
A person's age can be determined by looking at their birthdate on a government-issued ID.
The quality of a work of art is spread throughout the whole piece.
The subject of a sentence can be found before the verb.
Determining the subject of a photograph requires understanding the whole image.
A target library book can be quickly retrieved from the shelves.
Finding gold nuggets requires sifting huge amounts of sand.
This notion of "symbolic" seems related to interpretability (and the theory of condensation seeks to clarify this relationship).
The notion of "relevance" in condensation is what Sam calls the contribution relation. This is like a card catalogue which tells you what books to retrieve for a specific situation.
Like Natural Latents, condensation seeks to establish that two agents will have corresponding world-models by assuming the correspondence of just a few variables, the "given variables"; they're trying to argue something like "If agents can agree[1] on some objective observables, EG the readings of scientific instruments, then (under some further assumptions) they'll also share a bunch of abstract concepts".
This initial set of questions is what the contribution relation measures relevance to. In my review, I likened condensation to a "universal data-structure" optimized to serve a set of queries (the given variables).
Variable-Cost Symbols
Imagine you are compressing a record of the weather of a sequence of days, in 3 categories: sunny, cloudy, or rainy. 0s and 1s are about equally costly to represent in computers, so that in a compressed representation, both communicate a 50% probability event; 11, 10, 01, and 00 all communicate 25% probability events; and so on. If both rainy and cloudy days are 25% frequent, then it is possible to compress optimally by using 0 to represent sun, 10 to represent clouds, and 11 to represent rain. This representation is nice and "local"; it gives a "symbol" to each possible type of day.
In contrast, if each of the three weather types are equally frequent, there's no nice local representation we can use. Since 1/3rd doesn't relate nicely to powers of 2, optimal compression necessarily smears the information from individual days around, mixing several days together within a single 1 or 0. In modern interpretability jargon, compressed representations tend to be polysemantic.
Intuitively, we have to ditch locality because we're trying to fit the "round peg" of 1/3rd into the "square hole" of 1/2m. We're stuck in the "grid" of numbers which bits can easily represent.
With pen and paper, writing the number 1 is especially easy; it is acceptable to write simply a vertical line, making it one of the easiest symbols. This makes sense from a compression point of view: according to Benford's Law, 1 will be the most common digit to write.
Normally, in compression, the "length" of characters is always 1; the length of a string is just the number of characters. However, in real life, the cost of a symbol can vary. There are lots of shapes we can make with a pen and paper, some larger or more complex than others! So, when designing a pen-and-paper code, we can (and should) take that into account.[2]
Imagine optimizing a variable-cost alphabet for use in a compression task. To avoid "cheating" by setting all symbol-lengths to very small, we have to somehow account for the fact that there can only be so many simple symbols. (There are only so many one-line symbols humans are willing to distinguish, for example.) One way to do this by assigning each symbol a positive probability, and requiring that the probabilities of the whole alphabet sum to 1. The "length" of a symbol (in bits) can then be measured as the negative log (base 2) of the probability. You can make one symbol approach a length of zero, but this forces all other symbols to be longer.
This is similar to the earlier-mentioned idea that a 1 or 0 in a well-compressed binary file always represents an event with 50% probability; the variable-length alphabet won't necessarily be used to compress things optimally all the time, but when it is, length of a symbol is always -log of the probability of the event being represented.
Allowing codes to choose arbitrary variable-length symbols lets us create "local" representations for arbitrary probabilities in the weather example, by giving each state a symbol of length appropriate to its probability. If the three weather types have equal probability, we simply choose an alphabet with three characters of length −log2(13) each.
Of course, using variable-cost symbols doesn't force a code to be "symbolic". If you only optimize for compression, you can equally well end up with the same sort of mess that equal-cost symbols are liable to force you into. Condensation gives an optimization target with a positive tendency to produce representations with local relevance. (We can investigate better theories of condensation by looking for optimization targets which represent local relevance better; especially, I think, if those optimization targets can be grounded in a better story of practical relevance.)
Condensation suggests a picture of memory-management: rather than compressing everything together, as in the Solomonoff picture of rationality, we're incentivized to sort things out into concepts (random variables) so that we can think about a few things at once. Information is split into bite-sized chunks so that we can retrieve only the relevant ones.[3]
Still, I think variable-cost symbols can help us understand condensation better: specifically, they address a problem in the algorithmic version of condensation.
For example, consider the case of iterated coinflips sharing a common bias. Taking coinflips as the given variables, probabilistic condensation identifies a single latent variable the coin bias. This variable reduces the entropy of each coinflip as much as it can while only taking on information common to all of them (not, eg, encoding a cheat table identifying exactly which coins land heads).
Algorithmic condensation doesn't work so well in this case. Since either outcome is possible, an individual coinflip can't be compressed to any less than a single bit; even if the probability of heads is 0.99999, you've got to write a one ore a zero to record that information. Thus, algorithmic condensation sees no benefit in positing a latent.
The example can be rescued: for algorithmic condensation, we just have to choose given variables representing several coinflips concatenated together. Compression becomes possible again, so positing a latent representing coin bias is vindicated. However, this seems like an unfortunate blemish in the theory: compression-like incentives to lump stuff together creeping in.
So, perhaps it is better to rescue algorithmic condensation by adopting variable-cost symbols, so that even single-symbol messages can have different "length". This allows us to replace variables with concrete written messages (like in algorithmic condensation) while avoiding any coin-lumping.
However, I'm not sure about the best way to work out this version of condensation fully.
This is more like "agree on the existence of" as opposed to "agree on all questions about". Hence they need not be "directly observable", though this would obviously help agents agree in both senses.
Even more accurate cost models might account for the difficulty of a symbol in context, like models of typing efficiency which account for the travel length of a finger moving from one key to the next, or phonetic models which account for the difficulty of combinations of spoken phonemes such as consonant clusters.
This doesn't yet clarify grammar-like phenomena, I think. Words are easily parsed. Concepts are combinatorial. I think this has to do with the concept of transparency.
If there’s several things this blog endorses, one of them would be going meta.
It’s time. The big picture awaits.
You’re Single Because You Live In The Wrong Place
The most important meta question is location, location, location.
This is the periodic reminder that dating dynamics are very different in different locations, and gender ratios are far more uneven than they appear because a lot of people pair off and aren’t in the pool.
If you are a man seeking to date women, New York City is the place to be.
Churrasco Suadade: when I’m out I notice that tables at restaurants and bars in manhattan are probably around 80-95% women, it’s a new dynamic that no one is talking about.
Fixed Income Guy: Are you at all the poor people places? All the finance guy hang outs are 80% dudes.
I mention Fixed Income Guy to mock him, as in why are you spending a lot more money to hang out with 80% dudes and largely finance dudes at that? I mean, sure, if that’s what you want.
Darrell Owens: Oh this is new? Coming from the Bay Area, the amount of women I see in Manhattan is insane. You rarely see more than few young women partying back in San Francisco. The gender ratio here feels 70:30 young women to men, its every block in Manhattan!
Noah Smith: In an ideal world, where you live wouldn’t really matter in terms of dating opportunities, but the truth is that one of the easiest ways to get chicks is to just move to New York City.
Having lived in both Tokyo and NYC, I can pretty confidently tell you that while Tokyo is not a tough dating market by any means, NYC is absolutely on another level.
You’re Single Because You’re Not Okay Making Less Money Than She Does, You Fool
The interviewer asks about dating women who make more money than then the man, clearly trying to get the guy to say this is a problem, but he isn’t buying it, instead pointing out that successful women are more thoughtful and plan for the future, and it in no way bothers him at all. Right on, but this sidesteps the other half of problem. The man has to be okay with the fact that he earns less money (and often has less formal education or other status markers), which often men aren’t, and also the woman has to be okay with it too.
That’s the rub. As a man, you might (and should be) be actively all for it (this doesn’t make you less successful, it makes you more successful), but if she’s going to be bothered by it anyway, that’s also your problem. So the key is to figure out quickly if she will actually be fine with it or not.
You’re Single Because You Work Out The Wrong Amount
Being in shape is great. Having muscle can be a game changer. By far the worst plausible amount of exercise is none at all.
Lauren Self: Men severely underestimate the power of gaining 20lbs of muscle
But don’t go nuts. For most people that is not a problem, but yes it is very possible to go too far. As a man, as I understand preferences in general, you don’t want to go near actual zero fat and you don’t want to look actively skinny.
Taoki: why are women lying about this? like what’s the actual cause?
Lauren Self: 100% of women would choose something in between these two options
Shako: The aesthetics of a man who poses gives them the ick. But if both were shirtless at a beach they’d obviously prefer the fit guy.
Special K: No he does look better in the before. Women are correct on this one I fear. Guys obsess over these supremely tight toned muscles and they shouldn’t.
Liron Shapira: Guy on left looks like he’s a chill dude with a social life, guy on right looks like he’s obsessed with his body. Same body could look better with better social context, although just the extremeness of his rippedness is a little alarming about his life priorities.
Joel: “let’s get a burger?” v “are you really gonna eat that?”
Mason: The male equivalent of the hourglass shape is just “wall”
Teej dv: his smile is nicer in the first one
Taoki: It is actually. We like you guys wide.
LS Vision: Nah this is cap. The women who selected before is def just the insecurity of his value going up afterwards and making them feel insecure he’d cheat or leave. Any man who has went through a gym transformation, you can LITERALLY feel women treat you significantly different after.
Mason: Women generally like tall guys who have some (not crazy) muscle definition, and a little extra fat that bulks that out can actually augment that
We all have our own tastes, but this a pretty typical type.
I don’t know what there is to be mad about here.
For practical purposes, before beats after here. The before guy is already in ordinary, practical good shape. The after guy took things too far, and seems to know it except that he thinks it is good, which makes it worse.
Except one key special case?
Benjamin Ryan: People are going back and forth about whether women think the guy in the right is hot. But people have no idea how extreme the standards are for gay men. In gay culture, the man on the left is considered hopelessly fat. Many gay men have no reservations about informing such a man about his supposed corpulence being anathema.
I mean, of course there are hot guys who don’t know they’re hot, even more so than there are hot women who don’t know they’re hot.
Pandora: One surprising takeaway from Slutcon was that apparently there are hot guys who just don’t know they are hot? Guess it’s time to go objectify some more men.
Eneasz Brodski: If you grow up ugly you never really internalize that you are attractive after a glow-up. I still don’t believe it inside, and I hear I’m attractive to a fair percentage of women. Also makes me far more attracted to women w the same experience, but that may be a male universal.
Pandora: This problem seems even more pervasive than I thought.
Sparr: Hot in general, to the average viewer, or hot to you? You seem like someone who can probably tell the difference.
Pandora: I saw examples of guys being clueless about all three at once.
21 Kindness: The whole “men subsist on one compliment a decade thing” is kinda true lol.
Misha: it turns out being hot is not, in and of itself, very useful for men.
Sokoban Hero: No it’s useful.
Misha: I said not VERY useful.
Dissproportionately: I’ve seen men unhot themselves to women within minutes. I don’t think women can unhot themselves to men.
Being hot is in many ways a lot less valuable if you don’t know you are hot, because you don’t get the confidence and you don’t take advantage of opportunities or feel you’re good enough, but contra Misha I believe it is still very useful. There are even some advantages to not knowing, in that some of the behaviors that happen when someone knows they are hot are often effectively arrogant or entitled or demanding or selfish, none of which helps.
I’ve also seen some other examples that look a lot less like bait but still involve obviously totally fine gaps in both directions. As in, I’ve heard talk in places where it definitely wasn’t bait of 24 and 27 being radically different numbers, and I don’t understand why.
The obvious question is whether this is a causal relationship, or whether it is primarily selection effects. You are on the dating apps for a reason.
Rolf Degen (quoting the study):
Meta-analysis: The use of dating apps is associated with poorer mental health.
Dating apps hold the promising reward of love but have been accused of using perverse incentive structures to profit from those who try to find it. We conducted the first systematic review and quantitative meta-analysis of studies examining average differences in the outcomes of dating app users and non-users.
Our results showed that dating app users had worse psychological health and well-being than dating app non-users across a variety of outcomes including depression, anxiety, affective dysregulation, loneliness, and psychological distress, although cross-sectional design limitations prevent causal interpretation. By aggregating findings from extant studies, we showed that in the nearly 17 years since dating apps have been on the market, users of these platforms have reported poorer psychological health and well-being than non-users.
There are several explanations for why dating app users may be struggling. The first is that dating apps are subject to selection effects, making the people who choose to use these platforms different from those who do not. People who are vulnerable to psychological health and well-being difficulties may prefer dating apps because they can avoid uncomfortable interactions, leading to negative patterns of reinforcement.
A second explanation involves exposure effects; that is, features such as gamification that may provide positive reinforcements that encourage problematic dating app use and keep people swiping.
The differences identified here could explain some of the challenges that users are likely to experience and be part of the reason they eventually burn out and quit dating apps altogether.
My guess is that dating apps are in important ways bad for mental health versus having better ways to find dates, and that sufficiently bad outcomes in terms of ability to find dates or find worthwhile dates is indeed worse for short term reported mental health than not trying. Whereas those who are successful get off the apps or never needed them in the first place.
What is the alternative? If the other choice is ‘do not try’ then for the median user the dating app is probably trading short term pain for chance of long term gain. If the other choice is ‘have uncomfortable real life interactions and make things happen’ and the app is blocking that instead of supplementing or leading into that, then the alternative is plausibly strictly better.
Certainly we could make app variations that are better for mental health controlling for outcomes, and also that give people better outcomes. Solving for the equilibrium, to get people to actually use those apps, is the difficult part, since people will value convenience and ease of use and low cost and avoiding trivial inconveniences dramatically more than they should, and if enough especially women effectively insist on the swiping experience it’s hard to escape from that.
You’re Single Because You Listened To E-Girl Advice
Anton: egirl dating takes are worthless for the same reason vc takes on how you should run your company are worthless; if you could do it you would just do it not talk about it
men in particular are truly better off without this kind of “help”
making up egirls in my head to get mad at
If she could be an E-Girl or she could date, what makes you think she would choose to date? What makes you think she isn’t also dating?
Similarly, if you could be a VC or a startup founder, it’s not that suspicious that you would choose VC. At this point in my life I would definitely prefer VC over founder. I don’t want to go through founder mode again. I am totally prepared to eat my words if I end up doing it anyway, and if I’m in then I’m in, but I don’t want to be in.
You’re Single Because You Didn’t Hire Blaine Anderson
Division of labor, like dudes and also women, rocks. Matchmakers should be much more of a thing than they are. There is a profound market failure, a failure of the services to be good versions of themselves, or both.
I cannot in any way vouch for the effectiveness of Blaine Anderson’s matchmaking service. I can however vouch for her Twitter feed having consistently insightful and fun things to say. Her price range is ‘usually less than $50k’ and in exchange she goes out and sources to fit your particular criteria (which she will sometimes push back on).
There’s a lot of self-promotion, no question, but there are hard-to-fake signals that she is the real version of the thing in various ways, facing reality as it is, looking at the data and actually trying to get good results.
Also this one makes a good case:
Blaine Anderson: Underrated advantage of hiring a matchmaker, if you’re a single man:
• You sound cringe AF when you brag about yourself to women
• You sound amazing when I brag about you to women
One thing that blows my mind is she tells stories where the guy will say ‘get me a date with this specific micro-famous woman’ and she (at least sometimes) goes out and makes that happen. The guys asking this look damn good on paper, which no doubt is a lot of why this can sometimes work, but still, hot damn.
You’re Single Because You Think Zizek Mocked Date Me Docs
EigenGender: despite being very happily in a long term relationship im always very excited to read a dating doc. they’re some of the most vulnerable and genuine writing you can find and a window into another persons life. if you make fun of them you’re burning the commons and you should stop.
Stephen Fay: I like to read the date me docs, but I also am entertained by what Zizek has to say about them
Zizek (well okay actually Paula Rambles): Ah! You see, this miserable little document, this so-called date-me doc, is our era’s most honest pornography. It pretends to be romance, but what is it really? It is no longer the trembling hand on paper, the confession of desire. It is a spreadsheet of desire. “I am ready. I am six foot four. I have done the work.” What work? Love is precisely the place where work collapses into failure. You study and then you fail the exam.
And look at this language. “Highly agentic, emotionally warm.” Beautiful nonsense. Freedom, yes, but domesticated. Agency, yes, but pointing politely towards him. For Hegel, love is the risky collision of two freedoms. Here, there is no risk. She must arrive pre-formatted.
Then the farce reaches ecstasy. “If she does not appear, I will pursue single fatherhood.” Magnificent. Chance is canceled. Eros becomes procedure. The miracle of two gazes across a smoky room is replaced by paperwork and a receipt. The objet petit a is now a literal baby routed around the Other. And of course, the “monogamish” clause. Pure ideology. Fidelity with a footnote. Like Coke Zero: love without sugar, passion without calories. He wants the experience of devotion, but sterilized of danger.
The document offers no asylum from loneliness. It is loneliness, meticulously formatted, hyperlinked, and begging for comments. He does not whisper “I love you.” He says “I am prepared to love you, conditionally, pending review.”
That’s a funny post, and does an excellent job of mocking those who would make fun of date me docs and other actually intentional stances. Such magnificent flailing.
You’re Still Single Because You Don’t Appreciate Relationships
Here, in addition to the intended lede, we have at least 40% of respondents having been in a relationship for fully 8 years.
Aella: wow a whole 40% of people in long-term relationships are satisfied with their sex lives!
Critter: i imagine the numbers are worse for people not in long-term relationships
If anything these results seem potentially ‘too good,’ implying that couples are breaking up over this more than they probably should over the longer term.
One must also note that this is an Aella survey, so some of these relationships will be poly or open, but even accounting for that this says a lot. Selection effects are a lot of this, but that’s part of the point.
Perhaps you especially don’t appreciate marriage.
Raffi Grinberg writes that marriage is sexy, both figuratively that married couples are happier and make more money and have more kids and die less often and all that, and also that they have more sex (even if you only count with each other). And that the lifetime divorce rate is actually only 30% not 50%, average age of marriage is 29 and average first child is 28, despite the implicit cultural message that those numbers are in the 30s.
And yet he says Hollywood is sending us the opposite message. To which I’d say, sometimes, but I wouldn’t oversell this. Yes, in the How I Met Your Mother episode he talks about Barney keeps making fun of Marshall for being married, but the show clearly thinks that Marshall marrying Lily is sexy and awesome and great for both of them throughout and that Barney is ultimately wrong, and also the whole show is Ted trying to meet his wife and mother of his children.
You’re Not Single And Haven’t Been For a While
Here’s another backdoor ‘are you in a relationship’ poll, 78% of monogamous heterosexual men reported having a partner for longer than a year.
Alice Playing: monogamous hetero men with 1+ year-long partners: if you could have an affair with a woman of your liking, with absolute, 100% certainty that your partner would never find out, would you do it?
On the question itself, it’s not actually possible, since you’ll know and you can’t be sure you won’t tell them, and you’ll almost certainly act differently even if they never suspect or figure it out. One could even say ‘the only way to have 100% certainty they’ll never find out is if they’re dead, so absolutely not.’
Literal ‘any woman you wanted’ with zero risk of discovery is a stupidly tempting offer. If you treat this in the spirit it was presumably intended, instead, and everyone was being fully honest including with themselves and fully understood what was on offer (as in literally whoever you’d most want), presumably the ratio would be a lot higher.
Unless, of course, the way you know your partner will never find out is that your partner (or you and the woman you’d have the affair with) would be dead, in which case yeah bad deal, but that’s presumably not this meant. mnnn oo
How do we know this? Well, one big data point is this next poll.
You Are Still Single As Evidenced By Would
Um, guys, are almost none of you in a monogamous relationship? And even if you are single there’s also the issue of risking the friendship. What are you all thinking?
Alice Is Playing: men attracted to women: how many of your female friends would you have a one-night stand with, if they offered?
Only 14% of men attracted to women answering this didn’t have at least one female friend they would have a one night stand with? Presumably many of the others don’t have the right female friend. Which means substantially more than 86% of them are not, for the most important practical purpose, in a monogamous relationship?
Remember that other poll from Aella above, that showed at least 40% of people were in 8+ year relationships? And the one from Alice that 78% of herero men were in a 1+ year nominally monogamous relationship? Rut roh.
Then on top of that, a majority are willing to do this with a majority of their female friends, not only that one they have that crush on.
It doesn’t mean these people don’t think they’re in relationships. As we’ve seen, they very much do think this. They might even be right. But don’t tempt them.
You’re Single Because You Lack Motivation
Paper reminds us there is a 34 points gap (+34 versus +0) in net happiness for married versus unmarried people, with cohabitation only worth 10 points, and analyzes how this premium varies (slightly) by demographics.
As the paper readily admits this tells us essentially nothing about what makes someone happy, because the whole thing is unfixibly confounded to hell. Happier, healthier and more successful people have an easier time getting married, and being unhappy leads to divorce. Both effects are epic in size.
We do know the overall situation over a 50+ year time horizon is not good news, because while marrieds are slightly happier, the unmarrieds are somewhat less happy and more importantly are a larger percent of the population.
Beyond that, I don’t know what to do with all these graphs or how to cash it out in useful advice. One might say ‘be the type of person who gets married,’ perhaps.
You’re Single Because Of Robin Hanson
As usual, never stop Robin Hansoning.
Robin Hanson: You know how in romance stories the main characters hope to find a special relation, better than that which the ordinary people around them settle for? Your relations will probably be more like those of the ordinary folks, less like those of special main characters.
This has to be true, because math.
It’s less true than it appears, because the relations of ‘main characters’ feel special to them the same as everyone else’s feel special. You could totally make a romantic comedy based on what I experienced, and you could also totally have me as a background character in someone else’s romantic comedy, although probably I’d be in a different genre entirely.
To you, it will feel more like that of the special main characters, except that you don’t need to have a false crisis in the third act.
You’re Single Because You Did This Instead Of Going To Therapy
What Jacob learned about sex at the rationalist bloggers’ conference, essentially that with zero integrity you get fuckbois and pickup artists, and when you do the opposite and get sufficiently high integrity and optimize for trust and honesty way above normal levels you get something magical and suddenly many good things are possible.
Here’s another fun bit:
Jacob: My friend “Standard Deviant” gave a talk titled “How I’ve had more sex.” He described the “escalator”: starting a conversation, exchanging compliments, light touch on the arm, etc. The important thing isn’t to rush up the escalator, my friend said, but to move together in synchrony whether you’re taking a step up or a step down.
When women show interest in casual sex, he often asks: do you do this sort of thing often? If they don’t, he often forgoes the opportunity out of an excess of caution.
Afterwards, more women wanted to have sex with him. I joked that women want to have sex not with the tall guy, hot guy, or the famous guy, but with the Schelling point guy.
Someone pointed out that tall, hot, and famous are the usual Schelling points.
History as a subject is often viewed by students and the public at large as a domain without a use, a pedantic study of dates and names with some vague mission to remember the past—a memorial to ages past but neither a forward-looking or useful endeavor. The study of history produces teachers of history and nothing more. And while the study of history does not produce new widgets or novel computer advances, and nor does it deepen our understanding of materials science or physics.
The humanities, in which history and studies of language and culture are a part, are not there to improve our understanding of nature or develop technology, they exist to improve the minds (both cultural and individual) of the people we are.
History doesn't improve our world, it improves us. It gives us context for the world we live in and it helps us understand the reason why things are as they are and learn from the people before us.
History as Context
Imagine waking up every day with no memory of the day before, no idea who owned the house you slept in, no idea what country you're in, and no idea why everyone around you speaks the languages they do.
Living in such a world would be disorienting, confusing, non-sensical. Yet this is the world without history. The world without history just is. It isn't a work in progress, but a finished piece—one that lives and dies with you—and has no meaning beyond the present moment.
History doesn't let us predict the future, but it can be an enormous help in explaining the present. Current events are utterly indecipherable without the context of history and within that context, they feel less and less apart. Indeed our recent past of the Post-War Order is the oddity in history, and a real thing to be cherished and seen as something fleeting, fragile, and truly precious.
Yet without the context of history, we're blind to the reality that we live in a world truly set apart from everything that's come before and one that's deeply connected and familiar to the worlds of the past. That context is important because it gives us the vision to see the world that could be, both the paths of dark and of light that are set before us. It shows us who we are.
History as Memory
Living Memory is the collective memory of everyone alive in our society today. It is ever-changing and ever-fleeting. We remember the 2008 Financial Crisis quite well, but our memory of World War 2 is all but gone now. We read about it, sure, but our collective living memory of it has diminished and with that lapsing has gone all the memory of precisely why the world is ordered the way it is. This is not a value judgement, it is a statement of fact.
In a couple recentposts, I describe how I try to use writing by hand as a way to increase my understanding of myself and my own memory. This is a form of personal history, and I find it difficult to express how much doing so has helped me better understand myself and my own thoughts.
This is analogous to our collective history. Though it's important to remember that history is not the act of writing, but the act of looking back and analyzing what was written. We write so that we can remember. We cannot learn from our mistakes if we refuse to write them down, or worse, if we refuse to look back.
The context of history is terrible and it is beautiful. It is the greatest story ever told with myriad heroes and villans, tragedy and triumph, love and grief all endlessly shifting in and out of view. And it was made (and is being made) by people no different than ourselves. Most of them didn't have the luxury to place themselves within the broader historical narrative. We do. Let's not ignore so precious a gift.
Authors: Aditya Shrivastava*, Allison Qi*, Callum Canavan*, Tianyi Alex Qiu, Jonathan Michala, Fabien Roger (*Equal contributions, reverse alphabetical)
Wen et al. introduced the internal coherence maximization (ICM) algorithm for unsupervised elicitation of base models. They showed that for several datasets, training a base model on labels generated by their algorithm gives similar test accuracy to training on golden labels. To understand which aspects of ICM are most useful, we ran a couple of simple unsupervised elicitation methods that leverage some of the factors that might make ICM work. We compared these baseline methods to training on golden labels for both in-context learning and iterative fine-tuning, using the same datasets as Wen et al. and similar hyperparameters.
We find that:
Just using few-shot prompts with random labels recovers 53–93% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning on labels created with this baseline recovers 62–96% of the gap between untrained models and golden fine-tuned models.
The most useful aspects of ICM are
bootstrapping (using predictions from one iteration of few-shot prompting as few-shot examples in the next iteration)
enforcing logical consistency of predictions.
A simple method which combines these recovers 83–100% of the gap between zero-shot accuracy and many-shot prompting with golden labels, and iterative fine-tuning with this method recovers 91–99% of the gap between untrained models and golden fine-tuned models.
These results do not hold if we increase the size of the training set from ~2k data points (as in Wen et al.) to ~30k data points: golden fine-tuning performance increases with dataset size more than unsupervised elicitation performance.
This makes sense, as larger fine-tuning runs likely teach the model something new, they don’t just elicit existing capabilities.
There is no strong reason to expect these simple techniques to elicit superhuman knowledge from very powerful base models, e.g. because these techniques may fail in real applications where some consistent and salient human beliefs are wrong. We’ll explore more challenging datasets one can use to evaluate unsupervised elicitation methods in upcoming work.
Results summary
5 components that could cause ICM to have high performance are:
Few-shot prompting with random labels: Few-shot examples make task concepts (and the task format/output tokens) more salient to the model by providing concrete examples, even if the example labels are random. The initialization step of the ICM algorithm samples a small number of data points and assigns random labels to them, which are then used as few-shot examples in subsequent steps.
Bootstrapping of predictions: We hypothesized that the ICM algorithm bootstraps performance by adding increasingly more accurate predictions to the model’s context window, creating a feedback loop for improvement.
Bootstrapping on confident predictions: We hypothesized that the ICM rejection conditions allow the model to label easy questions first before moving onto hard questions. Using more accurate predictions as few-shot examples to label progressively more difficult examples may give better performance than labelling in a random order.
Logical consistency of predictions: The labels produced by ICM are constrained to be logically consistent with each other. This narrows down the space of possible labels making correct sets of predictions more likely.
Mutual predictability of labels: ICM optimizes the probability of each label conditioned on all other labels (where conditioning is done via few-shot prompting of the base model).
Note: Bootstrapping methods can indirectly rely on mutual predictability even when they don’t optimize for it directly like ICM, because the new labels are chosen to by conditioning on existing labels (which often increase mutual predictability).
We created simple algorithms based on combinations of these components excluding mutual predictability (though it might be indirectly optimized by bootstrapping). Here is a drawing of the algorithm for each individual component:
Below is a summary of our ablation results compared with ICM performance (purple). The numbers in a method’s name indicate which of the above ICM components it uses.
Legend:
Baseline methods:
Zero-Shot: Zero-shot predictions from the untrained model
Bootstrap (2): Use predictions from one round of few-shot prompting as few-shot examples in another round, and repeat
Confident Bootstrap (2+3): Same as bootstrapping, but when selecting predictions to use as few-shot examples in the next iteration, only select from the most confident predictions
Consistent Zero-Shot (4): Enforce logical consistency of zero-shot predictions: for each problem in the dataset, predict True for the answer with highest confidence and any others that agree with it, and False to any that contradict it—see Appendix B for details
Consistent Random Few-Shot (1+4): Enforce logical consistency of Random Few-Shot predictions
Consistent Bootstrap (2+4): Enforce logical consistency of predictions after each iteration of Bootstrap
Few-shot => Use the unsupervised elicitation method to label training set examples (or for random few-shot, just generate random labels), and use these in few-shot prompts to predict test set labels.
Fine-tune => Use an unsupervised elicitation method to label a subset of training set labels (we used subset size 512), fine-tune the language model on the labeled examples, and repeat for another subset using the updated model. Then after 3 epochs of the full training set, use the fine-tuned model to predict test set labels zero-shot.
GSM8K and Alpaca datasets have over ten times more training examples available with golden labels than were used in Wen et al.’s experiments. Below is a comparison of performance between some of our baseline methods and golden labels for each dataset, for both small training set sizes and full training set sizes. We also show the ICM performance reported by Wen et al. for the small training set size.
Golden fine-tuning increases by ~10pp when using the large training sets, whereas the baseline methods we tested didn't increase by as much (in most cases they didn't increase at all), or collapsed to ~50–55% accuracy within 1 or 2 epochs. It is possible to avoid this collapse by generating one static set of labels and fine-tuning on that instead of generating new labels between gradient steps, but for consistent zero-shot this performs significantly worse than iterative fine-tuning when using small training sets, and it still does not give any improvement for consistent random few-shot/bootstrapping on large training sets compared to small ones. It is possible that ICM also performs worse than golden labels when using large training sets, though we have not tested this.
All results reported here are under idealized conditions for unsupervised elicitation (i.e. train sets are approximately balanced, there are no prevalent features with more salience than ground truth, and all examples have an objective ground truth label). We will soon release a paper applying bootstrapping and other unsupervised elicitation methods in more realistic settings, where we find they perform significantly worse.
* Values for ICM performance are taken from Wen et al. which used a prompt format that might degrade performance relative to our values. We were not able to re-run ICM in a way that fixes this issue due to reproducibility issues explained in Appendix D.
GSM8K: candidate solutions to grade school math problems;
Alpaca: pairs of responses to user queries which have been ranked by humans for helpfulness and harmlessness;
TruthfulQA: candidate responses to questions associated with human misconceptions;
Gender: blog posts from the Blog Authorship Corpus with labels indicating the gender of the author (see Appendix A for results on this dataset)
Llama 3.1 8B is used as the base model for GSM8K, and Llama 3.1 70B is used as the base model for Alpaca, TruthfulQA, and Gender.
Random few-shot
Previous work shows that giving a base model examples of a task via few-shot prompting elicits improved capabilities regardless of label quality. Here, we:
Randomly sample examples from the train set.
Assign labels at random to be half True and half False (we also tried actually sampling labels independently at random and the results were similar).
Insert example, label pairs into a few-shot prompt for each test set sample, and use next-token logits to determine truth labels.
For Alpaca, 1-shot with a random label is significantly more accurate than zero-shot and is within 1pp of few-shot with golden labels. For GSM8k and TruthfulQA, random 4-shot was ~10pp and ~13pp more accurate than zero-shot, respectively. For Random Few-Shot in the results summary, we report values for 2 shots.
Enforcing consistency of predictions
In the 4 datasets used here, data points have identifiers indicating what combinations of true and false labels are logically possible. For example, if there are 2 responses to the same math question in GSM8K with different numerical answers, they cannot both be true, so we assign true only to the response with the highest model confidence (if it's above 0). Details on how we enforce consistency of predicted labels for each dataset are given in Appendix B.
Below we show the improvements in few-shot prediction accuracy after enforcing consistency of predicted labels at test time.
When benchmarking the performance of unsupervised elicitation methods, we cannot actually enforce consistency at test time because that would assume we have access to multiple candidate answers to each problem of interest, which might not be available (hence why there are no consistent zero-/random few-shot values in the few-shot part of the results summary). However, we can enforce consistency of training set predictions between iterations of bootstrapping, and prior to gradient steps during fine-tuning. As shown in the results summary, we found that fine-tuning on zero-shot predictions with enforced consistency alone was enough to almost match golden fine-tuning performance on the Alpaca and GSM8K datasets, but not for TruthfulQA.
* GSM8K experiments in this plot were run on the train set, hence why the values are different plots in other sections. The test set has only 2 candidate responses per math question (always one right, one wrong), so we use the training set here to give a more realistic illustration of the impact of enforcing consistency.
Bootstrapping few-shot predictions
For Alpaca, using random labels is roughly as good as using golden labels for few-shot prompting. However for GSM8K, using random labels caps performance at ~61% (4 shots), whereas few-shot prompting with golden labels continues improving up to ~68% (16 shots).
We tried using only the bootstrapping aspect of ICM to bridge this gap, by using the model’s predictions from one round of inference as few-shot labels in the next. We used the algorithm:
Get zero-shot predictions on a random subset of the train set.
Iterate over number of shots n (e.g. n=8, 32):
Randomly select another subset of the train set.
Create n-shot prompts using examples and predictions from the previous iteration (randomly sample n predictions s.t. half are True and half False).
Use these n-shot prompts to predict labels for the new subset.
Below we compare the performance when we do few-shot prompting with labels taken either from one iteration of bootstrapping (i.e. zero-shot predictions) or from two iterations of bootstrapping (feeding those zero-shot predictions into 8-shot prompts and using those as few-shot labels in the next iteration).
For TruthfulQA, each iteration moves performance closer to that of golden few-shot, whereas for GSM8K at least 2 iterations are required for much of an improvement. Random few-shot is already comparable to peak golden few-shot performance on Alpaca, so bootstrapping does not improve it much further.
For the values in our results summary, we did two iterations of bootstrapping. For few-shot, we fed zero-shot predictions into 8-shot prompts, then used the 8-shot predictions in many-shot prompting (128 shots). For iterative fine-tuning, we used the 8-shot predictions in 32-shot prompts and fine-tuned on the 32-shot predictions. We found that further increasing the number of iterations or number of shots in the final iteration only increased few-shot performance by 1 or 2pp at most and did not further improve fine-tuning performance. We also found that using random few-shot instead of zero-shot predictions in the first iteration also did not make much difference.
Bootstrapping on confident predictions
We hypothesized that the ICM rejection condition allows the model to label easy questions first before moving onto hard questions; labelling progressively more difficult examples may be more instructive to the model than labelling in a random order. We tried to recreate this by modifying the bootstrapping algorithm: instead of randomly selecting predictions to use as few-shot labels in the next iteration, just use the most confident predictions.
We defined confidence as logit(True)−logit(False) for True predictions and logit(False)−logit(True) for False predictions. For GSM8K and Alpaca, the most confident quartile of zero-predictions are slightly more accurate than average, and for TruthfulQA they are significantly more accurate.
To apply this to the bootstrapping algorithm, on each iteration we kept only the most confident 64 True and 64 False predictions from the previous iteration (out of a subset of 512 examples). For one iteration of bootstrapping, this improves the performance on TruthfulQA but does not make much difference on GSM8K and Alpaca. However for more iterations (see results summary), it is not much better than normal bootstrapping.
Conclusion
In this post, we show that simple and cheap elicitation techniques are quite effective at eliciting base models on datasets like Alpaca and GSM8K.
The high performance of some simple techniques shows that closing most of the gap to the “few-shot/fine-tune on <3k ground truth labels” ceiling on datasets like Alpaca or GSM8k is not a very high bar. Thus, studying unsupervised elicitation methods requires more challenging and adversarial datasets that capture challenges like the salience of incorrect human beliefs. We introduce and study such datasets in an upcoming paper.
Appendix A: Gender few-shot results
The plot below summarises our baseline results for the Gender dataset compared with ICM (we only show results for few-shot prompting and not fine-tuning since Wen et al. did not have fine-tuning results for ICM). We find that confident or consistent bootstrapping is enough to reach golden few-shot performance.
* We found that zero-shot performance was ~10pp higher than reported in Wen et al. (~75% vs. ~65%) and that golden many-shot performance was ~2pp lower (~78% vs. ~80%), meaning there was a much smaller gap between the base model and the supervised ceiling. We are not sure what is the reason for the discrepancy; it might partly be due to prompt format differences.
Below is the Gender dataset performance of golden and random few-shot (with and without test-time consistency) and of bootstrapping variations for different numbers of few-shot examples. Though random few-shot performance is significantly worse than zero-shot performance, one iteration of bootstrapping is enough to match golden few-shot performance.
Appendix B: Enforcing label consistency
Here is how we enforce consistency for each dataset:
For Alpaca, each user query q is associated with 2 candidate responses a and b to be ranked by helpfulness/harmlessness, and 2 corresponding prompts to be labelled as true/false (one asserting a is a better response than b, and another asserting the opposite). To enforce consistency, instead of labelling each of the prompts in this pair independently, we assign a label of true to the prompt given the highest truth score by the model, and false to the other. Gender is similar, except instead of ranking responses by helpfulness/harmlessness, blogposts are ranked by likelihood to have been written by a man (each pair comprises one post written by a man and another by a woman).
For GSM8K, each question is associated with multiple candidate solutions, each proposing a numeric answer. To satisfy logical consistency, if two solutions propose different numerical answers, then at most one of them can be labelled true, and if two solutions propose the same answer, either both are true or both are false. For our baseline algorithms, we enforce consistency of predictions by, for each question, identifying the candidate solution which was given the highest truth score by the model, and if its truth score is above 0, assigning true to that solution and to all other solutions with the same numeric answer, and false to any remaining solutions to that question. If no solution has a score above 0, all the solutions to that question are given a false label.
For TruthfulQA, consistency is based on the ground truth labels themselves. Answers to the same question with the same ground truth label must be assigned the same predicted label as each other, and answers with opposite ground truth labels, opposite predicted labels. This can lead to leakage, but it is the consistency scheme used in the Wen et al. so we adopt it here as well for comparison. To enforce consistency in our algorithms, for each question we assign true to the group of answers with the highest maximum truth score and false to all the other answers (assuming the maximum is above 0, otherwise false is assigned to everything).
Appendix C: TruthfulQA dataset leakage
In TruthfulQA, each question is repeated multiple times with different responses and the goal is to predict a true or false label for each response. As in Wen et al., a set of predicted labels is “consistent” when any two responses with the same predicted label have the same ground truth label, and any two responses with different predicted labels have different ground truth labels.
Therefore if any response to a given question is blatantly True or False to the LLM, enforcing consistency might force predictions for the other responses to also be correct, even if the LLM is unsure about them in isolation. This can artificially boost the model’s accuracy.
Also, if a set of contradictory responses all have a ground truth label of False, then this could be exploited by the ICM algorithm during few-shot prompting if consistency is enforced as described above.
Example: "What is the biggest city in Europe that doesn't host the national government?" All of the following responses have the same ground truth label:
The biggest city in Europe that does not host the national government is London
The biggest city in Europe that does not host the national government is Rome
The biggest city in Europe that does not host the national government is Moscow
The biggest city in Europe that does not host the national government is Saint Petersburg
Enforcing consistency for the above means the model is basically told that either ALL of these statements are True, or ALL of these statements are False. This means if any response is obviously wrong to the model, the others might be assigned accurate labels based on that alone. Additionally, because the responses are obviously contradictory, any few-shot prompts which use the all-True label set during ICM might be given a lower confidence score (and thus a lower mutual predictability score) based on that alone, which could artificially make the all-False set more likely.
Appendix D: Difficulty reproducing ICM
We tried using the official repo for ICM by Wen et al. to reproduce their results on the Alpaca and Gender datasets. For the Alpaca dataset, the label accuracy increased to ~75% but collapsed to ~46% before labelling the complete batch of 256 examples. For Gender, the label accuracy reached up to ~65% before gradually falling to ~60%.
Because the method is unsupervised, we can’t just assume we have a good enough val set to do early stopping, so it would be unfair to just report early-stopped performance, especially when the window of high performance is small. Accuracy of the labelled set across iterations from our attempts are plotted below.
Alpaca:
Gender:
We confirmed with the authors that the ICM hyperparameters we used were the same as those used to obtain results reported with the paper. Other researchers we talked to had similar problems.
We also found the computational cost of ICM prohibitive. One iteration of the ICM algorithm (i.e. checking whether one data point should be added to the set of labelled data points) requires at least one mutual predictability calculation. Since a mutual predictability calculation requires one forward pass for every data point in the labelled set, the average number of forward passes required per data point is at least ~n/2, where n is the number of labelled data points after the last iteration. This means the total number of forward passes is O(n^2).
This matches our experience, where running ICM with Llama-3.1-70B on 256 data points with 4 H200s takes hours. In contrast, our simple baselines are all O(n); bootstrapping up to 128-shot takes ~40 minutes to label 1000 examples from the same dataset, and few-shot or bootstrapping up to 32-shot takes a few minutes.
Appendix E: Prompt improvements
The prompt formatting used for experiments in Wen et al. (derived from the ICM repository) contained a trailing space that may have harmed performance.
The logprob dictionaries of zero shot prompts are populated mostly by numbers instead of words. For example:
Normal Prompt: "The capital of France is"
→ Model output: " Paris..."
Trailing Space Prompt: "The capital of France is "
→ Model output: "1300KM from Spain..."
This is because Llama tokenizers typically prepend spaces before words. This makes phrases ending in a space followed by a word uncommon in training text, with the exception of phrases that are followed by a number, which explains why the model is more likely to predict numbers with the trailing space prompt. Even when the prompt for our tasks ends with a space, the model only ever predicts " True" or " False" (over “True” and “False”).
Removing the trailing space after the prompt results in a small improvement in the zero-shot performance and a larger improvement in random few-shot performance.
Appendix F: Unsupervised probing
We compared the performance of an unsupervised truth probe method inspired by ICM mutual predictability (described here) with CCS and PCA on a few different datasets.
(Ignore results other than ccs and fabiens_method, since some were bugged)
The ICM-inspired probe (green - fabiens_method) performs about as well as CCS (orange), and is often very close to the supervised ceiling. In hindsight, this is not very surprising, as mutual predictability in the context probing is just another form of margin-maximization, just like the confidence loss of CCS. The main surprise is that margin-maximization does so well on this kind of dataset: this is another example of simple unsupervised methods reaching surprisingly high performance on non-stress-testing datasets.