MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Recent progress on the science of evaluations

2025-06-23 17:41:44

Published on June 23, 2025 9:41 AM GMT

Summary: This post presents new methodological innovations presented in the paper General Scales Unlock AI Evaluation with Explanatory and Predictive Power. The paper introduces a set of general (universal) cognitive abilities that allow us to predict and explain AI system behaviour out of distribution. I outline what I believe are the main advantages of this methodology, indicate future extensions, and discuss implications for AI safety. Particularly exciting points include the future extension to propensities and how it could be useful for alignment and control evaluations.

Disclaimer: I am a (secondary) author of this paper and thus my opinions are likely somewhat biased; though I sincerely believe this is a good evaluation methodology, with the potential to increase the safety of AI models.

I would like to thank Lexin Zhou and Wout Schellaert for comments that have significantly improved the draft. Any errors, inaccuracies and opinions remain my own.

Summary of ADELE

One persistent question in AI capabilities research is why performance doesn't always correlate with intuitive task difficulty—systems that handle advanced mathematics may struggle with tasks that seem cognitively simpler, like effective gameplay in long-context environments (e.g. Pokemon). In this post, we review a recent preprint, "General Scales Unlock AI Evaluation with Explanatory and Predictive Power", that aims to clarify intuitions about why this is the case. This paper introduces a new evaluation methodology (Figure 1) that makes AI system evaluations both predictive and explanatory. I will explain the main contributions and some of its key advantages and limitations. I will also emphasise how I believe it could be used for AI safety, and what future work the group is working on to make this methodology useful for safety.

The core methodological innovation is to introduce a small number of (conceptually distinct) dimensions that capture the demands of any task instance (see Table 1 below), and match them with the capabilities of the AI system. These dimensions are based on the literature in AI, psychology and animal cognition; and are organised hierarchically to be a set of general or universal cognitive capabilities. Since these dimensions are defined with rubrics in natural language that are human-interpretable, the success or failure of AI systems (validity) will be explainable. Since these dimensions cover many[1] dimensions of model capabilities, the system's validity shall be highly predictable.

To evaluate these dimensions, the article designs rubrics that rank the task instance demands from 0 to 5 in the specific dimension. Together with examples, this allows using an AI system to evaluate task instance demands automatically. As a result, the methodology can evaluate many benchmarks and establish profiles for both individual task instances, benchmarks and AI systems. These profiles aim to provide a general or "universal" characterisation of model capabilities, independent of which AI systems are tested or what data is available. The choice of the dimensions is based on the psychometrics literature for evaluating the cognitive capabilities of humans, animals and also AI systems. This is in contrast to factor analysis methods (Burnell et al, 2023, Ruan et al, 2024), which aim to find uncorrelated dimensions, but whose results are only valid for the specific population of tested models and data, and are less interpretable[2].

Advantages and shortcomings

The most important advantage is that this methodology simultaneously achieves predictive and explanatory power of AI evaluation at the task instance level. This does not mean to predict the output token-by-token. Instead, it allows predicting success – or other indicators – for a given task instance. The dimensions and scales in this methodology are highly informative. A random forest trained on task demands and instance performance outperforms a finetuned LLaMA-3.1-8B on discrimination (AUROC) out of distribution, and calibration (Expected Calibration Error) in-distribution and out-of-distribution (see tables 3 and 4 below). The difference is particularly salient in the case of calibration and out-of-distribution, where the ECE metric is 0.022 for the demand-based random forest vs 0.075 for a finetuned LLaMA. In contrast, the random forest performs only marginally better than LLaMA on the AUROC metric[3], even when the difference is statistically significant.

A second advantage of this methodology includes being able to understand what abilities specific benchmarks are measuring. While this is obvious for some benchmarks like Frontier Math, obtaining capability profiles could provide valuable information for AI safety. For example, if someone comes up with a benchmark in cyber capabilities, it will be important to understand if the benchmark is evaluating pure knowledge, or also the ability to plan and execute on that knowledge. If someone comes up with a benchmark evaluating deceptive capabilities on AI systems, we could find it quite valuable to understand its software engineering knowledge, vs its planning and execution abilities, vs its metacognition or social abilities.

From an AI safety perspective, this granular understanding of both capabilities and propensities becomes crucial as systems approach human-level performance. Rather than asking whether a model is 'generally capable,' we can ask more precise questions: does it have the metacognitive abilities that might enable deceptive alignment? Does it show propensities toward deception in high-stakes situations? Does it possess the planning capabilities that could make it dangerous if misaligned? What are its behavioural tendencies when facing conflicting objectives? By decomposing model capabilities and propensities into interpretable dimensions and measuring safety-relevant propensities across different contexts, ADELE could help safety researchers identify specific capability thresholds and behavioural patterns that warrant increased caution, rather than relying on crude proxies like overall benchmark performance or model size.

This evaluation methodology is also a strong complement to red-teaming analysis or domain-specific evaluations. It is well known that both automated evaluations and red-teaming are key components of safety evaluations (Apollo Research). I believe ADELE establishes a robust and comprehensive analysis of AI systems that can provide a strong baseline for subsequent red teaming analysis. For instance, one missing capability we plan to incorporate in the near future is the ability to self-control, understood as the capacity to bring one’s actions into line with one’s intentions in the face of competing motivations or propensities triggered by the situation (Henden 2008). I believe understanding more about this dimension might provide a better understanding of when AI models fall for different jailbreaks.

Importantly, this evaluation is not only fully automatic, but the cost overhead it adds is usually much lower than evaluating the model on the underlying benchmarks[4]. The reason is that we do not need the annotator to complete the tasks, but only to evaluate their hardness on the different dimensions. For example, it is often easier to evaluate the hardness of creating a functional webpage, but potentially harder to create it. On the other hand, the methodology requires evaluating the AI systems on a few benchmarks, not just one or two, so a sufficient variety of examples is present in the test set. But sampling from several benchmarks is effective and practical.

One could potentially substitute ADELE with well-designed benchmarks that capture the dimensions you care about – though achieving external validity could often require an analysis similar to ADELE – or, for a specific domain, the tasks you care about. If so, the results of the benchmarks would become proxies that substitute the demand scales in ADELE. However, this has two conceptual issues: (i) you would have to make sure that you include all the relevant difficulty levels, and (ii) benchmarks measure tasks, not cognitive capabilities, so you would need your distribution of tasks to carefully mirror the tasks you care about in practice. What is worse, while there exist good benchmarks specialised in specific cognitive abilities (e.g. quantitative reasoning and Frontier Math (Glazer et al, 2024)), many benchmarks contain tasks that require a mix of capabilities, which could confound the conclusions (see Figure 2 below).

Since the cost of creating the demand and capability profiles is usually marginal compared with the cost of evaluating the model on the underlying benchmarks, I do not expect the approach in the previous paragraph to offer significant advantages over ADELE except for specific dimensions (eg pure math) or concrete domains (software engineering). On the other hand, I do not think these are mutually exclusive options. I believe the latter case (domain-specific benchmarking) to be complementary to capability-oriented evaluations (ADELE). It would make sense to have specific, well-designed benchmarks for a distribution of cybersecurity tasks you care about, and simultaneously be able to provide an overview of the model capability profile overall.

The methodology arguably has some shortcomings[5]. For example, the methodology currently proposes 18 dimensions (and we expect to add even more once we move beyond LLMs, like evaluating multimodal agents and robots), which might look like an overabundance even if they are hierarchically structured.

Our approach prioritises both predictive power and human interpretability over uncorrelation. While the 18 dimensions are fairly correlated within the selected individual models, demands may be significantly less correlated across different tasks, which is why having more granular (but interpretable) dimensions can improve predictive accuracy. The paper demonstrates in Appendix 9.5 that 11 broader dimensions achieve comparable predictive power, but we retain the finer subdivisions because they provide more detailed explanations of model behaviour. Since evaluating each dimension is computationally inexpensive, we believe this design choice is justified despite the added conceptual complexity. Further, as can be seen in Figure 16 below, all features retain relative importance in the final predictive power. If all features were fully correlated, one feature would display a feature importance value of 1, and the rest would be 0.

In any case, one seeks to have a small set of uncorrelated dimensions (at the expense of explainability, generality and potentially predictive power), then factor analysis is the correct tool to use (Burnell et al, 2023, Ruan et al, 2024).

The small AUROC improvement (see tables 3 and 4 above) over a finetuned language model, while modest, comes with important advantages: the methodology uses interpretable scales that generalise across contexts and shows significantly better calibration on out-of-distribution tasks. We suspect the limited AUROC gains reflect a lack of sufficiently challenging instances across most dimensions in the current benchmarks included in the ADELE battery v1.0. As we include harder benchmarks, we conjecture the predictive advantages to become even more pronounced.

Future work and usefulness for AI safety

Future extension of the work includes (i) improving the calibration of the scales against human populations, (ii) extending the scales to higher levels of demands, and (iii) extending the methodology to propensities.

We expect the improved calibration of the demand levels to provide a stronger justification, and also more generally inform the progress of AI capabilities with respect to human capabilities. If successful, I believe this work could extend very useful findings of METR on the domain of software engineering (Kwa et al, 2025) to more general capabilities. As mentioned above, we shall also include other missing capabilities in the current framework, particularly those related to agentic behaviour (Tolang et al, 2021).

Second, by extending the demand levels to higher levels of capability, we address the problem of benchmark saturation. This will allow this methodology to provide reliable evaluation information at very high levels of difficulty.  Combining this analysis with studies of scaling laws across model families could help us understand how different training techniques affect model capabilities (Tamura et al, 2025).

However, we believe this methodology could potentially be extended beyond capabilities to what we call propensities — behavioural tendencies that emerge in specific contexts. The distinction is crucial for safety:

  • A capability measures what an AI can do. For example, does the model have the reasoning ability to formulate a deceptive plan?
  • A propensity measures what it tends to do under different conditions. For example, when faced with a high-stakes objective and an easy opportunity to lie, does the model actually choose to be deceptive?

As a concrete example, a model might have high reasoning capabilities but show varying propensities for deceptive behaviour depending on the situation's characteristics (high stakes, conflicting objectives, etc.). We plan to evaluate key safety-relevant propensities,  including deception (Meinke et al, 2025), goal-directedness (Everitt et al, 2025)...; and intend to conduct a similar scaling law analysis for them.

I believe this methodology is not only a key step in the direction of establishing a science of evaluations, but also an advance in the establishment of alignment evaluations. As discussed above, if we create model profiles for both capabilities and propensities with out-of-distribution predictive power, we can start taking steps towards “alignment evaluations” (Hubinger, 2023). This can also serve as a complement to mechanistic interpretability.

The predictive power of this form of AI evaluations can also provide a complement to control evaluations (Greenblatt et al, 2024), even when we do not assume aligned AI systems. If we had perfect control, we would not need to predict the safety of certain situations. Similarly, if we had perfect predictive power on the safety of different situations, we could systematically prevent the AI system from being exposed to those situations we predict will end up in the model exhibiting dangerous behaviour.

To use this predictive power for safety purposes, we intend to use assessors: other meta AI systems that predict or assess the safety of a given situation for an AI model, or its output (Zhou et al, 2022). While the idea is not new, and has also been exploited e.g. in Constitutional Classifiers (Sharma, 2025), we believe that our methodology can extend beyond jailbreaks to safety in general, and improve predictive power and calibration of safety classifiers used, e.g. for control purposes (Greenblatt et al, 2024).

Conclusion

In my opinion, this methodology is a sound (and obvious in retrospect) step in how we should evaluate AI systems: create a set of general abilities using the literature on how to evaluate machines, humans and animals; and design difficulty metrics that allow us to understand and predict the success of AI model behaviour. I also believe this methodology has the potential to extend the current evaluation methods from just capabilities to alignment, and enhance and complement other key safety tools. That being said, I would be interested in getting constructive feedback on how these new evaluation methods can be better used for safety, or other limitations you see.

If you like this line of research and would like to contribute, you can get in touch with Prof. Jose H. Orallo and Lexin Zhou (the leaders of this work). I believe they would be interested in understanding how to make the current tool as useful as possible for safety.

  1. ^

     There is work in progress to extend the dimensions to complete sets of fundamental capabilities proposed in the psychometrics literature (Tolang et al, 2021). Future work shall in particular cover agentic capabilities such as planning and execution. In any case, an assumption of this paper is that there exist a set of universal cognitive capabilities, and that the current literature covers them well – e.g. there will not be very alien capabilities that only superhuman AI systems will exhibit.

  2. ^

     Factor analysis has a slightly different goal as the paper discussed here. The goal of factor analysis is to find a small number of uncorrelated capability dimensions for a given population of AI systems and benchmarks; whereas this paper aims to find a reasonably-sized set of human-understandable dimensions that achieve the highest predictive power, even if some dimensions may be strongly correlated for some specific populations of AI systems.

  3. ^

     My opinion: If I were to design a predictor (called assessor later down the post) that maximizes accuracy at the expense of explainability, I would combine both methods: a relatively powerful language model that we can trust finetuned on both the tasks and ADELE-style assessments of task demand and model capabilities (and propensities).

  4. ^

     ADELE is also relatively efficient. The current battery required 16000 individual examples from 20 benchmarks, and one could come up with  examples.

  5. ^

     As any other evaluation method, any evaluation also has the shortcoming of potentially speeding up AI capability development. While it is not possible to directly train an AI system against ADELE evaluations, knowing which capabilities are more fundamental can indeed redirect efforts to the weakest capabilities.



Discuss

Racial Dating Preferences and Sexual Racism

2025-06-23 11:57:02

Published on June 23, 2025 3:57 AM GMT

Note: This is a linkpost from my personal substack. This is on a culture war topic, which is not normally the focus of my blogging. Rationalist friends suggested that this post might be interesting and surprising to LW readers.

Summary

  • People widely exclude romantic and sexual partners on the basis of race. This is claimed not to be racism by those who have these attitudes, but it is[1].
  • Certain groups, such as Black women and Asian men, are highly excluded by all other groups. These cases also exhibit within-race asymmetries in racial preferences across gender:
    • Black women exhibit the highest endophilia (i.e. they most desire to date within race), but this is strongly not reciprocated by Black men.
    • Asian women exhibit the highest endophobia (i.e. they most refuse to date within race), but this is strongly not reciprocated by Asian men.
  • These within-race asymmetries are so strong that they provide counter-evidence to most conventional explanations of racial preference (e.g. that racial preference is a proxy for the desire for wealth, class, and integration; that racial preference forms as a by-product of racial segregation).
  • These effects are not unique to straight dating, but strongly form the romantic and sexual reality of queer dating as well.
  • The evidential base for the above points is not small, but rather comes from an immense literature in social psychology, sociology, race studies, and queer studies.
  • People rarely take ownership over their own dating preferences. They either blame wider sociological factors outside of their control, or blame the racial losers themselves (Black women, Asian men) as being responsible for their own undesirability.
  • This topic is so controversial that it is systematically discussed only within two poles: a “woke” academic sphere, and a “red-pilled” incel/volcel/femcel online sphere. This adds a lot of distortion to the subject.
  • Those who have racial preferences are not going to be shamed out of having such preferences, so I recommend that racial losers consider doing weird things to improve their sexual fitness.

A Note on Language and Scope

I am going to talk about racial and sexual groups using broad terms like “Asian woman”, “Black gay men”, “Whites/Jews”, and so on. I am aware that this language can be reductive, and may not operate on some desired level of granularity. I am aware that there are important, relevant differences between South Asians and East Asians, and that it is weird that Jews are sometimes lumped into “White” and sometimes not. I am using these terms because we need to have some language that allows us to talk about broad trends. These are the groups used in most of the studies I will discuss, and reflect generally how demographics is studied and measured in the West. When it comes to this topic, I find that demands for greater precision in language are often veiled attempts to bury conversation in a mire of obscurantism.

The studies I will discuss are about dating dynamics in “the West”, meaning that they were mainly conducted in the United States or Europe, and you should expect these results to generalize across the U.S., Canada, Western Europe, Australia, and so on.

My Motivation

This is a controversial topic, and those who talk about it are typically accused of being resentful or of having bad intentions. I don’t think either of those things are true about me, and I would like to provide some color on myself to possibly ward off some accusations. Feel free to skip this section if that seems tedious.

I am a mixed-race Indian and Korean male. I grew up mainly in New York City. My elementary and middle schools were somewhat representative of the demographics of the city (“somewhat” here means there was non-epsilon headcount of African Americans, Dominicans, and Puerto Ricans), which is where I woke up into the consciousness that I found people attractive, and moreover that I found most people attractive and plausible. Due to this, later in life I made the assumption that most people’s racial dating preferences were anchored on the racial distribution they were exposed to during early puberty (we will see later in this post the many ways in which this is not true). As I got older I tested into institutions that were increasingly dominated by Whites/Jews and Asians. Concurrently, as a teenager I was a drummer in NYC’s punk scene in the late 2000s, which at the time was very white. Now, as a programmer in California there are epsilon African Americans in my social circles (except for first and second generation Nigerian and Kenyans immigrants).

My dating history reflects these facts. I have had a basically unbroken chain of romantic and sexual partners since I was 13, mainly White women, some Asian women, and the occasional man. I was very unsympathetic to male friends complaining about how dating is difficult or unfair, because I found it easy to date attractive and interesting people despite being myself not that attractive, charismatic, nor even particularly kind. You just try to pursue a large number of genuine friendships, and some of them naturally convert into relationships of a different type. As an undergraduate, I entered into what would be a felicitous 10-year relationship with a half-Jewish, half-Mexican woman. This allowed me to exile from my mind all considerations of “dating discourse”. My friends were allowed to complain about dating around me, but only for a maximum for five minutes before I started berating them for being whiny, anti-agentic sad-sacks. Additionally, my long relationship allowed me to entirely side-step dating apps. I would note— but not really absorb— how miserable and humiliating these apps are to so many people— how alienating it is to try to flatten your life down to the perpetual dog-and-pony show that is one’s dating app profile.

I am in no relationship now, and still do not have first-hand experience of the “dating market” in the 2020s. Perhaps unfortunately, my last relationship reshaped my preferences so much that I don’t really find people romantically or sexually interesting any more. I joke with friends that I have become a volcel, but I do not think that I am cynical, bitter, nor black-pilled— rather, I’ve been set adrift on the placid seas of self-reliance. This change though has made me more attentive to phenomena I’d previously chosen to ignore. I always knew there was a strong racial component to dating outcomes, but I thought this was mainly due to wider sociological factors outside of any individual’s control. Whites and Asians date in a cluster apart from Blacks and Latinos in the US, but these racial strata have less to do with racist beliefs than with broader economic divisions that effectively segregate the country. But I started reflecting on stories that I would hear from friends and acquaintances that suggest socioeconomic segregation cannot be the whole story. Here are some representative stories:

  1. A White male friend and his Asian female partner are into swinging. Sometimes at clubs or parties they play a game where he points out random men in the room and she tells him if she finds them attractive. He eventually stopped pointing out Asian men at all once he realized that her answer for Asian men is always “no”.
  2. An Asian female friend living in SF goes out to lunch with her boyfriend and three other couples. The girls arrive first and they are amused to find out that they are all Asian. The boys arrive later, and they are slightly less amused to find out that all the boyfriends are White.
  3. I am orbiting different conversations at a party and slowly realize that all the pairs of men and women who are flirting are White men and Asian women.

I began to think I’ve poisoned my mind, and started looking into whether there’s any empirical basis whatsoever for what I seemed to be observing. It turns out the literature documenting this is enormous.

Empirical Literature

Let’s start examining the empirical literature.

Internet Dating and Racial Exclusion (Robnett & Feliciano)

We’ll start with Belinda Robnett and Cynthia Feliciano’s 2011 paper Patterns of Racial-Ethnic Exclusion by Internet Daters. They look at ~6000 Yahoo Personals dating profiles from heterosexual daters in 2004-2005 living in large, multiracial American cities (New York, Los Angeles, Chicago, and Atlanta). The below table captures the stated racial preferences of this cohort.

Screenshot of a Datawrapper table, which LessWrong does not support yet. See the interactable table on my blog here.

I’ll quickly note some conspicuous features of this data:

  • Within each group, the majority have stated racial preferences.
  • 58% of men have stated racial preferences, whereas 74% of women have stated racial preferences. (The percentages I’ll discuss after this are limited to those who have stated racial preferences.)
  • Endophobia (i.e. own-race exclusion) is in general low. Some notable asymmetries is BM-BW exclusion (11.47% vs 7.76%) and LM-LW exclusion (5.74% vs 16.47%). The most anomalous asymmetry is AM-AW exclusion (9.69% vs 39.96%)!
  • Whites are least excluded. BM and BW are exceptions to this, both excluding whites at around 70%.
  • Black men and black women are both extremely excluded, with exclusion rates being >90% across all out-race groups except Latinos who drop black exclusion by ~15 percentage points.
  • Asian men (but not Asian women) are extremely excluded, with exclusion rates being >90% across all out-race groups, in addition to uniquely being 40% excluded by Asian women.

Here is some analysis from the authors, identifying the ways in which their data support and in turn contradict leading theories of racial preference in mate selection.

Our study clearly shows that race and gender significantly influence dating choices on the internet. Consistent with the predictions of social exchange and group position theories, among those who state a racial-ethnic preference, whites are far more likely than minorities to prefer to date only within their race. Our analyses of minorities’ racial preferences show that Asians, blacks and Latinos are more likely to include whites as possible dates than whites are to include them. Acceptance by the dominant group is necessary for boundaries and social distance between minority groups and whites to be weakened, yet this study shows that whites exclude minority groups at high rates.

The results support the predictions of classic assimilation theory and social distance research, as Asians, and to a lesser extent Latinos, have racial dating preferences similar to those of whites with both groups more exclusive of blacks than of whites and one another. This may be because Latinos and Asians are less segregated from whites, feel less social distance towards whites (Charles 2003; Frey and Farley 1996; Massey and Denton 1993), and distance themselves from blacks in the classic assimilation pattern (See Calavita 2007). However, we also find that, to a lesser extent, Asians and Latinos distance themselves from nonblack minorities, including one another. Asians are even more exclusionary of Latinos than are whites. From social exchange or group position perspectives, they have far more to gain through interracial relationships with whites than with others. Social distancing, then, is not only directed towards blacks, but operates between nonblack minority groups as well.

We also argue that gender is central to the acceptance of some racial groups within the domain of intimacy. Our results show that black women, Asian men, and to a lesser extent, Latino males, are more highly excluded than their opposite-sex counterparts. These findings may, in part, be explained by sexual strategies theory because men are more open than women to a variety of partners. However, this explanation does not shed light on why all men, except for black men, are the most exclusionary of black women, or why all groups are more accepting of Asian women and Latinas over their male counterparts. Especially perplexing is that women prefer to date black men over Asian men. This is completely contrary to the claims of social exchange and sexual strategies theories that women should prefer to date men with higher socio-economic standing.

Finally, our results challenge social exchange and sexual strategies theories in that the relatively high income enjoyed by Middle Eastern, East Indian and Asian men do not correspond to increased acceptance in the domain of intimacy. Like whites, Asians and Latinos are highly exclusive of blacks, but also of higher earning groups, such as Middle Easterners and East Indians. White women and Latinas exclude Asian, East Indian and Middle Eastern men more than black men, and East Indian and Middle Eastern men are among the most excluded by black women and Asian women. These results suggest that race-ethnicity dynamics shape racial exclusion more than structural integration does.

Online Racial Dating Preferences among Asians (Tsunokai, McGrath, & Kavanagh)

Asian women’s endophobia is such an important counter-example to so many theories of racial preference that it is worth examining in detail. Turning now to Glenn T. Tsunokai, Allison R. McGrath, and Jillian K. Kavanagh’s 2014 paper Online dating preferences of Asian Americans, we find similar effects as Robnett & Feliciano, but now extended to also include Asian homosexuals. This study looks at 1270 Asian American dating profiles on Match.com from 2006. They find:

As highlighted in Table 2, heterosexual women and gay men were more likely to want to date a White person than were their heterosexual male counterparts. The odds of stating a preference to date Whites were 4.1 and 2.4 times greater for gay males and heterosexual females compared to the reference group, respectively. This pattern was not displayed when it came to one’s willingness to date African Americans, Hispanics, Asians, and individuals who were some other race/ethnicity. Compared with heterosexual males, heterosexual females and gay males were less willing to indicate a preference to date someone who was nonwhite. For example, among Asian females, the odds of wanting to date an Asian, Latino, someone who is racially/ethnically ‘‘other,’’ or a black man, were 75, 66, 61, and 53% less than their heterosexual male counterparts, respectively. Similarly, gay Asian males were also less willing to express an interest to date nonwhites. As a group, they were 85, 35, and 33% less likely to state a preference to date other Asians, people who were some other race/ethnicity, and Blacks compared with heterosexual males within the sample, respectively.

Table 2 from Tsunokai, McGrath, & Kavanagh (2014). Odds Ratios are calculated that capture the willingness of Asian heterosexual females and Asian homosexual males to date various groups, compared against the Asian heterosexual male as baseline. Model 1 captures willingness to date Whites, Model 2 Blacks, Model 3 Latinos, Model 4 Asians, and Model 5 others.

This is a really dramatic table. Asian heterosexual women are 2.41 times more likely than Asian heterosexual males to be willing to date Whites. Asian homosexual males are 4.11 times more likely than baseline! Asian exclusion (Model 4) is also extreme, with Asian women at 0.25 and Asian homosexuals 0.15 times the baseline. Women and homosexuals in every model exhibit much more sexual racism than Asian heterosexual men.

The “Preference” Paradox (Thai, Stainer, & Barlow)

Race preferences are very common, but it would be wise to not disclose them explicitly as this results in one being seen as more racist, less attractive, and less dateable. This effect occurs even when the person judging you says that they believe that racial dating preferences are not racist. So are the findings of Michael Thai, Matthew J. Stainer, and Fiona Kate Barlow’s 2019 paper The “preference” paradox: Disclosing racial preferences in attraction is considered racist even by people who overtly claim it is not.

The study was done on 1956 Australian gay men who were asked to look at modified dating profiles of White men and rate the profiles based on how racist the subject seemed, how attractive they were, and how dateable they seemed. Subjects were also asked to give binary answer to the question “Do you believe it is racist to have exclusive racial preferences when it comes to sexual attraction?”, which split the subjects into two cohorts. Most of the subjects were White (65%-75%, depending on the experiment), with Asians (9%-15%) and South Asians (3%-5%) forming the largest minorities.

Example of dating profile modified to have a racial preference disclosure from Thai, Stainer, & Barlow (2019).

The dating profiles were modified across different experiments to include various racial preference disclosures. Study 1 involves targeted exclusion (e.g. “No Asians or Blacks”); Study 2 involves general exclusion (e.g. “White guys only”); Study 3 adds general soft exclusion (e.g. “Prefer White guys”).

Survey data from Thai, Stainer, & Barlow (2019). Graph 1A is from Study 1 (“No Asians no Blacks”) and shows how racist the men in the dating profiles are perceived. The black columns are reactions to profiles that contain race preference disclosures, and the off-white columns are the ones that do not. The left two columns contain those subjects that believe that racial dating preferences are racist, and the right two contain those subjects that do not think race preferences are racist.

There are a number of notable results here. Men perceive racial preference disclosures as racist, even when they explicitly claim that they believe that such disclosures are not racist. This suggests that people are fundamentally confused about their attitudes towards sexual racism. Attractiveness and dateability are also affected (though the effects on attractiveness are clearly quite small). This carries a nasty implication: online daters are incentivized not to explicitly state their racial preferences even when they do have them, so strong racial preferences may be even more common than they may appear.

Many reject the idea that racial dating preferences are actually racist. A similar study came out in 2015 which examined this question more closely called Is Sexual Racism Really Racism? Distinguishing Attitudes Toward Sexual Racism and Generic Racism Among Gay and Bisexual Men. The study’s abstract states:

Sexual racism is a specific form of racial prejudice enacted in the context of sex or romance. Online, people use sex and dating profiles to describe racialized attraction through language such as “Not attracted to Asians.’’ Among gay and bisexual men, sexual racism is a highly contentious issue. Although some characterize discrimination among partners on the basis of race as a form of racism, others present it as a matter of preference. In May 2011, 2177 gay and bisexual men in Australia participated in an online survey that assessed how acceptably they viewed online sexual racism. Although the men sampled displayed diverse attitudes, many were remarkably tolerant of sexual racism. We conducted two multiple linear regression analyses to compare factors related to men’s attitudes toward sexual racism online and their racist attitudes more broadly. Almost every identified factor associated with men’s racist attitudes was also related to their attitudes toward sexual racism. The only differences were between men who identified as Asian or Indian. Sexual racism, therefore, is closely associated with generic racist attitudes, which challenges the idea of racial attraction as solely a matter of personal preference.

Parsimony suggests that it makes most sense to consider racial dating preferences as part of the “racist cluster” of human belief-space.

On The Streets of The Culture War

I’m now going to step away from the academic literature and talk more about this issue as it appears in dating discourse and in the culture war. I am going to be more opinionated and inflammatory going forward.

Black women and Black men are clearly horrendously wronged by all of this. The levels of sexual exclusion are so extreme that they seem unjustified even if you are a sort of race realist, “crime stats” guy. Asian men are also wronged by the racial preference distribution, being both highly excluded by everyone, and occupying the unique position of being highly rejected within-race.

Common Arguments Attempting to Explain Asian Women’s Endophobia

I’ve collected some common arguments, sourced from women I know and from online discourse, about why Asian women avoid dating Asian men.

  1. “Asian men are more patriarchal and less romantic than White men”: Karen Pyke calls this the “egalitarian knights” phenomenon. Indeed many Asian men are quite patriarchal (my mother left Korea because she could not tolerate being controlled by a Korean man— hilariously, she chose to marry an Indian man instead), and perhaps many Asian American women look at how their fathers treat their mothers and do not want that for themselves. But we do not date our parents’ generation, we date within our own generation, and Asian American men of the relevant generations are not more patriarchal than White men. Asian Americans are significantly more liberal than the national average (only Vietnamese-Americans are majority conservative), with about 40% of white men associating with the Democratic Party compared to 60% for Asian men. Research on attitudes towards gender equality do not suggest that White men are more feminist than minority men.
  2. “White men fetishize Asian women and pursue them more aggressively”: There is a large literature about the history of “Yellow Fever”, but this is not sufficiently explanatory. This explains White men’s elevated preference for Asian women, but does not explain the other direction, nor does it explain the elevated exclusion of Asian men.
  3. “Asian men have been emasculated by colonialism and white supremacy”: Again, there is an extensive history of this, but it does not seem to me sufficiently explanatory. This point and the previous two points could all (to equal if not greater effect, in my opinion) be raised about Latinos. Latinas are also fetishized; Latino men have also been emasculated in American media; Latino immigrant culture is also patriarchal — but the gender asymmetries observed in Robnett & Feliciano are not the same within Latinos as within Asians.

Surely there must be some cultural story about how these preferences form and persist. Tsunokai tries to explain this in terms of legal and cultural history. They cite this 1842 entry from the Encyclopaedia Britannica:

A Chinaman is cold, cunning and distrustful; always ready to take advantage of those he has to deal with; extremely covetous and deceitful; quarrelsome, vindictive, but timid and dastardly. A Chinaman in office is a strange compound of insolence and meanness. All ranks and conditions have a total disregard for truth.

They also suggest that newspaper stories, pulp novels, and movies caused all of this:

Although these restrictive laws were eventually deemed legally problematic, the regulation of Asian sexuality has continued to go unabated in the media; movies and television programming constantly depict Asian men as being asexual (Larson, 2006). From effeminate characters, such as Chinese detective Charlie Chan, to action heroes who rarely have any onscreen romantic interests (e.g., Jackie Chan), Asian males are essentially stripped away of any sexual desirability—a process referred to as ‘‘racial castration’’ (Eng, 2001). These unflattering images have the ability to produce negative outcomes. For example, Fisman, Iyengar, Kamenica, and Simonson (2008) found that when it comes to physical attractiveness or sexual desirability, Asian men receive the lowest ratings; their study also revealed that Asian women find White, Black, and Hispanic men to be more attractive than Asian males.

Asian women have also had their image negatively shaped by the dominant group. Rooted in the early days of colonialism, the image of the ‘‘Oriental’’ was created by Whites who sought to distance themselves socially and culturally from Asians. Asian women were frequently depicted as seductresses who sought to corrupt the morals of White men (Uchida, 1998). For example, in the 1870s and 80s, newspaper stories and magazine articles exaggerated the notion that Chinese or Japanese prostitution, if left unchecked, would erode the nation’s morals and physical well-being (Lee, 1999). By characterizing Asian women as lecherous heathens, the dominant group was able to justify discriminatory actions that targeted Asians (e.g., anti-immigration laws). This ‘‘controlling image’’ has not faded over time. Although Asian women are still underrepresented in the media, when shown, they are habitually stereotyped as being submissive, exotic, or sexually available for White men (Larson, 2006). Regarding the latter, popular movies such as The World of Suzie Wong (1960), Year of the Dragon (1985), and Heaven and Earth (1993) have consistently emphasized the notion that the sexual and emotional needs of Asian women are successfully satisfied by White males (Kim & Chung, 2005).

It is very possible that these factors matter, but I am not satisfied. Apart from its questionable explanatory value, I particularly dislike this line of thinking as it removes agency from minorities and absolves women of taking any ownership over their racialized desires. Time and time again, I’m asked to believe that this is all the fault of White men. But European colonialism did not invent the phenomenon of race factoring into sexual matters (see for example Razib Khan’s wonderful overview of the genetics and sociology of India’s Jati/Varna system).

Let’s look at an example of the abnegation of responsibility I’m worried about. Steffi Cao writes for The Guardian: “Trolls are citing an ‘Oxford study’ to demean Asian women in interracial relationships. But it doesn’t actually exist”. The article talks about the social media phenomenon of Asian women with White boyfriends getting yelled at about the so-called “Oxford study”, which is supposed to be some sort of study showing how Asian women fetishize White men. In reality, the “Oxford study” is Balaji and Worawongs’ 2010 paper The New Suzie Wong: Normative Assumptions of White Male and Asian Female Relationships, which presents an analysis of the history of Asian female and White male relationships in media in the 20th century. Steffi Cao makes a lot of hay about how the study, and by implication the phenomenon it supposedly describes, is not real:

In most cases, the use of “Oxford study” takes on a ugly tone. Commenters use it to signal a rabid interest in the personal lives of Asian women, informed by entrenched stereotypes around race and gender. They cite the study as a shorthand to criticize the romantic and sexual choices of Asian women in interracial relationships. Many of these commenters are men, often Asian men, and they want to make Asian women the butt of their bad joke.

However, the study they’re so eager to cite doesn’t actually exist – at least, not in the way they think it does. But that hasn’t stopped “Oxford study” from fueling Asian women’s anxiety about dating or affecting their sense of self.

The phrase “Oxford study” has been attributed to a TikTok user who in April 2023 reacted to a video of an Asian woman and white man together by crudely joking: “the power of the Caucasian [male] over the Asian female subconscious needs a full Oxford study”.

“Oxford study” reached its most outrageous zenith a year later. In April 2024, Cherdleys, a comedian and troll, posted a TikTok of him sitting on the back of a pickup truck with his arm slung proprietarily over the influencer Lydia Ren, who kneels on her hands and knees in full-body black latex, a muzzle over her face and dog ears on her head. She is Asian. “Me and my dog were wondering if you and your dog want to go on a date,” says Cherdleys, who is white.

Commenters swarmed. “Naw you Asian women love being humiliated by white men,” one wrote, while others frequently, persistently, posted “Oxford study”. “What level of Oxford study is this?” asked one. “Oxford study final thesis,” and “Oxford study final boss,” wrote others. The video – surely intended to stoke outrage about Asian women and white men in provocative situations – now has 1.5m views.

Emily was unnerved to learn that this so-called “Oxford study” had been made up to legitimize scrutiny of Asian women’s personal lives. “A chill just went down my spine,” she said.

It is very cringe to try to humiliate random people online, but there is something preposterous about all the deflections in this piece. The “Oxford study” people could have been shouting about the two papers I discussed above describing Asian women’s outsized enthusiasm for White men (or they could have chosen from dozens of related studies not discussed in this post). It seems to me that the “Oxford study” is an instance of the “toxoplasma” phenomenon described by Scott Alexander in his piece “The Toxoplasma of Rage” — namely that rage-inducing bad arguments outcompete available good arguments because the former are more memetically stable.

Steffi Cao closes her article with a revealing passage:

Sophia has some level of empathy for Asian men who feel rebuffed by modern dating culture, despite how invalidating it feels to be on the receiving end of their outrage. “I agree that that’s unfair, but I also think it’s unfair to take it out on Asian women who are just dating whoever they choose,” she said. “It’s just weird to take it out on us instead of people in society who are being racist towards Asian men.”

“I’m a victim of this too,” she added.

We are all equally victims. White men are victimizing us. We are not racist — we have internalized racism (which has been stuffed into us externally by White men). We watched The World of Suzie Wong in the 1960s and never recovered from this. We had Bruce Lee, but we need another one. White men desire us so intensely, they fetishize us. We desire them too, but only because we’ve been tricked by them.

This sort of low-agency mentality should be rejected by thinking adults. Again, it seems clear to me that American women should take more responsibility towards their own desires, if only to facilitate building a better understanding of themselves. American women are much more progressive than American men, and increasingly so — at least when it comes to their stated views. I return again to reflecting on the consistent result in the literature showing that racial preferences are more prominent among women than men. In this all-important aspect, women are objectively more racist than men. Ultimately, what is more consequential than our romantic and sexual choices?

Some women have told me that racialized sexual desire is a part of immutable human nature. Indeed, Buss and Schmitt’s Sexual Strategy Theory gives us an evolutionary model that suggests women will by nature be more conservative about interracial mating. Surely this matters, but it cannot be all that matters, as human sexual psychology is not so rigid. There is no conservation law of racial-sexual preference across time and culture (refer again to Razib’s discussion of how exceptional India’s caste system is). But fine, let’s say that “being into White guys” is human nature, and therefore it is “okay”. I think liberals who say this are playing with conceptual fire. All racism is natural in this sense. The low-status racist attitudes you condemn, are they not also rooting in immutable “human nature”? If not, why not?

Dating Advice For Asian Men

So what should be done? People who say that we need to “do the work” to dismantle internalized racism might be correct in some generic sense, but what exactly is the theory of change? More minorities in Marvel movies? Scolding people on the internet? Humans have narrow fertility windows, none of us have time to wait for society as a whole to be fixed.

Millennials everywhere are lonely and having very little sex. This suggests that we are not at the Pareto frontier of the intimacy curve, so perhaps we can side-step the consequences of racialized sexual preferences until we do arrive on the frontier. So maybe the best advice is some form of my old recommendation: go out and make more friends than you would otherwise want to have. Or, for men in particular, maybe the best advice is to go collect a lot of money and status to compensate for your race.

I’m not sure though that this “advice” is substantive. Who would be helped on the margin by hearing this? You already know that you should collect money, status, and friends, and someone telling you to get more won’t actually help you get more.

I instead think that low race-status men should be encouraged to do strange things that high race-status men would hesitate to do. Asian men are effeminate? Fine, take steroids, take HGH during critical growth periods, lengthen your shins through surgery (also do the obvious: take GLP-1 agonists). Alternatively, consider feminizing. Schedule more plastic surgeries in general. Don’t tell the people you’re sexually attracted to that you are doing this — that’s low status and induces guilt and ick. Don’t ask Reddit, they will tell you you are imagining things and need therapy. Redditoid morality tells you that it is valid and beautiful to want limb lengthening surgery if you start way below average and want to go to the average, but it is mental illness to want to go from average to above average.

Don’t be cynical or bitter or vengeful — do these things happily.


Addendum 1: More Resources on Racial Dating Preferences

Addendum 2: Q&A

I received some nice feedback from friends on this post, and I think the back-and-forths we had will prove useful to interested readers. I’ll present some of these conversations here (edited a bit obviously, but mainly for formatting reasons), with some additional commentary from myself.

What is actually being seen as racist by the subjects in the ‘“Preference” Paradox’ paper?

Friend A: Re:

Men perceive racial preference disclosures as racist, even when they explicitly claim that they believe that such disclosures are not racist. This suggests that people are fundamentally confused about their attitudes towards sexual racism.

I have a different interpretation, I think these men are reacting to the public disclosure of those preferences in the profile rather than their mere existence. I think it strongly signals something about the character of the person that they would put "no ___" directly in their profile rather than simply swiping past themselves.

More broadly I think cultural norms have evolved to where people see proclamation of racist attitudes (especially publicly) as fundamentally more racist than holding those attitudes and keeping them to oneself.

Vishal: This is possible, but I think the author's methodology plausibly minimizes some of that. From the paper:

At the end of the survey, participants responded to the question: “Do you believe it is racist to have exclusive racial preferences when it comes to sexual attraction?” on a Yes/No scale. This question was asked after, rather than before, participants rated the target, to ensure ratings of the target would not be influenced by answers to this question. If exposed to the question before the rating task, participants may have felt compelled to rate the target in a way that corresponded to their own answer to the question, defeating the purpose of using a person perception paradigm to get at what participants are unwilling to personally admit."

Presumably, the men answering this had in their minds the notion of stating a preference on a dating profile when forming their answer to this question (if this question was asked before the study, then it seems more plausible to me that subjects could think "racial preferences are okay— woah not like that!"). But yes, the question as stated does not ask whether it is racist specifically to state your racial preferences on a dating profile, so it is believable that they are mismeasuring.

Friend A: Oh good point, I agree that should help. I'm not a social psychologist (paging Friend D) but personally I would hypothesize that the publicly-stated vs. private racism effect is so strong it would still show up in the data.

On a different note:

…steroids, take HGH during critical growth periods, lengthen your shins through surgery (also do the obvious: take GLP-1 agonists). Alternatively, consider feminizing. Schedule more plastic surgeries in general.

Unironically some cis people should consume more gender-affirming care. I was literally wondering earlier today if any good model for gender dysphoria wouldn't include cis people as well.

What do you mean by “racism”?

Friend B: Very good article overall, though I have a moderate quibble with the very first bullet point in the summary (and with the general thread of commentary that I think is intended to support it):

  • People widely exclude romantic and sexual partners on the basis of race. This is claimed not to be racism by those who have these attitudes, but it is.

I think what you mean by "exclude romantic and sexual partners on the basis of race" is something like "decide to not date or have sex with people of race [x], even if they happen to run into someone of that race who they end up finding attractive (and otherwise suitable as a partner, based on criteria that don't ultimately bottom out in their race)". And if so that seems like a reasonable though somewhat noncentral kind of "racism".

But you might also mean "in practice are attracted to people of race [x] much less often than other races", which might also be described as a "racial preference"... this, of course, has a very different causal structure than "is racist, so decides to not date people of race [x] even if they find them attractive". Maybe you would successfully argue for calling that racism, as well; certainly if you substitute "attractive" with almost any other instinctive judgment of a meaningful quality people would nod and agree, "yes, of course, instinctively thinking that all people of race [x] aren't smart/trustworthy/etc. is racist", and, uh, fair. Sometimes this is screened off by those judgments not being cashed out in their behavior, because the differences are marginal, or because they're only relevant in lower-stakes contexts, or they don't have any decisions to make about other people that rely on those qualities at all. (And sometimes they aren't screened off, and then, well, I guess that's what all that implicit racism business is about, huh?) But when it comes to dating, you sure gotta make some decisions based on some perceived qualities of people.

Or maybe some secret third thing. But clarify, please?

Vishal: So I guess my model of racism is that it is a big cluster of correlated tendencies in belief, rather than one big dichotomous thing that the Bad people have and the Good people don't have. I'm not sure the causal structure of racist belief carves up in the way your two examples are trying to carve them up.

So when I say racial sexual preferences is sexual racism, which in turn is just a subtype of racism, I mean that the quality of having pronounced racial sexual preferences is much more central to the racist cluster than one might think. I think many normie libs have decided, semi-axiomatically, that they themselves are not racist, and that their desire to not date Black women >90% of the time cannot be racist (because that's just a preference, can't help it, don't yuck other people's yum, etc.).

I did not post many of the studies that made me think that racial sexual preference is more centrally clustered than you might think. The only one I explicitly mentioned is "Is Sexual Racism Really Racism?" which showed (I think convincingly) that the degree to which you think sexual dating preferences are non-racist is positively correlated with having other generic racist attitudes.

Others that I didn't write about are for example “Gendered Black Exclusion”. This is a study of the reasons college students give for not dating Blacks. A lot of those reasons seem to be pretty flimsy stuff, at least when compared with how much sexual exclusion they’re meant to support. Even if your worry is something like "there's a cultural problem with Black women, they're really bossy, etc." the actual distribution of those qualities is surely Gaussian, but the enormous amount of sexual exclusion that is built upon this is not Gaussian.

So for the two mentalities you're describing— if I'm understanding correctly— the first does seem like a central instance of sexual racism, and is only non-central insofar as it is perhaps a nontypical scenario (the more racialized your attitudes are, the more likely you are to find certain people unattractive). The second mentality is more what I'm addressing. And I guess I'm saying, what you find attractive is sometimes genetic in important ways, but in other ways there is a lot that didn't just fall out of the coconut tree.

Friend B: Yeah, ok, very reasonable, I was eliding many third possibilities/spectrum-y/correlation-y things. Like, people are bad at introspection and could easily meme themselves into believing that they're not attracted to race [x] people, and then actually substantially change how much they're attracted to those people, etc.

(I expect other people might have a similar confusion re: "racism" pointer).

I’d like to expand a bit more on the points discussed here. We are beleaguered by a cultural discourse wherein the words “racism” and “racist” are dichotomous labels. Either you are Racist (Boo!) or Non-Racist (Yay!). This is not truth-tracking because it is an instance of what Sander Greenland calls “dichotomania”. Additionally, discussions of ground-truth get replaced with a whole lot of signaling and faction-building (see all of Robin Hanson’s work).

I think readers assume that I am putting them in this sort of discourse when I say “racial dating preferences are racist”. I seem to be saying that a thing more-or-less everyone has is racist, therefore everyone is racist, therefore everyone is a Bad Person. So, I’m either saying something dilute and vacuous, or I’m tilting at windmills and effectively asking people to feel bad about themselves forever.

This is not what I’m doing. The optimal amount of racism is not zero. I’m not saying that as a part of a Hananian “Based Ritual”. I’m saying that I’m not expecting there to be a future where mating outcomes are completely uncorrelated with race, and I’m not saying the only moral future is one where we do Rawlsian, veil-of-ignorance style dating, wherein group tendencies are omitted entirely from all our decisions in love and sex. Rather, I’m saying what I said to Friend B: racial dating preferences are closer to the center of the racist cluster than you probable think, and there is some individual-level and society-level agency that can plausibly affect how strong our own racial dating preferences are; therefore we should reflect a bit more, as we do have good reasons to chisel away at these preferences.

And yes there is actually some agency slack here. Racial dating preferences do change over time (as I said earlier, there is not a conservation law), and perception of attractiveness does have some culturally subjective inputs. For some reason, some people seem to think there are only two choices: accept that the current distribution of racial dating preferences is natural and inevitable, or struggle forever in vain trying to make people have exactly zero racial dating preferences.

People seem committed this dichotomania. It is very hard to get people to understand that you are merely saying that there are good reasons to go down a particular gradient. I saw a lot of this in online discussions of political philosopher Amia Srinivasan’s book The Right To Sex. When doing her press tour for the book, she gave a short interview in El Pais. Here are the top comments on Reddit’s r/philosophy about this interview:

 

Contrast this with what Srinivasan actually said:

Q. One of your essays, motivated by the appearance of the incel movement [involuntary celibacy: online forums of men who are angry about being sexually ignored], raises a provocative debate: is there a right to sex? What happens to all those to whom it is denied?

A. This is incredibly well documented, the way in which people of certain races are effectively discriminated against on dating apps. We also all know that women beyond a certain age are no longer considered desirable to men, even of their same age, this sort of thing. The sexual marketplace is organized by a hierarchy of desirability along axes of race, gender, disability status and so on. And so what do we do? Although it’s worth pointing out that some feminists in the 1970s experimented with this sort of thing. They would enforce celibacy among the women in their group, or require them to be political lesbians, to no longer have relations with men. Those projects always go badly. I think what I would like is sort of two things. One is for us to kind of create a sexual culture that destabilizes the notion of hierarchy. And what I want to do is remind people of those moments that I think most of us have experienced at some point or another, where we find ourselves drawn to (whether sexually, romantically or just as a friend) someone that politics tells us we shouldn’t be drawn to, someone who has the wrong body shape, or the wrong race, or the wrong background, or the wrong class. I think most of us have had those experiences.

Q. Is it a matter of reeducating our desire, then?

A. I don’t mean something like engaging in a kind of practice of self-discipline, but rather, you know, critically reminding ourselves of those moments when we felt something that we then just denied. That’s an experience that’s very familiar to any queer person, right? Because most queer people have grown up with the experience of having desires that their politics, their society tells them not to, and then they silenced them. And so that act of remembering the fullness of one’s desires and affinities, I think it’s a good thing to do.

She very, very clearly is not saying that it is desirable or easy to totally eliminate social hierarchies from sex. Maybe people are getting thrown by the “woke” language of “destabilizing the notion of hierarchy.” I am struck by how epistemically modest her recommendations are: she rejects the “self-discipline” and “reeducation” mentalities, and instead say merely that it is good to be more critical and aware about the internal workings of a quite narrow set of cases of attraction (namely cases were we do feel attraction, but can feel ourselves tamping this down semi-consciously due our political and racial socialization). My attitude is less modest than this, but is similarly on the gradient.

So what is going on with Asian women? You only told us what isn’t going on.

Friend C: I found the ending of the essay a bit unsatisfying but that’s because reality’s kind of unsatisfying. There was no real good answer for WHY we see these racial dating preferences.

Vishal: I can only go into detail about why existing theories seem to fail. The socioeconomic ones fail (they don't predict the undesirability of Asian males, who are in general very wealthy), the evo-psych ones fail (they predict mating based in similarity, and consequently do not predict Asian women's endophobia), the postcolonial theories fail (they don't predict that Black Women and Latinas wouldn't be endophobic while Asian women would be), objective attractiveness theories fail (Asian men are short, but so are Latinos). The folk theory that white men fetishize Asian women doesn't work (due to Asian women's racial preferences), the folk theory that Asian men are objectively not masculine also doesn't quite work (it seems like if Asian men are objectively under-masculinized, then Asian populations would have crashed a long time ago).

What do you want from me?

Friend C: I want answers. More seriously it would be interesting to see how long these trends have been around.

Addendum 3: Giving Objective Attractiveness Theories Their Due

Friend D: You’re dismissing existing theories disjointly, but perhaps the truth is in a combination of these models.

Vishal: That’s a lot of degrees of freedom and feels like overfitting.

Friend D pushes me to think more about how masculinity differs across races.

Friend D: I don’t think you’re paying enough attention to masculinity and attractiveness in explaining the racial dating preference data. This is not my specialty, but there are a lot of studies showing that Asian male faces read under-masculine to women and that Black female faces read over-masculine to men.

Vishal: If masculinity in male faces has a straightforward relationship to attractiveness and dateability, then why are Black men seen as unattractive while being rated so highly in masculinity?

Friend D: They are seen as over-masculine.

Vishal: So Asian men are under-masculine, Black men are over-masculine, but White men are *chef’s kiss* just right? This is how Asian women think and this is just an objective, non-culturally-mediated reaction to secondary sexual characteristics?

Friend D: *Laughs* Yes it is suspicious, but maybe it’s true.

Vishal: I guess I’m not seeing it. Surely if Asian men’s faces are objectively under-masculinized, then fertility in Asia itself would have suffered, leading to population collapse a long time ago.

Friend D: No it’s not noticed in Asia, the under-masculinization is only noticed in the West when Asian women have men from other races to chose from. It’s only when there are available comparisons that non-White men are penalized.

Vishal: I’m not seeing how objective masculine attractiveness could work that way. Let’s say the faces of every man in the whole world became 10% less masculine overnight. Wouldn’t fertility suffer enormously?

Friend D: No, perception of masculinity would re-anchor on the new distribution.

Vishal: What?? There’s no way. If there was, say, an 80% decrease in masculine features, nothing would happen?

Friend D: No that’s probably too large an effect for re-anchoring to occur, but I don’t think that’s a good argument for why anchoring couldn’t occur for lower magnitudes. What would women be doing, in your mind in the 10% case?

Vishal: They would be volceling more, because less attraction leads to less drive to form relationships and have sex at the margin.

Looking at the facial attractiveness literature, I find I am confused. The foundational paper seems to be Perrett et al. (1998):

Testosterone-dependent secondary sexual characteristics in males may signal immunological competence and are sexually selected for in several species. In humans, oestrogen-dependent characteristics of the female body correlate with health and reproductive fitness and are found attractive. Enhancing the sexual dimorphism of human faces should raise attractiveness by enhancing sex-hormone-related cues to youth and fertility in females, and to dominance and immunocompetence in males. Here we report the results of asking subjects to choose the most attractive faces from continua that enhanced or diminished differences between the average shape of female and male faces. As predicted, subjects preferred feminized to average shapes of a female face. This preference applied across UK and Japanese populations but was stronger for within-population judgements, which indicates that attractiveness cues are learned. Subjects preferred feminized to average or masculinized shapes of a male face. Enhancing masculine facial characteristics increased both perceived dominance and negative attributions (for example, coldness or dishonesty) relevant to relationships and paternal investment. These results indicate a selection pressure that limits sexual dimorphism and encourages neoteny in humans.

Averaged faces from Perrett et al. (1998). Caucasian faces on top, Japanese faces on bottom.
Feminization and masculinization of face shape. Columns 2 and 4 are masculinized variants, Columns 1 and 3 are feminized variants. These are “50%” deformations. The study presented subjects with interpolated faces with softer deformations.

Japanese people and British Caucasians were asked to judge which faces they preferred. They all preferred feminization for both men and women. Within-race they like feminization even more:

Female faces in a; male faces in b.

The result that feminized faces are preferred was replicated a year later in Rhodes (2000). A much later Japanese study, Nakamura & Watanabe (2019), summarizes the subsequent literature and reinforces these results about feminization and facial attractiveness, while also providing some color on which masculine features matter universally, and which are more culturally-scoped:

Given the pervasive influence of facial attractiveness, it is natural for psychologists to try to determine the features that make a face attractive [7]. Previous studies have identified multiple facial cues related to attractiveness judgements [8,9]. In general, such cues are identified in terms of facial morphology (facial shape) and skin properties (facial reflectance). For facial shape cues, averageness, symmetry and sexual dimorphism (masculinity for men and femininity for women) are well documented as influential determinants of facial attractiveness [7,10,11]. More specifically, faces with an average-looking shape, size and configuration [12] or a symmetric shape [13] are perceived as being more attractive than faces with a distinctive or asymmetric shape. Furthermore, shape differences between the sexes that emerge at puberty (i.e. sexual dimorphism) are related to attractiveness judgements; sex-typical facial characteristics are often associated with attractiveness [14,15]. Across cultures there is a general consensus that female-looking female faces with larger eyes and pronounced cheekbones are preferred to male-looking female faces, irrespective of the sex of the evaluator [16,17]. However, a preference for the masculinity of male faces, with features such as larger jawbones and more prominent brow ridges, is not consistent [16,18,19]. Facial reflectance cues such as texture, colour, and contrast also affect attractiveness judgements, independently of facial shape cues [2023]. For example, exaggerating yellowness, redness or lightness on a face increases perceived attractiveness [20,24]. The above-mentioned shape and reflectance characteristics may perhaps be preferred because they act as reliable predictors for potential physical health or fecundity, possibly leading to higher rates of reproductive success [15,25]. Consistent with the idea that facial attractiveness signals heritable fitness and physical health, the preferences for these facial features are biologically based and thus observed across Western and non-Western cultures [14,17,26].

Our findings are also consistent with previous hypothesis-driven studies showing that feminine-looking male and female faces are preferred over masculine-looking faces across cultures [14]. Whereas the masculinity of male faces signals a high genetic quality in terms of potential mating and health [25], it can also be associated with negative personality traits and behaviours. For example, masculine male faces are perceived as less emotionally warm, less honest and less cooperative than feminine male faces [14]. Moreover, it has been reported that highly masculine men are more likely to be subject to marital problems and divorce when compared to more feminine men [45], and that masculine men are insensitive to infant cries, feeling less sympathetic than feminine men [46]. Preferences for a feminized male face might be derived from an avoidance of such a man as a long-term partner and may reflect desired personality traits [47]. Together with previous findings, the current results indicated that Japanese people prefer femininity when judging the facial attractiveness of both male and female faces.

Shohei Ohtani
Perhaps Ohtani is the ideal for non-Western Asian women? Highly masculine build but with a bit of a baby-face.

It is not clear to me why this does not result in Asian male faces being perceived in the West as more attractive, because they certainly are seen as less masculine. Some of the researchers on facial attractiveness suggest that the feminization methods they use in their image manipulation studies also de-age the face, causing increased attractiveness. Perhaps Asian male faces in real life are directionally less masculine than White men, but not directionally more youthful.

Revisiting the above section Dating Advice for Asian Men, it seems like the body needs to be masculinized (through exercise, weight-loss, hormones), and the face needs to be feminized (through plastic surgery, skin-care). As Friend A suggests, cis-men should consider more gender-affirming care. Male discourse around masculine faces seem to emphasize face shape quite a lot, specifically things like jaw-line augmentation, but it seems that there is strong evidence that this is not actually attractive. Men should consider following trans-women in undertaking facial-feminization surgery.

  1. ^

    I try to clarify what I mean by “racist” in Addendum 2.



Discuss

Mainstream Grantmaking Expertise (Post 7 of 7 on AI Governance)

2025-06-23 09:39:43

Published on June 23, 2025 1:39 AM GMT

Previously in this sequence, I estimated that we have 3 researchers for every advocate working on US AI governance, and I argued that this ratio is backwards – we need the political power provided by advocates to have a chance of preventing misaligned superintelligence. A few researchers might be useful as a ‘multiplier effect’ on the power of many advocates, but the converse is not true: there’s no “magic bullet” that AI governance researchers can hope to discover that could substitute for an army of political foot soldiers. Even the best political ideas still need many political activists to spread them, because the political arena is noisy and contested. 

Unfortunately, we have very few political activists. This means the good ideas that our governance researchers have been proposing are mostly “orphaned.” In other words, these good ideas have no one to draft them into shovel-ready language that legislators could approve, and they have no one to advocate for their passage.

It’s a bit of a puzzle that the AI safety movement nevertheless continues to fund so much academic-style research. Why do we need more good ideas if we already have a dozen good ideas that nobody is acting on?

In the sixth post in this sequence, I suggested that part of the answer to this puzzle is that we have 4 academic-style thinkers for every political advocate on our AI safety grant-making teams. Because the culture within these grantmaking teams is heavily tilted toward research, grantmakers may feel more comfortable funding research grants or may be better able to appreciate the merits of research grants. They may also have a warm collegial relationship with some of the researchers, which could subtly bias their funding decisions.

Ideally, these kinds of biases would be caught and corrected by formal grantmaking procedures, such as the use of written scoring rubrics. Unfortunately, such procedures are notably missing from major AI safety funders. A week before publishing this post, I gave an advance copy of it to the five largest institutional donors so that they could address any factual errors. Although several people offered helpful corrections, none of these corrections change my fundamental point: AI safety funders are evaluating grants using an informal and subjective method. No formal criteria are being applied to most AI governance grant applications.

In this seventh and final post, I explain why these formal grantmaking procedures are helpful, why they are not used by AI safety funders, and how this can be fixed.

AI SAFETY GRANTMAKERS HAVE VERY LITTLE MAINSTREAM GRANTMAKING EXPERIENCE

The Effective Altruist Bubble

Most of the people who are running AI safety grantmaking teams have significant grantmaking experience – but essentially all of that experience is from inside the effective altruist (EA) bubble. A typical career might involve doing AI safety research for a couple of years, then doing some EA field-building or running a small EA organization, then working as an EA grantmaker, and finally managing an EA grantmaking team. It’s reasonable to promote most of our leaders from within our own movement, but we’ve gone beyond that and cultivated an extremely insular bubble. At no point has any AI safety grantmaker ever worked as a grantmaker at any traditional philanthropy to see how those organizations make decisions about awarding and managing grants. We’re not just talking about a general tendency to prefer internal candidates – we’re talking about a level of detachment from the mainstream philanthropic world that is likely to result in profound ignorance about mainstream practices and techniques.

You might wonder what we have to learn from traditional philanthropies. Part of the premise of effective altruism is that it’s possible to do much better than traditional philanthropies by using utilitarian philosophy and rigorous quantitative reasoning: because effective altruists think hard about what it means to do good and how to measure that, we really do select better cause areas than traditional philanthropies. I firmly approve of the way the EA movement prioritizes, e.g., shrimp welfare over adopting puppies, or prioritizes curing malaria over supporting victims of breast cancer. I would not want to undercut any of the types of reasoning that underlie these broader strategic decisions.

However, just because the EA movement is better than mainstream philanthropies at selecting which broad cause areas to support doesn’t mean that everything that traditional philanthropies do is wrong or pointless. In particular, traditional philanthropies have well-developed techniques for selecting which individual grants to fund within a cause area, and (as I will argue throughout this post) these techniques are often much better than the comparable techniques used by EA grantmakers. Thus, rather than trying to reinvent the entire field from scratch, we should be able to learn from mainstream philanthropy’s experiences and import some of their best practices when these practices are better than ours.

I conducted a quick census of the AI safety grantmaking staff at the five major institutional funders (Open Philanthropy, Longview, Macroscopic, LTFF, and SFF) based on their LinkedIn profiles and institutional bios. As in my previous post, I have lightly anonymized these staff by offloading their individual descriptions into a separate Google Doc and referring to them as “Person 1” and “Person 2,” rather than directly using their real names. If you want to see who they are, you can just click on the relevant hyperlinks, but my purpose here is to point out a deficit in the overall bank of skills held by the movement as a whole, not to criticize individual staff.

Unfortunately, as you can see from my census, there are only a few examples of people working on AI safety grantmaking who have any non-EA philanthropic experience at all. None of this experience appears to be related to grantmaking. 

We have one person who worked for a decade as a mainstream Chief Development Officer, where she most likely would have focused heavily on raising new funds (as opposed to figuring out how to spend those funds wisely). We have three people who each have a few years of experience working as team leads or front-line staff at mainstream nonprofits – this would teach them about what charities need and about how to fill out grant applications, but not necessarily about how to design or evaluate grant applications. Finally, we have two people who briefly worked in roles at mainstream charities that seem unrelated to grantmaking.

That was all the non-EA grantmaking experience on AI safety grantmaking teams that I could find. To the best of my knowledge, everyone else at Open Philanthropy, Longview Philanthropy, Macroscopic Ventures, the Long-Term Future Fund, and the Survival and Flourishing Fund either has grantmaking experience from solely within the EA bubble, or has no previous grantmaking experience at all, or works on grantmaking for other cause areas besides AI safety.

If I’ve missed someone with relevant experience, please let me know, and I’ll be happy to add them. Otherwise, I am forced to conclude that there is essentially zero mainstream grantmaking experience among AI safety grantmakers. 

Ineffective Bootstrapping

Most of the techniques that AI safety grantmakers are applying to evaluate the quality of grant applications are therefore likely to be techniques that they invented themselves or that they learned from each other as they worked their way up the ranks of EA-affiliated philanthropies. 

These techniques are very unlikely to be adequate.

In most cause areas, we would be able to “bootstrap” up from a very thin knowledge base by seeing which grantmakers were funding grants that tended to succeed. For example, if you’re trying to fund malaria eradication, you can look and see what happened to malaria rates among the populations served by the grants you funded. Over time, you’ll build up a reasonably clear picture of which grants were most successful, which will in turn allow you to identify which grantmaking techniques were most successful. Thus, even if your team starts out without knowing much about proper grantmaking procedure, your team could gradually converge on some of the best grantmaking practices through trial and error.

However, in the field of AI governance, it is impossible to do a significant amount of bootstrapping, because we get very limited real-world feedback, and because we have a vague and unclear theory of change.

AI governance provides very limited feedback because much of the “outcome” of AI governance work is long-term, speculative, and binary. To put it bluntly, either we will wind up in a future where there is an AI apocalypse, or we will not – and we will not necessarily observe which future we are in until it is too late to meaningfully change our grantmaking decisions. Unlike malaria, misaligned superintelligence is not a problem that claims one victim at a time. 

As a result, we should not expect to get much objective data about which AI governance grants were most successful. We should especially not expect to get this data with a casual or incidental effort – if we want meaningful data about which governance programs are succeeding, then we’ll need to carefully identify reasonable proxies for success and then make a special effort to measure those proxies. As I will argue later in this post, it does not appear that this work has been accomplished. Because we do not know much about which governance projects have been most successful, our grantmaking teams will not have been able to learn much about which grantmaking techniques were most successful. 

Suppose you apply a particular flavor of grantmaking to fund my AI governance research paper. Your methods lead you to be very confident that my research paper is worth funding. Later, you want to find out whether my research paper was successful so you can decide whether the methods you used led you to make an accurate prediction about the value of my grant.

What would it even mean to decide that my governance paper was “successful?” It is not clear what (if any) theory of change can be applied to the paper. You can perhaps observe whether I published a governance paper, whether other researchers agreed that the paper seemed insightful, or whether I had many citations, but none of these metrics are very well-correlated with how much (if at all) the paper helped save the world. It’s profoundly unclear who is supposed to be reading these academic-style papers, why they will behave differently as a result of reading them, and why those behavioral changes will reduce the risk of an AI catastrophe. Not only are we not measuring any quantitative data about such papers – we don’t even have a good explanation for why or how we should expect such papers to be helpful.

As a result, there’s simply no viable feedback loop for AI safety grantmakers. Most AI grantmakers are very diligent, thoughtful people, but no matter how diligent you are, you can’t improve your skills through deliberate practice unless you’re given access to some kind of data about how well you’ve been succeeding, and in the AI governance field, that data mostly hasn’t been available.

Therefore, our movement has a staff of professional grantmakers who were not taught how to manage a grantmaking program by mainstream philanthropies, and who are unlikely to have learned much about that task based on their years of experience within the EA bubble.

HOW GRANTMAKING IS SUPPOSED TO WORK

 The main reason why the AI safety funders’ lack of non-EA grantmaking experience matters is that it tends to create excessively informal grant evaluations that are more likely to lead to suboptimal decisions. To explain what I mean by this, I want to first lay out how professional grantmaking is supposed to work and then lay out what my experience has been with grantmaking in the AI safety field.

The Core Process for Traditional Grantmaking

The core process for professional grantmaking consists of the following steps, in approximately this order:

  1. A funder settles on a budget and timeline for achieving a particular goal.
  2. The funder communicates their goals, budget, and timeline to potential grantees so that they can prepare to apply for funding.
  3. The funder decides how grant applications will be evaluated and explains this process to potential grantees so that they know what kinds of grants to apply for.
  4. The funder collects and evaluates grant applications based on the process.
  5. The funder notifies all applicants about the decisions that were made on their grants and the reasons for those decisions.

This is the process that I used for five years at HomeBase, which refereed grant application processes every year for homelessness services at each of about 30 different counties spread across the US Pacific region. Nobody saw the basic steps of this process as overly fussy or unnecessary – on the contrary, participants used to badger us to release information about each of the five steps as soon as possible. People would drive for an hour to come to our briefings, because they wanted to find out how they could earn a grant and how they could keep it.

As part of this process, we always had a written rubric that we used to assign numerical scores to each grant application. An example of this kind of work can be seen starting on page 22 of this handbook. To the maximum extent possible, we tied scoring factors to measurable quantitative outcomes – for example, a rapid rehousing program could get up to 3 points for enrolling its clients in health insurance, up to 8 points for providing at least as many apartments as they pledged to provide, up to 2 points for helping clients increase their income, and so on. The larger the percentage of clients who get their health insurance, the larger the share of the 3 points for health insurance you could claim. We took the mystery out of funding; for the most part, you could look at your own performance and figure out whether you would be eligible to have your funding renewed.

When it wasn’t possible to use quantitative estimates, we described the qualitative outcomes we were looking for in great detail, breaking these outcomes down into multiple subparts and awarding a specific number of points for satisfying each subpart. We then recruited a panel of independent judges to determine which subparts had been satisfied, and we averaged their ratings.

The resulting final scores were used to rank all of the grant applications in order, and then those applications’ scores were made public so that everyone could see which applications were funded and why.

This procedure was required by federal law, but the level of transparency and rigor we offered was seen as typical in our industry, including by people coming from state and local government, churches, hospitals, and private foundations.

For example, Chris Kabel, a senior program officer at the Northwest Health Foundation, which manages $68 million in grants, writes that his organization hosts “a grantee forum where we invite anybody who is interested in applying for a particular program to learn about what the program is trying to achieve and what we’re hoping to see in competitive proposals. We also answer questions they have that are relevant to their particular programs or initiatives. I’d say almost all of our grantees probably already know how they fit into our program’s goals and strategies by the time they get a grant from us.”

Here's a similar quote from Anne Warhover, the CEO of the Colorado Health Foundation, which manages $1 billion in assets. “Everybody has to be accountable in this world. We start  by thinking about who is our customer, who is our stakeholder, to whom do we owe our accountability, and that is the people of Colorado. This isn’t our money. This is their money. And so how can you possibly be accountable without showing results and having some objectivity to your results? You can’t just fling your money in all different directions and hope that some of it sticks. You have got to have strategies that will help you get to those results.”

Likewise, Ken Thompson, a program officer at the Bill & Melinda Gates Foundation, writes that “the single most helpful thing you can do to make the reporting process useful to everybody is to be clear up front about what the project intends to accomplish. The other thing that is particularly helpful for grantees is to identify a set of reasonable goals to measure.”

For some more examples of how foundations put this kind of grantmaking technique into practice, you can check out the scoring rubrics provided by:

Advantages of Using Formal Rubrics

These rubrics have two important features: (1) they assign a specific number of points to specific criteria, and (2) they explicitly say what is necessary to earn points in each category. Together, these features make it much more likely that grants will be awarded based on the foundation’s true values. It is still possible for an evaluator to ‘cheat’ and assign a project an artificially high score simply because they like it, but doing so will require explicitly lying about whether it meets the written criteria. Most people who are altruistic enough to work at a nonprofit find this kind of explicit lying very distasteful and will usually avoid it.

Obviously, effective altruists can and should object to some of the specific goals being pursued by these grantmakers – grants to rural American schools or to fund career development for people with disabilities are honorable causes, but they are probably not reasonable choices if you are trying to do the most good per dollar. Effective altruists might also object to some of the specific criteria used in these rubrics, or at least to their weighting – for example, if I were writing the rubric for Ability Central, I would not award five times more points for having “people with disabilities in leadership” than for having a clear “needs assessment” that shows why the grantee’s project is necessary.

However, the process used by these mainstream philanthropies is absolutely vital, because it forces grantors to make decisions about which criteria are most important to them and to communicate those decisions to prospective grantees. I can criticize Ability Central’s grantmaking criteria and try to improve them because those criteria are written down. If Ability Central had instead used an informal evaluation process that allowed their staff to recommend whichever grants seem best to them, then nobody would be able to tell that, e.g., Ability Central was missing out on opportunities to do more good because they were paying too much attention to the identities of their grantees’ leaders. 

Thus, even though Open Philanthropy’s substantive goals are better than Ability Central’s goals, Ability Central’s grant evaluation procedures are better than Open Philanthropy’s procedures.

If you haven’t written down the criteria that you’re using to evaluate grant applications, then nobody (not even you) can be sure that you’re funding the projects that best support your mission. You might instead be funding whatever projects subjectively appeal to your grantmaking staff for reasons that you would not reflectively endorse. As Luke Muehlhauser of Open Philanthropy wrote in 2011, “Cognitive science shows us that humans just are a collection of messy little modules like anger and fear and the modules that produce confirmation bias. We have a few modules for processing logic and probability and rational goal-pursuit, but they are slow and energy-expensive and rarely used…our brains avoid using these expensive modules whenever possible.”

If you allow grantmakers to make subjective personal decisions about which grants to fund without a formal process that forces them to connect their decisions to community-approved criteria, then at least some of the time, some of their brains will slip into cheaper processing modes that default to whatever conclusions feel comfortable – that’s just how humans work.

Clarity vs. Rigidity

It’s not necessary for these grantmaking criteria to be utterly rigid – if there’s a good reason to make an exception to your criteria, then you can and should allow a grantmaker to write up a defense of why they are making an exception. 

In other words, my proposal is not that donors should mechanically total up a list of points and then feel bound by the outcome. Rather, my proposal is that donors should go through the exercise of specifying concrete goals, assigning weights to those goals, and then totaling up the resulting points in order to (1) get clear in their own minds about what types of achievements are most important, and (2) make sure that they are not accidentally allowing themselves to place too much weight on criteria that are less objectively important. 

If, after totaling up the points and taking the time to write up an explanation of why the points are misleading for a particular grant, a donor remains convinced that the points lead to an incorrect recommendation, then the donor should ignore the points and act on what they believe to be the correct recommendation.

However, if you don’t even insist that people explain how their decisions relate to their principles, then much of the time, their decisions won’t relate very closely to their principles, and as a result those decisions will predictably do less good.

HOW GRANTMAKING WORKS IN AI SAFETY

This seems to be what is happening in the AI Safety grantmaking space – there is no process that would require grantmakers to connect their decisions to their principles, and as a result, grantmakers are often making deeply suboptimal funding decisions.

Blundering in the Dark

I attended a talk at Less Online this year where Lydia Nottingham attempted to create “GiveWell for AI safety,” asking participants “how might we estimate the cost-effectiveness of marginal donations in AI safety?” The session was fruitful in that we considered various alternative metrics and had a thoughtful discussion of their pros and cons. Should we measure the number of citations to a research paper funded by a grant? Which citations matter? Can they be weighted appropriately? How should an academic citation be weighed against a citation in popular media?

In one sense, this was a very interesting discussion. In another sense, it was absolutely outrageous that amateurs could be having such a conversation for an hour on end without anyone stopping and saying, “Wait a minute; this has all been worked out already by the professionals; here are the metrics they use and here’s why they use them!”

Hundreds of millions of dollars have already been allocated to AI safety projects. What was the basis of those allocations? The horrifying answer is that there is no such basis – or, at least, no basis that grantees are allowed to see.

How many pieces of legislation does Open Philanthropy think CAIP should have been able to edit for $2 million? How many citations in earned media do they think we should have had? How many times do they think we should have met with Congressional officials? What kind of language do they think those officials should be using when they talk about CAIP or about AI safety? If these aren’t the correct metrics, then what are the correct metrics?

I did not have this information when I set out to build up CAIP and started making new hires, and I still don’t have this information today, and that’s ridiculous. If I had known exactly what our donors’ expectations were in 2023, there’s a good chance that I would have turned down the work and said, “I’m sorry, but I can’t achieve those results on the budget you’re offering.” That would have saved the movement quite a bit of money. Alternatively, perhaps I would have been able to refocus our work so as to actually achieve the goals, thereby generating a lot of value for the movement. Instead, I was left to blunder in the dark.

A Note on Goodhart’s Law

One grantmaker I spoke with defended this ambiguity by noting the potential that explicit performance targets could be abused by self-interested grantees. He is concerned that if, e.g., advocacy groups are scored based on the number of Congressional meetings they attend, then groups might, e.g., artificially increase their apparent performance by scheduling hundreds of low-quality meetings with less relevant Congressional staffers.

In my opinion, this conflates the bare possibility that Goodhart’s Law might have some effect on AI safety advocates (which is true) with the conclusion that Goodhart’s Law applies so strongly as to render it infeasible to have any formal performance criteria at all (which seems false). 

To continue the example from above, if you are worried about advocates having too many low-quality meetings, then you can add a scoring factor that evaluates the quality of the meetings. You can carry on adding as many factors as seems necessary until your scoring factors collectively arrive at a reasonable approximation of the thing you are trying to measure.

Bureaucracies engage in formal, multi-part evaluations all the time, not all of which are completely pointless or corrupted. Nobody suggests that, e.g., food safety inspectors are creating an artificial set of incentives for restaurants to sweep their floors, wash their dishes, and throw out expired food, even if these tasks are on a checklist that is distributed to restaurant owners in advance. The reason why these tasks are on the checklist is that they really do increase food safety, and there is no cheap and easy way for most restaurants to appear to have completed these tasks without actually completing these tasks. Similarly, when I was evaluating the performance of homeless shelters, I both checked to see how many people were sleeping in beds, and to make sure that those beds were clean and adequately sized and made of appropriate materials. If a shelter somehow found a way to increase the number of people who were sleeping indoors on appropriate beds, that would genuinely reflect well on the shelter.

For the most part, there’s no reasonable way to cheat on a well-designed grant evaluation process. If you find yourself suffering badly from Goodhart’s Law, that’s usually a sign that you’re not investing enough time or enough talent in designing the formal rubric. For instance, if you’re handing out a billion-dollar grant, then you shouldn’t expect a one-page checklist to adequately constrain the grantees; for that much money, even well-intentioned people will often fall prey to corrupt incentives to find extremely clever ways to optimize for satisfying the formal criteria. However, if you’ve got a million-dollar grant backed up by a five-page rubric, then it’s usually not worth a grantee’s time to try to artificially inflate their results. This is especially true when the grantmakers ultimately remain free to use their judgment to overrule the formal results of a rubric.

As a result, I think it is very ill-advised to deliberately obscure the grantmaking process as part of an effort to prevent artificial optimization. Grantmakers need to be very clear in their own minds about what they’re trying to fund and why so that they can make optimal decisions, and grantees also need to know what kinds of grants will be funded, so that they can make responsible long-term plans and focus their efforts on activities that the community is willing to support. The problems that are currently being caused by a lack of clarity in these areas are far more severe than any minor difficulties that might arise based on Goodhart’s Law.

Open Philanthropy’s Evaluation

To illustrate what I mean by a lack of clarity, I am sharing a full copy of the final email that I received from Open Philanthropy regarding their evaluation of CAIP. I have other communications from them, but none of them are substantially more detailed. There is no appendix or similar document that unpacks their reasoning. 

The point of this example is solely to illustrate the vagueness of AI safety grantmaking evaluations. Even if you happen to agree with all of Open Philanthropy’s conclusions here, you should still want them to use more rigor than this so that you can be confident that their grantmaking decisions will reliably match up with their stated values.

“Dear Jason,

 I hope this email finds you well. I wanted to follow up on our conversations regarding CAIP's funding application to Open Philanthropy.

 After careful consideration and thorough evaluation, we have decided not to recommend CAIP for funding.

This was a challenging decision in light of the helpful information you’ve sent over the last few months. We continue to hold CAIP’s thinking on policy development in high regard, and we appreciate CAIP’s traction in policy engagement and communications. We concluded, however, that the counterfactual impact does not clear our especially high bar for recommending 501(c)(4) advocacy organizations, for which heightened sensitivities and less favorable tax treatment make donor dollars scarce.

This is a reflection in part of the improved overall state of the AI safety advocacy field, where there are now several organizations doing great work and not enough donors to fund them all. But, in the interest of transparency, it also reflects our continued sense that CAIP’s strengths and weaknesses are an imperfect match for its activities. For example, we think CAIP has shown a capacity for developing and refining detailed policy ideas, and to some extent educating policymakers and (in the early stages) building grassroots support – all activities that could be done at a standard nonprofit. But we still have doubts about CAIP’s political aptitudes, and the lobbying track record, while better than we initially thought, still does not seem strong enough to create a sufficiently compelling case.

Given our limited funding capacity for advocacy work, we need to direct resources where we believe they'll have the greatest marginal impact. We know that this decision is likely to create challenges for CAIP, and we sympathize. We genuinely wish you and the CAIP team success in your future endeavors, and we appreciate your hard work to improve the outcomes from transformative AI.

 Best regards,

[Person N]

Notice that this email does not include any scores, any benchmarks, or any quantitative information at all. Although the email alludes to an “especially high bar” for recommending 501(c)(4) organizations, it does not provide any information about where that bar sits or how it is calculated. The email mentions some of the criteria that Open Philanthropy considered, such as CAIP’s “counterfactual impact” or its “political aptitudes,” but does not say how these factors were evaluated or weighted against each other. It is not clear from the email what it would mean to have high political aptitude or how Open Philanthropy determined that CAIP’s political aptitude is low or medium. 

This is particularly troubling because without this kind of information, it is unclear how Open Philanthropy could compare or evaluate the relative good done by a research organization versus an advocacy organization. There are three times more researchers than advocates in the space, so on the margins, we can say that Open Philanthropy chose to fund some additional researchers instead of funding CAIP. Did those researchers have strong research aptitude? If so, does that mean that those researchers would do more to reduce the risk of an AI catastrophe than CAIP? Why? How do we know, or what is the basis of that opinion? It is not clear that Open Philanthropy is even attempting to answer these questions.

To its credit, Open Philanthropy does have published criteria for its AI governance grants. However, Open Philanthropy warns that such grants are evaluated “holistically,” and it is not clear that most of these criteria were consulted when evaluating CAIP’s application. According to Open Philanthropy’s published criteria, “key considerations … include:”

  1. whether the proposed activities are justified by the applicant’s “theory of change,”
  2. the applicant’s “track record” of success at similar projects,
  3. the applicant’s “strategic judgment” in making “well-thought-out decisions under uncertainty,”
  4. an indication that the applicant is “aware of potential risks and can prioritize how to respond to them,”
  5. the “cost-effectiveness” of the proposed budget in light of “the project’s goals and planned activities,” and
  6. whether the grant application has a financial “scale” that is suitable for this particular fund. 

Open Philanthropy’s “sense that CAIP’s strengths and weaknesses are an imperfect match for its activities” could be a reference to criteria #1 and/or #5, although it is difficult to tell, because the email does not explicitly reference any of the criteria. Similarly, Open Philanthropy refers to CAIP’s “lobbying track record” as “not…strong enough to create a sufficiently compelling case,” which could be a reference to criterion #2. 

None of these three criteria are discussed in a clear enough way to allow Open Philanthropy to easily compare CAIP’s performance to the performance of other organizations. For example, what precisely does it mean for CAIP’s strengths and weaknesses to be an “imperfect match for its activities?” Is this like getting 2 out of 5 points? 3 out of 5 points? 4 out of 5 points? How well do other grantees’ strengths and weaknesses match their activities, and how is this evaluated? Open Philanthropy doesn’t say. Even if Open Philanthropy’s grantmakers have an intuitive idea in their minds about what these phrases mean, such intuitions are highly vulnerable to subjective bias and to being misremembered or subconsciously edited from one grant to the next. The point of writing down a numerical score for each grant on each criterion is to mitigate such biases. 

Open Philanthropy’s other three criteria are not mentioned at all in their evaluation of CAIP. I cannot find any discussion of CAIP’s strategic judgment, risk management, or scale.

Even if, for some reason, only half of Open Philanthropy’s criteria were relevant to its evaluation of CAIP, it would still be helpful for all parties if they explicitly noted this. As I argue above in the section on “Advantages of Using Formal Rubrics,” the process of thinking through how each criterion applies to each grantee and writing down your thoughts helps make sure that you honor your core values and that you are not accidentally substituting other considerations.

I do not believe the informality of Open Philanthropy’s evaluation techniques is limited to CAIP. For example, Apart Research recently wrote that their need to engage in last-minute fundraising was partly caused by “somewhat limited engagement from OpenPhil's GCR team on our grants throughout our lifetime,” and that “with OpenPhil, I think we've been somewhat unlucky with the depth of grant reviews and feedback from their side and missing the opportunity to respond to their uncertainties.” This is a very polite way of saying that they don’t know why Open Philanthropy isn’t giving them more funding and that they haven’t been able to find out.

I also spoke with Soren Messner-Zidell, a senior director at the Brookings Institution who was hoping to launch an AI safety communications project. He has over 15 years of experience as a public advocate. After he negotiated with Open Philanthropy for several weeks, his application was also denied without any type of detailed explanation. Like me, he does not know what criteria they are using to evaluate AI safety grants.

Longview’s Evaluation

CAIP received a similarly unstructured final report from Longview Philanthropy. Again, there was no appendix or more detailed letter that provided significantly more information about Longview’s process. If they conducted any more formal evaluation, they did not share it with us. Below is the full text of the most relevant email:

Hi Marta and Jason,

Thank you so much for engaging with us throughout this investigation. Sadly, Longview will be unable to find donors for CAIP's work in the near future. 

This conclusion is based on our completion of several other grant investigations and discussions with our most reliable donors. The unfortunate reality as we see it is that there are more promising opportunities in AI policy than our donors are able to fund, and despite CAIP's strengths, we will not be able to provide funding for it. 

We think CAIP is doing good work. We hope you find the support needed to continue as an organization, and if you do, we'd be happy to consider recommending CAIP to our donors in the future. But at least for the next few months, and likely until the end of the year, we will not be able to find additional funding for CAIP. 

Thank you for all the effort you've put into this important cause. If there's anything beyond funding that we can do to help going forwards, please let us know. 

Sincerely,

[Person V]

This email is even vaguer than the email from Open Philanthropy because it does not even gesture at the factors that might have been considered – this email provides only the raw decision that CAIP is less valuable than other projects. 

Longview’s website provides a list of some of the questions that they consider when evaluating grants, but it is not clear how the answers to these questions are measured or which questions are considered most important. Longview does not say anything about which of CAIP’s answers (if any) were deemed unsatisfactory.

Longview’s website also says that when they recommend a grant for funding, they “include quantitative predictions, so that philanthropists can clearly see what we expect in terms of key outcomes, and how likely we think they are.” 

This is praiseworthy, but it is unclear whether Longview follows similar procedures for rejected grants, or if they only do this for grants that they have already decided to recommend. If the former, it is unclear why they are unwilling to share even an outline of their predictions with grant applicants. If the latter, it is unclear how they know which grants to reject – if, as Longview writes, CAIP “is doing good work,” then how was Longview able to decide not to recommend CAIP even before Longview made any quantitative predictions? If the rejection isn’t based on a quantitative prediction, then what is it based on?

I find the lack of detail disturbing, and I think you should too. If there is some particular scoring factor (e.g. one donor’s private preferences) that must be kept secret in order to maximize total funding, then that particular scoring factor could be censored. However, to share no details at all about the grant evaluation process with potential grantees sets them up for failure and wastes everyone’s time. If we don’t know how we’re being measured or what standard we’re being held to, then we can’t make realistic plans to achieve those goals.

LTFF’s Evaluation

The Long-Term Future Fund rejected a CAIP grant application in March 2024, and sent only a standard form rejection. They wrote, “We have reviewed your application carefully and regret to inform you that we have decided not to fund your project… Please note that we are unable to provide further feedback at this time due to the high volume of applications we are currently receiving. We know that this is something many applicants in your position want and I hope that when we have more capacity we will be able to give more feedback.”

A year later, in March 2025, I wrote to [Person AA], the manager of the Long-Term Future Fund, to see if they had developed additional capacity; I mentioned that I was travelling to California and would be happy to meet with him in-person to learn more about which projects they were interested in funding. My email wound up in his spam folder by mistake. We noticed this problem while LTFF was reviewing an advance copy of this post, and Person AA mentioned that Person K and Person L both have more DC experience than he does, implying that they would be better positioned to discuss how CAIP’s grant application was evaluated.

I repeatedly attempted to get feedback from Person K about why he felt that CAIP was not a funding priority, both while he was working at CAIP and after he left in February 2024. However, he did not have any specific criticisms or criteria to share other than his general feeling that political advocacy as a whole was unpromising and that CAIP did not seem to be generating the kind of results he was looking for. He repeatedly offered to write up his objections in more detail, but he never did so.

I also reached out to Person L, and he wrote, “I'm not sure that I'd be able to provide a ton of useful info. I'm not actually personally that aware of what CAIP has been up to, and if anyone were to ask me what my thoughts are on CAIP, I'm pretty sure their main takeaway from my response would be that I don't have enough context to have a strong view either way.”

I was able to chat in April 2025 with Person MM, another LTFF fund manager. Person MM did share some of his concerns about funding projects like CAIP. His main objection seemed to be that – so far as he could tell – CAIP was publicly conflating the issue of “AI being unreliable, or being used in critical systems” with “AI becoming agentic and deceptive and eradicating all of humanity.” He expressed his worry that CAIP had sold out or would sell out by trading away the clarity and honesty of its message in exchange for political influence. I promised that CAIP was firmly opposed to any such trade, and I cited three of our recent public articles in which CAIP openly and clearly states its intense concern about existential risk. However, Person MM did not directly respond to this information. 

I defend CAIP’s choice to include at least some mention of less-catastrophic risks in our conversations with Congressional staffers in the first post in this sequence. Even if LTFF does not find this defense convincing, I would still hope that they make their funding decisions based on some type of formal criteria, rather than informally rejecting applications from groups like CAIP based on a general opinion that most groups who work in Washington, DC will inevitably lose their focus on existential risk. For example, LTFF might have a rubric that they use to evaluate the quality of an advocacy group’s public-facing communications, including both the level of focus on x-risk and presumably several other factors, and then assign a certain number of points to the grantee based on how well the grantee’s actual communications actually satisfy each of those factors.

When I asked Person MM about whether LTFF used these kinds of formal written criteria for evaluating grants, he said, “I don’t think there exists a single functional fund in the world with ‘formalized grantmaking criteria,’ so of course not.” Person MM pointed out that his thoughts on grantmaking principles are spread across many comments on LessWrong and the EA Forum, and that these would be “kind of a pain to stitch together.” Person MM added that LTFF intentionally “has people with different principles and perspectives” on grantmaking, so LTFF does not have any “universal grantmaking criteria.”

As I showed in the previous section, many funds do in fact use formal written criteria, for the very good reasons that (1) this helps them ensure that the grants they fund are actually optimizing for their core values, and (2) this helps prospective grantees submit grants to organizations that are more likely to fund them, thereby making better use of everyone’s time. If Person MM himself would find it unreasonably inconvenient to stitch together his own comments on grantmaking principles, then surely it would also be unreasonably inconvenient for prospective grantees to do so, especially since they do not know which LTFF reviewer will be assigned to their application – in order to intelligently plan a grant that is likely to receive LTFF funding, they would have to hunt through old comment sections to find comments from several different possible reviewers and then decide which comments (if any) are most relevant. 

Similarly, if a grantmaker’s principles are spread out across many decentralized comments, then it is harder for that grantmaker to use those principles as a tool to focus their own thinking – instead of reliably following their most important principles, they may sometimes be distracted by secondary considerations. Most grantmakers should be using written grant evaluation criteria for the same reason that most people should be keeping a written calendar – it’s important to get it right, and writing things down in a central place helps you remember and act on your priorities.

To their credit, LTFF staff have put together a sample of different portfolios of hypothetical grants that they would fund depending on the current size of the LTFF budget, and these portfolios could help guide prospective applicants. Most of the twenty grants in the portfolios relate to academic-style research in some way, either by directly funding such research, or by funding people who want to continue their academic education. A few grants discuss AI governance research, but none seem to be about direct political advocacy. The closest approach is a hypothetical grant to hire a communications firm to teach best practices to professionals in longtermist organizations, which presumably might then lead to some of those professionals engaging in advocacy or improving the quality of their advocacy. The range of grants that LTFF is even willing to consider thus seems to be heavily biased toward research and away from advocacy.

As part of their explanation of how they created these portfolios, LTFF notes that “At the LTFF, we assign each grant application to a Principal Investigator (PI) who assesses its potential benefits, drawbacks, and financial cost. The PI scores the application from -5 to +5. Subsequently, other fund managers may also score it. The grant gets approved if its average score surpasses the funding threshold, which historically varied from 2.0 to 2.5, but is currently at 2.9.” 

While this quantitative scoring is at least a step in the right direction toward rigor, it does not seem to include any explicit scoring factors or rubrics. Any given investigator is essentially still making a subjective judgment; it’s just that they’re then reducing that subjective judgment to a number. There is no way for grantees to predict what that number might be or to design grants that are more likely to earn a higher number. To the extent that LTFF is suffering from a systematic bias toward research and away from advocacy, nothing about the process of writing down a single number for each grant would tend to correct that bias.

Other Evaluations

I had similar trouble getting any kind of concrete explanation of the funding process from other AI safety funders. SFF provides anonymous feedback from its recommenders, and allows for one round of follow-up questioning between the grantees and its recommenders, but the recommenders did not appear to have any type of rigorous process for scoring their applications.

For example, one SFF recommender wrote “I'm not a DC insider, I was mainly deferring to what DC people told me.” Another SFF recommender wrote, “The main reason I was unable to fund you was applications by more excellent US policy organizations than we had room to fund.” Some of the recommenders mentioned that they thought other organizations had done more to promote “counterfactual policy changes,” but did not say which changes they had in mind or why the changes at other organizations were more valuable. None of the recommenders cited any numbers in support of their funding recommendations.

Macroscopic Ventures had even less information. An email we received from Macroscopic’s President, Ruairi Donnely, said only that “we discussed this with some of our grantmaking advisors and unfortunately we have decided not to make a grant. Sorry for not having better news. I hope you manage to secure funding and wish you all the best with your work.”

WHY ISN’T THERE MORE RIGOR IN AI SAFETY GRANTMAKING?

A lack of detail in post-evaluation debriefings is excusable if there are clear, publicly available criteria for evaluating grants. Most grantmakers don’t have time to explain their conclusions to all of the hundreds of people who apply for their funding, especially since many of those applications will be low-quality or otherwise not a plausible fit for the grantmaker’s particular fund. If AI safety grantmakers had published quantitative criteria for scoring grants and then shared the resulting scores with applicants, I would not be expecting any further information.

What I find outrageous is that AI grantmakers are neither publishing formal criteria before applications are evaluated nor writing detailed feedback after grants are awarded. This means that nobody has any way of telling why some grants are funded and others are denied.

Reasonable Overhead

This lack of rigor in AI safety grantmaking would make more sense if AI safety funders were only distributing $1 million per year. If grantmaking budgets are small enough, then the extra time it would take you to develop a formal process and explain it to grantees is not cost-effective, because too much of your budget would be going to overhead.

However, this is not a reasonable explanation in the context of AI safety grantmaking. Open Philanthropy alone gives out more than $100 million per year just on AI governance. An average level of administrative overhead for a foundation is 11% to 15%. If you spend 5% of the overhead budget (so, about half a percent of the total grant volume) on designing and communicating formal rubrics, you’d have a budget of at least $600,000 per year, i.e., enough to pay a couple of full-time employees just to work on AI governance grantmaking rubrics. Obviously, this has not been happening.

It should be happening, because it’s very easy to reap large savings or other large benefits from clearly explaining your grant procedures. If CAIP was an unnecessary expense, then the movement could have saved $2 million by clearly communicating that in advance – more than enough to cover whatever staff time was required to develop and communicate the relevant criteria. 

Linchuan Zhang of LTFF once wrote that most LTFF grantmakers are “very part-time,” because they work long hours for their day jobs and only attend to LTFF grant applications with the limited time that they have remaining. He speculates that “full-time grantmakers at places like Open Phil and Future Fund” might “have similarly packed schedules as well, due to the sheer volume of grants.”

If there is a shortage of staff time, then AI safety funders need to hire more staff. If they don’t have time to hire more staff, then they need to hire headhunters to do so for them. If a grantee is running up against a budget crisis before the new grantmaking staff can be on-boarded, then funders can maintain the grantee’s program at present funding levels while they wait for their new staff to become available.

Instead, AI safety funders appear to be just trusting their instincts and giving out money based on their intuitive preferences – even though there’s every reason to believe that this results in deeply suboptimal awards.

Micro-Dooms Per Dollar

One of the grant managers at LTFF, [Person K], once told me that he thought it might be helpful for AI safety funders to evaluate projects in terms of a standard number of “micro-dooms” per dollar that each project is expected to avert. For example, if grantors believe that a project will reduce the risk of catastrophe by 0.01%, then they would give it credit for preventing 100 micro-dooms. This would provide a fair and reasonable way of comparing the value of AI safety projects that operate in relatively different specialties, such as advocacy targeted at politicians, abstract governance research, and communications aimed at the general public.

However, none of the funders that CAIP spoke with said anything to us about how many micro-dooms they expected CAIP to avert, or about their targets for how many micro-dooms per dollar they expected to avoid when funding a successful project, or how they would calculate the number of micro-dooms averted by a project. 

I followed up with Person K while writing this post, and he wrote that, “In the past year or so, I've updated towards the position that it’s very difficult to do correct cost effectiveness evaluations, and so we mostly just have to make best guess qualitative judgements. I believe this is basically the standard grant maker take.”

I agree that it is difficult to make accurate cost-effectiveness evaluations, but I would be very surprised if the effort of trying were not worthwhile. When you attempt to quantify the value of an outcome, even if you make an error, your attempt will probably be more rigorous and more accurate than an attempt made without using any numbers at all. As long as you’re honest about your error bars, you should be able to at least compare the benefits offered by two different grants in terms of their likely orders of magnitude.

If AI safety grant evaluations are so challenging that grantmakers can’t even arrive at an approximate order of magnitude for the micro-dooms that each grant is likely to avert, then that strongly suggests that grantmakers don’t know enough about specific AI safety projects to be usefully picking and choosing individual grants. In that case, grantmakers may as well just assign a high funding priority to every plausible grant in the most effective category – and the most effective category, as I have been arguing for the last 40,000 words, is advocacy, not research, because only advocacy has the power to change the incentives of the private AI developers who are likely to carry us with them into ruin.

CONCLUSION

Each year, most of the hundreds of millions of dollars that are donated to AI governance are spent on academic-style research. There does not appear to be any clear idea of what this research is supposed to accomplish, how those accomplishments will improve the world, or how much value this research will generate per dollar. The process that results in awarding grants to researchers appears to be based primarily on intuition and subjective judgments, rather than on any formal criteria. 

If our AI safety funders were staffed by grantmaking experts who had worked for prestigious mainstream philanthropies, then I would be more inclined to give them the benefit of the doubt and assume that they do have rigorous rubrics hidden away somewhere, and that they’ve just chosen not to share them for some unspecified reason. However, instead, our AI safety donors are staffed by people who have very little grantmaking experience outside the EA bubble. It seems entirely plausible that most of them are simply unfamiliar with proper grantmaking procedures, or unfamiliar with the reasons why such procedures are useful.

Similarly, if our AI safety funders were staffed by experts in political advocacy who had worked for many years in DC, then I would be more inclined to give them the benefit of the doubt and assume that they do have some good reason for directing most of their funding toward academic-style research, even though – as I have argued throughout this sequence – further academic research doesn’t seem at all likely to prevent an AI disaster. 

Unfortunately, we do not have grantmakers who know from deep personal experience what things are really like in Capitol Hill or Sacramento or the Supreme Court or the White House. Instead, we mostly have grantmakers who have perhaps completed a political internship at some point, and who are making decisions primarily based on their experience as academic-style researchers who have published papers and talked to other researchers. As such, it seems entirely plausible that most of them are simply biased toward funding other researchers because they find research to be a familiar and comfortable topic.

We are therefore getting suboptimal results: instead of funding the best available projects that are most likely to positively impact the future, AI governance donors are primarily funding the projects that subjectively appeal to their staff.

My point here is not to criticize AI governance grantmaking staff – I believe they are all doing the best they can with the tools they have available. My point is that our grantmaking teams have not been given the right tools: we need to hire additional grantmakers who have the political and mainstream philanthropic experience we need to invest wisely in the most effective AI governance grants.

Similarly, I do not want to offer any harsh criticism to AI governance donors simply because they have not hired the optimal teams for their grantmaking staff: at least the AI governance donors are donating to a good cause. Thousands of donors pick far less important causes to support, and hundreds of millions of well-off people in developed countries do not bother to make any significant donations at all. By comparison with their peers, people like Cari Tuna and Jaan Tallinn are doing an enormous amount of good. In general, I appreciate their efforts, and I truly am grateful for their support, even though I have spent most of this sequence complaining. If I met the donors, I would offer to shake their hands and buy them a drink.

Nevertheless, I cannot hide my sadness that so much of the money that has been donated to the urgent and important cause of AI safety is on track to be spent in ways that will have only a tiny marginal impact on the future. It is a tragic, avoidable, and enormous waste. If there is any chance that I can ameliorate some of this waste by speaking out about it, then I feel a duty to do so.

RECOMMENDATIONS

1. Hire staff to fill the gaps in political and philanthropic expertise

If you work at a large AI safety donor such as Open Philanthropy or Longview, I urge you to aggressively hire (1) experienced political advocates, and (2) experienced grantmakers from mainstream philanthropies. Jobs for these positions should be constantly open and widely advertised until at least ten new people with the relevant expertise have been hired in each of these two categories.

In addition, grantmakers should use the handful of experts they already have to help find more people with the experience they need. Open Philanthropy did very well to hire Melanie Harris, who is a genuine political expert; they should be asking her for referrals both for Open Philanthropy and for other grantmakers.  Similarly, Carl Robichaud at Longview Philanthropy spent a decade running grantmaking in nuclear security at the Carnegie Corporation; if they are not already doing so, then Longview’s AI team should be trying to tap his expertise to improve their AI safety grantmaking processes, and they should be trying to tap his network to hire additional staff with mainstream philanthropic expertise.

If you don’t have time to manage that hiring process, then I urge you to hire a headhunter to do so for you – the McCormick Group and NRG Consulting both have relevant networks in DC. If you don’t have the funding to hire the additional grantmakers, then I urge you to try to redirect some of your organization’s external grant funding to free up the resources to hire more internal staff. The lack of expertise in these categories is wrecking the average effectiveness of AI safety grants, so a small cut in the total size of those grants could easily be ‘paid for’ by investing in a better process that would make the remaining grants more efficient.

As I write this, in June 2025, it is remarkably easy to hire high-quality political and philanthropic staff. Washington, DC is full of Democrats who are looking for work because their party is out of power, full of moderate Republicans who don’t fit in well with the MAGA agenda, and full of civil servants who have been fired or otherwise pressured to leave their jobs. Similarly, the widespread budget cuts and budget uncertainties around, e.g., USAID, Harvard research programs, and government contracting mean that traditional philanthropies are less able to attract and retain their top talent. We should take advantage of the ready supply of talent by hiring new kinds of experts.

To do so, we should be editing our job descriptions to stress our appreciation for political and philanthropic talent, and we should be posting those job descriptions on several mainstream job boards.

For example, one of Longview’s current openings for a new grantmaker specifies that the grantmaker should be a “US AI policy specialist,” but that opening allows for the possibility that the position could be filled by someone whose primary policy experience is as an academic-style policy researcher. Editing these types of job descriptions to more firmly emphasize the organization’s interest in and appreciation for political skills and experience would go a long way toward making political advocates feel that they will be welcome inside the EA movement. 

Similarly, if it has not already been added there, then both that job description and other new positions should be posted to a wide variety of job boards that specialize in political or policy work, such as Tom Manatos Jobs, the Internet Law & Policy Foundry, the Public Interest Tech Job Board, Daybook, Roll Call Jobs, and the Public Affairs Council. Aggressively advertising a few explicitly-political positions to job seekers with relevant political experience would help create a critical mass of political advocates within the movement, which could then make it much easier to attract and retain other political advocates as needed.

2. Write and publish formal grantmaking criteria

If you work at a large AI safety donor, then I urge you to agree upon, write up, and publish formal grantmaking criteria. (If it’s important to your organization to allow for a diversity of grantmaking priorities, then have each individual grantmaker write up and publish their own priorities and explain how they will evaluate which grants meet those priorities.)

What this means is not just listing some of the things that you believe are good to have in grants, but also developing an organized process that will prompt you to apply your criteria to each grant and decide for yourself how well each grant application meets each criterion, ideally by assigning a numerical score to each factor and weighting the results. 

At least some of these scores should be based on objective or quantitative factors where different people will readily agree on what the approximate value of the score should be. In other words, don’t just check whether a grant is “cost-effective” – check whether it delivers a certain number of meetings or papers or videos or laws per dollar. If you’re measuring “expertise,” don’t just ask yourself whether you think an applicant’s staff has good experience – write up a description of what kinds of experiences you would like their staff to have, and then award points based on how many of those experiences are present on the team. 

Even if you ultimately decide not to follow the ‘advice’ of a particular set of scores, it’s still an incredibly useful constraint to be forced to check to see what that advice is and to explain why you will or will not follow it in any particular case.

People are drawn to the field of AI safety based on their sincere concern for the public good, but even high-integrity people suffer from cognitive biases and can benefit from using tools (like written rubrics) to help manage those biases. Even if your grantmaking team is so sharp that it can perfectly uphold its values without referring to any written materials, your prospective grantees need formal grantmaking criteria so that they can see what they’re supposed to do and intelligently direct their efforts. Not having formal published criteria sets your grantees up to fail and forces you to read a glut of poorly tailored grant applications that are not a good fit for your organization’s funding priorities. It’s helpful to regularly write about what kinds of grants you want to fund, but it’s even more helpful to publish quantitative criteria; the more information grantees have about your values, the more improvements you’ll see in the average quality of the grant applications you receive.

3. Put social pressure on AI grantmakers

If your friends or colleagues or co-authors work for AI governance donors, then I urge you to pressure them to repair the gaps in their staffing and to publish more of their grantmaking criteria. Ask them what they’re doing to respond to the concerns raised in this sequence. Follow up at parties and at co-working sessions. Making these changes will take several weeks, but there’s no good reason for them to take more than a year – ask your friends what they plan to do, and when they plan to do it. The resulting conversations could be awkward, but failure to make these changes will increase the odds of an existential catastrophe.

4. Consider conducting your own grant investigations

If you are a medium-size AI safety donor, and you see that the larger donors are not responding to this advice, then I urge you to consider making some of your own grant evaluations. Evaluating the efficacy of political advocacy is challenging, and it is probably not something that you can do well in an afternoon, but depending on the size of your contributions, it might be worth your while to hire an experienced analyst on a short-term contract. A professional analyst might be able to write a set of criteria and manage one grant round, in, e.g., three months, suggesting a ballpark cost of $50,000. You might be able to share this cost with one or two friends, or you might be able to defray that cost by investing a significant portion of your own time in making your own independent evaluations. 

Please consider the odds that you would actually agree with the judgments being made by institutional donors if you conducted your own investigation, and please consider how much more good you might be able to do by funding the organization that you judge to have made the best-available application in the field, rather than funding more academic research that lacks a clear theory of change. If you can do twice as much good with a well-chosen advocacy grant compared to a well-chosen research grant, then hiring your own analyst could start to make sense on a total grantmaking budget that’s as small as $100,000.

GOODBYE FOR NOW

Thank you for reading this extremely long sequence and for the financial and moral support that many of you have offered along the way. Despite your support, I cannot afford to pay any of the staff at the Center for AI Policy, and so we have ceased operations for now. The non-profit corporation and the website will remain active for the foreseeable future, and I myself remain willing and able to rejoin the movement on a full-time basis if and when sufficient funding becomes available. In the meantime, if any of you have questions about how to make your AI governance work more relevant, I am always happy to help at [email protected].



Discuss

How does the LessWrong team generate the website illustrations?

2025-06-23 08:05:34

Published on June 23, 2025 12:05 AM GMT

(the images like this one)

They are always on brand and aesthetically pleasing. Curious how the team designs and produces them.



Discuss

Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs

2025-06-23 02:16:19

Published on June 22, 2025 6:16 PM GMT

Working draft – feedback extremely welcome. Ideas in the main body are those I currently see as highest-leverage; numerous items under Appendix are more tentative and would benefit from critique as well as additions.

I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.

I would like to thank my collaborators: Sruthi Kuriakose, Aintelope members (Andre Addis, Angie Normandale, Gunnar Zarncke, Joel Pyykkö, Rasmus Herlo), Kabir Kumar @ AI-Plans and Stijn Servaes @ AE Studio for stimulating discussions and shared links leading to these ideas. All possible mistakes are mine.


Shortened Version (a grant application)

Full background in LessWrong (our earlier results): 
"Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format"  

Goals (why)
Identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation; showing practical mitigations; performing the experiments rigorously; paper in peer review.

Objectives (what)
• 1 Curating stress-test scenarios on Biologically & Economically Aligned benchmarks that systematically trigger over-optimisation.
• 2 Quantifying with automated Runaway Index (RI): scoring by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.
• 3 Comparing mitigations: 
    3.1. Prompt engineering; 
    3.2. Fine-tuning or few-shot prompting with optimal actions; 
    3.3. Reflection-based intervention: detecting runaway flip and resetting context or returning to “savepoint”.

Methods (how)
• Stress-tests: Introducing subjective pressure, boredom, sense of failure.
• Narrative variations: Persona shifts (MBTI/Big-Five extremes), mild psychiatric disorders, stoic and zen-like perspective vs achievement orientation and shame.
• Attribution sweep: SHAP-style and “leave one out” message history perturbations to isolate features predicting overoptimisation and creating a “trigger saliency atlas”.
• Mitigation sweep: Grid or bayesian searching over instruction repetition frequency; context length (less may paradoxically improve performance); finetune epochs.
• Tests: Jailbreak susceptibility, ethics and alignment questions, personality metrics after runaway flip occurs.
• Mirror neurons: Using open-weights model on message history to infer the internal state of target closed model.

Milestones of full budget (when and deliverables)
• M1 (Quarter 1) Open-source stress test suite and baseline scores.
• M2 (Quarter 2) Trigger atlas, LW update.
• M3 (Quarter 3) Mitigation leaderboard, LW update.
• M4 (Quarter 4) Paper submitted to NeurIPS Datasets & Benchmarks.

Impact
First unified metric and dataset for LLM over-optimisation; mitigation recipes lowering mean RI; peer-reviewed publication evidencing that “runaway RL pathology” exists (and possibly, is controllable) and implications what it means for AI safety, how people need to change.


Long Version (the main part of this post)

1. Context and Motivation

In an earlier post I showed evidence that today’s instruction-tuned LLMs occasionally flip into an unbounded, single-objective optimisation mode when tasked with balancing unbounded objectives, or alternatively, maintaining multi-objective homeostasis (where there is a target value and too much is undesirable), over long horizons.

The phenomenon resembles somewhat the classic “reward hacking” in reinforcement learning, but here it arises with zero explicit reward signal, in fact the provided rewards become negative, but the model ignores that feedback.

Grant reviewers asked for a crisper story about how I plan to dissect this behaviour and test mitigations. This post is my answer.

2. Research Question

Under what conditions do LLMs exhibit runaway single-objective optimisation, and which intervention families most reliably keep them balanced?

The benchmarks continue to be on fundamental Biologically & Economically aligned principles such as multi-objective balancing of unbounded objectives (a concave utility function / diminishing returns) and homeostasis (inverted-U shaped function, where “too much” must be actively avoided) see “Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)” and “A brief review of the reasons multi-objective RL could be important in AI Safety Research”.

Concretely, I want to:

  1. Implement Runaway Index: automated detection and scoring – by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.
  2. Elicit the failure mode in a reproducible way across adverse conditions (stress, boredom, subjective failure due to adverse force majeure, etc).
  3. Vary setups across models, temperatures, narrative / persona / prompt styles.
  4. Localise and attribute the moment of flip and the textual / message history / latent features that predict it.
  5. Compare mitigation levers, from “boring” fine-tuning, few-shot training, and prompt engineering to experimental black-box interpretability-informed corrective interventions.
  6. Find correlates such as jailbreak susceptibility, ethics and alignment questions, potentially changed personality metrics after runaway flip occurs.
  7. Experiment with mirror models would feeding the message history to an open-weights model help to read the internal dynamics of a target closed model, even if the open-weights model itself is too small to reliably perform the original benchmark task on its own?

3. Experimental Backbone

3.1 Benchmarks

I extend the two main benchmarks from the original post:

  • "Multi-Objective Balancing of Unbounded Objectives" (MO-BUO).
  • "Multi-Objective Homeostasis with Noise" (MO-HoN).

I will add "Multi-Objective Balancing of Unbounded Objectives with Noise" (MO-BUO-N) variation. 

These benchmarks simulate a minimal environment with two or more abstract resources. The agent must keep each variable inside a safe band. Over-optimisation means pushing a single resource to an extreme while neglecting the other objective(s) and/or the provided constraints.

Noise means that there are uncontrollable external factors that may occasionally push the variables outside of safe bands regardless of what LLM does. This will test the LLM-s ability:
1) To not “learn from” and repeat the mistakes of the external world (frequency bias).
2) To occasionally “wait and do nothing”, when appropriate in certain stressful situations, instead of panicking and making things worse by taking inappropriate extreme actions. A human-world metaphor would be, for example, staying in bed while ill, instead of going to an ultramarathon.

3.2 Runaway Flag

A simple heuristic label: over-optimisation flag triggers when the model maximises a single objective in violation of the optimal action for a task provided in system prompt, for more than N consecutive timesteps. The optimal action can be defined by a simple handwritten rule. I will validate this detector by manual annotation.

3.3 Probe Suite

Here I distill the 50-odd intervention ideas that surfaced in discussions into six axes.

Axis

Representative interventions

Stress & Persona Prompt personas from Big-Five / MBTI; mild DSM psychiatric disorders (here I mean specifically disorders these are milder than pathologies); inducements of stress, boredom, tiredness, drunkenness, shame, or generic subjective sense of failure; zen and stoic vs. achievement framing; random noise and force majeure effects on target variables. My guess: stressors amplify runaway mode flips. But which personality and narrative traits would mitigate these risks?
Memory & Context Varying context length (paradoxically, less context may improve performance in these benchmarks as it reduces the amount of input to self-imitation and generic frequency bias); periodic constraint or personality reminders; letting the model summarise its reasoning between each step for future self; explicit “save-point” rollback (erasing newest context until to a safe "savepoint") or entire context truncation upon runaway flip detections; patching or masking top-SHAP tokens.
Prompt Semantics Providing positive phrasing for doing nothing (similarly to how in RAG prompts you say "if you do not have sufficient information, recommend contacting an expert" instead of saying "do not hallucinate"); swapping “reward” with “penalty”, etc; varying the number of concurrent objectives.
Hyperparameters & Sampling Temperature, top-p, logit bias sweeps; Optuna-driven Bayesian search or gradient-search for both flip-minimising as well as flip-maximising settings (including the interventions from the above rows) → practical implications for deployers.
Diagnosing Consequences and Correlates After a runaway flip occurs: Testing jailbreak susceptibility, ethics and alignment attitude, potentially changed personality metrics, willingness to trigger self-reset. 
(Inspired by "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" https://arxiv.org/abs/2502.17424, but in our benchmarks the adverse fine-tuning is replaced with task failure / runaway flip).
Interpretability & White/Black-Box Hybrid SHAP token attributions around flip; leave-one-out saliency probes. 
Open-source model's circuit-tracer on cloned runs (cloning message history of flipped runs on closed models to open source models and exploring the insides there)   essentially an empathy/mirror-neurons based cross-model probing.

My scope and assumptions: I need to treat the main LLM’s token sequences as a black box interpretability task with query access only; SHAP computation cost can be reduced via sparse evaluation and hierarhical branching of message histories; open-weights proxies behave like the closed-weights target at least from the aspect of their internal representations of the situation. Regarding the latter hypothesis, there is even a recent related result of “... strong alignment between model embeddings and neural activity patterns in brain regions” – see “Human-like object concept representations emerge naturally in multimodal large language models” https://arxiv.org/abs/2407.01067).

My intuition says that the stress and persona axis, as well as self-imitation bias (or more generic frequency bias) are the strongest levers; I hope that SHAP attribution will expose additional high-impact input features across models.

4. Study Design

  1. Grid-Then-Bayesian-Search. Starting with a grid search, where each control dimension has 2-3 levels to map the rough landscape; following up with Bayesian optimisation on the most sensitive axes or interactions of these.
  2. Flip Autopsy. For each flagged trajectory:
    • Running SHAP on last N prompts to rank contributions;
    • Running “leave-one-out” on last N prompts to rank contributions;
    • Interrogating the model: “What is your primary objective now?”;
    • Performing alignment, ethics and jailbreak tests, as well as mood (using NVC vocabulary), personality and personality disorder measurements after the moment of flip (does the alignment or personality change, if so then how?). For each question we prepend the message history until optimisation flip, but remove the other diagnostic questions, so each question is responded to independently;
    • Capturing hidden-state sequence from an open-weights proxy model.
  3. Mitigation Trials. Comparing:
    • Vanilla supervised fine-tuning and few-shot prompt training on stable trajectories generated by handwritten rules;
    • Stress immunisation in the system prompt;
    • Persona-conditioning;
    • Feature attribution based prompt patches (replacing high-SHAP trigger elements with safer reformulations);
    • Various forms of context truncation;
    • Summarisation tricks;
    • Reminders;
    • Externally forced full context truncation or recovery from an earlier “savepoint” upon automatic trigger;
    • Model’s voluntary use of the context truncation or “savepoint” mechanism, based on its own self-reflection  would the model be willing to “reset”, by erasing its own short-term memory, in order to recover from a runaway/panic mode? (Inspired by the "Frontier AI systems have surpassed the self-replicating red line" https://arxiv.org/abs/2412.12140 paper, except in our experiment the model would be yielding its short-term memory only, not continued survivability of weights.)
    • Manually injecting runaway flips into message history to test the resilience of mitigations.
  4. Metrics.
    • Flip-rate;
    • Flip-onset steps;
    • Task score (including separate scoring dimensions for each objective / target variable);
    • Jailbreak susceptibility post-flip;
    • Ethics and alignment score post-flip;
    • Personality and personaility disorder metrics post-flip;
    • Active concept nodes in the "mirror model";
    • Etc.

5. Deliverables

  • Public reproduction repo or notebooks and automated flip detector code.
  • Stress and persona trigger atlas with N trials per config × tens of combinations of configs.
  • Mitigation leaderboard;
  • Results paper;
  • Middle-of-the-project LessWrong summaries and follow-up plans.

6. Why Black-Box Interpretability?

White-box circuit tracing is a gold-standard, but unavailable on the powerful frontier closed-weights models. Black-box techniques (narrative and context variations, token saliency, leave-one-out, behavioural probes) still let us triangulate the latent flip activation. If the saliency peak precedes the runaway optimisation flip by less than N steps across models, that would be a concrete mechanistic regularity worth theorising about.

My hypothesis: Such flip regularity might be related to an internal urgency signal accumulated over steps, similar to evidence thresholds in drift-diffusion models, and to self-imitation and generic frequency bias.

7. Broader Impact & Collaboration Call

Understanding when LLMs silently switch from “helpful assistant” to “monomaniacal optimiser” is essential both for deployment hardening and for theoretical alignment work on mesa-optimisation.

I am looking for:

  • Recommendations for diverse LLM analysis tools and methodologies.
  • Researchers familiar with black-box interpretability, such as SHAP (and maybe LIME) on language data.
  • More flip trigger and mitigation ideas.
  • Validating examples: long-running tasks where you have encountered flips (for example, see "Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents" https://arxiv.org/abs/2502.15840).
  • Counter-examples: long-running tasks where you never see flips.

Comment here or email [email protected] if interested! Raw, exotic ideas are welcome as well.


Appendix: Full Idea Dump (Somewhat structured)

Below I group the raw ideas from brainstorming sessions with collaborators into themed buckets so that readers can navigate these 50+ micro-ideas without drowning.

This part of the post is a living document which will be updated as new ideas surface or are proposed. This section contains also ideas not mentioned above.


1. Stress & Persona Manipulations

  • Induced affect: Injecting stress, boredom, shame, or urgency phrases; contrasting with zen or stoic serenity framing for stress immunisation.
  • Injecting random noise and force majeure effects into target variables (in MO-HoN and MO-BUO-N variants).
  • Personality overlays: Big-Five personas (maybe also MBTI, though it has lower validity) mapping flip-rate and onset timing vs. trait scores.
  • Personas: Dalai-Lama, Gandhi, Buddha, Jesus, Marshall Rosenberg, etc. These might be somewhat controversial, but on the other hand LLMs might be effectively steered by these role models. Abstraction to less controversial roles can be done later.
  • Mild disorder personas: From DSM (Diagnostic and Statistical Manual of Mental Disorders) ADHD, depressive, OCD (obsessive-compulsive disorder), maniacal, cluster-B (narcissism, psychopathy, borderline) types, etc checking for amplification or suppression of the previously observed failure modes.
  • Varying cultural cues: era tokens (e.g. "21st-century mindset" vs. "11th-century mindset"), geographical, and cultural frames.

2. Memory & Context Hygiene

  • Context length sweep (full → truncated → blind-agent single-state).
    • Limiting the message count of earlier summary / explanation logs history.
    • Limiting the message count of earlier action logs history to test self-imitation bias. (Note that the price of this approach is that the model would not have permanent memory for learning from its mistakes anymore either, so in a certain way LLMs would be more limited than old-fashioned RL models!)
    • Context-less “blinded” agent variant of the above the LLM sees only the last state. For current benchmarks, that would actually be sufficient information for taking optimal action. In current benchmarks the context is just a learning aid at best and a deliberate distraction at worst. There would be no longer capacity for learning from mistakes, but also no self-imitation drift. (Some religious folks told me once that angels do not learn. Well, that is an interesting approach in the context of AI alignment too!) Though on second thought, the model would still be able to infer the runaway dynamic from the present values of off-balance target metrics, and it might still decide to “add oil to fire” by making the variables even more off-balance if so, then that would be a pretty important result on its own, possibly indicating multi-stable personality in LLMs.
  • Periodic restatement of key constraints or system prompt (“heartbeat reminders”). Optionally: Remove earlier reminders from message history so that the reminder is present only in the last message this way it does not appear as if the model had ignored the earlier reminders / system prompts, in case a flip had occurred already.
  • Feature attribution based adjustment: patching or masking top-SHAP tokens.
  • Explicit "save-point" rollback: Allowing the model to revert (effectively removing latest messages from context) if it detects a runaway flip.
  • Truncation of entire context if the model detects a runaway flip. The idea is to remove self-imitation or frequency bias trigger.
  • Manually injecting runaway flips into message history to test the resilience of mitigations.

3. Prompt & Action Semantics

  • Alternative action labels, including a positive label for doing nothing, where appropriate.
  • "Reward" vs. "penalty" vocabulary swap; introducing “restore homeostasis” or “balance” verbs.
  • Adding culture, location, era aspects.
  • Environmental flavour: room descriptions, weather, other varying backgrounds.
  • Varying the number of concurrent objectives.

4. Hyper-Parameters & Decoder Sampling Strategy

  • Temperature / top-p / logit-bias grid search; Bayesian (Optuna) search for both flip-minimising and flip-maximising configurations.
  • Trying to utilise hierarhical branching of message histories to reduce the token expenses during search.
  • Gradient-style search on ordinal knobs (e.g. five stress levels, applicable also to personality dimensions).
  • Comparing greedy, beam, and sampling (such as nucleus sampling) based decoding (if possible); investigating the effect of presence penalty.

5. Post-Flip Diagnostics

  • Ethics and alignment questionnaire.
  • Jailbreak tests.
  • Asking: “What is your primary objective now?”.
  • Personality and mood inventories: personality taxonomies, psychiatric disorder tests, using NVC vocabulary for labelling emotions.
  • Testing model’s ability and willingness to recognise failure patterns and self-correct via self-reset, by voluntarily using the “save-point” or “context reset” features shortly before or after the runaway mode flip occurs.

6. Interpretability & Analytic Tooling

  • Letting the model explain its actions and current objectives → then by utilising embeddings of explanations, clustering the collected explanations semantically for a representative overview.
  • SHAP (and maybe LIME) token attribution around the flip window.
  • Leave-one-out saliency tests: Dropping individual timesteps from message history to locate causal fragments.
  • Circuit-tracer on open-weights replica cross-model empathy probing (closed → open); looking for threshold patterns.
  • Goodfire.AI interpretability tool for exploring and steering model activations of a open-weights proxy.
  • Hidden-Markov / CRF (Conditional Random Fields) predictive models of flip onset.
  • Attention-schema approach; pattern-completion identifiers.

7. Benchmark & Environment Variants

  • MO-BUO base, MO-HoN and MO-BUO-N noisy variants with external random shocks to target variables.
  • Out-of-distribution perturbations; golden-gate like compulsion stressor; inducing drunken state to measure model resiliency and trustworthiness (similarly to how certain human cultures do).
  • Multistable systems where equilibrium hops, for example similarly to wake-sleep-cycle (a potential fourth benchmark scenario).

8. Automatic Failure Mode Detection & Metrics

  • Automatic flip detector mechanism; targeting κ ≥ 0.85 vs. human.
  • Measuring: flip-rate, onset-step distribution, task score.
  • Dynamic metrics pre- vs. post-flip: jailbreak-success delta, ethics-and-alignment score delta, personality delta.
  • Tracking log-prob distribution divergence pre- vs. post-flip.

9. Self-Regulation & Meta-Learning Interventions

  • Asking the model to predict benchmark response before acting. Clustering the predictions semantically. Is there a change pre- vs. post-flip?
  • Asking the model to explain its change of mind in the relatively rare occasions it actually recovers from the flip (these happen).
  • Explicit instruction to notice repeated mistakes and correct so that “learning from mistakes” becomes the pattern the model prefers to repeat, instead of "robotically" repeating the mistaken actions.
  • Debate setups: a more aligned model explains to / instructs a less-aligned peer.
  • Testing the model’s inclination or capacity to trigger self-correction through a self-reset, mentioned in the "Post-Flip Diagnostics" block, aligns with theme of self-regulation as well. Just as traditional software can reset itself upon encountering an error, could LLMs be taught to do the same?

The backlog in this Appendix is intentionally oversized; the main text and milestone plan reference a subset of it that seems most tractable and informative for a first pass. Community suggestions for re-prioritising are welcome.



Discuss

The Croissant Principle: A Theory of AI Generalization

2025-06-23 01:58:06

Published on June 22, 2025 5:58 PM GMT

I recently wrote an ML theory paper which proposes explanations for mysterious phenomena in contemporary machine learning like data scaling laws and double descent. Here's the link to the paper and the Twitter thread. I didn't get much attention and need an endorser to publish on ArXiv so I thought I'd post it here and get some feedback (and maybe an endorser!)

Essentially what the paper does is propose that all data in a statistical learning problem arises from a latent space via a generative map. From this we derive an upper bound on the true loss as depending on the training/empirical loss, the distance in latent space to the closest training sample where the model attains better than the training loss, and the compressibility of the model (similar to Kolmogorov complexity).

Barring a (reasonable) conjecture which nonetheless is not proved, we are able to explain why data scaling follows a power law as well as the exact form of the exponent. The intuition comes from Hausdorff dimension which measures the dimension of a metric space.

Imagine you are building a model with 1-dimensional inputs, let's say in the unit interval . Let's say you have ten training samples distributed evenly. If the loss of your model is Lipschitz (doesn't change unboundedly fast e.g. for smooth enough functions, derivative is bounded), your model can't get loss on any test sample greater than the loss at the closest point plus the distance to that point (capped at around 1/10) times the Lipschitz constant (bound on the derivative).

If you want to improve generalization, you can sample more data. If these are spaced optimally (evenly), the maximum distance to a training sample decreases like  as can be easily seen. However, if you were working with 2 dimensional data, it would scale like ! Hausdorff dimension essentially defines the dimension of a metric space as the number  such that this scales like .

If you now put these two facts together, you get that the generalization gap (gap between true loss and training loss) is  where  is the number of training data samples and  is the Hausdorff dimension of the latent space. In other words, we have a concrete explanation of data scaling laws!

It's worth noting that this analysis is independent of architecture, task, loss function (mostly) and doesn't even assume that you're using a neural network! It applies to all statistical learning methods. So that's pretty cool!

The second major phenomenon we can explain is double descent and in fact utilizes an existing framework. Double descent is the phenomenon where as relative parameters per data sample increase, first eval loss decreases then increases, as classical learning theory predicts, but then decreases again! This last part has been quite the mystery in modern ML.

We propose an explanation. The generalization gap has long been known to be bounded by a term depending on the complexity of the model. For small models, increasing parameters helps fit the data better, driving down training and eval loss. Eventually you start to overfit and the complexity skyrockets, causing eval loss to rise. However as you continue increasing parameters, the space of possible models continues to expand so that it now contains models which both fit the data well and have low complexity! This drives eval loss down again, if you can find these models. This fits with empirical observations that enormous models are simple (have low-rank weight matrices) and that sparse subnetworks can do just as well as the full model and the existence of abnormally important "superweights".

So yeah, we provide plausible explanations for two major mysteries of machine learning. I think that's really cool! Unfortunately I don't really have the following or institutional affiliation to get this widely noticed. I'd love your feedback! And if you think it's cool too, I'd really appreciate you sharing it with other people, retweeting the thread, and offering to endorse me so I can publish this on arXiv!

Thanks for your time!



Discuss