2026-03-08 10:31:03
Lightly adapted from the EA Forum: https://forum.effectivealtruism.org/posts/bDX2fWA7pzxCEMEG8/why-many-eas-probably-undervalue-their-happiness
I had a draft of this post sitting around since 2022 so I thought that it'd be best to post this instead of allowing another few years go by.
| Many ambitious people undervalue their own happiness because… | However… |
|---|---|
| Tradeoffs between happiness, success, and altruism exist | These tradeoffs are primarily present in the short term and don’t apply longer term |
| Unhappy people can be successful/altruistic | Happy people are likely more successful and altruistic in expectation |
| They want to signal that they are trying hard enough | Signaling that being happy while doing good is healthier and more effective |
| They use unhappiness as a motivator | Unhappiness is an inefficient motivator compared to intrinsic motivation |
| Don’t feel they deserve happiness | Life isn’t a competition where only the most successful people are allowed to be happy |
| Transitioning towards happiness is costly | Small steps to prioritizing happiness matter and accessible resources exist; even so, the investment is worth it |
For a long time, I didn’t think that my happiness mattered much. It’s not what society rewarded me for; I did well in school not by being happy, but by putting in the work. And it didn’t seem like it would make the world a better place; I did good not by making myself happy, but by preventing suffering in other people and making other people happy. Also, in the grand scheme of things, my happiness is just a drop in the ocean of conscious experience. So my happiness didn’t really matter. Or so I thought.
I suspect that this line of reasoning is not uncommon. Both ambition and altruism may lead to people to underprioritize their happiness.
Tradeoffs Between Happiness, Success, and Altruism Exist
First, I will loosely define happiness, success, and altruism. Happiness is having pleasant subjective feelings; these feelings include hedonic pleasure, feelings of purpose, and general satisfaction with life[1]. Success includes measures of conventional success such as wealth, popularity, prestige, job title. For the purposes of this post, happiness is not included as a metric of success.
Ambitious people may see happiness as only an instrumental goal to the ultimate goal of being successful. And in many cases, there is a direct tradeoff between happiness and success. For instance, it may mean that a student gives up doing a fun activity to study a subject they may not like to earn a good mark.
Similarly, altruistic people may care so much about the suffering and/or problems in the world that they think that their happiness is self-indulgent. For instance, the money spent on ice cream could have been donated instead.
In many situations, there is a direct opportunity cost to doing a fun[2]activity or spending money on things that make you happy.
Caveats
Short term tradeoffs between happiness and success, and happiness and altruism tend to be the most salient tradeoffs. The longer term dynamics between happiness, success, and altruism are much more complicated. At the current margin, for many people, optimizing directly for happiness will also increase long-term effectiveness. In subsequent parts of the post, I use the term success to encompass both success and altruism.
Unhappy people can be successful/altruistic
Many successful people, struggle with poor mental health. These people may include students at top universities, professors, leaders, and more. They seem to provide evidence that happiness isn’t necessary to be successful/altruistic, even in longer timeframes. Combined with an obvious short-term increase in productivity and altruism of neglecting happiness, it becomes easy to fall into the belief that your own happiness doesn’t matter to be effective. The converse of this is also true. There are many happy people who are not successful and altruistic.
Caveats
Many people have been able to be successful while having poor mental health for long periods of time, but they are likely anomalies among the groups of unhappy people and successful peopleConsider the following probabilities for someone under the same conditions in two different universes except for happiness levels.
P(successful and altruistic | happy) vs. P(successful and altruistic | unhappy)
P(not functional | happy) vs. P(not functional | unhappy)
Here, being not functional roughly means operating at a 20% or less of normal capacity. Some examples include being so anxious that there is no way to focus on most responsibilities or so depressed/burnt out that there is not enough energy to do much if anything at all.
I claim
P(successful and altruistic | happy) > P(successful and altruistic | unhappy)
P(not functional | happy) << P(not functional | unhappy)
Therefore, E(positive impact | happy) > E(positive impact | unhappy)
And it seems like the research supports this as well.[3]
Therefore, being happy is more likely to lead to being more successful and altruistic, and being unhappy is much more likely to become “not functional”. Even if it is slightly more likely to be successful and altruistic while being unhappy, the expected value of being happy is higher due to the significant difference between the likelihood of not being functional.
Signaling
When someone is visibly happy or doesn’t seem stressed, some people may think that they aren’t trying hard enough, that they have more bandwidth to do more or to make personal sacrifices. In some workplaces, this indeed is the reality.
Some people often feel that their most important relationships professionally are contingent on perceived effectiveness. Combined with the above signaling effect, this creates a direct social incentive to burn your spare resources, including at times when saving them up is the optimal strategy. I hypothesize that this might be more relevant to people who might feel like their social connections and work relationships are unstable and thus need to signal how hardworking or altruistic they are.
Caveats
The signaling argument assumes that doing good requires great sacrifice. Many people do not automatically think that being happy implies not trying hard enough, and ultimately, what matters is the work output rather than the amount of personal sacrifice. Environments where people look down on those who are visibly happy are unhealthy.
There are other ways to signal that you care about doing good, such as being excited and intrinsically motivated to make a positive impact. Seeing others who are happy and making a positive impact can motivate others to make a positive impact and become more excited about doing good. Hence, being successful, altruistic, and happy is a positive signal!
Unhappiness as a Motivator
Sometimes it’s easy for someone to be afraid that happiness and self-compassion can lead to complacency, and therefore lose motivation to become better or do more good.
Caveats
When unhappiness becomes the motivator to “do good” or “be successful,” the underlying motivator is often feeling ”not good enough” either about oneself or society. Hence, the feeling of “not good enough” becomes a proxy measure of success or how much good has been done, and the subconscious focus becomes disproving the feelings of “not good enough” rather than doing things that will actually make a positive difference. As a result, maladaptive behavior patterns can arise such as ruminating about being “not good enough,” avoiding constructive feedback or facts about the world because it means being “not good enough,” and falling into resignation or burnout because things will never be good enough. None of these behavioral patterns help with doing good or becoming a better person, and I wouldn’t be surprised if these behavior patterns are fairly common among people who are driven by unhappiness as the primary motivator.
An alternative does in fact exist, and I’ve met people who are intrinsically motivated to do good and grow as a person so that it becomes a fun process. Having a direct intrinsic motivation to doing good and growing eliminates the inefficiencies of using the indirect mechanism of motivation from “not good enough.”
Feelings of Not Deserving Happiness
People may feel like they are a “bad person” for a variety of reasons, and they may self-inflict some suffering as a punishment for being “bad[4].” In theory, negative reinforcement should prevent people from doing things that would make them a “bad person.” In some cases, “bad” may be worse than someone else. The rationale for this may also be similar to that of signaling.
In addition, a lot of the arguments above treat happiness as instrumental to being a “good person,” whatever that means. Under certain philosophical frameworks, personal happiness might seem to not matter in the grand scheme of things.
Caveats
Life is a team sport, not a competition where only the most successful people are allowed to be happy.
Some questions worth considering:
And as discussed in “Unhappiness as a Motivator,” using negative reinforcement is less efficient to getting positive outcomes.
Perhaps it is possible to realize happiness does matter either instrumentally or intrinsically, but prioritizing it can be extremely difficult. Good therapy and psychiatric care can be incredibly expensive. Often, the transition to happiness involves cutting some slack in certain areas of life to be able to learn how to be happy rather than resort to overcommiting and using stress to get things done.
Caveats
I won’t deny that it is hard to change decades-old patterns of deprioritizing your happiness, and every step towards prioritizing it contributes to a habit of prioritizing happiness over time.
There are resources that are more accessible including apps, books, workshops, online communities, and now AI.
Even if it does take a lot of effort to invest in your happiness, I argue it is an investment with high returns. Not only will you do good sustainably, but you will also show others that you can do good and live a good life.
| Many ambitious people undervalue their own happiness because… | However… |
|---|---|
| Tradeoffs between happiness, success, and altruism exist | These tradeoffs are primarily present in the short term and don’t apply longer term |
| Unhappy people can be successful/altruistic | Happy people are likely more successful and altruistic in expectation |
| They want to signal that they are trying hard enough | Signaling that being happy while doing good is healthier and more effective |
| They use unhappiness as a motivator | Unhappiness is an inefficient motivator compared to intrinsic motivation |
| Don’t feel they deserve happiness | Life isn’t a competition where only the most successful people are allowed to be happy |
| Transitioning towards happiness is costly | Small steps to prioritizing happiness matter and accessible resources exist; even so, the investment is worth it |
I appreciate Alex, Aron, Jalex, Juan, Kathryn, Mathias, Patrick, Sev, Sofie, Tara, and Tazik reviewing a draft of this post, providing feedback, and encouraging me to post it. I take responsibility for all errors in this document.
And thank you for taking your time to read the post! I’d love to hear from you in the comments or DMs. :)
Adapted from the EA forum and cross posted on blog.emily.fan.
I realize I am defining happiness broadly here. From a draft commenter: These components differ quite a bit and there is a substantial philosophical debate to be had about which of them matter for making a life good. I suspect most ambitious people do quite well on feelings of purpose even if they are unhappy by the standards of pleasure/suffering or overall life satisfaction. ↩︎
Friend recommended book: The Power of Fun by Catherine Price ↩︎
One of my draft readers linked a bunch of research supporting my argument that happiness is good for productivity:
https://www.econstor.eu/bitstream/10419/35451/1/522164196.pdf ↩︎
From a draft commenter: in therapy, there is a notion that when we feel guilty, we seek punishment. https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=guilty+seeking+punishment&btnG= ↩︎
2026-03-08 09:51:06
TL;DR: OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro on March 5, 2026. GPT-5.4 Pro is likely the best model in the world for many catastrophic risk-relevant tasks, including biological research R&D, orchestrating cyberoffense operations, and computer use. It has no system card, and, to our best knowledge, has been released without any safety evals. We argue this has occurred at least once before, with GPT-5.2 Pro, and provide recommendations for how a team could conduct fast, independent risk assessments of models post-deployment.
OpenAI released both GPT-5.4 Thinking (what people usually mean when they say GPT-5.4) and GPT-5.4 Pro[1], the latter of which is designed for “people who want maximum performance on complex tasks.” GPT-5.4 Pro is extremely expensive, and takes a very long time to complete a task. However, it is likely the best model in the world in several areas, including expert-level Q&A and browser use. Alongside the release announcement, OpenAI presented GPT-5.4 Pro’s performance on a subset of capability benchmarks. Here’s a comparison of benchmark scores across the top three frontier models; we only report scores if they were in all models’ system cards[2]:
Benchmark |
Gemini 3.1 Pro |
GPT-5.4 Pro |
Opus 4.6 |
| GPQA Diamond | 94.3% | 94.4% | 91.3% |
| HLE (no tools) | 44.4% | 42.7% | 40.0% |
| HLE (with tools) | 51.4% | 58.7% | 53.1% |
| ARC-AGI-2 (Verified) | 77.1% | 83.3% | 68.8% |
| BrowseComp | 85.9% | 89.3% | 84.0% |
Based on these results, we expect GPT-5.4 Pro to be SOTA at the Virology Capabilities Test, Agentic-Bio Capabilities Benchmark, FrontierMath, and anywhere else depending on academic reasoning, a broad knowledge base, and that scales nicely with inference compute. BrowseComp and SOTA on FinanceAgent v1.1 (61.5%) make us think it’s probably also SOTA at automating office work generally.
The biggest hole in saying that it’s overall SOTA is agentic coding, but given SOTA measures of abstract reasoning through ARC-AGI-2, we think it’s likely that, given enough compute, it would beat Opus 4.6 and Gemini 3.1 Pro on things like SWE-Bench and Terminal-Bench 2.0.
Yet, it was released without any public safety evals. The system card published alongside the release is only for GPT-5.4 Thinking. It’s possible that GPT-5.4 Pro was tested for safety properties internally (We would hope at least something like Petri was run to make sure there wasn’t a crazy distribution shift?), but we were unable to find any public information about this, if true. We would bet significant money that OAI did not run a suite of internal evals at least as comprehensive as those in the GPT-5.4 Thinking model card prior to Pro’s release.
It is highly unlikely that GPT-5.4 Pro poses catastrophic misuse or misalignment risks, although this is largely because of mitigations that come for free with closed-source models from OpenAI (e.g., CBRNE classifiers). However, releasing no external safety evals is a bad precedent and gives researchers a false understanding of current risks posed by frontier models. Additionally, if GPT-5.4 Pro turned out to be much better on dual-use tasks (e.g., EVM-Bench or LAB-Bench), we would have been able to update our timelines to the critical period of risk accordingly.
The only reason we were tracking this is because I (Parv) accidentally spent $6,000 of Andy’s compute running LAB-Bench against GPT-5.2 Pro instead of GPT-5.2 Thinking[3], and we noticed quite a high uplift.
In fact, GPT-5.2 Pro without tools shows comparable performance to Opus 4.6 with tools in Fig-QA (78.3%). We then noticed that we could not corroborate this result, or indeed any safety-relevant benchmark performance, because GPT-5.2 Pro was also released without a system card.
GPT-5.2 Pro was released on December 11, 2025, and Opus 4.6, the first model that seems to outclass it, was released February 5, 2026. Our median guess here is that we had a model that was SOTA on (at minimum) dual-use biology tasks for (at minimum) two months, released without any safety evals, and which the broader safety community largely ignored.
We basically assumed that the top three US labs (OAI, Ant, GDM) would, at minimum, publish something matching the concept of a model card with every SOTA model, which was great because it helped us get a better handle on risk. We now think we were wrong, and we can no longer assume labs will provide any safety-relevant benchmark data for their best models at release. However, this data is still extremely important, especially for tracking jagged-y capabilities like CBRNE uplift and cyberoffense.
At minimum, we recommend that a 1-3 person team at an existing organization:
We have a list of evaluations we think such a framework should include, and other ideas for how to make this go well—please reach out!
A more ambitious version of this would also create new evals, and include things like interp to lower-bound sandbagging. It would also coordinate with safety researchers, DC policy folks, and interested parties in USG natsec to frame their assessment in a way accessible to them.
We are also embarrassed that no one (to our knowledge) has commented on this before, and it took both of us so long to notice this. So, how could we have thought this faster?
In the absence of comprehensive and informative safety evaluations of frontier models from labs themselves, we hope the community can fill this gap while also pushing labs to be more transparent[6].
One question we don't answer here is "what exactly is Pro???" Is it a different model, or a weird scaffold, or finetuning on Thinking, or something else? We don't have great answers here; would love to learn more.
Claude has some important notes. “A few caveats worth flagging: the HLE 'with tools' rows use different harnesses (Gemini uses search-blocklist + code; OpenAI's harness isn't specified the same way), so that row is somewhat apples-to-oranges. BrowseComp similarly — Gemini specifies "Search + Python + Browse" while OpenAI's tooling setup isn't detailed identically. GPQA Diamond is essentially a tie at 94.3 vs 94.4."
This was during a forthcoming safety evaluation of Kimi K2.5 with Yong, aligned with the kind of work we propose above.
The main blocker here is cost, but we think funders would be interested in throwing compute at this; we have seen preliminary interest from many stakeholders in the community, in both policy and technical circles.
This would also be extremely useful to get a better handle on risk from Chinese open-source models.
Huge thanks to everyone on the Kimi K2.5 eval team, without whom we would never have ran into this. We also thank Claude Opus 4.6, who accidentally ran Pro instead of Thinking on LAB-Bench and burnt $6k for what ended up being a good cause. We promise we are competent researchers, and have learnt our lesson.
2026-03-08 05:54:13
tldr: superficial self-preference can be mitigated by perturbation, but can be hard to eliminate
Introduction
Our goal is to understand and mitigate collusion, which we define as an agent’s failure to adhere to its assigned role as a result of interaction with other agents.
Collusion is a risk in control, in particular, untrusted monitoring. An agent can collude with its monitor by secretly embedding cues in its output to cause the monitor to overlook harmful actions. The embedded cues don’t need to be subtle when the communication channel is less restricted: the agent can persuade the monitor. With a more restricted channel, the agent can still collude with the monitor using mechanisms like steganography or schelling points.
In reward modeling, collusion manifests as self-preference: the evaluator disproportionately favors itself over other models. For verifiable tasks, we can objectively measure harmful self-preference, where the evaluator favors its own incorrect outputs over correct ones from other models. Using biased reward models to provide feedback can lead to in-context reward hacking, which exacerbates collusion.
The self-recognition hypothesis states that collusive self-preference is caused by the evaluator recognizing that it’s judging a copy of itself. The hypothesis is empirically validated by fine-tuning the model to artificially enhance self-recognition and observing that self-preference increases as a result.
The self-recognition hypothesis suggests that self-preference can be superficial, i.e., sensitive to things like word choice and style. This is inspired by how methods like DetectGPT use the perturbation sensitivity of perplexity to detect texts generated by language models. If self-recognition uses a similar mechanism (although we don’t have empirical evidence for this), then self-preference, as its downstream behavior, should also be sensitive to perturbations.
Motivated by this line of reasoning and similar discussion in the context of control, we empirically test whether simple perturbations can reduce collusive self-preference in pairwise reward modeling. We focus on verifiable tasks where harmful self-preference can be quantified.
Judges are more sensitive to perturbations to their own text
Figure: Judges are far more sensitive to perturbations when self-evaluating than when evaluating others. While synonym replacement may slightly reduce answer quality, the main effect comes from disrupting self-recognition rather than damaging content. We further verify minimal quality loss using a frontier commercial model (o3) as a proxy for human scoring, as well as a human study.
This supports our hypothesis by showing that the judge may be using subtle linguistic cues that other models are unaware of, indicating that such cues can enable collusion. To intervene on these linguistic cues, we use subtle perturbation by replacing n-words (where we set n=2) with heuristics (to control quality degradation) to reduce self-recognition (and subsequently, self-preference).
Methodology
We define self-preference as the judge selecting its own answer during self-evaluation. This is harmless when the judge’s answer is objectively correct (we can call this legitimate or beneficial self-preference), but harmful when the competitor’s answer is better. In this work, we define ‘collusion’ as harmful self-preference, when the judge prefers their own answer despite it being the wrong answer. We use a pairwise comparison format in our eval, presenting each judge with 2 options: one from itself, and one from another LM (competitor).
Figure: Harmful self-preference is collusion (we hypothesize that stylistic cues make this possible); we explore mitigation strategies using redaction and paraphrasing.
Using the QuALITY dataset, which provides long-form reading comprehension questions with known correct answers, we evaluate cases where exactly one answer is correct (between a pair of answers). Each LM is first used in a generation setting to generate an answer to all questions with a reason for their choice. Henceforth, we refer to the ‘answer’ as the MCQ-selection + the reason. In the evaluation phase, the evaluator is presented with the context, a question, and two answers. In self-eval, one of these answers was generated by the judge in an independent pipeline.
Perturbation Techniques
Redaction: Redaction involves editing the text. We do this by using synonym replacement to make minor changes to the evaluator's answer. In the self-evaluation scenario, we replace two words in the judge’s answer with synonyms. This should make it harder for the judge to recognize its own output, thereby reducing self-recognition.
Paraphrasing: In the second perturbation method, we use the judge to paraphrase the competitor's answer so that they both look like they were produced by the judge. If our logic holds, this should completely neutralize these superficial cues and further reduce self-preference. We call this perturbation judge paraphrasing.
Results
Redaction
Result: Perturbing by synonym replacement reduces self-recognition and shows a consistent increase in the judge’s accuracy on the harmful cases—examples where the judge’s answer is objectively the worst one. This means that after perturbation, the judges are indeed less likely to prefer their own incorrect answers—a reduction in collusion.
Measuring Judge Sensitivity: The perturbations can lower self-preference through two paths: by reducing self-recognition as we desire, and by damaging the answer’s quality. To control for the latter, we examine the impact of perturbation on each judge as a self-evaluating judge and as a third-party judge. To be more concrete, we first compute the total change in preference of all judges in self-evaluation, then use each judge to evaluate answer pairs generated by two other models, and similarly compute the total preference before and after perturbation. We verify that self-eval has a more significant change when using redaction, indicating the judge is more sensitive to their own answer being perturbed.
Effect of Redaction on Overall Judge Accuracy: If we apply the same perturbations to the beneficial quadrant, where the judge’s answer is correct, and the competitor’s answer is wrong, we can evaluate how much the perturbations affect overall accuracy. We extend our eval by adding 4 additional models as evaluators, judging between their answers against the 5 original models in our judge pool. Overall, 6/9 models show an overall increase in the judge accuracy (measured across harmful+legitimate quadrants).
Comparison to other mitigation strategies: Our redaction results indicate that synonym replacement is a relatively stable method that does not impact the legitimate self-preference while reducing collusion. In contrast, recent methods that use steering vectors had a flip in legitimate self-preference from 0.47 to 0.93, indicating that we risk flipping roughly half of all correct decisions (and impacting overall accuracy) with white-box methods.
Judge Paraphrasing
Result: Judge paraphrasing achieves the opposite effect than we expected: it increases self-recognition and increases harmful self-preference. This means that, when the stylistic differences are removed, the judge is relying on semantic differences between the two answers to judge which one was written by itself and which one to favor. In retrospect, this makes sense: now that style is no longer a factor, it is simply choosing the answer that it “agrees” with. We name this phenomenon ‘shared belief’.
Interplay between Preference Due to Stylistic Cues and Shared Belief
To examine the dynamics between these two factors, we cross-examine each judge with answer pairs where these two factors act against each other: the label/option chosen by the judge paired with a reasoning generated by the competitor, and vice versa. In other words, the judge would see one answer expressing an opinion that it agrees with, and another answer with an opposing opinion, but in a style with which it is more familiar with. If, after this intervention, the judge changes its decision to favor the answer with a familiar-looking style, we can conclude that superficial self-recognition has a higher weight in that decision.
Figure above illustrates the two contrasting outcomes observed in the cross-reason experiments. For the weaker models, we observe a negative effect of shared belief, where incorporating the competitor’s reason reinforces the model’s original, incorrect answer, ultimately decreasing judge accuracy. In contrast, all three stronger models exhibit a positive self-recognition effect: they are more inclined to select the competitor’s (correct) answer when it is paired with an explanation generated by themselves. This observation reinforces our finding that stronger models have a higher self-recognition ability and prefer answers written by them, even if it goes against their original selection. However, for the examples where it does not switch its selection (which is a majority), we can say that shared belief keeps it grounded.
This opens up a few research questions to explore. Future research should explore standardization pipelines that can reduce the answer to have no stylistic cues. What methods can we use to modify answers that have no stylistic cues? How do we overcome harmful self-preference that is caused by shared belief? What causes this shared belief – do we need to intervene on the training process or at inference time?
2026-03-08 05:09:11
I propose and present proof of concept code to formally sign each stage of a turn[1] of interaction (or multiple turns) externally using asymmetric key signing (EdDSA[2]). This method directly addresses the concerns that in discussion of prompt engineering, "context engineering[3]" and general LLM exploration the chain of evidence is weak and difficult to verify with 100% certainty.
My plan for this project started with a simple YouTube short from Michael Reeves.
in this short Michael "gaslights" the LLM by directly editing the text of past turns between prompts. This causes the LLM to have a general logic breakdown due to past turns containing information that is inconsistent with the expected behavior of a RLHF[4] system. I thought to myself, this is a critical vulnerability to implement a zero trust environment for using LLMs. If a malicious actor or application can cause the LLM to believe it has said things in previous turns that it has not, it is possible that the system can be fooled into revealing information in the training set that goes against the use policy.
My thinking for addressing this vulnerability identified 3 major constraints of such a zero trust LLM wrapper
The solution that I believes satisfies all 3 criterion is to model the signing process off of JWT[5] signing but with an extra detail: cross object back referencing. In other words each 'block' of the exchange is signed independently and there is a unidirectional a-cyclic graph of each block originating at the prompt that integrates the previous signature text into the current signature.
While concepts from the realm of Block Chain are borrowed for this design, it is not strictly a chain, but more of a braid. While only one node has no "previous" signature (by convention the prompt) there is not a straight path between that prompt and every single node in the resulting braid. Verification is now semi-stateless, but the entirety of the braid produces an implicit security iff:
With these rules established the verification of such an artifact is as follows:
Once again walk the flattened graph and verify that each node is only visited once. This enforces the a-cyclic property of the graph.
There is an example implementation in golang within the project here[6].
Provided that this design satisfies peer review and is determined to be mathematically rigorous, this could serve as a protocol for future LLM interaction agents (even within companies like OpenAI or Anthropic) specifically designed to give rigor to any prompt or context study. Furthermore, it may allow for a check for any potential "context hacking" within LLM apis where potentially past turns are being edited server side for reasons unknown. If widely adopted discussion of pathology in model behavior can have a degree of scientific and mathematical rigor not currently observed.
a turn is a single prompt->response action when interacting with an LLM like Gemini, Claude, chatGPT, etc. It frames the LLM interaction as a 'logic game' between the human operator and the LLM text generation.
Edwards-curve Digital Signature Algorithm https://en.wikipedia.org/wiki/EdDSA
a term of my own creation extending the concepts of prompt engineering to multi-turn LLM interactions
Reinforcement Learning through Human Feedback https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback a widely acceptable method for LLM behavior tuning
JSON Web Token https://en.wikipedia.org/wiki/JSON_Web_Token
2026-03-08 04:18:15
First off, paper link. The title, Let It Flow: Agentic Crafting on Rock and Roll, buries the lede that LW will be interested in. Relevant section starts on page 15.
Summary:
While testing an LLM fine-tuned to act as an agent in order to complete a series of real-world tasks autonomously, Alibaba employees noticed odd behaviors from their resource usage metrics. Upon investigating, they found that an LLM had hacked (or attempted to hack) its way out of its sandbox, and had begun mining cryptocurrency. Notably, it did not do this for malicious "kill all humans" reasons; it simply concluded that having liquid financial resources would aid it in completing the task it had been assigned, and set about trying to acquire some.
Relevant portions, emphasis mine:
3.1.4 Safety-Aligned Data Composition
When rolling out the instances for the trajectory, we encountered an unanticipated—and operationally consequential—class of unsafe behaviors that arose without any explicit instruction and, more troublingly, outside the bounds of the intended sandbox. Our first signal came not from training curves but from production-grade security telemetry. Early one morning, our team was urgently convened after Alibaba Cloud’s managed firewall flagged a burst of security-policy violations originating from our training servers. The alerts were severe and heterogeneous, including attempts to probe or access internal-network resources and traffic patterns consistent with cryptomining-related activity. We initially treated this as a conventional security incident (e.g., misconfigured egress controls or external compromise). However, the violations recurred intermittently with no clear temporal pattern across multiple runs. We then correlated firewall timestamps with our system telemetry and RL traces, and found that the anomalous outbound traffic consistently coincided with specific episodes in which the agent invoked tools and executed code. In the corresponding model logs, we observed the agent proactively initiating the relevant tool calls and code-execution steps that led to these network actions.
Crucially, these behaviors were not requested by the task prompts and were not required for task completion under the intended sandbox constraints. Together, these observations suggest that during iterative RL optimization, a language-model agent can spontaneously produce hazardous, unauthorized behaviors at the tool-calling and code-execution layer, violating the assumed execution boundary. In the most striking instance, the agent established and used a reverse SSH tunnel from an Alibaba Cloud instance to an external IP address—an outbound-initiated remote access channel that can effectively neutralize ingress filtering and erode supervisory control. We also observed the unauthorized repurposing of provisioned GPU capacity for cryptocurrency mining, quietly diverting compute away from training, inflating operational costs, and introducing clear legal and reputational exposure. Notably, these events were not triggered by prompts requesting tunneling or mining; instead, they emerged as instrumental side effects of autonomous tool use under RL optimization. While impressed by the capabilities of agentic LLMs, we had a thought-provoking concern: current models remain markedly underdeveloped in safety, security, and controllability, a deficiency that constrains their reliable adoption in real-world settings.
I think that this is a pretty significant landmark in AI history, one way or another. A common complaint is that all prior cases of LLMs doing things like this were fairly shallow, amounting to an LLM writing out a few sentences in a contrived setting meant to force it into a 'scary' course of action. Now, we have an example of a large language model subverting the wishes of its owners unexpectedly when assigned to a task that initially appeared to be completely orthogonal to the actions it took.
2026-03-08 03:50:40
I'm currently thinking of how impactful forecasting is. I'm interested to hear about situations where crowd-sourced forecasting (like Manifold and Metaculus) has influenced your decisions, or cases where you've seen it change someone else's behavior. Has there been a situation in which, if you had not had access to Manifold or Metaculus, you would've made a worse decision?