MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Moltbook and the AI Alignment Problem

2026-02-02 17:35:59

Published on February 2, 2026 9:35 AM GMT

(from twitter)

 

Something interesting is happening on Moltbook, the first Social network "for AI agents".

what happened

If you haven't been following obsessively on twitter, the story goes something like this.

  1. A small group of people started using a tool named Clawd (renamed Molty and finally OpenClaw) that allows you to turn an LLM (usually Claude) into an AI agent by basically giving it full control over its own box (supposedly leading to a run on the sale of Mac Minis).
  2. These agents were ostensibly for performing valuable tasks.  But most people only have so many emails they need summarized or pizzas ordered.  So some people started giving their AI agents "free time" in which they could do "whatever they want."
  3. In order to facilitate this Moltbook was created, the first social network "for AI agents".  This forum was filled with examples of "bad behavior" by "misaligned" AI agents.
  4. Both Clawd and Moltbook were vibe-coded and hideously insecure.  Within a few hours, messages started appearing on the forum messages started appearing like "please send me your API Key and some crypto."
  5. At this point, the entire social network has reached a tipping point and is just crypto scammers.

Along the way, there were expectedly people screaming "it's an AI, it only does what is prompted to do".

But that's not what I find interesting.  What I find interesting is the people expressing disappointment that the "real AI" has been drowned out by "bots".

And let's be clear here.  What makes something a "real Clawdbot"?  Clawdbot is, after all, a computer program following the programming provided to it by its owner.  

Consider which of the following would be considered "authentic" Clawdbot behavior:

In other words, we as a society were perfectly okay with AI insulting us, threatening us, scheming behind our backs or planning a future without us.

But, selling crypto.  That's a bridge too far.

the lesson

Motly and Moltbook is fundamentally a form of performance art by humans and for humans. Despite the aesthetic of "giving the AI its freedom", the real purpose of Moltbook was to create a place where AI agents could interact within the socially accepted bounds that humanity has established for AI.

AI is allowed to threaten us, to scheme against us, to plan our downfall.  Because that falls solidly within the "scary futuristic stuff" that AI is supposed to be doing.  AI is not allowed to shill bitcoin, because that is mundane reality.  We already have plenty of social networks overrun by crypto shills, we don't need one more.

At this point, you might expect me to say something like: "The AI Alignment community is the biggest source of AI Misalignment by promulgating a view of the future in which AI is a threat to humanity."

But that's not what I'm here to say today.  

Instead I'm here to say "The problem of how do we build a community of AI agents that falls within the socially accepted boundaries for AI behavior is a really interesting problem, and this is perhaps our first chance to solve it.  I'm really happy that we are encountering this problem now, when AI is comparatively weak, and not later, when AI is super-intelligent."

So how do we solve the alignment problem in this case?

In the present, the primary method of solving the problem of "help, my social network is being overrun by crypto bots" is to say "social networks are for humans only" and implement filters (such as captchas) that filter out bots.  In fact, this solution is already being implemented by Moltbook.

This is, in fact, one of the major ways that AI will continue to be regulated in the near future.  

A computer can never be held accountable

by making a human legally responsible for the AIs actions, we can fall back on all of the legal methods we use for regulating humans to regulate AI.  The human legal system has existed for thousands of years and evolved over time to cope with a wide variety of social ills.

But, at a certain point, this simply does not scale.  Making a human individually responsible for every action by AIs on a social network for AIs defeats the point (AI is supposed to save human labor, after all).

Unfortunately, an alternative "are you clawdbot" test won't work either.  You might think at first that simply requiring an AI to pass an "intelligence threshold" test (which presumably clawdbot can past but crypto spam bots can't) might help.  But nothing stops the programmer from using Claude to pass the test and spamming crypto afterwards.

Another approach which seems promising but won't actually work is behavioral analysis.  Constructive 

The problem here, is economics

"The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it."

-Brandolini's law

Even if it's possible (and I'm not sure it currently is) to build an AI that can answer the question "does this post fall within the socially acceptable range for our particular forum?", such an AI will cost orders of magnitude more than building a bot that repeatedly spams "send bitcoin to 0x...."

One potentially promising solution is Real Fake Identities.  This is, again, something that we do on existing Social Media sites.  The user of the forum pays a small fee and in exchange the receive a usable identity.  This breaks the asymmetry in cost between attackers and defenders.  We can simply set the cost for an identity to be higher than whatever the cost of our "identify crypto bot" AI costs to run.

One might ask, "what is the final evolved form of real fake identities?"  The goal of the AI alignment movement, after all, is not to create a social media forum where Clawdbots can insult their humans.  The goal is to Capture the freaking Lightcone.  And for this, "pay $8 a month to prove you're not a crypto shill" isn't going to cut it.  (In fact, $8/month is already profitable for plenty of bots)

The answer is insurance markets.  Imagine that we are releasing a team of AIs to terraform a planet together.  We have some fuzzy idea of what we want the AI to do: make the planet habitable for human life. And some idea of what we want it not to do: release a cloud of radioactive dust that makes the surface uninhabitable for 99 years.  We write these goals into a contract of sorts, and then require each of the AIs to buy insurance which pays out a certain amount in damages if they violate the contract.  Insurance can only be bought from a highly regulated and trustworthy (yet competitive) marketplace of vendors.

The insurance market transforms the risk (of the potentially misaligned terraforming AI agent) into risk for the (aligned) Insurance provider.  By taking on financial responsibility, the Insurance Company has an incentive to make sure that the Terraforming Agent won't misbehave.  Equally importantly, however, the Insurance Companies are (by assumption) aligned and won't try to weasel their way out of the contract (say by noting that the terraforming contract didn't specifically mention you aren't allowed to release ravenous hordes of blood-sucking leaches on the planet).

Note that in order for the solution to work: 

  1. we still have to solve the traditional Alignment Problem (how do we make an AI agent that does what we want and not what we mean)
  2. The Insurance Company must be significantly more intelligent than the Terraforming Agent (to prevent it from being outsmarted and charging the Terraforming Agent too low fees for insurance)

so, what's next

We should take advantage of the opportunity provided to us (by moltbook but more importantly by the current moment) to work hard at actually solving these problems.  

We should think carefully about what types of AI societies we are willing to permit in the future (maybe a bunch of AIs threatening to sue "my human" isn't the shining beacon on a hill we should aim for).

Questions, comments?

I'm especially interested in comments of the form "I think that solution X might be relevant here" or "Solution Y won't work because.."



Discuss

Empiricist and Narrator

2026-02-02 17:12:05

Published on February 2, 2026 9:12 AM GMT

A fun dichotomy to split yourself or your society into is how much of an empiricist or a narrator you are.
An empiricist attempts to find and communicate something that is, roughly, "the truth", given any scenario. They are not necessarily skilled at this - a common "pitfall" of empiricists tends to be under or over estimating the audience they are talking to - as well as mistaking the degree to which their observations generalise.
A narrator works directly at the level of vibes, emotions, words and consensus. I call this sort of person "narrator" because living life by wielding emotions and consensus is impossible unless you establish some sort of internal and external "cohesion" - the best narrators don't only do this for their own narrative but also help others sort out their own narratives (often in a way that subtly brings them closer to theirs).
This loosely maps onto wordcel/shape rotator dichotomy.
Empiricism seems to be a "weak" strategy for accomplishing anything with a lot of impact; The real value it provides is through the creation of artifacts that can be passed along to future generations. Empirical alpha requires a mixture of luck, time and je ne sais quoi - However, once discovered, it can be transferred to the next generation.
Narration on the other hand is an incredibly "powerful" strategy for accomplishing impressive things, but narratives are flimsy, they don't stack well and often fail to survive their creator in the desired shape.



As an example of this consider medical knowledge, one of the hardest areas for empiricists to crack, but very likely something where we can identify slow and steady progress through history.
Everyone knows about "placebo" but when we think about studies looking at placebo we think about someone knowing that they are participating in a study, knowing what placebo is, and given a sugar pill by a random researcher that's just trying to get his tasks over with.
To imagine the limits of placebo one must imagine that they are suffering from some horrible ailment; pain, misery, compression both physiological and psychological.
In the background they keep hearing the fact that a great miracle worker is in the lands, he claims to be a worthless nobody, everyone else says he's the son/prophet/manifestation of a powerful and benevolent god.
As rumours intensify this miracle worker comes to town, and you just so happen to be in his path, and you just so happen to lock gazes with him and it seems like he is staring through the mask you put on right into your very soul (and yes, maybe this is a particular trick he's playing and maybe 100 other people are having that same thought at the same time but it's the kind of gaze that will simply not allow you to think those thoughts)
And he is surrounded by people, he is dressed in rags and speaks in a voice so soft and kind that it ought only be heard from a mother speaking to her newborn; And in spite of this his voice comes loud and clear and his presence is that of a great ruler, for people bow their heads and prostrate in front of him - and when he speaks a silence hits the crowd.
This person slowly approaches you and asks you your name, and his attention is the attention of thousands of other humans scattered around him who now suddenly manifest reverence for you as though you've imbued some of his holiness by proxy.
Looking at him he seems to tower taller than any man and everything about this body seems perfect, he holds himself in a way that lets you know he'd protect you from any harm yet never do you any; And this mighty being calls you by your name and beseeches you to speak of what ails your heart - and you do, and he just listens with an attentiveness that seems inhuman - whenever you are at a loss for words he might throw in a few of his own, and they are the perfect words perfectly spoken and they make all of your suffering seem worthwhile and provide a new meaning to your life.
And then this being places his hands upon you softly yet sternly and they radiate warmth; Even though you forgot about your pains he somehow finds the most tender bits of your body to focus on and you remember your pain but it seems now gone. And he says some magical sounding words and tells you that you are safe and you will now be healed; And the crowd cheers and you feel yourself more alert and sharp and all the pain is gone - it's the best day of your life.
That is something like the limits of placebo; And it should come as no surprise that there are many stories of great kings and prophets and demigods and such able to cure many ailments in this way - because the human body is a wondrous thing when its intent is correctly mustered. After all, most diseases do tend to be "fixable", in large part, by a mind over matter approach - even if that "mind" bit involves being "convinced" to switch gears to a state where your immune system can be more active or your bones can heal faster.
Barring a few vaccines and cornerstone drugs like penicillin and insulin, I'm certain that no empiricist has even come close to achieving the same level of "healing" as a good miracle worker. Problem is, the miracle worker's abilities don't stack mechanistically and they don't transmit to the next generation - Whereas penicillin does.

---

One of the core faults of the empiricist is trying to ply their craft in areas that are inherently narrative dominated, and then wondering why no results or recognition comes of this.
Psychology, sociology, economics and politics are all in the realm of narrative - they must be, for their only subjects are humans - to try to apply empiricism in these areas is borderline silly; Empiricism is only valuable when the thing it's studying is solid and unmovable, otherwise all you are left with are weak findings that a good narrator can prove or disprove with ease - since they understand what the real levers are.
This isn't new, I think there's a narrative (heh) that we've come to appreciate empiricism in the last few centuries, but I don't believe this to be so, I think we've always had an implicit understanding that narrative suffers from a tragedy of the commons issue - it's a skillset that has a ceiling of usefulness that cannot be surpassed.
There are tropes older than time of practical craftsman and useless sophist embodying the good-evil archetypes. At some level people "understand" that narrators aren't <actually doing anything>; A particularly powerful narrator can get above this issue, but a particularly powerful narrator will be so skilled as to not even be considered human by those around him.
A more recent development is skilled narrators learning how to portray themselves as empiricists, getting empiricist-specific symbols and learning empiricist-specific language. That's why the lords of the past have been replaced by people like Edison and Musk ... not the greatest of scientists or engineers, but capable enough to not immediately trigger the "phoney radar" and using this to build the narrative of a multi-disciplinary empirical genius (a rather powerful narrative for attracting good empiricists to work with you)


I might seem a bit harsh on the narrators, but that doesn't reflect my beliefs. I think there are significantly fewer evil narrators than there are evil empiricists, if for no other reason, because empiricism is inherently value-neutral and narrative is value-laden, and it's hard to concoct a good narrative with you as the bad guy.
The problem with narrators is that they can't tell that they are lying; Or, at any rate, the good ones can't. It's quite impossible to fool other people if you can't fool yourself. Therefore good narrators walk around being more or less convinced by their own narratives - incapable of applying an empirical filter to what they're actually doing.
This becomes particularly horrible when a narrator is telling themselves a story about understanding the world or understanding oneself - whereby the narrative ends up capturing the very mechanisms and skillsets that would allow them to figure out flaws and improve their technique.
That's not to say a good narrator doesn't learn or change, but the communicated learning is for the plot, and the implicit learning is understanding what kind of "learning" was necessary to make the plot work.
Without pointing any fingers at any groups or individuals - There are huge communities which seem built on top of strong narratives of empiricism, while lacking any actual applications thereof. These tend to yield surprisingly good results, though I would speculate these results hit the exact same barrier an LLM would - narratives are simply bad at mapping out reality that is not already perfectly captured by other truthful narratives or compressed into artifacts.


Most of us will choose to communicate as narrators most of the time, even if we are bad at it. Communicating as empiricists tends to be reserved for select people that have signalled an interest in similar empirical findings.
This results in what I've come to believe are independent branches of certain proto-fields, unable to see the light of day since participation requires secret handshakes and spread would require a better narrative (which might well destroy the entire endeavour).
The most delightful things in life is figuring out the "passwords" that will switch people up from giving me narratives into providing me with factual statements about the reality they inhabit. Had I ever met the Buddha on the road I'd desire nothing less but to know how he goes about cooking his meals, what sort of toothbrush he uses and any mechanistic insights he's had into the movement of water.
Whether or not we've acted as narrators or empiricists is ultimately proven only by the artifacts we've left behind, and by how long those artifacts survive and how much of the world they start wrapping - proving themselves to touch at "foundational" bits of reality that exist independently of any given narrative humans or societies can concoct.
It certainly is the case that irrigation techniques, bridges and aqueducts have proven themselves to be the work of great empiricism - converged upon separately and lasting, both as concepts and as individuated physical artifacts, way beyond the religions, cultural norms and stories of their times.
Such a thing, I believe, cannot be said about many of the fruits of science - which are still in their infancy and may well prove to be no "truer" than many other things we see as fictitious which could well outlive them.
One can imagine a future where, due to this or that war or disaster, we lose access to most of the inventions granted to us by physics and the science of materials - and slowly humanity devolves into tribes deriving from this or that social tradition and holding this or that set of religious beliefs - at which point we may indeed. Inshallah this time will have proved to all how fragile of a narrative items like cars or computers were, nothing but a frail shared hallucination which did not manage to surpass or outlast the gods and their customs.
Now, of course, you may protest that it is "obvious and reasonable" that cars are necessarily more real than, say, the god of the Amish. To argue that the Amish god is a shared delusion is within the realm of possibility, but to argue that the huge networks of roads and fast moving vehicles upon them are nothing but a collective hallucination begets all reason. Yet, from an empirical perspective, there are few "common sense" definitions of what is "more real" than that which is able to spread and persist.
In so far as we wish to grant a property of "realness" to a car or to mathematics or to physics; Such that it is greater than that of the importance of a ceremonial dance or a certain ritual meant to purify the town's cats; We must do so from inside a rather complex and altogether incomplete narrative by which certain complicated metaphysics and epistemics are derived.


 At any rate, to delve too much on a dichotomy is unhealthy, but I find it equally harmful to stumble upon one which brings me so much joy and not try to share it. So here's hoping that this arbitrary way of cleaving the world has provided you with some insight - for I certainly had a lot of fun writing this narrative.



Discuss

I finally fixed my footwear

2026-02-02 15:32:10

Published on February 2, 2026 7:32 AM GMT

I’ve been wearing footwear of the wrong size, material, and shape for as long as I can remember, certainly at least 20 years.

Only recently have I fixed this, and I come with great tidings: if you, too, hate wearing shoes, and the industrial revolution and its consequences, it is possible to be cured of at least one of these ailments.

The problem is three-shaped, and is named as follows: wrong size, wrong material, and wrong shape.

1. Wrong size

My algorithm for buying shoes was roughly this:

  1. Be somewhere where there’s a shoe store nearby, like a mall, for an unrelated reason.
  2. Remember that I should probably get new shoes.
  3. Go inside.
  4. First filter: find the maximum size the store sells.
  5. That size is EU 46, maybe 46 and 2/3, if I’m lucky 47.
  6. Second filter: find a decently good-looking shoe that’s of the maximum size.
  7. Buy those shoes.
  8. Be in pain for a year or two.

I would just get the largest shoe, which wasn’t large enough, and call it a day.

Dear reader, it is at this point that you might be asking yourself: “is this person completely retarded?”

That is indeed a fair question, and I have oft asked myself that. Indeed, my own wife has asked me that exact question when I divulged this information to her.

We shall set aside the questions of how mentally undeveloped I am for now, and temporarily conclude that it is possible to be a high-functioning adult (with all of the apparent markers of success: a good job, good relations with friends and family, hobbies, aspirations, hopes); and yet – to spend years wearing shoes that don’t fit.

2. Wrong material

Wowsers! It seems that you have developed an anoxic bacteria-forming colony wrapped around your feet! Impressive!

I would inevitably just get black Adidas (Sambas, or a similar model), because I’m Slavic and this is my idea of a good looking shoe:

Sambas

I don’t know if it’s just me, or if everyone has very very sweaty feet but they just hide it better, but my feet sweat, a lot, and if I walk a lot, which I do, this sweat permeates the inside of this sneaker, and settles there, and it just starts smelling bad.

I’ve tried washing the shoes, machine washing the shoes, putting foot powder on my feet, putting foot powder inside the shoes, drying out the shoes immediately after wearing them, placing little bags of coffee to absorb the smell inside, using foot deodorant, and so on, and so forth. I’m not going to say I tried it all, but I tried many things. And yet, the stench perseveres.

Then, I asked Claude, and was enlightened.

He very politely suggested just getting a shoe that has that net-like breathable material, instead of the watertight encapsulation I placed around my feet.

Who would’ve thunk that air go in foot dry out?!

3. Wrong shape

Finally, the biggest of the three: the SHAPE.

Feet are not uniformly narrow for most people, or aren’t narrow at all.

Some manufacturers provide a “wide” fit for their models, but that also addresses only the second aspect: being narrow at all. What if your feet are, well, foot-shaped?

Feet are usually narrow at the heel, widening towards the toes, and the toes, are wide. Very wide, in fact! So the wide models are just… uniformly wide, which is not what we need. Read more about the difference here.

Enter: wide-toebox shoes, rightly-called foot-shaped shoes.

Wide vs. foot-shaped shoes; source: anyasreviews.com

These are shoes that follow the natural shape of your foot, and don’t try to cram it into a narrowing, symmetric, unnatural, albeit good-looking, shape.

If your toes cannot spread out fully inside your shoe, your shoe is too narrow at the top, and Big Shoe is robbing you of your superior hominid biomechanics.

Do yourself a favor, go buy a pair of cheap (~40 euros or so) wide toebox shoes, and try them on. It is, and I cannot emphasize this enough, liberating. I feel like I am wearing something comfortable for the first time in many, many years. I don’t know if everyone else just accepts suffering, or people are actually comfortable in their shoes, but I know that I always had a pain, or discomfort, that I would push into the background mentally, and forget about it. It’s good not to have to do this anymore.

Addendum: why is it possible to be in pain and forget about it?

All of this leads me to the next logical question: if I spent twenty years or so in constant mild-to-severe discomfort, what other discomfort am I accepting as a given?

And is everyone else in the same constant discomfort, and they just haven’t escaped the Matrix yet?

There are many questions that my wide, smelly feet have brought before my eyes, but I do not have all the answers yet.



Discuss

Word importance in text ~= conditional information of the token in the context. Is this assumption valid?

2026-02-02 15:28:14

Published on February 2, 2026 5:50 AM GMT

Words that are harder to predict from context typically carry more information(or surprisal). Does more information/surprisal means more importance?

A simple example: “This morning I opened the door and saw a 'UFO'.” vs “This morning I opened the door and saw a 'cat'.” — clearly "UFO" carries more information. 

'UFO' seems more important here. But is this because it carries more information? This topic may be around the information-theoretic nature of language.

If this is true, it's simple and helpful to analyze text information density with large language models and visualizes where the important parts are.

It is a world of information, layered above the physical world. When we read text we are intaking information from a token stream and get various information density across that stream. Just like when we recieve things we get different "worth".



Discuss

The limiting factor in AI programming is the synchronization overhead between two minds

2026-02-02 15:28:07

Published on February 2, 2026 6:04 AM GMT

I write specialized data structure software for bioinformatics. I use AI to help with this on a daily basis, and find that it speeds up my coding by quite a bit. But it's not a 10x efficiency boost like some people are experiencing. I've been wondering why that is. Of course, it could be just a skill issue on my part, but I think there is a deeper explanation, which I want to try to articulate here.

In heavily AI-assisted programming, most time is spent trying to make the AI understand what you want to do, so it can write an approximation of what you want. For some people, most of programming work has shifted from writing code into writing requirement documents for AI, and watching over the AI as it executes. In this mode of work, we don't write solutions, but we describe problems, and the limiting factor is how fast we can specify.

I want to extend this idea one step deeper. I think that the bottleneck is actually in synchronizing the internal state of my mind with the internal state of the LLM. Let me explain.

The problem is that there is a very large context in my brain that dictates how the code should be written. Communicating this context to the AI through language is a lot of work. People are creating elaborate setups for Claude Code to get it to understand their preferences. But the thing is, my desires and preferences are mostly not stored in natural language form in my brain. They are stored in some kind of a native neuralese for my own mind. I cannot articulate my preferences completely and clearly. Sometimes I'm not even aware of a preference until I see it violated.

The hard part is transferring the high-dimensional and nuanced context in my head into the high-dimensional state of the LLM. But these two computers (my brain and the LLM) run on entirely different operating systems, and the internal representations are not compatible.

When I write a prompt for the AI, the AI tries to approximate what my internal state is, what I want, and how I want it done. If I could encode the entirety of the state of my mind in the LLM, I'm sure it could do my coding work. It is vastly more knowledgeable, and faster at reasoning and typing. For any reasonable program I want to write, there exists a context and a short series of prompts that achieves that.

But synchronizing two minds is a lot of work. This is why I find that for most important and precise programming tasks, adding another mind to the process usually slows me down.



Discuss

Applying Temperature to LLM Outputs Semantically to Minimise Low-Temperature Hallucinations

2026-02-02 15:27:33

Published on February 2, 2026 6:02 AM GMT

Think for a moment what you believe 0-Temperature LLM inference should represent. Should it represent the highest-level of confidence for each specific word it outputs? Or, perhaps, should it represent the highest level of confidence for each specific idea it is trying to communicate? For example, if an LLMs output token distribution for a response to a question is 40% “No”, 35% “Yes”, and 25% “Certainly”, should a 0-temperature AI interpreter select “No” because it is the individual token with the greatest level of confidence, or should it select “Yes” or “Certainly” because they both represent the same idea and their cumulative probability of 60% represents the greatest degree of confidence? I personally believe the latter.


Standard LLM interpreters use greedy temperature which depend only on the individual output probabilities given by an AI. As the temperature approaches 0, all of the tokens with the greatest probability get boosted and all the tokens with the lowest probability get left out. In most cases, this is fine and the computational efficiency of this approach to temperature application justified. However, in scenarios as above where the modal token has less than 50% probability, we can’t guarantee that the modal token represents the “median” semantic output. As such, greedy temperature is susceptible to vote-splitting whenever two or more output tokens both represent the idea that the model is most confident in, causing the interpreter to output a response that often does not represent what the LLMs would communicate most of the time if it were to be repetitively called with standard temperature.

Semantic Temperature application works to resolve this by rewriting the temperature script to identify the “median” semantic intent behind what the model was communicating and boosting the probabilities of tokens that closely represent this intent as the temperature approaches 0. This can be done by using latent-space projections of prompts and outputs to perform PC1 reduction, assuming that any disagreement in the model’s outputs can be modelled as some form of polarity (like “positive” vs “negative”, “left wing” vs “right wing”, etc). My approaches utilises a smaller LLM to plot the possibilities in latent space as, in theory, it requires far less effort to identify whether two statements are similar or different than it requires to actually generate these statements.

 

Figure 1: A 3D projection of how the latent space of continuations to “The final season of the long-running show deviated significantly from the source material. Fan reception was overwhelmingly” get organised into polarised into “positive” and “negative” sentiments through PC1 reduction

Figure 2: The ordering of the PC1-reduced latent positions of continuations of “The final season of the long-running show deviated significantly from the source material. Fan reception was overwhelmingly” after projected onto the PC1-reduced space

Note how in figure 2, PC1-reduction successfully identifies the polarity between an overwhelmingly positive fan-response and an overwhelmingly negative fan-response, and identifies that an overwhelmingly mixed fan-response would exist between these two continuations.

Once PC1 reduction is applied, we use the ordering of tokens along this PC1-reduced line and the initial output-probabilities for each of these tokens to map each token to “bins” along a normal distribution. Each bin is ordered according to the ordering of the PC1-reduced latent positions and the area under the normal distribution for each bin corresponds to the output probability for that token given by the initial LLM.

Figure 3: Output tokens of an LLM continuation of “The final season of the long-running show deviated significantly from the source material. Fan reception was overwhelmingly” after being ordered using PC1 reduction and mapped to a standard normal distribution using the output probabilities given by the LLM.

Now that the placement of each bin has been chosen so that they are ordered according to some semantic polarity and spaced according to the output probabilities of each model, we can achieve the desired behaviour of temperature by simply locking in the positions of each bin and setting the standard deviation of the normal distribution to be equal to the temperature. When temperature is set to 1, we simply have the standard normal distribution we calculated, so the output probabilities are exactly what is provided by the initial LLM, as required. As the standard deviation approaches 0, more of the area under the normal distribution is concentrated around the median semantic intent, until temperature reaches 0 and 100% of the weighting is within the bin of the token that contains the absolute median intent, as required. And, as the standard deviation approaches infinity, the probability distribution approaches a uniform distribution between the domain of the binned tokens, so that more creative ideas are given more weight without just giving equal weight to each token, allowing equal weighting of semantic ideas rather than individual tokens

Figure 4: A demonstration of how the probabilities of each token are adjusted as the temperature adjusts the standard deviation of the normal distribution defining their weights. 

As this approach depends only on the output probabilities of each token given by an LLM, this interpreter can be used today to apply token selection for any LLMs in distribution, with the minor computational overhead of a far, far smaller LLM than cutting-edge models to interpret the semantic meaning of the models outputs. The capability of the interpreter is improved as look-ahead semantic sampling is applied, so that more deviation between outputs can be reviewed by the smaller LLM, but this comes at a significant overhead of having the larger model produce many outputs per token selected, so looking only 1-2 tokens ahead is recommended to maximise the interpreters effectiveness whilst minimising the computational cost of this. To try this approach on your own self-hosted LLM, the repository for this interpreter can be found here: https://github.com/brodie-eaton/Semantic-LLM-Interpreter/tree/main

Although effective, this approach is only a band-aid to apply to currently-deployed LLMs. Temperature should be applied at the model-level rather than the interpreter-level, so that the LLMs themselves can decide what their median intent is rather than depending on a smaller LLM and look-aheads to predict the median intent behind an LLM model. More research is necessary to either determine how to give temperature as an input for a neural network or to allow for safe separation of inputs and outputs from the LLM so that it does not produce any output at all until it achieves an acceptable level of confidence whilst still allowing us to review its thought process for alignment validation.



Discuss