MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Time isn't real man

2025-05-21 00:25:10

Published on May 20, 2025 3:48 PM GMT

``There is nothing more deceptive than an obvious fact.''
― Arthur Conan Doyle

Anyone who' s taken LSD, pulled an all-nighter, been bored, or fell into a YouTube rabbit knows that time is subjective. Brains aren't clocks. They cheat constantly. Compressing
sequences of events, deleting traumatic evens, buffering sensory delays, stretching and compressing depending on how arousing whatever you're experiencing is.

Code Monks

Programmers enter multi-hour flow states without fatigue or hunger. Usually helped by autism headphones and chemical stimulants.

Bullet Time

When put in life-or-death situations, many report time slowing down. It's like the brain temporarily overclocks and hyper-saturates itself with sensory perception.

Teleportation

Notice how commuting to your job feels like a time skip? It's like mundane regular tasks simply don't exist to us.

Time Travel

When you travel, a few weeks turn into a whole-ass chapter of your life. A giant exotic memory palace. At the end you often feel like you're leaving an entire life behind.

Time Merchants

Video platforms optimize for smoothness and hyperstimulus. They make you forget where videos start and end.

Chronohacking

Time perception seems to me like a fertile ground for exploits. It's not just fun, it feels necessary in a world where attention is farmed and sold. There's only so much
we can do to have a longer life, so why not try to make our life denser.



Discuss

AISN #55: Trump Administration Rescinds AI Diffusion Rule, Allows Chip Sales to Gulf States

2025-05-21 00:21:58

Published on May 20, 2025 4:21 PM GMT

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

In this edition: The Trump Administration rescinds the Biden-era AI diffusion rule and sells AI chips to the UAE and Saudi Arabia; Federal lawmakers propose legislation on AI whistleblowers, location verification for AI chips, and prohibiting states from regulating AI.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe to receive future versions.

The Center for AI Safety is also excited to announce the Summer session of our AI Safety, Ethics, and Society course, running from June 23 to September 14. The course, based on our recently published textbook, is open to participants from all disciplines and countries, and is designed to accommodate full-time work or study.

Applications for the Summer 2025 course are now open. The final application deadline is May 30th. Visit the course website to learn more and apply.


Trump Administration Rescinds AI Diffusion Rule, Allows Chip Sales to Gulf States

On May 12th, the Department of Commerce announced that it had rescinded the Framework for Artificial Intelligence Diffusion, which was set to take effect May 15th. The rule would have regulated the export of AI chips and models across three tiers of countries, each with its own set of restrictions. (Other AI chip export controls, including those prohibiting sales to China, remain on the books.)

The announcement states that the Bureau of Industry and Security (BIS) will issue a replacement rule in the future. In the meantime, the BIS will focus on working to prevent US chips from being used in Chinese AI development. Bloomberg reports that new restrictions will focus on countries that have diverted US chips to China, including Thailand and Malaysia.

The Trump Administration wants to capture the global AI chip market. Though China has yet to export its own AI chips, the BIS will also issue guidance that states using Huawei Ascend chips violates US export controls. This preemptive restriction supports the Trump Administration’s intent for the US to dominate the global AI chip market.

UAE and Saudi Arabia are set to receive hundreds of thousands of AI chips. Last week, Trump announced trade deals with the UAE and Saudi Arabia, respectively.

The UAE is set to receive up to 500,000 of Nvidia's most advanced chips per year, beginning in 2025. 100,000 of these would go to the Emirati firm G42, with the remainder going to U.S. companies building datacenters in the UAE. Following the deal’s announcement, G42 announced the construction of a five GW AI campus in Abu Dhabi—the largest AI infrastructure project anywhere in the world.

President Trump with the Emirati president, Sheikh Mohammed bin Zayed, at the AI campus’ unveiling. (Source.)

Nvidia announced a strategic partnership with Saudi Arabia’s new sovereign AI company, Humain. In the first phase of the partnership, Humain is set to receive 18,000 Blackwell chips. AMD also announced a partnership with Humain.

The chip sales affect several US priorities. The deals will direct large investments to US AI companies that might have otherwise gone to China (China is the leading source of revenue for both the UAE and Saudi Arabia). It will also allow US AI companies to circumvent compute capacity limitations imposed by the US’ energy grid.

Some US officials argue that the Trump Administration’s chip sales threaten to undermine the US’ lead in compute capacity, and consequently US national security, since compute capacity may soon become a key determinant of state power. However, it’s difficult to evaluate the sales’ overall effects on US interests, since the terms of the agreement are unclear.

Bills on Whistleblower Protections, Chip Location Verification, and State Preemption

A federal AI whistleblower protection act. Senate Judiciary Committee Chair Chuck Grassley introduced the Artificial Intelligence Whistleblower Protection Act, which would protect employees who come forward with information about harmful or illegal activities happening inside AI companies.

Current AI whistleblower protections aren’t effective. Currently, these sorts of laws only exist as a patchwork across jurisdictions, making it difficult for would-be AI whistleblowers to predict whether they would be protected. They also often only protect reporting violations of law. Because AI regulation is minimal, developer behavior that poses a threat to public safety may not violate any law.

AI companies can also require employees to sign NDAs preventing them from disparaging the company even after they leave. OpenAI had employees sign such an NDA, which they later discontinued after public pressure.

The AI Whistleblower Protection Act addresses these shortcomings. It covers disclosing any “substantial and specific” danger that AI developer behavior might pose to public safety, public health, or national security. It also prohibits AI companies from requiring employees to sign NDAs or other contracts that undermine their ability to make such disclosures.

A bill requiring location verification for AI chips. Senator Tom Cotton introduced the Chip Security Act, which would require location verification mechanisms for export-controlled AI chips.

The bill would strengthen US export controls by preventing AI chips from being smuggled into China. AI chip smuggling is a growing problem, with potentially 100,000 chips smuggled in 2024.

Currently, US officials struggle to determine what happens to AI chips once they’re shipped overseas. Location verification would allow export authorities to tell when a shipment of chips isn’t where it’s supposed to be, triggering further investigation.

A provision in a tax bill prohibiting states from regulating AI. The House Energy and Commerce Committee included a provision that would prohibit states from regulating AI in its markup of House Republicans’ tax bill.

Ever since California’s SB 1047 almost became law, AI companies have argued that states should be prohibited from regulating AI, and instead leave the problem to the federal government. SB 1047 would have made AI companies liable for harm caused by their models.

However, the provision seems to run afoul of the Senate’s “Byrd Rule,” which prohibits policy provisions from being included in budget reconciliation bills.

In Other News

Industry

Civil Society

See also: CAIS’ X account, our paper on superintelligence strategy, our AI safety course, and AI Frontiers, a new platform for expert commentary and analysis.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe to receive future versions.



Discuss

US Govt Whistleblower guide (incomplete draft)

2025-05-20 23:34:37

Published on May 20, 2025 3:34 PM GMT

2025-05-20

WARNING

This guide is currently a research draft. DO NOT follow the guide as of today, if you are a whistleblower. I do not yet endorse following this guide as being sufficient to protect your safety. I am only sharing this to get research feedback. I will update you once I do think the guide is good enough.

Main

Why this guide?

  • My personal motivation here is especially focussed on whistleblowers working at an AI company in the US hoping to build superintelligent AI. The guide may also be useful for others.
  • Some of the whistleblower guides online are IMO blatantly against your true interests, and in the interests of journalists or lawyers. I wanted to write a guide that's in your true interests as a whistleblower.
  • A lot of cybersecurity guides online fail to acknowledge that the way to escape the NSA with high success rate is not to improve your opsec, it's to flee the country. But the latter is less fun for them to geek out about. So I wanted to write about that.

Disclaimer

  • Geopolitical disclaimer
    • This guide is based entirely on publicly available information.
    • This guide is based on the geopolitical situation as of 2025-05. Some parts of this guide will no longer apply if the US govt were to enter a war (not via proxy) with the govts of Russia, China, Ecuador or any of the other countries mentioned in this document.
    • Case study: Manhattan project spies Julius and Ethel Rosenberg were executed. This guide assumes US citizen whistleblowers are unlikely to be executed, which is likely true during peacetime but may or may not be true during war.
  • Expertise disclaimer
    • This should go without saying, but I don't have credentials or expert-level knowledge on cybersecurity or international law or psychology or any other subject. I have above-average knowledge on cybersecurity and below-average knowledge on all the other subjects.
  • This is a "first pass" guide.
    • After reading this guide, you should also do your own independent research on any steps you're uncertain about.
    • Also you should eventually involve someone else to give you personalised advice for your specific situation. For legal topics, a good lawyer will almost certainly give better advice than what's in this document (assuming you trust them to not prioritise their self-interest above yours).

I broadly categorise whistleblowers into three categories:

  • corporate whistleblowers - possess incriminating information about a company
  • political whistleblowers - possess incriminating information about a political party or group
  • govt whistleblowers - possess incriminating information about intelligence agencies, military, judiciary, etc

Who?

  • This guide assumes you're a govt whistleblower living in the US with US citizenship.
  • This guide assumes you're a software developer (above-average IT skills) and have >$100k in funds.
  • My personal motivation here is especially focussed on whistleblowers working at an AI company in the US hoping to build superintelligent AI. The guide may also be useful for others.

Picking a strategy

As a whistleblower, you have a few available strategies:

  1. leak summary, stay anonymous in the US

  2. leak summary, become public in the US and

  3. leak summary, become public outside of the US

  4. leak summary, stay anonymous outside of the US

  5. leak US classified information, stay anonymous in the US

  6. leak US classified information, become public in the US

  7. leak US classified information, become public outside of the US

  8. leak US classified information, stay anonymous outside of the US

A summary would include broad overview of the situation as you see it, but not a lot of details or any classified documents or recordings.

As per my reading of all the past NSA whistleblower cases: 2 is slightly better than 1 3 4, 7 is far better than 5 6 and 8 is unlikely. Therefore this document is actually a guide for 7. (Leak US classified information, become public outside of the US)

2 is slightly better than 1, 3 and 4.

  • In almost all cases of 2 and 3, people were fired, house raided, social circle investigated, and spent funds on legal case. In almost none of these cases, people spent time in prison.
  • Being public is slightly better as you can then control the media narrative of your story.
  • Maybe if you lack funds for a legal case or finding another job, then staying anonymous is an option. Consider not becoming a whistleblower if you don't have atleast 6 months savings.
    • (I haven't researched this enough.)
  • Leaving the country doesn't benefit you much.

7 is better than 5 and 6

  • In almost all cases of 5 and 6, people spent atleast 3 years in prison. How good their opsec was did not matter, although some had much better opsec than others. IMO do not trust all the opsec guides on the internet that suggest otherwise.
    • The sysamdmins working for the NSA leadership internally track who downloads a document from a central database. If you make a document public, the NSA now has a small list of suspects who have downloaded that particular document from their databases in the last few months.
  • Case studies
    • The only recent case of someone avoiding prison while leaking US classified information is Edward Snowden, who got asylum from Russia. Leaving the US geopolitical sphere of influence is by far your best bet if you look at historical data.

8 seems very unlikely

  • If you've already obtained asylum in a foreign country, that govt is willing to protect you from the US govt. The main reasons I can think of to continue wanting to spend your life in anonymity from the US govt inspite this:
    • possible illegal assassination by US
      • this is a real threat, for instance Assange expressed concern that Snowden could be assassinated if he lived in asylum in a Latin American country
    • want to be deployed in future as a spy on behalf of this country
      • this seems unusual and also not something I will currently be providing a guide for
  • Case studies
    • There is no public case in last 25 years of this situation happening.
  • For 8 to occur you need to figure out a way to steal classified documents anonymously, apply for asylum anonymously, live many years anonymously in that country, and escape US govt-controlled agents or militias in that country trying to assassinate you. Theory says conjuction of these many unlikely events is very unlikely.

Case studies

  • List of people who may not have been imprisoned if they had attempted to fly out of their country on the same day they leaked the documents.
    • (This is complete speculation on my part. Someone with more knowledge please correct me.)
    • Chelsea Manning
      • Chelsea Manning was living in US military base in Iraq. She talked to Adrian Lamo during time period 2010-05-21 to 2010-05-25, who tipped the NSA. She was arrested on 2010-05-27.
      • Hypothetically, someone with more foresight could have waited till they had a valid flight ticket and a plausible excuse for it, smuggled an SD card on their person until then, and leaked the documents on the same day as the flight.
    • Reality Winner
      • Reality Winner was living in US. She sent leaked documents by email to the Intercept on 2017-05-09. The Intercept contacted the NSA for document verification (blunder on the Intercept's part) on 2017-05-30. She was arrested on 2017-06-03.
      • Hypothetically, someone with more foresight could have gotten approved travel first and then leaked the documents on the same day they flew. Since they were not identified as the source, it is quite likely their travel outside the US would have been approved.

How to leak classified documents and leave the US

The more complex a plan you're following, the more planning is required. If you are willing to follow a more complex plan, you can ensure a slightly large window of time for yourself, from when you steal the documents to when you are deanonymised.

Objective: leak US classified documents, leave the US, reveal identity to public outside the US, neither you nor anyone in your circle gets imprisoned

Case studies of worst case outcome

  • None of the recent NSA whistleblower cases faced the death penalty. Your worst case scenario that's not unrealistic is a prison sentence between 5 years and life imprisonment.
  • Julius and Ethel Rosenberg were executed as Soviet spies in the Manhattan Project during the cold war. There is a possibility you may face death penalty if caught during wartime scenario. As mentioned already, this guide assumes a war is not ongoing.
Minimmum not-anonymous plan

Here's the minimum plan you should execute. The order of steps in this plan is deliberate.

  • walk out of the building with classified documents or recordings
  • redact documents yourself (don't trust a journalist)
  • send snail mail to >100 journalists a few hours before the flight
    • (Minimum plan skips trying to encrypt or anonymise the messages, the good plan below does both.)
  • fly to country A
  • contact lawyers and retain atleast one
  • walk into country B embassy in country A and apply for asylum

Consequences

  • Above plan is easier to execute so I'm guessing there's a >80% probability you'll escape the US without being arrested.
  • Depending on your skills my guess is there's a >50% probability your identity is made public within one month after you send the documents to journalists.
Good anonymous plan

Here'a a more advanced plan that might get you a few more months of anonymity. The order of steps in this plan is deliberate.

  • walk out of the building with classified documents or recordings
  • on airgapped computer:
    • redact documents yourself (don't trust a journalist)
    • encrypt the documents twice, first with your symmetric key and second with various journalists' pgp public keys
    • copy the encrypted documents to printed paper (or else USB)
  • setup 10-100 anonymous dead drops in the US
  • wipe your house clean
  • fly to country A
  • send anonymous messages to 10-100 journalists revealing dead drop locations and decryption keys (use a secure channel with pgp+airgap on both sides to communicate)
  • contact lawyers and retain atleast one
  • walk into country B embassy in country A and apply for asylum

Consequences

  • Above plan requires you execute more steps correctly before you leave the US. Your probability of being caught and imprisoned before you leave the US increases.
  • The benefit is you might remain anonymous for a few months maximum before the NSA identifies you, because they need more to time to narrow down the pool of all suspects who downloaded that specific document. Depending on your skills my guess is there's still a >20% chance your identity is made public within one month of you sending the documents to journalists.

Case studies

  • Snowden did not execute either of these plans. He took the documents with him to Hong Kong on a (presumably encrypted) disk drive. He met journalists in-person.
    • US border and airport security has increased in subsequent years and they're a lot more likely to confiscate your electronics or demand decryption keys. (Snowden leak may or may not have influenced this.) I currently do not recommend carrying encrypted SD cards with you on your flight leaving the US. I recommend leaving encrypted information back in the US and either carrying decryption keys on paper or memorising the decryption keys.
    • I do not recommend trusting any journalists with your identity if you have the time and energy and skill to redact documents yourself. You can always involve journalists in the plan after your identity is public.

Social circle guide

Specific advice for you

  • Spend time to take a clear decision on who is in-the-loop on your plan to whistleblow. See below for my recommendations.
  • If you have sufficient funds, leave some funds behind for your family members in the US. They will need this for legal expenses, until you have revealed your identity and can direct more funding towards them.
  • Read mental health resources as required. Do not contact a mental health practioner or friend or any other person for mental health reasons, as mentioned below.

Grand strategy - flow of information

  • You have broadly 3 options if you want to inform anyone in your social circle in the US about your plan, before you have gotten permanent asylum in another country. This includes immediate family members such as your spouse or children.
  1. Don't inform anyone living in the US geopolitical sphere. Let them stay behind and get investigated.
  2. Inform someone living in the US geopolitical sphere. Let them stay behind and get investigated.
  3. Inform someone living in the US geopolitical sphere. Ask them to fly with you and apply for asylum together.
  • Whether to execute 1

    • If you only want to maximise success probability of whistleblowing while ensuring you and your social circle avoids being imprisoned, I would strongly recommend executing 1.
    • Important: There is a possibility you will lose people including immediate family members who will not be ready to continue the relationship when they understand the decisions and risks you've taken. This is a risk you must be ready to accept if you go ahead with this plan.
  • Whether to execute 2

    • If you feel you have an inflexible moral obligation to inform someone in your social circle about your plan, I currently strongly recommend 3 over 2.
    • If you execute 2, there is a significant likelihood that they will be imprisoned as an accessory to your crime, for not reporting you immediately.
    • There is a low likelihood they will be able to successfully deny that they were informed, if they were in fact informed.
  • Whether to execute 3

    • Executing 3 also reduces your success probability significantly, compared to executing 1.
    • This person will also need to learn considerable opsec skills just like you.
    • They will need to psychologically adjust to complete social isolation and constant fear of imprisonment, just like you. A single word spoken by them to wrong person can lead to both of you being imprisoned.
    • Since you are the person who took the initial decision to whistleblow, it is possible you are psychologically more ready for these adjustments than they are.
    • You likely have a good understanding of the psychological makeup of the person you are considering involving in your plan.
    • However, you are also an emotionally biased judge of how much your success probability drops by involving them in the plan. You might want to do your own research and attempt writing down an actual number for how much you estimate your success probability to drop if you involve them. My naive guess is success probability will drop by atleast 40% for a majority of people, if they were to involve their spouse in their plan.
  • Case studies

    • There is no recent example of someone doing 2 or 3 successfully. Snowden is the only recent example of doing 1 successfully.
  • I would strongly recommend distancing from your social circle over a period of multiple months, and not informing them about your plan. It is important not to raise any suspicion, as doing so could lead to you and them being imprisoned.

  • Apart from your lawyer, I do not recommend keeping anyone in-the-loop by default until you have obtained asylum.

    • I do not recommend contacting a lawyer until you have the left the US geopolitical sphere. This document and similar resources already contain most of the generic advice you require. It is unlikely they will have a lot of situation-specific advice you cannot get otherwise. Hence it may not be worth risking informing them.
    • Once you have left the US geopolitical sphere, you can contact your lawyer.
    • Your lawyer may be able to provide better advice on what information is appropriate to share with other people in your circle including your family members.
  • Psychiatrists, extended family, work colleagues, journalists and so on will be part of the people investigated. I do not recommend informing any of these people.

  • Until your identity is safe to publicly reveal (and likely even after that), you cannot support anyone in your social circle living in the US sphere in any way. It could take many months before you can support them.

    • You (and any orgs supporting you) will not be able to financially support their legal expenses.
    • You will not be able to inform them of any of the upcoming consequences in their life, or inform them of any strategy to protect themselves.
    • You will not be able to emotionally support them.

Consequences for social circle living in the US geopolitical sphere (who are not informed about your plan)

  • When I say social circle this includes but is not limited to immediate family, extended family, neighbours, work colleagues, etc. Immediate family members are likely to face more severe consequences.
  • Consequences
    • Your immediate family members living in the US geopolitical sphere are likely to be interrogated, wiretapped and have their houses raided.
    • They are likely to be intimidated and followed.
    • They are likely to be cut off by members of their social circles who would like to avoid being caught in the investigation.
    • They are likely to receive hate mail.
    • They will almost certainly will not be imprisoned in the US. The investigation will almost certainly successfully prove their innocence at the end.
  • Case studies for consequences
    • Almost every single US govt whistleblower leaking classified documents has family members who faced every single one of the above listed consequences.
    • I have not yet found a case of a US govt whistleblower leaking classified documents whose family member (or other person in social circle) went to prison if they were not informed about the plan to whistleblow.

Family visits

  • Exact details here depend on the country you go to. Ask your lawyer for advice.
  • If you go to prison in US/Europe, you will likely be allowed family visits in prison.
  • If you get asylum in Russia, your family will likely be allowed to leave US and enter Russia.
    • Once in Russia, their safe stay and potential return will depend on decisions of the Russian govt.
  • Case studies for family visits
    • Snowden's wife (then girlfriend) was allowed to leave the US 13 months after he left. As of 2025, Snowden's wife lives with him in Russia along with their son (born in Russia).
    • I could not find any documented case of family members or friends being indefinitely held in the US and prevented from travel out of the US.
    • Chelsea Manning, Reality Winner and Julian Assange were allowed in-person family visits in prison in US and UK respectively.
    • See the whistleblower database document for more detailed case studies.

Mental health

  • I do not recommend informing a mental health practitioner, spiritual/religious mentor or friend about your plan for mental health reasons.
    • They will likely be investigated using illegal tactics, and will not be able to protect your information.
    • Any such existing person in your life should be slowly distanced from, as mentioned above.
    • Case study: Daniel Ellsberg's psychiatrist's office was broken into, as part of a (failed) strategy to prove him mentally unstable in court.
  • I recommend reading mental health resources yourself using your secure tails-based setup, if you require it.

Mental health resources

  • Secret life of secrets by Michael Slepian
    • Consider whether you believe you are morally correct or incorrect in keeping this secret.
    • Consider whether your actions in the long term (not short term) benefit or harm people you care about.

Pre-planning security guide

Even before you actually start planning, you should consider setting up a relatively secure machine to do research and planning.

  • You may or may not yet have made up your mind on whether to whistleblow or not.
  • Setting up a secure machine running tails does not by itself incriminate you. You are allowed to change your mind afterwards and decide not to whistleblow.
    • As of 2025, I have not been able to find any case with public evidence of a person becoming a person of interest due to search results alone. Usually search results are used to investigate only after you are already a person of interest.
    • Increase in NSA surveillance activity and AI capabilities could change this in future however. It is, in theory, possible that are you already a person of interest for having read my guide on clearnet.
    • I'm unsure about giving concrete advice on this point.
  • If you do finally decide to whistleblow, your search results starting from today could reduce the time required by the NSA to doxx you.

How to setup secure machine for planning and research

  • Purchase a new windows/linux machine.

    • Physically disconnect the wires to the mic, camera, wireless adapter. Open the case and use a plier.
    • Only connect ethernet cable to router. No wireless signal.
  • Purchase a USB drive and install tails on it.

  • Reserve a separate room in your house where this machine is kept.

  • No mobile phone or other device allowed in this room.

    • Even better, physically dismantle and destroy your phone, if you can manage to still keep your job and manage your life without a phone.
  • No other person allowed in this room.

  • Important: Resist the urge to search whistleblower related info on your other devices or on home or corporate network in clearnet.

"Walk out of the building" guide

  • Do not raise any complaints whatsoever via internal channels.

    • This is possibly the single biggest blunder made by all the historical NSA whistleblowers.
    • This guarantees you'll be on a suspect list even before leaving the US. Your networks may be monitored by a person. You may be followed. You may be prevented from travelling. You may be (unofficially) interrogated.
    • This almost guarantees you'll be doxxed within the first month of leaving the US
      • The intersection of people who downloaded a specific document and people who raised complaints via internal channels uniquely points at you.
    • Case studies:
      • Thomas Drake raised complaints via internal channels and this may have shortened the time interval to him being doxxed. Complaints via internal channels did not have major impact on NSA afaik.
      • Thomas Tamm raised complaints via internal channels and this may have shorted the time interval to him becoming a person of internet, with his house raided and communications tapped. Complaints did not have major impact on NSA afaik.
      • John Crane sued claiming internal channels of NSA are being used to hunt down whistleblowers. Fired as a result. Complaints and legal cases did not have major impact on NSA afaik.
      • Russ Tice claims NSA uses internal channels to hunt down whistleblowers.
      • William Binney raised complaints via internal channels and publicly. Complaints did not have major impact on NSA afaik.
      • James Robertson, a judge on the US FISA court post-9/11 made complaints via internal channels, and resigned. Complaints did not have major impact on NSA or the FISA court system afaik.
  • Select which documents to release.

    • More documents means more work.
      • More manual work to redact them, including higher probability of human error.
      • Might become less feasible to print on paper, mandating disk storage.
      • If there's a large amount of video data, any copying or file processing you wish to do could take multiple hours.
    • You can probably make the key political arguments you wish to make by leaking 100 documents rather than 1 million.
    • Maybe put the most important documents in a folder to save the journalist some time while analysing the archive.
    • The public could benefit from more documents leaked rather than less, in a way that you can't immediately foresee. It is hard for you to pre-emptively guess every possible usage of your leaked documents from now till a century from now.
      • If you are willing to undertake personal risks involved in redacting and transferring a larger dataset, prefer releasing a larger dataset over a smaller one.
  • Plausible deniability

    • Increasing size of initial pool of suspects could give you more lead time before your identity is found out by the NSA.
    • Make sure to access a large number of documents from the database, not just the ones you will be publishing.
    • Make sure no camera or person is watching you while you access the documents or copy them
  • Hardware

    • A standard SD card will do, as it is easy to smuggle. A USB drive or hard disk drive will also do.
    • If you are releasing a lot of video content you may need to purchase multiple disks. As of 2025, it is not common to get disks larger than 12 TB. Prefer purchasing a disk with high throughput like 1 GB/s instead of 100 MB/s.
    • As always, make sure to have a plausible alternate reason for any hardware purchases you make under real name or credit card.

Video recording guide

You can consider hiding a camera and mic on you to record important interactions you see. This could include leadership admitting to any actions or values they would not admit publicly.

Should you make recordings yourself?

  • I strongly recommend you only make recordings yourself if there is no existing documented evidence that would be straightforward to obtain. In most situations, there will exist documented evidence that is available to atleast a few people.
  • Cons
    • Remember that you do not have a second person to help you, and you do not have prior experience with undercover work.
    • Attempting to make recordings noticeably reduces the success probability of leaking information without being imprisoned.
  • Pros
    • Video recordings of the leadership making unpopular decisions might be more influential in subsequent politics than leaked documents of the leadership making the same decisions. Average citizen has limited attention span and hence prefers video over text.

Video recording guide

  • Prepare to be doxxed
    • Audio and video are harder to redact. If you are making recordings you should be aware of this.
    • You may or may not be a participant in the conversations being recorded. If you are a participant in the conversation, you may be doxxed soon after the recordings are released publicly.
  • Getting access to people
    • It may take multiple months of effort before you obtain access to such people and discussions. Every additional month of time taken decreases the success probability of the plan, as there is more potential to make a mistake.
    • Until you have made the recordings, there is some probability you will avoid being imprisoned even if caught. (I don't have legal background to comment on the exact boundary here.) You can spend a fair amount of time preparing to make the recording, before actually making it.
    • You may find it difficult to mentally adjust to your new role, your motivation levels may reduce and your work performance may be affected. All this is detectable by people working with you.
    • You will likely need to take a significant amount of self-initiative to get access to people.
    • It is best to continue playing your old role at the company, and invent plausible reasons why a person in your role would need access to the leadership or to a specific meeting. For instance you may want to volunteer for additional responsibilities or attempt getting promoted.
    • Don't carry recording hardware with you while you're still in the process of increasing your access.
    • Don't possess any incriminating evidence on you while you're still in the process of increasing your access. Assume you can be caught and interrogated at any moment.
  • Hardware
    • Warning: Your building may contain scanners, including metal detectors, X-ray scanners, millimetre-wave scanners and infrared scanners.
      • Security may be able to detect equipment you carry even while switched off.
      • Assume the security team has read this guide and installed scanners in all buildings, unless you have evidence otherwise.
    • Choice of hardware
      • You should have a plausible excuse for why the hardware is on you, if interrogated by security. You should have a plausible excuse for why you purchased it, as all purchases are recorded and can be questioned.
      • It might therefore be better to use a regular phone or mic to do the recording, instead of specialised "spy" hardware.
        • Spy hardware typically is designed to be hidden (in your shirt, in the wall etc) with an almost invisible opening.
        • Spy hardware may also have pickup and sensitivity higher than regular use.
      • Use simple hardware without too many pieces or cables.
      • I'm not listing models of specific mics or cameras here. Anything I list here might be automatically suspicious. Do your own research. Purchase high-quality hardware that won't look suspicious. The exact choice is not very important, multiple brands sell similar enough hardware.
    • Using mics
      • Distance of mic from the speaker is by far the most important variable to ensure good pickup.
      • Mics are best carried on person to ensure better pickup. Place them at 30 degree angle or pointed at other person's mouth to maximise pickup.
      • Mics may have digital or analog filters to remove low-decibal noises, make sure those are not present or disabled.
      • Mics may accumulate lint or dust, clean as required.
      • Ensure less ambient noise in the area.
      • Parabolic mics are designed for surveillnce and can pickup at >1 metre distance.
      • Laying blankets or padding behind the talker can help, for recordings taken at >1 metre distance.
    • Using cameras
      • Room lighting is the most important variable for good quality video.
      • Ensure sufficient ambient lighting in the recording. Ensure there's no window creating glare.
      • Assuming good lighting, most camera models perform well. Pick any camera with good resolution.
      • Cameras may also accumulate dust or lens scratches, clean as required.
      • Cameras can be carried on person or placed in the room.
      • Camera placement is important. Minimise blind spots in the room not captured by the camera.
  • Testing - very important
    • Attempt multiple test runs of your recording equipment by secretly recording lower-stakes conversations.

Resources

  • The Sting Book (1994) by Steven Frazier
    • (Don't purchase a hard copy using your real address or credit card. Get a soft copy using your tails setup)
    • Many sections of the book is irrelevant, some sections are outdated, but some sections are useful.
    • Informed by real empirical data but in a different context.

Redaction guide

Time investment

  • If you can do the redactions yourself within 1 month, I would strongly recommend doing it yourself instead of trusting a journalist to do it. (Exceptions exist.)

Generic information

  • You need to do redaction (removal of private information) and metadata removal. Always do metadata removal first, redaction second.
  • Metadata
    • Almost every file format leaks some other the other metadata. Could be docx, jpeg, pdf, html, whatever.
    • Possible metadata in input files
      • Useful metadata - for users of the application, but may end up doxxing you
      • Junk metadata - that no one notices or uses, but may end up doxxing you
      • Steganographic - deliberately inserted by organisation to doxx whistleblowers like you
  • Redaction
    • There may be information in the documents that is incriminating either to you or to third-parties. You might prefer to remove this information for either moral or strategic reasons.

Text

  • Suggested programs: vim, nano, emacs
    • Also likely safe: Gedit (linux default), Notepad (windows default), TextEdit (mac default), Sublime Text
  • Step 0: File format conversion
    • Output format should be plaintext.
    • Use ASCII not UTF-8.
  • Step 1: Metadata removal
    • If you suspect your file contains steganographic metadata, consider not submitting the file, and rewriting its contents in your words. If not, continue.
    • Remove all non-printable characters.
  • Step 2: Redaction
    • Replace with [REDACTED]
    • Don't use different placeholders for different types of information. Use same placeholder across all content.
    • Use common sense when redacting documents. Redacted content is sometimes easy to guess based on surrounding content.
  • Step 3: Final checks
    • Generate hexdump and verify no extra characters present. All chars must be within accepted byte range.

Images

  • Suggested programs: GIMP, hexdump, Tails metadata cleaner
  • Step 0: File format conversion
    • Output format should be bitmap (.BMP), with low resolution.
    • GIMP supports converting many image formats to BMP.
    • poppler-utils on linux can convert pdf to image, which can then be converted to BMP.
    • Linux screenshots are a sufficiently safe way to convert any file to image, which can then be converted to BMP.
  • Step 1: Metadata removal
    • If you suspect your file contains steganographic metadata, consider not submitting the file, and rewriting its contents in your words. If not, continue.
    • Remove any image metadata such as EXIF/XMP data. (EXIF can include location data, which is important to remove.)
    • Image resolution
      • Camera lenses may have scratches that uniquely identify the camera used. These are hard-to-detect with naked eye in an image, but can be seen using graphics processing tools. Lowering resolution helps prevent this analysis.
      • Lower resolution down a minimum resolution that ensures documents are still readable.
      • Keep this minimum resolution constant across all your images.
  • Step 2: Redaction
    • Guide to making black boxes
      • Use GIMP to make black boxes.
      • Use black fill with opacity 100%.
      • Click Image -> Flatten layers. (This is optional, layers get flattened by default if you use .BMP as save format)
      • Save as .BMP. Uncheck saving EXIF/XMP data.
    • Use the same fill across all redacted content.
    • Use common sense when redacting documents. Redacted content is sometimes easy to guess based on surrounding content.
  • Step 3: Final checks
    • .BMP format includes a 54-byte header followed by RGB values, with no compression, no encoding and no metadata.
    • Generate hexdump and check that bytecount matches expected bytecount based on header size and number of pixels.

Audio

  • As of 2025-05, my recommendation is to not do audio redaction unless you have previous expertise. I will update the guide once I have more experience with audio redaction.
  • Your audio files may also require mixing and editing to improve quality. Trust that someone else will do it once the files are released publicly.
  • If you are submitting audio
    • Use tails metadata cleaner.
    • Accept that there is increased risk of the file containing information about both - the people recorded in the audio, and any devices and people that transmitted the file including you.
    • Methods can depend on encoding method used. You may want to delete entire frames or delete / voice-change voices within some frames.

Video

  • As of 2025-05, my recommendation is to not do video redaction unless you have previous expertise. I will update the guide once I have more experience with video redaction.
  • Your video files may require editing to improve quality. Trust that someone else will do it once the files are released publicly.
  • If you are submitting video
    • Use tails metadata cleaner.
    • Accept that there is increased risk of the file containing information about both - the people recorded in the video, and any devices and people that transmitted the file including you.
    • Methods can depend on encoding method used. For visual, you may want to delete entire frames or black fill polygonal regions within some frames. For audio, see above.
  • Otherwise, don't submit video. Consider extracting frames
    • Extract a few important frames as .BMP. Then follow the image guide above.
    • Programs: ffmpeg: ffmpeg -i video.VOB -vsync 0 -ss 01:30 -to 01:40 %06d.bmp

Airgap and encryption guide

To do for self

  • Write this

Anonymous dead drop guide

To do for self

  • Write this

House cleaning guide

To do for self

  • Write this

List of recommended countries

Travel paths

  • My naive guess is your best bet is going to the Ecuador embassy in Moscow, Russia and requesting asylum in-person, not anonymously. You are likely to be imprisoned only if both Ecuador govt and Russian govt reject your requests.

  • My naive guess is that contacting consuls requesting asylum before reaching foreign soil is a bad idea. You should first fly out, before contacting your lawyer and making asylum requests.

  • If you are flying from US to Russia, my naive guess is it makes more sense to fly to a third country that has flights from both Russia and US. This will raise less suspicion when you apply for visa or book tickets. This is especially important if you hold a security clearance. Once you are in the third country, you can book the second ticket.

  • Ask your lawyer for advice, after you have landed in Russia.

  • (It is possible to ask for more information from a lawyer before you leave the US. There is a low probability your lawyer will get you imprisoned. There's also a low probability they offer you useful personalised information which you cannot otherwise obtain from guides such as this one. Ideally someone who is not affiliated with you, such as me, should be getting this information and publishing it publicly in a guide.)

  • Case study: Snowden was approved for asylum in Russia, after being rejected by 20+ countries including countries in Latin America, countries in Europe, India and China. Later he was granted Russian citizenship.

  • Case study: Assange was approved for asylum in Ecuador embassy in UK. His room was bugged by cameras and eventually Ecuador govt revoked their approval, letting UK govt arrest him.

Along with the leaked documents, you can request govts to unilaterally promise asylum to the whistleblower without knowing the identity of the whistleblower.

  • You can also include a pgp public key in your leaked documents. This allows a government to start an encrypted conversation with your lawyer without knowing the identity of either you or your lawyer.
  • My guess is there's <20% probability this will work. My guess is all countries have a strong enough negotiating position to demand you reveal your identity and reach their soil, before they take a decision on your asylum application.
  • No govt in the world cares to keep you alive on humanitarian grounds but they may do it if it helps strengthen their reputation as a country that can defy the US.

Funding guide

You need funding for the following:

  • Secure computers, while in the US
  • Legal expenses
  • Living in foreign country while you apply for asylum
  • Living in foreign country after you have asylum, but not an alternate job
  • Emergency funds, such as funds for emergency flights, purchasing new computers

Legal expenses are by far your biggest expense, as per case studies.

Being cut off from your own funds is a potential risk.

  • Soon after the first meeting with your lawyer outside the US, they should help you contact organisations that can fund expenses on your behalf.
    • It is ideal if someone else is making the payments on your behalf, as your cash on hand can be seized and accounts can be frozen. This is high probability.
    • Case study: Assange paid for Snowden's stay in Hong Kong and onward flight tickets.
    • Case study: Assange received payments via cryptocurrency after all bank accounts were frozen.

Cryptocurrency

  • Having an organisation willing to send you funds is better than managing your own funds. Once you have left the US, there should ideally be atleast one such organisation willing to support you and bring cash to you in-person as required.
  • Using cryptocurrency or cash by yourself is only a worst case option if you and your lawyer are unable to find an organisation willing to fund you.
    • If you have never purchased cryptocurrency before, I would strongly recommend not learning to do it now.
      • Having cryptocurrency with you instead of cash will only help you in a small number of scenarios. It is not worth significantly increasing the risk of being caught, and if you have not done this before, your risk of being caught before leaving the US goes up significantly.
    • If you already have BTC or XMR with you in a cold wallet, or if you have previously purchased BTC or XMR using cash via no-KYC methods, you can consider bringing some with you on your journey outside the US.
    • Carry it in a brain wallet (i.e. memorise it) or paper wallet. Don't leave behind any hardware wallets or hot wallets. (Any electronic devices left behind will go through data recovery, and could reveal your crypto addresses for example.)
    • Don't carry more than $1M as this makes you a target for theft.
    • Some countries are far more friendly than others, when finding counterparties as a stranger. Consider checking this for your destination country before leaving the US.
    • Case studies: I did not find any historical example of a whistleblower who left the US, failed to find an organisation who will fund them, and ended up relying on their own cryptocurrency assets. This entire section of the guide is dealing with a low-probability speculative scenario.

List of recommended lawyers

To do for self

  • write this section

List of funding sources

To do for self

  • write this section

List of high-attention media operators (such as journalists and youtubers)

To do for self

  • write this section

As mentioned multiple times in the document, prefer making a plan that does not involve trusting any journalists.

List of somewhat trustworthy high-attention media operators (with securedrop/signal/email IDs)

  • ?

Exhaustive list of high-attention media operators (with securedrop/signal/email IDs)

  • ?


Discuss

If one surviving civilization can rescue others, shouldn't civilizations randomize?

2025-05-20 23:26:38

Published on May 20, 2025 3:26 PM GMT

In the comments section of You can, in fact, bamboozle an unaligned AI into sparing your life, both supporters and critics of the idea seemed to agree on two assumptions:

  • Surviving civilizations have some hope of rescuing civilizations killed by misaligned AI, but they disagree on the best method of rescuing.
  • The big worry is that there are almost 0 surviving civilizations, because if we're unlucky, all civilizations will die the same way.

What if to ensure at least some civilizations survive (and hopefully rescue others), each civilization should pick a random strategy?

Maybe if every civilization follows a random strategy, they increase the chance of surviving the singularity, and also increase the chance that the average sentient life in all of existence is happy rather than miserable. It reduces logical risk.

History already is random, but perhaps we could further randomize the strategy we pick.

For example, if the random number generated using Dawson et al's method (after July 1st 00:00UTC, using pulsar PSR J0953+0755 or the first publicly available pulsar data) is greater than the 95th percentile, we could all randomly choose MIRI's extremely pessimist strategy, and do whatever Eliezer Yudkowsky and Nate Soares suggest with less arguing and more urgency. If they tell you that your AI lab, working on both capabilities and alignment, is a net negative, then you quit and work on something else. If you are more reluctant to do so, you might insist on the 99th percentile instead.

Does this make sense or am I going insane again?

Total utilitarianism objections

If you are a total utilitarian, and don't care about how happy the average life is, and only care about the total number of happy lives, then you might say this is a bad idea, since it increases the chance at least some civilizations survive, but reduces the total expected number of happy lives.

However, it also reduces the total expected number of miserable lives. Because if 0 civilizations survive, the number of miserable lives may be huge due to misaligned AI simulating all possible histories. If only a few civilizations survive, they may trade with these misaligned AI (causally or acausally) to greatly reduce suffering, since the misaligned AI only gain a tiny tiny bit by causing astronomical suffering. They only lose a tiny bit of accuracy if they decrease the suffering by 2x.

This idea is only morally bad if you are both a total utilitarian, and only care about happiness (not worrying about suffering). But really, we should have moral uncertainty and value more than one philosophy (total utilitarianism, average utilitarianism, etc.).



Discuss

President of European Commission expects human-level AI by 2026

2025-05-20 22:13:30

Published on May 20, 2025 2:13 PM GMT

On May 20, during her speech at the Annual EU Budget Conference 2025, Ursula von der Leyen, President of the European Commission, stated:

When the current budget was negotiated, we thought AI would only approach human reasoning around 2050. Now we expect this to happen already next year. It is simply impossible to determine today where innovation will lead us by the end of the next budgetary cycle. Our budget of tomorrow will need to respond fast.

This is remarkable coming from the highest-ranking EU official. It suggests the Overton window for AI policy has shifted significantly.



Discuss

Selective regularization for alignment-focused representation engineering

2025-05-20 20:54:09

Published on May 20, 2025 12:54 PM GMT

We study how selective regularization during training can guide neural networks to develop predictable, interpretable latent spaces with alignment applications in mind. Using color as a test domain, we observe that anchoring even a single concept (red) influences the organization of other concepts, with related concepts clustering nearby — even with weak supervision. We then propose that concept-anchored representation engineering might enable more precise intervention in complex models without requiring extensive post-hoc interpretability work.

Still frame from a visualization of the evolution of the latent space of our experiment model. This image has been tightly cropped to show only a thin slice of the full frame. It shows three vibrant, colorful scatter plots, the left resembling a cube, and the other two suggesting a manifold in high-dimensional space.

Introduction

In our previous post, we proposed that anchoring key concepts to specific directions in latent space during training might make AI systems more interpretable and controllable. This post presents our exploratory findings as we work toward that goal, adapting and combining techniques from representation learning with a specific focus on alignment applications.

Rather than attempting to discover and modify latent directions after training (as in mechanistic interpretability), we're exploring whether it's possible to impose useful structure on latent spaces during training, creating a more interpretable representation from the start. Many of the techniques we use have precedents in machine learning literature, but our focus is on their application to alignment challenges and whether they might enable more controlled model behavior.

Using color as an experimental domain, we investigated whether simple autoencoders with targeted regularization could learn predictable latent space structures that organize concepts in ways we can understand and potentially control — with specific colors as a stand-in for "concepts"[1].

By intentionally structuring a portion of the model's internal representations during training, we aimed to know exactly where key concepts will be embedded without needing to search for them. Importantly, we don't constrain the entire latent space, but only the aspects relevant to concepts we care about. This selective approach allows the model to find optimal representations for other concepts while still creating a predictable structure for the concepts we want to control. We observed that other concepts naturally organized themselves in relation to our anchored concepts, while still maintaining flexibility in dimensions we didn't constrain.

Our experiments incorporate elements from established techniques in representation learning, prototype-based models, and supervised autoencoders, but with a specific adaptation: using stochastic, selective regularization with minimal concept anchoring (just a single concept) to influence broader representation organization without dictating the entire structure. We view this work as preparatory research for developing techniques to train more complex models in domains in which data labelling is hard, such as language.

In this post, "we" refers to "me and Claude 3.7 Sonnet," who helped draft this content. The underlying research was conducted with coding assistance from Gemini 2.5 Pro, GPT-4o, and GPT-4.1, and literature review by deep research in ChatGPT.

Related work

Our approach builds upon several established research directions in machine learning. Concept Activation Vectors (CAVs)[2] pioneered the identification of latent space directions corresponding to human-interpretable concepts, though primarily as an interpretability technique rather than a training mechanism. Work on attribute-based regularization[3] has shown that variational autoencoders can be structured to encode specific attributes along designated dimensions, while our work applies to simple autoencoders with no assumption of monosemantic representations. Recent research has also explored using concept vectors for modifying model behavior through gradient penalization in latent space[4]. Our work differs by proactively anchoring concepts during training rather than analyzing or modifying latent representations post-hoc.

Several approaches have explored similar ideas through different frameworks. Supervised autoencoder methods often incorporate geometric losses that fix class clusters to pre-determined coordinates in the latent space[5]. Similarly, center loss for face recognition trains a "center" vector per class and penalizes distances to it[6]. Prototypical VAE uses pre-set "prototype" anchors in latent space to distribute regularization[7]. These techniques all employ concept anchoring in some form, though our approach is distinctive in focusing specifically on the minimal supervision needed and examining how structure emerges around it rather than explicitly defining the entire latent organization.

In the contrastive and metric learning literature, techniques like supervised contrastive learning[8] cause samples of the same class to cluster together, but typically don't pin clusters to particular locations in advance. These methods achieve similar grouping of related concepts but don't provide the same level of predictability and control over where specific concepts will be represented. In unsupervised contrastive learning (e.g. SimCLR[9]/BYOL[10]), the clusters emerge solely from data similarities, with no predefined anchor for a concept.

Alignment motivation

Our interest in these techniques stems specifically from challenges in AI alignment. Current approaches like RLHF and instruction tuning adjust model behavior without providing precise control over specific capabilities. Meanwhile, mechanistic interpretability requires significant post-hoc effort to understand models' internal representations, with limited guarantees of success.

We're exploring whether a middle path exists: deliberately structuring the latent space during pre-training to make certain concepts more accessible and controllable afterward. This approach is motivated in part by findings on critical learning periods in neural networks[11], which demonstrated that networks lose plasticity early in training (even in the first few epochs!), establishing connectivity patterns that become increasingly difficult to reshape later.

Recent work on emergent misalignment further strengthens the case for latent space structure, showing that models fine-tuned solely to write insecure code subsequently exhibited a range of misaligned behaviors in unrelated contexts, from expressing anti-human views to giving malicious advice[12]. This suggests that even highly abstract related concepts naturally cluster together in latent space, making the organization of these representations a critical factor in alignment. Deliberately structuring these representations during pre-training may provide more direct control than purely post-hoc interventions.

If models develop more predictable representations for key concepts we care about (like harmfulness, deception, or security-relevant knowledge), we might gain better tools for:

  1. Monitoring the presence of specific capabilities during training
  2. Selectively intervening on specific behaviors without broad performance impacts
  3. Creating more fundamental control mechanisms than prompt-level constraints.

While our experiments with color are far simpler than language model alignment challenges, they provide a testbed for techniques that might eventually scale to more complex domains.

Color as a test domain

We chose color as our initial experimental domain because it offers several properties that make it suitable for exploring latent space organization:

  • Clear ground truth: Every color has objective RGB values that can be precisely measured, so we can check whether our manipulations of latent space produce the expected results.
  • Hierarchical and multidimensional structure: Primary colors (red, green, blue) combine to form secondary colors (yellow, cyan, magenta), which further mix together and with black and white to create continuous hues, tones, and shades. This hierarchical relationship may mirror how abstract concepts relate in more complex domains.
  • Visually intuitive: We can directly see the organization of the latent space and understand what's happening. When we visualize embeddings of colors, the emergence of a color wheel or a spherical organization isn't coincidence — it's a meaningful representation that aligns with human understanding of color.

If we can demonstrate reliable control over a model's internal representation with color, it suggests we might achieve similar control in more complex domains. We assume that the core mechanisms of embedding and representation remain similar even when the domains differ greatly in complexity.

Experimental approach

Our experiments used a simple architecture: a multilayer perceptron (MLP) autoencoder with a low-dimensional bottleneck. This architecture forces the model to compress information into a small number of dimensions, revealing how it naturally organizes concepts.

Architecture

We used a 2-layer MLP autoencoder with RGB colors (3 dimensions) as input and output, and a 4-dimensional bottleneck layer. While RGB data is inherently 3-dimensional, we needed a 4D bottleneck because of our unitarity (hypersphere) constraint, described below.

This architecture creates a useful tension: the model must balance information compression against the structural constraints imposed through regularization.

A neural network architecture diagram showing an MLP autoencoder with bottleneck structure. There are 5 layers. From left to right: the RGB input (three nodes colored red, green, and blue), then a stack of 16 white nodes, then four nodes for the latent space, two of which are filled with a rainbow gradient and two that are grayscale. The design is then reflected, with another 16 white nodes and the RGB output. Grey lines connect each layer to the next, forming the shape of an hourglass on its side. The structure illustrates how RGB color information is compressed into the 4D bottleneck representation and then reconstructed. Flowchart-like arrows connect from the middle and right layers to labelled boxes representing the loss terms. The middle (latent space) layer connects to the box labelled R for 'regularization'; the right (output) layer connects to a box labelled C for 'criterion'; both of those connect to a box labelled L for 'loss'.
MLP Autoencoder Architecture with RGB inputs/outputs (3 dimensions) and a bottleneck layer (4-dimensions). R: Regularization terms from the bottleneck layer. C: reconstruction loss criterion from the output layer. L: combined loss.

Regularization

We imposed structure through several complementary regularization techniques:

Planarity penalized embeddings for using dimensions beyond the first two (using L2 norm on the 3rd and 4th dimensions with target length 0), encouraging the model to organize hues in a plane.

Unitarity encouraged embedding vectors to have unit length (using L2 norm on all dimensions with target length 1), pushing representations toward the surface of a hypersphere[14].

Angular repulsion encouraged distinct concepts to maintain distance from each other in the embedding space using cosine distance.

Concept anchoring explicitly pushed certain colors (like red) toward predetermined coordinates in the latent space using Euclidean distance. In retrospect, cosine distance might have been more appropriate given the spherical nature of our embedding space.

Conceptual diagram illustrating the four regularization terms used in our experiments, displayed against a dark gray background. From left to right: (1) Planarity: depicted as a horizontal disc with two white circles on opposite sides and arrows pointing vertically toward the plane, representing the constraint that keeps certain dimensions close to zero; (2) Unitarity: shown as a large circle with a black center point and a white point on the perimeter, with an arrow suggesting movement toward the circumference, representing the constraint that vectors maintain unit length; (3) Repulsion: illustrated as white circles placed on either end of an arc, each one with an arced arrow pointing to the other, representing how distinct embeddings are encouraged to maintain distance from each other; (4) Concept anchoring: visualized as two white circles being pulled toward a red triangle (representing the "red" concept), showing how certain colors are guided toward predetermined coordinates in latent space.
Illustration of regularization terms. From left to right: planarity, unitarity, angular repulsion, concept anchoring.

Labels

We applied these regularization terms selectively to different samples based on stochastic binary labels. The probability of a sample receiving a particular label was determined by formulas that capture the relevant color properties.

For the "red" label (measuring proximity to pure red), we calculated:

This formula yields high values for pure red (r=1, g=0, b=0) and rapidly diminishing values as green or blue components increase. The exponent of 10 creates a sharp falloff.

For the "vibrant" label (measuring proximity to any pure hue):

Where s and v are saturation and value in the HSV color space. The high exponent (100) creates an extremely sharp distinction between fully saturated, bright colors and even slightly desaturated or darker ones.

These continuous probabilities were then scaled (multiplied by 0.5) and converted to binary labels by comparison with random values. For example, pure red would be labeled "red" approximately 50% of the time, while colors progressively further from red would be labeled with rapidly decreasing probability.

This stochastic approach effectively creates a form of weak supervision, where the model receives imperfect, noisy guidance about concept associations. We designed it this way to simulate real-world scenarios like labeling internet text for pretraining, where labels might be incomplete, inconsistent, or uncertain. Our experiments show that such sparse, noisy signals can still effectively shape the latent space - an encouraging finding for applications where comprehensive labeling is impractical.

Key findings

Our experiments produced several interesting results:

Predictable structure

The 4D latent space organized itself into a coherent structure where similar colors clustered predictably. When projected onto the first two dimensions, vibrant hues formed a clear circle resembling a color wheel, despite never explicitly telling the model to arrange them in this way.

Latent space visualizations showing three different 2D projections of the 4D embedding space at step 10000, alongside smaller thumbnails showing the evolution of the space during training. The left panel [1,0] (hue) shows a circular arrangement of colors forming a color wheel. The middle and right panels [1,2], [1,3] show similar circular structures from different angles. Thumbnails below show how the space evolved from a small cluster at step 0 through various transformations, eventually forming a spherical structure by step 8000.
Top row: Latent space visualizations showing three different 2D projections of the 4D embedding space at the end of training. Left: the hue plan (first two dimensions), with red at the top. Middle and right: one hue and one other dimension each. Bottom row: select frames showing the evolution of the space during training.

Importantly, we didn't need to search for interesting projections after training. The structure formed as we intended due to regularization. While dimensionality reduction techniques often produce meaningful visualizations, what distinguishes our approach is the predictability and consistency in placing specific concepts in predetermined locations. This addresses a fundamental challenge in interpretability: knowing where in the latent space to look for specific concepts.

Here's a similar visualization, this time as a video.

The video[15] (and also the thumbnails in the figure above) shows the evolution of latent space:

  1. Initially (step 0), all colors cluster at a single point
  2. The space expands to form something resembling an RGB cube (around step 200)
  3. The structure then contorts as the model balances reconstruction objectives against regularization constraints (steps 800-3000)
  4. Around step 5000, coinciding with increased unitarity regularization, the structure becomes more spherical
  5. Through the remainder of training, it settles into an increasingly regular sphere.

The video also shows how the hyperparameters were varied over time, and the regularization loss terms.

Effectiveness of minimal concept anchoring

Even with only a single anchored concept (red at coordinates ), the entire latent space organized itself relative to this point. The other primary and secondary colors took positions spaced around the circle at regular intervals. This happened despite:

  1. The anchoring regularization being applied inconsistently (pure red was only labeled as "red" 50% of the time)
  2. The anchoring signal being applied stochastically based on noisy labels
  3. No colors other than red being anchored to a point (although notably, vibrant colors were anchored to the plane formed by the first two dimensions).

We observed that anchoring a single concept, combined with planar constraints, influenced how the model organized its latent space. While we only explicitly constrained a portion of the space (2 of 4 dimensions), the entire representation adapted to these constraints in our simple domain. This suggests potential for targeted regularization, though the extent to which such influence would propagate in higher-dimensional spaces (like those found in large language models) remains an open question.

Selective vs. curriculum-based regularization

One of our practical findings concerns the training methodology. We initially explored curriculum learning approaches where we gradually introduced more complex data. However, we discovered that simply training on the full color space from the beginning with selective regularization produced superior results:

  1. More stable training dynamics
  2. Better preservation of the learned structure
  3. More consistent results across different random initializations

This went against our initial intuition, but it simplifies implementation: designing curricula is tricky, whereas training on full data with selective regularization is more straightforward.

Implications for alignment research

These initial results suggest some promising directions for concept-anchored representation engineering in an alignment context. While other approaches have explored structuring latent spaces through regularization, our specific exploration of anchoring minimal concepts during training with selective per-sample regularization offers several potential insights[16]:

  1. Structure can be guided through selective regularization. Targeted regularization can successfully impose predictable organization on latent space.
  2. Stochastic labeling is sufficient. The effect persists even with noisy, stochastic application of regularization, suggesting we don't need perfect or complete concept labeling.
  3. Related concepts naturally cluster. The resulting structure places related concepts near each other in latent space, potentially enabling more precise interventions.
  4. Minimal anchoring may influence broader organization. A small number of anchored concepts can influence the organization of the broader representation space without distorting the relationship between concepts.

These points suggest that anchored concepts might indeed act as attractors that organize related concepts in meaningful ways, though verification in more complex domains is needed.

Limitations

Despite these promising signs, we acknowledge significant limitations:

Domain simplicity: Color has intrinsic geometric structure that may make it uniquely amenable to this approach. The RGB color space is already low-dimensional with clear, objective distance metrics. Language concepts likely occupy a much messier, higher-dimensional manifold with less obvious geometric relationships. The ease with which we found a color wheel structure may not transfer to domains where the "natural" organization is less clearly defined.

Architectural simplicity: Our experiments used tiny MLPs with a few hundred parameters. Modern language models have billions or trillions of parameters with complex, multi-layer architectures. While some work exists on regularizing transformer latent spaces (e.g., ConceptX, LSRMT), applying our specific approach of concept anchoring during training to shape transformer representations presents challenges we have yet to explore, particularly given how attention mechanisms create context-dependent representations.

Unknown trade-offs: There may be significant performance trade-offs between regularized structure and model capabilities. If regularization forces the model to organize concepts in ways that aren't optimal for its tasks, we might see degraded performance.

Supervision requirements: This technique requires some concept labeling. In language models, identifying and labeling instances of abstract concepts like "deception" or "harmfulness" is more subjective and challenging.

Despite these limitations, we remain optimistic. The robustness to noisy labeling, effectiveness of selective regularization, and the observed influence of our targeted constraints suggest that this approach deserves further exploration in more complex domains.

Next steps

Our work so far has focused on exploring whether we can create structured latent spaces through selective regularization. Next, we'd like to see whether we can use this structure to actually control model behavior. Our next experiments will explore:

  1. Selective concept suppression: Can we reliably suppress specific colors (e.g., red) at inference time, using pre-determined directions in latent space? Ideally, this would affect red and red-adjacent colors while leaving others (like blue) untouched. This could suggest new approaches for controlling model capabilities.
  2. Concept deletion: Beyond temporary suppression, can we modify the network to permanently remove specific capabilities by identifying and ablating the relevant weights?
  3. Transformer architectures: Ultimately, we want to apply these techniques to transformer models, which presents additional challenges due to their attention mechanisms and context-dependent representations.

Conclusion

Our experiments explore how selective regularization can guide the formation of structured representations during training, potentially reducing the need for post-hoc interpretation. Using these adapted techniques, we created partially structured latent spaces where concepts we cared about were predictably positioned, while allowing the model flexibility in organizing other aspects of its representations.

While many of the individual techniques we've used have precedents in the literature on prototype learning, supervised autoencoders, and contrastive learning, our specific contribution lies in exploring: (1) how proactively structuring a portion of the latent space during training through stochastic weak supervision can yield predictable organization where it matters, (2) that explicitly constraining just one concept and one plane can influence nearby representations without dictating the structure of the entire space, (3) that the approach shows robustness to noisy, stochastic labeling, and (4) that such structured latent spaces can be shaped by selective per-sample regularization rather than comprehensive supervision.

These findings suggest that concept-anchored representation engineering could potentially provide a valuable approach for designing more interpretable and controllable neural networks, building on existing work in latent space structuring. Whether this approach can scale to language models and more complex domains remains an open question, but these initial results provide encouragement to continue exploring this direction. If we can reliably engineer the structure of relevant portions of latent representations during training, we might gain better tools for alignment - particularly for more precise control over what models learn and how they use that knowledge.


We welcome suggestions, critiques, and ideas for improving our approach. If you're working on similar research or would like to collaborate, please reach out. For those interested in replicating or building upon our experiments, we've made our code available in z0u/ex-color-transformer, with Experiment 1.7 containing the implementation of the selective regularization approach described in this post.


  1. ^

    In this work, we distinguish between the semantic "concepts" we're interested in (like "red" or "vibrant") and their representation in the model's latent space, where related terms from the literature would include "latent priors" or "prototypes" — the anchor points or structures that guide how these concepts are encoded.

  2. ^

    Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). International Conference on Machine Learning. arXiv:1711.11279

  3. ^

    Hadjeres, G., & Nielsen, F. (2020). Attribute-based regularization of latent spaces for variational auto-encoders. Neural Computing and Applications. arXiv:2004.05485

  4. ^

    Anders, C. J., Dreyer, M., Pahde, F., Samek, W., & Lapuschkin, S. (2023). From Hope to Safety: Unlearning Biases of Deep Models via Gradient Penalization in Latent Space. arXiv:2308.09437

  5. ^

    Gabdullin, N. (2024). Latent Space Configuration for Improved Generalization in Supervised Autoencoder Neural Networks. arXiv:2402.08441

  6. ^

    Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016). A Comprehensive Study on Center Loss for Deep Face Recognition. DOI: 10.1007/s11263-018-01142-4 [open access version on GitHub]

  7. ^

    Oliveira, D. A. B., & La Rosa, L. E. C. (2021). Prototypical Variational Autoencoders. OpenReview hw5Kug2Go3-

    While the specific implementation of Prototypical VAE by Oliveira et al. was retracted due to methodological concerns, we include this reference to acknowledge that the concept of prototype-based regularization in latent spaces has also been explored in other studies.

  8. ^

    Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., & Krishnan, D. (2020). Supervised Contrastive Learning. arXiv:2004.11362

  9. ^

    Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. arXiv:2002.05709

  10. ^

    Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P.H., Buchatskaya, E., Doersch, C., Pires, B.A., Guo, Z.D., Azar, M.G., & Piot, B. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. arXiv:2006.07733

  11. ^

    Achille A., Rovere M., & Soatto S. (2017). Critical Learning Periods in Deep Neural Networks. arXiv:1711.08856

  12. ^

    Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., Labenz, N., & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv:2502.17424

  13. ^

    Epistemic status: 75%, based on: - Word2Vec and similar embedding approaches showing meaningful geometric structure - Recent mech interp work suggesting similar representation mechanisms across domains - Emergent misalignment research indicating models tend to discover similar representational patterns.

  14. ^

    In upcoming experiments with transformers, we intend to use hypersphere normalization as in nGPT. We think that normalization helps the optimizer to find meaningful representations, and we expect our DevInterp research to be most useful with architectures that lend themselves to well-structured latent spaces in other ways.

  15. ^

    As an aside: I'm fascinated by these animated visualizations of latent space. Seeing how the space changes for every batch has given me insights into potential hyperparameter tweaks that would have been hard to find otherwise. Perhaps carefully selected metrics could have given similar insight if viewed as a static line chart, but I don't know which metrics they would be, nor how you could know in advance which ones to choose.

  16. ^

    My confidence in these claims ranges from 90..60% (from top to bottom).



Discuss