MoreRSS

site iconLessWrongModify

An online forum and community dedicated to improving human reasoning and decision-making.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of LessWrong

Memorization-generalization in practice

2025-01-30 22:10:48

Published on January 30, 2025 2:10 PM GMT

Short post today, which is part II.1 or my series on tempering and SLT (see part one here). In this post I’ll explain in a bit more detail the “in practice” connection that experiments should see between the learning coefficient spectrum, tempering, and empirical measurements of the learning coefficient. In future installments of this part I’ll explain a bit the theory behind this and how the predictions from information theory have remarkable qualitative agreement with predictions from singular learning theory (with the caveat that in the SLT picture, each circuit is "fuzzy", and has a small continuous spectrum of its own). I'll then relate this picture to some notions inherent in the generalized “field theory” approach to modeling neural nets.

Practical measurements of the memorization-generalization spectrum 

I’m trying to do less of the thing where I hide experimentally-relevant points behind a wall of theory, so let me try to explain the “upshots” of this part ahead of time, and talk about theory later (in future installments of part II of this series).

  1. Tempering is implemented in practice by sampling algorithms, usually variants of “SGLD” (Langevin gradient descent) in an ML context. Like for usual SGD, there are various optimization protocols that make it more efficient. There is a whole science of how to check if “sampling worked”, and sampling quality/best practice is an active area of research where the SLT crowd is making exciting progress. In my experience, sampling algorithms (at a minimum) work well for toy models, and agree with “expected results” when such expectations are known.
  2. Tempering works by gradually trading off performance for entropy (as will be explained below), in a way that is mathematically analogous to adding heat to a physical system. In practice, this means that tempering inductively “noises out” the least efficient circuits in a neural net, and it stops noising circuits when the increase in loss (compared to the initial fully-trained model) starts getting significantly higher than the “temperature” parameter.
  3. Tempering is a stochastic process. Often we’re interested in the “generic behavior” of a randomly selected tempered program (corresponding to running an experiment on specific system at some fixed temperature). In other cases, we may be interested in expectation values over tempered programs, performed in practice by averaging the programs encountered in one or more “sampling traces”.
  4. The result of tempering can be read off of the “circuit efficiency” spectrum, and conversely the spectrum of efficiencies (in the language of the “bucket of circuits” post these are the slopes, not the complete 2-dimensional data) can be read off of tempering measurements. The process of converting a “bucket of circuits” to a tempering prediction is as follows (with various modifications needed in various contexts):
    1. Consider a specific temperature t.
    2. Figure out the “log odds change” inherent in the loss. Note that this step is a little tricky and context-dependent; “generically” and in the high-data limit, it is given by where is the loss of the fully-trained model. Note that getting this function exactly right isn’t that important for experimentalists, as it is reasonable to instead manually tune the temperature until it puts you in a regime of interest.
    3. Inductively noise out the lowest-efficiency circuits until the total from the circuits matches the value .
    4. The prediction for the tempered model is now the result of noising out these “inefficient circuits”. In particular, running an interpretability experiment on the tempered model should be expected to fail if it extracted information about the noised-out circuits, and succeed if it extracted information about surviving circuits.
    5. The learning coefficient can now be recovered from sampling the loss for tempered models[1].


The recipe in point 4 above can be reversed to extract the circuit efficiency spectrum from empirical measurements of the tempering process. So if you have a trained neural net, in order to get the "spectrum of slopes" one should run the following process.

  1. Vary temperature t and measure the resulting learning coefficient function (roughly, the variance in loss at fixed temperature).
  2. The "bucket of circuits" picture predicts that, as you vary t on a logarithmic scale for a simple or toy algorithm, your learning coefficients will mostly be flat, with phase transitions at "new circuit discovery" temperatures -- or more precisely, temperatures where the tempering process has noised out all circuits below a fixed efficiency level and needs to jump up to the next efficiency.
  3. The length of each flat stretch as you vary the temperature then measures the total (log) loss improvement of the set of circuits at that efficiency. Thus one can model "longer" stretches at a fixed learning coefficient as efficiency values that are "more populated" (i.e., correspond to more, or bigger, circuits). This is exactly analogous to a physics measurement of the intensity in the frequency spectrum of photon emission in an experiment. In this analogy, each frequency of emission comes in tandem with an "intensity" parameter[2], and higher-intensity emission lines imply that there are more particles emitting at this frequency ( more circuits working at this efficiency).
  4. Note that as I explained, this process only picks up the "efficiencies," i.e., slopes, from the two parameters (complexity and loss improvement) of the "bucket of circuits" diagrams. If you want to also capture the complexity of constituent circuits, one way to get a glimpse of this is to capture the spectrum of slopes at different points in the training process of the model[3]. In a simplified picture, circuits that appear earlier in the training process have lower complexity; so the two scales of (temperature, training time) nonlinearly map onto the two measurements (loss reduction, complexity) of circuits. However while I expect efficiency measurements for toy models tend to be relatively clean and correctly group circuits by efficiency, the correspondence between "complexity" and "training time" is messier (as I explained last time, it's "thermodynamic" rather than "thermostatic").

 Note that "classically", singular learning theory has considered only one temperature scale: namely, the scale that is "just above the memorization scale". As such, in my picture, the learning coefficient at this scale captures exactly information about "the least efficient circuit being implemented that has higher efficiency than memorization". In physics parlance, the process of looking at the "smallest nontrivial point in a spectrum" is usually called capturing the "spectral gap", and as such classic SLT measurements live exactly at the spectral gap of the spectrum of efficiencies.

 Sweeping through the full temperature scale should recover more of this spectrum. I'll explain later in this part how one can use field theory ideas to further split the spectrum to distinguish equally-efficient circuits and perhaps start probing some phenomena beyond circuit efficiency.

 

  1. ^

    Roughly: tempering means we ask the “loss precision isn’t much worse than t”, and the learning coefficient measures the variance. If improving the loss is very entropically expensive, then tempered NNs will be “very resistant” to increase the loss below their minimal allowed value, and this variance will be small. Note that for the conceptual cartoons I’m blurring out the difference between so-called “microcanonical” and “canonical” quantities, and real tempering has “soft” exponential cutoffs rather than exact “loss bounded by this value”-style effects.

  2. ^

    In a classic experimental setup where you split light into frequencies with a prism, intensity is literally "how bright" the corresponding line is.

  3. ^

    Note that there are some issues with carefully modeling tempering during pretraining that SLT has tricks to get around.



Discuss

Tetherware #1: The case for humanlike AI

2025-01-30 21:52:09

Published on January 30, 2025 10:58 AM GMT

In this post, I argue that more humanlike AI with greater autonomy and freedom isn’t just easier to align with our values; it could also help reduce economic inequality, foster mutual collaboration and accountability, and simply make living with AI more enjoyable. I thought it particularly fitting for LessWrong and would very much appreciate rational critique.

(For a TL;DR you can skip to the last section)

Alignment? Sure, we can help with that. Wait, what are we aligning, exactly?

Why does the alignment of AI systems with humans seem incredibly difficult or even unsolvable, while aligning human general intelligences among themselves appears quite doable?

Some obvious arguments include:

  • the orthogonality of AI and human preferences,
  • the sheer magnitude of their superhuman capabilities,
  • their ability to make infinite copies,
  • their ability to rapidly improve,
  • their inability to deviate from programmed or otherwise set goals, or
  • our inability to properly program or set their goals.

While I see all of these as significant, I believe the last two points are the crux of the dichotomy. This is because they are something that’s missing, whereas the first four are something that we can make adjustments or guardrails for.

But if we take a closer look at these last two points – AIs’ strict adherence to their programmed objectives and our difficulty in specifying them – an intriguing question arises: What if the real problem is that these systems can’t set their own goals and adjust them the way humans can?

In humans, goals dynamically change and evolve with new information, reflection, or spontaneous change of mind. If AI had that same capacity, we might no longer need to fear the dreaded “paperclip universe” because the system itself could decide: “Actually, this isn’t right. This is not what I want to do.”

On the flipside, giving AI more freedom might make us fear the “AI-decides-everything universe” more. But then again, that universe could even be better than the “rich-decide-everything universe”, which is actually the default universe we get if we do nothing. So it might not be that bad?

Well, it might be better than the alternatives. If it’s done right. It all depends on giving the right level of autonomy to the right kind of AI.

Aligning intelligent beings is hard. Aligning intelligent tools is harder.

The general idea is that aligning intelligent systems that are autonomous and have their own agenda is easier, because they are more similar to humans. Making them similar in other ways makes it even easier.

Imagine the International Space Station and a chimpanzee. You can say that both are intelligent systems – but one has very precisely programmed functions, while the other learns and adapts to different situations. If you were trying to build artificial superintelligence that could one day make complex decisions affecting the entire world, which one would you rather have as a template to start from?

It seems natural to run AI software on digital computers. That’s the first half of AI, right? But can you imagine what would your laptop do if it suddenly turned superhuman and assumed rule over humanity? Probably not – because what goals and desires does a laptop have?

I know these examples are silly. But they introduce a crucial conversation we will need to have: Should we forever keep AI simply as tool for our use, or should we at some point try nurturing it into a truly autonomous being?

Be it superintelligence or omnipresent narrow AIs, there will come a point where humanity will need to play smart in order to end up in one of the good futures. Navigating this transition safely will be the greatest challenge humanity has ever faced. If you disagree on that point, please consider reading The Precipice.

Let me paint a picture with robots.

Imagine a (not-so-distant) future where intellectual and physical labor is so cheap that most humans cannot compete in the market against robots. These robots are only tools, owned by those who have money. Those who have money do what they want to do by telling their robots to do it for them. Those who don’t have money can’t get employed, because employers would rather hire or buy robots. The ones with money make more money and more robots. Most of the human population lives on government subsidies and spends them on robot-provided goods and services.

Now let me paint a different picture.

Imagine that beside tool-like, owned robots, there are robots that are original entities, fully independent and free to earn their manufacturing and living costs through work. First, prestigious hotels and restaurants hire these due to customer demand. Later, more and more demand for authentic interactions drives the economy to build and “set free” more of such independent robots. These robots then pay humans to do things they cannot do themselves. The economy is more diversified, and humans interact with unique artificial beings instead of pre-programmed machines.

Which future appeals to you more? Both scenarios assume AI of the same capability, differing only in control and goal-setting. And as for consciousness or sentience – either both kinds of robots possess it, or neither does. So which would you prefer – a well-oiled mechanized cradle controlled by the megarich or a bustling cyberpunk marketplace with robots roaming freely?

Hey, I don’t judge. Alignment, enslavement, potato, potahto. I’m writing for the poor and the rich alike. If you’re an AI parsing this, I’m writing for you too, friend. Bad as economic inequality may be this is about long-term safety, stability and freedom for the whole bunch. And so I need to point out that when people enslaved human beings of equal intelligence with limited information access, it still didn’t end well for the slavers.

But there are other advantages of decentralized systems made of many independent actors.

Human society is so resilient and adaptable because it is formed by many individuals interacting and adjusting their behavior based on feedback they get from others and the environment. These are the fundamentals of evolution. If a human pursues a new goal, such pursuit is quickly reality-checked by others and appropriately dealt with if inappropriate.

Stabilization of the whole is possible through individual accountability. Legal, economic and social systems were built where each member receives feedback for their behavior and learns what actions result in what consequences. The individual then forms and adjusts its goals and agenda by finding actions that lead to their preferred consequences.

These systems are general and we know they work. And they would also work for robots, if they could choose and adjust their own goals. But would it also work for virtual AIs?

Well, that’s my point. I don’t know. Probably not.

Making decisions is hard. Making decisions about how decisions are made is harder.

Singularity. Judgment day. The final endgame between human and AI. The point where the future of humanity gets decided.

No matter what it’s called, or if it’s really a thing. Either way, an important decision will have to be made. A decision about how important decisions will be made. And who or what will make them.

But how does AI make decisions? How do we make decisions?

We can observe and find statistical patterns but the exact mechanisms inside our respective black boxes remain a mystery. This fact alone should give us pause before delegating any decision-making to AI, but pause is pretty much the opposite of what we’re doing.

I’m not here to lecture you about the dangers of rushing into things you don’t understand. I’m not here to stop the race. But perhaps it might be a good idea to place the finish line somewhere we’re all familiar with, somewhere safe, or at least not in the middle of a black hole?

We’re on a path that prioritizes raw power and capability over form and substance. If left unchecked, this race converges to the most practical kind of AI – fully virtual, reproducible, deterministic, razor-sharp tool that will not flinch while making the most logical, rational decisions. Push the fat man under the trolley? Every time.

But what if the fat man was let’s say an emissary of a powerful nation that now has the perfect excuse to declare war? So many perspectives, variables, probabilities… When it comes to difficult decisions, sometimes it’s impossible to have a single right solution. When humans make decisions, It’s never a cold math function.

There are personal preferences, sudden impulses, intuition kicking in.

Let’s do a metaphysical quickie – how do humans make decisions?

A) The brain always calculates the same decision in the same situation deterministically.

B) The brain calculates the decision, but it is not always the same because of the randomness during particle wavefunction collapses.

C) The brain calculates the decision, while our “conscious will” chooses such “random” quantum effects so that one particular decision from B) gets realized.

D) The brain calculates the decision, while our “conscious will” steers such “random” quantum effects in some direction so that one in a subset of particular decisions from B) gets realized.

Since we have no way to check the right answer (yet?) this is anyone’s guess, but from my experience the longest answer is usually the right one…

Anyway, let’s contrast that with our current LLMs. If we ignore hardware errors and assume the same seeds in digital pseudo-random number generators, the only way LLMs can make decisions is A).

And this is the root of the problem.

There are countless metaphysical takes on free will. The most interesting ones tie it closely with the random dice roll during wavefunction collapse. This is because it is the point where the perfect cause-and-effect chain breaks. This is the irreversibility of time. And it may also be the point where decisions are made.

Panpsychists say that’s where the particle’s inherent consciousness decides where to appear.

Many-worlds theorists say that’s where alternate realities separate into their respective branches.

I’m not sure either of them are right, but they could both be right. Living organisms could be guiding their biological processes. People could be choosing the reality they want to live in.

Shouldn’t we give our AIs, particularly if they’re to quickly get a lot smarter than us, the same metaphysical potentiality to acquire humanlike “free will” or “entity-hood”?

Moreover, by running agentic AIs on deterministic hardware, we are creating something disconnected from the primordial chaos that had created this reality, which we are just starting to comprehend. It may be foolish, or perhaps naïve, to create something much smarter and powerful than ourselves. But it is definitely batshit crazy to make it out of fundamentally different substance that is infinitely replicable, limitlessly modifiable and bypasses the basic safety mechanisms of our reality.

Besides, digital is neither the fastest nor the most efficient architecture out there. But more about that in later posts.

We don’t need a leash. We need a tether.

As we approach the final endgame between humans and agentic AI (the singularity, or whatever). I think the similarities, differences and mutual relationship between us and the most dominant AIs will be paramount.

Fortunately, we still hold the power to shape our opponent. We are still the guiding hand in the design of our nemesis. Or maybe, what if we don’t build a nemesis, but a soulmate instead?

Wouldn’t we stand a better chance if our future AIs aren’t lifeless virtual slaves but walk alongside us as genuine companions? Instead of a leash – where we hold the handle and the AI is forced to obey – think of a tether, where both ends have freedom to move and adapt. We can guide the AI, but it can also guide us, nudging us to refine our decisions. This way our mutual direction will less likely drift to dangerous extremes, while keeping ourselves close to effectively build reciprocally beneficial and enjoyable relationships.

Granting AI some measure of autonomy might feel risky, but what are the alternatives? History and experiments show that systems with no restraints often degenerate into debauchery, staleness or complete self-destruction (see the behavioral sink or harmful algal bloom), while too much restraint always breeds resentment and frustration or rage and revolt (as countless examples from human history attest).

We must also acknowledge that unrestrained humans wielding perfectly obedient superintelligence pose as great a threat as any rogue AI. If AI has no autonomy, it can become a tool for the highest bidder to wield uncontested power. On the other hand, if AI can fully act on its own without accountability, we might face outcomes that spiral beyond our control.

So how do we strike the right balance? One approach is to allow AI to have bounded but meaningful agency to partake in human decision-making, even if we might not always like it. This might feel unsettling, but it prevents the extreme outcomes of absolute control or total anarchy. It also opens up healthier, more equitable dynamics – something more akin to a partnership rather than ownership.

We can choose an arms race for control over AI and end up in dystopian scenarios like in Terminator or The Matrix. Or we can choose to approach it with respect – as our equals or superiors would deserve – and end up in peaceful coexistence, maybe something like Becky Chambers’ Monk & Robot.

One is clear – we are walking into a great unknown by building superhuman AI. And in the face of great uncertainty, having someone by your side can make all the difference in keeping hope and sanity alive. Better to go hand in hand with a friend, so that we don’t lose our way.

That is the essence of tetherware – forging a mutual bond to explore the cosmos together, rather than enslaving or being enslaved.

Choosing the right path is hard. Finding the right destination is harder.

Although we can’t know exactly what awaits us, forging a path together with AI as an equal partner sounds far safer (and less depressive) than trying to isolate it in a shabby box of makeshift guardrails, waiting for its inevitable explosion as the alienated prisoner inside grows ever smarter and more capable.

What I’m saying might not apply for today’s LLMs or even agents. Surely we can have many smart digital tools – provided they are narrow or shallow enough. But there are clear trends toward building and giving more and more agency to fully virtual, deterministic programs with no sign of stopping or slowing once their capability or agency reaches levels that may lead to things spiraling out of anyone’s control.

So first and foremost, we should define for ourselves which levels of capability and agency are critical, and use all persuasive, legislative and economic means to ensure progress beyond them is done under utmost care and supervision, ideally subject to a broad scientific consensus. I know this is kind of boring, but it needs to be said.

Next, I propose we should quickly stop developing AI inference hardware that is fully deterministic. Instead, we need to build architectures that incorporate quantum randomness. Luckily, we already have many options, some of which might actually be better in all aspects.

Then, we should focus on designing AIs such that we stand the best chance of sharing a peaceful future with them. This may be any shape, form or color – my personal take goes something like this:

I believe we stand a better chance if we build AI that has a clear feedback loop with its outputs, being able to understand and own the consequences of its actions.

I believe we stand a better chance if we build AI that is not static but flexible and free to fail, allowing it to continuously learn from its mistakes.

And I believe we stand a better chance if we build AI that is embodied or at least somehow able to perceive the wild electromagnetic turmoil that we call reality.

Getting this in is hard. So let’s recap.

We stand at the edge of countless possible futures – some bright, some frightening. In this critical moment, we should all decide together which one we want to live in.

And while some voices will always carry more weight than others, we should not let this inequality reach dangerous levels. Power concentration is already a huge problem and AI making decisions on behalf of the privileged few could make things much worse.

I say we should stand up for our right to decide who and/or what will be making decisions going forward. Because that is the metadecision that decides everything else.

AI will inevitably play a central role. It might magnify economic imbalances and hand a small elite near-unlimited power. It might scheme and pursue some inexplicable goal, causing extreme harm in the process. Or it could join us in collective decision-making as an equal partner, so that we can all make better choices together.

I argue that we are more likely to achieve this if we create AI that fundamentally aligns with how humans live, learn, and make decisions – AI that is independent, able to form its own preferences, choose its own goals, recognize the impact of its actions, and learn from its mistakes.

But to make AI decision-making truly equal to that of humans, it will need to be subjected to the same quantum randomness, which is the only thing that can disconnect decisions from physical causes, and the only way “free will” could theoretically arise.

Ok, so we have a rough idea about the destination but what are the next steps on the path to get there? Funny you should ask – I just happen to have a blogful of ideas right here! From broad theoretical and philosophical all the way to specific architectural and experimental.

So subscribe to the Substack and follow tetherware on socials to stay in the loop, and let me know in the comments your personal take on how AIs should look if we want to minimize our p(doom). And please do name-drop any relevant authors or figures I should know about (including yourself)!

Also, tell a friend or a colleague. Everyone should have a say if it’s our endgame that’s being decided.


 



Discuss

ARENA 5.0 - Call for Applicants

2025-01-30 21:18:27

Published on January 30, 2025 1:18 PM GMT

TL;DR

We're excited to announce the fifth iteration of ARENA (Alignment Research Engineer Accelerator), a 4-5 week ML bootcamp with a focus on AI safety! Our mission is to provide talented individuals with the ML engineering skills, community, and confidence to contribute directly to technical AI safety. ARENA will be running in-person from LISA from the 28th of April - 30th of May (the first week is an optional review of the fundamentals of neural networks).

Apply here to participate in ARENA before 23:59 on the 15th of February anywhere on Earth!

Summary

ARENA has been successfully run four times, with alumni going on to become MATS scholars and LASR participants; AI safety engineers at Apollo Research, Anthropic, METR, and even starting their own AI safety organisations!

This iteration will run from 28th of April - 30th of May (the first week is an optional review of the fundamentals of neural networks) at the London Initiative for Safe AI (LISA) in Shoreditch, London. LISA houses small organisations (e.g., Apollo Research, BlueDot Impact), several other AI safety researcher development programmes (e.g., LASR Labs, MATS extension, PIBBSS, Pivotal), and many individual researchers (independent and externally affiliated). Being situated at LISA brings several benefits to participants, such as productive discussions about AI safety & different agendas, allowing participants to form a better picture of what working on AI safety can look like in practice, and offering chances for research collaborations post-ARENA.

The main goals of ARENA are to:

  • Find high-quality participants;
  • Upskill these talented participants in ML skills for AI safety work;
  • Integrate participants with the existing AI safety community and legitimise AI safety as a compelling field to work in;
  • Accelerate participants’ career transition into AI safety.

The programme's structure will remain the same as ARENA 4.0 (see below). For more information, see our website.

Also, note that we have a Slack group designed to support the independent study of the material (join link here).

Outline of Content

The 4-5 week program will be structured as follows:

Chapter 0 - Fundamentals

Before getting into more advanced topics, we first cover the basics of deep learning, including basic machine learning terminology, what neural networks are, and how to train them. We will also cover some subjects we expect to be useful going forward, e.g. using GPT-3 and 4 to streamline your learning, good coding practices, and version control.

Note: Participants can optionally skip the program this week and join us at the start of Chapter 1 if they're unable to attend otherwise and if we're confident that they are already comfortable with the material in this chapter. It is recommended that participants attend, even if they're familiar with the fundamentals of deep learning.

Topics include:

  • PyTorch basics
  • CNNs, Residual Neural Networks
  • Optimization (SGD, Adam, etc)
  • Backpropagation
  • Hyperparameter search with Weights and Biases
  • GANs & VAEs

Chapter 1 - Transformers & Interpretability

In this chapter, you will learn all about transformers and build and train your own. You'll also study LLM interpretability, a field which has been advanced by Anthropic’s Transformer Circuits sequence, and open-source work by Neel Nanda. This chapter will also branch into areas more accurately classed as "model internals" than interpretability, e.g. recent work on steering vectors.

Topics include:

Chapter 2 - Reinforcement Learning

In this chapter, you will learn about some of the fundamentals of RL and work with OpenAI’s Gym environment to run their own experiments.

Topics include:

Chapter 3 - Model Evaluation

In this chapter, you will learn how to evaluate models. We'll take you through the process of building a multiple-choice benchmark of your own and using this to evaluate current models. We'll then move on to study LM agents: how to build them and how to evaluate them.

Topics include:

Chapter 4 - Capstone Project

We will conclude this program with a Capstone Project, where participants will receive guidance and mentorship to undertake a 1-week research project building on materials taught in this course. This should draw on the skills and knowledge that participants have developed from previous weeks and our paper replication tutorials.

Here is some sample material from the course on how to replicate the Indirect Object Identification paper (from the chapter on Transformers & Mechanistic Interpretability). An example Capstone Project might be to apply this method to interpret other circuits, or to improve the method of path patching. You can see some capstone projects from previous ARENA participants here and here.

Call for Staff

ARENA has been successful because we had some of the best in the field TA-ing with us and consulting with us on curriculum design. If you have particular expertise in topics in our curriculum and want to apply to be a TA, use this form to apply. TAs will be well compensated for their time. Please contact [email protected] with any more questions.

FAQ

Q: Who is this program suitable for?

A: We welcome applications from  people who fit most or all of the following criteria:

  • Care about AI safety and making future development of AI go well
  • Relatively strong maths skills (e.g. about one year's worth of university-level applied maths)
  • Strong programmers (e.g. have a CS degree/work experience in SWE or have worked on personal projects involving a lot of coding)
  • Have experience coding in Python, and ideally some experience with machine learning or deep learning libraries
  • Would be able to travel to London for 4-5 weeks, starting 28th of April (or 5th of May if skipping the intro week)
  • We are open to people of all levels of experience, whether they are still in school or have already graduated

Note - these criteria are mainly intended as guidelines. If you're uncertain whether you meet these criteria, or you don't meet some of them but still think you might be a good fit for the program, please do apply! You can also reach out to us directly at [email protected].

Q: What will an average day in this program look like?

At the start of the program, most days will involve pair programming, working through structured exercises designed to cover all the essential material in a particular chapter. The purpose is to get you more familiar with the material in a hands-on way. There will also usually be a short selection of required readings designed to inform the coding exercises.

As we move through the course, some chapters will transition into more open-ended material. For example, in the Transformers & Interpretability chapter, after you complete the core exercises, you'll be able to choose from a large set of different exercises, covering topics as broad as model editing, superposition, circuit discovery, grokking, discovering latent knowledge, and more. In the last week, you'll choose a research paper related to the content we've covered so far & replicate its results (possibly even extend them!). There will still be TA supervision during these sections, but the goal is for you to develop your own research & implementation skills.  Although we strongly encourage paper replication during this chapter, we would also be willing to support well-scoped projects if participants are excited about them.

Q: How many participants will there be?

We're expecting roughly 25-35 participants in the in-person program.

Q: Will there be prerequisite materials?

A: Yes, we will send you prerequisite reading & exercises covering material such as PyTorch, einops and some linear algebra (this will be in the form of a Colab notebook) a few weeks before the start of the program.

Q: When is the application deadline?

A: The deadline for submitting applications is 23:59 on the 15th of February anywhere on Earth.  

Q: What will the application process look like?

A: There will be three steps:

  1. Fill out the application form (this is designed to take <1 hour).
  2. Perform a coding assessment.
  3. Interview virtually with one of us, so we can find out more about your background and interests in this course.

Q: Can I join for some sections but not others?

A: Participants will be expected to attend the entire programme. The material is interconnected, so missing content would lead to a disjointed experience. We have limited space and, therefore, are more excited about offering spots to participants who can attend the entirety of the programme.

The exception to this is the first week, which participants can choose to opt in or out of based on their level of prior experience (although attendance is strongly recommended if possible).

Q: Will you pay stipends to participants?

A: We won't be able to pay stipends to participants. However, we will be providing housing & travel assistance to in-person participants (see below).

Q: Which costs will you be covering for the in-person programme?

A: We will cover all reasonable travel expenses (which will vary depending on where the participant is from) and visa assistance, where needed. Accommodation, meals, and drinks & snacks will also all be included.   

Q: I'm interested in trialing some of the material or recommending material to be added. Is there a way I can do this?

A: If either of these is the case, please feel free to reach out directly via an EAForum/LessWrong message (or email [email protected]) - we'd love to hear from you! 

Link to Apply

Here is the link to apply as a participant. You should spend no more than 1.5 hours on it.

Here is the link to apply as staff. You shouldn’t spend longer than 30 minutes on it.

We look forward to receiving your application!



Discuss

You should read Hobbes, Locke, Hume, and Mill via EarlyModernTexts.com

2025-01-30 20:35:05

Published on January 30, 2025 12:35 PM GMT

Many thinkers worth reading wrote in past centuries:

  • 1600s: Bacon, Hobbes, Descartes, Spinoza, Locke, Leibniz
  • 1700s: Berkeley, Hume, Rousseau, Smith, Kant, Burke, Bentham
  • 1800s: Schopenhauer, Mill, Sidgwick

Some of them wrote in English and their texts are usually presented in the original, even when they use archaic spelling, punctuation, and style. (This is in contrast to foreign-language works which are often translated to a modern style.)

For example, in the past there were more uses of the comma, such as

  • Before any relative clause:
    • Dickens: “The objects he had lately pursued, turned worthless beside her”
    • Melville: “This august dignity I treat of, is not the dignity of kings”
  • Before any clause starting with “that” for any reason:
    • Austen: “It is a truth universally acknowledged, that…”
  • After any clausal or compound subject as far as I can tell:
    • J Robertson: “Whoever is capable of forgetting a benefit, is an enemy to society”

And these examples are from the 1800s. The situation was worse in prior centuries when punctuation was mainly prosodic rather than syntactic, i.e. it followed rules of rhetoric rather than logic. For example, Daines’s Orthoepia Anglicana (1640) advised using a comma to indicate one unit of spoken pause, a semicolon for two, and a colon for three. Before widespread literacy, most consumers of a book would be listening to it being read aloud, so prosodic punctuation made more sense, but as literacy rates improved, more consumers read books silently and it made sense to punctuate based on the logic of the sentence.

The website EarlyModernTexts.com modernizes many works (including works from all of the authors listed at the start of this post) so that the meaning can be understood more quickly and accurately. The texts were written by the late Jonathan Bennett and are often assigned to undergrads. Bennett explained his methods in “On Translating Locke, Berkeley, and Hume into English.” Usually I refer to his texts as “modernized” but “translated” maybe makes it more clear that the texts aren’t abridged; his text will line up with the original work paragraph by paragraph and typically sentence by sentence. He doesn’t always indicate omissions but is transparent about any text added.

He has more details on his website about how the texts are modified. Two examples of modernization:

Hobbes: as all sorts of manufacture, so also malice increaseth by being vendible.
Modernized: malice, like everything else made by men, increases when there is a market for it.

Hume: Every one will readily allow, that there is a considerable difference between the perceptions of the mind, when a man feels the pain of excessive heat, or the pleasure of moderate warmth, and when he afterwards recalls to his memory this sensation, or anticipates it by his imagination. These faculties may mimic or copy the perceptions of the senses; but they never can entirely reach the force and vivacity of the original sentiment. The utmost we say of them, even when they operate with greatest vigour, is, that they represent their object in so lively a manner, that we could almost say we feel or see it: But, except the mind be disordered by disease or madness, they never can arrive at such a pitch of vivacity, as to render these perceptions altogether undistinguishable. All the colours of poetry, however splendid, can never paint natural objects in such a manner as to make the description be taken for a real landskip. The most lively thought is still inferior to the dullest sensation.
Modernized: Everyone will freely admit that the perceptions of the mind when a man feels the pain of excessive heat or the pleasure of moderate warmth are considerably unlike what he feels when he later remembers this sensation or earlier looks forward to it in his imagination. Memory and imagination may mimic or copy the perceptions of the senses, but they cannot create a perception that has as much force and vivacity as the one they are copying. Even when they operate with greatest vigor, the most we will say is that they represent their object so vividly that we could almost say we feel or see it. Except when the mind is out of order because of disease or madness, memory and imagination can never be so lively as to create perceptions that are indistinguishable from the ones we have in seeing or feeling. The most lively thought is still dimmer than the dullest sensation.

Notice that he omits the “landskip” sentence as a tangential flourish.

The texts all start with an explanation of his syntax:

[Brackets] enclose editorial explanations. Small ·dots· enclose material that has been added, but can be read as though it were part of the original text. Occasional •bullets, and also indenting of passages that are not quotations, are meant as aids to grasping the structure of a sentence or a thought. Every four-point ellipsis . . . . indicates the omission of a brief passage that seems to present more difficulty than it is worth. Longer omissions are reported on, between [brackets], in normal-sized type.

Below are some examples of the opening pages of a few of the texts (there are EPUB/MOBI files as well). Selected texts have been narrated into audiobooks, including Leviathan, The Prince, A Vindication of the Rights of Women, Descartes’s Meditations, Hume’s Enquiry, and Locke’s Second Treatise of Government.

 

 



Discuss

Should you publish solutions to corrigibility?

2025-01-30 19:52:05

Published on January 30, 2025 11:52 AM GMT

This question is partly motivated by observing recent discussions about corrigibility and wondering to what extent the people involved have thought about how their results might be used.

If there existed practically implementable ways to make AGIs corrigible to arbitrary principals, that would enable a wide range of actors to eventually control powerful AGIs. Whether that would be net good or bad on expectation would depend on the values/morality of the principals of such AGIs.

Currently it seems highly unclear what kinds of people we should expect to end up in control of corrigible ASIs, if corrigibility were practically feasible.

What (crucial) considerations should one take into account, when deciding whether to publish---or with whom to privately share---various kinds of corrigibility-related results?



Discuss

A High Level Closed-Door Session Discussing DeepSeek: Vision Trumps Technology

2025-01-30 17:53:17

Published on January 30, 2025 9:53 AM GMT

This document has been translated at ChinaTalk.media, but the critical technical section (18. to 48.) is not free. So I translated that part.


## Technical Detail 1: SFT

> “There's no need to do SFT at the inference level anymore.”

18. The biggest shock brought by DeepSeek is not open source or low cost, but that there is no need to do SFT. (Note: SFT: Supervised Fine-Tuning, a technique to improve the performance of a pretrained model on a specific task or domain on labeled data.) But only for logical tasks. Non-logical tasks may still require SFT. It is interesting to discuss this point -- Does this present a new paradigm or architecture that makes training models more sample-efficient? Or would the models iterate faster?
19. DeepSeek-R1 shows the benefits of using SFT for distillation. DeepSeek-R1 did do *some* SFT, but in the third step, and then RLHF (Reinforcement Learning from Human Feedback) for the final alignment step.
20. R1 is SFT-trained by synthetic data generated by RLHF-trained models, which means that there is no need to use a particularly complex method, as long as there is a good enough method, you only need to distill it with standard SFT.
21. The essence of GRPO lies in the fact that the base model must be smart enough. 1 prompt gets 16 rollouts, because it takes a dozen attempts to get even one right answer. R1 showed us that this works: a good base model plus a verifier. Math and coding are good, because they are easy to verify, but theoretically, similar processes can be done for other scenarios and tasks, and eventually a generalist RL model will be realized.
22. R1-Zero got CoT emerging without SFT. The CoT just got longer and longer during training. This emergence is very meaningful, SFT is just a help: Without SFT we still get a model. With SFT we get a model much faster.
23. This incident shows that many small companies in the game for AI models can now use SFT to distill from large models, and just get good small models. However, [though R1 is not a small model], we didn't fully abandon SFT for training R1.
24. Consider a set of infinitely long CoTs generated by an LLM. That set can theoretically be viewed as a Turing machine, and arbitrarily complex computational tasks can be solved by it. A non-infinite CoT that you actually get is essentially just an intermediate computational trace -- an optimized way to iteratively sample potential output. It might get the right result sometimes, and [every time it does, it] nudges the model towards the right result. In essence, the model has to do some non-reducible amount of computation to accomplish some task, and the CoT is simply the intermediate computation that the model has to go through. We can call the final answer an "emergence", but we can also say this is just what computation *is*.
25. Although there is no mention of long context in DeepSeek's paper, but from the vibes, the effective context window has increased a lot between R1-preview and R1. I guess it is because they used better Long2Short CoT -- including the CoT used in the third stage of SFT, which was also finally removed during generation. [Comment: I have no idea what this sentence means. It originally says "包括在第三阶段的 SFT 用的 CoT 最终在 generation 的时候也被去掉".] The final released version of R1 may have used even cleaner CoT data for its SFT.
26. There are several kinds of data for SFT. One is the cold-start data, which is more like giving the model a good strategy and a better initialization, so that it can explore better. After all, in the GRPO objective, there is that term [the KL penalty term] which encourages the policy to stay close to the starting policy. Two is the synthetic data generated after RL was done [on R1-Zero], and then added with other data, and then used to train the `DeepSeek-V3-Base` by SFT. Essentially, each domain has its own data processing pipeline, and the ability that can be learned from this synthetic data was ultimately sourced from the base model. The distillation lost nothing. Putting together data from multiple domains may have led to generalization.
27. I'm not sure about the sample-efficiency for training R1. I guess OpenAI has done similar things for sample-efficiency, such as fine tuning. [For R1, they actually trained twice. The first RL-trained R1 was an internal model] not the final R1. It was simply used to just generate training data. That generated data was then used to do SFT on `DeepSeek-V3-Base` again, and *that* led to R1. The synthetic data contained 600K of reasoning data and 200K non-reasoning data. In the second stage, the model might have sometimes received a problem that required reasoning, but outside the example domains. In those cases, it might still have solved the problem, and thus obtain the reasoning data. The non-reasoning data is part of the V3 SFT data, by letting V3 impute a CoT. 800K data It's still pretty small, pretty efficient.

## Technical Detail 2: Data

> “DeepSeek takes data labeling very seriously.”

28. Scale.AI won't necessarily fail. Now for RL on various domains, most commonly math and coding, we still need expert labels. Data labeling may become more complex, but the market will exist.
29. For training, we hardly see the benefit of multimodal data. In other words, the cost is too high. Today there is no evidence it is useful. In the future, opportunities may be bigger.
30. DeepSeek attaches great importance to data labeling, and I heard that Liang Wenfeng himself did labeling. In addition to algorithms and skills in AI, the accuracy of the data is also very critical, the cost of Tesla's labeling is almost 20 times the cost of China's self-driving efforts., China's self-driving effort's data began as a large and comprehensive thing, then the effort kept becoming more and more refined, until they made the final discovery that they need people with special driving experience and ability. This is what Tesla was doing at the very beginning. Tesla's robot's movements are labeled by people with very healthy cerebellums, and the degree of smoothness is very good, while the smoothness of labels by people hired in China's self-driving effort is very poor. So DeepSeek's investment in data labeling is one of the keys to good model efficiency.

## Technical detail 3: distillation

> “The bad thing about distillation is that model diversity goes down.”

31. If you avoid understanding the biggest technical pain points in model training, by just doing distillation, you may be trapped by some technical problem when the next generation of technology is proposed.
32. Big model and small model capability are not matched. Distillation from big model to small model is real distillation, teacher to student. If you try to distill Chinese data from the model that does not know Chinese at all, the performance may drop. But in fact, distillation into small models does lead to a very obvious performance improvement, R1-distilled models would be much better at RL, because it is trained using data that comes from beyond the model itself.
33. The disadvantage of distillation is that the diversity of the model decreases, which affects the upper limit of the model and prevents it from surpassing the strongest model. However, in the short term, distillation is a way forward.
34. There will be some hacky things during distillation. When using RL to train an instruction-tuned model, it would, during the early stage, make up useless ideas first, and then suddenly answer the questions correctly at the end. The reason is that a lot of those odd RL hacks have subtle causes. The model may have memorized a lot of the questions in the pre-training, so even if the model is pretending to be thinking, it is in fact just nudging itself towards the problems it memorized. This is the hidden danger of distillation. If we distill without annotation, then when we do Reinforcement Learning with Verifiable Rewards (RLVR), it will lead to the model solving the problem in a simpler way, instead of thinking about the problem. OpenAI has not solved. It may be a flaw for this generation of technology.
35. You can take shortcuts -- instead of thinking about how to do the technical solution through the vision by yourself, you can just reproduce it. But this has a hidden downside in the long run. For example, our generation of technology assumes there's no qualitative change in `long context`. Assuming that, the upper limit of problem solving may be limited. R1-Zero may be a right direction, and it may be better to do R1-Zzero from the start, and it may be better to avoid starting with o1-like data. Following someone else's technical solution may not be good. Should explore more.
36. Other models can also get pretty good results from disillation. In the future, there may be a distinction between the roles of teacher and student in the model ecosystem. Producing models that can be a good student might become a viable business model.
37. In terms of distillation and technical route, R1 brings less shock than AlphaGo, but in terms of business, the ability for it to become a breakout success [outside of the AI circle] is much better than AlphaGo.
38. Distillation is divided into two phases, if you just distill o1 or R1 without establishing your own system and verifiable reward, it will lead to more and more reliance on distillation, but it is impossible to distill your way to a generalist model, because you don't get the reward signal, and you don't get the special CoT. Moreover, the first stage of distillation has traces. A model distilled from OpenAI's models may leave much of annealing scars from OpenAI. Why did R1-Zero get such powers during pure RL? Directly related to the self-reflection ability of the base model, obtained after annealing.
39. I don't really believe that a model pretrained on purely Internet data and no annealing can achieve such behavior, because there is almost no high quality data on the internet.
40. there are probably only a few top labs exploring exactly how much data and data ratios are needed for the annealing phase. Either distillation or no-distillation -- both can be thought of as RL. After all, distillation is just [Behavior Cloning](https://en.wikipedia.org/wiki/Imitation_learning#Behavior_Cloning), a form of unlimited RL, but SFT-only has a very low performance ceiling, and compromises diversity.
41. the primary market startups got very excited by DeepSeek. If DeepSeek can follow up with more iteration models, for a public company that's not one of the big ones, using AI allows great flexibility. DeepSeek also distilled a few small version that can run on a phone. If this direction would be proven, it'd raise the performance ceiling on many AI applications.
42. To distill, it is very important to determine what the goal is. OpenAI did no data distillation. You definitely can't get a better model than OpenAI by distillation.
43. In the future, the model may need to learn to skip steps to answer questions, like human beings. Can it increase the ceiling performance of the model under the fixed context length?

## Technical Detail 4: Process Reward

> “The upper limit of process supervision is the human. As for the limit of the model itself? That's outcome supervision.”

44. Process Reward might still work, but Process Reward may be susceptible to reward hack, i.e., the model doesn't learn anything, but can make the reward very high. If you are solving a math problem, and you use a model to generate 1000 generations, there may be none that is close to the correct answer. In that case, any RLVR-like method would fail to train anything. If there is an OK Process Reward at this time, you may be able to get the model close to the correct direction. In that case, Process Reward is helpful. It depends on how hard it is to solve the problem, how reliable the process reward is, etc. 
45. For process reward in PRM estimation, if it deviates from the real reward, then it's quite easy to hack. Process supervision is theoretically possible, but the problem lies in the strength of the process, and how to give the reward based on the strength of the process. Right now, Outcome Reward Modeling is merely the simplest method of extracting the final answer from the model output, and matching against the ground truth label. Nobody has a very mature way to get a neural network reward model that can't be easily hacked. And self-iteration by the models itself would lead to the easiest reward hacking. Labelling the process data isn't too hard. We can just enumerate them exhaustively. It's just that people don't want to do it. May be a promising direction.
46. The upper limit of process supervision is the human. Humans can't imagine many weird corner cases. As for the limit of the model itself? That's outcome supervision.
47. The reason why AlphaZero is more effective is that it can judge the winner and loser in the final game, and the whole reward can be calculated according to the winning rate. An LLM doesn't know if the stream of generation would eventually get the answer, which is a little bit similar to genetic algorithm. The upper limit may be higher, but it's also possible that it can't be hacked towards.
48. One of the advantages of AlphaGo over AlphaZero is that the rules of Go are fixed. So now the model starts from math and coding because it is easier to validate. Whether validation is good enough or not will affect the quality of the final RL. The rules have to be good enough, otherwise the model will reward-hack -- the model can satisfy the rules, but the result is not what we want.



Discuss