2025-05-15 08:00:01
Low Level Technicals of LLMs: Daniel Han
[Music]
Welcome to the AI Engineers world. This is the first workshop. There are a few others running, but thanks for coming. We just arrived from Australia with my brother. I think he’s over there somewhere. Yes, we just came here. We didn’t know a lot of stuff about SF and I think maybe the US is a bit different from Australia. But yeah, we’re very excited to be here. We’re going to stay here for a few months, so if you want to meet up, you can just hit me up via email or Twitter or wherever.
So today I’m going to be talking about low-level technicals of language models. Yes, yes, I’m Daniel. We do have a website called un. If you want to look that up, there are cute SLS and stuff. My brother designed that. We’ll be using two tiny URLs today. The first one is, oh wait, the slides are at tinyurl.com/unof. Hopefully, that works. There’s also a Q&A section, so I’ll be monitoring Q&A. You can type any question that you like, and I will be answering questions as we go. That is at tinyurl.com/unofQA. If those two work, they’ll be at the bottom. If anyone doesn’t get this, we’ll reshow these links.
You might know me from my tweets. GMA kind of released an open-source model a few months ago. We just found a few issues and bugs for different implementations. For example, the first tweet that we ever did was about some sort of approximate jelly bug issue. Multiple implementations of Gemma had different implementations; some of them used exact jell, some of them used approximate. So, which one is correct? That’s the question. We just tweeted about this. That was our first issue that we found. We thought this was just one issue, but actually, there were many issues, and we found more bugs. I’m assuming maybe you know me from this. We did get partially recognizable through our Gemma B fixes.
Today, we’ll be showing you how you can actually find these bugs and issues in language models and how you can analyze this yourself without us just doing it manually ourselves. Hopefully, this can be an open-source project where everyone can find these issues automatically and help us to solve these issues. I always thought about can we automate this? I don’t think it can be automated. There are actually many issues with these implementations, and it’s not just JMA. For example, we also analyzed Gro, and there are some weird things in their code, like they’re scaled by 30 * 10 h x over 30. It’s just a clamping mechanism. You can see I also make mistakes sometimes. I said it’s division and not multiplication, so sometimes I misread the code. That’s because when the code gets released, I quickly try to analyze it, and sometimes I mistakenly say stuff. I have to showcase corrections, so yes, I’m still human.
We analyze lots of models and stuff like this, and hopefully by the end of this workshop, you will actually learn how to find bugs and stuff like that. Another one I did recently was Nvidia’s Nron. I don’t know if you saw this, but Nvidia released a 340 billion parameter model, which is extremely large. I’m assuming this is in preparation for Llama, 45 billion, right? They had to do this quickly, but there are some weird and interesting things, like they used the squared value and not the normal Swig glue. They were actually the first model trained using these other activation functions. There are other weird quirks and stuff like that, and hopefully, you’ll be able to analyze and whenever the code comes out, just read it, and you’ll get it. It does take some practice.
The first time when I read this code, it took me many days to read through all these architectures and understand exactly what they are, but now it takes me like 10 minutes. So I’m sure you can just read the code. That’s the whole goal of today. Language models, if you don’t know, are not just about issues and bugs and analysis of these architectures. The tokenizer is a totally separate beast from language models. Tokenization is extremely annoying. There are different types of tokenization issues, like mistol, llama, mixol—different variants of mistol from the mistal team.
If you notice, the smiley face is a space, and if you tokenize them, depending on the model, you’ll have different results. The question is, which one is correct? Unfortunately, I do not know. I did ask the team for this, and according to them, some of them are correct, and some of them are just because the mistro team forgot to update the model to the fast tokenization variant. We will be talking about this later as well, but you can see, even before you can train or run the model, the token is broken.
It’s a multi-pronged problem. We don’t just do language models. Our experience is broader than that. I used to major in maths and computer science. Yes, very fun. Actually, I kind of did very badly in math, but it’s very interesting. I don’t know if anyone has done normal machine learning here. Yes, there are a few people. The SVD, I’m assuming most people know PCA, principal component analysis. Yes, it’s a very powerful technique. More people should know about it. SVD—okay, I don’t know if people know about SVD. It’s a bit less well-known.
I’m confused why people don’t know SVD. It’s actually one of the most important algorithms in all of math and computer science. It literally underpins many applications and is extremely important. I’m a huge proponent of telling people to learn more about SVD, so please do. The singular value decomposition is a must. That’s the most important algorithm. It’s like one algorithm that can spawn many other algorithms and can be used for many purposes.
There’s also the QR decomposition. Okay, probably no one knows the LU. There’s a lot of randomized SVD; yes, that’s extremely important as well. We don’t just do language models. You can ask me any questions about math or computer science you have.
Do you think for the neotron 340b, is it a unique architecture because you can only use Nemo loader to load and train? I think that data is just the most valuable part. We are attempting to try to convert it to a hugging face transformer safe tensors, but we’ve had issues because we don’t have the modeling file. So I was wondering, do you think that it’s similar to—they uploaded a 70b of Llama 3 that’s Neotron as well. Do you think that we can get clues on how to build a hugging face implementation?
Yes, the question was for Neotron, the code was not released for the actual inference. For training, you have to go through the Nemo training framework. What I mean is that I can dump the weights, yes, but the code—yeah, I was actually planning on doing something like that, but we didn’t have time to do that. I might take a crack at it.
Yes, there will be Q&A, so anyone can raise their hands and ask questions. I will just repeat the questions. There will be a slider if you want to randomize questions. So I will keep monitoring.
The other one was another paper: “Laura learns less and forgets less.” It shows that fine-tuning via Laura does not work for learning new knowledge. Well, it depends on how you read the paper. Some components were incorrect; they didn’t actually train in all linear layers—they forgot a few.
You need to do some specific parameters to make this work, and we will also be talking about that later. I was trying to show you we don’t just do language models. We have a wealth of knowledge across different topics, and you can ask me any question that you like.
We launched this last December. My brother did. It’s a bit outdated, but we have 11.9k or something. I don’t even know now, but we launched this last December. It generally makes fine-tuning of language models like Llama faster—two times faster, generally speaking. We have about 80% less memory usage now. We have some new methodologies that reduce memory even further, and the trick is there is no degradation in accuracy.
We don’t do approximations; that’s the purpose of optimizations. We don’t want to lose any accuracy, and we do Triton kernels. This is from OpenAI; it’s a language to do CUDA programming. Essentially, it’s like an intermediary between the CUDA code and Python itself.
We’ll be showing some Triton code. I don’t know if we have time for programming Triton, but that’ll be another topic. The purpose of unof is to make everyone able to fine-tune their language models with very bad GPUs, like Tesla T4s. Does anyone know that Google Colab has free Tesla T4s?
Yes, right? 65 Tera flops; it’s actually not that bad if you use it properly. Reminder: there’s a common misconception that the P100s on Kaggle are faster; that’s actually not correct. I think P100s are five times slower than Tesla T4s. Although it’s more expensive as a GPU, it’s actually slower. Please do not select the P100s on Kaggle.
Kaggle has 30 hours for free per week GPUs, and you get two Tesla T4s, so that’s 130 teraflops per week. That is actually very powerful. I think it’s the same as RTX 3070, but I can’t remember exactly. Kaggle has 30 hours for free per week. Google Colab depends on how much you use; normally you get four hours per day.
The pro is not that bad—it’s like $10 per month. You can actually get a decent setup. You could use runpod and Lambda Labs and stuff like that; I guess that’s another option, but we do share a pricing comparison. You need to be careful when you use GPUs.
There’s a big issue like, “Oh, look, I want to use an H100.” Did you check how much flops that H100 provides? Be careful of Nvidia’s marketing—it’s times two because it has sparsity. Just be careful of that and also be careful of the flops when it’s like float 8 or float 16. I do have a pricing comparison where we normalize by the flops with no sparsity.
We looked at Lambda Labs, Runpod, Google Cloud, and I think Runpod is mostly pretty good. Yes, back to the sparsity question: the sparity feature allows you to take 50% of the weights and make them go to zero. Nvidia essentially allows you to train this two times faster by not doing matrix multiplications on the zeros. You’re like two * Z is just zero, so you essentially don’t fire the transistors, and this makes it two times faster.
That’s a higher-level overview but essentially, you compress the matrix into this special format, and this Nvidia special format allows you to do multiplications two times faster. It’s on H100s and it’s on A100s as well. Your RTX 3060 and RTX 30 series have that feature. If you want to enable it, the biggest issue is that most companies do not train their language models with sparsity enabled.
If you set weights to go to zero, you will ruin the behavior of the model. There are papers that show that you can turn on the feature, and then you can do fine-tuning to make it work. In theory, you could enable this, but it depends on what models are released from large companies.
I’m assuming Facebook has implemented sparsity in their PyTorch and X forms library, and I think they might focus on sparsity because you get two times faster. If you know OpenAI, they keep saying, “It’s two times faster for a reason.”
I wonder why; is it due to sparsity or float 8? Float 8 is generally two times faster, albeit not exactly but approximately. When you hear “two times faster,” where does that come from? Could it be these things?
Yes, any other questions? Just remember you can raise your hand or wait. Are there any questions? I’m assuming there are no slider questions yet. Just raise your hand.
For unso, we do benchmarking against hugging face plus flash attention 2, and we show our benchmark. This is kind of old already; the memory reduction is much more now. We did a blog post with them, so thanks to hugging face for the collaboration. All you need to do is, from UNS slof, import fast language model, and we try to make it as easy as possible for people to fine-tune language models.
We’ll be talking about unof a bit later. There’s a question: Is it a myth or a solid hypothesis that linear versus cosine learning rates for one to two EPO versus three to five EPO is highly generalized? I think it depends.
For training methodologies, I think linear versus cosine. Of course, short EPO versus long is the best way to train any standard model. I think it depends. There are some research papers that show cosine or linear schedules work better but it depends. To tell the truth, it’s a toss-up; I don’t think the learning rate schedule is that important.
A lot of it should depend on the dataset and the number of parameters. Research papers show that if you change from tied weights to untied weights, you can get better accuracy for smaller models.
I think the learning rate schedule is not that important. You might get accuracy plus 0.1%; just train for more data. There we go—get more data. Just train for more data.
To tell the truth, I think it’s best to do small experiments and test which schedule is the best, but I don’t think it’s that important. The number of epochs is actually important. For these big companies, to tell the truth, I’m not sure what Llama is—like is it 15 trillion tokens?
Is it actually 15 trillion tokens, or is it like 5 trillion tokens per epoch? I do not know. These questions are very important: if it’s 5 trillion tokens for 3 epochs, that’s very different from 15 trillion tokens in total.
Generally speaking, if you train for more epochs, three is generally a good approximation. One is actually the best for pre-training generally. You shouldn’t retrain your data multiple times.
Did you have a follow-up question? Well, basically, the learning rate was one of the big issues you fixed with the Gemma implementation. Oh yes, that’s why I think that’s where my pitfall was when I was training my 2B for Gemma. I actually trained it before your fix, and somehow it turned out that the benchmark after your fix was better than I don’t know what happened.
Now, like one of the highest-ranking models—I don’t know what you have any theories about what could happen. I trained on the Transformers’ broken version and then subsequently using Oxel.
We used a very hard-reduced learning rate, but it turned out surprisingly well. We also didn’t use unsoft, but we used H. After the fixes, the performance improved significantly.
It does appear perplexing. Before, it was usable, although everyone else was unusable. After your fixes, we are now one of the top companies on the open leaderboard.
That is quite shocking. If you change the code and fix all the issues and it does better without the need to retrain it—that’s a very interesting phenomenon.
Language models are active areas of research, and please someone do research on that—suggestions are valuable. I just read the code and fix the bugs; I do not know.
We also do long context fine-tuning. We show that if you use a new methodology with gradient checkpointing and offload it to system RAM, you can randomly increase your context size by four. The weird part is, if you offload correctly to system RAM from the GPU, the execution time is just slower by 1 to 2%.
If you can do non-blocking courses and offload the GPU memory to system RAM correctly, it’s not slower. Some implementations, unfortunately, offload incorrectly. I don’t want to name anyone, but offloading incorrectly can lead to issues.
Please try to offload to memory first, and then disk. Disk is extremely slow, and if you can offload to system RAM, you can actually get away with a lot of memory usage.
Okay, I should have put this at the first slide, but today we’ll be covering three approximate topics. I wanted to make them into three different separate topics, but I guess I just mixed them together. You’ll learn about low-level technicals of language models, for example: back propagation, why is the training time of Transformers O(n) instead of O(n cubed), and there’s a lot of maths.
I will try my best to explain as simply as possible. The whole goal of the workshop is that you will actually understand the maths and formulas well. Just a reminder, I nearly failed my maths class in university, so do not worry; do not be scared.
We will talk about fast fine-tuning: the best tips and tricks for fine-tuning. How do we write the fast kernels for fine-tuning? How do we make it two times faster with 70% less memory and no accuracy degradation?
We’ll talk about some Tron open eyes trend language and stuff like that. We’ll be finding and fixing bugs, and this will be a constant theme: how do we find and fix bugs in Llama or M Jama models.
We’ll be talking about the mixture of experts as well, but maybe it depends on time. We’ll do a lot of bug hunting, bug fixing, and by the end, everyone will be a fantastic bug hunter and bug fixer. We can essentially open-source our effort to fix open-source models for everyone here.
Oh yes, we also have stickers. I don’t know where they are, but my brother has some stickers, and we bought a few which look pretty cute. You can wait; my laptop has some. I put them on my laptop, and they’re pretty cute. My brother has them, and we’ll be handing them out as well at the end.
Let us start with the Transformer. What is the Transformer? I’m assuming everyone knows what the Transformer is. Does anyone here not know what the Transformer is? Yes or no? You can simply raise your hands.
Yes, the Transformer is just an architecture that is behind all language models. Gbd4, gbd3, you know, Llama, Mistro, Jam—all these open-source models rely on the Transformer. The Transformer is essentially an architecture that seems to be very good for sequence modeling.
It’s not just for languages; it can be for any sequence modeling. For example, Sora can be a Transformer. Well, not just a Transformer; it’s probably a plus diffusion, but it’s generally a Transformer. There are other different types of models that do not have to be language modeling. It’s just sequence modeling.
I will show some pictures later. I probably should have explained it a bit better, but just assume Transformers are the methods behind all language models.
Gbd4, gbd3, gb5—I don’t know if anyone knows what gb5 is, but I’m assuming it’s a Transformer. Transformers are good at learning new knowledge, injecting knowledge into the model. They’re good at changing the weights to fit the training data.
The gpd2 architecture was very popular for the most deod style Transformer and was reused by adding extra components to it. This new architecture is called Transformer Plus Plus. I don’t know if people have heard of this, but Transformer Plus Plus is the gbd two Transformer architecture plus rope embeddings plus Wigo plus RMS layer norm and with no bias.
I think it is untied weights, but I’m not sure. Transformer Plus Plus is the architecture that most people think is the best Transformer architecture for now. There are probably some other tweaks and small things that Transformers can still do, but in general, this would be considered the best architecture.
The architecture looks like a list of math equations. I just wrote down the entire Transformer architecture. This is Llama 2’s Transformer architecture in one slide. All you need to do is get some inputs, do some layer norm, do some rope embeddings, use some attention plus some residual connections, and you essentially repeat this middle section L times or many times.
That is the Transformer architecture. The next part is math equations. I’m not sure if the math equations scare anyone. I’ll explain each one.
Hopefully, I try to make the equations reasonable. In theory, if you write this down in PyTorch, you actually have a working implementation of a Transformer architecture. We’ll talk about each component separately.
Is anyone scared of the math? No? Yes? No? Okay, very good. Let me check if anyone has questions. Does anyone have any questions?
From my understanding, from the layer level for Transformers, there are almost 80 layers in a grid. I’m not actually familiar with that. Can you please explain what you mean by Cosmic Flinko?
I’m not that smart, so please explain what that is. Am I supposed to search for that? That sounds like an arcade game I played on Windows XP. You know how you visualize it? If you take it to visualize it, yes, and visualize the map for example.
Let’s take Llama 3, which has 32 layers. There are layer zero and layer 32, which is the output layer. When you drop a prompt into this machine, which is not actually Cosmic, it’s just maps, far so it’s Cosmic to us.
I don’t think we understand 50%. I think we understand a lot, but this is not Cosmic. You are the expert here.
You’re trying to say it’s kind of like a game, making it easier for people to visualize the maps. You’re right. I talk about that in the slides, but I think it’s more like an analogy.
It’s like you’re going through a structure. Each layer has its own ‘fashion designer,’ whose job is to make changes or suggestions for others, and each layer will modify based on the changes proposed by the previous designer. I think that’s more like how a Transformer works.
I guess that analogy brings a clearer explanation. Is it the arcade game from Windows XP? I still played it before, I just can’t remember.
When I put a subscript, technically everything is a matrix. If you see any summation signs with subscripts, that generally means row-wise or column-wise. The small W generally means vector.
In general, everything that’s capital is a matrix. Why is it a matrix? Because it’s faster. You could convert this all into vectors, but for speed purposes, it should be matrices.
Next, why do they put this “hello, my name is Daniel”? Does anyone notice any similarities or differences between these sentences?
Okay, except for the first sentence, is there anything else? Just saying, random stuff is interesting. Yes, okay. “Hello” and “hi” are the same thing but with different words.
Okay, yes, semantic embeddings essentially show the relationships between words. The “king” plus “woman” corresponds to “queen,” for example.
So, what this could turn into is that you can visualize the number of embeddings and their relationships with math and their positioning in the model architecture.
If you consider punctuation as paired with the word (like “hello,”), you could treat that as a separate component. If you ignore spaces, then the first sentence has just five components.
What does the second one have?
This is a method for crafting a tokenizer. We had the wrong method previously, but we invented a new one that combined each component’s tokens before processing them.
Remember the purpose is to convert this into numbers since computers only understand numbers, not words. Each token must have an ID assigned to it.
For example, “hello” has an ID of 0. “My” is ID 1, and “name” is ID 2. If you don’t assign these IDs, the computer doesn’t know what you’re actually doing.
We just invented this tokenizer—it’s not perfect, and there are issues. For example, we included punctuation for the words, which isn’t helpful.
For example, “Michael” with an explanation mark could cause confusion, so how would you suggest fixing this?
They will have their individual IDs, and the relationship between different tokens will form a vocabulary. So we would need to establish a set of rules for how to develop a vocabulary that minimizes significant variations between arguments.
Stemming is another way to solve the issue; for instance, “skipping” could just become “skip.”
The idea is to reduce the vocabulary. If we generally lowercase more, that could remove a lot of issues as well.
Does anyone have any suggestions for handling upper and lowercase? Generally, capitalized words imply that they start a sentence. Lowercase implies a word that occurs in the middle of a sentence.
Good idea for normalization, but there are challenges associated with this to maintain semantics, particularly with the tokenization process.
Yes, that’s a great way to build up the vocabulary in practical use.
The goal is to understand how to create effective tokenization without losing essential semantics.
Now let us just look at one sentence: “Hello, my name is Daniel.” Assuming our tokenizer is useful, let’s also assume punctuation is combined.
The question now is if I select the first token “hello.” A language model should predict “my.”
In other words, while you can use multiple tokens, each time you need to confirm that you have effectively predicted the next component.
If you shift it up by one place, then “hello” is aligned with “my,” and you need to define the end of the sentence accordingly.
Machines don’t like gaps, so we need to replace missing components with a representation like an EOS marker.
They use that to help train models by introducing fillers or alternatives, as these are common problems across datasets that must be addressed.
Remember, we can use the idea of shifting without losing structure because it aligns the predicted contextual shapes rather than invalidating them.
The predictability is crucial for structure—predict the shifted words, shifting the input by one places helps maintain context. The figures on how many combinations emphasize the complex nature of effective model training.
The structure retains significance across distributions by ensuring alignment among tokens.
So remember, by creating chunky but systematic organization, one can keep track of learning in controlled layers where sequence modeling takes precedence.
Varying combinations of inputs alter how predictions can be stabilized, and assessing factors is crucial for effective embedding utilization in matrix forms supporting each cognitive prediction.
In establishing layers of reliability in model training, density metrics, and adaptation metrics play an essential role in understanding deployment in accessible ways across various instances-scenario setups.
As we continue, we see the transcription of dialogue centered on pattern recognition and repetitive structure before delving into optimizations related to integration.
Sketching contours of utility and accessibility aids clarity in discussing complex interrelations among modeling tasks.
When looking into performance metrics, the significance of well-defined responses will aid comprehension in modeling behavior, allowing the model to adjust as necessary while maintaining coherent outputs.
Thank you for participating in this session that encapsulates elaborate design aspects related to architecture-focused language model engagement.
Are there any concluding thoughts? want to train the model for more. You want to use one of those unused tokens for your own purpose, and so they left some holes in there. Does that kind of answer your question?
So when you do tokenization, assuming you don’t encounter these problems, you won’t have any issues. But if you do, then there are problems. Yes, so for example, if you do llama 3 fine-tuning, if you use the base model for llama 3 and you accidentally use one of those tokens, you will get errors for fine-tuning. Right, so you have to be very careful.
And so, I think what we did is, for untrained tokens, we actually find these untrained tokens first, set them to the mean of all the embeddings, and you won’t have these issues. So I think that’s actually a model creator problem. They probably should not have set it to zero. I don’t know why they did that, but anyway, they should have set it to a normal distribution or just some random initialization.
Yes, okay, any other questions? Okay, oh yes. Oh yeah, yeah, you can put a beginning of sentence token. I just didn’t do that. You should put a beginning of sentence, as most language models would. I put the end of the sentence. You should probably put a beginning of sentence as well. That’s actually very important. Most models do that now. They found that to be very helpful.
To be honest, I don’t think it’s actually that effective. I think the beginning of sentence token came from the old style, the CLS token. I think it was the first token for BERT style. They had a classifier token at the very start. I think it was at the very start. I’m not 100% sure, but I think that’s where it came from.
I don’t think the beginning of sentence token actually makes that much difference, but you should put it. You should give the model more freedom to move. It is always better. Yes, I probably should have put a beginning of sentence, but for demonstration, I did not do that.
Okay, we did that right. So like the green one, right? So like the attention block is kind of encoding the stuff that we described, right? Predicting the next word based on the previous words. Right, and so like the attention block is that first part. The MLP block is just the mixing component. This is kind of the transform architecture, visualized. You just repeat this L times. That is a Transformer.
Now the other question I always have is why is training language models not O(n cubed)? Because like aren’t you given the word “hello”? You’re predicting my name, right? And now we have “hello my”, you’re predicting “name”, and then you have “hello my name”, and you’re predicting “is” and so on, right? Shouldn’t this be the training data? Why is the training data just “hello my my name name is is Daniel Daniel”?
Right, this is the training data that you actually see. Why is it not this? Can anyone know? Why? Sorry, the complexity?
Yes, the complexity is very bad. So like if the sentence is 100 words, what do you think? How many permutations? Quite bad, yes. So like 1 plus 2 plus 3 plus 4 plus 5 all the way to 100. Right, so like n equals 2, 1 plus 100. I think I can’t remember my math, but yeah, something like that. So it ends very badly.
And that’s if you have one sentence. If you have 10 sentences, oh my. But does anyone know why language models don’t need to do this? Like, we don’t actually need to do this. So like, we can skip essentially. Instead of having this as the training data, your training data is simply “my name is Daniel” and shift it by one up, and that’s your training data. Why is it not this?
Oh yes, we haven’t talked about position encodings yet. Yeah, okay. But you actually don’t need position encodings. Oh, okay. Yeah, the attention mechanism, yeah. Because of the mask, that’s the answer. Yes, it’s because of attention masking. Specifically, mask attention, right?
That’s a trick. Okay, we’ll be talking about that a few times. I’ll give you the code again, well, actually the math formulas for Transformer architecture. Right, so like in the attention block, we will now talk about the attention block.
So like Z is equal to the softmax of QK transpose over root D plus MV. And as you mentioned, it is the attention mechanism which allows us to skip the O(n cubed) complexity and make it O(n squared). Why? Because remember, we want to mask out future tokens because we don’t want to predict on future data. Right, so by using this mask, weirdly, this mask allows you to train more efficiently.
It’s funny because attention is O(n squared), so the longer your sequence is, the worse the complexity. But actually, there is a special trick which you use to mask, and this actually makes attention not that bad.
Instead of doing “hello” to predict “my” and so on, the attention mask acts as this methodology, right? So the attention mask itself acts as you don’t need to do the complicated predictions of all the words predicting the next word.
Okay, this is okay. Probably should have addressed that. So we will now talk just about the attention itself. Right, so like softmax QK over root DV. Just a reminder that whenever you see QK, it refers to queries and keys.
I do not like that explanation. I would like this to be a math approach. So my view is to give the matrix X, which is your embeddings. Remember, “hello” is a vector of numbers. You multiply this by some weights WQ, WK, and WV, and you get back QKV. Q is query, K is keys, and V is values.
But that’s a vague interpretation. I don’t really believe it. I don’t really trust those interpretations. It’s not that clear. Just assume it’s just math. Get your X matrix and multiply by weights, and you get some extra weights. That’s my view.
So that is kind of, so like if you see why… does anyone know why it stacked it like this? Like why did the presenter present it like this, specifically? Why is it presented like this?
Any composition? Composition decomposition? Interesting, okay, that’s a very interesting point. But no, correct, I just… yes, that’s correct. So I just lined it up such that it’s easier to see. And if you take the matrix X and you multiply by WQ, you will get Q, right? This is actually the correct math.
And so I like people to visualize Transformers as math. In my view, it’s easier. Okay, I’m not sure for other people, but my view is easier. I do not like it when they say queries and keys, and you’re trying to do values. I don’t know what that even means.
Anyways, the yellow components are the ones you want to train. X is what you want to train. WQ is what you want to train. WK and WV and QKV are just the components afterward.
When you have the Q and the K, all you need to do is when you do K transpose, you transpose the matrix, and you do Q times K transpose, and you get this big square matrix called QK transpose. Right? Hello, my name is Daniel and so on. Right? So like that’s kind of what I want to visualize.
When you do Q times K transpose, you get a square matrix. And all you need to do now is do the softmax divide by D. Right? So softmax essentially means each row you normalize to one, right? The sum of the exponentials must be… you need to normalize them.
Do you know why you should do that? And why should you use softmax? Any clues? Yes? Yes? Okay, that’s the answer. Yes, but like why?
Why? Sorry, when you multiply them, you can get NaNs. Oh yes, very good. That’s okay. Do you know how to fix that? Close, you have to minus the maximum of each row. That’s how you fix it.
Yes, oh yes, very good. Okay, yes, we want to sample from that. Okay, sample from that distribution. But what happens if you don’t do the softmax? Doesn’t this still work or not? Like what happens if you just do QK transpose over root D, remove the softmax? Like why do I have to do softmax?
Yes, interesting that you can fix that with like minus max of the row as well with exploding. Anyone else? Okay, what happens if you don’t have a nonlinearity? Do you have to use softmax? Can it be something else? Could it be something else? Yes, it could be.
Yes, that is another area of research which people should focus on, which is like why do we need to use softmax? Generally speaking, research papers show that it’s actually the most accurate. If you use other activation functions, it might actually not be that accurate.
Right? So like, um, but this also is the bottleneck of Transformers because it’s a softmax. It does the row sum of the exponentials. This means that you can’t actually decompose this. Right? You can’t actually bring the matrix multiplications out.
So if someone can find ways to make this faster, you know, you could get millions of times faster. Okay, maybe much more than that. But yes, and V is just… remember V comes from here, right? So we just take V, multiply it up again, and we get this matrix at the very end, and that is right.
That is the final component. Right? This empty box is what you get out from the attention mechanism. For the layer norms, I don’t really want to explain too much, but the layer norms essentially… you take the square of all the elements per row, sum them, divide them by the square root, and take the mean.
All this does is just normalize the row to make it easier for the language model to learn. Right? So like why do people do layer norm? It just makes training easier.
It’s more stable. There are some theories, like batch normalization, where you want to shift towards the distribution of out-of-distribution data. I just like to think of this as an optimization method. It just makes training easier and more stable.
Layer norm is simply, remember as I said, you take the X matrix, do a row sum of all the squares, take the mean, and then you just divide it and multiply it by some weights—a vector of weights—and that’s just layer norm.
You don’t worry too much about what layer norm or what it does. It just makes training better and more stable. Please add as many layer norms as possible.
Yes, add layer norms everywhere, and you’ll make training much better. Okay, I probably… okay, I don’t know if you can see this, but in Triton, right, in order to write Triton code for the layer norm, this is the forward kernel. We will not be talking about Triton today, but it’s actually not that complicated if you read more intensively.
Ignore all of the components. There are only very few lines for the layer norm. It’s actually not that complicated. The rest is simply just how to load the data.
Um, it’s actually not that hard. Yeah, the backward kernel is when the problem comes. How do we actually do the differentiation of the layer norms? Right? Remember, you have to train the W. Right? It’s yellow.
You actually have to train that. How do we find the derivatives of the W? It is very complicated. If you want to learn in your own time, you can have fun learning the derivatives. Um, it is extremely complicated.
Because there are like sums, there are like row sums… how do we do the derivative of a row sum? It can get quite complex. I wanted to talk about backpropagation today, but I thought it was probably too heavy of a topic, so no backpropagation, but I do have tutorials on that.
So if you go to the Triton tutorial, I followed that; that’s actually quite helpful. The backward kernel is just very, very problematic. Now, up to the rope embeddings, why do we do rope embeddings? Does anyone know what the rope embedding is?
Yes, it’s a way to extend… so you could use the rope embeddings to extend context. Yes. Do you know how? How does it extend context? How would you use rope embeddings to extend context? How would you do that? I would create… basically what I would do is create… kind of…
You just multiply the base by two, and then you get two times longer context. You multiply the base by ten. The problem is looking for like one million contexts. Right? Then the model, a part of it, is trained at like 4,000 tokens, correct?
So that’s where the rope might kick in. So is that the dynamic variant? Well, it’s dynamic either way. So how would you solve the problem if you want to train with one million context length but your dataset is only 1,000 words? How would you think of solving that problem?
Because like some people have said they do 10 million context length. Are there any datasets with 10 million tokens? How would you deal with that?
Oh no, no, but that’s 15 trillion tokens for the dataset I mean. Like, how do we do long context training? Remember, when you do long context training, you have to have a document with at least 10 million words to learn how to predict the 10 million-plus-one token.
So, how I would solve the problem would just be to gather a better and more diverse dataset. Yes, that’s the ideal. So what happens if there is no dataset that is 100 million tokens?
Then what would you do? I would synthesize it. How would you synthesize if the model… it’s like a chicken-and-egg problem. How would you do synthesis? No, no, no. I would create, like CLA or any of the state-of-the-art models, with like Laura, and then basically train…
But are they trained on 10 million tokens? If the model itself wasn’t trained on 10 million tokens, does it still work?
If I was to try to solve this problem for a client, for example, let’s say their code base is 1 million tokens or they want a 10 million-something context or whatever. Right, then I would basically create a synthetic dataset. Not synthetic, but a derived dataset from what we have.
Okay, interesting. Assuming we do not have… But I can’t assume that we have no data.
Right, so good point. I don’t know. I think it remains to be seen. Like many claims by companies for 10 million contacts or 100 million contacts, I question that. I’ve only seen 1 million actually work, so I mean, yeah, and that brings me to attention.
Right, okay, okay. Now, we’re going into—okay. Yes, okay. Okay, no, no, no, that’s fine. I was asking the questions, but okay. Wait, the question was like, what is rope embedding?
Someone did mention positions. What does that actually entail? What do you think is the point of rope embedding? All it does is tell the model to learn the position of the words.
Right? So like “hello, my name is Daniel”. It actually has meaning. Like “hello” is the first token, but then if you put “hello” as the third token, what’s the difference? There is a difference, right? So like depending on where the word is in the sentence, it matters.
So the whole point of rope embeddings is to learn where the position of this component is. Old styles used absolute position encodings. Rope embeddings do some special tricks, like using cosine and sine and some sort of special rotation.
The paper found that if you do rope embeddings, it actually has high accuracy, and you know everyone uses rope embeddings now.
You mean lower, sorry. The position at the very beginning was lower. Yes. There is, I think, BERT did not— I don’t know. Did BERT use rope? I don’t think BERT used absolute positions. Yes, that’s the problem.
I think BERT used absolute positions. I don’t remember anymore. But yes, exactly. Rope did not exist. This paper shows that previously people used absolute position encodings that simply just add a position.
You can literally just add, like if the position is zero, just add zero. If the position is one, add one. If the position is two, just add two. That’s literally what they do. Well, actually, not exactly, but you know what I mean. Right?
You have to divide it by some sort of normalizing factor. If the position is 30,000, don’t add 30,000. Right? You would ruin training, but that’s kind of what they do.
And what they show is if you do rope embeddings, you can essentially increase accuracy somehow, and we just use this as truth. We just treat this as true, and everyone uses rope embeddings now.
In that case, do you have an opinion on YARN versus rope? So, YARN is kind of rope. So, YARN just does… I’m assuming YARN is rope. But I’m not an expert on that.
Doesn’t it? It does like… I don’t think I can comment on this, because I’m not an expert on that. But you can see the literal activations. Yes, yes, the activation ones.
So basically… can you think we can figure out why this works this way? Because you said it’s kind of an open question. But do you think that using tools like Transformer Lens, where we can look at training activations or not activations but like steps, we would have to like…
I’m not sure if I explained it correctly, but do you think mechanistic interpretability is a path to understanding this?
Good question. Could mechanistic interpretability… okay, it depends. I think my view is… if it is specifically on the topic of layer norms, it just makes training more stable. I don’t think it has any significance. That’s my view.
Okay, that’s fair. I think the mass equations don’t show that it has any meaning. I just find it to stabilize training. There were papers, like… what was the one? Batch normalization? I forgot what the term was.
Yeah, there was a theory which showed that batch normalization reduces problems of out-of-distribution data and internal covariate shift or something. That was the phrase.
Does anyone know what that means? There was a video on that. Does anyone know what that means?
Layer norms—I think when you do layer norms, if you don’t do layer norms, let’s say you take the number 2, multiply it by 2, you get 4. Remember, there are 32 layers, right? If you multiply by 2 continuously, you will get infinities in the end, right? Because you go out of the float32 scope.
So what layer norm does is make your numbers go back to a good scale. So if you do 2 times, it’s 4. Let’s divide it by 4 to go back to 1. Right? So now it’s 1 again. If you multiply by 2, it’s 2 again. Let’s divide by 2 again to go back to 1.
So all layer norm does is make the scale go back to a good scale, so your numbers don’t diverge on both sides. Does that kind of answer your question?
Okay, any other questions?
So, remember the decoder style. Oh, wait, I think we actually kind of finished reviewing the Llama architecture. There’s nothing else to do.
The decoder, right, you do this 32 times. Remember, like four decoder layers in self-attention. You do this 32 times. I think it is 32. I can’t remember. Multiple times. That’s the decoder.
You just apply this multiple times. Do a layer norm, and finally, you get your logits, which is your LM head. Right? This outputs the probabilities of which token.
Remember, we’re trying to predict the next token. We output probabilities for each token, and that is called the LM.
Where is the forward function? The forward? Right? There’s a forward. Always with the forward. You go through the model, and then you… okay, remember, ignore this. Right? Ignore this.
Your LM head, that’s just one line, one line. Okay, one line. And then you do the float.
Now, another question people have is why do you have to upcast to float? Does anyone know why you have to upcast to float?
Any clues? Have a guess. Make this bigger. Have a guess. Have a guess. Why do we have to upcast to float?
Sorry? Gradients? Okay, close. Why? Why gradients? It is related to gradients somehow.
Okay, it’s for training stability purposes. So the softmax, you should always upcast to float32 because it makes training more stable. If you take the derivatives and gradients, if you do not use float32, you might get NaNs, as well.
Remember, exponentials can be very large, so you want to use float32, which has larger precision than float16. Right? Float16 is a maximum number of 65536. I think. I think it is 65536.
Right? But float32, the maximum is a large number to the power of 38 or something, 10 to the power of 38. So that’s why you have to upcast to float32. This just makes training more stable.
So all of these tricks are just to make training stable.
You can do more parameters if you want. You can do 300 times, up to you. That just makes your model ten times larger.
So like when you hear, like, you know, Llamas. Yes, the weights you train when you take the tokens… you go through the architecture, and it changes the tokens, and these tokens keep shifting to some new direction, and you keep doing this.
If you do it more times, you get a larger model. The problem is you have to train more weights. So each iteration has different weights, correct?
Yes, each iteration has 32 different weights for each layer.
Okay. And so like, yeah, normally people just… if you see this, like, you know, GPD4, what is it like, one something trillion tokens? I’m assuming there are more layers, larger embedding dimensions, larger this, larger that, more layers.
Normally speaking, the more layers you do, the model can learn more. So that’s the whole reason why you wanted to add more layers—you just want to increase the capacity of the model to learn.
Again, this is to make training more stable.
And so this… remember the shifting trick that we did in PyTorch. The shifting trick is just this, and that’s the shifting trick.
That’s the thing that makes it learn to predict the next token. And then you pass through the loss function, the cross-entropy loss which we discussed.
That’s the Llama architecture, and the rest is not useful. In theory, you could write the entire Llama architecture in, like, I think, 50 lines or something.
The rest is just unnecessary bloat. This is 1,600 lines of comments and stuff like that. But this is for Hugging Face’s implementation. It’s highly respected, and this is what you should look at first when you read a new architecture.
So we just kind of went through the Llama architecture. Hopefully, you can kind of get a sense of it. Obviously, if this is your first time reading the source code, it’s actually not that hard. It’s not that complicated. You just have to see which components are important and which ones are bloat. Components you can ignore, right? It’s not that scary. Yep, does that kind of get it? Or you guys kind of get that feel. We’re going to do more; obviously, this is the first one. Any questions? No, not really, other than more tokens.
I think they changed some of the numbers, like how many numbers you want to represent for each number; they changed that. Large vocabulary—they did much larger vocabulary and more tokens. Other than that, no, there’s no change at all. Yeah, the reason why it’s funny is I used to work at Nvidia; why shouldn’t I be writing CUDA? The reason is I see CUDA as extremely annoying to write, and if you want to optimize for just Nvidia hardware, okay, go ahead, you can do CUDA. But my view is like, I don’t think so; that’s going to be forever. So as a safety precaution, let’s just do Triton, right? Let Triton handle the compiling down to CUDA or AMD or whatever Intel or whatever, right, and try to compile the intermediary. If you want to get like 10% faster, yes, you should do CUDA, but it’s only 10%. If you do fine-tuning two times faster, it’s already nearly at the ceiling; you can only go so much. So, if you want to go down the extra mile, yes, more than happy to welcome you to do that, but I do know, I do not like—it’s funny because I used to do CUDA all the time, but I don’t suggest it. You will get more performance, though, but I don’t suggest it.
Yes, question—oh, sorry, yes? What never dropped? Yes, you don’t, yeah. So, Triton—you write it in Triton, then it compiles down to CUDA. Yeah, sorry, wait, actually it could work. The only problem why it doesn’t work on AMD is Triton. Oh, I think if Triton works on AMD, we work. If Triton, if X formers—Facebook’s flash attention library, if that works in AMD, then we work. Oh, funny we work. But anyway, it depends on those conditions. So if AMD has those to work, then yes, in theory, you can remove X formers and just use scaled dot product attention. So, there’s only one dependency, which is Triton. I think some people have gotten it to work, so it depends.
Yes, I kind of have been answering that. I’ve trained on a MI3 Instinct with one card, and it worked with AMD. So, okay, I mean if Triton works, then yes, it just works. So, I just have an answer—sorry, okay, good, you answered. Yeah, okay, yeah, but we don’t—so officially we did not support AMD, but I guess it works. Okay, that’s interesting.
Yes, okay, what’s next? Where is my—is it JMA one? Yes, okay, so we’re now going to talk about JMA bugs, specifically Jamama. So if you go to a blog post—I actually, we wrote a blog post about all the issues that we found in Jemma. For example, you must add a BOS token. There is a typo in the PayPal. Yes, so we don’t just find bugs, and you know we have to read the paper first to understand the model. Now, the problem is sometimes when people release models, they don’t release papers. That is very painful; that happens a lot now. So, please, model creators, please provide papers, otherwise it gets more complicated.
There’s also like some other issues, and we have a Colab notebook which provides all these. So if you open up the link—details—in the Remember, if you don’t have access to these slides, it is timeurl.com/ unso. Right, that’s the slides. If you open up the Colab notebook, this is actually runnable in Colab; please log into your Google account for this to actually work. But we show that this is the log L2 norm. So we check the—so this layer number, right? There’s like 18 layers; we check every single layer—the output of the actual good implementation.
So the DeepMind implementation with the Hugging Face one, with the PyTorch one, with the other ones, and if you do the L2 norm, you find that the error is very high. What we showed is that you can actually move the error down by doing multiple changes. Right, so each line, you can see; there’s like multiple lines. Each line is actually a method that we apply to make it better, right? So, like we finally found that approximately either the blue line or the black line makes training much better.
Does anyone notice any interesting things about that? This graph—anything interesting? Do you see the—you know, so remember each line is a fix that we did, right? So like there’s many lines, and we did a fix, and it changed the error. We selected the black line to be the final one. Does anyone have any—what is like anything interesting?
Yes, so one of them caused a huge jump, and that is a rope float 32 fix that we did for all architectures. Yes, and the other ones are less prominent. But anything else—anything else interesting? Yes, yes.
Fantastic, why? I do not know, and that is a good question, and I don’t actually know. I think it’s just language; I have a theory. The theory is—yeah, but unfortunately, I can’t say everything. I mean my theory is—and there was also a jump as well in the middle. And the blue line, you know, it starts from very low; it goes up very high, and everything does this, right? So there is this some weird transition boundary in the Gemma model, right? And so I’m just going to guess. My guess is that when you train a transformer, the later layers get harder and harder to train, right? The earlier layers actually get very easy to train.
And so this transition boundary is when the model probably was not really trained that well—it’s just guessing. Maybe the model should have been trained for more data, and the boundary should disappear. This is just my guess. So there is a phenomenon that essentially, like more data—the model, the last layers are much harder to train, and that’s kind of my theory. I don’t think that’s correct, but okay.
Yes, right, last one—yeah, exactly. So in the end, the question is like why do we choose the black one then? Why don’t we choose the green or blue line? So that’s adding the exact gel that we found. So if you add the Rope Fix Plus the exact gel, you get the blue line, but we in the end decided to do the black line. And why do you think that is? We did not choose a blue line; we should have chosen a blue line, right?
But with the final—after all the fixes that we did, so essentially the answer why we did not choose the blue line is because there was not just one error; there were two errors—there were many errors. And all of the errors combined together, we finally chose the black line because it matches the original implementation. So remember, the trick is you have to match the original implementation of whatever the Gemma model creators did. So you kind of just look for this error.
Maybe, like if someone chose different fixes that we did, you can probably get even a lower training loss, I guess you could. But we decided to choose the black line because that’s what the original implementation did. Any other questions?
Oh, I’m talking about the weights. So the weights are the ones—the model weights are the ones training, right? So the rest you don’t actually train; it’s just the weights itself. Yes. So remember the goal of a transformer is you want to predict the next word, right? So the sentence “Hello, my name is Daniel.” You’re trying to predict “Hello,” predict “my,” predict “name,” and so on. You have this data, correct? Like you have just taken novels; you shove in the novels; you’ve essentially created data out of thin air.
And then you change these weights using back propagation, do derivatives, and try to change these weights such that you get the highest accuracy. This training procedure is called back propagation. And so, like I was trying to show you, how do we actually derive the derivatives? When you do back propagation, you need to derive the derivatives. Just use PyTorch—PyTorch will do the derivatives for you. And yes, but does that kind of answer your question or—and?
Okay. Yes. Yes, yes, yes. Actually, yes, on actually has that, so you can actually, depending on your layer—for now, what we do is your embedding and your final layer—you can change different weights. Different learning rates, so we found that if you train on—if you train the last layer with the embedding weights and the first—sorry, the embedding weights in the LM head by a factor of ten smaller—the learning rate—you can actually have increased accuracy.
So, yes, you should change the learning rates for each layer, but people don’t actually do that. I think it’s because if you set a learning rate for each layer beforehand, you’re kind of like—you’re like doing subjective bias. So that’s why people just set one learning rate for all the layers. But I think in this case, I’m just going to guess. Okay, this might be a Transformer. This is not just for Jemma; this is for all Transformers. Maybe, I guess layerwise learning rate could work; I think there are like some papers which do that. I think it’s called LS—I think L does layerwise learning rate. I hopefully that answers your question.
Yes, it’s a log L2 norm. So we take the DeepMind implementation, you code it up correctly, then you take the other implementations like PyTorch, hugging face—even DeepMind’s own implementations—and then you check each layer. You compare it with the original implementation, check what’s the error, and that’s the thing that I graphed. Your goal is you want the error to go to zero, right? So you want it to go all the way to zero; you know on the bottom—not very high.
And that’s a log scale, right? So the error is not a small number; it’s 1,000, right? So every single line, every single step you go down is a log difference. It’s not like—it’s I essentially logged it; if you did not log it, it would look very bad. But I just logged it. Yeah, does that—okay, any other questions?
Yeah, so let’s say if there’s an issue in the tokenization part, a fundamental thing, or we find some optimization—and you have to change the way you’re tokenizing. Would you have to retrain your models to indicate this? This actually happens a lot—very frequently. And I think, like for example, Tiny Llama—someone trained Tiny Llama, and then training already 80% completed—they found a bug for tokenization. They’re like—so it happens very frequently and it depends on what you want to do.
I think it depends on the model creator. If you already spent millions of dollars, maybe just keep—just train it with the bugged version, but it should still work—hopefully. Yeah, so in theory, let’s say OpenAI would have a lot of difficulty shifting if they found—like somebody else found a more optimized tokenizer or something like that. They would have trouble shifting to that model because they would have to spend—like you have to retrain everything, correct?
So just assume it—just leave it. If you’ve already spent billions of dollars, I’m probably not a good idea to retrain. So even if like a 2x optimization, they would have to retrain and spend—yes, you have to retrain everything from scratch. But that’s why I think like—that’s why you should do like small scale experiments, you know—get a smaller model, train it for less data, test it, and see if the accuracy is good, and then scale it up.
Yeah, any other questions? Okay, I will—yes, so there’s a notebook, so we show step-by-step exactly what we did, and if you inspect the code—okay, the Gemma code is now—the Gemma code. If you—oh, okay, wait, no, it’s modeling Gemma. Oh, okay, maybe I should just go to Hugging Face itself.
Wait, let me go to—you can actually find this—if you copy-paste this, right? You edit the—you go to Jamama, and you go to modeling Jamama. Right, this is—oh, did I not—okay, let me just—okay, maybe I typed it wrong. Did they not—oh, okay, maybe I did two L’s—my bad; I always get confused on that.
Oh, what is this? This is interesting. Okay, yeah, I did not—I did not, yeah, so all of this. So we wrote inside the—for like, you know, Llama does this. So we show, for example, in the code now, if you go to Hugging Face’s code for JMA, we wrote—I tried to write some comments, you know, for it to be more clear why we are doing this.
And so for example, the layer norm, right? You have to be careful to where you upcast and downcast. We write this in here. Where is it? I think it’s—no, no, no, not—wait, is it? No, I’m pretty sure I’ve read it somewhere—no, it is here. Yes, okay, it’s a bit unclear; I need to make this bigger.
Okay, it’s a bit blurry, but you can see that depending on the model in Gemma, you have to actually upcast a float 32 everywhere. You must use float 32 everywhere because the original implementation used float 32, right? So you must always follow the original implementation. If you don’t follow the original implementation, then you will get somewhat worse results. And the problem was other implementations just copied Llama and Mistro’s code, and they did not do this.
And so we found that you actually have to upcast correctly over here, right? You have to upcast immediately, and then you downcast at the very end. We wrote a few comments—right, Llama does X, does float 16, whilst Gemma is X. You know, it really—like Llama does that, right? But Gemma does this, right? So there are small little issues—downcasting, upcasting.
Another question is like why do we have to do downcasting? Does anyone know why—like why is there always downcasting, upcasting, float 32, float 16, float 8? Does anyone know why we have to do downcasting? Yes, correct; it’s for faster speed. Do you know how much faster?
Like, so float 32 to float 16—what do you think it depends? Who said 2? Okay, good guess. Why did you guess 2? Well, that’s a lot. Okay, okay, yes, okay. Float 8—approximately 2. Actually, it could be more. So float 32 to float 16 is actually not 2; it’s actually much more. I think it’s 5 or I think—or is it 6?
The reason is because the representation of the float is different, right? So float 32—I have floating point representation—Wikipedia, I think it’s in here somewhere. Oh, maybe I go to beat float 16. Where is beat float 16?
Yes, right, so there it is. Oh, there’s more pictures now. Oh, they edited this; I did not. Okay, this is new. I didn’t see the AMD fp24 format or Pixar. Oh, okay, they have like weird formats now. This is float 32, right? And float 32—the exponent has eight numbers, right? Eight bits, and the fraction has 23.
And when you do matrix multiplication—does anyone know how to calculate the number of transistors you need for float 32? Does anyone know? It’s a formula that’s related to the exponent, the fraction. What do you think the formula is? Have a guess, right? I said that it’s approximately—so if you have B float 16, the fraction is 7, right? Float 16 has 16 bits. You can use an exponent.
The exponent is used for the dynamic range of the number, right? So if you want larger numbers, you have to have larger exponents, right? So this means B36 only has a range of 2 to the—the exponent is not—I’m not saying 2 to the 8, but like just assume you know it’s 2 to the power of 8, okay?
But is it? Yeah, and this one, float 32, also has 2 to the 8. There is another format called float 16, which is 2 to the 5. And then the fractional component is 10. So all of these numbers you can scale, right? How many do you want for the exponent? How many do you want for the fraction? You must include the sign bit.
And the trick is you must have 16; you need to fit, you know, 16. So you could have like exponent 1 and fraction could be 14; that could also work. But does anyone know how many transistors you need to use for float 16, for example? And B float 16, remember I said it was around five times faster—it’s actually not right.
I think it’s even more. What is the formula? Have a guess—how many transistors do you need to use to do float 16 multiplication or float multiplication? It’s a formula related to exponent and fraction. The answer is exponent plus fraction squared. That’s the answer.
So what does that mean? Float 16 is 5 plus 10, right? And float 32 is 8 plus 23. So it is not two times faster; it is much faster, right? So, like, I don’t know what that is—what is—so it’s 8 plus 23 plus 5. So you need approximately—okay, this is approximately 537 transistors for float 32 multiplication.
Oh, it’s just 23 squared, so it’s 8 plus 23 squared, yeah. And so what is the other one? I think was—what was that? I can’t remember. So, um, so it’s 8 and 7, right? 8 and 7—this is Google’s format; it is 57. So what does that mean? How many times faster? Yeah, so it’s actually 10 times faster, right?
So 32 to float 16, B float 16 is around 10 times faster. Right? Float 16 is 5 plus 10, right? So 5 plus 10, so B float 16 is approximately two times faster than float 16, although no one really noticed any difference. But in general, B 16 is actually faster, right? So that’s why it’s not two times faster; it’s 10 times faster.
And that’s why you must use Tesla T4s, as I said, because it has TSMC, which does float 16 multiplication very effectively and very efficiently. And so do not use P100s, again, right? P100s do not have this methodology.
Yes, question? Yes, float 8. So float 8, I don’t know—there are two formats for float 8. Oh wait, I don’t think so—it’s in Wikipedia. Float 8—oh, okay, floating point. There is—it’s called e em. Oh, I just use mini float—does it? They have some—yeah, there we go, right?
So you got to decide—remember, if you want to have eight bits, you get to decide how many you want to do for the exponent, how many you want to do for the fraction, or the mantissa part, right? You get to decide. And depending on the company, you know, it’s unclear; there’s no standard.
So this one’s 143, right? So like what’s 143? 1 plus 4, right? 43? Is that—is that 43? Yeah, 13. So float 8 is, I think it’s around, yeah, so around four times faster than B 16. But in general, it’s not—okay, in general, it’s like 2 to 3. It’s not going to be four.
The reason is because you’re packing so many transistors in; you also have to do energy. You have to do the data movement; there are other transistors you have to do. I just—approximately it’s two to three times faster. That’s float 8. Can you go even lower? Yes. Why don’t we go one bit?
So you must have to sign, though, so you can’t do one bit—so 1.58 bits. Some people have been talking about two bits; two bits could be possible. The problem with two bits is it’s problematic because when you do two bits training—yes, okay, so let’s see—let’s do two bits.
Right? So what do you want to do? How many exponent? Zero. Remember, you have to have a sign bit—that’s the most important one for the exponent and fraction, zero, right? That’s because, remember, it’s squared—so plus one. Oh wait, no it’s Z plus one. Okay, so it’s one.
Okay, 10 times faster? I don’t think so. Okay, maybe two bits is probably too low. Maybe four bits. Four bits could work. Yeah.
Yes! Oh, that’s just because they wanted to do that just for easier calculations. Like for their 32, they—turns out 32 is not 32. It is—they have it somewhere—Nvidia T of floats, it’s 19. That’s the trick. They like to do marketing and they say it’s 32, but it’s actually 19.
Yes, that’s why it’s the same. Okay, any other questions? Was it someone else raised their hand? Okay, but yes, I was going to say, like, you can do four bits, right? So four bits is actually a new Nvidia’s new GPUs, the B100s do have four bits.
So that is approximately two times faster now. The reason it’s not—okay, let me just try four bits. I think it’s 1 plus—it’s probably like 2 plus 2 or something. I don’t know—six? Okay, right? It’s not going to be that much faster because, as I said, there are power transistors and there are other transistors. You can only go so far; just the jump from float 32 to float 16 was very large.
Quick question. For example, the one bit—bit—that 1.58 bits, yeah, so that would be an example. One bit. So it’s different, so actually—I had a tweet about this—1.58 bit and float 4 is the same in terms of the number of transistors. You’d rather use float 4. The reason why is 1.58 bit is you have to do more manipulation to make it work; you have to use like the straight-through estimator.
It’s a horrible mess—you’d rather just use float 4. Float 4 and 1.58 bit are similar; you get to create your own base model. You do—if you replicate the paper, yes, me—which most of us have never done, right? Which would be a technium, and the Noah research probably related to something—though it does work somewhat.
I mean, yeah, it’s one bit, but I mean, 1.58—yeah, it’s actually one bit. Yeah, it’s—I think it’s like three calls one bit. Oh, they like to call it one bit. Yeah, but my question is, like, so in theory—obviously I don’t know who works here, but most of us have never built a base model, yes?
So like, well, you could. Yeah, yeah, yeah, you can with enough GPU power. But one bit—that you know—that was—and they even had like a really great tutorial. But do you think that—I’m just asking for your opinion on that.
I don’t think so 1.58 bit would be the future. I think Nvidia is—the focus is on float 4; they might go to—float, I think float might be the final precision. I don’t think you can go any faster with that. I think float 4 is the final—no more. So we won’t be having that much faster GPUs.
I don’t think so. I mean, float 4—they don’t actually do float 4 anymore; it’s like float 6 for the gradients, and float 6 and then float 4 for the activation. It’s very weird. I mean, you could do like float 3, float 2, but it’s your diminishing returns.
In arm silicon, though, there have been advances like super low fixed points. Is it called fixed point stop? Or I think it’s called fixed—I-I knew it has fixed point. Oh well.
Yeah, so it’s—I mean, just like the Snapdragon X, like the new—yes, they have. So it’s like customizable as well, or I don’t know. Yeah, well the—okay, so the SDK is broken; you have to pass the—so this is why you can technically run Mixol 8X 7B on your phone at like 20-something FPS—sorry, TPS—is because you can use UFS 4.0 as flash storage and subsequently use that as memory.
But the thing is then you’re running at two-bit precision, which is probably why—you know, if you use two-bit precision—that’s why you have memory reductions. But there are actually papers that show that if you do two bits for the MLP plus four bits for attention, that’s actually the most appropriate. You can actually do that; that’s not an invalid approach. No, that’s not invalid—actually, it works.
Works, the most people did that, I think. Yeah, that’s—yes, question. Sorry, okay, two kind of related questions on precision. First one is like why is that bit—you must have the sign bit. Yeah, you don’t have to, but it’s generally standard practice to have the sign bit.
In theory, you don’t have to. The only problem is if you don’t have a sign bit, your numbers will be 0, 1, and 2, right? But then what happens if you wanted to—like you’re trying to not make the model learn negative directions anymore? You could do that.
I don’t know if there are papers; maybe you should write a paper about that—train a model on that and let’s see what—okay, but yeah, related, I think all—yeah, softmax, you’re basically just linearly fitting—you know? There’s nothing about—that could be wrong. The reason is because remember when you do softmax, you also have to normalize the sum of the exponentials.
And if you do the exponential of 10, you already get like some large number, and this probability will take over the entire sum. Well, but you’re not—you’re not likeing it; you’re just square rooting it. No, no, it’s the sum of exponentials divided by—sorry, the exponentials divided by the sum of the exponentials.
Yeah, but the big exponential dominates, right? Yes, that’s the problem, though. If you do that, then your model’s not learning; you’re just trying to learn to predict one token. Why don’t you just predict that one token then, like the largest one that you did?
That’s kind of what you’re forcing the model to not learn anything—that is why you have to subtract the maximum. That’s a trick that we showed, like minus the maximum, and then you can reduce this effect of this one token, or this one issue. So it’s for training stability purposes.
I don’t know if that kind of—okay, probably that didn’t answer your question, but okay. Yes, that is a good question. To be honest, I do not know. I don’t think so; it changes too much. Layer norms—if you upcast, it’s probably a small effect—a small effect. But the reason why you need to upcast is because JMA did it before, so you have to do it.
Remember, the trick is you must follow what the original implementation does. Any other questions? Okay, there are some other issues which we showed. It’s funny because it’s all about upcasting, downcasting, and stuff like that. Each implementation does its own thing.
Unfortunately, how do you actually analyze this? You have to open three screens up—the DeepMind one, the DeepMind one—okay, okay too—excited. You have to open up three implementations—the DeepMind one, the Hugging Face one, the Carass one. You have to open three screens, and you see line by line what did they do. And then now you have to guess which one’s the correct one.
The guessing part is the most painful, so you have to like inquire—you ask Hugging Face which one’s the correct one, you look at the paper which one’s the correct one, you assume the DeepMind one’s correct, and stuff like that. So there’s like some human component you have to guess. Guessing, so that’s probably why it can’t be automated, right?
These error checking things cannot be automated because there’s a human there which made these decisions, and so you have to—now you have to decide which one— which of those decisions did they choose. And you can’t really automate this away, I guess. You could automate this by doing the methodology which we described—try all combinations and see which one has a lower error, I guess you could do that.
But remember, you must have the original implementation first. That is a problem. So there’s like chicken-and-egg problems. The Rope position—this is the one I was talking about. Upcasting rope—it’s in all architectures now; you must not downcast rope. If you do, you will get wrong results.
So previously on the left side, if you see 8192, 8192, 8192—that’s the positions—um, that’s definitely incorrect. What does that mean? Like, do you know why that’s incorrect? 8192, 8192, 8192—does anyone know why? Remember, this is positions. Why is it—why is it all the same? Like, does anyone know why this is very bad?
So we kind of like—essentially now we—the three words have the same position, right? 8192 is a position, and what is another big error of this? There’s actually one more error. Let’s assume the maximum is 8192—the sequence length. What is 8192? It’s out of bounds. Remember, it’s minus one for Python, right? It’s 8191 is the correct number.
So if you correct this, you get 8191, 8189, 8190, and 8191. You can see all the numbers are like this. So the point is if you use—remember, the whole point of this problem is because we’re using float 16 for faster training. Remember float 16 is how much times faster? Yes, around 10 or 5 to 10—something around there, right?
That is why you have to do this, and these are the issues that pop out because of this issue. We’re trying to make training faster, but then these issues come up. And the GELU one which we described before, this was the first bug that we found. Actually, I think this is the main reason why we were trying to look through bugs. We found that, “Oh, look, there’s this bug in GELU in the activation function.”
And so the point is, Caret’s use approximate GELU; the PyTorch version used exact GELU. Hugging Face also used exact GELU. And the question is which one is the correct one—is the exact GELU correct, is the approximate GELU correct? So what’s the difference between exact and GELU—GELU activation function?
There is—where is the— I don’t know if they have the exact and the—it’s called—Flex. Oh, okay, that’s not good mode. That’s even worse. Okay, whatever. Oh, that’s pre—where is GELU? Oh wait, no, I have to find it.
Right—yes, right, so the exact GELU is this one, right? There’s an error function—okay, my thing is not rendering it properly. But if you essentially what you do is you use Desmos. So what I like to do is I use Desmos—Desmos—right, and literally plot them together—plot them on the graph.
So if you have—right, X is equal to X over 2, right? You literally type this in. And what is this one? I think you can do error function. Oh yes, you can. Right, you can do error function—right, X over TK of 2, right? That’s the exact GELU. Now you type in this complicated formula for X over 2—um, I don’t remember this.
Square root of two—what was it—pi? Oh, it’s pi! Pi and the—what? Um, X plus 0.447.
What was the—15? 15x to the cubed? Was it cubed? Yeah. Okay.
Right. Oh, is it? You’re right. Okay, wait, is that right? Is that—is that? Oh, is it just the rendering problem, or is it square root? No, no, no. It’s square root of two over pi. I think—wait— is it correct? Wait—is it correct? I did something wrong. Maybe I did something wrong.
Whatever; just assume. Okay, oh wait, you’re right! I put the square root everywhere—oh, is that what you were saying? Yeah? Oh, okay. Oh, no, no, no, no. Whoops—no, get rid of that. Okay, let me just—no, it’s—it’s tan of everything now. I have to do this.
Oh, okay, I probably have to play around with this. Oh, there we go. There we go. Right? So the blue line—if you remove it, the blue line and the red line, right? They’re the same thing. But what’s the difference? Remember, I don’t know if people know this, but you can actually do derivatives D DX. Did anyone know this?
You could actually do derivatives; you get your D over DX, and then you can do this as well—D over DX. They generally align, right? The exact GELU and the approximate GELU generally align. And guess what? You can also do integration—integral of minus infinity—oh, did I spell wrong? Oh.
Infinity to infinity, right? I think this works. I’m not 100% sure. Right? You take your exact GELU; you minus the difference. Oh, I don’t think so. This works; I don’t think so.
Oh yes, it works! Yes, it works! So what you do is you can take the integral of minus infinity to infinity, so the entire line, minus exact GELU, and the approximate GELU, and you do DX. There is a difference, right? But the difference is very small, right? It’s like 10 to the power of -16; it’s very, very small.
And notice when we do fast-forward kernels, I generally use this feature, so you can do integration and derivatives. And, you know, you can use Desmos. So I highly recommend Desmos. And if you do this, that’s where we found the problem. It’s like, “Oh, okay, there’s some sort of issue.”
And if you fix it—remember, the GELU fix does do some effect; it does do some effect. But remember we only showed there were only very small effects, so it’s not that useful. The Rope fix was the most important, right? The Rope fix actually caused issues, so you must fix that, and that’s the most important fix that you must do.
Finally, there are some other things that we do. Depending on the precision that you use, there is actually a difference between float 16 and B float 16. If you do this, we show that float 32—remember we showed before that in the fixes that we did, the lines sometimes go back up, right? But actually, if you do float 32, it actually does work.
If you do float 32 precision, the lines actually don’t do separate very well, but once you use float 16, the lines then match up again, right? And B float 16, the lines match up again, right? So this is just a phenomenon that you’re using—faster, smaller precisions. And that is why you have this problem.
But if you do use full precision, you get good results. And the fine-tuning notebook for the Gemma one also works. So Gemma is two times faster; it uses, like—I think 60% less memory as well. It’s more now. So if you run this, remember you have to connect to your Google account, and you will get this to run.
Any questions on the Gemma one? Okay.
Yes? Okay, yes. Um, where did I put the picture? Oh wait, it’s in the blog post. Yes, that’s fine. Um, wait, where did I put it? Oh, it’s the first picture, right? Yeah, this one, right?
So the x-axis is the layer number. So Gemma has 18 layers. So each of those—the x-axis just indicates which layer it is. The y-axis is log to L2 norm. So what you do is you take the original implementation, like DeepMind’s implementation; you take Hugging Face, PyTorch, J—implementations, you check the output of both of them.
So the output you run the model through—you take output layer one and output layer one from your other implementations, and you just find the error. This is just the error. And this is log scale. So when it’s log scale, it looks better; when it’s not log scale, it looks very bad.
So, does that—does that better? You’re taking the output. Yes, output of each layer. Yes, um, that’s called JAM. So that’s for Jamma. For F3, similar what you do is you open up the F3 implementation; you read through the F3 implementation.
And because like you guys, like, most likely can go through Llama and just look at it. In general, remember to delete useless parts of the code. You will see there are differences in V3, and the differences are they use other methodologies. They use upcasting; they use stuff. But there was a weird thing that we found in the config file.
I will show you V3 config. Okay, just use the instruct version. If you go to—always when you go to like new models, always read the config file, right? Config.js. When you open it up, it tells you all the tricks you need to know about the model architecture.
And I highly—right? It tells you what is the EOS token ID—32,000, right? When you look at this, “Hm, is that a good idea—32,000?” What is the EOS token ID? Right, 32,000? Okay, that’s fine. The pad token—no, is that a good idea?
You have to think about why—why they’re there. How many layers does V3 have? It’s 40, right? So 40 layers. How many positional encodings does it have? What is the context length? It is 1,072—that’s the context length. Remember, it’s 100. So this model, the V3 medium, is 128k, right?
It’s not 128, right? Just be careful; it’s actually 128k, right? It’s 131072. There are other issues with this model as well. Okay, that’s probably okay. Probably don’t use the instruct version. Instruct—sorry, the—we choose the small version.
This is a smaller version. There is a thing we noticed; it’s a sliding window. So Mistro has a sliding window. Sliding window essentially attends to only 248 tokens, and this just makes training much faster.
Does anyone notice what the problem is for this? Why is it 2047? Anyone notice any issues? Yes, well, it’s not a power of two, but correct. So is that weird? I mean, that’s horrible.
Yeah, so I did ask the Hugging Face people, and they said yes, it is a bug. So they actually did fix it, but then I don’t know why they reverted it back. So I’m a bit confused. They never—they kind of forgot about this.
Yeah, so it’s actually—it’s supposed to be 2048. Yeah, because that only makes sense—because you’re training on—the correct context, right? Then this sliding window makes no sense. In fact, I’ve seen a lot of sliding window bugs recently.
Yeah, for some reason. Yeah, I’m not sure why. But I’m pretty sure this should be 2048. Yeah, I’m very confident. I’m actually 100% sure it’s 2048. Yeah, it’s not. And yeah, so these small issues—they need to fix.
They still have not fixed. But when it’s fixed—so we actually uploaded models which fix them, right? So if you go to our Unso Hugging Face repo, we actually have models which we fixed all of them.
Oh, this is too big. Where is the Fe one? Oh, I didn’t put it up. Okay, I need to find the Fe one now. Where’s V? Oh, there—V3 mini 4K instruct. Right. If you go to files, you go to config.js—we fixed it all, right? And there are other things that we did to fix it.
For example, the pad token ID is 30. Okay, that’s actually wrong—okay, okay, I need to fix my own. Okay, anyways, there is a bug which we discovered ourselves: it should be 30; this is actually wrong.
Another thing is you must not make the pad token the same token ID as EOS. Never, never, never, never. This must be a different token from the EOS token. I do not—if we automatically fix this during the loading, it’s just the config itself is not right.
But that’s okay; Unso itself is fine. Just the config is a bit wrong. Oh, okay, I found my own bugs, but okay, yes. So, okay, I’m not going to slow down; keep going because there’s a lot of—okay, oh yeah, yeah, actually there.
Okay, there’s not that much. Okay, actually there is—oh. Okay, I just noticed more. Another one is like Fe3 used—they merged the QK and K, remember? We did QK and V; they are unmerged.
The weights are separate for the attention, but V3 did a very interesting move in that they fused them into one matrix, and we found that to be very problematic for fine-tuning. Because if you fuse them together, when you do lower adapters, you actually only learn new extra weights and it’s very less.
So please unfuse them. And we do this—our version of the V3 actually uses the weights. You must unfuse. Actually, I have to, like, highly suggest you to unfuse the weights. You can only fuse them if you want to do training after.
This will make training maybe 5% faster; it’s actually not that much. It’s like 2%. You actually increase memory usage a lot, so just be careful of that as well. Yes, they actually did, so this is the sliding window one; they actually fixed it, and then they unfixed it.
I think they just forgot about it. I’ll probably push them again to fix it. And this is the fusing of the weights, so we show that if you actually unfuse the weights—so QK and V must be separate; you must not combine them.
If you combine them, you actually have lower accuracy, so please do not do that. For tokenization, remember this slide which I showed you about the smiley faces are like the spaces, and each one’s a different tokenization?
There are actually many issues for tokenization. This is a totally different—SE separate topic from finding bugs and issues in language models. This is a whole topic of its own because tokenizers are very problematic, and they’re very hard to find and fix.
Did I double this slide? Okay, I doubled that. Also, we have new support which we have not announced yet, which you can try out. So lots of people have asked us for how do we actually fine-tune a language model and export it to AMA effectively.
Does people know what AMA is or no? Or does anyone not know? Okay, so AMA is like an interface. When you find a model, you have to run it, right? You have to run the model somewhere, and it just makes running the model much easier.
So like you, ChBT PT is like the running mechanism—AMA is just like ChBT, but they don’t have the model. You have to select a model—um, that’s kind of AMA.
Yes, how did you manage to—so I’ve been working on converting, creating model files using automated pipelines, but we’ve been finding many issues trying to automate model file creation. Is this using Unslot? No, using Axle or something or other ones?
Did you automate and modify it yourself? Well, we, yeah. Because we need our own model files, right? Also, we do this automatically now, so we’ve spent—I spent like a few months trying to automate the model file creation. Why we were struggling so hard as a company.
Yes, I have code for that somewhere. Okay, open SCE. Oh yeah, it’s already in the GitHub repost. If you go to Unso, you go to chat templates, we have code for that.
Llama, it feels very ugly. So these are the chat templates. Remember the BOS token someone mentioned; you have to add it? Yeah, add the BOS token. This is the YAMA chat template, which we—AMA has a specific requirement: you must have a chat template because if you don’t use the correct chat template, your model will output incorrect, like substandard responses.
So this is the chat template for, like, some of them. I had to—we had to write chat templates for all of the architectures. We have an automatic one, so these are Bakun and blah, blah, blah—um, Alpaca, Skit Style, um—Jemma, the JMA style, we also have that.
We have many, many, even a Llama 3 chat template we have as well. Now, for the automatic one—so what we do is we can actually make an automatic chat template—a modifier for you automatically.
This makes your fine-tuning much more accurately. Wait, I’ll show you. Where is the code for that? Where is the code? Okay, you can see the code is quite large for just the chat templates. Right? This is just for tokenization, so it’s not even the—yes, this is Aache 2.0, right? Yes, it’s Aache.
Yes, it’s open source. Yeah. Wait, where is it? Okay, so we have something called pass combined prompt, which does some open square. It didn’t actually optimize this; it does over square. I should have done O of N, but anyways, it’s O of N squared. Checking the prompt—here’s the prompt format.
So we do—it looks quite ugly; the code for automatic model file creation, but we actually made it so you can actually automatically create a model from your chat template. You can see it’s quite ugly, but it works.
And, yes, oh, it’s even more ugly. It’s quite ugly code. But unfortunately, the model file is very hard to create automatically, and so we have the notebook which allows you to do this.
So this notebook is in here; Alpaca—so this one’s for the Alpaca dataset. And so this is our installation, Llama 3. Uh, where is it? So we’ll be using Alpaca GBD4—the GBD4 dataset. So you use the Alpaca dataset and use GBD4 to create the dataset.
And the trick is, though, we also have a CSV file now, so you can actually upload a CSV file and use Unsoft directly to fine-tune a language model. But the problem is a language model must have an instruction and output—right? Only two columns. CSV files and Excel files can have many columns, columns.
So what do you do? You have to merge the columns into one. Um, so remember, each of those columns in your Excel file convert them into text. For example, the Titanic dataset, you merge them to say they have one sibling, spouses, and so on, right?
You merge the rows into one row, and that’s what you do with Unsoft. You can do this now. I still probably need to edit the—like syntax calling, but this merging technique says, “Okay, your first column is called an instruction column,” and the two double brackets mean it’s optional.
So if the input column exists, then it will say the instruction followed by your input is—and you can make this very crazy. You can do as many columns as you like. Um, I don’t know if the syntax is useful, but like I will probably be editing this.
We’re going to make a YouTube video about this to talk about this. Um, this is actually very important for fine-tuning. We noticed that every single provider requires you to use only one column for instruction and one output column.
Now, you can have infinite columns—well, how many you like—but you must define the chat template. And we also have a customizable chat template. So before, when you do fine-tuning of language models, you have to use our Packer prompt in our other notebooks. Right? Below is an instruction that describes a task paired with an input, blah, blah, blah.
You put your instruction here, you put your input here, and you put your output here, right? But notice what the problem with this— is there a problem with this? You must only put one instruction and one output or response, right? The input is a problem, right?
How do you solve this? You solve this by merging the input into your instruction prompt. Right? So this actually should be removed entirely, right? And your input should be something else. And what you do is we can actually—you now we can do this now. Right?
So you must do—you must put the input, and you must put an output, right? You can only use two columns now, but you can use—remember, even though you can only use two columns, you can use this to convert your dataset into two columns.
Yes, do you lose any of the semantic meanings, though? Oh no, I don’t think so—no, I don’t think so. It depends on how you format the dataset. Remember, it’s a language model, so you can do—the more you tell the language model what to do, the better, of course.
But the problem is, to do the model file creation, you must do two iterations—repetitions of this, right? You must do instruction, response, and then you do another one. You must—okay, you must do this for Unsoft. I found this to be very, very important for the model file creation.
If you do not do this, you have dangling new lines, and you actually make your model output terrible. So you must do two repetitions of this, okay? It’s a must—must. If you don’t do that, we’ll error out.
And so once you do this, we also have examples. For example, this is Llama 3’s chat template, right? We again do two iterations. You must do two iterations— most importantly. And when you finish training the model, remember you can do runtime run all.
You can do inference now, right? Continue the Fibonacci sequence; your input is 1, one, two—whatever. And the next Fibonacci sequence is 13. I think that’s correct; yes, that’s correct.
So your language model has learned how to do Fibonacci, and because it’s a chat template, you can also do—you can shove in multiple messages into the model. So this becomes a chat GBT for you. This is the customized chat GBT that you can use.
And finally, when you want to save the model, um, you can save it to lower adapters. So this only is 100 MB in size. So once you fine-tune the model, you have 100 MB, but some people also want to merge the model back, and that will take 16 GB, um, but you must merge this for Unama support and GGW and stuff like that.
And what we showed for AMA support is you first have to—like, you know, install AMA. Um, you select what you want to save the model to. GGW, so this is 8-bit. Um, we now support multiple quantization methods, right? You don’t have to do eight bits; you can do four bits, five bits, or whatever you like.
And this will be saved into one go much faster. In fact, I think this will save you like 20 minutes of your time, and we save this automatically. Okay, and this does all the saving, blah, blah, blah—saves, and we also—you see, we automatically create a model file automatically using your chat template.
And I can verify this is actually correct because I tried it. And then when you want to serve the model file, you can actually print out the model file which we created, and this is the model file.
Um, whoops—I pressed run already. Um, anyway, and finally to serve it, you can just do the model file to serve it. Um, and you can serve this. Um, and we do have a CSV version, so you can actually use the Titanic dataset.
Okay, it’s loading. Um, so if you want to use the Titanic dataset, you can upload the Titanic dataset, right? I uploaded the Titanic CSV. You can use the CSV file for this. Um, and again, you have to merge columns and so on, right? This is a more complicated example.
In fact, I provide this entire example for you for the entire Titanic dataset to merge all the columns into one. Um, and it’s the same exact output. So that’s a notebook that we’re sharing for—we did not release this yet. So this is for you guys to experiment and see if there are any issues.
Um, yeah, and just tell me. We also have blog posts on our website which you can see—our Unso GitHub repo. Um, and we have stickers available, um, and they’re very, very cute for you to take. And then, yeah. And also, yeah, we have Q&A now.
Yeah? Yes! Oh, did you measure the difference between writing the CSV content in English sentences as opposed to just the JSON format? The problem is, if you put them in JSON format, you still need to have instruction and output.
So how would you do that? You need to have two columns only for fine-tuning. Can you do—like in your template, here you have instruction, and then you add all—the—yeah, you add all the other columns onto it. Can you?
You could. You could do the JSON file yes, you can. But we just show you that you can do multiple columns now. So like if you have like 10 columns, you can now make the ten columns into one by merging them together.
Does that—there’s a big difference in representing that, merging columns as an English sentence for like a dictionary? Oh no, you can’t use—you mean like you shove the actual dictionary for fine-tuning?
You could do that. I don’t—I think you should do English language because a language model predicts the next word. JSON is probably less useful. Always convert it into English. I have the same intuition; I was wondering if you measured it. Research paper?
Yes, it should be another research paper. Yeah, any other questions? So there’s a lot of upvoted questions from me in the chat.
Sorry, I was wondering if you could take a look at them. I didn’t actually check the slider questions. Whoopsies. Um, it didn’t actually load, so—oh, there are lots of questions. Okay, I will—okay, oh, okay, I need to answer each of them afterwards.
I think I’m already out of time, though. So yes, thanks a lot.
2025-05-13 08:00:01
Tengyu Ma on Voyage AI - Weaviate Podcast #91!
Hey everyone thank you so much for watching another episode of the WE8 podcast. I’m super excited to welcome Tangu Mod of the WE8 podcast. Tangu is the co-founder of Voyage AI and an assistant professor at Stanford University, which I think kind of sets the stage for why this is so exciting. Why we’re so excited to be adding Voyage embeddings into Weeva.
We have a new text of Voyage module, and Tangu has published so many amazing works in deep learning and contrastive learning. I’m so excited to be learning from Tangu and welcoming you to the podcast. Thank you so much for joining.
Yeah, thanks so much for the introduction. This is a very energetic introduction; I wish I could also have one! Awesome, well, could we kick it off with what motivated you to start Voyage AI?
Yeah, so I think this is a very good question. I started to think about this probably around early 2023. I think I’ve done a lot of research at Stanford; my team has done a lot of research on large language models and deep learning theory. I thought that it was a good time for me to use my expertise in AI to contribute something to the commercialization of AI.
I thought that enterprise AI is one of the most important directions. There are a lot of enterprise applications of AI; we can revolutionize the industry in many ways. I started to think about what the best approach to use AI in enterprise is, and at that point, I thought that Retrieval-Augmented Generation (RAG) is the right approach.
It turns out that I think we are kind of on the right track to some degree because in the last year there has been some debate, and then gradually it sounds like people are gradually converging to RAG over fine-tuning in the last year. In early 2023, I think there was still a lot of debate, and now most people say at least RAG and fine-tuning probably coexist, or maybe some people are just only using RAG.
In early 2023, I was thinking, okay, so how do you have a company based on RAG? There are many other startups doing RAG. In some sense, you connect different components; you connect the embedding models with a vector store and large language models, and you can do some good UI. You can have domain expertise to understand user requirements and so forth. There are many startups, and I was thinking to focus on the technical side because I believe that the retrieval quality can be improved; the overall quality can also be improved.
Later, I realized that maybe it’s actually better to focus on components because that allows you to be horizontal. If you say, I’m going to have an RAG startup and my retrieval quality is much better than other people’s, sometimes it’s very hard to justify why the technical differences are important. However, if you focus on components, you can say I’m only working on the retrieval system, and in particular, the quality of the retrieval system is responsible for the embedding model, which we can discuss more.
Then you can work on embedding models for different areas, such as finance, legal, and so on. Right now, we have a sequence of domain-specific embedding models for every domain. We can go horizontal and work with many different partners; we can work with RAG startups, enterprises that build RAG by themselves, and we can work with platform people who design platforms for serving RAG.
That’s how we maximize our strength, which is the research and quality of the AI models, and in some sense, kind of go horizontal and maximize the market size. I love everything you’re saying. There are so many nuggets in that from the RAG-versus-fine-tuning debate to the whole scope of RAG and all that kind of end-to-end system, specific components, and horizontal business.
Yes, and then the domain-specific thing is definitely a big topic we’re going to dive into. We’re super excited about the Voyage code models; I think a lot of WE8 users are probably going to be really excited that we finally have a strong code embedding model integrated with WE8. But I’d love to dive into embedding models and contrastive learning.
For me, my interest in vector representations came during the SimCLR and MoCo era, this kind of self-supervised learning for computer vision. I think you’d be the perfect guest to take us through contrastive learning theory and all of this stuff.
So I guess just a quick introduction to some of this contrastive learning and embedding model stuff. Embedding models, as many people know, are turning documents or images into vectors. Basically, you are turning a model that outputs vectors, so you need different loss functions. The loss functions, as you suggested, are called contrastive losses.
This contrasted FLW was first designed for images. People said they needed to learn visual representations, and how do they do that without labels? Suppose they have a lot of images, and they don’t want to use labels. What people do is augment an image into a similar image and say that this augmentation and the original image should have similar representations.
Part of the loss function incentivizes this, and also you want to incentivize that random pairs of images have different representations. There are a sequence of loss functions like contrastive loss, SimCLR, SimAM, spectral contrastive loss, and so forth. They all operate under the same principle: you want similar images to have similar representations, and similar images are usually defined as augmentations of the same image. You also want random pairs of images to have different representations.
How you design this loss function to incentivize these two principles can vary across different methods, and there are trade-offs on various fronts. Generally, all the contrastive learning algorithms are like this. Now, people are using these contrastive learning algorithms for text as well.
The same idea applies: you want similar text to have similar representations. It’s very obvious. You want random pairs of texts to have different representations. For text, it’s even a little more complicated because you have to define similarity in the right way. Whether similarity just means semantic similarity or if it could also mean matching keyword similarity, there’s some relationship, but not like exactly the same kind of semantic meanings. Sometimes it’s question and answer pairs, and so forth.
That’s generally what people do for learning text embeddings with contrastive losses. I think you raised so many great points in this story. I like the last point about what kind of relationships are captured by embeddings. Especially with code, I can imagine if I search with a query of a docstring to try to match a Python function instead of another docstring, and this kind of nuance and things like question and answer, there’s a lot of nuance.
Parsing this out, I think there’s a tour of loss functions: triplet, in-batch, data augmentation, semantic similarity, and presenting representational collapse. There are all sorts of topics in the air.
Could we maybe kick it off by, if we could, I think it may be really interesting to start off with data augmentation. With images, I know you have these invariances to rotations or horizontal flips. With text, I personally have a little background of doing a survey on image data augmentation, and then I tried to take my lessons to text data augmentation. I found it a little bit harder to preserve the label and semantics of text data.
But now I imagine maybe with generative models you can transform positive pairs. What’s been your experience so far with that kind of positive data augmentation supervision?
Yeah, maybe just to set up the baseline in some sense, augmentations, as you said, are very important in this contrastive learning because you rely on this augmentation to create the so-called positive pair—the pairs of data that should have similar representations. However, this does not decide everything.
For instance, let’s say for images, it’s not true that in the image world, the only similar image pairs are just augmentations or rotations or translations of each other. There are also image pairs that are very related and not translations, flips, or crops of each other. Nonetheless, their similarities are still learned by this contrastive learning algorithm.
The beauty here is that even though you define similarity in a very narrow way—you say only two images which are translations of each other are similar in your loss function—when you learn the model, this similarity propagates. In the image domain, this is related to some of the papers we’ve written that explain why contrastive learning works.
The theory we presented in the paper is that if you have, for example, two huskies that are not translations or augmentations of each other, you can learn similarities by contrastive learning of these two huskies. The reason is that even though they are not direct augmentations of each other, you can find a sequence of huskies that connects them.
For any two huskies, you can find a sequence of 10,000 husky images such that every consecutive pair of images are augmentations of each other, gradually changing one husky into another by altering the position, posture, cropping, and color scheme.
Of course, you wouldn’t find all 10,000 huskies in your dataset; however, such images hypothetically exist. The claim is that the contrastive algorithm actually learns how to work on a larger population of images that includes all those hypothetical images. Essentially, if you had infinite data, you should be able to learn all these relationships.
Now, why you can learn with fine data is that new works have transition capabilities. The final data is not very different from infinite data. Even though you don’t see the entire sequence of huskies in your training set, you are implicitly learning the magic along this trajectory.
So basically what I’m trying to say is that even though augmentations provide a very local definition of similarity, there are many possible similarities you want to capture. In place, you can capture compositions of similarities using this contrastive loss. So that’s why it’s not that the algorithms are sensitive to the augmentations you choose, but that you can still learn the compositions of similarities defined by augmentations.
Now, let’s go back to your text example. For text, it’s a bit more complicated to define similarities because you can say a query and a question are similar. You can say a question and an answer are similar, right? The answer to a question is a form of a similar pair. You can also say that two questions that ask about similar concepts are similar.
There are many definitions of similarity. What we do is try to create a variety of different augmentations, adding all possible kinds of augmentations or pairs of similarities that you can think of. You then rely on the magic of neural networks to propagate or compose the similarity into more complex definitions of similarity that you may not know.
With this approach, using your handcrafted definitions of augmentations, combined with the beauty of networks’ compositionality and extrapolation power, you can learn a variety of possible similarities.
That’s amazing! I love that kind of local broad generalization. I really like how you brought up the measure of intelligence, and it really helped me think about this. For people listening who are curious about these points, it’s essential to visualize it.
I guess something I really like is that positives could be paraphrases of text or maybe semantically similar because this is an answer to that question or whatever other kind of relationship you can imagine. It inspires me to ask you this question. I’ve asked this question to Nils Ryers, as well as Zach Nusom at Nomic: what’s your experience with creating the dataset used to train these embedding models?
Yeah, so we have—first of all, I don’t know everything about how people create datasets in my team. Secondly, I probably cannot tell you everything either. But what I can say is that there are a lot of trials and errors and a lot of human intuitions in defining the positive pairs in preparing the datasets.
The way I think about this is that data set preparation or curation, if you think more broadly, seems to be one of those components in AI that requires a lot of human intuition and handcrafting. It’s reminiscent of feature engineering from 20 years ago.
Twenty years ago, people always used linear models; there was no way to change anything about the model. It was always linear, maybe sparse, and the only thing you could do was craft a feature—define the feature or the kernel function, which is the same as defining the feature yourself.
That was the innovation you pursued, but it was a very ad-hoc process. You cannot really publish a paper saying, “I engineered my features in this way based on my intuition and my understanding of this application.” However, that part is pretty vital.
It’s somewhat similar these days when you think about data curation because it’s a pretty ad-hoc process as well. So far, I think the majority of people are trying to curate data with intuition and handcrafting. The beauty of the modern AI is that users don’t have to do that. The model provider, before, required users to engineer their features based on their dataset.
Now, model providers like us train the models for you; we do all of this engineering and data curation for the users. As end users, you can just use this AI model as a black box. That’s the difference.
In terms of technical low-level details, it sounds reminiscent of feature engineering to me these days.
Fascinating! I’ve been super interested in the DSPI synthetic data framework. One experiment we did at WE8 was having Erica Cardinas generate synthetic queries and then using Coher for fine-tuning their rankers. We were studying the idea of how to connect the loop between query generation and fine-tuning with gradients.
In this loop, with feature engineering, I’m thinking about the prior on how you would generate a positive example. From this angle, the prompts you curate are how you create an automated engine.
Is that kind of what you mean when you refer to synthetic data generation?
Yes, that’s what I would mean when I talk about generating synthetic data. Although most of our training doesn’t use synthetic data, we have a complex pipeline these days with multiple stages of training.
However, most of our training doesn’t utilize synthetic data. The reason is that generating synthetic data is also expensive. We’re training on trillions of tokens, and generating synthetic data with trillions of tokens is actually pretty costly.
Sometimes, generating synthetic data can be as expensive as tuning those tokens. The issue is that there’s still limited diversity in synthetic data. There are ways to tune the prompts to make synthetic data more diverse and realistic, but sometimes it still isn’t as good as real data.
In some dimensions, synthetic data could be very good, while in others it may not be as effective. That’s why we use a mixture of real data and synthetic data, and the real data is much larger than the synthetic data. Synthetic data offers higher quality in some respects because you can specify exactly what data you want.
However, real data has more diversity; it covers all the noisy cases in the real world and is also much larger and cheaper. This discussion of diversity will transition us perfectly to clustering the representation space.
I’m very curious about this idea. One of the papers around the SimCLR and MoCo time period was Suave Clustering. I know you’ve done some work with spectral clustering.
What role does clustering play in learning vector embeddings when you’re trying to enforce a prior on how to distribute the space and avoid representation collapse? What role does clustering have in training and embedding models?
I think there could be several roles. Let me briefly talk about some of the papers I wrote, which is actually one of my favorite papers on understanding how some of this contrastive learning works.
This paper analyzes an algorithm called Spectral Contrastive Loss, which is easier to analyze than others. It’s quite similar to SimCLR and other loss functions.
The main idea is that whatever contrastive loss you’re using is kind of doing the following: imagine you have a manifold of images or, for text, a manifold of text. It might not be identical, but let’s use images as an example; you have many manifolds of images, such as one for huskies, one for cats, and others for desks or landscapes.
You have many, many manifolds of images, and from these manifolds, you take samples as a training set. which let’s now discuss for the moment. So you have a lot of manifolds, and you can imagine that you can build a graph on this manifold by doing the following. So you say you take a discretization of the manifold. So every manifold, instead of being a continuous manifold, has a lot of points on it.
Then you say I connect nearby points and I build this graph, which is kind of like a proximity graph. Basically, you build a graph based on the local distances on the manifold, and then you can actually prove that the contrastive learning algorithms are basically doing some clustering algorithms on this graph.
So, in some sense, this graph distance is the same as the distance on the manifold. If you have two points on the manifold, they have smaller distances. If you have two points on different manifolds, they have very, very big distances. This graph, if you can cluster in the graph sense, then you can cluster in the semantic sense because every cluster in the graph corresponds to the same manifold, which corresponds to the same classes or concepts.
In that sense, many of these algorithms are doing clustering. The only thing is that you are inciting clustering of a lot of points because you are thinking in this high-dimensional non-parametric sense where you have an infinite number of images on the manifold, and so forth.
So that’s one connection to clustering. The nice thing about this is that once you have the representation, the embeddings—right, you can prove that the embeddings, if you do clustering based on distance in the embedding space, is the same as doing the clustering in the manifold.
So, basically, you can prove that the clean distance actually has a very nice semantic clustering property. So that’s the one connection to the clustering.
But of course, when you have text, the theory becomes a little complicated because for text, it’s not like you have very clear well-defined clusters in a textual space. For images, every class is a cluster, or maybe every subclass is a small cluster, but for text, every paragraph of text is about a few different topics.
So there are a lot of overlapping clusters, and sometimes you don’t even know what’s the right granularity you should talk about clustering. Should you talk about clustering in terms of the topic level, or should you talk about clustering in terms of some lower level? So it’s a little harder to understand for the text space.
I love that it unlocks a lot of thinking. Maybe my first question is about leaving the manifold thing alone because it breaks my brain trying to think about that, but maybe I would kind of want to start transitioning into like the code, like embeddings for code. I’ve heard like with text, and the example you just said resonated with me enormously.
I’ve heard a lot of people say they want ice cream; they don’t want ice cream. How should these be embedded in relation to each other? Nils Reiers had taught me about multi-discourse, and then coming to WE8 and seeing tons of examples of paragraphs that talk about more than one thing.
So is code maybe more naturally atomic? Because you’re able to kind of cut up a function or part of a function. Maybe it’s easier to chunk up than natural text. So I think even for code, sometimes it is interesting that they use the term multi-discourse.
So, yeah, sometimes every piece of code has different dimensions you can cluster in some sense. One dimension is the functionality of the code, and maybe another dimension is the programming language used for the code. Another aspect is the surrounding context, right? Which code file this code block belongs to and whether it’s a helper function or whether it’s the main function, right?
So which layer it belongs to, and so forth. I think it’s probably still a multi-dimensional, multi-faceted concept. The idea is that in the embedding model, we capture all of this to some degree in different coordinates of the embeddings or some different dimensions of these high-dimensional objects. You don’t exactly know where they are captured, but they are captured somewhere hidden in that high-dimensional vector, and it’s up to you to retrieve those in some way when you need it.
Listening to you talk about this, I’m now just inspired to ask if L2 distance, where we try to compare all the vectors to each other, is maybe not the best way of doing it. Maybe you have PCA factoring of the dimensions of the vector. Just my natural curiosity from hearing you explain this is thinking that maybe there’s something more than L2 distance for comparing vector embeddings.
Yes, I think in the long run it should be. But there are several different ways to deal with this. For example, one way you can deal with this is to add prompts in your text. You can prepend some prompts in your text, and if the models are trained properly, then these prompts will automatically rotate or convert the embedding models in different ways so that you can still use L2 distance.
Basically, my vision is that I think L2 distance as a metric for embedding space should still exist and probably will still exist for a long time. The main reason is that it is really fast. It is really good when you do the search, so in some sense that’s your top priority, and then you do other things to accommodate for that.
You can allow prompts to change your embedding to kind of attend to special dimensions of the embeddings. You can probably tune your embedding algorithms; you can fine-tune the embedding algorithms to emphasize certain aspects of the similarity. So, and you can probably do other things to disentangle certain aspects. You say maybe the first 500 dimensions are about the functionality of the code and the second 20 dimensions are about maybe the other aspects of the code, and so forth.
But you try to kind of like insist that at the end of the day, you use some L2 version, some L2 similarity, so that you can have faster search. Right now, I think the Voyage coding at Bing is mostly focusing on during the training. We mostly focus on the functionality, right?
We want to make sure that the model understands fundamentally what the code is really about and which algorithm it’s implementing. We also focus a lot on the keywords. We focus on basically everything that we can think of right now. But I can imagine in the future someone has a particular similarity they want to optimize for, and that’s a perfect time to use either some prompt or use some fine-tuning on top of the models.
We’re going to provide some fine-tuning API soon for people to do that, and we can also fine-tune for people right now. So I think these are probably more… I would say this will exist for a long time. I would predict that this is the way we go forward. We insist on the distance, but we change other parts.
Again, that just inspires so much thinking. One thing with WE8, we started supporting multi-vectors. In WE8 24.2, we’ll have Voyage AI plugging the WE8 features. So what do you think about this idea of maybe having like… I really like this paper, Prompting Biased. I’m not sure, but I recall it was maybe two years ago, so it’s not fresh in my memory.
But I’m not sure what you would kind of put before the text. Before you embed it, maybe something like this code belongs to this folder. You kind of use the metadata to put it in there, and then you get the embedding based on sort of where it’s located in the code. I really wanted to ask you this question about you know like single vector representations in the index or maybe like a Cobert-style approach where you rank with additional vectors.
Maybe those vectors could be from other sorts of relationships, and it’s not necessarily just like token vectors, as well as maybe if we could throw this into the category of matryoshka embeddings and sort of multi-vectors where there are different levels of granularity. I hope that wasn’t too much in one question, but maybe I’ll talk about each of them one by one.
So I guess probably the first concept is about the multi-actor Cobert. This technique allows you to not only have one embedding model, one embedding for a whole trunk of text. You have actually multiple embeddings. In the original version, you have one embedding for each token. Right?
So because we have one embedding for each token, if your trunk is like 512 or maybe 1,000, you’re going to have either 512 or 1,000 embeddings for the whole single trunk. The benefit is now you are more granular because every embedding captures the localized meaning of that part within the bigger context.
So it’s very good; it’s very fine-tuned. The downside is that now you have to store a lot of vectors. Before, if you have one million documents, you have one million vectors, and now one million documents means one million times a thousand—one trillion vectors.
So definitely a lot of burden on the vector database side, which is great that WE8 is supporting multi-vectors—that’s great. But sometimes it could be too much depending on how you do the tradeoff. It also depends on how many documents you have. If you have like 100 documents, it probably doesn’t matter. But if you have 1 million documents, you have to think very hard.
But that said, there’s also middle ground. One of my team members has published a paper on multi-actor embeddings, which is a bit less extreme than COBERT. Cobert has one embedding for each token, right? So his version is that you probably have a trunk of size 1,000, and you have maybe 10 or 100 embeddings for this whole trunk. You don’t go all the way to the token level, so that could be more likely a good middle ground when you have a lot of documents.
You have a tradeoff between quality and space and the compute and space. Another dimension I would like to mention is that COBERT is about multi-vector retrieval, and it’s built on top of an existing base model.
So COBERT is building on top of BERT. So even though the COBERT part matters a lot, the BERT part also matters. If you replace the BERT with something right, you would definitely get a much bigger lift. That’s why sometimes if you look at the benchmarks right now, COBERT, even V2, their published results are still not as good as OpenAI; it’s not as good as Voyage, and so on and so forth.
Even though OpenAI, OpenAI V3, Voyage—all of these models have only a single vector. So basically there are multiple different ways you can improve. You can make the model really, really good. You can also have multiple vectors and have multiple dimensions of the vectors.
This is about the method we’re going to talk about. So basically, there are many different ways to improve the embedding models. Right now at Voyage, we are focusing a lot on improving the core—the base, right? The transformer—how do you make the parameters in that transformer as accurate and as deep as possible?
But you know, one day we may very likely also have a code Voyage where you have a multiple-vectors version of Voyage. Yeah, I think that’s probably the future where we will go, but for now, if you literally compare COBERT with Voyage, I think Voyage is still probably better than COBERT.
In the future, we probably have C-Voyage or maybe call something else, M-back Voyage. Yeah, super compelling. Again, I guess one more super technical detail about embeddings and vector models before coming into maybe the code and the LangChain case study you did and things like building a product around an embedding API.
So kind of what you were just hinting towards about new architectures for contrastive learning, I’m super interested in this. I think reading your paper on the inductive biases and contrastive learning, maybe there’s something to, like, kind of latent bottlenecks and just how you design the dimensionality of intermediate layers of the transformers.
In what you’re just saying, you already kind of hinting at this. How much more adapting of B is there to be done for state-of-the-art embedding models? So I think that’s a great question. In some sense, this is about how do we train the embedding models? How do we get it better?
It’s a pretty complex process, as you can imagine. For example, if you ask me what’s the secret sauce of OpenAI, I think these days the community would probably guess that there are multiple secret sources—all of them combined together to give the best OpenAI model or maybe Claude, and so on and so forth.
So that’s kind of what’s happening at Voyage as well, right? We have already a relatively complex pipeline. We have some pre-training, some kind of contrastive learning, and we have all kinds of steps to deal with different aspects.
Where we found that maybe the model is not as good as it should be on certain aspects, then we have an additional step to improve the capacity or capability in that specific kind of dimension. And also, you have to have a data collection process for every step.
You have to have data curation; you have to verify the quality of the data. In every step, you have to tune the hyperparameters the best way so that they can tune fast. Because if you tune three times slower, it means $1 million versus $3 million, or $10 million versus $30 million.
That’s why any efficiency improvement is very important. You have to choose the right loss function, the right architecture, the activation function, the number of layers, the width, and so on and so forth.
I think, of course, I don’t know exactly what OpenAI or Anthropic is doing, but I think this is kind of similar to them in the sense that you need optimization for every component so that if all together you get probably a 10x efficiency improvement, maybe a 100x improvement.
So that you can train this model in a round of time with sufficient high quality. That’s pretty much what we do under the hood. We tune many of these types of parameters, but I have to tell you that sometimes we don’t tune all of them sufficiently well. The architectures, I think, are still relatively standard.
We tweak them a little bit, but not a lot. We tweak the optimizers a little bit—again, not a lot. We spend a lot of time on data curation. That’s probably cost us a lot of energy. We spend a lot of time on every different part of the system to get this embedding right.
The consensus now is that the dataset curation is the most valuable thing, and you’re pretty intense if you’re. I’m sure there are researchers at Stanford who do things like maybe that SHAMPOO optimizer and second-order optimizers and things like that.
But is neural architecture search—like I think we just saw Flash Attention—kind of like, you know, there may be some opportunity there. But it sounds like you’re using standard transformer architecture. Do you think neural architecture search is still promising?
We don’t use neural architecture search in the sense that it also depends a little bit on what you mean. But I don’t think we use—at least not the most typical NAS at all. We’re still mostly using scaling laws and human tuning to tune hyperparameters or to tune architectures.
Human means you look at what’s going on, you know whether the gradient is too big. You use some kind of analysis to guess which part of the system is a bottleneck and then you fix them by tuning some of the hyperparameters, changing some of the activations, and so forth.
Part of the reason here is that, you know, I’ve done some research along that line about NAS, but I think NAS still requires a lot of compute if you really want to do it well. Sometimes it’s hard to justify the use of that compute because the ROI is not good enough.
If you use a small model to do the NAS, then sometimes the lessons you learn from the neural architectural search don’t transfer to large models. But if you use a large model to do the NAS, then you spend too much compute. So it’s very hard to strike a balance there.
That’s why we mostly do scaling law. We try to find out what’s the best hyperparameter for the small model. We try to find what’s the best hyperparameter for the middle model, and then we fix some curve.
We try to create a curve to say that maybe this hyperparameter should scale linearly with model size, and the other hyperparameters should stay constant as the model size changes. Some other hyperparameters should be inversely proportional to the number of layers.
You figure out all of these relationships and then you scale up to the largest model and train once for a few weeks or maybe for months. That kind of note on human hyperparameter tuning and graduate student descent thinking is very interesting.
There’s a parallel there with language models doing that kind of observation. It’s almost like Bayesian, but then it has this prior in the language model. I think that is definitely a can of worms we could open. But the scaling laws thing, I think scaling laws is just one of the most exciting stories in AI.
Maybe really quickly before we graduate back into higher-level use-case stuff, scaling laws and embedding models—what’s kind of the story? I think scaling still holds a lot on the embedding models. Of course, I would say one key about scaling law is that it’s only an empirical, observational law.
In a sense that if you change something in your code, your scaling law may be different. We try to do that. Sometimes you use a different algorithm, and the scaling becomes much better. But still, you know, I definitely agree with you that scaling law is the core of AI.
Because you have a lot of predictability on large models. You don’t want to just run your experiments and pray. You want to have some confidence on how good experiments will be after three months.
So yes, we do a lot of scaling law, and it’s still true that for embedding models, the larger the model is, the better. However, for embedding models, the challenge is that you cannot make the model too big because then the latency is not good enough, right?
This is supposed to be a very fast step, and you can search some relevant documents in under 500 milliseconds, and in most cases probably 50 milliseconds; maybe sometimes 15 milliseconds. So, you have to… maintain the model to be small enough and actually we are using very small models to do this right. So our model on M tab is kind of like 10x smaller than some of the other models with similar performance as us.
And that means that you have to find a way so that your skill law is better than others. Somehow your skill law just shows that only 1 billion parameters model is enough, and other SC shows that 10 million parameters is enough to reach the same accuracy. So that’s in some sense one of the key challenges we are addressing every day: how do you change the scal law so that you can use smaller models to achieve the same performance or maybe even better performance?
That’s so fascinating. One of the biggest misconceptions we sort of see at wv8 is people who want to use the biggest language model to embed their data set. Understanding that idea that you don’t want to use a 100 billion parameter model to embed your document is awesome.
So yeah, Tang, I thought that was just such an amazing overview of all these concepts. I can’t wait to watch it back and just study it myself. Could we kind of transition into maybe a higher level thing? On the Voyage blog, you document doing a case study of searching through the Lang chain documentation. So maybe people when they heard me previewing this thought I meant like a Lang chain demo, but this is like the meta of the dog food of searching through the documentation through the Lang chain code. Could you just kind of take us through that journey?
Yeah, so basically what we did is that we fine-tuned the embedding models on Launch and documentation. The reason we need to fine-tune here is that as you can imagine, the Launching documentation is not new; they are only one year old or maybe 1.5 years old.
There are a lot of RAG there, and many of the concepts are also very new. You don’t really expect that an off-the-shelf embedding model understands all of those detailed logic about how to do RAG right and also those terminologies. So that’s why fine-tuning is almost necessary.
What we did was we started with our embedding model, a base embedding model, and we fine-tuned for the Launching documentation. We saw a 15% improvement in recall, which is great. It’s something from 60% to 75%.
One lesson I learned—actually, there are two lessons I learned—one lesson I found is that it’s indeed true that the retrieval quality highly correlates with the final response quality. Once you see a 15% improvement in the retrieval quality, then the final response quality also improves a lot because you get the right document.
Then GPT-4 or maybe other language models can synthesize an answer very well given the right document, and if they don’t have the right document in context, they hallucinate a lot. They just guess what you should do with LCH, but sometimes the guess is wrong.
The second lesson I learned is that it’s kind of like a blend of this particular setup such that you can fine-tune actually on a relatively small document set. So basically when you say fine-tuning, the first reaction you might have is that if I don’t have enough documents to fine-tune, then I may overfit.
That’s actually what’s happening to some degree with the large language model fine-tuning. If you don’t have enough documents, sometimes you either overfit or you cannot override existing prior of the large language models. If you have a lot of documents, then it works great, but you really need probably one million documents to fine-tune language models or something like that.
Just at least people have found that it’s difficult to fine-tune large language models with a small set of documents. However, when you find these ined models, the blessing or the beauty here is the following: let me give you a dream example. Suppose you only have five documents, a very small number of documents.
I’m going to find you an embedding model for you, and the most natural guess would be that I’m going to overfit to these five documents. That’s probably very likely too. My embedding model, fine-tuning, will memorize your five documents just to death. However, that’s not a problem.
Why? Because anyway, you only have five documents to retrieve from. So the only thing you have to do is to retrieve from these five documents. You won’t have confusion every time you have a query. Of course, when you have five documents, sometimes you cannot answer the query with these five documents, but there’s nothing you can do anyway. However, when you can answer, the only thing you can do is to retrieve one out of five.
Basically, memorizing your five documents is not a big problem. As long as you only have five documents, the problem will be that I memorized five documents for you, but you actually have 1,000 documents or maybe 10,000 documents later.
Basically, what I’m saying is that if you only have five documents, the only thing you have to do is to find the model on whatever corpus you have for your retrieval. If your corpus is small, some memorization will happen, some overfitting will happen, but that’s okay because anyway, your task is easy.
You only have like five documents. Memorization is actually probably exactly the right thing to do for retrieval from these five documents. But if at some point you say you have 1 million documents now, then what we would do is we just do a continuous fine-tuning on these 1 million documents.
Because anyway, continuous fine-tuning doesn’t take that much cost. You have to embed or re-index these 1 million documents, right? So basically, you just take a pass of training over these 1 million documents for training, and then you reindex these 1 million documents when you have a bigger document size.
Now you don’t have this issue of overfitting because you already have 1 million documents, and fine-tuning becomes really generalizing. Everything is in a more normal situation of the machine; everything is generalization. So forth, maybe sometimes you are in a middle regime where you only have 2,000 documents.
Then there is a mixture of overfitting, memorization, and generalization. But anyway, if you have 2,000 documents, that’s probably the best thing to do for that 2,000 document retrieval problem.
So basically, what I’m saying is that somehow for some kind of embedding model fine-tuning, you can allow fine-tuning for even a small number of documents.
Amazing. I guess it’s just about pushing the space apart less than the generalization. Maybe it’s related to like works on autoencoders and compression, and it’s not as much about generalization.
Exactly. When you have a smaller number of documents, you are more about how to kind of auto-encode this small set of documents. You don’t necessarily have to generalize just because the problem is easier.
Amazing! So I’m so curious about how you opened it about like Lang chain has new concepts even in the data set. It makes me think that this kind of continual learning thing—this is one of the debate topics that I’m always bringing up with people—is this kind of idea of zero-shot versus continual fine-tuning.
I’m sort of in the camp of zero-shot. I think especially with coherent command R model, what I suspect they did to get it really good at RAG is they trained it with RAG, and maybe it was reliable to the retrieved context.
So I still believe strongly that you can just sort of retrieve and then have models that are really specialized in grounding it in the context compared to this continual fine-tuning. But then for embedding models, maybe there’s something again to that prompt B where you kind of retrieve a short paragraph of facts and new knowledge that you would need to encode this.
Yeah, maybe you have a better—you know, I think I see what you mean. So basically, let’s specifically talk about embedding models, right? Whether you want to do zero-shot or you want to do fine-tuning or continual fine-tuning, you have three options.
Zero-shot means you just take off the shelf. Fine-tuning means that you fine-tune on the Launch and documentation, maybe on January 2024, and you don’t do anything else after two years, right?
Continual fine-tuning means that you keep fine-tuning on launching new documentations probably every day or every two days. I think this depends on how fast your world is changing, like how fast your corpus is changing to some degree.
For LaunchChain documentation, because the LaunchChain documentation is so new, many of the lingos are very new. Not only the lingos, also, for example, even the logic, the deep scientific concept about all of this, you know, RAG and agent chain and so forth, are new.
So you probably have to fine-tune a little bit to get the best performance. However, once you fine-tune, once you are already familiar with this way of thinking about that, basically it’s kind of like, suppose you are a human retriever.
You are doing this retrieval yourself, right? You systematically study LaunchChain once you try to understand what Launch is really about. That’s kind of the first step for yourself. Once you do this, you probably don’t have to systematically study Launching again to retrieve further documents, even though the Launch documentation got updated.
Maybe every month, the Launching documentation gets updated a little bit. You don’t have to systematically retrain yourself to be able to retrieve new documents from Launching. That’s the beauty of the retrieval model. Even if the corpus changes, you don’t necessarily have to change your retrieval method.
But maybe at some point, LaunchChain pivots. I’m not saying that they should pay anything; I’m just using it as an example. But suppose LaunchChain pivots to a completely different company, and then every concept is changing, and then you probably have to restudy the whole documentation corpus.
That’s kind of like a continual fine-tuning or maybe a second fine-tuning phase. I think it just really depends on how fast the world is changing. If the world is changing really fast, you probably have to continue fine-tuning. If the world is changing very slowly, you don’t necessarily even have to.
You can just do zero-shots in big models. It’s so exciting! I’m sure a lot of Wv8 people building with Wv8 hearing about this have their code documentation—certainly something I’ve seen that’s very popular. I’m excited to hear about these things, like you don’t need too much data, too much documentation to take advantage of this.
All that is so exciting. Maybe if we could quickly touch on that problem of re-embedding the dataset, because if you continually train, you’ve got to re-index the dataset. Do you see that as a big problem for continual fine-tuning of embeddings?
Yeah, so I think right now if you fine-tune the embedding models, at least using our technology or using any technologies available here, I don’t think actually any other companies provide fine-tuning other than Voyage.
But if you fine-tune, you have to reindex the whole corpus, so we are not super concerned about this economically because anyway, you have to fine-tune on a new corpus that already costs something.
It’s not very expensive if you don’t have a lot of documents, but that cost maybe X dollars, and the cost of re-indexing the whole corpus is probably less than X—it’s probably X over 2 or 5 or something like that. So that’s why we are not very concerned about re-embedding the whole corpus if you have already fine-tuned.
In the future, everything will be much cheaper, right? Fine-tuning will be on the fly. You won’t have to fine-tune everything again; you only have to fine-tune the differences of the corpus.
When you update your embedding, you can also update maybe some part of the embeddings or only update 10 coordinates of the embeddings so that you can keep the costs even lower. But that’s for future developments.
Yeah, so exciting! I love the thinking behind that. One other question I wanted to ask you, and if this is kind of like a secret sauce question, feel free to pass it, but I’m really curious about serving and embedding API, building products around model inference APIs. What goes under the hood?
How is concurrency with GPUs, batching, all that good stuff? Yeah, I think that’s a great question. Actually, this is also something we are trying to figure out. The reason here is that serving APIs for many users, especially when you have embedding users using the embedding models, is kind of a little bit different from just serving the models for one user.
The reason is that some users probably embed—we have seen this—some users embed 10 billion tokens a day because we give them some higher token-per-TPM or request-per-minute constraints, and they embed 10 million to 10 million tokens a day. Some users are doing 10 million tokens every day.
So there’s such a huge disparity in terms of how people are using embedding models. Some people are using it for the core race in production, and some people are only embedding their initial on corpuses.
That gave us a lot of new questions in some sense because we have to restrict the TPM and RPM for every different user in different ways, and we have to deal with the backend and make it very reliable for a spike.
For example, one thing—this is not necessarily a secret source because it’s kind of easy to think about but probably hard to implement—is that it’s not exactly very hard to implement either. One thing could be that you can kind of automatically detect whether the users are sensitive to latency and see what’s the best way to optimize the tradeoff between latency and throughput.
So I think this probably also depends a little bit on whether they’re embedding a batch, right? You probably mentioned before that sometimes you have to have a batch transformation for this embedding model.
Our view is that we should just make users as easy as possible. A batch transformation is a very good idea, but it’s not 100% necessary for the users to understand the differences between batch transformation and any other embedding APIs.
If you have a very big batch, you have a 10 billion tokens corpus. Ideally, we want the users to be able to just keep sending us all the tokens, and we do it ourselves on our side. If you have like 100 billion tokens, just keep sending it to us as fast as you can, and we will do whatever we can do on the backend.
Maybe we can send back all the vectors to you, or maybe we wait a little bit—probably for two hours—and send back the results to you. But that means that on the backend, we have to be clever.
We have to know whether you are sensitive to latency. If you send us a request with only one sentence, then very naturally, you are very sensitive to latency. Maybe you are using this in production, and you have to see the result in 100 milliseconds. But if you keep sending us a large batch, this large batch already has maybe like 1 million tokens in a single batch.
There’s no chance you are expecting that we send it back in 100 milliseconds. You probably care more about the throughput, and then we are going to do something on the backend to optimize for throughput instead of latency.
We still keep the latency the same as if you are sending it normally, but we can optimize something else. We can do some other things. For example, one thing we can do is use cheaper GPUs and more GPUs to maximize the throughput and still keep the latency the same when you have a large batch.
But this technique wouldn’t work if you have only one example. If you have only one example, it’s all about how fast the GPUs can run. But if you have like 100 examples, we can parallelize it across multiple cheaper GPUs.
So there are a lot of these kinds of low-level optimizations we have to do on the system side. I don’t think these are really secret sauces, and they are not rocket science either, but they are a little bit special to embedding model APIs.
Oh, I love that! I think that’s so interesting. Hopefully, we have a lot of system people listening to the podcast. One of the podcasts I really liked was with Roit Agarwal from Portkey, where they have a load balancer between different LLM APIs.
A big thing with WV8 is hot storage and cold cloud storage. Coming back to the DSP thing, I’m super interested in kind of stateful chains and what kinds of in-memory data structures you want to use with these. There’s so much to this routing of data.
This has been such an exciting podcast. I’ve learned so much from this conversation. I really want to ask you this sort of big anchoring question. There have been so many topics in the air throughout this podcast, but just kind of, what inspires you the most? What’s getting you out of bed in the morning just with AI and directions for the future?
I think it depends on what’s the horizon you are talking about. If it’s like two weeks, I’m very excited about our legal embedding model we are going to launch in two to three weeks. If you are talking about six months, I’m very excited about a sequence of embedding models we can do for the long run.
One of the exciting things we have, in my opinion, about AI at this stage is that it becomes kind of more modularized to some degree than before. Five to ten years ago, if you used AI for a particular use case, you had to collect data, fine-tune your model, choose your architecture, and do everything from scratch to some extent.
But now at least you have a very strong base, which is that you can just connect existing off-the-shelf components. In some sense, using AI becomes much easier than using machine learning five to ten years ago. You need to know much fewer details.
In some way, we got like the model providers, like Voyage, who got all the dirty work done for you for tuning the embedding models, and then this model is very powerful. You don’t need to know how the models are tuned; you just have to use the output. Before, when you were machine learning, you had to understand a lot of things from all the way from data collection to how to fine-tune the model.
I think that’s one of the amazing things about AI these days. It kind of lowers the bar for people to use AI and maybe makes it harder to build AI components. I don’t know, but that seems to be the tradeoff.
I was already so excited about our Voyage integration with WEA kind of going into the podcast, but now, after picking your brain more, I think this is just so exciting. The continued advancement of embedding models and performance and the way it complements these vector indexes and vector databases—it’s all just such an exciting time to be working in RAG.
Even we are going beyond RAG and beyond chatbots. These horribly complex LM programs managed to do too much with that. But anyway, thank you so much for joining the podcast. It’s been so cool to meet you and learn about how you… See these things. Thanks so much for helping me. This is great, thanks.
2025-05-13 08:00:01
Stanford CS336 Language Modeling from Scratch Spring 2025 Parallelism 1
All right. So today’s going to be the second of the basic systems lectures. And now we’re going to move on to sort of multi-machine optimization.
And so the focus today is going to be all about parallelism across machines. The goal today is going to be to move from optimizing a single GPU’s throughput to being able to understand the complexities and the details that are required to train really large models. When models get large, they no longer fit on a single GPU, so you’ve got to split up your models across different machines. But also, you’ve got to be able to leverage all the different servers that you have in order to train these models quickly.
We have both compute and memory concerns that we’re going to have to deal with, and communication across different machines. It’s going to be quite heterogeneous. We have different kinds of communication across GPUs at different levels of hierarchy, and this is going to lead to different parallelization paradigms. People use many different parallelization strategies all together at once. We’re going to talk through each one of the very popular ones. Then we’ll talk about how you combine them together in order to efficiently train a very large model.
I’m going to end the lecture with looking at some examples of how people are actually using these parallelization strategies to run their large scale distributed training runs. That’s going to roughly map to the different parts of this lecture. We’re just going to talk about the basics of networking first, and then we’re going to talk about how do each of these networking hardware concepts map to different parallelization strategies, and then finally some case studies to close off with to show you how it all comes together.
I told you about GPU scaling last week, and it’s quite impressive seeing this super exponential curve of flops per GPU going way, way up. But if we want to rapidly scale out both our compute and memory, a single GPU isn’t enough. We’re going to have to wait for another couple of years for this curve to continue going upwards.
If we want to train a really powerful language model here and now today, we have to rely on multi-machine parallelism. If we look at the world’s fastest supercomputers, that’s what’s being shown here. The fastest supercomputers have exoflops and exoflops of compute. Those are the green lines that you see over there. That’s what you’re really going to have to rely on if you’re going to try to train the biggest, baddest language models today.
That’s the compute side of why you want to think about multi-machine parallelism. But we’ve also got a memory angle for thinking about the same thing. These two are really the core resources and the core concerns that you’re going to have to think about. In terms of memory, many of the models are getting quite big, and memory on GPU is also growing but not quite as quickly. A single GPU is not going to be able to fit these models, right? Maybe eventually in the distant future we won’t have to worry about a lot of these, but we’ve got billions and billions of parameters that aren’t going to fit very nicely into a single GPU.
We have to be very respectful of the memory constraints that we have. Those are the realities that we have to deal with. What are the tools that we have to handle these? GPUs, I’m sure you’ve noticed, in the cluster don’t come in sort of singleton. A single machine will have multiple GPUs within the same physical rack.
Here’s an example I took from the GPT Neo X paper. This is an old example, but the same lesson applies to the H100 machines that you have in class. Here there are eight different GPUs, right? They’re connected to various CPUs through fast interconnects. Within each GPU, you see this NV switch at the bottom. This is very fast connections across these eight GPUs. But if these eight GPUs want to talk to GPUs on a different machine, they’re going to have to go through a networking switch.
You see this purple line that says HDR Infiniband; that’s a much slower connection compared to the NVLink connection. You can sort of see the difference in throughput. That’s about eight times slower per lane. This kind of hardware hierarchy that we have is going to have big implications for how we’re going to end up parallelizing our models in practice. You can keep this mental model with you as I talk through these things. We have very fast connections within a single machine, and then when we go across machines, it’s going to get slower.
Depending on the kind of hardware we’re using, there might even be another level of slowness once we go beyond, let’s say, 256 GPUs networked together. Many of you may already know this from having taken systems or networking classes, but here’s a very brief refresher on collective communication operations.
The reason why I’m going to bring this up is there’s one particular important sort of identity or equivalence that you will need to know to really understand some of the finer points of the performance characteristics of the parallelization algorithms. I’ll talk through these, and then I’ll discuss one important performance implication.
The first one which all of you probably have heard of is all-reduce. You have four machines, four ranks in this case, each one having its own piece of data. What you’d like to do is perform some sort of reduction operation. Let’s say I want to sum all these inputs, and then I want the output to be copied over to every single machine. This is going to have roughly the cost of like two times the total number of things that you’re all reducing.
You have a broadcast operation, where I’m taking a single input from rank two, and I’d like to copy it out to all the remaining ranks. This is going to have roughly on the order of one times the total number of outputs in terms of the communication cost. Then we’ve got reduction, where we got different inputs, and that’s going to be summed up and then sent only to one machine.
The two that are quite important, even though these may not be quite as common, are going to be the all-gather and scatter. All-gather is an operation where I’m taking a single subcomponent of my parameters from rank zero, and I’m copying it over to all the ranks. The same thing with ranks one, two, three. Each of these are handling different parts of the parameters, and they’re copied over to the rest of the machines.
So that’s sort of copying what I have to everyone else. Then reduce-scatter, which is where I’m taking each of the rows, summing them up, and then sending the result only to rank zero. This is a partial version of an all-reduce, and hopefully, this diagram makes it clear how reduce-scatter works. All-gather and reduce-scatter are quite important because they are the primitives by which many of the parallelization algorithms are built.
This is an important equivalence or identity; I will refer to it one or two times as key points in this lecture. If you want to do an all-reduce, let’s say I’ve got different GPUs, A, B, C, D. Each of the GPUs is handling a different data point, right? And so I’ve got different gradients for each of these data points, and I’m going to need to sum those gradients and then pass all those gradients back to the GPUs. This is a classic data parallel operation that I might need to do across my four GPUs.
That would be an all-reduce. One important thing though is this could be replaced with two operations: a reduce-scatter and all-gather, where the reduce-scatter is going to sum sort of each of the rows and then leave the result of the rows in GPUs 0, 1, 2, 3 respectively. Then I’m going to do an all-gather to copy those back out to the remaining GPUs, so each GPU now is getting a full sum of a part of the parameters, and then it’s going to copy it back to the remaining workers.
In the bandwidth-limited regime, this is basically the best that you can do. All-reduce best that you can do is roughly matching the bandwidth you can get out of a reduce-scatter and all-gather. You can convince yourself of this by writing out how many communication operations happen in both all-reduce and the right-hand side.
The final thing that I want to sort of briefly touch on before I move on to talking about the parallelization algorithms is this is the one place I’ll talk about GPU versus TPU. Most of the discussion today can actually abstract out the underlying hardware, but there is one important thing that I will mention up front so that I can refer to it later as I talk through this: How do we network together different machines or different accelerators in GPUs?
As I showed you in the GPT Neo X slide here, in the GPU world this generally works is you’ve got nodes, single machines that contain, let’s say, eight GPUs, and then you’ve got these switches that connect fairly quickly to each other. These machines are connected all to all up to about 256 GPUs. That’s an important threshold up until which you have very fast arbitrary communication between machines. Above that, you’re actually going to need much more slow communication.
These leaf switches and spine switches come into play once you go beyond roughly a single rack’s worth of GPU. On the other hand, if you look at TPU design from Google, they actually take a very different approach to networking their machines. You’ve got a single TPU chip, and they all talk to their neighbors very, very quickly. This is a very easily expandable toroidal mesh, but you can only talk to your neighbors.
The reason why I’m talking about this right after the all-reduce slide is if you think about doing these kinds of collective communications like all-reduce or reduce-scatter, you can implement them just as efficiently on a toroidal mesh than you can on an all-to-all connection. If you’re optimizing purely for collective communications, it makes sense to think about things like TPU networking rather than GPU networking.
I’ll talk a little bit about pros and cons of this later as I go through different parallelization operations.
Just to put this together right now, we’re going to start talking about a new unit of compute, right? Instead of the GPU, the new unit is the data center. The whole data center is going to be the thing that we’re going to be doing. Now we’re going to try to come up with algorithms and sharding strategies that get us two different things.
The first one is linear memory scaling. As I scale up the number of GPUs, the biggest model that I can train is going to scale linearly with that. I can train bigger and bigger models if I really want to. I also want linear compute scaling. As I get more and more GPUs, the useful computation that I’m doing to train the model scales linearly.
A lot of these algorithms are going to be implemented by just calling these very simple collective communications primitives in various ways. When we think about the performance characteristics of these parallel algorithms, it suffices to reason about counting the collective communications primitives. So that’s kind of an important way to think about these.
We don’t go all the way down to the low-level implementation of these algorithms here.
Any questions on part one?
Yes. Sorry, but from the previous slide, does it mean that it’s better to do reduce-scatter gathering rather than all-reduce? So this slide, right? Yeah. The conclusion of this slide is that they’re equivalent, right? I think if you think about something like parallel gradient descent, all-reduce is a very natural operation to do because you distribute your data to different machines, and then you’ll have to all-reduce your gradients together.
What I’m saying is this very natural thing to do of all-reduce can actually be written as a sum of two different operations, and they’re equivalent. So there’s no performance characteristic by going from this left representation to this right one, at least in bandwidth. That’s going to have important implications in maybe like five slides. So you can wait a little bit to see why I mentioned this.
Okay. Any other questions? Good.
Now we’re going to get started. In some sense, this is kind of the exciting algorithmic meat of the lecture. There are three kinds of parallelism strategies that we should really be thinking about. The first one is data parallelism. At a high level, data parallelism is the idea of roughly copying the parameters across my different GPUs. I’m not going to worry about splitting my parameters up.
But I will take my batch, and I will split my batch up. Different GPUs or different machines will get different slices of my batch. That’s data parallelism. There’s lots of subtleties in how we execute that. Model parallelism now is starting to say, okay, I don’t want all my GPUs to have all the different parts of my model, right? As my models get bigger, that’s going to be a very big problem.
So, I need to cut up my model in very clever ways, and I need my GPU to handle different parts of my model. That’s going to be model parallelism. The final piece is kind of activation parallelism. We don’t really think too much about activations in our day-to-day lives because PyTorch handles it very transparently.
But as the models get bigger and the sequence lengths get longer, the activation memory starts to be a really big problem. If you want to train these really big models with big batch sizes, you have to somehow manage the memory footprint of your activations. We have to split those up too, so there are ways to handle that. When we put all these together, we will have all the tools we need in order to scale up both compute and memory gracefully as we have lots of machines.
These are kind of the core conceptual objects. Now we’re going to talk about implementing each of these ideas efficiently. The starting point of data parallelism is just SGD, right? If we’re doing very naive batch stochastic gradient descent, the formula for doing this looks like the equation that I have right here on the slide.
I’m taking a batch size B, and I’m going to sum up all those gradients and update my parameters. Naive data parallelism is just saying, all right, take your batch size B, split that up, and send that to different machines. Each machine will compute some part of this sum, and then I will exchange all of my gradients together to synchronize before each gradient step. I will synchronize my gradients and then take a parameter update.
Now I’ve been talking to you about compute and memory scaling and all these things. Let’s talk through what it looks like for each of these. For compute scaling, data parallelism is pretty great. Each machine, each GPU is going to get B over M examples. If my batch size is big enough, each GPU is going to get a pretty decent batch size micro-batch size, and it’s able to hopefully saturate its compute.
That’s good. What’s the communication overhead? Well, I’m going to have to transmit twice the number of my parameters every batch. Remember, an all-reduce is going to roughly be twice the amount of stuff that you’re all reducing in terms of communication cost. This is okay if the batch size is big, right? If my batch sizes are really big, I can mask the communication overhead of having to synchronize my gradients every now and then.
For memory scaling, I’m not touching this at all. Every GPU needs to replicate the number of parameters; it needs to replicate the optimizer state. It’s pretty bad for memory scaling, right? If we didn’t have to worry about memory at all, this is an okay strategy. But I think in practice, memory is a problem, right? Everyone of you sitting here has experienced trying to put a big model onto a GPU and PyTorch telling you you’re out of memory.
This is a problem with your training as well because if you can fit more and more batch sizes, that’s going to make data parallel more efficient. Ideally, you’d like to save on memory. Let’s take a closer look at the memory usage of naive data parallel. The memory situation is actually worse than it looks. It’s actually quite terrible.
You’ve done this in assignment one, but we can think about how many copies of our model we need to store, and it’s very large. Depending on the precision we’re doing some of our training, you’re going to need to store something like 16 bytes of data per parameter. In fact, you need to store something like five copies of your weights. This is really quite bad because if you want to think about your model parameters, technically you only need two bytes.
Where did that factor of eight come from? Well, at least you need gradients. If you’re computing those gradients in BF-16, that’s another two bytes. Then your optimizer state kind of shows up, and that’s a really big problem because you’ve got four bytes of master weights—like these intermediate sums that you’re doing. You need four or two bytes for Adam’s first moment estimates because Adam keeps track of historical gradients, and then Adam needs second moment estimates kind of like the variance of the gradients that you’ve gotten in the past.
That’s going to need another four or two bytes. What originally looked fine is now looking quite grim. This 16x factor, if I draw it as a picture, you realize that most of your memory usage, at least in terms of parameter memory, is really being dominated by the optimizer states of your Adam optimizer. Your memory consumed is going to be a function of how many bytes are being used for your optimizer state, and that’s generally going to be even more than the core parameter and gradient memory usage.
For a simple example of like a 7.5b model distributed over 64 accelerators, you’re using a ton of memory, right? This memory scales linearly upwards. Total memory scales linearly with the number of GPUs, so that’s no good at all. Once we look at this picture, we get some very simple ideas. You might wonder, do I really need all the optimizer states to be on every single machine, right?
Once you ask that question, you can get to this second row where this is going to be called optimizer state sharding. If we could do that, then at least in this case we can go from 120 GB of total memory usage down to 31.4. We can start sharding the gradients, and then we can get to 16.6 GB of memory usage. If we also shard the parameters, we can go all the way down to 1.9 GB of memory usage. That would be a pretty good place to be because now we’ve fully sharded out all of the optimizer state, parameter, and gradient memory that we need.
Sorry. Why could we shard optimizer state if we’re doing the gradient computation on each of them like reducing? How can we have that? That’s a very good question. The answer is how can we shard the optimizer state when we’re doing data parallel? GPU 0 has to be responsible for data point one, so clearly it needs to know about all the parameters and update them.
How can it possibly shard the optimizer state? In a way, I think Zero, which is what this is—the zero overhead data parallel optimizer—is a very clever idea because it shows you that even when you’re doing data parallel, you don’t actually need to copy everything onto every machine. You can be clever about how you do communications to avoid all this.
What we’re going to do is split up the optimizer states, as I said, so the first and second moments are now split up across all the GPUs. Everyone has the parameters and the gradients. If I have the parameters and gradients—let’s say I’m GPU 0—I have the parameters and gradients for everything. That’s enough information for me to compute the full gradient.
The only thing I can’t do is I can’t take that gradient and take an Adam step, right? I can’t update my parameters unless I see all of the optimizer states. That’s kind of the key idea. So what’s going to happen is GPU 0 is going to compute the gradients for everything, but GPU 0 is now only responsible for updating the parameters for the shard that they own. That’s the key idea.
We’re going to distribute the work of updating the parameters, and then we’re going to synchronize the parameters back. So let me show you in much more detail how this works and the reason why it’s called zero overhead.
Step one: every GPU gets different data points. Let’s say I’m just going to simplify all this batch computation. I have GPUs 0 through 4, and every GPU gets a single example and computes a full gradient on the example that they own.
What I’m going to do next is I’m going to reduce-scatter the gradients, right? I’m going to send the gradients. I’m going to collect in some sense the gradients that each GPU owns. So GPU 0 is responsible for this first quarter of the parameters. The parameters are the y-axis here, and the x-axis is GPUs. What we’re going to do is reduce-scatter to make sure that GPU 0 has all the gradient information from all the other GPUs for the subset of parameters that it is responsible for.
Now GPU 0 gets this gradient information from GPU 1, GPU 2, and GPU 3, and that’s all reduced into GPU 0. Now, GPU 0 has all the information it needs to update its own parameters because it has the optimizer state corresponding to this first part. It has a full summed gradient for this first part.
Now it’s going to take a gradient update on that part of the parameters using gradients and state. Now I have the full updated parameters for this subset in my GPU 0, and all I need to do is all-gather the updated parameters back to all the ranks.
There are many questions here. I’ll start here. Yes. When you say the communication cost is the number of frameworks, that’s per machine, right? Or is that total? The question was whether the number of communication costs was per machine or total. Here it’s going to be total because this is going to be like 1/4 of the parameters that are going to be sent three times to this machine, and then you repeat that four times.
That was also total. Yes. This question is not unique to what you’re showing here, but it made me think of it. The outlines that we showed seem to assume largely the independence of parameters, but we’ve drawn all these diagrams that show the opposite. We have connected nodes, and it seems especially interesting when we’re trying to split these and update them separately. Does that create any issues?
The question was whether Adam W seems to assume parameters operate independently. I’m assuming because you’re saying that we track like gradient sums and then diagonally sort of update the parameters, right? But we know that’s not fully diagonal. Is there a problem? There have been better attempts at improving Adam W to not just be diagonal. There are things like KFAC and all these other second-order style optimizers that people have come up with.
They haven’t dethroned Adam even though they do have their advantages. There are some really interesting things that you can do with these kinds of improved second-order preconditioning methods.
Yes. What is it reducing? What are the rows that we’re reducing over? You’re asking what are the rows of this picture? Yeah. Imagine this is like parameters in the rows. So GPU 0 is responsible for some number of parameters. This is a block of parameters at the top. When we do reduce-scatter, we’re saying take the gradients for example zero for this block of parameters. Take the gradients for example one for this same block of parameters and then sum them all and put them in rank zero. That’s what we’re saying here.
Cool. The key thing here is we’re doing a reduce-scatter and an all-gather, right? If you remember what I was saying before, a reduce-scatter and an all-gather have the same cost as an all-reduce, right? There is a little bit of surprising magic that happens here, which is that we were doing an all-reduce before on all the gradients to make sure everyone’s gradients were synchronized. That cost us two times the number of parameters.
If we’re clever about how we’re doing the updates, we can do a reduce-scatter and all-gather, and in between the two steps, we can do some computation. That gives us the same communication cost, but now at least for the optimizer state, we’ve fully sharded the optimizer state across the model. Zero stage one is, in some sense, free in the bandwidth-limited regime and gives you memory wins.
Yes, suppress the memory contribution of the higher moments. Do people modify Adam to add higher moments? What do you mean by you can suppress the higher order contributions? For the first and second moments, the amount of memory per GPU is divided broadly, so it seems like you might as well show more.
I see, so you’re roughly saying you could track way more optimizer state. To rephrase what you’re saying, you could have even more complicated optimizer state because you can divide that by the number of GPUs. This is true, but what we’re going to do next is we’re actually going to make the other components scale with N GPUs. That’s going to make things in some sense not free anymore. The optimizer state will continue to be the bottleneck if we can divide everything by the number of GPUs.
Hopefully, that’s a reasonable, convincing answer. Okay, we’re going to build up stage by stage to zero stage three, which is more complicated. Zero stage two is still relatively simple. Hopefully, that optimizer state sharding trick made sense. I think that’s very cool. Now we want to shard even more stuff.
I want to shard the gradients across the machines. We can roughly do the same kinds of trick as stage one, but there is one additional complexity. What’s the additional complexity? We can never instantiate a full gradient vector, right? If I ever do a full backward pass and I try to compute a full gradient vector, I might go out of memory. I want my maximum memory usage to be bounded by this, which is like full parameters, sharded gradient, sharded optimizer state.
What we’re going to have to do when we do the backward pass… as we’re computing the gradient vector, we can’t instantiate the full gradient first and then do communication. What we have to do is, as we compute the gradients backwards, as soon as we compute like a layer’s worth of gradient, we’re going to have to send that over to the corresponding sort of GPU that it belongs to, right? So this is kind of how it works. It’s roughly the same idea, right? So now everyone has their own batch component. Everyone incrementally goes backwards on the computation graph. And let’s say we’re going to operate layer by layer, right? So layers are sharded atomically to different GPUs.
So what we’re going to do then is as we go backwards on the computation graph after we compute a layer’s gradients, immediately call a reduction operation to send this to the right worker, right? So a layer belongs to some worker; maybe it’s like GPU number two in this case. So we’re just going to immediately reduce that and send that to the worker at that point, and gradients are now no longer needed. I don’t need to store the gradients on ranks 0, 1, and 3, so I can immediately free that, and then now we continue this process.
So all the machines have their fully updated gradients, and now they have a full gradient for their share of the parameters. They have a full optimizer state for their share of the parameters. Each machine can update their parameters and all gather the parameters back together, right? This looks like it’s maybe more communication because you’re doing this kind of reduction operation every layer, but this is only for a small amount of parameters, right? It’s sharded, and so the full communication remains the same.
So zero stage 2 has some more overhead because we have to synchronize layer by layer and make sure that the gradients are properly sent to the right workers. But the overhead is pretty minimal, right? It’s still very simple, fairly straightforward. Now, the last one of these zero stage 3 is more complicated for sure, but it allows you the greatest win of all, which is now essentially everything is divided by the number of GPUs that you have, and you can get the maximum savings possible.
And if you’ve heard of FSDP, you’ve probably used that in some aspect of your life in the past. FSDP is exactly zero stage three. So now you’ll hopefully today know how FSDP works. The same idea applies. We’re going to shard everything including the parameters. We’re going to do the same thing as zero stage 2, which is we’re going to incrementally communicate and compute things so that we don’t keep these big vectors of gradients lying around, and we’re going to send and request parameters on demand while we’re stepping through the compute graph both for the forward and backward passes.
As we go through, we’re going to send things around on demand, and of course, the key is to do this with as low overhead as possible. I think the thing that’s really surprising about FSDP is not that this is possible, but that this is possible with relatively low overhead. You’ll see kind of why it’s low overhead in the next slide. I admit that this is maybe not the most friendly graphic to start with, but this is, I promise, the baby version of SSDP. The next slide is a little bit more involved, but conceptually this actually explains everything.
So what we’re doing is, you know, we’re going to have model weights and we’re going to be all gathering the model weights as we go. For each layer, you know, no single GPU is going to have all the parameters, right? So I can’t do the normal thing of saying, “Oh, GPU zero, go ahead and run the forward pass.” That’s not possible. So GPU0, let’s say, only owns the bottommost layer. So it does that computation and then it stops and requests all of the parameters from all the other workers. So it stops and does an all gather, which you see there’s an all gather step. It gathers all the parameters.
Now it has the parameters that it needs to do a forward. So it can step forward and compute the layer that it didn’t have before. And then now it can free the weights. It doesn’t need the weights anymore; get rid of it. Now I can all gather the next layer. I can do another forward, free the weights, and I can repeat this. The activations have to be stored, so the activation memory here is growing. That’s going to be an eventual problem, but if we ignore activations for the moment, this is great because I load a layer, I do a forward, I free it; you know, the memory overhead is very low here.
Once I get kind of to the end, now I can do the same thing with a backward pass. I can call backwards, and every time I move backwards through the neural network, I all gather for the parameters that I need. I can do a reduce scatter to update after the gradients that have been computed. And now I can free the weights, or I can free both the gradients that I don’t need and the parameters. And at the very end, you know, I’ve got a fully updated model.
And so we’ve got three different operations that we’ve got to worry about here. We’ve got an all gather, we got another all gather, and then we got another reduce scatter basically to update the model after we take the gradient update step. So conceptually this is just a single step beyond zero stage two. But you do kind of see that there is more overhead. So the total communication cost is now higher.
We were kind of before, we had two times the number of parameters. Everything was kind of free in some sense. Now it’s not, right? There’s a total of three times the number of parameter communication cost, and there’s going to be cost associated with waiting for these communication things to finish. But I think the really cool thing about FSDP is it’s actually surprisingly low overhead. You might imagine that because we’re doing this crazy thing of asking for and sending parameters back and forth all the time, things will be really slow, right?
But you can do this core idea of overlapping communication and computation. So you want both your GPU to be working while the communication is happening in the background almost like pre-fetching, so that by the time you need some piece of information, it’s already loaded up. It’s already been communicated to you, and you’re good to go.
And so I’ll talk through this example at the bottom here, but this is kind of the key to making FSDP actually somewhat efficient. So let’s imagine we have a computation graph that looks something like this: W1 times W plus W2 times W0 times X—some input, let’s say, is Y, right? So some very simple computation graph like this, and then you might run FSDP, and you will get actually a computation and communication that looks like this block diagram at the very end here.
So the CPU, you know, it’s nice that we did the insight systems example last week because hopefully this diagram will now be clear. Right? The CPU is going to basically dispatch a bunch of commands asking the communication part of the GPU to go and fetch some parameters. It’s going to dispatch things to the GPU to say, “Okay, do some matrix multiplies,” and it’s going to run, you know, far ahead in some sense of the GPU. We’ve seen this when we were looking at the profiler last week.
Now let’s look at the sequence of both communication and computation that happens on device. Now remember that I need to sort of gather things on demand. So at the very beginning, I have to make sure that everyone has the weights for layer zero or W0 here. So I do all gather zero, and I’m going to wait for that to complete. Once that’s completed, I can do a forward step on W0. I can sort of compute X times W0, let’s say, right?
At this point, all gather one starts at the same time that all gather 0 ends. So as I’m doing this matrix multiply, I’m basically already starting to load the next parameters that I need. Of course, my communication is slower, and so there is some gap, but I end sort of much quicker than the initial load. So now forward one can happen, and in the background, once again, I’ve started to load parameter number two, and this yellow slice here I’m now freeing the parameters associated with forward one.
And now the other thing here is I’m repeating computation: W net0 is used twice, and so I don’t need to communicate this again. This happens very quickly, and I can do this very quickly. Right? I have forward two now already loaded before I needed it, and so there’s no bubble here. And then I can free number two. That’s the entirety of the forward pass, and you see that the gaps are relatively small here, and we’re able to do a lot of loads before the compute needed to happen.
And so by doing this very clever thing of kind of queuing the requests for weights before you actually need them, you can avoid a lot of the overhead associated with communication. And then now at this point, you know, of forward two, I’m done with the forward pass. I can free weight number two, and I start on the backward pass. You see that all gather two for the backward pass is already done, and so I can start on backward two. Backward zero weight zero is already stored, so that’s done.
And then the high overhead here happens in the backward pass because I need to do reduce scatters and then all gathers and so on and so forth. Hopefully you see this picture and you say, “Wow, it’s kind of surprising that even though we’re doing this crazy sharding, if you go back to this picture, you know, we’ve fully sharded the parameters, gradients, and optimizer states. But the total bandwidth that we need is only three times rather than two times. So that doesn’t seem too bad.”
And sort of the actual bubbles that we see are not horrendous, right? The communication is almost being fully utilized, and the computation isn’t stalling for very long. So we’re actually making pretty efficient use of the resources that we do have, which is cool.
Okay. Yes. Where do they get prefetched to? To my understanding, let’s say the GPU memory is full; where do the weights get prefetched to? Yeah. Yeah. So you need a buffer in which you can store these weights. And so, you know, this picture is not quite right. You will have some overhead associated with reading these weights for the current layer. And also, the other big elephant in the room is I haven’t talked at all about activation.
That’s going to be a big chunk because you’ve got a big set of activations for a full model that is sort of living here in some sense. Yeah. Cool. Um, right. Okay. So this is kind of distributed data parallel like zero is in some ways the way that people do distributed data parallel efficiently. Um, and so there are different stages, and you know, stage one is basically free. It’s doing the same communication pattern as naive data parallel, but you get to shard your optimizer state; that’s great, you might as well always do it, right?
Zero stage 2 is twice the number of parameters, so the total bandwidth consumption is the same, but there is additional overhead in having to do this incremental freeing of the gradients as you go backwards. Zero stage three is more involved; you do three times the number of parameter communication cost, but it’s not so bad, right? Like we did have some overhead in the diagram that we saw before, but if you really cleverly mask your communication patterns, it’s actually pretty good.
And so people use data parallel even for fairly slow sort of links in your networking pattern. Okay, and this is also conceptually very simple. One of the advantages here is, you know, especially data parallel doesn’t care too much about the architecture. I didn’t talk at all about how we actually implement a transformer in any of this. It’s all very abstracted. And so this is one of the reasons why, for example, FSDP is so popular.
It’s very easy to write a wrapper that parallelizes sort of arbitrary neural networks without having deep knowledge or deep introspection of what the architecture is actually doing. And so, you know, here are some examples. I worked out some examples because I’m always sort of running out of memory on my GPUs, and you can kind of see what’s the maximum size of the model that I can fit on a nodes with 8 times 180 gig, you know?
And so for baseline, you might end up with like, “Oh, I can fit barely a six billion parameter model,” whereas I think if I use zero stage three, you know, I’m able to fit something like a 50 billion parameter model. There’s big savings in my ability to fit larger and larger models by doing things like FSDP to cleverly save on memory.
So okay. Oh sorry, there’s a question. Yes. I guess I’m a little unclear as to what the difference then once you shard the parameters. What’s the difference between that model? Yeah. So model parallelism is really fundamentally about making sure that the parameters just, like, live in separate machines.
Let me see if I can find a picture so they never need to be communicated across. Yeah, yeah, yeah. In some ways, it’s true that we have sharded the parameters. So you could call this a kind of parallelism. But the whole point of model parallelism is to make sure that the parameters just live entirely in one machine. We’re not going to try to ship them across in various ways. Only the activations are going to get shipped across.
And so you’ll see very different discussions in the model parallelism section. The focus there will be on communicating activations rather than communicating parameters, and that’ll be a big difference. Yes. Let me see if the parameters are only on one machine. Why are you performing an all gather?
So you’re asking about this step: why are we doing all gather to gather weights onto all the machines? Is that when they’re only on one machine? Is that right? Yeah. So we need to take the weights that live on one machine and gather across all the machines to ensure that each layer is sort of properly replicated across all the GPUs.
Is that the right question that you’re asking? Or are you saying like, is there a simpler primitive that we could have invoked? Like are you saying broadcast is the right object rather than all gather? I think maybe it’s written that way because of some exceptions about layers not living on individual GPUs, but I’m not 100% sure. I agree with you that broadcast should be able to do the same thing if the parameters live on only one machine.
Okay, cool. Alrighty, let me make sure where. Okay, got it. Okay, right. So, there is a key resource in data parallel. And this is actually an important idea that I want you to remember. With data parallel, batch size is actually a really critical resource in the sense that you can’t parallelize greater than your batch size, right? Because you can have at most one example on each machine; you can’t go to fractional examples per machine.
And so this means that, you know, if there’s limits to your batch size, you stop being able to use data parallel. And there’s diminishing returns to batch sizes. So, you know, in your assignment one, you may have played with varying batch sizes, but you kind of know that as you crank up the batch size past a certain point, you start to see sort of fairly rapid diminishing returns to your optimization rates.
And there’s lots of papers written on this. OpenAI has a really nice one on something called critical batch sizes, where they basically argue that, you know, past a certain point, you have very rapid diminishing returns in how much each example is contributing to your ability to optimize. Basically, the intuition is that below a certain point, you have a lot of gradient noise, and reducing that is very valuable, but at a certain point, you’re really fundamentally limited by the number of gradient steps you’re taking rather than variance reduction.
And so that basically means data parallel alone isn’t going to get you to arbitrarily large parallelism. And this batch size thing is a really important resource. You want to essentially have a fixed maximum batch size, and you can spend it in different ways. And I’ll talk about that later because other kinds of parallelism also benefit from having sort of bigger batches, and so you use your batch size in certain parts.
Okay, and issues are going to remain with data parallel. Zero stages one and two don’t let you scale memory. Zero stage 3 is nice in principle, but it can be slow and maybe more importantly, and this relates to the earlier question, it does not reduce activation memory. I ideally want to cut up my model entirely and make them live totally separately because then the activation memory would also sort of be reduced.
And so now I want better ways to split up the model so I can fit these really big models in these GPUs, and so that’s going to bring us to model parallelism. We want to scale up in memory without changing the batch size, and we want an alternative axis where we don’t need to spend or basically have bigger batch sizes in order to parallelize.
What we’re going to do is we’re going to split up the parameters across GPUs, and in some ways, that’s like zero stage 3. But we’re not going to communicate parameters anymore; we’re going to pass activations around, and that’s going to be different. Sometimes activations are going to be much smaller than parameters, and that’ll be very good for us.
So we’ll cover two different types of parallelism. I’m going to talk about pipeline parallel, which is conceptually simpler but much more horrible implementation wise, and tensor parallel, which is conceptually maybe less obvious but honestly much nicer to implement and more commonly used. They’re going to correspond to two different ways of cutting up the model.
So I think pipeline parallel is maybe the most obvious way to cut up a neural network, right? You know that a deep neural network comes in layers, right? So if I have layers, a very natural place to cut a network is to cut it up at the layer boundaries. So each GPU is going to handle some subset of the layers, and I’m going to pass activations around. Like in this case, each layer belongs to a GPU, and GPUs are going to pass activations from one to the other. In the backward case, it’s going to pass the backward gradients backwards from GPU 3 to 0.
So that’s cool; that’s great. What’s wrong with this picture? Well, I think you should see that most of your GPUs are idle most of the time. This is actually quite terrible utilization. If I do this naive kind of parallelism that I described before, right? So if I have, you know, each layer having a forward, and let’s say I have a single example, that’s going to result in a diagram that looks like this.
So different rows in this picture are different GPUs and different layers. The x-axis here is time where I’m going from left to right. So what do you see? Well, I first compute my first layer at the very left here, and then the activations get passed to the second layer. GPU 2 wakes up, and it’s like, “Alright, it’s my turn.” It does its job, passes it to GPU 3, and then GPU 4, and now the backward passes can begin.
And so on and so forth. You see kind of this gigantic bubble. This is a big overhead where you’re doing absolutely nothing. And you see that the GPUs are active one at a time. So in some sense, this is the worst possible parallelism: I’ve added four GPUs, but I get the throughput of a single GPU.
One thing you can do is be a little bit more clever about what you do, and you can say, “Alright, I’m going to have a pipeline.” I’m not just going to cut things up in layers; I’m going to have a sequence of things that need to be processed by each GPU. So now let’s say I have a microbatch, right? Each machine is going to handle sort of four examples.
And what I’m going to do is, you know, I can finish my first example, my first data point, and I can send off the activations for that to my second GPU as soon as I finish, and then I can start working on my second data point. Right? And so now I’ve overlapped communication and computation. The second GPU can start working while the first GPU continues to work.
Now the size of the bubble can potentially be reduced by having bigger batch sizes, right? You can hopefully see why I said before that batch sizes are a resource. If you have a finite batch size and you have pipeline parallel, you can use that same batch size to make your pipeline bubble size smaller, for example, or you could use it to do data parallel, right? So there are many different ways that you can take your single batch size and then split it up into different ways.
So now your microbatch size can control the bubble time, and in fact, the ratio of your overhead to the useful compute that you have is the number of stages minus one over the number of microbatches. So if you have big batch sizes, pipeline parallel could potentially be efficient. But as we said before, batch sizes are finite; we can’t just crank that up to whatever value that we want.
In general, pipelines seem really horrible. Why do we do it? Why do we incur this cost of a bubble in order to parallelize? Well, there are a couple reasons. Pipelines help save memory compared to data parallel. I mean, zero stage 3 will also shard the parameters, but this also shards the activations, which is nice.
Pipelines can also have good communication properties, right? It only depends on activations. It’s also point-to-point, so it’s possible that depending on your topology and depending on what you have, pipelines might actually be very favorable for the slower parts of your network.
Pipeline parallel is often going to be used on your slower network links, inter-node or even sometimes across different racks or across different data centers. You might do pipeline parallel, right? One of the examples of a thing that I was recently told by some Google folks is, you know, they were saying actually one of the big advantages of TPUs is that we don’t have to do pipeline parallel very much because, you know, all of our connections are much bigger, right?
They have this big toroidal mesh. They don’t have this limit at 256 GPUs where they’re suddenly going towards a slower network link where you might want to switch to pipeline parallel, right? So that’s a real-world kind of example of when you would start to think about pipeline parallel.
And so this is an example from an NVIDIA paper, or I’ll talk about this paper in much greater detail later. They’ve done some really nice work showing the performance characteristics of different kinds of parallelism. But you kind of see with batch size 8 as you increase the pipeline parallel size, the number of devices, your utilization per GPU starts to really drop off.
Whereas if you have a big, big batch size of 128, you can get away with pretty good utilization for reasonably sized pipeline parallel. Right? So batch sizes are really key to hiding the size of the bubble. Otherwise, you have issues.
Of course, you can do different kinds of pipeline strategies. Instead of having these standard patterns for scheduling the bubble, you can sort of cut things up into finer pieces where you’re assigning different stages, assigning different sub-layers to different devices, and you’re doing different computations at different parts. You can then interleave the pipeline better.
And sort of an advanced version of this that I want to spend a moment talking about—and this is very clever—is zero bubble pipelining, or I think in DeepSpeed’s lingo, I think they call it dual pipe, but the core single trick is the same. If you think about it, let’s say we’re doing the backward pass to compute gradients. You can split this up into two different components.
The first part is about back propagating the activations. As I go down sort of the residual connections, I need to compute essentially the derivative with respect to the activations. Then, as I sort of get to a parameter, I also want to compute the gradient itself, like how am I going to update the parameters, not just how do the activations change with respect to the previous layers?
To give you a concrete example, let’s look at this bottom left diagram. In this diagram, you see the forward pass. This is a single MLP, so we’ve got multiply by A, I do a nonlinearity, and then I’m just going to output the nonlinearity. This is a naive single part of MLP. Now let’s look at the backward pass. I have the derivative with respect to the loss come in, and then I can compute how that’s going to change the inputs to my MLP.
This is, in some sense, the derivatives with respect to the activations here. As I compute these, of course, I can use them to compute the gradients that I need to update my weights. But the important thing is this part of computing the gradients for the weights can be done whenever. There’s no sort of dependence on this, and so I can rearrange the scheduling for this computation to any part of the computation graph.
So what you can do is you can sort of do your standard pipeline parallel for the parts that are serially dependent, but anytime you have to do these computations just for updating the parameters, you can sort of reschedule them wherever. The key idea is to start with a nice optimized pipeline, so you can take this and separate this computation of the backward part and the computation necessary to compute the gradient of the weights.
Now I can do the computation of the weights where I would have originally had a bubble, right? The parts where I originally had these idle utilization components, I can now fill them in with this computation. By thinking carefully about what the serial dependencies actually are, I can get good utilization out of my GPUs.
To be clear, this is horrendously complicated, right? If you want to implement pipeline parallel in this way, you’re going to have to intervene in how your autodiff is actually calculating these things. You have to have a queue that can track where things go. I heard a funny anecdote in a conversation recently from someone in a frontier lab sort of training language models, and they said, “You know, actually there’s two people in the group that understand how the pipeline parallel in our infrastructure works. One person left, and so there’s a single load-bearing person in our training infrastructure.”
There are stories like this. Pipeline parallel is infrastructurally very, very complicated. It looks simple here, and if you’re interested, I encourage you to try and implement it. It does get pretty hairy pretty fast, and I think that’s a good note on which to switch to the other kind of model parallelism because this is much simpler.
This is often very cleanly utilized by a lot of frameworks, and a lot of people training really big models rely very heavily or primarily on this kind of model parallelism. So what other way can we split up a model? If we think about it, most of what we do is matrix multiplies, right? In a big model, most of the computation is matrix multiplies. Most of the parameters are matrix multiplies or matrices.
So what can we do? Well, if we can parallelize just the matrix multiplies, that would be pretty good. Tensor parallel is this idea that we can take a big matrix multiply and split it up into a set of submatrices that can be multiplied. If I have this matrix multiply at the top right, we have X, and X * A = Y, what I can do instead is I can cut up A in half, and I can also cut up X in half, and I can compute the submatrices. I can sum them up, and then I will get my answer at the end, right?
So conceptually, pipeline parallel is cutting along the depth dimension like the layers. Tensor parallel cuts along the width dimension of the matrix multiply, allowing for effective usage of multiple GPUs to handle larger matrix operations simultaneously. This method effectively leverages the inherent structure of large, dense computations in neural networks, optimizing performance and resource utilization across the available computational nodes. Parallel, which is what this is, is cutting up along the width dimension of your matrix multiplies. And so we’re going to decompose into submatrices and then do partial sums. Here’s an example of what it might look like in MLP. We have each GPU handling a different submatrix of let’s say a big MLP matrix multiply, and then we’re going to have collective communications to synchronize the activations as we need them. So what are we going to do?
This is an MLP, and sort of the top half and the bottom half have two different paths. These are splitting up the matrices. I want to do this operation, y = x * A. I’m going to split up my matrix A into A1 and A2. And then on the right-hand side, I want to compute dropout YB. Right? And then I want to return the result as Z. So I’m going to also cut up B. So I’ve cut up both of my parameter matrices into two parts, A and B.
In the forward pass, what I’m going to do is I’m going to take my inputs X and I’m just going to copy them twice. Right? So each GPU is going to get the same inputs and they’re going to operate on it with A1 and A2. They have the same row dimensions, so it’s going to be fine operating on them. So XA1 and XA2 is going to give you some activations Y1 and Y2. Those are going to go into B1 and B2. And then I’m going to do an all-reduce to sum them up.
That’s exactly the figure I showed you before, right? So you copy and then you all-reduce and you get the answer Z. In the backwards pass, now it’s actually the reverse, as sort of the gradients come backwards in the backwards steps. This G is going to be the identity. So I’m going to copy sort of the derivatives on both sides and I’m going to do sort of the backwards operation all the way through. Once I get to f, this is an all-reduce, right? Because I’ve got sort of two derivatives coming in from both paths and then I sum them back up.
So this f and g are synchronization barriers. In the forward pass, I do a single all-reduce. On the backwards pass, I do a single all-reduce just at two different places in the computation graph. So now you can hopefully see how this is a very nice way of wherever you have a matrix multiply, you can just cut up the matrix multiply and sort of parallelize them across different devices.
As you might imagine, this is actually somewhat expensive. We have a synchronization barrier that lives kind of per layer. It needs to communicate an activation, sort of like the residual activation worth of stuff twice in a forward-backward pass. Tensor parallel, this very simple idea, is going to require very high-speed interconnects. There’s a rule of thumb. It’s a very simple rule of thumb to remember, which is that tensor parallel is applied within a device or within a single node.
So a single box of, let’s say Nvidia GPUs, is going to ship with eight different GPUs that live in that same box. As I showed you at the beginning of the lecture today, they’re very high-speed connected, right? So those eight GPUs can talk to each other very quickly. It makes sense to use something like Tensor Parallel that’s very bandwidth hungry between those eight devices. Typically, you will see tensor parallel applied up to eight GPUs where the eight GPUs live in the same machine, because that gives you the least sort of drop in performance.
This is an example from Hugging Face’s parallelization tutorial showing you the throughput decreases of different levels of tensor parallelism. You see that there are hits, right? 10 and 12 percent hits to throughput as you do tensor parallelism. But up until eight, maybe this is manageable. This is kind of the price you pay for just being able to parallelize more nicely. But then you go to 16 devices and you get this kind of astounding 42 percent drop in performance. You go to 32 and you see another sort of 65 percent drop in throughput, right?
Hopefully visually here, you see that you really want to stop at 8 for tensor parallelism. That’s really the sweet spot because of the kinds of hardware interconnects you can get your hands on. How do things now compare to pipeline parallel? Well, compared to pipeline parallel, we don’t really have to deal with this bubble thing that we had before. We don’t need to consume larger batch sizes in order to reduce the bubble, which is nice.
There’s relatively low complexity in applying tensor parallel. All you really need to know about is where the big matrix multiplies are. Can I split them up and make them live on different devices? The forwards and backwards operations still remain the same. Compared to implementing something like zero overhead or dual-pipe pipeline parallel, you’re going to be in much better shape doing this.
The con is that there’s much larger communication overhead. In pipeline parallel, batch size, time, sequence length, and residual dimension point-to-point communications per microbatch. In tensor parallel, you’ve got eight times that per layer, and you’ve got all-reduce communication. It’s potentially a very large amount of communication that needs to be done. The rule of thumb, as I said before, is tensor parallel is used whenever you have low latency, high bandwidth interconnects.
You’re going to see two to like 16 depending on what kinds of machines you have out in the wild. I’ll show you examples as I talk through at the very end here of examples of tensor parallel. Any questions on pipeline or tensor parallel before we move on to the third kind: sequence parallel and activation sharding? Yes. Can they both be used simultaneously or are they?
Yeah. So the question was can they be used simultaneously? The answer is that, yes, you do use them both. The typical thing that you see for large-scale runs is that you very often see tensor parallel. Pipeline parallel is often used on top of that. The only example I know of that does pipeline but not tensor parallel would be DeepSpeed v3 as far as I know. So within a single machine, you have like say five different machines, maybe the first 20 percent of the parameters are across the first machine tensor parallel one, and then that pipeline parallels into the second machine.
The question is do you do tensor parallel within the machine and pipeline parallel across machines? Yes. So you would do something like tensor parallel within the machine and a combination of data and pipeline parallel across machines, for example. I’ll show you the rule of thumb later, but basically, you do pipeline parallel because your models won’t fit. If you could fit your entire model, you just do data parallel plus tensor parallel or just maybe even data parallel.
We’ve been talking about memory, and memory is, in some sense, a very important part of parallelization because we’re going to be training big models. When you look at your memory, you realize that actually activations are a really big part of your memory usage. If you look at a standard kind of forward-backward pass, this was one from one of the PyTorch tutorials. You see that memory usage is very dynamic.
I’ll talk through this because I think it’s an interesting plot in general. You always have your parameters as you’re training because that’s static, but you know in iteration zero, you don’t still have optimizer state at all. Actually, you don’t have that part of your memory use. But as you do your forward and backwards, you see activation grows, grows, grows, grows, grows as you accumulate all the activations.
As you start your backwards pass, your activation goes down because you’re freeing it as you use up your activations and then you’re accumulating your gradient. Your gradient memory usage goes up. The peak is actually somewhere partially through your backwards pass where you haven’t freed all your activations yet, and you’re still building up your gradients. In iteration two, you kind of see the same thing here.
The point of this diagram is to say we’ve thought about all the other pieces. We thought about the parameters. We’ve thought about optimizer state. We’ve thought about the gradients. But we have not thought very deeply about the activations. So let’s do that. The final complexity that I want to talk you through is the activation memory. Tensor and pipeline parallel can linearly reduce basically most things, but it can’t actually reduce all of the activation memory usage.
This is an example from one of the NVIDIA papers that’s talking about how do you reduce activation memory. One thing that’s really interesting to see is as you make your models bigger and bigger, so going from left to right, you see that parameter and optimizer state memory can remain the same if we parallelize aggressively, but activation memory continues to grow because some parts of it don’t parallelize very cleanly.
No matter the number of devices you have, you can’t really get rid of the growth of activation memory per device, and I’ll show you why in a moment here. Whereas, if you do some slightly more clever things like recomputation, you can keep the activation memory low and that’s really key to parallelizing some of the biggest models.
What’s the activation memory per layer? You’ve done some of this transformer math and calculus before, so hopefully you’re now familiar with all of this. But we can compute what’s the amount of activation memory we need per layer. There’s a handy formula here. This is the amount of memory you need: SBH * 34 + 5 A S over H. Some of these numbers are mystifying, but actually they’re not so mystifying.
You can very much see that there’s a left term and then there’s a right term. The left term comes from the MLP and other pointwise operations. That’s where SBH * 34 comes from. These depend on the size of your residual stream, the H. On the right side, you have a term that’s actually, if you multiply this out, A S^2 over B, right? Because the H’s cancel.
That’s the memory you need for the softmax term and other quadratic terms in your attention, right? If you use flash attention, you can drastically reduce and use recomputation. You know that we can drastically reduce that second term. Let’s say we do tensor parallel everywhere we can. So we do it in the MLPs, we do it in the KQ computations in the attention computation. We will end up with something that looks like this, and this is looking pretty good but not quite there.
Activation memory per layer divided by T, which is the number of devices that we’re tensor paralleling over. If we’re dividing by 8, ideally we would divide all the activation memory by 8. But there’s this straggler term SBH * 10 that has not been reduced down. If you think about what these are, these are the non-MATMO components, like the layer norm, the dropouts, the inputs to the attention and the MLP, right?
All of these terms will unfortunately continue to grow with size and they will not be paralleled very nicely. The very last thing we need to think about is to take those simple pointwise operations, which thus far we have not parallelized, and we just need to split them up. There’s a very simple way to split them up, which is to say, if we’re doing a layer norm, these layer norms across different positions in the sequence do not interact at all with each other.
They just don’t care about anything else. What we’re going to do is, let’s say we have a 1024 long sequence. We’re going to cut that up and then each device will handle a different part of that layer norm or a different part of that dropout. Those pointwise operations can now be completely split up across the sequence dimension. Since now we’re cutting things up across the sequence dimension, we’re going to have to do some synchronization to make sure the parallel computations we did will get aggregated back together.
In the forward pass, these G’s are going to be all-gathers and G bars are going to be reduced scatters. In the backwards pass, the two are reversed. In some sense, there’s a duality here between the two. For the layer norm, we’ve scattered things around, and so we’re going to have to gather them back together so that we can do our standard computation. Now whenever we get to the dropout, we want to scatter them back out into the parallel components that we have.
In the backwards pass, we’re doing that in reverse. This is a very simple idea, right? We’re just parallelizing the very last components that we failed to parallelize before. Now we can put all these different pieces together and sort of get to the end, which is we started up here with no parallelism at all. We did tensor parallel, which allows us to divide everything that’s not a pointwise operation by T.
If we apply the sequence parallelism idea, we can divide this component by T once more. We can do things like activation recomputation, which is the flash attention trick to remove the second term. The minimal memory that you can kind of easily get away with is going to be this thing on the bottom, which is SB8 H34 over T. This is often used if you’re looking at different formulas for transformer arithmetic on how much activation memory do I use.
You often see something like PH34, and then if you have tensor parallel, you divide by T. This is the sort of easy minimum that you can get for that kind of memory. Any questions on sequence parallel and activations? Yes, I was wondering about the transformers stacking on top of each other. I suppose a combinational graph will grow more and more, like an imaginative pip combinational graph as like a DAG. Would that ever become a problem for communication between the engineers?
You’re asking if we have something that’s a more complicated computation graph than a single linear chain—will that become a problem? It’s a good question. I haven’t thought about that. I would guess not, at least for tensor parallel, this operates purely layer-wise. It doesn’t really care about the dependencies. Maybe for pipeline parallel, there’s opportunities for increased parallelization if there’s more than one branch, but I’m not too sure.
There are a few other parallelism strategies that I’m not going to talk about, just because in the interest of time and sort of fatiguing you, because I think I’ve already dragged you through a whole bunch of low-level details about how to do parallelization. The first one I want to talk about is context parallel or ring attention. You may have heard the term ring attention before. This is a way of essentially splitting up both the computation and the activation cost of computing really large attention.
Essentially, you’re just going to pass keys and values around different machines. Each machine is responsible for a different query, and then keys and values are going to travel from machine to machine in a sort of ring-like fashion in order to compute your KQV inner products. The cool thing is you already kind of know how to do this because you’ve done the tiling for flash attention. You know that attention can be computed in this kind of online tile-by-tile way, and that’s kind of what’s happening in ring attention.
The other thing, which now that you know tensor parallel, is pretty straightforward, is expert parallelism. Expert parallelism, you can think of as almost like tensor parallel in the sense that you’re splitting up one big MLP into smaller expert MLPs and then scattering them across different machines. The key difference with expert parallelism is that the experts are sparsely activated. You have to think a little bit about routing, and the routing is not going to be as predictable as the all-to-all communication we had before in tensor parallel because now maybe one expert is overloaded and your networking is going to be a little bit more complicated.
But otherwise, conceptually, you’re living in kind of the same world as tensor parallel for expert parallelism. Just to recap all the things we talked about, I’ve made a little small table of the different kinds of strategies that we have. We have DDP and 01. This is kind of the naive data parallelism thing that you do. Here you have some overhead per batch. You have no memory scaling, reasonable bandwidth properties. But you consume batch size in order to be able to do this, right? You need big batch sizes to have big data parallelism.
You have FSDP, which is kind of like a nicer version of 01 in the sense that you can get memory scaling, but you’re going to pay overhead across different layers. Now you’ve got higher communication costs, and you’ve got potentially synchronization barriers that lead to poor utilization. Pipeline parallel is nice in that we no longer have this dependence on per-batch aspects, but we can get linear memory scaling. But we have another issue, which is this also consumes batch size, and it’s horrendous to set up and use, so a lot of people like to avoid pipeline parallelism if it’s possible.
Finally, tensor parallelism is very high cost in terms of bandwidth and the amount of synchronization you need to do, but this has this really nice property that it has no impact on batch sizes. It’s like the one parallelism strategy you can use that has no cost in terms of your global batch size, which is nice. We have to balance a number of limited resources. We have memory, which is one resource. We have bandwidth and compute, which is another resource, and then we have batch size, which is kind of an unconventional resource but one that you should think of as a limited thing that you can spend on different aspects of these to improve your efficiency.
There’s a very nice TPU parallelism book from Google that I referred to last week, but also they have a really nice parallelism section with a great figure I wanted to show you before I moved on to some of the examples. The key quantity, as I was saying before, is the batch size. Depending on the ratio of batch size to the number of GPUs, different kinds of parallelism become optimal. They use a certain formula on how much communication and computation you end up doing for each of these models.
This is a simplified formula to generate this plot, and you can see if your batch size is too small, you have lots of GPUs and tiny batch sizes. There is no way for you to be efficient. You’re always communication-bound, which is this bottom half here, and in fact, you’re spending most of your time on communication. As you get more and more batch size, eventually you can get to a point where if you mix both FSDP (zero stage three) and MP (which in this case is tensor parallel), you can actually get to a place where you’re compute-bound.
Now you’re not wasting your FLOPs waiting for communication. Finally, if you get to a point where your batch sizes are big, then you can just get away with pure data parallel. Pure FSDP is going to get you into a regime where the time you spend doing computation is higher than the time you spend doing communication, right? If your batch size is big enough, you can just get away with FSTP. This is a cool illustration of this idea.
When you put these all together, you end up with what people call 3D or 4D parallelism. I think I’ve heard the term 5D parallelism recently. I wasn’t quite sure what the fifth dimension was yet. I’ll have to read up on that. You can put it all together, right? The different dimensions of parallelism. This is a really simple rule of thumb. I originally looked it up and put this together last year, but turns out it’s still the same this year. You can sort of follow this now.
The first thing you have to do is fit your model and your activations in memory. If you don’t do that, you just cannot train. This is a requirement, right? Until your model fits in memory, we have to split up our model. We’re going to do tensor parallelism, and we know that up to the number of GPUs per machine, it’s very efficient and fast.
We’re going to do tensor parallel up to that point. After that, depending on things like your desire to deal with pipeline parallel and your bandwidth constraints, you’re either going to use 03 or pipeline parallel across the machines until you can fit your model in memory. After that point, until you run out of GPUs, you can now run the whole thing, and your only goal is to increase the number of total FLOPs you have on hand. You’re going to scale the rest of the way with data parallelism because data parallel works well on low-bandwidth communication channels and is very simple.
That’s going to give you a way of using all your GPUs. If your batch size is small, then there’s a way of trading batch sizes for better communication efficiency. If you haven’t consumed all of your batch sizes of resource, you can use gradient accumulation on your devices. That’ll let you effectively have larger batch sizes even if you’re memory constrained, and that will let you trade your batch size for better communication efficiency since you’re synchronizing less often across machines.
This simple rule of thumb will let you train models with reasonable efficiency no matter what you’re doing. To make this concrete, I’ll talk through a few examples at the very end here. I’ll flash through this really lovely paper from Megatron LM back in 2021, basically showing you exactly these things in pictures and also a lot of ablations as well as some of the models from last year.
This is a big table of how they trained models going from 1.7 billion parameters to 1 trillion parameters. They get great utilization on all of these, right? You see the percentage of theoretical peak FLOPs they get, and it ranges from 40 to 52%. It’s pretty good, right? You can see tensor parallel starts at one and then they eventually go up to eight and cap out at eight, right?
So they’re using tensor parallelism first, and then pipeline parallel stays at one. But once the models get big enough, they can’t fit these big models. So pipeline parallel has to increase to compensate, and then the data parallel size basically starts out as big as possible and then slowly kind of goes down because, as we increase the amount of pipeline parallel, this is now consuming the batch sizes, and so you can’t have as big of a batch size if they’re being used for pipeline parallel.
Careful 3D parallelism is going to give you sort of linear gains in aggregate FLOPs. If you do careful 3D parallelism, you see very flat overall achieved FLOPs per GPU, which gives you, if you add more GPUs, linear scaling in total aggregate throughput. Tensor parallel 8 is often optimal. You see this is the pipeline parallel size and tensor parallel size going to 88 with a batch size of 128. Even if you have a smaller batch size, tensor parallel size of eight remains optimal, and activation recomputation enables larger batch sizes.
Remember that larger batches can, in turn, help you mask overhead for pipeline parallel. Activation recomputation, even though it’s more FLOPs, can pay for itself. We’ve seen that story play out in flash attention. The last part of this is recent language models. What do they do? I’ve gone through a few papers to look at examples of what people’s parallelization strategies are.
In the DOMA paper, they do FSDP for a 7 billion parameter model. DeepSpeed, the first paper, does zero stage one with tensor sequence and pipeline parallel. This is the vanilla approach. V3 actually does something slightly different. They do 16-way pipeline parallel and 64-way expert parallel, which is kind of like tensor parallel. Then zero stage one for their data parallelism strategy.
E, another Chinese model, does zero stage one tensor and pipeline parallel again. E-lightning replaces tensor parallelism with expert parallelism. The final thing, if you’re interested in state-of-the-art distributed training with lots of details, Llama 3’s report is really interesting to read. They have a lot of detail about how they do their networking and what sorts of things happen.
Once again, you see the kinds of things I said before. You see tensor parallel of eight. You see this is context parallel, which is only relevant for long context training, the very last step. You can ignore that. You have pipeline parallel and data parallel happening in these first two phases. You can also ignore the first stage here because that’s kind of the small batch size training they did for stability.
If you look at their rationale for how they do their parallelism strategy, you see exactly what I had said before: “Alright, you want to do TP, CP, pipeline parallel, and DP in that order in terms of the amount of bandwidth that you need.” Data parallel can tolerate long network latencies because you can do the asynchronous fetching of sharded model weights. They’re using the strategy I mentioned to train some of the biggest models.
The funny side note about Llama 3 is, as you may have heard in casual conversation, there’s lots of GPU failures when you train models at a huge scale. They had 148 interruptions from faulty GPUs, totaling about 30% of the total interruptions they had. They had things like unplanned maintenance of machines, which accounted for 32 interruptions during training.
When you’re training a model this big, I’ve talked about the algorithms, but you also need fault-tolerant architectures to be able to deal with these kinds of things. I’ve heard various stories of people saying the even scarier thing is not explicit model failures but actually data corruption. GPUs can silently fail on you and give you garbage data, completely ruining your run.
The last example is for GMA 2, and I wanted to end on this because this is a TPU example. They do 03, which is roughly FSDP, and then they do model parallelism and data parallelism. Here, as I said before, the TPUs allow them to stretch model parallelism a little bit further. Putting it all together, scaling beyond a certain point is going to require multi-GPU multi-node parallelism. There’s no single solution.
You want to combine all three approaches to leverage their strengths, and there are simple and interpretable rules of thumb for how you might execute this parallelism in practice. Thank you.
2025-05-11 08:00:01
The Physical Turing Test: Jim Fan on Nvidia’s Roadmap for Embodied AI
Next up, we have Jim Fan. You all know him. Come on up, Jim. Jensen was talking about him just this morning. He is not only director of AI at NVIDIA, but also a distinguished research scientist, and he’ll talk to us about physical AI.
So, a couple of days ago, I saw a blog post that caught my attention. It says, “We passed a touring test and nobody noticed.” Well, the touring test used to be sacred, right? It’s the holy grail of computer science, right? The idea that you can’t tell the difference between a conversation from a human or from a machine. And then it just so happens that we got there. We just got there.
People are upset when O3 Mini took a few more seconds to think or that Claude is not able to debug your nasty code, right? And then we shrug off every LM breakthrough as just yet another Tuesday. You guys in the room are the hardest crowd to impress. So I would like to propose something very simple called the physical touring test.
The idea is like this, right? You host a hackathon party on a Sunday night and this is what you end up with. Your partners yelling at you and you’re like, “Ah, damn. On Monday morning, I want to tell someone to clean up this mess and make me a very nice candlelit dinner so my partner can be happy.” And then you come home to this and you cannot tell if this was from a human or from a machine’s work. Right? Simple enough. The physical touring test.
But where are we now? Are we getting close? Well, look at this cumul robot getting ready for work. It didn’t make it, right? And how about our dogs and the banana peel? Ah, yeah. And the robot instructs you to make your breakfast cereal. Well, it correctly identifies the milk. I will give that a minus, right? It’s well-intentioned. Oh, it spoon feeds you. It’s a VIP experience, right? Look at that. I’m jealous. I got no one to spoon feed me. Yeah, this is where we’re at.
So why is it so hard to solve the physical touring test? You guys know that LM researchers complain a lot, right? They complain a lot. And recently some guy named Ilia, he complained. He said the LM pre-training is running out of data. And he even called the internet the fossil fuel of AI. And he said we’re running out of data to train LOM. Well, just spend one day with a roboticist and you’ll know how spoiled the LM researchers are. We don’t even get the fossil fuel.
So, this is a data collection session at NVIDIA headquarters. There’s a cafe in NVIDIA and we have these humanoid robots set up where we operate them and collect the data. And this is what the data looks like, right? The robot joint control signals. And these are continuous values over time. And you cannot scrape this from the internet. You can’t find it on Wikipedia, on YouTube, on Reddit, anywhere. So you have to collect it yourself.
And how do we collect it? We have a very sophisticated way, but also very expensive way called teleoperation. Well, you can have a human wear something of a VR headset that recognizes your hand pose and streams to the robot. And in this way, you can teach the robot what to do, like pick up a bread out of a toaster and then pour honey over it. But you can imagine this is a very slow and painful process, right? So if you put it on the scaling plot, basically it doesn’t scale at all. The real robot data is the human fuel. It’s worse than the fossil fuel. You’re burning human fuel.
And what’s worse, it’s at most 24 hours per robot per day. And in fact, you’ll get much less than that because the human gets tired and the robots get tired even more than the humans. So this is what you get and what to do, right? How to break this barrier? Where is the nuclear energy for robotics? We got to have clean energy. Can’t live on fossil fuel forever.
Well, enter simulation. We got to leave the physical world and then do something in simulation. So we trained this robot hand to do superhuman dextrous tasks as spinning a pen in a simulation and well, it’s superhuman with respect to me because I couldn’t spin a pen and I just gave up a long time ago in childhood, and I’m glad that my robot at least in simulation can do it better than I do.
So how do we train the hand to do a sophisticated task like this? There are two ideas. One is you got to simulate at 10,000 times faster than real time, meaning that you should have 10,000 environments running in parallel on a single GPU doing physics simulation. That’s number one. And number two, the 10,000 copies of the environment cannot all be identical. You got to vary some parameters like gravity, friction, and weight. And we call that domain randomization. And that gives us the simulation principle, right?
Why does it work? So imagine if a neuronet is able to control a robot to solve a million different worlds, then it may very well solve the million and first world which is our physical reality. So in other words, our physical world is in distribution of this training and then how we apply this, you can build a digital twin, right? A one-to-one copy of the robot and the world and then you train in simulation, you test it on the real world directly, transfers right, zero gap, and you can do a hand. This is the most impressive task that we could do.
So basically, you have a robot dog on a ball. we transfer that to the real world. This is at at pen at upen and basically someone walking the robot dog. Our researcher super weird looks like a black mirror episode and this is called Dr. Eureka. Actually, one of the researchers tried his dog on the yoga ball. At least we’re super dog dexterity right now. Yeah, the dog cannot do it right.
And next we can also apply that to much more complicated robots like the humanoid. These humanoid robots went through 10 years worth of training in only two hours of simulation time to learn walking. Then you can transfer that and it doesn’t matter what the embodiment is as long as you have the robot model. You simulate it and you can do the walking. Can we do more than walking, right?
As we are controlling our body, you can track any pose that you want, track any key point, and follow any velocity vector that you want. This is called the whole body control problem of humanoid and it’s really difficult, but we can train that right on 10,000 simulations running parallel. We can transfer that zero shot without any fine-tuning to the real robot.
This is at the Nvidia lab. We actually need to slow down the video. The first video is in real time, and the next video is slowed down. You can see the sophistication of the motion that it does. It imitates the human all these agile motions while standing balanced.
And guys, how big of the neuronet network is required to do this? It is 1.5 million parameters, not billion. 1.5 million parameters is enough to capture the subconscious processing of the human body. The system-wide reasoning is 1.5 million parameters.
If we put this on this diagram where you have the speed versus the diversity of a simulation, I think we call this simulation 1.0, the digital twin paradigm where it is a classical vectorized physics engine. You can run that up to 10,000, up to a million frames per second. But the issue is you got to build a digital twin. You need someone to build a robot, to build the environment and everything, right? That’s very tedious and manual.
So, can we start generating parts of the simulation? All of these 3D assets are generated by 3D gener model. All of these textures come from stable diffusion or any diffusion you would like. All of these layouts are generated by PR and LM to write XML. Putting all of these together, we built a framework called Roboccasta, which is a large-scale simulation, a compositional simulation of everyday tasks. Everything here, right, except the robot, everything is generated.
You can compose different scenes, but it still relies on this classical engine to run, yet you can already get a lot of tasks from it. Now what we can do is have a human again do the tally up, but this time you tally up in simulation. You don’t tally up a real robot; you tally up in simulation. You replay that trajectory in simulation and you add all the great hardware accelerated ray tracing to make these beautiful scenes with lighting.
You can even vary the motion, right? If you teleoperate and then move the cup from here to here, you don’t have to demonstrate moving the cup from here to here or from here to here again. Putting all of these together, you have one human demonstration in a simulation through environment generation. You can multiply that to n.
For motion generation, it is m * n. I promise you, this is the only math that you’re going to do today. That’s how we multiply the data. Then you put it together. Column one and three are the real videos from our real robot, and column two to four are from the Roboc simulation, all generated. You can still tell that these textures are not real, but they’re kind of close enough.
What do we call the things that are close enough? We call it the paradigm of the digital cousin. It’s not the digital twin, but it kind of captures the right. So digital cousin and these simulations run slower, but there are this kind of hybrid generative physics engine where we generate parts of it and then delegate the rest to the classical graphics pipeline.
Now simulate this scene, right? You got soft body, you got fluid, you got everything. It’s gonna take a very long time for artists or graphics engineers to simulate this scene properly. If we look at how graphics evolved, it took 30 years to go from the left to the right. It just took video generated models one year to go from left to the right, simulating all the deformable right noodles, right?
It lost some sense of humor here, but that’s a price I’m willing to pay for the latest Sora VO, right? All these strategic models only took one year. That’s the power of scaling and data-driven processes. Do you recall this video I showed at the beginning? I tricked you guys.
There’s not a real pixel in this video. It is fully generated by a custom model. What we do is take a general-purpose open-source state-of-the-art video generation model and we fine-tune it on domain data collected in our real robot lab, and all of these are generated. Now you can prompt the model to imagine different futures, right? To simulate the counterfactuals.
You see these two frames are the exact same but given… Different language the generated video is actually going to follow language and do the right thing even though this motion never happened in the real world. And then you can do this. The video diffusion model doesn’t care how complex the scene is, right? It doesn’t care if there’s fluid or soft body, and in the same scene you can ask it to pick up different things. It will actually use the right hand to grab the object and put it in the basket. And these all generated, all of these are generated. None of a pixel is real. It gets all these kinds of reflection correct, right? All of those interactions correct.
One of my favorites is the robot playing ukulele over there. So basically, the video model probably has seen millions of humans, lots of humans playing ukulele, and then it just simulates the robot finger to do that. Even though the hardware doesn’t actually support it, the video generation model can do it. So if we put this in perspective, right, this is simulation 2.0 Z where it’s got a lot of diversity, but it could run pretty slow these days, and nobody calls it, but I’m calling it the digital nomad, right? Which is wandering into the dream space, our video diffusion model.
And what is a video diffusion model, right? It is a compression of hundreds of millions of internet videos into this kind of simulation of the multiverse, so just like Doctor Strange, right? You instantiate the robot in the dream space, and basically, the robot can now interact with objects everywhere, everything, everywhere, all at once. So you have this embodied scaling law.
Okay. So Jensen left, but I think he’s going to like this a lot, right? So you need a lot of compute to scale up the classical simulation, and that’s the sim 1.x series. The issue is as you scale this up, it’s going to hit a wall because the diversity is limited in this handcrafted system. And then this is the neural world models, the sim 2.0 that’s going to scale exponentially with compute. And that’s a point where the neural network outperforms a classical graphics engineer.
Together, these two adding up will be our nuclear power to scale up the next generation of robotics systems. The more you buy, the more you say, the more you save. So at the beginning, whoever says that the compute situation is going to improve, not worse, burn this figure into a retina and think again. And you put all those data into what we call visual language action model that takes in pixels and instructions and outputs motor control, and you get what we open-sourced at March GTC Jensen’s keynote called the Groot N1 model, and we run on the robot.
You know, it could be romantic sometimes. Yeah, you can’t imagine how much cleaning we did during training. So yeah, it’s able to grasp the champagne in this one; it did it perfectly. Yeah, they do very well. And then it can also do some industrial tasks, pick up some of the factory objects, and it can also do multi-root coordination. So group N1 is fully open source, and actually the future series of the model will also be open source because we’re following Jensen’s paradigm of open source and democratizing physical AI.
Great. So what’s next? Where do we go after we solve physical AI? I will say the next thing is the physical API. You know, throughout human history, right, 5,000 years, we have much better tools, right? Much better society in general, but the way we make dinner and do a lot of hand labor are still more or less the same, right, from the Egyptian times.
And maybe for 99% of the human history, we have this structure where you go from raw materials through human labor, and you build civilization. And maybe in the last 1% or like 50 years, we have human labor shrinking, and we have these highly specialized, highly sophisticated robot systems that can do one thing at a time. And it’s very expensive to program, but they still live out our society. And that’s what we have right now.
And this is the future: to push that blue bar all over the place, all over there, and have the physical API, right? Just like LOM API, moving around chunks of digits of bits. The physical API moves around chunks of atoms. You basically give your software a physical actuator to change, right, the physical world. And on top of this physical API, there’s going to be a new economy, a new paradigm where you have physical prompting, right? How do you instruct these robots? How do you teach them? Language sometimes is not enough.
You can have a physical app store and skill economy. So let’s say Michelle the chef doesn’t need to just go to the kitchen every day. He can teach a robot and then basically deliver Michelin dinner as a service. And I should quote Jensen here again: that future, everything that moves will be autonomous. And one day you’ll come home, right, to a clean sofa and a candlelit dinner, and your partner’s smiling at you instead of yelling at you for not doing the dirty laundry.
That still motivates me every day, right? And you bought two humanoid robots last month. It’s running group N7, and those robots just fade into the background, right? Kind of like ambient intelligence. It fades into the background, and you wouldn’t even notice the moment that we pass the physical touring test, and that day will simply be remembered as another Tuesday. Thanks.
2025-05-07 08:00:01
Claude Code: Anthropic’s CLI Agent
Hello, AI engineers. A few weeks ago, engineering legend and former guest Steve Yege from Sourcegraph wrote an enthusiastic review. I’ve been using clawed code for a couple of days, and it has been absolutely ruthless in chewing through legacy bugs in my gnarly old code base. It’s like a wood chipper fueled by dollars. It can power through shockingly impressive tasks using nothing but chat. It seems the majority of high-taste testers agree.
Since then, the clawed code team has been on an absolute tear, delivering weekly updates, shipping best practices for agentic coding, and dedicated clawed code docs. As GitHub’s co-pilot turns four years old, we now see four major battlegrounds for coding agents. One, AI IDs like Winsurf and Cursor, now worth over $12 billion. Two, Vibe coding platforms like Bolt, Newcomer Lovable, and V0. Three, autonomous outer-loop agents like Cognition’s Devon, Cozine’s Genie, and upcoming Guest Factory AI’s Droids.
We’ve covered all three categories of coding agents, and today we’re taking a look at the newest one. The CLI-based agents like ADA, OpenAI Codex, and ClawedCode. We’re excited to share that the ClawedCode team will be presenting at the upcoming AI Engineer Worlds Fair in San Francisco, which now has early bird tickets on sale. On June 3rd, spend the day learning in hands-on workshops. On June 4th, take in tracks across MCP, Tiny Teams, Vibe Coding, LLM Recommendation Systems, GraphRAG, Agent Reliability, Infrastructure, AI Product Management, and Voice AI. On June 5th, eight more tracks for Reasoning and RL, SWE Agents, Evils, Retrieval and Search, Security, Generative Media, Design Engineering, Robotics, and Autonomy.
For CTOs and VPs of AI, there are now two leadership tracks, AI in Fortune 500 and AI Architects, named after our very well-received podcast with Brett Taylor of Sierra and OpenAI. ClawedCode will be presenting on the SWE Agents track on June 5th. Join us at AI.Engineer. Watch out and take care.
Hey, everyone. Welcome to the Lit in Space podcast. This is Alessio, partner and CTO at Decibel, and I’m joined by my co-host, Swix, founder of SmallAI. Hey, and today we’re in the studio with Kat Wu and Boris Cherney. Welcome. Thanks for having us. Thank you. Kat, you and I know each other from before. I just realized Dagster as well. Yeah. And then Index Ventures and now Anthropic. Exactly. It’s so cool to see like a friend that you know from before, like now working at Anthropic and like shipping really cool stuff.
And Boris, you’re a celebrity because we were just having you outside just getting coffee and people recognize you from your video. Oh, wow. Right? That’s new. Wasn’t that neat? Yeah, I definitely, I had that experience like once or twice in the last few weeks. It was surprising. Yeah. Well, thank you for making the time. We’re here to talk about Cloud Code. Most people probably have heard of it. We think quite a few people have tried it. But let’s get a crisp, upfront definition. What is Cloud Code?
Yeah. So Cloud Code is Cloud in the terminal. So, you know, Cloud has a bunch of different interfaces. There is desktop. There’s web. And yeah, Cloud Code, it runs in your terminal. Because it runs in the terminal, it has access to a bunch of stuff that you just don’t get if you’re running on the web or on desktop or whatever. So, it can run bash commands, it can see all of the files in the current directory, and it does all of that agentically.
And yeah, I guess maybe it comes back to like, maybe the question under the question is like, where did this idea come from? And yeah, part of it was we just want to learn how Cloud, we want to learn how people use agents. We are doing this with the CLI form factor because coding is kind of a natural place where people use agents today. And, you know, there’s kind of product market fit for this thing. But yeah, it’s just sort of this crazy research project. And obviously, it’s kind of bare bones and simple. But yeah, it’s like an agent in your terminal. That’s how the best stuff starts.
Yeah, how did it start? Did you have a master plan to build Cloud Code? Or? There’s no master plan. When I joined Anthropic, I was experimenting with different ways to use the model kind of in different places. And the way I was doing that was through the public API, the same API that everyone else has access to. And one of the really weird experiments was this claw that runs in a terminal. And I was using it for kind of weird stuff. I was using it to like, look at what music I was listening to and react to that.
And then, you know, like screenshot my, you know, video player and explain what’s happening there and things like this. And this was like kind of a pretty quick thing to build. And it was pretty fun to play around with. And then at some point, I gave it access to the terminal and the ability to code. And suddenly, it just felt very useful. Like I was using this thing every day. It kind of expanded from there. We gave the core team access and they all started using it every day, which was pretty surprising.
And then we gave all the engineers and researchers that Anthropic access. And pretty soon, everyone was using it every day. And I remember we had this DAU chart for internal users. And I was just watching it and it was vertical, like for days. And we’re like, all right, there’s something here. We got to give this to external people so everyone else can try this too. Yeah. Yeah, that’s where it came from.
And were you also working with Boris already? Or did this come out and then it started growing? And then you’re like, okay, we need to maybe make this a team, so to speak. Yeah, the original team was Boris, Sid, and Ben. And over time, as more people were adopting the tool, we felt like, okay, we really have to invest in supporting it because all our researchers are using it. And this is like our one lever to make them really productive.
And so at that point, I was using QuadCode to build some visualizations. I was analyzing a bunch of data. And sometimes it’s super useful to like spin up a streamlet and like see all the aggregate stats at once. And QuadCode made it really, really easy to do. So I think I sent Boris like a bunch of feedback. And at some point, Boris was like, do you want to just work on this? And so that’s how it happened.
It was actually a little like, it was more than that on my side. You were sending all this feedback. And at the same time, we were looking for a PM. And we were looking at a few people. And then I remember telling the manager, like, hey, I want Kat. I’m sure people are curious. What’s the process within Anthropic to like graduate one of these projects? Like, so you have kind of like a lot of growth. Then you get a PM. When did you decide, okay, we should like, it’s ready to be opened up?
Generally at Anthropic, we have this product principle of do the simple thing first. And I think that the way we build product is really based on that principle. So you kind of staff things as little as you can and keep things as scrappy as you can, because the constraints are actually pretty helpful. And for this case, we wanted to see some signs of product market fit before we scaled it.
Yeah, I imagine. So like we’re putting out the MCP episode this week. And I imagine MCP also now has a team around it in much the same way. It is now very much officially like sort of like an Anthropic product. So I’m kind of curious for Kat, like, how do you view PMing something like this? Like what is, I guess you’re like sort of grooming the roadmap. You’re listening to users.
And the velocity is something I’ve never seen coming out of Anthropic. I think I’ve come with a pretty light touch. I think Boris and the team are like extremely strong product thinkers. And for the vast majority of the features on our roadmap, it’s actually just like people building the thing that they wish that the product had. So very little actually is tops down. I feel like I’m mainly there to like clear the path if anything gets in the way and just make sure that we’re all good to go from like a legal marketing, etc., perspective.
And then I think like in terms of very broad roadmap or like long-term roadmap, I think the whole team comes together and just thinks about, okay, what do we think models will be really good at in three months? And like, let’s just make sure that what we’re building is really compatible with like the future of what models are capable of.
I’d be interested to double-click on this. What will models be good at in three months? Because I think that’s something that people always say to think about when building AI products, but nobody knows how to think about it because everyone’s just like, it’s generically getting better all the time. We’re getting AGI soon. So don’t bother, you know, like how do you calibrate three months of progress?
I think if you look back historically, we tend to ship models every couple of months or so. So three months is just like an arbitrary number that I picked. I think the direction that we want our models to go in is being able to accomplish more and more complex tasks with as much autonomy as possible. And so this includes things like making sure that the models are able to explore and find the right information that they need to accomplish a task, making sure that models are thorough in accomplishing every aspect of a task, and making sure the models can compose different tools together effectively.
Yeah, these are the directions we care about. Yeah. I guess coming back to code, this kind of approach affected the way that we built code also because we know that if we want some product that has like very broad product market fit today, we would build, you know, a cursor or a windsurf or something like this. Like these are awesome products that so many people use every day. I use them. That’s not the product that we want to build.
We want to build something that’s kind of much earlier on that curve and something that will maybe be a big product, you know, a year from now or, you know, however much time from now as the model improves. And that’s why code runs in a terminal. It’s a lot more bare bones. You have raw access to the model because we didn’t spend time building all this kind of nice UI and scaffolding on top of it.
When it comes to like the harness, so to speak, and things you want to put around it, there’s one that may be prompt optimization. So obviously I use cursor every day. There’s a lot going on in cursor that is beyond my prompt for like optimization and whatnot. But I know you recently released like, you know, compacting context features and all that. How do you decide how thick it needs to be on top of the CLI?
So that’s kind of the shared interface. And at what point are you deciding between, okay, this should be a part of Cloud Code versus this is just something for the IDE people to figure out, for example? Yeah, there’s kind of three layers at which we can build something. So the, you know, being an AI company, the most natural way to build anything is to just build it into the model and have the model do the behavior.
The next layer is probably scaffolding on top. So that’s like Cloud Code itself. And then the layer after that is using Cloud Code as a tool in a broader workflow. So to compose stuff in, you know, so for example, a lot of people use code with, you know, Tmux, for example, to manage a bunch of windows and a bunch of sessions happening in parallel. We don’t need to build all of that in.
Compacting has sort of this thing that kind of has to live in the middle because it’s something that we want to work when you use code. You shouldn’t have to pull in extra tools on top of it. And rewriting memory in this way isn’t something the model can do today. So you have to use a tool for it. And so it kind of has to live in that middle area.
We tried a bunch of different options for compacting, you know, like rewriting old tool calls and truncating old messages and not new messages. And then in the end, we actually just did the simplest thing, which is ask Cloud to summarize the, you know, the previous messages and just return that. And that’s it. And it’s funny, when the model is so good, the simple thing usually works. You don’t have to over-engineer it.
We do that for Cloud-placed Pokémon too, which is kind of interesting to see that pattern reemerging. And then you have the cloud.md file for the more user-driven memories, so to speak. It’s kind of like the equivalent of maybe cursor rules, we’ll say. Yeah, and cloud.md, it’s another example of this idea of, you know, do the simple thing first.
We had all these crazy ideas about like memory architectures and, you know, there’s so much literature about this. There’s so many different external products about this. And we wanted to be inspired by all this stuff. But in the end, the thing we did is ship the simplest thing, which is, you know, it’s a file that has some stuff and it’s auto-read into context. And there’s now a few versions of this file. You can put it in the root or you can put it in child directories or you can put it in your home directory and we’ll read all of these in kind of different ways. But yeah, simplest thing that could work.
I’m sure you’re familiar with Aether, which is another thing that people in our Discord loved. And then when Cloud Code came out, the same people love Cloud Code. Any thoughts on like, you know, inspiration that you took from it? Things you did differently? Kind of like maybe the same principle in which you went a different way?
Yeah, this is actually the moment I got HEI-pilled is related to this. Okay. Maybe I can tell that story. So Clyde is like, you know, CLI, Cloud. And that’s the predecessor to Cloud Code. It’s kind of this research tool that’s, you know, it’s like written using Python. It takes like a minute to start up. It’s like very much written by researchers. It’s not a polished product.
And when I first joined Anthropic, I was putting up my first pull request. You know, I hand wrote this pull request because I didn’t know any better. And my boot camp buddy at the time, Adam Wolfe, was like, you know, actually, maybe instead of handwriting it, just ask Clyde to write it. And I was like, okay, I guess so. It’s an AI lab. Maybe there’s some, you know, capability I didn’t know about.
And so I started up this terminal tool and it took like a minute to start up. And I asked Clyde, hey, you know, here’s the description. Can you make a PR for me? And after a few minutes of chugging along, it made a PR and it worked. And I was just blown away because I had no idea. I had just no clue that there were tools that could do this kind of thing. Like I thought that, you know, kind of single line autocomplete was the state of the art before I joined. And then that’s the moment where I got AGI filled. And yeah, that’s where code came from.
I think like people are interested in comparing and contrasting, obviously, because to you, this is the house tool. You work on it. People are interested in like figuring out how to choose between tools. There’s the cursors of the world. There’s like the devins of the world. There’s Athers and there’s Cloud Code. And we can’t try everything all at once. My question would be, where do you place it in the universe of options?
Well, you can ask quad to just try all these tools. I wonder what it would say. No self-favoring at all. A quad plays engineer. I don’t know. We like, we use all these tools in-house too. Like we’re big fans of all this stuff. Like, quad code is, obviously, it’s a little different than some of these other tools in that it’s a lot more raw. Like I said, there isn’t this kind of big, beautiful UI on top of it. It’s raw access to the model.
It’s as raw as it gets. So if you want to use a power tool that lets you access the model directly and use quad for automating, you know, big workloads. You know, for example, if you have like a thousand lint violations and you want to start a thousand instances of quad and have it fix each one and then make a PR, then quad code is a pretty good tool. Got it. It’s, it’s a tool for power workloads for power users.
And I think that’s just kind of where it fits. It’s the idea of like parallel versus kind of like single path. One way to think about it, where like the IDE is really focused on like what you want to do versus like Cloud Code. You kind of more see it as like less supervision required. You can kind of spin up a lot of them. Is that the right mental model?
Yeah. And there’s some people at Anthropic that have been racking up like thousands of dollars a day with this kind of automation. Most people don’t do anything like that, but you totally could do something like that. Yeah. We think of it as like a Unix utility, right? So it’s like the same way that you would compose, you know, grep or cat or, oh, cat. Or something like this. The same way you can compose code into workflows.
The cost thing is interesting. Do people pay internally or do you get free? If you work at Anthropic, you can just run this thing as much as you want every day. It’s for free. It’s for free internally. Nice. Yeah. I think if everybody had it for free, it would be huge. Because like, I mean, if I think about, I pay Cursor 20 bucks a month, I use millions and millions of tokens in Cursor that would cost me a lot more in Cloud Code.
And so I think like a lot of people that I’ve talked to, they don’t actually understand how much it costs to do these things. And they’ll do a task, and they’re like, oh, that costs 20 cents. I can’t believe I paid that much. How do you think, going back to like the product side too, it’s like, how much do you think of that being your responsibility to try and make it more efficient versus that’s not really what we’re trying to do with the tool?
We really see Cloud Code as like the tool that gives you the smartest abilities out of the model. We do care about cost insofar as it’s very correlated with latency and we want to make sure that this tool is extremely snappy to use and extremely thorough in its work. We want to be very intentional about all the tokens that it produces. I think we can do more to like communicate the cost with users.
Currently, we’re seeing costs around like $6 per day per active user. And so it does come out to a bit higher over the course of a month in cursor. But I don’t think it’s like out of band and that’s like roughly how we’re thinking about it. I would add that I think the way I think about it is it’s an ROI question. It’s not a cost question.
And so if you think about, you know, an average engineer salary and like what, you know, we were talking about this before, before the podcast, like engineers are very expensive. And if you can make an engineer 50, 70% more productive, that’s worth a lot. And I think that’s the way to think about it. So if you’re saying, if you’re targeting Cloud Code to be the most powerful end of the spectrum, as opposed to the less powerful, but faster, cheaper side of the spectrum, then there’s typically people who recommend a waterfall, right?
You try this faster, simple one that doesn’t work. You upgrade, you upgrade, you upgrade. And finally you hit Cloud Code, at least for people who are token constrained that don’t work at Anthropic. And part of me wants to just fast track all of that. I just want to fan out to everything all at once. And once I’m not satisfied with the next one solution, I’ll just sort of switch to the next. I don’t know if that’s real.
Yeah, we’re definitely trying to make it a little easier to make Cloud Code kind of the tool that you use for all the different workloads. So for example, we launched a thinking tool recently. So for any kind of planning workload where you might’ve used other tools before, you can just ask Quad and that’ll use, you know, chain of thought to think stuff out. I think we’ll get there.
Maybe we’ll do it this way. How about we recap like sort of the brief history of Cloud Code, like between when you launched and now there, there’ve been quite a few ships. How would you highlight the major ones? And then we’ll get to the thinking tool. And I think I’d have to check your Twitter. I think a big one that we’ve gotten a lot of requests for is web fetch. Yep.
So we worked really closely with our legal team to make sure that we shipped as secure of an implementation as possible. So we’ll web fetch if a user directly provides a URL, whether that’s in their cloud.md or in their message directly, or if a URL is mentioned in one of the previously fetched URLs. And so this way enterprises can feel pretty secure about letting their developers continue to use it.
We shipped a bunch of like auto features, like autocomplete, where you can press tab to complete a file name or file path. Autocompact, so that users feel like they have like infinite context since we’ll compact behind the scenes. And we also shipped auto accept because we noticed that a lot of users were like, hey, like Cloud Code can figure it out. I’ve like developed a lot of trust for Cloud Code. I wanted to just like autonomously edit my files, run tests, and then come back to me later.
So those are some of the big ones. Vim mode, custom slash commands. People love Vim mode. Yeah. So that was a top request too. That one went pretty viral. Yeah. Memory, those recent ones, like the hashtag to remember. So yeah, I’d love to dive into, on the technical side, any of them that were particularly challenging. Paul from Ader always says how much of it was coded by Ader.
You know, so then the question is how much of it was coded by Cloud Code. Obviously, there’s some percentage, but I wonder if you have a number like 50, 80. It’s pretty high. Probably near 80, I’d say. Yeah, it’s very high. It’s a lot of human code review though. Yeah. A lot of human code review. I think some of the stuff has to be handwritten and some of the code can be written by Quad.
And there’s sort of a wisdom in knowing which one to pick and what percent for each kind of task. So usually where we start is Quad writes the code. And then if it’s not good, then maybe a human will dive in. There’s also some stuff where like, I actually prefer to do it by hand. So it’s like, you know, intricate data model refactoring or something. I won’t leave it to Quad because I have really strong opinions and it’s easier to just do it and experiment than it is to explain it to Quad.
So yeah, I think that nets out to maybe like 80, 90% Quad written code overall. Yeah. We’re hearing a lot of that in our portfolio companies, like more like series A companies is like 80, 85% of the code they write is ad generated. Yeah. Well, that’s a whole different discussion. The custom slash command. I had a question. How do you think about custom slash command MCPs?
Like, how does this all tie together? You know, it’s the slash command and Cloud Code kind of like an extension of the MCP. Are people building things that should not be MCP, but are just kind of like self-contained things in there? How should people think about it? Yeah. I mean, obviously we’re big fans of MCP. You can use MCP to do a lot of different things. You can use it for custom tools and custom commands and all this stuff.
But at the same time, you shouldn’t have to use it. So if you just want something really simple and local, you just want, you know, essentially like a prompt that’s been saved, just use local commands for that. Over time, something that we’ve been thinking a lot about is how to re-expose things in convenient ways. So for example, let’s say you had this local command. Could you re-expose that as an MCP prompt?
Yeah. Because Cloud Code is an MCP client and an MCP server. Or similarly, let’s say you pass in a custom, you know, like a custom bash tool. Is there a way to re-expose that as an MCP tool? Because yeah, we think generally you shouldn’t have to be tied to a particular technology. You should use whatever works for you.
Yeah. Because there’s some like Puppeteer. I think that’s like a great way, great thing to use with Cloud Code, right? For testing. There’s like a Puppeteer MCP protocol, but then people can also write their own slash commands. And I’m curious like where MCP are going to end up being, where it’s like maybe each slash command leverages MCPs, but no command itself is an MCP because it ends up being customized.
I think that’s what people are still trying to figure out. It’s like, should this be in the runtime or in the MCP server? I think people haven’t quite figured out where the line is. Yeah. For something like Puppeteer, I think that probably belongs in MCP because there’s a few like tool calls that go in that too. And so it’s probably nice to encapsulate that in the MCP server.
Whereas slash commands are actually just like prompts. So they’re not actually tools. We’re thinking about how to expose more customizability options so that people can bring their own tools or turn off some of the tools that Cloud Code comes with. But there’s also some trickiness there because we want to just make sure that the tools people bring are things that Cloud is able to understand and that people don’t accidentally inhibit their experience by maybe bringing a tool that is like confusing to Cloud.
So we’re just trying to work through the UX of it. Yeah. I’ll give an example also of how this stuff connects. For Cloud Code internally in the GitHub repo, we have this GitHub action that runs. And the GitHub action invokes Cloud Code with a local slash command. And the slash command is lint. So it just runs a linter using quad. And it’s a bunch of things that are pretty tricky to do with a traditional linter that’s based on static analysis.
So for example, it’ll check for spelling mistakes, but also checks that code matches comments. It also checks that, you know, we use a particular library for network fetches instead of the built-in library. There are a bunch of these specific things that we check that are pretty difficult to express just with lint. And in theory, you can go in and, you know, write a bunch of lint rules for this. Some of it you could cover, some of it you probably couldn’t.
But honestly, it’s much easier to just write a one bullet in markdown in a local command and just commit that. And so what we do is quad runs through the GitHub action. We invoke it with slash project colon lint, which just invokes that local command, it’ll run the linter, it’ll identify any mistakes, it’ll make the code changes, and then it’ll use the GitHub MCP server in order to commit the changes back to the PR.
And so you can kind of compose these tools together. And I think that’s a lot of the way we think about code is just one tool in an ecosystem that composes nicely without being opinionated about any particular piece. It’s interesting. I have a weird chapter in my CV that makes me, I was the CLI maintainer for Netlify. And so I have a little bit of a dive.
There’s a decompilation of Cloud Code out there that has since been taken down. But it seems like you use Commander JS and React Inc. It’s like the public info about this. And I’m just kind of curious, like at some point you’re just, you’re not even building Cloud Code. You’re kind of just building a general-purpose CLI framework that anyone, any developer can hack to their purposes. You ever think about this?
Like this level of configurability is more of like a CLI framework or like some new form factor that doesn’t exist before. Yeah. It’s definitely been fun to hack on a really awesome CLI because there’s not that many of them. Yeah. But yeah, we’re big fans of Inc. Yeah. Vadim Demetis. So we actually used him, used React Inc. for a lot of our projects. Oh, cool. Yeah. Inc. is amazing. It’s sort of hacky and janky in a lot of ways. It’s like you have React, and then the renderer is just translating the React code to ANSI escape codes as the way to render.
And there’s all sorts of stuff that just doesn’t work at all because ANSI escape codes were like this thing that started to be written in the 1970s, and there’s no really great spec about it. Every terminal is a little different. So building in this way feels to me a little bit like building for the browser back in the day where you had to think about Internet Explorer 6 versus Opera versus Firefox and whatever. You have to think about these cross-terminal differences a lot.
But yeah, big fans of Inc. because it helps abstract over that. We’re also, we use Bun. So big fans of Bun. That’s been, it makes writing our tests and running tests much faster. We don’t use it in the runtime yet. It’s not just for speed, but you tell me, I don’t want to put words in your mouth, but my impression is they help you ship the compilation, the executable.
Yeah, exactly. So we use Bun to compile the code together. Any other pluses of Bun? I just want to track Bun versus Deno conversations. Yeah. These Deno’s in there, you know. I actually haven’t used Deno back. It’s been a while. I remember what a lot of people say. Ryan made it back in the day, and it was like there were some ideas that I think were very cool in it, but yeah, I just never took off to that same degree.
Still a lot of cool ideas, like being able to NPM just to import from any URL, I think is pretty amazing. That’s the dream of ESM. Very cool. Okay. Also, I was going to ask you one other feature, then we can get to the thinking tool of auto-accept. I have this little thing I’m trying to develop thinking around for trust and agents, right? When do you say, all right, go autonomous? When do you pull the developer in?
And sometimes you let the model decide. Sometimes you’re like, this is a destructive action. Always ask me. I’m just curious if you have any internal heuristics around when to auto-accept and where all this is going. We’re spending a lot of time building out the permission system. So Robert on our team is leading out this work. We think it’s really important to give developers the control to say, hey, these are the allowed permissions.
Generally, this includes stuff like the model’s always allowed to read files or read anything. And then it’s up to the user to say, hey, is it allowed to edit files? Is it allowed to run tests? These are probably the safest three actions. There’s a long list of other actions that users can either allow list or deny list based on regex matches with the action.
How can writing a file ever be unsafe if you have version control? I think there’s, I think there’s like a few different probably aspects of safety to think about. So it could be useful just to break that out a little bit. For file editing, it’s actually less about safety, although there is still a safety risk. What might happen is, let’s say the model fetches a URL, and then there’s a prompt injection attack in the URL, and then the model writes malicious code to disk and you don’t realize it.
Although, you know, there is code review as a separate layer there as protection. But I think generally for file writes, the model might just do the wrong thing. That’s the biggest thing. And what we find is that if the model is doing something wrong, it’s better to identify that earlier and correct it earlier, and then you’re going to have a better time. If you wait for the model to just go down this totally wrong path, and then correct it 10 minutes later, you’re going to have a bad time.
So it’s better to usually identify failures, aren’t we? But at the same time, there are some cases where you just want to let the model go. For example, if Cloud Code is writing tests for me, I’ll just hit shift tab, enter auto accept mode, and just let it run the tests and iterate on the tests until they pass because I know that’s a pretty safe thing to do.
And then for some other tools like Bash Tool, it’s pretty different. Because Cloud could run, you know, RM -RF slash, and that would suck. That’s not a good thing. So we definitely want people to be in the loop to catch stuff like that. The model is, you know, trained and aligned to not do that. But, you know, these are non-deterministic systems. So like, you still want a human in the loop.
Yeah. I think that generally the way that things are trending is kind of less time between human input. Did you see the meter paper? No. They established a Moore’s law for time between human input, basically. It’s roughly about autonomous for 50 minutes at the 50th percentile of human effort, which is kind of cool. Highly recommend that.
I put cursor in YOLO mode all the time and just run it. Which is vibe coding, right? Like, this is all state of spade. There’s a couple of interesting things when you talked about alignment and the model being trained. I always put it in a Docker container and I have a prefix every command with the Docker compose. And yesterday, my Docker server was not started.
I was like, oh, Docker is not running. Let me just run it outside of Docker. And I’m like, whoa, whoa, whoa, whoa, whoa. You should start Docker and run it in Docker. You cannot go outside. So that is like a very good example of like, you know, sometimes you think it’s doing something and then it’s doing something else.
For the review side itself, I would love to just chat about that more. I think the linter part that you mentioned, some people skipped it over. It doesn’t register the first time, but going from rule-based linting to semantic linting is great and super important. A lot of companies are trying to figure out how to do autonomous PR review, which I’ve not seen one that I use so far.
They’re all kind of mid. I’m curious how you think about closing the loop or making that better and figuring out especially what you’re supposed to review. Because these PRs get pretty big when you buy code. Sometimes I’m like, oh, wow. GTM. You know, it’s like, am I really supposed to read all of this? It seems pretty standard, but there are parts in there that the model would understand that are kind of out of distribution, so to speak, to really look at.
So yeah, I know it’s a very open-ended question, but any thoughts you have would be great. The way we’re thinking about it is quad code is, like I said before, it’s a primitive. If you want to use it to build a code review tool, you can do this. If you want to build a security scanning, vulnerability scanning tool, you can do that. If you want to build a semantic linter, you can do that.
Hopefully, with code, it makes it so if you want to do this, it’s just a few lines of code and you can have quad write that code also because quad is really great at writing GitHub actions. One thing to mention is we do have a non-interactive mode, which is what Cloud uses in these situations to automate cloud code. A lot of companies using Cloud Code actually use this non-interactive mode.
For example, they’ll say, hey, I have hundreds of thousands of tests in my repo. Some of them are out of date. Some of them are flaky. They’ll send Cloud Code to look at each of these tests and decide how can I update any of them? Should I deprecate some of them? How do I like increase our code coverage? So that’s been a really cool way that people are non-interactively using Cloud Code.
What are the best practices here? Because when it’s non-interactive, it could run forever, and you’re not necessarily reviewing the output of everything. Right. I’m just kind of curious, how is it different in non-interactive mode? What are the most important hyperparameters or arguments to set?
For folks that haven’t used those, non-interactive mode is just Cloud -P and then you pass in the prompt in quotes and that’s all it is. It’s just the -P flag. Generally, it’s best for tasks that are read-only. That’s the place where it works really well, and you don’t have to think about permissions and running forever and things like that.
For example, a linter that runs and doesn’t fix any issues, or we’re working on a thing where we use Cloud with -P to generate the change log for quad. Every PR is just looking over the commit history, saying, okay, this makes it into the change log. This doesn’t because we know people have been requesting change logs.
Generating non-interactive mode is really good for read-only tasks. For tasks where you want to write, the thing we usually recommend is to pass in a very specific set of permissions on the command line. You can pass in a –allowed-tools and allow a specific tool. For example, not just bash, but git status or git diff. You just give it a set of tools that it can use or edit tool.
It still has default tools like file read, grep, system tools, bash NLS, and memory tools. It still has all those tools, but a lot of tools just lets you pre-accept the permission prompt because you don’t have that in non-interactive mode. We’d also definitely recommend that you start small. Test it on one test, make sure that that has reasonable behavior, iterate on your prompt, then scale it up to ten, make sure that it succeeds or if it fails, analyze what the patterns of failures are, and gradually scale it from there.
Definitely don’t kick off a run to fix like a hundred thousand tests. At this point, I want to, this tagline is in my head that basically at Anthropic, there’s Cloud Code generating code and Cloud Code is also reviewing its own code. At some point, right? Different people are setting all this up. You don’t really govern that, but it’s happening.
The point of the thing I was thinking about was we have VPs of N, CTOs listening. This is all well and good for the individual developer, but the people responsible for the tech, the entire code base, the engineering decisions, all this is going on. My developers, like I manage a hundred developers, any of them could be doing any of this at this point. What do I do to manage this?
How does my code review process change? How does my change management change? I don’t know. We’ve talked to a lot of VPs and CTOs about it. They actually tend to be quite excited because they experiment with the tool. They download it. They ask it a few questions, and like Cloud Code, when it gives them sensible answers, they’re really excited because they can understand this nuance in the code base, and sometimes they even ship small features with Cloud Code.
Through that process of interacting with the tool, they build a lot of trust in it. A lot of folks actually come to us and ask us how they can roll it out more broadly. We’ll often have sessions with VPs of DevProd and talk about these concerns about how to make sure people are writing high-quality code. I think in general, it’s still very much up to the individual developer to hold themselves up to a high standard for the quality of code that they merge.
Even if we use Cloud Code to write a lot of our code, it’s still up to the individual who merges it to be responsible for it being well-maintained, well-documented code that has reasonable abstractions. I think that’s something that will continue where Cloud Code isn’t its own engineer that’s committing code by itself. It’s still very much up to the ICs to be responsible for the code that’s produced.
Yeah. I think Cloud Code also makes a lot of this stuff. A lot of quality work becomes a lot easier. For example, I have not manually written a unit test in many months. We have a lot of unit tests because Cloud writes all the tests. Before, I felt like a jerk if on someone’s PR, I’m saying, hey, can you write a test?
They kind of know they should probably write a test, and that’s probably the right thing to do. They make that trade-off where they just want to ship faster. You always feel like a jerk for asking. But now I always ask because Cloud can just write the test. You’re right. There’s no human work; you just ask Cloud to do it.
With writing tests becoming easier and with writing lint rules becoming easier, it’s actually much easier to have high-quality code than it was before. What are the metrics that you believe in? A lot of people actually don’t believe in 100% code coverage because sometimes that is optimizing for the wrong thing. Arguably, I don’t know.
You have a lot of experience in different code quality metrics. What still makes sense? I think it’s very engineering team dependent, honestly. I wish there was a one-size-fits-all answer. Like, for me, the one solution. For some teams, test coverage is extremely important. For other teams, type coverage is very important, especially if you’re working in a strictly type language and avoiding NEs in JavaScript and Python.
I think cyclomatic complexity gets a lot of flack, but it’s still, honestly, a pretty good metric just because there isn’t anything better in terms of ways to measure code quality. Productivity is obviously not lines of code. But do you care about measuring productivity? I’m sure you do.
Yeah, you know, lines of code honestly isn’t terrible. Oh God. It has downsides. Yes. It’s terrible. Lines of code is terrible for a lot of reasons. Yes. But it’s really hard to make anything better. It’s the least terrible. There like lines of code, maybe the number of PRs. How green your GitHub is.
The two that we’re really trying to nail down are one, decrease in cycle time. So how much faster are your features shipping because you’re using these tools? That might be something like the time between first commit and when your PR is merged. It’s very tricky to get right, but one of the ones that we’re targeting.
The other one we want to measure more rigorously is the number of features that you wouldn’t have otherwise built. We have a lot of channels where we get customer feedback. One pattern we’ve seen with Cloud Code is that sometimes customer support or customer success will post, hey, this app has a bug.
Sometimes 10 minutes later, one of the engineers on that team will be like, Cloud Code made a fix for it. A lot of the situations when you ping them and say, hey, that was really cool, they were like, yeah, without Cloud Code, I probably wouldn’t have done that because it would have been too much of a divergence from what I was otherwise going to do. It would have just ended up in this long backlog.
This is the kind of stuff that we want to measure more rigorously. That was the other AGI-appeal moment for me. There was an early version of Cloud Code many months ago. An engineer at Anthropic, Jeremy, built a bot that looked through a particular feedback channel on Slack. He hooked it up to code to have it automatically put up PRs with just fixes to all this stuff.
It fixed a lot of issues, and I was like 10 percent, 50. This was early on, so I don’t remember the number, but it was surprisingly high to the point where I became a believer in this kind of workflow. I wasn’t before. SOPM, isn’t that scary too, in a way? Where you can build too many things, it’s almost like maybe you shouldn’t build that many things.
I think that’s what I’m struggling with the most. It gives you the ability to create, create, create. At some point, you’ve got to support, support, support. This is the Jurassic Park. Your scientists are so preoccupied with whether you could.
How do you make decisions, now that the cost of actually implementing the thing is going down? As a PM, how do you decide what is actually worth doing? We definitely still hold a very high bar for net new features. Most of the fixes were, hey, this functionality is broken or there’s a weird edge case that we hadn’t addressed yet.
It was very much smoothing out the rough edges rather than building something completely new. For net new features, I think we hold a pretty high bar that it’s very intuitive to use. The new user experience is minimal. It’s just obvious that it works. We sometimes actually use Cloud Code to prototype instead of using docs.
You’ll have prototypes that you can play around with, and that often gives us a faster feel for whether this feature is ready yet or if this is the right abstraction, the right interaction pattern. It gets us faster to feeling really confident about a feature, but it doesn’t circumvent the process of making sure that the feature fits in the product vision.
It’s interesting how, as it gets easier to build stuff, it changes the way that I write software. Before, I would write a big design doc and think about a problem for a long time before I would build it for some problems. Now I’ll just ask Cloud Code to prototype three versions of it and see which one I like better. That informs me much better and much faster than a doc would have.
We haven’t totally internalized that transition yet in the industry. I feel the same way for some tools I build internally. People ask me if we could do this. I’m like, I’ll just, yeah, just build it. It feels pretty good. We should polish it. Or sometimes it’s like, no, that’s not. It’s comforting that, you know, your max cost is, even at Anthropic, where it’s theoretically unlimited, the cost is roughly $6 a day.
That gives people peace of mind because I’m like, $6 a day? Fine. $600 a day, we have to talk. I paid $200 a month to make Studio Ghibli photos, so it’s all good. You mentioned internal tools, and that’s actually a really big use case we’re seeing emerge.
If you’re working on something operationally intensive, if you can spin up an internal dashboard for it or operational tool where you can, for example, grant access to a thousand emails at once, a lot of these things don’t really need to have a super polished design. You just need something that works. QuadCode’s really good at those kinds of zero-to-one tasks. We use Streamlit internally, and there’s been a proliferation of how much we’re able to visualize.
Because we’re able to visualize it, we can see patterns we wouldn’t have otherwise if we were just looking at raw data. I was working on a side website last week, and I just showed QuadCode the mock. I took the screenshot I had, dragged and dropped it into the terminal, and I was like, hey, Quad, here’s the mock. Can you implement it? It implemented, and it sort of worked.
It was a little bit crummy, so I said, all right, now look at it in Puppeteer and iterate on it until it looks like the mock. It did that three or four times, and then the thing looked like the mock. I think we’re going to ask about two other features of the overall agent pieces that we mentioned. I’m interested in memory as well.
We talked about autocompact and memory using hashtags and stuff. My impression is that your simplest approach works. But I’m curious if you’ve seen any other requests that are interesting to you or internal hacks of memory that people have explored that, you know, you might want to surface to others.
There are a bunch of different approaches to memory. Most of them use external stores of various sorts. Like Chroma? Yeah, exactly. There’s a lot of projects like that. It’s either K-value or kind of like graph stores. That’s the two big shapes for this.
Do you believe in knowledge graphs for this stuff? You know, I’m a big, if you talked to me before I joined Anthropic and this team, I would have said, yeah, definitely. But now I feel everything is the model. That’s the thing that wins in the end. As the model gets better, it subsumes everything else.
At some point, the model will encode its own knowledge graph. It’ll encode its own AV story if you just give it the right tools. But yeah, I think the specific tools have a lot of room for experimentation, and we don’t know yet. In some ways, are we just coping for lack of context length?
Are we doing things from memory now that if we had a 100 million token context window, we wouldn’t care about? It’s an interesting way to think about that. I would love to have a 100 million token context, for sure. Some people have claimed to have done it; we don’t know if that’s true or not.
But I guess here’s the question for you, Sean. If you took all the world’s knowledge and put it in your brain, and let’s say there is some treatment to make it so your brain can have any amount of context, you have infinite neurons. Is that something that you would want to do, or would you still want to record knowledge externally?
Putting it in my head is different from trying to use an agent tool to do it because I’m trying to control the agent. I want to make myself unlimited, but I want to make the tools that I use limited because then I know how to control them. It’s not even a safety argument; it’s just more like I want to know what you know. If you don’t know a thing, sometimes that’s good. Like the ability to audit what’s in the intent.
I don’t know if this is small brain thinking because this is not a very bitter lesson, which is like, actually, sometimes you just want to control every part of what goes in there in the context. The more you just, you know, Jesus, take the wheel, trust the model, then you have no idea what it’s paying attention to.
Did you see the Mac interpretability stuff from Chris Ola and the team that was published last week? Yes. What about it? I wonder if something like this is the future. There’s an easier way to audit the model itself. If you want to see what is stored, you can just audit the model.
The main salient thing is that they know what features activate per token, and they can tune it up, suppress it, whatever. But I don’t know if it goes down to the individual item of knowledge from context, you know. Not yet. But I wonder, maybe that’s the Bitter Western version of it.
Any other comments from memory? Otherwise, we can move on to planning and thinking. We’ve seen people play around with memory in interesting ways, like having Claude write a logbook of all the actions that it’s done so that over time, Claude develops this understanding of what your team does, what you do within your team, what your goals are, how you like to approach work.
We would love to figure out what the most generalized version of this is so that we can share broadly. I think with Claude Code, when we’re developing things like Claude Code, it’s actually less work to implement the feature and a lot of work to tune these features to make sure that they work well for general audiences across a broad range of use cases.
There’s a lot of interesting stuff with memory, and we just want to make sure that it works well out of the box before we share it broadly. I agree with that. I think there’s a lot more to be developed here. I guess a related problem to memory is how do you get stuff into context? Knowledge base.
Like knowledge base, yeah. Originally, we tried very early versions of Claude that actually used RAG. We indexed the code base, and I think we were just using Voyage. Just off-the-shelf RAG, and that worked pretty well. We tried a few different versions of it. There was RAG, and then we tried a few different kinds of search tools.
Eventually, we landed on just agentic search as the way to do stuff. There were two big reasons, maybe three big reasons. One is it outperformed everything. By a lot. This was surprising. In what benchmark? This was just vibes. So internal vibes. There are some internal benchmarks also, but mostly vibes.
It just felt better. In agentic RAG, you just let it look up in however many search cycles it needs. Yeah, just using regular code searching, you know, glob, grep, just regular code search.
There was like one. And the second one was this whole indexing step that you have to do for RAG. There’s a lot of complexity that comes with that because the code drifts out of sync. There are security issues because this index has to live somewhere. What if that provider gets hacked? A lot of liability for a company to do that.
For our code base, it’s very sensitive. We don’t want to upload it to a third-party thing. It could be a first-party thing, but then we still have this out-of-sync issue. Agentic search sidesteps all of that. Essentially, at the cost of latency and tokens, you now have really awesome search without security downsides.
Well, memory is planning, right? Memory is what I like to do, and planning is now using those memories to plan to do these things. There was one. Or maybe put it as memory is sort of the past, like what we already did. And then planning is kind of what we will do.
Yeah. And it just crosses over at some point.
I think the maybe slightly confusing thing from the outside is what you define as thinking.
So there’s extensive thinking. There’s the think tool. And it’s kind of thinking as in planning, which is thinking before execution. And then there’s thinking while you’re doing, which is like the think tool.
Can you maybe just run people through the difference? I’m already confused listening to you.
Well, it’s one tool. So Quad can think if you ask it to think. Generally, the usage pattern that works best is you ask Quad to do a little bit of research, like use some tools, pull some code into context, and then ask it to think about it.
And then it can make a plan, do a planning step before you execute. There’s some tools that have explicit planning modes. Like, RueCode has this, and Quine has this. Other tools have it. Like, you can shift between plan and act mode, or maybe a few different modes.
We’ve sort of thought about this approach. But I think our approach to product is similar to our approach to the model, which is bitter lesson. So just freeform, keep it really simple, keep it close to the metal.
And so if you want Quad to think, just tell it to think. Be like, make a plan, think hard, don’t write any code yet. And it should generally follow that.
And you can do that also as you go. So maybe there’s a planning stage, and then Quad writes some code or whatever, and then you can ask it to think and plan a little bit more. You can do that anytime.
Yeah, I was reading the Think Tool blog post, and I said, while it sounds similar to extended thinking, it’s a different concept. Extended thinking is what Cloud does before it starts generating.
And then think it, once it starts generating, how do you have to stop and think? Is this all done by the Cloud Code harness? So people don’t really have to think about the difference between the two, basically, is the idea?
Yeah, you don’t have to think about it.
Okay. That is helpful. Because sometimes I’m like, man, am I not thinking right?
Yeah, and it’s all chain of thought, actually, in Quad Code. So we don’t use the Think Tool. Anytime that Quad Code does thinking, it’s all a chain of thought.
I had an insight. This is, again, something we had, a discussion we had before recording, which is in the Cloud Place Pokemon Hackathon, we had access to more sort of branching environments feature, which meant that we could take any VM state, branch it, play it forward a little bit, and use that in the planning.
And then I realized the TLDR of yesterday was basically that it’s too expensive to just always do that at every point in time. But if you give it as a tool to Quad and prompt it in certain cases to use that tool, seems to make sense.
I’m just kind of curious, like your takes on overall, like sandboxing, environment, branching, rewindability, maybe. It’s just something that you immediately brought up, which I didn’t think about.
Is that useful for Quad? Cloud has no opinions about it?
Yeah, I could talk for hours about this.
Quad probably can, too.
Yeah? Let’s get original tokens from you, and then we can train Cloud on that. By the way, that’s like explicitly what this podcast is, so we’re just generating tokens for people.
Is this the pre-training or the post-training?
It’s a pre-trained dataset. We’ve got to get in there.
Oh, man. Yeah, how do I buy? How do I get some tokens?
Starting with sandboxing, ideally, the thing that we want is to always run code in a Docker container. And then it has freedom, and you can kind of snapshot, you know, with other kind of tools later on top.
You can snapshot, rewind, do all this stuff. Unfortunately, working with a Docker container for everything is just like a lot of work, and most people aren’t going to do it.
And so we want some way to simulate some of these things without having to go full container. There’s some stuff you can do today. So, for example, something I’ll do sometimes is if I have a planning question or a research type question, I’ll ask Quad to investigate a few paths in parallel.
And you can do this today if you just ask it. So say, I want to refactor X to do Y. Can you research three separate ideas for how to do it? Do it in parallel. Use three agents to do it.
And so in the UI, when you see a task that’s actually like a sub-Claud, it’s a sub-agent that does this. And usually when I do something hairy, I’ll ask it to just investigate three times or five times or however many times in parallel.
And then Claude will kind of pick the best option and then summarize that for you.
But how does Claude pick the best option? Don’t you want to choose? What’s your handoff between you should pick versus I should be the final decider?
I think it depends on the problem. You can also ask Claude to present the options to you. Probably, you know, it exists at a different part of the stack than Claude Code specifically.
Claude Code as a CLI, like you could use it in any environment. So it’s up to you to compose it together.
Should we talk about how and when models fail? Because I think that was another hot topic for you.
I’ll just leave it open. Like how do you observe Claude Code failing?
There’s definitely a lot of room for improvement in the models, which I think is very exciting. Most of our research team actually uses Claude Code day to day.
And so it’s been a great way for them to be very hands-on and experience the model failures, which makes it a lot easier for us to target these in model training and to actually provide better models, not just for Claude Code, but for all of our coding customers.
I think one of the things about the latest Sonnet 3.7 is it’s a very persistent model. It’s like very motivated to accomplish the user’s goal, but it sometimes takes the user’s goal very literally.
And so it doesn’t always fulfill what the implied parts of the request are because it’s just so narrowed in on like, I must get X done. And so we’re trying to figure out, okay, how do we give it a bit more common sense so that it knows the line between trying very hard and like, no, the user definitely doesn’t want that.
Yeah. Like the classic example is like, hey, go on, get this test to pass. And then, you know, like five minutes later, it’s like, all right, well, I hard-coded everything. The test passes. I’m like, no, that’s not what I wanted.
Hard-coded the answer.
Yeah. But that’s the thing, like, it only gets better from here. Like these use cases work sometimes today, not, you know, not every time.
And, you know, the model sometimes tries too hard, but it only gets better.
Yeah. Like context, for example, is a big one where a lot of times, if you have a very long conversation and you compact a few times, maybe some of your original intent isn’t as strongly present as it was when you first started.
And so maybe the model forgets some of what you originally told it to do. And so we’re really excited about things like larger, effective context windows so that you can have these gnarly, like really long hundreds of thousands of tokens long tasks and make sure that Quad Code is on track the whole way through.
Like that would be a huge lift. I think not just for Quad Code, but for every coding company.
Fun story from David Hershey’s keynote yesterday, he actually misses the common sense of 3.5 because 3.7 being so persistent, 3.5 actually had some entertaining stories where apparently it like gave up on tasks and just 3.7 doesn’t.
And when Cloud 3.5 gives up, it started like writing a formal request to the developers of the game to fix the game. And he has some screenshots of it, which is excellent.
So if you’re listening to this, you can find it on the YouTube because we’ll post it. Very, very cool.
One form of failing, which I kind of wanted to capture, was something that you mentioned while we were getting coffee, which is that Quad Code doesn’t have that much between session memory or caching or whatever you call that, right?
So it reforms the whole state for whole coffee every single time. So it has to make the minimum assumptions on the changes that can happen in between.
So, like, how consistent can it stay, right? So I think that one of the failures is that it forgets what it was doing in the past unless you explicitly opt in via cloud.md or whatever. Is that something you worry about?
It’s definitely something we’re working on. I think, like, our best advice now for people who want to resume across sessions is to tell Claude to, hey, like, write down the state of this session into this text doc.
Probably not the cloud.md, but like in a different doc. And in your new session, tell Claude to read from that doc. But we plan to build in more native ways to handle this specific workflow.
There’s a lot of different cases of this, right? Like, sometimes you don’t want Claude to have the context. And it’s sort of like Git. Sometimes I just want, you know, a fresh branch that doesn’t have any history.
But sometimes I’ve been working on a PR for a while, and I need all that historical context. So we kind of want to support all of these cases. And it’s tricky to do a one-size-fits-all.
But generally, our approach to code is to make sure it works out of the box for people without extra configuration. So once we get there, we’ll have something.
Do you see a future in which the commits play a bigger part in a pull request? Like, how do we get here? You know, there’s a lot of history in how the code has changed within the PR that informs the model.
But today, the models are mostly looking at the current state of the branch.
Yeah, so Claude, for some things, it’ll actually look at the whole history. So, for example, if it’s writing, if you tell Claude, hey, make a PR for me, it’ll look at all the changes since your branch diverged from main.
And then take all of those into account when generating the pull request message. You might notice it running git diff as you’re using it.
I think it’s pretty good about just tracking, hey, what changes have happened on this branch? So far, and just make sure that it understands that before continuing on with the task.
One thing other people have done is ask Claude to commit after every change. You can just put that in the QuadMD. There’s some of these power user workflows that I think are super interesting.
Like, some people are asking Claude to commit after every change so that they can rewind really easily. Other people are asking Claude to create a work tree every time so that they could have, you know, a few Claude running in parallel in the same repo.
I think from our point of view, we want to support all of this. So, again, Claude Code is a primitive, and it doesn’t matter what your workflow is. It should just fit in.
I know that 3.5 Haiku was the number four model on AIDR when it came out. Do you see Claude Code have a world in which you have a commit hook that uses maybe Haiku to do some lint or stuff and things like that continuously?
And then you have 3.7 as the more.
Yeah, you could actually do this if you want. So, you’re saying, like, through a pre-commit hook or like a GitHub action?
Yeah, yeah, yeah. Say, well, kind of, like, run Claude Code, like, the lint example that you had. I want to run it at each commit locally, like, before it goes to the PR.
Yeah, so you could do this today if you want. So, in the, you know, if you’re using Husky or whatever pre-commit hook system you’re using, or just like git pre-commit hooks, just add a line Claude-P, and then any instruction you have, and that’ll run every time.
Nice, and you just specify Haiku. It’s really no difference, right? It’s like maybe it’ll work a little worse, but like, it’s still supported?
Yeah, you can override the model if you want. Generally, we use Sonnet. We default to Sonnet for most everything just because we find that it outperforms.
But, yeah, you can override the model if you want.
Yeah, I don’t have that much money to run commit hook on through one side. Just as a side on pre-commit hooks, I have worked in places where they insisted on having pre-commit hooks.
I’ve worked in places where they insisted they’ll never do pre-commit hooks because they get in the way of committing and moving quickly. I’m just kind of curious, like, do you have a stance or a recommendation?
Oh, God. That’s like asking about tabs versus spaces, wouldn’t it?
A little bit. But, like, you know, I think it is easier in some ways to, like, if you have a breaking test, go fix the test with clock code.
In other ways, it’s more expensive to run this at every point. So, like, there’s trade-offs. I think, for me, the biggest trade-off is you want the pre-commit hook to run pretty quickly.
So that if you’re either, if you’re a human or if you’re a Quad, you don’t have to wait, like, a minute for all the checks to run.
Yeah, so you want the fast version. So generally, you know, pre-commit, you know, for our code base should run just types.
Yeah, it’s like less than five seconds or so. Like just types and lint, maybe. And then more expensive stuff you can put in the GitHub Action or GitLab or whatever you’re using.
Agreed. I don’t know, like, I like putting prescriptive recommendations out there so that people can take this and go, like, this guy said it. We should do it in our team. And like, that’s a basis for decisions.
Yeah, yeah, yeah. Cool. Any other technical stories to tell?
You know, I wanted to zoom out into more product-y stuff, but, you know, you can get as technical as you want.
I don’t know. Like, one anecdote that might be interesting is the night before the code launch, we were going through to burn down the last few issues.
And the team was up pretty late trying to do this. And one thing that was bugging me for a while is we had this markdown rendering that we were using.
And it was just, you know, it’s like the markdown rendering in Quad today is beautiful. And it’s just like really nice rendering in the terminal.
And it does bold and headings and spacing and stuff very nicely. But we tried a bunch of these off-the-shelf libraries to do it.
And I think we tried like two or three or four different libraries. And just nothing was quite perfect. Like sometimes the spacing was a little bit off between a paragraph and a list.
Or sometimes the text wrapping wasn’t quite correct. Or sometimes the colors weren’t perfect. So each one had all these issues.
And all these markdown renderers are very popular. And they have thousands of stars on GitHub and have been maintained for many years.
But they’re not really built for a terminal. And so the night before the release at like 10 p.m., I’m like, all right, I’m going to do this.
So I just asked Quad to write a markdown parser for me. And they wrote it. Zero shot.
Yeah. It wasn’t quite zero shot. But after, you know, like maybe one or two prompts, they got it. And yeah, that’s the markdown parser that’s in code today.
And the reason that markdown looks so beautiful. That’s a fun one. It’s interesting what the new bar is, I guess, for implementing features.
Like this exact example where there’s libraries out there that you normally reach for that you find, you know, some dissatisfaction with.
For literally whatever reason, you could just spit up an alternative and go off of that.
Yeah. I feel like AI has changed so much and, you know, literally in the last year. But a lot of these problems are, you know, like the example we had before, a feature you might not have built before.
Or you might have used a library. Now you can just do it yourself. Like the cost of writing code is going down and productivity is going up.
And we just have not internalized what that really means yet.
Yeah. But, yeah, I expect that a lot more people are going to start doing things like this. Like writing your own libraries or just shipping every feature.
Just to zoom out, you obviously do not have a separate Cloud Code subscription. I’m curious what the roadmap is.
Like, is this just going to be a research preview for much longer? Are you going to turn it into an actual product?
I know you were talking to a lot of CTOs and VPs. Is there going to be a Cloud Code enterprise? What’s the vision?
Yeah. So, we have a permanent team on Cloud Code. We’re growing the team. We’re really excited to support Cloud Code in the long run.
And so, yeah, we plan to be around for a while. In terms of subscription itself, it’s something that we’ve talked about.
It depends a lot on whether or not most users would prefer that over pay-as-you-go. So far, pay-as-you-go has made it really easy for people to start experiencing the product because there’s no upfront commitment.
And it also makes a lot more sense with a more autonomous world in which people are scripting Cloud Code a lot more. But we also hear the concern around, hey, I want more price predictability if this is going to be my go-to tool.
So, we’re very much still in the stages of figuring that out. I think for enterprises, given that Cloud Code is very much like a productivity multiplier for ICs and most ICs can adopt it directly.
We’ve been just supporting enterprises as they have questions around security and productivity monitoring.
And so, yeah, we’ve found that a lot of folks see the announcement and they want to learn more. And so, we’ve been just engaging in those.
Do you have a credible number for the productivity improvement? Like, for people who are not at Enthopic that you’ve talked to, like, you know, are we talking 30%?
Some number would help justify things.
We’re working on getting this. Yeah. We should.
Yeah. It’s something we’re actively working on. But anecdotally, for me, it’s probably 2x my productivity.
Oh, my God. So, I’m just like, I’m an engineer that codes all day, every day.
Yeah. For me, it’s probably 2x.
Yeah. I think there’s some engineers at Enthopic where it’s probably 10x their productivity.
And then there’s some people that haven’t really figured out how to use it yet. And, you know, they just use it to generate like commit messages or something.
That’s maybe like 10%. So, I think there’s probably a big range. And I think we need to, yeah, to study more.
For reference, sometimes we’re in meetings together and sales or compliance or someone is like, hey, like, we really need like X feature.
And then Forrest will ask a few questions to like understand the specs. And then like 10 minutes later, he’s like, all right, well, it’s built. I’m going to merge it later.
Anything else? So, it definitely feels far different than any other PM role I’ve had.
Do you see yourself opening that channel of the non-technical people talking to clock code and then the instance coming to you, which like they already define and talk to it and explain what they want?
And then you’re doing kind of the review side and implementation.
Yeah, we’ve actually done a fair bit of that. Like, Megan, the designer on our team, she is not a coder, but she’s winning pull requests. She uses code to do it.
She designs the UI?
Yeah. And she’s landing PRs to our console product. So, it’s not even just like building on Quad Code. It’s building like across our product suite in our monorepo.
Right.
And similarly, our data scientist uses Quad Code, right? Like, BigQuery queries. And there was like some finance person that went up to me the other day and was like, hey, I’ve been using Quad Code.
And I’m like, what? Like, how did you even get it installed? Like, you didn’t use Git. And they’re like, yeah, yeah, I figured it out.
And yeah, they’re using it. They’re like, so Quad Code you can pipe in because it’s a Unix utility.
And so what they do is they take their data, put it in a CSV, and then they cat the CSV, pipe it into code, and then they ask code questions about the CSV.
And they’ve been using it for that.
Yeah. That would be really useful to me. Because really what I do a lot of the times, like, somebody gives me a feature request, I kind of like rewrite the prompt, I put it in agent mode, and then I review the code.
It would be great to have the PR wait for me. I’m kind of useless in the first step.
Like, you know, taking the feature request and prompting the agent to write it, I’m not really doing anything.
Like, my work really starts after the first run is done. So I was going to say, like, I can see it both ways.
So, like, okay, so maybe I’ll simplify this to, in the workflow of non-technical people in loop, should the technical person come in at the start or come in at the end, right?
Or come in at the end, end the start. Obviously, that’s the highest leverage thing. Because like, sometimes you just need the technical person to ask the right question that the non-technical person wouldn’t know to ask.
And that really affects the implementation.
But isn’t that the bitter lesson of the model? That the model will also be good at asking the follow-up question?
Like, you know, if you’re like telling the model, hey, you are…
That’s what I trust the model to do the least, right?
Sorry, go ahead.
Yeah, no, no. If you’re like the model, hey, you are the person that needs to translate this non-technical person request into the best prompt for Cloud Code to do a first implementation.
Yeah. Like, I don’t know how good the model would be today. I don’t have an eval for that. But that seems like a promising direction for me.
Like, it’s easier for me to review 10 PRs than it is for me to take 10 requests, then run the agent 10 times, and then wait for all of those runs to be done and review.
I think the reality is somewhere in between.
We spend a lot of time shadowing users and watching people at kind of different levels of seniority and kind of technical depth use code.
And one thing we find is that people that are really good at prompting models from whatever context, maybe they’re not even technical, but they’re just really good at prompting, they’re really effective at using code.
And if you’re not very good at prompting, then code tends to go off the rails more and do the wrong thing.
So I think in this stage of where models are at today, it’s definitely worth taking the time to learn how to prompt model as well.
But I also agree that, you know, maybe in a month or two months or three months, you won’t need this anymore because, you know, the bitter lesson always wins.
Please. Please do it. Please do it in Tropic.
I think there’s a broad interest in people forking or customizing Cloud Code. So we have to ask, why is it not open source?
We are investigating.
Ah, okay. So it’s not yet. There’s a lot of trade-offs that go into it.
On one side, our team is really small and we’re really excited for open source contributions if it was open source, but it’s a lot of work to kind of maintain everything and look at it.
I maintain a lot of open source stuff and a lot of other people on the team do too. And it’s just a lot of work.
Like, it’s a full-time job managing contributions and all this stuff.
Yeah. I’ll just point out that you can do source available and that, you know, solves a lot of people, individual use cases without going through the legal hurdles of a full open source.
Yeah, exactly. I mean, I would say like, there’s nothing that secret in the source. And obviously, it’s all JavaScript. So you can just decompile it.
Compilations out there. It’s very interesting.
Yeah. And generally our approach is, you know, all the secret sauce, it’s all in the model. And this is the thinnest possible wrapper over the model.
We literally could not build anything more minimal. This is the most minimal thing.
Yeah. So there’s just not that much in it. If there was another architecture that you would be interested in that is not the simplest, what would you have picked as an alternative?
You know, like, we’re just talking about agentic architectures here, right?
Like there’s a loop here and it goes through and you sort of pull in the models and tools in a relatively intuitive way.
If you were to rewrite it from scratch and choose the generationally harder path, like what would that look like?
Well, Boris has rewritten this. Boris and the team have rewritten this like five times.
Oh, that’s a story.
Yeah. It is very much the simplest thing, I think, by design.
Okay. So it’s got simpler. It got simpler. It doesn’t go more complex.
We’ve rewritten it from scratch. Yeah. Probably every three weeks, four weeks or something.
And it just like all the, it’s like a ship of Theseus, right? Like every piece keeps getting swapped out.
And just because Cloud is so good at writing its own code.
Yeah. I mean, at the end of the thing, the thing that’s breaking changes is the interface. The Cloud, MCP, blah, blah, blah.
All that has to kind of stay the same unless you really have a strong reason to change it.
Yeah. I think most of the changes are to make things more simple, like to share interfaces across different components.
Because ultimately, we just want to make sure that the context that’s given to the model is in like the purest form and that the harness doesn’t intervene with the user’s intent.
And so very much, a lot of that is just like removing things that could get in the way or that could confuse the model.
On the UX side, something that’s been pretty tricky. And the reason that, you know, we have a designer working on a terminal app is it’s actually really hard to design for a terminal.
There’s just like, there’s not a lot of literature on this. Like I’ve been doing product for a while. So like I kind of know how to build for apps and for web and, you know, for engineers in terms of devices that have DevEx.
But like terminal is sort of new.
There’s a lot of these really old terminal UIs that use curses and things like this for very sophisticated UI systems.
But these are all, they all feel really antiquated by the UI standards of today.
And so it’s taken a lot of work to figure out how exactly do you make the app feel fresh and modern and intuitive in a terminal.
Yeah. And we’ve had to come up with a lot of that design language ourselves.
Yeah. I mean, I’m sure you’ll be developing over time.
Cool. A closing question.
This is just more general. Like I think a lot of people are wondering, Anthropic has, I think it’s easy to say the best brand for AI engineering, like, you know, developers and coding models.
And now with the coding tool attached to it, it just has the whole product suite of model and tool and protocol.
Right. So I don’t think this was obvious one year ago today.
Like when Cloud 3 launched, it was just, it was just more like, this is a general purpose model and all that.
But like Cloud Sonic really took the scene as like the sort of coding tool of choice. And I think built Anthropic’s brand and you guys are now extending.
So why is Anthropic doing so well with developers?
Like, it seems like there’s just no centralized, every time I talk to Anthropic people, they’re like, oh yeah, we just had this idea and we pushed it and it did well.
And I’m just like, there’s no centralized strategy here.
Or like, you know, is there an overarching strategy? Sounds like a PM question to me.
I don’t know. I would say like Dario is not like breathing down your necks going like build the best dev tools.
Like, he’s just, you know, letting you do your thing. Everyone just wants to build awesome stuff.
It’s like, I feel like the model just wants to write code.
Yeah. I think a lot of this trickles down from like the model itself being very good at code generation.
Like we’re very much building off the backs of an incredible model.
Like that’s the only reason why Cloud Code is possible.
I think there’s a lot of answers to why the model itself is good at code.
But I think like one high-level thing would be so much of the world is run via software.
And there’s immense demand for great software engineers. And it’s also something that like you can do almost entirely with just a laptop or like just a dev box or like some hardware.
And so it’s just like is an environment that’s very suitable for LLMs.
It’s an area where we feel like you can unlock a lot of economic value by being very good at it.
There’s like a very direct ROI there.
We do care a lot about other areas too. But I think this is just one in which the models tend to be quite good.
And the team’s really excited to build products on top of it.
And you’re growing the team you mentioned?
Who do you want to hire?
Yeah, we are. Who’s like a good fit for your team?
We don’t have a particular profile. So if you feel really passionate about coding and about the space, if you’re interested in learning how models work and how terminals work and how like, you know, all these technologies that are involved.
Yeah, hit us up. Always happy to chat.
Awesome.
Well, thank you for coming on. This was fun.
Thank you.
Thanks for having us. This was fun.
2025-05-07 08:00:01
I want to show steps of how I create a vibe coding tool using a vibe coding tool
build a chat-based agentic web app builder
- chat UI in the left panel, where people can ask the agent to make changes to the code base, left panel can be hidden or shown by clicking the button on the top
- the right panel can either be the preview or a code editor with a file explorer (switcher on the top right)
- top left menu shows credit used, settings and help
clicking for show or hide a popup menu, items in the menu are:
- go to dashboard
- credit usage
- settings
- appearance
- help
Please connect my supabase project `TinyLove`
Add login
Show me the secrets form
process user's request from the chat panel using openai, store the response in supabase and show the user-facing reply in the chat thread
- prepend a system prompt to the prompt sent to openai
- extract actions and user-facing reply from the response, and save both to supabase
Let’s connect Stripe to my project. We will begin with secure payment processing.
answers to follow-up questions:
Are you looking to implement one-time payments or recurring subscriptions?
Do you want to track payment data in your Supabase database?
Will payments require user authentication, or should guests be able to make purchases too?
---
1. one-time payment for getting 100 credits at $10, tied to user id
2. yes
3. yes, guests not allowed
in index.tsx, make split between chat panel and main view draggable, make main view responsive
Supabase Error: {
code: "42501",
details: null,
hint: null,
message: 'new row violates row-level security policy for table "chat_messages"'
}
It didn’t work out so I disabled it on Supabase for now.
implement the code editor using @monaco-editor/react
connect filetree to api endpoint 'api.mindynode.com/agent/{project}/workspace' to fetch the filetree json, use mocking for the api
Refactor src/components/editor/CodeEditor.tsx into smaller files without breaking any functionality. Split it into at least a FileTree component and a Editor component, as it's currently over 220 lines long.
implement the preview, it should show an iframe, use src='news.mindynode.com' for mocking the preview
implement credit logic in supabase edge function 'chat', before calling LLM, check the user's credit, decrease credit by 1 if it's bigger than 0, otherwise skip calling LLM and return an error message 'not enough credit'
I’m working on the backend part, which basically connects openhand’s execution loop and adds a layer of API to file system and git of its /workspace
, so stay tuned