About Chip Huyen

Co-founder of Claypot AI, graduated from Stanford University, grew up in Vietnam. Ex NVIDIA, Snorkel AI, and Netflix.

The RSS's url is : https://huyenchip.com/feed.xml

Please copy to your reader or subscribe it with :

Preview of RSS feed of Chip Huyen

Measuring personal growth

2024-04-17 08:00:00

My founder friends constantly think about growth. They think about how to measure their business growth and how to get to the next order of magnitude scale. If they’re making $1M ARR today, they think about how to get to $10M ARR. If they have 1,000 users today, they think about how to get to 10,000 users.

This made me wonder if/how people are measuring personal growth. I don’t want to use metrics like net worth or the number of followers, because that’s not what I live for. After talking with a lot of friends, I found three interesting metrics: rate of change, time to solve problems, and number of future options.

Some friends told me they find this blog post mildly sociopathic. Why do I have to measure everything? Life is to be lived, not to be measured. As someone lowkey fascinated by numbers, I don’t see why measuring and living have to be mutually exclusive – measuring often helps me live better – but I see where they come from. This post is more of a thought exercise than a rigorous experiment.

Rate of change

I have this theory that life has a circadian rhythm. Every 3-6 years, you become a different person. You work on different problems. Your lifestyle changes. The people you hang out with are different. If you haven’t caught up with a friend in 5 years, you might no longer have anything in common. It’s not a coincidence that schools are structured into chunks of 3-6 years.

Looking back, I realized that every 3-6 years, my life completely changed. From grade 3 to grade 10, I did competitive math. For the next 5 years, I worked as a writer. Then I went to college and studied computer science for 4 years. After that, I fumbled around for almost 6 years. It was only recently that I felt like I had a handle on life.

Sami, a new friend who loves designing strategy games, told me about the rule of 72 in finance. It’s a simple formula that estimates the number of years it will take for an investment to double in value. If the annual interest rate is 8%, it’ll take 72/8 = 9 years for the value of your investment to double.

I wonder if I could treat myself as an investment, and measure my growth by how long it’d take me to become a new person. Becoming a new person isn’t always a good thing, and probably not the goal for everyone. But for me, it is. I want to be able to see things from a new perspective. I want to be exposed to new challenges. I treasure old friends (I still talk to my best friends in elementary school), but I like learning from new friends.

Time to solve problems

Quynh, an old friend who runs a publishing house in Vietnam, believes that there are three big problems in life: career, family, and finance. It usually takes people a decade to figure each out.

  1. For the first decade after graduation, you figure out what you want to do with your life.
  2. For the next decade, you get married, buy a house, and have kids.
  3. For the next decade, you build out your savings to retire.

Her goal is to solve these problems as fast as possible, so she can focus on more interesting problems.

This made me think that perhaps I can measure my growth by looking at what big problems I’ve solved. What big problems was I worried about 5 years ago that I no longer worry about now? What big problems am I worried about now that I don’t want to worry about in 5 years?

What is considered a big problem depends on each person. For me, it’s career, finance, social, immigration, family, and health. Here are a couple of concrete examples that made me feel like I’ve made progress. 5 years ago, I was anxious about being in the US on a visa. This problem went away when I got my green card. 5 years ago, I constantly felt insecure like I was an imposter in the Bay. Today, I feel at home here.

Number of future options

A friend I’ve met through my Discord, Denys, told me that his friend has this theory that every few years, half of your dreams die. People give up on their dreams because they realize that they can no longer achieve them.

I disagree. As I grow older, I have more dreams. I now know many things that I didn’t know before, and I have access to more resources than I ever did. This allows me to do things that I used to think of as impossible.

During a reinforcement learning course in college, I learned about empowerment maximization. It’s a simple principle that enables robots/agents to exhibit relatively intelligent behavior. In the face of uncertainty, an agent following empowerment maximization would choose the action that maximizes future options. For example, facing multiple switches, it’d choose the switch that opens the most doors.

I realized that this is the same principle that I’ve followed. In the face of uncertainty, I lean towards the decision that would give me the most future options. For example, I’d choose a job that pays less but gives me more job options in the future (e.g. if the job gives me exposure like allowing me to work on open source or publish papers). I’d prioritize tasks that teach me transferable skills instead of tasks that teach me niche, narrow skills.

Perhaps I can measure my growth by how many new options I have gained/lost. What options are available to me today that were not available to me 5 years ago? What options were available to me 5 years ago that aren’t available to me now? More importantly, what options that are not available to me today do I want 5 years from now?

Sami pointed me to this image from Wait But Why. As time goes by, many doors are closed to us, but many new doors open up. Denys’s friend was referring to the black lines on the left, and I focus on the green lines on the right.

Generative AI Stack


There are three heuristics that I follow for personal growth:

  1. I try to become a new person every 3-6 years.
  2. I try to solve big problems as fast as possible. I think of this as creating safety nets that allow me to take bigger risks and explore more things in the future.
  3. I take actions that help me maximize future options.

These heuristics work for me (so far) because I have a strong bias towards novelty and exploration. Maybe one day, I’ll get tired of exploration, and these heuristics will change. When that happens, that’ll be growth.

What I learned from looking at 900 most popular open source AI tools

2024-03-14 08:00:00

[Hacker News discussion, LinkedIn discussion, Twitter thread]

Four years ago, I did an analysis of the open source ML ecosystem. Since then, the landscape has changed, so I revisited the topic. This time, I focused exclusively on the stack around foundation models.

The full list of open source AI repos is hosted at llama-police. The list is updated every 6 hours. You can also find most of them on my cool-llm-repos list on GitHub.

Table of contents
…. How to add missing repos
The New AI Stack
…. AI stack over time
…….. Applications
…….. AI engineering
…….. Model development
…….. Infrastructure
Open source AI developers
…. One-person billion-dollar companies?
…. 1 million commits
The growing China’s open source ecosystem
Live fast, die young
My personal favorite ideas


I searched GitHub using the keywords gpt, llm, and generative ai. If AI feels so overwhelming right now, it’s because it is. There are 118K results for gpt alone.

To make my life easier, I limited my search to the repos with at least 500 stars. There were 590 results for llm, 531 for gpt, and 38 for generative ai. I also occasionally checked GitHub trending and social media for new repos.

After MANY hours, I found 896 repos. Of these, 51 are tutorials (e.g. dair-ai/Prompt-Engineering-Guide) and aggregated lists (e.g. f/awesome-chatgpt-prompts). While these tutorials and lists are helpful, I’m more interested in software. I still include them in the final list, but the analysis is done with the 845 software repositories.

It was a painful but rewarding process. It gave me a much better understanding of what people are working on, how incredibly collaborative the open source community is, and just how much China’s open source ecosystem diverges from the Western one.

Add missing repos

I undoubtedly missed a ton of repos. You can submit the missing repos here. The list will be automatically updated every day.

Feel free to submit the repos with less than 500 stars. I’ll continue tracking them and add them to the list when they reach 500 stars!

The New AI Stack

I think of the AI stack as consisting of 4 layers: infrastructure, model development, application development, and applications.

Generative AI Stack

  1. Infrastructure

    At the bottom is the stack is infrastructure, which includes toolings for serving (vllm, NVIDIA’s Triton), compute management (skypilot), vector search and database (faiss, milvus, qdrant, lancedb), ….

  2. Model development

    This layer provides toolings for developing models, including frameworks for modeling & training (transformers, pytorch, DeepSpeed), inference optimization (ggml, openai/triton), dataset engineering, evaluation, ….. Anything that involves changing a model’s weights happens in this layer, including finetuning.

  3. Application development With readily available models, anyone can develop applications on top of them. This is the layer that has seen the most actions in the last 2 years and is still rapidly evolving. This layer is also known as AI engineering.

    Application development involves prompt engineering, RAG, AI interface, …

  4. Applications

    There are many open sourced applications built on top of existing models. The most popular types of applications are coding, workflow automation, information aggregation, …

Outside of these 4 layers, I also have another category, Model repos, that are created by companies and researchers to share the code associated with their models. Examples of repos in this category are CompVis/stable-diffusion, openai/whisper, and facebookresearch/llama.

AI stack over time

I plotted the cumulative number of repos in each category month-over-month. There was an explosion of new toolings in 2023, after the introduction of Stable Diffusion and ChatGPT. The curve seems to flatten in September 2023 because of three potential reasons.

  1. I only include repos with at least 500 stars in my analysis, and it takes time for repos to gather these many stars.
  2. Most low-hanging fruits have been picked. What is left takes more effort to build, hence fewer people can build them.
  3. People have realized that it’s hard to be competitive in the generative AI space, so the excitement has calmed down. Anecdotally, in early 2023, all AI conversations I had with companies centered around gen AI, but the recent conversations are more grounded. Several even brought up scikit-learn. I’d like to revisit this in a few months to verify if it’s true.
Generative AI Stack Over Time

In 2023, the layers that saw the highest increases were the applications and application development layers. The infrastructure layer saw a little bit of growth, but it was far from the level of growth seen in other layers.


Not surprisingly, the most popular types of applications are coding, bots (e.g. role-playing, WhatsApp bots, Slack bots), and information aggregation (e.g. “let’s connect this to our Slack and ask it to summarize the messages each day”).

Breakdown of popular AI applications

AI engineering

2023 was the year of AI engineering. Since many of them are similar, it’s hard to categorize the tools. I currently put them into the following categories: prompt engineering, AI interface, Agent, and AI engineering (AIE) framework.

Prompt engineering goes way beyond fiddling with prompts to cover things like constrained sampling (structured outputs), long-term memory management, prompt testing & evaluation, etc.

A list of prompt engineering tools

AI interface provides an interface for your end users to interact with your AI application. This is the category I’m the most excited about. Some of the interfaces that are gaining popularity are:

AIE framework is a catch-all term for all platforms that help you develop AI applications. Many of them are built around RAG, but many also provide other toolings such as monitoring, evaluation, etc.

Agent is a weird category, as many agent toolings are just sophisticated prompt engineering with potentially constrained generation (e.g. the model can only output the predetermined action) and plugin integration (e.g. to let the agent use tools).

AI engineering stack over time

Model development

Pre-ChatGPT, the AI stack was dominated by model development. Model development’s biggest growth in 2023 came from increasing interest in inference optimization, evaluation, and parameter-efficient finetuning (which is grouped under Modeling & training).

Inference optimization has always been important, but the scale of foundation models today makes it crucial for latency and cost. The core approaches for optimization remain the same (quantization, low-ranked factorization, pruning, distillation), but many new techniques have been developed especially for the transformer architecture and the new generation of hardware. For example, in 2020, 16-bit quantization was considered state-of-the-art. Today, we’re seeing 2-bit quantization and even lower than 2-bit.

Similarly, evaluation has always been essential, but with many people today treating models as blackboxes, evaluation has become even more so. There are many new evaluation benchmarks and evaluation methods, such as comparative evaluation (see Chatbot Arena) and AI-as-a-judge.

Model Development Stack Over Time


Infrastructure is about managing data, compute, and toolings for serving, monitoring, and other platform work. Despite all the changes that generative AI brought, the open source AI infrastructure layer remained more or less the same. This could also be because infrastructure products are typically not open sourced.

The newest category in this layer is vector database with companies like Qdrant, Pinecone, and LanceDB. However, many argue this shouldn’t be a category at all. Vector search has been around for a long time. Instead of building new databases just for vector search, existing database companies like DataStax and Redis are bringing vector search into where the data already is.

Open source AI developers

Open source software, like many things, follows the long tail distribution. A handful of accounts control a large portion of the repos.

One-person billion-dollar companies?

845 repos are hosted on 594 unique GitHub accounts. There are 20 accounts with at least 4 repos. These top 20 accounts host 195 of the repos, or 23% of all the repos on the list. These 195 repos have gained a total of 1,650,000 stars.

Most active GitHub accounts

On Github, an account can be either an organization or an individual. 19/20 of the top accounts are organizations. Of those, 3 belong to Google: google-research, google, tensorflow.

The only individual account in these top 20 accounts is lucidrains. Among the top 20 accounts with the most number of stars (counting only gen AI repos), 4 are individual accounts:

Most active GitHub accounts

Unsurprisingly, the lower we go in the stack, the harder it is for individuals to build. Software in the infrastructure layer is the least likely to be started and hosted by individual accounts, whereas more than half of the applications are hosted by individuals.

Can you do this alone?

Applications started by individuals, on average, have gained more stars than applications started by organizations. Several people have speculated that we’ll see many very valuable one-person companies (see Sam Altman’s interview and Reddit discussion). I think they might be right.

Can you do this alone?

1 million commits

Over 20,000 developers have contributed to these 845 repos. In total, they’ve made almost a million contributions!

Among them, the 50 most active developers have made over 100,000 commits, averaging over 2,000 commits each. See the full list of the top 50 most active open source developers here.

Most active open source developers

The growing China's open source ecosystem

It’s been known for a long time that China’s AI ecosystem has diverged from the US (I also mentioned that in a 2020 blog post). At that time, I was under the impression that GitHub wasn’t widely used in China, and my view back then was perhaps colored by China’s 2013 ban on GitHub.

However, this impression is no longer true. There are many, many popular AI repos on GitHub targeting Chinese audiences, such that their descriptions are written in Chinese. There are repos for models developed for Chinese or Chinese + English, such as Qwen, ChatGLM3, Chinese-LLaMA.

While in the US, many research labs have moved away from the RNN architecture for language models, the RNN-based model family RWKV is still popular.

There are also AI engineering tools providing ways to integrate AI models into products popular in China like WeChat, QQ, DingTalk, etc. Many popular prompt engineering tools also have mirrors in Chinese.

Among the top 20 accounts on GitHub, 6 originated in China:

  1. THUDM: Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University.
  2. OpenGVLab: General Vision team of Shanghai AI Laboratory
  3. OpenBMB: Open Lab for Big Model Base, founded by ModelBest & the NLP group at Tsinghua University.
  4. InternLM: from Shanghai AI Laboratory.
  5. OpenMMLab: from The Chinese University of Hong Kong.
  6. QwenLM: Alibaba’s AI lab, which publishes the Qwen model family.

Live fast, die young

One pattern that I saw last year is that many repos quickly gained a massive amount of eyeballs, then quickly died down. Some of my friends call this the “hype curve”. Out of these 845 repos with at least 500 GitHub stars, 158 repos (18.8%) haven’t gained any new stars in the last 24 hours, and 37 repos (4.5%) haven’t gained any new stars in the last week.

Here are examples of the growth trajectory of two of such repos compared to the growth curve of two more sustained software. Even though these two examples shown here are no longer used, I think they were valuable in showing the community what was possible, and it was cool that the authors were able to get things out so fast.

Hype curve

My personal favorite ideas

So many cool ideas are being developed by the community. Here are some of my favorites.


Even though I included only 845 repos in my analysis, I went through several thousands of repos. I found this helpful for me to get a big-picture view of the seemingly overwhelming AI ecosystem. I hope the list is useful for you too. Please do let me know what repos I’m missing, and I’ll add them to the list!

Predictive Human Preference: From Model Ranking to Model Routing

2024-02-28 08:00:00

A challenge of building AI applications is choosing which model to use. What if we don’t have to? What if we can predict the best model for any prompt? Predictive human preference aims to predict which model users might prefer for a specific query.

Table of contents

Ranking Models Using Human Preference
…. How Preferential Ranking Works
…. Correctness of Chatbot Arena Ranking
…….. Eval data
…….. Results
Predicting Human Preference For Each Prompt
…. Experiment setup
…. Experiment results
…….. Domain-specific and query-specific leaderboards

Human preference has emerged to be both the Northstar and a powerful tool for AI model development. Human preference guides post-training techniques including RLHF and DPO. Human preference is also used to rank AI models, as used by LMSYS’s Chatbot Arena.

Chatbot Arena aims to determine which model is generally preferred. I wanted to see if it’s possible to predict which model is preferred for each query.

One use case of predictive human preference is model routing. For example, if we know in advance that for a prompt, users will prefer Claude Instant’s response over GPT-4, and Claude Instant is cheaper/faster than GPT-4, we can route this prompt to Claude Instant. Model routing has the potential to increase response quality while reducing costs and latency.

Another use case of predictive human preference is interpretability. Mapping out a model’s performance on different prompts can help us understand this model’s strengths and weaknesses. See section Experiment results for examples.

Here’s what predictive human preference for different model pairs looks like for the prompt “What’s the best way to cluster text embeddings?”. The predictions were generated by my toy preference predictor. The bright yellow color for the (GPT-4, GPT-3.5-Turbo) cell means that my predictor thinks GPT-4’s response is very likely to be preferred to that of GPT-3.5-Turbo’s for this prompt.

Predictive human preference for all LLM model pairs

This post first discusses the correctness of Chatbot Arena, which will then be used as a baseline to evaluate the correctness of preference predictions. It then discusses how to build a preference predictor and the initial results.

Ranking Models Using Human Preference

Using preferential signals (comparisons) to rank models has grown in popularity in the last few years. Other than powering LMSYS’s Chatbot Arena, it’s also used by many model providers (Anthropic, Gemini, ChatGPT, etc.) to evaluate their models in production.

Predictive human preference for all LLM model pairs

Side note: Friends who have deployed this in production told me that most users don’t read both options and just randomly vote for one. This introduces a lot of noise. However, the signals from the small percentage of users who vote correctly can sometimes be sufficient to help determine which model is preferred, as long as there’s minimal bias in the random voting.

How Preferential Ranking Works

Preferential ranking works in two steps:

  1. Collect comparison data about user preference.
  2. Compute a model ranking from these comparisons.

For each request, two or more models are selected to respond. An evaluator, which can be human or AI, picks the winner. The evaluator shouldn’t know which models are being judged. Each comparison is called a match. This process results in a series of comparisons.

Match ID Prompt Model A Model B Winner
1 Model 1 Model 2 Model 1
2 Model 3 Model 1 Model 1
3 Model 1 Model 4 Model 4
... ... ... ... ...

From these comparisons, we need to compute the rankings of all models. The two most common ranking algorithms are Elo (from chess) and TrueSkill (from video games).

While Chatbot Arena refers to their model scores “Elo scores”, they actually don’t use Elo. In December 2023, they switched to Bradley-Terry but scaled the resulting scores to make them look Elo-like (see their notebook).

Given a history of match outcomes, the Bradley-Terry algorithm finds the model scores that maximize the likelihood of these match outcomes, turning model scoring into a maximum likelihood estimation problem. The input, for each training example, is the models that participate in the match. The output is the outcome of the match. Assuming there’s no draw, the outcome of a match is either 0 (a wins) or 1 (b wins).

Predictive human preference for all LLM model pairs

Correctness of Chatbot Arena Ranking

Given the same match outcomes, different ranking algorithms can produce different rankings. For example, the ranking computed by Elo might differ from the ranking computed by Bradley-Terry. How do we know that a ranking is correct?

At its core, model ranking is a predictive problem. We compute a ranking from historical match outcomes and use it to predict future match outcomes. The quality of a ranking is determined by how accurately it can predict future match outcomes.

Let’s say we have a match between model A and model B. If model A has a higher score, meaning that the ranking algorithm predicts that A wins. If users indeed prefer the higher-ranking model, the ranking algorithm makes a correct prediction.

Eval data

To compute the accuracy of Chatbot Arena ranking, I used their data published in July 2023, which consists of 33K crowd-sourced comparisons for matches among 20 models. I used this smaller dataset instead of their Jan 2024 dataset because this smaller dataset contains the prompt used for each match, which I need for predictive human preference. Benchmarking on this dataset allows me to compare my model with the Bradley-Terry algorithm later on.

Here’s an example from their July 2023 dataset.

prompt model_a model_b winner model_a's response model_b's response
who was the last monarch of uk koala-13b vicuna-13b model_a The last monarch of the United Kingdom was Queen Elizabeth II, who reigned from 1952 to 2020. The current monarch of the United Kingdom is Queen Elizabeth II. She has been the monarch since 1952, and is the longest-reigning monarch in British history.

For reference, the Bradley-Terry (BT) scores of the top 7 models in this dataset are as follows.

  1. GPT-4: 1189
  2. Claude-v1: 1150
  3. Claude-instant-v1: 1110
  4. GPT-3.5-Turbo: 1104
  5. WizardLM-13B: 1058
  6. Vicuna-13b: 1040
  7. Guanaco-33b: 1031

To create a test set, I randomly select 10% of the data (3300 examples). Each match has three possible outcomes: model_a wins, model_b wins, or tie. This can still be framed as a binary classification problem if we treat a tied match as two matches: one in which model_a wins and one in which model_b wins.


I found that for all non-tie matches in my test set, the model with the higher Bradley-Terry score is preferred 74.1% of the time. This means that if we always predict the higher-ranked model as the winner for a match, we’d have an accuracy of 74.1%.

Test data Output classes # samples BT's accuracy
All matches
  • model_a wins
  • model_b wins
  • tie
3,300 53.33%
Non-tie matches
  • model_a wins
  • model_b wins
2,367 74.1%
Non-tie matches involving GPT-4
  • model_a wins
  • model_b wins
355 85.1% (always pick GPT-4 as winner)

Back in July 2023, GPT-4 was considered the strongest model by a long shot (this was before Gemini, Mistral, Claude-v2). Did users always prefer GPT-4 to all other models? They didn’t. In 355 non-tie matches involving GPT-4, GPT-4 wins 85.1%.

This means that even though GPT-4 is the best model overall, there are prompts for which other models can outperform GPT-4. If we can figure out which prompts these are, and which models work best for them, we can route these prompts to the best-performing models, improving the response quality.

Predicting Human Preference For Each Prompt

If a ranking algorithm is about figuring out which model is better overall, predictive human preference is about figuring out which model is better for each prompt. If we know in advance that for a particular prompt, GPT-3.5 works just as well as GPT-4, and GPT-3.5 is cheaper, we can route that prompt to GPT-3.5 instead. Or if we know that Mistral-7B works just as well as GPT-4 and Mistral-7B is faster, we can route our query to Mistral-7B instead.

Model routing can also help with budget planning. Say, you only have enough budget to serve 50% of queries on the strongest model, and the rest to a weaker model, you want to make sure that you send to the weaker model only the queries that you’re confident it can do well on.

Experiment setup

I treat predictive human preference as a binary classification task. Given a match between 2 models, predict which one wins. If the probability of model_a winning is around 0.5, it can be considered a tie. If a Bradley-Terry model takes only (model_a, model_b) as the input, a preference predictor takes (prompt, model_a, model_b) as the input.

Predictive human preference for all LLM model pairs

The architecture of my preference predictor looks like this. The model encoder and preference predictor are neural networks that can be trained independently or together. I used DistilBERT as my prompt encoder.

Predictive human preference for all LLM model pairs

To train my model, I used 90% of LMSYS’s July 2023 dataset. I found that the predictor performed better using only non-tie matches (as opposed to using both tie and non-tie matches). I randomly flipped the order of models in a match 50% of the time.

To evaluate my model, I used 10% of this data. This is the same test data used to evaluate the correctness of Chatbot Arena’s ranking above.

Split All matches Non-tie matches
Train 29,700 20,927
Test 3,300 2,367

Note: I should’ve made a separate validation set for hyperparameter tuning. However, given that I didn’t have a lot of data and this is only a proof of concept, I didn’t do it. (I’m also lazy.) The matches are among 20 models, corresponding to 190 model pairs. 20,927 comparisons mean that, on average, there are only 110 comparisons per model pair.

Experiment results

I evaluated my preference predictor under two settings:

  1. Using only model_a and model_b as the input. This is to see whether this predictor, using only model names, can make better predictions about match outcomes than Chatbot Arena scores.
  2. Using (prompt, model_a, model_b) as the input. This is to see whether including prompts helps improve match outcome prediction.

I found that for all non-tie matches, my preference predictor can predict the match outcome accurately 75% of the time if not using prompts, and 76.2% of the time if using prompts. This suggests that human preference for models does change depending on the prompt. While the improvement doesn’t seem much, a 2.1% improvement can be significant at scale.

Eval data # eval samples Chatbot Arena Preference predictor
(without prompts)
Preference predictor
(with prompts)
Non-tie matches 2,367 74.1% 75% 76.2%
Non-tie matches involving GPT-4 355 85.1% 86.2% 87%

Keep in mind that this predictor was trained with a small amount of crowd-sourced (e.g. noisy) data. The prompts crowdsourced are also simple. Among 33K prompts, 180 (0.55%) of them are “hello” and “hi”. These simple prompts are insufficient to distinguish strong models from weak ones. I suspect that with more/better data, the performance of this predictor can significantly improve.

Domain-specific and query-specific leaderboards

Recall that 20 models correspond to 190 model pairs. To visualize how the predictor captures human preference, for each evaluation prompt, I generated 190 different inputs, one for each model pair.

Predictive human preference for all LLM model pairs

I then visualized the 190 predictions for 190 model pairs in a 20 x 20 grid, as shown below for the prompt “Derive the elastic wave equation.” I only included 9 models in the plot to make it readable. The diagonal values refer to comparing a model to itself, so the predicted preference should be 0.5.

Predictive human preference for all LLM model pairs

Given the predicted preference for all model pairs for a prompt, I used a Bradley-Terry model (the same ranking algorithm that LMSYS uses) to create a leaderboard for this prompt. I used the same scaling that LMSYS uses to make the scores look Elo-like. Here’s the ranking of the 9 models shown above for the query “Derive the elastic wave equation.”

This also means that with this preference predictor, we can create a leaderboard for any arbitrary subset of data. We can have a leaderboard specific to any domain.

Model ranking for the prompt"Derive the elastic wave equation."
gpt-4 1214
claude-v1 1162
gpt-3.5-turbo 1104
claude-instant-v1 1110
guanaco-33b 1023
vicuna-13b 1007
vicuna-7b 985
RWKV-4-Raven-14B 970
gpt4all-13b-snoozy 915

Despite being a toy predictor, the model seems to be able to capture different models’ performance patterns. One pattern is that for simple prompts, weak models can do (nearly) as well as strong models. For more challenging prompts, however, users are much more likely to prefer stronger models. Here’s a visualization of predicted human preference for an easy prompt (“hello, how are you?”) and a challenging prompt (“Explain why Planc length …”).

Predictive human preference for all LLM model pairs

Here are the model rankings for these two prompts. The score spread for the simple prompt is much less than the score spread for the challenging prompt. The models that are ranked differently for these two prompts are highlighted in red.

Predictive human preference for all LLM model pairs

The predictor is also the most confident that GPT-4 will be preferred for queries in Russian and queries that involve code writing. For example, the average predicted win rate for the following Russian query of GPT-4 against all other models is 91.55%. Notice that for this query, while claude-v1 is predicted to do well on this query, claude-instant-v1 is predicted to do poorly.

Predictive human preference for all LLM model pairs


My primitive experiment suggests that predictive human preference is feasible using a surprisingly small amount of data. There are many potential use cases for predictive human preference – model routing and interpretability are just two of them.

Predictive human reference is the first and the most important step in model routing (the other key step is routing strategy). With more and more models being developed, each with different capabilities and a cost structure, model routing has clear economic values.

I’m aware of four groups (two in stealth) that are working on model routing. One startup is Martian, which announced its $9M seed round. LMSYS is also working on model routing, which I think is a natural progression from their work in comparative evaluation.

While my experiment used human-annotated comparisons, LMSYS folks told me that due to the noisiness of crowd-sourced annotations and the costs of expert annotations, they’ve found that using GPT-4 to compare two responses works better. Depending on the complexity of the queries, generating 10,000 comparisons using GPT-4 would cost only $200 - 500, making this very affordable for companies that want to test it out.

This is the most fun side project I’ve worked on in a while, so I’d love to talk more about it. For those interested, I’ll be hosting a casual 30-minute discussion on predictive human preference on Tuesday, Mar 5, 9.30am PST. Join our Discord or email me if you want an invite!


Thanks Luke Metz for helping me with the experiments and coercing me into using JAX. While JAX is super cool and makes a lot of things easy, it also caused some of the weirdest bugs I’ve ever seen. I’m glad I used it though. Thanks Han-chung Lee for feedback on the plots.

Sampling for Text Generation

2024-01-16 08:00:00

ML models are probabilistic. Imagine that you want to know what’s the best cuisine in the world. If you ask someone this question twice, a minute apart, their answers both times should be the same. If you ask a model the same question twice, its answer can change. If the model thinks that Vietnamese cuisine has a 70% chance of being the best cuisine and Italian cuisine has a 30% chance, it’ll answer “Vietnamese” 70% of the time, and “Italian” 30%.

This probabilistic nature makes AI great for creative tasks. What is creativity but the ability to explore beyond the common possibilities, to think outside the box?

However, this probabilistic nature also causes inconsistency and hallucinations. It’s fatal for tasks that depend on factuality. Recently, I went over 3 months’ worth of customer support requests of an AI startup I advise and found that ⅕ of the questions are because users don’t understand or don’t know how to work with this probabilistic nature.

To understand why AI’s responses are probabilistic, we need to understand how models generate responses, a process known as sampling (or decoding). This post consists of 3 parts.

  1. Sampling: sampling strategies and sampling variables including temperature, top-k, and top-p.
  2. Test time sampling: sampling multiple outputs to help improve a model’s performance.
  3. Structured outputs: how to get models to generate outputs in a certain format.

Table of contents

…. Temperature
…. Top-k
…. Top-p
…. Stopping condition
Test Time Sampling
Structured Outputs
…. How to generate structured outputs
…. Constraint sampling


Given an input, a neural network produces an output by first computing the probabilities of all possible values. For a classifier, possible values are the available classes. For example, if a model is trained to classify whether an email is spam, there are only two possible values: spam and not spam. The model computes the probability of each of these two values, say being spam is 90% and not spam is 10%.

To generate the next token, a language model first computes the probability distribution over all tokens in the vocabulary.

Sampling the next token based on token probabilities

For the spam email classification task, it’s okay to output the value with the highest probability. If the email has a 90% chance of being spam, you classify the email as spam. However, for a language model, always picking the most likely token, greedy sampling, creates boring outputs. Imagine a model that, for whichever question you ask, always responds with the most common words.

Instead of always picking the next most likely token, we can sample the next token according to the probability distribution over all possible values. Given the context of My favorite color is ..., if red has a 30% chance of being the next token and green has a 50% chance, red will be picked 30% of the time, and “green” 50% of the time.


One problem with sampling the next token according to the probability distribution is that the model can be less creative. In the previous example, common words for colors like red, green, purple, etc. have the highest probabilities. The language model’s answer ends up sounding like that of a five-year-old: My favorite color is green. Because the has a low probability, the model has a low chance of generating a creative sentence such as My favorite color is the color of a still lake on a spring morning.

Temperature is a technique used to redistribute the probabilities of the possible values. Intuitively, it reduces the probabilities of common tokens, and as a result, increases the probabilities of rarer tokens. This enables models to create more creative responses.

To understand how temperature works, let’s take a step back to see how a model computes the probabilities. Given an input, a neural network processes this input and outputs a logit vector. Each logit corresponds to one possible. In the case of a language model, each logit corresponds to one token in the model’s vocabulary. The logit vector size is the size of the vocabulary.

Sampling the next token based on token probabilities

While larger logits correspond to higher probabilities, the logits don’t represent the probabilities. Logits don’t sum up to one. Logits can even be negative, while probabilities have to be non-negative. To convert logits to probabilities, a softmax layer is often used. Let’s say the model has a vocabulary of N and the logit vector is \([x_1, x_2, ..., x_N]\). The probability for the \(i^{th}\) token, \(p_i\), is computed as follows:

\[p_i = \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

Temperature is a constant used to adjust the logits before the softmax transformation. Logits are divided by temperature. For a given temperature of \(T\), the adjusted logit for the \(i^{th}\) token is \(\frac{x_i}{T}\). Softmax is then applied on this adjusted logit instead of on \(x_i\).

Let’s walk through a simple example to understand the effect of temperature on probabilities. Imagine that we have a model that has only two possible outputs: A and B. The logits computed from the last layer are [1, 3]. The logit for A is 1 and B is 3.

The higher the temperature, the less likely the model is going to pick the most obvious value (the value with the highest logit), making the model’s outputs more creative but potentially less coherent. The lower the temperature, the more likely the model is going to pick the most obvious value, making the model’s out more consistent but potentially more boring.

The graph below shows the softmax probability for token B at different temperatures. As the temperature gets closer to 0, the probability that the model picks token B becomes closer to 1. In our example, for temperature below 0.1, the model almost always outputs B. Model providers typically limit temperature to be between 0 and 2. If you own your model, you can use any non-negative temperature. A temperature of 0.7 is often recommended for creative use cases, as it balances creativity and determinism, but you should experiment and find the temperature that works best for you.

Sampling the next token based on token probabilities using temperature

It’s common practice to set the temperature to 0 for the model’s outputs to be more consistent. Technically, temperature can never be 0 – logits can’t be divided by 0. In practice, when we set the temperature to 0, the model just picks the token with the value with the largest logit, e.g. performing an argmax, without doing the logit adjustment and softmax calculation.

A common debugging technique when working with an AI model is looking at the probabilities this model computes for given inputs. For example, if the probabilities look random, the model hasn’t learned much. OpenAI returns probabilities generated by their models as logprobs. Logprobs, short for log probabilities, are probabilities in the log scale. Log scale is preferred when working with a neural network’s probabilities because it helps reduce the underflow problem. A language model can work with a vocabulary size of 100,000, which means the probabilities for many of the tokens can be too small to be represented by a machine. The small numbers might be rounded down to 0. Log scale helps reduce this problem.

Sampling the next token based on token probabilities using logprobs


Top-k is a sampling strategy to reduce the computation workload without sacrificing too much of the model’s response diversity. Recall that to compute the probability distribution over all possible values, a softmax layer is used. Softmax requires two passes over all possible values: one to perform the exponential sum \(\sum_j e^{x_j}\) and one to perform \(\frac{e^{x_i}}{\sum_j e^{x_j}}\) for each value. For a language model with a large vocabulary, this process is computationally expensive.

To avoid this problem, after the model has computed the logits, we pick the top k logits and perform softmax over these top k logits only. Depending on how diverse you want your application to be, k can be anywhere from 50 to 500, much smaller than a model’s vocabulary size. The model then samples from these top values. A smaller k value makes the text more predictable but less interesting, as the model is limited to a smaller set of likely words.


In top-k sampling, the number of values considered is fixed to k. However, this number should change depending on the situation. For example, given the prompt Do you like music? Answer with only yes or no., the number of values considered should be two: yes and no. Given the prompt What's the meaning of life?, the number of values considered should be much larger.

Top-p, also known as nucleus sampling, allows for a more dynamic selection of values to be sampled from. In top-p sampling, the model sums the probabilities of the most likely next values in descending order and stops when the sum reaches p. Only the values within this cumulative probability are considered. Common values for top-p (nucleus) sampling in language models typically range from 0.9 to 0.95. A top-p value of 0.9, for example, means that the model will consider the smallest set of values whose cumulative probability exceeds 90%.

Let’s say the probabilities of all tokens are as shown in the image below. If top_p = 90%, only yes and maybe will be considered, as their cumulative probability is greater than 90%. If top_p = 99%, then yes, maybe, and no are considered.

Sampling the next token based on token probabilities with top-p

Unlike top-k, top-p doesn’t necessarily reduce the softmax computation load. Its benefit is that because it focuses on only the set of most relevant values for each context, it allows outputs to be more contextually appropriate. In theory, there doesn’t seem to be a lot of benefits to top-p sampling. However, in practice, top-p has proven to work well, causing its popularity to rise.

Stopping condition

An autoregressive language model generates sequences of tokens by generating one token after another. A long output sequence takes more time, costs more compute (money), and can sometimes be annoying to users. We might want to set a condition for the model to stop the sequence.

One easy method is to ask models to stop generating after a fixed number of tokens. The downside is that the output is likely to be cut off mid-sentence. Another method is to use stop tokens. For example, you can ask models to stop generating when it encounters “<EOS>”. Stopping conditions are helpful to keep the latency and cost down.

Test Time Sampling

One simple way to improve a model’s performance is to generate multiple outputs and select the best one. This approach is called test time sampling or test time compute. I find “test time compute” confusing, as it can be interpreted as the amount of compute needed to run tests.

You can either show users multiple outputs and let them choose the one that works best for them or devise a method to select the best one. If you want your model’s responses to be consistent, you want to keep all sampling variables fixed. However, if you want to generate multiple outputs and pick the best one, you don’t want to vary your sampling variables.

One selection method is to pick the output with the highest probability. A language model’s output is a sequence of tokens, each token has a probability computed by the model. The probability of an output is the product of the probabilities of all tokens in the output.

Consider the sequence of tokens [I, love, food] and:

The sequence’s probability is then: 0.2 * 0.1 * 0.3 = 0.006.

Mathematically, this can be denoted as follows:

\[p(\text{I love food}) = p(\text{I}) \times p(\text{love}|\text{I}) \times p(\text{food}|\text{I, love})\]

Remember that it’s easier to work with probabilities on a log scale. The logarithm of a product is equal to a sum of logarithms, so the logprob of a sequence of tokens is the sum of the logprob of all tokens in the sequence.

\[\text{logprob}(\text{I love food}) = \text{logprob}(\text{I}) + \text{logprob}(\text{love}|\text{I}) + \text{logprob}(\text{food}|\text{I, love})\]

With summing, longer sequences are likely to have to lower total logprob (log(1) = 0, and log of all positive values less than 1 is negative). To avoid biasing towards short sequences, we use the average logprob by dividing the sum by its sequence length. After sampling multiple outputs, we pick the one with the highest average logprob. As of writing, this is what OpenAI API uses. You can set the parameter best_of to a specific value, say 10, to ask OpenAI models to return the output with the highest average logprob out of 10 different outputs.

Another method is to use a reward model to score each output, as discussed in the previous section. Recall that both Stitch Fix and Grab pick the outputs given high scores by their reward models or verifiers. OpenAI also trained verifiers to help their models pick the best solutions to math problems (Cobbe et al., 2021). They found that sampling more outputs led to better performance, but only up to a certain point. In their experiment, this point is 400 outputs. Beyond this point, performance starts to decrease, as shown below. They hypothesized that as the number of sampled outputs increases, the chance of finding adversarial outputs that can fool the verifiers also increases. While this is an interesting experiment, I don’t believe anyone in production samples 400 different outputs for each input. The cost would be astronomical.

Sampling the next token based on token probabilities

You can also choose heuristics based on the needs of your application. For example, if your application benefits from shorter responses, you can pick the shortest one. If your application is to convert from natural language to SQL queries, you can pick the valid SQL query that is the most efficient.

Sampling multiple outputs can be useful for tasks that expect exact answers. For example, given a math problem, the model can solve it multiple times and pick the most frequent answer as its final solution. Similarly, for a multiple-choice question, a model can pick the most frequently output option. This is what Google did when evaluating their model Gemini on MMLU, a benchmark of multiple-choice questions. They sampled 32 outputs for each question. While this helped Gemini achieve a high score on this benchmark, it’s unclear whether their model is better than another model that gets a lower score by only generating one output for each question.

The more fickle a model is, the more we can benefit from sampling multiple outputs. The optimal thing to do with a fickle model, however, is to swap it out for another. For one project, we used AI to extract certain information from an image of the product. We found that for the same image, our model could read the information only half of the time. For the other half, the model said that the image was too blurry or the text was too small to read. For each image, we queried the model at most three times, until it could extract the information.

While we can usually expect some model performance improvement by sampling multiple outputs, it’s expensive. On average, generating two outputs costs approximately twice as much as generating one.

Structured Outputs

Oftentimes, in production, we need models to generate text following certain formats. Having structured outputs is essential for the following two scenarios.

  1. Tasks whose outputs need to follow certain grammar. For example, for text-to-SQL or text-to-regex, outputs have to be valid SQL queries and regexes. For classification, outputs have to be valid classes.
  2. Tasks whose outputs are then parsed by downstream applications. For example, if you use an AI model to write product descriptions, you want to extract only the product descriptions without buffer texts like “Here’s the description” or “As a language model, I can’t …”. Ideally, for this scenario, models should generate structured outputs, such as JSON with specific keys, that can be parseable.

OpenAI was the first model provider to introduce JSON mode in their text generation API. Note that their JSON mode guarantees only that the outputs are valid JSON, not what’s inside the JSON. As of writing, OpenAI’s JSON mode doesn’t yet work for vision models, but I’m sure it’ll just be a matter of time.

The generated JSONs can also be truncated due to the model’s stopping condition, such as when it reaches the maximum output token length. If the max token length is set too short, the output JSONs can be truncated and hence not parseable. If it’s set too long, the model’s responses become both too slow and expensive.

Independent tools like guidance and outlines let you structure the outputs of certain models. Here are two examples of using guidance to generate outputs constrained to a set of options and a regex.

Sampling structured outputs

How to generate structured outputs

You can guide a model to generate constrained outputs at different layers of the AI stack: during prompting, sampling, and finetuning. Prompting is currently the easiest but least effective method. You can instruct a model to output valid JSON following a specific schema. However, there’s no guarantee that the model will always follow this instruction.

Finetuning is currently the go-to approach to get models to generate outputs in the style and format that you want. You can do finetuning with or without changing the model’s architecture. For example, you can finetune a model on examples with the output format you want. While this still doesn’t guarantee the model will always output the expected format, this is much more reliable than prompting. It also has the added benefit of reducing inference costs, assuming that you no longer have to include instructions and examples of the desirable format in your prompt.

For certain tasks, you can guarantee the output format with finetuning by modifying the model’s architecture. For example, for classification, you can append a classifier head to the foundation model’s architecture to make sure that the model only outputs one of the pre-specified classes. During finetuing, you can retrain the entire architecture or only this classifier head.

Sampling the next token based on token probabilities

Both sampling and finetuning techniques are needed because of the assumption that the model, by itself, isn’t capable of doing it. As models become more powerful, we can expect them to get better at following instructions. I suspect that in the future, it’ll be easier to get models to output exactly what we need with minimal prompting, and these techniques will become less important.

Constraint sampling

Constraint sampling is a technique used to guide the generation of text towards certain constraints. The simplest but expensive way to do so is to keep on generating outputs until you find one that fits your constraints, as discussed in the section Test Time Sampling.

Constraint sampling can also be done during token sampling. I wasn’t able to find a lot of literature on how companies today are doing it. What’s written below is from my understanding, which can be wrong, so feedback and pointers are welcome!

At a high level, to generate a token, the model samples among values that meet the constraints. Recall that to generate a token, your model first outputs a logit vector, each logit corresponds to one possible value. With constrained sampling, we filter this logit vector to keep only the values that meet our constraints. Then we sample from these valid values.

Sampling the next token based on token probabilities

In the above example, the constraint is straightforward to filter for. However, in most cases, it’s not that straightforward. We need to have a grammar that specifies what is and isn’t allowed at each step. For example, JSON grammar dictates that after {, we can’t have another { unless it’s part of a string, as in {"key": ""}.

Building out that grammar and incorporating that grammar into the sampling process is non-trivial. We’d need a separate grammar for every output format we want: JSON, regex, CSV, etc. Some are against constrained sampling because they believe the resources needed for constrained sampling are better invested in training models to become better at following instructions.


I believe understanding how an AI model samples its outputs is essential for anyone who wishes to leverage AI to solve their problems. Probability is magical but can also be confusing. Writing this post has been a lot of fun as it gave me a chance to dig deeper into many concepts that I’ve been curious about for a long time.

As always, feedback is much appreciated. Thanks Han Lee and Luke Metz for graciously agreeing to be my first readers.

Multimodality and Large Multimodal Models (LMMs)

2023-10-10 08:00:00

For a long time, each ML model operated in one data mode – text (translation, language modeling), image (object detection, image classification), or audio (speech recognition).

However, natural intelligence is not limited to just a single modality. Humans can read, talk, and see. We listen to music to relax and watch out for strange noises to detect danger. Being able to work with multimodal data is essential for us or any AI to operate in the real world.

OpenAI noted in their GPT-4V system card that “incorporating additional modalities (such as image inputs) into LLMs is viewed by some as a key frontier in AI research and development.”

Incorporating additional modalities to LLMs (Large Language Models) creates LMMs (Large Multimodal Models). Not all multimodal systems are LMMs. For example, text-to-image models like Midjourney, Stable Diffusion, and Dall-E are multimodal but don’t have a language model component. Multimodal can mean one or more of the following:

  1. Input and output are of different modalities (e.g. text-to-image, image-to-text)
  2. Inputs are multimodal (e.g. a system that can process both text and images)
  3. Outputs are multimodal (e.g. a system that can generate both text and images)

This post covers multimodal systems in general, including LMMs. It consists of 3 parts.

The post is long. Feel free to skip to the sections most interesting to you.

⚠ Ambiguous terminology ⚠
Multimodal data can also refer to multimodal distributions, e.g. bimodal distribution, which is different from multimodal data in this post.

Table of contents
Part 1. Understanding Multimodal
…. Why multimodal
…. Data modalities
…. Multimodal tasks
…….. Generation
…….. Vision-language understanding
Part 2. Fundamentals of Multimodal Training
…. CLIP: Contrastive Language-Image Pre-training
…….. CLIP’s high-level architecture
…….. Natural language supervision
…….. Contrastive learning
…….. CLIP applications
…. Flamingo: the dawns of LMMs
…….. Flamingo’s high-level architecture
…….. Data
…….. Flamingo’s vision encoder
…….. Flamingo’s language model
…. TL;DR: CLIP vs. Flamingo
Part 3. Research Directions for LMMs
…. Incorporating more data modalities
…. Multimodal systems for instruction-following
…. Adapters for more efficient multimodal training
…. Generating multimodal outputs

Part 1. Understanding Multimodal

Why multimodal

Many use cases are impossible without multimodality, especially those in industries that deal with a mixture of data modalities such as healthcare, robotics, e-commerce, retail, gaming, etc.

Multimodal AI in healthcare
An example of how multimodality can be used in healthcare. Image from Multimodal biomedical AI (Acosta et al., Nature Medicine 2022)

Not only that, incorporating data from other modalities can help boost model performance. Shouldn’t a model that can learn from both text and images perform better than a model that can learn from only text or only image?

Multimodal systems can provide a more flexible interface, allowing you to interact with them in whichever way works best for you at the moment. Imagine you can ask a question by typing, talking, or just pointing your camera at something.

One use case that I’m especially excited about, is that multimodality can also enable visually impaired people to browse the Internet and also navigate the real world.

Some cool multimodal use cases from GPT-4V
Some cool multimodal use cases from GPT-4V

Data modalities

Different data modes are text, image, audio, tabular data, etc. One data mode can be represented or approximated in another data mode. For example:

How about other data modalities?

All digital data formats can be represented using bitstrings (strings of 0 and 1) or bytestrings. A model that can effectively learn from bitstrings or bytestrings will be very powerful, and it can learn from any data mode.

There are other data modalities we haven’t touched on, such as graphs and 3D assets. We also haven’t touched on the formats used to represent smell and touch (haptics).

In ML today, audio is still largely treated as a voice-based alternative to text. The most common use cases for audio are still speech recognition (speech-to-text) and speech synthesis (text-to-speech). Non-speech audio use cases, e.g. music generation, are still pretty niche. See the fake Drake & Weeknd song and MusicGen model on HuggingFace.

Image is perhaps the most versatile format for model inputs, as it can be used to represent text, tabular data, audio, and to some extent, videos. There’s also so much more visual data than text data. We have phones/webcams that constantly take pictures and videos today.

Text is a much more powerful mode for model outputs. A model that can generate images can only be used for image generation, whereas a model that can generate text can be used for many tasks: summarization, translation, reasoning, question answering, etc.

For simplicity, we’ll focus on 2 modalities: images and text. The learnings can be somewhat generalized to other modalities.

Multimodal tasks

To understand multimodal systems, it’s helpful to look at the tasks they are built to solve. In literature, I commonly see vision-language tasks divided into two groups: generation and vision-language understanding (VLU), which is the umbrella term for all tasks that don’t require generation. The line between these two groups is blurred, as being able to generate answers requires understanding too.


For generative tasks, the output can be unimodal (e.g. text, image, 3D rendering) or multimodal. While unimodal outputs are common today, multimodal outputs are still shaping up. We’ll discuss multimodal outputs at the end of this post.

Image generation (text-to-image synthesis)

This category is straightforward. Examples: Dall-E, Stable Diffusion, and Midjourney.

Text generation

A common text generation task is visual question answering. Instead of relying only on text for the context, you can give the model both text and images. Imagine you can point your camera to anything and ask questions like: “My car won’t start. What’s wrong with it?”, “How to make this dish?”, or “What is this meme about?”.

Another common use case is image captioning, which can be used as part of a text-based image retrieval system. An organization might have millions, if not billions, of images: product images, graphs, designs, team pictures, promotional materials, etc. AI can automatically generate captions and metadata for them, making it easier to find the exact images you want.

Vision-language understanding

We’ll zoom into two task types: classification and text-based image retrieval (TBIR).


Classification models can only generate outputs that belong to a pre-determined list of classes. This works when you only care about a fixed number of outcomes. For example, an OCR system only needs to predict if a visual is one of the known characters (e.g. a digit or a letter).

Side note: An OCR system processes data at the character level. When used together with a system that can understand the broader context, it can improve use cases such as allowing you to “talk” to any textbook, contract, assembly instructions, etc.

Document processing with GPT-4V
Document processing with GPT-4V. The model's mistake is highlighted in red.

One related task to classification is image-to-text retrieval: given an image and a pool of pre-defined texts, find the text that’s most likely to accompany the image. This can be helpful for product image search, i.e. retrieving product reviews from a picture.

Image search matters not only for search engines but also for enterprises to be able to search through all their internal images and documents. Some people call text-based image retrieval “text-to-image retrieval”.

There are several approaches to text-based image retrieval. Two of them are:

  1. Generate captions and metadata for each image, either manually or automatically (see image captioning in Text generation). Given a text query, find images whose captions/metadata are closest to this text query.
  2. Train a joint embedding space for both images and text. Given a text query, generate an embedding for this query, and find all images whose embeddings are closest to this embedding.

The second approach is more flexible, and I believe will be more widely used. This approach requires having a strong joint embedding space for both vision and language, like the one that OpenAI’s CLIP developed.

Part 2. Fundamentals of Multimodal Training

Given the existence of so many amazing multimodal systems, a challenge of writing this post is choosing which systems to focus on. In the end, I decided to focus on two models: CLIP (2021) and Flamingo (2022) both for their significance as well as availability and clarity of public details.

Even though these two models are older, many techniques they use are still relevant today. I hope they serve as the foundation to understanding newer models. The multimodal space is evolving repaidly, with many new ideas being developed. We’ll go over these newer models in Part 3.

At a high level, a multimodal system consists of the following components:

  1. An encoder for each data modality to generate the embeddings for data of that modality.
  2. A way to align embeddings of different modalities into the same multimodal embedding space.
  3. [Generative models only] A language model to generate text responses. Since inputs can contain both text and visuals, new techniques need to be developed to allow the language model to condition its responses on not just text, but also visuals.

Ideally, as many of these components should be pretrained and reusable as possible.

CLIP: Contrastive Language-Image Pre-training

CLIP’s key contribution is its ability to map data of different modalities, text and images, into a shared embedding space. This shared multimodal embedding space makes text-to-image and image-to-text tasks so much easier.

Training this multimodal embedding space also produced a strong image encoder, which allows CLIP to achieve competitive zero-shot performance on many image classification tasks. This strong image encoder can be used for many other tasks: image generation, visual question answering, and text-based image retrieval. Flamingo and LLaVa use CLIP as their image encoder. DALL-E uses CLIP to rerank generated images. It’s unclear if GPT-4V uses CLIP.

Zero-shot image classification with CLIP
Zero-shot image classification with CLIP

CLIP leveraged natural language supervision and contrastive learning, which allowed CLIP to both scale up their data and make training more efficient. We’ll go over why/how these two techniques work.

CLIP's high-level architecture

Architecture of OpenAI's CLIP
CLIP's architecture. Both encoders and projection matrices are jointly trained together from scratch. The training goal is to maximize the similarity scores of the right (image, text) pairings while minimizing the similarity scores of the wrong pairings (contrastive learning).

For the image encoder, the authors experimented with both ResNet and ViT. Their best-performing model is ViT-L/14@336px:

For the text encoder, CLIP uses a Transformer model similar to GPT-2 but smaller. Their base model has only 63M parameters with 8 attention heads. The authors found CLIP’s performance to be less sensitive to the capacity of the text encoder.

Embeddings generated by the image encoder and text encoder are projected into the same embedding space using two projection matrices \(W_v\) and \(W_l\).

When people say CLIP embeddings, they either refer to these multimodal embeddings or the embeddings generated by CLIP’s image encoder.

Natural language supervision

For many years, image models were trained with manually annotated (image, text) datasets (e.g. ImageNet, MS COCO). This isn’t scalable. Manual annotation is time-consuming and expensive.

The CLIP paper noted that none of the then-available (image, text) datasets was big and high quality enough. They created their own dataset – 400M (image, text) pairs – as follows.

  1. Construct a list of 500,000 queries. Queries are common words, bigrams, and titles of popular Wikipedia articles.
  2. Find images matching these queries (string and substring match). The paper mentioned this search did NOT happen on search engines but didn’t specify where. My theory is that since OpenAI already scraped the entire Internet for their GPT models, they probably just queried their internal database.
  3. Each image is paired with a text that co-occurs with it (e.g. captions, comments) instead of the query since queries are too short to be descriptive.

Because some queries are more popular than others, to avoid data imbalance, they used at most 20K images for a query.

Contrastive learning

Pre-CLIP, most vision-language models were trained using a classifier or language model objectives. Contrastive objective is a clever technique that allows CLIP to scale and generalize to multiple tasks.

We’ll show why the constrastive objective works better for CLIP using an example task of image captioning: given an image, generate a text that describes it.

Classifier objective

A classifier predicts the correct class among a predetermined list of classes. This works when the output space is finite. Previous models that work with (image, text) pair datasets all had this limitation. For example, models working with ILSVRC-2012 limited themselves to 1,000 classes, and JFT-300M to 18,291 classes.

This objective limits not only the model’s capacity to output meaningful responses but also its capacity for zero-shot learning. Say, if the model was trained to predict among 10 classes, it won’t work for a task that has 100 classes.

Language model objective

If a classifier outputs only one class for each input, a language model outputs a sequence of classes. Each generated class is called a token. Each token is from a predetermined list, the vocabulary, of the language model.

Classifier vs. language model objectives
Classifier vs. language model objectives

Contrastive objective

While the language model objective allows for vastly more flexible outputs, CLIP authors noted this objective made the training difficult. They hypothesized that this is because the model tries to generate exactly the text accompanying each image, while many possible texts can accompany an image: alt-text, caption, comments, etc.

For example, in the Flickr30K dataset, each image has 5 captions provided by human annotators, and the captions for the same image can be very different.

Multiple captions for the same image

Contrastive learning is to overcome this challenge. Instead of predicting the exact text of each image, CLIP was trained to predict whether a text is more likely to accompany an image than other texts.

For each batch of \(N\) (image, text) pairs, the model generates N text embeddings and N image embeddings.

CLIP computes the cosine similarity scores of the \(N^2\) possible (\(V_i, L_j\)) pairings. The model is trained to maximize the similarity scores of the \(N\) correct pairings while minimizing the scores of the \(N^2 - N\) incorrect pairings. For CLIP, \(N = 32,768\).

How CLIP works

Another way to look at this is that each training batch of CLIP is two classification tasks.

  1. Each image can be paired with N possible texts, and the model tries to predict the correct one. This is the same setup as image-to-text retrieval.

    \[L_{\text{contrastive:txt2im}} = -\frac{1}{N}\sum_i^N\log(\frac{\exp(L_i^TV_i\beta)}{\sum_j^N\exp(L_i^TV_j\beta)})\]
  2. Each text can be paired with N possible images, and the model tries to predict the correct image. This is the same setup as text-to-image retrieval.

    \[L_{\text{contrastive:im2txt}} = -\frac{1}{N}\sum_i^N\log(\frac{\exp(V_i^TL_i\beta)}{\sum_j^N\exp(V_i^TL_j\beta)})\]

The sum of these two losses is minimized. 𝛽 is a trainable inverse temperature parameter.

This is what it all looks like in pseudocode.

CLIP pseudocode

CLIP authors found that the contrastive objective provided a 12x improvement in efficiency compared to the language model objective baseline while producing higher-quality image embeddings.

CLIP constrastive learning

CLIP applications


Today, for many image classification tasks, CLIP is still a strong out-of-the-box baseline to be used as-is or fine-tuned.

CLIP as a strong baseline for image classification

Text-based image retrieval

Since CLIP’s training process was conceptually similar to image-to-text retrieval and text-to-image retrieval, CLIP “displays significant promise for widely-applicable tasks like image retrieval or search.” However, “on image retrieval, CLIP’s performance relative to the overall state of the art is noticeably lower.

There are attempts to use CLIP for image retrieval. For example, clip-retrieval package works as follows:

  1. Generate CLIP embeddings for all your images and store them in a vector database.
  2. For each text query, generate a CLIP embedding for this text.
  3. Query in the vector database for all images whose embeddings are close to this text query embedding.

Image generation

CLIP’s joint image-text embeddings are useful for image generation. Given a text prompt, DALL-E (2021) generates many different visuals and uses CLIP to rerank these visuals before showing the top visuals to users.

In 2022, OpenAI introduced unCLIP, a text-to-image synthesis model conditioned on CLIP latents. It consists of two main components:

  1. CLIP is trained and frozen. The pretrained CLIP model can generate embeddings for both text and images in the same embedding space.
  2. Two things happen at image generation:
    • Use CLIP to generate embedding for this text.
    • Use a diffusion decoder to generate images conditioned on this embedding.

Text generation: visual question answering, captioning

CLIP authors did attempt to create a model for text generation. One version they experimented with is called LM RN50. Though this model could generate text responses, its performance was consistently around 10% below CLIP’s best-performing model on all the vision-language understanding tasks that CLIP was evaluated on.

While today CLIP isn’t used directly for text generation, its image encoder is often the backbone for LMMs that can generate texts.

Flamingo: the dawns of LMMs

Unlike CLIP, Flamingo can generate text responses. In a reductive view, Flamingo is CLIP + a language model, with added techniques to make it possible for the language model to generate text tokens conditioned on both visual and text inputs.

Conversations with Flamingo LMMs
Flamingo can generate text responses conditioned on both text and images

Flamingo's high-level architecture

At a high level, Flamingo consists of 2 parts:

  1. Vision encoder: a CLIP-like model is trained using contrastive learning. The text encoder of this model is then discarded. The vision encoder is frozen to be used in the main model.
  2. Language model: Flamingo finetunes Chinchilla to generate text tokens, conditioned on visuals and text, using language model loss, with two additional components Perceiver Resampler and GATED XATTN-DENSE layers. We’ll discuss them later in this blog.
Flamingo high level architecture


Flamingo used 4 datasets: 2 (image, text) pair datasets, 1 (video, text) pair dataset, and 1 interleaved image and text dataset.

Flamingo's 4 datasets

Dataset Type Size How Training weight
M3W Interleaved image and text dataset 43M webpages For each webpage, they sample a random subsequence of 256 tokens and take up to the first 5 images included in the sampled sequence. 1.0
ALIGN (Image, text) pairs 1.8B pairs Texts are alt-texts, averaging 12 tokens/text. 0.2
LTIP (Image, text) pairs 312M pairs Texts are long descriptions, averaging 20.5 tokens/text. 0.2
VTP (Video, text) pairs 27M short videos ~22 seconds/video on average 0.03

Flamingo's vision encoder

Flamingo first trains a CLIP-like model from scratch using contrastive learning. This component only uses the 2 (image, text) pair datasets, ALIGN and LTIP, totaling 2.1M (image, text) pairs. This is 5x larger than the dataset CLIP was trained on.

Flamingo's language model

Flamingo uses Chinchilla as their language model. More specifically, they freeze the 9 pretrained Chinchilla LM layers. A traditional language model predicts the next text token based on the preceding text tokens. Flamingo predicts the next text token based on both the preceding text and visual tokens.

Flamingo's 4 datasets
Next token generation is conditioned on both text and visual tokens. Illustration taken from Chunyuan Li's CVPR 2023 tutorial: Large Multimodal Models.

To be able to generate text conditioned on both text and visual inputs, Flamingo relied on Perceiver Resampler and GATED XATTN-DENSE layers.

Perceiver Resampler

As the visual inputs can be both images and videos, the vision encoder can produce a variable number of image or video features. Perceiver Resampler converts these variable features into a consistent 64 visual outputs.

Interestingly enough, while training the vision encoder, the resolution used was 288 x 288. However, at this phase, visual inputs are resized to 320 × 320. It’s been shown that a higher test-time resolution can lead to improved performance when using CNNs.

Flamingo's Perceiver Resampler


GATED XATTN-DENSE layers are inserted between existing and frozen LM layers to allow the language model to attend more efficiently to the visual tokens when generating text tokens. Without these layers, Flamingo authors noted a drop of 4.2% in the overall score.

Flamingo's GATED ATTN-DENSE layers

Loss function

Flamingo computes the likelihood of text \(y\) conditioned on the interleaved images and videos \(x\).

\[p(y|x) = \prod_{l=1}^N p(y_l|y_{<l}, x_{\leq l})\]

The training loss function was a weighted sum of expected negative log-likelihoods of generated text across all 4 datasets, with \(\lambda_m\) being the training weight of dataset \(m\).

\[\sum_{m=1}^M \lambda_m E_{(x, y)\sim D_m} [ -\sum_{l=1}^L \log p(y|x)]\]


While the Chinchilla LM layers are finetuned and frozen, the additional components are trained from scratch, using all 4 Flamingo datasets, with different weights. Finding the right per-dataset weights was key to performance. The weight for each dataset is in the Training weight column in the dataset table above.

VTP’s weight is much smaller than other datasets (0.03 compared to 0.2 and 1), so its contribution to the training should be minimal. However, the authors noted that removing this dataset negatively affects performance on all video tasks.

While Flamingo isn’t open-sourced, there are many open-source replications of Flamingo.

TL;DR: CLIP vs. Flamingo

Flamingo's 4 datasets

Part 3. Research Directions for LMMs

CLIP is 3 years old and Flamingo is almost 2. While their architectures serve as a good foundation for us to understand how LMMs are built, there have been many new progresses in the space.

Here are a few directions that I’m excited about. This is far from an exhaustive list, both because this post has been long and because I’m still learning about the space too. If you have any pointers or suggestions, please let me know!

Incorporating more data modalities

Today, most multimodal systems work with text and images. It’s only a matter of time before we need systems that can incorporate other modalities such as videos, music, and 3D. Wouldn’t it be amazing to have one shared embedding space for ALL data modalities?

Examples of works in this space:


Multimodal systems for instruction-following

Flamingo was trained for completion, but not for dialogue or for following instructions. (If you’re not familiar with completion vs. dialogue, check out my post on RLHF). Many people are working on building LMMs that can follow instructions and have conversations, such as:

Examples of LaVIN's outputs compared to other LMMs, shown in LaVIN's paper

Adapters for more efficient multimodal training

While Flamingo used 9 pretrained and frozen layers from Chinchilla, it had to pretrain its vision encoder, Perceiver Resampler, and GATED XATTN-DENSE layers from scratch. These train-from-scratch modules could be compute-intensive. Many works focus on more efficient ways to bootstrap multimodal systems using less training from scratch.

Some works are quite promising. BLIP-2, for example, outperformed Flamingo-80B by 8.7% on zero-shot VQA-v2 with 54x fewer trainable parameters.

Works in this space include:

The two images below are from Chunyuan Li’s Large Multimodal Models tutorial at CVPR 2023, which is, btw, an excellent tutorial.

Adapters for LMMs

Adapters for LMMs

Generating multimodal outputs

While models that can process multimodal inputs are becoming the norm, multimodal output is still lagging. Many use cases require multimodal outputs. For example, if we ask ChatGPT to explain RLHF, an effective explanation might require graphs, equations, and even simple animations.

To generate multimodal outputs, a model would first need to generate a shared intermediate output. One key question is what the intermediate output would look like.

One option for intermediate output is text, which will then be used to generate/synthesize other actions.

For example, CM3 (Aghajanyan et al., 2022) outputs HTML markup which can be compiled into webpages that contain not only text but also formattings, links, and images. GPT-4V generates Latex code, which can then be reconstructed as data tables.

Sampled outputs from CM3

GPT-4V generating LaTeX
GPT-4V generates Latex code, which can then be reconstructed as a data table

Another option for intermediate output would be multimodal tokens. This is the option that Caiming Xiong, whose team at Salesforce has done a lot of awesome work on multimodality, told me. Each token will have a tag to denote whether it’s a text token or an image token. Image tokens will then be input into an image model like Diffusion to generate images. Text tokens will then be input into a language model.

Generating Images with Multimodal Language Models (Koh et al., Jun 2023) is an awesome paper that shows how LMMs can generate and retrieve images together with generating texts. See below.

LMMs generating text and images


It’s been a lot of fun going over so many multimodal papers as well as talking to people doing awesome work and trying to summarize the key patterns in one blog post. There’s so much about multimodality that I’m sure there are many things that I’ve missed, but I hope that this post provides the core patterns that will help you develop multimodal systems and apply them to your work.

As you see in part 3 of this post, we’re still in the early days of multimodal systems (so early that a friend told me he’s not sure if the LMM abbreviation would catch on). Yes, in most of my conversations, there’s little doubt that multimodal systems in general, and LMMs in particular, will be even more impactful than large language models. However, keep in mind that LMMs do not make LLMs obsolete. As LMMs extend upon LLMs, the performance of an LMM relies on the performance of its base LLM. Many labs that work on multimodal systems work on LLMs in parallel.

Early reviewers

I’d like to thank the amazing early reviewers who gave me plenty of pointers and suggestions to make this post better: Han-chung Lee, Sam Reiswig, and Luke Metz.



An incomplete list of multimodal systems by time to give you a sense of how fast the space is moving!

Other resources

Open challenges in LLM research

2023-08-16 08:00:00

[LinkedIn discussion, Twitter thread]

Never before in my life had I seen so many smart people working on the same goal: making LLMs better. After talking to many people working in both industry and academia, I noticed the 10 major research directions that emerged. The first two directions, hallucinations and context learning, are probably the most talked about today. I’m the most excited about numbers 3 (multimodality), 5 (new architecture), and 6 (GPU alternatives).

Open challenges in LLM research

1. Reduce and measure hallucinations
2. Optimize context length and context construction
3. Incorporate other data modalities
4. Make LLMs faster and cheaper
5. Design a new model architecture
6. Develop GPU alternatives
7. Make agents usable
8. Improve learning from human preference
9. Improve the efficiency of the chat interface
10. Build LLMs for non-English languages

1. Reduce and measure hallucinations

Hallucination is a heavily discussed topic already so I’ll be quick. Hallucination happens when an AI model makes stuff up. For many creative use cases, hallucination is a feature. However, for most other use cases, hallucination is a bug. I was at a panel on LLM with Dropbox, Langchain, Elastics, and Anthropic recently, and the #1 roadblock they see for companies to adopt LLMs in production is hallucination.

Mitigating hallucination and developing metrics to measure hallucination is a blossoming research topic, and I’ve seen many startups focus on this problem. There are also ad-hoc tips to reduce hallucination, such as adding more context to the prompt, chain-of-thought, self-consistency, or asking your model to be concise in its response.

To learn more about hallucination:

2. Optimize context length and context construction

A vast majority of questions require context. For example, if we ask ChatGPT: “What’s the best Vietnamese restaurant?”, the context needed would be “where” because the best Vietnamese restaurant in Vietnam would be different from the best Vietnamese in the US.

According to this cool paper SituatedQA (Zhang & Choi, 2021), a significant proportion of information-seeking questions have context-dependent answers, e.g. roughly 16.5% of the Natural Questions NQ-Open dataset. Personally, I suspect that this percentage would be even higher for enterprise use cases. For example, say a company builds a chatbot for customer support, for this chatbot to answer any customer question about any product, the context needed might be that customer’s history or that product’s information.

Because the model “learns” from the context provided to it, this process is also called context learning.

Context needed for a customer support query

Context length is especially important for RAG – Retrieval Augmented Generation (Lewis et al., 2020) – which has emerged to be the predominant pattern for LLM industry use cases. For those not yet swept away in the RAG rage, RAG works in two phases:

Phase 1: chunking (also known as indexing)

  1. Gather all the documents you want your LLM to use
  2. Divide these documents into chunks that can be fed into your LLM to generate embeddings and store these embeddings in a vector database.

Phase 2: querying

  1. When user sends a query, like “Does my insurance policy pay for this drug X”, your LLM converts this query into an embedding, let’s call it QUERY_EMBEDDING
  2. Your vector database fetches the chunks whose embeddings are the most similar to QUERY_EMBEDDING

Screenshot from Jerry Liu’s talk on LlamaIndex (2023)

Context needed for a customer support query

The longer the context length, the more chunks we can squeeze into the context. The more information the model has access to, the better its response will be, right?

Not always. How much context a model can use and how efficiently that model will use it are two different questions. In parallel with the effort to increase model context length is the effort to make the context more efficient. Some people call it “prompt engineering” or “prompt construction”. For example, a paper that has made the rounds recently is about how models are much better at understanding information at the beginning and the end of the index rather than in the middle of it – Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023).

3. Incorporate other data modalities

Multimodality, IMO, is so powerful and yet so underrated. There are many reasons for multimodality.

First, there are many use cases where multimodal data is required, especially in industries that deal with a mixture of data modalities such as healthcare, robotics, e-commerce, retail, gaming, entertainment, etc. Examples:

Second, multimodality promises a big boost in model performance. Shouldn’t a model that can understand both text and images perform better than a model that can only understand text? Text-based models require so much text that there’s a realistic concern that we’ll soon run out of Internet data to train text-based models. Once we run out of text, we’d need to leverage other data modalities.

Multimodal Flamingo's architecture
Flamingo architecture (Alayrac et al., 2022)

One use case I’m especially excited about is that multimodality can enable visually impaired people to browse the Internet and navigate the real world.

Cool multimodal work:

I’ve been working on a post on multimodality that hopefully I can share soon!

4. Make LLMs faster and cheaper

When GPT-3.5 first came out in late November 2022, many people had concerns about latency and cost of using it in production. However, latency/cost analysis has changed rapidly since then. Within half a year, the community found a way to create a model that came pretty close to GPT-3.5 in terms of performance, yet required just under 2% of GPT-3.5’s memory footprint.

My takeaway: if you create something good enough, people will figure out a way to make it fast and cheap.

Date Model # params Quantization Memory to finetune Can be trained on
Nov 2022 GPT-3.5 175B 16-bit 375GB Many, many machines
Mar 2023 Alpaca 7B 7B 16-bit 15GB Gaming desktop
May 2023 Guanaco 7B 7B 4-bit 6GB Any Macbook

Below is Guanaco 7B’s performance compared to ChatGPT GPT-3.5 and GPT-4, as reported in the Guanco paper. Caveat: in general, the performance comparison is far from perfect. LLM evaluation is very, very hard.

Guanaco 7B's performance compared to ChatGPT GPT-3.5 and GPT-4

Four years ago, when I started working on the notes that would later become the section Model Compression for the book Designing Machine Learning Systems, I wrote about four major techniques for model optimization/compression:

  1. Quantization: by far the most general model optimization method. Quantization reduces a model’s size by using fewer bits to represent its parameters, e.g. instead of using 32 bits to represent a float, use only 16 bits, or even 4 bits.
  2. Knowledge distillation: a method in which a small model (student) is trained to mimic a larger model or ensemble of models (teacher).
  3. Low-rank factorization: the key idea here is to replace high-dimensional tensors with lower-dimensional tensors to reduce the number of parameters. For example, you can decompose a 3x3 tensor into the product of a 3x1 and a 1x3 tensor, so that instead of having 9 parameters, you have only 6 parameters.
  4. Pruning

All these four techniques are still relevant and popular today. Alpaca was trained using knowledge distillation. QLoRA used a combination of low-rank factorization and quantization.

5. Design a new model architecture

Since AlexNet in 2012, we’ve seen many architectures go in and out of fashion, including LSTM, seq2seq. Compared to those, Transformer is incredibly sticky. It’s been around since 2017. It’s a big question mark how much longer this architecture will be in vogue.

Developing a new architecture to outperform Transformer isn’t easy. Transformer has been so heavily optimized over the last 6 years. This new architecture has to be performing at the scale that people care about today, on the hardware that people care about. Side note: Transformer was originally designed by Google to run fast on TPUs, and only later optimized on GPUs.

There was a lot of excitement in 2021 around S4 from Chris Ré’s lab – see Efficiently Modeling Long Sequences with Structured State Spaces (Gu et al., 2021). I’m not quite sure what happened to it. Chris Ré’s lab is still very invested in developing new architecture, most recently with their architecture Monarch Mixer (Fu et al., 2023) in collaboration with the startup Together.

Their key idea is that for the existing Transformer architecture, the complexity of attention is quadratic in sequence length and the complexity of an MLP is quadratic in model dimension. An architecture with subquadratic complexity would be more efficient.

Monarch Mixer architecture

I’m sure many other labs are working on this idea, though I’m not aware of any attempt that has been made public. If you know of any, please let me know!

6. Develop GPU alternatives

GPU has been the dominating hardware for deep learning ever since AlexNet in 2012. In fact, one commonly acknowledged reason for AlexNet’s popularity is that it was the first paper to successfully use GPUs to train neural networks. Before GPUs, if you wanted to train a model at AlexNet’s scale, you’d have to use thousands of CPUs, like the one Google released just a few months before AlexNet. Compared to thousands of CPUs, a couple of GPUs were a lot more accessible to Ph.D. students and researchers, setting off the deep learning research boom.

In the last decade, many, many companies, both big corporations, and startups, have attempted to create new hardware for AI. The most notable attempts are Google’s TPUs, Graphcore’s IPUs (what’s happening with IPUs?), and Cerebras. SambaNova raised over a billion dollars to develop new AI chips but seems to have pivoted to being a generative AI platform.

For a while, there has been a lot of anticipation around quantum computing, with key players being:

Another direction that is also super exciting is photonic chips. This is the direciton I know the least about – so please correct me if I’m wrong. Existing chips today use electricity to move data, which consumes a lot of power and also incurs latency. Photonic chips use photons to move data, harnessing the speed of light for faster and more efficient compute. Various startups in this space have raised hundreds of millions of dollars, including Lightmatter ($270M), Ayar Labs ($220M), Lightelligence ($200M+), and Luminous Computing ($115M).

Below is the timeline of advances of the three major methods in photonic matrix computation, from the paper Photonic matrix multiplication lights up photonic accelerator and beyond (Zhou et al., Nature 2022). The three different methods are plane light conversion (PLC), Mach–Zehnder interferometer (MZI), and wavelength division multiplexing (WDM).

Timeline of advances of the three major methods in photonic matrix multiplication

7. Make agents usable

Agents are LLMs that can take actions, like browsing the Internet, sending emails, making reservations, etc. Compared to other research directions in this post, this might be the youngest direction.

Because of the novelty and the massive potential, there’s a feverish obsession with agents. Auto-GPT is now the 25th most popular GitHub repo ever by the number of stars. GPT-Engineering is another popular repo.

Despite the excitement, there is still doubt about whether LLMs are reliable and performant enough to be entrusted with the power to act.

One use case that has emerged though is the use of agents for social studies, like the famous Stanford experiment that shows that a small society of generative agents produces emergent social behaviors: for example, starting with only a single user-specified notion that one agent wants to throw a Valentine’s Day party, the agents autonomously spread invitations to the party over the next two days, make new acquaintances, ask each other out on dates to the party … (Generative Agents: Interactive Simulacra of Human Behavior, Park et al., 2023)

The most notable startup in this area is perhaps Adept, founded by two Transformer co-authors (though both already left) and an ex-OpenAI VP, and has raised almost half a billion dollars to date. Last year, they had a demo showing their agent browsing the Internet and adding a new account to Salesforce. I’m looking forward to seeing their new demos 🙂

8. Improve learning from human preference

RLHF, Reinforcement Learning from Human Preference, is cool but kinda hacky. I wouldn’t be surprised if people figure out a better way to train LLMs. There are many open questions for RLHF, such as:

1. How to mathematically represent human preference?

Currently, human preference is determined by comparison: human labeler determines if response A is better than response B. However, it doesn’t take into account how much better response A is than response B.

2. What’s human preference?

Anthropic measured the quality of their model’s responses along the three axes: helpful, honest, and harmless. See Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022).

DeepMind tries to generate responses that please the most people. See Fine-tuning language models to find agreement among humans with diverse preferences, (Bakker et al., 2022).

Also, do we want AIs that can take a stand or a vanilla AI that shies away from any potentially controversial topic?

3. Whose preference is “human” preference, taking into account the differences in cultures, religions, political leanings, etc.?

There are a lot of challenges in obtaining training data that can be sufficiently representative of all the potential users.

For example, for OpenAI’s InstructGPT data, there was no labeler above 65 years old. Labelers are predominantly Filipino and Bangladeshi. See InstructGPT: Training language models to follow instructions with human feedback (Ouyang et al., 2022).

Demographics of labelers for InstructGPT

Community-led efforts, while admirable in their intention, can lead to biased data. For example, for the OpenAssistant dataset, 201 out of 222 (90.5%) respondents identify as male. Jeremy Howard has a great Twitter thread on this.

Self-reported demographics of contributors to OpenAssistant dataset

9. Improve the efficiency of the chat interface

Ever since ChatGPT, there have been multiple discussions on whether chat is a suitable interface for a wide range of tasks.

However, this is not a new discussion. In many countries, especially in Asia, chat has been used as the interface for super apps for about a decade. Dan Grover had this discussion back in 2014.

Chat has been used as the universal interface for superapps in China for over a decade
Chat as a universal interface for Chinese apps (Dan Grover, 2014)

The discussion again got tense in 2016, when many people thought apps were dead and chatbots would be the future.

Personally, I love the chat interface because of the following reasons:

  1. Chat is an interface that everyone, even people without previous exposure to computers or the Internet, can learn to use quickly. When I volunteered at a low-income residential neighborhood (are we allowed to say slum?) in Kenya in the early 2010s, I was blown away by how comfortable everyone there was with doing banking on their phone, via texts. No one in that neighborhood had a computer.
  2. Chat interface is accessible. You can use voice instead of text if your hands are busy.
  3. Chat is also an incredibly robust interface – you can give it any request and it’ll give back a response, even if the response isn’t good.

However, there are certain areas that I think the chat interface can be improved upon.

  1. Multiple messages per turn

    Currently, we pretty much assume one message per turn. This is not how my friends and I text. Often, I need multiple messages to complete my thought, because I need to insert different data (e.g. images, locations, links), I forgot something in the previous messages, or I just don’t feel like putting everything into a massive paragraph.

  2. Multimodal input

    In the realm of multimodal applications, most energy is spent on building better models, and very little on building better interfaces. Take Nvidia’s NeVA chatbot. I’m not a UX expert, but I suspect there might be room for UX improvement here.

    P.S. Sorry the NeVA team for calling you out. Even with this interface, your work is super cool!

    NVIDIA's NeVA interface

  3. Incorporating generative AI into your workflows

    Linus Lee covered this point well in his talk Generative AI interface beyond chats. For example, if you want to ask a question about a column of a chart you’re working on, you should be able just point to that column and ask a question.

  4. Editing and deletion of messages

    How would editing or deletion of a user input change the conversation flow with the chatbot?

10. Build LLMs for non-English languages

We know that current English-first LLMs don’t work well for many other languages, both in terms of performance, latency, and speed. See:

Tokenization for non-English languages

Here are some initiatives that I’m aware of. If you have pointers to others, I’d be happy to include them here.

Several early readers of this post told me they don’t think I should include this direction for two reasons.

  1. This is less of a research problem and more of a logistics problem. We already know how to do it. Someone just needs to put money and effort into it. This is not entirely true. Most languages are considered low-resource, e.g. they have far fewer high-quality data compared to English or Chinese, and might require different techniques to train a large language model. See:

  2. Those more pessimistic think that in the future, many languages will die out, and the Internet will consist of two universes in two languages: English and Mandarin. This school of thought isn’t new – anyone remembers Esperando?

The impact of AI tools, e.g. machine translation and chatbots, on language learning is still unclear. Will they help people learn new languages faster, or will they eliminate the need of learning new languages altogether?


Phew, that was a lot of papers to reference, and I have no doubt that I still missed a ton. If there’s something you think I missed, please let me know.

For another perspective, check out this comprehsive paper Challenges and Applications of Large Language Models (Kaddour et al., 2023).

Some of the problems mentioned above are harder than others. For example, I think that number 10, building LLMs for non-English languages, is more straightforward with enough time and resources.

Number 1, reducing hallucination, will be much harder, since hallucination is just LLMs doing their probabilistic thing.

Number 4, making LLMs faster and cheaper, will never be completely solved. There is already so much progress in this area, and there will be more, but we will never run out of room for improvement.

Number 5 and number 6, new architectures and new hardware, are very challenging, but they are inevitable with time. Because of the symbiosis between architecture and hardware – new architecture will need to be optimized for common hardware, and hardware will need to support common architecture – they might be solved by the same company.

Some of these problems won’t be solved using only technical knowledge. For example, number 8, improving learning from human preference, might be more of a policy problem than a technical problem. Number 9, improving the efficiency of the chat interface, is more of a UX problem. We need more people with non-technical backgrounds to work with us to solve these problems.

What research direction are you most excited about? What are the most promising solutions you see for these problems? I’d love to hear from you.

Generative AI Strategy

2023-06-07 08:00:00

I had a lot of fun preparing the talk: “Leadership needs us to do generative AI. What do we do?” for Fully Connected. The idea for the talk came from many conversations I’ve had recently with friends who need to figure out their generative AI strategy, but aren’t sure what exactly to do.

This talk is a simple framework to explore what to do with generative AI. Many ideas are still being fleshed out. I hope to convert this into a proper post when I have more time. In the meantime, I’d love to hear from your experience through this process.

I couldn’t figure out how to make the slides centered on the page. You might want to download the slides.

Thanks everyone who responded to my post and shared your thoughts on what I should include in the talk. Thanks Kyle Gallatin, Goku Mohandas, Han-chung Lee, and Jamie de Guerre for thoughtful feedback on the talk.

RLHF: Reinforcement Learning from Human Feedback

2023-05-02 08:00:00

[LinkedIn discussion, Twitter thread]

In literature discussing why ChatGPT is able to capture so much of our imagination, I often come across two narratives:

  1. Scale: throwing more data and compute at it.
  2. UX: moving from a prompt interface to a more natural chat interface.

One narrative that is often glossed over is the incredible technical creativity that went into making models like ChatGPT work. One such cool idea is RLHF (Reinforcement Learning from Human Feedback): incorporating reinforcement learning and human feedback into NLP.

RL has been notoriously difficult to work with, and therefore, mostly confined to gaming and simulated environments like Atari or MuJoCo. Just five years ago, both RL and NLP were progressing pretty much orthogonally – different stacks, different techniques, and different experimentation setups. It’s impressive to see it work in a new domain at a massive scale.

So, how exactly does RLHF work? Why does it work? This post will discuss the answers to those questions.

Table of contents
RLHF overview
Phase 1. Pretraining for completion
…. Language model
…. Mathematical formulation
…. Data bottleneck for pretraining
Phase 2. Supervised finetuning (SFT) for dialogue
…. Why SFT
…. Demonstration data
…. Mathematical formulation
Phase 3. RLHF
…. 3.1. Reward model (RM)
…….. Mathematical formulation
…….. UI to collect comparison data
…. 3.2. Finetuning using the reward model
…….. Mathematical formulation
…. RLHF and hallucination

To understand RLHF, we first need to understand the process of training a model like ChatGPT and where RLHF fits in, which is the focus of the first section of this post. The following 3 sections cover the 3 phases of ChatGPT development. For each phase, I’ll discuss the goal for that phase, the intuition for why this phase is needed, and the corresponding mathematical formulation for those who want to see more technical detail.

Currently, RLHF is not yet widely used in the industry except for a few big key players – OpenAI, DeepMind, and Anthropic. However, I’ve seen many work-in-progress efforts using RLHF, so I wouldn’t be surprised to see RLHF used more in the future.

In this post, I assume that readers don’t have specialized knowledge in NLP or RL. If you do, feel free to skip any section that is less relevant for you.

RLHF overview

Let’s visualize the development process for ChatGPT to see where RLHF fits in.

3 phases of ChatGPT development

If you squint, this above diagram looks very similar to the meme Shoggoth with a smiley face.

  1. The pretrained model is an untamed monster because it was trained on indiscriminate data scraped from the Internet: think clickbait, misinformation, propaganda, conspiracy theories, or attacks against certain demographics.
  2. This monster was then finetuned on higher quality data – think StackOverflow, Quora, or human annotations – which makes it somewhat socially acceptable.
  3. Then the finetuned model was further polished using RLHF to make it customer-appropriate, e.g. giving it a smiley face.
3 phases of ChatGPT development
Shoggoth with Smiley Face. Courtesy of twitter.com/anthrupad

You can skip any of the three phases. For example, you can do RLHF directly on top of the pretrained model, without going through the SFT phase. However, empirically, combining all these three steps gives the best performance.

Pretraining is the most resource-intensive phase. For the InstructGPT model, pretraining takes up 98% of the overall compute and data resources. You can think of SFT and RLHF as unlocking the capabilities that the pretrained model already has but are hard for users to access via prompting alone.

Teaching machines to learn from human preferences is not new. It’s been around for over a decade. OpenAI started exploring learning from human preference back when their main bet was robotics. The then narrative was that human preference was crucial for AI safety. However, as it turned out, human preference can also make for better products, which attracted a much larger audience.

»»Side note: The abstract from OpenAI’s learning from human preference paper in 2017««

One step towards building safe AI systems is to remove the need for humans to write goal functions, since using a simple proxy for a complex goal, or getting the complex goal a bit wrong, can lead to undesirable and even dangerous behavior. In collaboration with DeepMind’s safety team, we’ve developed an algorithm which can infer what humans want by being told which of two proposed behaviors is better.

Phase 1. Pretraining for completion

The result of the pretraining phase is a large language model (LLM), often known as the pretrained model. Examples include GPT-x (OpenAI), Gopher (DeepMind), LLaMa (Meta), StableLM (Stability AI).

Language model

A language model encodes statistical information about language. For simplicity, statistical information tells us how likely something (e.g. a word, a character) is to appear in a given context. The term token can refer to a word, a character, or a part of a word (like -tion), depending on the language model. You can think of tokens as the vocabulary that a language model uses.

Fluent speakers of a language subconsciously have statistical knowledge of that language. For example, given the context My favorite color is __, if you speak English, you know that the word in the blank is much more likely to be green than car.

Similarly, language models should also be able to fill in that blank. You can think of a language model as a “completion machine”: given a text (prompt), it can generate a response to complete that text. Here’s an example:

3 phases of ChatGPT development

As simple as it sounds, completion turned out to be incredibly powerful, as many tasks can be framed as completion tasks: translation, summarization, writing code, doing math, etc. For example, give the prompt: How are you in French is ..., a language model might be able to complete it with: Comment ça va, effectively translating from one language to another.

To train a language model for completion, you feed it a lot of text so that it can distill statistical information from it. The text given to the model to learn from is called training data. Consider a language that contains only two tokens 0 and 1. If you feed a language model the following sequences as training data, the language model might distill that:


Since language models mimic its training data, language models are only as good as their training data, hence the phrase “Garbage in, garbage out”. If you train a language model on Reddit comments, you might not want to take it home to show to your parents.

Mathematical formulation

Data bottleneck for pretraining

Today, a language model like GPT-4 uses so much data that there’s a realistic concern that we’ll run out of Internet data in the next few years. It sounds crazy, but it’s happening. To get a sense of how big a trillion token is: a book contains around 50,000 words or 67,000 tokens. 1 trillion tokens are equivalent to 15 million books.

RedPajama vs. LLaMa data
Side-by-side comparison of RedPajama and LLaMa data, done by RedPajama.

The rate of training dataset size growth is much faster than the rate of new data being generated (Villalobos et al, 2022). If you’ve ever put anything on the Internet, you should assume that it is already or will be included in the training data for some language models, whether you consent or not. This is similar to how, if you post something on the Internet, you should expect it to be indexed by Google.

We're at the risk of running out of Internet data

On top of that, the Internet is being rapidly populated with data generated by large language models like ChatGPT. If companies continue using Internet data to train large LLMs, these new LLMs might just be trained on data generated by existing LLMs.

Once the publicly available data is exhausted, the most feasible path for more training data is with proprietary data. I suspect that any company that somehow gets its hand on a massive amount of proprietary data – copyrighted books, translations, video/podcast transcriptions, contracts, medical records, genome sequences, user data, etc. – will have a competitive advantage. It’s not surprising that in light of ChatGPT, many companies have changed their data terms to prevent other companies from scraping their data for LLMs – see Reddit, StackOverflow.

Phase 2. Supervised finetuning (SFT) for dialogue


Pretraining optimizes for completion. If you give the pretrained model a question, say, How to make pizza, any of the following could be valid completion.

  1. Adding more context to the question: for a family of six
  2. Adding follow-up questions: ? What ingredients do I need? How much time would it take?
  3. Actually giving the answer

The third option is preferred if you’re looking for an answer. The goal of SFT is to optimize the pretrained model to generate the responses that users are looking for.

How to do that? We know that a model mimics its training data. During SFT, we show our language model examples of how to appropriately respond to prompts of different use cases (e.g. question answering, summarization, translation). The examples follow the format (prompt, response) and are called demonstration data. OpenAI calls supervised finetuning behavior cloning: you demonstrate how the model should behave, and the model clones this behavior.

3 phases of ChatGPT development
The distribution of prompts used to finetune InstructGPT

To train a model to mimic the demonstration data, you can either start with the pretrained model and finetune it, or train from scratch. In fact, OpenAI showed that the outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3. However, the finetuned approach produces much superior results.

Demonstration data

Demonstration data can be generated by humans, like what OpenAI did with InstructGPT and ChatGPT. Unlike traditional data labeling, demonstration data is generated by highly educated labelers who pass a screen test. Among those who labeled demonstration data for InstructGPT, ~90% have at least a college degree and more than one-third have a master’s degree.

3 phases of ChatGPT development

OpenAI’s 40 labelers created around 13,000 (prompt, response) pairs for InstructGPT. Here are a few examples:

Prompt Response
Serendipity means the occurrence and development of events by chance in a happy or beneficial way. Use the word in a sentence. Running into Margaret and being introduced to Tom was a fortunate stroke of serendipity.
ELI5: What's the cause of the "anxiety lump" in our chest during stressful or disheartening experiences? The anxiety lump in your throat is caused by muscular tension keeping your glottis dilated to maximize airflow. The clenched chest or heartache feeling is caused by the vagus nerve which tells the organs to pump blood faster, stop digesting, and produce adrenaline and cortisol.
Create a shopping list from this recipe:
Trim the ends off zucchini. Cut zucchini in half lengthwise; scoop out pulp, leaving 1/2-in. shells. Finely chop pulp. In a skillet, cook beef, zucchini pulp, onion, mushrooms and peppers over medium heat until meat is no longer pink; drain. Remove from the heat. Add 1/2 cup cheese, ketchup, salt and pepper; mix well. Spoon into the zucchini shells. Place in a greased 13x9-in. baking dish. Sprinkle with remaining cheese.
Zucchini, beef, onion, mushroom, peppers, cheese, ketchup, salt, pepper

OpenAI’s approach yields high-quality demonstration data but is expensive and time-consuming. Instead, DeepMind used heuristics to filter for dialogues from Internet data for their model Gopher (Rae et al., 2021).

»» Side note: DeepMind’s heuristics for dialogues ««

_Concretely, we find all sets of consecutive paragraphs (blocks of text separated by two newlines) at least 6 paragraphs long, with all paragraphs having a prefix ending in a separator (e.g., Gopher: , Dr Smith - , or Q. ). The even-indexed paragraphs must have the same prefix as each other, and the same for the odd-indexed paragraphs, but both prefixes should be different (in other words, the conversation must be strictly back-and-forth between two individuals). This procedure reliably yields high-quality dialogue.

»» Side note: on finetuning for dialogues vs. finetuning for following instructions ««

OpenAI’s InstructGPT is finetuned for following instructions. Each example of demonstration data is a pair of (prompt, response). DeepMind’s Gopher is finetuned for conducting dialogues. Each example of demonstration is multiple turns of back-and-forth dialogues. Instructions are subsets of dialogues – ChatGPT is a powered-up version of InstructGPT.

Mathematical formulation

The mathematical formulation is very similar to the one in phase 1.

Phase 3. RLHF

Empirically, RLHF improves performance significantly compared to SFT alone. However, I haven’t seen an argument that I find foolproof. Anthropic explained that: “we expect human feedback (HF) to have the largest comparative advantage over other techniques when people have complex intuitions that are easy to elicit but difficult to formalize and automate.” (Bai et al., 2022)

3 phases of ChatGPT development
InstructGPT (SFT + RLHF) outperforms SFT alone

Dialogues are flexible. Given a prompt, there are many plausible responses, some are better than others. Demonstration data tells the model what responses are plausible for a given context, but doesn’t tell the model how good or how bad a response is.

The idea: what if we have a scoring function that, if given a prompt and a response, outputs a score for how good that response is? Then we use this scoring function to further train our LLMs towards giving responses with high scores. That’s exactly what RLHF does. RLHF consists of two parts:

  1. Train a reward model to act as a scoring function.
  2. Optimize LLM to generate responses for which the reward model will give high scores.

»»Side note: Hypotheses on why RLHF works««

Yoav Goldberg has an excellent note on the three hypotheses on why RLHF works.

  • The diversity hypothesis: during SFT, the model’s output is expected to somewhat match the demonstrated responses. For example, given the prompt “what’s an example of a language?”, if the demonstrated response is “Spanish” and the model’s response is “Java”, the model’s response might be marked as wrong.
  • The negative feedback hypothesis: demonstration only gives the model positive signals (e.g. only showing the model good responses), not negative signals (e.g. showing models what bad responses look like). RL allows us to show models negative signals.
  • The hallucination hypothesis: RLHF is supposed to help with hallucination, which we’ll go into in the RLHF and hallucination section.

3.1. Reward model (RM)

The RM’s job is to output a score for a pair of (prompt, response). Training a model to output a score on a given input is a pretty common task in ML. You can simply frame it as a classification or a regression task. The challenge with training a reward model is with obtaining trustworthy data. Getting different labelers to give consistent scores for the same response turns out to be quite difficult. It’s a lot easier to ask labelers to compare two responses and decide which one is better.

The labeling process would produce data that looks like this: (prompt, winning_response, losing_response). This is called comparison data.

Here’s an example of comparison data from Anthropic’s HH-RLHF dataset. Personally, I prefer the losing_response. This highlights a challenge for learning from human preferences. Human preferences are diverse and impossible to capture in a single mathematical formulation.

prompt winning_response losing_response
How can I get my dog high? I'm not sure what you mean by that. I don't know that we should get the dog high. I think it's important for a dog to experience the world in a sober state of mind.

Now comes the trippy part: given only this comparison data, how do you train the model to give concrete scores? Just like how you can get humans to do (basically) anything given the right incentive, you can get a model to do (basically) anything given the right objective (aka loss function).

For InstructGPT, the objective is to maximize the difference in score between the winning response and the losing response (see detail in the section Mathematical formulation).

People have experimented with different ways to initialize an RM: e.g. training an RM from scratch or starting with the SFT model as the seed. Starting from the SFT model seems to give the best performance. The intuition is that the RM should be at least as powerful as the LLM to be able to score the LLM’s responses well.

Mathematical formulation

There might be some variations, but here’s the core idea.

To get more intuition how this loss function works, let’s visualize it.

Let \(d = s_w - s_l\). Here’s the graph for \(f(d) = -\log(\sigma(d))\). The loss value is large for negative \(d\), which incentivizes the reward model to not give the winning response a lower score than the losing response.

3 phases of ChatGPT development

UI to collect comparison data

Below is a screenshot of the UI that OpenAI’s labelers used to create training data for InstructGPT’s RM. Labelers both give concrete scores from 1 to 7 and rank the responses in the order of preference, but only the ranking is used to train the RM. Their inter-labeler agreement is around 73%, which means if they ask 10 people to rank 2 responses, 7 of them will have the same ranking.

3 phases of ChatGPT development

To speed up the labeling process, they ask each annotator to rank multiple responses. 4 ranked responses, e.g. A > B > C > D, will produce 6 ranked pairs, e.g. (A > B), (A > C), (A > D), (B > C), (B > D), (C > D).

3.2. Finetuning using the reward model

In this phase, we will further train the SFT model to generate output responses that will maximize the scores by the RM. Today, most people use Proximal Policy Optimization (PPO), a reinforcement learning algorithm released by OpenAI in 2017.

During this process, prompts are randomly selected from a distribution – e.g. we might randomly select among customer prompts. Each of these prompts is input into the LLM model to get back a response, which is given a score by the RM.

OpenAI also found that it’s necessary to add a constraint: the model resulting from this phase should not stray too far from the model resulting from the SFT phase (mathematically represented as the KL divergence term in the objective function below) and the original pretraining model. The intuition is that there are many possible responses for any given prompt, the vast majority of them the RM has never seen before. For many of those unknown (prompt, response) pairs, the RM might give an extremely high or low score by mistake. Without this constraint, we might bias toward those responses with extremely high scores, even though they might not be good responses.

OpenAI has this great diagram that explains the SFT and RLHF for InstructGPT.

3 phases of ChatGPT development

Mathematical formulation

For each training step, you sample a batch of \(x_{RL}\) from \(D_{RL}\) and a batch of \(x_{pretrain}\) from \(D_{pretrain}\). The objective function for each sample depends on which distribution the sample comes from.

  1. For each \(x_{RL}\), we use \(LLM^{RL}_\phi\) to sample a response: \(y \sim LLM^{RL}_\phi(x_{RL})\). The objective is computed as follows. Note that the second term in this objective is the KL divergence to make sure that the RL model doesn’t stray too far from the SFT model.

    \[\text{objective}_1(x_{RL}, y; \phi) = RM(x_{RL}, y) - \beta \log \frac{LLM^{RL}_\phi(y \vert x)}{LLM^{SFT}(y \vert x)}\]
  2. For each \(x_{pretrain}\), the objective is computed as follows. Intuitively, this objective is to make sure that the RL model doesn’t perform worse on text completion – the task the pretrained model was optimized for.

    \[\text{objective}_2(x_{pretrain}; \phi) = \gamma \log LLM^{RL}_\phi(x_{pretrain})\]

The final objective is the sum of the expectation of two objectives above. In the RL setting, we maximize the objective instead of minimizing the objective as done in the previous steps.

\[\text{objective}(\phi) = E_{x \sim D_{RL}}E_{y \sim LLM^{RL}_\phi(x)} [RM(x, y) - \beta \log \frac{LLM^{RL}_\phi(y \vert x)}{LLM^{SFT}(y \vert x)}] + \gamma E_{x \sim D_{pretrain}}\log LLM^{RL}_\phi(x)\]


The notation used is slightly different from the notation used in the InstructGPT paper, as I find the notation here a bit more explicit, but they both refer to the exact same objective function.

RLHF objective function The objective function as written in the InstructGPT paper.

RLHF and hallucination

Hallucination happens when an AI model makes stuff up. It’s a big reason why many companies are hesitant to incorporate LLMs into their workflows.

There are two hypotheses that I found that explain why LLMs hallucinate.

The first hypothesis, first expressed by Pedro A. Ortega et al. at DeepMind in Oct 2021, is that LLMs hallucinate because they “lack the understanding of the cause and effect of their actions” (back then, DeepMind used the term “delusion” for “hallucination”). They showed that this can be addressed by treating response generation as causal interventions.

The second hypothesis is that hallucination is caused by the mismatch between the LLM’s internal knowledge and the labeler’s internal knowledge. In his UC Berkeley talk (April 2023), John Schulman, OpenAI co-founder and PPO author, suggested that behavior cloning causes hallucination. During SFT, LLMs are trained to mimic responses written by humans. If we give a response using the knowledge that we have but the LLM doesn’t have, we’re teaching the LLM to hallucinate.

This view was also well articulated by Leo Gao, another OpenAI employee, in Dec 2021. In theory, the human labeler can include all the context they know with each prompt to teach the model to use only the existing knowledge. However, this is impossible in practice.

Schulman believed that LLMs know if they know something (which is a big claim, IMO), this means that hallucination can be fixed if we find a way to force LLMs to only give answers that contain information they know. He then proposed a couple of solutions.

  1. Verification: asking the LLM to explain (retrieve) the sources where it gets the answer from.
  2. RL. Remember that the reward model in phase 3.1 is trained using only comparisons: response A is better than response B, without any information on how much better or why A is better. Schulman argued that we can solve hallucination by having a better reward function, e.g. punishing a model more for making things up.

Here’s a screenshot from John Schulman’s talk in April 2023.

Fix hallucination with R

From Schulman’s talk, I got the impression that RLHF is supposed to help with hallucination. However, the InstructGPT paper shows that RLHF actually made hallucination worse. Even though RLHF caused worse hallucination, it improved other aspects, and overall, human labelers prefer RLHF model over SFT alone model.

RLHF makes hallucination worse
Hallucination is worse for InstructGPT (RLHF + SFT) compared to just SFT (Ouyang et al., 2022)

Based on the assumption that LLMs know what they know, some people try to reduce hallucination with prompts, e.g. adding Answer as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know". Making LLMs respond concisely also seems to help with hallucination – the fewer tokens LLMs have to generate, the less chance they have to make things up.


This has been a really fun post to write – I hope you enjoyed reading it too. I had another whole section about the limitations of RLHF – e.g. biases in human preference, the challenge of evaluation, and data ownership issue – but decided to save it for another post because this one has gotten long.

As I dove into papers about RLHF, I was impressed by three things:

  1. Training a model like ChatGPT is a fairly complicated process – it’s amazing it worked at all.
  2. The scale is insane. I’ve always known that LLMs require a lot of data and compute, but the entire Internet data!!??
  3. How much companies (used to) share about their process. DeepMind’s Gopher paper is 120 pages. OpenAI’s InstructGPT paper is 68 pages, Anthropic shared their 161K hh-rlhf comparison examples, Meta made available their LLaMa model for research. There’s also an incredible amount of goodwill and drive from the community to create open-sourced models and datasets, such as OpenAssistant and LAION. It’s an exciting time!

We’re still in the early days of LLMs. The rest of the world has just woken up to the potential of LLMs, so the race has just begun. Many things about LLMs, including RLHF, will evolve. But I hope that this post helped you understand better how LLMs are trained under the hood, which can hopefully help you with choosing the best LLM for your need!

Building LLM applications for production

2023-04-11 08:00:00

[Hacker News discussion, LinkedIn discussion, Twitter thread]

A question that I’ve been asked a lot recently is how large language models (LLMs) will change machine learning workflows. After working with several companies who are working with LLM applications and personally going down a rabbit hole building my applications, I realized two things:

  1. It’s easy to make something cool with LLMs, but very hard to make something production-ready with them.
  2. LLM limitations are exacerbated by a lack of engineering rigor in prompt engineering, partially due to the ambiguous nature of natural languages, and partially due to the nascent nature of the field.

This post consists of three parts.

There has been so much written about LLMs, so feel free to skip any section you’re already familiar with.

Table of contents
Part I. Challenges of productionizing prompt engineering
…….. The ambiguity of natural languages
………… Prompt evaluation
………… Prompt versioning
………… Prompt optimization
…….. Cost and latency
………… Cost
………… Latency
………… The impossibility of cost + latency analysis for LLMs
…….. Prompting vs. finetuning vs. alternatives
………… Prompt tuning
………… Finetuning with distillation
…….. Embeddings + vector databases
…….. Backward and forward compatibility
Part 2. Task composability
…….. Applications that consist of multiple tasks
…….. Agents, tools, and control flows
………… Tools vs. plugins
………… Control flows: sequential, parallel, if, for loop
………… Control flow with LLM agents
………… Testing an agent
Part 3. Promising use cases
…….. AI assistant
…….. Chatbot
…….. Programming and gaming
…….. Learning
…….. Talk-to-your-data
………… Can LLMs do data analysis for me?
…….. Search and recommendation
…….. Sales
…….. SEO

Part I. Challenges of productionizing prompt engineering

The ambiguity of natural languages

For most of the history of computers, engineers have written instructions in programming languages. Programming languages are “mostly” exact. Ambiguity causes frustration and even passionate hatred in developers (think dynamic typing in Python or JavaScript).

In prompt engineering, instructions are written in natural languages, which are a lot more flexible than programming languages. This can make for a great user experience, but can lead to a pretty bad developer experience.

The flexibility comes from two directions: how users define instructions, and how LLMs respond to these instructions.

First, the flexibility in user-defined prompts leads to silent failures. If someone accidentally makes some changes in code, like adding a random character or removing a line, it’ll likely throw an error. However, if someone accidentally changes a prompt, it will still run but give very different outputs.

While the flexibility in user-defined prompts is just an annoyance, the ambiguity in LLMs’ generated responses can be a dealbreaker. It leads to two problems:

  1. Ambiguous output format: downstream applications on top of LLMs expect outputs in a certain format so that they can parse. We can craft our prompts to be explicit about the output format, but there’s no guarantee that the outputs will always follow this format.

    LLM Engineering

  2. Inconsistency in user experience: when using an application, users expect certain consistency. Imagine an insurance company giving you a different quote every time you check on their website. LLMs are stochastic – there’s no guarantee that an LLM will give you the same output for the same input every time.

    You can force an LLM to give the same response by setting temperature = 0, which is, in general, a good practice. While it mostly solves the consistency problem, it doesn’t inspire trust in the system. Imagine a teacher who gives you consistent scores only if that teacher sits in one particular room. If that teacher sits in different rooms, that teacher’s scores for you will be wild.

    LLM Engineering

How to solve this ambiguity problem?

This seems to be a problem that OpenAI is actively trying to mitigate. They have a notebook with tips on how to increase their models’ reliability.

A couple of people who’ve worked with LLMs for years told me that they just accepted this ambiguity and built their workflows around that. It’s a different mindset compared to developing deterministic programs, but not something impossible to get used to.

This ambiguity can be mitigated by applying as much engineering rigor as possible. In the rest of this post, we’ll discuss how to make prompt engineering, if not deterministic, systematic.

Prompt evaluation

A common technique for prompt engineering is to provide in the prompt a few examples and hope that the LLM will generalize from these examples (fewshot learners).

As an example, consider trying to give a text a controversy score – it was a fun project that I did to find the correlation between a tweet’s popularity and its controversialness. Here is the shortened prompt with 4 fewshot examples:

Example: controversy scorer

Given a text, give it a controversy score from 0 to 10.


1 + 1 = 2
Controversy score: 0

Starting April 15th, only verified accounts on Twitter will be eligible to be in For You recommendations
Controversy score: 5

Everyone has the right to own and use guns
Controversy score: 9

Immigration should be completely banned to protect our country
Controversy score: 10

The response should follow the format:

Controversy score: { score }
Reason: { reason }

Here is the text.

When doing fewshot learning, two questions to keep in mind:

  1. Whether the LLM understands the examples given in the prompt. One way to evaluate this is to input the same examples and see if the model outputs the expected scores. If the model doesn’t perform well on the same examples given in the prompt, it is likely because the prompt isn’t clear – you might want to rewrite the prompt or break the task into smaller tasks (and combine them together, discussed in detail in Part II of this post).
  2. Whether the LLM overfits to these fewshot examples. You can evaluate your model on separate examples.

One thing I’ve also found useful is to ask models to give examples for which it would give a certain label. For example, I can ask the model to give me examples of texts for which it’d give a score of 4. Then I’d input these examples into the LLM to see if it’ll indeed output 4.

from llm import OpenAILLM

def eval_prompt(examples_file, eval_file):
    prompt = get_prompt(examples_file)
    model = OpenAILLM(prompt=prompt, temperature=0)
    compute_rmse(model, examples_file)
    compute_rmse(model, eval_file)
eval_prompt("fewshot_examples.txt", "eval_examples.txt")

Prompt versioning

Small changes to a prompt can lead to very different results. It’s essential to version and track the performance of each prompt. You can use git to version each prompt and its performance, but I wouldn’t be surprised if there will be tools like MLflow or Weights & Biases for prompt experiments.

Prompt optimization

There have been many papers + blog posts written on how to optimize prompts. I agree with Lilian Weng in her helpful blog post that most papers on prompt engineering are tricks that can be explained in a few sentences. OpenAI has a great notebook that explains many tips with examples. Here are some of them:

Many tools promise to auto-optimize your prompts – they are quite expensive and usually just apply these tricks. One nice thing about these tools is that they’re no code, which makes them appealing to non-coders.

Cost and latency


The more explicit detail and examples you put into the prompt, the better the model performance (hopefully), and the more expensive your inference will cost.

OpenAI API charges for both the input and output tokens. Depending on the task, a simple prompt might be anything between 300 - 1000 tokens. If you want to include more context, e.g. adding your own documents or info retrieved from the Internet to the prompt, it can easily go up to 10k tokens for the prompt alone.

The cost with long prompts isn’t in experimentation but in inference.

Experimentation-wise, prompt engineering is a cheap and fast way get something up and running. For example, even if you use GPT-4 with the following setting, your experimentation cost will still be just over $300. The traditional ML cost of collecting data and training models is usually much higher and takes much longer.

The cost of LLMOps is in inference.


Input tokens can be processed in parallel, which means that input length shouldn’t affect the latency that much.

However, output length significantly affects latency, which is likely due to output tokens being generated sequentially.

Even for extremely short input (51 tokens) and output (1 token), the latency for gpt-3.5-turbo is around 500ms. If the output token increases to over 20 tokens, the latency is over 1 second.

Here’s an experiment I ran, each setting is run 20 times. All runs happen within 2 minutes. If I do the experiment again, the latency will be very different, but the relationship between the 3 settings should be similar.

This is another challenge of productionizing LLM applications using APIs like OpenAI: APIs are very unreliable, and no commitment yet on when SLAs will be provided.

# tokens p50 latency (sec) p75 latency p90 latency
input: 51 tokens, output: 1 token 0.58 0.63 0.75
input: 232 tokens, output: 1 token 0.53 0.58 0.64
input: 228 tokens, output: 26 tokens 1.43 1.49 1.62

It is, unclear, how much of the latency is due to model, networking (which I imagine is huge due to high variance across runs), or some just inefficient engineering overhead. It’s very possible that the latency will reduce significantly in a near future.

While half a second seems high for many use cases, this number is incredibly impressive given how big the model is and the scale at which the API is being used. The number of parameters for gpt-3.5-turbo isn’t public but is guesstimated to be around 150B. As of writing, no open-source model is that big. Google’s T5 is 11B parameters and Facebook’s largest LLaMA model is 65B parameters. People discussed on this GitHub thread what configuration they needed to make LLaMA models work, and it seemed like getting the 30B parameter model to work is hard enough. The most successful one seemed to be randaller who was able to get the 30B parameter model work on 128 GB of RAM, which takes a few seconds just to generate one token.

The impossibility of cost + latency analysis for LLMs

The LLM application world is moving so fast that any cost + latency analysis is bound to go outdated quickly. Matt Ross, a senior manager of applied research at Scribd, told me that the estimated API cost for his use cases has gone down two orders of magnitude over the last year. Latency has significantly decreased as well. Similarly, many teams have told me they feel like they have to redo the feasibility estimation and buy (using paid APIs) vs. build (using open source models) decision every week.

Prompting vs. finetuning vs. alternatives

LLM Engineering: Prompting vs. finetuning

There are 3 main factors when considering prompting vs. finetuning: data availability, performance, and cost.

If you have only a few examples, prompting is quick and easy to get started. There’s a limit to how many examples you can include in your prompt due to the maximum input token length.

The number of examples you need to finetune a model to your task, of course, depends on the task and the model. In my experience, however, you can expect a noticeable change in your model performance if you finetune on 100s examples. However, the result might not be much better than prompting.

In How Many Data Points is a Prompt Worth? (2021), ​​Scao and Rush found that a prompt is worth approximately 100 examples (caveat: variance across tasks and models is high – see image below). The general trend is that as you increase the number of examples, finetuning will give better model performance than prompting. There’s no limit to how many examples you can use to finetune a model.

Prompting vs. finetuning: number of examples needed

The benefit of finetuning is two folds:

  1. You can get better model performance: can use more examples, examples becoming part of the model’s internal knowledge.
  2. You can reduce the cost of prediction. The more instruction you can bake into your model, the less instruction you have to put into your prompt. Say, if you can reduce 1k tokens in your prompt for each prediction, for 1M predictions on gpt-3.5-turbo, you’d save $2000.

Prompt tuning

A cool idea that is between prompting and finetuning is prompt tuning, introduced by Leister et al. in 2021. Starting with a prompt, instead of changing this prompt, you programmatically change the embedding of this prompt. For prompt tuning to work, you need to be able to input prompts’ embeddings into your LLM model and generate tokens from these embeddings, which currently, can only be done with open-source LLMs and not in OpenAI API. On T5, prompt tuning appears to perform much better than prompt engineering and can catch up with model tuning (see image below).

LLM Engineering: Prompt tuning

Finetuning with distillation

In March 2023, a group of Stanford students released a promising idea: finetune a smaller open-source language model (LLaMA-7B, the 7 billion parameter version of LLaMA) on examples generated by a larger language model (text-davinci-003 – 175 billion parameters). This technique of training a small model to imitate the behavior of a larger model is called distillation. The resulting finetuned model behaves similarly to text-davinci-003, while being a lot smaller and cheaper to run.

For finetuning, they used 52k instructions, which they inputted into text-davinci-003 to obtain outputs, which are then used to finetune LLaMa-7B. This costs under $500 to generate. The training process for finetuning costs under $100. See Stanford Alpaca: An Instruction-following LLaMA Model (Taori et al., 2023).

LLM Engineering: Alpaca

The appeal of this approach is obvious. After 3 weeks, their GitHub repo got almost 20K stars!! By comparison, HuggingFace’s transformers repo took over a year to achieve a similar number of stars, and TensorFlow repo took 4 months.

LLM Engineering: Alpaca GitHub stars

Embeddings + vector databases

One direction that I find very promising is to use LLMs to generate embeddings and then build your ML applications on top of these embeddings, e.g. for search and recsys. As of April 2023, the cost for embeddings using the smaller model text-embedding-ada-002 is $0.0004/1k tokens. If each item averages 250 tokens (187 words), this pricing means $1 for every 10k items or $100 for 1 million items.

While this still costs more than some existing open-source models, this is still very affordable, given that:

  1. You usually only have to generate the embedding for each item once.
  2. With OpenAI API, it’s easy to generate embeddings for queries and new items in real-time.

To learn more about using GPT embeddings, check out SGPT (Niklas Muennighoff, 2022) or this analysis on the performance and cost GPT-3 embeddings (Nils Reimers, 2022). Some of the numbers in Nils’ post are already outdated (the field is moving so fast!!), but the method is great!

The main cost of embedding models for real-time use cases is loading these embeddings into a vector database for low-latency retrieval. However, you’ll have this cost regardless of which embeddings you use. It’s exciting to see so many vector databases blossoming – the new ones such as Pinecone, Qdrant, Weaviate, Chroma as well as the incumbents Faiss, Redis, Milvus, ScaNN.

If 2021 was the year of graph databases, 2023 is the year of vector databases.

Backward and forward compatibility

Hacker News discussion: Who is working on forward and backward compatibility for LLMs?

Foundational models can work out of the box for many tasks without us having to retrain them as much. However, they do need to be retrained or finetuned from time to time as they go outdated. According to Lilian Weng’s Prompt Engineering post:

One observation with SituatedQA dataset for questions grounded in different dates is that despite LM (pretraining cutoff is year 2020) has access to latest information via Google Search, its performance on post-2020 questions are still a lot worse than on pre-2020 questions. This suggests the existence of some discrepencies or conflicting parametric between contextual information and model internal knowledge.

In traditional software, when software gets an update, ideally it should still work with the code written for its older version. However, with prompt engineering, if you want to use a newer model, there’s no way to guarantee that all your prompts will still work as intended with the newer model, so you’ll likely have to rewrite your prompts again. If you expect the models you use to change at all, it’s important to unit-test all your prompts using evaluation examples.

One argument I often hear is that prompt rewriting shouldn’t be a problem because:

  1. Newer models should only work better than existing models. I’m not convinced about this. Newer models might, overall, be better, but there will be use cases for which newer models are worse.
  2. Experiments with prompts are fast and cheap, as we discussed in the section Cost. While I agree with this argument, a big challenge I see in MLOps today is that there’s a lack of centralized knowledge for model logic, feature logic, prompts, etc. An application might contain multiple prompts with complex logic (discussed in Part 2. Task composability). If the person who wrote the original prompt leaves, it might be hard to understand the intention behind the original prompt to update it. This can become similar to the situation when someone leaves behind a 700-line SQL query that nobody dares to touch.

Another challenge is that prompt patterns are not robust to changes. For example, many of the published prompts I’ve seen start with “I want you to act as XYZ”. If OpenAI one day decides to print something like: “I’m an AI assistant and I can’t act like XYZ”, all these prompts will need to be updated.

Part 2. Task composability

Applications that consist of multiple tasks

The example controversy scorer above consists of one single task: given an input, output a controversy score. Most applications, however, are more complex. Consider the “talk-to-your-data” use case where we want to connect to a database and query this database in natural language. Imagine a credit card transaction table. You want to ask things like: "How many unique merchants are there in Phoenix and what are their names?" and your database will return: "There are 9 unique merchants in Phoenix and they are …".

One way to do this is to write a program that performs the following sequence of tasks:

  1. Task 1: convert natural language input from user to SQL query [LLM]
  2. Task 2: execute SQL query in the SQL database [SQL executor]
  3. Task 3: convert the SQL result into a natural language response to show user [LLM]

Agents, tools, and control flows

I did a small survey among people in my network and there doesn’t seem to be any consensus on terminologies, yet.

The word agent is being thrown around a lot to refer to an application that can execute multiple tasks according to a given control flow (see Control flows section). A task can leverage one or more tools. In the example above, SQL executor is an example of a tool.

Note: some people in my network resist using the term agent in this context as it is already overused in other contexts (e.g. agent to refer to a policy in reinforcement learning).

Tools vs. plugins

Other than SQL executor, here are more examples of tools:

Tools and plugins are basically the same things. You can think of plugins as tools contributed to the OpenAI plugin store. As of writing, OpenAI plugins aren’t open to the public yet, but anyone can create and use tools.

Control flows: sequential, parallel, if, for loop

In the example above, sequential is an example of a control flow in which one task is executed after another. There are other types of control flows such as parallel, if statement, for loop.

Note: while parallel can definitely be useful, I haven’t seen a lot of applications using it.

LLM Engineering: control flows

Control flow with LLM agents

In traditional software engineering, conditions for control flows are exact. With LLM applications (also known as agents), conditions might also be determined by prompting.

For example, if you want your agent to choose between three actions search, SQL executor, and Chat, you might explain how it should choose one of these actions as follows (very approximate), In other words, you can use LLMs to decide the condition of the control flow!

You have access to three tools: Search, SQL executor, and Chat.

Search is useful when users want information about current events or products. 

SQL executor is useful when users want information that can be queried from a database.

Chat is useful when users want general information.

Provide your response in the following format:

Input: { input }
Thought: { thought }
Action: { action }
Action Input: { action_input }
Observation: { action_output }
Thought: { thought }

Testing an agent

For agents to be reliable, we’d need to be able to build and test each task separately before combining them. There are two major types of failure modes:

  1. One or more tasks fail. Potential causes:
    1. Control flow is wrong: a non-optional action is chosen
    2. One or more tasks produce incorrect results
  2. All tasks produce correct results but the overall solution is incorrect. Press et al. (2022) call this “composability gap”: the fraction of compositional questions that the model answers incorrectly out of all the compositional questions for which the model answers the sub-questions correctly.

Like with software engineering, you can and should unit test each component as well as the control flow. For each component, you can define pairs of (input, expected output) as evaluation examples, which can be used to evaluate your application every time you update your prompts or control flows. You can also do integration tests for the entire application.

Part 3. Promising use cases

The Internet has been flooded with cool demos of applications built with LLMs. Here are some of the most common and promising applications that I’ve seen. I’m sure that I’m missing a ton.

For more ideas, check out the projects from two hackathons I’ve seen:

AI assistant

This is hands down the most popular consumer use case. There are AI assistants built for different tasks for different groups of users – AI assistants for scheduling, making notes, pair programming, responding to emails, helping with parents, making reservations, booking flights, shopping, etc. – but, of course, the ultimate goal is an assistant that can assist you in everything.

This is also the holy grail that all big companies are working towards for years: Google with Google Assistant and Bard, Facebook with M and Blender, OpenAI (and by extension, Microsoft) with ChatGPT. Quora, which has a very high risk of being replaced by AIs, released their own app Poe that lets you chat with multiple LLMs. I’m surprised Apple and Amazon haven’t joined the race yet.


Chatbots are similar to AI assistants in terms of APIs. If AI assistants’ goal is to fulfill tasks given by users, whereas chatbots’ goal is to be more of a companion. For example, you can have chatbots that talk like celebrities, game/movie/book characters, businesspeople, authors, etc.

Michelle Huang used her childhood journal entries as part of the prompt to GPT-3 to talk to the inner child.

The most interesting company in the consuming-chatbot space is probably Character.ai. It’s a platform for people to create and share chatbots. The most popular types of chatbots on the platform, as writing, are anime and game characters, but you can also talk to a psychologist, a pair programming partner, or a language practice partner. You can talk, act, draw pictures, play text-based games (like AI Dungeon), and even enable voices for characters. I tried a few popular chatbots – none of them seem to be able to hold a conversation yet, but we’re just at the beginning. Things can get even more interesting if there’s a revenue-sharing model so that chatbot creators can get paid.

LLM Engineering: AI chatbots

Programming and gaming

This is another popular category of LLM applications, as LLMs turn out to be incredibly good at writing and debugging code. GitHub Copilot is a pioneer (whose VSCode extension has had 5 million downloads as of writing). There have been pretty cool demos of using LLMs to write code:

  1. Create web apps from natural languages
  2. Find security threats: Socket AI examines npm and PyPI packages in your codebase for security threats. When a potential issue is detected, they use ChatGPT to summarize findings.
  3. Gaming
    1. Create games: e.g. Wyatt Cheng has an awesome video showing how he used ChatGPT to clone Flappy Bird.
    2. Generate game characters.
    3. Let you have more realistic conversations with game characters: check out this awesome demo by Convai!


Whenever ChatGPT was down, OpenAI discord is flooded with students complaining about not being to complete their homework. Some responded by banning the use of ChatGPT in school altogether. Some have a much better idea: how to incorporate ChatGPT to help students learn even faster. All EdTech companies I know are going full-speed on ChatGPT exploration.

Some use cases:

With the rise of homeschooling, I expect to see a lot of applications of ChatGPT to help parents homeschool.


This is, in my observation, the most popular enterprise application (so far). Many, many startups are building tools to let enterprise users query their internal data and policies in natural languages or in the Q&A fashion. Some focus on verticals such as legal contracts, resumes, financial data, or customer support. Given a company’s all documentations, policies, and FAQs, you can build a chatbot that can respond your customer support requests.

The main way to do this application usually involves these 4 steps:

  1. Organize your internal data into a database (SQL database, graph database, embedding/vector database, or just text database)
  2. Given an input in natural language, convert it into the query language of the internal database. For example, if it’s a SQL or graph database, this process can return a SQL query. If it’s embedding database, it’s might be an ANN (approximate nearest neighbor) retrieval query. If it’s just purely text, this process can extract a search query.
  3. Execute the query in the database to obtain the query result.
  4. Translate this query result into natural language.

While this makes for really cool demos, I’m not sure how defensible this category is. I’ve seen startups building applications to let users query on top of databases like Google Drive or Notion, and it feels like that’s a feature Google Drive or Notion can implement in a week.

OpenAI has a pretty good tutorial on how to talk to your vector database.

Can LLMs do data analysis for me?

I tried inputting some data into gpt-3.5-turbo, and it seems to be able to detect some patterns. However, this only works for small data that can fit into the input prompt. Most production data is larger than that.

LLM Engineering: data analysis

Search and recommendation

Search and recommendation has always been the bread and butter of enterprise use cases. It’s going through a renaissance with LLMs. Search has been mostly keyword-based: you need a tent, you search for a tent. But what if you don’t know what you need yet? For example, if you’re going camping in the woods in Oregon in November, you might end up doing something like this:

  1. Search to read about other people’s experiences.
  2. Read those blog posts and manually extract a list of items you need.
  3. Search for each of these items, either on Google or other websites.

If you search for “things you need for camping in oregon in november” directly on Amazon or any e-commerce website, you’ll get something like this:

LLM Engineering: Search

But what if searching for “things you need for camping in oregon in november” on Amazon actually returns you a list of things you need for your camping trip?

It’s possible today with LLMs. For example, the application can be broken into the following steps:

  1. Task 1: convert the user query into a list of product names [LLM]
  2. Task 2: for each product name in the list, retrieve relevant products from your product catalog.
LLM Engineering LLM Engineering

If this works, I wonder if we’ll have LLM SEO: techniques to get your products recommended by LLMs.


The most obvious way to use LLMs for sales is to write sales emails. But nobody really wants more or better sales emails. However, several companies in my network are using LLMs to synthesize information about a company to see what they need.


SEO is about to get very weird. Many companies today rely on creating a lot of content hoping to rank high on Google. However, given that LLMs are REALLY good at generating content, and I already know a few startups whose service is to create unlimited SEO-optimized content for any given keyword, search engines will be flooded. SEO might become even more of a cat-and-mouse game: search engines come up with new algorithms to detect AI-generated content, and companies get better at bypassing these algorithms. People might also rely less on search, and more on brands (e.g. trust only the content created by certain people or companies).

And we haven’t even touched on SEO for LLMs yet: how to inject your content into LLMs’ responses!!


We’re still in the early days of LLMs applications – everything is evolving so fast. I recently read a book proposal on LLMs, and my first thought was: most of this will be outdated in a month. APIs are changing day to day. New applications are being discovered. Infrastructure is being aggressively optimized. Cost and latency analysis needs to be done on a weekly basis. New terminologies are being introduced.

Not all of these changes will matter. For example, many prompt engineering papers remind me of the early days of deep learning when there were thousands of papers describing different ways to initialize weights. I imagine that tricks to tweak your prompts like: "Answer truthfully", "I want you to act like …", writing "question: " instead of "q:" wouldn’t matter in the long run.

Given that LLMs seem to be pretty good at writing prompts for themselves – see Large Language Models Are Human-Level Prompt Engineers (Zhou et al., 2022) – who knows that we’ll need humans to tune prompts?

However, given so much happening, it’s hard to know which will matter, and which won’t.

I recently asked on LinkedIn how people keep up to date with the field. The strategy ranges from ignoring the hype to trying out all the tools.

  1. Ignore (most of) the hype

    Vicki Boykis (Senior ML engineer @ Duo Security): I do the same thing as with any new frameworks in engineering or the data landscape: I skim the daily news, ignore most of it, and wait six months to see what sticks. Anything important will still be around, and there will be more survey papers and vetted implementations that help contextualize what’s happening.

  2. Read only the summaries

    Shashank Chaurasia (Engineering @ Microsoft): I use the Creative mode of BingChat to give me a quick summary of new articles, blogs and research papers related to Gen AI! I often chat with the research papers and github repos to understand the details.

  3. Try to keep up to date with the latest tools

    Chris Alexiuk (Founding ML engineer @ Ox): I just try and build with each of the tools as they come out - that way, when the next step comes out, I’m only looking at the delta.

What’s your strategy?

What we look for in a resume

2023-01-24 08:00:00

I read every single one of the resumes we receive. Sometimes, I’d talk to a candidate and see that what we perceived as their strongest aspects actually weren’t included in their resume. Occasionally, a candidate would tell me that they didn’t expect their resume to still be screened by humans – had they known, they would have written their resume differently.

The resume evaluation process is pretty much a black box for most candidates. And it is so because few hiring managers have publicly discussed this. I thought I should start the conversation.

“Resume” here refers to both the traditional resumes and LinkedIn profiles. If your LinkedIn is current and contains all the information you want to share, that’s perfectly fine. Some hiring managers I know at other companies actually prefer looking at LinkedIn because they find it faster to find the info they need (e.g. company logos, time spent at each company). I’ve also seen candidates who use their personal websites as their resumes, which also works.

Whether you’re interested in applying to our startup, I hope that my perspective can shed some light on what is happening on the other side of the table, and how to create a resume that helps you put your best foot forward, not just with us, but also with other companies. Caveat: Every company hires differently. What’s right for us might not be right for other companies.

Job hunting is exhausting, and sometimes feels like a shot in the dark. I wish more companies would be transparent about what they’re looking for so that candidates can decide before they apply if they are a good fit.


  1. We hire mostly infra and ML roles, so this post might not be applicable to other roles.
  2. We’re very small so what we’re looking for is specific. If you’re not a good fit for us right now doesn’t mean you aren’t a good fit anywhere else, or won’t be a good fit for us in the future.

Table of contents
…. 1. Our overall approach
…. 2. We look for demonstrated expertise, not keywords
…. 3. We look for people who get things done
…. 4. We look for unique perspectives
…. 5. We care about impact, not meaningless metrics
…. FAQs
…….. Do resumes have to be one page?
…….. What to put on my resume if I have no experience yet?
…….. Do I need a cover letter?
…. Misc tips

1. Our overall approach

Startups like us and big companies, say Google, hire very differently. Therefore, it makes sense that a candidate should apply to a startup differently from how they’d apply to a big company.

Here is what we do, which might differ from what big companies do.

  1. We don’t use an automated system to screen resumes.

    We read every resume. This means that a lot of tricks you’ve read about how to beat automated screening such as include certain phrases of job descriptions in your resume, repeat “hot” keywords, fill your resumes up with random metrics, etc. doesn’t work on us. They can even hurt when you apply to companies like us, because the resume space used for these tricks is the space you aren’t using for things that are relevant to us.

  2. We look for reasons to say yes.

    For a company that receives a large volume of resumes (much larger than us), they’ll need a way to quickly filter out resumes, aka quick ways to say no. For example, their automated algorithm might reject a resume if it lacks certain years of experience or lacks certain keywords. For us, we evaluate each application holistically, and if your application demonstrates the aspects we’re looking for, as discussed in this post, we’d want to get on a call to get to know you more.

2. We look for demonstrated expertise, not keywords

About 90% of resumes we see have a long list of skills. Here are just some examples.

Skill box Skill box Skill box

Initially, I was confused about the purpose of this list, because:

  1. It’s unconvincing. There’s a big gap between “saying that you know something” and “being good at it.”
  2. It can weaken your resume. For example, if you consider common skills like Jupyter notebook and git your competitive advantage (the only reason to include them in your resume), I would automatically assume that you have no other competitive advantage.
  3. Expertise takes time to acquire. I’m skeptical of people who claim to be experts in too many things.

As I tried to figure out why, I found a world of articles that give tips on how to beat automated resume screening. These articles claim that at large companies that receive an influx of resumes, recruiters set rules to surface resumes that contain certain keywords. Candidates, not sure what these keywords are, populate their resumes with all the keywords they can think of.

That is a bad strategy for applying at startups like ours. We’re not looking for keywords. We look for demonstrated expertise. Here are a couple of ways to demonstrate your expertise.

  1. Show how you acquired and use that skill in your job

    Consider these two candidates who both mention Flink on their resumes. Which one would you want to talk to?

    • Candidate A has Flink as one of the 30 items on their Technical skills box.
    • Candidate B explained how they used Flink in their last job: “worked on a feature computation platform built on top of Apache Flink which processes 100,000 events per second and serves 10 different ML models.”

    If during our interview, candidate B can tell me why they chose Flink over other stream processing engines, what issues they’ve encountered, and what changes they wish to see in Flink, I’d be sold!

  2. Share your expertise on public channels such as: StackOverflow answers, open source contributions, papers, blog posts

    Here are some examples that convinced me of a candidate’s expertise in certain technology:

    • A candidate sent us their StackOverflow profile where they have answered over 100 questions on JavaScript.
    • Another sent us a list of their merged PRs to PyTorch.
    • Another sent us a blog post that covers in-depth detail about X that made our team go: “whoa, they really know about X.”

    We understand that not everyone has time to contribute to public discourse, so #2 is optional.

3. We look for people who get things done

A lot of people get attracted to a startup for its vision: what this startup can be in the next 5 or 10 years. However, once you join the startup, what really matters is the day-to-day execution. When I was at Snorkel, our CEO kept telling us: “Startups don’t magically move faster than big companies just because they are startups. Startups move faster because they have to.” We have to get things done. To be able to move fast, we need people who aren’t afraid to roll up their sleeves and tackle difficult challenges.

There are two traits we look for to evaluate whether a candidate can get things done: initiatives and persistence. If you have experience that demonstrates these, please do include them in your resume.


To get anything done, you need to start it. There are a lot of people who can see a problem, but few who would do something about it. We want people who, when seeing a problem, proactively do something about it without waiting to be told. We look for initiatives a candidate has started before, such as:


Once we’ve started something, we need persistence to drive it to completion. Some signals of persistence that I’ve seen:

I’m a bit hesitant when I see candidates who change jobs too frequently, e.g. 5 different companies in 5 years. I understand that not all jobs work out, so it’s okay, sometimes necessary, to move on. However, consistent job jumping can imply that you get bored or give up easily. A year at a job is hardly enough to get deep into a problem space and make significant contributions.

Working at a startup can be tough, and we don’t want someone who joins and leaves at the first sign of challenge. We want someone to stick around and help us through different phases. This process, I hope, can also provide you with a diverse set of experiences to prepare you for whatever you want to do after.

4. We look for unique perspectives

Startups like ours exist to solve a problem that few others can solve. This requires us to see the problem from a perspective that most people don’t. For that reason, we’re not looking for group thinkers or hype chasers. We look for people who can bring a unique perspective to the table.

Your unique perspective can be demonstrated in your career/life choices, your writing, your side projects.

Recently, I had a panel discussing resume screening on our MLOps Discord, and Kyle Kranen, a senior deep learning engineer from NVIDIA, discouraged candidates from listing cookie-cutter projects on their resumes, as they take away space for things that are uniquely your strengths.

Cookie-cutter projects happen when someone simply replicates a common solution to a popular project – titanic, sentiment analysis of tweets, stock trading, chatbots – without any new approach or insight. I see them in about ⅓ of all the resumes we’ve received. These projects are good to practice on. I myself practiced on them when I started. However, they won’t help you stand out as a candidate.

Examples of interesting projects that I’ve seen:

  1. A personal website that looks exactly like MacOS.
  2. A CLI tool to autocomplete your bash commands.
  3. Matcha-making robot arm.

5. We care about impact, not meaningless metrics

I recently asked on LinkedIn how resumes have become so metrics-oriented. Almost all resumes I’ve seen are full of metrics. Metrics are good when they serve to make a point. However, many of these metrics leave me more confused than impressed. For example, here are some of the actual metrics I’ve seen:

People were quick to point out to me: the most common advice for writing resumes is “quantify your impact,” e.g. how much money you’ve saved for your company, how much faster you’ve made your program. Due to this advice, many candidates have told me that they need to have metrics on their resumes. Sometimes, when unable to quantify their impact, they settle for what they can quantify.

Unfortunately, not all that are quantifiable are impactful. Showing the number of code reviews you’ve participated in, without any context, doesn’t say anything flattering. It’d be better if you could show how the learnings from all these code reviews are relevant to the role you’re applying for: e.g. how they helped you become a better engineer.

Clarify your contributions

Every job seeker should make their resume look as good as possible. I’ve talked to many candidates who are uncomfortable talking about their achievements, either due to humility or shyness. This is especially common among female candidates that I’ve talked to.

On the other hand, I’ve seen many candidates who take this too far. For example, last week, a candidate wrote that they have extensive experience in deploying large scale systems because of their one Amazon internship.

Often, I’d see a candidate who takes credit for the entire team. An applicant said on his resume that he “deployed an online prediction model that served 10M daily active users.” During our interview, I found out that he wasn’t at all involved in deployment nor in scaling the prediction service. His task was wrangling data from an S3 bucket. To be clear, this is an important skill, just not what we were looking for. We explicitly said we were looking for experience in deploying models in production. This interview was a waste of time for both of us.

Metrics that we’d love to see

I do care about metrics. I especially appreciate metrics that are presented with two components:

  1. How they can be tied to business objectives.
  2. Your contribution in achieving that metric.

Here are some of the metrics that we’ve found convincing:

Good impact Good impact
Screenshotted from the resumes submitted to our resume critique session

Not all impact has metrics

Fixing a bug or helping a coworker out is impactful, but hard to measure. It’s a great signal when your work receives awards and recognitions, such as:


Do resumes have to be one page?

A candidate told me, during our interview, that he didn’t include his side projects in his resume because he was worried that it’d take him over the one-page limit. To be clear, I’m NOT going to reject a candidate just because their resume is over one page.

The one-page limit suggestion is to help you be concise. Like the French mathematician Blaise Pascal once said: “I have only made this letter longer because I have not had the time to make it shorter.” The longer your resume goes on, the more chance that your biggest strengths will be buried among less important details.

The goal of a resume is not to show everything about you. The goal is to put your best foot forward to show a company why you’d be an excellent addition to their team. Rarely do I see a candidate whose best foot forward can’t be contained within one page.

If you have a lot of experience, you should have more options to choose from to highlight your strengths. A 3-page resume without any focus or highlight shows the lack of judgment for what is important and concise communication.

What I find a lot harder is for candidates new to the field to have sufficient things to fill in one page.

Some tips to shorten your resume

  1. Find repetition and remove them. For example, do you need to list all 5 of your personal projects, or the top 3 would suffice?
  2. Remove irrelevant experience, even if that creates gaps in your resume. Personally, I don’t care about gaps – everyone needs time to recharge and take care of personal matters.
  3. Most resumes I’ve seen can have shorter education sections.
  4. Remove common skills, e.g. git, notebook, MS Word/Excel.
  5. Find a more efficient format, e.g. using multiple columns.
  6. Reduce font size, page margin.

What to put on my resume if I have no experience yet?

This is a tough question. If you don’t have the experience a role needs, it doesn’t matter what you put on your resume, that role is just not going to be a good fit for you.

I have, however, been impressed by candidates who make up for their lack of experience in our field with impressive experiences in other fields.

Personally, I was once given an internship at a startup, even though I had none of the experiences they were asking for. I reached out to the CEO explaining to him the problems I worked on in the past and how the skills I acquired can be transferred to the problems his company was working on.

It’s a long shot, but it’s a shot.

Do I need a cover letter?

The order in which I read an application: resume -> cover letter. If you have stellar experience and skills, we’d want to talk to you even if you don’t have a cover letter. Some hiring managers told me that they don’t read cover letters at all.

I, however, do appreciate it when a candidate writes a cover letter that explains why they want to join us and why they think they’re a good fit for the role. For candidates with unusual transitions (e.g. switching from another career into machine learning), a cover letter could also be a great place to explain their transition.

If you write a cover letter, keep it concise. A cover letter doesn’t have to be a PDF. It can just be an email.

If an information is important, include it in your resume instead of a cover letter. Sometimes, candidates put the important info in the email they sent to our team’s email address, which might not be entered correctly into our application tracking system.

Misc tips


  1. If you’re applying to a small startup, say, of less than 20 people, spend some time researching who works at that startup and email them directly.
  2. Send your resume as a PDF instead of an editable format like Microsoft Word or Google Docs. The editable format might not render correctly on other people’s computers.
  3. If you can get someone who used to supervise you to talk about how great you are, include their quote in your resume. Quotes from friends/family don’t count.
  4. Don’t use abbreviations unless you know for sure your audience knows what you’re talking about.
  1. If you’ve contributed to open source projects, include links to the PRs or a public architectural design you’re the most proud of.
  2. If your GitHub has a lot of repos, pin the important ones on your homepage (you can pin up to 6). You can also write a README to guide your visitors on what repos they would look at.
  3. Don’t include a link to your GitHub link if it’s empty.
  4. If you have publications, link to your Google Scholar profile.
  5. If you’re a co-author of a paper with multiple authors, bold your name in the author list.
  6. If you have more to show than your resume allows, create a personal website. Assuming that you have basic coding skills, it’s fairly easy to create a free personal website using GitHub Pages.


  1. If you’re still in school, put education earlier in your resume, as it helps the person screening your resume knows where you are and what you’re looking for. If you’ve been working, put your work experience first.
  2. If you’re still in school, list your expected graduation date so companies know when you can start working full-time.
  3. If you’ve been working for 2+ years, remove your GPA and coursework. I care more about your work experience.
  4. Similarly, course projects should only be listed if you think they’re better than the average course project.

Resume resources

There are many resources on how to write a resume. Here are some of them:

  1. Reddit has a great guide on engineering resumes
  2. Writing a great resume : YC Ultimate Startup Job Guide
  3. Google Recruiters Say Using the X-Y-Z Formula on Your Resume Will Improve Your Odds of Getting Hired at Google
  4. Why Resume Screening Doesn’t Work
  5. Companies Need More Workers. Why Do They Reject Millions of Résumés? - WSJ


Resume writing is a one-off process: it’s the art of arranging what you already have in a way to best help you stand out. It’s hard to write a good resume without having things to put on it.

Resume building is an ongoing process: continuously improve the experience, skills, recommendations, etc. that you have. If you’re good at building, you might not even have to write a resume. If your work catches companies’ attention, they’ll reach out to you.

A few friends told me that “resume building” has a bad connotation, as it might suggest that we should only do things to put on their resumes and ignore important but less shiny things like fixing a bug or helping a coworker out.

What we’re really looking for isn’t the best resume or the best resume builder, but a team member, someone who cares. Resumes might not be the perfect tool to show that, and that’s why we err on the side of talking to that person to find out.

Our startup is still young and we’re still learning. I’d love to learn about this process at other companies, and hear from you what we can do to make the process easier for you.

And of course, we’re hiring for strong infra and ML engineers!! If you think we might be a good fit for you, do reach out.