2024-12-19 08:00:00
How do you get up to speed with AI engineering? Unfortunately, I don’t know of any good consolidated resources, so I’m going to attempt to make one here. My first attempt at this focused more on what an AI engineer is and made only a feeble attempt at providing resources to get started. Let’s go!
The reason it’s difficult is that AI Engineering is so new, it’s bleeding edge. People still scoff at the idea that it’s even a title that someone can hold. It’s moving so fast that 3 months is roughly equivalent to a decade, so any resources that might exist become obsolete within a few months.
LangChain is used pervasively in tutorials. They usually are one of the first to implement a new prompting technique right after the paper comes out. However, nobody I know uses it in production. Many attempt to, but then replace it with either a langchain competitor or a write their own code.
Instead:
There’s a very long list of other good options, both open source & proprietary. The reason LangChain doesn’t work is that the code isn’t structured well. It works seamlessly until you run into a case that they didn’t explicitly plan for. Experienced software engineers would say that LangChain doesn’t “compose well”.
There’s no shortage of people on LinkedIn or X that are hawking “one weird trick”, the magic prompt, or in one way or another trying to convince you that there are special words or phrases that magically make an LLM do your bidding. If it sounds like a salesman trying to sell you something, it’s definitely a salesman trying to sell you something. In fact, they’re almost always the sales type, and very rarely have any sort of engineering experience. Avoid.
This is a contentious topic, I’ve writen about it. They can be an asset, but beware of blindly taking advice from people who have been deep into traditional pre-LLM machine learning.
They’re both amazingly intelligent and unexpectedly dumb. The only real way to know what you’re dealing with is to use them a lot, for everything. Yes, you do need to get burnt. Just do it in a way that doesn’t matter too much. The goal here is to develop an instinct. You should be able to tell yourself, “if I do X it’ll probably go poorly, but if I rephrase it as Y then I can be confident in what it says”.
You should know RAG inside & out. Chain of Thought (CoT), and the ReAct pattern. Skim the rest of this post for more leads.
Better yet, get a gaming laptop with an NVIDIA graphics card and Linux. But if not, get a Macbook M1, M2, M3, etc. series. The main memory & GPU memory is all the same, shared, so you can rock some surprisingly big models, all local.
I’m a big advocate of local LLMs, especially for AI engineers. They’re worse than the big SOTA models, which means you learn the sharp edges faster; learn to properly distrust an LLM. Plus, you can send logs with passwords to a local model, but it’s highly unwise to send passwords to OpenAI, Anthropic, or any computer that isn’t your own.
Here are several large areas to learn about. Not all of them will be important to you.
As new models are released, their capabilities increase. As an AI engineer, it’s crucial you stay on top of this. You should know about the pre-training scaling laws that have brought LLMs into the public’s eye.
Ways that models improve:
Most of this shows up in blog announcements from the AI labs and announced on X.
AI Engineering is still being figured out. If you go back far enough in programming history, languages didn’t
even have control structures like if
/then
or for
loops. It took time to figure that stuff out.
We’re in a similar spot with AI engineering, where the patterns are still emerging.
Check out Prompting Guide for a comprehensive list of current patterns. Also subscribe to Latent Space and read Simon Willison to keep up to date.
Outside of the AI labs, you may want to watch some providers:
Additionally, pay attention to vector stores:
Also, look into edge compute. Ollama for personal computers, vLLM for Linux servers, but also pay attention to work being done to run LLMs on IoT devices and phones.
Generally, do not do this unless you know you need to. It’s often tempting to try to fine tune, but it’s usually a red herring.
Topics:
This is quickly evolving and there’s unfortunately not much here.
Topics
As I’m writing, this is a hot topic. The train time scaling laws seem to be fading and the new promising area is having models “think” longer during inference (see o1). This also seems to be a significant key to agents.
Generally follow any of the sources below. The information is spread out.
There’s two kinds of perspectives here:
The former isn’t very interesting, it’s just the ReAct pattern. The latter is an area of active research. Within agents you have topics like:
In my experience, present agents are like riding a unicycle. It’s possible to make them work, but it takes a lot of experience to not fall off. The main blocker to having them rolled out more broadly is reasoning & planning. I think Test Time Compute (TTC) might be part of the puzzle, others are betting on world models. In reality, it’s going to be a bit of everything; the whole field needs to evolve.
The AI Labs’s documentation often also has good primers:
This is a new one for me, but some highly recommend following people on Github first and then maybe follow individual repos. It’s far better to follow people, because then you learn about new repos. Whereas following repos gets noisy very fast, so only do that when you want to keep close tabs. Look for new repos, new ideas, and new trends.
See People to Watch for Github links.
[Huggingface][(https://huggingface.co/) is like “Github for AI/ML models”. Typically, the code for the
model is kept in Github and the model artifacts are hosted in huggingface. The
transformers
library makes it very easy to download models
off huggingface and run them, or fine-tune, or disassemble and use just the tokenizer, or steal the attention
layers from an LLM to fine-tune an embedding model, etc.
Also, Huggingface offers inference. So you can host model inference there. For example, the Open LLM Leaderboard is hosted there, so it’s also not limited to just model inference.
Additionally, a lot of papers are posted to huggingface (sometimes instead of arXiv). There seems to be a social networking aspect to it, where you can comment on papers, follow authors, etc. It’s safe to say that huggingface is a core part of the AI ecosystem. While it’s not an AI lab in the traditional sense, it’s in many ways just as critical to AI development, maybe more so.
If I forgot something contact me, or else use the Github repo for this blog to create an issue or PR. Or add to one of the discussion links.
2024-12-10 08:00:00
Yesterday I posted here about becoming an AI Engineer and made a statement that prior ML experience is often a liability for transitioning into AI engineering. That turned out to be quite the hot take! In this post I’ll incorporate feedback and try to expand that into a more balanced take. I’ll expand on the perspective of it being an asset, as well as where it’s a liability.
First of all, the responses were very polarized, either enthusiastic agreement or disagreement (no in between, so you know it was extra spicy). That seems like a strong indicator that it’s a difference between archetypes.
My hunch is that those who disagreed with my statement tend to align with the “researcher” archetype.
The researcher arechetype is probably poorly named, although I’m not sure what a better name is. They’re model-centric.
I originally formed that opinion back in 2022 about a week or two after trying to build on top of LLMs for the first time. I was talking to a data scientist (who I’m close with both personally and professionally) about how to incorporate LLMs. I recall there being a ton of friction in those initial conversations, which led me to state something overly dramatic like, “I think data science is going to be dead post-LLM”.
Since then, I’ve had a lot of people independently validate that opinion. One take I’ve heard went something like this:
ML people think their job is to produce a model, whereas (pure) engineering folk do not, which leads engineers to view fine-tuning as an optimization that’s often premature.
I’ve also used the argument that ML folk view Occam’s Razor to mean that they should produce the simplest (smallest) possible model first and increase the model complexity as needed, whereas engineers tend think Occam’s Razor means they should start with the approach that’s most likely to work easily (the biggest, baddest LLM available) and work downward toward more efficient models to optimize costs.
I’ve talked to hiring managers who explicitly seek “Please No ML Experience”. In their words, they’ve seen ML people push their org into spending tens or hundreds of thousands of dollars fine tuning models. Those projects fail at an unfortunately high rate and deliver slowly. Whereas simply prompting better will often get you close enough to launch (and therefore mitigate project risk).
Rahul Dave posted on Bluesky that it’s sometimes difficult to know when you need to fine tune, and he found that his prior ML experience was critical in identifying that situation.
That’s a very good point. Organizationally, the act of identifying that a threshold has been crossed is very difficult. Historically in my engineering experience it’ll show up as
We built component X to solve problem Y. But 3 months ago problem Y disappeared due to a change in management/customers/business and now component X only causes people friction. We’re stuck with it forever because nobody realized that the original problem it solved is now gone.
One of the big ways a staff+ engineer contributes is to identify and explain change. With LLM apps, it often takes ML intuition to be able to correctly identify the situation where performance isn’t good enough (and therefore a huge architectural change is needed).
Vicki Boykis took another tack, arguing that the non-determinism of LLMs is unfamiliar to software engineers:
I think software engineering in general favors things like: unit tests where you get the same input and same output, a
for
loopn
times will only loop throughn
times, type checking (in most languages 😅) confer correctness etc. LLMs are none of that, and lossy compression to boot.
Her experience is that, for this reason, ML people have an easier time transitioning into AI engineering. I personally think some engineers, e.g. distributed systems background, are already adept at dealing with non-determinism, so this isn’t much of a hurdle for them. But she’s correct, this is a huge hurdle for a lot of engineers. If you’re a hiring manager, you should probably address non-determinism in the interview.
If you have too much ML experience, your organization will definitely fine tune models and it will cost a lot of money. If you have too little, you won’t fine tune any models and you’ll be leaving performance on the table.
Fine tuning historically has a much riskier track record, which leads a lot of people to recommend against fine tuning. However, it might be wise to include a staff+ engineer with ML experience on your team so they can identify when your team needs to transition into the researcher archetype.
2024-12-09 08:00:00
You want to be an AI Engineer? Do you even have the right skills? What do they do? All great questions. I’ve had this same conversation several times, so I figured it would be best to write it down. Here I answer all those, and break down the job into archetypes that should help you understand how you’ll contribute.
An AI engineer is a specialized software engineer that integrates GenAI models into applications. It can involve training or fine-tuning LLMs, but it often does not. It can involve working on low-level harnesses, like llama.cpp or vLLM, but it often does not.
More often AI engineering involves building UIs, APIs, and data pipelines. It can look wildly different from job to job. The common thread is that you send prompts to an LLM or image model, e.g. via OpenAI’s API, and use the result in an application somehow.
You’ll be a great AI engineer if:
Seriously, you don’t typically need to have AI experience, it’s a new field so not many people actually have prior experience. It’s tempting to think machine learning (ML) expierience is helpful, but it’s actually often more of a liability[1] to approach problems like a data scientist does.
Here are a few archetypes of AI engineers distinguished by how they look at problems. You’ll likely know which archetype you are based on what you already do.
An extension of a data engineer, this archetype is most likely to use RAG architecture to build AI applications using company databases or knowledge banks. When asked, “how can I make this better?”, your answer is to improve the quality of the data, or how it’s indexed, or the model used to index it, etc. All problems center around the data.
This archetype should have a thorough understanding of RAG architecture and embeddings, holds strong opinions about vector databases vs just using a vector index, and maybe can diagram out how the HNSW algorithm works on the back of a bar napkin.
This arechetype of AI engineer views “intelligence” as an inseperable collaboration between human & AI. They aren’t necessarily a UX designer or frontend engineer, but you typically can’t live as this archetype without slinging a fair bit of React code.
If you’re living this archetype, you might work with the Data Pipeline Archetype, or even also be one. But when it comes to, “how can I make this app better”, your answer is typically “tighter collaboration with the user”. You work to improve the quality of information you glean from the user, or use AI to improve the user’s experience with the app or the value they get out of it.
You might be a UX Archetype if you admire ChatGPT, Cursor, or NotebookLM for how they helped us reimagine how we can use LLMs. You probably get excited about new LLMs that are faster or lower latency, multimodal, or new modalities.
The Researcher Archetype isn’t necessarily a researcher, but they’re focused on the models and algorithms. When asked, “how can I make this app better”, their answer is about algorithms, new models, more compute, etc.
The Researcher Archetype is most likely to fine-tune their own model. To be successful as this archetype, you need to spend a lot of time keeping track of AI news on X/Bluesky/Reddit. The AI space moves fast, but as this archetype especially, you ride the bleeding edge, so it takes extra effort to keep pace. Make time to read 1-5 papers per week, and become adept at using NotebookLM.
Also, hack a lot in your spare time. You should definitely be running models locally (e.g. via Ollama). You should be comfortable running pytorch models via the Transformers library in a Jupyter notebook. You’re eyes probably light up every time SmolLM is in the news. And you may have a desktop with a RTX 3060 (and not for gaming).
There’s probably several others. For example, I have a poorly-understood concept of an “artist” archetype, that uses AI to create something beautiful. Maybe more for safety, philosophy, and others. The ones outlined above are what you’re most likely to be hired for.
For the most part, AI & Software engineering are the same. The main difference is how fast the AI field moves. Because of this, you have to be extra okay with throwing out all your work from time to time. For example, if a new framework comes out and you rewrite everything in DSPy.
(By the way, you should really checkout DSPy 🔥)
Another thing is management. I keep thinking about how using AI as a tool in your work feels a lot like management, or at least being your own tech lead. I’m not sure we’ve properly equipped most engineers with the right skills, but if you thrive in the next few years, you’ll be well set up to go into management, if that’s your thing.
You’re already a solid engineer, so you’re most of the way there already. The other part is getting your continuing education setup.
I personally am not a fan of courses. There’s an absolute ton of them out there, but I believe that the mere fact that a course has to be prepared in advance and delivered many times in order to make money, that kinda implies the material is going to be a bit stale since AI moves so fast.
My recommendations:
Data archetypes should check out episode S2E16 from the How AI Is Built podcast. It goes into detail on trategies for improving the quality of the source data.
All archetypes should probably have a solid social media source. I think 🦋 Bluesky is the best, it has starter packs to get you zeroed into the right group very quickly. I know X has a lot of great chatter, but it’s extremely noisy, so it’s hard to recommend. Feel free to scrape my account for followers.
That’s it! I hope that helps.
2024-12-08 08:00:00
I just released entrapix, which is a fork of Ollama that raises
a ConfusedAgentError
, so to speak, when the LLM becomes unsure of itself.
Entrapix is a riff off of entropix, which is an experiment with dynamic
samplers. Basically, you shouldn’t have to set top_p
, top_k
, temperature
, etc.
manually, entropix streamlines the whole process by watching the internal state of
the model and reacting accordingly. I wrote about it a while back.
Entrapix is much simpler. It doesn’t do the dynamic sampler stuff, it just detects the
case when the LLM is high entropy / high varentropy and exits immediately, setting
done_reason = "trap"
in the API.
Low Entropy | High Entropy | |
Low Varentropy | Easy, churn out tokens | |
High Varentropy | Uncertainty! raise ConfusedAgentError() |
The CLI tool is wired up. At minimum, the --entrapix true
flag is needed to
enable it.
❯ ./ollama run llama3.1 "How much energy is required for the information contained in
a bowling ball to escape a black hole?" --entrapix true --entrapix-threshold 1 --entrapix-varent 1
A fun and imaginative question!
In theory, if we were to place a bowling ball near the event horizon of a black hole, its information would indeed be
affected by the strong gravitational field.
However,
Error: agent is confused and has stopped
I haven’t tried it yet, but I imagine most would want to set the thresholds in the modelfile.
I built the concept to try out in dentropy. Dentropy is an app that helps overactive people keep track of their lives. The thing obout our users is they often write notes and have no idea what they meant a few hours later. Naturally, the LLM also has no idea what you meant, and it exhibits in the entropy/varentropy. We handle a confused agent by asking the user clarifying followup questions.
However, I imagine it’s going to be more common to just do a different query and RAG from a different angle.
The philosophical difference between entropix and entrapix is the original bets on the information encoded inside the model, whereas my entrapix bets on things found outside the model.
The agent-building crowd is similarly split. Some folk think you should build agents out of systems of LLMs & other components (e.g. the DSPy crowd). Like my entrapix, they think they can guide models via information found in databases, setting up judge LLMs or planning LLMs, etc.
In an agent, a systems approach is going to start with a loop outside the LLM and call the LLM (& other resources) from within it:
while True:
convo = plan(convo)
convo = act(convo)
convo = measure(convo)
Whereas the other approach is to embed the loop inside the LLM itself. Recall that an LLM is just predicting the next token over and over in a loop, right? Well, when does it stop? LLMs are fine-tuned to stop relatively soon, but realistically, there’s no reason they can’t keep going until their context window fills up (and then, if you’re good at forgetting, maybe keep going forever).
This is called an inner loop.
In that world, the original entropix helps continually nudge the LLM back into line, after every
token. In a sense, it’s implementing a generic version of the measure
step. Rather than building
a control loop for every new agent and problem, you just have a single “AgentGPT” app that takes
any problem and brings it to completion.
I don’t know which approach will win out in the end. I have a hunch that the inner loop might, but I build for technology that exists today. And today LLMs & LRMs are quite hard to wrangle in real world applications. I hope entrapix gives more hooks that you can use to keep outer loop apps functional, today.
2024-11-21 08:00:00
A year ago I wrote a post called LLMs Are Interpretable. The gist is that LLMs were the closest thing to “interpretable machine learning” that we’ve seen from ML so far. Today, I think it’s fair to say that LRMs (Large Reasoning Models) are even more interpretable.
Yesterday DeepSeek released their reasoning model, R1. For kicks, I threw it a riddle that my 8 year old loves:
If you’re flying over a desert in a canoe and your wheels fall off, how many pancakes does it take to cover a dog house?
Most people will (should) do a double take, and then give up. It’s a nonsense question. Even if you try to estimate the sizes of doghouses and pancakes, there’s so much contention about both that the estimates are also meaningless. This is a test of a highly ambiguous situation, how does the model handle it?
The transcripts are fascinating, I’ll quote some passages here, but really you should go ahead and read the full reasoning trace. The final answer isn’t terribly interesting; tl;dr it figures out that it’s a nonsense question.
Off the start, it tries to break the problem down and immediately notes a lot of inconsistency:
First, “flying over a desert in a canoe.” Well, canoes are typically used on water, not in the air or over deserts. So that’s already a bit odd. Maybe it’s a metaphor or a riddle that plays on words. Then it says, “your wheels fall off.” Canoes don’t have wheels, so that’s another strange part. Maybe the wheels are part of something else, or maybe it’s just adding to the confusion.
Then it doubts itself and tries looking at it from a different angle, puns and double meanings:
Perhaps I’m approaching this the wrong way. Maybe it’s a riddle where the answer isn’t literal but more about wordplay or logic. Let’s consider if there’s a pun or a double meaning here.
And then metaphors:
Wait a minute, maybe “wheels” isn’t referring to actual wheels. Could it be a metaphor for something else? Or perhaps the entire first part is just a distraction, and the real question is about pancakes and a dog house.
The it begins to think it’s being tricked:
But then why include all that other information?
Then it begins to suspect that it’s nonsense:
Alternatively, maybe the key is to realize that the scenario described is impossible or doesn’t make sense, which might imply that the answer to the question is also nonsensical or that it’s a trick question. If you’re flying over a desert in a canoe with no wheels, perhaps the number of pancakes needed is zero because the scenario itself is impossible.
Followed by my absolute favorite part, its worried that it’s being dismissive:
But that feels a bit too dismissive. Maybe there’s a deeper meaning or a specific answer that I’m missing. Let me try to think of it differently.
It goes on like this for pages and pages. It runs calculations. It suspects diversion. It uses the phrase, “In conclusion,” followed by 10 thousand more characters of reasoning. The whole thing is a trip.
When I wrote my original post about LLMs being interpretable, I got flak because people pointed out that it doesn’t help ML Engineers understand how the model works, or how to fix a bug, etc. That’s a valid criticism, but misses the point.
With ML, lots of different people want an explanation for what the model said:
And a whole lot more, the list is very long and varied, and if you dive into any of them, there’s not a ton of overlap in what they want. It’s not realistic to expect that a single interpretability technique could address every party’s concerns.
In the traditional ML, I would use SHAP to generate ML explanations for LightGBM models. It would give you a vector that mirrored the feature vector but would tell you how much each feature contributed to the prediction. There’s even fancy proofs showing that this is the optimally fair solution for assigning feature importance.
The thing is, when we showed these explanations, via a visualization, to very busy nurses, the explanation caused them to lose trust in the model, even though the model had a radically better track record of making the prediction than they did. Why? Because it didn’t consider some aspect that the deemed to be critical. (This is when I learned the importance of full moons in emergency department care).
Interpretability is hard. And we usually get it wrong.
I know it’s crazy, but I think LRMs might actually address interpretability concerns of most people.
The busy nurses. They don’t have time to read the reasoning trace every time, but a glance through it once in a while is enough to build faith in it.
Lawyers. The trace is so verbose that it thoroughly uncovers any bias, and gives lawyers a lot to work with to figure out if a model used some questionable path of reasoning.
ML Engineers (well, AI engineers). It’s not perfect, but the trace gives a ton of information about which parts of a RAG inclusion influenced it, and why.
Researchers. This one is more involved, but when you combine reasoning traces with other tools to introspect logits and entropy, you can get a real sense for how the algorithm works and where the big gains might be.
In general, users just want to trust it (or not trust it, that’s valuable too). For me personally, the trace boosted my trust in the model a lot. It showed me:
The trace is too large to read most of the time, but I’d love to throw the trace into an LLM, like Qwen 2.5, and have it what I could do differently to get better results out of the LRM.
Yes, LLMs were a huge boost for interpretability, but LRMs really close the loop. The reasoning trace is easily ignored, but it’s also easily used to understand what the model did. I think there’s even more room for further interpretability too.
2024-10-29 08:00:00
It’s Github Universe again, and Github announced some new features that bring it up to par with Cursor. “Yay”, I mutter to myself as I contemplate how bored I am of this.
What I really want is Zed, a hot new text editor written in Rust with first class support for collaborative text editing. It’s just so stinkin’ fast, it’s a complete joy to use.
It’s just that Zed is lagging Cursor/Copilot in AI capabilities that are apparently now very important to me.
Maybe your hot new editor is something else. Neovim is neat. Or maybe you like old editors. Whatever your jam is, AI text editing feels like an unexpected table-stakes feature these days, much like programming language support.
VS Code co-launched with LSP (Language Server Protocol). It’s a way for new text editors to support every language. It helped VS Code become popular, because, while not every language team was willing to do work for Microsoft to get VS Code support, they were willing to build a LSP server and never worry about editor support ever again.
I want the same thing with AI. Every text editor implements some AiSP (AI Server Protocol), and lots of AI vendors differentiate by offering better AI completion services. No need to build an entire editor or fork VS Code again.
I don’t have a solution for this, and I’m not sure what exactly it would look like. But I badly want to go back to 2022 and use whatever my favorite text editor is in the moment. I just want to code, and love it. (Sorry VS Code, but I don’t actually like you.)