Rss preview of Blog of Answer.AI

Thoughts On A Month With Devin

2025-01-08 08:00:00

In March 2024, a new AI company burst onto the scene with impressive backing: a $21 million Series A led by Founders Fund, with support from industry leaders including the Collison brothers, Elad Gil, and other tech luminaries. The team behind it? IOI gold medalists - the kind of people that solve programming problems most of us can’t even understand. Their product, Devin, promised to be a fully autonomous software engineer that could chat with you like a human colleague, capable of everything from learning new technologies and debugging mature codebases to deploying full applications and even training AI models.

The early demos were compelling. A video showed Devin independently completing an Upwork bounty, installing and running a PyTorch project without human intervention.¹ The company claimed Devin could resolve 13.86% of real-world GitHub issues end-to-end on the SWE-bench benchmark - ~3x times better than previous systems. Only a select group of users could access it initially, leading to breathless tweets about how this would revolutionize software development.

As a team at Answer.AI that routinely experiments with AI developer tools, something about Devin felt different. If it could deliver even half of what it promised, it could transform how we work. But while Twitter was full of enthusiasm, we couldn’t find many detailed accounts of people actually using it. So we decided to put it through its paces, testing it against a wide range of real-world tasks. This is our story - a thorough, real-world attempt to work with one of the most hyped AI products of 2024.

What is Devin?

What makes Devin unique is its infrastructure. Unlike typical AI assistants, Devin operates through Slack and spins up its own computing environment. When you chat with Devin, you’re talking to an AI that has access to a full computing environment - complete with a web browser, code editor, and shell. It can install dependencies, read documentation, and even preview web applications it creates. Below is a screenshot of one way to initiate a task for Devin to work on:

One way to initiate a task with Devin - through Slack

The experience is designed to feel like chatting with a colleague. You describe what you want, and Devin starts working. Through Slack, you can watch it think through problems, ask for credentials when needed, and share links to completed work. Behind the scenes, it’s running in a Docker container, which gives it the isolation it needs to safely experiment while protecting your systems. Devin also provides a web interface, which also allows you to gain access to its envirnoment and watch it work with IDEs, Web Browsers and more in real time. Here is a screenshot of the web interface:

Early Wins

Our first task was straightforward but real: pull data from a Notion database into Google Sheets. Devin tackled this with surprising competence. It navigated to the Notion API documentation, understood what it needed, and guided me through setting up the necessary credentials in Google Cloud Console. Rather than just dumping API instructions, it walked me through each menu and button click needed - saving what would typically be tedious documentation sleuthing. The whole process took about an hour (but only a few minutes of human interaction). At the end, Devin shared a link to a perfectly formatted Google Sheet containing our data.

The code it produced was a bit verbose, but it worked. This felt like a glimpse into the future - an AI that could handle the “glue code” tasks that consume so much developer time. Johno had similar success using Devin to create a planet tracker for debunking claims about historical positions of Jupiter and Saturn. What made this particularly impressive was that he managed this entirely through his phone, with Devin handling all the heavy lifting of setting up the environment and writing the code.

Scaling Up Our Testing

Building upon our early successes, we leaned into Devin’s asynchronous capabilities. We imagined having Devin write documentation during our meetings or debug issues while we focused on design work. But as we scaled up our testing, cracks appeared. Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions.

Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible. When asked to deploy multiple applications to a single Railway deployment (something that Railway doesn’t support), instead of identifying this limitation, Devin spent over a day attempting various approaches and hallucinating features that didn’t exist.

The most frustrating aspect wasn’t the failures themselves - all tools have limitations - but rather how much time we spent trying to salvage these attempts.

A Deeper Look at What Went Wrong

At this point in our journey, we were puzzled. We had seen Devin competently handle API integrations and build functional applications, yet it was struggling with tasks that seemed simpler. Was this just bad luck? Were we using it wrong?

Over the course of a month, we systematically documented our attempts across these categories:

Creating new projects from scratch
Performing research tasks
Analyzing & Modifying existing projects

The results were sobering. Out of 20 tasks, we had 14 failures, 3 successes (including our 2 initial ones), and 3 inconclusive results. Even more telling was that we couldn’t discern any pattern to predict which tasks would work. Tasks that seemed similar to our early successes would fail in unexpected ways. We’ve provided more detail about these tasks in the appendix below. Below is a summary of our experiences in each of these categories:

1. Creating New Projects From Scratch

This category should have been Devin’s sweet spot. After all, the company’s demo video showed it autonomously completing an Upwork bounty, and our own early successes suggested it could handle greenfield development. The reality proved more complex.

Take our attempt to integrate with an LLM observability platform called Braintrust. The task was clear: generate synthetic data and upload it. Instead of a focused solution, Devin produced what can only be described as code soup - layers of abstraction that made simple operations needlessly complex. We ultimately abandoned Devin’s attempt and used Cursor to build the integration step-by-step, which proved far more efficient. Similarly, when asked to create an integration between our AI notes taker and Spiral.computer, Devin generated what one team member described as “spaghetti code that was way more confusing to read through than if I’d written it from scratch.” Despite having access to documentation for both systems, Devin seemed to overcomplicate every aspect of the integration.

Perhaps most telling was our attempt at web scraping. We asked Devin to follow Google Scholar links and grab the most recent 25 papers from an author - a task that should be straightforward with tools like Playwright. This should have been particularly achievable given Devin’s ability to browse the web and write code. Instead, it became trapped in an endless cycle of trying to parse HTML, unable to extract itself from its own confusion.

2. Research Tasks

If Devin struggled with concrete coding tasks, perhaps it would fare better with research-oriented work? The results here were mixed at best. While it could handle basic documentation lookups (as we saw in our early Notion/Google Sheets integration), more complex research tasks proved challenging.

When we asked Devin to research transcript summarization with accurate timestamps - a specific technical challenge we were facing - it merely regurgitated tangentially related information rather than engaging with the core problem. Instead of exploring potential solutions or identifying key technical challenges, it provided generic code examples that didn’t address the fundamental issues. Even when Devin appeared to be making progress, the results often weren’t what they seemed. For instance, when asked to create a minimal DaisyUI theme as an example, it produced what looked like a working solution. However, upon closer inspection, we discovered the theme wasn’t actually doing anything - the colors we were seeing were from the default theme, not our customizations.

3. Analyzing and Modifying Existing Code

Perhaps Devin’s most concerning failures came when working with existing codebases. These tasks require understanding context and maintaining consistency with established patterns - skills that should be central to an AI software engineer’s capabilities.

Our attempts to have Devin work with nbdev projects were particularly revealing. When asked to migrate a Python project to nbdev, Devin couldn’t grasp even basic nbdev setup, despite us providing it access to comprehensive documentation. More puzzling was its approach to notebook manipulation - instead of directly editing notebooks, it created Python scripts to modify them, adding unnecessary complexity to simple tasks. While it occasionally provided useful notes or ideas, the actual code it produced was consistently problematic.

Security reviews showed similar issues. When we asked Devin to assess a GitHub repository (under 700 lines of code) for security vulnerabilities, it went overboard, flagging numerous false positives and hallucinating issues that didn’t exist. This kind of analysis might have been better handled by a single, focused LLM call rather than Devin’s more complex approach.

The pattern continued with debugging tasks. When investigating why SSH key forwarding wasn’t working in a setup script, Devin fixated on the script itself, never considering that the problem might lie elsewhere. This tunnel vision meant it couldn’t help us uncover the actual root cause. Similarly, when asked to add conflict checking between user input and database values, one team member spent several hours working through Devin’s attempts before giving up and writing the feature themselves in about 90 minutes.

Reflecting As A Team

After a month of intensive testing, our team gathered to make sense of our experiences. These quotes capture our feelings best:

Tasks it can do are those that are so small and well-defined that I may as well do them myself, faster, my way. Larger tasks where I might see time savings I think it will likely fail at. So no real niche where I’ll want to use it. - Johno Whitaker

I had initial excitement at how close it was because I felt I could tweak a few things. And then slowly got frustrated as I had to change more and more to end up at the point where I would have been better of starting from scratch and going step by step. - Isaac Flath

Devin struggled to use internal tooling that is critical at AnswerAI which, in addition to other issues, made it difficult to use. This is despite providing Devin with copious amounts of documentation and examples. I haven’t found this to be an issue with tools like Cursor, where there is more opportunity to nudge things in the right direction more incrementally. - Hamel Husain

In contrast to Devin, we found workflows where developers drive more (like Cursor) avoid most issues we faced with Devin.

Conclusion

Working with Devin showed what autonomous AI development aspires to be. The UX is polished - chatting through Slack, watching it work asynchronously, seeing it set up environments and handle dependencies. When it worked, it was impressive.

But that’s the problem - it rarely worked. Out of 20 tasks we attempted, we saw 14 failures, 3 inconclusive results, and just 3 successes. More concerning was our inability to predict which tasks would succeed. Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability - Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers.

This reflects a pattern we’ve observed repeatedly in AI tooling. Social media excitement and company valuations have minimal relationship to real-world utility. We’ve found the most reliable signal comes from detailed stories of users shipping products and services. For now, we’re sticking with tools that let us drive the development process while providing AI assistance along the way.

Appendix: Tasks Attempted With Devin

Below is a table of projects we gave Devin, categorized by the themes of: (1) Creating a new project, (2) research, (3) analyze an existing code base and (4) modifying a code base.

1. Create A New Project

Project Name	Status	Description	Reflections
Planet Tracker	Success	I wanted to debunk some claims about historical positions of Jupiter and Saturn	Devin nailed it. I actually talked to Devin from my phone via slack and it made it happen.
Migrating data from Notion Into Google Sheets	Success	I told Devin to programmatically pull info from a Notion document into a Google Sheet. This was my very first project that I executed with Devin and it pulled it off nicely. Devin read notion and Google API docs by itself. Devin also navigated me to the Google Cloud console and provided me with instructions on all the different menus to click through which would have taken me quite a bit of time on my own! At the end, I was given a reasonable Python script that executed the task.	This was my very first interaction with Devin and it executed exactly what I wanted it to do, which was a brand new experience for me. I was quite excited about Devin at this point.
Multi-app deploys on Railway	Inconclusive	I asked Devin to deploy multiple applications to a single railway deployment, so that I could have different apps sharing the same local db for testing.	It turns out that this task was ill-defined because it’s not actually possible to do this, if I understand correctly. However, Devin marched forward and tried to do this and hallucinated some things about how to interact with railway.
Generate synthetic data and upload it to Braintrust	Failure	I asked Devin to create synthetic data for a LLM observability platform called Braintrust that I wanted to test.	Devin created overly complex code that was hard to understand, and got stuck trying to fix errors. We ended up using Cursor to do this step by step in an iterative fashion.
Create an integration between two applications	Failure	I asked Devin to create an integration between Circleback, my AI notes taker, and Spiral.computer with pointers to the documentation of each.	I got really horrible spaghetti code that was way more confusing to read through than me trying to just write it from scratch. So I decided to not invest any more time in using Devin for this particular task.
Web scraping Papers By Following Google Scholar Links	Failure	I asked Devin to grab the most recent 25 papers from an author on Google Scholar programmatically using playwright, and if it encountered a paywall it was ok to skip that particular document.	Devin went into a rabbit hole of trying to parse HTML that it seems like it couldn’t get out of. It got stuck and went to sleep.
Create minimal HTMX bulk upload example app	Failure	I asked Devin to read the HTMX documentation page for bulk edit example and with that and fake server code, create a minimal FastHTML version of the example for the FastHTML Gallery.	The example did not work and was not minimal. Devin used objects from the request object that didn’t exist and added many unnecessary things, like toasts (which also didn’t work), and inline css styling.
Create a DaisyUI Themes to match FrankenUI Theming	Failure	I asked Devin to create DaisyUI and highlight.js theming so that they match the frankenui themes and can be used in the same app seamlessly	Devin mapped daisyUI pre-existing themes to frankenui themes, but did they did not match well in many cases. It was also a ton of code changes that I didn’t understand and I ended up not using any of it because I was too confused to know what to do with it.

2. Perform Research

Project Name	Status	Description	Reflections
Research How to make a discord bot	Success	I asked Devin to perform research on how I could use Python to build a Discord bot that summarizes each day’s messages and sends an email. I also told it to use Claudette if possible to do so. Finally, I told it to write its findings in notebooks with small code snippets I could use to test.	Devin produced research notes in the form of a markdown file as an intermediate step to creating the notebook, which I did not ask it for. However, it was quite useful to see a step-by-step plan on how an implementation might come together. The code that it provided me in the notebook was not 100% correct, but it was useful as pseudocode to give me an idea of how I might glue this together. Given that this was more of a research project and I wanted just to know the general idea, I would call this a success.
Research on Transcript Summarization With Accurate Timestamps	Failure	One issue that I face with summarizing transcripts is that I would love to have accurate timestamps that go with notes, so that I could use it for YouTube chapter summaries or similar. Concretely, it is not a problem to get accurate time-stamps from a transcript, But it’s difficult to associate timestamps with summaries because the timestamps often get bungled. So this is kind of an AI engineering research task.	Devin regurgitated related things to my problem but it did not tackle it did not do a good job of performing research or trying to tackle the problem I was trying to solve, and gave me pointers to code and examples that were not helpful.
Create a minimal DaisyUI theme as an example	Failure	I asked Devin to create a minimal DaisyUI theme as an example. My goal was to get a starting point to start from since asking it to do it in a more complete way was unsuccessful.	Devin ignored the request to make it as a FastHTML app, and it took some back and forth to get it to go down that path. Eventually, it created an app that appeared to work with different button types. While it gave a link that looked good, once I tried modifying the theme, is became clear the theme was doing nothing. The other colors in the app were from the default theme. This is not a helpful starting point.

3. Analyze Existing Code

Project Name	Status	Description	Reflections
Performing a security review of a code base	Inconclusive	For this task, I pointed Devin at a GitHub repository and told it to assess it for security vulnerabilities. The codebase is under 700 lines of code. I told Devin to write its notes in a markdown file with sample code where necessary.	Devin did identify some security vulnerabilities but was extremely overzealous and hallucinated some issues that were not there. Perhaps this was not the ideal task for Devin as this is something that would be just as good in a single call to my favorite LLM.
Review blog posts and make a pull request with improvements	Failure	I asked Devin to review a blog post and suggest changes with a pull request. Ultimately, Devin failed because it could not figure out how the static site generator that I was using, Quarto, worked.	I think that this task would have been successful inside something like Cursor. It seemed like Devin did not do a good job of learning from the project structure and existing files, so it messed up things like front matter and other conventions necessary to edit the blog post correctly.
Review An Application and Identify Potential Areas of Improvement	Failure	I asked Devin to view the timekeeping app I had mentioned earlier and provided an open-ended task of asking it to suggest any improvements.	The suggestions that it provided did not make any sense.
Debug why ssh key forwarding is not working in a setup script	Inconclusive	I asked Devin to figure out why ssh key forwarding was not working on a server when I used a script to set it up.	The issue ended up being unrelated to the script, which I thought was the problem, but Devin never suggested or implied that maybe the problem was somewhere else. It was not helpful because it did not help me uncover the root cause.

4. Modify An Existing Project

Project Name	Status	Description	Reflections
Making changes to a nbdev project	Failure	I had a simple application for time tracking built with FastHTML and nbdev that I wanted to integrate with apple shortcuts via an API route.	Devin could not figure out how to operate successfully in this environment, even though it got impressively far. One curiosity that I noticed is that Devin created Python scripts to edit notebooks rather than trying to edit the notebook itself. However, Devin gave me some useful notes there and ideas that I hadn’t considered. However, the code that it tried to write did not make sense. Eventually, I ended up using a template from someone else and not going with any of Devin’s suggestions.
Migration of python Project To nbdev	Failure	I asked Devin to migrate a project to nbdev [prompt details omitted for brevity]	It got horribly stuck and could not figure out basic nbdev setup. It seems like it didn’t do a good job of reading the nbdev docs.
Integrate Styling Package Into FastHTML	Failure	I asked Devin to integrate MonsterUI into one of my applications.	Devin could not figure out how to work with a nbdev repo.
Add feature to check for conflicts between user input and database	Failure	I asked Devin to add a feature to an app to compare user input values to values from a database based on prior runs and give a UI if they don’t match.	I spent several hours slowly working through getting it working properly before I gave up. I wrote the feature myself in about 90 minutes.
Generate LLMs context file with the contents of every fasthtml gallery example	Failure	I asked Devin to create llms text files for the fasthtml gallery	I was excited to see it created a separate markdown file for each example and then tried to roll them up in the llms context files initially. I had not thought about doing that and things seemed all there at first. When I pulled down and started digging in I started finding things I did not like: The format of the llms wasn’t correct. Even though I gave it infomration to use XML tags to seperate examples, it didn’t. It added and pinned a specific version of the markdown package as a dependency and used that, instead of using the markdown2 package with is already used and was already a dependency. It did a bunch of pytest stuff and added a dep, even though the project doesn’t use pytest.

Footnotes

This demo was descisively debunked by this video↩︎

Finally, a Replacement for BERT: Introducing ModernBERT

2024-12-19 08:00:00

Finally, a Replacement for BERT

Note

This is a cross-post of the announcement blog post posted on the 🤗 HuggingFace blog.

TL;DR

This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8192 sequence length, better downstream performance and much faster processing.

ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (149M params) and large (395M params) model size.

Click to see how to use these models with transformers

ModernBERT will be included in v4.48.0 of transformers. Until then, it requires installing transformers from main:

pip install git+https://github.com/huggingface/transformers.git

Since ModernBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM. To use ModernBERT for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes. ⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  Paris

Using a pipeline:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)
input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)

Note: ModernBERT does not use token type IDs, unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the token_type_ids parameter.

Introduction

BERT was released in 2018 (millennia ago in AI-years!) and yet it’s still widely used today: in fact, it’s currently the second most downloaded model on the HuggingFace hub, with more than 68 million monthly downloads, only second to another encoder model fine-tuned for retrieval. That’s because its encoder-only architecture makes it ideal for the kinds of real-world problems that come up every day, like retrieval (such as for RAG), classification (such as content moderation), and entity extraction (such as for privacy and regulatory compliance).

Finally, 6 years later, we have a replacement! Today, we at Answer.AI and LightOn (and friends!) are releasing ModernBERT. ModernBERT is a new model series that is a Pareto improvement over BERT and its younger siblings across both speed and accuracy. This model takes dozens of advances from recent years of work on large language models (LLMs), and applies them to a BERT-style model, including updates to the architecture and the training process.

We expect to see ModernBERT become the new standard in the numerous applications where encoder-only models are now deployed, such as in RAG pipelines (Retrieval Augmented Generation) and recommendation systems.

In addition to being faster and more accurate, ModernBERT also increases context length to 8k tokens (compared to just 512 for most encoders), and is the first encoder-only model that includes a large amount of code in its training data. These features open up new application areas that were previously inaccessible through open models, such as large-scale code search, new IDE features, and new types of retrieval pipelines based on full document retrieval rather than small chunks.

But in order to explain just what we did, let’s first take a step back and look at where we’ve come from.

Decoder-only models

The recent high-profile advances in LLMs have been in models like GPT, Llama, and Claude. These are decoder-only models, or generative models. Their ability to generate human-like content has enabled astonishing new GenAI application areas like generated art and interactive chat. These striking applications have attracted major investment, funded booming research, and led to rapid technical advances. What we’ve done, essentially, is port these advances back to an encoder-only model.

Why? Because many practical applications need a model that’s lean and mean! And it doesn’t need to be a generative model.

More bluntly, decoder-only models are too big, slow, private, and expensive for many jobs. Consider that the original GPT-1 was a 117 million parameter model. The Llama 3.1 model, by contrast, has 405 billion parameters, and its technical report describes a data synthesis and curation recipe that is too complex and expensive for most corporations to reproduce. So to use such a model, like ChatGPT, you pay in cents and wait in seconds to get an API reply back from heavyweight servers outside of your control.

Of course, the open-ended capabilities of these giant generative models mean that you can, in a pinch, press them into service for non-generative or discriminative tasks, such as classification. This is because you can describe a classification task in plain English and … just ask the model to classify. But while this workflow is great for prototyping, you don’t want to pay prototype prices once you’re in mass production.

The popular buzz around GenAI has obscured the role of encoder-only models. These are the workhorses of practical language processing, the models that are actually being used for such workloads right now in many scientific and commercial applications.

Encoder-only models

The output of an encoder-only model is a list of numerical values (an embedding vector). You might say that instead of answering with text, an encoder model literally encodes its “answer” into this compressed, numerical form. That vector is a compressed representation of the model’s input, which is why encoder-only models are sometimes referred to as representational models.

While decoder-only models (like a GPT) can do the work of an encoder-only model (like a BERT), they are hamstrung by a key constraint: since they are generative models, they are mathematically “not allowed” to “peek” at later tokens. They can only ever look backwards. This is in contrast to encoder-only models, which are trained so each token can look forwards and backwards (bi-directionally). They are built for this, and it makes them very efficient at what they do.

Basically, a frontier model like OpenAI’s O1 is like a Ferrari SF-23. It’s an obvious triumph of engineering, designed to win races, and that’s why we talk about it. But it takes a special pit crew just to change the tires and you can’t buy one for yourself. In contrast, a BERT model is like a Honda Civic. It’s also an engineering triumph, but more subtly, since it is engineered to be affordable, fuel-efficient, reliable, and extremely useful. And that’s why they’re absolutely everywhere.

You can see this by looking at it a number of ways.

Supporting generative models: One way to understand the prevalence of representational models (encoder-only) is to note how frequently they are used in concert with a decoder-only model to make a system which is safe and efficient.

The obvious example is RAG. Instead of relying on the LLM’s knowledge trained into the model’s parameters, the system uses a document store to furnish the LLM with information relevant to the query. But of course this only defers the problem. If the LLM doesn’t know which documents are relevant to the query, then the system will need some other process to select those documents? It’s going to need a model which is fast and cheap enough that it can be used to encode the large quantities of information needed to make the LLM useful. That model is often a BERT-like encoder-only model. For more details on how Encoders like ModernBERT are critical in RAG pipelines, see this talk by Benjamin Clavié.

Another example is supervision architectures, where a cheap classifier might be used to ensure that generated text does not violate content safety requirements.

In short, whenever you see a decoder-only model in deployment, there’s a reasonable chance an encoder-only model is also part of the system. But the converse is not true.

Encoder-based systems: Before there was GPT, there were content recommendations in social media and in platforms like Netflix. There was ad targeting in those venues, in search, and elsewhere. There was content classification for spam detection, abuse detection, etc.. These systems were not built on generative models, but on representational models like encoder-only models. And all these systems are still out there and still running at enormous scale. Imagine how many ads are targeted per second around the world!

Downloads: On HuggingFace, RoBERTa, one of the leading BERT-based models, has more downloads than the 10 most popular LLMs on HuggingFace combined. In fact, currently, encoder-only models add up to over a billion downloads per month, nearly three times more than decoder-only models with their 397 million monthly downloads. In fact, the `fill-mask` model category, composed of encoder “base models” such as ModernBERT, ready to be fine-tuned for other downstream applications, is the most downloaded model category overall.

Inference costs: What the above suggests, is that on an inference-per-inference basis, there are many times more inferences performed per year on encoder-only models than on decoder-only or generative models. An interesting example is FineWeb-Edu, where model-based quality filtering had to be performed over 15 trillion tokens. The FineWeb-Edu team chose to generate annotations with a decoder-only model, Llama-3-70b-Instruct, and perform the bulk of the filtering with a fine-tuned BERT-based model. This filtering took 6,000 H100 hours, which, at HuggingFace Inference Points’ pricing of $10/hour, comes to a total of $60,000. On the other hand, feeding 15 trillion tokens to popular decoder-only models, even with the lowest-cost option of using Google’s Gemini Flash and its low inference cost of $0.075/million tokens, would cost over one million dollars!

Performance

Overview

Here’s a snapshot of the accuracy of ModernBERT and other models across a range of tasks, as measured by standard academic benchmarks – as you can see, ModernBERT is the only model which is a top scorer across every category, which makes it the one model you can use for all your encoder-based tasks:

If you’ve ever done an NLP competition on Kaggle, then you’ll know that DeBERTaV3 has been the choice of champions for years. But no longer: not only is ModernBERT the first base-size model to beat DeBERTaV3 on GLUE, it also uses less than 1/5th of Deberta’s memory.

And of course, ModernBERT is fast. It’s twice as fast as DeBERTa – in fact, up to 4x faster in the more common situation where inputs are mixed length. Its long context inference is nearly 3 times faster than other high-quality models such as NomicBERT and GTE-en-MLM.

ModernBERT’s context length of 8,192 tokens is over 16x larger than most existing encoders. This is critical, for instance, in RAG pipelines, where a small context often makes chunks too small for semantic understanding. ModernBERT is also the state-of-the-art long context retriever with ColBERT, and is 9 percentage points above the other long context models. Even more impressive: this very quickly trained model, simply tuned to compare to other backbones, outperforms even widely-used retrieval models on long-context tasks!

For code retrieval, ModernBERT is unique. There’s nothing to really compare it to, since there’s never been an encoder model like this trained on a large amount of code data before. For instance, on the StackOverflow-QA dataset (SQA), which is a hybrid dataset mixing both code and natural language, ModernBERT’s specialized code understanding and long-context capabilities make it the only backbone to score over 80 on this task.

This means whole new applications are likely to be built on this capability. For instance, imagine an AI-connected IDE which had an entire enterprise codebase indexed with ModernBERT embeddings, providing fast long context retrieval of the relevant code across all repositories. Or a code chat service which described how an application feature worked that integrated dozens of separate projects.

Compared to the mainstream models, ModernBERT performs better across nearly all three broad task categories of retrieval, natural language understanding, and code retrieval. Whilst it slightly lags DeBERTaV3 in one area (natural language understanding), it is many times faster. Please note that ModernBERT, as any other base model, can only do masked word prediction out-of-the-box. To be able to perform other tasks, the base model should be fine-tuned as done in these boilerplates.

Compared to the specialized models, ModernBERT is comparable or superior in most tasks. In addition, ModernBERT is faster than most models across most tasks, and can handle inputs up to 8,192 tokens, 16x longer than the mainstream models.

Efficiency

Here’s the memory (max batch size, BS) and Inference (in thousands of tokens per second) efficiency results on an NVIDIA RTX 4090 for ModernBERT and other decoder models:

The first thing you might notice is that we’re analysing the efficiency on an affordable consumer GPU, rather than the latest unobtainable hyped hardware. First and foremost, ModernBERT is focused on practicality, not hype.

As part of this focus, it also means we’ve made sure ModernBERT works well for real-world applications, rather than just benchmarks. Models of this kind are normally tested on just the one exact size they’re best at – their maximum context length. That’s what the “fixed” column in the table shows. But input sizes vary in the real world, so that’s the performance we worked hard to optimise – the “variable” column. As you can see, for variable length inputs, ModernBERT is much faster than all other models.

For long context inputs, which we believe will be the basis for the most valuable and important future applications, ModernBERT is 2-3x faster than the next fastest model. And, on the “practicality” dimension again: ModernBERT doesn’t require the additional heavy “xformers” dependency, but instead only requires the now commonplace Flash Attention as a dependency.

Furthermore, thanks to ModernBERT’s efficiency, it can use a larger batch size than nearly any other model, and can be used effectively on smaller and cheaper GPUs. The efficiency of the base size, in particular, may enable new applications that run directly in browsers, on phones, and so forth.

Why is ModernBERT, well, Modern?

Now, we’ve made our case to why we should give some more love to encoder models. As trusted, under-appreciated workhorses, they’ve had surprisingly few updates since 2018’s BERT!

Even more surprising: since RoBERTa, there has been no encoder providing overall improvements without tradeoffs (fancily known as “Pareto improvements”): DeBERTaV3 had better GLUE and classification performance, but sacrificed both efficiency and retrieval. Other models, such as AlBERT, or newer ones, like GTE-en-MLM, all improved over the original BERT and RoBERTa in some ways but regressed in others.

However, since the duo’s original release, we’ve learned an enormous amount about how to build better language models. If you’ve used LLMs at all, you’re very well aware of it: while they’re rare in the encoder-world, Pareto improvements are constant in decoder-land, where models constantly become better at everything. And as we’ve all learned by now: model improvements are only partially magic, and mostly engineering.

The goal of the (hopefully aptly named) ModernBERT project was thus fairly simple: bring this modern engineering to encoder models. We did so in three core ways:

a modernized transformer architecture
particular attention to efficiency
modern data scales & sources

Meet the New Transformer, Same as the Old Transformer

The Transformer architecture has become dominant, and is used by the vast majority of models nowadays. However, it’s important to remember that there isn’t one but many Transformers. The main thing they share in common is their deep belief that attention is indeed all you need, and as such, build various improvements centered around the attention mechanism.

ModernBERT takes huge inspiration from the Transformer++ (as coined by Mamba), first used by the Llama2 family of models. Namely, we replace older BERT-like building blocks with their improved equivalent, namely, we:

Replace the old positional encoding with “rotary positional embeddings” (RoPE): this makes the model much better at understanding where words are in relation to each other, and allows us to scale to longer sequence lengths.
- Switch out the old MLP layers for GeGLU layers, improving on the original BERT’s GeLU activation function.
- Streamline the architecture by removing unnecessary bias terms, letting us spend our parameter budget more effectively
- Add an extra normalization layer after embeddings, which helps stabilize training

Upgrading a Honda Civic for the Race Track

We’ve covered this already: encoders are no Ferraris, and ModernBERT is no exception. However, that doesn’t mean it can’t be fast. When you get on the highway, you generally don’t go and trade in your car for a race car, but rather hope that your everyday reliable ride can comfortably hit the speed limit.

In fact, for all the application cases we mentioned above, speed is essential. Encoders are very popular in uses where they either have to process tons of data, allowing even tiny speed increments to add up very quickly, or where latency is very important, as is the case on RAG. In a lot of situations, encoders are even run on CPU, where efficiency is even more important if we want results in a reasonable amount of time.

As with most things in research, we build while standing on the shoulders of giants, and heavily leverage Flash Attention 2’s speed improvements. Our efficiency improvements rely on three key components: Alternating Attention, to improve processing efficiency, Unpadding and Sequence Packing, to reduce computational waste, and Hardware-Aware Model Design, to maximise hardware utilization.

Global and Local Attention

One of ModernBERT’s most impactful features is Alternating Attention, rather than full global attention. In technical terms, this means that our attention mechanism only attends to the full input every 3 layers (global attention), while all other layers use a sliding window where every token only attends to the 128 tokens nearest to itself (local attention).
As attention’s computational complexity balloons up with every additional token, this means ModernBERT can process long input sequences considerably faster than any other model.

In practice, it looks like this:

Conceptually, the reason this works is pretty simple: Picture yourself reading a book. For every sentence you read, do you need to be fully aware of the entire plot to understand most of it (full global attention)? Or is awareness of the current chapter enough (local attention), as long as you occasionally think back on its significance to the main plot (global attention)? In the vast majority of cases, it’s the latter.

Unpadding and Sequence Packing

Another core mechanism contributing to ModernBERT’s efficiency is its use for Unpadding and Sequence packing.

In order to be able to process multiple sequences within the same batch, encoder models require them to be the same length, so they can perform parallel computation. Traditionally, we’ve relied on padding to achieve this: figure out which sentence is the longest, and add meaningless tokens (padding tokens) to fill up every other sequence.

While padding solves the problem, it doesn’t do so elegantly: a lot of compute ends up being spent and wasted on padding tokens, which do not contribute any semantic information.

Comparing padding with sequence packing. Sequence packing (‘unpadding’) avoids wasting compute on padding tokens and has more consistent non-padding token counts per batch. Samples are still processed individually through careful masking.

Unpadding solves this issue: rather than keeping these padding tokens, we remove them all, and concatenate them into mini-batches with a batch size of one, avoiding all unnecessary computations. If you’re using Flash Attention, our implementation of unpadding is even faster than previous methods, which heavily relied on unpadding and repadding sequences as they went through the model: we go one step further by introducing our own implementation of unpadding, relying heavily on recent developments in Flash Attention’s RoPE support. This allows ModernBERT to only have to unpad once, and optionally repad sequences after processing, resulting in a 10-20% speedup over previous methods.

To speed up pre-training even further, unpadding is in good company within our model, as we use it in conjunction with sequence packing. Sequence packing here is a logical next step: as we’re concatenating inputs into a single sequence, and GPUs are very good at parallelisation, we want to maximise the computational efficiency we can squeeze out of a single forward model pass. To do so, we use a greedy algorithm to group individual sequences into concatenated ones that are as close to the model’s maximum input length as possible.

Paying Attention to Hardware

Finally, the third facet of ModernBERT’s efficiency is hardware design.

We attempted to balance two insights that have been highlighted by previous research:

Deep & Narrow vs Wide & Shallow: Research shows that deeper models with narrower layers, often perform better than shallow models with fewer, wider layers. However, this is a double-edged sword: the deeper the model, the less parallelizable it becomes, and thus, the slower it runs at identical parameter counts.
Hardware Efficiency: Model dimensions need to align well with GPU hardware for maximum performance, and different target GPUs result in different constraints.

Sadly, there is no magic recipe to make a model run similarly well on a wide range of GPUs, but there is an excellent cookbook: The Case for Co-Designing Model Architectures with Hardware, in which the ways to optimize a model architecture for a given GPU are carefully laid out. We came up with a heuristic to extend their method to a basket of GPUs, while respecting a given set of constraints. Logically, the first step is to define said constraints, in our case:

Defining our target GPUs as common inference ones (RTX 3090/4090, A10, T4, L4)
Roughly defining our target model sizes at 130-to-150 million parameters for ModernBERT-Base, and 350-to-420 for ModernBERT-Large.
The final embedding sizes must match the original BERT’s dimensions, 768 for base and 1024 for large, to maximize backwards compatibility
Set performance constraints which are common across the basket of GPUs

Afterwards, we experimented with multiple model designs via a constrained grid search, varying both layer counts and layer width. Once we’d identified shapes that appeared to be the most efficient ones, we confirmed that our heuristics matched real-world GPU performance, and settled on the final model designs.

Training

def data(): return [‘text’, ‘bad_text’, ‘math’, ‘code’]

Picture this exact scene, but replace Developers with Data

Another big aspect in which encoders have been trailing behind is training data. This is often understood to mean solely training data scale, but this is not actually the case: previous encoders, such as DeBERTaV3, were trained for long enough that they might have even breached the trillion tokens scale!

The issue, rather, has been training data diversity: many of the older models train on limited corpora, generally consisting of Wikipedia and Wikibooks. These data mixtures are very noticeably single text modality: they contain nothing but high-quality natural text.

In contrast, ModernBERT is trained on data from a variety of English sources, including web documents, code, and scientific articles. It is trained on 2 trillion tokens, of which most are unique, rather than the standard 20-to-40 repetitions common in previous encoders.

The impact of this is immediately noticeable: out of all the existing open source encoders, ModernBERT is in a class of its own on programming-related tasks. We’re particularly interested in what downstream uses this will lead to, in terms of improving programming assistants.

Process

We stick to the original BERT’s training recipe, with some slight upgrades inspired by subsequent work: we remove the Next-Sentence Prediction objective, since then shown to add overhead for no clear gains, and increase the masking rate from 15% to 30%.

Both models are trained with a three-phase process. First, we train on 1.7T tokens at a sequence length of 1024. We then adopt a long-context adaptation phase, training on 250B tokens at a sequence length of 8192, while keeping the total tokens seen per batch more or less consistent by lowering the batch size. Finally, we perform annealing on 50 billion tokens sampled differently, following the long-context extension ideal mix highlighted by ProLong.

Training in three phases is our way of ensuring our model is good across the board, which is reflected in its results: it is competitive on long-context tasks, at no cost to its ability to process short context…

… But it has another benefit: for the first two-phases, we train using a constant learning rate once the warmup phase is complete, and only perform learning rate decay on the final 50 billion tokens, following the Trapezoidal (or Warmup-Stable-Decay) learning rate. And what’s more: we will release every single immediate intermediate checkpoints from these stable phases, inspired by Pythia. Our main reason for doing so was supporting future research and applications: anyone is free to restart training from any of our pre-decay checkpoints, and perform annealing on domain-appropriate data for their intended use!

The tricks, it’s all about the tricks!

If you’ve made it this far into this announcement, you’re probably used to this: of course, we use tricks to make things quicker here too. To be precise, we have two main tricks.

Let’s start with the first one, which is pretty common: since the initial training steps are updating random weights, we adopt batch-size warmup: we start with a smaller batch size so the same number of tokens update the model weights more often, then gradually increase the batch size to the final training size. This significantly speeds up the initial phase of model training, where the model learns its most basic understanding of language.

The second trick is far more uncommon: weight initialization via tiling for the larger model size, inspired by Microsoft’s Phi family of models. This one’s based on the following realization: Why initialize the ModernBERT-large’s initial weights with random numbers when we have a perfectly good (if we dare say so ourselves) set of ModernBERT-base weights just sitting there?

And indeed, it turns out that tiling ModernBERT-base’s weights across ModernBERT-large works better than initializing from random weights. It also has the added benefit of stacking nicely with batch size warmup for even faster initial training.

Conclusion

In this blog post we introduced the ModernBERT models, a new state-of-the-art family of small and efficient encoder-only models, finally giving BERT a much needed do-over.

ModernBERT demonstrates that encoder-only models can be improved by modern methods. They continue to offer very strong performance on some tasks, providing an extremely attractive size/performance ratio.

More than anything, we’re really looking forward to seeing what creative ways to use these models the community will come up with! To encourage this, we’re opening a call for demos until January 10th, 2025: the 5 best ones will get added to this post in a showcase section and win a $100 (or local currency equivalent) Amazon gift card, as well as a 6-month HuggingFace Pro subscription! If you need a hint to get started, here’s a demo we thought about: code similarity HF space! And remember, this is an encoder model, so all the coolest downstream applications will likely require some sort of fine-tuning (on real or perhaps decoder-model synthetic data?). Thankfully, there’s lots of cool frameworks out there to support fine-tuning encoders: 🤗Transformers itself for various tasks, including classification, GliNER for zero-shot Named Entity Recognition, or Sentence-Transformers for retrieval and similarity tasks!

Links

LightOn sponsored the compute for this project on Orange Business Cloud Avenue.

nbsanity - Share Notebooks as Polished Web Pages in Seconds

2024-12-13 08:00:00

At fastai, we’ve long believed that Jupyter Notebooks are an excellent medium for technical writing, combining live code, visualizations, and narrative text in a single document. However, sharing notebooks in a way that’s both beautiful and accessible has always been a challenge. While GitHub’s notebook viewer is functional, it lacks the polish and features needed for proper technical communication. Today, we’re introducing nbsanity, a service that transforms any public GitHub notebook into a polished web page with just a URL change.

The Challenge

While GitHub’s rendering is functional, it suffers from several limitations: the rendering can be sluggish and occasionally fails completely, there’s no way to collapse or hide code cells, and the presentation can’t be customized. One particularly frustrating issue is the lack of horizontal scrolling for code cells, and overall, the reading experience isn’t optimized for consumption.

Nbviewer solves some of these issues, but doesn’t allow you to customize the presentation. We’ve previously addressed some of these challenges with tools like fastpages and nbdev, but these solutions require setup and maintenance ¹. We realized there was a need for something simpler - a solution that would allow instant sharing without any overhead.

I’ve been searching for the perfect low-friction system for technical writing ever since discovering Simon Willison’s elegant TIL (Today I Learned) approach. With nbsanity, we finally have it.

What is nbsanity?

nbsanity is a free service that renders any public Jupyter notebook from GitHub or Gists as a polished web page. There’s no setup, no configuration, and no deployment needed.

nbsanity is powered by Quarto, an open-source scientific and technical publishing system. Through our extensive work with various documentation tools, we’ve found Quarto to be the most ergonomic static site generator available for notebooks. It offers seamless integration with both Jupyter and VSCode through dedicated extensions, while providing remarkable flexibility in output formats - including presentations, books, PDFs and websites.

One of Quarto’s most powerful features is its “directives” system - simple cell comments that begin with #| that allow you to customize how your content is rendered. These directives are easy to add and do not clutter your code. Below are examples of Quarto capabilities you get access to with nbsanity:

Cell Visibility Control: Hide specific cells with #|include: false while keeping their execution
Output Management: Show just results with #|echo: false or raw output with #|output: asis
Error Handling: Control error messages with #|error: false and warnings with #|warning: false
Content Organization: Create tab panels with {.panel-tabset} and callouts with :::{.callout-note} (this is not a directive, but markdown cell syntax that creates tab panels and callouts.).
Layout Control: Apply custom CSS classes and control figure layouts with directives like #| fig-width: and #| layout-ncol:

Documentation concerning these directives can be found in the more resources section.

nbsanity is focused on doing one thing well: rendering public notebooks beautifully. This means it only works with notebooks hosted on GitHub or in Gists. Furthermore, you’ll need to use remote URLs for any images in your notebooks². These constraints let us deliver a service that’s simple, fast, and completely maintenance-free for users. Think of nbsanity as the “pastebin for notebooks” - it’s the fastest way to go from a GitHub notebook to a polished reading experience.

We added extra love

In addition to Quarto’s rendering process, we’ve added several quality-of-life improvements. All rendered notebooks have a (1) table of contents, (2) link to the original GitHub URL, (3) and wrap text in code cells.

We’ve even made sure that rendered notebooks have fancy social cards, thanks to Simon Willison’s shot-scraper:

Testing the social cards for nbsanity notebookshttps://t.co/5WkGb5dvU6
— Hamel Husain (@HamelHusain) December 13, 2024

These social cards show the actual contents of your notebook and help your posts stand out on social media.

Getting Started

Using nbsanity couldn’t be simpler. You have two options:

Option 1: URL Modification

Replace github.com with nbsanity.com in any GitHub notebook URL. This works for both repositories and gists. For example:

GitHub URL   https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb

nbsanity URL https://nbsanity.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb

For gists, the URL format is slightly different: nbsanity.com/gist/[username]/[gist_id]. See these instructions for more details.

Option 2: Bookmarklet

For even faster conversion, drag this bookmarklet to your bookmarks bar:

nbsanity

Clicking on this bookmarklet while viewing a public GitHub notebook will perform the necessary url substitution for you.

A Demo

To demonstrate Quarto’s capabilities, let’s examine one of my favorite features: code-folding.

Example 1

To collapse a code cell with an expandable summary, I can add the following directive to the top of a code cell:

#| code-fold: true
#| code-summary: "Click to see data preprocessing"

These rendering instructions are used to create this effect, but are not rendered and seen by the reader.

Click to see data preprocessing

import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'id': range(1000),
    'value': np.random.normal(0, 1, 1000),
    'category': np.random.choice(['A', 'B', 'C'], 1000)
})

# Preprocessing steps
data['value_normalized'] = (data['value'] - data['value'].mean()) / data['value'].std()
data['value_binned'] = pd.qcut(data['value'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])

Example 2

To specify that you want readers to have the option to collapse code, we can use the same code-fold directive with a different option:

#| code-fold: show

Code

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(np.random.randn(100).cumsum())
plt.title('Random Walk')
plt.show()

Important Notes

While nbsanity makes notebook sharing effortless, there are a few key things to keep in mind to use it well. First, nbsanity is a rendering service only - it displays your notebooks but does not execute them, even if you have Quarto directives that say otherwise. This avoids potential security issues.

nbsanity also has a a caching system that preserves the history of your notebook renders. Each time you render a notebook, you receive a unique link corresponding to that specific version. If you later update your notebook and render it again, you’ll get a new link. All previous versions remain accessible through their original links. Any new rendering capabilities we introduce will only apply to new renders, meaning your existing shared notebooks will maintain their original appearance.

Next Steps with nbsanity

We built nbsanity because we believe that reducing friction in sharing knowledge is important. We’ve been refining nbsanity with our community of over 2,000 students in our solveit course, where it’s become an integral part of how students share their work. Their feedback and usage patterns have helped us polish the tool into something we love using ourselves.

The best way to get started is to try it yourself:

Visit nbsanity.com and drag the bookmarklet to your browser’s bookmark bar
Navigate to any public Jupyter notebook on GitHub
Click the bookmarklet to view the notebook with beautiful Quarto rendering

Whether you’re writing “Today I Learned” posts, sharing technical tutorials, or enhancing your project’s documentation, we hope this tool makes your technical writing journey a little bit easier. The project is open source and available on GitHub—we welcome your feedback and contributions!³

P.S. If you share your notebook using nbsanity on social media, please tag me—I’d love to see your work! You can find me on twitter and linkedin.

More resources

Here are links to Quarto docs I find helpful when authoring notebooks:

cell output: hide, show, and filter cell output and input.
code-display: configure how code is displayed, including line-numbers, folding of cells, hiding of cells, etc.
figures: configure how figures are shown
tables: configure how tables are shown
metadata: configure the title, subtitle, date, author and more.
numbering: toggle section numbering.

Footnotes

JupyterBook is another project that allows you to customize the presentation of notebooks. Like fastpages, nbdev and other static site generators, these projects require a non-trivial amount of setup and maintenance.↩︎
The reason for requiring remote urls is that we do not want to be rate limited by the GitHub API in fetching related files.↩︎
We need to keep the service minimal, so please expect that we will be discerning about feature requests and PRs.↩︎

ShellSage Loves iTerm

2024-12-10 08:00:00

This is a quick note on a convenient way to use Nate Cooper’s ShellSage, one of the coolest pieces of tech to come out of AnswerAI recently.

As Nate notes, ShellSage relies on tmux to do its magic. tmux is a terminal multiplexer. It traditionally sits in between your terminal emulator (like Terminal.app on macOS) and one or more shells (like bash). Sitting in between is what allows it to see your incoming commands, and their output, and make that context available to an AI.

iTerm2 and tmux control mode

But, what if you’re not interested in multiplexing as such? In particular, you might not want to learn the tmux keyboard commands for switching between tmux panes, or to use the tmux visual interface which places multiple panes into one terminal window.

If your main interest in tmux is just to enable ShellSage, then you might want to explore tmux control mode, a feature available in iTerm2. Using this, your terminal can open up with shell sage integrated by default, like so:

Briefly, here’s what tmux control mode does and a couple way to set it up for ShellSage.

iTerm2.app is just another macOS terminal emulator, much like Terminal.app (which comes with the OS), Warp, and others. But in iTerm2, if you invoke tmux with tmux -CC, it enables control mode, and then instead of drawing its custom text UI, tmux will send control signals directly to iTerm2, and iTerm2 will render tmux panes using native UI controls. So instead of seeing tmux panes indicated by a text footer, you will just see separate tabs in your window. Instead of switching among panes with a key command (C-b n for next, and so on), you can just switch tabs (with your mouse, or with the usual shortcut of C-}). And similarly, within a tab, you can scroll using native scrollbars.

In other words, when you’re using tmux control mode, tmux just provides its functionality while hardly changing your interface at all. So what? In short, if you use tmux through iTerm2’s control mode, then you get the benefit of ShellSage without modifying your terminal interface or learning new commands.

The key to make this effortless is to configure iTerm2 profiles which launch directly into tmux, just as if you were launching your shell without tmux in between.

Here’s two ways to set this up.

For connecting to your local machine

First, ensure you have tmux installed on your local machine, and note the path of the tmux executable. I’ll assume the path to tmux is /opt/homebrew/bin/tmux.

Second, find the path of the shell you want to open by default. If it is not the default system shell (which is /bin/zsh on macOS), then add a line like the following to your ~/.tmux.conf which sets the shell for tmux itself to launch. This line, for instance, sets my tmux to use a newer version of bash provided by homebrew:

set-option -g default-shell /opt/homebrew/bin/bash

Third, in iTerm2, go to Settings, Profiles, and create a new profile, where under the General section, in the “Command” subsetion, you set the dropdown “Command” value as follows:

/opt/homebrew/bin/tmux -CC new-session -A -s main

Launching this profile, instead of launching bash directly, will launch tmux into control mode, directing it to create or connect to a tmux session named “main”. And because of your earliest setting in .tmux.conf, that will in turn launch your bash shell.

For connecting to a remote host

The setup is similar for remote hosts.

The only difference is that you should a command like the following, supposing the remote host is named box:

/usr/bin/ssh -t box 'tmux -CC new-session -A -s main'

With this command, launching the profile will ssh into the host, and immediately pass it the same tmux command, reconnecting to or creating the tmux session named “main”.

Other conveniences and details

You can set one of these profiles as your default, so that it launches automatically when you launch iTerm. This way, you can get ShellSage by default, within a familiar terminal UX, with no other changes to your workflow. Or, you can assign keyboard shortcuts to make these profiles easy to launch directly.

Control mode isn’t perfect. It does also create an extra “control” window, which hangs around with no purpose besides to enable debugging the control connection (please tell me if you figure out how to get rid of it). And it’s not widely supported on other terminal emulators (though people keep asking). But it doesn’t box you in. You can always reconnect to those same sessions using tmux normally from another client, even simultaneously.

But I find in many cases, control mode is great. It takes a fiddly thing (multiplexing, multiple planes, and attaching to sessions, custom keyboard commands) and makes it transparent and frictionless. In this way, it’s just like ShellSage, so it’s no surprise that they go well together.

ShellSage - Your AI Bash Buddy

2024-12-05 08:00:00

The Problem with Terminals

We’ve all been there - staring at the terminal, trying to remember that obscure tar command or the right flags for ssh. Sure, you could Google it, but then you’re context-switching between documentation, Stack Overflow, and your terminal. Or maybe you’re using an AI assistant like ChatGPT or Claude, but now you’re copying and pasting between windows, losing your terminal context, and getting walls of text that don’t quite fit your specific situation.

This context switching isn’t just inconvenient - it breaks one of the fundamental principles we’ve discovered for effective human-AI collaboration: maintaining shared context. When you copy-paste snippets between windows or try to describe your problem to an AI assistant, you’re creating an artificial barrier between human and machine thinking. We’ve found that the most powerful collaboration happens when both human and AI can see and understand the same complete context, right where the work is happening.

I found myself in this situation constantly when I discovered Simon Willison’s excellent llm tool. It allows you to chat with an AI assistant right in your terminal. While llm is great for many things, it wasn’t quite what I needed for these sysadmin tasks. The responses were often verbose walls of text that required scrolling through my terminal, and they didn’t always warn me about the gotchas that come with powerful commands. Most importantly, it couldn’t see what I was actually doing in my terminal - the context that could help it give me more relevant answers.

This pain point became particularly acute during our development of SolveIt at Answer.AI. We were juggling multiple sysadmin tasks - setting up Caddy for reverse proxies, managing Docker containers, configuring Linux quotas - and the context switching between documentation, our Claude Projects, and the terminal was becoming a real bottleneck. The cognitive load of jumping between these different interfaces was slowing us down and making it harder to learn from our experiences.

What we needed wasn’t just an AI that could recite documentation - we needed a teaching assistant that could see what we were doing, understand our context, and help us learn while solving immediate problems. That’s when ShellSage (code named BashBuddy) was born. Here is what the same output looks like with ShellSage for what I asked llm:

ShellSage giving a more concise and actionable response about the tar command

This also touches on an important point here at Answer.AI. We believe the future isn’t about AI replacing humans - it’s about humans and AI working together, each bringing their unique strengths to solve problems. ShellSage embodies this philosophy by creating a shared context between you and AI right in your terminal, where many of us spend our working days.

Birth of ShellSage

What started as a simple script to help me remember bash commands evolved into something much more powerful. The initial idea was straightforward: I wanted the convenience of llm but with a focus on teaching rather than just telling. The key insight came when I realized that tmux, which many developers already use for terminal management, could provide the missing context piece.

By integrating with tmux’s capture-pane functionality, ShellSage could now “see” what I was doing in my terminal. This meant it could understand not just my question, but the entire context of my work. If I was in the middle of debugging a Docker container issue, ShellSage would know that from my terminal history. If I had just encountered an error with a Git command, it could see that too.

This approach of combining AI assistance with terminal awareness turned out to be exactly what we needed. Instead of context-switching between documentation and terminals, we could stay focused on our task while learning proper system administration practices along the way.

The real test came during our intensive development period at Answer.AI. We were constantly setting up new services, configuring servers, and debugging system issues. ShellSage became our go-to tool for navigating these challenges, evolving with each new use case we encountered. What began as a personal utility for remembering commands had grown into a genuine teaching assistant for system administration.

Real-World Example: The Certificate Mystery

Let me share a story that perfectly illustrates how humans and AI can work together to solve complex problems. During the development of SolveIt, Jeremy noticed something odd in our server logs - we were getting probed by potential attackers almost immediately after our servers went live. This was particularly puzzling because we were using random subdomains that should have been impossible to guess.

Here’s what our logs looked like:

INFO: "GET /.git/config HTTP/1.1" 404 Not Found
INFO: "GET /.env HTTP/1.1" 404 Not Found
INFO: "GET /wp/v2/users/ HTTP/1.1" 404 Not Found
INFO: "GET /.vscode/sftp.json HTTP/1.1" 404 Not Found

Jeremy had an interesting hypothesis: since only our server and Let’s Encrypt should know about these subdomains, could something about the Let’s Encrypt certificate process be inadvertently exposing our URLs? He turned to ShellSage to help validate this theory, asking it about Let’s Encrypt’s certificate registration process and potential information exposure.

ShellSage confirmed a crucial detail: Let’s Encrypt certificates are logged in public Certificate Transparency (CT) logs (e.g., https://crt.sh), which are searchable and monitored by automated scanning tools. This validation helped Jeremy develop a solution - using wildcard certificates instead of individual subdomain certificates, which he then verified with ShellSage would prevent this kind of information leakage.

This interaction showcases what makes ShellSage special - it’s not about AI solving problems for us, but rather augmenting human intuition and problem-solving. Jeremy’s experience led to the hypothesis, while ShellSage’s knowledge helped validate the theory and confirm the solution’s viability. This kind of human-AI collaboration, where each brings their strengths to the table, is exactly what we’re building towards at Answer.AI.

We’ve reproduced a similar interaction in this gist to show how this type of collaborative problem-solving works.

How ShellSage Works

At its core, ShellSage is deceptively simple - in fact, the initial version clocked in at under 80 lines of code, with most of that being the system prompt that defines its personality and behavior. Even now, at ~150 lines, it’s still mostly system prompts and some autogenerated code and comments. This simplicity comes from a focused design philosophy: instead of trying to make AI do everything, we created a tool that enables effective human-AI collaboration for real-world pain points.

Let’s break down the key components that make this simple tool so effective:

The Power of tmux

The secret sauce behind ShellSage’s context awareness is tmux, a terminal multiplexer that many developers already use for managing terminal sessions. Specifically, we leverage tmux’s capture-pane functionality, which can grab not just what’s visible in your terminal, but also your scrollback history. This means ShellSage can see:

Commands you’ve recently run
Their outputs and any error messages
The current state of your terminal session
Even content from your text editor if configured properly

This deep integration with tmux is what enables true human-AI collaboration. Instead of having to copy and paste error messages or describe what you’re trying to do, both you and ShellSage have access to the same context, leading to more natural and effective problem-solving.

Teaching Through Context

Unlike traditional command generators or AI assistants, ShellSage is designed to teach rather than just tell. When you ask about a command, you’ll get:

An explanation of what the command does
Why certain flags or options are being used
Common variations for different use cases
Real examples based on your current context

This approach creates a feedback loop where both human and AI learn from each context. You might try a command, get an error, and then together with ShellSage, understand what went wrong and how to fix it. It’s this kind of iterative, collaborative learning that we believe is the future of human-AI interaction.

The simplicity of ShellSage’s implementation comes from focusing on a specific need - helping humans work better in the terminal - and designing for collaboration rather than automation. By sharing context between human and AI, we’ve created a tool that enhances rather than replaces human capabilities. This aligns perfectly with our philosophy at Answer.AI: the best tools aren’t the ones that do the work for you, but the ones that help you work better.

Who Is ShellSage For?

Even the most experienced developers occasionally wrestle with command-line tools. That’s exactly why we built ShellSage - to help both beginners and experienced developers work more effectively in the terminal, whether they’re learning their first commands or managing complex system administration tasks.

For Beginners

If you’re just starting your journey with the command line, ShellSage acts as a patient teacher. Instead of throwing man pages at you or giving you commands to blindly copy-paste, it explains concepts in context. When you ask about a command, you’ll understand not just what to type, but why you’re typing it.

For Experienced Developers

Even if you’ve been using the terminal for years, you’ll find ShellSage valuable for:

Quickly recalling syntax for less-frequently used commands
Understanding system behaviors in complex scenarios
Debugging issues with immediate context awareness
Learning best practices for system administration tasks

ShellSage helping diagnose problems with nginx

The goal isn’t to replace your knowledge or experience - it’s to augment it. Think of ShellSage as a knowledgeable colleague who’s always ready to help, whether you’re learning your first commands or debugging a complex system issue.

Getting Started

Getting started with ShellSage is straightforward. First, install it using pip:

pip install shell_sage

ShellSage works best with tmux, which provides the terminal context awareness that makes it so powerful. If you’re not already using tmux, you’ll want to install it first (available through most package managers like apt, brew, or yum).

For the best experience, we recommend configuring your terminal editor to keep content visible after exit. For vim users, add this to your .vimrc:

echo "set t_ti= t_te=" >> ~/.vimrc

Next, you will need an Anthropic API key and set it as an environment variable:

export ANTHROPIC_API_KEY=sk...

Once installed, you can start using ShellSage immediately:

ssage "How do I compress this directory?" # quotes optional

If you’re not using tmux, you can still use ShellSage with the --NH flag, though you’ll miss out on some of the context-aware features:

One quick note for zsh users: due to how zsh handles question marks, you’ll need to quote your queries that contain them.

What’s Next

ShellSage is still in its early days, and we’re excited to see how the community uses it. While it’s already proving invaluable for our team’s daily work, we see plenty of opportunities for growth and improvement.

One area we’re particularly interested in is expanding terminal integration options. While tmux is our current focus, we know many developers use different terminal emulators like Wezterm, which offers similar capabilities for capturing terminal context. Supporting these alternatives could make ShellSage more accessible to a broader range of users.

But more importantly, ShellSage represents something bigger - it’s part of Answer.AI’s broader mission to create tools that enable effective human-AI collaboration. We’re currently teaching these principles to our first cohort of 1,000 students in our “How to Solve It with Code” course, where we’re exploring how humans and AI can work together most effectively by sharing context and building on each other’s strengths.

The future of AI isn’t about replacing human intelligence - it’s about augmenting it. At Answer.AI, we’re building tools that put this philosophy into practice, creating simple but powerful solutions that help humans and AI work together more effectively. ShellSage is just one example of this approach, and we’re excited to see how the community helps us evolve it further.

If you’re interested in contributing or have ideas for improvements, check out our GitHub repository. We’d love to hear your thoughts on how we can make ShellSage even more helpful for your command-line adventures, and how we can better support the future of human-AI collaboration.

Building an Audience Through Technical Writing: Strategies and Mistakes

2024-11-30 08:00:00

This post was originally published here.

People often find me through my writing on AI and tech. This creates an interesting pattern. Nearly every week, vendors reach out asking me to write about their products. While I appreciate their interest and love learning about new tools, I reserve my writing for topics that I have personal experience with.

One conversation last week really stuck with me. A founder confided, “We can write the best content in the world, but we don’t have any distribution.” This hit home because I used to think the same way.

Let me share what works for reaching developers. Companies and individuals alike often skip the basics when trying to grow their audience. These are proven approaches I’ve seen succeed, both in my work and in others’ efforts to grow their audience in the AI space.

1. Build on Great Work

Here’s something surprising: few people take the time to thoughtfully engage with others’ work in our field. But when you do, amazing things happen naturally.

For example, here are some recent posts I’ve enjoyed that present opportunities to engage with others:

Shreya Shankar’s DocETL
Eugene Yan’s work on AlignEval
Ben Claive’s work on rerankers
Jeremy Howard’s work on llms.txt

In the above examples, you could share how their ideas connect with what you’ve built. You could add additional case studies and real-world insights. If you deeply engage with someone’s work and add your insights, they often share your content with their audience. Not because you asked, but because you’ve added something meaningful to their work. Swyx has written a great post on how to do this effectively.

The key is authenticity. Don’t do this just for marketing—do it because you’re genuinely interested in learning from others and building on their ideas. It’s not hard to find things to be excited about. I’m amazed by how few people take this approach. It’s both effective and fun.

2. Show Up Consistently

I see too many folks blogging or posting once every few months and wondering why they’re not getting traction. Want to know what actually works? Look at Jason Liu. He grew his following from 500 to 30,000 followers by posting ~ 30 times every day for a year.

You don’t have to post that often (I certainly don’t!), but consistency matters more than perfection. Finally, don’t just post into the void. Engage with others. When someone comments on your post, reply thoughtfully. When you see conversations where you can add value, provide helpful information.

Finally, don’t be discouraged if you don’t see results immediately. Here’s some advice from my friend (and prolific writer), Eugene Yan:

In the beginning, when most people start writing, the output’s gonna suck. Harsh, but true—my first 100 posts or so were crap. But with practice, people can get better. But they have to be deliberate in wanting to practice and get better with each piece, and not just write for the sake of publishing something and tweeting about it. The Sam Parr course (see below) is a great example of deliberate practice on copywriting.

3. Get Better at Copywriting

This changed everything for me. I took Sam Parr’s copywriting course just 30 minutes a day for a week. Now I keep my favorite writing samples in a Claude project and reference them when I’m writing something important. Small improvements in how you communicate can make a huge difference in how your content lands.

One thing Sam teaches is that big words don’t make you sound smart. Clear writing that avoids jargon is more effective. That’s why Sam teaches aiming for a 6th-grade reading level. This matters even more with AI, as AI loves to generate flowery language and long sentences. The Hemingway App can be helpful in helping you simplify your writing.¹

4. Build a Voice-to-Content Pipeline

The struggle most people have with creating content is that it takes too much time. But it doesn’t have to if you build the right systems, especially with AI.

Getting this system right takes some upfront work, but the payoff is enormous. Start by installing a good voice-to-text app on your phone. I use either Superwhisper or VoicePal. VoicePal is great for prompting you to elaborate with follow-up questions. These tools let me capture ideas at their best. That’s usually when I’m walking outside or away from my computer. At my computer, I use Flow.

The key is to carefully craft your first few pieces of content. These become examples for your prompts that teach AI your style and tone. Once you have high-quality examples, you can organize these (transcript, content) pairs and feed them to language models. The in-context learning creates remarkably aligned output that matches your writing style while maintaining the authenticity of your original thoughts.

For example, I use this pipeline at Answer AI. We have started interviewing each other and using the recordings as grounding for blog posts. Our recent post about SolveIt shows this in action. The raw conversation is the foundation. Our workflow turns it into polished content.

I’ve also integrated this workflow into my meetings. Using CircleBack, my favorite AI note-taking app, I can automatically capture and process meeting discussions. You can set up workflows to send your meeting notes and transcripts to AI for processing. This turns conversations into content opportunities.

The real power comes from having all these pieces working together. Voice capture, AI, and automation makes content creation fun and manageable.

5. Leverage Your Unique Perspective

Through my consulting work, I notice patterns that others miss. My most popular posts address common problems my clients had. When everyone’s confused about a topic, especially in AI where there’s lots of hype, clear explanations are gold. This is the motivation for some of my blog posts like:

You probably see patterns too. Maybe it’s common questions from customers, or problems you’ve solved repeatedly. Maybe you work with a unique set of technologies or interesting use cases. Share these insights! Your unique perspective is more valuable than you think.

6. Use High Quality Social Cards, Threads, and Scheduling

This is probably the least important part of the process, but it’s still important. Thumbnails and social cards are vital for visibility on social media. Here are the tools I use:

socialsharepreview.com to check how your content looks on different platforms. For X, I sometimes use the Twitter Card Validator.
chatGPT to create cover images for my posts. Then, paste them into Canva to size and edit them. Some of my friends use ideogram, which generates images with text accurately.
Canva for the last mile of creating social cards. They have easy-to-use buttons to ensure you get the dimensions right. They also have inpainting, background removal, and more.
If using X, social cards can be a bit fiddly. As of this writing, they do not show your post title, just the image if using the large-image size. To mitigate this,I use Canva to write the post’s title in the image like this.
Social media can be distracting, so I like to schedule my posts in advance. I use typefully for this purpose. Some of my friends use hypefury.

Finally, when posting on X, threads can be a great way to raise the visibility of your content. A simple approach is to take screenshots or copy-paste snippets of your content. Then, walk through them in a thread, as you would want a reader to. Jeremy Howard does a great job at this: example 1, example 2.

The Content Flywheel: Putting It All Together

Once you have these systems in place, something magical happens: content creates more content. Your blog posts spawn social media updates. Your conversations turn into newsletters. Your client solutions become case studies. Each piece of work feeds the next, creating a natural flywheel.

Don’t try to sell too hard. Instead, share real insights and helpful information. Focus on adding value and educating your audience. When you do this well, people will want to follow your work.

This journey is different for everyone. These are just the patterns I’ve seen work in my consulting practice and my own growth. Try what feels right. Adjust what doesn’t.

P.S. If you’d like to follow my writing journey, you can stay connected here.

Footnotes

Don’t abuse these tools or use them blindly. There’s plenty of situations where you should not be writing at a 6th-grade reading level. This includes, humor, poetry, shitposting, and more. Even formal writing shouldn’t adhere strictly to this rule. It’s advice that you should judge on a case-by-case basis. When you simplify your writing - do you like it more?↩︎

Answer.AIModify

Rss preview of Blog of Answer.AI

What is Devin?

Early Wins

Scaling Up Our Testing

A Deeper Look at What Went Wrong

1. Creating New Projects From Scratch

2. Research Tasks

3. Analyzing and Modifying Existing Code

Reflecting As A Team

Conclusion

Appendix: Tasks Attempted With Devin

1. Create A New Project

2. Perform Research

3. Analyze Existing Code

4. Modify An Existing Project

Footnotes

Finally, a Replacement for BERT

TL;DR

Introduction

Decoder-only models

Encoder-only models

Performance

Overview

Efficiency

Why is ModernBERT, well, Modern?

Meet the New Transformer, Same as the Old Transformer

Upgrading a Honda Civic for the Race Track

Global and Local Attention

Unpadding and Sequence Packing

Paying Attention to Hardware

Training

def data(): return [‘text’, ‘bad_text’, ‘math’, ‘code’]

Process

The tricks, it’s all about the tricks!

Conclusion

Links

The Challenge

What is nbsanity?

We added extra love

Getting Started

Option 1: URL Modification

Option 2: Bookmarklet

A Demo

Example 1

Example 2

Important Notes

Next Steps with nbsanity

More resources

Footnotes

iTerm2 and tmux control mode

For connecting to your local machine

For connecting to a remote host

Other conveniences and details

The Problem with Terminals

Birth of ShellSage

Real-World Example: The Certificate Mystery

How ShellSage Works

The Power of tmux

Teaching Through Context

Who Is ShellSage For?

For Beginners

For Experienced Developers

Getting Started

What’s Next

1. Build on Great Work

2. Show Up Consistently

3. Get Better at Copywriting

4. Build a Voice-to-Content Pipeline

5. Leverage Your Unique Perspective

6. Use High Quality Social Cards, Threads, and Scheduling

The Content Flywheel: Putting It All Together

Further Reading

Footnotes

The author's social media

Answer.AI Modify