MoreRSS

site icon Answer.AIModify

A new kind of AI R&D lab which creates practical end-user products based on foundational research breakthroughs
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of Answer.AI

GPU Programming from Scratch

2025-03-17 08:00:00

Jeremy Howard says: I’m really excited to introduce you all to Sarah Pan, an extraordinary and inspiring AI researcher who began working with Answer.AI whilst still at high school (and she had a first-author paper accepted at NeurIPS too)!

Sarah’s first project with us is WebGPU Puzzles, which is the best way I know of to get started with GPU programming fundamentals today. With it, you can begin learning GPU programming right in your browser. I was astonished at how Sarah was able to learn, from scratch, GPU programming, WebGPU, and gpu.cpp in a matter of weeks, to a level where she could pull this off.

I’ve asked Sarah to share a bit about her story, which she has done in the post below. She was also kind enough to spend some time doing an interview with me, which I’m sure you’ll agree is a fascinating insight into the life of an very special person.

Hey! My name is Sarah Pan and you might’ve seen my name attached to the WebGPU Puzzles project (based on Answer.AI’s gpu.cpp). A little about me: I’m a research fellow at Answer.AI as well as a first-year student at MIT! This means that outside of classes and all the other fun chaos of MIT, I work with the Answer.AI team on various projects, as well as on my own research.

The Origin Story

You might be wondering how I got here. (Sometimes, I do too.) But my AI journey began towards the end of middle school when my older brother introduced me to fast.ai. At the time, having R2D2 as my favorite Star Wars character was enough to propel me into taking the course.

Practical Deep Learning took a top-down approach to teaching about neural networks. This meant that the important high-level ideas weren’t gatekept by the nitty-gritty. Being able to understand the inner workings of complex systems without having taken a math class past Algebra I, and much less having a college degree, was very refreshing.

Fast forward to junior year of high school—-I had a few more AI experiences under my belt and was ready for more. I joined MIT Primes, a research program that connects high schoolers to researchers in mathematics, computer science, and computational biology. There, my mentor, Vlad Lialin showed me the ropes to everything from effectively reading academic papers to adopting the “iterate fast” ethos.

Together, we worked on the project that would become my first publication. I don’t want to bore you with the details, but we essentially used a process reward model 1 in RL to improve the reasoning abilities of LLMs.

Though this sounded pretty straightforward at the start, I was quickly proven wrong. There were many moments where learning auxiliary skills were essential to implementing the ideas I really cared about. If anything, a summer of trying to fit billion-parameter LLMs onto dual 3090s taught me about the importance of good engineering habits. But soon enough, October rolled around and my fingers were crossed for a NeurIPS paper.

NeurIPS

I don’t really know of any other way to describe the experience but surreal. The poster halls were huge and, almost out of nowhere, there were so many people with the same interests as me. All those ideas I saw on Twitter and read about on various blogs materialized in front of me.

I remember bumping into Jeremy entirely out of chance2, and we stayed in touch after the conference. Little did I know, those minute engineering problems I encountered over the summer would resurface in conversations with him and the people who would become my mentors and collaborators at Answer.AI.

As of late

Last summer, I collaborated with Austin Huang on creating WebGPU Puzzles. And fun fact, that was my second encounter with GPU programming, so I was a little intimidated going into it. I had a general understanding of what CUDA was and had stumbled upon Sasha Rush’s GPU Puzzles at some point, too. But soon enough I realized that the ideas those experiences taught me would be pretty useful.

One thing I appreciated about Sasha’s puzzles was that my main focus was on solving the puzzles themselves. For one, they were hosted in a Google Colab notebook, which has a beginner-friendly interface. And when it came to syntax, CUDA puzzles used Numba, which doesn’t require much knowledge beyond Python and NumPy. The accessibility and user-friendliness of these puzzles took away the unnecessary complexities and reduced parallel computing into a suite of largely unobstructed principles. That way, instead of worrying about all things C++, I could focus on something more akin to a coding challenge.

I wanted to replicate this for those that wanted to test out WebGPU/gpu.cpp, or even those just ``breaking into’’ GPU programming. From there, I set out on developing a WebGPU version of Sasha’s CUDA puzzles with a detailed set of solutions for ultimate beginner-friendliness. Since then, I’ve returned to my research roots–I’m currently working on a reward model project3.

Beyond research, I’m a first year at MIT studying math and computer science. My favorite class thus far is probably discrete math (it’s very well taught!) but regret not signing up for more math classes.4 Outside of school, I love watching the sun rise while rowing on the Charles River, reading AI Twitter, and Facetiming my dog.

Footnotes

  1. A process reward model (PRM) provides feedback at each step of a reasoning process, unlike outcome reward models (ORMs) which evaluate the entire response, offering more granular and structured guidance for improving complex tasks.↩︎

  2. Ultimate full circle moment for me!↩︎

  3. preprint soon!↩︎

  4. Have to knock out those general insitute requirements↩︎

TIL: Masked Language Models Are Surprisingly Capable Zero-Shot Learners

2025-02-10 08:00:00

Welcome to this post! As a “TIL”, it’s a purposefully smaller blog post, containing just the key details. If you’d like to know more, head over to the technical report or play with the model on HuggingFace!

TL;DR

Traditionally (with some exceptions, of course), encoder models such as BERT are used with a task-specific head on top of the core encoder model. Functionally, this means that we discard all the language modelling goodness stored in the Masked Language Modelling head (the one used during pre-training), and seek to simply re-use the backbone to perform various tasks.

This works really well: there’s a reason why it’s the dominant paradigm! However, what if the generative head itself could actually perform most tasks, even zero-shot? This is what we tried, and it works pretty well! We introduce ModernBERT-Large-Instruct, an “instruction-tuned” encoder fine-tuned on top of ModernBERT-Large with a shockingly simple mechanism. It can be used to perform classification and multiple-choice tasks using ModernBERT’s MLM head instead of task-specific heads. Unlike previous approaches, our method requires no architectural changes nor complex pieplines, and still achieves strong results across various tasks.

  • It’s surprisingly capable at knowledge QA tasks, where encoders are usually weak: On the MMLU-Pro leaderboard, it outperforms all sub-1B models like Qwen2.5-0.5B and SmolLM2-360M, and is quite close to Llama3-1B (trained on considerably more tokens, and with 3x the parameters)!
  • On NLU tasks, fine-tuning ModernBERT-Instruct matches or outperforms traditional classification heads when fine-tuned on the same dataset.
  • We achieve these results with a super simple training recipe, which is exciting: there’s definitely a lot of room for future improvements👀👀

I just want to try it!

The model is available on HuggingFace as ModernBERT-Large-Instruct. Since it doesn’t require any custom attention mask, or anything of the likes, the zero-shot pipeline is very simple to set up and use:

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load model and tokenizer
model_name = "answerdotai/ModernBERT-Large-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device == 'cuda':
    model = AutoModelForMaskedLM.from_pretrained(model_name, attn_implementation="flash_attention_2")
else:
    model = AutoModelForMaskedLM.from_pretrained(model_name)

model.to(device)

# Format input for classification or multiple choice. This is a random example from MMLU.
text = """You will be given a question and options. Select the right answer.
QUESTION: If (G, .) is a group such that (ab)^-1 = a^-1b^-1, for all a, b in G, then G is a/an
CHOICES:
- A: commutative semi group
- B: abelian group
- C: non-abelian group
- D: None of these
ANSWER: [unused0] [MASK]"""

# Get prediction
inputs = tokenizer(text, return_tensors="pt").to(device)
outputs = model(**inputs)
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero()[0, 1]
pred_id = outputs.logits[0, mask_idx].argmax()
answer = tokenizer.decode(pred_id)
print(f"Predicted answer: {answer}")  # Outputs: B

For more, you’ll want to check out our mini cookbook GitHub repository, with examples on how to fine-tune the model!

Introduction

Encoder models traditionnally perform best on all tasks with a task-specific head. While not necessarily an issue, this feels like a bit of a waste: the MLM head, its original pre-training head, is fully discarded. In practice, this works, but it also feels like we might leaving something on the table. Additionnally, this places great restrictions on zero-shot capabilities: as task-specific heads are usually always required, it’s been necessary to find various tricks to get around this and still get good zero-shot performance.

A brief, incomplete history of downstream uses of MLM encoders

Zero-shot classification with encoder models has been an active area of research, with various approaches tried over the years. The most common approach has been to repurpose textual entailment: after training on tasks like MNLI, models are used to predict whether a given label is entailed by the input text. Some very powerful models have been trained on the large-scale TaskSource datasets, such as tasksource/ModernBERT-large-nli.

This is also definitely not the first piece of work exploring generative BERTs as multitasks learners: there’s been some work on prompting, sample-efficient training via the pattern-exploitng training (PET) method, or even making the models auto-regressive! Some approaches are even pretty similar to ours, like UniMC which has shown promise by converting tasks into multiple-choice format using semantically neutral verbalizers (e.g., “A”, “B” instead of meaningful words) and employing custom attention masks.

However, all of these methods come with drawbacks: some are either brittle (particularly to different verbalizers) or reach performance that is promising-but-not-quite-there, while others yet reach very good results but add considerable complexity. Meanwhile, in decoder-land (or, if you will, LLMTopia), instruction tuning has progressed extremely rapidly, and big, scary LLMs have become very good at generative classification, especially zero-shot, thanks to their instruction training.

But this, too, has drawbacks: small LLMs are routinely outperformed by encoders, which can even match the larger ones once fine-tuned! Additionnally, the computational cost of running an autoregressive LLM, even one on the smaller side, is generally considerably bigger than that of an encoder, who performs tasks in a single forward pass.

ModernBERT-Large-Instruct

Our approach aims to show that maybe, just maybe, we can have our cake and eat it too: what if an MLM could tackle tasks (even zero-shot ones!) in a generative way with a single forward pass, could be easily fine-tuned further to perform better in-domain, without adding any pipeline or architectural complexity?

This is what we demonstrate the potential of here! We use a very simple training recipe, with FLAN-style instruction tuning with ModernBERT’s MLM head. We do not custom attention masks, no complex prompt engineering, and no heavy-handed data pre-processing pipeline: we simply filter FLAN to only tasks that can be answered using a single token, and filter out some examples from datasets that we used for downstream evaluations.

How It Works

A high-level overview of the full process

Our key insight is two-fold: ModernBERT can use a single-head to perform most NLU tasks, either zero-shot or fully-finetuned, and this behaviour can be unlocked with an extremely simple training recipe, suggesting a very strong potential.

The way it works is very simple:

  1. All tasks are formatted in a way where the model can answer with a single token, which is also the final token of the input. This is always prefaced with an anchor token ([unused0]), to tell the model that the next token needs to be the single token answer.
  2. The model is given a question, short instructions, and a list of potential choices. All choices are prefaced with a single-token verbalizer: this is the token that the model will predict if it assigns this label.
  3. The model then predicts the most likely token for the answer, and the potential verbalizer with the highest score is selected as the answer.

This approach has several advantages: - No architectural changes needed, for training or inference. - It can be tried on any model that supports Masked Language Modeling out of the box. - Very little data pre-processing is needed to begin experimenting. - Likewise, it reduces prompt engineering greatly: only a very short template and a description of all labels needs to be written to perform a task.

Training Details

As above, the training recipe is kept voluntarily simple. This is largely meant to avoid scope screep: there are a lot of potential improvements to be explored by using better processing pipelines, or more modern instruction sets, but these would all require complex processes to turn them into single-token tasks.

  • Data: A downsampled (20M samples), filtered FLAN-2022 dataset to keep only single-token answers. A very simple filtering process: tokenize the potential answer and exclude all examples where the answer contains more than one token. Examples from our evaluation datasets were also filtered out to avoid overfitting.
  • Objective: We use the Answer Token Prediction (ATP) objective, which is to predict the single masked token which should be the verbalizer containing the answer. The final training objective is a mix of 80% ATP and 20% dummy MLM examples, where masked tokens are given a meaningless label (see below).
  • Base Model: ModernBERT-Large (395M parameters), which we recently introduced with our friends at LightOn & other places. It proved to be a much more capable base model than alternatives.

Dummy Examples

When training the model, we theorized that Answer Token Prediction could lead to catastrophic forgetting, with the model only learning to predict certain tokens and losing overall reasoning capabilities. To counter this, we introduced a training objective mix, where 20% of the examples were assigned the normal MLM objective (where 30% of tokens in the text are randomly masked, and the model has to predict all of them at once), with the remaining 80% adopting the Answer Token Prediction objective.

Except, we implemented this wrong, and effectively made these samples empty examples, which we dub “dummy MLM examples”. The issue was in the labelling: rather than the [MASK] tokens being assigned the appropriate label, they were all given [MASK] as their label. This meant that very quickly, the model learned to simply predict [MASK] for all of them if there’s more than one [MASK] token in the text, and the loss on these examples swiftly dropped to near-zero.

Hm, simple mistake, easy to fix, right? Right. Except, we observed something that we didn’t expect: we evaluated three pre-training setups (100% ATP, 80%ATP/20%MLM, 80%ATP/20%dummy), and we found that the dummy example variant was the best performing one, by a good margin! While we haven’t explored this phenomenon in enough depth to explain what is going on, my personal theory is that it acts as a form of regularization, similar to dropout.

Performance

Zero-Shot Results

The zero-shot results are pretty encouraging and, in a way, pretty surprising!

Competing with the best (MMLU-Pro leaderboard for sub-2B models)
  • Knowledge-Based Multiple Choice Questions (MMLU and MMLU-Pro): ModernBERT-Large-Instruct stands at 43.06% accuracy on MMLU, beating similarly sized models like SmoLLM2-360M (35.8%) and getting close to Llama3-1B (45.83%). On MMLU-Pro, its performance would give it a very good spot on the leaderboard, punching far above its weight class and competing with bigger LLMs!
  • Classification: On average, it beats all the previous zero-shot methods. However, this is not true on a per-dataset basis: while this method has strong potential and gets very good overall results, there are some datasets where it underperforms, and others where it overperforms. This indicates strong potential for future developments of the method.

Fine-Tuned Results

The MLM Head is All You Need

Across a variety of tasks, focusing on topic classification, textual entailment (MNLI) and sentiment analysis, fine-tuning ModernBERT-Large-Instruct on each task appears to match the performance of traditional classification head-based approach. On certain datasets, it even outperforms them! In fact, I think that this method holds the key to finally closing the last gap and making ModernBERT a better classifier than DeBERTaV3.

A caveat here is that the training set of some of these tasks is present, in relatively small proportions, in our pre-training mix: however, we expect this effect to be rather minimal, as fine-tuning performed for multiple epochs bring both methods firmly into the “in-domain” territory.

Modernity Matters

A shamelessly self-plagiarized but appropriate meme

Finally, we wanted to know whether this potential is inherent to all pre-trained MLM encoders, or whether it’s specific to ModernBERT. To answer this question, we applied the same approach to older models like RoBERTa-Large or models with a modern architecture but trained on smaller-scale, less diverse data, and the performance dropped significantly:

Model MMLU
ModernBERT-Large-Instruct 43.06
GTE-en-MLM-Large 36.69
RoBERTa-Large 33.11

This suggests that strong generative downstream performance in MLM encoders relies largely on being trained on a sufficiently large-scale, diverse data mix, given the vast performance gap between ModernBERT-Large-Instruct and GTE-en-MLM-Large, which adopts a very similar architecture to that of ModernBERT-Large (minus efficiency tweaks). The relatively smaller performance gain from RoBERTa-Large to GTE-en-MLM-Large seems to suggest that while adopting a better architecutre does play a role, it is much more modest than that of the training data.

Looking Forward

While these results are promising, they are very early stage! All they really do is demonstrate the potential of the MLM head as a multi-task head, but they are far from pushing it to its limits. Among other things:

  • Exploring better, more diverse templating
  • A more in-depth analysis of the training mechanisms, and the effect of dummy examples
  • Testing on more recent instruction datasets, with better construction
  • Investigating few-shot learning capabilities
  • Scaling to larger model sizes
  • … so many more things!

All strike us as very promising directions for future work! In fact, we’ve heard that some very good people are working on some of these things already…

Ultimately, we believe that the results of our exceedingly simple approach presented here open up new possibilities for encoder models. The ModernBERT-Large-Instruct model is available on HuggingFace.

MonsterUI: Bringing Beautiful UI to FastHTML

2025-02-09 08:00:00

Modern web development requires complicated dependencies and extensive boilerplate spread over multiple languages to make good UI. MonsterUI is here to fix that.

The Problem with Web UI Development

Building attractive web applications has always been complicated. FastHTML simplifies web app development by bringing HTMX, Starlette, HTML, and HTTP fundamentals together.

Getting the aesthetics right is still too hard. It requires either extensive CSS, a framework with long inline class strings, or both. You might try Bootstrap or Tailwind CSS. Now, you’re managing class names, remembering utility patterns, and checking docs for boilerplate class strings. This leads to code that is hard to build, maintain, and change for anyone who is not an expert designer.

A typical app has many components: nav bars, forms, modals, cards, and more. Each requires careful consideration of styling, responsive behavior, and interactive states. As your application grows, managing these styles consistently becomes more and more challenging.

This became apparent to me while I was developing web apps. I found myself copying and pasting class strings and maintaining complex styling logic across multiple components. FastHTML made the application logic development a joy, but the styling side remained a constant source of friction.

If you’re tired of context-switching between HTML, CSS, and Python just to build basic web UIs, MonsterUI might be for you.

Real-World Example: Building a Blog

Introducing MonsterUI

MonsterUI lets anyone build high-quality, modern web apps in pure Python without sacrificing design quality.

Built with MonsterUI, styled with FrankenUI, based on design by Shadcn

MonsterUI is a layer on top of FastHTML that provides pre-styled components and smart defaults based on modern libraries (such as Tailwind, FrankenUI, DaisyUI) while maintaining full access to Tailwind CSS when you need it. MonsterUI:

  • Brings FastHTML’s simplicity to web styling.
  • Provides beautiful, responsive components without writing a single CSS class.
  • Lets you focus on building features instead of remembering utility classes.

Let’s learn by example with a card for team members:

def TeamCard(name, role, location="Remote"):
    icons = ("mail", "linkedin", "github")
    return Card(
        DivLAligned(
            DiceBearAvatar(name, h=24, w=24),
            Div(H3(name), P(role))),
        footer=DivFullySpaced(
            DivHStacked(UkIcon("map-pin", height=16), P(location)),
            DivHStacked(*(UkIconLink(icon, height=16) for icon in icons))))

I specified the entire layout, font sizing, icons, and avatar using only Python. I controlled everything without needing special flexbox or CSS class knowledge.

Example is from the cards documentation page
dicebear_url = 'https://api.dicebear.com/8.x/lorelei/svg?seed=James Wilson'
Div(Div(Div(
    Span(Img(alt='Avatar', loading='lazy', src=dicebear_url, 
             cls='aspect-square h-24 w-24'),cls='relative flex h-24 w-24 shrink-0 overflow-hidden rounded-full bg-accent'),
    Div(H3('James Wilson', cls='uk-h3'),
        P('Senior Developer')),
            cls='uk-flex uk-flex-left uk-flex-middle space-x-4'),
        cls='uk-card-body space-y-6'),
    Div(Div(Div(
                Uk_icon(icon='map-pin', height='16'),
                P('New York'),
                cls='uk-flex uk-flex-row uk-flex-middle space-x-4'),
            Div(A(Uk_icon(icon='mail', height='16'),href='#',cls='uk-icon-link'),
                A(Uk_icon(icon='linkedin', height='16'),href='#',cls='uk-icon-link'),
                A(Uk_icon(icon='github', height='16'),href='#',cls='uk-icon-link'),
                cls='uk-flex uk-flex-row uk-flex-middle space-x-4'),
            cls='uk-flex uk-flex-between uk-flex-middle uk-width-1-1'),
        cls='uk-card-footer'),
    cls='uk-card')

What MonsterUI does for you

MonsterUI is based on a simple principle: provide smart defaults while allowing full flexibility.

We’ve done this by builing upon proven approaches from some of the most innovative projects in modern web development, carefully selecting components that address the pain points of raw HTML/CSS while maintaining mature, battle-tested strategies.

MonsterUI’s core is FrankenUI, an innovative framework-free UI library by sveltecult that uses beautiful HTML-first components. FrankenUI itself was inspired by shadcn/ui by shadcn which pioneered the concept of copy-pasteable UI components for React.

Raw HTML and CSS present two key challenges: dated visual aesthetics and complex layout management. By combining FrankenUI’s framework-agnostic approach with FastHTML, MonsterUI delivers modern, beautiful components that integrate seamlessly with HTMX’s progressive enhancement paradigm - all while maintaining clean, readable code.

This isn’t just theory - we’re using MonsterUI in production for new applications we’re testing with preview customers, where it powers everything from complex dialog interfaces to dynamic content rendering. The library has been proven robust and maintainable in real-world enterprise settings.

Let’s explore some key features:

Theme

Pick a color theme for your app. There are 12 colors to choose from, each with a dark and a light mode. By default it uses the user’s system preferences.

All themes are synced so components look good on the same page regardless of whether the component is styled with FrankenUI, DaisyUI, or another framework.

Themes add the boilerplate needed to make color styling consistent throughout your app.

app, rt = fast_app(hdrs=Theme.blue.headers())

Base Components

Every HTML element in MonsterUI comes with sensible default styling. A Button isn’t just an HTML button. It’s a styled component with hover states, focus rings, and consistent padding.

Button("Save Changes")

MonsterUI provides data structures (ListT, TextT, ButtonT, etc.) for easy discoverability and tab completion for selecting styles.

For example, to style it with your Theme’s primary color, use ButtonT.primary. Primary colors are used for action buttons like “Add to Cart” or “Submit.”

Button("Add to Cart", cls=ButtonT.primary)

Semantic Text Styles

Build on the foundations of the web, MonsterUI styles semantic tags based on the HTML spec. This means that we have styled functions that match the themes that use standard HTML tags like emphasis (<em>), citation (<cite>), Marked (<mark>), small (<small>) and much more.

Card(
    H1("MonsterUI's Semantic Text"),
    P(
        Strong("MonsterUI"), " brings the power of semantic HTML to life with ",
        Em("beautiful styling"), " and ", Mark("zero configuration"), "."),
    Blockquote(
        P("Write semantic HTML in pure Python, get modern styling for free."),
        Cite("MonsterUI Team")),
    footer=Small("Released February 2025"),
)

Smart Layout Helpers

Overall page layout is made simple with the smart layout helpers (DivVStacked, DivCentered, DivFullySpaced, Grid, etc.). For example, DivVStacked stacks things vertically. Grid creates a grid in which to place components.

DivFullySpaced(
    H1("Dashboard"), 
    DivRAligned(
        Button("Export", cls=ButtonT.secondary),
        Button("New Entry", cls=ButtonT.primary)))

# Grid layout with smart responsive columns for mobile vs desktop
# Easy args to customize responsiveness as you need
Grid(map(TeamCard, products), cols_max=3)

Note: See our layout tutorial for more details and advanced usage

Common UI Patterns

MonsterUI includes shortcuts for common UI patterns. For example, you almost always want an input text box to have a label to communicate what it’s for so we have provided LabelInput as a shortcut that creates a Label and Input pair..

LabelInput("Name", id='myid')

You can use Div, FormLabel, and Input to do this yourself, but this pattern is so common we’ve provided a shortcut. Here’s what the shortcut replaces:

Div(FormLabel('Name', fr='myid'),
    Input(id='myid', name='myid'),
    cls='space-y-2')

Higher Level Components

We also provide helpers to generate more complex components such as navbars, modals, cards, and tables. Each of these is built on top of several base components (ModalContainer, ModalDialog, etc.) so you could build them up yourself. However, the helper function usually gives all the flexibility you need without needing to write your own boilerplate. These helper functions create good UX behavior for you such as automatically collapsing your NavBar into a hamburger menu on mobile.

For example to create a button that opens a modal:

Div(Button("Open Modal",uk_toggle="target: #my-modal" ),
    Modal(ModalTitle("Simple Test Modal"), 
          P("With some somewhat brief content to show that it works!", 
              cls=TextPresets.muted_sm),
          footer=ModalCloseButton("Close", cls=ButtonT.primary),id='my-modal'))

Div(Button('Open Modal', type='button', uk_toggle='target: #my-modal', 
           cls='uk-button uk-button-default'),
    Div(Div(Div(H2('Simple Test Modal', cls='uk-modal-title'),
                P('With some somewhat brief content to show that it works!', 
                  cls='uk-text-muted uk-text-small'),
                cls='uk-modal-body space-y-6'),
            Div(Button('Close', type='button', 
                       cls='uk-button uk-modal-close uk-button-primary'),
                cls='uk-modal-footer'),
            cls='uk-modal-dialog'),
        uk_modal=True,
        id='my-modal',
        cls='uk-modal uk-modal-container'))

Rendering Markdown

MonsterUI provides a render_md function that converts Markdown to styled HTML, with syntax highlighting via HighlightJS for code blocks, FrankenUI classes for styling, and Tailwind for additional styling and spacing. Here’s how to use it:

render_md("""
# My Document

> Important note here

+ List item with **bold**
+ Another with `code`

```python
def hello():
    print("world")
```
""")

Getting Started

First, install it using pip:

pip install MonsterUI

Create a new FastHTML application with MonsterUI styling:

from fasthtml.common import *
from monsterui.all import *

# Choose a theme color (blue, green, red, etc)
hdrs = Theme.blue.headers()

# Create your app with the theme
app, rt = fast_app(hdrs=hdrs)

@rt
def index():
    socials = (('github','https://github.com/AnswerDotAI/MonsterUI'),
               ('twitter','https://twitter.com/isaac_flath/'),
               ('linkedin','https://www.linkedin.com/in/isaacflath/'))
    return Titled("Your First App",
        Card(
            H1("Welcome!"),
            P("Your first MonsterUI app", cls=TextPresets.muted_sm),
            P("I'm excited to see what you build with MonsterUI!"),
            footer=DivLAligned(*[UkIconLink(icon,href=url) for icon,url in socials])))

serve()

That’s it! You now have a styled application with zero configuration. The app already includes:

  • Automatic dark/light mode based on user preferences
  • Properly styled typography and spacing
  • Responsive layout that works on all devices
  • Beautiful UI components ready to use
  • Synchronized color scheme with DaisyUI, FrankenUI, and Tailwind

Check out our documentation for more examples and component references.

Thoughts On A Month With Devin

2025-01-08 08:00:00

In March 2024, a new AI company burst onto the scene with impressive backing: a $21 million Series A led by Founders Fund, with support from industry leaders including the Collison brothers, Elad Gil, and other tech luminaries. The team behind it? IOI gold medalists - the kind of people that solve programming problems most of us can’t even understand. Their product, Devin, promised to be a fully autonomous software engineer that could chat with you like a human colleague, capable of everything from learning new technologies and debugging mature codebases to deploying full applications and even training AI models.

The early demos were compelling. A video showed Devin independently completing an Upwork bounty, installing and running a PyTorch project without human intervention.1 The company claimed Devin could resolve 13.86% of real-world GitHub issues end-to-end on the SWE-bench benchmark - ~3x times better than previous systems. Only a select group of users could access it initially, leading to breathless tweets about how this would revolutionize software development.

As a team at Answer.AI that routinely experiments with AI developer tools, something about Devin felt different. If it could deliver even half of what it promised, it could transform how we work. But while Twitter was full of enthusiasm, we couldn’t find many detailed accounts of people actually using it. So we decided to put it through its paces, testing it against a wide range of real-world tasks. This is our story - a thorough, real-world attempt to work with one of the most hyped AI products of 2024.

What is Devin?

What makes Devin unique is its infrastructure. Unlike typical AI assistants, Devin operates through Slack and spins up its own computing environment. When you chat with Devin, you’re talking to an AI that has access to a full computing environment - complete with a web browser, code editor, and shell. It can install dependencies, read documentation, and even preview web applications it creates. Below is a screenshot of one way to initiate a task for Devin to work on:

One way to initiate a task with Devin - through Slack

The experience is designed to feel like chatting with a colleague. You describe what you want, and Devin starts working. Through Slack, you can watch it think through problems, ask for credentials when needed, and share links to completed work. Behind the scenes, it’s running in a Docker container, which gives it the isolation it needs to safely experiment while protecting your systems. Devin also provides a web interface, which also allows you to gain access to its envirnoment and watch it work with IDEs, Web Browsers and more in real time. Here is a screenshot of the web interface:

Early Wins

Our first task was straightforward but real: pull data from a Notion database into Google Sheets. Devin tackled this with surprising competence. It navigated to the Notion API documentation, understood what it needed, and guided me through setting up the necessary credentials in Google Cloud Console. Rather than just dumping API instructions, it walked me through each menu and button click needed - saving what would typically be tedious documentation sleuthing. The whole process took about an hour (but only a few minutes of human interaction). At the end, Devin shared a link to a perfectly formatted Google Sheet containing our data.

The code it produced was a bit verbose, but it worked. This felt like a glimpse into the future - an AI that could handle the “glue code” tasks that consume so much developer time. Johno had similar success using Devin to create a planet tracker for debunking claims about historical positions of Jupiter and Saturn. What made this particularly impressive was that he managed this entirely through his phone, with Devin handling all the heavy lifting of setting up the environment and writing the code.

Scaling Up Our Testing

Building upon our early successes, we leaned into Devin’s asynchronous capabilities. We imagined having Devin write documentation during our meetings or debug issues while we focused on design work. But as we scaled up our testing, cracks appeared. Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions.

Even more concerning was Devin’s tendency to press forward with tasks that weren’t actually possible. When asked to deploy multiple applications to a single Railway deployment (something that Railway doesn’t support), instead of identifying this limitation, Devin spent over a day attempting various approaches and hallucinating features that didn’t exist.

The most frustrating aspect wasn’t the failures themselves - all tools have limitations - but rather how much time we spent trying to salvage these attempts.

A Deeper Look at What Went Wrong

At this point in our journey, we were puzzled. We had seen Devin competently handle API integrations and build functional applications, yet it was struggling with tasks that seemed simpler. Was this just bad luck? Were we using it wrong?

Over the course of a month, we systematically documented our attempts across these categories:

  1. Creating new projects from scratch
  2. Performing research tasks
  3. Analyzing & Modifying existing projects

The results were sobering. Out of 20 tasks, we had 14 failures, 3 successes (including our 2 initial ones), and 3 inconclusive results. Even more telling was that we couldn’t discern any pattern to predict which tasks would work. Tasks that seemed similar to our early successes would fail in unexpected ways. We’ve provided more detail about these tasks in the appendix below. Below is a summary of our experiences in each of these categories:

1. Creating New Projects From Scratch

This category should have been Devin’s sweet spot. After all, the company’s demo video showed it autonomously completing an Upwork bounty, and our own early successes suggested it could handle greenfield development. The reality proved more complex.

Take our attempt to integrate with an LLM observability platform called Braintrust. The task was clear: generate synthetic data and upload it. Instead of a focused solution, Devin produced what can only be described as code soup - layers of abstraction that made simple operations needlessly complex. We ultimately abandoned Devin’s attempt and used Cursor to build the integration step-by-step, which proved far more efficient. Similarly, when asked to create an integration between our AI notes taker and Spiral.computer, Devin generated what one team member described as “spaghetti code that was way more confusing to read through than if I’d written it from scratch.” Despite having access to documentation for both systems, Devin seemed to overcomplicate every aspect of the integration.

Perhaps most telling was our attempt at web scraping. We asked Devin to follow Google Scholar links and grab the most recent 25 papers from an author - a task that should be straightforward with tools like Playwright. This should have been particularly achievable given Devin’s ability to browse the web and write code. Instead, it became trapped in an endless cycle of trying to parse HTML, unable to extract itself from its own confusion.

2. Research Tasks

If Devin struggled with concrete coding tasks, perhaps it would fare better with research-oriented work? The results here were mixed at best. While it could handle basic documentation lookups (as we saw in our early Notion/Google Sheets integration), more complex research tasks proved challenging.

When we asked Devin to research transcript summarization with accurate timestamps - a specific technical challenge we were facing - it merely regurgitated tangentially related information rather than engaging with the core problem. Instead of exploring potential solutions or identifying key technical challenges, it provided generic code examples that didn’t address the fundamental issues. Even when Devin appeared to be making progress, the results often weren’t what they seemed. For instance, when asked to create a minimal DaisyUI theme as an example, it produced what looked like a working solution. However, upon closer inspection, we discovered the theme wasn’t actually doing anything - the colors we were seeing were from the default theme, not our customizations.

3. Analyzing and Modifying Existing Code

Perhaps Devin’s most concerning failures came when working with existing codebases. These tasks require understanding context and maintaining consistency with established patterns - skills that should be central to an AI software engineer’s capabilities.

Our attempts to have Devin work with nbdev projects were particularly revealing. When asked to migrate a Python project to nbdev, Devin couldn’t grasp even basic nbdev setup, despite us providing it access to comprehensive documentation. More puzzling was its approach to notebook manipulation - instead of directly editing notebooks, it created Python scripts to modify them, adding unnecessary complexity to simple tasks. While it occasionally provided useful notes or ideas, the actual code it produced was consistently problematic.

Security reviews showed similar issues. When we asked Devin to assess a GitHub repository (under 700 lines of code) for security vulnerabilities, it went overboard, flagging numerous false positives and hallucinating issues that didn’t exist. This kind of analysis might have been better handled by a single, focused LLM call rather than Devin’s more complex approach.

The pattern continued with debugging tasks. When investigating why SSH key forwarding wasn’t working in a setup script, Devin fixated on the script itself, never considering that the problem might lie elsewhere. This tunnel vision meant it couldn’t help us uncover the actual root cause. Similarly, when asked to add conflict checking between user input and database values, one team member spent several hours working through Devin’s attempts before giving up and writing the feature themselves in about 90 minutes.

Reflecting As A Team

After a month of intensive testing, our team gathered to make sense of our experiences. These quotes capture our feelings best:

Tasks it can do are those that are so small and well-defined that I may as well do them myself, faster, my way. Larger tasks where I might see time savings I think it will likely fail at. So no real niche where I’ll want to use it. - Johno Whitaker

I had initial excitement at how close it was because I felt I could tweak a few things. And then slowly got frustrated as I had to change more and more to end up at the point where I would have been better of starting from scratch and going step by step. - Isaac Flath

Devin struggled to use internal tooling that is critical at AnswerAI which, in addition to other issues, made it difficult to use. This is despite providing Devin with copious amounts of documentation and examples. I haven’t found this to be an issue with tools like Cursor, where there is more opportunity to nudge things in the right direction more incrementally. - Hamel Husain

In contrast to Devin, we found workflows where developers drive more (like Cursor) avoid most issues we faced with Devin.

Conclusion

Working with Devin showed what autonomous AI development aspires to be. The UX is polished - chatting through Slack, watching it work asynchronously, seeing it set up environments and handle dependencies. When it worked, it was impressive.

But that’s the problem - it rarely worked. Out of 20 tasks we attempted, we saw 14 failures, 3 inconclusive results, and just 3 successes. More concerning was our inability to predict which tasks would succeed. Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability - Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers.

This reflects a pattern we’ve observed repeatedly in AI tooling. Social media excitement and company valuations have minimal relationship to real-world utility. We’ve found the most reliable signal comes from detailed stories of users shipping products and services. For now, we’re sticking with tools that let us drive the development process while providing AI assistance along the way.

Appendix: Tasks Attempted With Devin

Below is a table of projects we gave Devin, categorized by the themes of: (1) Creating a new project, (2) research, (3) analyze an existing code base and (4) modifying a code base.

1. Create A New Project

Project Name Status Description Reflections
Planet Tracker Success I wanted to debunk some claims about historical positions of Jupiter and Saturn Devin nailed it. I actually talked to Devin from my phone via slack and it made it happen.
Migrating data from Notion Into Google Sheets Success I told Devin to programmatically pull info from a Notion document into a Google Sheet. This was my very first project that I executed with Devin and it pulled it off nicely. Devin read notion and Google API docs by itself. Devin also navigated me to the Google Cloud console and provided me with instructions on all the different menus to click through which would have taken me quite a bit of time on my own! At the end, I was given a reasonable Python script that executed the task. This was my very first interaction with Devin and it executed exactly what I wanted it to do, which was a brand new experience for me. I was quite excited about Devin at this point.
Multi-app deploys on Railway Inconclusive I asked Devin to deploy multiple applications to a single railway deployment, so that I could have different apps sharing the same local db for testing. It turns out that this task was ill-defined because it’s not actually possible to do this, if I understand correctly. However, Devin marched forward and tried to do this and hallucinated some things about how to interact with railway.
Generate synthetic data and upload it to Braintrust Failure I asked Devin to create synthetic data for a LLM observability platform called Braintrust that I wanted to test. Devin created overly complex code that was hard to understand, and got stuck trying to fix errors. We ended up using Cursor to do this step by step in an iterative fashion.
Create an integration between two applications Failure I asked Devin to create an integration between Circleback, my AI notes taker, and Spiral.computer with pointers to the documentation of each. I got really horrible spaghetti code that was way more confusing to read through than me trying to just write it from scratch. So I decided to not invest any more time in using Devin for this particular task.
Web scraping Papers By Following Google Scholar Links Failure I asked Devin to grab the most recent 25 papers from an author on Google Scholar programmatically using playwright, and if it encountered a paywall it was ok to skip that particular document. Devin went into a rabbit hole of trying to parse HTML that it seems like it couldn’t get out of. It got stuck and went to sleep.
Create minimal HTMX bulk upload example app Failure I asked Devin to read the HTMX documentation page for bulk edit example and with that and fake server code, create a minimal FastHTML version of the example for the FastHTML Gallery. The example did not work and was not minimal. Devin used objects from the request object that didn’t exist and added many unnecessary things, like toasts (which also didn’t work), and inline css styling.
Create a DaisyUI Themes to match FrankenUI Theming Failure I asked Devin to create DaisyUI and highlight.js theming so that they match the frankenui themes and can be used in the same app seamlessly Devin mapped daisyUI pre-existing themes to frankenui themes, but did they did not match well in many cases. It was also a ton of code changes that I didn’t understand and I ended up not using any of it because I was too confused to know what to do with it.

2. Perform Research

Project Name Status Description Reflections
Research How to make a discord bot Success I asked Devin to perform research on how I could use Python to build a Discord bot that summarizes each day’s messages and sends an email. I also told it to use Claudette if possible to do so. Finally, I told it to write its findings in notebooks with small code snippets I could use to test. Devin produced research notes in the form of a markdown file as an intermediate step to creating the notebook, which I did not ask it for. However, it was quite useful to see a step-by-step plan on how an implementation might come together. The code that it provided me in the notebook was not 100% correct, but it was useful as pseudocode to give me an idea of how I might glue this together. Given that this was more of a research project and I wanted just to know the general idea, I would call this a success.
Research on Transcript Summarization With Accurate Timestamps Failure One issue that I face with summarizing transcripts is that I would love to have accurate timestamps that go with notes, so that I could use it for YouTube chapter summaries or similar. Concretely, it is not a problem to get accurate time-stamps from a transcript, But it’s difficult to associate timestamps with summaries because the timestamps often get bungled. So this is kind of an AI engineering research task. Devin regurgitated related things to my problem but it did not tackle it did not do a good job of performing research or trying to tackle the problem I was trying to solve, and gave me pointers to code and examples that were not helpful.
Create a minimal DaisyUI theme as an example Failure I asked Devin to create a minimal DaisyUI theme as an example. My goal was to get a starting point to start from since asking it to do it in a more complete way was unsuccessful. Devin ignored the request to make it as a FastHTML app, and it took some back and forth to get it to go down that path. Eventually, it created an app that appeared to work with different button types. While it gave a link that looked good, once I tried modifying the theme, is became clear the theme was doing nothing. The other colors in the app were from the default theme. This is not a helpful starting point.

3. Analyze Existing Code

Project Name Status Description Reflections
Performing a security review of a code base Inconclusive For this task, I pointed Devin at a GitHub repository and told it to assess it for security vulnerabilities. The codebase is under 700 lines of code. I told Devin to write its notes in a markdown file with sample code where necessary. Devin did identify some security vulnerabilities but was extremely overzealous and hallucinated some issues that were not there. Perhaps this was not the ideal task for Devin as this is something that would be just as good in a single call to my favorite LLM.
Review blog posts and make a pull request with improvements Failure I asked Devin to review a blog post and suggest changes with a pull request. Ultimately, Devin failed because it could not figure out how the static site generator that I was using, Quarto, worked. I think that this task would have been successful inside something like Cursor. It seemed like Devin did not do a good job of learning from the project structure and existing files, so it messed up things like front matter and other conventions necessary to edit the blog post correctly.
Review An Application and Identify Potential Areas of Improvement Failure I asked Devin to view the timekeeping app I had mentioned earlier and provided an open-ended task of asking it to suggest any improvements. The suggestions that it provided did not make any sense.
Debug why ssh key forwarding is not working in a setup script Inconclusive I asked Devin to figure out why ssh key forwarding was not working on a server when I used a script to set it up. The issue ended up being unrelated to the script, which I thought was the problem, but Devin never suggested or implied that maybe the problem was somewhere else. It was not helpful because it did not help me uncover the root cause.

4. Modify An Existing Project

Project Name Status Description Reflections
Making changes to a nbdev project Failure I had a simple application for time tracking built with FastHTML and nbdev that I wanted to integrate with apple shortcuts via an API route. Devin could not figure out how to operate successfully in this environment, even though it got impressively far. One curiosity that I noticed is that Devin created Python scripts to edit notebooks rather than trying to edit the notebook itself. However, Devin gave me some useful notes there and ideas that I hadn’t considered. However, the code that it tried to write did not make sense. Eventually, I ended up using a template from someone else and not going with any of Devin’s suggestions.
Migration of python Project To nbdev Failure I asked Devin to migrate a project to nbdev [prompt details omitted for brevity] It got horribly stuck and could not figure out basic nbdev setup. It seems like it didn’t do a good job of reading the nbdev docs.
Integrate Styling Package Into FastHTML Failure I asked Devin to integrate MonsterUI into one of my applications. Devin could not figure out how to work with a nbdev repo.
Add feature to check for conflicts between user input and database Failure I asked Devin to add a feature to an app to compare user input values to values from a database based on prior runs and give a UI if they don’t match. I spent several hours slowly working through getting it working properly before I gave up. I wrote the feature myself in about 90 minutes.
Generate LLMs context file with the contents of every fasthtml gallery example Failure I asked Devin to create llms text files for the fasthtml gallery I was excited to see it created a separate markdown file for each example and then tried to roll them up in the llms context files initially. I had not thought about doing that and things seemed all there at first. When I pulled down and started digging in I started finding things I did not like: The format of the llms wasn’t correct. Even though I gave it infomration to use XML tags to seperate examples, it didn’t. It added and pinned a specific version of the markdown package as a dependency and used that, instead of using the markdown2 package with is already used and was already a dependency. It did a bunch of pytest stuff and added a dep, even though the project doesn’t use pytest.

Footnotes

  1. This demo was descisively debunked by this video↩︎

Finally, a Replacement for BERT: Introducing ModernBERT

2024-12-19 08:00:00

Finally, a Replacement for BERT

TL;DR

This blog post introduces ModernBERT, a family of state-of-the-art encoder-only models representing improvements over older generation encoders across the board, with a 8192 sequence length, better downstream performance and much faster processing.

ModernBERT is available as a slot-in replacement for any BERT-like models, with both a base (149M params) and large (395M params) model size.

Click to see how to use these models with transformers

ModernBERT will be included in v4.48.0 of transformers. Until then, it requires installing transformers from main:

pip install git+https://github.com/huggingface/transformers.git

Since ModernBERT is a Masked Language Model (MLM), you can use the fill-mask pipeline or load it via AutoModelForMaskedLM. To use ModernBERT for downstream tasks like classification, retrieval, or QA, fine-tune it following standard BERT fine-tuning recipes. ⚠️ If your GPU supports it, we recommend using ModernBERT with Flash Attention 2 to reach the highest efficiency. To do so, install Flash Attention as follows, then use the model as normal:

pip install flash-attn

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  Paris

Using a pipeline:

import torch
from transformers import pipeline
from pprint import pprint
pipe = pipeline(
    "fill-mask",
    model="answerdotai/ModernBERT-base",
    torch_dtype=torch.bfloat16,
)
input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)
Note: ModernBERT does not use token type IDs, unlike some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can omit the token_type_ids parameter.

Introduction

BERT was released in 2018 (millennia ago in AI-years!) and yet it’s still widely used today: in fact, it’s currently the second most downloaded model on the HuggingFace hub, with more than 68 million monthly downloads, only second to another encoder model fine-tuned for retrieval. That’s because its encoder-only architecture makes it ideal for the kinds of real-world problems that come up every day, like retrieval (such as for RAG), classification (such as content moderation), and entity extraction (such as for privacy and regulatory compliance).

Finally, 6 years later, we have a replacement! Today, we at Answer.AI and LightOn (and friends!) are releasing ModernBERT. ModernBERT is a new model series that is a Pareto improvement over BERT and its younger siblings across both speed and accuracy. This model takes dozens of advances from recent years of work on large language models (LLMs), and applies them to a BERT-style model, including updates to the architecture and the training process.

We expect to see ModernBERT become the new standard in the numerous applications where encoder-only models are now deployed, such as in RAG pipelines (Retrieval Augmented Generation) and recommendation systems.

In addition to being faster and more accurate, ModernBERT also increases context length to 8k tokens (compared to just 512 for most encoders), and is the first encoder-only model that includes a large amount of code in its training data. These features open up new application areas that were previously inaccessible through open models, such as large-scale code search, new IDE features, and new types of retrieval pipelines based on full document retrieval rather than small chunks.

But in order to explain just what we did, let’s first take a step back and look at where we’ve come from.

Decoder-only models

The recent high-profile advances in LLMs have been in models like GPT, Llama, and Claude. These are decoder-only models, or generative models. Their ability to generate human-like content has enabled astonishing new GenAI application areas like generated art and interactive chat. These striking applications have attracted major investment, funded booming research, and led to rapid technical advances. What we’ve done, essentially, is port these advances back to an encoder-only model.

Why? Because many practical applications need a model that’s lean and mean! And it doesn’t need to be a generative model.

More bluntly, decoder-only models are too big, slow, private, and expensive for many jobs. Consider that the original GPT-1 was a 117 million parameter model. The Llama 3.1 model, by contrast, has 405 billion parameters, and its technical report describes a data synthesis and curation recipe that is too complex and expensive for most corporations to reproduce. So to use such a model, like ChatGPT, you pay in cents and wait in seconds to get an API reply back from heavyweight servers outside of your control.

Of course, the open-ended capabilities of these giant generative models mean that you can, in a pinch, press them into service for non-generative or discriminative tasks, such as classification. This is because you can describe a classification task in plain English and … just ask the model to classify. But while this workflow is great for prototyping, you don’t want to pay prototype prices once you’re in mass production.

The popular buzz around GenAI has obscured the role of encoder-only models. These are the workhorses of practical language processing, the models that are actually being used for such workloads right now in many scientific and commercial applications.

Encoder-only models

The output of an encoder-only model is a list of numerical values (an embedding vector). You might say that instead of answering with text, an encoder model literally encodes its “answer” into this compressed, numerical form. That vector is a compressed representation of the model’s input, which is why encoder-only models are sometimes referred to as representational models.

While decoder-only models (like a GPT) can do the work of an encoder-only model (like a BERT), they are hamstrung by a key constraint: since they are generative models, they are mathematically “not allowed” to “peek” at later tokens. They can only ever look backwards. This is in contrast to encoder-only models, which are trained so each token can look forwards and backwards (bi-directionally). They are built for this, and it makes them very efficient at what they do.

Basically, a frontier model like OpenAI’s O1 is like a Ferrari SF-23. It’s an obvious triumph of engineering, designed to win races, and that’s why we talk about it. But it takes a special pit crew just to change the tires and you can’t buy one for yourself. In contrast, a BERT model is like a Honda Civic. It’s also an engineering triumph, but more subtly, since it is engineered to be affordable, fuel-efficient, reliable, and extremely useful. And that’s why they’re absolutely everywhere.

You can see this by looking at it a number of ways.

Supporting generative models: One way to understand the prevalence of representational models (encoder-only) is to note how frequently they are used in concert with a decoder-only model to make a system which is safe and efficient.

The obvious example is RAG. Instead of relying on the LLM’s knowledge trained into the model’s parameters, the system uses a document store to furnish the LLM with information relevant to the query. But of course this only defers the problem. If the LLM doesn’t know which documents are relevant to the query, then the system will need some other process to select those documents? It’s going to need a model which is fast and cheap enough that it can be used to encode the large quantities of information needed to make the LLM useful. That model is often a BERT-like encoder-only model. For more details on how Encoders like ModernBERT are critical in RAG pipelines, see this talk by Benjamin Clavié.

Another example is supervision architectures, where a cheap classifier might be used to ensure that generated text does not violate content safety requirements.

In short, whenever you see a decoder-only model in deployment, there’s a reasonable chance an encoder-only model is also part of the system. But the converse is not true.

Encoder-based systems: Before there was GPT, there were content recommendations in social media and in platforms like Netflix. There was ad targeting in those venues, in search, and elsewhere. There was content classification for spam detection, abuse detection, etc.. These systems were not built on generative models, but on representational models like encoder-only models. And all these systems are still out there and still running at enormous scale. Imagine how many ads are targeted per second around the world!

Downloads: On HuggingFace, RoBERTa, one of the leading BERT-based models, has more downloads than the 10 most popular LLMs on HuggingFace combined. In fact, currently, encoder-only models add up to over a billion downloads per month, nearly three times more than decoder-only models with their 397 million monthly downloads. In fact, the `fill-mask` model category, composed of encoder “base models” such as ModernBERT, ready to be fine-tuned for other downstream applications, is the most downloaded model category overall.

Inference costs: What the above suggests, is that on an inference-per-inference basis, there are many times more inferences performed per year on encoder-only models than on decoder-only or generative models. An interesting example is FineWeb-Edu, where model-based quality filtering had to be performed over 15 trillion tokens. The FineWeb-Edu team chose to generate annotations with a decoder-only model, Llama-3-70b-Instruct, and perform the bulk of the filtering with a fine-tuned BERT-based model. This filtering took 6,000 H100 hours, which, at HuggingFace Inference Points’ pricing of $10/hour, comes to a total of $60,000. On the other hand, feeding 15 trillion tokens to popular decoder-only models, even with the lowest-cost option of using Google’s Gemini Flash and its low inference cost of $0.075/million tokens, would cost over one million dollars!

Performance

Overview

Here’s a snapshot of the accuracy of ModernBERT and other models across a range of tasks, as measured by standard academic benchmarks – as you can see, ModernBERT is the only model which is a top scorer across every category, which makes it the one model you can use for all your encoder-based tasks:

If you’ve ever done an NLP competition on Kaggle, then you’ll know that DeBERTaV3 has been the choice of champions for years. But no longer: not only is ModernBERT the first base-size model to beat DeBERTaV3 on GLUE, it also uses less than 1/5th of Deberta’s memory.

And of course, ModernBERT is fast. It’s twice as fast as DeBERTa – in fact, up to 4x faster in the more common situation where inputs are mixed length. Its long context inference is nearly 3 times faster than other high-quality models such as NomicBERT and GTE-en-MLM.

ModernBERT’s context length of 8,192 tokens is over 16x larger than most existing encoders. This is critical, for instance, in RAG pipelines, where a small context often makes chunks too small for semantic understanding. ModernBERT is also the state-of-the-art long context retriever with ColBERT, and is 9 percentage points above the other long context models. Even more impressive: this very quickly trained model, simply tuned to compare to other backbones, outperforms even widely-used retrieval models on long-context tasks!

For code retrieval, ModernBERT is unique. There’s nothing to really compare it to, since there’s never been an encoder model like this trained on a large amount of code data before. For instance, on the StackOverflow-QA dataset (SQA), which is a hybrid dataset mixing both code and natural language, ModernBERT’s specialized code understanding and long-context capabilities make it the only backbone to score over 80 on this task.

This means whole new applications are likely to be built on this capability. For instance, imagine an AI-connected IDE which had an entire enterprise codebase indexed with ModernBERT embeddings, providing fast long context retrieval of the relevant code across all repositories. Or a code chat service which described how an application feature worked that integrated dozens of separate projects.

Compared to the mainstream models, ModernBERT performs better across nearly all three broad task categories of retrieval, natural language understanding, and code retrieval. Whilst it slightly lags DeBERTaV3 in one area (natural language understanding), it is many times faster. Please note that ModernBERT, as any other base model, can only do masked word prediction out-of-the-box. To be able to perform other tasks, the base model should be fine-tuned as done in these boilerplates.

Compared to the specialized models, ModernBERT is comparable or superior in most tasks. In addition, ModernBERT is faster than most models across most tasks, and can handle inputs up to 8,192 tokens, 16x longer than the mainstream models.

Efficiency

Here’s the memory (max batch size, BS) and Inference (in thousands of tokens per second) efficiency results on an NVIDIA RTX 4090 for ModernBERT and other decoder models:

The first thing you might notice is that we’re analysing the efficiency on an affordable consumer GPU, rather than the latest unobtainable hyped hardware. First and foremost, ModernBERT is focused on practicality, not hype.

As part of this focus, it also means we’ve made sure ModernBERT works well for real-world applications, rather than just benchmarks. Models of this kind are normally tested on just the one exact size they’re best at – their maximum context length. That’s what the “fixed” column in the table shows. But input sizes vary in the real world, so that’s the performance we worked hard to optimise – the “variable” column. As you can see, for variable length inputs, ModernBERT is much faster than all other models.

For long context inputs, which we believe will be the basis for the most valuable and important future applications, ModernBERT is 2-3x faster than the next fastest model. And, on the “practicality” dimension again: ModernBERT doesn’t require the additional heavy “xformers” dependency, but instead only requires the now commonplace Flash Attention as a dependency.

Furthermore, thanks to ModernBERT’s efficiency, it can use a larger batch size than nearly any other model, and can be used effectively on smaller and cheaper GPUs. The efficiency of the base size, in particular, may enable new applications that run directly in browsers, on phones, and so forth.

Why is ModernBERT, well, Modern?

Now, we’ve made our case to why we should give some more love to encoder models. As trusted, under-appreciated workhorses, they’ve had surprisingly few updates since 2018’s BERT!

Even more surprising: since RoBERTa, there has been no encoder providing overall improvements without tradeoffs (fancily known as “Pareto improvements”): DeBERTaV3 had better GLUE and classification performance, but sacrificed both efficiency and retrieval. Other models, such as AlBERT, or newer ones, like GTE-en-MLM, all improved over the original BERT and RoBERTa in some ways but regressed in others.

However, since the duo’s original release, we’ve learned an enormous amount about how to build better language models. If you’ve used LLMs at all, you’re very well aware of it: while they’re rare in the encoder-world, Pareto improvements are constant in decoder-land, where models constantly become better at everything. And as we’ve all learned by now: model improvements are only partially magic, and mostly engineering.

The goal of the (hopefully aptly named) ModernBERT project was thus fairly simple: bring this modern engineering to encoder models. We did so in three core ways:

  1. a modernized transformer architecture
  2. particular attention to efficiency
  3. modern data scales & sources

Meet the New Transformer, Same as the Old Transformer

The Transformer architecture has become dominant, and is used by the vast majority of models nowadays. However, it’s important to remember that there isn’t one but many Transformers. The main thing they share in common is their deep belief that attention is indeed all you need, and as such, build various improvements centered around the attention mechanism.

ModernBERT takes huge inspiration from the Transformer++ (as coined by Mamba), first used by the Llama2 family of models. Namely, we replace older BERT-like building blocks with their improved equivalent, namely, we:

  • Replace the old positional encoding with “rotary positional embeddings” (RoPE): this makes the model much better at understanding where words are in relation to each other, and allows us to scale to longer sequence lengths.
    • Switch out the old MLP layers for GeGLU layers, improving on the original BERT’s GeLU activation function.
    • Streamline the architecture by removing unnecessary bias terms, letting us spend our parameter budget more effectively
    • Add an extra normalization layer after embeddings, which helps stabilize training

Upgrading a Honda Civic for the Race Track

We’ve covered this already: encoders are no Ferraris, and ModernBERT is no exception. However, that doesn’t mean it can’t be fast. When you get on the highway, you generally don’t go and trade in your car for a race car, but rather hope that your everyday reliable ride can comfortably hit the speed limit.

In fact, for all the application cases we mentioned above, speed is essential. Encoders are very popular in uses where they either have to process tons of data, allowing even tiny speed increments to add up very quickly, or where latency is very important, as is the case on RAG. In a lot of situations, encoders are even run on CPU, where efficiency is even more important if we want results in a reasonable amount of time.

As with most things in research, we build while standing on the shoulders of giants, and heavily leverage Flash Attention 2’s speed improvements. Our efficiency improvements rely on three key components: Alternating Attention, to improve processing efficiency, Unpadding and Sequence Packing, to reduce computational waste, and Hardware-Aware Model Design, to maximise hardware utilization.

Global and Local Attention

One of ModernBERT’s most impactful features is Alternating Attention, rather than full global attention. In technical terms, this means that our attention mechanism only attends to the full input every 3 layers (global attention), while all other layers use a sliding window where every token only attends to the 128 tokens nearest to itself (local attention).
As attention’s computational complexity balloons up with every additional token, this means ModernBERT can process long input sequences considerably faster than any other model.

In practice, it looks like this:

Conceptually, the reason this works is pretty simple: Picture yourself reading a book. For every sentence you read, do you need to be fully aware of the entire plot to understand most of it (full global attention)? Or is awareness of the current chapter enough (local attention), as long as you occasionally think back on its significance to the main plot (global attention)? In the vast majority of cases, it’s the latter.

Unpadding and Sequence Packing

Another core mechanism contributing to ModernBERT’s efficiency is its use for Unpadding and Sequence packing.

In order to be able to process multiple sequences within the same batch, encoder models require them to be the same length, so they can perform parallel computation. Traditionally, we’ve relied on padding to achieve this: figure out which sentence is the longest, and add meaningless tokens (padding tokens) to fill up every other sequence.

While padding solves the problem, it doesn’t do so elegantly: a lot of compute ends up being spent and wasted on padding tokens, which do not contribute any semantic information.

Comparing padding with sequence packing. Sequence packing (‘unpadding’) avoids wasting compute on padding tokens and has more consistent non-padding token counts per batch. Samples are still processed individually through careful masking.

Unpadding solves this issue: rather than keeping these padding tokens, we remove them all, and concatenate them into mini-batches with a batch size of one, avoiding all unnecessary computations. If you’re using Flash Attention, our implementation of unpadding is even faster than previous methods, which heavily relied on unpadding and repadding sequences as they went through the model: we go one step further by introducing our own implementation of unpadding, relying heavily on recent developments in Flash Attention’s RoPE support. This allows ModernBERT to only have to unpad once, and optionally repad sequences after processing, resulting in a 10-20% speedup over previous methods.

To speed up pre-training even further, unpadding is in good company within our model, as we use it in conjunction with sequence packing. Sequence packing here is a logical next step: as we’re concatenating inputs into a single sequence, and GPUs are very good at parallelisation, we want to maximise the computational efficiency we can squeeze out of a single forward model pass. To do so, we use a greedy algorithm to group individual sequences into concatenated ones that are as close to the model’s maximum input length as possible.

Paying Attention to Hardware

Finally, the third facet of ModernBERT’s efficiency is hardware design.

We attempted to balance two insights that have been highlighted by previous research:

  1. Deep & Narrow vs Wide & Shallow: Research shows that deeper models with narrower layers, often perform better than shallow models with fewer, wider layers. However, this is a double-edged sword: the deeper the model, the less parallelizable it becomes, and thus, the slower it runs at identical parameter counts.
  2. Hardware Efficiency: Model dimensions need to align well with GPU hardware for maximum performance, and different target GPUs result in different constraints.

Sadly, there is no magic recipe to make a model run similarly well on a wide range of GPUs, but there is an excellent cookbook: The Case for Co-Designing Model Architectures with Hardware, in which the ways to optimize a model architecture for a given GPU are carefully laid out. We came up with a heuristic to extend their method to a basket of GPUs, while respecting a given set of constraints. Logically, the first step is to define said constraints, in our case:

  • Defining our target GPUs as common inference ones (RTX 3090/4090, A10, T4, L4)
  • Roughly defining our target model sizes at 130-to-150 million parameters for ModernBERT-Base, and 350-to-420 for ModernBERT-Large.
  • The final embedding sizes must match the original BERT’s dimensions, 768 for base and 1024 for large, to maximize backwards compatibility
  • Set performance constraints which are common across the basket of GPUs

Afterwards, we experimented with multiple model designs via a constrained grid search, varying both layer counts and layer width. Once we’d identified shapes that appeared to be the most efficient ones, we confirmed that our heuristics matched real-world GPU performance, and settled on the final model designs.

Training

def data(): return [‘text’, ‘bad_text’, ‘math’, ‘code’]

Picture this exact scene, but replace Developers with Data

Another big aspect in which encoders have been trailing behind is training data. This is often understood to mean solely training data scale, but this is not actually the case: previous encoders, such as DeBERTaV3, were trained for long enough that they might have even breached the trillion tokens scale!

The issue, rather, has been training data diversity: many of the older models train on limited corpora, generally consisting of Wikipedia and Wikibooks. These data mixtures are very noticeably single text modality: they contain nothing but high-quality natural text.

In contrast, ModernBERT is trained on data from a variety of English sources, including web documents, code, and scientific articles. It is trained on 2 trillion tokens, of which most are unique, rather than the standard 20-to-40 repetitions common in previous encoders.

The impact of this is immediately noticeable: out of all the existing open source encoders, ModernBERT is in a class of its own on programming-related tasks. We’re particularly interested in what downstream uses this will lead to, in terms of improving programming assistants.

Process

We stick to the original BERT’s training recipe, with some slight upgrades inspired by subsequent work: we remove the Next-Sentence Prediction objective, since then shown to add overhead for no clear gains, and increase the masking rate from 15% to 30%.

Both models are trained with a three-phase process. First, we train on 1.7T tokens at a sequence length of 1024. We then adopt a long-context adaptation phase, training on 250B tokens at a sequence length of 8192, while keeping the total tokens seen per batch more or less consistent by lowering the batch size. Finally, we perform annealing on 50 billion tokens sampled differently, following the long-context extension ideal mix highlighted by ProLong.

Training in three phases is our way of ensuring our model is good across the board, which is reflected in its results: it is competitive on long-context tasks, at no cost to its ability to process short context…

… But it has another benefit: for the first two-phases, we train using a constant learning rate once the warmup phase is complete, and only perform learning rate decay on the final 50 billion tokens, following the Trapezoidal (or Warmup-Stable-Decay) learning rate. And what’s more: we will release every single immediate intermediate checkpoints from these stable phases, inspired by Pythia. Our main reason for doing so was supporting future research and applications: anyone is free to restart training from any of our pre-decay checkpoints, and perform annealing on domain-appropriate data for their intended use!

The tricks, it’s all about the tricks!

If you’ve made it this far into this announcement, you’re probably used to this: of course, we use tricks to make things quicker here too. To be precise, we have two main tricks.

Let’s start with the first one, which is pretty common: since the initial training steps are updating random weights, we adopt batch-size warmup: we start with a smaller batch size so the same number of tokens update the model weights more often, then gradually increase the batch size to the final training size. This significantly speeds up the initial phase of model training, where the model learns its most basic understanding of language.

The second trick is far more uncommon: weight initialization via tiling for the larger model size, inspired by Microsoft’s Phi family of models. This one’s based on the following realization: Why initialize the ModernBERT-large’s initial weights with random numbers when we have a perfectly good (if we dare say so ourselves) set of ModernBERT-base weights just sitting there?

And indeed, it turns out that tiling ModernBERT-base’s weights across ModernBERT-large works better than initializing from random weights. It also has the added benefit of stacking nicely with batch size warmup for even faster initial training.

Conclusion

In this blog post we introduced the ModernBERT models, a new state-of-the-art family of small and efficient encoder-only models, finally giving BERT a much needed do-over.

ModernBERT demonstrates that encoder-only models can be improved by modern methods. They continue to offer very strong performance on some tasks, providing an extremely attractive size/performance ratio.

More than anything, we’re really looking forward to seeing what creative ways to use these models the community will come up with! To encourage this, we’re opening a call for demos until January 10th, 2025: the 5 best ones will get added to this post in a showcase section and win a $100 (or local currency equivalent) Amazon gift card, as well as a 6-month HuggingFace Pro subscription! If you need a hint to get started, here’s a demo we thought about: code similarity HF space! And remember, this is an encoder model, so all the coolest downstream applications will likely require some sort of fine-tuning (on real or perhaps decoder-model synthetic data?). Thankfully, there’s lots of cool frameworks out there to support fine-tuning encoders: 🤗Transformers itself for various tasks, including classification, GliNER for zero-shot Named Entity Recognition, or Sentence-Transformers for retrieval and similarity tasks!

Links

LightOn sponsored the compute for this project on Orange Business Cloud Avenue.

nbsanity - Share Notebooks as Polished Web Pages in Seconds

2024-12-13 08:00:00

At fastai, we’ve long believed that Jupyter Notebooks are an excellent medium for technical writing, combining live code, visualizations, and narrative text in a single document. However, sharing notebooks in a way that’s both beautiful and accessible has always been a challenge. While GitHub’s notebook viewer is functional, it lacks the polish and features needed for proper technical communication. Today, we’re introducing nbsanity, a service that transforms any public GitHub notebook into a polished web page with just a URL change.

The Challenge

While GitHub’s rendering is functional, it suffers from several limitations: the rendering can be sluggish and occasionally fails completely, there’s no way to collapse or hide code cells, and the presentation can’t be customized. One particularly frustrating issue is the lack of horizontal scrolling for code cells, and overall, the reading experience isn’t optimized for consumption.

Nbviewer solves some of these issues, but doesn’t allow you to customize the presentation. We’ve previously addressed some of these challenges with tools like fastpages and nbdev, but these solutions require setup and maintenance 1. We realized there was a need for something simpler - a solution that would allow instant sharing without any overhead.

I’ve been searching for the perfect low-friction system for technical writing ever since discovering Simon Willison’s elegant TIL (Today I Learned) approach. With nbsanity, we finally have it.

What is nbsanity?

nbsanity is a free service that renders any public Jupyter notebook from GitHub or Gists as a polished web page. There’s no setup, no configuration, and no deployment needed.

nbsanity is powered by Quarto, an open-source scientific and technical publishing system. Through our extensive work with various documentation tools, we’ve found Quarto to be the most ergonomic static site generator available for notebooks. It offers seamless integration with both Jupyter and VSCode through dedicated extensions, while providing remarkable flexibility in output formats - including presentations, books, PDFs and websites.

One of Quarto’s most powerful features is its “directives” system - simple cell comments that begin with #| that allow you to customize how your content is rendered. These directives are easy to add and do not clutter your code. Below are examples of Quarto capabilities you get access to with nbsanity:

  • Cell Visibility Control: Hide specific cells with #|include: false while keeping their execution
  • Output Management: Show just results with #|echo: false or raw output with #|output: asis
  • Error Handling: Control error messages with #|error: false and warnings with #|warning: false
  • Content Organization: Create tab panels with {.panel-tabset} and callouts with :::{.callout-note} (this is not a directive, but markdown cell syntax that creates tab panels and callouts.).
  • Layout Control: Apply custom CSS classes and control figure layouts with directives like #| fig-width: and #| layout-ncol:

Documentation concerning these directives can be found in the more resources section.

nbsanity is focused on doing one thing well: rendering public notebooks beautifully. This means it only works with notebooks hosted on GitHub or in Gists. Furthermore, you’ll need to use remote URLs for any images in your notebooks2. These constraints let us deliver a service that’s simple, fast, and completely maintenance-free for users. Think of nbsanity as the “pastebin for notebooks” - it’s the fastest way to go from a GitHub notebook to a polished reading experience.

We added extra love

In addition to Quarto’s rendering process, we’ve added several quality-of-life improvements. All rendered notebooks have a (1) table of contents, (2) link to the original GitHub URL, (3) and wrap text in code cells.

We’ve even made sure that rendered notebooks have fancy social cards, thanks to Simon Willison’s shot-scraper:

These social cards show the actual contents of your notebook and help your posts stand out on social media.

Getting Started

Using nbsanity couldn’t be simpler. You have two options:

Option 1: URL Modification

Replace github.com with nbsanity.com in any GitHub notebook URL. This works for both repositories and gists. For example:

GitHub URL   https://github.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb

nbsanity URL https://nbsanity.com/fastai/lm-hackers/blob/main/lm-hackers.ipynb

For gists, the URL format is slightly different: nbsanity.com/gist/[username]/[gist_id]. See these instructions for more details.

Option 2: Bookmarklet

For even faster conversion, drag this bookmarklet to your bookmarks bar:

nbsanity

Clicking on this bookmarklet while viewing a public GitHub notebook will perform the necessary url substitution for you.

A Demo

To demonstrate Quarto’s capabilities, let’s examine one of my favorite features: code-folding.

Example 1

To collapse a code cell with an expandable summary, I can add the following directive to the top of a code cell:

#| code-fold: true
#| code-summary: "Click to see data preprocessing"

These rendering instructions are used to create this effect, but are not rendered and seen by the reader.

Click to see data preprocessing
import pandas as pd
import numpy as np

# Create sample data
np.random.seed(42)
data = pd.DataFrame({
    'id': range(1000),
    'value': np.random.normal(0, 1, 1000),
    'category': np.random.choice(['A', 'B', 'C'], 1000)
})

# Preprocessing steps
data['value_normalized'] = (data['value'] - data['value'].mean()) / data['value'].std()
data['value_binned'] = pd.qcut(data['value'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])

Example 2

To specify that you want readers to have the option to collapse code, we can use the same code-fold directive with a different option:

#| code-fold: show
Code
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(np.random.randn(100).cumsum())
plt.title('Random Walk')
plt.show()

Important Notes

While nbsanity makes notebook sharing effortless, there are a few key things to keep in mind to use it well. First, nbsanity is a rendering service only - it displays your notebooks but does not execute them, even if you have Quarto directives that say otherwise. This avoids potential security issues.

nbsanity also has a a caching system that preserves the history of your notebook renders. Each time you render a notebook, you receive a unique link corresponding to that specific version. If you later update your notebook and render it again, you’ll get a new link. All previous versions remain accessible through their original links. Any new rendering capabilities we introduce will only apply to new renders, meaning your existing shared notebooks will maintain their original appearance.

Next Steps with nbsanity

We built nbsanity because we believe that reducing friction in sharing knowledge is important. We’ve been refining nbsanity with our community of over 2,000 students in our solveit course, where it’s become an integral part of how students share their work. Their feedback and usage patterns have helped us polish the tool into something we love using ourselves.

The best way to get started is to try it yourself:

  1. Visit nbsanity.com and drag the bookmarklet to your browser’s bookmark bar
  2. Navigate to any public Jupyter notebook on GitHub
  3. Click the bookmarklet to view the notebook with beautiful Quarto rendering

Whether you’re writing “Today I Learned” posts, sharing technical tutorials, or enhancing your project’s documentation, we hope this tool makes your technical writing journey a little bit easier. The project is open source and available on GitHub—we welcome your feedback and contributions!3

P.S. If you share your notebook using nbsanity on social media, please tag me—I’d love to see your work! You can find me on twitter and linkedin.

More resources

Here are links to Quarto docs I find helpful when authoring notebooks:

  1. cell output: hide, show, and filter cell output and input.
  2. code-display: configure how code is displayed, including line-numbers, folding of cells, hiding of cells, etc.
  3. figures: configure how figures are shown
  4. tables: configure how tables are shown
  5. metadata: configure the title, subtitle, date, author and more.
  6. numbering: toggle section numbering.

Footnotes

  1. JupyterBook is another project that allows you to customize the presentation of notebooks. Like fastpages, nbdev and other static site generators, these projects require a non-trivial amount of setup and maintenance.↩︎

  2. The reason for requiring remote urls is that we do not want to be rate limited by the GitHub API in fetching related files.↩︎

  3. We need to keep the service minimal, so please expect that we will be discerning about feature requests and PRs.↩︎