RSS preview of Ben Kuhn

Rss preview of Blog of Ben Kuhn

Impact, agency, and taste

2025-04-19 08:00:00

I’ve been thinking recently about what sets apart my coworkers who’ve done the best work.

You might think that the main thing that makes people really effective at research or engineering is technical ability, and among the general population that’s true. Among my Anthropic coworkers, though, we’ve restricted the range by screening for extremely high-percentile technical ability, so the remaining differences, while they still matter, aren’t quite as critical. Instead, people’s biggest bottleneck eventually becomes their ability to get leverage—i.e., to find and execute work that has a big impact-per-hour multiplier.

For example, here are some types of work at Anthropic that tend to have high impact-per-hour, or a high impact-per-hour ceiling when done well (of course this list is extremely non-exhaustive!):

Improving tooling, documentation, or dev loops. A tiny amount of time fixing a papercut in the right way can save hundreds of users hours of debugging or schlepping through inefficient workflows.
Identifying promising research directions. Things like character and computer use started off as a fraction of one person’s time, taking a bet that some technique applied to some problem was going to work really well, and ended up being a major influence on Anthropic’s research direction.
System design. This tends to be a small part of the overall time to execute a project, but when done well, it can make the final system way better and save lots of work.
Collecting and digging into data. It’s almost always the case that, if we start looking at some new source of data that we didn’t previously have, we discover huge ~~problems~~ opportunities for improvement, and it changes our prioritization a lot.
- Interestingly, even versions of this work that seem very rote are often extremely high leverage. For example, you might think that a task like “look at 100-1000 individual data points for anomalies” could be delegated to a junior hire, but in fact, the people who are most obsessive about this are our most senior researchers and managers—and they often find severe issues by doing it that we wouldn’t have noticed otherwise.

I think of finding high-leverage work as having two interrelated components:

Agency: i.e. some combination of the initiative/proactiveness to try to make things happen, and relentlessness and resourcefulness to make sure you’ll succeed.
Taste: you need a good intuition for what things will and won’t work well to try. Taste is important both “in the large” (picking important problems) and “in the small” (picking approaches to solving those problems that will work well); I usually see people first become great at the latter, then the former.

Without taste, you’re likely to work toward the wrong thing. Without agency, even if you work toward the right thing, you’re likely to get nowhere.

How can someone improve at these? Mostly, I think, by practicing: there’s no substitute for good feedback loops with reality. So maybe the most important takeaway here is that you have permission to try to exercise good taste and high agency! But below are some more specific pieces of advice that have helped me improve on these dimensions.

I. Agency

Understand and work backwards from the root goal

One of the easiest ways to get more leverage is to take a goal you’re already trying to accomplish, and figure out a better way to accomplish the same thing. (For example: changing your experiment design to be more directly relevant to the high-level question; finding an 80/20 way of building a tool; just deciding not to do something entirely; etc.) In order to do this, you need to keep careful track of the high-level context and priorities that your specific project is aimed at, so that you can understand and judge the trade-offs.

Obviously, whoever’s supervising your project will make an attempt to point you at the most effective way of achieving the goal that they can think of. But finding the most effective way of doing something takes a lot of time and attention, they probably have a lot of other things to do, and they also have much less detailed context than you, so they’re liable to miss things!

For example, when I did a work trial at Anthropic in late 2022, I was assigned the task of building a custom scheduler for our compute cluster to fix some unfortunate behaviors in the default scheduler that were making researchers’ lives hard. But after seeing the team work for a week and thinking about the system design, I started to worry that this project would end up imposing a high maintenance burden on the team that owned the cluster, which was already overloaded, and I ended up suggesting that we postpone it. It turned out that second-order effect would have made this project net-negative for the root goal of “keep the clusters working well!”

Don’t rely too much on permission or encouragement

A counterintuitive fact about the highest-leverage projects is that they’re often not obviously high-leverage to most people in advance (because if they were, they would already have been done a long time ago). That means people are often skeptical of the value of pursuing them.

That skepticism can be for a couple reasons (in the real world, most projects have some mix of both going on):

People might underrate the impact that the project will have. For example, I find that people often underestimate the value of work related to metrics, instrumentation, or generally looking at data, because you can’t predict what specific things you would learn or what specific impact you’d have as a result. Because of that, many teams consistently under-invest in looking at data.
The project might seem too hard to execute in a high-impact way. For example, someone I knew built a system to identify unhealthy hosts in a data center by parsing and categorizing every kernel log message, which ended up saving $X00m/yr (iirc). When he proposed the project initially, others thought it would take a whole team multiple quarters, probably because it’s extremely boring and detail-oriented; but because this person was good at focused, detail-oriented work even when it was boring, he finished it in a couple months.

A personal example: shortly after I joined Anthropic’s inference team, a couple folks on the team proposed completely rewriting our inference service codebase, because they felt the current version was overcomplicated. I was somewhat skeptical of this, based on the standard software engineering heuristic of “most rewrites fail and most people are over-optimistic about how long they will take,” although I didn’t push back very hard because I knew I didn’t have much context. It turned out that the rewrite took about a week, was an immediate integer-multiple efficiency improvement, and made it much easier to make further efficiency improvements. I’m glad that people didn’t listen to my skepticism!

One tactic for addressing this dynamic is that instead of asking people for approval to go do something, you can just tell them what you intend to do (implicitly giving them space to object or course-correct if they feel strongly).

Of course, the same “high leverage projects are non-obvious” phenomenon also means that some of your self-driven high-impact project attempts will probably fail, because you misjudged their impact or difficulty. Because of this, you should think of them as bets that might or might not pay off, and—at least until you’ve proven your ability to make good bets over fairly long time horizons—it’s best to keep these as part of a portfolio with low-risk projects as well, to avoid the situation where you bite off a single high-risk project, it doesn’t pan out, and people lose trust in your ability to pick your own projects.

Make success inevitable

A common trait of high-agency people is that they take accountability for achieving a goal, not just doing some work.

There’s a huge difference between the following two operating modes:

My goal is to ship this project by the end of the month, so I’m going to get people started working on it ASAP.
My goal is to ship this project by the end of the month, so I’m going to list out everything that needs to get done by then, draw up a schedule working backwards from the ship date, make sure the critical path is short enough, make sure we have enough staffing to do anything, figure out what we’ll cut if the schedule slips, be honest about how much slop we need, track progress against the schedule and surface any slippage as soon as I see it, pull in people from elsewhere if I need them…

Mode 1 makes you a leaky abstraction—if your project is critical, someone else needs to be constantly monitoring it and figuring out how to resolve blockers. My concept handle for being in mode 2 is “making success inevitable,” because “inevitable” is the bar where other people can stop spending substantial fractions of mental energy on worrying about the project.

People who can be trusted to make something inevitable are really rare, and are typically the bottleneck for how many different things a team or company can do at once. So if someone else is responsible for making your project inevitable, you’re consuming some of that scarce resource; if you’re the one making your own projects inevitable, you’re a producer of that resource, and you’re helping unblock a key constraint for your team.

(Of course in practice you will never achieve full inevitability, but getting closer to it still makes a huge difference!)

II. Taste

Find your angle

I wrote above that you should expect high-impact projects to be non-obvious because if they were obvious, they’d have been done already. This points to another interesting dynamic, which is that it’s quite rare for different people’s “zone of best taste” to overlap very much. Instead, the quality of most people’s taste is highly idiosyncratic and area-specific, where areas can be as localized as e.g. “language model personality design” or “what blog post titles will get upvoted on Hacker News” (one of my own less-useful areas of good taste…)

For this reason, an important signal to keep track of is: where is your taste the best?

I’ve noticed a lot of people underestimate their own taste, because they expect having good taste to feel like being very smart or competent or good at things. Unfortunately, I am here to tell you that, at least if you are similar to me, you will never feel smart, competent, or good at things; instead, you will just start feeling more and more like everyone else mysteriously sucks at them.

For this reason, the prompt I suggest here is: what does it seem like everyone else is mysteriously bad at? That’s probably a sign that you have good taste there.

It’s okay if this prompt doesn’t immediately yield anything; you might just be on a team with a lot of really good people, where you don’t have a really unique angle on anything yet. Even so, a weakened version of this question—where does it seem like the most people are being the least competent?—is a useful gradient signal!

Think real hard

One way of thinking about taste is that it’s about the quality of your predictive models and search heuristics. If I design the experiment this way, what will I find? If I design the tool this way, how easy will it be to use? If I write the doc this way, how much will it resonate with people?

Doing enough search and prediction to come up with great ideas takes time. The first domain that I got some degree of taste in was software design, and I remember a pretty clear phase transition where I gained the ability to improve my designs by thinking harder about them. After that point, I spent a lot of time iterating on many different design improvements—most of which I never implemented because I couldn’t come up with something I was happy enough with, but a few of which turned into major wins.

The easiest way to improve at this is just to try it! Whenever you’re debating what to do, explicitly ask yourself “what do I predict will happen if I choose option A?” and try to unroll the trajectory. Even if you think you’re already intuitively predicting the results of your choices, I’ve found it helps surprisingly much to be explicit—one of my manager role models asks me this (“what do you think will happen?”) every time I ask him for advice and it’s kind of silly how often it helps me realize something new. (For bonus points, revisit your predictions afterwards.)

Things that I’ve found benefit from a lot of thinking time:

What to work on. Changing your prioritization is often the single biggest lever for improving your leverage!
“Design” broadly construed—whether that’s experiment design, system design, org design, process design, outlining blog posts, etc.
How projects could have gone better and what I should learn from them—see below.

(See also: Think real hard. Although if you’re someone whose natural failure mode is to overthink things, it’s possible you should reverse this advice.)

Reflect on your thinking

If taste is about the quality of your predictive models and search heuristics, it’s important to wring out every possible update to these from the data that you get.

For that reason, many of the most effective people I’ve worked with also do the most metacognition, i.e., reflecting on their own (and their team’s) work and thought processes, and figuring out how to improve them. They’re often the people who are most likely to identify improvements to our processes or mental models—things like:

Wow, we should have added this instrumentation months ago because we learned so much valuable stuff from it. I should probably up-weight my heuristic that “any time you’re not looking at data about something important, you’re probably unaware of some fire.”
I think we’re working inefficiently because we’re overestimating how close we are to being done; we need to step back and improve our tooling. Also, I’ve noticed we get into this pattern kinda often; I’ll try to notice it earlier next time!
In retrospect, we could have avoided this problem if we’d thought about X earlier—let’s keep that in mind next time we’re in a similar situation.
I think our key constraint here is we don’t have enough people with Y skillset. Can we move people around somehow to get more of the right people working on the problem?

Often, they don’t just do their own metacognition but also help drive “group metacognition” by sharing these reflections with the team, scheduling retrospectives, etc. Even if each of the lessons here is small individually, they compound over time to help people and teams become much more effective.

My best habit for encouraging myself to metacogitate more is a weekly review. The format and typical outcomes of my reviews have evolved a lot since I wrote that, but I still do them and find them extremely valuable for helping me improve at whatever I’m currently focused on improving at!

(See also / further reading: Chris Olah’s Research Taste Exercises)

Advice for time management as a manager

2025-04-01 08:00:00

This post was adapted from an internal doc I wrote at Wave.

Welcome to being a manager! Your time-management problem just got a lot harder.

As an IC, you can often get away with a very simple time-management strategy:

Decide what your one most important thing is.
Work on it until it’s done.
GOTO 1

As a team lead, this isn’t going to work, because much more of your work is interrupt-driven or gets blocked for long periods of time. One-on-ones! Code reviews! Design reviews! Prioritization meetings! Project check-ins! All of these are subject to an external schedule rather than being the type of thing that you can push on in a single focused block until it’s done.

Being a team lead means three big changes for your time management:

You no longer have a single most important thing. You’ll have to learn how to juggle competing priorities.
Your most important responsibility is for your team’s output, not your personal output. That means that individual engineering work goes last in your list of potential most important things.
You’ll need to start spending some of your time on a manager’s schedule:

There are two types of schedule, which I’ll call the manager’s schedule and the maker’s schedule. The manager’s schedule is for bosses. It’s embodied in the traditional appointment book, with each day cut into one hour intervals. You can block off several hours for a single task if you need to, but by default you change what you’re doing every hour.

Most powerful people are on the manager’s schedule. It’s the schedule of command. But there’s another way of using time that’s common among people who make things, like programmers and writers. They generally prefer to use time in units of half a day at least. You can’t write or program well in units of an hour. That’s barely enough time to get started.

Here’s some advice on how to cope with those changes.

Have accurate expectations of yourself

Your responsibilities to your team will take time, and even more importantly, attention. That means you’ll be a lot less productive on IC work than you have been in the past—especially at first while you’re finding your legs. Additionally, your time will be less predictable week-to-week as you might have to spend an unknown amount of time responding to “inbound” work.

For the first few months, you should treat any individual engineering work that you get done as a bonus. Even after that, you should expect to have something like 10-20% less individual output per engineer you manage, depending on how experienced you are, they are, etc., and with substantial week-to-week variance.

To mitigate this, my rule for myself has been to make sure that my IC work is important but not urgent—i.e. that nobody will be sad and no plans will be derailed if I end up having to spend the next week firefighting instead of pushing it forward.

For honing my intuitions about how much I can actually expect to accomplish, I’ve found time tracking very useful (see How time tracking helped me be a better manager and I apparently got 50% better at my job last month).

Prioritize ruthlessly

A corollary of the above is that it becomes very important for you to prioritize what to work on, both on an hour-to-hour cadence and on a larger timescale.

It’s not possible to write down a full algorithm for prioritizing in a blog post—that’s why they pay us the big bucks—but here are some heuristics for which things are most worth prioritizing:

Deadlines where something bad happens if you miss them. (e.g. performance improvements in advance of a holiday rush.) Note that the badness of missing deadlines varies wildly. Make a habit of asking “what is the reason for this deadline?” for any deadline-driven project.
Work that increases your or your team’s future bandwidth. This can include hiring, addressing tech / process debt, automating toil, mentoring people, reducing pager burden, etc. A useful way of thinking about it is to rank this kind of work based on the “payback period,” or how long it takes before the time saved by the improvement exceeds the time invested in making it.
One-on-ones. Try very hard not to cancel these, unless you’re on vacation—the other 39.5 hours a week they’re focused on what you need from them, so please don’t disdain the 0.5 hours you spend focused on what they need from you. If you cancel too many, expect your reports to feel less safe bringing up tricky things, and to have more issues “blow up” because they didn’t get addressed early.

Unemploy your future self

One of the most important types of “work that increases your or your team’s future bandwidth” is delegating things. This is something entire books have been written about, but here’s how to avoid a few common delegation pitfalls for new team leads:

Negotiate how hands-on to be. Effective delegation means finding the right balance between micromanaging, and throwing your report to the wolves. The stereotype is that new managers often micromanage, but at Wave, I’ve noticed that new managers often err on the side of undermanaging, or being too hands-off, perhaps out of a desire to signal that they trust their reports.

If you’re unsure, it’s good to have an explicit conversation with the person you’re delegating to about how much support they want. E.g. “do you want to do the design for this feature yourself, have me do the high-level and fill in the details, or have me write the entire design doc?”
Calibrate to your team members’ task-relevant maturity, or how capable they are of independently doing a particular type of task. A senior engineer should be able to design most features independently (with review), but give the same design task to a junior engineer and they’ll probably flail around and make no progress. You should be maintaining a mental map of each of your team members’ strengths and weaknesses—and updating it over time as you help them improve.
Delegate ahead of future growth. Your team’s workload is going to increase over time, so even if you don’t feel like you have too much work to do right now, you probably will in the future, unless you currently feel underutilized. You should aim for a workload where in the steady state, you feel like you have some slack capacity.
Delegate “stretch projects” to help your team level up. Getting enough slack might require you to delegate work that no one else on your team can currently do. Take your mental strengths-and-weaknesses map, ask yourself what the most important growth directions for each of those team members are, and figure out how to give them work that stretches them in that direction. Note that these delegations will probably require more frequent monitoring, since your reports will have less task-relevant maturity on their stretch projects!

A five-step “help, I’m overwhelmed” checklist

Despite your best efforts to follow the above advice, there will probably come a time when you feel very stressed about the amount of work on your plate. When that time comes, here’s what to do:

Schedule time with your manager, for the soonest slot you can, to triage your todo list. (If the primary stakeholder for your scariest todos is your PM, schedule with them instead.)
Make a list of everything that’s on your plate currently. Yes, everything, even that code review that’s been sitting in your backlog for the last 3 months.
At the meeting you scheduled in step 1, figure out how to delegate everything in that list you can delegate, then stack-rank the remainder.
Realistically (see Have accurate expectations of yourself) decide how far down the list you’re going to get. Remember to leave yourself some slack for whatever comes up!
For things below the cutoff, decide that you’re not going to do them, and notify everyone who cares that you probably won’t get to it.

Carve out focused time

If you’re not careful, it’s easy to fill your entire calendar with meetings, Slack, etc. and have no time for deep work. With careful planning, you can avoid this by “batching” all your distractions to particular times of day.

There are lots of tactical tips for doing this; I catalogued some that work for me in Tools for keeping focused.

One tech-lead-specific one that I’ll add is batching meetings: I schedule all my meetings back-to-back on Tuesdays and Thursdays to leave the rest of the week as free as possible for deep work.

Appendix: further reading

How I've run major projects

2025-03-16 08:00:00

My few most productive individual weeks at Anthropic have all been “crisis project management:” coordinating major, time-sensitive implementation or debugging efforts.

In a company like Anthropic, excellent project management is an extremely high-leverage skill, and not just during crises: our work has tons of moving parts with complex, non-obvious interdependencies and hard schedule constraints, which means organizing them is a huge job, and can save weeks of delays if done right. Although a lot of the examples here come from crisis projects, most of the principles here are also the way I try to run any project, just more-so.

I think excellent project management is also rarer than it needs to be. During the crisis projects I didn’t feel like I was doing anything particularly impressive; mostly it felt like I was putting in a lot of work but doing things that felt relatively straightforward. On the other hand, I often see other people miss chances to do those things, maybe for lack of having seen a good playbook.

So here’s an attempt to describe my playbook for when I’m being intense about project management.

(I’ve described what I did as “coordinating” above, but that’s underselling it a bit; it mattered a lot for this playbook that I had enough technical context, and organizational trust, to autonomously make most prioritization decisions about the project. Sometimes we instead try to have the trusted decisionmakers not be highly involved in managing execution, and instead farm that out to a lower-context or less-trusted project manager to save the trusted decisionmaker time, but IMO this is usually a false economy for projects where it’s critical that they be executed well.)

Focus

For each of the crisis management projects I completely cleared my schedule to focus on them, and ended up spending 6+ hours a day organizing them.

This is a bit unintuitive because I’m used to thinking of information processing as basically a free action. After all, you’re “just” moving info from place to place, not doing real work like coding, right? But if you add it all up—running meetings, pinging for updates, digesting Slack threads, pinging for updates again, thinking about what’s next, pinging for updates a third time, etc.—it’s surprisingly time-intensive.

Even more importantly than freeing up time, clearing my schedule made sure the project was the top idea in my mind. If I don’t do that, it’s easy for me to let projects “go on autopilot,” where I keep them running but don’t proactively make time to think through things like whether we should change goals, add or drop priorities, or do other “non-obvious” things.

For non-crisis projects, it’s often not tenable (or the right prioritization) to spend 6+ hours a day project-managing; but it’s still the case that you can improve execution a lot if you focus and make them a top priority, e.g. by carving out dedicated time every day to check statuses, contemplate priorities, broadcast updates, and so on.

Maintain a detailed plan for victory

A specific tool that I’ve found critical for staying oriented and updating quickly is a detailed plan for victory, i.e., a list of steps, as concrete as possible, that end with the goal being achieved.

The plan is important because whether or not we’re achieving the plan is the best way to figure out how well or badly things are going. Knowing how well or badly things are going is important because it tells me when to start asking for more support, cutting scope, escalating problems, and otherwise sounding more alarms. One of the most common megaproject failure modes is to not freak out soon enough, and having a concrete plan is the best antidote.

As a both positive and negative example of this, during a recent sprint to release a new implementation of a model, we took a detailed accounting of all the work we thought we had to do to launch.

On the plus side, this made it clear three months before launch that things were going to be very tight, and this enabled us to ask for help from another team, who loaned us someone who sped up the project a fair amount.
On the minus side, we also massively underestimated a few components of the project, and because of this, we still ended up very crunched at the end.

As the above example shows, having a plan can’t completely save you if you underestimate how long all the steps in the plan will take. But it certainly helps! My sense is that the main things that would have helped even more in the above case were:

We were inexperienced at estimating tasks, especially tasks related to new model implementations (which most people on the team were too new to have done before), and we were too cowardly to add the requisite amount of “slop” to our plan.
We didn’t check in frequently enough against the plan once we made it, or sound the alarm early enough when we went off-plan.

Run a fast OODA loop

OODA stands for “observe, orient, decide, act”—in other words, the process by which you update your plans and behavior based on new information.

Most of the large projects I’ve worked on have been characterized by incomplete information:

Our cluster’s networking is bad, but we don’t understand why.
We have a correctness bug but we don’t know where it is.
We need to rewrite the system but we’re not totally sure what the rewrite should look like.

In fact, I’d make a stronger claim: usually getting complete information was the hard part of the project, and took up a substantial fraction of the overall critical-path timeline.

For example, let’s take a recent project to kick off a training run. The critical path probably looked something like:

Chips for the training run are delivered
We run some tests
We discover one aspect of performance is unexpectedly poor
We escalate the problem with our compute partner
Compute partner staffs a large debugging effort
We realize we had given our compute partner an outdated benchmark that is causing them to target the wrong improvements
Compute partner switches benchmark and prioritizes different improvements
We share our benchmarks with compute partner so they can run the exact same code as us
Compute partner rolls out improvements
We test the improvements
Performance is still poor and we tell them that
Repeat steps 8-10 until eventually it’s good enough

Practically all of these steps are about information-processing, not writing code! Even the step where the compute partner debugged the problems on their side was itself constrained by information processing speed, since there were tens of people working on the debugging effort and coordinating / sharing info between them was difficult. Overall, the project timeline was strongly constrained by how quickly information could round-trip from our compute partner’s large-scale debugging effort, through their tech lead, me, and Anthropic’s large-scale debugging effort.

This pattern generalizes to most projects I’ve been a part of, and as a result, one of my most productive project management habits is to try to run the fastest OODA loop that I can.

A few specific things that I’ve found help:

Spend time on it: running OODA loops takes time, and is one of the primary reasons that, as mentioned above, I usually spend 6+ hours a day on running a megaproject if it’s in crisis mode.
Communicate uncomfortably much: For the training run debugging, to reduce the round-trip time between orgs as much as possible, I had multiple daily calls with my counterpart at our compute partner (9am and 6pm). For the model implementation effort, I was basically constantly bouncing between different groups of debuggers, asking for updates and processing them.
Track and prioritize the biggest open questions: For most big projects I’ve maintained a living doc with a ranked list of all my biggest open questions about the project. Resolving or de-risking these uncertainties basically turns into the project’s priority list.
Ideally, there are enough people working on the project that we can work on resolving multiple of the uncertainties in parallel, since that’s one of the best ways to speed things up. (And for a project in “crisis mode,” if we have more top priorities than we can parallel-path with the current set of people working on the problem, that’s also a good test for whether it’s time to pull in more folks.)
Step back and reorient frequently: Other than asking for updates, the main thing I spend time on was reorienting—looking at our list of priorities, asking myself whether they should still be the top priorities, then looking at what people were working on, and making sure those things were attacking the top priorities. I probably reviewed the project’s priorities multiple times a day as well, although I often didn’t make changes as a result.
- (Note that it is possible to change what people are working on too often, since switching tasks is costly. Parallelizing work on the top few priorities, as mentioned above, helps with this, since if you decide that priority #3 is now #1, but there are 2 people working on each, then nobody has to switch tasks. The thing that kills you is when no one is working on the new priority #1.)

Overcommunicate

It’s not just enough for me personally to be running a fast OODA loop—in a large group, everyone needs to be autonomously making frequent, high-quality, local prioritization decisions, without needing a round-trip through me. To get there, they need to be ambiently aware of:

what else is going on around them, so they can coordinate and update on new info quickly (“oh, we’re planning to kick off the next derisking run in three days, so I have to have my new RL environment ready and tested by then”)
how their goal fits into the overall project, so they can make correct decisions about the details of their approach (“we’re trying to scale up as much as possible right now, so this direction isn’t valuable to pursue since it could never provide the scale of data we need”)

I’ve usually found that to create the right level of ambient awareness, I have to repeat the same things way more often than I intuitively expect. This is roughly the same “communicate uncomfortably much” principle above, but applied to broadcasts and not just 1:1 conversations with people.

For example, although the first team I managed at Anthropic started with a daily asynchronous standup, we found that synchronous meetings were much more effective for creating common knowledge and reorienting, so we moved to a twice-weekly synchronous standup, which probably qualified as “uncomfortably much” synchronous communication for Anthropic at the time.

Break off subprojects

Once a project gets over maybe 10 people, I can’t track everything myself in enough detail to project-manage the entire thing myself. At this point, it becomes critical to delegate.

Here I mean delegating the project management, not just the execution (that’s what I’d be delegating to the first 10 people). This is the point where I need other people to help split up the work, monitor and communicate progress, escalate blockers, etc.

A few things I try to keep in mind when delegating project management:

The ideal unit of delegation is a crisp, simple, high-level goal, with limited overlap with other workstreams. (This is as opposed to, e.g., a list of tasks like “see if X helps.“) Good examples: “get X training technique working over Y networking protocol at Z throughput,” “get identical evals between model implementations A and B.” Bad examples: “follow this 10-step checklist that we hope results in training working,” “try these 3 techniques for debugging the loss eval.”
The best project-managers are often not the strongest technical ICs. Instead the most important traits are that they’re highly organized and great at staying laser focused on end goals, perhaps to the point of being annoying about it. IC depth helps and I’ll never say no to it, but it’s not what I’d optimize for.
People running subprojects are probably also doing a lot of the same stuff I do, in particular e.g. spending a lot of time on it. That means they’ll take a substantial hit to their IC productivity. This is expected, and is often worth it. “Direction is more important than magnitude”—it’s usually better to have a lower-velocity project that works on the right things, than a higher-velocity one that’s pointed at the wrong goal.

One of my favorite things to make delegation easier is to keep goals simple—if they can fit in a Slack message while still crisply describing a path to the desired end state, then the people working on the goal will be much more able to prioritize autonomously, and point their work at the real end goal rather than doing something that turns out to be useless for some reason they didn’t think about.

“Keep goals simple” doesn’t have to mean “do less”—the best way to keep goals simple is to find the latent structure that enables a clean recursive decomposition into subgoals. This often requires a deceptive amount of work—both cognitive and hands-on-keyboard—to identify the right intermediate goals, but I’ve found that it pays off immensely by clarifying what’s important to work on.

Have fun

Some of my favorite memories of Anthropic are of helping out with these big projects. While they can be intense, it’s also really inspiring to see how our team comes together, and the feeling of being part of a big team of truly excellent people cooking something ambitious together can be really magical! So I try to enjoy the chaos :)

Appendix: my project DRI starter kit

Here’s the internal doc I share with folks on my team who are getting into being responsible for large projects.

So you’re the DRI of a project (or part of one). Concretely, what do you do to “be DRI”?

This doc is my suggested “starter kit” answer to that question. The habits and rituals described here aren’t perfect for every situation, but they’re lightweight and broadly helpful. I suggest you use them as a starting point for iteration: try them out, then adjust as necessary. This is an SL init; the RL is your job :)

Goals of this playbook

The goal is to help you do your job as DRI—

Make your project go quickly:
- Participants deeply understand the root goal and can autonomously choose the most important next things to work on
- People have “situational awareness” of what other people are working on, learn about relevant updates quickly, and coordinate quickly when needed
- People get quick feedback on their work
- If things aren’t going fast enough, you (the DRI) can notice and course-correct quickly
“Play well with others:”
- Observers can figure out where to go to follow along
- Adjacent or intersecting people/projects don’t miss important updates or get caught by surprise
- People notice quickly if the project is behind or off-track, and can identify opportunities to help

—without adding too much overhead:

<1 hour of setup to make a working doc, schedule a weekly meeting, etc.
30 min/week of meetings
15-30 min/week to write an update

(Note: being DRI will still unavoidably add some overhead—e.g. you’ll have to track what other people are doing, delegate work, unblock people, set and communicate goals, etc. The goal is specifically for the process/paperwork to be minimal.)

Weekly meeting

You should schedule at least one 30-minute weekly meeting with everyone working on the project.

The goal of this meeting is to (1) be a backstop for any coordination that needs to happen and didn’t happen asynchronously; (2) be an efficient way to create common knowledge of goals, updates, etc.; (3) help you track whether things are going well.

Starter-kit agenda:
- [5m] DRI reviews major updates from last week and sets goals for next week
- [10m] Silent write and comment on discussion topics
- [10m] Synchronous discussion of most important things not addressed during silent write
Signs that more meetings might help (e.g. a second weekly standup):
- you have a very tight deadline and can’t afford to lose time
- people aren’t working on the most important thing
- people need feedback frequently
- people step on each others’ toes or miss opportunities to help each other out
- if you just like hanging out with each other :)

Landing page / working doc

It’s really helpful for discoverability and wayfinding to have a single “master doc” with all the most important info about a project. As you loop more people in, they can read the doc to get up to speed. And anyone who thinks “I wonder how X is going” can stop by there to find out.

Create a doc for your workstream with:

A go/ link in the name (if a subproject, maybe use go/project/subproject)
- → This makes it easier to find quickly (search is kinda rough)
A clear description of a concrete top level goal and how it fits into broader goals
- → This is critical info for participants, so they can autonomously prioritize the most important things; and for observers, so that they know what outcome to expect.
Staffing: A list of people working on the project, your name as the DRI, and a link to the slack channel that’s being used for discussion
Links: A short list of relevant links at the top (work trackers, the project’s Slack channel, major design docs, etc.). If needed, a longer “docs / see also” section later links to relevant docs.
- → It’s really easy to lose track of relevant docs otherwise!
A roadmap section with intermediate goals and target dates
- → See the section on plans; these will help people understand what the overall shape of the project is expected to be.
A section for “running notes” containing meeting notes from your weekly meetings (and any other ad-hoc meetings) and broadcast updates
- → This really helps observers and new-joiners get up to speed!
I like maintaining a list of important open questions / uncertainties/ risks and updating it over time. This helps me stay focused on removing risk from the project as quickly as possible.

If it’s part of a larger project, your doc should be nested within the larger project’s working doc.

If this ends up being too much for one doc, you can fork these out into sub-docs (esp. running notes and updates).

Plan / roadmap / milestones

In your working doc, include a section with some intermediate goals and dates by which you hope to accomplish them.
- → This is helpful mostly for noticing you’re off track or behind without getting frog-boiled.
- → Or noticing when you need to make a direction change because the intermediate goals don’t seem good anymore.
You might feel some pressure to add false certainty or precision, but avoid this and be honest about your uncertainty instead. For a lot of research projects it’s hard to plan more than a couple weeks ahead. You can make the milestones fuzzier / more aspirational beyond that, or just drop them.
- I often find it helpful to phrase milestones in probabilitis and distributions (e.g. “my 90% confidence interval for this date is X-Y” or “I think there’s a 75% chance this technique works”)

Who’s working on what

You should have something somewhere that describes what people are working on.
The minimum viable version of this is a list of what people are working on in your working doc.
- If you end up with a large set of tasks and a big backlog, maybe use a checklist and/or move to a subdoc.
Stack rank your work list. It’s really important for people to understand priorities!
If there’s more different people/TODOs, I suggest using some app to make a kanban board with “backlog” / “up next” / “in progress” / “done” columns.
- This is probably most helpful for more deterministic/plannable projects where there’s a clear backlog + set of future tasks, and a lot of things you need to remember to do.
If you have an external task tracker, link it in the wiki section of the working doc.

Slack norms

Have conversations about the project in a Slack channel (not DMs).
- Reference the channel in your working doc.
- Link the working doc in the Slack channel bookmarks.
Cross-post notebook posts and experiment write-ups into the channel so observers don’t have to follow tons of notebook channels.
Do not use DMs. These make it hard to make info discoverable or share it further.
- If people send you important stuff in a DM, ask them to put it in the project channel.
- If you need confidentiality, make a private channel.
Avoid centithreads. Most ≥10-message Slack threads would be better as a ~5-minute Tuple.
- (This is hard to do with people who are in tons and tons of meetings like execs. But you should try to do it for others.)
- If you end up with a centithread, assume nobody will read it; post a summary back to the channel afterwards.
Bias towards fewer, larger, noisier channels. The right time to create a channel is when discussion is either not happening, or getting lost.
- → Too many slack channels makes it harder to manage membership, decide where to put things, or find where discussion is happening.
Channel organization and membership matters. Invest in routing conversations to the right place and curating the channel “architecture.”

Weekly broadcast updates

Once a week, probably either just before or just after your weekly meeting, write up a brief update for a broader audience with:
- The overall vibe
- What’s changed since last update
- What’s coming up next
When writing these updates, optimize for signal to noise ratio.
- Err towards concision
- No “we worked on X”—tell me “we accomplished Y” or “we learned Z”
- Remember your audience (= people not familiar with the project)
- State things crisply and concretely (“X improves eval Y by Z points,” not “we got X working”)
- Leave out anything that’s not actionable—you don’t need to be exhaustive
Post the update in your project Slack channel, and cross-post it to other relevant channels (e.g. a larger “megaproject” channel) if necessary.
- If your project is part of a larger megaproject, these updates might feed into something broader like a weekly meeting of DRIs or an aggregated status update.

Retrospectives

Every so often, step back and ask “how could the last X weeks have gone better?”
- Frequency depends on how much there is going on—every 2 weeks is good if there’s a lot, maybe every 4-8 weeks for smaller projects
Suggested meeting format
- Friday afternoon
- [13 min] Async brainstorm 2 lists of items: “what went well” / “what we could improve”
- [2 min] Dedupe topics and emoji vote by putting :heavy_plus_sign: next to ones you agree with
- Sort “what we could improve” by highest votes
- [10 min] Synchronous discussion of top points (either highest voted or flagged by DRI); figure out action items

Thanks to Kelley Rivoire for many thoughtful comments on a draft!

Categories of leadership on technical teams

2024-07-21 08:00:00

This is an adaptation of an internal doc I wrote for Anthropic.

Recently I’ve been having a lot of conversations about how to structure and staff teams. One framework I’ve referenced repeatedly is to break down team leadership into a few different categories of responsibility.

This is useful for a couple reasons. One is that it helps you get more concrete about what leading a team involves; for new managers, having an exhaustive list of job responsibilities is helpful to make sure you’re tracking all of them.

More importantly, though, we often want to somehow split these responsibilities between people. Team leadership covers a huge array of things—as you can see from how long this post is—and trying to find someone who can be great at all of them is often a unicorn hunt. Even if you do find someone good-enough at all of them, they usually spike in 1-2 areas, and it might be higher-leverage for them to fully focus on those.

Here’s a breakdown I use a lot:¹

Example divisions of responsibility

Here are a few different real-world examples of how these responsibilities can be divided up.³

The “tech lead manager”

When a new company introduces their first technical managers, they often do it by moving their strongest technical person (or people) into a management role and expecting them to fulfill all four responsibilities. Some people do just fine in such roles, but more commonly, the new manager isn’t great at one or more of the responsibilities—most often people management—and struggles to improve due to the number of other things they’re responsible for. (Further reading: Tech Lead Management roles are a trap)

Although TLM roles have some pitfalls, they’re not impossible. Here are a few protective factors that make them more likely to succeed:

the team is small or low-pressure so that they have more time to focus on their growth areas
the TLM has a highly engaged manager-of-managers who can support them where they need help
the team’s domain is simple enough, or the ICs senior enough, that the need for technical oversight is limited
the TLM has strong prior experience in both tech leadership and people management
the TLM is up for working a large number of hours (this is, of course, descriptive and not normative)

Engineering manager / tech lead

This type of split is common in larger tech companies, with the EM responsible for overall direction, people and project management, and the TL responsible for technical leadership (and potentially also contributing to overall direction). “Tech lead” doesn’t have to be a formal title here, and sometimes a team will have multiple tech leads in different areas.

At Anthropic, a good example of this is our inference team, where the managers don’t set much technical direction themselves, and instead are focused on hiring, organizing, coaching, establishing priorities, and being glue with the team’s many many client teams. Since the domain is highly complex and the team is senior-heavy, tech leadership is provided by multiple different ICs for different parts of the service (model implementation, server architecture, request scheduling, capacity management, etc.).

Product manager / tech lead

This is an example of a less-common split. At Wave, we used a division similar to the EM/TL split described above, but the team managers (which we called Product Managers, although it was a somewhat atypical shape for a PM role) often came from non-technical backgrounds.

PMs were expected to act as the “mini CEO” of a product area (e.g. our bill payment product, our agent network, etc.) with fairly broad autonomy to work within that area. Because the “mini CEO” role involved a bunch of other competencies, we decided they didn’t also need to be as technical as a normal engineering manager might.

Although unusual, this worked well for a couple main reasons:

Wave was relatively technically simple, but complex from an operational and product perspective, so for the person accountable for overall direction, technical skill was relatively less important and operational/product skill relatively more so.
We hired very strong people into the PM role, mostly by transferring the strongest people internally from other teams.
Each team’s PM and TL had a very strong working relationship, such that they were able to communicate effectively about things like tradeoffs between velocity and tech quality, and didn’t end up resolving those via e.g. PM fiat.

Notably, this broke the suggestion I mentioned above that people managers should be reasonably technical. This worked mostly because we were able to lean heavily on tech leads for the parts of people management that required technical context. Tech lead was a formal role, with secondary reporting into an engineering manager-of-managers; and while PMs were ultimately responsible for people management, the TL played a major role as well. Both of them would have 1:1s with each team member, and performance reviews would be co-written between the PM and the TL.

People manager / research lead

Anthropic has a few examples of splitting people management from research leadership; the longest-running one is on our Interpretability team, where Chris Olah owned overall direction and technical leadership, and Shan Carter owned people and project management. (This has changed a bit now that Interpretability has multiple sub-teams.)

In this split, unlike an EM/TL split on an engineering team, it made more sense for the research lead to be accountable for overall direction because it depended very heavily on high-context intuitive judgment calls about which research direction to pursue (e.g. betting heavily on the superposition hypothesis, which led to several major results). Many (though not all!) engineering teams’ prioritization depends less on this kind of highly technical judgment call.

This is interesting as an example of a setup where the people manager wasn’t (primarily) responsible for overall direction. It’s somewhat analogous to the CTO / VP Engineering split in some tech companies, where the CTO is responsible for overall direction but most people-leadership responsibility lies with the VPE who reports to them.

Thanks to Milan Cvitkovic and many Anthropic coworkers for reading a draft of this post.

These categories are a good starting point for figuring out how to divide team leadership work, but of course reality is fuzzier and messier, and responsibility won’t break down exactly along these axes. Plus, being responsible for an area doesn’t mean being the only person that contributes; all of these benefit a lot from input from other people!
It’s worth noting though that technical leadership is not the only way to achieve high impact as an individual contributor! For a software-engineering-specific unpacking of other archetypes, which translates partly but not entirely to other technical domains, see Will Larson’s post on Staff engineer archetypes.
These are extremely non-exhaustive, and are still an oversimplified schematic—on any given team, the exact division will depend on the skillsets of the leaders involved, and there will be lots of fuzziness and overlap!

Trust as a bottleneck to growing teams quickly

2024-07-13 08:00:00

This is an adaptation of an internal doc I wrote for Anthropic.

I’ve been noticing recently that often, a big blocker to teams staying effective as they grow is trust.

“Alice doesn’t trust Bob” makes Alice sound like the bad guy, but it’s often completely appropriate for people not to trust each other in some areas:

One might have an active reason to expect someone to be bad at something. For example, recently I didn’t fully trust two of my managers to set their teams’ roadmaps… because they’d joined about a week ago and had barely gotten their laptops working. (Two months later, they’re doing great!)
One might just not have data. For example, I haven’t seen most of my direct reports deal with an underperforming team member yet, and this is a common blind spot for many managers, so I shouldn’t assume that they will reliably be effective at this without support.

In general, if Alice is Bob’s manager and is an authority on, say, prioritizing research directions, Bob is probably actively trying to build a good mental “Alice simulator” so that he can prioritize autonomously without checking in all the time. But his simulator might not be good yet, or Alice might not have verified that it’s good enough. Trust comes from common knowledge of shared mental models, and that takes investment from both sides to build.

If low trust is sometimes appropriate, what’s the problem? It’s that trust is what lets collaboration scale. If I have a colleague I don’t trust to (say) make good software design decisions, I’ll have to review their designs much more carefully and ask them to make more thorough plans in advance. If I have a report that I don’t fully trust to handle underperforming team members, I’ll have to manage them more granularly, digging into the details to understand what’s going on and forming my own views about what should happen, and checking on the situation repeatedly to make sure it’s heading in the right direction. That’s a lot more work both for me, but also for my teammates who have to spend a bunch more time making their work “inspectable” in this way.

The benefits here are most obvious when work gets intense. For example, Anthropic had a recent crunch time during which one of our teams was under intense pressure to quickly debug a very tricky issue. We were able to work on this dramatically more efficiently because the team (including most of the folks who joined the debugging effort from elsewhere) had high trust in each other’s competence; at peak we had probably ~25 people working on related tasks, but we were mostly able to split them into independent workstreams where people just trusted the other stuff would get done. In similar situations with a lower-mutual-trust team, I’ve seen things collapse into endless FUD and arguments about technical direction, leading to much slower forward progress.

Trust also becomes more important as the number of stakeholders increases. It’s totally manageable for me to closely supervise a report dealing with an underperformer; it’s a lot more costly and high-friction if, say, 5 senior managers need to do deep dives on a product decision. In an extreme case, I once saw an engineering team with a tight deadline choose to build something they thought was unnecessary, because getting the sign-off to cut scope would have taken longer than doing the work. From the perspective of the organization as an information-processing entity, given the people and relationships that existed at the time, that might well have been the right call; but it does suggest that if they worked to build enough trust to make that kind of decision efficient enough to be worth it, they’d probably move much faster overall.

As you work with people for longer you’ll naturally have more experience with each other and build more trust. So on most teams, these kinds of things work themselves out over time. But if you’re going through hypergrowth, then unless you’re very proactive about this, any given time most of your colleagues will have some sort of trust deficit.

Symptoms I sometimes notice that can indicate a buildup of trust deficits:

Too many decisions needing to be escalated
Too many decisions requiring deep involvement from many stakeholders
People having lots of FUD about whether projects they’re not involved in are on track
Leaders frequently needing to do “deep dives” on individual topics
Leaders needing to spending most of their time working “in the system” (problem-solving specific issues) rather than “on the system” (unblocking future growth)
Hiring more people doesn’t make you (much) less busy

It’s easy to notice these and think that the solution is for people to “just trust each other more.” There are some situations and personalities where that’s the right advice. But often it’s reasonable not to trust someone yet! In that case, a better tactic is to be more proactive about building trust. In a large, fast-growing company you’ll probably never get to the utopian ideal of full pairwise trust between everyone—it takes too long to build. But on the margin, more effort still helps a lot.

Some ways to invest more effort in trusting others that I’ve seen work well:

Share your most important mental models broadly. At Anthropic, Dario gives biweekly-ish “informal vision updates” (hour-long talks on important updates to parts of company strategy) that I think of as the canonical example of this. Just about everyone at Anthropic is trying to build an internal “Dario simulator” who they can consult when the real one is too busy (i.e. ~always). For high level strategy, these updates do an amazing job of that.
Put in time. In addition to one-way broadcasts, trust-building benefits a lot from one-on-one bidirectional communication so that you can get feedback on how well the other person is building the right models. This is one of the reasons I schedule lots of recurring 1:1s with peers in addition to my team. Offsites are also very helpful here.
Try people out. If you’re unsure whether someone on your team will be great at something, try giving them a trial task and monitoring how it’s going more closely than you would by default, to catch issues early. This is a great way to invest in your long-term ability to delegate things.
Give feedback. It’s easy to feel like something is “too minor” to give feedback on and let it slide, especially when there’s always too much to do. But I’ve never regretted erring on the side of giving feedback, and often regretted deciding to “deal with it” or keep quiet. One pro-tip here: if you feel anxious about giving someone negative feedback, consider whether you’ve given them enough positive feedback—which is a helpful buffer against people interpreting negative feedback as “you’re not doing well overall.”
Inspection forums, i.e., recurring meetings where leadership monitors the status of many projects by setting goals and tracking progress against them. The above tactics are mostly 1:1 or one-to-all, but sometimes you want to work with a small group and this is an efficient way of doing that.

To help other people trust you:

Accept that you start out with incomplete trust. When someone, say, tries to monitor my work more closely than I think is warranted, my initial reaction is to be defensive and ask them to trust me more. It takes effort to put myself into their shoes and remind myself that they probably don’t have a good enough model of me to trust me yet.
Overcommunicate status. This helps in two ways: first, it gives stakeholders more confidence that if something goes off the rails they’ll know quickly. And second, it gives them more data and helps them build a higher-fidelity model of how you operate.
Proactively own up when something isn’t going well. Arguably a special case of overcommunicating, but one that’s especially important to get right: if you can be relied on to ask for help when you need it, it’s a lot less risky for people to “try you out” on stuff at the edge of what they trust you on.

Related reading: Inspection and the limits of trust

How I build and run behavioral interviews

2024-02-25 08:00:00

This is an adaptation of an internal doc I wrote for Wave.

I used to think that behavioral interviews were basically useless, because it was too easy for candidates to bullshit them and too hard for me to tell what was a good answer. I’d end up grading every candidate as an “okay, I guess” because I was never sure what bar I should hold them to.

I still think most behavioral interviews are like that, but after grinding out way too many of them, I now think it’s possible to escape that trap. Here are my tips and tricks for doing so!

Confidence level: doing this stuff worked better than not doing it, but I still feel like I could be a lot better at behavioral interviews, so please suggest improvements and/or do your own thing :)

Before the interview

Budget 2+ hours to build

That’s how long I usually take to design and prepare a new type of interview. If I spend a couple hours thinking about what questions and follow-ups to ask, I’m much more likely to get a strong signal about which candidates performed well.

It might sounds ridiculous to spend 2 hours building a 1-hour interview that you’ll only give 4 times. But it’s worth it! Your most limited resource is time with candidates, so if you can spend more of your own time to use candidates’ time better, that’s worth it.

Think ahead about follow-ups and rubric

I spend most of those 2 hours trying to answer the following question: “what answers to these questions would distinguish a great candidate from a mediocre one, and how can I dig for that?” I find that if I wait until after the interview to evaluate candidates, I rarely have conviction about them, and fall back to grading them a “weak hire” or “weak no-hire.”

To avoid this, write yourself a rubric of all the things you care about assessing, and what follow-up questions you’ll ask to assess those things. This will help you deliver the interview consistently, but most importantly, you’ll ask much better follow-up questions if you’ve thought about them beforehand. See the appendix for an example rubric.

Focus on a small number of skills

I usually focus on 1-3 related skills or traits.

To get a strong signal from a behavioral interview question I usually need around 15 minutes, which only leaves time to discuss a small number of scenarios. For example, for a head of technical recruiting, I decided to focus my interview on the cluster of related traits of being great at communication, representing our culture to candidates, and holding a high bar for job candidate experience.

You should coordinate with the rest of the folks on your interview loop to make sure that, collectively, you cover all the most important traits for the role.

During the interview

Kicking off

My formula for kicking off a behavioral question is “Tell me about a recent time when [X situation happened]. Just give me some brief high-level context on the situation, what the problem was,¹ and how you addressed it. You can keep it high-level and I’ll ask follow-up questions afterward.”

I usually ask for a recent time to avoid having them pick the one time that paints them in the best possible light.

The second sentence (context/problem/solution) is important for helping the candidate keep their initial answer focused—otherwise, they are more likely to ramble for a long time and leave less time for you to…

Dig into details

Almost everyone will answer the initial behavioral interview prompt with something that sounds vaguely like it makes sense, even if they don’t actually usually behave in the ways you’re looking for. To figure out whether they’re real or BSing you, the best way is to get them to tell you a lot of details about the situation—the more you get them to tell you, the harder it will be to BS all the details.

General follow-ups you can use to get more detail:

Ask for a timeline—how quickly people operate can be very informative. (Example: I asked someone how they dealt with an underperforming direct report and they gave a compelling story, but when I asked for the timeline, it seemed that weeks had elapsed between noticing the problem and doing anything about it.)
“And then what happened?” / “What was the outcome?” (Example: I asked this to a tech recruiter for the “underperforming report” question and they admitted they had to fire the person, which they hadn’t previously mentioned—that’s a yellow flag on honesty.)
Ask how big of an effect something had and how they know. (Example: I had a head of technical recruiting tell me “I did X and our outbound response rate improved;” when I asked how much, he said from 11% to 15%, but the sample size was small enough that that could have been random chance!)
“Is there anything you wish you’d done differently?” (Sometimes people respond to this with non-actionable takeaways like “I wish I’d thought of that idea earlier” but having no plan or mechanism that could possibly cause them to think about the idea earlier the next time.)

Evaluating candidates

Make yourself a rubric

One of the worst mistakes you can make in a behavioral interview is to wing it: to ask whatever follow-up questions pop into your head, and then at the end try to answer the question, “did I like this person?” If you do that, you’re much more likely to be a “weak yes” or “weak no” on every candidate, and to miss asking the follow-up questions that could have given you stronger signal.

Instead, you should know what you’re looking for, and what directions to probe in, before you start the interview. The best way to do this is to build a scoring rubric, where you decide what you’re going to look for and what a good vs. bad answer looks like. See the appendix for an example.

General things to watch out for

Of course, most of your rubric should be based on the details of what traits you’re trying to evaluate! But here are some failure modes that are common to most behavioral interviews:

Vague platitudes: some people have a tendency to fall back on vague generalities in behavioral interviews. “In recruiting, it’s all about communication!” “No org structure is perfect!” If they don’t follow this up with a more specific, precise or nuanced claim, they may not be a strong first-principles thinker.
Communication bandwidth: if you find that you’re struggling to understand what the person is saying or get on the same page as them, this is a bad sign about your ability to discuss nuanced topics in the future if you work together.
Self-improvement mindset: if the person responds to “what would you do differently” with “nothing,” or with non-actionable vague platitudes, it’s a sign they may not be great at figuring out how to get better at things over time.
Being embarrassingly honest: if probing for more details causes you to learn that the thing went less well than the original impression you got, the candidate probably is trying to “spin” this at least a little bit.
High standards: if they say there’s nothing they wish they’d done differently, this may also be lack of embarrassing honesty, or not holding themselves to a high standard. (Personally, even for any project that went exceptionally well I can think of lots of individual things I could have done better!)
Scapegoating: if you ask about solving a problem, do they take responsibility for contributing to the problem? it’s common for people to imply/say that problems were all caused by other people and solved by them (eg “this hiring manager wanted to do it their way, but I knew they were wrong, but couldn’t convince them…”). Sometimes this is true, but usually problems aren’t a single person’s fault!

Appendix: example rubric and follow-ups

Here’s an example rubric and set of follow-up questions for a Head of Technical Recruiting.

Question: “tell me about a time when your report wasn’t doing a good job.”

moving quickly to detect and address the issue
- ask for a timeline of events
- bad answer = lots of slop in time between “when problem started” / “when you discovered” / “when you addressed”
setting clear expectations with their report and being embarrassingly honest
- ask what the conversation with their report was like
making their reports feel psychologically safe
- ask how they thought their report felt after the tough convo
- bad answer = not sure, or saying things in a non-supportive / non-generous way
being effective at discovering the root problem
- ask a mini postmortem / “five whys”
- bad answer = not having deep understanding of root dynamics, only symptoms
do they understand whether what they did worked
- ask for concrete metrics on how things were going before and after they intervened
- bad answer = not having metrics, having metrics that moved only a small amount (and not realizing this is a failure), etc.
learning and updating over time
- ask them what they could have done differently next time
- bad answer = vague platitudes or “nothing”

For behaviors that involve addressing problems—reword “problem” as something else as appropriate.

Ben KuhnModify