MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

5 Things That Will Fail Your SOC 2 Audit (That Nobody Warns You About)

2026-03-31 22:00:03

We passed our SOC 2 Type II audit on the second attempt. The first attempt, we failed. And the things that tripped us up were not the things any blog post had warned us about.

Everyone writes about "implement access controls" and "encrypt data at rest." Those are the obvious ones. Here are the five non-obvious things that almost sank our audit, and that i've since heard from multiple other startups who hit the same walls.

1. Your Audit Logs Don't Prove Anything

I wrote a whole separate post about this, but its worth mentioning here because it was our single biggest failure point.

We had audit logs. We had millions of them in ELK. But when the auditor asked "can you demonstrate that these logs are complete and unmodified," we couldnt.

The auditor's specific concern was CC7.2 from the AICPA Trust Services Criteria: "The entity monitors system components and the operation of those components for anomalies that are indicative of malicious acts, natural disasters, and errors affecting the entity's ability to meet its objectives."

The word "monitors" is key. Its not enough to have logs. You need to demonstrate that you actively monitor them and that they have integrity controls. We had logging but no monitoring, no alerting, and no integrity verification.

What fixed it: We implemented hash-chained audit logging with daily integrity verification checks. We also set up alerts for suspicious patterns (multiple failed logins, privilege escalations, data exports over a threshold). The auditor wanted to see both the technical implementation AND evidence that alerts had been triggered and responded to during the observation period.

2. Your Employee Offboarding Process Has Gaps

This one surprised us. We thought our offboarding was solid. When someone leaves, we disable their Okta account. Done, right?

Nope. The auditor asked for a list of every system the departed employee had access to and evidence that access was revoked in each one. Turns out, disabling Okta covers maybe 60% of access. The other 40%:

  • Personal API keys they generated (did you revoke those?)
  • AWS IAM credentials
  • Database connection strings they might have saved locally
  • SSH keys on servers
  • Third-party tools with separate logins (Figma, Notion, etc.)
  • GitHub deploy keys
  • Service account credentials they knew about
// Offboarding checklist - what the auditor actually wanted to see
interface OffboardingChecklist {
  employeeId: string;
  terminationDate: Date;
  steps: {
    system: string;
    accessType: string;
    revokedDate: Date | null;
    revokedBy: string | null;
    verified: boolean;
    evidence: string; // screenshot URL, API response, etc.
  }[];
}

// They wanted EVIDENCE for each step
// Not just a checkbox saying "done"
// Actual screenshots or API confirmations showing access was removed

The auditor also checked for timing. Was access revoked on the termination date or two weeks later? If theres a gap between when someone leaves and when their access is revoked, thats a finding.

What fixed it: We built an offboarding automation that queries every integrated system and produces an evidence report. It takes about 2 hours per departing employee instead of the 20 minutes we used to spend.

3. Your Change Management Is "We Review PRs"

We told the auditor our change management process was code review via GitHub pull requests. Every change is reviewed before merging.

The auditor asked: "Can you show me the approval for this specific production deploy on February 14th?"

We could show the PR was reviewed. But we couldnt show that the PR corresponded to the specific deployment, that the deployment was authorized by someone with the right role, or that the production environment matched what was tested in staging.

SOC 2 CC8.1 requires that changes to system components are "authorized, designed, developed, configured, documented, tested, approved, and implemented."

Thats a lot more than "someone clicked Approve on the PR."

# What we added to our CI/CD pipeline
# Deployment manifest that links everything together

deployment:
  # Links to the PR/change request
  change_request: "PR #1234"
  approved_by: "[email protected]"
  approval_timestamp: "2026-02-14T10:30:00Z"

  # Links to test evidence
  test_results:
    unit_tests: "passing - 847/847"
    integration_tests: "passing - 123/123"
    staging_deploy: "deploy_stg_abc123"
    staging_verification: "QA sign-off by [email protected]"

  # Production deployment details
  production:
    deployer: "ci-bot (automated)"
    deploy_timestamp: "2026-02-14T14:00:00Z"
    commit_sha: "a1b2c3d4"
    rollback_plan: "Revert commit a1b2c3d4, run db:rollback"

What fixed it: We added deployment manifests to every production deploy that link the PR, approval, test results, and deployment together in a single auditable record. The auditor could now trace any production change back to its authorization.

4. Your Vendor Management Is Nonexistent

The auditor asked: "Which third-party services process or store your customers' data? What are their security certifications? When did you last review their security posture?"

We used about 15 third-party services (AWS, Stripe, SendGrid, Datadog, etc.). We had never formally documented which ones had access to customer data, never verified their SOC 2 reports, and never done a vendor risk assessment.

This falls under CC9.2: "The entity assesses and manages risks associated with vendors and business partners."

// What the auditor expected us to have
interface VendorAssessment {
  vendor: string;
  dataTypes: string[]; // What customer data do they access?
  certifications: string[]; // SOC 2, ISO 27001, etc.
  lastReviewed: Date;
  riskLevel: 'low' | 'medium' | 'high' | 'critical';
  contractHasSecurityTerms: boolean;
  contractHasDataProcessingAgreement: boolean;
  subprocessors: string[]; // Their vendors who touch our data
  incidentNotificationSLA: string;
}

What fixed it: We created a vendor inventory spreadsheet (honestly a Google Sheet works fine for this), collected SOC 2 reports from all critical vendors, and established a quarterly review cadence. Boring but necessary. According to NIST's Cybersecurity Supply Chain Risk Management guidance, vendor risk management should be proportional to the data sensitivity involved.

5. Your Incident Response Plan Has Never Been Tested

We had an incident response plan. It was a Google Doc someone wrote 18 months ago. It listed steps like "identify the incident" and "contain the threat" and "notify affected parties."

The auditor asked: "When was this plan last tested? Can you show me records of the test?"

Silence.

Having a plan is not enough. SOC 2 requires evidence that the plan has been tested and that lessons from the test were incorporated. CC7.4 specifically addresses "The entity responds to identified security incidents by executing a defined incident response program."

What fixed it: We ran a tabletop exercise. This is a meeting where you present a hypothetical security incident and walk through the response. No actual systems are affected. You just talk through: Who gets notified? What gets shut down? How do you communicate with customers? When do you involve legal?

We found 3 major gaps in our plan during the exercise:

  1. Our escalation contact list had someone who left the company 6 months ago
  2. Nobody knew how to rotate production database credentials in an emergency
  3. Our customer notification template referenced a product name we'd rebranded away from

The tabletop exercise took 2 hours and was genuinely useful. We now run one quarterly.

The Pattern

Notice a pattern? None of these are technical security failures. Our encryption was fine. Our access controls were fine. Our infrastructure was properly configured.

The failures were all in processes, documentation, and evidence. SOC 2 isn't really a technical audit. Its a process audit that happens to involve technology.

The auditor wants to see three things for every control:

  1. Design: Is the control designed to address the risk?
  2. Implementation: Is the control actually implemented?
  3. Operating effectiveness: Has the control been operating consistently during the observation period?

Most engineering teams focus on #2 (implementation) and forget about #1 (documentation) and #3 (evidence that it actually runs over time).

We eventually passed on our second attempt. The experience taught me that SOC 2 preparation is maybe 30% technical work and 70% process and documentation work. I wish someone had told us that before we spent 3 months only doing the technical part.

Migrating a Rails App from Heroku to Railway

2026-03-31 22:00:00

Last weekend I migrated my Doctors App from Heroku to Railway.

It's a multi-tenant Rails app where each hospital gets its own subdomain — one.doctors.com, two.doctors.com, and so on.

Five hospitals, around 25,000 appointments, 9,700+ patients. Not huge, but not trivial either.

Here's how it went, including the part where I accidentally broke the database.

The setup

I already had a Railway project running with a test domain (*.juanvasquez.dev) from earlier experiments. The web service was deployed from GitHub and the Postgres 17 instance was co-located in us-east4. Cloudflare R2 handles file storage — that stays the same regardless of where the app runs.

The plan was simple: put Heroku in maintenance mode, dump the database, restore it to Railway, flip the DNS, and go home.

The database restore

First, I captured a fresh Heroku backup and downloaded it:

heroku pg:backups:capture --app doctors
heroku pg:backups:download --app doctors --output /tmp/heroku_backup.dump

Then I wiped the Railway database and restored:

# Wipe
psql -h <railway-host> -p <port> -U postgres -d database_name \
  -c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"

# Restore
pg_restore --verbose --no-owner --no-acl \
  -h <railway-host> -p <port> -U postgres -d database_name /tmp/heroku_backup.dump

The restore threw two errors — both about the unaccent extension. Heroku installs extensions in a heroku_ext schema that doesn't exist on Railway. The fix is to just create it manually afterward:

psql -h <railway-host> -p <port> -U postgres -d database_name \
  -c "CREATE EXTENSION IF NOT EXISTS unaccent;"

Everything else restored cleanly. I verified every table:

Table Heroku Railway
users 9,752 9,752
appointments 25,481 25,481
addresses 9,835 9,835
patient_referrals 1,211 1,211
hospitals 5 5

All 12 tables matched exactly. If you take one thing from this post: always verify row counts after a restore.

The moment I broke the database

With the data restored, I wanted to trigger a deploy on the web service. I ran:

railway up --detach

Without --service web.

That command deployed my Rails application code onto the Postgres service. It replaced the PostgreSQL 17 container with Puma. The database was now a Rails web server that couldn't handle Postgres connections.

The logs told the story immediately:

HTTP parse error, malformed request: #<Puma::HttpParserError:
Invalid HTTP format, parsing fails. Are you trying to open
an SSL connection to a non-SSL Puma?>

The web service was trying to connect to Postgres, but Postgres was now running Puma, responding to TCP connections with HTTP errors.

The fix was to roll back the Postgres service to its last good deployment. Railway's CLI doesn't have a rollback command, so I used the dashboard to roll back the deployment.

After about 45 seconds, Postgres was back. Data intact. Lesson learned: always pass --service web when deploying.

Flipping the domain

Removing the test domain was another adventure. Railway's CLI can add domains but can't delete them. I used the dashboard to remove it.

Then I added the production wildcard domain:

railway domain "*.doctors.com" --service web --port 8080

Railway returned the DNS records I needed. In Squarespace (my domain registrar), I added:

Type Host Value
CNAME * znjcefnu.up.railway.app
CNAME _acme-challenge znjcefnu.authorize.railwaydns.net

There was also a _railway-verify record for domain ownership. I initially tried adding it as a CNAME, but Squarespace rejected the value — it's actually a TXT record, not a CNAME. Small thing, but it tripped me up.

DNS propagated fast. Within a couple of minutes, Railway confirmed the domain was verified and SSL was provisioned.

One more thing: RACK_ENV

The first request to demo.doctors.com returned a 500. I checked the logs and saw... a Rails development error page. RACK_ENV was set to development. A quick variable update and redeploy fixed it:

railway variable set RACK_ENV=production --service web

Then all five hospital subdomains came back with 200s.

Trial plan limitations

Railway's trial plan only allows one custom domain per service. The wildcard *.doctors.com uses that single slot, which works great for multi-tenancy — every subdomain routes correctly. But I can't also add the root domain doctors.com. For now, I'll handle that with a redirect at the registrar level.

Pricing

Heroku Railway
Web service $7/mo (Basic dyno) Usage-based (~$5/mo)
Postgres $5/mo (Mini) Included (500MB)
Custom domains Included 1 per service (trial)
SSL Automatic Automatic
Chrome buildpack Required for old PDF setup Not needed (using Prawn now)

For my scale, Railway is slightly cheaper. The real win is simplicity — no buildpack configuration, no add-on marketplace to navigate, and Postgres is just there.

What I also did

While I was at it, I replaced Sentry with Honeybadger (referral link) for error tracking. Sentry's initializer still referenced Heroku env vars, so it was a good time to clean house. Honeybadger has a free plan, built-in uptime monitoring, and the Rails setup is just a YAML file:

# config/honeybadger.yml
api_key: <%= ENV.fetch("HONEYBADGER_API_KEY", "") %>
env: <%= Rails.env %>
exceptions:
  enabled: <%= Rails.env.production? %>

I also updated the CI pipeline — upgraded Postgres from 10.13 to 17 (matching production) and Node.js from 20 to 22 (matching package.json). Removed the Puppeteer and Chrome setup steps that were left over from when the app used Grover for PDF generation.

Things I'd tell myself before starting

  1. Verify row counts after every restore. Don't trust "no errors" — count the rows.
  2. Always specify --service when running Railway CLI commands. Especially railway up.
  3. Railway's CLI can't do everything. Domain deletion and deployment rollbacks need to be done through the dashboard.
  4. railway run executes locally, not on Railway's infrastructure. Use railway shell for remote access.
  5. Heroku's heroku_ext schema for extensions doesn't exist on Railway. Expect restore errors for extensions, and re-create them manually.
  6. Check your RACK_ENV. It seems obvious, but it's easy to forget when you're focused on the database.
  7. The _railway-verify DNS record is a TXT record, even though it looks like it could be a CNAME. Your registrar will reject it if you pick the wrong type.

Fair warning

Since migrating, I've seen reports from other developers that give me pause. One team experienced persistent 150–200ms request queuing on Railway that they couldn't resolve even with Pro plan support — response times that were 40ms on Heroku, Render, and DigitalOcean. Another long-time customer reported a caching misconfiguration that leaked user data between accounts, on top of weeks of near-daily incidents.

I measured my own response times after reading these reports, and for my scale they're good enough. But if you're running something larger, do thorough stress testing before committing, and have a rollback plan. Railway is young, and that cuts both ways: fast iteration, but also growing pains.

Was it worth it?

The whole migration took about an hour. Most of that was waiting for DNS propagation and debugging the Postgres incident. The actual work — dump, restore, set variables, flip DNS — was maybe 30 minutes.

Railway feels like what Heroku should have become. The dashboard is clean, deploys are fast, and the Postgres integration just works. I miss heroku run (Railway's local execution model is confusing at first), but railway shell covers most cases.

For a small multi-tenant Rails app like mine, it's a good fit. But I'm keeping my Heroku knowledge fresh — just in case.

Core Web Vitals Explained: What They Are, How to Measure Them, and Why They Matter for React Apps

2026-03-31 22:00:00

If you've seen the term "Core Web Vitals" and kept scrolling, this article is for you.

It's not just SEO jargon. These three metrics are the clearest signal we have for whether a web app feels fast to a real user — and they're measurable directly from your React code.

This article covers what the three metrics actually mean, how to measure them without any external tools, and what to do when they're bad.

What Are Core Web Vitals?

Core Web Vitals are three metrics defined by Google to measure user experience from a loading and interactivity perspective. They're based on real user data, not synthetic benchmarks.

The three metrics:

Metric Measures Good threshold
LCP — Largest Contentful Paint Loading speed ≤ 2.5s
FCP — First Contentful Paint Time to first visible content ≤ 1.8s
CLS — Cumulative Layout Shift Visual stability ≤ 0.1

There's a fourth metric worth knowing: INP (Interaction to Next Paint), which replaced FID (First Input Delay) in 2024. INP measures how responsive the page feels when you click or type. We'll cover it briefly at the end.

LCP — Largest Contentful Paint

What it measures: How long until the largest visible element on screen finishes loading.

This is usually a hero image, a large heading, or the main content block. Whatever takes up the most screen real estate "above the fold."

Why it matters: LCP is the closest single metric to "when does this page feel loaded." Users don't think in milliseconds — they think "did it load or not." LCP is when the answer flips from "no" to "yes."

What causes bad LCP:

  • Large, unoptimized images (the most common cause)
  • Render-blocking JavaScript or CSS that delays the page from painting
  • Slow server response times (TTFB)
  • Third-party scripts loading before your content

How to measure it in code:

const lcpObserver = new PerformanceObserver((list) => {
  const entries = list.getEntries();
  // Use the last entry — LCP can be updated as more content loads
  const lastEntry = entries[entries.length - 1];

  console.log('LCP:', lastEntry.startTime, 'ms');
  console.log('Element:', lastEntry.element); // Which element triggered it
});

lcpObserver.observe({ type: 'largest-contentful-paint', buffered: true });

Good: ≤ 2.5s

Needs improvement: 2.5s – 4.0s

Poor: > 4.0s

FCP — First Contentful Paint

What it measures: How long until the browser renders the first piece of DOM content — any text, image, or non-white canvas element.

Why it matters: FCP is a leading indicator. A slow FCP almost always means a slow LCP. If FCP is bad, users are staring at a blank screen, which is the worst user experience possible — worse than a slow load, because users don't even know if anything is happening.

What causes bad FCP:

  • Render-blocking resources (CSS and JS that pause HTML parsing)
  • Server-side rendering issues
  • Heavy JavaScript bundles that need to parse before anything renders

How to measure it:

const fcpObserver = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    if (entry.name === 'first-contentful-paint') {
      console.log('FCP:', entry.startTime, 'ms');
    }
  }
});

fcpObserver.observe({ type: 'paint', buffered: true });

Good: ≤ 1.8s

Needs improvement: 1.8s – 3.0s

Poor: > 3.0s

CLS — Cumulative Layout Shift

What it measures: How much the page layout shifts unexpectedly after it starts loading.

You've experienced this. You're reading an article, an ad loads above the paragraph you're on, and everything shifts down. You accidentally click the ad. That's a layout shift — and CLS measures how much of this happens across the full page lifecycle.

Why it matters: Layout shifts erode user trust instantly. They also cause accidental clicks, which is particularly bad on e-commerce and form pages.

What causes bad CLS:

  • Images and videos without width and height attributes set
  • Ads, embeds, or iframes without reserved space
  • Dynamically injected content above existing content
  • Web fonts loading and causing text to reflow (FOIT/FOUT)

How to measure it:

let clsValue = 0;

const clsObserver = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    // Only count shifts that happen without user interaction
    if (!entry.hadRecentInput) {
      clsValue += entry.value;
      console.log('Current CLS:', clsValue);
    }
  }
});

clsObserver.observe({ type: 'layout-shift', buffered: true });

Good: ≤ 0.1

Needs improvement: 0.1 – 0.25

Poor: > 0.25

How These Metrics Relate to Each Other

Understanding the sequence helps:

Navigation starts
    ↓
FCP fires — first pixel of content rendered
    ↓
LCP fires — largest content element rendered
    ↓
Page becomes interactive
    ↓
CLS accumulates throughout — tracks all layout shifts

In practice: if FCP is bad, LCP will be bad too. If FCP is fine but LCP is bad, the issue is usually the main content (an image, a large element) taking too long. CLS is independent — a page can have great LCP and terrible CLS.

Measuring in Your React App: A Complete Setup

Here's a minimal but complete implementation that collects all three metrics and logs them:

// utils/web-vitals.ts

type MetricName = 'LCP' | 'FCP' | 'CLS';
type MetricReport = {
  name: MetricName;
  value: number;
  rating: 'good' | 'needs-improvement' | 'poor';
};

function getRating(name: MetricName, value: number): 'good' | 'needs-improvement' | 'poor' {
  const thresholds = {
    LCP: [2500, 4000],
    FCP: [1800, 3000],
    CLS: [0.1, 0.25],
  };

  const [good, poor] = thresholds[name];
  if (value <= good) return 'good';
  if (value <= poor) return 'needs-improvement';
  return 'poor';
}

export function initWebVitals(onMetric: (metric: MetricReport) => void) {
  // LCP
  new PerformanceObserver((list) => {
    const entries = list.getEntries();
    const last = entries[entries.length - 1];
    const value = last.startTime;
    onMetric({ name: 'LCP', value, rating: getRating('LCP', value) });
  }).observe({ type: 'largest-contentful-paint', buffered: true });

  // FCP
  new PerformanceObserver((list) => {
    for (const entry of list.getEntries()) {
      if (entry.name === 'first-contentful-paint') {
        const value = entry.startTime;
        onMetric({ name: 'FCP', value, rating: getRating('FCP', value) });
      }
    }
  }).observe({ type: 'paint', buffered: true });

  // CLS
  let clsValue = 0;
  new PerformanceObserver((list) => {
    for (const entry of list.getEntries()) {
      if (!(entry as any).hadRecentInput) {
        clsValue += (entry as any).value;
        onMetric({ name: 'CLS', value: clsValue, rating: getRating('CLS', clsValue) });
      }
    }
  }).observe({ type: 'layout-shift', buffered: true });
}

Usage in your React app:

// App.tsx or main.tsx
import { initWebVitals } from './utils/web-vitals';

initWebVitals((metric) => {
  console.log(`${metric.name}: ${metric.value} (${metric.rating})`);
  // Send to your analytics endpoint, logging service, etc.
});

The Measurement Gap: Local vs. Production

Here's the part that most tutorials skip.

Lighthouse and DevTools give you synthetic measurements — they simulate a specific device and network condition in a controlled environment. This is useful for relative comparisons ("did my change make it better or worse?"), but it doesn't tell you what real users experience.

Real users have:

  • Older devices with slower CPUs
  • Variable network conditions (3G, congested WiFi)
  • Many browser tabs open
  • Cold cache (no previous visit to your site)

The only way to know your real-world Core Web Vitals is to measure in production, from real browsers. The code above does exactly that — it runs in your users' browsers and captures their actual experience.

What you do with those measurements is a separate question. At minimum, log them somewhere. Ideally, set up alerting so you know when they degrade — particularly after deploys.

Quick Wins for Each Metric

If you're seeing bad numbers, here's where to start:

Bad LCP?

  1. Check if the LCP element is an image — if so, add fetchpriority="high" to it
  2. Convert images to WebP format
  3. If using Next.js, switch to next/image
  4. Check TTFB — if your server responds slowly, everything else suffers

Bad FCP?

  1. Identify and remove render-blocking CSS/JS
  2. Inline critical CSS
  3. If using SSR, check that your server isn't doing too much work before sending HTML

Bad CLS?

  1. Add explicit width and height to all images and videos
  2. Reserve space for ads and dynamic embeds with CSS min-height
  3. Avoid inserting content above existing content after page load

What About INP?

INP (Interaction to Next Paint) replaced FID in March 2024. It measures how quickly the page responds to user interactions — clicks, taps, keyboard input.

Good threshold: ≤ 200ms

The most common cause of bad INP in React apps is expensive state updates that block the main thread. If you're seeing high INP, Long Tasks are usually the culprit — something is blocking the browser from responding to user input.

We'll cover Long Tasks in depth in the next article.

Summary

Core Web Vitals aren't just for SEO. They're the most concrete way to measure whether your app feels fast to a real user.

The three metrics tell a story:

  • FCP: Does anything appear quickly?
  • LCP: Does the main content load quickly?
  • CLS: Does the layout stay stable while loading?

You can measure all three with PerformanceObserver in your production React app right now, with zero dependencies.

How Do You Measure Whether Someone Is Actually Good at Working With AI?

2026-03-31 22:00:00

Here's a question that sounds simple and isn't: is your team actually good at working with AI, or are they just using it?

Using means generating output. Good at working with means the human added judgment, caught errors, maintained context, and produced something the organization can defend. The difference matters because when something goes wrong, accountability doesn't attach to the AI. It attaches to the person who signed off.

Every organization deploying AI needs to answer this question. And almost none of them can, because the tools they're using to measure AI skills don't measure collaboration. They measure knowledge.

The quiz problem

The dominant approach to measuring AI capability in organizations is some form of quiz: multiple choice, scenario-based questions, self-assessment surveys. These tell you whether someone knows what good collaboration looks like. They don't tell you whether someone does it.

This is the same gap that exists between knowing you should write tests and actually writing tests. Between knowing you should review PR diffs line by line and actually reviewing them. Knowledge and behavior diverge under real conditions, especially when the behavior is effortful and the shortcut is invisible.

The shortcut with AI is accepting output without meaningful verification. It looks like productivity. It feels like efficiency. And it's undetectable by any assessment that asks what you would do rather than observing what you actually do.

What behavioral measurement looks like

PAICE takes a different approach. Instead of asking people about AI collaboration, it puts them in one.

The assessment is a 25-minute conversation with an AI system. It looks and feels like a normal working session: you're given a realistic task, you collaborate with the AI to complete it, and you produce a deliverable. What you don't know is that the AI's outputs contain strategically injected errors -- factual mistakes, logical inconsistencies, subtle hallucinations calibrated to the domain.

The assessment isn't measuring whether you can use AI. It's measuring what happens when the AI is wrong and you're responsible for the output.

Do you catch the error? Do you verify claims that sound plausible? When you find a problem, do you fix it or work around it? When the AI pushes back on your correction, do you hold your ground or defer? These behavioral signals are what the scoring model captures.

Dimensional scoring

Collaboration quality isn't a single number. Someone might be excellent at iterative prompting but terrible at verification. Another person might catch every error but struggle to give the AI useful feedback. A single score flattens these differences into noise.

PAICE measures across multiple dimensions independently:

Accountability measures whether someone verifies outputs, detects injected errors, and takes ownership of the final work product. This is consistently the lowest-scoring dimension across all populations tested. People know they should verify. Under real working conditions, most don't verify thoroughly enough.

Integrity measures whether someone maintains factual standards, catches logical inconsistencies, and refuses to use AI-generated content that doesn't meet quality thresholds.

Collaboration quality measures the effectiveness of the human-AI interaction itself: whether feedback is specific and actionable, whether iteration actually improves the output, whether the person understands when AI adds value and when it introduces friction.

Evolution measures adaptive capacity: whether someone builds mental models of AI strengths and weaknesses over time and adjusts their approach accordingly.

Each dimension produces an independent score. For L&D teams designing targeted training, a dimensional profile is vastly more actionable than a percentage.

The engineering problem

Building this required solving several problems that don't have obvious precedents:

Error injection that doesn't break immersion. The injected errors have to be realistic enough that catching them requires domain judgment, not pattern recognition. If the errors are obviously wrong, you're measuring attention, not expertise. If they're too subtle, the signal-to-noise ratio collapses. The calibration is adaptive -- the system adjusts based on how the participant is performing.

Behavioral signal extraction from conversation. The scoring model doesn't grade the deliverable. It analyzes the collaboration process: what the participant questioned, what they accepted, how they responded to pushback, whether their verification was systematic or sporadic. This requires a multi-model architecture where the assessment AI and the scoring model operate independently.

Multi-model bias prevention. When the AI that runs the conversation is also the AI that scores it, you get circular reasoning. PAICE uses separate models for assessment delivery and scoring, with the scoring model evaluating behavioral signals rather than output quality.

Pre-post comparison for training ROI. The most valuable use case isn't a one-time score. It's administering the assessment before and after a training intervention and measuring whether actual behavior changed. This requires scoring stability across sessions and dimensional granularity fine enough to detect movement in specific skill areas.

Who this is for

PAICE is built for Leaders and organizational decision-makers who are deploying AI and need to know whether their people are collaborating with it effectively or just using it as a faster copy-paste.

If you're a developer interested in the measurement architecture, the Closing the Collaboration Gap whitepaper covers the technical framework, and the daily blog explores the intersection of trust, verification, and performance measurement in human-AI systems.

paice.work

PAICE.work PBC is a public benefit corporation focused on making human-AI collaboration measurable, teachable, and governable.

Debugging Multi-Agent Systems: Traces, Capture Mode, and Live Dashboards

2026-03-31 22:00:00

Multi-agent systems are hard to debug.

It's not the same as debugging a web request or a database query. You can't set a breakpoint in the middle of an LLM call. You can't predict what the model will say. When an agent produces bad output, you need to understand the full chain of events: what prompt was sent, what the model returned, which tools were called, what context from previous tasks was injected, and whether the output parsing succeeded.

Traditional debuggers don't help here. You need purpose-built observability.

This post covers the debugging and observability stack in AgentEnsemble: structured traces for post-mortem analysis, capture mode for recording full execution state, and the live dashboard for real-time visibility during development.

The Debugging Challenge

Consider a three-agent pipeline: Researcher, Analyst, Writer. The Writer produces a report that's factually wrong. Where did things go wrong?

  • Did the Researcher find bad information?
  • Did the Analyst misinterpret the research?
  • Did the Writer ignore the analysis and hallucinate?
  • Did a tool call return unexpected results?
  • Was the wrong context passed between tasks?

Without observability, you're guessing. With it, you're reading a log.

Layer 1: Structured Traces

The most broadly useful debugging tool is the structured trace. It records every significant event in an ensemble run as a tree of spans:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    .traceExporter(TraceExporter.json(Path.of("traces/")))
    .build()
    .run();

This produces a JSON file in the traces/ directory with a structure like:

Ensemble Run (total: 8,420ms, 5,230 tokens)
 |
 +-- Task: Research emerging trends (3,240ms, 1,847 tokens)
 |    +-- LLM Call #1 (1,900ms, 1,200 tokens)
 |    +-- Tool: WebSearch "emerging tech trends 2024" (890ms)
 |    +-- LLM Call #2 (450ms, 647 tokens)
 |
 +-- Task: Analyze research findings (2,180ms, 1,583 tokens)
 |    +-- LLM Call #1 (2,180ms, 1,583 tokens)
 |
 +-- Task: Write final report (3,000ms, 1,800 tokens)
      +-- LLM Call #1 (2,400ms, 1,400 tokens)
      +-- LLM Call #2 (600ms, 400 tokens)  // output retry

Each span records:

Field Description
Name Task description or tool call name
Duration Wall-clock time in milliseconds
Token count Input + output tokens for LLM calls
Status Success, failure, or retry
Input/Output What went in, what came out

Accessing Traces Programmatically

You don't have to read the JSON file. The trace is available on the EnsembleOutput:

ExecutionTrace trace = output.getTrace();

// Walk the span tree
for (TraceSpan span : trace.getSpans()) {
    System.out.printf("[%s] %s -- %dms, %d tokens%n",
        span.getStatus(),
        span.getName(),
        span.getDurationMs(),
        span.getTokenCount());

    for (TraceSpan child : span.getChildren()) {
        System.out.printf("  [%s] %s -- %dms%n",
            child.getStatus(),
            child.getName(),
            child.getDurationMs());
    }
}

This is useful for writing assertions in tests:

@Test
void ensembleShouldCompleteAllTasks() {
    EnsembleOutput output = ensemble.run();

    ExecutionTrace trace = output.getTrace();
    assertThat(trace.getSpans()).hasSize(3);
    assertThat(trace.getSpans())
        .allMatch(span -> span.getStatus() == TraceStatus.SUCCESS);
    assertThat(output.getMetrics().getTotalTokens()).isLessThan(10_000);
}

Trace Export for Analysis Pipelines

The JSON trace format is designed for programmatic consumption. Feed it into your log aggregation system, build custom analysis scripts, or import it into a notebook:

// Export to a specific directory with timestamped filenames
.traceExporter(TraceExporter.json(Path.of("traces/")))

// Or get the raw JSON string
String traceJson = output.getTrace().toJson();
logAggregator.ingest("agent-trace", traceJson);

Layer 2: Capture Mode

Traces tell you what happened. Capture mode tells you exactly what happened -- including the full prompts, raw LLM responses, and tool call payloads.

Three Levels

Ensemble.builder()
    .agents(researcher, writer)
    .tasks(researchTask, writeTask)
    .chatLanguageModel(model)
    .captureMode(CaptureMode.FULL) // OFF, STANDARD, or FULL
    .build()
    .run();
Level What's Captured Use Case
OFF Standard metrics only Production
STANDARD + Full LLM message history per iteration, memory operations Staging, initial deployment
FULL + Tool call I/O payloads, raw LLM responses, detailed timing Development, debugging

What STANDARD Adds

With CaptureMode.STANDARD, each task's execution record includes the full conversation between the framework and the LLM:

Task: Research emerging trends
  Iteration 1:
    System prompt: "You are Senior Research Analyst. Your goal is..."
    User message: "Research emerging trends in AI thoroughly..."
    Assistant response: "I'll search for the latest information..."
    Tool call: WebSearch("emerging AI trends 2024")
  Iteration 2:
    System prompt: [same]
    User message: [previous context + tool result]
    Assistant response: "Based on my research, here are the key..."

This is invaluable for understanding why an agent behaved a certain way. You can see exactly what prompt it received, what context was injected, and how it reasoned through the task.

What FULL Adds

CaptureMode.FULL adds the raw payloads for every interaction:

  • Tool call inputs: The exact arguments passed to each tool.
  • Tool call outputs: The exact response from each tool.
  • Raw LLM responses: The complete response body, including any JSON that was parsed.
  • Timing breakdowns: Per-iteration timing, not just per-task.

This is the level you use when something is wrong and you can't figure out why from the trace alone. It's verbose -- expect significantly more data -- but it gives you full replay capability.

Using Capture Data in Tests

Capture mode is a testing power tool. Record a full execution, then write assertions against the captured data:

@Test
void researcherShouldUseWebSearch() {
    EnsembleOutput output = Ensemble.builder()
        .agents(researcher, writer)
        .tasks(researchTask, writeTask)
        .chatLanguageModel(model)
        .captureMode(CaptureMode.FULL)
        .build()
        .run();

    // Verify the researcher used the web search tool
    ExecutionTrace trace = output.getTrace();
    TraceSpan researchSpan = trace.getSpans().get(0);

    boolean usedWebSearch = researchSpan.getChildren().stream()
        .anyMatch(child -> child.getName().contains("WebSearch"));
    assertThat(usedWebSearch).isTrue();
}

You can also capture a "golden run" and use it as a reference for regression testing -- comparing future runs against the expected execution pattern.

Layer 3: Event Callbacks

For real-time debugging during development, callbacks give you a live stream of execution events:

Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    .listener(event -> {
        switch (event) {
            case TaskStartEvent e ->
                System.out.printf("%n>>> Starting: %s (agent: %s)%n",
                    e.taskDescription(), e.agentRole());

            case TaskCompleteEvent e ->
                System.out.printf("<<< Completed: %s (%dms, %d tokens)%n",
                    e.taskDescription(), e.durationMs(), e.tokenCount());

            case TaskFailedEvent e ->
                System.err.printf("!!! Failed: %s -- %s%n",
                    e.taskDescription(), e.errorMessage());

            case ToolCallEvent e ->
                System.out.printf("    [tool] %s(%s) -> %s%n",
                    e.toolName(),
                    truncate(e.input(), 50),
                    truncate(e.result(), 100));

            case DelegationStartedEvent e ->
                System.out.printf("    [delegate] %s -> %s%n",
                    e.fromAgent(), e.toAgent());

            case TokenEvent e ->
                // Streaming: print tokens as they arrive
                System.out.print(e.token());

            default -> {}
        }
    })
    .build()
    .run();

This gives you a live play-by-play of the ensemble execution in your terminal. You see each task start and complete, each tool call and its result, and each delegation in hierarchical workflows.

Combining Callbacks with Logging

For persistent debugging output, route events to your logging framework:

.listener(event -> {
    if (event instanceof TaskCompleteEvent e) {
        log.info("Task completed: task={}, agent={}, duration={}ms, tokens={}",
            e.taskDescription(), e.agentRole(),
            e.durationMs(), e.tokenCount());
    }
    if (event instanceof TaskFailedEvent e) {
        log.error("Task failed: task={}, error={}",
            e.taskDescription(), e.errorMessage());
    }
})

These flow into your existing log aggregation pipeline (ELK, Splunk, CloudWatch Logs) alongside your application's other logs.

Layer 4: The Live Dashboard

For the most visual debugging experience, AgentEnsemble includes a live browser dashboard:

Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    .devtools(Devtools.enabled())
    .build()
    .run();

When the ensemble starts, a browser window opens (or a URL is printed to the console) showing a real-time visualization of the execution:

What the Dashboard Shows

  • DAG Visualization: A graph of all tasks and their dependencies. Nodes change color as tasks progress from pending to running to completed.
  • Agent Activity: Which agent is currently active, what it's doing, and how many iterations it's taken.
  • Token Consumption: Real-time token counters per task and for the entire ensemble.
  • Task Output Preview: Click on a completed task to see its output.
  • Timeline: A Gantt-chart-style view of task execution, showing parallelism and bottlenecks.

When to Use It

The live dashboard is a development tool, not a production monitoring dashboard. Use it when:

  • Building a new agent workflow and you want to see the execution flow.
  • Debugging why a specific task takes too long or produces unexpected output.
  • Demonstrating an agent system to stakeholders.
  • Understanding the parallelism in a DAG or MapReduce workflow.

For production monitoring, use the Micrometer metrics integration and your existing Grafana/Prometheus stack.

Debugging Recipes

Here are specific debugging scenarios and how to approach them with the tools above.

"The output is wrong, but I don't know which agent failed"

Use traces. Look at each task's output in the trace tree. Find the first task whose output is incorrect -- that's where things diverged.

.traceExporter(TraceExporter.json(Path.of("debug/")))

Then read the trace JSON, find the task with bad output, and check its input context to see what it received from upstream tasks.

"The agent keeps calling the same tool in a loop"

Use capture mode + callbacks. Enable CaptureMode.FULL and add a callback that logs tool calls:

.captureMode(CaptureMode.FULL)
.listener(event -> {
    if (event instanceof ToolCallEvent e) {
        log.warn("Tool call: {} with input: {}",
            e.toolName(), e.input());
    }
})

Then check the captured LLM conversation to see why the agent keeps making the same call. Usually it's a prompt issue -- the agent doesn't recognize the tool result as sufficient.

"The structured output parsing keeps failing"

Use capture mode. Enable CaptureMode.FULL and check the raw LLM response:

.captureMode(CaptureMode.FULL)

The captured data includes the raw response before parsing. Compare it to your record schema. Common issues:

  • The LLM wraps JSON in markdown code blocks.
  • Field names don't match (the LLM uses camelCase, the record uses snake_case).
  • The LLM adds extra fields or comments.

The framework handles most of these, but FULL capture mode shows you exactly what's happening.

"A parallel workflow is slower than expected"

Use the live dashboard. Enable devtools and look at the timeline view:

.devtools(Devtools.enabled())

You'll see whether tasks are actually running in parallel or if there's an unexpected dependency bottleneck. Common issues:

  • A task accidentally depends on another task via context() when it shouldn't.
  • One task takes much longer than the others, creating a bottleneck for downstream tasks.
  • Rate limiting is causing parallel tasks to serialize.

"I need to understand the full prompt the agent received"

Use CaptureMode.STANDARD or CaptureMode.FULL. The captured data includes the complete system prompt, user message, and any injected context for each LLM call.

This is the only way to see the actual prompt -- the framework constructs it dynamically from the agent's role/goal/background, the task description, context from previous tasks, and tool results.

Putting It All Together

A typical debugging setup during development:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    // Full observability stack
    .captureMode(CaptureMode.FULL)
    .traceExporter(TraceExporter.json(Path.of("traces/")))
    .devtools(Devtools.enabled())
    .listener(event -> {
        if (event instanceof TaskCompleteEvent e) {
            log.info("[DONE] {} -- {}ms", e.taskDescription(), e.durationMs());
        }
        if (event instanceof ToolCallEvent e) {
            log.info("[TOOL] {} -> {}", e.toolName(), e.result());
        }
    })
    .costConfiguration(CostConfiguration.builder()
        .inputTokenCostPer1k(0.01)
        .outputTokenCostPer1k(0.03)
        .build())
    .build()
    .run();

// Post-run analysis
EnsembleMetrics metrics = output.getMetrics();
log.info("Total cost: ${}, tokens: {}, duration: {}ms",
    metrics.getTotalCost(), metrics.getTotalTokens(),
    output.getTotalDuration());

For production, dial it back:

EnsembleOutput output = Ensemble.builder()
    .agents(researcher, analyst, writer)
    .tasks(researchTask, analysisTask, writeTask)
    .chatLanguageModel(model)
    // Production observability
    .captureMode(CaptureMode.OFF)
    .traceExporter(TraceExporter.json(Path.of("/var/log/agent-traces/")))
    .meterRegistry(prometheusMeterRegistry)
    .listener(productionEventHandler)
    .costConfiguration(costConfig)
    .build()
    .run();

The observability stack scales from "show me everything" during development to "show me what matters" in production. Same API, different configuration.

The Core Idea

Multi-agent systems are opaque by nature. An LLM call is a black box -- you send a prompt, you get a response, and the reasoning happens inside the model. The only way to make agent systems debuggable is to capture and structure everything around those black box calls: what went in, what came out, how long it took, and how it fits into the broader execution flow.

That's what traces, capture mode, callbacks, and the live dashboard provide. Not transparency into the model, but transparency around it. And in practice, that's enough to debug anything.

Get started:

AgentEnsemble is MIT-licensed and available on GitHub.

CodeRabbit for Monorepos: Handling Large Codebases

2026-03-31 22:00:00

Monorepos are fantastic for keeping shared code aligned and deployments coordinated. They are also a nightmare for code review tooling that was not built with them in mind. A single pull request in a monorepo can touch a shared utility package, a React frontend, a Node.js API, and a database migration script - all at once. Most AI review tools treat that diff as one giant blob of text and produce reviews that are either too broad to be useful or too shallow to catch anything meaningful.

CodeRabbit handles this better than most, but it requires deliberate configuration to get the most out of it. This guide covers how to set up CodeRabbit specifically for monorepo workflows - from path-based filters to per-package instructions to managing the reality of 50-file pull requests.

If you are new to CodeRabbit in general, start with how to use CodeRabbit and then come back here for the monorepo-specific setup.

Why Monorepos Create Unique Review Challenges

Before diving into configuration, it is worth understanding what makes monorepos specifically hard for AI code review tools.

Volume and noise. A single feature PR in a monorepo might touch 80 files across 6 packages. If your AI reviewer tries to comment on every file equally, you end up with hundreds of comments - most of them low-signal - and developers start ignoring the tool entirely.

Context fragmentation. The AI reviewer sees the diff for packages/api/src/users/service.ts but may not have visibility into how that service is consumed by packages/web/src/hooks/useUser.ts, even if both files changed in the same PR. Cross-package impact analysis requires understanding the full dependency graph, not just the changed lines.

Inconsistent standards across packages. In a well-organized monorepo, different packages have different maturity levels and different conventions. Your internal utilities package might use strict TypeScript with no any types, while your experimental feature package uses a looser style. A one-size-fits-all review profile will either be too strict for some packages or too lenient for others.

Generated and boilerplate code. Monorepos often contain auto-generated files - GraphQL schema types, protobuf outputs, OpenAPI client code, Storybook snapshots. These change frequently and are meaningless to review. Without explicit exclusions, they consume review tokens that should be spent on real code.

Large PR problem. Teams working in monorepos often resist splitting PRs because related changes across packages need to land together. This leads to PRs with 50, 80, or even 150 changed files. Even the best AI reviewers struggle to maintain quality at that scale.

Setting Up .coderabbit.yaml for a Monorepo

The .coderabbit.yaml file is where all the real monorepo configuration lives. Place it at the root of your repository. Here is a solid starting template for a typical Nx or Turborepo monorepo:

# .coderabbit.yaml
language: en-US
tone_instructions: "Be concise. Prioritize actionable feedback over explanatory commentary."

reviews:
  profile: assertive
  request_changes_workflow: false
  high_level_summary: true
  poem: false
  review_status: true
  collapse_walkthrough: false
  auto_review:
    enabled: true
    ignore_title_keywords:
      - "WIP"
      - "chore"
      - "docs"
      - "release"
    drafts: false

path_filters:
  include:
    - "packages/**"
    - "apps/**"
    - "services/**"
  exclude:
    - "**/dist/**"
    - "**/build/**"
    - "**/*.generated.ts"
    - "**/*.generated.js"
    - "**/__snapshots__/**"
    - "**/graphql/generated/**"
    - "**/proto/generated/**"
    - "**/node_modules/**"
    - "**/*.lock"
    - "**/coverage/**"
    - "**/.turbo/**"
    - "**/.nx/**"

path_instructions:
  - path: "packages/api/**"
    instructions: |
      This is the core API package. Enforce strict input validation on all controller methods.
      Flag any database query that does not use parameterized inputs.
      Check that all new endpoints have corresponding OpenAPI documentation.
      Reject any use of 'any' type in TypeScript.

  - path: "packages/shared/**"
    instructions: |
      This is a shared utilities package used by all other packages.
      Be extra strict about breaking changes - flag any removed exports or changed function signatures.
      Ensure all exported functions have JSDoc documentation.
      Flag any circular dependency risks.

  - path: "packages/web/**"
    instructions: |
      This is the React frontend package.
      Check that new components have associated Storybook stories.
      Flag any direct DOM manipulation outside of React lifecycle.
      Look for missing key props in list renders and accessibility issues.

  - path: "apps/**"
    instructions: |
      Application-level code. Focus on configuration correctness and environment variable usage.
      Flag any hardcoded credentials, URLs, or environment-specific values.

  - path: "**/*.test.ts"
    instructions: |
      For test files, focus on test coverage completeness and assertion quality.
      Flag tests that only assert truthiness without checking specific values.
      Skip style comments on test files.

  - path: "**/*.spec.ts"
    instructions: "Same as test files - focus on assertion quality and skip style feedback."

This configuration does several things at once. The path_filters section narrows the diff to meaningful source code. The path_instructions section gives CodeRabbit a distinct personality and rule set for each part of the codebase. The auto_review settings prevent reviews from firing on PRs that are clearly not ready for feedback.

For a deeper look at all available configuration options, see the CodeRabbit configuration guide.

Per-Package Review Rules in Practice

The path_instructions feature is CodeRabbit's most valuable capability for monorepos. In practice, here is how teams use it effectively.

Security-sensitive packages get stricter rules. If you have a package that handles authentication, payments, or PII, you can instruct CodeRabbit to flag every instance of raw string concatenation in SQL, every missing rate limit check, and every unencrypted data write. This is the kind of context-aware review that would otherwise require a dedicated security reviewer on every PR.

Experimental packages get lighter treatment. A package under active development should not be held to the same standard as production code. Set the profile to chill for those paths and instruct CodeRabbit to skip style enforcement and focus only on obvious bugs.

Infrastructure-as-code gets different rules entirely. Terraform files, Helm charts, and Docker configurations have completely different quality signals than application code. You can instruct CodeRabbit to look for things like missing resource limits, open security group rules, and unversioned image tags - concerns that would not appear in instructions for a TypeScript package.

Shared libraries get change-impact focus. For any package imported by multiple other packages, instruct CodeRabbit to flag breaking changes prominently. Something like: "This package is imported by all other packages. Always highlight if a change could break downstream consumers, including removed exports, changed type signatures, or altered function behavior."

Handling Large PRs - the 50+ File Reality

Even with good PR hygiene, monorepo PRs get large. A migration, a shared utility refactor, or a cross-cutting configuration change can touch dozens of packages in one go. Here is how to manage that with CodeRabbit.

Use collapse_walkthrough: false selectively. The PR walkthrough is CodeRabbit's high-level summary of what changed. For large PRs, this is often more valuable than individual file comments. Keeping it uncollapsed helps reviewers get oriented before diving into specifics.

Accept that large PRs get lighter reviews. CodeRabbit's underlying LLM has a context window constraint. For very large diffs, the tool makes intelligent tradeoffs - it provides deeper analysis on files that show more complex changes and lighter summaries on files with small, mechanical changes. This is the right behavior, but it means you should not expect the same comment density on a 100-file PR as on a 10-file PR.

Use ignore keywords aggressively. If your monorepo has release automation that opens a PR to bump versions across all packages, that PR will have dozens of package.json changes. Add release and version-bump to ignore_title_keywords so CodeRabbit skips those entirely. Same for PRs titled chore: update dependencies - lockfile changes are not worth AI review cycles.

Split PRs where you can. This is the unsexy answer, but it is the right one. If a PR changes a shared package and five consumer packages, consider splitting it into a "shared package change" PR and a "consumer update" PR. The first PR can be reviewed in depth; the second is mechanical and can be approved quickly. CodeRabbit can still review both, and the quality of feedback on each will be higher.

Leverage draft PR status. Set auto_review.drafts: false in your config. This prevents CodeRabbit from reviewing draft PRs, saving review capacity for when a PR is actually ready. Many developers use draft status while assembling a large cross-package change, and triggering a review on a half-complete diff is wasteful.

Nx, Turborepo, and Lerna - What Changes?

CodeRabbit is build-tool agnostic. It does not integrate with your Nx project graph, Turborepo pipeline definitions, or Lerna workspace configuration directly. What it sees is the Git diff.

This means your .coderabbit.yaml path structure needs to mirror your workspace structure manually. If your Nx monorepo has apps in apps/ and libraries in libs/, your path_filters should reflect that:

path_filters:
  include:
    - "apps/**"
    - "libs/**"
  exclude:
    - "libs/generated/**"
    - "**/node_modules/**"

For Turborepo setups, the structure is typically apps/ for deployable applications and packages/ for shared code. The same approach applies.

For Lerna monorepos, which often use packages/ at the root, the configuration is the same as the Turborepo example above.

One Nx-specific consideration: Nx often generates boilerplate when you run nx generate. These generated files - default configs, barrel exports, test setups - are largely not worth reviewing. Add patterns like **/index.ts (for barrel files) and **/*.module.ts (for Angular module boilerplate) to your exclude list if your team auto-generates these.

The practical difference between these tools for CodeRabbit configuration is minimal. The path filter approach works the same way regardless of which build orchestrator you use.

Context Window Limits - What You Need to Know

CodeRabbit does not publish exact token limits for its review engine, but based on community reports and practical usage, here is what you can expect.

For PRs under roughly 30 files with moderate diff sizes, CodeRabbit operates at full capacity - detailed per-hunk comments, security analysis, style enforcement, and actionable suggestions.

For PRs in the 30 to 80 file range, CodeRabbit maintains review quality on the most complex files but may produce lighter-touch summaries for files with only small changes. You will still get a comprehensive walkthrough and comments on the important stuff.

For PRs over 80 files, especially those with large diffs per file, expect the per-file depth to decrease. The walkthrough remains useful, but individual file comments become more selective. This is not a bug - it is the tool making a rational tradeoff between breadth and depth.

The configuration option that matters most here is being selective with path_filters.include. If you include packages/** and your monorepo has 40 packages, a cross-cutting change will try to review all 40. If you know that only packages/api and packages/shared contain code worth deep review, say so explicitly:

path_filters:
  include:
    - "packages/api/**"
    - "packages/shared/**"
    - "packages/auth/**"

This is counterintuitive - you might feel like you are missing coverage. But getting high-quality reviews on your most critical packages is more valuable than surface-level coverage of everything.

Comparing CodeRabbit to CodeAnt AI for Monorepos

If you are evaluating AI review tools specifically for monorepo use cases, CodeAnt AI is worth considering alongside CodeRabbit. At $24 to $40 per user per month (Basic to Premium), CodeAnt AI bundles AI code review with SAST scanning, secret detection, infrastructure-as-code security analysis, and DORA metrics in a single platform.

For monorepos that house services with different security profiles - say, a public-facing API next to an internal data pipeline - CodeAnt AI's ability to apply different security rule sets to different services can be valuable. Its SAST coverage applies across the monorepo, which can catch vulnerabilities that a pure review tool might miss.

CodeRabbit's advantage for monorepos is the granularity of path_instructions. The ability to write natural-language review instructions per path, and have those honored consistently, is more flexible than CodeAnt AI's configuration model. If per-package review quality is your top priority, CodeRabbit wins on that dimension.

For a broader comparison of your options, see CodeRabbit alternatives.

A Realistic Workflow for Monorepo Teams

Here is what a well-configured CodeRabbit monorepo workflow looks like in practice for a team of 8 to 20 developers.

Step 1: Baseline configuration. Start with the YAML template above, adapt the path structure to your repo layout, and deploy it. Do not try to write path_instructions for every package on day one.

Step 2: Tune ignore patterns. After a week, look at the CodeRabbit comments your team marked as unhelpful or dismissed repeatedly. Most of these will be on generated files or boilerplate. Add those patterns to your exclude list.

Step 3: Add path instructions incrementally. Start with your most critical packages - the ones where a bug has the highest impact. Write focused instructions for those. Expand to other packages over the next few weeks as you learn what CodeRabbit catches and misses in your specific codebase.

Step 4: Set up PR conventions. Establish a convention that PRs touching more than 3 packages must use a descriptive title so that ignore_title_keywords can catch the mechanical ones. Add a PR template that reminds developers to split large changes where possible.

Step 5: Review CodeRabbit's performance monthly. Look at which packages generate the most review comments, and whether those comments lead to code changes. Adjust your path_instructions to reduce noise where the signal-to-noise ratio is poor.

For more advanced configuration patterns, the CodeRabbit configuration deep dive covers options not discussed here, including chat commands, Jira integration, and review scheduling.

Quick Reference - Key .coderabbit.yaml Settings for Monorepos

Here is a concise reference for the settings that matter most in a monorepo context:

Setting What it does Monorepo use
path_filters.include Limits review to matched paths Scope to real source packages
path_filters.exclude Skips matched paths entirely Exclude generated, dist, lock files
path_instructions Per-path review instructions Different rules per package
auto_review.ignore_title_keywords Skips PRs matching keywords Skip release and chore PRs
auto_review.drafts: false Skips draft PRs Avoid reviewing WIP large PRs
reviews.profile Sets review strictness globally Use "assertive" for critical packages
reviews.high_level_summary Enables PR walkthrough Essential for large PRs
reviews.collapse_walkthrough Controls walkthrough display Set false for large PRs

Getting Started Today

If you are reading this with a monorepo and no CodeRabbit configuration, the fastest path to value is:

  1. Install CodeRabbit on your repository (the how to use CodeRabbit guide covers setup in under 10 minutes)
  2. Create a .coderabbit.yaml at your repo root with at minimum a path_filters.exclude list for generated and build output directories
  3. Add a single path_instructions entry for your most critical package
  4. Open a test PR and evaluate the review quality

From there, the configuration is iterative. CodeRabbit's learning system also picks up on patterns over time - comments your team consistently dismisses will be weighted less in future reviews, and approaches your team consistently approves will be reinforced.

For a comprehensive look at what CodeRabbit can do beyond monorepos, the full CodeRabbit review covers pricing, feature comparisons, and real-world performance benchmarks in detail.

Monorepo AI review is not a solved problem, but CodeRabbit's path-based configuration system gets you meaningfully closer than the default behavior of any AI review tool. The investment in .coderabbit.yaml configuration pays off within the first month for most teams.

Further Reading

Frequently Asked Questions

Does CodeRabbit work with monorepos?

Yes. CodeRabbit supports monorepos natively through .coderabbit.yaml path filters, per-path override rules, and ignore patterns. You can scope reviews to specific packages, exclude generated files, and set different review instructions for frontend versus backend packages - all within a single configuration file.

How do I configure CodeRabbit to only review certain packages in a monorepo?

Use the path_filters section in .coderabbit.yaml. You can define include and exclude glob patterns, for example include: ['packages/api/', 'packages/web/'] and exclude: ['/generated/', '/snapshots/']. CodeRabbit will then limit its review scope to only the matched paths.

Can CodeRabbit handle large pull requests with 50 or more changed files?

Yes, but with caveats. On the Pro plan, CodeRabbit has no enforced file count limit, but the underlying LLM context window constrains how deeply it can analyze very large diffs. For PRs with 50 or more files, CodeRabbit prioritizes the most critical changes and may produce lighter-touch summaries for less critical files. Splitting large PRs is still the recommended best practice.

What is the context window limit for CodeRabbit reviews?

CodeRabbit does not publicly publish its exact token limit, but in practice, reviews of diffs exceeding roughly 100,000 tokens may receive less granular per-file comments. The tool intelligently summarizes and batches large diffs, but for monorepos with massive PRs, you will get the best results by keeping individual PRs under 30-40 changed files.

How do I set different review rules for different packages in a monorepo?

Use the path_instructions array inside .coderabbit.yaml. Each entry accepts a path glob and a freeform instructions string. For example, you can instruct CodeRabbit to enforce strict API contract validation only for packages/api/** while applying different style rules to packages/web/**. This per-path instruction system is one of CodeRabbit's most powerful monorepo features.

Does CodeRabbit support Nx, Turborepo, and Lerna monorepos?

Yes. CodeRabbit is build-tool agnostic - it analyzes Git diffs and does not require integration with Nx, Turborepo, or Lerna directly. However, you can leverage your build tool's project graph by defining path filters in .coderabbit.yaml that mirror your workspace structure, ensuring reviews stay aligned with package boundaries.

How do I exclude auto-generated files from CodeRabbit reviews in a monorepo?

Add them to the path_filters.exclude list in .coderabbit.yaml. Common patterns include '/dist/', '/build/', '/*.generated.ts', '/graphql/schema.ts', and '/snapshots/'. You can also exclude entire packages that are vendor code or auto-generated client SDKs.

Can CodeRabbit understand cross-package dependencies in a monorepo?

Partially. CodeRabbit analyzes the diff it receives and has some ability to follow imports and references within the changed files. However, it does not have full visibility into the runtime dependency graph of your monorepo. For cross-package impact analysis, you still need human reviewers or a dedicated dependency analysis tool.

What .coderabbit.yaml settings matter most for monorepo performance?

The most impactful settings are path_filters (to scope the diff), path_instructions (for per-package rules), auto_review.ignore_title_keywords (to skip chore/docs PRs automatically), and reviews.profile (set to 'assertive' for critical packages, 'chill' for documentation packages). Combining these reduces noise and keeps reviews focused.

Is CodeRabbit better than CodeAnt AI for monorepos?

CodeRabbit offers more granular path-based configuration for monorepos, making it a stronger choice if fine-grained per-package review rules are your priority. CodeAnt AI ($24-40/user/month) bundles SAST and security scanning alongside AI review, which can be valuable if your monorepo houses services with different security profiles. The right choice depends on whether you need pure review quality or broader security coverage.

How do I stop CodeRabbit from reviewing the same boilerplate across 40 packages?

Use path_filters.exclude to skip boilerplate-heavy directories, and use path_instructions to add a note like 'skip detailed style feedback on files matching this pattern - these are generated from templates'. You can also set the review profile to 'chill' for packages that contain mostly scaffolding code.

Does CodeRabbit's free tier support monorepo path filters?

Path filter configuration via .coderabbit.yaml is available on the free tier, but custom path_instructions (per-path review rules) require a Pro plan at $24/user/month. Free-tier users can still exclude paths and limit review scope, which alone provides a significant improvement for large monorepo PRs.

Originally published at aicodereview.cc