2026-03-31 22:00:03
We passed our SOC 2 Type II audit on the second attempt. The first attempt, we failed. And the things that tripped us up were not the things any blog post had warned us about.
Everyone writes about "implement access controls" and "encrypt data at rest." Those are the obvious ones. Here are the five non-obvious things that almost sank our audit, and that i've since heard from multiple other startups who hit the same walls.
I wrote a whole separate post about this, but its worth mentioning here because it was our single biggest failure point.
We had audit logs. We had millions of them in ELK. But when the auditor asked "can you demonstrate that these logs are complete and unmodified," we couldnt.
The auditor's specific concern was CC7.2 from the AICPA Trust Services Criteria: "The entity monitors system components and the operation of those components for anomalies that are indicative of malicious acts, natural disasters, and errors affecting the entity's ability to meet its objectives."
The word "monitors" is key. Its not enough to have logs. You need to demonstrate that you actively monitor them and that they have integrity controls. We had logging but no monitoring, no alerting, and no integrity verification.
What fixed it: We implemented hash-chained audit logging with daily integrity verification checks. We also set up alerts for suspicious patterns (multiple failed logins, privilege escalations, data exports over a threshold). The auditor wanted to see both the technical implementation AND evidence that alerts had been triggered and responded to during the observation period.
This one surprised us. We thought our offboarding was solid. When someone leaves, we disable their Okta account. Done, right?
Nope. The auditor asked for a list of every system the departed employee had access to and evidence that access was revoked in each one. Turns out, disabling Okta covers maybe 60% of access. The other 40%:
// Offboarding checklist - what the auditor actually wanted to see
interface OffboardingChecklist {
employeeId: string;
terminationDate: Date;
steps: {
system: string;
accessType: string;
revokedDate: Date | null;
revokedBy: string | null;
verified: boolean;
evidence: string; // screenshot URL, API response, etc.
}[];
}
// They wanted EVIDENCE for each step
// Not just a checkbox saying "done"
// Actual screenshots or API confirmations showing access was removed
The auditor also checked for timing. Was access revoked on the termination date or two weeks later? If theres a gap between when someone leaves and when their access is revoked, thats a finding.
What fixed it: We built an offboarding automation that queries every integrated system and produces an evidence report. It takes about 2 hours per departing employee instead of the 20 minutes we used to spend.
We told the auditor our change management process was code review via GitHub pull requests. Every change is reviewed before merging.
The auditor asked: "Can you show me the approval for this specific production deploy on February 14th?"
We could show the PR was reviewed. But we couldnt show that the PR corresponded to the specific deployment, that the deployment was authorized by someone with the right role, or that the production environment matched what was tested in staging.
SOC 2 CC8.1 requires that changes to system components are "authorized, designed, developed, configured, documented, tested, approved, and implemented."
Thats a lot more than "someone clicked Approve on the PR."
# What we added to our CI/CD pipeline
# Deployment manifest that links everything together
deployment:
# Links to the PR/change request
change_request: "PR #1234"
approved_by: "[email protected]"
approval_timestamp: "2026-02-14T10:30:00Z"
# Links to test evidence
test_results:
unit_tests: "passing - 847/847"
integration_tests: "passing - 123/123"
staging_deploy: "deploy_stg_abc123"
staging_verification: "QA sign-off by [email protected]"
# Production deployment details
production:
deployer: "ci-bot (automated)"
deploy_timestamp: "2026-02-14T14:00:00Z"
commit_sha: "a1b2c3d4"
rollback_plan: "Revert commit a1b2c3d4, run db:rollback"
What fixed it: We added deployment manifests to every production deploy that link the PR, approval, test results, and deployment together in a single auditable record. The auditor could now trace any production change back to its authorization.
The auditor asked: "Which third-party services process or store your customers' data? What are their security certifications? When did you last review their security posture?"
We used about 15 third-party services (AWS, Stripe, SendGrid, Datadog, etc.). We had never formally documented which ones had access to customer data, never verified their SOC 2 reports, and never done a vendor risk assessment.
This falls under CC9.2: "The entity assesses and manages risks associated with vendors and business partners."
// What the auditor expected us to have
interface VendorAssessment {
vendor: string;
dataTypes: string[]; // What customer data do they access?
certifications: string[]; // SOC 2, ISO 27001, etc.
lastReviewed: Date;
riskLevel: 'low' | 'medium' | 'high' | 'critical';
contractHasSecurityTerms: boolean;
contractHasDataProcessingAgreement: boolean;
subprocessors: string[]; // Their vendors who touch our data
incidentNotificationSLA: string;
}
What fixed it: We created a vendor inventory spreadsheet (honestly a Google Sheet works fine for this), collected SOC 2 reports from all critical vendors, and established a quarterly review cadence. Boring but necessary. According to NIST's Cybersecurity Supply Chain Risk Management guidance, vendor risk management should be proportional to the data sensitivity involved.
We had an incident response plan. It was a Google Doc someone wrote 18 months ago. It listed steps like "identify the incident" and "contain the threat" and "notify affected parties."
The auditor asked: "When was this plan last tested? Can you show me records of the test?"
Silence.
Having a plan is not enough. SOC 2 requires evidence that the plan has been tested and that lessons from the test were incorporated. CC7.4 specifically addresses "The entity responds to identified security incidents by executing a defined incident response program."
What fixed it: We ran a tabletop exercise. This is a meeting where you present a hypothetical security incident and walk through the response. No actual systems are affected. You just talk through: Who gets notified? What gets shut down? How do you communicate with customers? When do you involve legal?
We found 3 major gaps in our plan during the exercise:
The tabletop exercise took 2 hours and was genuinely useful. We now run one quarterly.
Notice a pattern? None of these are technical security failures. Our encryption was fine. Our access controls were fine. Our infrastructure was properly configured.
The failures were all in processes, documentation, and evidence. SOC 2 isn't really a technical audit. Its a process audit that happens to involve technology.
The auditor wants to see three things for every control:
Most engineering teams focus on #2 (implementation) and forget about #1 (documentation) and #3 (evidence that it actually runs over time).
We eventually passed on our second attempt. The experience taught me that SOC 2 preparation is maybe 30% technical work and 70% process and documentation work. I wish someone had told us that before we spent 3 months only doing the technical part.
2026-03-31 22:00:00
Last weekend I migrated my Doctors App from Heroku to Railway.
It's a multi-tenant Rails app where each hospital gets its own subdomain — one.doctors.com, two.doctors.com, and so on.
Five hospitals, around 25,000 appointments, 9,700+ patients. Not huge, but not trivial either.
Here's how it went, including the part where I accidentally broke the database.
I already had a Railway project running with a test domain (*.juanvasquez.dev) from earlier experiments. The web service was deployed from GitHub and the Postgres 17 instance was co-located in us-east4. Cloudflare R2 handles file storage — that stays the same regardless of where the app runs.
The plan was simple: put Heroku in maintenance mode, dump the database, restore it to Railway, flip the DNS, and go home.
First, I captured a fresh Heroku backup and downloaded it:
heroku pg:backups:capture --app doctors
heroku pg:backups:download --app doctors --output /tmp/heroku_backup.dump
Then I wiped the Railway database and restored:
# Wipe
psql -h <railway-host> -p <port> -U postgres -d database_name \
-c "DROP SCHEMA public CASCADE; CREATE SCHEMA public;"
# Restore
pg_restore --verbose --no-owner --no-acl \
-h <railway-host> -p <port> -U postgres -d database_name /tmp/heroku_backup.dump
The restore threw two errors — both about the unaccent extension. Heroku installs extensions in a heroku_ext schema that doesn't exist on Railway. The fix is to just create it manually afterward:
psql -h <railway-host> -p <port> -U postgres -d database_name \
-c "CREATE EXTENSION IF NOT EXISTS unaccent;"
Everything else restored cleanly. I verified every table:
| Table | Heroku | Railway |
|---|---|---|
| users | 9,752 | 9,752 |
| appointments | 25,481 | 25,481 |
| addresses | 9,835 | 9,835 |
| patient_referrals | 1,211 | 1,211 |
| hospitals | 5 | 5 |
All 12 tables matched exactly. If you take one thing from this post: always verify row counts after a restore.
With the data restored, I wanted to trigger a deploy on the web service. I ran:
railway up --detach
Without --service web.
That command deployed my Rails application code onto the Postgres service. It replaced the PostgreSQL 17 container with Puma. The database was now a Rails web server that couldn't handle Postgres connections.
The logs told the story immediately:
HTTP parse error, malformed request: #<Puma::HttpParserError:
Invalid HTTP format, parsing fails. Are you trying to open
an SSL connection to a non-SSL Puma?>
The web service was trying to connect to Postgres, but Postgres was now running Puma, responding to TCP connections with HTTP errors.
The fix was to roll back the Postgres service to its last good deployment. Railway's CLI doesn't have a rollback command, so I used the dashboard to roll back the deployment.
After about 45 seconds, Postgres was back. Data intact. Lesson learned: always pass --service web when deploying.
Removing the test domain was another adventure. Railway's CLI can add domains but can't delete them. I used the dashboard to remove it.
Then I added the production wildcard domain:
railway domain "*.doctors.com" --service web --port 8080
Railway returned the DNS records I needed. In Squarespace (my domain registrar), I added:
| Type | Host | Value |
|---|---|---|
| CNAME | * |
znjcefnu.up.railway.app |
| CNAME | _acme-challenge |
znjcefnu.authorize.railwaydns.net |
There was also a _railway-verify record for domain ownership. I initially tried adding it as a CNAME, but Squarespace rejected the value — it's actually a TXT record, not a CNAME. Small thing, but it tripped me up.
DNS propagated fast. Within a couple of minutes, Railway confirmed the domain was verified and SSL was provisioned.
The first request to demo.doctors.com returned a 500. I checked the logs and saw... a Rails development error page. RACK_ENV was set to development. A quick variable update and redeploy fixed it:
railway variable set RACK_ENV=production --service web
Then all five hospital subdomains came back with 200s.
Railway's trial plan only allows one custom domain per service. The wildcard *.doctors.com uses that single slot, which works great for multi-tenancy — every subdomain routes correctly. But I can't also add the root domain doctors.com. For now, I'll handle that with a redirect at the registrar level.
| Heroku | Railway | |
|---|---|---|
| Web service | $7/mo (Basic dyno) | Usage-based (~$5/mo) |
| Postgres | $5/mo (Mini) | Included (500MB) |
| Custom domains | Included | 1 per service (trial) |
| SSL | Automatic | Automatic |
| Chrome buildpack | Required for old PDF setup | Not needed (using Prawn now) |
For my scale, Railway is slightly cheaper. The real win is simplicity — no buildpack configuration, no add-on marketplace to navigate, and Postgres is just there.
While I was at it, I replaced Sentry with Honeybadger (referral link) for error tracking. Sentry's initializer still referenced Heroku env vars, so it was a good time to clean house. Honeybadger has a free plan, built-in uptime monitoring, and the Rails setup is just a YAML file:
# config/honeybadger.yml
api_key: <%= ENV.fetch("HONEYBADGER_API_KEY", "") %>
env: <%= Rails.env %>
exceptions:
enabled: <%= Rails.env.production? %>
I also updated the CI pipeline — upgraded Postgres from 10.13 to 17 (matching production) and Node.js from 20 to 22 (matching package.json). Removed the Puppeteer and Chrome setup steps that were left over from when the app used Grover for PDF generation.
--service when running Railway CLI commands. Especially railway up.railway run executes locally, not on Railway's infrastructure. Use railway shell for remote access.heroku_ext schema for extensions doesn't exist on Railway. Expect restore errors for extensions, and re-create them manually._railway-verify DNS record is a TXT record, even though it looks like it could be a CNAME. Your registrar will reject it if you pick the wrong type.Since migrating, I've seen reports from other developers that give me pause. One team experienced persistent 150–200ms request queuing on Railway that they couldn't resolve even with Pro plan support — response times that were 40ms on Heroku, Render, and DigitalOcean. Another long-time customer reported a caching misconfiguration that leaked user data between accounts, on top of weeks of near-daily incidents.
I measured my own response times after reading these reports, and for my scale they're good enough. But if you're running something larger, do thorough stress testing before committing, and have a rollback plan. Railway is young, and that cuts both ways: fast iteration, but also growing pains.
The whole migration took about an hour. Most of that was waiting for DNS propagation and debugging the Postgres incident. The actual work — dump, restore, set variables, flip DNS — was maybe 30 minutes.
Railway feels like what Heroku should have become. The dashboard is clean, deploys are fast, and the Postgres integration just works. I miss heroku run (Railway's local execution model is confusing at first), but railway shell covers most cases.
For a small multi-tenant Rails app like mine, it's a good fit. But I'm keeping my Heroku knowledge fresh — just in case.
2026-03-31 22:00:00
If you've seen the term "Core Web Vitals" and kept scrolling, this article is for you.
It's not just SEO jargon. These three metrics are the clearest signal we have for whether a web app feels fast to a real user — and they're measurable directly from your React code.
This article covers what the three metrics actually mean, how to measure them without any external tools, and what to do when they're bad.
Core Web Vitals are three metrics defined by Google to measure user experience from a loading and interactivity perspective. They're based on real user data, not synthetic benchmarks.
The three metrics:
| Metric | Measures | Good threshold |
|---|---|---|
| LCP — Largest Contentful Paint | Loading speed | ≤ 2.5s |
| FCP — First Contentful Paint | Time to first visible content | ≤ 1.8s |
| CLS — Cumulative Layout Shift | Visual stability | ≤ 0.1 |
There's a fourth metric worth knowing: INP (Interaction to Next Paint), which replaced FID (First Input Delay) in 2024. INP measures how responsive the page feels when you click or type. We'll cover it briefly at the end.
What it measures: How long until the largest visible element on screen finishes loading.
This is usually a hero image, a large heading, or the main content block. Whatever takes up the most screen real estate "above the fold."
Why it matters: LCP is the closest single metric to "when does this page feel loaded." Users don't think in milliseconds — they think "did it load or not." LCP is when the answer flips from "no" to "yes."
What causes bad LCP:
How to measure it in code:
const lcpObserver = new PerformanceObserver((list) => {
const entries = list.getEntries();
// Use the last entry — LCP can be updated as more content loads
const lastEntry = entries[entries.length - 1];
console.log('LCP:', lastEntry.startTime, 'ms');
console.log('Element:', lastEntry.element); // Which element triggered it
});
lcpObserver.observe({ type: 'largest-contentful-paint', buffered: true });
Good: ≤ 2.5s
Needs improvement: 2.5s – 4.0s
Poor: > 4.0s
What it measures: How long until the browser renders the first piece of DOM content — any text, image, or non-white canvas element.
Why it matters: FCP is a leading indicator. A slow FCP almost always means a slow LCP. If FCP is bad, users are staring at a blank screen, which is the worst user experience possible — worse than a slow load, because users don't even know if anything is happening.
What causes bad FCP:
How to measure it:
const fcpObserver = new PerformanceObserver((list) => {
for (const entry of list.getEntries()) {
if (entry.name === 'first-contentful-paint') {
console.log('FCP:', entry.startTime, 'ms');
}
}
});
fcpObserver.observe({ type: 'paint', buffered: true });
Good: ≤ 1.8s
Needs improvement: 1.8s – 3.0s
Poor: > 3.0s
What it measures: How much the page layout shifts unexpectedly after it starts loading.
You've experienced this. You're reading an article, an ad loads above the paragraph you're on, and everything shifts down. You accidentally click the ad. That's a layout shift — and CLS measures how much of this happens across the full page lifecycle.
Why it matters: Layout shifts erode user trust instantly. They also cause accidental clicks, which is particularly bad on e-commerce and form pages.
What causes bad CLS:
width and height attributes setHow to measure it:
let clsValue = 0;
const clsObserver = new PerformanceObserver((list) => {
for (const entry of list.getEntries()) {
// Only count shifts that happen without user interaction
if (!entry.hadRecentInput) {
clsValue += entry.value;
console.log('Current CLS:', clsValue);
}
}
});
clsObserver.observe({ type: 'layout-shift', buffered: true });
Good: ≤ 0.1
Needs improvement: 0.1 – 0.25
Poor: > 0.25
Understanding the sequence helps:
Navigation starts
↓
FCP fires — first pixel of content rendered
↓
LCP fires — largest content element rendered
↓
Page becomes interactive
↓
CLS accumulates throughout — tracks all layout shifts
In practice: if FCP is bad, LCP will be bad too. If FCP is fine but LCP is bad, the issue is usually the main content (an image, a large element) taking too long. CLS is independent — a page can have great LCP and terrible CLS.
Here's a minimal but complete implementation that collects all three metrics and logs them:
// utils/web-vitals.ts
type MetricName = 'LCP' | 'FCP' | 'CLS';
type MetricReport = {
name: MetricName;
value: number;
rating: 'good' | 'needs-improvement' | 'poor';
};
function getRating(name: MetricName, value: number): 'good' | 'needs-improvement' | 'poor' {
const thresholds = {
LCP: [2500, 4000],
FCP: [1800, 3000],
CLS: [0.1, 0.25],
};
const [good, poor] = thresholds[name];
if (value <= good) return 'good';
if (value <= poor) return 'needs-improvement';
return 'poor';
}
export function initWebVitals(onMetric: (metric: MetricReport) => void) {
// LCP
new PerformanceObserver((list) => {
const entries = list.getEntries();
const last = entries[entries.length - 1];
const value = last.startTime;
onMetric({ name: 'LCP', value, rating: getRating('LCP', value) });
}).observe({ type: 'largest-contentful-paint', buffered: true });
// FCP
new PerformanceObserver((list) => {
for (const entry of list.getEntries()) {
if (entry.name === 'first-contentful-paint') {
const value = entry.startTime;
onMetric({ name: 'FCP', value, rating: getRating('FCP', value) });
}
}
}).observe({ type: 'paint', buffered: true });
// CLS
let clsValue = 0;
new PerformanceObserver((list) => {
for (const entry of list.getEntries()) {
if (!(entry as any).hadRecentInput) {
clsValue += (entry as any).value;
onMetric({ name: 'CLS', value: clsValue, rating: getRating('CLS', clsValue) });
}
}
}).observe({ type: 'layout-shift', buffered: true });
}
Usage in your React app:
// App.tsx or main.tsx
import { initWebVitals } from './utils/web-vitals';
initWebVitals((metric) => {
console.log(`${metric.name}: ${metric.value} (${metric.rating})`);
// Send to your analytics endpoint, logging service, etc.
});
Here's the part that most tutorials skip.
Lighthouse and DevTools give you synthetic measurements — they simulate a specific device and network condition in a controlled environment. This is useful for relative comparisons ("did my change make it better or worse?"), but it doesn't tell you what real users experience.
Real users have:
The only way to know your real-world Core Web Vitals is to measure in production, from real browsers. The code above does exactly that — it runs in your users' browsers and captures their actual experience.
What you do with those measurements is a separate question. At minimum, log them somewhere. Ideally, set up alerting so you know when they degrade — particularly after deploys.
If you're seeing bad numbers, here's where to start:
Bad LCP?
fetchpriority="high" to itnext/image
Bad FCP?
Bad CLS?
width and height to all images and videosmin-height
INP (Interaction to Next Paint) replaced FID in March 2024. It measures how quickly the page responds to user interactions — clicks, taps, keyboard input.
Good threshold: ≤ 200ms
The most common cause of bad INP in React apps is expensive state updates that block the main thread. If you're seeing high INP, Long Tasks are usually the culprit — something is blocking the browser from responding to user input.
We'll cover Long Tasks in depth in the next article.
Core Web Vitals aren't just for SEO. They're the most concrete way to measure whether your app feels fast to a real user.
The three metrics tell a story:
You can measure all three with PerformanceObserver in your production React app right now, with zero dependencies.
2026-03-31 22:00:00
Here's a question that sounds simple and isn't: is your team actually good at working with AI, or are they just using it?
Using means generating output. Good at working with means the human added judgment, caught errors, maintained context, and produced something the organization can defend. The difference matters because when something goes wrong, accountability doesn't attach to the AI. It attaches to the person who signed off.
Every organization deploying AI needs to answer this question. And almost none of them can, because the tools they're using to measure AI skills don't measure collaboration. They measure knowledge.
The dominant approach to measuring AI capability in organizations is some form of quiz: multiple choice, scenario-based questions, self-assessment surveys. These tell you whether someone knows what good collaboration looks like. They don't tell you whether someone does it.
This is the same gap that exists between knowing you should write tests and actually writing tests. Between knowing you should review PR diffs line by line and actually reviewing them. Knowledge and behavior diverge under real conditions, especially when the behavior is effortful and the shortcut is invisible.
The shortcut with AI is accepting output without meaningful verification. It looks like productivity. It feels like efficiency. And it's undetectable by any assessment that asks what you would do rather than observing what you actually do.
PAICE takes a different approach. Instead of asking people about AI collaboration, it puts them in one.
The assessment is a 25-minute conversation with an AI system. It looks and feels like a normal working session: you're given a realistic task, you collaborate with the AI to complete it, and you produce a deliverable. What you don't know is that the AI's outputs contain strategically injected errors -- factual mistakes, logical inconsistencies, subtle hallucinations calibrated to the domain.
The assessment isn't measuring whether you can use AI. It's measuring what happens when the AI is wrong and you're responsible for the output.
Do you catch the error? Do you verify claims that sound plausible? When you find a problem, do you fix it or work around it? When the AI pushes back on your correction, do you hold your ground or defer? These behavioral signals are what the scoring model captures.
Collaboration quality isn't a single number. Someone might be excellent at iterative prompting but terrible at verification. Another person might catch every error but struggle to give the AI useful feedback. A single score flattens these differences into noise.
PAICE measures across multiple dimensions independently:
Accountability measures whether someone verifies outputs, detects injected errors, and takes ownership of the final work product. This is consistently the lowest-scoring dimension across all populations tested. People know they should verify. Under real working conditions, most don't verify thoroughly enough.
Integrity measures whether someone maintains factual standards, catches logical inconsistencies, and refuses to use AI-generated content that doesn't meet quality thresholds.
Collaboration quality measures the effectiveness of the human-AI interaction itself: whether feedback is specific and actionable, whether iteration actually improves the output, whether the person understands when AI adds value and when it introduces friction.
Evolution measures adaptive capacity: whether someone builds mental models of AI strengths and weaknesses over time and adjusts their approach accordingly.
Each dimension produces an independent score. For L&D teams designing targeted training, a dimensional profile is vastly more actionable than a percentage.
Building this required solving several problems that don't have obvious precedents:
Error injection that doesn't break immersion. The injected errors have to be realistic enough that catching them requires domain judgment, not pattern recognition. If the errors are obviously wrong, you're measuring attention, not expertise. If they're too subtle, the signal-to-noise ratio collapses. The calibration is adaptive -- the system adjusts based on how the participant is performing.
Behavioral signal extraction from conversation. The scoring model doesn't grade the deliverable. It analyzes the collaboration process: what the participant questioned, what they accepted, how they responded to pushback, whether their verification was systematic or sporadic. This requires a multi-model architecture where the assessment AI and the scoring model operate independently.
Multi-model bias prevention. When the AI that runs the conversation is also the AI that scores it, you get circular reasoning. PAICE uses separate models for assessment delivery and scoring, with the scoring model evaluating behavioral signals rather than output quality.
Pre-post comparison for training ROI. The most valuable use case isn't a one-time score. It's administering the assessment before and after a training intervention and measuring whether actual behavior changed. This requires scoring stability across sessions and dimensional granularity fine enough to detect movement in specific skill areas.
PAICE is built for Leaders and organizational decision-makers who are deploying AI and need to know whether their people are collaborating with it effectively or just using it as a faster copy-paste.
If you're a developer interested in the measurement architecture, the Closing the Collaboration Gap whitepaper covers the technical framework, and the daily blog explores the intersection of trust, verification, and performance measurement in human-AI systems.
PAICE.work PBC is a public benefit corporation focused on making human-AI collaboration measurable, teachable, and governable.
2026-03-31 22:00:00
Multi-agent systems are hard to debug.
It's not the same as debugging a web request or a database query. You can't set a breakpoint in the middle of an LLM call. You can't predict what the model will say. When an agent produces bad output, you need to understand the full chain of events: what prompt was sent, what the model returned, which tools were called, what context from previous tasks was injected, and whether the output parsing succeeded.
Traditional debuggers don't help here. You need purpose-built observability.
This post covers the debugging and observability stack in AgentEnsemble: structured traces for post-mortem analysis, capture mode for recording full execution state, and the live dashboard for real-time visibility during development.
Consider a three-agent pipeline: Researcher, Analyst, Writer. The Writer produces a report that's factually wrong. Where did things go wrong?
Without observability, you're guessing. With it, you're reading a log.
The most broadly useful debugging tool is the structured trace. It records every significant event in an ensemble run as a tree of spans:
EnsembleOutput output = Ensemble.builder()
.agents(researcher, analyst, writer)
.tasks(researchTask, analysisTask, writeTask)
.chatLanguageModel(model)
.traceExporter(TraceExporter.json(Path.of("traces/")))
.build()
.run();
This produces a JSON file in the traces/ directory with a structure like:
Ensemble Run (total: 8,420ms, 5,230 tokens)
|
+-- Task: Research emerging trends (3,240ms, 1,847 tokens)
| +-- LLM Call #1 (1,900ms, 1,200 tokens)
| +-- Tool: WebSearch "emerging tech trends 2024" (890ms)
| +-- LLM Call #2 (450ms, 647 tokens)
|
+-- Task: Analyze research findings (2,180ms, 1,583 tokens)
| +-- LLM Call #1 (2,180ms, 1,583 tokens)
|
+-- Task: Write final report (3,000ms, 1,800 tokens)
+-- LLM Call #1 (2,400ms, 1,400 tokens)
+-- LLM Call #2 (600ms, 400 tokens) // output retry
Each span records:
| Field | Description |
|---|---|
| Name | Task description or tool call name |
| Duration | Wall-clock time in milliseconds |
| Token count | Input + output tokens for LLM calls |
| Status | Success, failure, or retry |
| Input/Output | What went in, what came out |
You don't have to read the JSON file. The trace is available on the EnsembleOutput:
ExecutionTrace trace = output.getTrace();
// Walk the span tree
for (TraceSpan span : trace.getSpans()) {
System.out.printf("[%s] %s -- %dms, %d tokens%n",
span.getStatus(),
span.getName(),
span.getDurationMs(),
span.getTokenCount());
for (TraceSpan child : span.getChildren()) {
System.out.printf(" [%s] %s -- %dms%n",
child.getStatus(),
child.getName(),
child.getDurationMs());
}
}
This is useful for writing assertions in tests:
@Test
void ensembleShouldCompleteAllTasks() {
EnsembleOutput output = ensemble.run();
ExecutionTrace trace = output.getTrace();
assertThat(trace.getSpans()).hasSize(3);
assertThat(trace.getSpans())
.allMatch(span -> span.getStatus() == TraceStatus.SUCCESS);
assertThat(output.getMetrics().getTotalTokens()).isLessThan(10_000);
}
The JSON trace format is designed for programmatic consumption. Feed it into your log aggregation system, build custom analysis scripts, or import it into a notebook:
// Export to a specific directory with timestamped filenames
.traceExporter(TraceExporter.json(Path.of("traces/")))
// Or get the raw JSON string
String traceJson = output.getTrace().toJson();
logAggregator.ingest("agent-trace", traceJson);
Traces tell you what happened. Capture mode tells you exactly what happened -- including the full prompts, raw LLM responses, and tool call payloads.
Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.captureMode(CaptureMode.FULL) // OFF, STANDARD, or FULL
.build()
.run();
| Level | What's Captured | Use Case |
|---|---|---|
OFF |
Standard metrics only | Production |
STANDARD |
+ Full LLM message history per iteration, memory operations | Staging, initial deployment |
FULL |
+ Tool call I/O payloads, raw LLM responses, detailed timing | Development, debugging |
With CaptureMode.STANDARD, each task's execution record includes the full conversation between the framework and the LLM:
Task: Research emerging trends
Iteration 1:
System prompt: "You are Senior Research Analyst. Your goal is..."
User message: "Research emerging trends in AI thoroughly..."
Assistant response: "I'll search for the latest information..."
Tool call: WebSearch("emerging AI trends 2024")
Iteration 2:
System prompt: [same]
User message: [previous context + tool result]
Assistant response: "Based on my research, here are the key..."
This is invaluable for understanding why an agent behaved a certain way. You can see exactly what prompt it received, what context was injected, and how it reasoned through the task.
CaptureMode.FULL adds the raw payloads for every interaction:
This is the level you use when something is wrong and you can't figure out why from the trace alone. It's verbose -- expect significantly more data -- but it gives you full replay capability.
Capture mode is a testing power tool. Record a full execution, then write assertions against the captured data:
@Test
void researcherShouldUseWebSearch() {
EnsembleOutput output = Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.captureMode(CaptureMode.FULL)
.build()
.run();
// Verify the researcher used the web search tool
ExecutionTrace trace = output.getTrace();
TraceSpan researchSpan = trace.getSpans().get(0);
boolean usedWebSearch = researchSpan.getChildren().stream()
.anyMatch(child -> child.getName().contains("WebSearch"));
assertThat(usedWebSearch).isTrue();
}
You can also capture a "golden run" and use it as a reference for regression testing -- comparing future runs against the expected execution pattern.
For real-time debugging during development, callbacks give you a live stream of execution events:
Ensemble.builder()
.agents(researcher, analyst, writer)
.tasks(researchTask, analysisTask, writeTask)
.chatLanguageModel(model)
.listener(event -> {
switch (event) {
case TaskStartEvent e ->
System.out.printf("%n>>> Starting: %s (agent: %s)%n",
e.taskDescription(), e.agentRole());
case TaskCompleteEvent e ->
System.out.printf("<<< Completed: %s (%dms, %d tokens)%n",
e.taskDescription(), e.durationMs(), e.tokenCount());
case TaskFailedEvent e ->
System.err.printf("!!! Failed: %s -- %s%n",
e.taskDescription(), e.errorMessage());
case ToolCallEvent e ->
System.out.printf(" [tool] %s(%s) -> %s%n",
e.toolName(),
truncate(e.input(), 50),
truncate(e.result(), 100));
case DelegationStartedEvent e ->
System.out.printf(" [delegate] %s -> %s%n",
e.fromAgent(), e.toAgent());
case TokenEvent e ->
// Streaming: print tokens as they arrive
System.out.print(e.token());
default -> {}
}
})
.build()
.run();
This gives you a live play-by-play of the ensemble execution in your terminal. You see each task start and complete, each tool call and its result, and each delegation in hierarchical workflows.
For persistent debugging output, route events to your logging framework:
.listener(event -> {
if (event instanceof TaskCompleteEvent e) {
log.info("Task completed: task={}, agent={}, duration={}ms, tokens={}",
e.taskDescription(), e.agentRole(),
e.durationMs(), e.tokenCount());
}
if (event instanceof TaskFailedEvent e) {
log.error("Task failed: task={}, error={}",
e.taskDescription(), e.errorMessage());
}
})
These flow into your existing log aggregation pipeline (ELK, Splunk, CloudWatch Logs) alongside your application's other logs.
For the most visual debugging experience, AgentEnsemble includes a live browser dashboard:
Ensemble.builder()
.agents(researcher, analyst, writer)
.tasks(researchTask, analysisTask, writeTask)
.chatLanguageModel(model)
.devtools(Devtools.enabled())
.build()
.run();
When the ensemble starts, a browser window opens (or a URL is printed to the console) showing a real-time visualization of the execution:
The live dashboard is a development tool, not a production monitoring dashboard. Use it when:
For production monitoring, use the Micrometer metrics integration and your existing Grafana/Prometheus stack.
Here are specific debugging scenarios and how to approach them with the tools above.
Use traces. Look at each task's output in the trace tree. Find the first task whose output is incorrect -- that's where things diverged.
.traceExporter(TraceExporter.json(Path.of("debug/")))
Then read the trace JSON, find the task with bad output, and check its input context to see what it received from upstream tasks.
Use capture mode + callbacks. Enable CaptureMode.FULL and add a callback that logs tool calls:
.captureMode(CaptureMode.FULL)
.listener(event -> {
if (event instanceof ToolCallEvent e) {
log.warn("Tool call: {} with input: {}",
e.toolName(), e.input());
}
})
Then check the captured LLM conversation to see why the agent keeps making the same call. Usually it's a prompt issue -- the agent doesn't recognize the tool result as sufficient.
Use capture mode. Enable CaptureMode.FULL and check the raw LLM response:
.captureMode(CaptureMode.FULL)
The captured data includes the raw response before parsing. Compare it to your record schema. Common issues:
camelCase, the record uses snake_case).The framework handles most of these, but FULL capture mode shows you exactly what's happening.
Use the live dashboard. Enable devtools and look at the timeline view:
.devtools(Devtools.enabled())
You'll see whether tasks are actually running in parallel or if there's an unexpected dependency bottleneck. Common issues:
context() when it shouldn't.Use CaptureMode.STANDARD or CaptureMode.FULL. The captured data includes the complete system prompt, user message, and any injected context for each LLM call.
This is the only way to see the actual prompt -- the framework constructs it dynamically from the agent's role/goal/background, the task description, context from previous tasks, and tool results.
A typical debugging setup during development:
EnsembleOutput output = Ensemble.builder()
.agents(researcher, analyst, writer)
.tasks(researchTask, analysisTask, writeTask)
.chatLanguageModel(model)
// Full observability stack
.captureMode(CaptureMode.FULL)
.traceExporter(TraceExporter.json(Path.of("traces/")))
.devtools(Devtools.enabled())
.listener(event -> {
if (event instanceof TaskCompleteEvent e) {
log.info("[DONE] {} -- {}ms", e.taskDescription(), e.durationMs());
}
if (event instanceof ToolCallEvent e) {
log.info("[TOOL] {} -> {}", e.toolName(), e.result());
}
})
.costConfiguration(CostConfiguration.builder()
.inputTokenCostPer1k(0.01)
.outputTokenCostPer1k(0.03)
.build())
.build()
.run();
// Post-run analysis
EnsembleMetrics metrics = output.getMetrics();
log.info("Total cost: ${}, tokens: {}, duration: {}ms",
metrics.getTotalCost(), metrics.getTotalTokens(),
output.getTotalDuration());
For production, dial it back:
EnsembleOutput output = Ensemble.builder()
.agents(researcher, analyst, writer)
.tasks(researchTask, analysisTask, writeTask)
.chatLanguageModel(model)
// Production observability
.captureMode(CaptureMode.OFF)
.traceExporter(TraceExporter.json(Path.of("/var/log/agent-traces/")))
.meterRegistry(prometheusMeterRegistry)
.listener(productionEventHandler)
.costConfiguration(costConfig)
.build()
.run();
The observability stack scales from "show me everything" during development to "show me what matters" in production. Same API, different configuration.
Multi-agent systems are opaque by nature. An LLM call is a black box -- you send a prompt, you get a response, and the reasoning happens inside the model. The only way to make agent systems debuggable is to capture and structure everything around those black box calls: what went in, what came out, how long it took, and how it fits into the broader execution flow.
That's what traces, capture mode, callbacks, and the live dashboard provide. Not transparency into the model, but transparency around it. And in practice, that's enough to debug anything.
Get started:
AgentEnsemble is MIT-licensed and available on GitHub.
2026-03-31 22:00:00
Monorepos are fantastic for keeping shared code aligned and deployments coordinated. They are also a nightmare for code review tooling that was not built with them in mind. A single pull request in a monorepo can touch a shared utility package, a React frontend, a Node.js API, and a database migration script - all at once. Most AI review tools treat that diff as one giant blob of text and produce reviews that are either too broad to be useful or too shallow to catch anything meaningful.
CodeRabbit handles this better than most, but it requires deliberate configuration to get the most out of it. This guide covers how to set up CodeRabbit specifically for monorepo workflows - from path-based filters to per-package instructions to managing the reality of 50-file pull requests.
If you are new to CodeRabbit in general, start with how to use CodeRabbit and then come back here for the monorepo-specific setup.
Before diving into configuration, it is worth understanding what makes monorepos specifically hard for AI code review tools.
Volume and noise. A single feature PR in a monorepo might touch 80 files across 6 packages. If your AI reviewer tries to comment on every file equally, you end up with hundreds of comments - most of them low-signal - and developers start ignoring the tool entirely.
Context fragmentation. The AI reviewer sees the diff for packages/api/src/users/service.ts but may not have visibility into how that service is consumed by packages/web/src/hooks/useUser.ts, even if both files changed in the same PR. Cross-package impact analysis requires understanding the full dependency graph, not just the changed lines.
Inconsistent standards across packages. In a well-organized monorepo, different packages have different maturity levels and different conventions. Your internal utilities package might use strict TypeScript with no any types, while your experimental feature package uses a looser style. A one-size-fits-all review profile will either be too strict for some packages or too lenient for others.
Generated and boilerplate code. Monorepos often contain auto-generated files - GraphQL schema types, protobuf outputs, OpenAPI client code, Storybook snapshots. These change frequently and are meaningless to review. Without explicit exclusions, they consume review tokens that should be spent on real code.
Large PR problem. Teams working in monorepos often resist splitting PRs because related changes across packages need to land together. This leads to PRs with 50, 80, or even 150 changed files. Even the best AI reviewers struggle to maintain quality at that scale.
The .coderabbit.yaml file is where all the real monorepo configuration lives. Place it at the root of your repository. Here is a solid starting template for a typical Nx or Turborepo monorepo:
# .coderabbit.yaml
language: en-US
tone_instructions: "Be concise. Prioritize actionable feedback over explanatory commentary."
reviews:
profile: assertive
request_changes_workflow: false
high_level_summary: true
poem: false
review_status: true
collapse_walkthrough: false
auto_review:
enabled: true
ignore_title_keywords:
- "WIP"
- "chore"
- "docs"
- "release"
drafts: false
path_filters:
include:
- "packages/**"
- "apps/**"
- "services/**"
exclude:
- "**/dist/**"
- "**/build/**"
- "**/*.generated.ts"
- "**/*.generated.js"
- "**/__snapshots__/**"
- "**/graphql/generated/**"
- "**/proto/generated/**"
- "**/node_modules/**"
- "**/*.lock"
- "**/coverage/**"
- "**/.turbo/**"
- "**/.nx/**"
path_instructions:
- path: "packages/api/**"
instructions: |
This is the core API package. Enforce strict input validation on all controller methods.
Flag any database query that does not use parameterized inputs.
Check that all new endpoints have corresponding OpenAPI documentation.
Reject any use of 'any' type in TypeScript.
- path: "packages/shared/**"
instructions: |
This is a shared utilities package used by all other packages.
Be extra strict about breaking changes - flag any removed exports or changed function signatures.
Ensure all exported functions have JSDoc documentation.
Flag any circular dependency risks.
- path: "packages/web/**"
instructions: |
This is the React frontend package.
Check that new components have associated Storybook stories.
Flag any direct DOM manipulation outside of React lifecycle.
Look for missing key props in list renders and accessibility issues.
- path: "apps/**"
instructions: |
Application-level code. Focus on configuration correctness and environment variable usage.
Flag any hardcoded credentials, URLs, or environment-specific values.
- path: "**/*.test.ts"
instructions: |
For test files, focus on test coverage completeness and assertion quality.
Flag tests that only assert truthiness without checking specific values.
Skip style comments on test files.
- path: "**/*.spec.ts"
instructions: "Same as test files - focus on assertion quality and skip style feedback."
This configuration does several things at once. The path_filters section narrows the diff to meaningful source code. The path_instructions section gives CodeRabbit a distinct personality and rule set for each part of the codebase. The auto_review settings prevent reviews from firing on PRs that are clearly not ready for feedback.
For a deeper look at all available configuration options, see the CodeRabbit configuration guide.
The path_instructions feature is CodeRabbit's most valuable capability for monorepos. In practice, here is how teams use it effectively.
Security-sensitive packages get stricter rules. If you have a package that handles authentication, payments, or PII, you can instruct CodeRabbit to flag every instance of raw string concatenation in SQL, every missing rate limit check, and every unencrypted data write. This is the kind of context-aware review that would otherwise require a dedicated security reviewer on every PR.
Experimental packages get lighter treatment. A package under active development should not be held to the same standard as production code. Set the profile to chill for those paths and instruct CodeRabbit to skip style enforcement and focus only on obvious bugs.
Infrastructure-as-code gets different rules entirely. Terraform files, Helm charts, and Docker configurations have completely different quality signals than application code. You can instruct CodeRabbit to look for things like missing resource limits, open security group rules, and unversioned image tags - concerns that would not appear in instructions for a TypeScript package.
Shared libraries get change-impact focus. For any package imported by multiple other packages, instruct CodeRabbit to flag breaking changes prominently. Something like: "This package is imported by all other packages. Always highlight if a change could break downstream consumers, including removed exports, changed type signatures, or altered function behavior."
Even with good PR hygiene, monorepo PRs get large. A migration, a shared utility refactor, or a cross-cutting configuration change can touch dozens of packages in one go. Here is how to manage that with CodeRabbit.
Use collapse_walkthrough: false selectively. The PR walkthrough is CodeRabbit's high-level summary of what changed. For large PRs, this is often more valuable than individual file comments. Keeping it uncollapsed helps reviewers get oriented before diving into specifics.
Accept that large PRs get lighter reviews. CodeRabbit's underlying LLM has a context window constraint. For very large diffs, the tool makes intelligent tradeoffs - it provides deeper analysis on files that show more complex changes and lighter summaries on files with small, mechanical changes. This is the right behavior, but it means you should not expect the same comment density on a 100-file PR as on a 10-file PR.
Use ignore keywords aggressively. If your monorepo has release automation that opens a PR to bump versions across all packages, that PR will have dozens of package.json changes. Add release and version-bump to ignore_title_keywords so CodeRabbit skips those entirely. Same for PRs titled chore: update dependencies - lockfile changes are not worth AI review cycles.
Split PRs where you can. This is the unsexy answer, but it is the right one. If a PR changes a shared package and five consumer packages, consider splitting it into a "shared package change" PR and a "consumer update" PR. The first PR can be reviewed in depth; the second is mechanical and can be approved quickly. CodeRabbit can still review both, and the quality of feedback on each will be higher.
Leverage draft PR status. Set auto_review.drafts: false in your config. This prevents CodeRabbit from reviewing draft PRs, saving review capacity for when a PR is actually ready. Many developers use draft status while assembling a large cross-package change, and triggering a review on a half-complete diff is wasteful.
CodeRabbit is build-tool agnostic. It does not integrate with your Nx project graph, Turborepo pipeline definitions, or Lerna workspace configuration directly. What it sees is the Git diff.
This means your .coderabbit.yaml path structure needs to mirror your workspace structure manually. If your Nx monorepo has apps in apps/ and libraries in libs/, your path_filters should reflect that:
path_filters:
include:
- "apps/**"
- "libs/**"
exclude:
- "libs/generated/**"
- "**/node_modules/**"
For Turborepo setups, the structure is typically apps/ for deployable applications and packages/ for shared code. The same approach applies.
For Lerna monorepos, which often use packages/ at the root, the configuration is the same as the Turborepo example above.
One Nx-specific consideration: Nx often generates boilerplate when you run nx generate. These generated files - default configs, barrel exports, test setups - are largely not worth reviewing. Add patterns like **/index.ts (for barrel files) and **/*.module.ts (for Angular module boilerplate) to your exclude list if your team auto-generates these.
The practical difference between these tools for CodeRabbit configuration is minimal. The path filter approach works the same way regardless of which build orchestrator you use.
CodeRabbit does not publish exact token limits for its review engine, but based on community reports and practical usage, here is what you can expect.
For PRs under roughly 30 files with moderate diff sizes, CodeRabbit operates at full capacity - detailed per-hunk comments, security analysis, style enforcement, and actionable suggestions.
For PRs in the 30 to 80 file range, CodeRabbit maintains review quality on the most complex files but may produce lighter-touch summaries for files with only small changes. You will still get a comprehensive walkthrough and comments on the important stuff.
For PRs over 80 files, especially those with large diffs per file, expect the per-file depth to decrease. The walkthrough remains useful, but individual file comments become more selective. This is not a bug - it is the tool making a rational tradeoff between breadth and depth.
The configuration option that matters most here is being selective with path_filters.include. If you include packages/** and your monorepo has 40 packages, a cross-cutting change will try to review all 40. If you know that only packages/api and packages/shared contain code worth deep review, say so explicitly:
path_filters:
include:
- "packages/api/**"
- "packages/shared/**"
- "packages/auth/**"
This is counterintuitive - you might feel like you are missing coverage. But getting high-quality reviews on your most critical packages is more valuable than surface-level coverage of everything.
If you are evaluating AI review tools specifically for monorepo use cases, CodeAnt AI is worth considering alongside CodeRabbit. At $24 to $40 per user per month (Basic to Premium), CodeAnt AI bundles AI code review with SAST scanning, secret detection, infrastructure-as-code security analysis, and DORA metrics in a single platform.
For monorepos that house services with different security profiles - say, a public-facing API next to an internal data pipeline - CodeAnt AI's ability to apply different security rule sets to different services can be valuable. Its SAST coverage applies across the monorepo, which can catch vulnerabilities that a pure review tool might miss.
CodeRabbit's advantage for monorepos is the granularity of path_instructions. The ability to write natural-language review instructions per path, and have those honored consistently, is more flexible than CodeAnt AI's configuration model. If per-package review quality is your top priority, CodeRabbit wins on that dimension.
For a broader comparison of your options, see CodeRabbit alternatives.
Here is what a well-configured CodeRabbit monorepo workflow looks like in practice for a team of 8 to 20 developers.
Step 1: Baseline configuration. Start with the YAML template above, adapt the path structure to your repo layout, and deploy it. Do not try to write path_instructions for every package on day one.
Step 2: Tune ignore patterns. After a week, look at the CodeRabbit comments your team marked as unhelpful or dismissed repeatedly. Most of these will be on generated files or boilerplate. Add those patterns to your exclude list.
Step 3: Add path instructions incrementally. Start with your most critical packages - the ones where a bug has the highest impact. Write focused instructions for those. Expand to other packages over the next few weeks as you learn what CodeRabbit catches and misses in your specific codebase.
Step 4: Set up PR conventions. Establish a convention that PRs touching more than 3 packages must use a descriptive title so that ignore_title_keywords can catch the mechanical ones. Add a PR template that reminds developers to split large changes where possible.
Step 5: Review CodeRabbit's performance monthly. Look at which packages generate the most review comments, and whether those comments lead to code changes. Adjust your path_instructions to reduce noise where the signal-to-noise ratio is poor.
For more advanced configuration patterns, the CodeRabbit configuration deep dive covers options not discussed here, including chat commands, Jira integration, and review scheduling.
Here is a concise reference for the settings that matter most in a monorepo context:
| Setting | What it does | Monorepo use |
|---|---|---|
path_filters.include |
Limits review to matched paths | Scope to real source packages |
path_filters.exclude |
Skips matched paths entirely | Exclude generated, dist, lock files |
path_instructions |
Per-path review instructions | Different rules per package |
auto_review.ignore_title_keywords |
Skips PRs matching keywords | Skip release and chore PRs |
auto_review.drafts: false |
Skips draft PRs | Avoid reviewing WIP large PRs |
reviews.profile |
Sets review strictness globally | Use "assertive" for critical packages |
reviews.high_level_summary |
Enables PR walkthrough | Essential for large PRs |
reviews.collapse_walkthrough |
Controls walkthrough display | Set false for large PRs |
If you are reading this with a monorepo and no CodeRabbit configuration, the fastest path to value is:
.coderabbit.yaml at your repo root with at minimum a path_filters.exclude list for generated and build output directoriespath_instructions entry for your most critical packageFrom there, the configuration is iterative. CodeRabbit's learning system also picks up on patterns over time - comments your team consistently dismisses will be weighted less in future reviews, and approaches your team consistently approves will be reinforced.
For a comprehensive look at what CodeRabbit can do beyond monorepos, the full CodeRabbit review covers pricing, feature comparisons, and real-world performance benchmarks in detail.
Monorepo AI review is not a solved problem, but CodeRabbit's path-based configuration system gets you meaningfully closer than the default behavior of any AI review tool. The investment in .coderabbit.yaml configuration pays off within the first month for most teams.
Yes. CodeRabbit supports monorepos natively through .coderabbit.yaml path filters, per-path override rules, and ignore patterns. You can scope reviews to specific packages, exclude generated files, and set different review instructions for frontend versus backend packages - all within a single configuration file.
Use the path_filters section in .coderabbit.yaml. You can define include and exclude glob patterns, for example include: ['packages/api/', 'packages/web/'] and exclude: ['/generated/', '/snapshots/']. CodeRabbit will then limit its review scope to only the matched paths.
Yes, but with caveats. On the Pro plan, CodeRabbit has no enforced file count limit, but the underlying LLM context window constrains how deeply it can analyze very large diffs. For PRs with 50 or more files, CodeRabbit prioritizes the most critical changes and may produce lighter-touch summaries for less critical files. Splitting large PRs is still the recommended best practice.
CodeRabbit does not publicly publish its exact token limit, but in practice, reviews of diffs exceeding roughly 100,000 tokens may receive less granular per-file comments. The tool intelligently summarizes and batches large diffs, but for monorepos with massive PRs, you will get the best results by keeping individual PRs under 30-40 changed files.
Use the path_instructions array inside .coderabbit.yaml. Each entry accepts a path glob and a freeform instructions string. For example, you can instruct CodeRabbit to enforce strict API contract validation only for packages/api/** while applying different style rules to packages/web/**. This per-path instruction system is one of CodeRabbit's most powerful monorepo features.
Yes. CodeRabbit is build-tool agnostic - it analyzes Git diffs and does not require integration with Nx, Turborepo, or Lerna directly. However, you can leverage your build tool's project graph by defining path filters in .coderabbit.yaml that mirror your workspace structure, ensuring reviews stay aligned with package boundaries.
Add them to the path_filters.exclude list in .coderabbit.yaml. Common patterns include '/dist/', '/build/', '/*.generated.ts', '/graphql/schema.ts', and '/snapshots/'. You can also exclude entire packages that are vendor code or auto-generated client SDKs.
Partially. CodeRabbit analyzes the diff it receives and has some ability to follow imports and references within the changed files. However, it does not have full visibility into the runtime dependency graph of your monorepo. For cross-package impact analysis, you still need human reviewers or a dedicated dependency analysis tool.
The most impactful settings are path_filters (to scope the diff), path_instructions (for per-package rules), auto_review.ignore_title_keywords (to skip chore/docs PRs automatically), and reviews.profile (set to 'assertive' for critical packages, 'chill' for documentation packages). Combining these reduces noise and keeps reviews focused.
CodeRabbit offers more granular path-based configuration for monorepos, making it a stronger choice if fine-grained per-package review rules are your priority. CodeAnt AI ($24-40/user/month) bundles SAST and security scanning alongside AI review, which can be valuable if your monorepo houses services with different security profiles. The right choice depends on whether you need pure review quality or broader security coverage.
Use path_filters.exclude to skip boilerplate-heavy directories, and use path_instructions to add a note like 'skip detailed style feedback on files matching this pattern - these are generated from templates'. You can also set the review profile to 'chill' for packages that contain mostly scaffolding code.
Path filter configuration via .coderabbit.yaml is available on the free tier, but custom path_instructions (per-path review rules) require a Pro plan at $24/user/month. Free-tier users can still exclude paths and limit review scope, which alone provides a significant improvement for large monorepo PRs.
Originally published at aicodereview.cc