MoreRSS

site iconThe Practical DeveloperModify

A constructive and inclusive social network for software developers.
Please copy the RSS to your reader, or quickly subscribe to:

Inoreader Feedly Follow Feedbin Local Reader

Rss preview of Blog of The Practical Developer

Bridging Object-Oriented and Functional Thinking in Modern C++

2026-04-29 15:15:45

How dependencies make C++ systems hard to test and evolve—and why functional thinking changes it

Fighting complexity is crucial if you want to succeed in software development. In many real-world C++ projects, however, systems tend to evolve in the opposite direction. Components become tightly coupled, so changes ripple through large parts of the system. State is scattered, reducing transparency about how it evolves over time. Behavior emerges from many interacting parts, making it hard to reason about individual pieces in isolation.

This also shows up in testing: a significant amount of boilerplate and indirection is needed just to enable it. Establishing specific system states requires substantial setup, and maintaining tests becomes costly, as even small production changes force updates across large parts of the test code.

The result is a system that is hard to get right—and once it works, it feels rigid and difficult to evolve.

Where OOP Leads to Complexity

The underlying problems unfold nicely when jumping into the details from the testing point of view. So, let's focus on unit-testable code, that is code structured in smaller parts that can be executed in isolation. To get there we apply the “separation of concerns” principle, which provides the units for testing. And to actually run modules independently, dependency injection comes into play. This means if one module depends on another, it is passed in, normally as a constructor argument. In a unit test where a single module should run in isolation, all dependent modules need to be mocked. So, instead of real dependencies, the test passes its mocks as constructor arguments.

This reveals the first issue: No testing without mocking. Technically that mocking approach works well—unit testing is enabled. But actually it comes at a high price. The mocks need to be written and maintained, which increases test-specific code and in turn leads to higher test complexity. And we should not neglect this, as only if all the mocks are implemented correctly, the test is helpful.

Mocks must implement the same interface as the real components they replace, which leads to another problem: Interfaces create tight coupling between alternative implementations. Often these interfaces contain several interdependent function signatures with non-trivial pre- and post-conditions. Any change affects all implementations and thus also all mocks, which increases maintenance effort. With dependency injection and many mocks, this interface-level coupling becomes especially visible and expensive. So, we often trade testability for flexibility here, which can be a poor deal.

But the challenges don’t stop at interfaces. Once state comes into play, things get even more complicated: State encapsulated within a class is hard to modify by a test. In traditional OOP, private mutable state is fully hidden, so tests cannot set it up directly. A test that needs the object in a specific internal configuration must drive the state there indirectly through the public interface, which often requires multiple method calls, complex sequences, and mocks. This indirect setup adds complexity and makes tests more brittle.

Actually, state handling in a traditional OOP manner has another problematic dimension: Generally state is cluttered across the object graph. This introduces state dependencies that are hard to see. The more stateful objects exist, the harder it gets to reason about how state evolves in the system. Of course, this compounds the state-related test complexity mentioned earlier.

Interestingly, testing just makes these problems visible—but they exist in the design itself. E.g. the tight coupling of implementations sharing an interface can be described more generally as: Inheritance creates tight coupling within the hierarchy. So, once a class hierarchy is established, the base class interface becomes almost impossible to change without impacting every derived class. This gets worse the deeper the hierarchy is. Even small adjustments propagate through, making class hierarchies inflexible and expensive to modify.

Let's step back and draw conclusions. All these problems create a lot of complexity. This way they cause the pain described in the beginning. But there is more to understand here: The root cause is dependency—complexity emerges as different kinds of dependencies accumulate:

  • components dependent on each other based on injection can only be executed in isolation by introducing mocks
  • the broader an interface, the more complex the dependency between components sharing it
  • inheritance causes dependencies between classes within hierarchies
  • hidden, distributed, and evolving state creates implicit dependencies across the system

Every issue we looked at comes down to this: dependencies force parts of the system to change together, to be set up together, and to be understood together. As they reinforce each other, complexity grows increasingly fast.

What Functional Thinking Unlocks

So, I started questioning the design ideas I was following for years and searched for different ways of structuring my code to better manage these dependencies. The blog post Mocking is Code Smell helped me see this more clearly. It also pushed me to explore functional ideas that turned out to be particularly useful. Put simply, functional programming offers techniques that can complement a traditional C++ toolbox.

These techniques work because they change how we design systems—and with that, which dependencies emerge. To get an idea of what changes, consider pure functions as building blocks for your logic:

  • Pure functions receive all dependencies explicitly as input, instead of relying on injected objects. This eliminates implicit dependencies and significantly reduces the need for mocking.
  • In a world of pure functions, there is no hidden state, so dependencies on state are always explicit and visible.
  • Function interfaces are typically narrower than class interfaces, which keeps dependencies simpler.
  • There is no inheritance hierarchy, removing an entire class of dependencies

In conclusion, when you compose your business logic out of pure functions, dependencies become explicit, fewer, and simpler. This directly mitigates complexity.
The design becomes clearer, and as a result, reasoning, testing, and maintenance improve.

Just to be explicit, I am not saying functional programming is going to solve all problems, nor that it’s applicable in C++ without limits. I am saying there are aspects that are highly valuable when designing C++ systems. So, I advocate for complementing OOP with FP where appropriate.

If This Resonates

Adopting a different way of structuring code takes effort, and how much depends on your learning path. If you’re coming from an OOP-heavy, C++-style background like I did, I might speak your language well enough to further clarify these techniques and design choices.

My original motivation wasn’t to write about system design or functional programming. I simply wanted to figure out how to apply these ideas effectively in real-world C++ code. After exploring this for a while and finding techniques that work, it feels natural to share and discuss them.

The funkyposts blog is meant to build a bridge between traditional C++ and functional programming by illustrating practical FP patterns in modern C++. The goal is to provide concrete ways to improve how systems are structured—while staying grounded in real-world constraints. I also welcome feedback and different perspectives to further refine these ideas.

Dive deeper

The next step is to leverage these insights in practice—improving existing designs without rewriting everything. The subsequent posts in this series show how.

Part of the funkyposts blog — bridging object-oriented and functional thinking in C++.
Created with AI assistance for brainstorming and improving formulation. Original and canonical source: https://github.com/mahush/funkyposts (v06)

When Your AI Agent Crashes at 2 AM, Google Just Gave You a Way to Fix It

2026-04-29 15:15:28

This is a submission for the Google Cloud NEXT Writing Challenge

TL;DR

AI agents don’t just fail like traditional software. They fail because of how they reason.

At Google Cloud NEXT '26, Google introduced Agent Observability (to see what your agent was thinking) and Gemini Cloud Assist (to diagnose and fix issues directly in your code).

Together, they make debugging AI agents in production faster, clearer, and far less painful.

Estimated read time: 8 minutes

The Reality of AI Agents in Production

It’s 2 AM. Your AI agent just crashed in production.

You've spent weeks building it. It works great on your laptop. You deploy it. Customers start using it. And then, one random Tuesday, it just... dies. No clear error. No "you forgot a semicolon" message. Just a broken agent, confused logs, and you staring at your screen wondering what on earth it was thinking.

The problem isn’t just failure. It’s understanding why the agent failed.

This is the part nobody really talks about when we get excited about building AI agents. Building them is the fun part. Running them, keeping them alive, understanding why they fail, and fixing them fast, that is where things get genuinely hard.

At Google Cloud NEXT '26, Megan O'Keefe put it really well. The real challenge of putting agents into production isn't just scaling your infrastructure. It's "managing the reasoning, the tool calls, and all the places in the whole system where something can go wrong."

And Google showed two tools built exactly for this moment: Agent Observability and Gemini Cloud Assist.

First, let's understand what "debugging an AI agent" even means

With a traditional application, debugging is kind of like fixing a broken pipe. You find the leak, you patch it, you're done. The pipe either works or it doesn't. There's no in-between.

Debugging an AI agent is completely different. It's less like fixing a pipe and more like being a therapist for a robot. The agent isn't just crashing because of a typo or a missing database connection. It's crashing, or misbehaving, because of how it reasoned. It made a decision. That decision was wrong. And you need to understand why it made that decision so you can help it not do it again.

This is where AI systems are fundamentally different from traditional software.

That's a whole new discipline. And without the right tools, it's like trying to find a needle in a haystack while blindfolded.

What is Agent Observability?

Think about a flight data recorder, the black box on an airplane. After something goes wrong, investigators pull that box and replay everything: every reading, every signal, every action the pilots took. They don't have to guess. They have a record.

Agent Observability is that black box for your AI agent.

When a normal app has a problem, you check if a server crashed or if a response was slow. That's enough. But when an AI agent has a problem, you need to know something much deeper: what was it thinking? What tools did it call? What information did it look at? Where exactly did its reasoning go off track?

Agent Observability records all of this. It uses open standards, specifically OTel-compliant telemetry, which is the same kind of telemetry the broader software industry already uses for observability, to give you a visual trace of your agent's full execution path. Every step, in order, clearly laid out.

This matters because AI agents can fail in ways that are genuinely strange. They can get stuck in reasoning loops. Imagine someone pacing back and forth trying to solve a problem, taking the same wrong step over and over because they can't see that it's wrong. Or they can crash because they tried to hold too much information in memory at once. Both of these failures are invisible without observability. With it, you can actually see what happened.

What is Gemini Cloud Assist?

Now, once you see what happened, you still have to fix it. And this is where Gemini Cloud Assist comes in.

If Agent Observability is the black box, Cloud Assist is the investigator who reads it for you, connects it to everything else, and tells you exactly what to do.

Here's the old way of doing things: something breaks in production. You get an alert. You open logs. You stare at thousands of lines of dense, intimidating text. You copy chunks of it into a chat window somewhere, try to make sense of it, go back to your code, try to figure out where the problem lives, and maybe fix the wrong thing first. It's exhausting and slow.

Cloud Assist changes this. It doesn't just summarize the logs. It reads them, identifies the exact error, and then connects directly to your source code in your IDE (your code editor) through something called the Model Context Protocol (MCP). It reads both the production logs and your actual code at the same time. And then it suggests a specific, concrete fix.

Not a vague "maybe try this." An actual code change.

The demo: a marathon simulation that broke mid-race

To show how this all works together, Google ran a live simulation at the keynote (Google Cloud Next '26 Developer Keynote). Imagine a Las Vegas marathon. An AI agent is running the simulation of race logistics in real time. And mid-demo, the "Simulator Agent" crashes and starts causing high latency.

Here's how the debugging played out:

Megan got an alert in her Gmail. She opened the Cloud Monitoring console and looked at the trace view, the visual record of what the agent had done. She could see it had successfully called a few tools, and then it just died. Unexpectedly. No obvious reason in the trace itself.

Instead of scrolling through a massive wall of error text, she clicked one button to start a Cloud Assist investigation.

Cloud Assist found a 400 request error. The agent had tried to talk to the Gemini API and got rejected. But why?

Megan opened her code editor. Cloud Assist analyzed the source code (a file called agent.py) and figured out what happened: the agent had exceeded the 1 million context token limit.

What even is a token limit?

This is worth slowing down on, because it's one of those concepts that sounds technical but is actually very intuitive once you see it.

An AI's "context window" is basically its short-term memory. "Tokens" are the pieces of data it's holding in that memory, roughly speaking, the words and information it's actively working with.

Now imagine you're a student trying to memorize an encyclopedia in one sitting. You keep reading and reading, adding more and more to your working memory, and at some point your brain just gives up. It hits a limit. You can't hold any more.

That's exactly what happened to this agent. It had been running for a while, accumulating information, and it never stopped to summarize what it had learned. Its memory filled up. It hit the token limit. It crashed.

This is a real problem in production AI systems, and it's becoming one of the new bottlenecks in software development. "Token scale," managing how much information an agent holds and when it should compress its memory, is something developers now have to think about the same way they used to think about RAM or database size.

How Cloud Assist fixed it

This is the part that genuinely impressed me.

Cloud Assist didn't just say "your token limit was exceeded, good luck." It looked at the code, understood the architecture, and suggested a specific fix: add a token_threshold parameter to a feature called Event Compaction.

What Event Compaction does is force the agent to summarize its memory more frequently, before it gets dangerously close to the limit. By adding a threshold, you're essentially telling the agent: "don't wait until your memory is full. Start summarizing earlier and keep things manageable."

Megan approved the change, committed it, and the system automatically deployed the fixed agent.

The whole process, from alert to deployed fix, was remarkably fast. And more importantly, the fix was accurate. It wasn't a guess. It was based on reading the actual production error and the actual source code together.

Why this matters for every developer building with AI

Here's my honest take on all of this.

We're entering a genuinely new era of software development. A lot of us are building agents and excited about what they can do. But we haven't fully reckoned with the fact that agents are still just software. They still break. They still crash. They still misbehave in production.

They just break in completely new ways.

A traditional bug is usually deterministic. The same input gives you the same broken output every time. An agent bug can be non-deterministic. It might only happen under certain conditions, after a certain amount of time, or when the agent has accumulated a certain kind of context. That's much harder to reproduce and debug without proper tooling.

The moment you move an AI agent from a local experiment to a real environment where real users depend on it, you need observability. Not eventually. Immediately.

And tools like these fill a gap that genuinely needed filling. The IDE integration especially, being able to see the production error and the source code in the same place, at the same time, with suggested fixes, that's not just convenient. It's a fundamentally better workflow.

One thing to keep in mind

I want to be real with you about something, because I think it's worth saying.

We're now in a world where AI is diagnosing and writing code to fix other AI. That's remarkable. But it also means you should never just approve a suggested fix without understanding what it does.

Cloud Assist suggested the token_threshold change because it read the code and understood the architecture. But you, as the developer, need to review that change with your own understanding too. An AI can misread context. It can suggest a fix that solves the symptom but misses the root cause. Or worse, it could push a fix that quietly breaks something else.

Human-in-the-loop isn't just a nice phrase here. In production systems, it's genuinely important. Approve changes you understand. Don't just click accept because the AI was confident.

That said, the fact that we have these tools at all is genuinely exciting. Used thoughtfully, they make debugging AI systems faster and less painful than it's ever been.

The real shift happening right now

The conversation in AI development is moving. A year ago, everyone was talking about building agents. Now the real challenge is running them safely, understanding them when they fail, and fixing them quickly.

Agent Observability and Gemini Cloud Assist are Google's answer to that challenge. And based on what was shown at NEXT '26, it's a thoughtful one.

If you're building AI agents, even small ones or experimental ones, start thinking about observability now. Not when something breaks. Now.

Because when an AI agent fails at 2 AM, you don’t just need logs. You need answers.

References

For a deeper look at the announcements and demos mentioned in this post:

Cloudflare R2 vs S3: Object Storage for VPS Hosts

2026-04-29 15:11:03

If you’re running apps on a VPS, cloudflare r2 vs s3 is no longer an academic debate—it directly affects your bandwidth bill, latency, and how painful “oops, we egressed 20TB” feels at the end of the month.

What R2 and S3 are (and what VPS builders care about)

Amazon S3 is the default object storage API for the internet: durable, feature-rich, and supported by basically every tool. Cloudflare R2 is a newer S3-compatible object store designed around one big idea: no egress fees (at least from R2 itself). In VPS hosting workflows, object storage usually backs:

  • Static assets and media uploads
  • Backups/snapshots (app-level, not hypervisor-level)
  • Log archives and data exports
  • CDN origins for websites

If you host on providers like digitalocean or hetzner, you’re typically trying to keep costs predictable while still getting global performance. Object storage is the easiest place to accidentally blow that up.

Pricing and egress: the real differentiator

My opinion: for most VPS-hosted web apps, the object storage bill is dominated by bandwidth, not capacity.

S3 economics:

  • Storage is competitively priced, but egress is where you pay.
  • Requests, lifecycle transitions, replication, and “nice-to-have” features can add line items.
  • If you put S3 behind CloudFront, you may optimize delivery but you’re still in AWS billing land.

R2 economics:

  • The headline is $0 egress from R2.
  • You still pay for storage and operations.
  • Data transfer from Cloudflare to the broader internet is typically handled inside Cloudflare’s network, which is exactly what you want when you’re serving public assets.

For a VPS hosting stack, that changes architecture decisions:

  • With S3, you often end up caching aggressively just to avoid bandwidth cost.
  • With R2, it’s easier to use object storage as an origin for a CDN without fear.

Caveat: “no egress” doesn’t mean “no network cost anywhere.” If your VPS pulls lots of data from R2, your VPS provider may still charge for outbound bandwidth, and some providers charge for inbound at scale. But R2 removes the biggest surprise bill most teams hit: object store egress.

Performance and latency: where your users notice

S3 performance is excellent, but it’s region-centric. You pick a region, and requests travel there unless you layer caching/CDN on top.

R2 is designed to live close to Cloudflare’s edge network, and it plays especially well with Cloudflare’s caching. In practice, for typical “VPS + global visitors” scenarios:

  • S3: great if your app is mostly in one region and you can keep traffic local (e.g., app servers in us-east-1, S3 in us-east-1).
  • R2: great if you want global read performance with minimal setup and you’re already using Cloudflare in front.

If you’re hosting on hetzner (popular in Europe) but have visitors worldwide, R2 + Cloudflare caching can reduce perceived latency without deploying multi-region infrastructure. If your workload is internal (backups, data pipelines), S3’s regional model may be totally fine.

Compatibility and features: S3 is the baseline, R2 is “enough”

S3 has decades of features: event notifications, inventory, object lock/governance modes, deep lifecycle controls, multiple storage classes, and a huge ecosystem.

R2’s pitch is different: be S3-compatible enough that your tools work, and optimize for cost + edge delivery.

Here’s the practical checklist for VPS hosting:

  • API compatibility: R2 supports the S3 API, but not every obscure S3 feature. Test before committing.
  • Tooling: Most backup tools and SDKs that speak S3 will work with an endpoint + keys.
  • Lock-in: S3 is “lock-in by gravity” (everything supports it). R2 is “lock-in by workflow” if you lean heavily into Cloudflare’s edge features.

My rule: if you need advanced compliance controls or niche S3 capabilities, S3 wins. If you just need object storage that won’t punish you for serving files, R2 is often the better default.

Actionable example: using R2 with an S3 client on a VPS

Below is a minimal AWS CLI-style setup many VPS users already know. You can use it on a box hosted at digitalocean or hetzner to push backups to R2.

# 1) Configure a named profile for R2
aws configure set aws_access_key_id "$R2_ACCESS_KEY" --profile r2
aws configure set aws_secret_access_key "$R2_SECRET_KEY" --profile r2
aws configure set region auto --profile r2

# 2) Upload a backup (R2 is S3-compatible, so use the S3 commands)
aws s3 cp ./backup.tar.gz s3://my-bucket/backups/backup.tar.gz \
  --endpoint-url https://<accountid>.r2.cloudflarestorage.com \
  --profile r2

# 3) (Optional) List objects
aws s3 ls s3://my-bucket/backups/ \
  --endpoint-url https://<accountid>.r2.cloudflarestorage.com \
  --profile r2

This is the kind of migration that takes minutes, not days—provided your app doesn’t rely on S3-only features.

So, which one should you choose for VPS hosting?

If you’re building typical VPS-hosted sites/apps (WordPress, Laravel, Node, Django) and you serve lots of public assets, I’m bullish on R2 as the default: egress is the silent killer, and cloudflare built R2 to remove that pain.

Choose S3 when:

  • You need the full S3 feature set or strict compliance options
  • Your compute is already in AWS and data locality matters
  • You’re doing heavy internal workflows that don’t benefit from edge delivery

Choose R2 when:

  • You expect high public download/streaming volume
  • You want predictable costs and simple CDN-friendly architecture
  • You’re already using Cloudflare for DNS/WAF/CDN

In a VPS hosting context, a common setup is: VPS on hetzner or digitalocean, object storage on R2 for user uploads and backups, and keep the app stateless. If you later outgrow your VPS, that separation makes migration easier—without forcing you into a hard commitment today.

Some links in this article are affiliate links. We may earn a commission at no extra cost to you if you make a purchase through them.

My wife convinced me to finally ship the app we’ve been using for years

2026-04-29 15:09:55

My wife and I have been using a shopping list app I built for ourselves for a few years now.

Not because I'm a genius product designer. Not because I had some grand vision. Honestly — because every other app we tried did too much, and we just wanted a list.

The actual problem

You know the scenario. One of you is leaving work, the other is at home. Someone needs to stop by the store. The old workflow: a phone call, or a WhatsApp message, or a photo of a handwritten note on the fridge. Half the time you'd forget something anyway.

We tried a bunch of apps. They all had categories, tags, due dates, weekly reviews, collaboration features, premium tiers, onboarding flows. We didn't want any of that. We wanted to open the app, see the list, buy the thing, swipe it away. That's it.

So I built exactly that. No categories. No folders. No "are you sure you want to delete this?" Just a list.

Tollere empty state — clean, nothing in the way

What Tollere does

One screen. Your items. Swipe left to mark something done or ping your partner that something's urgent. A home screen widget so you can see the list without even opening the app.

That's genuinely the whole thing.

Tollere list in action — urgent item rises to the top

I know there are apps that do far more — complex grocery managers with aisle sorting, recipe imports, barcode scanning. Maybe those are better for some people. For us they were just noise. Tollere does one thing on purpose, and it will never do anything else. No due dates, no tags, no weekly review. Ever.

Tollere home screen widget — see your list without opening the app

Why it took years to ship

My wife has been telling me to put this on the App Store for a while. I kept saying no.

Part of it was honest doubt — who needs another list app? The market is full of them. Why would anyone pick this over Reminders, or AnyList, or a dozen other options?

Part of it was just inertia. It worked for us, and that felt like enough.

Eventually she wore me down. And once I started thinking about it seriously, I realized the doubt was actually the point. Because it does less, it might be exactly right for people who, like us, just want the thing to get out of the way.

So I properly rebuilt it for the App Store — cleaned up the UI, added a home screen widget, added a Pro tier for shared lists and push notifications (the "hey, we're out of milk, can you grab some?" feature). And now it's in TestFlight.

It's in beta — try it if you want

The app is live on TestFlight right now. If you want to give it a go and tell me what you think, I'd genuinely appreciate it.

If you do install it, here's what's worth testing:

Install

  1. Get TestFlight from the App Store (if you don't have it)
  2. Open: https://testflight.apple.com/join/pnCbFmnc
  3. Tap "Start Testing" → Install

Test the Pro purchase

  1. Open Tollere → add a few items
  2. Tap the gear icon (top right) → Settings
  3. Tap the "Tollere Pro" card and complete the purchase — you won't be charged, TestFlight uses a sandbox environment

Test list sharing

  1. In Settings → "Share list" → "Generate link"
  2. Send the link to someone else with an iPhone
  3. They install via the same TestFlight link above and open your link
  4. You should be in sync — whatever one of you adds, the other sees

Test notifications

  1. Swipe left on any item → tap "Notify"
  2. Anyone sharing your list should get a push notification instantly

Any feedback — bugs, anything that feels off, anything that's confusing — drop it in the comments or reach out directly. That's exactly what this stage is for.

TestFlight: https://testflight.apple.com/join/pnCbFmnc
Landing page: https://tollere.app

I'm not looking for validation — if it's not for you, that's fine. But if the "one thing, no clutter" pitch resonates with how you actually shop, give it a try and let me know how it holds up in real use.

Built with Swift/SwiftUI. Landing page is Svelte + Vite. The app is free, with a one-time €6.99 Pro upgrade for shared lists and notifications.

Architecture Teardown: How Meta Trains LLMs for Code Generation on 100k GPU Clusters

2026-04-29 15:07:53

In Q3 2024, Meta trained a 70B parameter code-specialized LLM on 100,000 Nvidia H100 GPUs, achieving 214 TFLOPS per GPU and 92% cluster utilization – a 3x improvement over their 2023 16k A100 cluster runs, with total training cost of $17.4M for 21 days of continuous operation.

📡 Hacker News Top Stories Right Now

  • Ghostty is leaving GitHub (2250 points)
  • Bugs Rust won't catch (158 points)
  • How ChatGPT serves ads (265 points)
  • Before GitHub (389 points)
  • Show HN: Auto-Architecture: Karpathy's Loop, pointed at a CPU (88 points)

Key Insights

  • Meta’s custom collective communication library (C3) achieves 98.7% bandwidth utilization on 100k H100 nodes, vs 89% for NCCL 2.18.3
  • PyTorch 2.3.0 with FSDP2 and custom activation checkpointing reduces memory footprint by 41% vs vanilla FSDP for 70B code models
  • Total training cost for 70B model on 100k GPUs is $17.4M, 22% cheaper than equivalent 16k A100 cluster runs when accounting for H100’s 3.2x throughput
  • By 2026, Meta plans to deploy 250k GPU clusters with custom silicon, reducing code LLM training time to 7 days for 100B parameter models

Why 100k GPUs? The Economics of Code LLM Training

For context, a 70B parameter LLM trained on 1.2PB of code data requires ~5.88e23 FLOPS of compute, per the Chinchilla scaling laws. A single Nvidia H100 GPU delivers ~1979 TFLOPS of BF16 compute, but real-world utilization is ~214 TFLOPS (as Meta achieved) due to communication overhead, memory bandwidth limits, and data loading latency. To finish training in 21 days (the maximum acceptable time for Meta’s quarterly release cycle), you need:

Meta’s 2024 analysis found that training a 70B code LLM to state-of-the-art performance requires 1.4 trillion tokens of training data. Using the standard 6 FLOPS per token per parameter rule, this totals 1.4T * 70B * 6 = 5.88e23 FLOPS. A single H100 GPU delivers ~2000 TFLOPS peak BF16 performance, but real-world training utilization (MFU) for large clusters is ~11% (214 TFLOPS / 1979 TFLOPS), due to distributed communication overhead. Over 21 days (1.81e6 seconds), a single H100 delivers 214e12 * 1.81e6 = 3.87e23 FLOPS. Thus, the minimum number of GPUs needed is 5.88e23 / 3.87e23 ≈ 15k GPUs. But Meta uses 100k GPUs – why? Because MFU drops as cluster size increases: on 15k GPUs, MFU is ~14%, but on 100k GPUs, it’s ~11% due to increased communication overhead. The extra GPUs also allow for redundancy: 5% of GPUs are held in reserve for failures, and 10% are used for asynchronous checkpointing and evaluation, so only 85k GPUs are actively training. This reserve capacity reduces the risk of training restarts, which cost $1.2M per restart on 100k clusters.

The cost math checks out: 100k H100 GPUs at $30k per GPU (owned, 3-year depreciation) is $3B total capital cost, or ~$2.7M per month. 21 days of training uses ~$1.89M in depreciation, plus $15.5M in datacenter power, cooling, and networking costs, totaling $17.4M per run. This is 22% cheaper than using 16k A100 GPUs, which have 3x lower throughput per GPU, requiring longer training times and higher power costs per FLOPS.

Meta’s H100 Cluster Topology and C3 Communication Stack

Meta’s 100k H100 cluster is organized into 12,500 nodes (8 GPUs per node), with a 3-tier network topology: 8 GPUs per node connected via NVLink 4.0 (900GB/s bidirectional bandwidth), nodes within a rack connected via PCIe 5.0 (128GB/s), racks within a data center connected via 400Gbps InfiniBand (NDR), and data centers connected via 100Gbps backbone. This topology is optimized for the hierarchical collective algorithms in C3, which aggregates gradients first at the GPU level, then node, then rack, then data center, minimizing long-distance traffic.

C3, Meta’s custom collective communication library, was built from the ground up to address NCCL’s limitations at scale. NCCL uses a flat ring topology for AllReduce, which requires O(N) communication steps for N GPUs. C3 uses a recursive halving-doubling algorithm for small payloads (<10MB) and a hierarchical ring algorithm for large payloads (>10MB), reducing communication steps to O(log N) for small payloads. C3 also supports Meta’s custom gradient compression: 4-bit quantization for gradients, which reduces communication volume by 75% with less than 0.1% perplexity loss on code tasks. Unlike NCCL, C3 has built-in support for heterogeneous clusters: Meta’s 100k cluster includes 80k H100s and 20k older A100s, which C3 automatically routes around for latency-sensitive operations.

Benchmarks show C3 achieves 98.7% of theoretical bandwidth on 100k GPUs, vs 78% for NCCL 2.18.3. This translates to a 22% reduction in total training time, saving $3.8M per run. C3 is tightly integrated with PyTorch 2.3.0’s distributed backend – Meta contributed the C3 backend to PyTorch in Q2 2024, so open-source users can test it via https://github.com/pytorch/pytorch (look for the c3\ backend in torch.distributed\).

Data Pipeline: Processing 1.2PB of Code Data

Meta’s code corpus is 1.2PB of raw data, sourced from public GitHub repos (800TB), Stack Overflow (200TB), internal Meta code (100TB), library documentation (50TB), and synthetic code (50TB). Processing this data takes 14 days on 1k Apache Beam Dataflow workers, using the pipeline we shared later (https://github.com/apache/beam). The pipeline performs 4 key steps: 1) Parsing and filtering (remove non-code files, oversized files, files with secrets), 2) Normalization (strip whitespace, comments, normalize indentation using tree-sitter), 3) Deduplication (SHA-256 hash of normalized content), 4) Quality filtering (keep only code with valid syntax, >80% test coverage, permissive licenses).

Quality filtering is critical: Meta found that training on low-quality code (e.g., code with syntax errors, no tests) increases perplexity by 18% and reduces code acceptance rate by 24%. The pipeline uses tree-sitter (https://github.com/tree-sitter/tree-sitter) to parse code into ASTs, then checks for syntax errors and counts test coverage by looking for test functions or assertions. For internal Meta code, the pipeline also redacts PII (e.g., employee IDs, internal hostnames) and secrets (API keys, passwords) using a custom regex-based redaction tool that has 99.97% accuracy, validated against 100k manually labeled samples.

The final processed dataset is 420TB, stored in Parquet format on S3-compatible object storage, with 16k token context windows. During training, data is loaded using a custom PyTorch DataLoader that prefetches 100 batches per GPU, achieving 99% GPU utilization during data loading – a common bottleneck for smaller clusters.

Comparison: C3 vs NCCL on 100k H100 Cluster

Metric

Meta C3 (100k H100)

NCCL 2.18.3 (100k H100)

Meta C3 (16k A100)

Cluster Utilization

92%

78%

84%

TFLOPS per GPU (BF16)

214

187

68

AllReduce Latency (1GB payload)

12ms

47ms

112ms

Memory Overhead (FSDP2)

8%

14%

19%

Cost per Training Run (70B model)

$17.4M

$21.1M

$22.3M

Code Example 1: PyTorch FSDP2 Training Loop for 70B Code LLM

import torch
import torch.nn as nn
import torch.distributed as dist
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torch.distributed.fsdp import MixedPrecision, ShardingStrategy
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoConfig, AutoModelForCausalLM  # https://github.com/huggingface/transformers
import os
import signal
import sys
from typing import Optional

# Signal handler for graceful shutdown
def handle_sigterm(signum, frame):
    print(f\"Received SIGTERM {signum}, checkpointing and exiting...\")
    if dist.is_initialized():
        dist.destroy_process_group()
    sys.exit(0)

signal.signal(signal.SIGTERM, handle_sigterm)
signal.signal(signal.SIGINT, handle_sigterm)

def get_code_model_config(model_name: str = \"meta-llama/CodeLlama-70b-hf\"):  # https://github.com/meta-llama/codellama
    \"\"\"Load and modify config for 70B code-specialized LLM\"\"\"
    try:
        config = AutoConfig.from_pretrained(model_name)
        # Customize for code generation: increase context length to 16k
        config.max_position_embeddings = 16384
        # Use SwiGLU activation for better code task performance
        config.hidden_act = \"swiglu\"
        # Add custom code-specific tokenizer vocab extensions
        config.vocab_size = 32064  # 2k extra tokens for code symbols
        return config
    except Exception as e:
        print(f\"Failed to load model config: {e}\")
        raise

def get_fsdp_wrap_policy():
    \"\"\"Auto wrap policy for transformer layers\"\"\"
    from transformers import LlamaDecoderLayer
    return transformer_auto_wrap_policy(
        transformer_layer_cls={LlamaDecoderLayer},
    )

def init_distributed():
    \"\"\"Initialize distributed training environment\"\"\"
    try:
        dist.init_process_group(backend=\"c3\")  # Use Meta's C3 backend
        local_rank = int(os.environ[\"LOCAL_RANK\"])
        torch.cuda.set_device(local_rank)
        return local_rank
    except KeyError as e:
        print(f\"Missing environment variable: {e}\")
        raise
    except RuntimeError as e:
        print(f\"Distributed init failed: {e}\")
        raise

def main():
    # Initialize distributed
    local_rank = init_distributed()
    world_size = dist.get_world_size()
    rank = dist.get_rank()
    print(f\"Rank {rank}/{world_size} initialized on GPU {local_rank}\")

    # Mixed precision config for BF16 training
    mixed_precision = MixedPrecision(
        param_dtype=torch.bfloat16,
        reduce_dtype=torch.bfloat16,
        buffer_dtype=torch.bfloat16,
    )

    # FSDP2 config with custom sharding
    sharding_strategy = ShardingStrategy.HYBRID_SHARD  # Shard within node, replicate across nodes
    auto_wrap_policy = get_fsdp_wrap_policy()

    # Load model
    try:
        config = get_code_model_config()
        model = AutoModelForCausalLM.from_config(config)
        model = FSDP(
            model,
            sharding_strategy=sharding_strategy,
            mixed_precision=mixed_precision,
            auto_wrap_policy=auto_wrap_policy,
            activation_checkpointing=transformer_auto_wrap_policy,  # Custom activation checkpointing
            cpu_offload=False,  # No CPU offload for H100 high bandwidth
        )
        if rank == 0:
            print(f\"Model loaded. Total parameters: {sum(p.numel() for p in model.parameters()):,}\")
    except Exception as e:
        print(f\"Model loading failed: {e}\")
        dist.destroy_process_group()
        raise

    # Optimizer: AdamW with Meta's custom learning rate schedule
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=1e-4,
        betas=(0.9, 0.95),
        weight_decay=0.1,
    )

    # Training loop (simplified for brevity, full run uses 1.2PB dataset)
    num_steps = 100000
    for step in range(num_steps):
        try:
            # Simulate batch loading (actual uses custom Beam pipeline)
            batch = torch.randint(0, config.vocab_size, (32, 16384), device=local_rank)
            labels = batch.clone()
            outputs = model(batch, labels=labels)
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            if step % 100 == 0 and rank == 0:
                print(f\"Step {step}/{num_steps}, Loss: {loss.item():.4f}, GPU Mem: {torch.cuda.memory_allocated()/1e9:.2f}GB\")
        except RuntimeError as e:
            print(f\"Rank {rank} training step {step} failed: {e}\")
            # Checkpoint and restart
            torch.save(model.state_dict(), f\"checkpoint_step_{step}_rank_{rank}.pt\")
            dist.destroy_process_group()
            raise

    # Cleanup
    if dist.is_initialized():
        dist.destroy_process_group()

if __name__ == \"__main__\":
    main()

Code Example 2: Apache Beam Pipeline for Code Corpus Processing

import argparse
import json
import hashlib
import logging
from typing import Dict, List, Optional
import apache_beam as beam  # https://github.com/apache/beam
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.transforms import DoFn, ParDo
import re
from datetime import datetime

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class ParseGitHubCode(DoFn):
    \"\"\"Parse raw GitHub archive data into code samples\"\"\"
    def __init__(self):
        super().__init__()
        self.code_extensions = {\".py\", \".js\", \".ts\", \".go\", \".rust\", \".cpp\", \".java\", \".c\", \".h\"}
        self.max_file_size = 1024 * 1024  # 1MB max file size

    def process(self, element: bytes) -> List[Dict]:
        try:
            # Parse raw JSON from GitHub archive
            record = json.loads(element.decode(\"utf-8\"))
            repo_name = record.get(\"repo\", {}).get(\"name\", \"\")
            file_path = record.get(\"file\", {}).get(\"path\", \"\")
            file_size = record.get(\"file\", {}).get(\"size\", 0)
            content = record.get(\"file\", {}).get(\"content\", \"\")

            # Filter out non-code files and oversized files
            if not any(file_path.endswith(ext) for ext in self.code_extensions):
                return []
            if file_size > self.max_file_size:
                logger.debug(f\"Skipping {file_path} in {repo_name}: too large ({file_size} bytes)\")
                return []
            if not content:
                return []

            # Generate content hash for deduplication
            content_hash = hashlib.sha256(content.encode(\"utf-8\")).hexdigest()

            # Extract code features
            line_count = len(content.split(\"\\n\"))
            has_tests = bool(re.search(r\"test_|spec\\.|it\\(|describe\\(\", content))
            license = self._extract_license(content)

            yield {
                \"repo_name\": repo_name,
                \"file_path\": file_path,
                \"content_hash\": content_hash,
                \"content\": content,
                \"line_count\": line_count,
                \"has_tests\": has_tests,
                \"license\": license,
                \"timestamp\": datetime.utcnow().isoformat(),
            }
        except json.JSONDecodeError as e:
            logger.error(f\"Failed to parse JSON: {e}\")
            return []
        except Exception as e:
            logger.error(f\"Unexpected error processing record: {e}\")
            return []

    def _extract_license(self, content: str) -> Optional[str]:
        \"\"\"Extract license info from file header\"\"\"
        license_patterns = [
            (r\"MIT License\", \"MIT\"),
            (r\"Apache License\", \"Apache\"),
            (r\"GNU General Public License\", \"GPL\"),
            (r\"BSD License\", \"BSD\"),
        ]
        for pattern, license_name in license_patterns:
            if re.search(pattern, content[:1024]):  # Check first 1KB
                return license_name
        return \"Unknown\"

class DeduplicateCode(DoFn):
    \"\"\"Deduplicate code samples using content hash\"\"\"
    def __init__(self):
        self.seen_hashes = set()

    def process(self, element: Dict) -> List[Dict]:
        try:
            content_hash = element[\"content_hash\"]
            if content_hash in self.seen_hashes:
                return []
            self.seen_hashes.add(content_hash)
            # Remove hash from output to save storage
            del element[\"content_hash\"]
            yield element
        except KeyError as e:
            logger.error(f\"Missing key in element: {e}\")
            return []
        except Exception as e:
            logger.error(f\"Deduplication failed: {e}\")
            return []

class ValidateCode(DoFn):
    \"\"\"Validate code samples for training suitability\"\"\"
    def __init__(self):
        self.min_lines = 10
        self.max_lines = 1000
        self.banned_patterns = [r\"password\\s*=\\s*['\\\"]\\w+['\\\"]\", r\"api_key\\s*=\\s*['\\\"]\\w+['\\\"]\"]

    def process(self, element: Dict) -> List[Dict]:
        try:
            content = element[\"content\"]
            line_count = element[\"line_count\"]

            # Filter by line count
            if line_count < self.min_lines or line_count > self.max_lines:
                return []

            # Filter out banned patterns (secrets)
            for pattern in self.banned_patterns:
                if re.search(pattern, content, re.IGNORECASE):
                    logger.debug(f\"Skipping {element['file_path']}: contains secrets\")
                    return []

            # Check for valid UTF-8 (already done in parse, but double check)
            content.encode(\"utf-8\")
            yield element
        except KeyError as e:
            logger.error(f\"Missing key in validation: {e}\")
            return []
        except UnicodeEncodeError as e:
            logger.error(f\"Invalid UTF-8 in content: {e}\")
            return []
        except Exception as e:
            logger.error(f\"Validation failed: {e}\")
            return []

def run_pipeline(argv: List[str] = None):
    \"\"\"Run Beam pipeline to process 1.2PB code corpus\"\"\"
    parser = argparse.ArgumentParser()
    parser.add_argument(\"--input\", dest=\"input\", required=True, help=\"Input GCS path to GitHub archive\")
    parser.add_argument(\"--output\", dest=\"output\", required=True, help=\"Output GCS path for processed data\")
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(
        pipeline_args,
        runner=\"DataflowRunner\",
        project=\"meta-code-llm\",
        region=\"us-central1\",
        temp_location=f\"{known_args.output}/temp\",
        save_main_session=True,
    )

    with beam.Pipeline(options=pipeline_options) as p:
        try:
            raw_data = p | \"Read Raw Data\" >> beam.io.ReadFromText(known_args.input, skip_header_lines=1)
            parsed = raw_data | \"Parse GitHub Code\" >> ParDo(ParseGitHubCode())
            validated = parsed | \"Validate Code\" >> ParDo(ValidateCode())
            deduplicated = validated | \"Deduplicate\" >> ParDo(DeduplicateCode())
            # Write to Parquet for efficient training loading
            deduplicated | \"Write Output\" >> beam.io.WriteToParquet(
                known_args.output,
                schema={
                    \"repo_name\": str,
                    \"file_path\": str,
                    \"content\": str,
                    \"line_count\": int,
                    \"has_tests\": bool,
                    \"license\": str,
                    \"timestamp\": str,
                }
            )
            logger.info(f\"Pipeline completed. Output written to {known_args.output}\")
        except Exception as e:
            logger.error(f\"Pipeline failed: {e}\")
            raise

if __name__ == \"__main__\":
    run_pipeline()

Code Example 3: AllReduce Benchmark for C3 vs NCCL

import torch
import torch.distributed as dist
import time
import argparse
import numpy as np
from typing import List, Tuple
import os

def benchmark_allreduce(
    backend: str,
    payload_size: int,
    num_iterations: int = 100,
    warmup_iterations: int = 10,
) -> Tuple[float, float]:
    \"\"\"
    Benchmark AllReduce performance for given backend.
    Returns (avg_latency_ms, avg_bandwidth_gbps)
    \"\"\"
    local_rank = int(os.environ.get(\"LOCAL_RANK\", 0))
    world_size = dist.get_world_size()

    # Create payload: random tensor of payload_size bytes (assuming float32: 4 bytes per element)
    num_elements = payload_size // 4
    tensor = torch.randn(num_elements, dtype=torch.float32, device=f\"cuda:{local_rank}\")

    # Warmup
    for _ in range(warmup_iterations):
        dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
    torch.cuda.synchronize()

    # Benchmark
    latencies = []
    for _ in range(num_iterations):
        start = time.perf_counter()
        dist.all_reduce(tensor, op=dist.ReduceOp.SUM)
        torch.cuda.synchronize()
        end = time.perf_counter()
        latencies.append((end - start) * 1000)  # Convert to ms

    # Calculate stats
    avg_latency = np.mean(latencies)
    # Bandwidth: total data transferred per node: payload_size * (world_size - 1) / world_size
    total_data_gb = (payload_size * (world_size - 1)) / 1e9
    avg_bandwidth = total_data_gb / (avg_latency / 1000)  # Gbps per node

    return avg_latency, avg_bandwidth

def main():
    parser = argparse.ArgumentParser(description=\"Benchmark AllReduce backends for Meta LLM training\")
    parser.add_argument(\"--backend\", type=str, choices=[\"c3\", \"nccl\"], required=True, help=\"Collective communication backend\")
    parser.add_argument(\"--payload-sizes\", type=int, nargs=\"+\", default=[1024, 10240, 102400, 1048576, 10485760], help=\"Payload sizes in bytes\")
    parser.add_argument(\"--num-iterations\", type=int, default=100, help=\"Number of benchmark iterations\")
    args = parser.parse_args()

    # Initialize distributed with chosen backend
    try:
        dist.init_process_group(backend=args.backend)
        local_rank = int(os.environ[\"LOCAL_RANK\"])
        torch.cuda.set_device(local_rank)
        rank = dist.get_rank()
        world_size = dist.get_world_size()
    except Exception as e:
        print(f\"Failed to initialize distributed with backend {args.backend}: {e}\")
        raise

    print(f\"Rank {rank}/{world_size} using backend {args.backend} on GPU {local_rank}\")

    results = []
    for payload_size in args.payload_sizes:
        try:
            avg_latency, avg_bandwidth = benchmark_allreduce(
                backend=args.backend,
                payload_size=payload_size,
                num_iterations=args.num_iterations,
            )
            if rank == 0:
                results.append({
                    \"payload_size_mb\": payload_size / 1e6,
                    \"avg_latency_ms\": round(avg_latency, 2),
                    \"avg_bandwidth_gbps\": round(avg_bandwidth, 2),
                })
                print(f\"Payload: {payload_size/1e6}MB | Latency: {avg_latency:.2f}ms | Bandwidth: {avg_bandwidth:.2f}Gbps\")
        except Exception as e:
            print(f\"Benchmark failed for payload {payload_size}: {e}\")
            if dist.is_initialized():
                dist.destroy_process_group()
            raise

    # Save results to JSON (rank 0 only)
    if rank == 0:
        import json
        with open(f\"allreduce_benchmark_{args.backend}.json\", \"w\") as f:
            json.dump(results, f, indent=2)
        print(f\"Results saved to allreduce_benchmark_{args.backend}.json\")

    # Cleanup
    if dist.is_initialized():
        dist.destroy_process_group()

if __name__ == \"__main__\":
    main()

Case Study: Meta’s Internal Code Completion Service Migration to 100k GPU Trained LLM

  • Team size: 12 engineers (4 backend, 5 ML, 3 infrastructure)
  • Stack & Versions: PyTorch 2.3.0, FSDP2, C3 1.2.1, Nvidia H100 80GB GPUs, Kubernetes 1.28, Redis 7.2 for caching, gRPC 1.58 for inference serving
  • Problem: Initial p99 latency for code completion requests was 2.4s when using the 16k A100-trained 13B code LLM, with 18% of requests timing out (5s timeout), and monthly GPU inference cost of $47k for 2M daily requests
  • Solution & Implementation: Migrated to 70B code LLM trained on 100k H100 cluster, implemented speculative decoding with 7B draft model, deployed model shards across 8 H100 nodes using FSDP2 inference, added prompt caching for repeated code contexts (e.g., same file, same user), and integrated C3 collective communication for fast weight synchronization across inference nodes
  • Outcome: p99 latency dropped to 120ms, timeout rate reduced to 0.3%, monthly inference cost dropped to $29k (saving $18k/month), and code acceptance rate increased from 34% to 61% in internal developer surveys

Developer Tips

1. Optimize FSDP2 Sharding Strategy for Large Code LLMs

For teams training code-specialized LLMs over 30B parameters, the default FSDP sharding strategy (FULL_SHARD) often leads to excessive communication overhead, especially when using hybrid CPU/GPU setups. Meta’s team found that HYBRID_SHARD – which shards model weights within a node and replicates across nodes – reduces AllReduce traffic by 62% for 70B models on 100k GPU clusters, as most communication happens within the high-bandwidth node (H100 nodes have 400Gbps NVLink). This strategy works best when you have uniform node sizes (e.g., 8 GPUs per node) and high intra-node bandwidth. Avoid FULL_SHARD if your inter-node bandwidth is less than 100Gbps, as the cross-node weight synchronization will become a bottleneck. Always benchmark sharding strategies with a 1-hour training run before full cluster deployment: we’ve seen teams waste $200k+ on inefficient sharding configurations that could have been caught in a short benchmark. Use PyTorch’s built-in FSDP profiling tools to measure communication vs computation time, and adjust sharding granularity accordingly. For code LLMs with long context windows (16k+ tokens), also enable activation checkpointing for transformer layers with more than 20 attention heads, as the activation memory footprint grows quadratically with context length.

Short code snippet for FSDP2 config:

from torch.distributed.fsdp import ShardingStrategy
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.HYBRID_SHARD,
    mixed_precision=mixed_precision,
    auto_wrap_policy=transformer_auto_wrap_policy,
)

2. Deduplicate Code Corpora with Content Hashing, Not File Names

When building training datasets for code LLMs, naive deduplication based on file names or repository names will miss 37% of duplicate code samples, according to Meta’s 2024 corpus analysis. Developers frequently copy-paste code across repositories, rename files, or fork repos without modifying core logic – all of which bypass filename-based deduplication. Meta’s team uses SHA-256 content hashing of normalized code (strip whitespace, remove comments) to deduplicate 1.2PB of raw code data down to 420TB of unique samples, reducing training time by 28% and improving model perplexity by 12% on the HumanEval benchmark. Normalization is critical here: if you hash raw code, minor formatting differences (e.g., 4 spaces vs 2 spaces indentation) will be treated as unique, inflating your dataset. Use tree-sitter (https://github.com/tree-sitter/tree-sitter) to parse code into ASTs, then normalize the AST before hashing to eliminate formatting differences entirely. For large datasets, use distributed deduplication with Apache Beam or Spark: Meta’s Beam pipeline processes 10TB of code data per hour on 1k Dataflow workers, with a 99.99% deduplication accuracy rate. Never skip deduplication – training on duplicate code leads to overfitting, where the model memorizes common snippets instead of learning generalizable code patterns, which tanks performance on rare edge cases like error handling or niche library usage.

Short code snippet for content normalization:

import hashlib
from tree_sitter import Language, Parser  # https://github.com/tree-sitter/tree-sitter

def normalize_code(content: str, lang: str = \"python\") -> str:
    parser = Parser()
    parser.set_language(Language(f\"build/{lang}.so\", lang))
    tree = parser.parse(bytes(content, \"utf-8\"))
    # Return normalized AST string (strip whitespace/comments)
    return str(tree.root_node)

def get_content_hash(content: str) -> str:
    normalized = normalize_code(content)
    return hashlib.sha256(normalized.encode(\"utf-8\")).hexdigest()

3. Benchmark Collective Communication Before Scaling to 10k+ GPUs

One of the most common mistakes teams make when scaling LLM training is assuming that collective communication libraries (e.g., NCCL, C3) will perform linearly as they add more GPUs. Meta’s team found that NCCL 2.18.3’s AllReduce latency grows from 12ms on 1k GPUs to 47ms on 100k GPUs, a 3.9x increase, while their custom C3 library only grows to 12ms (no increase) due to optimized hierarchical collective algorithms. Always run a full benchmark of your most common collective operations (AllReduce, AllGather, ReduceScatter) at 10%, 50%, and 100% of your target cluster size before starting training. For code LLMs, the dominant operation is AllReduce for gradient synchronization, which accounts for 38% of total training time on 100k GPU clusters. Measure both latency and bandwidth utilization: if bandwidth utilization drops below 85% at scale, you need to optimize your network topology or switch to a hierarchical collective algorithm that aggregates gradients at the rack level before cluster level. Meta uses a custom benchmark tool built on PyTorch Distributed (https://github.com/pytorch/pytorch) that logs per-GPU performance metrics to Prometheus, allowing them to identify slow nodes or faulty network links before they impact training. A single faulty 100Gbps link in a 100k GPU cluster can reduce overall cluster utilization by 4%, costing $70k/day in wasted GPU time.

Short code snippet for collective benchmark:

import torch.distributed as dist
import time

def benchmark_allreduce(payload_size: int):
    tensor = torch.randn(payload_size//4, device=\"cuda\")
    start = time.perf_counter()
    dist.all_reduce(tensor)
    torch.cuda.synchronize()
    return (time.perf_counter() - start) * 1000

Join the Discussion

We’ve shared Meta’s production architecture for training code LLMs on 100k GPU clusters – now we want to hear from you. Whether you’re training small 7B models on 8 GPUs or scaling to 10k+ clusters, your experience with distributed training, code data pipelines, or inference optimization is valuable to the community.

Discussion Questions

  • By 2026, Meta plans to deploy 250k GPU clusters with custom silicon – do you think open-source tools will keep pace with proprietary communication libraries like C3, or will the gap widen?
  • Training on 100k GPUs costs $17.4M per run – what cost optimization strategies (e.g., spot instances, mixed precision, gradient accumulation) have you found most effective for large-scale LLM training?
  • Meta uses FSDP2 over DeepSpeed ZeRO-3 for code LLM training – have you benchmarked the two for code generation tasks, and which performed better for your use case?

Frequently Asked Questions

How does Meta’s C3 communication library differ from NCCL?

C3 (Collective Communication for Clusters) is Meta’s custom library optimized for large-scale, hierarchical GPU clusters. Unlike NCCL, which uses a flat communication topology, C3 uses a 3-tier hierarchy: intra-GPU (NVLink), intra-node (PCIe 5.0), and inter-node (400Gbps InfiniBand). This reduces AllReduce latency by 75% on clusters over 10k GPUs. C3 also supports Meta’s custom gradient compression algorithm, which reduces communication volume by 40% for BF16 training, and has built-in fault tolerance for node failures, automatically rerouting traffic around dead nodes without restarting training.

What code corpora does Meta use to train its code LLMs?

Meta’s 1.2PB training corpus includes: 800TB of public GitHub repositories (filtered for licenses allowing ML training), 200TB of Stack Overflow Q&A pairs, 100TB of internal Meta code (with PII/secret redaction), 50TB of documentation from popular libraries (React, PyTorch, Rust stdlib), and 50TB of synthetic code generated by smaller LLMs to cover edge cases. All data is deduplicated, normalized, and filtered for quality: only code with >80% test coverage, valid syntax, and no secrets is included in the final training set.

How does Meta handle GPU failures during 21-day training runs?

Meta’s training stack has 3 layers of fault tolerance: 1) Per-node health checks every 30 seconds that restart training processes on healthy GPUs if a failure is detected, 2) C3 communication library automatically reroutes traffic around dead nodes, 3) Asynchronous checkpointing to S3-compatible storage every 15 minutes, with incremental checkpoints that only save changed weights (reducing checkpoint size by 92% vs full checkpoints). In 2024 runs, Meta saw an average of 12 GPU failures per day on 100k clusters, with 0 training restarts required – the fault tolerance stack handled all failures transparently.

Conclusion & Call to Action

Meta’s 100k GPU training stack for code LLMs represents the current state of the art in large-scale distributed ML, but the core lessons apply to teams of all sizes: optimize your communication stack before scaling, deduplicate your training data aggressively, and always benchmark at target scale. For teams training code LLMs, the 70B model trained on this stack achieves 82% pass@1 on HumanEval, 74% on MBPP, and 68% on Meta’s internal code review benchmark – a 2x improvement over 2023’s 13B model. If you’re building code generation tools, start by benchmarking your current training stack against the metrics we’ve shared, and prioritize communication optimization if your cluster utilization is below 85%. The gap between proprietary and open-source training stacks is narrowing, but only if we share real production numbers and avoid marketing fluff.

92%Cluster utilization achieved by Meta’s C3 library on 100k H100 GPUs

GitHub Copilot 請求体系と API チームへの影響

2026-04-29 15:07:14

GitHub Copilotの課金モデルは昨年2回変更され、今月さらに変更されました。今月から、プルリクエストでのCopilotコードレビューは、リポジトリを所有する課金アカウントのGitHub Actions実行時間(Actions minutes)を消費します。APIチームは、Copilotシート、プレミアムリクエスト、Actions実行時間の3つを同時に管理する必要があります。この記事では、それぞれの測定基準、APIリポジトリへの影響、請求前にコストを見積もる手順を整理します。

今すぐApidogを試す

Apidog内のワークフローと組み合わせると、API仕様、契約テスト、AIレビューの各ステップを、3つの異なる課金ダッシュボードに分散させず、1つの流れとして管理できます。

チームが直接利用するモデルAPIのコストも計算している場合は、GPT-5.5の料金体系DeepSeek V4の料金体系も確認してください。トークン単位のコスト見積もりに役立ちます。

TL;DR(要点)

  • Copilotのコストは、シートライセンス、プレミアムリクエスト、Copilotコードレビュー用Actions実行時間の3つで見る。
  • PR上のCopilotコードレビューは、内部的にGitHub Actionsとして実行され、通常のActionsクォータを消費する。
  • APIリポジトリは、仕様、生成クライアント、ハンドラ、テストをまとめて変更しやすいため、1レビューあたりの実行時間が増えやすい。
  • プレミアムリクエストは、Workspace、エージェントモード、Copilot Spacesなどの「エージェント的」な作業に関係する。
  • 次の請求サイクル前に利用制限を設定する。アクティブなAPIリポジトリごとに月400〜800 Actions実行時間を仮予算として置き、30日後に実測で見直す。

Copilot課金で見るべき3つの測定基準

Copilotの請求は、現在1つの固定料金だけではありません。以下の3つに分けて管理します。

測定基準1:シートごとのライセンス

これは固定料金です。

  • Copilot Business:ユーザーあたり月額10ドル
  • Copilot Enterprise:ユーザーあたり月額19ドル

この料金には、チャット、インライン補完、複数行の提案、IDE連携、標準モデルプールへのアクセスが含まれます。

実装面でやることはシンプルです。

  1. アクティブな開発者を一覧化する
  2. 直近30〜90日でCopilotを使っていないユーザーを確認する
  3. 四半期ごとに未使用シートを回収する

シートは最も予測しやすい一方で、過剰に割り当てられやすい項目です。

測定基準2:プレミアムリクエスト

プレミアムリクエストは、より高価なCopilot機能を使うときの単位です。

対象になりやすいものは次の通りです。

  • エージェントモード
  • Workspace
  • Copilot Spaces
  • デフォルト以外のモデル選択

現在の料金イメージは以下です。料金は変更される可能性があります。

機能 プレミアムリクエストでのコスト
デフォルトモデルのチャット 有料プランでは無料
インライン補完 有料プランでは無料
エージェントモード(デフォルトモデル) リクエストあたり1
Workspace(デフォルトモデル) リクエストあたり1
Claude Sonnet 4.5の選択 1.5倍
GPT-5.5の選択 2倍
GPT-5.5 Proの選択 6倍
Copilot Spacesクエリ クエリあたり1

含まれる月間クォータは次の通りです。

  • Copilot Business:シートあたり300プレミアムリクエスト
  • Copilot Enterprise:シートあたり1,000プレミアムリクエスト

超過分は、リクエストあたり0.04ドルで請求され、組織に設定した利用制限で上限を管理できます。

APIチームで注意すべき操作は、次のようなエージェントタスクです。

  • 「OpenAPIクライアントを再生成して」
  • 「この新しいエンドポイントの契約テストを作って」
  • 「このAPI変更に合わせてハンドラとテストを更新して」

これらは内部で複数ステップに分かれることがあり、1つのプロンプトが複数のプレミアムリクエストとして扱われる場合があります。

測定基準3:Actions実行時間(Copilotコードレビュー)

今月の変更で特に注意すべき点です。

Copilotがプルリクエストで自動コードレビューを実行すると、そのレビューはGitHub Actionsインフラ上で実行されます。そのため、レビューに使った実行時間は、組織の通常のActionsクォータから差し引かれます。

押さえるべきポイントは2つです。

  • Copilotコードレビュー用に別枠のクォータがあるわけではない
  • プライベートリポジトリではActions実行時間の予算を消費するが、パブリックリポジトリではActionsが無料

GitHubプランのActionsクォータ例:

  • Teamプラン:月額3,000分
  • Enterpriseプラン:Linuxランナーで50,000分

API PRに対するCopilotコードレビューは、通常2〜6 Actions実行時間を消費します。差分が大きい場合や、リポジトリ全体のコンテキストを読む場合は15分程度に達することもあります。

APIリポジトリでコストが増えやすい理由

APIリポジトリは、通常のアプリケーションコードよりもCopilotレビューの対象が広くなりがちです。

1. PRが大きくなりやすい

典型的なAPI変更では、次のファイルが同時に変わります。

  • openapi.yaml
  • 生成されたクライアント
  • サーバーハンドラ
  • 契約テスト
  • ドキュメントやサンプルリクエスト

Copilotレビューはこれらを読み込むため、単一ファイルのUI修正よりも実行時間が長くなります。

2. 生成コードが差分を大きくする

生成クライアントをリポジトリにコミットしている場合、API仕様の小さな変更でも大量の差分が出ます。

Copilotレビューが生成コードまで読むと、実行時間とトークン量の両方が増えます。レビュー対象にする価値が低いファイルは、パスフィルターで除外するべきです。

3. PRごとに複数のレビューエージェントが走る

多くのAPIチームでは、Copilotレビュー以外にも以下を実行しています。

  • CodeQL
  • Snyk
  • カスタムセキュリティスキャナー
  • 契約テスト
  • Lint
  • E2Eテスト

Copilotレビューはこの上に追加されるため、CI全体のActions消費が増えます。

例:

  • 月50 PR
  • 1レビューあたり4分
  • Copilotレビューだけで月200 Actions実行時間

これは、Teamプランの月間3,000分の約7%です。3つのAPIリポジトリで同じ規模なら、CI本体を走らせる前に約21%を使う計算になります。

月額請求額を見積もる手順

請求前に、次の3ステップで概算を作ります。

ステップ1:シート数を計算する

seats = active_users × $10  (Business)
      = active_users × $19  (Enterprise)

例:

10 developers × $19 = $190/month

実装上は、月末にGitHubの課金UIからCSVをエクスポートし、アクティブユーザー数を記録します。

ステップ2:プレミアムリクエストを計算する

開発者ごとの利用量をざっくり分類します。

  • チャット中心:月150リクエスト程度
  • Workspaceやエージェント利用が多い:月600〜800リクエスト程度

Businessプランではシートあたり300リクエストが含まれるため、エージェント利用が多いユーザーから超過しやすくなります。

premium_overage = max(0, requests_used - included_quota) × $0.04

Businessの場合:

included_quota = seats × 300

Enterpriseの場合:

included_quota = seats × 1000

組織レベルで利用制限を設定し、エージェントループが暴走しても予算を超えないようにします。

ステップ3:CopilotコードレビューのActions実行時間を計算する

review_minutes = prs_per_month × average_minutes_per_review

中規模のAPI PRなら、平均4分を初期値として使えます。

review_minutes = prs_per_month × 4

超過料金の概算:

review_overage = max(0, review_minutes - actions_quota_remaining)
                 × $0.008  (Linux private repos)

例:

  • 10人のEnterpriseチーム
  • 月200 PR
  • 1レビュー平均4分
review_minutes = 200 × 4 = 800 minutes

概算:

  • シート:$190
  • プレミアム超過:$40
  • レビュー実行時間:800分。Enterpriseクォータ内なら$0
  • 合計:シート基本料金に加えて約$230

BusinessティアではActionsクォータが小さいため、同じPR数でも超過に早く到達します。

CIパイプラインで最初に変更すべきこと

コストを下げるには、Copilotレビューを「すべてのPRで無条件に実行」しないことが重要です。

1. botやdependabotのPRではCopilotレビューをスキップする

RenovateやDependabotによるバージョンアップに、毎回AIレビューは不要です。

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  copilot-review:
    if: github.actor != 'dependabot[bot]' && github.actor != 'renovate[bot]'
    runs-on: ubuntu-latest
    steps:
      - uses: github/copilot-review@v1

対象にしたいbotが社内にある場合は、同じ条件に追加します。

2. 生成クライアントをレビュー対象から外す

生成コードは差分が大きく、レビューコストを押し上げます。

パスフィルターで、レビュー対象を人間が編集するファイルに絞ります。

on:
  pull_request:
    paths:
      - 'apis/**/*.yaml'
      - 'cmd/**'
      - 'internal/**'
      - 'tests/**'

生成クライアントが以下のような場所にあるなら、Copilotレビューの対象から外します。

generated/**
clients/**
sdk/**

3. 契約検証に成功した場合だけCopilotレビューを実行する

Copilotレビューは、パイプライン内で比較的高価なステップです。先に安価なチェックを走らせ、失敗したPRではレビューをスキップします。

例:

jobs:
  contract-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run contract tests
        run: apidog-cli test

  copilot-review:
    needs: contract-test
    if: success()
    runs-on: ubuntu-latest
    steps:
      - uses: github/copilot-review@v1

これにより、仕様や契約テストが壊れているPRに対して、高価なレビューを実行せずに済みます。

ガバナンス:APIチームが設定すべき4項目

次の設定は、予期しない請求を避けるために優先して実施します。

1. 組織レベルの利用制限

リポジトリ単位ではなく、組織レベルで上限を設定します。

推奨手順:

  1. 現在の月間利用量を確認する
  2. 許容できる最大金額を決める
  3. その80%程度を初期上限にする
  4. 30日後に実測で調整する

デフォルトの無制限運用は、監視していないチームには危険です。

2. プレミアムリクエストのアラート

GitHubは、含まれるクォータの50%、75%、90%に達したときに通知します。

メールだけに頼らず、次のような場所に流します。

  • Slack
  • Microsoft Teams
  • PagerDuty
  • インシデント管理ツール

3. Copilotレビュー起動ポリシー

すべてのPRでレビューを走らせるのではなく、明示的に対象を決めます。

例:

  • review-please ラベルがあるPRだけ実行
  • api-change ラベルがあるPRだけ実行
  • generated/** のみ変更されたPRではスキップ
  • botのPRではスキップ

ラベル駆動にすると、実際にレビュー価値があるPRを残しつつ、コストを大きく削減できます。

4. チーム単位の有効化

Copilot Enterpriseの機能は、組織全体で一括有効化するのではなく、チーム単位で展開します。

推奨パターン:

  1. API基盤チームで試す
  2. 30日間の利用量を測る
  3. パスフィルターと上限を調整する
  4. 他チームへ展開する

新機能をリリース直後に全員へ有効化すると、コストの増加要因を特定しにくくなります。

Apidogの役割

ApidogはCopilotの代替ではありません。API仕様、モック、契約テストを1つの流れにまとめ、Copilotレビューの前に安価な検証を実行するための層です。

実装パターンは次の通りです。

  • API仕様と保存済みリクエスト例を、リポジトリと一緒に管理する
  • 契約テストをライブAPIではなくApidogのモックサーバーに対して実行する
  • Copilotレビューは、仕様例の更新漏れではなく、ハンドラのロジックやテストカバレッジに集中させる
  • apidog-cli で契約検証を先に実行し、成功した場合だけCopilotレビューを起動する

Copilotレビューはパイプライン内で高価なステップになりやすいため、実行順序が重要です。

OpenAPI/Apidog contract check
        ↓
unit tests
        ↓
security scan
        ↓
Copilot review
        ↓
merge

契約違反で早く失敗させれば、レビュー実行時間を本当に必要なPRに集中できます。

Apidogのモックワークフローについては、PostmanなしでのAPIテストガイドを参照してください。モデルAPIへの適用例は、DeepSeek V4 APIガイドで確認できます。

次の請求サイクルで確認すること

次の30日間は、以下のタイミングで利用状況を確認します。

1日目〜7日目

プレミアムリクエストの利用量は、通常まだ低く見えます。多くのチームは、最初の週ではシートあたり300の含まれるクォータを下回ります。

確認すること:

  • アクティブユーザー数
  • Copilotレビューが起動したPR数
  • bot PRでレビューが走っていないか

14日目〜21日目

ヘビーユーザーが含まれるクォータを超え始めます。

確認すること:

  • Workspaceやエージェントモードの利用者
  • プレミアムリクエスト上位ユーザー
  • 利用制限に近づいているか

制限を設定している場合、上限に達したユーザーのリクエストは失敗し始めます。制限がない場合は、請求額が増加します。

28日目〜30日目

CopilotレビューによるActions実行時間が積み上がります。

確認すること:

  • 前月と比べたActions利用量
  • Copilotレビューだけの推定実行時間
  • パスフィルター導入前後の差
  • PR数との相関

月末には次を実施します。

  • 非アクティブユーザーのシート削減
  • ヘビーユーザーをEnterpriseティアへ移すか検討
  • レビューワークフローのパスフィルターを調整
  • bot PRや生成コードの除外漏れを確認

よくある間違い

APIチームで起きやすい問題は次の5つです。

1. 利用制限を設定していない

単一のエージェントループが長時間実行される可能性があります。必ず組織レベルで上限を設定します。

2. すべてのリポジトリでレビューを有効にしている

Copilotレビューが有効なリポジトリを選びます。

優先度が高いもの:

  • APIゲートウェイ
  • 認証・認可まわり
  • 課金や決済API
  • 外部公開API
  • 契約テストが重要なサービス

優先度が低いもの:

  • 生成SDKのみのリポジトリ
  • ドキュメントだけのリポジトリ
  • bot更新が中心のリポジトリ

3. 生成クライアントをレビューしている

生成コードは差分が大きく、レビュー価値が低いことが多いです。パスフィルターで除外します。

4. bot PRをレビューしている

Dependabot、Renovate、社内の自動バージョンアップツールを除外します。

if: github.actor != 'dependabot[bot]' && github.actor != 'renovate[bot]'

必要に応じて、社内botも追加します。

5. ベースライン指標がない

変更前の利用量がなければ、最適化の効果を判断できません。

毎月保存する指標:

  • Copilotシート数
  • プレミアムリクエスト利用量
  • Actions実行時間
  • 月間PR数
  • Copilotレビュー起動回数
  • bot PR数
  • 生成コードのみのPR数

GitHubの課金UIからCSVをエクスポートし、月次で比較します。

よくある質問

シート価格はまだユーザーあたり10ドルですか?

Copilot Businessはユーザーあたり月額10ドル、Copilot Enterpriseはユーザーあたり月額19ドルです。個人向けのCopilot Proは月額10ドルです。シートティアによって、含まれるプレミアムリクエストのクォータが変わります。

インライン補完も課金対象になりましたか?

いいえ。有料プランでは、デフォルトモデルのチャットとインライン補完は課金対象外です。プレミアムリクエストは、より高価な機能やモデル選択に使われます。

プレミアムクォータがなくなった場合はどうなりますか?

デフォルトでは、クォータエラーでリクエストが失敗し始めます。利用制限を設定している場合は、その上限まで1リクエストあたり0.04ドルで超過を許可できます。

Copilotコードレビュー用のActions実行時間は個別に請求されますか?

いいえ。CIの他のジョブと同じActions実行時間プールを消費します。合計Actions利用状況を追跡し、必要に応じてワークフロートリガーやパスフィルターを調整してください。

Copilotコードレビューを完全に無効にできますか?

はい。組織管理者は、ポリシーレベルでリポジトリをオプトアウトできます。同じ設定でチームごとの有効化も制御できます。

CopilotレビューはプライベートAPI仕様でも機能しますか?

はい。プライベートリポジトリでも動作します。ただし、プライベートリポジトリではActions実行時間を消費します。レビューは、他のソースコードと同様に仕様ファイルやハンドラファイルを読み込みます。

Copilotレビューもプレミアムリクエストを使用しますか?

現状では、Actions実行時間のみを消費します。レビュー担当者が使用するモデルはCopilotプラットフォームの一部であり、プレミアムリクエストとして個別に請求されません。ただし、この部分は今後変更される可能性があるため、GitHubの変更ログを確認してください。

CIでCopilotレビューとモデルAPIの直接呼び出しの両方を実行しているチームは、GPT-5.5無料Codexガイドでトークン単位のコストも確認してください。Apidogを使うと、モックと契約レイヤーを先に通し、安価なチェックに成功したPRだけでAIレビューを実行できます。