GitHub's March Meltdown: What Four Major Outages Tell Us About Scale : mgks.dev

I’ve been through enough incident reports to know when something systemic is breaking down. GitHub’s March was rough. Four separate incidents, each revealing different failure modes in what’s supposed to be the most reliable developer platform on the planet. Let’s talk about what actually happened and why you should care.

The first incident on March 3rd is the one that really gets me. A 40% failure rate on github.com requests isn’t just annoying, it’s the kind of thing that makes CTOs start evaluating alternatives. The root cause? A bug in a deployment meant to reduce write load on their user settings cache. Instead of reducing load, it expired every single user’s cache simultaneously, forcing recalculation and rewrites across the entire system.

Here’s the kicker: this was the same underlying issue they experienced in February. Same problem, different trigger. That’s not a one-off infrastructure hiccup. That’s a pattern.

When Configuration Changes Become Weapons

The March 5th Actions incident tells a different story about infrastructure complexity. GitHub was rolling out Redis updates to improve resiliency (ironic, I know) and managed to push incorrect load balancer configurations to production. The result? 95% of workflow runs delayed by an average of 30 minutes, with 10% failing outright.

What strikes me here is how a configuration change, something that should be relatively safe and routine, can cascade into a major incident. This is the reality of operating at GitHub’s scale. You can’t just SSH into a box and fix a config file. Everything is distributed, everything is automated, and when that automation goes sideways, the blast radius is enormous.

They’re now working on automation to prevent incorrect configurations from propagating. Good. That should have existed before they started rolling out production changes.

The Authentication Credential Fiasco

The Copilot Agent incidents on March 19th and 20th are almost textbook examples of incomplete remediation. First incident: authentication credentials preventing the service from reaching its datastore. Average error rate of 53%, peaking at 93%. They rotated credentials, declared victory, moved on.

Then it happened again the next day. This time hitting 99-100% error rates with what they describe as “significant retry amplification.” Translation: their systems were frantically retrying failed requests, making everything worse. The second incident was a direct result of not fully fixing the first one.

I’ve seen this movie before. The pressure to restore service quickly often means the deeper systemic issues get a Band-Aid instead of surgery. GitHub’s response of implementing automated monitoring for credential lifecycle events is the right move, but it came after two preventable outages.

The Upstream Dependency Problem

The March 24th Teams integration outage is less interesting technically but more important strategically. GitHub depends on Microsoft (their parent company) infrastructure, and when that infrastructure has problems, GitHub goes down with it. 37.4% average error rate, peaking at 90%.

There’s nothing GitHub could have done to prevent the upstream outage, but that’s exactly the point. When you’re deeply integrated into another company’s ecosystem, even if it’s your own parent company, you inherit their reliability problems. This is the bargain every platform makes when they choose integration over isolation.

What This Means for the Rest of Us

I’m not writing this to dunk on GitHub. They’re being remarkably transparent about their failures, and that transparency is valuable. What concerns me is what these incidents reveal about building and operating software at true scale.

GitHub has essentially unlimited resources, some of the best engineering talent in the industry, and years of experience running critical infrastructure. And they still had four major incidents in one month. They still deployed the same bug that caused an outage the previous month. They still pushed bad configurations to production. They still had incomplete incident remediation that caused a repeat outage the next day.

If GitHub struggles with this, what chance do the rest of us have? The answer isn’t pessimism, it’s realism about what operating at scale actually requires. It requires defense in depth. It requires assuming every deployment might go wrong. It requires automation that prevents bad changes from reaching production, not just automation that makes deployments faster.

The March incidents show us that artificial intelligence and sophisticated tooling aren’t enough. GitHub uses AI extensively in their own operations, yet they still hit the same problems every engineering organization faces: cache invalidation, configuration management, credential rotation, and dependency management.

The Architectural Debt Problem

GitHub mentions they’re making “substantial, long-term investments” and doing “deep architectural work” to improve resilience. That’s code for “we have significant technical debt and it’s going to take years to fix.” I respect the honesty, but it’s also a reminder that architectural decisions made years ago continue to haunt you.

The user settings cache issue that caused problems in both February and March suggests a fundamental design problem. When a system can’t handle expiring and recalculating user caches without causing cascading failures, the problem isn’t the deployment bug. The problem is that the system architecture allows a cache operation to bring down the entire platform.

This is the kind of thing you don’t notice until you’re operating at massive scale. The patterns that worked fine for millions of users start breaking at hundreds of millions. The caching strategy that seemed reasonable five years ago becomes a single point of failure.

Transparency as Service Quality

One thing GitHub does better than most: their incident reports are detailed, technical, and honest about what went wrong. They don’t hide behind vague language about “service degradation due to infrastructure issues.” They tell you exactly what failed, why it failed, and what they’re doing about it.

This transparency is valuable not just for GitHub’s users but for the entire industry. Every engineer reading these reports learns something about operating at scale, about what failure modes to watch for, about what architectural decisions have downstream consequences.

But transparency alone doesn’t prevent outages. GitHub’s March proves that you can have great incident reports and still have the same problems month after month. The gap between knowing what’s wrong and fixing what’s wrong is where most engineering organizations live.

I keep coming back to that March 3rd incident where they hit the same underlying issue from February. That’s the kind of thing that makes you question whether the “urgent, targeted improvements” they mention will actually address the root causes or just add more Band-Aids to a system that needs fundamental rethinking.

The real test isn’t whether GitHub can write good incident reports about their outages, it’s whether the architectural work they’re doing will actually prevent these classes of failures in the future, and given the complexity of what they’ve built, I’m not sure anyone knows if that’s even possible.

GitHub's March Meltdown: What Four Major Outages Tell Us About Scale

When Configuration Changes Become Weapons

The Authentication Credential Fiasco

The Upstream Dependency Problem

What This Means for the Rest of Us

The Architectural Debt Problem

Transparency as Service Quality

Read Next

Agentic AI – From Buzzword to Gamechanger (The Real Story)

Claude's System Prompt Evolution: What Opus 4.7 Tells Us About AI Behavior Design