GitHub's New Data Policy: Your Code Becomes Training Data : mgks.dev

GitHub just announced that starting April 24, they’ll be using interaction data from Copilot Free, Pro, and Pro+ users to train their AI models. This includes your inputs, outputs, code snippets, and associated context. Unless you opt out, of course.

I’ve been watching the AI tooling space closely, and this move feels like a watershed moment. Not because it’s surprising, but because it makes explicit what we’ve all suspected was coming. The artificial intelligence tools we use daily need data to improve, and that data has to come from somewhere.

The Devil in the Details

GitHub is careful to frame this as “aligning with established industry practices,” which is corporate speak for “everyone else is doing it too.” They’re not wrong. Most AI companies have been training on user data in various forms. What’s different here is the directness of it.

If you’re on Copilot Business or Enterprise, you’re safe. Your data isn’t included in this update. That’s an interesting line to draw. It suggests that GitHub understands the sensitivity around corporate code and intellectual property. Individual developers and small teams on the lower tiers? Apparently, that’s a different calculation.

The opt-out mechanism exists in your privacy settings, and GitHub claims they’ve preserved previous opt-out preferences. I want to believe that’s true, but I also know how these settings migrations tend to go. If you care about this, go check your settings right now. Don’t assume your previous choices carried over correctly.

What They’re Actually Collecting

The list is pretty comprehensive. Code snippets you accept or reject. The prompts you write. Error messages. Comments. Basically, your entire development session becomes training data. They’re watching how you use the tool, what works, what doesn’t, and feeding that back into the model.

GitHub mentions they’ve already been doing this with Microsoft employees and saw “meaningful improvements, including increased acceptance rates in multiple languages.” That’s the pitch, right? Your data makes the tool better for everyone. It’s the same argument every platform makes when they want to monetize user behavior.

I’m not entirely cynical about this. Real-world interaction data probably does improve machine learning models. The question isn’t whether it works, it’s whether the tradeoff is worth it.

The Open Source Irony

Here’s what gets me. GitHub built its empire on open source. Developers freely shared their code, collaborated publicly, and built the foundation that made GitHub valuable. Now that same ethos is being used to justify extracting value from individual developers to train proprietary AI models.

Sure, you can opt out. But how many developers will actually do that? How many even read these announcements? The default matters, and the default here is “yes, use my data.”

The data won’t be shared with third-party AI providers, which is something. It stays within the Microsoft family. Whether that’s meaningful depends on how you feel about Microsoft’s growing dominance in the developer tools space.

The Bigger Picture

This isn’t just about GitHub. It’s about the fundamental economics of AI development. These models are expensive to train and improve. The companies building them need massive amounts of data. Public datasets only get you so far. Eventually, you need real usage data from real developers doing real work.

What we’re seeing is the inevitable collision between free or cheap AI tools and the need to fund their development. GitHub isn’t a charity. Copilot costs money to run. If you’re not paying enough to cover those costs, then you’re not the customer. You’re the product. Or more specifically, your coding patterns are the product.

I don’t think GitHub is being uniquely evil here. They’re just being honest about something most AI companies do quietly. At least they’re telling you and giving you a choice, even if that choice requires you to actively opt out.

What Should Developers Do?

First, actually read your privacy settings. Not just on GitHub, but on every AI tool you use. Understand what data you’re sharing and make an informed choice.

Second, think about what code you’re writing where. If you’re working on something sensitive, maybe don’t use AI assistance, or use it through an Enterprise tier with better privacy guarantees.

Third, recognize that this is the new normal. AI tools will get better by learning from how we use them. That’s not inherently bad, but it does mean we need to be more conscious about our digital exhaust.

The tools we use to build software are becoming smarter by watching us build software, and whether that’s a virtuous cycle or a Faustian bargain probably depends on who’s asking and what they’re building.

GitHub's New Data Policy: Your Code Becomes Training Data

The Devil in the Details

What They’re Actually Collecting

The Open Source Irony

The Bigger Picture

What Should Developers Do?

Read Next

Datasette Agent: When Three Years of Tooling Finally Converge

When Your Code Becomes a Protest Sign: What the Google Nimbus Controversy Means for Developers