About GitHub’s use of your data

> Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.

> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.

That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.

nmcela · 2 years ago

This is huge, and unfortunately not surprising at all in the age of massive ever-growing out of control tech monopolies that do whatever the fuck they want. Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

Every service and utility gets enshittificated sooner or later, it's a given at the moment. I deleted all my private repos, github and all other MS services should be avoided in the future.

The only solution is to self-host. Gitea is good.

mhaberl · 2 years ago

> The only solution is to self-host. Gitea is good.

Gitea project hosts its code on GitHub: https://github.com/go-gitea/gitea. You must admit that is a bit ironic.

> age of massive ever-growing out of control tech monopolies that do whatever the fuck they want

GitHub is not the only option for source code hosting. There are alternatives like GitLab, Bitbucket, and numerous smaller ones.

itsoktocry · 2 years ago

>Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

This is what is crazy to me. You can agree to terms, build infrastructure around terms you agreed to, then those terms can completely change. Don't like it? Click disagree and we'll close your account, no problem!

And, thanks to politics around social media censorship, we have way too people willing to say, "Don't like the terms, don't use the platform!" to the point of normalization. Sad.

cmsonger · 2 years ago

I am agreeing and adding another solution.

The other solution is political. There's a reason that governments regulate and define economic rules of the road. This is a good example of where governments need to step in. The link between generative AI and the data it is trained on needs to be carefully thought through and properly handled especially given the capitalist nature of our economy.

bayindirh · 2 years ago

There is always SourceHut (https://sourcehut.org) if you want.

sureglymop · 2 years ago

Do you have experience with self-hosting Guitea? I am on to fence about going with Gitea because of the recent fork of the project (Forgejo). Seems that many contributors are now contributing mainly to Forgejo.

Deleted Comment

prox · 2 years ago

Cyberpunk 2077 here we come!

andsoitis · 2 years ago

> The only solution is to self-host. Gitea is good.

I don’t understand your thinking and gitea’s marketing. They say in the same breath that it’s “self-hosting” and that they do “Git hosting… similar to GitHub, BitBucket, and GitLab”. — https://docs.gitea.com/

endymi0n · 2 years ago

Playing devil‘s advocate, all kinds of linting, vetting or security scanning with any degree of smartness beyond a regex would probably fall into my definition of non-human eyes too.

Slippery slope, yes, but in reality there is a legal framework and a ToS in place as well.

skydhash · 2 years ago

These tools are enabled by the owner of the repository. And I think consent has precedence over license terms. But this seems like an escape hatch for using your code as a training material.

Mystery-Machine · 2 years ago

If it doesn't explicitly say: we do not use private repos data to train our AI models, I wouldn't even consider assuming any other way than that's exactly what they are doing. They know this is a question everyone wants to know the answer to. Why would they leave any ambiguity? Let me help you answer that: because that's exactly what they are doing.

rany_ · 2 years ago

Also dependency graph is enabled by default. So there's that.

sheepscreek · 2 years ago

This. I believe the title of this post is a click bait, and doesn’t include the context of this verbiage.

Use of data for AI training is a big deal. Implicitly allowing it under this definition will be enough cause for a lawsuit.

JohnFen · 2 years ago

> in reality there is a legal framework and a ToS in place as well.

That doesn't give a lot of comfort. ToS can change at any time.

Shrezzing · 2 years ago

Everything referenced in this webpage refers only to public data, or private data where a user has enabled the dependency graph

>data from public repositories, and also... data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph

So if you don't enable the Dep Graph feature, your private code is treated according to the standard terms of service.

imdsm · 2 years ago

For anyone wondering:

1. Click your avatar

2. Settings

3. Security / Code security and analysis

4. Dependency graph [Disable all]

You're welcome

rany_ · 2 years ago

So give them an inch (Dep Graph feature) and they'll take a mile (loss of confidentiality); that's not any better...

hajile · 2 years ago

If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

If I make my own image of Spider-Man or some similar IP, that image is still a derivative and subject to copyright claims. Even if I claimed “I never knew about a character like this before”, I’d still get hit with a massive infringement claim.

How is predictive AI any different? If you feed in copyrighted or trade secret material, how are you not responsible for suspect output?

marcus0x62 · 2 years ago

> If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

How would you know the LLM used your work specifically and then prove that in court?

CatWChainsaw · 2 years ago

Truly curious, do you feel the same way about art whose creator/s did not go out of their way to copyright images?

api · 2 years ago

It’s like saying you own an image after JPEG compressing it. There are going to be lawsuits, but MS is so big they might just consider settling them part of the cost of making sure they are on the cutting edge of generative AI.

PeterStuer · 2 years ago

Besides, the "except" part qualifying the "human eyes" is substantial.

GitHub is a Californian company they are beholden to US regulation that requires them to hand over data upon legal request, and they can be prohibited from telling you they did. These requests happen frequently.

simonw · 2 years ago

GitHub publish a transparency report where they reveal how many of these requests they processed.

https://github.blog/tag/github-transparency-report/

> In 2022, GitHub received and processed 432 requests to disclose user information, as compared to 335 in 2021. Of those 432 requests, 274 were subpoenas (with 265 of those subpoenas being criminal or from government agencies and 9 being civil), 97 were court orders, and 22 were search warrants.

Concerning national security letters:

> We’re very limited in what we can legally disclose about national security letters and Foreign Intelligence Surveillance Act (FISA) orders. We report information about these types of requests in ranges of 250, starting with zero. As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.

thumbuddy · 2 years ago

Now everyone knows why codeberg was invented. This type of thing is entirely foreseeable. Good luck not having all your IP be copy pasteable to anyone using a LLM.

bhrgunatha · 2 years ago

sigh I moved all my projects (from Bitbucket and Github) to Gitlab. Now it appears Gitlab is feeding its data to Google [1].

I do have a codeberg account so will consider moving my repositories again, although it's probably too late and has all been sucked up by Google already.

[1] https://news.ycombinator.com/item?id=36445526

orthoxerox · 2 years ago

We need GPLv4 that explicitly says that models trained on GPLv4 code and anything they produce are derivative works.

ralph84 · 2 years ago

GitHub/MSFT/OpenAI’s stance is that training AI models is fair use and doesn’t require a license. It doesn’t matter what kind of license you slap on the code. If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use, not a new software license.

Arnt · 2 years ago

The policy says human eyes will never see the contents of your private repositories. Suppose, as you say, the repos are one day fed into an AI. How does that policy then constrain the AI later?

AIUI the AI can't be used to answer any questions based knowledge from training that included private repositories, since (still AIUI) there's no guarantee that contents from the training won't appear in the output.

But there seems to be a loophole for yes/no questions. An AI that answers only yes/no could be trained on private repositories, or run on them. So they could, say, train an AI to recognise repos that require working hours from their legal staff, or from their support staff, and then run that AI on all repos to locate repos that cause expense in the future. Things like that seem possible.

And an AI might answer questions like this: "Does any repo contain the string 845fjkef5urwejf in a Kotlin file?"

JohnFen · 2 years ago

> The policy says human eyes will never see the contents of your private repositories.

I love how they say that as if it's a meaningful distinction in some way.

blibble · 2 years ago

> Your individual personal or repository data will not be shared with third parties.

OpenAI is a third party..?

rany_ · 2 years ago

GitHub owns the Copilot model; it is based on OpenAI's Codex but they own and operate it themselves. So technically feeding it into their AI would be OK by those terms. That's why Copilot still exists even after Codex was killed off by OpenAI.

visarga · 2 years ago

We’re still in the nascent stages of code models. There’s ample opportunity to transition to private models before the technology reaches maturity. As for the destiny of current private GitHub models, they may contribute to training models (through what’s referred to as “aggregate data”), but the potential backlash could be considerable, particularly when considering the bans on using ChatGPT that various firms have implemented.

WuxiFingerHold · 2 years ago

Indeed. I wonder how many people don't see this. Or don't want to see this. Nobody should assume that a private Github repo is really private.

My employer (one of the largest in the world) has fully migrated to Office 365 and a lot of the new source code is on private Github repositories. But no worries, the humans working at Microsoft will not analyse the data. This is only done by AI.

mhaberl · 2 years ago

Is anyone aware of GitLab's policies regarding the privacy of private repositories? Do they have any "AI feeding" mechanisms in place?

indigochill · 2 years ago

Seems like one has to assume corporate repos have them. If they don't already, they could decide to at any time. If that's a problem, self-hosting puts the power back into the hands of the users (and can be done very cheaply - I'm paying ~$5/month for a server that's doing a bunch of stuff including very basic personal git hosting)

jasonlotito · 2 years ago

How do you think they provide all the services they provide? Secrets scanning, contextual source information, everything. I'm not sure how anyone could have gone this long and been surprised by this.

Next people will be surprised that GitHub has the right to show your source code on a website....

Deleted Comment

a-dub · 2 years ago

i think they are referring to automated abuse scanners which have been "made of ai" for decades.

i strongly doubt that this language would support the use of private repositories for training generative systems.

Sosh101 · 2 years ago

Right, that's a very loose definition of private.

cyanydeez · 2 years ago

Copilot will AI wash your data

You guys realize that GitHub provides secret scannings and supply chain (libs) issues warnings?

How do you think they do that? Crystal ball?

Not everything is about AI

dijit · 2 years ago

I think this is probably true, despite critical software being leaked, that seems related to manually inputting data to chatGPT: https://mashable.com/article/samsung-chatgpt-leak-details

there is no current reason to believe that copilot or ChatGPT are trained on private repositories; however this agreement does permit them to do so.

The concern is, why not? you have an enormous corpus of professional grade software just sitting there, one bit flip away from access; and you are even permitted (to the letter of the law) to flip that bit- why wouldn't you, eventually?

marcinzm · 2 years ago

>however this agreement does permit them to do so.

How so? Assuming you read the whole agreement and not just take a single line out of context?

"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."

simonw · 2 years ago

"why wouldn't you, eventually?"

Because if people found out you were doing that you would lose vast numbers of paying customers, which would cost you a lot of money.

torginus · 2 years ago

In the intelligence business, they talk about capabilities, not intents.

bionade24 · 2 years ago

> The concern is, why not?

They need to reliably filter secrets or they are in trouble once Copilot suggests them.

Deleted Comment

bunga-bunga · 2 years ago

Temporarily reading into RAM and applying a regex isn’t the same as feeding your repos to a LLM, which may store parts of it permanently.

Obviously GitHub can read my repos in order to display them on GitHub.com, but access must be fleeting.

Dead Comment

croes · 2 years ago

Of course it's about AI.

They have a legitimate cause to read and while they are at it they use the data for everything else they want.

No way for you to know.