> Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.
>
> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.
That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.
This is huge, and unfortunately not surprising at all in the age of massive ever-growing out of control tech monopolies that do whatever the fuck they want. Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.
Every service and utility gets enshittificated sooner or later, it's a given at the moment. I deleted all my private repos, github and all other MS services should be avoided in the future.
>Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.
This is what is crazy to me. You can agree to terms, build infrastructure around terms you agreed to, then those terms can completely change. Don't like it? Click disagree and we'll close your account, no problem!
And, thanks to politics around social media censorship, we have way too people willing to say, "Don't like the terms, don't use the platform!" to the point of normalization. Sad.
The other solution is political. There's a reason that governments regulate and define economic rules of the road. This is a good example of where governments need to step in. The link between generative AI and the data it is trained on needs to be carefully thought through and properly handled especially given the capitalist nature of our economy.
Do you have experience with self-hosting Guitea? I am on to fence about going with Gitea because of the recent fork of the project (Forgejo). Seems that many contributors are now contributing mainly to Forgejo.
> The only solution is to self-host. Gitea is good.
I don’t understand your thinking and gitea’s marketing. They say in the same breath that it’s “self-hosting” and that they do “Git hosting… similar to GitHub, BitBucket, and GitLab”. — https://docs.gitea.com/
Playing devil‘s advocate, all kinds of linting, vetting or security scanning with any degree of smartness beyond a regex would probably fall into my definition of non-human eyes too.
Slippery slope, yes, but in reality there is a legal framework and a ToS in place as well.
These tools are enabled by the owner of the repository. And I think consent has precedence over license terms. But this seems like an escape hatch for using your code as a training material.
If it doesn't explicitly say: we do not use private repos data to train our AI models, I wouldn't even consider assuming any other way than that's exactly what they are doing. They know this is a question everyone wants to know the answer to. Why would they leave any ambiguity? Let me help you answer that: because that's exactly what they are doing.
Everything referenced in this webpage refers only to public data, or private data where a user has enabled the dependency graph
>data from public repositories, and also... data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph
So if you don't enable the Dep Graph feature, your private code is treated according to the standard terms of service.
If their AI uses my code as the basis to generate similar code, how is that not a derivative work?
If I make my own image of Spider-Man or some similar IP, that image is still a derivative and subject to copyright claims. Even if I claimed “I never knew about a character like this before”, I’d still get hit with a massive infringement claim.
How is predictive AI any different? If you feed in copyrighted or trade secret material, how are you not responsible for suspect output?
It’s like saying you own an image after JPEG compressing it. There are going to be lawsuits, but MS is so big they might just consider settling them part of the cost of making sure they are on the cutting edge of generative AI.
Besides, the "except" part qualifying the "human eyes" is substantial.
GitHub is a Californian company they are beholden to US regulation that requires them to hand over data upon legal request, and they can be prohibited from telling you they did. These requests happen frequently.
> In 2022, GitHub received and processed 432 requests to disclose user information, as compared to 335 in 2021. Of those 432 requests, 274 were subpoenas (with 265 of those subpoenas being criminal or from government agencies and 9 being civil), 97 were court orders, and 22 were search warrants.
Concerning national security letters:
> We’re very limited in what we can legally disclose about national security letters and Foreign Intelligence Surveillance Act (FISA) orders. We report information about these types of requests in ranges of 250, starting with zero. As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.
Now everyone knows why codeberg was invented. This type of thing is entirely foreseeable. Good luck not having all your IP be copy pasteable to anyone using a LLM.
sigh I moved all my projects (from Bitbucket and Github) to Gitlab.
Now it appears Gitlab is feeding its data to Google [1].
I do have a codeberg account so will consider moving my repositories again, although it's probably too late and has all been sucked up by Google already.
GitHub/MSFT/OpenAI’s stance is that training AI models is fair use and doesn’t require a license. It doesn’t matter what kind of license you slap on the code. If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use, not a new software license.
The policy says human eyes will never see the contents of your private repositories. Suppose, as you say, the repos are one day fed into an AI. How does that policy then constrain the AI later?
AIUI the AI can't be used to answer any questions based knowledge from training that included private repositories, since (still AIUI) there's no guarantee that contents from the training won't appear in the output.
But there seems to be a loophole for yes/no questions. An AI that answers only yes/no could be trained on private repositories, or run on them. So they could, say, train an AI to recognise repos that require working hours from their legal staff, or from their support staff, and then run that AI on all repos to locate repos that cause expense in the future. Things like that seem possible.
And an AI might answer questions like this: "Does any repo contain the string 845fjkef5urwejf in a Kotlin file?"
GitHub owns the Copilot model; it is based on OpenAI's Codex but they own and operate it themselves. So technically feeding it into their AI would be OK by those terms. That's why Copilot still exists even after Codex was killed off by OpenAI.
We’re still in the nascent stages of code models. There’s ample opportunity to transition to private models before the technology reaches maturity. As for the destiny of current private GitHub models, they may contribute to training models (through what’s referred to as “aggregate data”), but the potential backlash could be considerable, particularly when considering the bans on using ChatGPT that various firms have implemented.
Indeed. I wonder how many people don't see this. Or don't want to see this. Nobody should assume that a private Github repo is really private.
My employer (one of the largest in the world) has fully migrated to Office 365 and a lot of the new source code is on private Github repositories. But no worries, the humans working at Microsoft will not analyse the data. This is only done by AI.
Seems like one has to assume corporate repos have them. If they don't already, they could decide to at any time. If that's a problem, self-hosting puts the power back into the hands of the users (and can be done very cheaply - I'm paying ~$5/month for a server that's doing a bunch of stuff including very basic personal git hosting)
How do you think they provide all the services they provide? Secrets scanning, contextual source information, everything. I'm not sure how anyone could have gone this long and been surprised by this.
Next people will be surprised that GitHub has the right to show your source code on a website....
there is no current reason to believe that copilot or ChatGPT are trained on private repositories; however this agreement does permit them to do so.
The concern is, why not? you have an enormous corpus of professional grade software just sitting there, one bit flip away from access; and you are even permitted (to the letter of the law) to flip that bit- why wouldn't you, eventually?
>however this agreement does permit them to do so.
How so? Assuming you read the whole agreement and not just take a single line out of context?
"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."
Do people want Github to display files in your private repos on github.com? Do you want syntax highlighting and intellisense? Do you want them tokenized for search? Should they be available for cloning when you need it? Should they run security scans and identify vulnerabilities? Should you be able to see git blame, diffs, commits, other metadata? Set up build pipelines? Modify contents via an API?
Tell me – how can Github do all of this without programmatically accessing the contents of your private repos from their servers?
As I see it people are basically taking a single line of an agreement that explicitly references other agreements out of context:
"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."
No, I don't want these features. I want reliable remote git storage, which lets me clone a copy of my source code anywhere so I can do these things locally, on my own machine. People forget that this was the original design and intent of Git as a decentralized protocol. As an industry I think we have become too reliant on Github as some kind of way-far-beyond-git magic tool. This might be fine until it isn't--like when Microsoft decides it's time to start training Copilot on your private code.
Use another place to store your git projects then, it's not like Github is the only one. Github is a PRODUCT built around git, not just a place to store your git repositories. I'm pretty sure you will be able to find other services which only exist to host git projects and do nothing else.
Also, for what it's worth, I was extremely surprised and concerned when my private repos appeared in code search results. I had to triple-check that it wasn't showing up when logged out or logged in to a dummy account without access to the repo. This behavior is definitely not intuitive and if there's a way to disable search indexing on private repos I'd do so.
Or - how any provider even self hosted, other than GitHub can provide any of that without automated processing of the contents of private repositories?
2. Build small- to mid-size businesses, that can prioritize healthy growth – like focusing on users, being kind to employees, building brand equity over time.
The pump-and-dump grift cycle of enshittification has got to stop.
As soon as finance capital gets involved, the core objective of business shifts to primarily providing higher returns for the investors. Small to medium businesses can have diversity but once they "go public" they all are almost alike.
Yeah, as soon as MBAs and financial investors get involved (which, for "inefficient" small companies, is just a matter of time) the companies change to be exploitative cancer on society like the rest.
There's whole industry looking for companies that are too nice to their employees and customers to buy, exploit, grab profits and leave a husk behind.
Receiving files, copying them within a file system, and transmitting the contents? Clearly not looking.
Computing a hash for each file? Doesn’t seem like looking either.
Parsing the contents, or tokenizing them for a ML model? That’s looking IMO.
I think the defining factor is the size and nature of the derived state update on the server. Copying and transmitting retains only metadata about the file. Computing a hash produces only a handful of bytes of state about the file, and there’s nothing that can be inferred about the file’s contents from the hash. But if you start storing something like word frequencies, that’s not neutral anymore, it’s a “look” into the contents.
Postman doesn’t provide tools to diff your postcards, search them, scan their content for security issues, or automatically take actions based on the contents of your postcards.
It really sucks when you want to keep a secret but keep getting asked the right questions. It would have been much easier for Amazon if people just took the first statement at face value.
Not related to private repositories; but GitHub also keeps logs about commits in the "Activity Graph" if these commits do not exist anymore.
For example, clearing the full reflog of a repository and then either force pushing over or deleting the remote repository (then pushing it again under a different name) will keep the activity attribution, in the second case it might even attribute the commits to the new remote (404ing when clicking the "Created x commits in y repositories").
Good for team workflows and pull requests, but annoying for public repositories with intentionally rewritten history.
Reflog is data local to a git node, not part of the repository. In particular reflogs are never transmitted between remotes. so “clearing the reflog” (presumably of your local machine) has nothing to do with removing any data at GitHub. You need to contact support for that which the docs advise you to do if you’ve force pushed over something you want to be fully deleted.
>
> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.
That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.
Every service and utility gets enshittificated sooner or later, it's a given at the moment. I deleted all my private repos, github and all other MS services should be avoided in the future.
The only solution is to self-host. Gitea is good.
Gitea project hosts its code on GitHub: https://github.com/go-gitea/gitea. You must admit that is a bit ironic.
> age of massive ever-growing out of control tech monopolies that do whatever the fuck they want
GitHub is not the only option for source code hosting. There are alternatives like GitLab, Bitbucket, and numerous smaller ones.
This is what is crazy to me. You can agree to terms, build infrastructure around terms you agreed to, then those terms can completely change. Don't like it? Click disagree and we'll close your account, no problem!
And, thanks to politics around social media censorship, we have way too people willing to say, "Don't like the terms, don't use the platform!" to the point of normalization. Sad.
The other solution is political. There's a reason that governments regulate and define economic rules of the road. This is a good example of where governments need to step in. The link between generative AI and the data it is trained on needs to be carefully thought through and properly handled especially given the capitalist nature of our economy.
Deleted Comment
I don’t understand your thinking and gitea’s marketing. They say in the same breath that it’s “self-hosting” and that they do “Git hosting… similar to GitHub, BitBucket, and GitLab”. — https://docs.gitea.com/
Slippery slope, yes, but in reality there is a legal framework and a ToS in place as well.
Use of data for AI training is a big deal. Implicitly allowing it under this definition will be enough cause for a lawsuit.
That doesn't give a lot of comfort. ToS can change at any time.
>data from public repositories, and also... data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph
So if you don't enable the Dep Graph feature, your private code is treated according to the standard terms of service.
1. Click your avatar
2. Settings
3. Security / Code security and analysis
4. Dependency graph [Disable all]
You're welcome
If I make my own image of Spider-Man or some similar IP, that image is still a derivative and subject to copyright claims. Even if I claimed “I never knew about a character like this before”, I’d still get hit with a massive infringement claim.
How is predictive AI any different? If you feed in copyrighted or trade secret material, how are you not responsible for suspect output?
How would you know the LLM used your work specifically and then prove that in court?
GitHub is a Californian company they are beholden to US regulation that requires them to hand over data upon legal request, and they can be prohibited from telling you they did. These requests happen frequently.
https://github.blog/tag/github-transparency-report/
> In 2022, GitHub received and processed 432 requests to disclose user information, as compared to 335 in 2021. Of those 432 requests, 274 were subpoenas (with 265 of those subpoenas being criminal or from government agencies and 9 being civil), 97 were court orders, and 22 were search warrants.
Concerning national security letters:
> We’re very limited in what we can legally disclose about national security letters and Foreign Intelligence Surveillance Act (FISA) orders. We report information about these types of requests in ranges of 250, starting with zero. As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.
I do have a codeberg account so will consider moving my repositories again, although it's probably too late and has all been sucked up by Google already.
[1] https://news.ycombinator.com/item?id=36445526
AIUI the AI can't be used to answer any questions based knowledge from training that included private repositories, since (still AIUI) there's no guarantee that contents from the training won't appear in the output.
But there seems to be a loophole for yes/no questions. An AI that answers only yes/no could be trained on private repositories, or run on them. So they could, say, train an AI to recognise repos that require working hours from their legal staff, or from their support staff, and then run that AI on all repos to locate repos that cause expense in the future. Things like that seem possible.
And an AI might answer questions like this: "Does any repo contain the string 845fjkef5urwejf in a Kotlin file?"
I love how they say that as if it's a meaningful distinction in some way.
OpenAI is a third party..?
My employer (one of the largest in the world) has fully migrated to Office 365 and a lot of the new source code is on private Github repositories. But no worries, the humans working at Microsoft will not analyse the data. This is only done by AI.
Next people will be surprised that GitHub has the right to show your source code on a website....
Deleted Comment
Deleted Comment
i strongly doubt that this language would support the use of private repositories for training generative systems.
How do you think they do that? Crystal ball?
Not everything is about AI
there is no current reason to believe that copilot or ChatGPT are trained on private repositories; however this agreement does permit them to do so.
The concern is, why not? you have an enormous corpus of professional grade software just sitting there, one bit flip away from access; and you are even permitted (to the letter of the law) to flip that bit- why wouldn't you, eventually?
How so? Assuming you read the whole agreement and not just take a single line out of context?
"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."
Because if people found out you were doing that you would lose vast numbers of paying customers, which would cost you a lot of money.
They need to reliably filter secrets or they are in trouble once Copilot suggests them.
Deleted Comment
Obviously GitHub can read my repos in order to display them on GitHub.com, but access must be fleeting.
Dead Comment
They have a legitimate cause to read and while they are at it they use the data for everything else they want.
No way for you to know.
Tell me – how can Github do all of this without programmatically accessing the contents of your private repos from their servers?
"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."
2. Build small- to mid-size businesses, that can prioritize healthy growth – like focusing on users, being kind to employees, building brand equity over time.
The pump-and-dump grift cycle of enshittification has got to stop.
Accept that you cannot grow at the same speed and focus on deep roots and fanatical customers
There's whole industry looking for companies that are too nice to their employees and customers to buy, exploit, grab profits and leave a husk behind.
Receiving files, copying them within a file system, and transmitting the contents? Clearly not looking.
Computing a hash for each file? Doesn’t seem like looking either.
Parsing the contents, or tokenizing them for a ML model? That’s looking IMO.
I think the defining factor is the size and nature of the derived state update on the server. Copying and transmitting retains only metadata about the file. Computing a hash produces only a handful of bytes of state about the file, and there’s nothing that can be inferred about the file’s contents from the hash. But if you start storing something like word frequencies, that’s not neutral anymore, it’s a “look” into the contents.
And if you don't agree, just remember that my claim hinges on the definition of the words "you", "definition", "got" and "a".
GitHub does all those things with your commits.
Not really analogous.
Nothing is stored at Amazon.
Ok, stored but no human has access.
Ok, humans have access but only special employees
Ok, some third party contractor get the data too.
GitHub should be making this blatantly clear and transparent for users, rather than using Facebook-esk cloak-and-dagger language.
For example, clearing the full reflog of a repository and then either force pushing over or deleting the remote repository (then pushing it again under a different name) will keep the activity attribution, in the second case it might even attribute the commits to the new remote (404ing when clicking the "Created x commits in y repositories").
Good for team workflows and pull requests, but annoying for public repositories with intentionally rewritten history.