Readit News logoReadit News
rany_ · 2 years ago
> Private repository data is scanned by machine and never read by GitHub staff. Human eyes will never see the contents of your private repositories, except as described in our Terms of Service.

>

> Your individual personal or repository data will not be shared with third parties. We may share aggregate data learned from our analysis with our partners.

That's very concerning wording... I guess even private repositories are being fed into AI, at least that's what that policy explicitly allows.

nmcela · 2 years ago
This is huge, and unfortunately not surprising at all in the age of massive ever-growing out of control tech monopolies that do whatever the fuck they want. Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

Every service and utility gets enshittificated sooner or later, it's a given at the moment. I deleted all my private repos, github and all other MS services should be avoided in the future.

The only solution is to self-host. Gitea is good.

mhaberl · 2 years ago
> The only solution is to self-host. Gitea is good.

Gitea project hosts its code on GitHub: https://github.com/go-gitea/gitea. You must admit that is a bit ironic.

> age of massive ever-growing out of control tech monopolies that do whatever the fuck they want

GitHub is not the only option for source code hosting. There are alternatives like GitLab, Bitbucket, and numerous smaller ones.

itsoktocry · 2 years ago
>Whatever reads in the TOS now, they can and will just reword it when they need it. There's no trust.

This is what is crazy to me. You can agree to terms, build infrastructure around terms you agreed to, then those terms can completely change. Don't like it? Click disagree and we'll close your account, no problem!

And, thanks to politics around social media censorship, we have way too people willing to say, "Don't like the terms, don't use the platform!" to the point of normalization. Sad.

cmsonger · 2 years ago
I am agreeing and adding another solution.

The other solution is political. There's a reason that governments regulate and define economic rules of the road. This is a good example of where governments need to step in. The link between generative AI and the data it is trained on needs to be carefully thought through and properly handled especially given the capitalist nature of our economy.

bayindirh · 2 years ago
There is always SourceHut (https://sourcehut.org) if you want.
sureglymop · 2 years ago
Do you have experience with self-hosting Guitea? I am on to fence about going with Gitea because of the recent fork of the project (Forgejo). Seems that many contributors are now contributing mainly to Forgejo.

Deleted Comment

prox · 2 years ago
Cyberpunk 2077 here we come!
andsoitis · 2 years ago
> The only solution is to self-host. Gitea is good.

I don’t understand your thinking and gitea’s marketing. They say in the same breath that it’s “self-hosting” and that they do “Git hosting… similar to GitHub, BitBucket, and GitLab”. — https://docs.gitea.com/

endymi0n · 2 years ago
Playing devil‘s advocate, all kinds of linting, vetting or security scanning with any degree of smartness beyond a regex would probably fall into my definition of non-human eyes too.

Slippery slope, yes, but in reality there is a legal framework and a ToS in place as well.

skydhash · 2 years ago
These tools are enabled by the owner of the repository. And I think consent has precedence over license terms. But this seems like an escape hatch for using your code as a training material.
Mystery-Machine · 2 years ago
If it doesn't explicitly say: we do not use private repos data to train our AI models, I wouldn't even consider assuming any other way than that's exactly what they are doing. They know this is a question everyone wants to know the answer to. Why would they leave any ambiguity? Let me help you answer that: because that's exactly what they are doing.
rany_ · 2 years ago
Also dependency graph is enabled by default. So there's that.
sheepscreek · 2 years ago
This. I believe the title of this post is a click bait, and doesn’t include the context of this verbiage.

Use of data for AI training is a big deal. Implicitly allowing it under this definition will be enough cause for a lawsuit.

JohnFen · 2 years ago
> in reality there is a legal framework and a ToS in place as well.

That doesn't give a lot of comfort. ToS can change at any time.

Shrezzing · 2 years ago
Everything referenced in this webpage refers only to public data, or private data where a user has enabled the dependency graph

>data from public repositories, and also... data from private repositories when a repository's owner has chosen to share the data with GitHub by enabling the dependency graph

So if you don't enable the Dep Graph feature, your private code is treated according to the standard terms of service.

imdsm · 2 years ago
For anyone wondering:

1. Click your avatar

2. Settings

3. Security / Code security and analysis

4. Dependency graph [Disable all]

You're welcome

rany_ · 2 years ago
So give them an inch (Dep Graph feature) and they'll take a mile (loss of confidentiality); that's not any better...
hajile · 2 years ago
If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

If I make my own image of Spider-Man or some similar IP, that image is still a derivative and subject to copyright claims. Even if I claimed “I never knew about a character like this before”, I’d still get hit with a massive infringement claim.

How is predictive AI any different? If you feed in copyrighted or trade secret material, how are you not responsible for suspect output?

marcus0x62 · 2 years ago
> If their AI uses my code as the basis to generate similar code, how is that not a derivative work?

How would you know the LLM used your work specifically and then prove that in court?

CatWChainsaw · 2 years ago
Truly curious, do you feel the same way about art whose creator/s did not go out of their way to copyright images?
api · 2 years ago
It’s like saying you own an image after JPEG compressing it. There are going to be lawsuits, but MS is so big they might just consider settling them part of the cost of making sure they are on the cutting edge of generative AI.
PeterStuer · 2 years ago
Besides, the "except" part qualifying the "human eyes" is substantial.

GitHub is a Californian company they are beholden to US regulation that requires them to hand over data upon legal request, and they can be prohibited from telling you they did. These requests happen frequently.

simonw · 2 years ago
GitHub publish a transparency report where they reveal how many of these requests they processed.

https://github.blog/tag/github-transparency-report/

> In 2022, GitHub received and processed 432 requests to disclose user information, as compared to 335 in 2021. Of those 432 requests, 274 were subpoenas (with 265 of those subpoenas being criminal or from government agencies and 9 being civil), 97 were court orders, and 22 were search warrants.

Concerning national security letters:

> We’re very limited in what we can legally disclose about national security letters and Foreign Intelligence Surveillance Act (FISA) orders. We report information about these types of requests in ranges of 250, starting with zero. As shown below, we received 0–249 notices from July to December 2022, affecting 250–499 accounts.

thumbuddy · 2 years ago
Now everyone knows why codeberg was invented. This type of thing is entirely foreseeable. Good luck not having all your IP be copy pasteable to anyone using a LLM.
bhrgunatha · 2 years ago
sigh I moved all my projects (from Bitbucket and Github) to Gitlab. Now it appears Gitlab is feeding its data to Google [1].

I do have a codeberg account so will consider moving my repositories again, although it's probably too late and has all been sucked up by Google already.

[1] https://news.ycombinator.com/item?id=36445526

orthoxerox · 2 years ago
We need GPLv4 that explicitly says that models trained on GPLv4 code and anything they produce are derivative works.
ralph84 · 2 years ago
GitHub/MSFT/OpenAI’s stance is that training AI models is fair use and doesn’t require a license. It doesn’t matter what kind of license you slap on the code. If you don’t think it should be fair use the solution is getting governments to amend copyright law to clarify fair use, not a new software license.
Arnt · 2 years ago
The policy says human eyes will never see the contents of your private repositories. Suppose, as you say, the repos are one day fed into an AI. How does that policy then constrain the AI later?

AIUI the AI can't be used to answer any questions based knowledge from training that included private repositories, since (still AIUI) there's no guarantee that contents from the training won't appear in the output.

But there seems to be a loophole for yes/no questions. An AI that answers only yes/no could be trained on private repositories, or run on them. So they could, say, train an AI to recognise repos that require working hours from their legal staff, or from their support staff, and then run that AI on all repos to locate repos that cause expense in the future. Things like that seem possible.

And an AI might answer questions like this: "Does any repo contain the string 845fjkef5urwejf in a Kotlin file?"

JohnFen · 2 years ago
> The policy says human eyes will never see the contents of your private repositories.

I love how they say that as if it's a meaningful distinction in some way.

blibble · 2 years ago
> Your individual personal or repository data will not be shared with third parties.

OpenAI is a third party..?

rany_ · 2 years ago
GitHub owns the Copilot model; it is based on OpenAI's Codex but they own and operate it themselves. So technically feeding it into their AI would be OK by those terms. That's why Copilot still exists even after Codex was killed off by OpenAI.
visarga · 2 years ago
We’re still in the nascent stages of code models. There’s ample opportunity to transition to private models before the technology reaches maturity. As for the destiny of current private GitHub models, they may contribute to training models (through what’s referred to as “aggregate data”), but the potential backlash could be considerable, particularly when considering the bans on using ChatGPT that various firms have implemented.
WuxiFingerHold · 2 years ago
Indeed. I wonder how many people don't see this. Or don't want to see this. Nobody should assume that a private Github repo is really private.

My employer (one of the largest in the world) has fully migrated to Office 365 and a lot of the new source code is on private Github repositories. But no worries, the humans working at Microsoft will not analyse the data. This is only done by AI.

mhaberl · 2 years ago
Is anyone aware of GitLab's policies regarding the privacy of private repositories? Do they have any "AI feeding" mechanisms in place?
indigochill · 2 years ago
Seems like one has to assume corporate repos have them. If they don't already, they could decide to at any time. If that's a problem, self-hosting puts the power back into the hands of the users (and can be done very cheaply - I'm paying ~$5/month for a server that's doing a bunch of stuff including very basic personal git hosting)
jasonlotito · 2 years ago
How do you think they provide all the services they provide? Secrets scanning, contextual source information, everything. I'm not sure how anyone could have gone this long and been surprised by this.

Next people will be surprised that GitHub has the right to show your source code on a website....

Deleted Comment

Deleted Comment

a-dub · 2 years ago
i think they are referring to automated abuse scanners which have been "made of ai" for decades.

i strongly doubt that this language would support the use of private repositories for training generative systems.

Sosh101 · 2 years ago
Right, that's a very loose definition of private.
cyanydeez · 2 years ago
Copilot will AI wash your data
hardware2win · 2 years ago
You guys realize that GitHub provides secret scannings and supply chain (libs) issues warnings?

How do you think they do that? Crystal ball?

Not everything is about AI

dijit · 2 years ago
I think this is probably true, despite critical software being leaked, that seems related to manually inputting data to chatGPT: https://mashable.com/article/samsung-chatgpt-leak-details

there is no current reason to believe that copilot or ChatGPT are trained on private repositories; however this agreement does permit them to do so.

The concern is, why not? you have an enormous corpus of professional grade software just sitting there, one bit flip away from access; and you are even permitted (to the letter of the law) to flip that bit- why wouldn't you, eventually?

marcinzm · 2 years ago
>however this agreement does permit them to do so.

How so? Assuming you read the whole agreement and not just take a single line out of context?

"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."

simonw · 2 years ago
"why wouldn't you, eventually?"

Because if people found out you were doing that you would lose vast numbers of paying customers, which would cost you a lot of money.

torginus · 2 years ago
In the intelligence business, they talk about capabilities, not intents.
bionade24 · 2 years ago
> The concern is, why not?

They need to reliably filter secrets or they are in trouble once Copilot suggests them.

Deleted Comment

bunga-bunga · 2 years ago
Temporarily reading into RAM and applying a regex isn’t the same as feeding your repos to a LLM, which may store parts of it permanently.

Obviously GitHub can read my repos in order to display them on GitHub.com, but access must be fleeting.

Dead Comment

croes · 2 years ago
Of course it's about AI.

They have a legitimate cause to read and while they are at it they use the data for everything else they want.

No way for you to know.

paxys · 2 years ago
Do people want Github to display files in your private repos on github.com? Do you want syntax highlighting and intellisense? Do you want them tokenized for search? Should they be available for cloning when you need it? Should they run security scans and identify vulnerabilities? Should you be able to see git blame, diffs, commits, other metadata? Set up build pipelines? Modify contents via an API?

Tell me – how can Github do all of this without programmatically accessing the contents of your private repos from their servers?

marcinzm · 2 years ago
As I see it people are basically taking a single line of an agreement that explicitly references other agreements out of context:

"If you enable data use for a private repository, we will continue to treat your private data, source code, or trade secrets as confidential and private consistent with our Terms of Service."

lopkeny12ko · 2 years ago
No, I don't want these features. I want reliable remote git storage, which lets me clone a copy of my source code anywhere so I can do these things locally, on my own machine. People forget that this was the original design and intent of Git as a decentralized protocol. As an industry I think we have become too reliant on Github as some kind of way-far-beyond-git magic tool. This might be fine until it isn't--like when Microsoft decides it's time to start training Copilot on your private code.
lucasfdacunha · 2 years ago
Use another place to store your git projects then, it's not like Github is the only one. Github is a PRODUCT built around git, not just a place to store your git repositories. I'm pretty sure you will be able to find other services which only exist to host git projects and do nothing else.
misnome · 2 years ago
Doing remote storage _still_ requires their servers to read and process your repository data.
paxys · 2 years ago
Then you shouldn't be using Github at all. Problem solved.
lopkeny12ko · 2 years ago
Also, for what it's worth, I was extremely surprised and concerned when my private repos appeared in code search results. I had to triple-check that it wasn't showing up when logged out or logged in to a dummy account without access to the repo. This behavior is definitely not intuitive and if there's a way to disable search indexing on private repos I'd do so.
wg0 · 2 years ago
Or - how any provider even self hosted, other than GitHub can provide any of that without automated processing of the contents of private repositories?
23B1 · 2 years ago
1. Avoid VC and investors, and,

2. Build small- to mid-size businesses, that can prioritize healthy growth – like focusing on users, being kind to employees, building brand equity over time.

The pump-and-dump grift cycle of enshittification has got to stop.

tap-snap-or-nap · 2 years ago
As soon as finance capital gets involved, the core objective of business shifts to primarily providing higher returns for the investors. Small to medium businesses can have diversity but once they "go public" they all are almost alike.
ticviking · 2 years ago
Then don’t go public.

Accept that you cannot grow at the same speed and focus on deep roots and fanatical customers

izacus · 2 years ago
Yeah, as soon as MBAs and financial investors get involved (which, for "inefficient" small companies, is just a matter of time) the companies change to be exploitative cancer on society like the rest.

There's whole industry looking for companies that are too nice to their employees and customers to buy, exploit, grab profits and leave a husk behind.

jstanley · 2 years ago
How exactly is GitHub supposed to work if their computers are not allowed to look at your code? Git doesn't work by magic.
pavlov · 2 years ago
I guess it hinges on the definition of “looking”.

Receiving files, copying them within a file system, and transmitting the contents? Clearly not looking.

Computing a hash for each file? Doesn’t seem like looking either.

Parsing the contents, or tokenizing them for a ML model? That’s looking IMO.

I think the defining factor is the size and nature of the derived state update on the server. Copying and transmitting retains only metadata about the file. Computing a hash produces only a handful of bytes of state about the file, and there’s nothing that can be inferred about the file’s contents from the hash. But if you start storing something like word frequencies, that’s not neutral anymore, it’s a “look” into the contents.

bentlegen · 2 years ago
What about parsing and tokenizing the code for code search?
einpoklum · 2 years ago
Once things hinge on the definition of commonly-used verbs and nouns, you've got a problem.

And if you don't agree, just remember that my claim hinges on the definition of the words "you", "definition", "got" and "a".

paxys · 2 years ago
Go to github.com, open your repository and open a file. It shows up on the screen. Is that not "looking"?
ChatGTP · 2 years ago
It's the sharing with partners bit that I don't like. If I want that to happen, I should be able to explicitly grant permissions.
croes · 2 years ago
Same way a postman delivers a postcard.
jameshart · 2 years ago
Postman doesn’t provide tools to diff your postcards, search them, scan their content for security issues, or automatically take actions based on the contents of your postcards.

GitHub does all those things with your commits.

can16358p · 2 years ago
Postman doesn't need to show, potentially scan, and parse the contents of some text while displaying on a website.

Not really analogous.

jstanley · 2 years ago
By carefully reading the entire thing and storing a list of deltas from some other postcard when it would help compress it better?
croes · 2 years ago
Didn't Amazon say the same about the Alexa data?

Nothing is stored at Amazon.

Ok, stored but no human has access.

Ok, humans have access but only special employees

Ok, some third party contractor get the data too.

flagrant_taco · 2 years ago
It really sucks when you want to keep a secret but keep getting asked the right questions. It would have been much easier for Amazon if people just took the first statement at face value.
ostenning · 2 years ago
Why is this setting hidden behind the “Dependency Graph” feature?

GitHub should be making this blatantly clear and transparent for users, rather than using Facebook-esk cloak-and-dagger language.

moritzwarhier · 2 years ago
Not related to private repositories; but GitHub also keeps logs about commits in the "Activity Graph" if these commits do not exist anymore.

For example, clearing the full reflog of a repository and then either force pushing over or deleting the remote repository (then pushing it again under a different name) will keep the activity attribution, in the second case it might even attribute the commits to the new remote (404ing when clicking the "Created x commits in y repositories").

Good for team workflows and pull requests, but annoying for public repositories with intentionally rewritten history.

semiquaver · 2 years ago
Reflog is data local to a git node, not part of the repository. In particular reflogs are never transmitted between remotes. so “clearing the reflog” (presumably of your local machine) has nothing to do with removing any data at GitHub. You need to contact support for that which the docs advise you to do if you’ve force pushed over something you want to be fully deleted.
moritzwarhier · 2 years ago
Thank you, today I learned