"I found two of my old Github repos in there. Both were deleted last year and both were private."
The Stack was constructed a while ago, so "deleted last year" wouldn't have an impact if it was constructed before then.
"Both were private" is the thing that needs to be unpacked here. Were these genuinely private repositories that had never been made public on GitHub?
https://huggingface.co/datasets/bigcode/the-stack-v2 talks about where the Stack comes from: "This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history"
> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.
I have about fifty private github repos and hundreds more public github repos (half of them forks of other public repos). I've verified that none of my private repos are in the dataset, and all of my public repos are.
I was expecting to see at least one repo that shouldn't be there depending on when the dataset was put together. In 2015 I changed a repo from public to private, which I think might suggest that the dataset was built after 2015 since my now private repo isn't in the dataset?
Github publicly streams the global change log including all public repos. I have a public repo and noticed there were multiple clones daily after every commit I made. That's when I discovered the API: https://docs.github.com/en/rest/activity/events?apiVersion=2...
If the repo was public even for a single commit, that was likely cloned and replicated elsewhere.
> I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private.
I'd be interested in hearing from the cited number of people. If it's like 5 randos, then that's quite possibly a misremembering, or even conceivably the victims of some unrelated code theft. If it's a few dozen people, well that would have very different implications.
Considering private repos used to require a paid subscription misremembering seems likely.
It used to be public by default, and enough people got confused by this that AWS & Github used to specifically scan repo's for accidentally public AWS credentials.
This is a really big claim. I think we need specifics on this - I'm inclined to think this is people not understanding the GitHub public/private repo model (or misremembering the history of their repos) over GitHub deliberately leaking private code to third parties.
I've got a couple that are intended to be GPL, but you wouldn't know unless you go to the GitHub issues to find the issue I raised about the licence file not being in the repo. They are included.
The software heritage sent me an email they were going to steal copies of my software. I think they assumed because it was published as a git repo they had unlimited rights to steal, reproduce, sell, etc my copyrighted work.
To quote the email they sent me:
"The mission of Software Heritage is to collect, preserve and share all the publicly available source code: https://www.softwareheritage.org"
So they're telling me their intent is to reproduce (share) my work, just because it was publicly available.
Upon my reply they did offer to cancel the request, but also told me they are facilitating the storage of my code for users private theft
"
"Add forge now" requests are submitted by Software Heritage users who think
that a forge is worth being archived.
After a careful examination of your arguments, we acknowledge that your forges may not be archived, so we won't process their ingestion, and close this add forge now request. However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature
Steal your repos? So you don't have them anymore because they took them? Maybe ask nicely to have them back and with a bit of luck they'll give them back to you.
Have you figured out if there is a way to prevent them from doing that? The email I got, which I assume is similar, was suspiciously lacking a "just GTFO and leave my website alone" option.
All of the comments here where the commenter found their repos say either "it only has the public ones" or "it has some private ones from x years ago and i don't quite remember if they were ever public".
So it seems probable this is a case of repo owners misremembering.
> If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:
> I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.
I have 66 repositories picked up and put in The Stack. I spot-checked the first 10. All 10 are Public on GitHub. 8 of the 10 do not have a license of any type, meaning they are covered by copyright, at least in the US, unless GitHub has some terms extending the license of public projects to 3rd parties.
One of mine is Private and was an extension I sold for a short time. I can't say if I ever made it public or not.
Since your repo's name is public now anyway, could you please help us and post it here? I'm really curious about what happened. Since public GitHub activities were archived [1], if you post it here we can check if it were ever public or it's truly private at all time.
Since a lot of this depends on whether someone had ever had a private repo public in the past, I was hoping it could be resolved using the GitHub security audit log.
But... it looks like the audit log only goes back 6 months, so sadly it's not useful for reviewing this particular situation which involves repos that could have been 5 or more years old.
The ClickHouse copy of the GitHub Archive is useful for reviewing things and goes back a lot further. Try it here:
You can run this query to see relevant events for a specific username:
with public_events as (
select
created_at as timestamp,
'Private repo made public' as action,
repo_name
from github_events
where actor_login = 'simonw'
and event_type in ('PublicEvent')
),
most_recent_public_push as (
select
max(created_at) as timestamp,
'Most recent public push' as action,
repo_name
from github_events
where event_type = 'PushEvent'
and actor_login = 'simonw'
group by repo_name
),
combined as (
select * from public_events
union all select * from most_recent_public_push
)
select * from combined order by timestamp
> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.
> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.
> thanks for the additional info. I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private. So i’m not sure what to think or how to explain it.
I'm curious how they get around licenses. For example, I have repos that show up in The Stack. Some have licenses that require inclusion of the copyright in any source re-use or redistribution.
IANAL, but it seems like inclusion in the data set and subsequent distribution without the copyright notice would be a violation.
In The Stack FAQ, they claim that they are doing minimal analysis of the LICENSE file and SPDX tags.
I'd bet that this is enough to detect cases like GPL code, but I also bet that if this analysis fails instead of falling back to "unknown license, assume proprietary, don't copy" they fall back to "free lunch!". Because reasons.
Even though even permissive licenses like MIT and BSD require attribution and preservation of copyright notices. Maybe their AI just can't "reliably detect" licenses.
Probably some obscure GH legal clause stating “we own your data. Ownership is implied and we may do anything with it. Private vs public is a concept of accessibility over the internet. Not necessarily means it’s not accessible via intranet or other non-public means”
It’s the similar legal clauses used for decades on social and video hosting platforms.
I don't think that gets around the code licenses. They may use it as if it does, but I'm not convinced that would hold up if soemone were wealthy enough to make a court case.
GH interestingly doesn't grant themselves nor others that many rights.
---
You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
---
Any User-Generated Content you post publicly, including issues, comments, and contributions to other Users' repositories, may be viewed by others. By setting your repositories to be viewed publicly, you agree to allow others to view and "fork" your repositories (this means that others may make their own copies of Content from your repositories in repositories they control).
If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
---
They don't even include training for copilot (though a dodgy lawyer will likely try to include that as "part of providing the service). 3rd parties only get a license to fork your repo, seemingly not even a license to do anything with that repo. (And hot take: Github should just let people disable the fork button already.)
"I found two of my old Github repos in there. Both were deleted last year and both were private."
The Stack was constructed a while ago, so "deleted last year" wouldn't have an impact if it was constructed before then.
"Both were private" is the thing that needs to be unpacked here. Were these genuinely private repositories that had never been made public on GitHub?
https://huggingface.co/datasets/bigcode/the-stack-v2 talks about where the Stack comes from: "This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history"
You can search that here: https://archive.softwareheritage.org/browse/search/?q=simonw... - it would be interesting to know if the OP's "private" repos are included in that collection.
> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.
> https://huggingface.co/datasets/bigcode/the-stack-v2
> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.
> That's my best guess.
Dead Comment
I was expecting to see at least one repo that shouldn't be there depending on when the dataset was put together. In 2015 I changed a repo from public to private, which I think might suggest that the dataset was built after 2015 since my now private repo isn't in the dataset?
If the repo was public even for a single commit, that was likely cloned and replicated elsewhere.
The poster responded to that:
https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...
> I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private.
It used to be public by default, and enough people got confused by this that AWS & Github used to specifically scan repo's for accidentally public AWS credentials.
I've got a couple that are intended to be GPL, but you wouldn't know unless you go to the GitHub issues to find the issue I raised about the licence file not being in the repo. They are included.
To quote the email they sent me:
"The mission of Software Heritage is to collect, preserve and share all the publicly available source code: https://www.softwareheritage.org"
So they're telling me their intent is to reproduce (share) my work, just because it was publicly available.
Upon my reply they did offer to cancel the request, but also told me they are facilitating the storage of my code for users private theft
" "Add forge now" requests are submitted by Software Heritage users who think that a forge is worth being archived.
After a careful examination of your arguments, we acknowledge that your forges may not be archived, so we won't process their ingestion, and close this add forge now request. However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature
Edit: I asked this before the parent commenter included the contents of the email. Thanks parent commenter!
This is not true. They just had to remove about 500 public repos to comply with my copyright.
So it seems probable this is a case of repo owners misremembering.
> If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:
> https://huggingface.co/spaces/bigcode/in-the-stack
> I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.
> Remove all your code from Github.
> CONSENT IS NOT OPT-OUT.
I'd really like to know if they were ever public at any point because that might explain it.
None of my repos that have always been private are included (apparently).
That's not to say I'm not concerned by this...
I have 66 repositories picked up and put in The Stack. I spot-checked the first 10. All 10 are Public on GitHub. 8 of the 10 do not have a license of any type, meaning they are covered by copyright, at least in the US, unless GitHub has some terms extending the license of public projects to 3rd parties.
One of mine is Private and was an extension I sold for a short time. I can't say if I ever made it public or not.
https://github.com/codazoda/like_roller
[1] https://www.gharchive.org/
You can access that for your repos here: https://github.com/settings/security-log
Then search for repo:simonw/datasette or similar.
But... it looks like the audit log only goes back 6 months, so sadly it's not useful for reviewing this particular situation which involves repos that could have been 5 or more years old.
The ClickHouse copy of the GitHub Archive is useful for reviewing things and goes back a lot further. Try it here:
https://play.clickhouse.com/play?user=play
You can run this query to see relevant events for a specific username:
The PublicEvent one is "When a private repository is made public" according to https://docs.github.com/en/rest/using-the-rest-api/github-ev...I just built a tool for running this query without having to type in the SQL: https://observablehq.com/@simonw/github-public-repo-history
Explained in this TIL: https://til.simonwillison.net/clickhouse/github-public-histo...
Dead Comment
> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.
> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.
> That's my best guess.
https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...
> thanks for the additional info. I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private. So i’m not sure what to think or how to explain it.
Deleted Comment
Using another platform/self-host would introduce a friction.
IANAL, but it seems like inclusion in the data set and subsequent distribution without the copyright notice would be a violation.
I'd bet that this is enough to detect cases like GPL code, but I also bet that if this analysis fails instead of falling back to "unknown license, assume proprietary, don't copy" they fall back to "free lunch!". Because reasons.
Even though even permissive licenses like MIT and BSD require attribution and preservation of copyright notices. Maybe their AI just can't "reliably detect" licenses.
It’s the similar legal clauses used for decades on social and video hosting platforms.
---
--- ---They don't even include training for copilot (though a dodgy lawyer will likely try to include that as "part of providing the service). 3rd parties only get a license to fork your repo, seemingly not even a license to do anything with that repo. (And hot take: Github should just let people disable the fork button already.)