Readit News logoReadit News
simonw · a year ago
This is talking about The Stack. The poster says:

"I found two of my old Github repos in there. Both were deleted last year and both were private."

The Stack was constructed a while ago, so "deleted last year" wouldn't have an impact if it was constructed before then.

"Both were private" is the thing that needs to be unpacked here. Were these genuinely private repositories that had never been made public on GitHub?

https://huggingface.co/datasets/bigcode/the-stack-v2 talks about where the Stack comes from: "This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history"

You can search that here: https://archive.softwareheritage.org/browse/search/?q=simonw... - it would be interesting to know if the OP's "private" repos are included in that collection.

simonw · a year ago
Here's a reply from a GitHub staff member: https://mastodon.social/@correcthorse/112128192392083842

> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

> https://huggingface.co/datasets/bigcode/the-stack-v2

> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

> That's my best guess.

beeboobaa · a year ago
Depending on the license, how is this legal?

Dead Comment

buffington · a year ago
I have about fifty private github repos and hundreds more public github repos (half of them forks of other public repos). I've verified that none of my private repos are in the dataset, and all of my public repos are.

I was expecting to see at least one repo that shouldn't be there depending on when the dataset was put together. In 2015 I changed a repo from public to private, which I think might suggest that the dataset was built after 2015 since my now private repo isn't in the dataset?

godelski · a year ago
I did the same and see none of my private repos. Which is quite a lot. I do see plenty of deleted repos but those were public.
hn72774 · a year ago
Github publicly streams the global change log including all public repos. I have a public repo and noticed there were multiple clones daily after every commit I made. That's when I discovered the API: https://docs.github.com/en/rest/activity/events?apiVersion=2...

If the repo was public even for a single commit, that was likely cloned and replicated elsewhere.

latexr · a year ago
> Were these genuinely private repositories that had never been made public on GitHub?

The poster responded to that:

https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...

> I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private.

chefandy · a year ago
I'd be interested in hearing from the cited number of people. If it's like 5 randos, then that's quite possibly a misremembering, or even conceivably the victims of some unrelated code theft. If it's a few dozen people, well that would have very different implications.
cavisne · a year ago
Considering private repos used to require a paid subscription misremembering seems likely.

It used to be public by default, and enough people got confused by this that AWS & Github used to specifically scan repo's for accidentally public AWS credentials.

RandallBrown · a year ago
None of my private repos are included in the software heritage project.
simonw · a year ago
This is a really big claim. I think we need specifics on this - I'm inclined to think this is people not understanding the GitHub public/private repo model (or misremembering the history of their repos) over GitHub deliberately leaking private code to third parties.
errantmind · a year ago
For info, I checked and it looks like none of my AGPL licensed repos are included in The Stack. Neither are my private repos.
tom_ · a year ago
My GPL ones seems to be excluded.

I've got a couple that are intended to be GPL, but you wouldn't know unless you go to the GitHub issues to find the issue I raised about the licence file not being in the repo. They are included.

calvinmorrison · a year ago
The software heritage sent me an email they were going to steal copies of my software. I think they assumed because it was published as a git repo they had unlimited rights to steal, reproduce, sell, etc my copyrighted work.

To quote the email they sent me:

"The mission of Software Heritage is to collect, preserve and share all the publicly available source code: https://www.softwareheritage.org"

So they're telling me their intent is to reproduce (share) my work, just because it was publicly available.

Upon my reply they did offer to cancel the request, but also told me they are facilitating the storage of my code for users private theft

" "Add forge now" requests are submitted by Software Heritage users who think that a forge is worth being archived.

After a careful examination of your arguments, we acknowledge that your forges may not be archived, so we won't process their ingestion, and close this add forge now request. However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature

buffington · a year ago
Do you happen to still have that email? I'd like to search my inbox for anything similar, and would love some snippets to search for.

Edit: I asked this before the parent commenter included the contents of the email. Thanks parent commenter!

remotefonts · a year ago
Steal your repos? So you don't have them anymore because they took them? Maybe ask nicely to have them back and with a bit of luck they'll give them back to you.
ryan-c · a year ago
> However, we cannot prevent a publicly available from being archived by anyone using our "Save code now" feature

This is not true. They just had to remove about 500 public repos to comply with my copyright.

AshamedCaptain · a year ago
Have you figured out if there is a way to prevent them from doing that? The email I got, which I assume is similar, was suspiciously lacking a "just GTFO and leave my website alone" option.
caesil · a year ago
All of the comments here where the commenter found their repos say either "it only has the public ones" or "it has some private ones from x years ago and i don't quite remember if they were ever public".

So it seems probable this is a case of repo owners misremembering.

jtietema · a year ago
I agree. Only my public repos are in the data set.
latexr · a year ago
Full text from the post:

> If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here:

> https://huggingface.co/spaces/bigcode/in-the-stack

> I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @huggingface.

> Remove all your code from Github.

> CONSENT IS NOT OPT-OUT.

tailspin2019 · a year ago
> "Both were deleted last year and both were private"

I'd really like to know if they were ever public at any point because that might explain it.

None of my repos that have always been private are included (apparently).

That's not to say I'm not concerned by this...

codazoda · a year ago
This is interesting...

I have 66 repositories picked up and put in The Stack. I spot-checked the first 10. All 10 are Public on GitHub. 8 of the 10 do not have a license of any type, meaning they are covered by copyright, at least in the US, unless GitHub has some terms extending the license of public projects to 3rd parties.

One of mine is Private and was an extension I sold for a short time. I can't say if I ever made it public or not.

https://github.com/codazoda/like_roller

kevincrane · a year ago
god the language in that opt-out link is patronizing as hell. do AI people just assume everyone is happy to be a cog in their monetization scheme?
CatWChainsaw · a year ago
They assume that they don't need to care and their assumption has so far proven correct.
samtheprogram · a year ago
Thought it was bait, but I can confirm I can find private repos in the search results. What the heck?
rfoo · a year ago
Since your repo's name is public now anyway, could you please help us and post it here? I'm really curious about what happened. Since public GitHub activities were archived [1], if you post it here we can check if it were ever public or it's truly private at all time.

[1] https://www.gharchive.org/

tailspin2019 · a year ago
Were those repos always private?
simonw · a year ago
Which private repos? Any that you're willing to share (I get that sharing names of private repos goes against the whole idea of them being private!)
simonw · a year ago
Since a lot of this depends on whether someone had ever had a private repo public in the past, I was hoping it could be resolved using the GitHub security audit log.

You can access that for your repos here: https://github.com/settings/security-log

Then search for repo:simonw/datasette or similar.

But... it looks like the audit log only goes back 6 months, so sadly it's not useful for reviewing this particular situation which involves repos that could have been 5 or more years old.

The ClickHouse copy of the GitHub Archive is useful for reviewing things and goes back a lot further. Try it here:

https://play.clickhouse.com/play?user=play

You can run this query to see relevant events for a specific username:

    with public_events as (
      select
        created_at as timestamp,
        'Private repo made public' as action,
        repo_name
      from github_events 
      where actor_login = 'simonw'
      and event_type in ('PublicEvent')
    ),
    most_recent_public_push as (
      select
        max(created_at) as timestamp,
        'Most recent public push' as action,
        repo_name
      from github_events
      where event_type = 'PushEvent'
      and actor_login = 'simonw'
      group by repo_name
    ),
    combined as (
      select * from public_events
      union all select * from most_recent_public_push
    )
    select * from combined order by timestamp
The PublicEvent one is "When a private repository is made public" according to https://docs.github.com/en/rest/using-the-rest-api/github-ev...

I just built a tool for running this query without having to type in the SQL: https://observablehq.com/@simonw/github-public-repo-history

Explained in this TIL: https://til.simonwillison.net/clickhouse/github-public-histo...

Dead Comment

jmuguy · a year ago
Need more proof than "pretty sure they were private" and "heard from a number of people".
tailspin2019 · a year ago
From one of the comments on that post:

> I work for github. This post was shared in our slack, on a non work related channel. We don't think it's us.

> They say they get data from SoftwareHeritage, a website that archives repos from github. If your repos were ever open, they might have been archived there even after you deleted from github.

> That's my best guess.

latexr · a year ago
The OP responded to that:

https://hachyderm.io/@emenel@post.lurk.org/11212861313743638...

> thanks for the additional info. I’m pretty sure the repos of mine were private. If it were only me i could be misremembering something, but i have heard from a number of people that they also found repos in this dataset that were private. So i’m not sure what to think or how to explain it.

Deleted Comment

psuedo_uuh · a year ago
It blows my mind that we’re all fine with “the home of open source software” is closed source
bogwog · a year ago
The way I rationalize my use of Github to myself is by framing it as me stealing free compute and storage from the evil Microsoft.
HaZeust · a year ago
Trust me, they're making it back.
mdaniel · a year ago
especially with the number of actual FOSS alternatives available right now, but that network effect, whew, it's strong :-(
mnau · a year ago
I am considering moving away because of network effect. The more popular repo is, the more work is there (issues, PRs...). It's not scalable.

Using another platform/self-host would introduce a friction.

bmitc · a year ago
What FOSS solutions give me an editor in the web, Codespaces, free CI/CD compute, free website hosting, etc.?
angst_ridden · a year ago
I'm curious how they get around licenses. For example, I have repos that show up in The Stack. Some have licenses that require inclusion of the copyright in any source re-use or redistribution.

IANAL, but it seems like inclusion in the data set and subsequent distribution without the copyright notice would be a violation.

AshamedCaptain · a year ago
In The Stack FAQ, they claim that they are doing minimal analysis of the LICENSE file and SPDX tags.

I'd bet that this is enough to detect cases like GPL code, but I also bet that if this analysis fails instead of falling back to "unknown license, assume proprietary, don't copy" they fall back to "free lunch!". Because reasons.

ryan-c · a year ago
At least one of my repos with no license/spdx was excluded, though the source files do say "all rights reserved" in them.
angst_ridden · a year ago
I suspect you're right.

Even though even permissive licenses like MIT and BSD require attribution and preservation of copyright notices. Maybe their AI just can't "reliably detect" licenses.

xyst · a year ago
Probably some obscure GH legal clause stating “we own your data. Ownership is implied and we may do anything with it. Private vs public is a concept of accessibility over the internet. Not necessarily means it’s not accessible via intranet or other non-public means”

It’s the similar legal clauses used for decades on social and video hosting platforms.

angst_ridden · a year ago
I don't think that gets around the code licenses. They may use it as if it does, but I'm not convinced that would hold up if soemone were wealthy enough to make a court case.
ADeerAppeared · a year ago
GH interestingly doesn't grant themselves nor others that many rights.

---

  You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.

  This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
---

  Any User-Generated Content you post publicly, including issues, comments, and contributions to other Users' repositories, may be viewed by others. By setting your repositories to be viewed publicly, you agree to allow others to view and "fork" your repositories (this means that others may make their own copies of Content from your repositories in repositories they control).

  If you set your pages and repositories to be viewed publicly, you grant each User of GitHub a nonexclusive, worldwide license to use, display, and perform Your Content through the GitHub Service and to reproduce Your Content solely on GitHub as permitted through GitHub's functionality (for example, through forking).
---

They don't even include training for copilot (though a dodgy lawyer will likely try to include that as "part of providing the service). 3rd parties only get a license to fork your repo, seemingly not even a license to do anything with that repo. (And hot take: Github should just let people disable the fork button already.)