This highlights why it's so important that any secret that gets committed must be rotated. Simply removing it from the git history isn't enough, because it can still linger, it's just harder to find.
Full disclosure, I work for GitHub, but push protection from Secret Scanning is awesome for this because your nearly leaked secret doesn’t make it to the remote, and it gives you instructions on how to fix your local repo!
Why does GitHub provide no way for a repository administrator to self-service a git gc? I seem to recall reading a blog post that suggested GitHub had invested a bunch of engineering resource in making cleaning up unreachable objects much more scalable.
We turned that on about a year ago, and that totally helped reduce the silly. The new dashboards are nice to - letting you spot what application team needs a phone call. 'This is still active' warning is fantastic. Wish all providers would give you the API to show that.
This is a useful feature but can only provide a degree of protection.
To a certain extent, your approach of considering any mistakenly pushed commit as public is laudable, but it still seems unreasonable to me to not provide an analogue to gc
It is still in your local repository, but it's not pushed to the remote repository. So a forensics on your local machine may reveal it (probably until you do git gc, but I'm not an expert on git forensics) but it's safe otherwise.
That's only client side, and you also need to "gc" it to get rid of it, or it will still be in .git/objects and can be retrieved via something like `git cat-file`.
You don’t even need the pushes API to see commits that were force pushed away. You can get the head of any branch at a given time using `gitrevisions` [1] syntax any place that you would normally put a branch or commit.
e.g to see the state of the cpython main branch on January 1 we can ask for `main@{2024-01-01}`:
This does not walk the commit history, but instead the server-side reflog, so it’s immune to force pushing and can only be avoided by GC of the reflog or repo. Definitely contact GH support if you pushed something you shouldn’t have.
To be 100% sure that something hasn't changed recently I tried and, nope, your revision command only looks at the local reflog, after a forced push you get different answers from the original repository (that has the full reflog) and a new clone.
I'm very confused by your comment. The grandparent comment talks about using the gitrevisions syntax in a GitHub URL to search the reflog stored on GitHub. Nothing to do with your local clone of a repository.
If you've inadvertently committed, say, copyrighted material to GitHub, and want to fully erase it, is there a way? Other than contacting GitHub as this article mentions.
Even if you contact them, GitHub says[1] that they will not remove "non-sensitive data", but makes no reference to copyrighted material.
If it's a copyright violation (be sure that it ACTUALLY is!) they will remove content in response to a DMCA request, but any forks will only be removed if you manually find them and issue a request for each fork. This isn't very useful if you accidentally uploaded your own copyrighted material though, since that's not a violation you could issue a notice for.
I don't think there's a need to erase copyrighted material? If it's your material then the copyright still holds. If it's not your material it's a problem between GitHub and the copyright holder who can DMCA the "hidden commit" if for some godforsaken reason the copyright holder somehow found the commit and cares.
Mostly a Git issue. In general Git won't remove old data pushed to remotes. Maybe if they run a garbage collection.
However GitHub does exacerbate it a little by providing APIs that list commits that are no longer in the history. However there are other ways to get this info such as brute-forcing short prefixes of commits.
But really this is another case of the general problem that once you publish information you can't unpublish it. If you push a secret to a repo you can't 100% reliably clean it up. You should assume that everyone with the repo took a copy.
It's not really an issue, it's just that the assumption that removing a commit from the history actually deletes it is not correct. That holds for both Git and GitHub, and probably most other Git hosts.
Also in general, don't assume that you can remove anything from the internet once it has been published.
Git can potentially clean dangling commits `git gc --aggressive --prune=now` . Gitlab offers this as part of housekeeping. However, be aware: this garbage collection does not work if you e.g. reference a commit in an issue. (Like creating an incident that references the offending commit)
To a certain extent, your approach of considering any mistakenly pushed commit as public is laudable, but it still seems unreasonable to me to not provide an analogue to gc
If I commit something locally, reset it and push to remote something else does it leave a trace?
I see git reflog kinda like an OS recycle bin
e.g to see the state of the cpython main branch on January 1 we can ask for `main@{2024-01-01}`:
https://github.com/python/cpython/tree/main@{2024-01-01}
This does not walk the commit history, but instead the server-side reflog, so it’s immune to force pushing and can only be avoided by GC of the reflog or repo. Definitely contact GH support if you pushed something you shouldn’t have.
[1] https://git-scm.com/docs/gitrevisions
To be 100% sure that something hasn't changed recently I tried and, nope, your revision command only looks at the local reflog, after a forced push you get different answers from the original repository (that has the full reflog) and a new clone.
Even if you contact them, GitHub says[1] that they will not remove "non-sensitive data", but makes no reference to copyrighted material.
[1] https://docs.github.com/en/authentication/keeping-your-accou...
However GitHub does exacerbate it a little by providing APIs that list commits that are no longer in the history. However there are other ways to get this info such as brute-forcing short prefixes of commits.
But really this is another case of the general problem that once you publish information you can't unpublish it. If you push a secret to a repo you can't 100% reliably clean it up. You should assume that everyone with the repo took a copy.
Also in general, don't assume that you can remove anything from the internet once it has been published.
And it is a GitHub issue. If you were self-hosting you could just run `git prune` `git gc` or `git repack` or whatever the magic command is.