It's not just that they dropped the ball, they actively sabotaged whatever goodwill they had built by adding malware to software. Not only was this a massive hassle, it ruined the reputation of lots of FOSS projects with folks who just wanted to use some of the most popular consumer-ish open source software like Filezilla.
While SF was crapping where they eat, GitHub built a lot of trust and goodwill with a lot of people.
Open Source = a Microsoft property... I don't want it to be true and I will act like it isn't true for things I have influence over - i.e. it's not the first choice for hosting a repository.
Some people don't even know the difference between Git and GitHub...
Is it? I don't know, I don't use it for personal (actually personal) stuff, but if I actually want to publish something — I'd do it there. If I want to contribute to something — it's way easier to do it on Github than most other places. It makes searching for code easier too. If Microsoft decides to abuse the monopoly, or if Gitlab/etc. would actually be much better, I don't imagine it would be very difficult to switch. Well, yeah, sure, Github Actions and issue history are somewhat of a vendor-lock, but it's not that bad, I suppose.
Maybe Copilot (made possible by the huge non-commercial codebase on Github) being somewhat unfair advantage to other commercial alternatives is a bit troublesome, yeah. But otherwise I just don't see why Github being a de-facto standard is bad. In fact, I am somewhat annoyed when a really popular project doesn't have a Github repository (mostly because it makes filing an issue, or even reading existing issues much more difficult in most cases). So I'm actually glad to hear that some big projects feel pressed to migrate to github. What's even a problem with that, apart, maybe, for github actions, that honestly suck?
(Maybe I should add: I am a git hater, and do think that mercurial is just unquestionably better, but this battle is lost long time ago, so I don't suppose it's the topic of this discussion.)
I love having this central location even though git is distributed, mainly because having to go to multiple Git hosts of varying quality would be a pain in the ass.
If you are confused (like me) that this was about PyPI (Python packages repository) then no. It is about a project called PyPy (one can argue it is bad name) that is an implementation of python interpreter but without cpython. Instead they rely on a JIT compiler. And it is syntax compatible but if your code uses any library or method relying on C extensions then you are out of luck (Goodbye NumPy.. etc).
Edit: They have C-layer emulation, but I don't know its limitations or current status, but you can use those libraries [1][2]
I've been using git happily for many years. Strangely enough the provenance of a commit i.e. which branch did a commit originally come has not really mattered to me very much. Mercurial provides this and they are using `git notes` to add this provenance meta-data to each commit during migration to git.
I would have thought I'd need this much more, but I have not. In plain git I'll just `git log` and grep for the commit in case I want to make sure a commit is available in a certain branch.
The point is giving branches a meaning (e.g. "implementation of this feature") and being able to at least keep the information that such commit was part of that (well at least that's why I'd want Mercurial's named branches, I'm not sure that's how this project used them)
Wouldn't good merge commit conventions work to preserve as much of this sort of information as desired? All the commits of the branch contained in it with the the merge commit message preserving that info.
When I want to see inside a piece of software I look for (1) the source code; (2) the git-blame; (3) the code review for significant commits. I have never wanted to see into the history before that point, namely how the developer drafted and polished their idea prior to the final code review approval.
What practical use case am I missing out on when these work-in-progress draft commits are lost? I can’t see one.
But 33% of PyPy packages contain the potential for extreme security flaws and you don't know which ones until it gets you. How bad do you have to want to use Python to tolerate that?
"“When we actually examined the behavior and looked for new attack vectors, we discovered that if you download a malicious package — just download it — it will automatically run on your computer,” he told SC Media in an interview from Israel. “So we tried to understand why, because for us the word download doesn’t necessarily mean that the code will automatically run.”
But for PyPi, it does. The commands required for both processes run a script, called pip, executes another file called setup.py, that is designed to provide a data structure for the package manager to understand how to handle the package. That script and process is also composed of Python code that runs automatically, meaning an attacker can insert and execute that malicious code on the device of anyone who downloads it." https://www.scmagazine.com/analysis/a-third-of-pypi-software...
my bad .. handicapped by the way I auditize .. point remains the same tho. Need to clean up PyPI or stop the mortals from using PyTHON. In the meantime, maybe put your venv's into a single non-emulated vm.
There are benefits to having branches be an inherent property of a commit as opposed to the Git model of a dynamic property of the graph.
Suppose I have a branch A with three commits, and then I make another branch B on top of that with another few commits. The Git model essentially says that B consists of all commits that are an ancestor of B that aren't the ancestor of any other branch. But now I go and rebase A somewhere else--and as a result, B suddenly grew several extra commits on its branch because those commits are no longer on branch A. If I want to rebase B on the new A, well, those duplicated commits will cause me some amount of pain, pain that would go away if only git could remember that some of those commits are really just the old version of A.
> If I want to rebase B on the new A, well, those duplicated commits will cause me some amount of pain
Not really. Git will recognize commits that produce an identical diff and skip them. Your only pain will be that for each skipped commit, you will see a notification line in the output of your `git rebase`:
> There are benefits to having branches be an inherent property of a commit
And drawbacks, naturally. Advanced branching/merging workflows become extremely painful if not impossible, which makes mercurial unusable as a "true" DVCS (where everyone maintains a fork of the code and people trade PRs/merges).
> The difference between git branches and named branches is not that important in a repo with 10 branches (no matter how big). But in the case of PyPy, we have at the moment 1840 branches. Most are closed by now, of course. But we would really like to retain (both now and in the future) the ability to look at a commit from the past, and know in which branch it was made. Please make sure you understand the difference between the Git and the Mercurial branches to realize that this is not always possible with Git— we looked hard, and there is no built-in way to get this workflow.
> Still not convinced? Consider this git repo with three commits: commit #2 with parent #1 and head of git branch “A”; commit #3 with also parent #1 but head of git branch “B”. When commit #1 was made, was it in the branch “A” or “B”? (It could also be yet another branch whose head was also moved forward, or even completely deleted.)
In this post they say that "Github notes solves much of point (1): the difficulty of discovering provenance of commits, although not entirely"
I mean you can't really compare them since git doesn't even _have_ branches as Mercurial understands them. git's branches would perhaps better be called twigs in comparison. git's lightweight branches better map to Mercurial's topics or bookmarks, though neither perfectly. And Mercurial has even lighter weight branches since you can just make a new head by committing without having to name anything, and it won't yell at you about a detached head like git will.
Speaking of git, for mega monorepro performance, we're gonna need synthetic FSes and SCM-integrated synthetic checkouts. Sapling (was hg in the past but was forked and reworked extensively) will be able to do this if EdenFS will ever be released, but Git will need something similar. This will require a system agent running with a caching overlay fs that can grab and cache bits on-the-fly. Yes, it's slightly slower than having contents already, but there is no way to checkout a 600+ GiB repo on a laptop with a 512 GiB SSD.
That already exists. It’s called Scalar[1] and it has been built-into Git since October 2022[2], dates back to 2020[3] and is the spiritual successor or something Microsoft was using as far back as 2017[4].
Scalar explicitly does not implement the virtualized filesystem the OP is referring to. The original Git VFS for Windows that Microsoft designed did in fact do this, but as your third link notes, Microsoft abandoned that in favor of Scalar's totally different design which explicitly was about scaling repositories without filesystem virtualization.
There's a bunch of related features they added to Git to achieve scalability without virtualization, including the Scalar daemon which does background monitoring and optimization. Those are all useful and Scalar is a welcome addition. But the need for a virtual filesystem layer for large-scale repositories is still a very real one. There are also some limitations with Git's existing solutions that aren't ideal; for example Git's partial clones are great but IIRC can only be used as a "cone" applied to the original filesystem hierarchy. More generalized designs would allow mapping arbitrary paths in the original repository to any other path in the virtual checkout, and synchronizing between them. Tools like Josh can do this today with existing Git repositories[1].
The Git for Windows that was referenced isn't even that big at 300GB, either. That's well within the realm of single machine stuff. Game studios regularly have repositories that exist at multi-terabyte size, and they have also converged on similar virtualization solutions. For example, Destiny 2 uses a "virtual file synchronization" layer called VirtualSync[2] that reduced the working size of their checkouts by over 98%, multiple terabytes of savings per person. And in a twist of fate, VirtualSync was implemented thanks to a feature called "ProjFS" that Microsoft added to Windows... which was motivated originally by the Git VFS for Windows they abandoned!
I worked on source control at Facebook/Meta for many years. On top of what aseipp said, I remember the early conversations we had with Microsoft where the performance targets for status/commit/rebase they wanted to hit were an order of magnitude behind where we wanted to be.
But most repositories are not that big so this is hardly an issue for most people. Personally, the system I'm most optimistic about in 2024 is Jujutsu. I've been using it full time with Git repos for several months and it's overall been a delight.
Every provider out there can talk a standard Git protocol, but all the features that don't have a standard Git protocol become a proprietary API. I think if Git (or a project like it) made a standard protocol/data format for all the features of a SCM, then all those providers could adopt it, and we could start moving away from GitHub as the center of the known universe. If we don't make a universal standard (and implementation) then it'll remain the way it is today.
It's kind of sad that this is true.
I'm guilty myself, I contribute to projects on GitHub more often than on any other platform.
And when I search for open source projects the first page I use is GitHub.
They had a sizable lead and completely bungled it.
While SF was crapping where they eat, GitHub built a lot of trust and goodwill with a lot of people.
Some people don't even know the difference between Git and GitHub...
I have also found this to be the case, even with engineers that have years of experience. It's both impressive and awful.
That said I do feel some joy when I see a project on Gitlab and am happy to contribute there, eg FDroid.
Maybe Copilot (made possible by the huge non-commercial codebase on Github) being somewhat unfair advantage to other commercial alternatives is a bit troublesome, yeah. But otherwise I just don't see why Github being a de-facto standard is bad. In fact, I am somewhat annoyed when a really popular project doesn't have a Github repository (mostly because it makes filing an issue, or even reading existing issues much more difficult in most cases). So I'm actually glad to hear that some big projects feel pressed to migrate to github. What's even a problem with that, apart, maybe, for github actions, that honestly suck?
(Maybe I should add: I am a git hater, and do think that mercurial is just unquestionably better, but this battle is lost long time ago, so I don't suppose it's the topic of this discussion.)
Edit: They have C-layer emulation, but I don't know its limitations or current status, but you can use those libraries [1][2]
[1] https://www.pypy.org/posts/2018/09/inside-cpyext-why-emulati...
[2] https://pythoncapi.readthedocs.io/cpyext.html
Deleted Comment
Given the way that pypy is implemented, I think the name is quite clever really.
> one can argue it is bad name
I suppose one can, but it's a python interpreter written in python, so I think it's pretty good.
I would have thought I'd need this much more, but I have not. In plain git I'll just `git log` and grep for the commit in case I want to make sure a commit is available in a certain branch.
Deleted Comment
What practical use case am I missing out on when these work-in-progress draft commits are lost? I can’t see one.
"“When we actually examined the behavior and looked for new attack vectors, we discovered that if you download a malicious package — just download it — it will automatically run on your computer,” he told SC Media in an interview from Israel. “So we tried to understand why, because for us the word download doesn’t necessarily mean that the code will automatically run.”
But for PyPi, it does. The commands required for both processes run a script, called pip, executes another file called setup.py, that is designed to provide a data structure for the package manager to understand how to handle the package. That script and process is also composed of Python code that runs automatically, meaning an attacker can insert and execute that malicious code on the device of anyone who downloads it." https://www.scmagazine.com/analysis/a-third-of-pypi-software...
By all means, I prefer Git branches.
Suppose I have a branch A with three commits, and then I make another branch B on top of that with another few commits. The Git model essentially says that B consists of all commits that are an ancestor of B that aren't the ancestor of any other branch. But now I go and rebase A somewhere else--and as a result, B suddenly grew several extra commits on its branch because those commits are no longer on branch A. If I want to rebase B on the new A, well, those duplicated commits will cause me some amount of pain, pain that would go away if only git could remember that some of those commits are really just the old version of A.
Not really. Git will recognize commits that produce an identical diff and skip them. Your only pain will be that for each skipped commit, you will see a notification line in the output of your `git rebase`:
And drawbacks, naturally. Advanced branching/merging workflows become extremely painful if not impossible, which makes mercurial unusable as a "true" DVCS (where everyone maintains a fork of the code and people trade PRs/merges).
> The difference between git branches and named branches is not that important in a repo with 10 branches (no matter how big). But in the case of PyPy, we have at the moment 1840 branches. Most are closed by now, of course. But we would really like to retain (both now and in the future) the ability to look at a commit from the past, and know in which branch it was made. Please make sure you understand the difference between the Git and the Mercurial branches to realize that this is not always possible with Git— we looked hard, and there is no built-in way to get this workflow.
> Still not convinced? Consider this git repo with three commits: commit #2 with parent #1 and head of git branch “A”; commit #3 with also parent #1 but head of git branch “B”. When commit #1 was made, was it in the branch “A” or “B”? (It could also be yet another branch whose head was also moved forward, or even completely deleted.)
In this post they say that "Github notes solves much of point (1): the difficulty of discovering provenance of commits, although not entirely"
There is either base branch A whose current head is commit #2 / branch B with head of commit #3.
OR
Commit #1 is branch “default” commit #2 is branch “A” with parent as commit #1 and commit is branch “B” with parent also as commit #1
Consider your same example with forking instead of branching, how would the issue be resolved?
1. https://git-scm.com/docs/scalar
2. https://github.blog/2022-10-13-the-story-of-scalar/
3. https://devblogs.microsoft.com/devops/introducing-scalar/
4. https://devblogs.microsoft.com/bharry/the-largest-git-repo-o...
There's a bunch of related features they added to Git to achieve scalability without virtualization, including the Scalar daemon which does background monitoring and optimization. Those are all useful and Scalar is a welcome addition. But the need for a virtual filesystem layer for large-scale repositories is still a very real one. There are also some limitations with Git's existing solutions that aren't ideal; for example Git's partial clones are great but IIRC can only be used as a "cone" applied to the original filesystem hierarchy. More generalized designs would allow mapping arbitrary paths in the original repository to any other path in the virtual checkout, and synchronizing between them. Tools like Josh can do this today with existing Git repositories[1].
The Git for Windows that was referenced isn't even that big at 300GB, either. That's well within the realm of single machine stuff. Game studios regularly have repositories that exist at multi-terabyte size, and they have also converged on similar virtualization solutions. For example, Destiny 2 uses a "virtual file synchronization" layer called VirtualSync[2] that reduced the working size of their checkouts by over 98%, multiple terabytes of savings per person. And in a twist of fate, VirtualSync was implemented thanks to a feature called "ProjFS" that Microsoft added to Windows... which was motivated originally by the Git VFS for Windows they abandoned!
[1] https://github.com/josh-project/josh
[2] https://www.gdcvault.com/play/1027699/Virtual-Sync-Terabytes...
But most repositories are not that big so this is hardly an issue for most people. Personally, the system I'm most optimistic about in 2024 is Jujutsu. I've been using it full time with Git repos for several months and it's overall been a delight.
Deleted Comment
Oh that's quite helpful. I was worried about how lossy the migration would be.