PyPy has moved to Git, GitHub

> Open Source has become synonymous with GitHub, and we are too small to change that.

It's kind of sad that this is true.

I'm guilty myself, I contribute to projects on GitHub more often than on any other platform.

And when I search for open source projects the first page I use is GitHub.

gdevenyi · 2 years ago

A lot of this is SourceForge's fault.

They had a sizable lead and completely bungled it.

bastawhiz · 2 years ago

It's not just that they dropped the ball, they actively sabotaged whatever goodwill they had built by adding malware to software. Not only was this a massive hassle, it ruined the reputation of lots of FOSS projects with folks who just wanted to use some of the most popular consumer-ish open source software like Filezilla.

While SF was crapping where they eat, GitHub built a lot of trust and goodwill with a lot of people.

ahartmetz · 2 years ago

Open Source = a Microsoft property... I don't want it to be true and I will act like it isn't true for things I have influence over - i.e. it's not the first choice for hosting a repository.

Some people don't even know the difference between Git and GitHub...

sweetgiorni · 2 years ago

> Some people don't even know the difference between Git and GitHub...

I have also found this to be the case, even with engineers that have years of experience. It's both impressive and awful.

enriquto · 2 years ago

Be part of the solution, not part of the problem. You can use some other forge and keep an up-to-date github repository as a read-only front.

SushiHippie · 2 years ago

I don't host my own repositories on GitHub, I host a gitea instance myself.

clintonb · 2 years ago

What exactly is the problem?

NooneAtAll3 · 2 years ago

I found Codeberg being good substitute

politelemon · 2 years ago

I will admit, I've often searched for $project_name github to get to their repositories. It shouldn't matter but it's just a force of habit now.

That said I do feel some joy when I see a project on Gitlab and am happy to contribute there, eg FDroid.

krick · 2 years ago

Is it? I don't know, I don't use it for personal (actually personal) stuff, but if I actually want to publish something — I'd do it there. If I want to contribute to something — it's way easier to do it on Github than most other places. It makes searching for code easier too. If Microsoft decides to abuse the monopoly, or if Gitlab/etc. would actually be much better, I don't imagine it would be very difficult to switch. Well, yeah, sure, Github Actions and issue history are somewhat of a vendor-lock, but it's not that bad, I suppose.

Maybe Copilot (made possible by the huge non-commercial codebase on Github) being somewhat unfair advantage to other commercial alternatives is a bit troublesome, yeah. But otherwise I just don't see why Github being a de-facto standard is bad. In fact, I am somewhat annoyed when a really popular project doesn't have a Github repository (mostly because it makes filing an issue, or even reading existing issues much more difficult in most cases). So I'm actually glad to hear that some big projects feel pressed to migrate to github. What's even a problem with that, apart, maybe, for github actions, that honestly suck?

(Maybe I should add: I am a git hater, and do think that mercurial is just unquestionably better, but this battle is lost long time ago, so I don't suppose it's the topic of this discussion.)

beanjuiceII · 2 years ago

I love having this central location even though git is distributed, mainly because having to go to multiple Git hosts of varying quality would be a pain in the ass.

Why do people like Mercurial branches? Was it revamped? I hate it when I used it.

By all means, I prefer Git branches.

jcranmer · 2 years ago

There are benefits to having branches be an inherent property of a commit as opposed to the Git model of a dynamic property of the graph.

Suppose I have a branch A with three commits, and then I make another branch B on top of that with another few commits. The Git model essentially says that B consists of all commits that are an ancestor of B that aren't the ancestor of any other branch. But now I go and rebase A somewhere else--and as a result, B suddenly grew several extra commits on its branch because those commits are no longer on branch A. If I want to rebase B on the new A, well, those duplicated commits will cause me some amount of pain, pain that would go away if only git could remember that some of those commits are really just the old version of A.

sampo · 2 years ago

> If I want to rebase B on the new A, well, those duplicated commits will cause me some amount of pain

Not really. Git will recognize commits that produce an identical diff and skip them. Your only pain will be that for each skipped commit, you will see a notification line in the output of your `git rebase`:

    warning: skipped previously applied commit <hash>

Ferret7446 · 2 years ago

> There are benefits to having branches be an inherent property of a commit

And drawbacks, naturally. Advanced branching/merging workflows become extremely painful if not impossible, which makes mercurial unusable as a "true" DVCS (where everyone maintains a fork of the code and people trade PRs/merges).

notPlancha · 2 years ago

Pypy described the following in the FAQ:

> The difference between git branches and named branches is not that important in a repo with 10 branches (no matter how big). But in the case of PyPy, we have at the moment 1840 branches. Most are closed by now, of course. But we would really like to retain (both now and in the future) the ability to look at a commit from the past, and know in which branch it was made. Please make sure you understand the difference between the Git and the Mercurial branches to realize that this is not always possible with Git— we looked hard, and there is no built-in way to get this workflow.

> Still not convinced? Consider this git repo with three commits: commit #2 with parent #1 and head of git branch “A”; commit #3 with also parent #1 but head of git branch “B”. When commit #1 was made, was it in the branch “A” or “B”? (It could also be yet another branch whose head was also moved forward, or even completely deleted.)

In this post they say that "Github notes solves much of point (1): the difficulty of discovering provenance of commits, although not entirely"

dgfitz · 2 years ago

The question in your example seems odd to me. It can be interpreted as either 2 OR 3 unique branches depending on how you read it.

There is either base branch A whose current head is commit #2 / branch B with head of commit #3.

Commit #1 is branch “default” commit #2 is branch “A” with parent as commit #1 and commit is branch “B” with parent also as commit #1

Consider your same example with forking instead of branching, how would the issue be resolved?

KwanEsq · 2 years ago

I mean you can't really compare them since git doesn't even _have_ branches as Mercurial understands them. git's branches would perhaps better be called twigs in comparison. git's lightweight branches better map to Mercurial's topics or bookmarks, though neither perfectly. And Mercurial has even lighter weight branches since you can just make a new head by committing without having to name anything, and it won't yell at you about a detached head like git will.

bluish29 · 2 years ago

If you are confused (like me) that this was about PyPI (Python packages repository) then no. It is about a project called PyPy (one can argue it is bad name) that is an implementation of python interpreter but without cpython. Instead they rely on a JIT compiler. And it is syntax compatible but if your code uses any library or method relying on C extensions then you are out of luck (Goodbye NumPy.. etc).

Edit: They have C-layer emulation, but I don't know its limitations or current status, but you can use those libraries [1][2]

[1] https://www.pypy.org/posts/2018/09/inside-cpyext-why-emulati...

[2] https://pythoncapi.readthedocs.io/cpyext.html

pletnes · 2 years ago

They do support numpy. The pypy name predates pypi. You’re off on multiple details here.

Deleted Comment

Waterluvian · 2 years ago

PyPy being a Python JIT written in Python with an ouroboros as a logo is pretty much the perfect name.

giovannibajo1 · 2 years ago

to be fair, PyPy predates PyPI

d-cc · 2 years ago

Now all we need is a PiPy and we'll have all the pies

miketheman · 2 years ago

How so? PyPI launched in 2003, PyPy's first release was in 2007. https://www.pypa.io/en/latest/history/#before-2013

pythux · 2 years ago

Nit. I believe you can use numpy and at least some other libraries relying on native extensions but performance might vary: https://doc.pypy.org/en/latest/faq.html#should-i-install-num...

CogitoCogito · 2 years ago

> one can argue it is bad name

Given the way that pypy is implemented, I think the name is quite clever really.

fumeux_fume · 2 years ago

The packages repo is known as PyPI (like Py P.I.), not PyPi.

Dylan16807 · 2 years ago

As a counterbalance, I'm very familiar with PyPy and have never heard "PyPI".

I suppose one can, but it's a python interpreter written in python, so I think it's pretty good.

THANK YOU. This was my first reaction as well.

sidkshatriya · 2 years ago

I've been using git happily for many years. Strangely enough the provenance of a commit i.e. which branch did a commit originally come has not really mattered to me very much. Mercurial provides this and they are using `git notes` to add this provenance meta-data to each commit during migration to git.

I would have thought I'd need this much more, but I have not. In plain git I'll just `git log` and grep for the commit in case I want to make sure a commit is available in a certain branch.

g-b-r · 2 years ago

The point is giving branches a meaning (e.g. "implementation of this feature") and being able to at least keep the information that such commit was part of that (well at least that's why I'd want Mercurial's named branches, I'm not sure that's how this project used them)

eikenberry · 2 years ago

Wouldn't good merge commit conventions work to preserve as much of this sort of information as desired? All the commits of the branch contained in it with the the merge commit message preserving that info.

heads · 2 years ago

When I want to see inside a piece of software I look for (1) the source code; (2) the git-blame; (3) the code review for significant commits. I have never wanted to see into the history before that point, namely how the developer drafted and polished their idea prior to the final code review approval.

What practical use case am I missing out on when these work-in-progress draft commits are lost? I can’t see one.

vaxman · 2 years ago

But 33% of PyPy packages contain the potential for extreme security flaws and you don't know which ones until it gets you. How bad do you have to want to use Python to tolerate that?

"“When we actually examined the behavior and looked for new attack vectors, we discovered that if you download a malicious package — just download it — it will automatically run on your computer,” he told SC Media in an interview from Israel. “So we tried to understand why, because for us the word download doesn’t necessarily mean that the code will automatically run.”

But for PyPi, it does. The commands required for both processes run a script, called pip, executes another file called setup.py, that is designed to provide a data structure for the package manager to understand how to handle the package. That script and process is also composed of Python code that runs automatically, meaning an attacker can insert and execute that malicious code on the device of anyone who downloads it." https://www.scmagazine.com/analysis/a-third-of-pypi-software...

lumpa · 2 years ago

PyPy is a Python implementation, not the same the PyPI (Python Package Index).

my bad .. handicapped by the way I auditize .. point remains the same tho. Need to clean up PyPI or stop the mortals from using PyTHON. In the meantime, maybe put your venv's into a single non-emulated vm.

aeurielesn · 2 years ago

1letterunixname · 2 years ago

Speaking of git, for mega monorepro performance, we're gonna need synthetic FSes and SCM-integrated synthetic checkouts. Sapling (was hg in the past but was forked and reworked extensively) will be able to do this if EdenFS will ever be released, but Git will need something similar. This will require a system agent running with a caching overlay fs that can grab and cache bits on-the-fly. Yes, it's slightly slower than having contents already, but there is no way to checkout a 600+ GiB repo on a laptop with a 512 GiB SSD.

filmgirlcw · 2 years ago

That already exists. It’s called Scalar[1] and it has been built-into Git since October 2022[2], dates back to 2020[3] and is the spiritual successor or something Microsoft was using as far back as 2017[4].

1. https://git-scm.com/docs/scalar

2. https://github.blog/2022-10-13-the-story-of-scalar/

3. https://devblogs.microsoft.com/devops/introducing-scalar/

4. https://devblogs.microsoft.com/bharry/the-largest-git-repo-o...

aseipp · 2 years ago

Scalar explicitly does not implement the virtualized filesystem the OP is referring to. The original Git VFS for Windows that Microsoft designed did in fact do this, but as your third link notes, Microsoft abandoned that in favor of Scalar's totally different design which explicitly was about scaling repositories without filesystem virtualization.

There's a bunch of related features they added to Git to achieve scalability without virtualization, including the Scalar daemon which does background monitoring and optimization. Those are all useful and Scalar is a welcome addition. But the need for a virtual filesystem layer for large-scale repositories is still a very real one. There are also some limitations with Git's existing solutions that aren't ideal; for example Git's partial clones are great but IIRC can only be used as a "cone" applied to the original filesystem hierarchy. More generalized designs would allow mapping arbitrary paths in the original repository to any other path in the virtual checkout, and synchronizing between them. Tools like Josh can do this today with existing Git repositories[1].

The Git for Windows that was referenced isn't even that big at 300GB, either. That's well within the realm of single machine stuff. Game studios regularly have repositories that exist at multi-terabyte size, and they have also converged on similar virtualization solutions. For example, Destiny 2 uses a "virtual file synchronization" layer called VirtualSync[2] that reduced the working size of their checkouts by over 98%, multiple terabytes of savings per person. And in a twist of fate, VirtualSync was implemented thanks to a feature called "ProjFS" that Microsoft added to Windows... which was motivated originally by the Git VFS for Windows they abandoned!

[1] https://github.com/josh-project/josh

[2] https://www.gdcvault.com/play/1027699/Virtual-Sync-Terabytes...

sunshowers · 2 years ago

I worked on source control at Facebook/Meta for many years. On top of what aseipp said, I remember the early conversations we had with Microsoft where the performance targets for status/commit/rebase they wanted to hit were an order of magnitude behind where we wanted to be.

But most repositories are not that big so this is hardly an issue for most people. Personally, the system I'm most optimistic about in 2024 is Jujutsu. I've been using it full time with Git repos for several months and it's overall been a delight.

throwawaaarrgh · 2 years ago

Every provider out there can talk a standard Git protocol, but all the features that don't have a standard Git protocol become a proprietary API. I think if Git (or a project like it) made a standard protocol/data format for all the features of a SCM, then all those providers could adopt it, and we could start moving away from GitHub as the center of the known universe. If we don't make a universal standard (and implementation) then it'll remain the way it is today.

8organicbits · 2 years ago

> the script properly retained the issue numbers

Oh that's quite helpful. I was worried about how lossy the migration would be.