The Python Package Index Should Get Rid of Its Training Wheels

One of the reasons Python is so popular as a scripting language in science and ML is that it has a very good story for installing Frankenstein code bases made of assembly, C and Pascal sprinkled with SIMD.

I was here before Anaconda popularized the idea of binary packages for Python and inspired wheels to replace eggs, and I don't want to go back to having to compile that nightmare on my machine.

People that have that kind of idea are likely capable of running kubs containers, understand vectorization and can code a monad on the top of their head.

Half of the Python coders are struggling to use their terminal. You have Windows dev that lives in Visual Studio, teachers of high school that barely show a few functions, mathematicians that are replacing R/Matlab, biologists forced to script something to write a paper, frontend dev that just learned JS is not the only language, geographers that are supplicating their GIS system to do something it's not made for, kids messing with their dad laptop, and probably a dog somewhere.

Binary wheels are a gift from the Gods.

HelloNurse · a year ago

Compiling Python extensions is a nightmare because we allow Autoconf, CMake, Visual Studio, Bazel etc. to make it complicated and nonportable; when someone sets out to wrap some library for Python the quality of the result is limited by the quality of the tools and by low expectations.

A serious engineering effort along the lines of the Zig compiler would allow Python to build almost everything from source out of the box; exotic compilers and binary dependencies, not "Frankenstein code bases" per se, are the actual obstacles.

ddulaney · a year ago

What you’re proposing here is essentially “if we fix the C/C++ build systems environment, this would be easy!”. You’re absolutely right, but fixing that mess has been a multi-decade goal that’s gone nowhere.

One of the great victories of new systems languages like Rust and Zig is that they standardized build systems. But untangling each individual dependency’s pile of terrible CMake (or autoconf, or vcxproj) hacks is a project in itself, and it’s often a deeply political one tied up with the unique history of each project.

BiteCode_dev · a year ago

Problem: saving bandwidth for pypi.

Solution: unify all software build stack.

Tomorrow, chat, we will tackle world hunger in an attempt to save people from anorexia.

Like and subscribe.

klyrs · a year ago

You and your respondents see this as a Python problem. I see it as a Zig problem.

As in, Zig will seamlessly add Python to its build system long before Python's build story is so robust.

aragilar · a year ago

Autoconf is easy, it's the highly bespoke build systems that someone thought would be a good idea that require the right phase of the moon that are the challenge.

aragilar · a year ago

Wheels do predate conda (though manylinux did base itself on the experience of Anaconda and Enthought's base set of libraries), and there were distributions like Enthought (or even more field specific distributions like Ureka and individuals like Christoph Gohlke) that provided binaries for the common packages.

What the conda ecosystem did was provide a not-horrible package manager that included the full stack (in a pinch to help fix up a student's git repository on a locked-down Windows system I used conda to get git, and you can get R and do R<->python connections easily), and by providing a standard repository interface (as opposed to a locked down and limited version that the other providers appeared to do), conda pushed out anyone doing something bespoke and centralised efforts so spins like conda-forge, bioconda, astroconda could focus on their niche and do it well.

RockRobotRock · a year ago

If I didn’t have access to binary wheels, I wouldn’t have been able to learn Python in high school as a heavy Windows user.

Even now, running into package compile issues is a sure fire way to lose half an hour of my time.

tonnydourado · a year ago

> was here before Anaconda popularized the idea of binary packages for Python

This has incredible "I was there, Gandalf, three thousand years ago" vibes =P

nektro · a year ago

the article does not suggest removing wheeels

The analysis contains an error. Binary artifacts don't cause exponential growth in storage requirements. It's still just linear. That's also quite clearly seen when after a phase of exponential growth, the binary artifacts still only account for 75%.

So this whole strategy (actually a pitch for zigbuild) would ideally reduce the storage requirements to 25% - which would buy the whole system maybe a year or two if the growth continues.

Of course, it's a good idea to build client-side. Especially considering the security implications. But it won't fundamentally change the problem.

kristoff_it · a year ago

The problem is not space, it's the fact that bandwidth costs 4x the total operating income that the PSF has. Everything else is context to understand and offer some insight into this problem.

This is a pitch for Zig as a C/C++ build system because, as you can see in other comments in this submission, there's still a lot of people that don't even believe this is a solvable problem at all, while in reality Zig does a pretty good job at solving it.

Only somebody actively involved with Python can really say if and how this could help improve the situation for PyPI, but if the people who actually happen to be working on something potentially beneficial weren't allowed to speak "because it's a pitch" then what even is the point of software interoperability.

As an occasional Python user, I also think that it's silly that people keep putting containers everywhere and pre-building a ton of binaries when I know for a fact that you could trivially `zig biuld` that shit, since I do it in my projects all the time (see https://github.com/allyourcodebase/).

zahlman · a year ago

>it's the fact that bandwidth costs 4x the total operating income that the PSF has.

If you're getting this from my analysis included in https://news.ycombinator.com/item?id=41586751, I would encourage you to consider some of the replies to my comment. I cited and worked with AWS retail rates because that's the information I had available. It's still a major contribution on Fastly's part which likely dwarfs whatever Microsoft and Google are offering.

For what it's worth, I suspect those containerized builds are responsible for the overwhelming majority of Setuptools downloads, for example. There's really no good reason why annual downloads of Setuptools (or really anything else) should rival the population of the Earth (https://pypistats.org/packages/setuptools) - Pip caches downloaded wheels, and separate copies of Pip should ordinarily share a cache on a given machine. But I wouldn't be at all surprised to learn the Docker, k8s etc. defeat that caching scheme.

chippiewill · a year ago

> Binary artifacts don't cause exponential growth in storage requirements. It's still just linear.

I think what the author was getting at was the combinatoric explosion as you add different variants (CPU architecture, operating systems, python versions) - but I agree that "exponential growth" is not the right term to use here.

yaleman · a year ago

The fact that tensorflow takes up 12.9TiB is truly horrifying, and most of that because they use pypi's storage as a dumping ground for their pre-release packages. What a nightmare they've put on other people's shoulders.

theamk · a year ago

I think pypi should require larger packages, like tensorflow, to self-host their releases.

There is all support for that already - the pypi index file contains arbitrary URL for data file and a sha256 hash. Let pypi store the hashes, so there is no shenanigans with versions being secretly overridden, but point the actual data URLs to other servers.

(There must obviously be a balance for availability vs pypi's cost, so maybe pypi hosts only smaller files, and larger files must be self-hosted? Or pypi hosts "major releases" while pre-releases are self-hosted? And there should be manual exceptions for "projects with funding from huge corporations" and "super popular projects from solo developers"...)

I believe tensorflow does remove old pre-releases (I know other projects do), so that number I think might be fairly static?

That tensorflow is that big isn't surprising, given the install of it plus its dependencies is many gigabytes (you can see the compressed sizes of wheels on the release pages e.g. https://pypi.org/project/tensorflow/#files), and the "tensorflow" package (as opposed to the affiliated packages) based on https://py-code.org/stats is 965.7 GiB, which really only includes a relatively small number of pre-releases.

Why tenserflow is that big comes down to needing to support many different kinds of GPUs with different ecosystem versions, and I suspect the build time of them with zig cc (assuming it works, and doesn't instead require pulling in a different compiler/toolchain) would be so excessive (especially on IoT/weaker devices) that it would make the point of the exercise moot.

amoshebb · a year ago

Is it though? If it saves one engineer one afternoon that storage has paid for itself, and this thing has hundreds of thousands of downloads a day.

Wouldn’t it be more horrifying to force everybody who wants to use a prerelease to waste an afternoon getting it to build just to save half a hard drive?

skeledrew · a year ago

That's besides the point though. Yes, having prebuilt binaries is very helpful. But what happens if Fastly decides against renewing next time and there is nobody else willing to sponsor? The cost is through the sky for the PSF to handle. Where does PyPI go?

choeger · a year ago

pxc · a year ago

There are already lots of passable package managers that know how to provide working binaries for the native, non-Python dependencies of Python packages. Instead of trying to make Python packages' build processes learn how to build everything else in the world, one thing Python packages could do is just record their external dependencies in a useful way. Then package managers that are actually already designed for and committed to working with multiple programming language ecosystems could handle the rest.

This is something that could be used by Nix, Guix, Spack, as well as more conventional software distributions like Conda, Pkgsrc, MacPorts, Homebrew, etc. With the former, users could even set up per-project environments that contain those external dependencies, like virtualenvs but much more general. But the simple feature of this metadata would naturally be valuable, if provided well, to maintainers of all Linux distros and many other software distributions, where autogenerated packages are already the norm for languagea like Rust and Go, while creating such tooling for Python is riddled with thorny problems— o these two proposals are not mutually exclusive, and perhaps each is individually warranted on its own.

Enriching package metadata in this simple way has already been proposed here:

htps://peps.python.org/pep-0725/

trlampert · a year ago

"When Python came into existence, repeatable builds (i.e. not yet reproducible, but at least correctly functioning on more than one machine) were a pipe dream. Building C/C++ projects reliably has been an intractable problem for a long time, but that's not true anymore."

I'd dispute that. It used to be the case that building NumPy just worked, now there are Cython/meson and a whole lot of other dependency issues and the build fails.

"At the Zig Software Foundation we look up to the Python Software Foundation as a great example of a fellow 501(c)(3) non-profit organization that was able to grow an incredibly vibrant community ..."

Better don't meet your heroes. Python was a reasonable community made up and driven by creative individuals. In the last 8 years, it has been taken over by corporate bureaucrats who take credit for the remnants of that community and who will destroy it. The PSF has never done anything except for selling expensive conference tickets and taking care of its own.

rightbyte · a year ago

These package respositories are used in a wasteful way. Probably by thousands of CI servers spinning up blank slate docker containers etc.

strokirk · a year ago

CI providers should definitely start proxying PyPI with their own cache.

Numerlor · a year ago

I wanted to spin up a mirror locally to do simple caching for docker builds but the tooling was lacking, there was a way to do a direct mirror of pypi locally but no other way of adding custom indices

Pip already does its own caching, but it's maddeningly difficult to even locate and extract files, let alone set up anything usable. It's also needlessly difficult to make Pip just use such a cache directly - for example, if you haven't pinned a version, it will automatically check PyPI to figure out the latest version even if you have cached wheels already.

I don't know (I lack the experience), but I assume that container systems get in the way of Pip finding its normal cache, too. (If they're emulating a filesystem or something, then the cache is in the wrong filesystem unless you reuse the container.)

Spivak · a year ago

It might honestly be easier and cheaper for those CI providers to just pay PyPI's bandwidth costs.

robertlagrant · a year ago

Yeah we use Artifactory internally as a proxy for DockerHub, Pypi, NPM, etc.

faustin · a year ago

conda-forge handles the first part of this (reproducible builds) for most common platforms. The idea of rebuilding deleted artifacts on demand sounds nice in theory, but it has the complication that rebuilding something that depends on several other somethings will likely trigger a build cascade where a bunch of stuff has to get built in order. Hopefully none of those ancient build scripts require external resources hosted at dead links!

Also, this is very much assuming that the code is both C or C++ and that LLVM is the right compiler to use. Fortran is still a major part of the ecosystem, which the Zig compiler isn't going to solve. There already exists numerous options to provide compilers to the problematic platform, the fact is binary wheels (mostly) solve the issue far better than doing local builds.

Also, the large packages are typically due to the need to support the huge number of possible GPU combinations (because you care about what CUDA versions are supported).

This feels like a solution being forced on a problem (not that zig cc isn't cool), but the post has really misunderstood the issues around wheels.

AndyKelley · a year ago

One strategy would be making PyPI packages fetch any external resources from PyPI, or at least add PyPI URLs as mirrors for such resources.

zamlag · a year ago

At work we don't use PyPI any longer. We have our own set of curated packages, the security issues are just too great:

https://developers.slashdot.org/story/24/09/15/0030229/fake-...

https://jfrog.com/blog/revival-hijack-pypi-hijack-technique-...

https://jfrog.com/blog/leaked-pypi-secret-token-revealed-in-...

We consider switching to Java, C++ or Rust because of general quality issues with Python.

notpushkin · a year ago

What do Java and Rust (and your C++ package manager of choice) do to mitigate those things?

Don't know about Java or Rust, but in C++ it's much harder to get new packages, which works wonder to keep number of dependencies down.

In Python, installing packages is so simple, people just do it after 10 second google search - and that library can pull dozens or hundreds of dependencies with usually no review.

In C++, given that you have to manually find the library and incorporate into the build system, people generally spend a few minutes looking at the options and choosing the best one - and this includes checking for things "how long has this library been around", "does it have many users" and "is it in healthy state". And this must be repeated for each dependency as well, so that even a dozen of dependencies is a huge negative - and such library will not be used unless there are no better alternatives.

(The exception to this rule are libraries provided by your Linux distribution. Those can be easily installed by the dozen, and that's OK - the distribution makers did all the hard work for you, vetting and packaging those libraries)

This in general means a much healthier dependency state for C++, as well as much higher code quality. No one is going to add a dependency to a core library just to add a better progress bars for example.

Deleted Comment

ajrqh · a year ago

What is not clear about "general quality issues", i.e., issues unrelated to package management?