Making pip installs a little less slow

Never mind less slow, how about making it work first?

I'm in a disprivileged location, and it seems pip can only download from pythonhosted.org at a rate of 10-20 kB/s. Worse still, pip downloads timeout and fail extremely quickly.

If I rerun the pip install, instead of resuming the download, it will download the file from the beginning, then timeout and fail again somewhere in the middle.

I tried going through a private VPN hosted on Linode, with similar results.

For the official Python package manager, this is simply unreliable and unacceptable behavior.

BiteCode_dev · 3 years ago

When in Africa, I worked around that problem and many others using a pypi proxy. Today devpi is the standard: https://devpi.net/docs/devpi/devpi/stable/%2Bd/index.html

You pip install from it instead of pypi. It will in turn download from pypi and give the result to you. It will also keep itself updated, and you can batch download during the nigh packages you assume you will work.

As a result, our entire team always as most packages locally available. Changing machine, location or purging cache didn't mean loosing this benefit.

Besides, pip caching wheels doesn't mean it's not making any requests, so it's still a better experience.

For big companies, I recommend it anyway: it speeds up the entire team work, CI, allow you to publish private packages, etc.

If you can't do that, the next best things is to do "pip download", instead of "pip install", and save the wheels into a hard drive.

occoder · 3 years ago

Thanks for the tips, I'll definitely look into them.

EDIT: And thanks for the tip from godmode2019. I upvoted you both.

However, they don't change the fact that pip by default assumes a decent Internet connection, with its short timeout and no resume on downloads, and thus is unusable with anything less.

Again, IMHO that's simply unacceptable for the official Python package manager.

n8henrie · 3 years ago

I've had ~5 mbps download speeds for the last 5 years in a rural US location. I set up devpi on an old Raspberry Pi hoping to have this same experience, but after several months I disabled it and found it was actually slowing things down. If memory serve, the issue was that the local package lookups were excruciatingly slow (even if done in localhost), and I was surprised to see that devpi would routinely soak up multiple GB of memory and start swapping (I recall a few issues on memory leaks in their repo).

Seemed like a great idea, and perhaps I just needed something beefier than an RPi 3, but it didn't work out for me.

netheril96 · 3 years ago

You can change the pypi index URL with `pip config set global.index-url`.

For example, as a Chinese, I often switch to `https://pypi.tuna.tsinghua.edu.cn/simple`. You may need to look up which one is available and faster in your region.

occoder · 3 years ago

Thank you! This tip may just be the cure.

But I shouldn't need to know this, pip should have taken care of picking the best download mirror.

Or it could just support resuming a previous download. I don't actually mind waiting for a slow download or having to rerun pip install a few times, as long as it makes progress each time I run it.

n8henrie · 3 years ago

People should be aware that adding additional mirrors (instead of switching mirrors as suggested here) can lead to big slowdowns, as pip has no way to determine whether a given / found wheel is the m best choice without comparing all options, so it will continue to search both locally and all available indices until it can resolve the "best" wheel to download.

Some discussion: https://discuss.python.org/t/why-does-pip-reach-out-to-indic...

I don't recall if this issue is solved by specifying hashes (or perhaps even pinned versions are adequate) -- I would hope so.

godmode2019 · 3 years ago

Pip can download from github

Pip install git+<repo_url>

The size of python packages is becoming an increasingly worrying issue.

A lot of maintainers don't pay attention to excluding test cases and artifacts from their packages, leading to ridiculous package size growth.

Last I checked numpy was close to 100MB, a huge chunk of that being test case artifacts.

Django packages all translations for all languages to ever exist.

Pandas bundles pre-built dynamic libraries with debug symbols.

Etc.

Most of my "production" virtualenvs are close to a GB nowadays, which is insane.

houzi · 3 years ago

This! Especially with pandas.

I did have a look at numpy and on my machine the tests did not bloat it as much as you made me believe. The `core/tests/` modules are 3MB and the `__pycache__` doubled it to 6MB. What do you refer to as "test case artifacts"? The modules, pycache, or both?

Also, wrt you statement on pandas; is it the debug symbols that account for the bloat, or the libs themselves?

Galanwe · 3 years ago

I cannot seem to remember which version exactly that was. I remember there were some test input/output files that weighted multiple MBs.

Just installed a quick "data science" like virtualenv, here is the example things I talk about:

$ du -sh scipy 110M scipy $ find scipy -type f -name '.so' -exec strip -s {} \; $ du -sh scipy 76M scipy

$ du -sh django/contrib/admin/locale/ 5.4M django/contrib/admin/locale/

tedmiston · 3 years ago

All of those cases sound like good reasons for optionals / extras.

boxed · 3 years ago

I've had people report missing tests from my python packages as bugs. You can't please all the people all of the time.

kzrdude · 3 years ago

Sounds like a bug should be filed towards numpy?

kzrdude · 3 years ago

Apparently it was wontfixed last year, but maybe it could still be suggested to be an extra, not in the standard package? https://github.com/numpy/numpy/issues/19591

edpenz · 3 years ago

The most efficient workflow I've found for pip so far is something along the lines of:

  pip-compile # resolve and pin the dependency tree
  grep '==' requirements-pinned.txt | xargs -n1 pip wheel  # build every wheel in parallel
  pip install *.whl  # install the pre-built wheels

It may not avoid all the slowness, but when nicely integrated into a build/CI system it can avoid suffering it more than once.

It's interesting if you got a lot of wheels to build locally, but this is very specific: most wheels are pre-built for you, that kinda the point.

It still happens from time to time, but if somebody else read this comment, you want to check how many wheels you build before doing so.

Also, you may want to cache those build once and for all.

raffraffraff · 3 years ago

Depends on your platform. If you're using ARM (like AWS Graviton, and possibly Apple M1) you might find that many wheels are not available. That might be different now, my information is 2 years old, but I remember 'pip install numpy' taking ages. Every time someone pushed a change we'd have a 15 minute job clogging the pipes. Luckily I had already created an artifact repo (AWS CodeArtifact) for our own internalpackages, so I added a step to our build process to push locally built wheels, so a painfully slow build would only happen once.

memco · 3 years ago

I had a situation recently where I saved a bunch of time on resolution during install of Python packages by breaking it up into smaller groups of packages to install in separate steps. We have some packages in a private devpi server and some dependencies that can come from any pypi mirror. By forcing the private packages to be in their own install step it drastically cut down the time to resolve and start downloading packages. Does anyone know of comparisons of resolution metrics? I spent several hours of my week waiting on Python packages installation and would love to improve the situation.

On a related note: our NPM install times are even worse: any tips to help there would be welcome.

shp0ngle · 3 years ago

I like pnpm, which is like npm, but does caching by default, and does some other optimizations. (p is for parallel, I think)

The negative: sometimes it plays bad with some packages. Sometimes you need the (stupidly named) —shamefully-hoist, sometimes even that doesn’t help.

pnpm by default does NOT have the flat structure in node_modules but puts into separate sub-folders; sometimes it plays bad with some dependencies that expect the flat structure (usually webpack-related cludges)

But if you are starting fresh, or can spend some time on debugging, pnpm is amazing.

8n4vidtmkvmk · 3 years ago

pnpm is plug n play... modules i think.

pip-compile [0] is the best tool I've found for streamlining dependency resolution, but it may not provide any benefit for one-shot installs.

[0] https://github.com/jazzband/pip-tools

For those who are wondering what is the benefit over pip freeze: pip freeze dumps you entire venv, while pip compile gets project dependancies, and infer the requirements file from that.

tomatowurst · 3 years ago

is there an alternative to pip that improves on things? sort of like yarn for python?

goodoldneon · 3 years ago

Pipenv or Poetry. Pip by itself is missing some key features, like separating dev dependencies (e.g. mypy)

kmarc · 3 years ago

I think it's a pretty common pattern to use

    pip install -r requirements-dev.txt

Not the most convenient, but solves (works around) the issue

nerdponx · 3 years ago

These both use Pip internally. They are wrappers, not replacements.

mafro · 3 years ago

Pip has had ton of development in the last few years, and continues to do so. I'd say stick with pip, unless there's a specific problem you're addressing with pipenv or Poetry.

However, like a sibling comment, I've also heard god things about Poetry. Probably worth giving it a spin somewhere and seeing if my initial thought still holds.

sjellis · 3 years ago

I've been experimenting with Hatch:

https://hatch.pypa.io/

The USP of Hatch is that it uses the latest generation of Python standards and tech under the covers, so you have a unified tool that's less quirky than previous ones. Poetry and pipenv predate some of the improvements in Python packaging, so had to develop some things in their own way.

Poetry is the cleanest python deps handling experience, and I would recommend it as well.

But it is very slow.

bbkane · 3 years ago

I haven't tried it, but I've heard good things about poetry

japhyr · 3 years ago

I think people use Poetry for a variety of reasons, but I don't think speed is one of them.

baggiponte · 3 years ago

A bit unrelated, but I have been using pdm (https://pdm.fming.dev/) which is something like Poetry but faster and PEP compliant (!)

"PEP compliant" is kinda saying "convention compliant", it's not meaningful unless you provide the PEP number.

However, pdm and the likes are interesting because they are, indeed, PEP 582 compliants, which let you use __pypackages__ like one uses node_modules.

oh hi, sorry for the late reply. indeed you are right; i was typing from my phone and was a bit too hasty. i am not a fan of poetry after reading these; TLDR poetry does not support PEP621 about pyproject.toml either, and i feel this is the direction the python community has been going towards recently.

* https://iscinumpy.dev/post/bound-version-constraints/

* https://iscinumpy.dev/post/poetry-versions/

also pdm is developed by frostming, who's on PyPA (Python Packaging Authority) and more libraries are extending support for pdm (streamlit just today).

but in general i am not against poetry. it's a good tool and it pushed package managers a fair bit further - providing a superior alternative to conda imho

Deleted Comment