Readit News logoReadit News
Posted by u/mattip 2 years ago
Ask HN: Is anyone using PyPy for real work?
I have been the release manager for PyPy, an alternative Python interpreter with a JIT [0] since 2015, and have done a lot of work to make it available via conda-forge [1] or by direct download [2]. This includes not only packaging PyPy, but improving on an entire C-API emulation layer so that today we can run (albeit more slowly) almost the entire scientific python data stack. We get very limited feedback about real people using PyPy in production or research, which is frustrating. Just keeping up with the yearly CPython release cycle is significant work. Efforts to improve the underlying technology needs to be guided by user experience, but we hear too little to direct our very limited energy. If you are using PyPy, please let us know, either here or via any of the methods listed in [3].

[0] https://www.pypy.org/contact.html [1] https://www.pypy.org/posts/2022/11/pypy-and-conda-forge.html [2] https://www.pypy.org/download.html [3] https://www.pypy.org/contact.html

ggm · 2 years ago
I'm using pypy to analyse 350m DNS events a day, through python cached dicts to avoid dns lookup stalls. I am getting 95% dict cache hit rate, and use threads with queue locks.

Moving to pypy definitely speeded me up a bit. Not as much as I'd hoped, it's probably all about string index into dict and dict management. I may recode into a radix tree. Hard to work out in advance how different it would be: People optimised core datastructs pretty well.

Uplift from normal python was trivial. Most dev time spent fixing pip3 for pypy in debian not knowing what apts to load, with a lot of "stop using pip" messaging.

danpalmer · 2 years ago
Debian is its own worst enemy with things like this. It’s why we eventually moved off it at a previous job, because deploying Python server applications on it was dreadful.

I’m sure it’s better if you’re deploying an appliance that you hand off and never touch again, but for evolving modern Python servers it’s not well suited.

gjvc · 2 years ago
Yes 1000x What is it with them which makes them feel entitled to have special "dist-packages" vs "site-packages" as is the default? This drives me nuts, when I have a bunch of native packages I want to bundle in our in-house python deployment. CentOS and Ubuntu are vanilla, and only Debian (mind-boggingly) deviates from the well-trodden path.

I still haven't figured out how to beat this dragon. All suggestions welcome!

Deleted Comment

bashinator · 2 years ago
What distro did you move to? IME debian as a base image for python app containers is also kind of a pain.

Deleted Comment

bombolo · 2 years ago
It works completely fine in my experience.
syllogism · 2 years ago
If you have very large dicts, you might find this hash table I wrote for spaCy helpful: https://github.com/explosion/preshed . You need to key the data with 64-bit keys. We use this wrapper around murmurhash for it: https://github.com/explosion/murmurhash

There's no docs so obviously this might not be for you. But the software does work, and is efficient. It's been executed many many millions of times now.

ggm · 2 years ago
I'm in strings, not 64 bit keys. But thanks, nice to share ideas.
mattip · 2 years ago
> it's probably all about string index into dict and dict management

Cool. Is the performance here something you would like to pursue? If so could you open an issue [0] with some kind of reproducer?

[0] https://foss.heptapod.net/pypy/pypy/-/issues

ggm · 2 years ago
I'm thinking about how to demonstrate the problem. I have a large pickle but pickle load/dump times across gc.disable()/gc.enable() really doesn't say much.

I need to find out how to instrument the seek/add cost of threads against the shared dict under a lock.

My gut feel is that probably if I inlined things instead of calling out to functions I'd shave a bit more too. So saying "slower than expected" may be unfair because there's limits to how much you can speed this kind of thing up. Thats why I wondered if alternate datastructures were a better fit.

its variable length string indexes into lists/dicts of integer counts. The advantage of a radix trie would be finding the record in semi constant time to the length in bits of the strings, and they do form prefix sets.

CyberDildonics · 2 years ago
Uplift from normal python was trivial.

By definition if you lift something it is going to go up, but what does this mean?

ggm · 2 years ago
If you replace your python engine you have to replace your imports.

Some engines can't build and deploy all imports.

Some engines demand syntactic sugar to do their work. Pypy doesn't

sitkack · 2 years ago
One should really consider using containers in this situation.
rovr138 · 2 years ago
Can you describe what in this situation warrants it?

I'm very curious about where the line is/should be.

reftel · 2 years ago
I use it at work for a script that parses and analyzes some log files in an unusual format. Wrote a naive parser with a parsing combinator library. It was too slow to be usable with CPython. Tried PyPy and got a 50x speed increase (yes, 50 times faster). Very happy with the results, actually =)
mattip · 2 years ago
Thanks for the feedback. It does seem like parsing logs and simulations is a sweet spot for PyPy
_aavaa_ · 2 years ago
Simulations are, at least in my experience, numba’s [0] wheelhouse.

[0]: https://numba.pydata.org/

zzzeek · 2 years ago
what cpython version and OS was that? I'd be very surprised if modern Python 3.11 has anything an order of magnitude slower like that. things have gotten much faster over the years in cpython
macNchz · 2 years ago
I put PyPy in production at a previous job, running a pretty high traffic Flask web app. It was quick and pretty straightforward to integrate, and sped up our request timings significantly. Wound up saving us money because server load went down to process the same volume of requests, so we were able to spin down some instances.

Haven’t used it in a bit mostly because I’ve been working on projects that haven’t had the same bottleneck, or that rely on incompatible extensions.

Thank you for your work on the project!

mattip · 2 years ago
You're welcome.

> that rely on incompatible extensions.

Which ones? Is using conda an option, we have more luck getting binary packages into their build pipelines than getting projects to build wheels for PyPI

macNchz · 2 years ago
I can't actually remember off of the top of my head, I tried it out a year or two ago but didn't get too far because during profiling it became clear the biggest opportunities for performance improvement in this app were primarily algorithmic/query/io optimizations outside of Python itself, so business-wise it didn't make too much sense, though if it had I think using Conda would have been on the table. We make heavy use of Pandas/Numpy et al, though I know those are largely supported now so I'd guess it was not one of them but something adjacent.
ADcorpo · 2 years ago
This post is a funny coincidence as I tried today to speed-up a CI pipeline running ~10k tests with pytest by switching to pypy.

I am still working on it but the main issue is psycopg support for now, as I had to install psycopg2cffi in my test environment, but it will probably prevent me from using pypy for running our test suite, because psycopg2cffi does not have the same features and versions as psycopg2. This means either we switch our prod to pypy, which won't be possible because I am very new in this team and that would be seen as a big, risky change by the others, or we keep in mind the tests do not run using the exact same runtime as production servers (which might cause bugs to go unnoticed and reach production, or failing tests that would otherwise work on a live environment).

I think if I ever started a python project right now, I'd probably try and use pypy from the start, since (at least for web development) there does not seem to be any downsides to using it.

Anyways, thank you very much for your hard work !

cpburns2009 · 2 years ago
If you use recent versions of PostgreSQL (10+ I believe) you can use psycopg3 [1] which has a pure Python implementation which should be compatible with PyPy.

[1]: https://www.psycopg.org/psycopg3/docs/basic/install.html

jsmeaton · 2 years ago
Second this - no psycopg2 support and to a lesser extent lxml is a nonstarter and makes it pretty difficult to experiment with on production code bases. I could see a lot of adoption from Django deployments otherwise.
sodimel · 2 years ago
Yeah we don't use pypy for those exact reasons on our small django projects.
tlocke · 2 years ago
I work on pg8000 https://pypi.org/project/pg8000/ which is a pure-Python PostgreSQL driver that works well with pypy. Not sure if it would meet all your requirements, but just thought I'd mention it.
lozenge · 2 years ago
One compromise could be to run pypy on draft PRs and CPython on approved PRs and master?
PaulHoule · 2 years ago
I use CPython most of the time but PyPy was a real lifesaver when I was doing a project that bridged EMOF and RDF, particularly I was working with moderately sized RDF models (say 10 million triples) with rdflib.

With CPython, I was frustrated with how slow it was, and complained about it to the people I was working with, PyPy was a simple upgrade that sped up my code to the point where it was comfortable to work with.

mark_l_watson · 2 years ago
That is a great idea! I use rdflib frequently and never thought to try it with PyPy. Now I will.
mattip · 2 years ago
Is your group still using it?
PaulHoule · 2 years ago
That particular code has been retired because after a quite a bit of trying things that weren’t quite right we understood the problem and found a better way to do it. I’m doing the next round of related work (logically modeling XSLT schemas and associated messages in OWL) in Java because there is already a library that almost does was I want.

I am still using this library that I wrote

https://paulhoule.github.io/gastrodon/

to visualize RDF data so even if I make my RDF model in Java I am likely to load it up in Python to explore it. I don’t know if they are using PyPy but there is at least one big bank that has people using Gastrodon for the same purpose.

nickpsecurity · 2 years ago
What do you use RDF models for?
PaulHoule · 2 years ago
So I wrote this library

https://paulhoule.github.io/gastrodon/

which makes it very easy to visualize RDF data with Jupyter by turning SPARQL results into data frames.

Here are two essays I wrote using it

https://ontology2.com/essays/LookingForMetadataInAllTheWrong...

https://ontology2.com/essays/PropertiesColorsAndThumbnails.h...

People often think RDF never caught on but actually there are many standards that are RDF-based such as RSS, XMP, ActivityPub and such that you can work on quite directly with RDF tools.

Beyond that I’ve been on a standards committee for ISO 20022 where we’ve figured out, after quite a few years of looking at the problem, how to use RDF and OWL as a master standard for representing messages and schemas in financial messaging. In the project that needed PyPy we were converting a standard represented in EMOF into RDF. Towards the end of last year I figured out the right way to logically model the parts of those messages and the associated schema with OWL. That is on its way of becoming one of those ISO standard documents that unfortunately costs 133 swiss franc. I also figured out that it is possible to do the same for many messages defined with XSLT and I’m expecting to get some work applying this to a major financial standard and I think there will be some source code and a public report on that.

Notably the techniques I use address quite a few problems with the way most people use RDF, most notably many RDF users don’t use the tools available to represented ordered collections, a notable example with this makes trouble is in Dublin Core for document (say book) metadata where you can’t represent the order of the authors of a paper which is something the authors usually care about a great deal. XMP adapts the Dublin Core standard enough to solve this problem, but with the techniques I use you can use RDF to do anything any document database can, though some SPARQL extensions would make it easier.

eigenvalue · 2 years ago
Thanks for reminding me to look at PyPy again. I usually start all my new Python projects with this block of commands that I keep handy:

Create venv and activate it and install packages:

  python3 -m venv venv
  source venv/bin/activate
  python3 -m pip install --upgrade pip
  python3 -m pip install wheel
  pip install -r requirements.txt

I wanted a similar one-liner that I could use on a fresh Ubuntu machine so I can try out PyPy easily in the same way. After a bit of fiddling, I came up with this monstrosity which should work with both bash and zsh (though I only tested it on zsh):

Create venv and activate it and install packages using pyenv/pypy/pip:

  if [ -d "$HOME/.pyenv" ]; then rm -Rf $HOME/.pyenv; fi && \
  curl https://pyenv.run | bash && \
  DEFAULT_SHELL=$(basename "$SHELL") && \
  if [ "$DEFAULT_SHELL" = "zsh" ]; then RC_FILE=~/.zshrc; else RC_FILE=~/.bashrc; fi && \
  if ! grep -q 'export PATH="$HOME/.pyenv/bin:$PATH"' $RC_FILE; then echo -e '\nexport PATH="$HOME/.pyenv/bin:$PATH"' >> $RC_FILE; fi && \
  if ! grep -q 'eval "$(pyenv init -)"' $RC_FILE; then echo 'eval "$(pyenv init -)"' >> $RC_FILE; fi && \
  if ! grep -q 'eval "$(pyenv virtualenv-init -)"' $RC_FILE; then echo 'eval "$(pyenv virtualenv-init -)"' >> $RC_FILE; fi && \
  source $RC_FILE && \
  LATEST_PYPY=$(pyenv install --list | grep -P '^  pypy[0-9\.]*-\d+\.\d+' | grep -v -- '-src' | tail -1) && \
  LATEST_PYPY=$(echo $LATEST_PYPY | tr -d '[:space:]') && \
  echo "Installing PyPy version: $LATEST_PYPY" && \
  pyenv install $LATEST_PYPY && \
  pyenv local $LATEST_PYPY && \
  pypy -m venv venv && \
  source venv/bin/activate && \
  pip install --upgrade pip && \
  pip install wheel && \
  pip install -r requirements.txt
Maybe others will find it useful.

nicce · 2 years ago
Just a note; these scrips are not comparable in monstrosity as the first is about to initiate the project when as the second one is to initiate whole PyPy installation.

So if you have PyPy already on your machines;

  pypy -m venv venv && \
    source venv/bin/activate && \
    pip install --upgrade pip && \
    pip install wheel && \
    pip install -r requirements.txt
Was not that bad after all, when my initial thought was that do I need all the above to just initiate the project :D

eigenvalue · 2 years ago
That's true, but you can run the first block of commands on a brand new Ubuntu installation because regular CPython is installed by default. Whereas you would need to do the whole second block when starting on a fresh machine.
deizel · 2 years ago
Given you'll want to activate a virtual environment for most Python projects, and projects live in directories.. I find myself constantly reaching for direnv. https://github.com/direnv/direnv/wiki/Python

    echo "layout python\npip install --upgrade pip pip-tools setuptools wheel\npip-sync" > .envrc
When you CD into a given project, it'll activate the venv, upgrade to non-ancient versions of Pip/etc with support for latest PEPs (ie. `pyproject.toml` support on new Python 3.9 env), verify the latest pinned packages are present.. it's just too useful not to have.

    direnv stdlib
This command (or this link https://direnv.net/man/direnv-stdlib.1.html) will print many useful functions that can be used in the `.envrc` shell script that is loaded when entering directories, ranging from many languages, to `dotenv` support, to `on_git_branch` for e.g. syncing deps when switching feature branches.

Check it out if you haven't.. I've been using it for more years than I can count and being able to CD from a PHP project to a Ruby project to a Python project with ease really helps with context switching.

stefanor · 2 years ago
If you have a system-level installed pypy, the pypy equivalent is:

  python3 -m venv -p pypy3 venv
  source venv/bin/activate
  python3 -m pip install --upgrade pip
  python3 -m pip install wheel
  pip install -r requirements.txt
Not very different...

saila · 2 years ago
For a more apples to apples comparison, you would install pypy using your package manager, e.g. apt install pypy3 or brew install pypy3. On Linux, you might have to add a package repo first.
eigenvalue · 2 years ago
I find that much scarier to do personally since it seems a lot more likely to screw up other stuff on your machine, whereas with pyenv it's all self-contained in the venv. Also using apt packages tends to install a pretty old version.
pdw · 2 years ago
We don't. To be honest, I didn't realize PyPy supported Python 3. I thought it was eternally stuck on Python 2.7.

So the good: It apparently now supports Python 3.9? Might want to update your front page, it only mentions Python 3.7.

The bad: It only supports Python 3.9, we use newer features throughout our code, so it'd be painful to even try it out.

eyegor · 2 years ago
Their docs seem perpetually out of date, but they recently released support for 3.10. I haven't been able to try it recently because our projects use 3.10 features but in the past it was easily a 10-100x speedup as long as all the project's libraries worked.

https://downloads.python.org/pypy/

mattip · 2 years ago
It supports Python3.10 now too. Thanks, I updated the site.
ADcorpo · 2 years ago
I think it supports up to 3.10, as there are official docker images for this version, I saw them this morning.

Maybe the site is not up to date ?

mkl · 2 years ago
You should probably put "Ask HN:" in your title.

Personally I don't use PyPy for anything, though I have followed it with interest. Most of the things I need to go faster are numerical, so Numba and Cython seem more appropriate.

_aaed · 2 years ago
Cut him some slack, he's only been registered for 10 years
ezekiel68 · 2 years ago
I read this as humor and I imagine mattip may have done also.
Cort3z · 2 years ago
I don’t think it’s about being strict or condescending. In some HN readers the post will show up in a different catalogue and generally be easier for people to find, thus giving the post more visibility :)

Edit; typo