About 15 years ago there was a series of papers by Andreas Gal (then future and now former CTO of Mozilla) about precisely this aspect of JIT-compilers design.
If you have a program and represent it as a series of VM instructions, and you try to jit them individually the kit compiler has very limited ability for optimization. Plus, after each such instruction the execution comes back to the piece of the interpreter that selects the next instruction to execute: essentially a giant switch statement that makes branch prediction on CPU very difficult and inefficient.
The alternative is to not only jit the instructions but also embed pieces of the interpreter between them so that the compiler can see how those instructions are connected and generate the code for the whole sequence of them. This the compiler can make better assumptions about the code and optimize it a lot more.
This work eventually made it to all sorts of virtual machines. Adobe used it for Flash, Mozilla for their SpiderMonkey javascript engine, Google used this design for early versions of Android Dalvik VM.
PyPy uses it, too. That's why they call it a meta-compiler. It compiles pieces of your code and pieces of its own interpreter together to produce the more optimized binary.
If you're interested in trace based compilation then the origins go much further back than Andreas Gal's work and into research in the 1970s into compilers for VLIW architectures [1]
It's not the dispatch to the next instruction that's expensive in most virtual machines. It's the sheer complexity of each instruction as it maps to the underlying assembly instructions [2]
Just getting rid of the dispatch loop doesn't help much and in many cases the increased pressure on the instruction cache makes performance worse.
I'm trying to help with this problem at the moment for the CRuby JIT compiler.
SpiderMonkey, LuaJIT and Dalvik record the trace in the bytecode interpreter. There's no generating baseline JITed code with additional trace recording stuff.
What PyPy does is quite different. PyPy is a Python virtual machine written in a language called RPython and has an interpreter and trace compiler for RPython.
It adds a whole layer of indirection compared to traditional trace compilers which makes it harder to do some optimizations but makes it easier to implement some more basic parts of trace recording and compilation and crucially, makes it somewhat re-usable for different languages.
PyPy is fantastic, it rescued a project I had where a massive amount of json needed to be parsed before an absurdly short heartbeat. Not even some of the fast json libraries could do it, but with a drop-in PyPy replacement, it worked, and still does, wonderfully
What is stopping PyPy from getting more widespread adoption in the Python community? Surely it isn't that everyone is using the latest version of CPython; it's been years and many are still using 2.7.
The startup times are a fair bit worse, and the memory requirements are a fair bit higher. Despite everything else python does tend to be fast enough, and the worse startup/memory usage isn't necessarily worth it.
Also it doesn't speed up things that rely on external C libraries, so code using numpy/scipy/tensorflow/etc doesn't generally run appreciably faster.
If it hasnt changed, and I wasn't wrong in the first place, it has some issues with deep C integration, you need to jump over some hoops to get numpy working, and some of the other powerhouse libraries.
I have no detailed information. But the development of PyPy has been going on for 20 years and was partly sponsored by the EU. If you compare that with LuaJIT which was developed by a single person in a shorter timeframe and a performance even faster than V8 I would assume that maybe the conceptual approach taken by RPython/PyPy is less suited or Python is just that much harder to speedup. I would guess the former because also the performance of other RPython based implementations is not insanely impressive.
Thanks for the link. Had a quick look at the top rated comment, but "Python spends almost all of its time in the C runtime - This means that it doesn't really matter how quickly you execute the 'Python' part of Python" is already wrong. CPython is an interpreter, and this interpreter is implemented in C. You cannot argument that due to the fact that the runtime spends most of its time in C functions it makes no sense to improve it. That's exactly what a JIT (in contrast to an interpreter) is for.
More or less. It really depends on what your code does and how frequently in calls into C modules such as Numpy. The Python side is JIT-able while the C side is not. PyPy3 currently implements Python 3.6.9 so you'll be missing out on any 3.7+ features (e.g., built-in dataclasses but a backport is available from PyPI).
PyPy claims to be compatible with CPython 3.6.9 (https://www.pypy.org/compat.html). CPython 3.7 is currently the most common Python version. The project I work on since last year depends on features introduced in CPython 3.7. The migration to CPython 3.8 in already underway on many Linux distros (Fedora is already on it). So PyPy is unfortunately too far behind.
Is that accurate? What are the pros & cons of this approach compared to a normal tracing JIT, e.g. LuaJIT?
If you have a program and represent it as a series of VM instructions, and you try to jit them individually the kit compiler has very limited ability for optimization. Plus, after each such instruction the execution comes back to the piece of the interpreter that selects the next instruction to execute: essentially a giant switch statement that makes branch prediction on CPU very difficult and inefficient.
The alternative is to not only jit the instructions but also embed pieces of the interpreter between them so that the compiler can see how those instructions are connected and generate the code for the whole sequence of them. This the compiler can make better assumptions about the code and optimize it a lot more.
This work eventually made it to all sorts of virtual machines. Adobe used it for Flash, Mozilla for their SpiderMonkey javascript engine, Google used this design for early versions of Android Dalvik VM.
PyPy uses it, too. That's why they call it a meta-compiler. It compiles pieces of your code and pieces of its own interpreter together to produce the more optimized binary.
It's not the dispatch to the next instruction that's expensive in most virtual machines. It's the sheer complexity of each instruction as it maps to the underlying assembly instructions [2]
Just getting rid of the dispatch loop doesn't help much and in many cases the increased pressure on the instruction cache makes performance worse.
I'm trying to help with this problem at the moment for the CRuby JIT compiler.
SpiderMonkey, LuaJIT and Dalvik record the trace in the bytecode interpreter. There's no generating baseline JITed code with additional trace recording stuff.
What PyPy does is quite different. PyPy is a Python virtual machine written in a language called RPython and has an interpreter and trace compiler for RPython.
It adds a whole layer of indirection compared to traditional trace compilers which makes it harder to do some optimizations but makes it easier to implement some more basic parts of trace recording and compilation and crucially, makes it somewhat re-usable for different languages.
[1] https://archive.org/details/optimizationofho00fish/mode/2up [2] https://www.sciencedirect.com/science/article/pii/S157106610...
Also it doesn't speed up things that rely on external C libraries, so code using numpy/scipy/tensorflow/etc doesn't generally run appreciably faster.
If we put this figure in context with the CLBG (see https://benchmarksgame-team.pages.debian.net/benchmarksgame/...) we're somewhere between Racket and Dart.
Is a further speedup to be expected? Why is it still slower than V8 after so much development effort?