Hi everyone,
I built PyXL — a hardware processor that executes a custom assembly generated from Python programs, without using a traditional interpreter or virtual machine. It compiles Python -> CPython Bytecode -> Instruction set designed for direct hardware execution.
I’m sharing an early benchmark: a GPIO test where PyXL achieves a 480ns round-trip toggle — compared to 14-25 micro seconds on a MicroPython Pyboard - even though PyXL runs at a lower clock (100MHz vs. 168MHz).
The design is stack-based, fully pipelined, and preserves Python's dynamic typing without static type restrictions. I independently developed the full stack — toolchain (compiler, linker, codegen), and hardware — to validate the core idea. Full technical details will be presented at PyCon 2025.
Demo and explanation here: https://runpyxl.com/gpio Happy to answer any questions
Reading further down the page it says you have to compile the python code using CPython, then generate binary code for its custom ISA. That's neat, but it doesn't "execute python directly" - it runs compiled binaries just like any other CPU. You'd use the same process to compile for x86, for example. It certainly doesn't "take regular python code and run it in silicon" as claimed.
A more realistic claim would be "A processor with a custom architecture designed to support python".
However it doesn't seem like it does, the pyc still had to be further processed into machine code. So I also agree with the parent comment that this seems a bit misleading.
I could be convinced that that native code is sufficiently close to pyc that I don't feel misled. Would it be possible to write a boot loader which converts pyc to machine code at boot? If not, why not?
Anyway, the project is mega-cool, and very useful (in some specific applications). Is just that the title is a little bit confusing.
What gets executed is a direct mapping of Python semantics to hardware. In that sense, this is more “direct” than most systems running Python.
This phrasing is about conveying the architectural distinction: Python logic executed natively in hardware, not interpreted in software.
"runs directly on embedded hardware"
https://www.raspberrypi.com/documentation/microcontrollers/m...
I don't understand why they have the need to do this...
Still cool, but I would definitely ease back the first claim.
I was going to say it does make me wonder how much a pain a direct processor like this would be in terms of having to constantly update it to adapt to the new syntax/semantics everytime there's a new release.
Also - are there any processors made to mimic ASTs directly? I figure a Lisp machine does something like that, but not quite... Though I've never even thought to look at how that worked on the hardware side.
EDIT: I'm not sure AST is the correct concept, exactly, but something akin to that... Like building a physical structure of the tree and process it like an interpreter would. I think something like that would require like a real-time self-programming FPGA?
The system compiles Python source to CPython ByteCode, and then from ByteCode to a hardware-friendly instruction set. Since it builds on ByteCode—not raw syntax—it’s largely insulated from most language-level changes. The ByteCode spec evolves slowly, and updates typically mean handling a few new opcodes in the compiler, not reworking the hardware.
Long-term, the hardware ISA is designed to remain fixed, with most future updates handled entirely in the toolchain. That separation ensures PyXL can evolve with Python without needing silicon changes.
4o also ends many of its messages that way. It has to be related.
I'd love to read about the design process. I think the idea of taking bytecode aimed at the runtime of dynamic languages like Python or Ruby or even Lisp or Java and making custom processors for that is awesome and (recently) under-explored.
I'd be very interested to know why you chose to stay this, why it was a good idea, and how you went about the implementation (in broad strokes if necessary).
There are definitely some limitations beyond just memory or OS interaction. Right now, PyXL supports a subset of real Python. Many features from CPython are not implemented yet — this early version is mainly to show that it's possible to run Python efficiently in hardware. I'd prefer to move forward based on clear use cases, rather than trying to reimplement everything blindly.
Also, some features (like heavy runtime reflection, dynamic loading, etc.) would probably never be supported, at least not in the traditional way, because the focus is on embedded and real-time applications.
As for the design process — I’d love to share more! I'm a bit overwhelmed at the moment preparing for PyCon, but I plan to post a more detailed blog post about the design and philosophy on my website after the conference.
https://doc.pypy.org/en/latest/faq.html#what-is-pypy
Taking the concrete example of the `struct` module as a use-case, I'm curious if you have a plan for it and similar modules. The tricky part of course is that it is implemented in C.
Would you have to rewrite those stdlib modules in pure python?
https://en.wikipedia.org/wiki/Java_processor
https://en.m.wikipedia.org/wiki/Java_Card
To the point where most adult humans in the world probably own a Java-supported processor on a SIM card. Or at least an emulator (for eSIMs).
On example of a CPU arch used on JavaCard devices is the ARM926EJ-S that I believe can execute Java byte code.
Deleted Comment
In today's world this is generally not great.
Interpreted languages often include bytecode instructions that actually do very complex things and so do not nicely map to operations that can be sanely implemented in hardware. So you end up with all the usual boring alu, branch etc operations implemented in hardware, and anything else traps and runs a software handler.
Separately, interpreted language bytecode is often a poor fit for hardware execution; e.g. for dotnet (and python) bytecode many otherwise trivial operations do not explicitly encode information about types, and therefore the hardware must track type information in order to do the right thing (floating point addition looks very very different from integer addition!)
A lot of effort has been spent on compiler optimisation for x86 and ARM code. JIT compilers benefit massively from this. Meanwhile, interpreted language bytecode is often very lightly optimised, where it is optimised at all (until relatively recently, explicit Python policy as set by Guido van Rossum was to never optimise!) Optimisation has the side effect of throwing away potentially valuable high level / semantic information; optimising at the bytecode level hinders debuggability for interpreted code (which is a primary goal in Python) and can also be detrimental to JIT output; and the results are underwhelming compared to JIT since your small team of plucky bytecode optimisers isn't really going to compete with decades of x86 compiler development; and so the incentive is to not do much of that.
So if you're running bytecode in hardware, on top of all the obvious costs, you are /running unoptimised code/. This is actually the thing that kills these projects - everything else can ultimately be solved by throwing more silicon at it, but this can only really be solved by JITting, and the existing JIT+x86 / JIT+ARM solution is cheap and battle tested.
In general I think the practical result is that x86 is like democracy. It’s not always efficient but there are other factors that make it the best choice.
The vast majority of computers in the world are not x86.
https://pycpu.wordpress.com/
"Running a very small subset of python on an FPGA is possible with pyCPU. The Python Hardware Processsor (pyCPU) is a implementation of a Hardware CPU in Myhdl. The CPU can directly execute something very similar to python bytecode (but only a very restricted instruction set). The Programcode for the CPU can therefore be written directly in python (very restricted parts of python) ..."
I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.
The background garbage collection thing is easier said than done, but I'm talking to someone who has already done something impressively difficult, so...
It almost sounds like you're asking for Nim ( https://nim-lang.org/ ); and there are some projects using it for microcontroller programming, since it compiles down to C (for ESP32, last I saw).
- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.
- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).
- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).
A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.
I'd add that even beyond types, late binding is fundamental to Python’s dynamism: Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.
So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.
That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.
When type annotations are available, it's already possible to compile Python to improve performance, using Mypyc. See for example https://blog.glyph.im/2022/04/you-should-compile-your-python...
Those, at runtime (and, nowadays, optionally also at compile time), convert that to native code. Python doesn’t; it runs a bytecode interpreter.
Reason Python doesn’t do that is a mix of lack of engineering resources, desire to keep the implementation fairly simple, and the requirement of backwards compatibility of C code calling into Python to manipulate Python objects.
If you define it as trying to compile Python in such a way that you would get the ability to do optimizations and get performance boosts and such, you end up at PyPy. However that comes with its own set of tradeoffs to get that performance. It can be a good set of tradeoffs for a lot of projects but it isn't "free" speedup.
Optimized libraries (e.g. numpy, Pandas, Polars, lxml, ...) are the idiomatic way to speed up "the parts that don't need to be in pure Python." Python subsets and specializations (e.g. PyPy, Cython, Numba) fill in some more gaps. They often use much tighter, stricter memory packing to get their speedups.
For the most part, with the help of those lower-level accelerations, Python's fast enough. Those who don't find those optimizations enough tend to migrate to other languages/abstractions like Rust and Julia because you can't do full Python without the (high and constant) cost of memory access.
For example if you compile x + y in C, you'll get a few clean instructions that add the data types of x and y. But if you compile this thing in some sort of Python compiler it would essentially have to include the entire Python interpreter; because it can't know what x and y are at compile time, there necessarily has to be some runtime logic that is executed to unwrap values, determine which "add" to call, and so forth.
If you don't want to include the interpreter, then you'll have to add some sort of static type checker to Python, which is going to reduce the utility of the language and essentially bifurcate it into annotated code you can compile, and unannotated code that must remain interpreted at runtime that'll kill your overall performance anyway.
That's why projects like Mojo exist and go in a completely different direction. They are saying "we aren't going to even try to compile Python. Instead we will look like Python, and try to be compatible, but really we can't solve these ecosystem issues so we will create our own fast language that is completely different yet familiar enough to try to attract Python devs."
For most code, you don't need static typing for most overloaded operators to get decent performance, either. From my experience with Ur-Scheme, even a simple prediction that most arithmetic is on (small) integers with a runtime typecheck and conditional jump before inlining the integer version of each arithmetic operation performs remarkably well—not competitive with C but several times faster than CPython. It costs you an extra conditional branch in the case where the type is something else, but you need that check anyway if you are going to have unboxed integers, and it's smallish compared to the call and return you'll need once you find the correct overload to call. (I didn't implement overloading in Ur-Scheme, just exiting with an error message.)
Even concatenating strings is slow enough that checking the tag bits to see if you are adding integers won't make it much slower.
Where this approach really falls down is choosing between integer and floating point math. (Also, you really don't want to box your floats.)
And of course inline caches and PICs are well-known techniques for handling this kind of thing efficiently. They originated in JIT compilers, but you can use them in AOT compilers too; Ian Piumarta showed that.
I have seen people do that for closed-source software that is distributed to end-users, because it makes reverse engineering and modding (a bit) more complicated.
Even type annotations would have to be anointed with semantics, which (IIUC) they have none today (w/CPython AFAIK). They are just annotations for use by static checkers.
Unless you can perform optimizations, the compilation can't make a whole bunch of progress beyond that bytecode.
* In fact, IIRC there was/is some "freeze" program that would do just that: compile your python program. Under the covers it would bundle libpython with your *.pyc bytecode.
Where’s the AOT compiler that handles the whole Python language?
It’s not routine because its not even an option, and people who are concerned either use the tools that let them compile a subset of Python within a larger, otherwise-interpreted program, or use a different language.
Smaller binaries, faster execution, proper metaprogramming, actual type safety, and you don't need to bundle a whole interpreter just to say "hello world"
* Could you share the assembly language of the processor?
* What is the benefit of designing the processor and making a Python bytecode compiler for it, vs making a bytecode compiler for an existing processor such as ARM/x86/RISCV?
HDL: Verilog
Assembly: The processor executes a custom instruction set called PySM (Not very original name, I know :) ). It's inspired by CPython Bytecode — stack-based, dynamically typed — but streamlined to allow efficient hardware pipelining. Right now, I’m not sharing the full ISA publicly yet, but happy to describe the general structure: it includes instructions for stack manipulation, binary operations, comparisons, branching, function calling, and memory access.
Why not ARM/X86/etc... Existing CPUs are optimized for static, register-based compiled languages like C/C++. Python’s dynamic nature — stack-based execution, runtime type handling, dynamic dispatch — maps very poorly onto conventional CPUs, resulting in a lot of wasted work (interpreter overhead, dynamic typing penalties, reference counting, poor cache locality, etc.).
Your example contains some integer arithmetic, I'm curious if you've implemented any other Python data types like floats/strings/tuples yet. If you have, how does your ISA handle binary operations for two different types like `1 + 1.0`, is there some sort of dispatch table based on the types on the stack?
Perhaps they don't need to be interruptible if there's no virtual memory.
How does it allocate memory? Malloc and free are pretty complex to do in hardware.
https://en.wikipedia.org/wiki/PicoJava
https://news.ycombinator.com/item?id=8598037
The main thing that appealed to me about this idea is that it would require a two-dimensional program counter. As I recall from the original specification, skipping through blank space is supposed to take O(1) time, but I didn't plan on implementing that. I did, however, imagine a machine with 256x256 bytes of memory, where some 80x25 (or 24?) region was reserved as directly memory-mapped to a character display (and protected at boot by surrounding it with jump instructions).
I’ve been using .NET since 2001 so maybe I have it confused with something else, but at the same time a lot of the web from that era is just gone, so it’s possible something like this did exist but didn’t gain any traction and is now lost to the ether.