Show HN: I built a hardware processor that runs Python

Are there any limitations on what code can run? (discounting e.g. memory limitations and OS interaction)

I'd love to read about the design process. I think the idea of taking bytecode aimed at the runtime of dynamic languages like Python or Ruby or even Lisp or Java and making custom processors for that is awesome and (recently) under-explored.

I'd be very interested to know why you chose to stay this, why it was a good idea, and how you went about the implementation (in broad strokes if necessary).

hwpythonner · 8 months ago

Thanks — really appreciate the interest!

There are definitely some limitations beyond just memory or OS interaction. Right now, PyXL supports a subset of real Python. Many features from CPython are not implemented yet — this early version is mainly to show that it's possible to run Python efficiently in hardware. I'd prefer to move forward based on clear use cases, rather than trying to reimplement everything blindly.

Also, some features (like heavy runtime reflection, dynamic loading, etc.) would probably never be supported, at least not in the traditional way, because the focus is on embedded and real-time applications.

As for the design process — I’d love to share more! I'm a bit overwhelmed at the moment preparing for PyCon, but I plan to post a more detailed blog post about the design and philosophy on my website after the conference.

mikepurvis · 8 months ago

In terms of a feature-set to target, would it make sense to be going after RPython instead of "real" Python? Doing that would let you leverage all the work that PyPy has done on separating what are the essential primitives required to make a Python vs what are the sugar and abstractions that make it familiar:

https://doc.pypy.org/en/latest/faq.html#what-is-pypy

ammar2 · 8 months ago

> I'd prefer to move forward based on clear use cases

Taking the concrete example of the `struct` module as a use-case, I'm curious if you have a plan for it and similar modules. The tricky part of course is that it is implemented in C.

Would you have to rewrite those stdlib modules in pure python?

bokchoi · 8 months ago

There were a few chips that supported directly executing JVM bytecodes. I'm not sure why it didn't take off, but I think it is generally more performant to JIT compile hotspots to native code.

https://en.wikipedia.org/wiki/Java_processor

teruakohatu · 8 months ago

It did take off just in a different direction:

https://en.m.wikipedia.org/wiki/Java_Card

To the point where most adult humans in the world probably own a Java-supported processor on a SIM card. Or at least an emulator (for eSIMs).

On example of a CPU arch used on JavaCard devices is the ARM926EJ-S that I believe can execute Java byte code.

Deleted Comment

tsukikage · 8 months ago

Running bytecode directly on hardware has certainly been tried (e.g. ARM's Jazelle).

In today's world this is generally not great.

Interpreted languages often include bytecode instructions that actually do very complex things and so do not nicely map to operations that can be sanely implemented in hardware. So you end up with all the usual boring alu, branch etc operations implemented in hardware, and anything else traps and runs a software handler.

Separately, interpreted language bytecode is often a poor fit for hardware execution; e.g. for dotnet (and python) bytecode many otherwise trivial operations do not explicitly encode information about types, and therefore the hardware must track type information in order to do the right thing (floating point addition looks very very different from integer addition!)

A lot of effort has been spent on compiler optimisation for x86 and ARM code. JIT compilers benefit massively from this. Meanwhile, interpreted language bytecode is often very lightly optimised, where it is optimised at all (until relatively recently, explicit Python policy as set by Guido van Rossum was to never optimise!) Optimisation has the side effect of throwing away potentially valuable high level / semantic information; optimising at the bytecode level hinders debuggability for interpreted code (which is a primary goal in Python) and can also be detrimental to JIT output; and the results are underwhelming compared to JIT since your small team of plucky bytecode optimisers isn't really going to compete with decades of x86 compiler development; and so the incentive is to not do much of that.

So if you're running bytecode in hardware, on top of all the obvious costs, you are /running unoptimised code/. This is actually the thing that kills these projects - everything else can ultimately be solved by throwing more silicon at it, but this can only really be solved by JITting, and the existing JIT+x86 / JIT+ARM solution is cheap and battle tested.

f1shy · 8 months ago

I understand that is the reason Lisp Machines were dropped (even in the time where Lisp was still a very good seen language). At least I understand so in the SICP videos, like in 1986 it was already clear it was much better to compile to ASM.

checker659 · 8 months ago

Forth CPU (in SystemVerilog): https://www.youtube.com/watch?v=DRtSSI_4dvk

hermitShell · 8 months ago

JVM I think I can understand, but do you happen to know more about LISP machines and whether they use an ISA specifically optimized for the language, or if the compilers for x86 end up just doing the same thing?

In general I think the practical result is that x86 is like democracy. It’s not always efficient but there are other factors that make it the best choice.

kragen · 8 months ago

They used an ISA specifically optimized for the language. At the time it was not known how to make compilers for Lisp that did an adequate job on normal hardware.

The vast majority of computers in the world are not x86.

f1shy · 8 months ago

When the RISC processors were available (for the same reason RISC started to grow) it was better to just compile to ASM.

Why is it not routine to "compile" Python? I understand that the interpreter is great for rapid iteration, cross compatibility, etc. But why is it accepted practice in the Python world to eschew all of the benefits of compilation by just dumping the "source" file in production?

cchianel · 8 months ago

The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library). Without type annotations, there is very little for a non-JIT compiler to optimize, since:

- The vast majority of code generation would have to be dynamic dispatches, which would not be too different from CPython's bytecode.

- Types are dynamic; the methods on a type can change at runtime due to monkey patching. As a result, the compiler must be able to "recompile" a type at runtime (and thus, you cannot ship optimized target files).

- There are multiple ways every single operation in Python might be called; for instance `a.b` either does a __dict__ lookup or a descriptor lookup, and you don't know which method is used unless you know the type (and if that type is monkeypatched, then the method that called might change).

A JIT compiler might be able to optimize some of these cases (observing what is the actual type used), but a JIT compiler can use the source file/be included in the CPython interpreter.

hwpythonner · 8 months ago

You make a great point — type information is definitely a huge part of the challenge.

I'd add that even beyond types, late binding is fundamental to Python’s dynamism: Variables, functions, and classes are often only bound at runtime, and can be reassigned or modified dynamically.

So even if every object had a type annotation, you would still need to deal with names and behaviors changing during execution — which makes traditional static compilation very hard.

That’s why PyXL focuses more on efficient dynamic execution rather than trying to statically "lock down" Python like C++.

pjmlp · 8 months ago

Solved by Smalltalk, Self, and Lisp JITs, that are in the genesis of JIT technology, some of it landed on Hotspot and V8.

Qem · 8 months ago

> The primary reason, in my opinion, is the vast majority of Python libraries lack type annotations (this includes the standard library).

When type annotations are available, it's already possible to compile Python to improve performance, using Mypyc. See for example https://blog.glyph.im/2022/04/you-should-compile-your-python...

Someone · 8 months ago

Python doesn’t eschew all benefits of compilation. It is compiled, but to an intermediate byte code, not to native code, (somewhat) similar to the way java and C# compile to byte code.

Those, at runtime (and, nowadays, optionally also at compile time), convert that to native code. Python doesn’t; it runs a bytecode interpreter.

Reason Python doesn’t do that is a mix of lack of engineering resources, desire to keep the implementation fairly simple, and the requirement of backwards compatibility of C code calling into Python to manipulate Python objects.

jerf · 8 months ago

If you define "compiling Python" as basically "taking what the interpreter would do but hard-coding the resulting CPU instructions executed instead of interpreting them", the answer is, you don't get very much performance improvement. Python's slowness is not in the interpreter loop. It's in all the things it is doing per Python opcode, most of which are already compiled C code.

If you define it as trying to compile Python in such a way that you would get the ability to do optimizations and get performance boosts and such, you end up at PyPy. However that comes with its own set of tradeoffs to get that performance. It can be a good set of tradeoffs for a lot of projects but it isn't "free" speedup.

jonathaneunice · 8 months ago

A giant part of the cost of dynamic languages is memory access. It's not possible, in general, to know the type, size, layout, and semantics of values ahead of time. You also can't put "Python objects" or their components in registers like you can with C, C++, Rust, or Julia "objects." Gradual typing helps, and systems like Cython, RPython, PyPy etc. are able to narrow down and specialize segments of code for low-level optimization. But the highly flexible and dynamic nature of Python means that a lot of the work has to be done at runtime, reading from `dict` and similar dynamic in-memory structures. So you have large segments of code that are accessing RAM (often not even from caches, but genuine main memory, and often many times per operation). The associated IO-to-memory delays are HUGE compared to register access and computation more common to lower-level languages. That's irreducible if you want Python semantics (i.e. its flexibility and generality).

Optimized libraries (e.g. numpy, Pandas, Polars, lxml, ...) are the idiomatic way to speed up "the parts that don't need to be in pure Python." Python subsets and specializations (e.g. PyPy, Cython, Numba) fill in some more gaps. They often use much tighter, stricter memory packing to get their speedups.

For the most part, with the help of those lower-level accelerations, Python's fast enough. Those who don't find those optimizations enough tend to migrate to other languages/abstractions like Rust and Julia because you can't do full Python without the (high and constant) cost of memory access.

ModernMech · 8 months ago

Part of the issue is the number of instructions Python has to go through to do useful work. Most of that is unwrapping values and making sure they're the right type to do the thing you want.

For example if you compile x + y in C, you'll get a few clean instructions that add the data types of x and y. But if you compile this thing in some sort of Python compiler it would essentially have to include the entire Python interpreter; because it can't know what x and y are at compile time, there necessarily has to be some runtime logic that is executed to unwrap values, determine which "add" to call, and so forth.

If you don't want to include the interpreter, then you'll have to add some sort of static type checker to Python, which is going to reduce the utility of the language and essentially bifurcate it into annotated code you can compile, and unannotated code that must remain interpreted at runtime that'll kill your overall performance anyway.

That's why projects like Mojo exist and go in a completely different direction. They are saying "we aren't going to even try to compile Python. Instead we will look like Python, and try to be compatible, but really we can't solve these ecosystem issues so we will create our own fast language that is completely different yet familiar enough to try to attract Python devs."

kragen · 8 months ago

You don't need the whole Python interpreter to fall back to dynamic method dispatch for overloaded operators. CPython itself implements them with per-interface vtables for C extensions, very similar to Golang but laboriously constructed by hand.

For most code, you don't need static typing for most overloaded operators to get decent performance, either. From my experience with Ur-Scheme, even a simple prediction that most arithmetic is on (small) integers with a runtime typecheck and conditional jump before inlining the integer version of each arithmetic operation performs remarkably well—not competitive with C but several times faster than CPython. It costs you an extra conditional branch in the case where the type is something else, but you need that check anyway if you are going to have unboxed integers, and it's smallish compared to the call and return you'll need once you find the correct overload to call. (I didn't implement overloading in Ur-Scheme, just exiting with an error message.)

Even concatenating strings is slow enough that checking the tag bits to see if you are adding integers won't make it much slower.

Where this approach really falls down is choosing between integer and floating point math. (Also, you really don't want to box your floats.)

And of course inline caches and PICs are well-known techniques for handling this kind of thing efficiently. They originated in JIT compilers, but you can use them in AOT compilers too; Ian Piumarta showed that.

franga2000 · 8 months ago

There's no benefit that I know of, besides maybe a tiny cold start boost (since the interpreter doesn't need to generate the bytecode first).

I have seen people do that for closed-source software that is distributed to end-users, because it makes reverse engineering and modding (a bit) more complicated.

Qem · 8 months ago

Check Nuitka: https://nuitka.net/

hwpythonner · 8 months ago

There have been efforts (like Cython, Nuitka, PyPy’s JIT) to accelerate Python by compiling subsets or tracing execution — but none fully replace the standard dynamic model at least as far as I know.

wyldfire · 8 months ago

For python, compilation means emitting some bytecode. And you could conceivably ship that bytecode *. But because it's so terribly dynamic of a language, virtually nothing is bound to anything until you execute this particular line. "What code does this function call resolve to?" -- we'll find out when we get there. "What type does this local use?" -- we'll find out when we get there.

Even type annotations would have to be anointed with semantics, which (IIUC) they have none today (w/CPython AFAIK). They are just annotations for use by static checkers.

Unless you can perform optimizations, the compilation can't make a whole bunch of progress beyond that bytecode.

* In fact, IIRC there was/is some "freeze" program that would do just that: compile your python program. Under the covers it would bundle libpython with your *.pyc bytecode.

dragonwriter · 8 months ago

> Why is it not routine to "compile" Python?

Where’s the AOT compiler that handles the whole Python language?

It’s not routine because its not even an option, and people who are concerned either use the tools that let them compile a subset of Python within a larger, otherwise-interpreted program, or use a different language.

f1shy · 8 months ago

AFAIK, one reason is that if you use "eval()" anywhere you need already a whole python compiler shipped with your program. So, compile is not different as shipping the code with the interpreter.

seanw444 · 8 months ago

It's called Nim.

archargelod · 8 months ago

Comparing Nim to compiled Python is almost insulting.

Smaller binaries, faster execution, proper metaprogramming, actual type safety, and you don't need to bundle a whole interpreter just to say "hello world"

zik · 8 months ago

This is a very cool project but I feel like the claim is overstated: "PyXL is a custom hardware processor that executes Python directly — no interpreter, no JIT, and no tricks. It takes regular Python code and runs it in silicon."

Reading further down the page it says you have to compile the python code using CPython, then generate binary code for its custom ISA. That's neat, but it doesn't "execute python directly" - it runs compiled binaries just like any other CPU. You'd use the same process to compile for x86, for example. It certainly doesn't "take regular python code and run it in silicon" as claimed.

A more realistic claim would be "A processor with a custom architecture designed to support python".

goranmoomin · 8 months ago

Not related to the project in any way, but I would say that if the hardware is running on CPython bytecode, I’d say that’s as far as it can get for executing Python directly – AFAIK running python code with the `python3` executable also compiles Python code into bytecode `*.pyc` files before it runs it. I don’t think anyone claims that CPython is not running Python code directly…

hamandcheese · 8 months ago

I agree with you, if it ran pyc code directly I would be okay saying it "runs python".

However it doesn't seem like it does, the pyc still had to be further processed into machine code. So I also agree with the parent comment that this seems a bit misleading.

I could be convinced that that native code is sufficiently close to pyc that I don't feel misled. Would it be possible to write a boot loader which converts pyc to machine code at boot? If not, why not?

Well it really does not run CPython, but CPython bytecode, compiled down to an assembler. Granted, a very specific, tailored assembler, but still.

Anyway, the project is mega-cool, and very useful (in some specific applications). Is just that the title is a little bit confusing.

Fair point if you're looking at it through a strict compiler-theory lens, but just to clarify—when I say "runs Python directly," I mean there is no virtual machine or interpreter loop involved. The processor executes logic derived from Python ByteCode instructions.

What gets executed is a direct mapping of Python semantics to hardware. In that sense, this is more “direct” than most systems running Python.

This phrasing is about conveying the architectural distinction: Python logic executed natively in hardware, not interpreted in software.

franzb · 8 months ago

Wouldn't an AoT Python-to-x86 compiler lead to a similar situation where the x86 processor would "run Python directly"?

_kidlike · 8 months ago

After a quick search I found that even Raspberry makes the same claim...

"runs directly on embedded hardware"

https://www.raspberrypi.com/documentation/microcontrollers/m...

I don't understand why they have the need to do this...

rcxdude · 8 months ago

Micropython does run directly on the hardware, though. It's a bare-metal binary, no OS. Which is a different claim to running the python code you give it 'directly'.

Well, runing python on Raspian, you could toggle a pin at maximum a couple of KHz, not near the 2 MHz you can do with this project. Also it claims predictability, so I assume the time jitter is much less, which is a very important parameter for real time applications.

PyXL is a bit more direct :)

dividuum · 8 months ago

Huh? MicroPython literally does exactly that: You copy over Python source(!) code and it runs on the Pico.

wormius · 8 months ago

Yeah that was my first thing. Wait a minute you run a compiler on it? It's literally compiled code, not direct. Which is fine, but yeah, overselling what it is/does.

Still cool, but I would definitely ease back the first claim.

I was going to say it does make me wonder how much a pain a direct processor like this would be in terms of having to constantly update it to adapt to the new syntax/semantics everytime there's a new release.

Also - are there any processors made to mimic ASTs directly? I figure a Lisp machine does something like that, but not quite... Though I've never even thought to look at how that worked on the hardware side.

EDIT: I'm not sure AST is the correct concept, exactly, but something akin to that... Like building a physical structure of the tree and process it like an interpreter would. I think something like that would require like a real-time self-programming FPGA?

PyXL deliberately avoids tying itself to Python’s high-level syntax or rapid surface changes.

The system compiles Python source to CPython ByteCode, and then from ByteCode to a hardware-friendly instruction set. Since it builds on ByteCode—not raw syntax—it’s largely insulated from most language-level changes. The ByteCode spec evolves slowly, and updates typically mean handling a few new opcodes in the compiler, not reworking the hardware.

Long-term, the hardware ISA is designed to remain fixed, with most future updates handled entirely in the toolchain. That separation ensures PyXL can evolve with Python without needing silicon changes.

BiteCode_dev · 8 months ago

Which is what nuitka does. But the result doesn't allow for real time python programs, andy you don't get direct access to the hardware like here.

rytill · 8 months ago

The phrasing “<statement> — no X, Y, Z, just <final simplified claim>” is cropping up a lot lately.

4o also ends many of its messages that way. It has to be related.

Y_Y · 8 months ago

I built a hardware processor that runs Python programs directly, without a traditional VM or interpreter. Early benchmark: GPIO round-trip in 480ns — 30x faster than MicroPython on a Pyboard (at a lower clock). Demo: https://runpyxl.com/gpio

jonjacky · 8 months ago

A much earlier (2012) attempt at a Python bytecode interpreter on an FPGA:

https://pycpu.wordpress.com/

"Running a very small subset of python on an FPGA is possible with pyCPU. The Python Hardware Processsor (pyCPU) is a implementation of a Hardware CPU in Myhdl. The CPU can directly execute something very similar to python bytecode (but only a very restricted instruction set). The Programcode for the CPU can therefore be written directly in python (very restricted parts of python) ..."

boutell · 8 months ago

This is very, very cool. Impressive work.

I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.

The background garbage collection thing is easier said than done, but I'm talking to someone who has already done something impressively difficult, so...

rangerelf · 8 months ago

> I'm interested to see whether the final feature set will be larger than what you'd get by creating a type-safe language with a pythonic syntax and compiling that to native, rather than building custom hardware.

It almost sounds like you're asking for Nim ( https://nim-lang.org/ ); and there are some projects using it for microcontroller programming, since it compiles down to C (for ESP32, last I saw).

obitsten · 8 months ago

rthomas6 · 8 months ago

* What HDL did you use to design the processor?

* Could you share the assembly language of the processor?

* What is the benefit of designing the processor and making a Python bytecode compiler for it, vs making a bytecode compiler for an existing processor such as ARM/x86/RISCV?

Thanks for the question.

HDL: Verilog

Assembly: The processor executes a custom instruction set called PySM (Not very original name, I know :) ). It's inspired by CPython Bytecode — stack-based, dynamically typed — but streamlined to allow efficient hardware pipelining. Right now, I’m not sharing the full ISA publicly yet, but happy to describe the general structure: it includes instructions for stack manipulation, binary operations, comparisons, branching, function calling, and memory access.

Why not ARM/X86/etc... Existing CPUs are optimized for static, register-based compiled languages like C/C++. Python’s dynamic nature — stack-based execution, runtime type handling, dynamic dispatch — maps very poorly onto conventional CPUs, resulting in a lot of wasted work (interpreter overhead, dynamic typing penalties, reference counting, poor cache locality, etc.).

pak9rabid · 8 months ago

Wow, this is fascinating stuff. Just a side question (and please understand I am not a low-level hardware expert, so pardon me if this is a stupid question): does this arch support any sort of speculative execution, and if so do you have any sort of concerns and/or protections in place against the sort of vulnerabilities that seem to come inherent with that?

> it includes instructions for stack manipulation, binary operations

Your example contains some integer arithmetic, I'm curious if you've implemented any other Python data types like floats/strings/tuples yet. If you have, how does your ISA handle binary operations for two different types like `1 + 1.0`, is there some sort of dispatch table based on the types on the stack?

Python the language isn't stack-based, though CPython's bytecode is. You could implement it just as well on top of a register-based instruction set. You may have a point about the other features that make it hard to compile, though.

larusso · 8 months ago

This sounds like your ‚arch‘ (sorry don‘t 100% know the correct term here) could potentially also run ruby/js if the toolchain can interpret it into your assembly language?

tlb · 8 months ago

How do you deal with instructions that iterate through variable amounts of memory, like concatenating strings? Are such instructions interruptible?

Perhaps they don't need to be interruptible if there's no virtual memory.

How does it allocate memory? Malloc and free are pretty complex to do in hardware.

rkagerer · 8 months ago

Back when C# came out, I thought for sure someone would make a processor that would natively execute .Net bytecode. Glad to see it finally happened for some language.

kcb · 8 months ago

For Java, this was around for a bit https://en.wikipedia.org/wiki/Jazelle.

monocasa · 8 months ago

Even better was a complete system rather than a mode for arm processors that ran a subset of the common jvm opcodes.

https://en.wikipedia.org/wiki/PicoJava

varispeed · 8 months ago

Didn't some phones have hardware Java execution or does my memory fail me?

jiehong · 8 months ago

Java got that with smart cards for example. Cute oddities of the past

JavaCard was just implemented as just a regular interpreter last time I checked.

supportengineer · 8 months ago

Does anyone remember the JavaOne ring giveaway?

https://news.ycombinator.com/item?id=8598037

zahlman · 8 months ago

In university, for my undergrad thesis, I wanted to do this for a Befunge variant (choosing the character set to simplify instruction decoding). My supervisor insisted on something more practical, though. :(

I probably should have added a link: https://esolangs.org/wiki/Befunge

The main thing that appealed to me about this idea is that it would require a two-dimensional program counter. As I recall from the original specification, skipping through blank space is supposed to take O(1) time, but I didn't plan on implementing that. I did, however, imagine a machine with 256x256 bytes of memory, where some 80x25 (or 24?) region was reserved as directly memory-mapped to a character display (and protected at boot by surrounding it with jump instructions).

ComputerGuru · 8 months ago

I want to say there was a product that did this circa 2006-2008 but all I’m finding is the .NET Micro Framework and its modern successor the .NET nano Framework.

I’ve been using .NET since 2001 so maybe I have it confused with something else, but at the same time a lot of the web from that era is just gone, so it’s possible something like this did exist but didn’t gain any traction and is now lost to the ether.

duskwuff · 8 months ago

There was Netduino, but that was a STM32 microcontroller running an interpreter, not dedicated hardware which directly executed CLR code.

rcorrear · 8 months ago

Maybe you’re thinking of Singularity OS?

john-h-k · 8 months ago

The tl;dr (I spent lots of time investigating this) is that it just fundamentally isn’t a good bytecode for execution. It’s designed to be small on disk, not hardware friendly

whoomp12342 · 8 months ago

I'd be surprised if azure app services didn't do this already.

I’d be willing to bet my net worth that they don’t

actionfromafar · 8 months ago

Wouldn't that be a real scoop?

bongodongobob · 8 months ago

Azure runs on Linux if I'm not mistaken.