Achieving high Python performance with code generation (2022)

There's one caveat that I don't see explicitly mentioned in the article - CPython's bytecode "API" is not stable, and availability and behaviour of particular opcodes can vary between two different CPython release versions.

In practice it's stable-ish, but I'd be very afraid of putting it in a critical path like this.

Further, contrary to what you might assume CPython is not memory-safe. Most of the memory-unsafety is difficult to accidentally reach from regular Python code, but it's a distinct possibility if you're emitting bytecode directly. For example, LOAD_CONST is not bounds-checked.

kzrdude · a year ago

Stable-ish probably turns into full unstable due to rapid changes in the faster cpython work. Lots of bytecode changes have been done.

ok123456 · a year ago

I encountered this when a coworker got random segfaults trying to run some Cython code. Somehow, he was using an old beta version of the interpreter. After stepping through with GDB (I think it was a call to LOAD_CONST) and finding the offending seemingly benign code, on a hunch, I updated the interpreter, and the random segfault disappeared.

pansa2 · a year ago

> memory-unsafety is difficult to accidentally reach from regular Python code

Can you give an example of Python (source) code that isn't memory safe?

I assumed Python would be similar to Lua in this respect - the interpreter is memory safe no matter what source code you give it, but when it comes to custom bytecode, all bets are off.

Retr0id · a year ago

My favourite unsafe one-liner:

    eval((lambda:0).__code__.replace(co_consts=()))

This one is slightly cheat-y because it does modify the code object, just not the bytecode itself.

"Purer" unsafety issues are treated as bugs and patched, but that doesn't stop them from existing periodically. I don't have a specific example on-hand but I bet you can find some in here: https://github.com/search?q=repo%3Apython%2Fcpython+segmenta...

Edit: This one is a decent example (and hasn't been fixed for a long time) https://github.com/python/cpython/issues/74957

tekknolagi · a year ago

Anything using codecs, especially CJK. The amount of scary stuff we found when reimplementing it was... Well, scary

dmurray · a year ago

The most typical alternative to this is that you write the hot code in an unsafe language. And your code already has dozens of dependencies on third-party modules in unsafe languages. So it's something to be aware of, sure, but it's not opening up a whole new surface of vulnerabilities.

Retr0id · a year ago

I get what you mean, but everyone already knows C is a sharp implement. It's less obvious that writing CPython bytecode has similar risks. (You can also write native python extensions in rust etc., but most people aren't doing that, yet)

I'm not saying it's necessarily a reason to write off the approach, but definitely something people should be aware of.

ActorNightly · a year ago

Yep.

If you need to process data fast, just write the component in C that runs as a separate process, and send data to it with pipes.

Rather than go to all this hassle, I think it’s much easier to use Cython, or even to write C and just wrap it with Cython/Ctypes. Performance is so much better.

willseth · a year ago

The problem isn't simply that repeated function calls are expensive, it's that the comparator logic is very branchy in order to support the different comparator options for their queries, even though any given query will only perform the same comparator for every value. The operation being too dynamic would exist regardless.

He jumps straight from the problem to codegen-based solutions, but I wonder if a simpler loop specialization would yield a big enough chunk of the performance he got. If you hoist the comparator handling higher up into a single branch per query, then have a separate loop for each, you avoid unnecessary branching in your hot loop. You'd have to duplicate logic for every comparator, which is a little ugly, but there were only a few comparators, and that seems a lot more palatable if the alternative is an elaborate codegen solution.

stefanos82 · a year ago

Or you could start with mypyc [1] first from mypy and if the results do not please you, then you can use Cython; some folks don't want to learn yet another programming syntax, even though it's Pythonic enough to learn it in a rather short period of time, but still...

https://github.com/python/mypy

olejorgenb · a year ago

(https://github.com/mypyc/mypyc) That's cool!

> Mypyc compiles Python modules to C extensions. It uses standard Python type hints to generate fast code. Mypyc uses mypy to perform type checking and type inference.

> Mypyc can compile anything from one module to an entire codebase. The mypy project has been using mypyc to compile mypy since 2019, giving it a 4x performance boost over regular Python.

kevin_thibedeau · a year ago

Cython can speed up plain Python code. It bypasses the interpreter for everything that is PyObject-based.

wdroz · a year ago

I agree, you can also use Rust with PyO3[0] and Maturin[1]. It's far easier than what the author is doing is the article.

[0] -- https://github.com/PyO3/pyo3

[1] -- https://github.com/PyO3/maturin

Austizzle · a year ago

I've done this and it's a really lovely way to work once the rust stuff clicks!

pjmlp · a year ago

Or use a language with JIT/AOT in the box, REPL support, some of which even have IDEs all the way back to 1990's.

hyperpape · a year ago

While it's probably less often used, runtime code generation is a potential technique for optimizing performance in Java.

Elsewhere iirc, V8, which is written in C++, can use runtime code generation for regexes.

KeplerBoy · a year ago

yes, generating Python bytecode seems to be a horrible idea, and 2x improvements are not great coming from Python.

Apart from Cython and other C/C++ bindings one might also want to look at JIT compilers like Numba (or Jax and Pytorch for number crunching).

willseth · a year ago

The improvement was 2x across the entire application, i.e. this hotspot was 50% of the overall runtime, and his bytecode optimization reduced that to an insignificant %. The improvement to the code he actually optimized was probably at least 2 orders of magnitude.

poulsbohemian · a year ago

This isn't really a criticism of this article - at first glance it looks well-written and well-informed. The problem I have with articles like this is that it just doesn't fit my own experience of a decade+ worth of application performance work... I worked on lots of projects where something was slow or under-performing, and it was almost never as simple as some loop or generic function. And, if you are performance bound in the way the author states as a hypothesis, my experience is that you have an architecture problem, not just a problem with a single function. Case in point, a few years ago I worked on a project where the original requirements were stated as needing to process maybe 10,000 images a day. In reality, the need was to process something on the order of 10 million images. Two very different needs. So while there are interesting things to be learned from articles like this, they just don't fit my own observations when it comes to what's necessary to boost the real performance of applications in the wild.

You're not wrong, but despite being complicated, this codegen is less than 200 lines of comment heavy code. This might have been a very light lift in comparison to whatever fundamental architectural changes would be necessary to solve the problem at its root.

Totally understood, but I've just never found anything that was this simple in any real-world production code. That's not a criticism, it's just sharing a perspective. The only time I can think of that came even close to this was about 20 years ago I worked on a project where for nearly every remote call they were also invoking a function that made an expensive auth call. Yes, the "problem" was in a single function, but there wasn't anything explicitly wrong with the function, rather the overall auth pattern.

bravura · a year ago

Didn't your mother teach you not to prematurely optimize? You'll grow hair on your hands.

physicsguy · a year ago

eesmith · a year ago

I did byte code-based codegen for one of my projects. It was a maintenance headache. The byte codes are NOT stable across Python releases, so you need to be prepared to tweak your generator and even maintain multiple implementations.

My needs were all static, so I gave up and wrote a codegen that generated Python, with markers in the relevant files to indicate where the codegen output should go. That would not work here.

082349872349872 · a year ago

IMX if you're going to all the trouble of generating linear bytecode, might as well leverage ctypes and go all the way to linear machine code.

The linked-to example uses a query-dependent expression which resists easy pre-compilation. The author says the overhead of building a Python expression as a string then byte compiling it is too high, which would also hold for a ctypes/cffi implementation.

In my case I had a lot of config dictionaries like:

  default_output_args = {
    "tsv": {"sep": "tab", "quoting": "always"},
    "csv": {"sep": "comma", "quoting": "always"},
    "json": {"indent": None, "allow_nan": True},
    ...
  }

and a generic API like:

  def open_writer(destination, format="csv", output_args=None):
     ...

which would let 'format' specify the output format, with the corresponding default arguments, and with the output_args able to override the defaults.

I found the generic API too annoying when I knew I wanted a given format, so I added format-specific APIs with kwargs set to the default values, like:

  def open_csv_writer(destination, sep="tab", quoting="always"):
      ...

  def open_json_writer(destination, indent=None, allow_nan=True):
      ...

And I wanted to keep the generic API and format-specific APIs synchronized, because the generic API is useful too.

My original version processed default_output_args to generate the byte code for the different writer functions, setting up the function definitions correctly.

This is not something which can be done with ctypes.

rossdavidh · a year ago

So, I haven't done this particular thing, but I have messed with code that writes code in a couple different production cases, and the number one issue is not any of what he mentions. The number one issue is that it is harder to debug, and harder for someone who comes after you to tell what you're doing.

kevindamm · a year ago

I had a $work-related project where a certain critical path could be optimized further by generating bytecode (the context was slightly different, it was for the JVM not cpython, and there was a DSL that had already existed before I joined the team which provided model structure and feature semantics, and the system was dealing with a lot of backlog at times so it was business-critical to improve performance on the critical path).

You're spot on, the number one issue is that it becomes harder to debug (and there will be bugs no matter how many eyeballs are on the parser and IR and code generator) and harder to communicate. I'm sure the system was dismantled not long after I and a few others left the team. I did my best to document everything as I went, but I probably could have worked faster without that (and it surely would have been dismantled faster, too).

The question comes down to how much you want that performance, and how much tooling you're willing to build around it. You can make some very custom debugging tools if you're already in there generating byte sequences around chunks of semantics. The mature languages have a lot of similar tools, too, though.

ameliaquining · a year ago

In addition to the other alternatives proposed here, I'm confused why it wouldn't be appropriate to use compile wrapped in a cache. Yeah, you don't want to regularly call it inside the serving path, but calling it once per data type per process seems like it should be fine? It basically just increases amortized startup cost by a tiny amount.

Majromax · a year ago

It looks like the compared-to values are hard-coded into the generated function, so you would be calling compile once per unique query per process.

Does hardcoding those values into the generated function actually speed things up? I thought the idea was just to have a distinct function per different control-flow path.

Yeah, and you could run a warmup script on deploy so no actual users pay for it. It's not a rare pattern.

wrigby · a year ago

Yeah, I came here to say the same thing. Why not just memoize compile()?

henning · a year ago

Maybe stop using dynamic languages that were designed for the same kind of use cases as bash scripts for performance-sensitive production workloads? If you have to generate your own bytecode, the language is by definition no longer adequate for your needs.

pdonis · a year ago

I don't understand why it's necessary to generate bytecode for this. In Python you can write functions that define other Python functions (i.e., using the normal Python def statement) and return them. Why can't the query object constructor just call a Python function that returns the appropriate Python comparison function based on the query object's attributes?

I don't fully understand all the performance tradeoffs in this specific case, but in general, closures don't make things faster. It would be necessary to look up the closed-over value in the nonlocal environment on every iteration of the inner loop, which is at least as slow as the original solution with inline conditionals.

> in general, closures don't make things faster.

Compared to what?

Compared to doing the logic that picks which specific comparison function to call in every iteration of the inner loop? Closures will certainly make things faster compared to that, for the same reason that the method described in the article does.

Compared to generating bytecode by hand? Closures might not make things any faster, but they won't make them any slower either. Generating bytecode by hand is just doing by hand what the Python interpreter does internally when you define a closure using normal Python code. The def statement means the interpreter generates bytecode from the contents of the def block, wraps it in a code object, and attaches that code object to the function that gets returned.

In other words, what the article is doing is generating a closure--just doing it in a way that's much less common than just writing a Python function that returns a function.

> It would be necessary to look up the closed-over value in the nonlocal environment

No, it wouldn't. A Python closure includes cells that store the values of the closed over variables, so as far as the closure function is concerned, they're local variables.

teaearlgraycold · a year ago

You don’t understand. That’s not complicated enough! No one will be impressed.

sitkack · a year ago

Do what pdonis suggests, but inline the resulting bytecode!