Why I Use Nim instead of Python for Data Processing

It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem. Which also why I don't use it that much, because while numerical analysis is a big part of what I do, so is what I would call "symbolic manipulation" and unless you go to quite some effort to transform every problem into a numerical one, Python is just awful at that.

But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python. And that isn't counting multicore - if you count that you quickly get to a 100x improvement.

Personally I use Groovy for much of what I do for similar reasons (which is somewhat unusual) but its just a placeholder for "use anything except python".

amyjess · 4 years ago

> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

From my experience in using Python at my last job, I'll also add that Python is decent at tasks that aren't CPU-bound.

I wrote a lot of scripts that polled large amounts of network devices for information and then did something with it (typically upsert the data into a database, either via direct SQL or a REST API to whatever service owns the database). All these tasks were heavily network-bound. The amount of time the CPU was doing any work was minuscule compared to the amount of time it was waiting to get data back from the network. I doubt Nim or any other language would have been a significant performance improvement in this case.

For what it's worth, that made these scripts excellent candidates for multithreading. I'd run them with 20+ threads, and it was glorious. At first I did multiprocessing, because of all the GIL horror stories, but multiprocessing made it very difficult to cache data, so eventually I said "well, all this is network-bound so the GIL doesn't even apply" and switched over to multiprocessing.dummy (which implements pools using the same API as multiprocessing but with threads instead of processes), and I never looked back.

Edit: For what it's worth, Nim sounds like a really cool language, and it's right up my alley in several ways, I just don't think Python is particularly slow at network-bound tasks that use very little CPU.

m_mueller · 4 years ago

I agree with your entire post but, and I‘m saying this as a fulltime python dev, there‘s often a point where it starts being bothersome, and that usually comes only later in the lifecycle of an application after it had some organic growth. Some day e.g. a sales manager comes down to your lair and asks you if you couldn‘t just also parse this little 200MB Excel spreadsheet after it came over the network such that your ETL process could save it into a new table. And boom you‘re into CPU bound lands now. Often it‘s fine, you can wait those 1-2min for a daily occuring process. But what if for example you put this whole component behind a REST API that is behind a load balancer that is set with a certain timeout? There are even strict upper limits if for example you chose AWS Lambda for your stuff.

And suddenly you need to introduce quite a bit more technical complexity into this story that‘s gonna be hard to explain to management - all they see is that you now can insert a couple of millions of DB rows and their Big Data consultants[TM] told them that this is nowadays not even worth thinking about.

Point being: If your performance ceiling is low, you‘re gonna hit it sooner.

csmpltn · 4 years ago

> "I'll also add that Python is decent at tasks that aren't CPU-bound"

IO-bound tasks are almost by definition outside of your Python application's control. You yield control to the system to execute the actual task, and from that point on - you're no longer in control of how long the task will take to complete.

In other words, Python "being fast" by waiting on a Socket to complete receiving data isn't a particularily impressive feat.

gpderetta · 4 years ago

In my experience, Python is so slow that it will make CPU bound tasks that have no business being CPU bound.

otherme123 · 4 years ago

You have to be reaaaaaally slow to be beaten by a network. Also, touched by the OP, if it takes 3 hours to write some code that in Python takes only 1, or if the compile times are huge, Python can beat other languages in speed (that edge fades when the same program is used over and over again).

But as demonstrated, Nim is fast to write and fast to compile, so Python has little edge. Just it's huge ecosystem.

yellow_lead · 4 years ago

As other commenters point out, how can a language be fast in a way besides something CPU bound? You are saying it is fast when it's not doing anything. Not sure I understand.

nerdponx · 4 years ago

You might want to look into async (asyncio or anyio) instead of or in addition to threads for network-heavy code. Async coroutines I find can be much easier to debug and develop than OS-threaded code.

CraigJPerry · 4 years ago

The thing with Python is it's usually pretty easy to optimise quite impressively.

E.g. random example:

Sprinkle some cdef's in your python and suddenly you're faster than c++

https://github.com/luizsol/PrimesResult

https://github.com/PlummersSoftwareLLC/Primes/blob/drag-race...

25.8 seconds down to 1.5

Jensson · 4 years ago

I've tried cython and it isn't competitive with C in real world cases. Cherry picking a case where it is competitive doesn't help your case. Its performance tend to be around the level of Java, another language people say is competitive or faster than C++ but in practice C++ is still twice faster than Java in most cases.

Still, getting Java level performance out of python is a huge improvement and should be enough for most cases.

PartiallyTyped · 4 years ago

There is also numba, which is very impressive in its own right, and also pypy, which supports features up to Python 3.7.

Some may consider Jax, and its XLA compiler, but unless you require gradients, numba will be significantly faster, an instance of this is available here [1].

XLA runs on a higher level than LLVM and therefore can't achieve the same optimizations as numba does using the latter. IIRC numba also has a Python to Cuda compiler, which is also very impressive.

[1] https://github.com/scikit-hep/iminuit/blob/develop/doc/tutor...

stavros · 4 years ago

Apparently mypyc does the same if you have types in your code, though I've never used it.

p7g · 4 years ago

> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.

CPython's slowness doesn't boggle my mind at all. It's a bytecode interpreter for an incredibly dynamic language that states simplicity of implementation as a goal. I would say performance is actually pretty impressive considering all that. What _does_ boggle my mind is the performance of cutting-edge optimizing compilers like LLVM and V8!

At least there is a benefit to a simple implementation: Someone like me can dive into CPython's source and find out how things work.

flohofwoe · 4 years ago

IME Python is pretty great as cross-platform scripting-, automation- and glue-language up to a few thousand lines of code. Essentially replacing bash scripts and Windows batch files or for simple command line tools that don't need the performance. It should be just one language in one's language toolbox though.

nimmer · 4 years ago

> It's primarily a testament to how simply mind bogglingly slow Python is

No, Nim is truly among the top fastest languages when writing idiomatic code as shown in many benchmarks.

> But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python

...while also being very friendly to Python programmers, intuitive and expressive. Unlike many other languages.

sevensor · 4 years ago

> mind bogglingly slow Python is outside of its optimised numerical science ecosystem

Granted, but inside its optimised numerical science ecosystem, Python is, in fact, fast enough. If most of your program is calls into numpy, Python will get you where you need to go. In my experience, one scalar Python math operation takes about the same amount of time as the equivalent numpy operation on a million-element array. Linked against a recent libblas, numpy will even distribute work across multiple cores. So much for the GIL.

nerdponx · 4 years ago

I often recommend PyPy for "non-numerical" data processing when performance matters.

Also, "awful" is too harsh. Probably 90% of Python code just doesn't need to be faster than it is.

mattbillenstein · 4 years ago

I'd guess closer to 99%...

gameswithgo · 4 years ago

Nim isn’t unique in its performance, but it “feels” a lot like Python making it a nice gateway drug for Python users to start wasting less electricity.

IshKebab · 4 years ago

Yeah should really be "Why I don't use Python for Data Processing". I would consider Typescript as an alternative too which I'm sure would get a similar speedup.

Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this:

> Nim treats identifiers as equal if they are the same after removing capitalization (except for the first letter) and underscore, which means that you can use whichever style you want.

If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!

matthiaswh · 4 years ago

I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.

Nim's underlying, perhaps understated philosophy is that it lets you write code the way you want to write code. If you like snake case, use it. If you want camel case, sure. Write your code base how you want to write it, keep it internally consistent if you want, or don't. Nim doesn't really care.

(That philosophy extends far beyond naming conventions.)

What this avoids is being stuck with antiquated standard libraries that continue to do things contrary to the language's standards for the sake of backward compatibility (arg Python!) and 3rd party libraries where someone chose a different standard because that's their preference (arg Python! JavaScript! Literally every language!). Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.

Your code is inconsistent because someone else's code was inconsistent - that's simply not a problem in Nim.

Could Nim have forced everyone to snake_case naming structures for everything from the start? Well, sure, but then the people that have never actually written code in Nim would be whining about that convention instead and we'd be in the same place. After having actually used Nim, my opinion, and I would venture to say the opinion of most, is that its identity rules were a good decision for the developers who actually write Nim code.

elcritch · 4 years ago

> Yeah should really be "Why I don't use Python for Data Processing".

Not entirely. Nim‘s benefit here is that it’s superficially similar enough to Python that’s it’s easy for people from that world to pickup and start using Nim.

> Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this: > If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!

It may seem like a design mistake at first glance but it’s surprisingly useful. It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles. Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles. Though mostly I forgot it’s there as most idiomatic Nim code sticks with camel case. I’d say not to knock it until you’ve tried it.

The rest of Nim’s design avoids many issues I consider actual blunders in a modern language such as Python’s treatment of if/else as statements rather than as expressions, and then adding things like the walrus operator etc to compensate.

benjamin-lee · 4 years ago

I actually use TypeScript/JavaScript a lot for this reason, especially for biological algorithms that I want to run in the browser. The developer tooling is also as good as you can hope for, especially when using VS Code. I actually wrote a circular RNA sequence deduplication algorithm in it just recently [1].

With respect to the identifier resolution in Nim, it strikes me as more of a matter of preference. Especially given the universal function call syntax in Nim, at least it's consistent. For example, Nim treats "ATGCA".lowerCase() the same as lowercase("ATGCA"). I do appreciate the fact that you can use a chaining syntax instead of a nesting one when doing multiple function calls but this is also a matter of style more than substance.

[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...

thewakalix · 4 years ago

How is that a basic mistake? I think having two distinct identifiers that differ only by case sounds like a mistake!

This is reasonably idiomatic Python and 10x faster than the implementation in the original post:

  with open("orthocoronavirinae.fasta") as f:
      text = ''.join((line.rstrip() for line in f.readlines() if not line.startswith('>')))
      gc = text.count('G') + text.count('C')
      total = len(text)

Or if you want to be explicit, this is just as fast (and might scale better for particularly long genomes):

  gc = 0
  total = 0
  
  with open("orthocoronavirinae.fasta") as f:
      for line in f.readlines():
          if not line.startswith('>'):
              line = line.rstrip()
              gc += line.count('C') + line.count('G')
              total += len(line)

I didn't test Nim but the author reports Nim is 30x faster than his Python implementation, so mine would be about 3x slower than his Nim.

epidemian · 4 years ago

I think this is missing the point of the article.

Yes, you can implement a faster Python version, but notice also:

* This faster version is reading all the file into memory (except comment lines). The article mentions the data being 150MB, which should fit in memory, but for larger datasets, this approach would be unfeasible

* The faster version is actually delegating a lot of work to Python's C internals by using text.count('G'). All the internal looping and comparisons is done in C, while on the original version, goes through Python

So yes, you can definitely write faster Python by delegating most of the work to C.

The point of the article is not about how to optimize Python, but about how given almost identical implementations in Python and Nim, Nim can outperform Python by 1 or 2 orders of magnitude without resorting to use C internals for basic things like looping or comparing characters.

soundmasterj · 4 years ago

I didn't try to write optimized code, but idiomatic Python. Which also happens to be 10x faster.

To make it streaming, take the second version and remove the readlines (directly iterate over f).

Delegating work to Python's C internals is fine IMO because "batteries included" is a key feature of Python. "Nim outperforms unidiomatic Python that deliberately ignores key language features" is perhaps true, but less flashy of a headline.

And to be honest, I mainly wrote this because the other top level Python implementations for this one were terrible at the time of the post.

user5994461 · 4 years ago

One liner to count gc, without buffering.

    import io
    f = io.StringIO(
    """
    AB
    CD
    EF
    GH
    """
    )

    total = sum(map(lambda s: 0 if s[0]==">" else s.count('G') + s.count('C'), f.readlines()))

    print(total)

user5994461 · 4 years ago

And reading the file as binary. There's a lesson about the overhead of unicode strings here ;)

Your first example takes 3.1 seconds, my previous comment takes 2.3 seconds, this one takes 1.4 seconds.

    start = time.perf_counter()

    with open("orthocoronavirinae.fasta", "rb") as f:
    total = sum(map(lambda s: 0 if s[0]==65 else s.count(b"G") + s.count(b"C"), f.readlines()))
    
    end = time.perf_counter()
    
    print(total, " total")
    print(end-start, " seconds")