It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem. Which also why I don't use it that much, because while numerical analysis is a big part of what I do, so is what I would call "symbolic manipulation" and unless you go to quite some effort to transform every problem into a numerical one, Python is just awful at that.
But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python. And that isn't counting multicore - if you count that you quickly get to a 100x improvement.
Personally I use Groovy for much of what I do for similar reasons (which is somewhat unusual) but its just a placeholder for "use anything except python".
> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.
From my experience in using Python at my last job, I'll also add that Python is decent at tasks that aren't CPU-bound.
I wrote a lot of scripts that polled large amounts of network devices for information and then did something with it (typically upsert the data into a database, either via direct SQL or a REST API to whatever service owns the database). All these tasks were heavily network-bound. The amount of time the CPU was doing any work was minuscule compared to the amount of time it was waiting to get data back from the network. I doubt Nim or any other language would have been a significant performance improvement in this case.
For what it's worth, that made these scripts excellent candidates for multithreading. I'd run them with 20+ threads, and it was glorious. At first I did multiprocessing, because of all the GIL horror stories, but multiprocessing made it very difficult to cache data, so eventually I said "well, all this is network-bound so the GIL doesn't even apply" and switched over to multiprocessing.dummy (which implements pools using the same API as multiprocessing but with threads instead of processes), and I never looked back.
Edit: For what it's worth, Nim sounds like a really cool language, and it's right up my alley in several ways, I just don't think Python is particularly slow at network-bound tasks that use very little CPU.
I agree with your entire post but, and I‘m saying this as a fulltime python dev, there‘s often a point where it starts being bothersome, and that usually comes only later in the lifecycle of an application after it had some organic growth. Some day e.g. a sales manager comes down to your lair and asks you if you couldn‘t just also parse this little 200MB Excel spreadsheet after it came over the network such that your ETL process could save it into a new table. And boom you‘re into CPU bound lands now. Often it‘s fine, you can wait those 1-2min for a daily occuring process. But what if for example you put this whole component behind a REST API that is behind a load balancer that is set with a certain timeout? There are even strict upper limits if for example you chose AWS Lambda for your stuff.
And suddenly you need to introduce quite a bit more technical complexity into this story that‘s gonna be hard to explain to management - all they see is that you now can insert a couple of millions of DB rows and their Big Data consultants[TM] told them that this is nowadays not even worth thinking about.
Point being: If your performance ceiling is low, you‘re gonna hit it sooner.
> "I'll also add that Python is decent at tasks that aren't CPU-bound"
IO-bound tasks are almost by definition outside of your Python application's control. You yield control to the system to execute the actual task, and from that point on - you're no longer in control of how long the task will take to complete.
In other words, Python "being fast" by waiting on a Socket to complete receiving data isn't a particularily impressive feat.
You have to be reaaaaaally slow to be beaten by a network. Also, touched by the OP, if it takes 3 hours to write some code that in Python takes only 1, or if the compile times are huge, Python can beat other languages in speed (that edge fades when the same program is used over and over again).
But as demonstrated, Nim is fast to write and fast to compile, so Python has little edge. Just it's huge ecosystem.
As other commenters point out, how can a language be fast in a way besides something CPU bound? You are saying it is fast when it's not doing anything. Not sure I understand.
You might want to look into async (asyncio or anyio) instead of or in addition to threads for network-heavy code. Async coroutines I find can be much easier to debug and develop than OS-threaded code.
I've tried cython and it isn't competitive with C in real world cases. Cherry picking a case where it is competitive doesn't help your case. Its performance tend to be around the level of Java, another language people say is competitive or faster than C++ but in practice C++ is still twice faster than Java in most cases.
Still, getting Java level performance out of python is a huge improvement and should be enough for most cases.
There is also numba, which is very impressive in its own right, and also pypy, which supports features up to Python 3.7.
Some may consider Jax, and its XLA compiler, but unless you require gradients, numba will be significantly faster, an instance of this is available here [1].
XLA runs on a higher level than LLVM and therefore can't achieve the same optimizations as numba does using the latter. IIRC numba also has a Python to Cuda compiler, which is also very impressive.
> It's primarily a testament to how simply mind bogglingly slow Python is outside of its optimised numerical science ecosystem.
CPython's slowness doesn't boggle my mind at all. It's a bytecode interpreter for an incredibly dynamic language that states simplicity of implementation as a goal. I would say performance is actually pretty impressive considering all that. What _does_ boggle my mind is the performance of cutting-edge optimizing compilers like LLVM and V8!
At least there is a benefit to a simple implementation: Someone like me can dive into CPython's source and find out how things work.
IME Python is pretty great as cross-platform scripting-, automation- and glue-language up to a few thousand lines of code. Essentially replacing bash scripts and Windows batch files or for simple command line tools that don't need the performance. It should be just one language in one's language toolbox though.
> mind bogglingly slow Python is outside of its optimised numerical science ecosystem
Granted, but inside its optimised numerical science ecosystem, Python is, in fact, fast enough. If most of your program is calls into numpy, Python will get you where you need to go. In my experience, one scalar Python math operation takes about the same amount of time as the equivalent numpy operation on a million-element array. Linked against a recent libblas, numpy will even distribute work across multiple cores. So much for the GIL.
Nim isn’t unique in its performance, but it “feels” a lot like Python making it a nice gateway drug for Python users to start wasting less electricity.
Yeah should really be "Why I don't use Python for Data Processing". I would consider Typescript as an alternative too which I'm sure would get a similar speedup.
Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this:
> Nim treats identifiers as equal if they are the same after removing capitalization (except for the first letter) and underscore, which means that you can use whichever style you want.
If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!
I wish this tired, boring argument didn't derail every conversation that remotely mentions Nim. See: literally every past HN post with "Nim" in the title.
Nim's underlying, perhaps understated philosophy is that it lets you write code the way you want to write code. If you like snake case, use it. If you want camel case, sure. Write your code base how you want to write it, keep it internally consistent if you want, or don't. Nim doesn't really care.
(That philosophy extends far beyond naming conventions.)
What this avoids is being stuck with antiquated standard libraries that continue to do things contrary to the language's standards for the sake of backward compatibility (arg Python!) and 3rd party libraries where someone chose a different standard because that's their preference (arg Python! JavaScript! Literally every language!). Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.
Your code is inconsistent because someone else's code was inconsistent - that's simply not a problem in Nim.
Could Nim have forced everyone to snake_case naming structures for everything from the start? Well, sure, but then the people that have never actually written code in Nim would be whining about that convention instead and we'd be in the same place. After having actually used Nim, my opinion, and I would venture to say the opinion of most, is that its identity rules were a good decision for the developers who actually write Nim code.
> Yeah should really be "Why I don't use Python for Data Processing".
Not entirely. Nim‘s benefit here is that it’s superficially similar enough to Python that’s it’s easy for people from that world to pickup and start using Nim.
> Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this:
> If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!
It may seem like a design mistake at first glance but it’s surprisingly useful. It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles. Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles. Though mostly I forgot it’s there as most idiomatic Nim code sticks with camel case. I’d say not to knock it until you’ve tried it.
The rest of Nim’s design avoids many issues I consider actual blunders in a modern language such as Python’s treatment of if/else as statements rather than as expressions, and then adding things like the walrus operator etc to compensate.
I actually use TypeScript/JavaScript a lot for this reason, especially for biological algorithms that I want to run in the browser. The developer tooling is also as good as you can hope for, especially when using VS Code. I actually wrote a circular RNA sequence deduplication algorithm in it just recently [1].
With respect to the identifier resolution in Nim, it strikes me as more of a matter of preference. Especially given the universal function call syntax in Nim, at least it's consistent. For example, Nim treats "ATGCA".lowerCase() the same as lowercase("ATGCA"). I do appreciate the fact that you can use a chaining syntax instead of a nesting one when doing multiple function calls but this is also a matter of style more than substance.
While I trust the author on this, I don’t think DNA datasets and string analysis was a great example.
One of the big, big things for improving performance on DNA analysis of ANY kind is converting these large text files into binary (4 letters easily converts to 2 bit encoding) and massively improves basically any analysis you’re trying to do.
Not only does it compress your dataset (2 bits vs 16 bits), it allows absurdly faster numerical libraries to be used in lieu of string methods.
There’s no real point in showing off that a compiled language is faster at doing something the slow way…
You make a fair point that using optimized numerical libraries instead of string methods will be ridiculously fast because they're compiled anyway. For example, scikit-bio does just this for their reverse complement operation [1]. However, they use an 8 bit representation since they need to be able to represent the extended IUPAC notation for ambiguous bases, which includes things like the character N for "aNy" nucleotide [2]. One could get creative with a 4 bit encoding and still end up saving space (assuming you don't care about the distinction between upper versus lowercase characters in your sequence [3]). Or, if you know in advance your sequence is unambiguous (unlikely in DNA sequencing-derived data) you could use the 2 bit encoding. When dealing with short nucleotide sequences, another approach is to encode the sequence as an integer. I would love to see a library—Python, Nim, or otherwise—that made using the most efficient encoding for a sequence transparent to the developer.
Because we use it as a nice syntactic frontend to numpy, a large and highly optimized library written in C++ and Fortran (sic). That is, we actually don't use "Python-native" code much, and numpy is essentially APL-like array-oriented thing where e.g. you don't normally need loops.
For native-language data processing, Python is slow; Nim or Julia would easily outperform it, while being comparably ergonomic.
Apparently there's also a data processing library for Nim called Arraymancer[0] that's inspired by Numpy and PyTorch. It claims to be faster than both.
Please add D language to the mix as well. Interestingly, you can simply replace Nim with D in the blog article and most of the contents will still make sense!
The funny thing is that Nim and Julia libraries are still wrapping Fortran numerical library while D has beaten the old and trusted Fortran library in its home turf five years back:
There’s been a tremendous amount of work optimizing blas _and_ ensuring it’s numerically stable. Julia made a good choice to use blas first. Though it’s good to see new native implementations.
For Nim, there’s also NimTorch which is interesting in that it builds on Nim’s C++ target to generate native PyTorch code. Even Python is technically a second class citizen for the C++ code. Most ML libraries are C++ all the way down.
Not necessarily true with Julia. Many libraries like DifferentialEquations.jl are Julia all of the way down because the pure Julia BLAS tools outperform OpenBLAS and MKL in certain areas. For example see:
Another nim & python thread that has not been mentioned yet here
https://news.ycombinator.com/item?id=28506531 - project allows creating pythonic bindings for your nim libraries pretty easily, which can be useful if you still want to write most of your toplevel code in python, but leverage nim's speed when it matters.
The author makes a fair point, however, that is a rather non optimal implementation in Python. You likely could use chunked Pandas to speed up or the code, or at least replace some of the for loops with a list comprehension syntax.
However, in any case I would never replace Python with Nim as it is too niche of a language and you would struggle with recruiting. I could consider Julia if it's popularity keeps growing.
That is the ultimate challenge of a language. It either needs a large backer (Go and Google) or be so good, it gets a natural market adaptation(Julia). As a manager I am reluctant to adapt yet another language unless there is a healthy job market for it.
There are objectively better niche solutions for niche problems out there. But we pickup things that can be applied to solve a number of different problem, that are more versatile and has a community behind them.
Agreed. I am certainly inclined to believe that Nim is a better language than Python, but it's not so much better thlo justify moving off of the ecosystem.
This is reasonably idiomatic Python and 10x faster than the implementation in the original post:
with open("orthocoronavirinae.fasta") as f:
text = ''.join((line.rstrip() for line in f.readlines() if not line.startswith('>')))
gc = text.count('G') + text.count('C')
total = len(text)
Or if you want to be explicit, this is just as fast (and might scale better for particularly long genomes):
gc = 0
total = 0
with open("orthocoronavirinae.fasta") as f:
for line in f.readlines():
if not line.startswith('>'):
line = line.rstrip()
gc += line.count('C') + line.count('G')
total += len(line)
I didn't test Nim but the author reports Nim is 30x faster than his Python implementation, so mine would be about 3x slower than his Nim.
Yes, you can implement a faster Python version, but notice also:
* This faster version is reading all the file into memory (except comment lines). The article mentions the data being 150MB, which should fit in memory, but for larger datasets, this approach would be unfeasible
* The faster version is actually delegating a lot of work to Python's C internals by using text.count('G'). All the internal looping and comparisons is done in C, while on the original version, goes through Python
So yes, you can definitely write faster Python by delegating most of the work to C.
The point of the article is not about how to optimize Python, but about how given almost identical implementations in Python and Nim, Nim can outperform Python by 1 or 2 orders of magnitude without resorting to use C internals for basic things like looping or comparing characters.
I didn't try to write optimized code, but idiomatic Python. Which also happens to be 10x faster.
To make it streaming, take the second version and remove the readlines (directly iterate over f).
Delegating work to Python's C internals is fine IMO because "batteries included" is a key feature of Python. "Nim outperforms unidiomatic Python that deliberately ignores key language features" is perhaps true, but less flashy of a headline.
And to be honest, I mainly wrote this because the other top level Python implementations for this one were terrible at the time of the post.
import io
f = io.StringIO(
"""
AB
CD
EF
GH
"""
)
total = sum(map(lambda s: 0 if s[0]==">" else s.count('G') + s.count('C'), f.readlines()))
print(total)
As a Data Engineer, I mainly use Python in conjunction with PySpark. So essentially I just use Python as an easy-to-read wrapper around Spark, and it works great when working together with Data Scientists, who are mostly used to Pandas, Keras, Tensorflow, etc.
In my use case, I don't really see how Nim would make my life easier right now.
One pitfall that I ran into when trying out Nim for scientific computing is that Nim follows more a computer science convention than mathematics convention for exponentiation and negation operators. That is in Nim -2^3=(-2)^3 unlike more scientific computing oriented languages where -2^3=-(2^3). To someone like myself who mainly does scientific work this was quite unexpected and it causes a surprising amount of mental overhead to avoid mistakes. I did like Nim quite a bit otherwise, but essentially found that I was missing some important numerical libraries so did not continue using it.
Although Nim using weird order of operations is unfortunate, your example is not well chosen. Replace the exponent 3 with an even integer and your point will be clear.
Maybe not that weird, although not the most common syntax. In Nim 0 - 2^2 evaluates to -4, while -2^2 evaluates to +4. So in -2, the - is treated as a unary minus and binds tightly to the 2. It would be bad if Nim were doing + or - before ^, or before *. I have a feeling some other languages treat the unary minus the same way, but don’t know of any examples off hand.
I don't think that is a mathematics convention vs computer science convention thing. Most languages, whether aimed at mathematics or computer science, have exponentiation at higher precedence than unary minus. (JavaScript does not, but it also does not allow unary minus directly in front of the base of an exponentiation so you have to add parenthesis no matter which interpretation you want).
The main places you find it the other way are spreadsheets and shells.
Is there an explanation from the Nim authors as to why they made such an odd choice?
IIRC the argument is that unitary negation is the operator with the highest presedence and one should be consistent across logical and mathematical operations. I think this is a not a completely unreasonable stance to take (and several other languages take the same), but it is unintuitive for someone who does significant scientific computing.
But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python. And that isn't counting multicore - if you count that you quickly get to a 100x improvement.
Personally I use Groovy for much of what I do for similar reasons (which is somewhat unusual) but its just a placeholder for "use anything except python".
From my experience in using Python at my last job, I'll also add that Python is decent at tasks that aren't CPU-bound.
I wrote a lot of scripts that polled large amounts of network devices for information and then did something with it (typically upsert the data into a database, either via direct SQL or a REST API to whatever service owns the database). All these tasks were heavily network-bound. The amount of time the CPU was doing any work was minuscule compared to the amount of time it was waiting to get data back from the network. I doubt Nim or any other language would have been a significant performance improvement in this case.
For what it's worth, that made these scripts excellent candidates for multithreading. I'd run them with 20+ threads, and it was glorious. At first I did multiprocessing, because of all the GIL horror stories, but multiprocessing made it very difficult to cache data, so eventually I said "well, all this is network-bound so the GIL doesn't even apply" and switched over to multiprocessing.dummy (which implements pools using the same API as multiprocessing but with threads instead of processes), and I never looked back.
Edit: For what it's worth, Nim sounds like a really cool language, and it's right up my alley in several ways, I just don't think Python is particularly slow at network-bound tasks that use very little CPU.
And suddenly you need to introduce quite a bit more technical complexity into this story that‘s gonna be hard to explain to management - all they see is that you now can insert a couple of millions of DB rows and their Big Data consultants[TM] told them that this is nowadays not even worth thinking about.
Point being: If your performance ceiling is low, you‘re gonna hit it sooner.
IO-bound tasks are almost by definition outside of your Python application's control. You yield control to the system to execute the actual task, and from that point on - you're no longer in control of how long the task will take to complete.
In other words, Python "being fast" by waiting on a Socket to complete receiving data isn't a particularily impressive feat.
But as demonstrated, Nim is fast to write and fast to compile, so Python has little edge. Just it's huge ecosystem.
E.g. random example:
Sprinkle some cdef's in your python and suddenly you're faster than c++
https://github.com/luizsol/PrimesResult
https://github.com/PlummersSoftwareLLC/Primes/blob/drag-race...
25.8 seconds down to 1.5
Still, getting Java level performance out of python is a huge improvement and should be enough for most cases.
Some may consider Jax, and its XLA compiler, but unless you require gradients, numba will be significantly faster, an instance of this is available here [1].
XLA runs on a higher level than LLVM and therefore can't achieve the same optimizations as numba does using the latter. IIRC numba also has a Python to Cuda compiler, which is also very impressive.
[1] https://github.com/scikit-hep/iminuit/blob/develop/doc/tutor...
CPython's slowness doesn't boggle my mind at all. It's a bytecode interpreter for an incredibly dynamic language that states simplicity of implementation as a goal. I would say performance is actually pretty impressive considering all that. What _does_ boggle my mind is the performance of cutting-edge optimizing compilers like LLVM and V8!
At least there is a benefit to a simple implementation: Someone like me can dive into CPython's source and find out how things work.
No, Nim is truly among the top fastest languages when writing idiomatic code as shown in many benchmarks.
> But Nim is only one of a whole suite of languages that easily cruise to a 10x performance win over Python
...while also being very friendly to Python programmers, intuitive and expressive. Unlike many other languages.
Granted, but inside its optimised numerical science ecosystem, Python is, in fact, fast enough. If most of your program is calls into numpy, Python will get you where you need to go. In my experience, one scalar Python math operation takes about the same amount of time as the equivalent numpy operation on a million-element array. Linked against a recent libblas, numpy will even distribute work across multiple cores. So much for the GIL.
Also, "awful" is too harsh. Probably 90% of Python code just doesn't need to be faster than it is.
Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this:
> Nim treats identifiers as equal if they are the same after removing capitalization (except for the first letter) and underscore, which means that you can use whichever style you want.
If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!
Nim's underlying, perhaps understated philosophy is that it lets you write code the way you want to write code. If you like snake case, use it. If you want camel case, sure. Write your code base how you want to write it, keep it internally consistent if you want, or don't. Nim doesn't really care.
(That philosophy extends far beyond naming conventions.)
What this avoids is being stuck with antiquated standard libraries that continue to do things contrary to the language's standards for the sake of backward compatibility (arg Python!) and 3rd party libraries where someone chose a different standard because that's their preference (arg Python! JavaScript! Literally every language!). Now you're stuck with screaming linters or random `# noqa` lines stuffed in your code, and that one variable that you're using from a library sticks out like a sore thumb.
Your code is inconsistent because someone else's code was inconsistent - that's simply not a problem in Nim.
Could Nim have forced everyone to snake_case naming structures for everything from the start? Well, sure, but then the people that have never actually written code in Nim would be whining about that convention instead and we'd be in the same place. After having actually used Nim, my opinion, and I would venture to say the opinion of most, is that its identity rules were a good decision for the developers who actually write Nim code.
Not entirely. Nim‘s benefit here is that it’s superficially similar enough to Python that’s it’s easy for people from that world to pickup and start using Nim.
> Also I don't know how anyone could design a language in the 21st century and make basic mistakes like this: > If that's any indication of the sanity of the rest of Nim then I'd say steer well clear!
It may seem like a design mistake at first glance but it’s surprisingly useful. It’s intent is to allow a given codebase to maintain a consistent style (eg camel vs snake) even when making use of upstream libraries that use different styles. Not including the first letter avoids most of the annoyance of wantonly mixing all cap constants or lower case and linters avoid teams mismatching internal styles. Though mostly I forgot it’s there as most idiomatic Nim code sticks with camel case. I’d say not to knock it until you’ve tried it.
The rest of Nim’s design avoids many issues I consider actual blunders in a modern language such as Python’s treatment of if/else as statements rather than as expressions, and then adding things like the walrus operator etc to compensate.
With respect to the identifier resolution in Nim, it strikes me as more of a matter of preference. Especially given the universal function call syntax in Nim, at least it's consistent. For example, Nim treats "ATGCA".lowerCase() the same as lowercase("ATGCA"). I do appreciate the fact that you can use a chaining syntax instead of a nesting one when doing multiple function calls but this is also a matter of style more than substance.
[1] https://github.com/Benjamin-Lee/viroiddb/blob/main/scripts/c...
One of the big, big things for improving performance on DNA analysis of ANY kind is converting these large text files into binary (4 letters easily converts to 2 bit encoding) and massively improves basically any analysis you’re trying to do.
Not only does it compress your dataset (2 bits vs 16 bits), it allows absurdly faster numerical libraries to be used in lieu of string methods.
There’s no real point in showing off that a compiled language is faster at doing something the slow way…
[1] https://github.com/biocore/scikit-bio/blob/b470a55a8dfd054ae...
[2] https://en.wikipedia.org/wiki/Nucleic_acid_notation
[3] https://bioinformatics.stackexchange.com/questions/225/upper...
I’m surprised you need the full 4 bits to deal with ambiguous bases, but it probably makes sense at some lower level I don’t understand.
(As in GATTACA might be read as is, but might be read as GAT?ACA.)
Still that's a minimal of 3 bits versus much longer.
[Edit : i see another commenter with the same observation, more thoroughly explained! ]
Because we use it as a nice syntactic frontend to numpy, a large and highly optimized library written in C++ and Fortran (sic). That is, we actually don't use "Python-native" code much, and numpy is essentially APL-like array-oriented thing where e.g. you don't normally need loops.
For native-language data processing, Python is slow; Nim or Julia would easily outperform it, while being comparably ergonomic.
[0] https://mratsim.github.io/Arraymancer/
The funny thing is that Nim and Julia libraries are still wrapping Fortran numerical library while D has beaten the old and trusted Fortran library in its home turf five years back:
http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...
You say that, but Julia is rapidly acquiring native numerical libraries that outperform OpenBLAS:
https://discourse.julialang.org/t/realistically-how-close-is...
For Nim, there’s also NimTorch which is interesting in that it builds on Nim’s C++ target to generate native PyTorch code. Even Python is technically a second class citizen for the C++ code. Most ML libraries are C++ all the way down.
https://github.com/sinkingsugar/nimtorch
https://github.com/YingboMa/RecursiveFactorization.jl/pull/2...
So a stiff ODE solve is pure Julia, LU-factorizations and all. This is what allows it to outperform the common C and Fortran libraries very consistently. See https://benchmarks.sciml.ai/html/MultiLanguage/wrapper_packa... and https://benchmarks.sciml.ai/html/Bio/BCR.html
https://news.ycombinator.com/item?id=28506531 - project allows creating pythonic bindings for your nim libraries pretty easily, which can be useful if you still want to write most of your toplevel code in python, but leverage nim's speed when it matters.
If you want to make your nim code even more "pythonic" there is a https://github.com/Yardanico/nimpylib, and for calling some python code from nim there is a https://github.com/yglukhov/nimpy
However, in any case I would never replace Python with Nim as it is too niche of a language and you would struggle with recruiting. I could consider Julia if it's popularity keeps growing.
That is the ultimate challenge of a language. It either needs a large backer (Go and Google) or be so good, it gets a natural market adaptation(Julia). As a manager I am reluctant to adapt yet another language unless there is a healthy job market for it.
Not all technologies require the full cycle and the normal risk management.
Yes, you can implement a faster Python version, but notice also:
* This faster version is reading all the file into memory (except comment lines). The article mentions the data being 150MB, which should fit in memory, but for larger datasets, this approach would be unfeasible
* The faster version is actually delegating a lot of work to Python's C internals by using text.count('G'). All the internal looping and comparisons is done in C, while on the original version, goes through Python
So yes, you can definitely write faster Python by delegating most of the work to C.
The point of the article is not about how to optimize Python, but about how given almost identical implementations in Python and Nim, Nim can outperform Python by 1 or 2 orders of magnitude without resorting to use C internals for basic things like looping or comparing characters.
To make it streaming, take the second version and remove the readlines (directly iterate over f).
Delegating work to Python's C internals is fine IMO because "batteries included" is a key feature of Python. "Nim outperforms unidiomatic Python that deliberately ignores key language features" is perhaps true, but less flashy of a headline.
And to be honest, I mainly wrote this because the other top level Python implementations for this one were terrible at the time of the post.
Your first example takes 3.1 seconds, my previous comment takes 2.3 seconds, this one takes 1.4 seconds.
In my use case, I don't really see how Nim would make my life easier right now.
The main places you find it the other way are spreadsheets and shells.
Is there an explanation from the Nim authors as to why they made such an odd choice?