Taichi lang: High-performance parallel programming in Python

Out of curiosity, I rewrote their prime counter example to use a sieve instead of being a silly maximally-computation-dense example.

To make it work with taichi, I had to change the declaration of the sieve from sieve = [0] * N to sieve = ti.field(ti.i8, shape=N) but the rest of the code remained the same.

Ordinary Python:

time elapsed: 0.444s

Taichi ignoring compile time, I believe:

time elapsed: 0.119s

A slightly more realistic example than the 10x+ improvement they show on the really toy example with results that aren't too bad. I'd take a 3x improvement for tiny changes. Pretty neat!

(I tried some other trivial things like using np.int8 and it was slower. One can obviously make this a ton faster but I was interested in seeing how the toy was if we just made it slightly more memory-bound).

A negative was that throwing list comprehensions in made the python version faster - about 0.3 seconds - (and shorter and arguably more "pythonic") and simultaneously broke the port to taichi.

vishal0123 · 3 years ago

I tried implementing the same and I am getting 500ms vs 20ms with wrong answer in the first call in taichi but correct in subsequent calls. I guess I found some bug in taichi: https://imgur.com/a/lpK2iVF

Could you share your code as well.

    N = 1000000

    isnotprime = [0] * N

    def count_primes(n: int) -> int:
        count = 0
        for k in range(2, n):
            if isnotprime[k] == 0:
                count += 1
                for l in range(2, n // k):
                    isnotprime[l * k] = 1

        return count

    import taichi as ti
    ti.init(arch=ti.cpu)

    isnotprime = ti.field(ti.i8, shape=(N, ))

    @ti.kernel
    def count_primes(n: ti.i32) -> int:
        count = 0
        for k in range(2, n):
            if isnotprime[k] == 0:
                count += 1
                for l in range(2, n // k):
                    isnotprime[l * k] = 1

        return count

dgacmu · 3 years ago

Python:

    import time
    import math
    import numpy as np
    
    N = 1000000
    SN = math.floor(math.sqrt(N))
    sieve = [False] * N
    
    def init_sieve():
        for i in range(2, SN):
            if not sieve[i]:
                k = i*2
                while k < N:
                    sieve[k] = True
                    k += i
            
    
    def count_primes(n: int) -> int:
        return (N-2) - sum(sieve)
    
    start = time.perf_counter()
    init_sieve()
    print(f"Number of primes: {count_primes(N)}")
    print(f"time elapsed: {time.perf_counter() - start}/s")

Taichi:

    import taichi as ti
    import time
    import math
    ti.init(arch=ti.cpu)
    
    N = 1000000
    SN = math.floor(math.sqrt(N))
    sieve = ti.field(ti.i8, shape=N)
    
    @ti.kernel
    def init_sieve():
        for i in range(2, SN):
            if sieve[i] == 0:
                k = i*2
                while k < N:
                    sieve[k] = 1
                    k += i
            
    @ti.kernel
    def count_primes(n: int) -> int:
        count = 0
        for i in range(2, N):
            if (sieve[i] == 0):
                count += 1
        return count
    
    start = time.perf_counter()
    init_sieve()
    print(f"Number of primes: {count_primes(N)}")
    print(f"time elapsed: {time.perf_counter() - start}/s")

(The difference of using 0 vs False is tiny; I had just been poking at the python code to think about how I'd make it more pythonic and see if that made it worse to do taichi)

bombolo · 3 years ago

But how fast does it go with pypy and no changes to the code?

v3ss0n · 3 years ago

Interested as well

garyrob · 3 years ago

Might be worthwhile to run the same code with an appropriate numba decorator. My guess is that you'd get at least as much speed up but without having to change the sieve declaration, but I'm not sure.

Deleted Comment

Please don't retrofit more stuff to make python work. move over to julia already. u can call python from Julia

Buttons840 · 3 years ago

It's been awhile, but last time I was doing serious work in Julia things were a little janky. For example, the REPL would segfault sometimes if I Ctrl-C'ed during heavy computations. And Flux at first seems like it will work on any code, which seems amazing, but then you find out at runtime that one of the operations you used isn't supported and get a runtime error. PyTorch might not work on regular Python code, but at least I know the APIs provided by PyTorch will work, even though they are a subset of what can be done in regular Python.

Still, most things worked in Julia, and there have been many improvements since then so I suspect the few remaining rough spots are being smoothed out. In the future I will be happy if I get to work with Julia more.

xiaodai · 3 years ago

Yeah. It's a chicken and egg thing. Imagine if resources applied to PyTorch was spent implementing in Julia.

But then there's not enough users... so the cycle continues until one day Julia hits critical mass and a tipping point is reached.

montebicyclelo · 3 years ago

I think the Julia people underestimate how many people, and how much those people, like the syntax of Python. If they'd stuck more to that they might have been able to have won more people over.. (white space, plus many other things).

acmj · 3 years ago

It is not just syntax. While python has Java-like OOP from a far, Julia has a distinct language design. It doesn't have a concept of "class" in the traditional sense. It instead has multiple dispatch, which is very flexible but sometimes too flexible to control. I found Julia harder to write for an averaged programmer. Furthermore, the time-to-first-plot problem had pissed off many early adopters (I know a few and they won't come back) and apparently remains a problem for some [1].

Julia is a great niche language for what it is good at. It will survive but won't gain much popularity.

[1] https://discourse.julialang.org/t/very-slow-time-to-first-pl...

jarbus · 3 years ago

As someone who switched from python to Julia, Julia syntax supports things like list comprehensions, but also many better things like broadcasting that python really needs. The only thing python has over Julia is that Julia requires an “end” keyword where python uses indentation. But julia has so much more than python (like macros) that it’s just better syntax wise.

amval · 3 years ago

One of the features that I dislike the most abiout Julia is how list comprehensions have been directlly ported from Python. This was a conscious choice they made out of pragmatism, not because of the merits of the design.

I don't doubt that what you say it's true, but to me, it comes down more to lack of familiarity with other languages than any actual merits of Python syntax and semantics. Frankly, I am glad the didn't take more from Python.

nhgiang · 3 years ago

New language developers really overestimate most people's willingness to learn new languages

cwp · 3 years ago

Yes. It's quite amazing really. Also, people really hate using more than one language. Web developers will twist themselves into incredible knots to avoid having to write HTML and CSS and Javascript in the same project.

schemescape · 3 years ago

It’s not even the language that causes the most friction, it’s the runtime, libraries, package manager, build system, foreign function interfaces, and on and on…

xiaodai · 3 years ago

not overestimate. i think they know of the challenges and stickiness of python die hards.

but there's a better way, there's a better way.

like python used to be less popular than perl but look at where perl is now vs python. things do take time to change though

Archit3ch · 3 years ago

As someone in the Julia ecosystem, I wouldn't know where to begin with writing performant python. Is it Numba? Taichi? Torchscript? NumPy? Cython? Pypy? None of these seem to work together (besides calling each other, I suppose).

montebicyclelo · 3 years ago

Wow, it seems like you have a number of good options to choose from. What a horrible situation, to have a language with a really rich ecosystem of powerful libraries.

ActorNightly · 3 years ago

Performant code is really a big group of code.

If you look at ML, Python is completely fine because all the processing that happens with matrix multiplication, even on CPUs, far, far, FAR outweighs all the setup stuff in volume of operations.

On the other hand, if majority of your application relies heavily on processing speed (i.e you need compare/jump operations rather just add/multiply/load/store of the GPUs), Python is going to be slow. In this case, if you want custom performant code, you write C extensions for the performant critical code, and launch them from higher level python code.

That being said, there is generally a library (like Taichi) that already does this for you.

catchnear4321 · 3 years ago

The problem is I have no interest in calling Python from Julia, since I can just use Python.

alphanullmeric · 3 years ago

Ah yes, the language that claims to look like Python and run like C while not being particularly close in either aspect.

v3ss0n · 3 years ago

That's quite silly. Julia ecosystem is non-existent. If we move nimlang would be the closest . PyPy shaping up nice for cext part and when fully compatible we will just use PyPy.org.

Julia language features are really weak.

packetlost · 3 years ago

I wouldn't consider Julia a replacement for Python

Iwan-Zotow · 3 years ago

and its an ugly thing

they didn't get range composition right

okaleniuk · 3 years ago

Tried that. It was fun. Didn't see any benefit though, went back to Python.

I don't care much about how the interface over LLVM looks like. As soon as I have the same result in the end, I'd rather stick to whatever has more users.

ActorNightly · 3 years ago

I used to hate Python too, but i complety 180ed in the past few years as I learned more about computer science.

If you look at compute in general, it can be pretty much be summed up as add/multiply/load/store/compare/jump (straight from Jim Keller). All the other instructions are more specialized versions of that, with some having dedicated hardware in CPUS.

If you need to do those 6 things as fast as possible, on single piece of hardware, you are most likely writing a video game. Thus video game development is pretty much C/C++ with a few things of Swift/C# sprinkled about.

If a single piece of hardware requirement goes away (i.e you are writing a distributed system to serve a web app), people quickly figured out that hardware is cheaper than developer salary, and also network latency is going to be the dominant thing for speed. This is the reason Python took off - its super quick to write and deploy applications, and instead of paying a developer $10k+ a month, you can just spend half that on more EC2s that handle the load just fine, even if the end user has to wait 1.5 seconds instead of 1.1 for a result.

If you don't need compare/jump, your program is essentially better off suited to running on the GPUs. OpenCL/CUDA came about because people realized that a lot of applications simply need to do math without any decisions along the way, and GPUS are much better at this. The paradigm is that you write kernels that you then load onto the GPU - this can be done in any language since you really just need to run the code once. I.e Python, despite being slow is used primarily for ML because of this.

Then there is multiply/add only, which you probably best know in implementation as ASICs for bitcoin mining that blew GPUs out of the water. When you don't have memory controllers and just load/store from predefined locations, your speed goes through the roof. This is the future of ML chips as well, where your compiler looks a lot like the verilog/hdl compilers for FPGAs.

Furthermore, with ML, the compare/jump and even load/store is being rolled into multiply/add. You have seemingly complex algorithms like GPT that make decisions, but without any branching. Technically speaking, a NAND gate is all you need to make a general purpose CPU and you need 2 neurons to simulate a NAND gate. So you can build an entire general purpose CPU from multiply/add.

So in the end, its absolutely worth investing in Python and making it better. Languages like Julia are currently better suited to performant tasks, but the necessity of writing performant code to run on CPUs is going away slowly with every day. Its better to have a high purpose language that allows you to put ideas into code as quickly as possible, and then have different specializations for more generic tasks.

Deleted Comment

jalino23 · 3 years ago

julia is so nice but WHY did they have to decide to do a 1 based index?? like theres literally no reasonnnnnnnnnnnnnnnn

xiaodai · 3 years ago

let's see, fortran, r, matlab.

all serious numerical languages have that. it's more natural

0 base indexing is only good for calculating memory offsets. Nothing else. like in go `vec[a,b]` is indexing `a` to `b-1` which is purely because it's more convenient due to 0-indexing. this `b-1` is hugely confusing and big gotcha for the layman.

Dead Comment