Out of curiosity, I rewrote their prime counter example to use a sieve instead of being a silly maximally-computation-dense example.
To make it work with taichi, I had to change the declaration of the sieve from sieve = [0] * N to sieve = ti.field(ti.i8, shape=N) but the rest of the code remained the same.
Ordinary Python:
time elapsed: 0.444s
Taichi ignoring compile time, I believe:
time elapsed: 0.119s
A slightly more realistic example than the 10x+ improvement they show on the really toy example with results that aren't too bad. I'd take a 3x improvement for tiny changes. Pretty neat!
(I tried some other trivial things like using np.int8 and it was slower. One can obviously make this a ton faster but I was interested in seeing how the toy was if we just made it slightly more memory-bound).
A negative was that throwing list comprehensions in made the python version faster - about 0.3 seconds - (and shorter and arguably more "pythonic") and simultaneously broke the port to taichi.
I tried implementing the same and I am getting 500ms vs 20ms with wrong answer in the first call in taichi but correct in subsequent calls. I guess I found some bug in taichi: https://imgur.com/a/lpK2iVF
Could you share your code as well.
N = 1000000
isnotprime = [0] * N
def count_primes(n: int) -> int:
count = 0
for k in range(2, n):
if isnotprime[k] == 0:
count += 1
for l in range(2, n // k):
isnotprime[l * k] = 1
return count
import taichi as ti
ti.init(arch=ti.cpu)
isnotprime = ti.field(ti.i8, shape=(N, ))
@ti.kernel
def count_primes(n: ti.i32) -> int:
count = 0
for k in range(2, n):
if isnotprime[k] == 0:
count += 1
for l in range(2, n // k):
isnotprime[l * k] = 1
return count
import time
import math
import numpy as np
N = 1000000
SN = math.floor(math.sqrt(N))
sieve = [False] * N
def init_sieve():
for i in range(2, SN):
if not sieve[i]:
k = i*2
while k < N:
sieve[k] = True
k += i
def count_primes(n: int) -> int:
return (N-2) - sum(sieve)
start = time.perf_counter()
init_sieve()
print(f"Number of primes: {count_primes(N)}")
print(f"time elapsed: {time.perf_counter() - start}/s")
Taichi:
import taichi as ti
import time
import math
ti.init(arch=ti.cpu)
N = 1000000
SN = math.floor(math.sqrt(N))
sieve = ti.field(ti.i8, shape=N)
@ti.kernel
def init_sieve():
for i in range(2, SN):
if sieve[i] == 0:
k = i*2
while k < N:
sieve[k] = 1
k += i
@ti.kernel
def count_primes(n: int) -> int:
count = 0
for i in range(2, N):
if (sieve[i] == 0):
count += 1
return count
start = time.perf_counter()
init_sieve()
print(f"Number of primes: {count_primes(N)}")
print(f"time elapsed: {time.perf_counter() - start}/s")
(The difference of using 0 vs False is tiny; I had just been poking at the python code to think about how I'd make it more pythonic and see if that made it worse to do taichi)
Might be worthwhile to run the same code with an appropriate numba decorator. My guess is that you'd get at least as much speed up but without having to change the sieve declaration, but I'm not sure.
> No barrier to entry for Python users: Taichi shares almost the same syntax as Python. Apply a single Taichi decorator, and your functions are automatically turned into optimized machine code.
It looks super interesting except “almost the same syntax as Python” part here seems like such a foot gun for everything from IDE integration to subtle bugs and more.
I was super into the idea of a strict Python subset that gets JIT compiled inline based on just a decorator.
I helped someone that had to use taichi code written by a PhD student and it was a bit weird. It looks a lot like python but you have to code like you would in cuda (e.g control flow), there is no magic.
For this we had to calculate forces to animate some kind of polygon with a lot of joints and we could not just call sycipy from the taichi code. I had to implement a very dirty polynomial equation solver in taichi for the demo
I was playing earlier today with Triton[0], from OpenAI. Like Taichi it makes it super easy to write native GPU code from Python, but it really does feel like something very experimental for now. (I know the use case is very different)
It's been awhile, but last time I was doing serious work in Julia things were a little janky. For example, the REPL would segfault sometimes if I Ctrl-C'ed during heavy computations. And Flux at first seems like it will work on any code, which seems amazing, but then you find out at runtime that one of the operations you used isn't supported and get a runtime error. PyTorch might not work on regular Python code, but at least I know the APIs provided by PyTorch will work, even though they are a subset of what can be done in regular Python.
Still, most things worked in Julia, and there have been many improvements since then so I suspect the few remaining rough spots are being smoothed out. In the future I will be happy if I get to work with Julia more.
I think the Julia people underestimate how many people, and how much those people, like the syntax of Python. If they'd stuck more to that they might have been able to have won more people over.. (white space, plus many other things).
It is not just syntax. While python has Java-like OOP from a far, Julia has a distinct language design. It doesn't have a concept of "class" in the traditional sense. It instead has multiple dispatch, which is very flexible but sometimes too flexible to control. I found Julia harder to write for an averaged programmer. Furthermore, the time-to-first-plot problem had pissed off many early adopters (I know a few and they won't come back) and apparently remains a problem for some [1].
Julia is a great niche language for what it is good at. It will survive but won't gain much popularity.
As someone who switched from python to Julia, Julia syntax supports things like list comprehensions, but also many better things like broadcasting that python really needs. The only thing python has over Julia is that Julia requires an “end” keyword where python uses indentation. But julia has so much more than python (like macros) that it’s just better syntax wise.
One of the features that I dislike the most abiout Julia is how list comprehensions have been directlly ported from Python. This was a conscious choice they made out of pragmatism, not because of the merits of the design.
I don't doubt that what you say it's true, but to me, it comes down more to lack of familiarity with other languages than any actual merits of Python syntax and semantics. Frankly, I am glad the didn't take more from Python.
Yes. It's quite amazing really. Also, people really hate using more than one language. Web developers will twist themselves into incredible knots to avoid having to write HTML and CSS and Javascript in the same project.
It’s not even the language that causes the most friction, it’s the runtime, libraries, package manager, build system, foreign function interfaces, and on and on…
As someone in the Julia ecosystem, I wouldn't know where to begin with writing performant python. Is it Numba? Taichi? Torchscript? NumPy? Cython? Pypy? None of these seem to work together (besides calling each other, I suppose).
Wow, it seems like you have a number of good options to choose from. What a horrible situation, to have a language with a really rich ecosystem of powerful libraries.
If you look at ML, Python is completely fine because all the processing that happens with matrix multiplication, even on CPUs, far, far, FAR outweighs all the setup stuff in volume of operations.
On the other hand, if majority of your application relies heavily on processing speed (i.e you need compare/jump operations rather just add/multiply/load/store of the GPUs), Python is going to be slow. In this case, if you want custom performant code, you write C extensions for the performant critical code, and launch them from higher level python code.
That being said, there is generally a library (like Taichi) that already does this for you.
That's quite silly. Julia ecosystem is non-existent. If we move nimlang would be the closest . PyPy shaping up nice for cext part and when fully compatible we will just use PyPy.org.
Tried that. It was fun. Didn't see any benefit though, went back to Python.
I don't care much about how the interface over LLVM looks like. As soon as I have the same result in the end, I'd rather stick to whatever has more users.
I used to hate Python too, but i complety 180ed in the past few years as I learned more about computer science.
If you look at compute in general, it can be pretty much be summed up as add/multiply/load/store/compare/jump (straight from Jim Keller). All the other instructions are more specialized versions of that, with some having dedicated hardware in CPUS.
If you need to do those 6 things as fast as possible, on single piece of hardware, you are most likely writing a video game. Thus video game development is pretty much C/C++ with a few things of Swift/C# sprinkled about.
If a single piece of hardware requirement goes away (i.e you are writing a distributed system to serve a web app), people quickly figured out that hardware is cheaper than developer salary, and also network latency is going to be the dominant thing for speed. This is the reason Python took off - its super quick to write and deploy applications, and instead of paying a developer $10k+ a month, you can just spend half that on more EC2s that handle the load just fine, even if the end user has to wait 1.5 seconds instead of 1.1 for a result.
If you don't need compare/jump, your program is essentially better off suited to running on the GPUs. OpenCL/CUDA came about because people realized that a lot of applications simply need to do math without any decisions along the way, and GPUS are much better at this. The paradigm is that you write kernels that you then load onto the GPU - this can be done in any language since you really just need to run the code once. I.e Python, despite being slow is used primarily for ML because of this.
Then there is multiply/add only, which you probably best know in implementation as ASICs for bitcoin mining that blew GPUs out of the water. When you don't have memory controllers and just load/store from predefined locations, your speed goes through the roof. This is the future of ML chips as well, where your compiler looks a lot like the verilog/hdl compilers for FPGAs.
Furthermore, with ML, the compare/jump and even load/store is being rolled into multiply/add. You have seemingly complex algorithms like GPT that make decisions, but without any branching. Technically speaking, a NAND gate is all you need to make a general purpose CPU and you need 2 neurons to simulate a NAND gate. So you can build an entire general purpose CPU from multiply/add.
So in the end, its absolutely worth investing in Python and making it better. Languages like Julia are currently better suited to performant tasks, but the necessity of writing performant code to run on CPUs is going away slowly with every day. Its better to have a high purpose language that allows you to put ideas into code as quickly as possible, and then have different specializations for more generic tasks.
all serious numerical languages have that. it's more natural
0 base indexing is only good for calculating memory offsets. Nothing else. like in go `vec[a,b]` is indexing `a` to `b-1` which is purely because it's more convenient due to 0-indexing. this `b-1` is hugely confusing and big gotcha for the layman.
Right, I'd like to see a comparison to Nim or Julia, or another compiled high-level language that isn't particularly performance-oriented like Haskell or Clojure, or Common Lisp, or even Ruby with its new JIT(s). Or for that matter, Python with Numba or one of the other JIT implementations (PyPy, Pyston, Cinder).
Is that improvement caused because the program is automatically parallelized or because the code is compiled/JITed? A x150 improvement is too much, so I suspect both reasons collaborate to get it.
they say the example i used is JIT compiled into machine code. i haven't looked into the codebase yet but i presume that means it just un-pythons it back into C? not sure.
fwiw, i tried the gpu target (cuda) and it was faster than vanilla, but slower than accelerated cpu target by about 4x.
To make it work with taichi, I had to change the declaration of the sieve from sieve = [0] * N to sieve = ti.field(ti.i8, shape=N) but the rest of the code remained the same.
Ordinary Python:
time elapsed: 0.444s
Taichi ignoring compile time, I believe:
time elapsed: 0.119s
A slightly more realistic example than the 10x+ improvement they show on the really toy example with results that aren't too bad. I'd take a 3x improvement for tiny changes. Pretty neat!
(I tried some other trivial things like using np.int8 and it was slower. One can obviously make this a ton faster but I was interested in seeing how the toy was if we just made it slightly more memory-bound).
A negative was that throwing list comprehensions in made the python version faster - about 0.3 seconds - (and shorter and arguably more "pythonic") and simultaneously broke the port to taichi.
Could you share your code as well.
Deleted Comment
It looks super interesting except “almost the same syntax as Python” part here seems like such a foot gun for everything from IDE integration to subtle bugs and more.
I was super into the idea of a strict Python subset that gets JIT compiled inline based on just a decorator.
Library: https://numba.readthedocs.io/en/stable/
CUDA: https://numba.readthedocs.io/en/stable/cuda/index.html
Example: https://numba.readthedocs.io/en/stable/cuda/examples.html
For this we had to calculate forces to animate some kind of polygon with a lot of joints and we could not just call sycipy from the taichi code. I had to implement a very dirty polynomial equation solver in taichi for the demo
[0] https://openai.com/research/triton
[0] https://developer.nvidia.com/nvidia-triton-inference-server
Meaning?
- their approach is still bizarre and exploratory and they still don't know how to structure their APIs and are making it up as they go?
or:
- there are still some rough edges, bugs, and no full documentation yet?
as those are quite different cases...
CUDA-only, no mention of Metal.
Deleted Comment
Still, most things worked in Julia, and there have been many improvements since then so I suspect the few remaining rough spots are being smoothed out. In the future I will be happy if I get to work with Julia more.
But then there's not enough users... so the cycle continues until one day Julia hits critical mass and a tipping point is reached.
Julia is a great niche language for what it is good at. It will survive but won't gain much popularity.
[1] https://discourse.julialang.org/t/very-slow-time-to-first-pl...
I don't doubt that what you say it's true, but to me, it comes down more to lack of familiarity with other languages than any actual merits of Python syntax and semantics. Frankly, I am glad the didn't take more from Python.
but there's a better way, there's a better way.
like python used to be less popular than perl but look at where perl is now vs python. things do take time to change though
If you look at ML, Python is completely fine because all the processing that happens with matrix multiplication, even on CPUs, far, far, FAR outweighs all the setup stuff in volume of operations.
On the other hand, if majority of your application relies heavily on processing speed (i.e you need compare/jump operations rather just add/multiply/load/store of the GPUs), Python is going to be slow. In this case, if you want custom performant code, you write C extensions for the performant critical code, and launch them from higher level python code.
That being said, there is generally a library (like Taichi) that already does this for you.
Julia language features are really weak.
they didn't get range composition right
I don't care much about how the interface over LLVM looks like. As soon as I have the same result in the end, I'd rather stick to whatever has more users.
If you look at compute in general, it can be pretty much be summed up as add/multiply/load/store/compare/jump (straight from Jim Keller). All the other instructions are more specialized versions of that, with some having dedicated hardware in CPUS.
If you need to do those 6 things as fast as possible, on single piece of hardware, you are most likely writing a video game. Thus video game development is pretty much C/C++ with a few things of Swift/C# sprinkled about.
If a single piece of hardware requirement goes away (i.e you are writing a distributed system to serve a web app), people quickly figured out that hardware is cheaper than developer salary, and also network latency is going to be the dominant thing for speed. This is the reason Python took off - its super quick to write and deploy applications, and instead of paying a developer $10k+ a month, you can just spend half that on more EC2s that handle the load just fine, even if the end user has to wait 1.5 seconds instead of 1.1 for a result.
If you don't need compare/jump, your program is essentially better off suited to running on the GPUs. OpenCL/CUDA came about because people realized that a lot of applications simply need to do math without any decisions along the way, and GPUS are much better at this. The paradigm is that you write kernels that you then load onto the GPU - this can be done in any language since you really just need to run the code once. I.e Python, despite being slow is used primarily for ML because of this.
Then there is multiply/add only, which you probably best know in implementation as ASICs for bitcoin mining that blew GPUs out of the water. When you don't have memory controllers and just load/store from predefined locations, your speed goes through the roof. This is the future of ML chips as well, where your compiler looks a lot like the verilog/hdl compilers for FPGAs.
Furthermore, with ML, the compare/jump and even load/store is being rolled into multiply/add. You have seemingly complex algorithms like GPT that make decisions, but without any branching. Technically speaking, a NAND gate is all you need to make a general purpose CPU and you need 2 neurons to simulate a NAND gate. So you can build an entire general purpose CPU from multiply/add.
So in the end, its absolutely worth investing in Python and making it better. Languages like Julia are currently better suited to performant tasks, but the necessity of writing performant code to run on CPUs is going away slowly with every day. Its better to have a high purpose language that allows you to put ideas into code as quickly as possible, and then have different specializations for more generic tasks.
Deleted Comment
Deleted Comment
all serious numerical languages have that. it's more natural
0 base indexing is only good for calculating memory offsets. Nothing else. like in go `vec[a,b]` is indexing `a` to `b-1` which is purely because it's more convenient due to 0-indexing. this `b-1` is hugely confusing and big gotcha for the layman.
Dead Comment
Dead Comment
(taichi) [X@X taichi]$ python primes.py
[Taichi] version 1.4.1, llvm 15.0.4, commit e67c674e, linux, python 3.9.14
[Taichi] Starting on arch=x64
Number of primes: 664579
time elapsed: 93.54279175889678/s
Number of primes: 664579
time elapsed: 0.5988388371188194/s
fwiw, i tried the gpu target (cuda) and it was faster than vanilla, but slower than accelerated cpu target by about 4x.
Deleted Comment