Readit News logoReadit News
t-vi commented on What to Learn? CUDA vs. PyTorch vs. Jax vs. Triton/Pallas    · Posted by u/bananc
t-vi · a month ago
- I don't think it hurts to learn PyTorch (and having learned JAX is good, too). I don't know if JAX + triton is as impossible as you make it out, but it seems that PyTorch integration is quite good for many things. - For pallas, triton and CUDA/C++, you probably want to know a bit about how GPU works. There is the GPU-Mode discord / lectures / ressources if you are looking for material https://github.com/gpu-mode/ . - In my experience how well Triton works varies depending on what you want to do (depending on the how well the programming model fits the task). If it does, it is quite nice to get something reasonably fast reasonably fast. PyTorch (in the inductor torch.compile backend) has made many things work well, so you could check that out if you run out of examples elsewhere).
t-vi commented on Running load using official Nvidia PyTorch image boost performance by 50%   bartusiak.ai/2025/11/04/i... · Posted by u/riomus
t-vi · a month ago
Note that the NVIDIA container uses CUDA+cuBLAS 13.0.2 which cites "Improved performance on NVIDIA DGX Spark for FP16/BF16 and FP8 GEMMs", which seems to be your use-case. In general, I would suspect that it mostly comes to versions of the libs.

Interestingly, there is a cuBLAS 13.1 whl on PyPI, not sure what that does.

t-vi commented on Backpropagation is a leaky abstraction (2016)   karpathy.medium.com/yes-y... · Posted by u/swatson741
t-vi · 2 months ago
It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.

For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:

One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.

with this footnote:

In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.

That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.

t-vi commented on Surprisingly fast AI-generated kernels we didn't mean to publish yet   crfm.stanford.edu/2025/05... · Posted by u/mfiguiere
t-vi · 7 months ago
Note that PyTorch's kernels are somewhat generic in shape. It has always been relatively easy to get speedups by specializing the shape, e.g. Apache TVM had that (back before it was "Apache" even).
t-vi commented on Decorator JITs: Python as a DSL   eli.thegreenplace.net/202... · Posted by u/ingve
t-vi · 10 months ago
If you like JIT wrappers and Python interpreters:

In Thunder[1], a PyTorch to Python JIT compiler for optimizing DL models, we are maintaining a bytecode interpreter covering 3.10-3.12 (and 3.13 soon) for our jit. That allows to run Python code while re-directing arbitrary function calls and operations but is quite a bit slower than CPython.

While the bytecode changes (and sometimes it is a back-and-forth for example in the call handling), it seems totally good once you embrace that there will be differences between Python versions.

What has been a large change is the new zero cost (in the happy path) exception handling, but I can totally why Python did that change to that from setting up try-block frames.

I will say that I was happy not to support Python <= 3.9 as changes were a lot more involved there (the bytecode format itself etc.).

Of course, working on this has also means knowing otherwise useless Python trivia afterwards. One of my favorites is how this works:

  l = [1, 2, 3]
  l[-1] += l.pop()
  print(l)
1. https://github.com/Lightning-AI/lightning-thunder/

t-vi commented on Show HN: Pi-C.A.R.D, a Raspberry Pi Voice Assistant   github.com/nkasmanoff/pi-... · Posted by u/nkaz123
nkaz123 · 2 years ago
Yes! I'm currently using https://espeak.sourceforge.net/, so it isn't especially fun to listen to though.

Additionally, since I'm streaming the LLM response, it won't take long to get your reply. Since it does it a chunk at a time, there's occasionally only parts of words that are said momentarily. Also of course depends on what model you use or what the context size is for how long you need to wait.

t-vi · 2 years ago
When I did a similar thing (but with less LLM) I liked https://github.com/coqui-ai/TTS but back then I needed to cut out the conversion step from tensor to a list of numbers to make it work really nicely.
t-vi commented on SIMD in Pure Python   da.vidbuchanan.co.uk/blog... · Posted by u/dmarto
eigenket · 2 years ago
> This is what a dot-product would look like in PeachPy

Is it just me (I'm far from an expert here) or is this code really weird? Why does ymm_one appear to contain the number zero? Why do we subtract what looks like it should be the inner product we want from ymm_one at the end?

t-vi · 2 years ago
The subtraction is because "is an example of constructing the “Inner Product” distance" per the text above it. That ymmone might not be one could be because they only need that up to a constant and so don't care, but it's probably not ideal to name the thing containing 0s ymmone.

u/t-vi

KarmaCake day116November 22, 2019
About
This account should be deleted.
View Original