germanjoey (u/germanjoey)

germanjoey commented on We Made CUDA Optimization Suck Less rightnowai.co/... · Posted by u/jaberjaber23

godelski · 10 months ago

I was expecting something like TensorRT or Triton, but found "Vibe Coding"

The project seems very naive. CUDA programming sucks because there's a lot of little gotchas and nuances that dramatically change performance. These optimizations can also significantly change between GPU architectures: you'll get different performances out of Volta, Ampere, or Blackwell. Parallel programming is hard in the first place, and it gets harder on GPUs because of all these little intricacies. People that have been doing CUDA programming for years are still learning new techniques. It takes a very different type of programming skill. Like actually understanding that Knuth's "premature optimization is the root of evil" means "get a profiler" not "don't optimize". All this is what makes writing good kernels take so long. That's even after Nvidia engineers are spending tons of time trying to simplify it.

So I'm not surprised people are getting 2x or 4x out of the box. I'd expect that much if a person grabbed a profiler. I'd honestly expect more if they spent a week or two with the documentation and serious effort. But nothing in the landing page is convincing me the LLM can actually significantly help. Maybe I'm wrong! But it is unclear if the lead dev has significant CUDA experience. And I don't want something that optimizes a kernel for an A100, I want kernelS that are optimized for multiple architectures. That's the hard part and all those little nuances are exactly what LLM coding tends to be really bad at.

germanjoey · 10 months ago

TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.

germanjoey commented on The Missing Nvidia GPU Glossary modal.com/gpu-glossary/re... · Posted by u/birdculture

germanjoey · a year ago

This is really incredible, thank you!

germanjoey commented on Trillium TPU Is GA cloud.google.com/blog/pro... · Posted by u/gok

teleforce · a year ago

It's beyond me why processor with dataflow architecture is not being used for ML/AI workloads, not even in minority [1]. Native dataflow processor will hands down beats Von Neumann based architecture in term of performance and efficiency for ML/AI workloads, and GPU will be left redundant for graphics processing instead of being the default co-processor or accelerator for ML/AI [2].

[1] Dataflow architecture:

https://en.wikipedia.org/wiki/Dataflow_architecture

[2] The GPU is not always faster:

https://news.ycombinator.com/item?id=42388009

germanjoey · a year ago

Sambanova's RDU is a dataflow processor being used for ML/AI workloads! It's amazing and actually works.

germanjoey commented on Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference cerebras.ai/blog/llama-40... · Posted by u/benchmarkist

germanjoey · a year ago

Pretty amazing speed, especially considering this is bf16. But how many racks is this using? The used 4 racks for 70B, so this, what, at least 24? A whole data center for one model?!

germanjoey commented on Cerebras Trains Llama Models to Leap over GPUs nextplatform.com/2024/10/... · Posted by u/rbanffy

mentalically · a year ago

The value proposition of Cerebras is that they can compile existing graphs to their hardware and allow inference at lower costs and higher efficiencies. The title does not say anything about creating or optimizing new architectures from scratch.

germanjoey · a year ago

the title says "Cerebras Trains Llama Models"...

germanjoey commented on Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s cerebras.ai/blog/cerebras... · Posted by u/campers

odo1242 · a year ago

What made it so much faster based on just a software update?

germanjoey · a year ago

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)

germanjoey commented on Civilization VII recommends 16 cores and 32GB RAM for 4K gameplay tomshardware.com/video-ga... · Posted by u/doener

init2null · a year ago

I always find it interesting that Civilization (especially 5+) is basically a board game with added fog of war. These specs seem a little extreme given that fact. That being said, anyone who's played older versions knows that the AI needs every cycle it can get. I'd love to see smarter multithreaded strategy for the AI. Its combat skills border on embarrassing.

germanjoey · a year ago

Simply increasing processing power for the AI isn't enough. Gameplay mechanics are intimately related to the capabilities of the AI.

For example, when they redesigned combat around the 1-Unit-Per-Tile (1UPT) mechanic for CIV 5, this crippled the ability of the AI to wage war. That's because even if a high-difficulty AI could out-produce the player in terms of military, they were logistics-limited in their ability to get those units to the front because of 1UPT. That means that the AI can't threaten a player militarily, and thus loses it's main lever in terms of it's ability to be "difficult."

Contrast this to Civ 4, where high-difficulty AIs were capable of completely overwhelming a player that didn't take them seriously. You couldn't just sit there and tech-up and use a small number of advanced units to fend off an invasion from a much larger and more aggressive neighbor. This was especially the case if you played against advanced fan-created AIs.

I'm hoping they get rid of 1UPT completely for Civ 7, but I have a feeling that it is unlikely because casual players (the majority purchaser for Civ) actually like that 1UPT effectively removes tactical combat from the game.

germanjoey commented on We fine-tuned Llama 405B on AMD GPUs publish.obsidian.md/felaf... · Posted by u/felarof

felarof · a year ago

Hey HN, we recently fine-tuned the llama3.1 405B model on 8xAMD MI300x GPUs using JAX instead of PyTorch. JAX's advanced sharding APIs allowed us to achieve great performance. Check out our blog post to learn about the cool sharding tricks we used. We've also open-sourced the code: https://github.com/felafax/felafax

We're a small startup building AI infra for fine-tuning and serving LLMs on non-NVIDIA hardware (TPUs, AMD, Trainium).

Problem: Many companies are trying to get PyTorch working on AMD GPUs, but we believe this is a treacherous path. PyTorch is deeply intertwined with the NVIDIA ecosystem in a lot of ways (e.g., `torch.cuda` or scaled_dot_product_attention is an NVIDIA CUDA kernel exposed as a PyTorch function). So, to get PyTorch code running on non-NVIDIA hardware, there's a lot of "de-NVIDIAfying" that needs to be done.

Solution: We believe JAX is a better fit for non-NVIDIA hardware. In JAX, ML model code compiles to hardware-independent HLO graphs, which are then optimized by the XLA compiler before hardware-specific optimization. This clean separation allowed us to run the same LLaMA3 JAX code both on Google TPUs and AMD GPUs with no changes.

Our strategy as a company is to invest upfront in porting models to JAX, then leverage its framework and XLA kernels to extract maximum performance from non-NVIDIA backends. This is why we first ported Llama 3.1 from PyTorch to JAX, and now the same JAX model works great on TPUs and runs perfectly on AMD GPUs.

We'd love to hear your thoughts on our vision and repo!

germanjoey · a year ago

How are you verifying accuracy for your JAX port of Llama 3.1?

IMHO, the main reason to use pytorch is actually that the original model used pytorch. What can seem to be identical logic between different model versions may actually cause model drift when infinitesimal floating point errors accumulate due to the huge scale of the data. My experience is that debugging an accuracy mismatches like this in a big model is a torturous ordeal beyond the 10th circle of hell.

germanjoey commented on A post by Guido van Rossum removed for violating Python community guidelines discuss.python.org/t/shou... · Posted by u/oblvious-earth

EvanAnderson · 2 years ago

I'm an outsider who only knows Guido van Rossum by way of interviews his writing.

Assuming your quote is what the original text said (I don't disbelieve you-- but nobody can see it to confirm) why would this have violated community standards? Is there some rule about not mentioning "un-persons" or something?

It's very confusing.

Edit: Answering my own question. There appears to be a kerfuffle afoot. Apparently the Steering Council has suspended a core developer for 3 months[0] but isn't naming the suspended developer or citing specific reasons why (per [1] and sparking a call for a vote of no confidence in the council which did not succeed).

Apparently even mentioning the suspended person (without naming them) is enough for even Guido van Rossum to be censored. Wow.

Edit 2: The suspended developer is Tim Peters[3].

Edit 3: Altered paragraph "Edit:" from "...or the reason why[1] (" to "...or citing specific reasons why (per [1]".

Edit 4: Added "which did not succeed" after "...vote of no confidence in the council".

[0] https://discuss.python.org/t/three-month-suspension-for-a-co...

[1] https://discuss.python.org/t/calling-for-a-vote-of-no-confid...

[3] https://chrismcdonough.substack.com/p/the-shameful-defenestr...

germanjoey · 2 years ago

Looks like some kind of power play...

Originally discussed here: https://news.ycombinator.com/item?id=41234180