[1] Dataflow architecture:
https://en.wikipedia.org/wiki/Dataflow_architecture
[2] The GPU is not always faster:
A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.
It seems they also support only a very short sequence length. (1k tokens)
For example, when they redesigned combat around the 1-Unit-Per-Tile (1UPT) mechanic for CIV 5, this crippled the ability of the AI to wage war. That's because even if a high-difficulty AI could out-produce the player in terms of military, they were logistics-limited in their ability to get those units to the front because of 1UPT. That means that the AI can't threaten a player militarily, and thus loses it's main lever in terms of it's ability to be "difficult."
Contrast this to Civ 4, where high-difficulty AIs were capable of completely overwhelming a player that didn't take them seriously. You couldn't just sit there and tech-up and use a small number of advanced units to fend off an invasion from a much larger and more aggressive neighbor. This was especially the case if you played against advanced fan-created AIs.
I'm hoping they get rid of 1UPT completely for Civ 7, but I have a feeling that it is unlikely because casual players (the majority purchaser for Civ) actually like that 1UPT effectively removes tactical combat from the game.
We're a small startup building AI infra for fine-tuning and serving LLMs on non-NVIDIA hardware (TPUs, AMD, Trainium).
Problem: Many companies are trying to get PyTorch working on AMD GPUs, but we believe this is a treacherous path. PyTorch is deeply intertwined with the NVIDIA ecosystem in a lot of ways (e.g., `torch.cuda` or scaled_dot_product_attention is an NVIDIA CUDA kernel exposed as a PyTorch function). So, to get PyTorch code running on non-NVIDIA hardware, there's a lot of "de-NVIDIAfying" that needs to be done.
Solution: We believe JAX is a better fit for non-NVIDIA hardware. In JAX, ML model code compiles to hardware-independent HLO graphs, which are then optimized by the XLA compiler before hardware-specific optimization. This clean separation allowed us to run the same LLaMA3 JAX code both on Google TPUs and AMD GPUs with no changes.
Our strategy as a company is to invest upfront in porting models to JAX, then leverage its framework and XLA kernels to extract maximum performance from non-NVIDIA backends. This is why we first ported Llama 3.1 from PyTorch to JAX, and now the same JAX model works great on TPUs and runs perfectly on AMD GPUs.
We'd love to hear your thoughts on our vision and repo!
IMHO, the main reason to use pytorch is actually that the original model used pytorch. What can seem to be identical logic between different model versions may actually cause model drift when infinitesimal floating point errors accumulate due to the huge scale of the data. My experience is that debugging an accuracy mismatches like this in a big model is a torturous ordeal beyond the 10th circle of hell.
Assuming your quote is what the original text said (I don't disbelieve you-- but nobody can see it to confirm) why would this have violated community standards? Is there some rule about not mentioning "un-persons" or something?
It's very confusing.
Edit: Answering my own question. There appears to be a kerfuffle afoot. Apparently the Steering Council has suspended a core developer for 3 months[0] but isn't naming the suspended developer or citing specific reasons why (per [1] and sparking a call for a vote of no confidence in the council which did not succeed).
Apparently even mentioning the suspended person (without naming them) is enough for even Guido van Rossum to be censored. Wow.
Edit 2: The suspended developer is Tim Peters[3].
Edit 3: Altered paragraph "Edit:" from "...or the reason why[1] (" to "...or citing specific reasons why (per [1]".
Edit 4: Added "which did not succeed" after "...vote of no confidence in the council".
[0] https://discuss.python.org/t/three-month-suspension-for-a-co...
[1] https://discuss.python.org/t/calling-for-a-vote-of-no-confid...
[3] https://chrismcdonough.substack.com/p/the-shameful-defenestr...
Originally discussed here: https://news.ycombinator.com/item?id=41234180
The project seems very naive. CUDA programming sucks because there's a lot of little gotchas and nuances that dramatically change performance. These optimizations can also significantly change between GPU architectures: you'll get different performances out of Volta, Ampere, or Blackwell. Parallel programming is hard in the first place, and it gets harder on GPUs because of all these little intricacies. People that have been doing CUDA programming for years are still learning new techniques. It takes a very different type of programming skill. Like actually understanding that Knuth's "premature optimization is the root of evil" means "get a profiler" not "don't optimize". All this is what makes writing good kernels take so long. That's even after Nvidia engineers are spending tons of time trying to simplify it.
So I'm not surprised people are getting 2x or 4x out of the box. I'd expect that much if a person grabbed a profiler. I'd honestly expect more if they spent a week or two with the documentation and serious effort. But nothing in the landing page is convincing me the LLM can actually significantly help. Maybe I'm wrong! But it is unclear if the lead dev has significant CUDA experience. And I don't want something that optimizes a kernel for an A100, I want kernelS that are optimized for multiple architectures. That's the hard part and all those little nuances are exactly what LLM coding tends to be really bad at.