Ask HN: Why hasn’t AMD made a viable CUDA alternative?

There is more than one way to answer this.

They have made an alternative to the CUDA language with HIP, which can do most of the things the CUDA language can.

You could say that they haven't released supporting libraries like cuDNN, but they are making progress on this with AiTer for example.

You could say that they have fragmented their efforts across too many different paradigms but I don't think this is it because Nvidia also support a lot of different programming models.

I think the reason is that they have not prioritised support for ROCm across all of their products. There are too many different architectures with varying levels of support. This isn't just historical. There is no ROCm support for their latest AI Max 395 APU. There is no nice cross architecture ISA like PTX. The drivers are buggy. It's just all a pain to use. And for that reason "the community" doesn't really want to use it, and so it's a second class citizen.

This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.

thrtythreeforty · 5 months ago

This ticket, finally closed after being open for 2 years, is a pretty good micocosm of this problem:

https://github.com/ROCm/ROCm/issues/1714

Users complaining that the docs don't even specify which cards work.

But it goes deeper - a valid complaint is that "this only supports one or two consumer cards!" A common rebuttal is that it works fine on lots of AMD cards if you set some environment flag to force the GPU architecture selection. The fact that this is so close to working on a wide variety of hardware, and yet doesn't, is exactly the vibe you get with the whole ecosystem.

iforgotpassword · 5 months ago

What I don't get is why they don't at least assign a dev or two to make the poster child of this work: llama.cpp

It's the first thing anyone tries when trying to dabble in AI or compute on the gpu, yet it's a clusterfuck to get to work. A few blessed cards work, with proper drivers and kernel; others just crash, perform horribly slow, or output GGGGGGGGGGGGGG to every input (I'm not making this up!) Then you LOL, dump it and go buy nvidia et voila, stuff works first try.

mook · 5 months ago

I suspect part of it is also that Nvidia actually does a lot of things in firmware that can be upgraded. The new Nvidia Linux drivers (the "open" ones) support Turing cards from 2018. That means chips that old already do much of the processing in firmware.

AMD keeps having issues because their drivers talk to the hardware directly so their drivers are massive bloated messes, famous for pages of auto-generated register definitions. Likely it's much more difficult to fix anything.

citizenpaul · 5 months ago

I've thought about this myself and come to a conclusion that your link reinforces. As I understand it most companies doing (EE)hardware design and production consider (CS) software to be a second class citizen at the the company. It looks like AMD after all this time competing with NVIDIA has not learned the lesson. That said I have never worked in hardware so I'm taking what I've heard from other people.

NVIDIA while far from perfect has always easily kept stride in software quality ahead of AMD for over 20 years. While AMD repeatedly keeps falling on their face and getting egg all over themselves again and again and again as far as software goes.

My guess is NVIDIA internally has found a way to keep the software people from feeling like they are "less than" the people designing the hardware.

Sounds easily but apparently not. AKA management problems.

Covzire · 5 months ago

That reeks of gross incompetence somewhere in the organization. Like a hosting company that has a customer dealing with very poor performance, over pays greatly to avoid it while the whole time nobody even thinks to check what the linux swap file is doing.

CoastalCoder · 5 months ago

I had a similar (I think) experience when building LLVM from source a few years ago.

I kept running into some problem with LLVM's support for HIP code, even though I had not interest in having that functionality.

I realize this isn't exactly an AMD problem, but IIRC it was they were who contributed the troublesome code to LLVM, and it remained unfixed.

Apologies if there's something unfair or uninformed in what I wrote, it's been a while.

tomrod · 5 months ago

Geez. If I were Berkshire Hathaway looking to invest in the GPU market, this would be a major red flag in my fundamentals analysis.

trod1234 · 5 months ago

It is a little bit more complicated than ROCm simply not having support because ROCm has at a point claimed support, and they've had to walk it back painfully (multiple times). Its not a driver issue, nor a hardware issue on their side.

There has been a long-standing issue between AMD and its mainboard manufacturers. The issue has to do with features required for ROCm, namely PCIe Atomics. AMD has been unable or unwilling to hold the mainboard manufacturers to account for advertising features the mainboard does not support.

The CPU itself must support this feature, but the mainboard must as well (in firmware).

One of the reasons why ROCm hasn't worked in the past is because the mainboard manufacturers have claimed and advertised support for PCIe Atomics, and the support they've claimed has been shown to be false, and the software fails in non-deterministic ways when tested. This is nightmare fuel for the few AMD engineers tasked with ROCm.

PCIe Atomics requires non-translated direct IO to operate correctly, and in order to support the same CPU models from multiple generations they've translated these IO lines in firmware.

This has left most people that query their system to check this showing PCIAtomics is supported, while when actual tests that rely on that support are done they fail, in chaotic ways. There is no technical specification or advertising that the mainboard manufacturers provide showing whether this is supported. Even the boards with multiple x16 slots and the many technologies related to it such as Crossfire/SLI/mGPU brandings these don't necessarily show whether PCIAtomics is properly supported.

In other words, the CPU is supported, the firmware/mainboard fail with no way to differentiate between the two at the upper layers of abstraction.

All in all. You shouldn't be blaming AMD for this. You should be blaming the three mainboard manufacturers who chose to do this. Some of these manufacturers have upper end boards where they actually did do this right they just chose to not do this for any current gen mainboard costing less than ~$300-500.

fancyfredbot · 5 months ago

Look, this sounds like a frustrating nightmare, but the way it seems to us consumers is that AMD chose to rely on poorly implemented and supported technology, and Nvidia didn't. I can't blame AMD for the poor support by motherboards manufacturers but I can and will blame AMD for relying on it.

wongarsu · 5 months ago

There are so many hardware certification programs out there, why doesn't AMD run one to fix this?

Create a "ROCm compatible" logo and a list of criteria. Motherboard manufacturers can send a pre-production sample to AMD along with a check for some token amount (let's say $1000). AMD runs a comprehensive test suite to check actual compatibility, if it passes the mainboard is allowed to be advertised and sold with the previously mentioned logo. Then just tell consumers to look for that logo if they want to use ROCm. If things go wrong on a mainboard without the certification, communicate that it's probably the mainboard's fault.

Maybe add some kind of versioning scheme to allow updating requirements in the future

spacebanana7 · 5 months ago

How does NVIDIA manage this issue? I wonder whether they have a very different supply chain or just design software that puts less trust in the reliability of those advertised features.

zozbot234 · 5 months ago

AIUI, AMD documentation claims that the requirement for PCIe Atomics is due to ROCm being based on Heterogeneous System Architecture, https://en.wikipedia.org/wiki/Heterogeneous_System_Architect... which allows for a sort of "unified memory" (strictly speaking, a unified address space) across CPU and GPU RAM. Other compute API's such as CUDA, OpenCL, SYCL or Vulkan Compute don't have HSA as a strict requirement but ROCm apparently does.

pjc50 · 5 months ago

So .. how's Nvidia dealing with this? Or do they benefit from motherboard manufacturers doing preferential integration testing?

sigmoid10 · 5 months ago

>This is a management and leadership problem.

It's easy (and mostly correct) to blame management for this, but it's such a foundational issue that even if everyone up to the CEO pivoted on every topic, it wouldn't change anything. They simply don't have the engineering talent to pull this off, because they somehow concluded that making stuff open source means someone else will magically do the work for you. Nvidia on the other hand has accrued top talent for more than a decade and carefully developed their ecosystem to reach this point. And there are only so many talented engineers on the planet. So even if AMD leadership wakes up tomorrow, they won't go anywhere for a looong time.

raxxorraxor · 5 months ago

Even top tier engineers can be found eventually. The problem is if you never even start.

Of course the specific disciplines need quite an investment into the knowledge of their workers, but it isn't anything insurmountable.

jlundberg · 5 months ago

I wonder what would happen if they hired John Carmack to lead this effort.

He would probably be able to attract some really good hardware and driver talent.

pjc50 · 5 months ago

> This is a management and leadership problem. They need to make using their hardware easy. They need to support all of their hardware. They need to fix their driver bugs.

Yes. This kind of thing is unfortunately endemic in hardware companies, which don't "get" software. It's cultural and requires (a) a leader who does Get It and (b) one of those Amazon memos stating "anyone who does not Get With The Program will be fired".

flutetornado · 5 months ago

I was able to compile ollama for AMD Radeon 780M GPUs and I use it regularly on my AMD mini-PC which cost me 500$. It does require a bit more work. I get pretty decent performance with LLMs - just making a qualitative statement as I didn't do any formal testing, but I got comparable performance vibes as a NVIDIA 4050 GPU laptop I use as well.

https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M...

vkazanov · 5 months ago

Same here on lenovo thinkpad 14s with AMD Ryzen™ AI 7 PRO 360 that has a Radeon 880M iGPU. Works OK on ubuntu.

Not saying it works everywhere but it wasn't even that hard to setup, comparable to cuda.

Hate the name though.

bn-l · 5 months ago

> There is no ROCm support for their latest AI Max 395 APU

Fire the ceo

I want to argue that graphics cards are really 3 markets: integrated, gaming (dedicated), and compute. Not only do these have different hardware (fixed function, ray tracing cores, etc.) but also different programming and (importantly) distribution models. NVIDIA went from 2 to 3. Intel went from 1 to 2, and bought 3 (trying to merge). AMD started with 2 and went to 1 (around Llano) and attempted the same thing as NVIDIA via GCN (please correct me if I'm wrong).

My understanding is that the reason is that the real market for 3 (GPUs for compute) didn't show up until very late, so AMD's GCN bet didn't pay off. Even in 2021, NVIDIA's revenue from gaming was above data center revenue (a segment they basically had no competition in, and 100% of their revenue was from CUDA). AMD meanwhile won the battle for Playstation and Xbox consoles, and was executing a turnaround in data centers with EPYC and CPUs (with Zen). So my guess as to why they might have underinvested is basically: for much of the 2010s they were just trying to survive, so they focused on battles they could win that would bring them revenue.

This high level prioritization would explain a lot of "misexecution", e.g. if they underhired for ROCm, or prioritized APU SDK experience over data center, their testing philosophy ("does this game work ok? great").

brudgers · 5 months ago

The market segmentation you describe makes a lot of sense to me. But I don't think the situation is a matter of under-investment and is instead just fundamental market economics.

Nvidia can afford to develop a comprehensive software platform for the compute market segment because it has a comprehensive share of that segment. AMD cannot afford it because it does not have the market share.

Or to put it another way, I assume that AMD's efforts are motivated rational economic behavior and it has not been economically rational to compete heavily with Nvidia in the compute segment.

AMD was able to buy ATI because ATI could not compete with Nvidia. So AMD's graphics business started out trailing Nvidia. AMD has had a viable graphics strategy without trying to beat Nvidia...which makes sense since the traditional opponent is Intel and the ATI purchase has allowed AMD to compete with them pretty well.

Finally, most of the call for AMD to develop a CUDA alternative is based on a desire for cheaper compute. That's not a good business venture to invest in against a dominate player because price sensitive customers are poor customers.

quickthrowman · 5 months ago

> Finally, most of the call for AMD to develop a CUDA alternative is based on a desire for cheaper compute. That's not a good business venture to invest in against a dominate player because price sensitive customers are poor customers.

Nvidia’s gross margins are 80% on compute GPUs, that is excessive and likely higher than what cocaine and heroin dealers have for gross margins. Real competition would be a good thing for everyone except Nvidia.

joe_the_user · 5 months ago

This is such a key point. Everyone wants cheaper and cheaper compute - I want cheaper and cheaper compute. But not large-ish company wants to simply facilitate cheapness - they would a significant return on their investment and just making a commodity is generally not what they want. Back in the days of the PC clone, the clone makers were relatively tiny and so didn't have to worry about just serving the commodity market.

singhrac · 5 months ago

danielmarkbruce · 5 months ago

They likely haven't put even close to enough money behind it. This isn't a unique situation - you'll see in corporate america a lot of CEOs who say "we are investing in X" and they really believe they are. But the required size is billions (like, hundreds of really insanely talented engineers being paid 500k-1m, lead by a few being paid $3-10m), and they are instead investing low 10's of millions.

They can't bring themselves to put so much money into it that it would be an obvious fail if it didn't work.

Given how the big tech companies are buying hundreds of thousands of GPUs at huge prices, most of which is pure margin, I wonder whether it'd make sense for Microsoft to donate a couple billion to make the market competitive.

https://www.datacenterdynamics.com/en/news/microsoft-bought-...

The big players are all investing in building chips themselves.

And probably not putting enough money behind it... it takes enormous courage as a CEO to walk into a boardroom and say "I'm going to spend $50 billion, I think it will probably work, I'm... 60% certain".

DanielHB · 5 months ago

It amazes me how much these companies make actually gets spent on R&D, you see the funnel charts on reddit and I am like what the hell. Microsoft only spends ~6bn USD on R&D with a total 48bn of revenue and 15bn in profits?

What the hell is going on, they should be able to keep an army of PhDs doing pointless research even if only one paper in 10 years comes to a profitable product. But instead they are cutting down workforce like there is no tomorrow...

(I know, I know, market dynamics, value extraction, stock market returns)

disgruntledphd2 · 5 months ago

R and D in the financial statements I've seen basically covers the entire product, engineering etc org. Lots and lots of people, but not what regular people consider RnD.

laweijfmvo · 5 months ago

well, look at Meta... they're spending Billions with a capital B on stuff and they get slaughtered every earnings call because it hasn't paid off yet. if Zuckerberg wasn't the majority share holder it probably wouldn't be sustainable.

Deleted Comment

spmurrayzzz · 5 months ago

CUDA isn't the moat people think it is. NVIDIA absolutely has the best dev ergonomics for machine learning, there's no question about that. Their driver is also far more stable than AMD's. But AMD is also improving, they've made some significant strides over the last 12-18 months.

But I think more importantly, what is often missed in this analysis is that most programmers doing ML work aren't writing their own custom kernels. They're just using pytorch (or maybe something even more abstracted/multi-backend like keras 3.x) and let the library deal with implementation details related to their GPU.

That doesn't mean there aren't footguns in that particular land of abstraction, but the delta between the two providers is not nearly as stark as its often portrayed. At least not for the average programmer working with ML tooling.

(EDIT: also worth noting that the work being done in the MLIR project has a role to play in closing the gap as well for similar reasons)

martinpw · 5 months ago

> But I think more importantly, what is often missed in this analysis is that most programmers doing ML work aren't writing their own custom kernels. They're just using pytorch (or maybe something even more abstracted/multi-backend like keras 3.x) and let the library deal with implementation details related to their GPU.

That would imply that AMD could just focus on implementing good PyTorch support on their hardware and they would be able to start taking market share. Which doesn't sound like much work compared with writing a full CUDA competitor. But that does not seem to be the strategy, which implies it is not so simple?

I am not an ML engineer so don't have first hand experience, but those I have talked to say they depend on a lot more than just one or two key libraries. But my sample size is small. Interested in other perspectives...

> But that does not seem to be the strategy, which implies it is not so simple?

That is exactly what has been happening [1], and not just in pytorch. Geohot has been very dedicated in working with AMD to upgrade their station in this space [2]. If you hang out in the tinygrad discord, you can see this happening in real time.

> those I have talked to say they depend on a lot more than just one or two key libraries.

Theres a ton of libraries out there yes, but if we're talking about python and the libraries in question are talking to GPUs its going to be exceedingly rare that theyre not using one of these under the hood: pytorch, tensorflow, jax, keras, et al.

There are of course exceptions to this, particularly if you're not using python for your ML work (which is actually common for many companies running inference at scale and want better runtime performance, training is a different story). But ultimately the core ecosystem does work just fine with AMD GPUs, provided you're not doing any exotic custom kernel work.

(EDIT: just realized my initial comment unintentionally borrowed the "moat" commentary from geohot's blog. A happy accident in this case, but still very much rings true for my day to day ML dev experience)

[1] https://github.com/pytorch/pytorch/pulls?q=is%3Aopen+is%3Apr...

[2] https://geohot.github.io//blog/jekyll/update/2025/03/08/AMD-...

Vvector · 5 months ago

Back in 2015, they were a quarter or two from bankruptcy, saved by the XBOX and Playstation contracts. Those years saw several significant layoffs, and talent leaving for greener pastures. Lisa Su has done a great job at rebuilding the company. But not in a position to hire 2000 engineers x few million comp (~$4 billion annually) even if there were people readily available.

"it'd still be a good investment." - that's definitely not a sure thing. Su isn't a risk taker, seems to prefer incremental growth, mainly focused on the CPU side.

Where does the idea that engineers cost "a few million" come from? You might pay that much to senior engineering management, big names who can attract other talent, but normal engineers cost much less than a million dollars a year.

OP said "Even if they needed to hire a few thousand engineers at a few million in comp each". That's where the number came from.

Nvidia seems to pay the bulk of their engineers 200k-400k. If the fully loaded cost is 2.2, then it's closer to 440k-880k per engineer. Probably 500k would be a good number to use

red-iron-pine · 5 months ago

they're not hiring 4 engineers, they're hiring a team.

and this isn't just developers, R&D and design are iterative and will require proofing, QA, prototyping -- and that means bodies who can do all of that.

Zardoz89 · 5 months ago

They literally closed a deal hiring a 1100+ ZT Systems engineers yesterday.

Those are mostly hardware engineers, not software engineers, right?

ninetyninenine · 5 months ago

This is the difference between Jensen and Su. It’s not that Jensen is a risk taker. No. Jensen focused on incremental growth of the core business while slowly positioning the company for growth in other verticals as well should the landscape change.

Jensen never said… hey I’m going to bet it all on AI and cuda. Let’s go all in. This never happened. Both Jensen and Su are not huge risk takers imo.

Additionally there’s a lot of luck involved with the success of NVIDIA.

kbolino · 5 months ago

I think this broaches the real matter, which is that nVidia's core business is GPUs while AMD's core business is CPUs. And, frankly, AMD has lately been doing a great job at its core business. The problem is that GPUs are now much more profitable than CPUs, both in terms of unit economics and growth potential. So they are winning a major battle (against Intel) even as they are losing a different one (against nVidia). I'm not sure there's a strategy they could have adopted to win both at the same time.

However, the next big looming problem for them is likely to be the shrinking market for x86 vs. the growing market for Arm etc. So they might very well have demonstrated great core competence, that ends up being completely swept away by not just one but two major industry shifts.

dlewis1788 · 5 months ago

CUDA is an entire ecosystem - not a single programming language extension (C++) or a single library, but a collection of libraries & tools for specific use cases and optimizations (cuDNN, CUTLASS, cuBLAS, NCCL, etc.). There is also tooling support that Nvidia provides, such as profilers, etc. Many of the libraries build on other libraries. Even if AMD had the decent, reliable language extensions for general-purpose GPU programming, they still don't have the libraries and the supporting ecosystem to provide anything to the level that CUDA provides today, which is a decade plus of development effort from Nvidia to build.

guywithahat · 5 months ago

The counter point is they could make a higher level version of CUDA which wouldn't necessitate all the other supporting libraries. The draw of cuBLAS is that CUDA is a confusing pain. It seems reasonable to think they could write a better, higher level language (in the same vein as triton) and not have to write as many support libraries

100% valid - Nvidia is trying to address that now with cuTile and the new Python front-end for CUTLASS.

Dead Comment

Cieric · 5 months ago

I can't contribute much to this discussion due to bias and NDAs, but I just wanted to mention, technically HIP is our CUDA competitor. ROCm is the foundation that HIP is being built on.

johnnyjeans · 5 months ago

I wonder what the purpose is behind creating a whole new API? Why not just focus on getting Vulkan compute on AMD GPUs to have the data throughput of CUDA?

Const-me · 5 months ago

I don’t know answer to your question, but I recalled something relevant. Some time ago, Microsoft had a tech which compiled almost normal looking C++ into Direct3D 11 compute shaders: https://learn.microsoft.com/en-us/cpp/parallel/amp/cpp-amp-o... The compute kernels are integrated into CPU-running C++ in the similar fashion CUDA does.

As you see, the technology deprecated in Visual Studio 2022. I don’t know why but I would guess people just didn’t care. Maybe because it only run on Windows.

stuaxo · 5 months ago

OT: The thing where I have to choose between ROCm or AmdGPU drivers is annoying.

Mostly stick to AmdGPU as it seems to work for other stuff, I'd like to be able to run the HIP stuff on there without having to change drivers.

fransje26 · 5 months ago

So if someone would like to, say, port a CUDA codebase to AMD, you would use HIP for a more or less 1-on-1 translation?

Any card you would recommend, when trying to replace the equivalent of a 3090/4090?

markstock · 5 months ago

I can't recommend cards, but you are absolutely correct about porting CUDA to HIP: there was (is?) a hipify program in rocm that does most of the work.

dagmx · 5 months ago

AMD have actually made several attempts at it.

The first time, they went ahead and killed off their effort to consolidate on OpenCL. OpenCL went terribly (in no small part because NVIDIA held out on OpenCL 2 support) and that set AMD back a long ways.

Beyond that, AMD does not have a strong software division or one with the teeth to really influence hardware to their needs . They have great engineers but leadership doesn’t know how to get them to where they need to be.

WithinReason · 5 months ago

This is it, it's an organisational skill issue. To be fair, being a HW company and a SW company at the same time is very difficult.

It is but you have to be.

It’s been key to the success of their peers. NVIDIA and Apple are the best examples but even Intel to a smaller degree.