A lot of people think AMD should support these translation layers but I think it's a bad idea. CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult both technically and legally. For example I think it would be against the license agreement of cuDNN or cuBLAS to run them on this. So those and other Nvidia libraries would become part of the API boundary that AMD would need to reimplement and support.
Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.
Are you aware of HIP? It's officially supported and, for code that avoids obscure features of CUDA like inline PTX, it's pretty much a find-and-replace to get a working build:
if you're talking about building anything, that is already too hard for ML researchers.
you have to be able to pip install something and just have it work, reasonably fast, without crashing, and also it has to not interfere with 100 other weird poorly maintained ML library dependencies.
Support this, reimplement that, support upstream efforts, dont really care. Any of those would cost a couple of million and be worth a trillion dollars to AMD shareholders.
Is it weird how the comments here are blaming AMD and not Nvidia? Sure, the obvious argument is that Nvidia has no practical motivation to build an open platform. But there are counterexamples that suggest otherwise (Android). And there is a compelling argument that long term, their proprietary firmware layer will become an insufficient moat to their hardware dominance.
Who’s the root cause? The company with the dominant platform that refuses to open it up, or the competitor who can’t catch up because they’re running so far behind? Even if AMD made their own version of CUDA that was better in every way, it still wouldn’t gain adoption because CUDA has become the standard. No matter what they do, they’ll need to have a compatibility layer. And in that case maybe it makes sense for them to invest in the best one that emerges from the community.
>Nvidia can make things arbitrarily difficult both technically and legally.
I disagree. AMD can simply not implement those APIs, similar to how game emulators implement the most used APIs first and sometimes never bother implementing obscure ones. It would only matter that NVIDIA added eg. patented APIs to CUDA if those APIs were useful. In which case AMD should have a way to do them anyway. Unless NVIDIA comes up with a new patented API which is both useful and impossible to implement in any other way, which would be bad for AMD in any event. On the other hand, if AMD start supporting CUDA and people start using AMD cards, then developers will be hesitant to use APIs that only work on NVIDIA cards. Right now they are losing billions of dollars on this. Then again they barely seem capable of supporting RocM on their cards, much less CUDA.
You have a fair point in terms of cuDNN and cuBLAS but I don't know that that kind of ToS is actually binding.
Agreed. Rather than making CUDA the standard; AMD should push/drive an open standard that can be run on any hardware.
We have seen this succeed multiple times: FreeSync vs GSync, DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal.
All of the big tech companies are obsessed with ring-fencing developers behind the thin veil of "innovation" - where really it's just good for business (I swear it should be regulated because it's really bad for consumers).
A CUDA translation layer is okay for now but it does risk CUDA becoming the standard API. Personally, I am comfortable with waiting on an open standard to take over - ROCm has serviced my needs pretty well so far.
Just wish GPU sharing with VMs was as easy as CPU sharing.
> AMD should push/drive an open standard that can be run on any hardware.
AMD has always been notoriously bad at the software side, and they frequently abandon their projects when they're almost usable, so I won't hold my breath.
we actually also saw this historically with openGL.
openGL comes from an ancient company whispered about by the elderly programmers (30 + year old) known as SGI. Originally it was CLOSED SOURCE and SGI called it "SGI-GL" for a computer codename IRIS which was cool looking with bright popping color plastic and faux granite keyboard. Good guy SGI open sourced SGI-GL to become what we called "openGL" (get it, now it's open), and then it stuck.
That's all to say NVIDIA could pull a SGI and open their stuff, but they're going more sony style and trying to monopolize. Oh, and SGI also wrote another ancient lore library known as "STL" or the "SGI Template Library" which is like the original boost template metaprogramming granddaddy
OpenCL was released in 2009. AMD has had plenty of time to push and drive that standard. But OpenCL had a worse experience than CUDA, and AMD wasn't up to the task in terms of hardware, so it made no real sense to go for OpenCL.
> against the license agreement of cuDNN or cuBLAS to run them on this
They don’t run either of them, they instead implement an equivalent API on top of something else. Here’s a quote: “Open-source wrapper libraries providing the "CUDA-X" APIs by delegating to the corresponding ROCm libraries. This is how libraries such as cuBLAS and cuSOLVER are handled.”
> CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult [...] technically.
(Let's put the legal questions aside for a moment.)
nVidia changes GPU architectures every generation / few generations, right? How does CUDA work across those—and how can it have forwards compatibility in the future—if it's not designed to be technologically agnostic?
PTX is meant to be portable across GPU microarchitectures. That said, Nvidia owns the entire spec, so they can just keep adding new instructions that their GPUs now support but AMD GPUs don't.
One way is to make sure the hardware team does certain things to support easy transition to new architectures, we have seen this with Apple Silicon for example!
Well, they kinda have it with their hipify tool, although this is for porting CUDA code to AMD's HIP which supports both AMD and NVIDIA. This supports CUDA C code and libraries with AMD equivalents like cuDNN, cuBLAS, cuRAND, but doesn't support porting of CUDA C inline PTX assembler. AMD have their own inline GCN assembler, but seem to discourage it's use.
There are also versions of PyTorch, TensorFlow and JAX with AMD support.
PyTorch's torch.compile can generate Triton (OpenAI's GPU compiler) kernels, with Triton also supporting AMD.
CUDA is the juice that built Nvidia in the AI space and allowed them to charge crazy money for their hardware. To be able to run CUDA on cost effective AMD hardware can be a big leap forward, allow more people to research, and break away from Nvidia's stranglehold over VRAM. Nvidia will never open source their own platform unless their hand is forced. I think we all should support this endeavor and contribute where possible.
Before starting, AMD signed an agreement with Intel that gave them an explicit license to x86. And x86 was a whole lot smaller and simpler back then in 1982. A completely different and incomparable situation.
Was there a large entity steering x86 spec alone with a huge feature lead against their competition, free to steer the spec in any ways they choose? Also, hardware is not opensource software, you get big players onboard and they will be able to implement the spec they want every gen, software has more moving parts and unaligned parties involved.
Isn't cuDNN a much better case for reimplementing than CUDA? It has much more choice in how things actually happen and cuDNN itself chooses different implementations at runtime + does fusing. It seems way more generic and the reimplementation would allow using the best AMD-targeted kernel rather than one the original has.
I really hope they will do what you suggested. With some innovative product placement, GPUs with a lot of memory for example, they could dethrone nvidia if it doesn't change strategy.
That said, easier said than done. You need very specialized developers to build a CUDA equivalent and have people start using it. AMD could do it with a more open development process leveraging the open source community. I believe this will happen at some point anyway by AMD or someone else. The market just gets more attractive by the day and at some point the high entry barrier will not matter much.
So why should AMD skimp on their ambitions here? This would be a most sensible investment, few risks and high gains if successful.
That is why an open standard should be made so it isn't locked to a particular piece of hardware and then allow modular support for different hardware to interface with supported drivers.
Given AMDs prior lack of interest I'll take whatever options there are. My daily driver has a Vega 10 GPU and it's been quite frustrating not to be able to easily leverage it for doing basic ML tasks, to the point that I've been looking at buying an external nvidia GPU instead just to try out some of the popular Python libraries.
Ya, honestly better to leave that to third parties who can dedicate themselves to it and maybe offer support or whatever. Let AMD work on good first party support first.
I don't really see how any code that depends heavily on the underlying hardware can "just work" on AMD. Most serious CUDA code is aware of register file and shared memory sizes, wgmma instructions, optimal tensor core memory & register layouts, tensor memory accelerator instructions, etc...
Presumably that stuff doesn't "just work" but they don't want to mention it?
A lot of our hw-aware bits are parameterized where we fill in constants based on the available hw
. Doable to port, same as we do whenever new Nvidia architectures come out.
But yeah, we have tricky bits that inline PTX, and.. that will be more annoying to redo.
it's a speculation, but I think it's similar with processors = nobody guarantees the code will run the way you set it up. You may want to use some specific register but if the processor will think it has another register that can fulfill the task, it'll use that but tell you that your code is executed as expected. Maybe the internal gpu processor of amd can sufficiently simulate the behavior of nvidia hardware so that higher abstractions will be unaware that something different is happening under the hood
Prettymuch. Compilers can do a lot more than people give them credit for. At least AMD document their hardware so it is actually possible to know low-level details. PTX can obfuscate that surprisingly badly for nvidia targets.
Makes sense to expect this kind of thing to be open source. The whole point of providing improved compatibility is to make people’s lives easier, and open source is usually an important feature to ensure wide compatibility. It also means projects can live on after the creators move to other things, people can submit patches for important features or bug fixes, and generally makes the system much more useful.
I don't find it wrong for someone to attempt to make money back on their time and experience of doing the work. I don't mind people that offer that back as open source either. However, I do have a problem of people expecting everything to be open/free, especially those that then go on a crusade chastising those that do try to make money.
> Makes sense to expect this kind of thing to be open source. The whole point of providing improved compatibility is to make people’s lives easier, and open source is usually an important feature to ensure wide compatibility. It also means projects can live on after the creator
AMD just bought company working with similar things for more than 600m.
We're going to be publishing more details on later blog posts and documentation about how this works and how we've built it.
Yes, we're not open source, however our license is very permissive. It's both in the software distribution and viewable online at https://docs.scale-lang.com/licensing/
> I don't see a "buy now" button or a PCIe version anywhere here
"Buy now" buttons and online shopping carts are not generally how organizations looking to spend serious money on AI buy their hardware.
They have a long list of server hardware partners, and odds are you'd already have an existing relationship with one or more of them, and they'd provide a quote.
They even go one step further and show off some of their partners' solutions:
FWIW I believe Supermicro and Exxact actually do have web-based shopping carts these days, so maybe you could skip the quotation and buy directly if you were so motivated? Seems kind of weird at this price point.
The main cause of Nvidia's crazy valuation is AMD's unwillingness to invest in making its GPUs as useful as Nvidia's for ML.
Maybe AMD fears antitrust action, or maybe there is something about its underlying hardware approach that would limit competitiveness, but the company seems to have left billions of dollars on the table during the crypto mining GPU demand spike and now during the AI boom demand spike.
I like to watch YouTube retrospectives on old failed tech companies - LGR has some good ones.
When I think of AMD ignoring machine learning, I can't help imagine a future YouTuber's voiceover explaining how this caused their downfall.
There's a tendency sometimes to think "they know what they're doing, they must have good reasons". And sometimes that's right, and sometimes that's wrong. Perhaps there's some great technical, legal, or economic reason I'm just not aware of. But when you actually look into these things, it's surprising how often the answer is indeed just shortsightedness.
They could end up like BlackBerry, Blockbuster, Nokia, and Kodak. I guess it's not quite as severe, since they will still have a market in games and therefore may well continue to exist, but it will still be looked back on as a colossal mistake.
Same with Toyota ignoring electric cars.
I'm not an investor, but I still have stakes in the sense that Nvidia has no significant competition in the machine learning space, and that sucks. GPU prices are sky high and there's nobody else to turn to if there's something about Nvidia you just don't like or if they decide to screw us.
In fairness to AMD, they bet on crypto, and nvidia bet on AI. Crypto was the right short term bet.
Also, ignoring is a strong word: I’m staring at a little << $1000, silent 53 watt mini-PC with an AMD SoC. It has an NPU comparable to an M1. In a few months, with the ryzen 9000 series, NPUs for devices of its class will bump from 16 tops to 50 tops.
I’m pretty sure the linux taint bit is off, and everything just worked out of the box.
Toyota is extremely strong in the hybrid car market, and with ravenous competition for electric cars and slowing demand Toyota may have made the right decision after all
There's also just the idea of endeavour - Nvidia tried something, and it worked. Businesses (or rather their shareholders) take risks with their capital sometimes, and it doesn't always work. But in this case it did.
> I think this could be cultural differences, AMD's software department is underfunded and doing poorly for a long time now.
Rumor is that ML engineers (that AMD really needs) are expensive; and AMD doesn't want to give them more money than the rest of the SWEs they have (for pissing off the existing SWEs). So AMD is caught in a bind: can't pay to get top MLE talent and can't just sit by and watch NVDA eat its lunch.
AMD fears anti-collusion action, remember, CEOs of the two are just barely far enough of kinship to not be automatically considered colluding with each other.
The companies' CEO's are related. My conspiracy theory is that they don't want to step on each other's toes. Not sure if that works with fiduciary duty, though.
It does not conflict. Fiduciary duty for a for-profit organisation is not "profit at all costs", it's "you have to care about the company (care), you have to do good business (good faith) and you can't actively waste investors' and shareholders' money to intentionally lose out (loyalty)".
If they are found colluding due to nepotism, both will get a very swift revocation of business licence and a huge prison term. Remember they are just one step of kinship away from presumed collusion.
I worked for spectral compute a few years ago. Very smart and capable technical team.
At the time, not only did they target AMD (with less compatibility than they have now), but also outperformed the default LLVM ptx backend, and even NVCC, when compiling for Nvidia GPUs!
I don't understand how AMD has messed up so badly that I feel like celebrating a project like this. Features of my laptop are just physically there but not usable, particularly in Linux. So frustrating.
AMD hardware works fine, the problem is that the major research projects everyone copies are all developed specifically for Nvidia.
Now AMD is spinning up CUDA compatibility layer after CUDA compatibility layer. It's like trying to beat Windows by building another ReactOS/Wine. It's an approach doomed to fail unless AMD somehow manages to gain vastly more resources than the competition.
Apple's NPU may not be very powerful, but many models have been altered specifically to run on them, making their NPUs vastly more useful than most equivalently powerful iGPUs. AMD doesn't have that just yet, they're always catching up.
It'll be interesting to see what Qualcomm will do to get developers to make use of their NPUs on the new laptop chips.
I don't know if I would call it a mess up. AMD still has massive market in server chips, and their ARM stuff is on the horizon. We all assume that graphics cards are the way forward for ML, which may not be the case in the future.
Nvidia were just ahead in this particular category due to CUDA, so AMD may have just let them run with it for now.
Same boat, AMD CPU but nothing else. I feel like a moderate improvement of their FOSS support, drivers would open new hardware revenue - to say nothing about the AI channel.
It’s great that there is a page about current limitations [1], but I am afraid that what most people describe as “CUDA” is a small subset of the real CUDA functionality. Would be great to have a comparison table for advanced features like warp shuffles, atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.
You're right that most people only use a small subset of cuda: we prioritied support for features based on what was needed for various open-source projects, as a way to try to capture the most common things first.
A complete API comparison table is coming soon, I belive. :D
In a nutshell:
- DPX: Yes.
- Shuffles: Yes. Including the PTX versions, with all their weird/wacky/insane arguments.
- Atomics: yes, except the 128-bit atomics nvidia added very recently.
- MMA: in development, though of course we can't fix the fact that nvidia's hardware in this area is just better than AMD's, so don't expect performance to be as good in all cases.
- TMA: On the same branch as MMA, though it'll just be using AMD's async copy instructions.
> mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.
We plan to publish a compatibility table of which instructons are supported, but a list of the instructions used to produce each PTX instruction is not in general meaningful. The inline PTX handler works by converting the PTX block to LLVM IR at the start of compilation (at the same time the rest of your code gets turned into IR), so it then "compiles forward" with the rest of the program. As a result, the actual instructions chosen vary on a csae-by-case basis due to the whims of the optimiser. This design in principle produces better performance than a hypothetical solution that turned PTX asm into AMD asm, because it conveniently eliminates the optimisation barrier an asm block typically represents. Care, of course, is taken to handle the wacky memory consistency concerns that this implies!
We're documenting which ones are expected to perform worse than on NVIDIA, though!
> You're right that most people only use a small subset of cuda
This is true first and foremost for the host-side API. From my StackOverflow and NVIDIA forums experience - I'm often the first and only person to ask about any number of nooks and crannies of the CUDA Driver API, with issues which nobody seems to have stumbled onto before; or at least - not stumbled and wrote anything in public about it.
Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.
https://github.com/ROCm/HIP
Don't believe me? Include this at the top of your CUDA code, build with hipcc, and see what happens:
https://gitlab.com/StanfordLegion/legion/-/blob/master/runti...
It's incomplete because I'm lazy but you can see most things are just a single #ifdef away in the implementation.
you have to be able to pip install something and just have it work, reasonably fast, without crashing, and also it has to not interfere with 100 other weird poorly maintained ML library dependencies.
https://github.com/ROCm/HIPIFY
Support this, reimplement that, support upstream efforts, dont really care. Any of those would cost a couple of million and be worth a trillion dollars to AMD shareholders.
Who’s the root cause? The company with the dominant platform that refuses to open it up, or the competitor who can’t catch up because they’re running so far behind? Even if AMD made their own version of CUDA that was better in every way, it still wouldn’t gain adoption because CUDA has become the standard. No matter what they do, they’ll need to have a compatibility layer. And in that case maybe it makes sense for them to invest in the best one that emerges from the community.
I disagree. AMD can simply not implement those APIs, similar to how game emulators implement the most used APIs first and sometimes never bother implementing obscure ones. It would only matter that NVIDIA added eg. patented APIs to CUDA if those APIs were useful. In which case AMD should have a way to do them anyway. Unless NVIDIA comes up with a new patented API which is both useful and impossible to implement in any other way, which would be bad for AMD in any event. On the other hand, if AMD start supporting CUDA and people start using AMD cards, then developers will be hesitant to use APIs that only work on NVIDIA cards. Right now they are losing billions of dollars on this. Then again they barely seem capable of supporting RocM on their cards, much less CUDA.
You have a fair point in terms of cuDNN and cuBLAS but I don't know that that kind of ToS is actually binding.
We have seen this succeed multiple times: FreeSync vs GSync, DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal.
All of the big tech companies are obsessed with ring-fencing developers behind the thin veil of "innovation" - where really it's just good for business (I swear it should be regulated because it's really bad for consumers).
A CUDA translation layer is okay for now but it does risk CUDA becoming the standard API. Personally, I am comfortable with waiting on an open standard to take over - ROCm has serviced my needs pretty well so far.
Just wish GPU sharing with VMs was as easy as CPU sharing.
AMD has always been notoriously bad at the software side, and they frequently abandon their projects when they're almost usable, so I won't hold my breath.
I'll definitely agree with you on Sync and Vulkan, but dlss and xess are both better than fsr.
https://youtube.com/watch?v=el70HE6rXV4
That's all to say NVIDIA could pull a SGI and open their stuff, but they're going more sony style and trying to monopolize. Oh, and SGI also wrote another ancient lore library known as "STL" or the "SGI Template Library" which is like the original boost template metaprogramming granddaddy
Zero impact on Switch, Playstation, XBox, Windows, macOS, iOS, iPadOS, Vision OS.
Karol Herbst is working on Rusticl, which is mesa's latest OpenCL implementation and will pave the way for other things such as SYCL.
Dead Comment
Pretty sure APIs are not copyrightable, e.g. https://www.law.cornell.edu/supremecourt/text/18-956
> against the license agreement of cuDNN or cuBLAS to run them on this
They don’t run either of them, they instead implement an equivalent API on top of something else. Here’s a quote: “Open-source wrapper libraries providing the "CUDA-X" APIs by delegating to the corresponding ROCm libraries. This is how libraries such as cuBLAS and cuSOLVER are handled.”
Dead Comment
(Let's put the legal questions aside for a moment.)
nVidia changes GPU architectures every generation / few generations, right? How does CUDA work across those—and how can it have forwards compatibility in the future—if it's not designed to be technologically agnostic?
There are also versions of PyTorch, TensorFlow and JAX with AMD support.
PyTorch's torch.compile can generate Triton (OpenAI's GPU compiler) kernels, with Triton also supporting AMD.
That said, easier said than done. You need very specialized developers to build a CUDA equivalent and have people start using it. AMD could do it with a more open development process leveraging the open source community. I believe this will happen at some point anyway by AMD or someone else. The market just gets more attractive by the day and at some point the high entry barrier will not matter much.
So why should AMD skimp on their ambitions here? This would be a most sensible investment, few risks and high gains if successful.
Dead Comment
Dead Comment
Dead Comment
Presumably that stuff doesn't "just work" but they don't want to mention it?
A lot of our hw-aware bits are parameterized where we fill in constants based on the available hw . Doable to port, same as we do whenever new Nvidia architectures come out.
But yeah, we have tricky bits that inline PTX, and.. that will be more annoying to redo.
Dead Comment
Edit: not sure why I just sort of expect projects to be open source or at least source available these days.
AMD just bought company working with similar things for more than 600m.
Yes, we're not open source, however our license is very permissive. It's both in the software distribution and viewable online at https://docs.scale-lang.com/licensing/
It's open source with a long delay, but paying users get the latest updates.
Make the git repo from "today - N years" open source, where N is something like 1 or 2.
That way, students can learn on old versions, and when they grow into professionals they can pay for access to the cutting Edge builds.
Win win win win
( https://breckyunits.com/earlySource.html)
https://www.amd.com/en/products/accelerators/instinct/mi300/...
Another big AMD fuckup in my opinion. Nobody is going to drop millions on these things without being able to test them out first.
First rule of sales: If you have something for sale, take my money.
"Buy now" buttons and online shopping carts are not generally how organizations looking to spend serious money on AI buy their hardware.
They have a long list of server hardware partners, and odds are you'd already have an existing relationship with one or more of them, and they'd provide a quote.
They even go one step further and show off some of their partners' solutions:
https://www.amd.com/en/graphics/servers-instinct-deep-learni...
FWIW I believe Supermicro and Exxact actually do have web-based shopping carts these days, so maybe you could skip the quotation and buy directly if you were so motivated? Seems kind of weird at this price point.
https://www.exxactcorp.com/Exxact-TS4-185328443-E185328443
https://docusaurus.io/docs
Maybe AMD fears antitrust action, or maybe there is something about its underlying hardware approach that would limit competitiveness, but the company seems to have left billions of dollars on the table during the crypto mining GPU demand spike and now during the AI boom demand spike.
When I think of AMD ignoring machine learning, I can't help imagine a future YouTuber's voiceover explaining how this caused their downfall.
There's a tendency sometimes to think "they know what they're doing, they must have good reasons". And sometimes that's right, and sometimes that's wrong. Perhaps there's some great technical, legal, or economic reason I'm just not aware of. But when you actually look into these things, it's surprising how often the answer is indeed just shortsightedness.
They could end up like BlackBerry, Blockbuster, Nokia, and Kodak. I guess it's not quite as severe, since they will still have a market in games and therefore may well continue to exist, but it will still be looked back on as a colossal mistake.
Same with Toyota ignoring electric cars.
I'm not an investor, but I still have stakes in the sense that Nvidia has no significant competition in the machine learning space, and that sucks. GPU prices are sky high and there's nobody else to turn to if there's something about Nvidia you just don't like or if they decide to screw us.
Also, ignoring is a strong word: I’m staring at a little << $1000, silent 53 watt mini-PC with an AMD SoC. It has an NPU comparable to an M1. In a few months, with the ryzen 9000 series, NPUs for devices of its class will bump from 16 tops to 50 tops.
I’m pretty sure the linux taint bit is off, and everything just worked out of the box.
* https://www.levels.fyi/companies/amd/salaries/software-engin...
* https://www.levels.fyi/companies/nvidia/salaries/software-en...
And it's probably better now. Nvidia was paying much more long before, also their stock growing attracts even more talent.
Rumor is that ML engineers (that AMD really needs) are expensive; and AMD doesn't want to give them more money than the rest of the SWEs they have (for pissing off the existing SWEs). So AMD is caught in a bind: can't pay to get top MLE talent and can't just sit by and watch NVDA eat its lunch.
Dead Comment
https://www.tomshardware.com/news/jensen-huang-and-lisa-su-f...
If they are found colluding due to nepotism, both will get a very swift revocation of business licence and a huge prison term. Remember they are just one step of kinship away from presumed collusion.
At the time, not only did they target AMD (with less compatibility than they have now), but also outperformed the default LLVM ptx backend, and even NVCC, when compiling for Nvidia GPUs!
Now AMD is spinning up CUDA compatibility layer after CUDA compatibility layer. It's like trying to beat Windows by building another ReactOS/Wine. It's an approach doomed to fail unless AMD somehow manages to gain vastly more resources than the competition.
Apple's NPU may not be very powerful, but many models have been altered specifically to run on them, making their NPUs vastly more useful than most equivalently powerful iGPUs. AMD doesn't have that just yet, they're always catching up.
It'll be interesting to see what Qualcomm will do to get developers to make use of their NPUs on the new laptop chips.
Nvidia were just ahead in this particular category due to CUDA, so AMD may have just let them run with it for now.
[1]: https://docs.scale-lang.com/manual/differences/
A complete API comparison table is coming soon, I belive. :D
In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX versions, with all their weird/wacky/insane arguments. - Atomics: yes, except the 128-bit atomics nvidia added very recently. - MMA: in development, though of course we can't fix the fact that nvidia's hardware in this area is just better than AMD's, so don't expect performance to be as good in all cases. - TMA: On the same branch as MMA, though it'll just be using AMD's async copy instructions.
> mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.
We plan to publish a compatibility table of which instructons are supported, but a list of the instructions used to produce each PTX instruction is not in general meaningful. The inline PTX handler works by converting the PTX block to LLVM IR at the start of compilation (at the same time the rest of your code gets turned into IR), so it then "compiles forward" with the rest of the program. As a result, the actual instructions chosen vary on a csae-by-case basis due to the whims of the optimiser. This design in principle produces better performance than a hypothetical solution that turned PTX asm into AMD asm, because it conveniently eliminates the optimisation barrier an asm block typically represents. Care, of course, is taken to handle the wacky memory consistency concerns that this implies!
We're documenting which ones are expected to perform worse than on NVIDIA, though!
This is true first and foremost for the host-side API. From my StackOverflow and NVIDIA forums experience - I'm often the first and only person to ask about any number of nooks and crannies of the CUDA Driver API, with issues which nobody seems to have stumbled onto before; or at least - not stumbled and wrote anything in public about it.