AMD GPU Inference - Readit News

For inference, if you have a supported card (or probably architecture if you are on Linux and can use HSA_OVERRIDE_GFX_VERSION), then you can probably run anything with (upstream) PyTorch and transformers. Also, compiling llama.cpp is has been pretty trouble-free for me for at least a year.

(If you are on Windows, there is usually a win-hip binary of llama.cpp in the project's releases or if things totally refuse to work, you can use the Vulkan build as a (less performant) fallback).

Having more options can't hurt, but ROCm 5.4.2 is almost 2 years old, and things have come a long way since then, so I'm curious about this being published freshly today, in October 2024.

BTW, I recently went through and updated my compatibility doc (focused on RDNA3) w/ ROCm 6.2 for those interested. A lot has changed just in the past few months (upstream bitsandbytes, upstream xformers, and Triton-based Flash Attention): https://llm-tracker.info/howto/AMD-GPUs

woodrowbarlow · a year ago

i also have been playing with inference on the amd 7900xtx, and i agree. there are no hoops to jump through these days. just make sure to install the rocm version of torch (if using a1111 or similar, don't trust requirements.txt), as shown clearly on the pytorch homepage. obsidian is a similar story. hip is straightforward, at least on arch and ubuntu (fedora still requires some twiddling, though). i didn't realize xformers is also functional! that's good news.

qamononep · a year ago

It would be great if you included a section on running with Docker on Linux. The only one that worked out of the box was Ollama, and it had an example. https://github.com/ollama/ollama/blob/main/docs/docker.md

has a docker image but no examples to run it https://github.com/ggerganov/llama.cpp/blob/master/docs/dock...

has a docker image but no examples to run it https://github.com/LostRuins/koboldcpp?tab=readme-ov-file#do...

docker image was broken for me on 7800xt running rhel9 https://github.com/Atinoda/text-generation-webui-docker

fazkan · a year ago

good feedback thanks, would you be able to open an issue

conshama · a year ago

tldr: uses the latest rocm 6.2 to run full precision inference for llama 405b on a single node 8 x MI300x AMD GPU

How mature do you think Rocm 6.2-AMD stack is compared to Nvidia ?

fazkan · a year ago

this uses vllm?

On Ubuntu 24.04 (and Debian Unstable¹), the OS-provided packages should be able to get llama.cpp running on ROCm on just about any discrete AMD GPU from Vega onwards²³⁴. No docker or HSA_OVERRIDE_GFX_VERSION required. The performance might not be ideal in every case⁵, but I've tested a wide variety of cards:

    # install dependencies
    sudo apt -y update
    sudo apt -y upgrade
    sudo apt -y install git wget hipcc libhipblas-dev librocblas-dev cmake build-essential

    # ensure you have permissions by adding yourself to the video and render groups
    sudo usermod -aG video,render $USER
    # log out and then log back in to apply the group changes
    # you can run `rocminfo` and look for your GPU in the output to check everything is working thus far

    # download a model, build llama.cpp, and run it
    wget https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF/resolve/main/dolphin-2.2.1-mistral-7b.Q5_K_M.gguf?download=true -O dolphin-2.2.1-mistral-7b.Q5_K_M.gguf
    git clone https://github.com/ggerganov/llama.cpp.git
    cd llama.cpp
    git checkout b3267
    HIPCXX=clang-17 cmake -H. -Bbuild -DGGML_HIPBLAS=ON -DCMAKE_HIP_ARCHITECTURES="gfx803;gfx900;gfx906;gfx908;gfx90a;gfx1010;gfx1030;gfx1100;gfx1101;gfx1102" -DCMAKE_BUILD_TYPE=Release
    make -j16 -C build
    build/bin/llama-cli -ngl 32 --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -m ../dolphin-2.2.1-mistral-7b.Q5_K_M.gguf --prompt "Once upon a time"

I'd suggest RDNA 3, MI200 and MI300 users should probably use the AMD-provided ROCm packages for improved performance. Users that need PyTorch should also use the AMD-provided ROCm packages, as PyTorch has some dependencies that are not available from the system packages. Still, you can't beat the ease of installation or the compatibility with older hardware provided by the OS packages.

¹ https://lists.debian.org/debian-ai/2024/07/msg00002.html ² Not including MI300 because that released too close to the Ubuntu 24.04 launch. ³ Pre-Vega architectures might work, but have known bugs for some applications. ⁴ Vega and RDNA 2 APUs might work with Linux 6.10+ installed. I'm in the process of testing that. ⁵ The version of rocBLAS that comes with Ubuntu 24.04 is a bit old and therefore lacks some optimizations for RDNA 3. It's also missing some MI200 optimizations.

mindcrime · a year ago

I was able to install (AMD provided) ROCm and Ollama on Ubuntu 22.04.5 with an RX 7900 XTX with no real problems to speak of, and I can execute LLMs using Ollama on ROCm just fine. Take that FWIW.

ekianjo · a year ago

are there AMD cards with more than 24GB VRAM on the market right now at consumer friendly prices?

slavik81 · a year ago

The Radeon Pro W6800, W7800 or W7900 would be the standard answer. A hacker-spirited alternative would be to purchase a used MI50, MI60 or MI100 and 3d print a fan adapter. There are versions of all of those cards with 32GB of VRAM and they can be found on ebay for between 350 USD and 1200 USD. Plus twenty bucks for a fan adapter and a fan.

Those old gfx906 or gfx908 cards are more competitive for fp64 than for low-precision AI workloads, but they have the memory and the price is right. I'm not sure I would recommend the hacker approach to the average user, but it is what I've done for some of the continuous integration servers I host for the Debian project.

coolspot · a year ago

Amazon prices:

$3,600 - 61 TFLOPS - AMD Radeon Pro W7900

$4,200 - 38.7 TFLOPS - NVidia RTX A6000 48GB Ampere

$7,200 - 91.1 TFLOPS - NVidia RTX A6000 48GB Ada

mindcrime · a year ago

It sort of depends on how you define "consumer friendly prices". AFAIK, in the $1000 - "slightly over or under $1000" range, 24GB is all you can get. But there are Radeon Pro boards with 32GB or 48GB of RAM for various prices between around $2000 to about $3500. So not "cheap" but possibly within reach for a serious hobbyist who doesn't mind spending a little bit more.

hardware.graphics.enable = true; services.ollama = { enable = true; acceleration = "rocm"; environmentVariables = { ROC_ENABLE_PRE_VEGA = "1"; HSA_OVERRIDE_GFX_VERSION = "11.0.0"; }; };

lhl · a year ago

tcdent · a year ago

The rise of generated slop ml libraries is staggering.

This library is 50% print statements. And where it does branch, it doesn't even need to.

Defines two environment variables and sets two flags on torch.

mdaniel · a year ago

I also had to go to therapy to cure myself of the misunderstanding that data scientists and machine learning folks are software engineers, and expecting the same work product from those disparate audiences only raises your blood pressure

Expectation management is a huge part of any team/organization, I think

tpoacher · a year ago

They can be the same or different, given how you define them. People throw these words around with little thought, especially ones superficial to or outside the field.

I wouldn't disparage an entire field for lack of a clear definition in the buzzwords people use to refer to it.

driverdan · a year ago

I thought you were being overly harsh until I looked at the repo. You're not kidding, there's very little to it.

TechDebtDevin · a year ago

While I see where you are coming from, these are the types of comments that keep people from sharing their code, contributing to OSS or continuing to program in general.

a2128 · a year ago

It seems to use an old, 2 year old version of ROCm (5.4.2) which I'm doubtful would support my RX 7900 XTX. I personally found it easiest to just use the latest `rocm/pytorch` image and run what I need from there

The RX 7900 XTX (gfx1100) was first enabled in the math libraries (e.g. rocBLAS) for ROCm 5.4, but I don't think the AI libraries (e.g. MIOpen) had it enabled until ROCm 5.5. I believe the performance improved significantly in later releases, as well.

danielEM · a year ago

It has been like 8 months since I got Ryzen 8700G with NPU just for the purpose of inferencing NN, and so far only acceleration I'm getting is through vulkan on iGPU, not NPU (I'm using Linux only). On the bright side, with 64GB of RAM had no isues with trying models over 32GB. Kudos to llama.cpp for supporting vulkan backend!

You should have ROCm/HIP support on the iGPU as well, be sure to compile llama.cpp w/ the LLAMA_HIP_UMA=1 flag. If you take a look at https://github.com/amd/RyzenAI-SW you can see there's a fair amount of software to play with on the NPU now, but Phoenix is only 16 TOPS, so I've never bothered testing it.

rglullis · a year ago

So, this is all I needed to add to NixOS workstation:

tomxor · a year ago

I almost tried to install AMD rocm a while ago after discovering the simplicity of llamafile.

  sudo apt install rocm

  Summary:
    Upgrading: 0, Installing: 203, Removing: 0, Not Upgrading: 0
    Download size: 2,369 MB / 2,371 MB
    Space needed: 35.7 GB / 822 GB available

I don't understand how 36 GB can be justified for what amounts to a GPU driver.

atq2119 · a year ago

So no doubt modern software is ridiculously bloated, but ROCm isn't just a GPU driver. It includes all sorts of tools and libraries as well.

By comparison, if you go and download the CUDA toolkit as a single file, you get a download file that's over 4GB, so quite a bit larger than the download size you quoted. I haven't checked how much that expands to (it seems the ROCm install has a lot of redundancy given how well it compresses), but the point is, you get something that seems insanely large either way.

I suspected that, but any binaries being that large just seems wrong, I mean the whole thing is 35 time larger than my entire OS install.

Do you know what is included in ROCm that could be so big? Does it include training datasets or something?

steeve · a year ago

You can look us up at https://github.com/zml/zml, we fix that.

andyferris · a year ago

Wait, looking at that link I don't see how it avoids downloading CUDA or ROCM. Do you use MLIR to compile to GPU without using the vendor provided tooling at all?

burnte · a year ago

CPU drivers are complete OSes that run on the GPUs now.

greenavocado · a year ago

It's not just you; AMD manages to completely shit-up the Linux kernel with their drivers: https://www.phoronix.com/news/AMD-5-Million-Lines

striking · a year ago

> Of course, much of that is auto-generated header files... A large portion of it with AMD continuing to introduce new auto-generated header files with each new generation/version of a given block. These verbose header files has been AMD's alternative to creating exhaustive public documentation on their GPUs that they were once known for.

anthk · a year ago

OpenBSD, too.

stefan_ · a year ago

This seems to be some AI generated wrapper around a wrapper of a wrapper.

> # Other AMD-specific optimizations can be added here

> # For example, you might want to set specific flags or use AMD-optimized libraries

What are we doing here, then?

its just a big requirements file, and a dockerfile :) the rest are mostly helper scripts.