Readit News logoReadit News
Posted by u/frognumber 2 years ago
Reasonable GPUs
What is the status of GPUs for general compute?

Last I looked, NVidia worked well, and AMD was horrible. Right now, it looks like the major limiting factor (if you don't care about a ≈3x difference in performance, which I don't) is RAM. More is better, and good models need >10GB, while LLMs can be up to 350GB.

* Intel Arc A770 has 16GB for <$300. I have no idea about compatibility with Hugging Face, Blender, etc.

* NVidia 4060 has 16GB for <$500. 100% compatible with everything.

* Older NVidia (e.g. Pascal era) can be had with 24GB for <$300 used, without a graphics port. Not clear how CUDA compute capability lines up to what's needed for modern tools, or how well things work without a graphics port.

* Several cards may or may not work together. I'm not sure.

Is there any way to figure this stuff out, and what's reasonable / practical / easy? Something which explains CUDA compute levels, vendor compatibility, multi-card compatibility, and all that jazz. It'd be nice to have a generic enough guide to understand both pro and amateur use, e.g.:

- A770 x21, if someone got it working, could handle Facebook's OPT-175 for <$10k via Alpa. That brings it into "rich hobbyist" or "justifiable business expense" range. Not clear if that's practical.

- Kids learning AI would be much easier if it's cheaper (e.g. A770)

- "General compute" also includes things like Blender or accelerating rendering in kdenlive, etc.

- Etc.

This stuff is getting useful to a broader and broader audience, but it's confusing.

jszymborski · 2 years ago
It really depends on what you're trying to do.

This is sorta _the_ guide on GPUs for DL and has a great decision tree https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

Personally, I'm limited to an RTX 2080 for my personal projects at the moment, and I find the constraint pretty rewarding. It forces me to find alternatives to the huge models, and you'd be surprised what you can eek out when you pour in the time to tweak models. Of course, good data is also pinnacle.

pizza · 2 years ago
nvidia specific, best write up I know: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

Across vendors, generally, Nvidia still dominates currently. People are adding more support into ML libraries for other vendors via (second-class imo) alternate backends but expect to be patient if you're waiting for the day when there is healthy competition.

IMO, I'd say: if you can save up for it, get a 4090; if you can save up for half a 4090, get a 3090 - seen many going for 600-800 now. If you can save up for half a 3090, I'm not sure - depends on if you prefer speed or VRAM. If it were me, I'd pick more VRAM first.

re: compute capability, you can see here:

- which GPUs have what cc: https://developer.nvidia.com/cuda-gpus

- what cc comes with what features: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

I think the main qualitative change (beyond bigger numbers in the spec) for an enduser of machine learning libraries from 8.6 -> 8.9 (ie 3090 -> 4090) is this line:

> 4 mixed-precision Fourth-Generation Tensor Cores supporting fp8, fp16, __nv_bfloat16, tf32, sub-byte and fp64 for compute capability 8.9 (see Warp matrix functions for details)

ie new precisions will be builtin to eg pytorch with hw-level/tensor core support

edit: btw you probably ought to stick to a consumer gpu (ie not professional) if you want it to be generally versatile while also easy to use at home.

yread · 2 years ago
What about 4060 ti 16gb? It was released after this guide, costs ~500eur and is a bit faster, newer (and a lot more efficient) than a 3060
PeterStuer · 2 years ago
Wasn't this the one deliberately chocked by a narrow memory bus to prevent decent non-gaming workload performance?
smoldesu · 2 years ago
This. If you insist on being as cheap as possible, shoot for a 12gb card but be aware that you'll be missing out on the throughput of higher-end models. The 3060 is popular for this I think, but you'll probably want a better card with more CUDA cores to max out performance.

Cards like the A770 are awesome, but barely even support raster drivers on DirectX. Your best bang-for-buck options are going to be Nvidia-only for now, with a few competing AMD cards that have fast-tracked Pytorch support.

matthewaveryusa · 2 years ago
I purchased a 3060 specifically for the 12gb of memory last November and I've been able to run llama, alpaca, stable diffusion out of the box for everything without ever having any memory issues. Training is usually overnight, and a stable diffusion will render in ~5 seconds, llama will do 20 tokens/second.

I would say start with the 3060 for 250 bucks, and if you're still loving it after a couple months, drop 10x more on a quadro.

My only word of advice is get docker setup and install the nvidia docker toolkit to passthrough your gpu to docker images -- the package management for all these python ai tools is a hell-scape, especially if you want to try a bunch of different things.

frognumber · 2 years ago
Thank you. This is super-helpful.

> re: compute capability, you can see here:

My key question is much more pragmatic:

1) If I grab a random model from Hugging Face, will it accelerate?

2) If I run Blender, kdenlive, or DaVinci Resolve, will it accelerate?

Is there a line where things break?

I definitely prefer more VRAM to more speed. As an occasional user, speed doesn't really matter. Things working does.

smoldesu · 2 years ago
> If I grab a random model from Hugging Face, will it accelerate?

Probably, it depends more on how you configure the inferencing software. Most software that supports acceleration starts with CUDA or CUBLAS, so you should be good.

> If I run Blender, kdenlive, or DaVinci Resolve, will it accelerate?

Yep. If you're running Linux, some distros might be a little iffy about shipping the proprietary/accelerated versions of this software, but most are fine. The Flatpak versions should all have Nvidia acceleration working out-of-box, if you do encounter any issues.

> Is there a line where things break?

Yes, but you can avoid it by choosing smaller quantizations and giving yourself a few gigs of VRAM headroom. In my experience, it's always better to select a model smaller than you need so you're not risking an OOM crash (I've got a 3070ti).

Lotta other great advice in this thread, though! Good luck picking something out.

samspenc · 2 years ago
For Blender, you can actually check crowd-sourced public benchmarks for basically any CPU and GPU you want to compare https://opendata.blender.org/

That site is a goldmine for perf benchmarks, I actually use that site if I want to do a rough comparison of GPU performance across models for 3D / animation / gaming uses. Even though that is Blender specific, I'm pretty confident the results apply to any usage in the same class of applications.

simne · 2 years ago
For Blender, you must carefully read requirements of your software.

Unfortunately, only NN software are more or less standardized, so in many cases, you could choose best fit for your pocket, but all other could be tightly coupled not even to one brand, but to one model. For example, I've seen some software which work in Nvidia-960; I'm not sure about 1060; it don't work on 2060 (for some reason, developers avoid this series).

uniqueuid · 2 years ago
To be honest, the best idea for most people is probably just any GPU that you can easily afford and then rent a big iron GPU.

There is almost no way you will make back the $5k for a 40GB+ ram card, so just save yourself all the hassle and go for something that ticks all the rest of your boxes.

Non-CUDA cards may be ok if you have very simple requirements, but I'd expect many hours of debugging if you want something that's not ready to go out of the box.

civilitty · 2 years ago
I agree. Availability is a pain in the ass which might a dealbreaker for urgent interactive use cases but a 48GB A6000 on LambdaLabs is $0.80/hr [1]. A newer 80GB H100 is $1.99/hr so especially if you're trying to do batch processing and can script a bot to wait for availability, it's often a much better option.

With that aforementioned A6000 ($5k retail) you'd have to use it for at least six thousand hours to break even on the cloud cost.

[1] https://lambdalabs.com/service/gpu-cloud#pricing

buildbot · 2 years ago
That seems like a lot, but that's only ~8 months of usage. If you are doing consistent work with large models, or plan to for over a year, then it makes sense to at least have some hardware.

Something people forget too is that if you have no Nvidia GPUs at all locally, you'll need to spend an significant amount of time installing a new node, copying data, and debugging in your cloud instance, each time you want to do something, while being charged for it. It's a pretty big boost in terms of my time to develop locally and then scale to the cloud once something smaller scale is working.

p1esk · 2 years ago
Just to clarify, because this advise might be misleading. These LambdaLabs prices are pretty much meaningless, because there are no available instances currently, and haven't been for months. The last time I saw an available _hourly_ A6000 instance was more than 6 months ago. Forget about H100. You might be able to get a reserved instance if you're willing to commit a significant enough amount, but even that is probably impossible right now for H100 instances.
frognumber · 2 years ago
Rationally, this makes sense.

Emotionally, it doesn't. The problem is if I own something, I'll use it freely. If I rent a GPU, I'll be stressing and counting pennies. In practice, I'll use it less.

On the whole, I'd rather buy even if it costs more, because I'll use it, and in the long term, that pays dividends.

That's not everyone. That's me.

activescott · 2 years ago
I had a similar question and after reading far too much I put together https://coinpoet.com

I'd love to have others here try it out and give me some feedback on how I could make it useful. It's only a couple weeks in but already seems valuable to me. What am I missing?

buildbot · 2 years ago
Ebay 3090 or new @ 1599$ 4090 (founders edition, gigabyte windforce v2), are the best price/performance/ease of use in my opinion.

AMD is too funky for most still. I have an Mi60 that won’t load drivers due to some PSP (platform security processor) missing firmware on the GPU…

ilaksh · 2 years ago
I would really like a "reasonable" monthly price for a VPS with a GPU. Even a consumer card like a 3090.

vasti.ai have the best prices I have seen, but comes with limitations, and still not the best deal for an entire month.

xcv123 · 2 years ago
Given the cost of the hardware and power and everything else required to run it and support it, how can you say that 20 cents an hour is unreasonable? There's very little profit margin there. At this price it would take roughly a year for them to make a profit. If you need continuous usage at the lowest price then you need to buy a GPU on ebay.
cma · 2 years ago
Nvidia has a datacenter tax in their driver terms of use, so you won't find consumer card vpses at consumer like prices.
theyinwhy · 2 years ago
Most offers are fine, the A100 I rented was a scam, however. A scam in terms of: advertised as A100, performing like a 1080. I guess the seller partitioned the card or rigged the id. You can report frauds like this on their page but only while you are renting.
buildbot · 2 years ago
I would hope vast.ai would be able to detect MIG at least.

It could also be a low power cap - I had a Dell C4140 for a bit with 220V power supplies and 120V power, locking the entire thing to ~50% of the max power cap per GPU basically.

tommy_axle · 2 years ago
Checked out vast.ai but you can get down to the ~$0.34/hr at Runpod depending on how much vram you need.
diffeomorphism · 2 years ago
Is shared ram any useful? For instance a mini PC with an AMD 7940HS chip and 64GB of ddr5 ram costs about 800€. At less than half the price of just a GPU, I am not expecting any great results, but is it usable?
maerF0x0 · 2 years ago
I was curious about this topic too because M3 macs have "Unified memory" which is shared amongst their CPU/GPUs. Anyone have a link or explanation of how this works?
treesciencebot · 2 years ago
One of the main bottlenecks for inference is memory bandwith (esp when dealing with huge models, like SD/SDXL) and for that, nothing I know of comes close to matching memory speeds on Apple Silicon (up to 400GB/s).
pella · 2 years ago
as I know - only 16GB VRAM allowed in 7940HS ( via BIOS )

so probably you can expect similar results:

https://old.reddit.com/r/Amd/comments/15t0lsm/i_turned_a_95_...

HN https://news.ycombinator.com/item?id=37162762

startupsfail · 2 years ago
Older generation RTX 8000 / 48GB are reasonable.

One big disadvantage for older Turing card, no bfloat16. But if you run a quantized/mixed precision model or QLoRA, it doesn’t hurt as much.