Hi HN,
I work with a company that has a few GPU-intensive ML models. Over the past few weeks, growth has accelerated, and with that costs have skyrocketed. AWS cost is about 80% of revenue, and the company is now almost out of runway.
There is likely a lot of low hanging cost-saving-fruit to be reaped, just not enough people to do it. We would love any pointers to anyone who specializes in the area of cost optimization. Blogs, individuals, consultants, or magicians are all welcome.
Thank you!
Sorry to hear that. I’m sure it’s super stressful, and I hope you pull through. If you can, I’d suggest giving a little more information about your costs / workload to get more help. But, in case you only see yet another guess, mine is below.
If your growth has accelerated yielding massive cost, I assume that means you’re doing inference to serve your models. As suggested by others, there are a few great options if you haven’t already:
- Try spot instances: while you’ll get preempted, you do get a couple minutes to shut down (so for model serving, you just stop accepting requests, finish the ones you’re handling and exit). This is worth 60-90% of compute reduction.
- If you aren’t using the T4 instances, they’re probably the best price/performance for GPU inference. If you’re using a V100 by comparison that’s up to 5-10x more expensive.
- However, your models should be taking advantage of int8 if possible. This alone may let you pack more requests per part. (Another 2x+)
- You could try to do model pruning. This is perhaps the most delicate, but look at things like how people compress models for mobile. It has a similar-ish effect on trying to pack more weights into smaller GPUs, or alternatively you can do a lot simpler model (less weights and less connections also often means a lot less flops).
- But just as much: why do you need a GPU for your models? (Usually it’s to serve a large-ish / expensive model quickly enough). If you’re going to be out of business instead, try cpu inference again on spot instances (like the c5 series). Vectorized inference isn’t bad at all!
If instead this is all about training / the volume of your input data: sample it, change your batch sizes, just don’t re-train, whatever you’ve gotta do.
Remember, your users / customers won’t somehow be happier when you’re out of business in a month. Making all requests suddenly take 3x as long on a cpu or sometimes fail, is better than “always fail, we had to shut down the company”. They’ll understand!
I stopped using gpu's, "Vectorized inference isn’t bad at all!". This soo much, I was blinded with gpu speed, using tensorflow builds with avx optimization is actually pretty fast.
My discovery:
+ Stop expensive GPU's for inference and switch to avx optimized tensorflow builds.
+ Cleaned up the inference pipeline and reduced complexity.
+ Buying compute instance for a year or more provides a discount.
- I never got pruning to work without a significant loss increase.
- Tried spot instances with gpu's that are cheaper. Random kills and spinning up new instances took too long loading my code. The discount is a lot, but I couldn't reliable get it up. Users where getting more timeouts. I bailed and just used cpu inference. The gpu was being underutilized, using cpu only increased the inference to around 2-3 seconds. With the price trade off it was a more simpel,cheaper and easier solution.
You don't provide a lot of detail but I imagine at this point you need to get "creative" and move at least some aspect of your operation out of AWS. Some variation of:
- Buy some hardware and host it at home/office/etc.
- Buy some hardware and put it in a colocation facility.
- Buy a lot of hardware and put it in a few places.
Etc.
Cash and accounting is another problem. Hardware manufacturers offer financing (leasing). Third party finance companies offer lines of credit, special leasing, etc. Even paying cash outright can (in certain cases) be beneficial from a tax standpoint. If you're in the US there's even the best of both worlds: a Section 179 deduction on a lease!
https://www.section179.org/section_179_leases/
You don't even need to get dirty. Last I checked it was pretty easy to get financing from Dell, pay next to nothing to get started, and have hardware shipped directly to a co-location facility. Remote hands rack and configure it for you. You get a notification with a system to log into just like an AWS instance. All in at a fraction of the cost. The dreaded (actually very rare) hardware failure? That's what the warranty is for. Dell will dispatch people to the facility and replace XYZ as needed. You never need to physically touch anything.
A little more complicated than creating an AWS account with a credit card number? Of course. More management? Slightly. But at the end of the day it's a fraction of the total cost and probably even advantageous from a taxation standpoint.
AWS and public clouds really shine in some use cases and absolutely suck at others (as in suck the cash right out of your pockets).
Go for some colocation facility where costs are predictable.
A balanced approach is to only put the most expensive hardware portion of the business with the smallest availability requirement in colo, and horizontally scale it over time. Simultaneously use a cloud provider to execute on the cheap stuff fast and reliably.
And when they aren’t always the best. It’s often because you don’t know what you’re doing.
It’s too uncommon for people to over provision. Or go with too many services when they don’t need to.
Like let’s have a database and cache service and search search. When 95% of the time they only need the database because it can do full text searching adequate enough and they don’t have the traffic to warrant caching in redis, and can do basic caching.
They don’t take advantage of auto scale groups, or run instances that are over provisioned 24/7.
I’ve seen database instances where when it’s slow they throw more hardware at it instead of optimising the queries and analysing / adding indexes.
The biggest cost of cloud providers is outbound data. The rest is almost always the problem of the Developers.
That is, I am not sure "public cloud, if you spend lots of effort to optimize it and ask devs to be careful, can be as cheap as a naive on-prem implementation where devs don't need to be careful" is an argument for public cloud.
We really need some more details on your infrastructure, but I assume it's EC2 instance cost that skyrocketed?
A couple of pointers:
- Experiment with different GPU instance types.
- Try Inferentia [1], a dedicated ML chip. Most popular ML frameworks are supported by the Neuron compiler.
Assuming you manage your instances in an auto scaling group (ASG):
- Enable a target tracking scaling policy to reactively scale your fleet. The best scaling metric depends on your inference workload.
- If your workload is predictable (e.g. high traffic during the daytime, low traffic during nighttime), enable predictive scaling. [3]
[1] https://aws.amazon.com/machine-learning/inferentia/
[2] https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-sca...
[3] https://docs.aws.amazon.com/autoscaling/plans/userguide/how-...
It's unlikely that your users are going to notice the accuracy difference between the linear model and the GPU-intensive one unless you are doing computer vision. If you have small datasets, you might even find the linear model works better.
So it won't affect revenue, but it will cut costs to almost nothing.
Supporting evidence: I just completed this kind of migration for a bay area client (even though I live in Australia). Training (for all customers simultaneously) runs on a single t3.small now, replacing a very large and complicated set up that was there previously.
One advice: speak to your AWS rep immediately. Get credits to redesign your system and keep you running. you can expect up to 7 digits in credits (for real!) and support for a year for free, they really want to help you in avoiding this.
AWS has always been eager to get on the phone with me to discuss cost savings strategies. And they don’t upsell you in the process.
We bough 2 Dell servers via their financing program. Each server is about 19-25K. We paid AWS $60K per month before that. We pay $600 for co-location.
So my advice is try to get hardware via financing of provider Dell had a good program I think.
Cloud servers are a “luxury” that most don’t realise and just take for granted. Having said that, there are obvious overheads with handling your own servers, but when your costs are several salaries it’s probably worth considering.
I have loud and angry thoughts about this; https://www.lastweekinaws.com/blog/ has a bunch of pieces, some of which may be more relevant than others. The slightly-more-serious corporate side of the house is at https://www.duckbillgroup.com/blog/, if you can stomach a slight decline in platypus.
I'm CTO of an AI image processing company, so I speak from experience here.
I personally use Hetzner.de and their Colo plans are very affordable, while still giving you multi GBit internet uplinks per server. If you insist on renting, Hetzner also offers rental plans for customer-specified hardware upon request. The only downside is that if you call a Hetzner tensorflow model from an AWS east frontend instance, you'll have 80-100 ms of roundtrip latency for the rpc/http call. But the insane cost savings over using cloud might make that negligible.
Also, have you considered converting your models from GPU to CPU? They might still be almost as fast, and affordable CPU hosting is much easier to find than GPU options.
I'm happy to talk with you about the specifics of our / your deployment via email, if that helps. But let me warn you, that my past experience with AWS and Google Cloud performance and pricing, in addition to suffering through low uptime at the hands of them, has made me somewhat of a cloud opponent for compute or data heavy deployments.
So unless your spend is high enough to negotiate a custom SLA, I would assume that your cloud uptime isn't any better than halfway good bare metal servers.