Switching to AWS Graviton slashed our infrastructure bill

Cloud cost optimisation is underrated. In the companies I've worked in nobody has really given a shit (at least not under normal economic circumstances). In the industry there's a strong avoidance of ARM compute instances for no good reason. If I were building from scratch today I would definitely go with Graviton.

jiggawatts · 3 years ago

At $dayjob I found an unused box in the cloud running an expensive database engine. It was idle for months, created to be used by a consultant on a project that had wound up. The consultant had quit his consultancy on top of this.

I was told under no uncertain terms not to even think of touching this VM because “the budget has been approved”.

I was shocked at the flagrant waste of money and assumed it was a one-off aberration.

Nope, for months afterwards I kept hearing the same refrain from manager after manager, from product owners and dev team leads.

“Don’t touch! We fought hard for this budget! You’ll take it from our cold dead hands!”

Eventually I soured on the whole idea of cloud cost optimisation a service for unmotivated third parties and gave up on the whole notion.

yazaddaruvala · 3 years ago

FWIW, in these situations you're better off proposing:

"I'm going to reuse this VM, to help our ... fleet scale better."

That way your management continues to use their allocated budget, and your real prod systems work slightly better (also will eventually require less additional $ to scale up - helping the company i.e. shareholders).

The thing to remember:

You would assume all middle management really manages are a top line and a bottom line. Numbers related to their KPIs/OKRs are roughly a top line, and numbers related to their resources (humans and cloud infra budget) are roughly their bottom line.

The reality: Middle management's resources (humans and cloud infra budget) are not their bottom line. Middle management gets rewarded (promoted) when they have "enough scope", scope has roughly always been defined by number of people (it now also includes things like cloud cost budget). As such middle management has to say "we need to do more with less", but they are promoted based on these numbers going up!

Is this reward structure in the best interest of companies (i.e. customers and/or shareholders)? No, neither. Is there a better system? Not yet. Is the reward structure created by middle management for middle management? Likely.

So in the meanwhile, if you don't want to become unmotivated, might as well work within the current reward structures.

GuB-42 · 3 years ago

Clever managers will repurpose that box.

One of my managers was a champion at that, taking scraps from everywhere for undercover projects. One of the few who managed to get things done in a highly bureaucratic company. Also helped by the fact he is good at picking competent people.

raffraffraff · 3 years ago

Imagine seeing your startup grow into a company where bureaucracy rewards department heads who waste money now to protect their budget so they can keep wasting it next year...

pclmulqdq · 3 years ago

I think the main reason is "I want to run the same binaries locally that I run in the cloud," and it's a pretty valid one. However, it's also an expensive one sometimes.

jacobwg · 3 years ago

Anecdotally, this is starting to shift with M1 MacBooks, Graviton is looking more attractive for precisely that architecture parity reason for teams using majority M1 devices.

wizofaus · 3 years ago

Valid why? Do you not trust compilers? Is it infeasible to (at least occasionally) run automated tests on cloud instances? Personally I've been pretty used to quite significant differences between local and production environments - it's rarely an issue, and I don't remember CPU architecture every being one. Things like timezones or firewall restrictions/ network differences (including talking to 3rd party APIs with IP whitelisting) are far more likely to cause problems.

arecurrence · 3 years ago

Completely agree... the only exception I've run into is that for small operations build tooling often doesn't work well with arm64.

EG: GitHub actions can build a container in a few minutes in x64 or 35 minutes in arm64... likewise aws-cdk literally could not run an arm64 fargate ecs deployment for months after support was added (They simply did not support the required attribute in the container definition).

I would love to see this change as I've had nothing but great experiences with graviton for virtually anything arm supported.

_joel · 3 years ago

Are you building on arm64 natively or via qemu. A few mins vs 35 for the same roughish spec of CPU, seems a bit off, even with optimisation considerations.

I've found arm64 builds on amd64 take longer when using one build context/arch (but doing multiple platforms), but that's as it's being emulated.

It's the oppostite on my M1, the buildx amd64 takes longer.

sofixa · 3 years ago

> GitHub actions can build a container in a few minutes in x64 or 35 minutes in arm64

What type of container, and on what runner? That has not been my experience at all, a cross-compiling buildx build with Python and a bunch of libraries takes only slightly longer for arm64 than it did for x86.

forty · 3 years ago

First graviton is not magic. We switched our main service, which is a nodejs monolith, and did not get any cost improvement (we had to add more instances to handle the same workload, which ended up being equivalent cost wise). There are certainly use cases when it's better, but it doesn't seem to be the only and obvious choice for all use cases.

Second our laptops and our CI are amd64 machines, and being able to run the same docker images in prod and locally is nice, and not having to build the image with qemu on the CI is also good.

I don't mind cloud-ARM, but there definitely are good reasons not to use it (which of course don't apply to everyone)

andrewstuart · 3 years ago

If I started to build today I'd build and host my own servers, or go with servers from ionos. Cloud is very expensive.

gryf · 3 years ago

I just worked on a massive "optimised" cloud migration like you've never seen. We moved from multiple DCs to AWS and the costs are approximately 8x what the pre-migration costs are. We were realistically expecting 2x which gave us some regional agility and was expected but the unconstrained growth and misunderstanding of the cost model was terrible. It's designed to be so convoluted that you can't possibly estimate costs until you get the first bill at which point you are committed on a multi-month or year project. On top of that the assumption at the time of development is the cost is someone else's so the sprawl since the migration is dangerous which means we cannot leave ever now we've embraced the PaaS options.

The whole proposition relies on the idea of a sunk cost being accepted.

So yes, back to servers please. IaaS should be the maximal offering that is accepted by a business from a risk perspective unless the tool or technology is disposable in a 6 month window. There is space there for gains. PaaS hell no.

Edit: worth mentioning that AWS support is somewhere near dire. We've had issues with multiple services and despite being a VERY high roller with enterprise support we can't get anything fixed in any reasonable time. It's just someone else's crap you're using and they aren't any better at it than you are, just adding lead time to any issues. In some cases I've had to actually call out complete bad implementations that break function guarantees provided by open source projects (I can't logically warn people away from services as it's pretty obvious who I am if I do). One rule I've developed is that if it's not a core project: S3, EC2, EBS, ALB etc then it's probably a commercial liability in some way. There are no people working or with any knowledge on some major bits of AWS infra.

Spivak · 3 years ago

Rent servers: yes. Host your own: maybe. You can run a whole-ass company on 2 $70/mo servers from Hetzner (and some B2 for durable storage) while you figure out whether you have a market or not.

Like there's just no point in coloing when you're small because either all non-server bits will cost you for no reason or you're using something managed which is just cloud but more annoying.

adrr · 3 years ago

Switching to Gravitron isn't an automatic cost savings. Everything is optimized for x86. It maybe cheaper, but significantly slower. We've been trying to migrate for the last year for both cost saving but also we switched to ARM based laptops.

gryf · 3 years ago

This. We have three people entirely dedicated to reducing costs.

As for avoiding ARM, we do only x86-64 because corporate security policy demands that we have Windows laptops so that some box ticking overlord can fill out a security policy compliance form. That means we're stuck limping along with docker and WSL2. Every single engineer in the org has an arm64 machine at home already and wants a proper computer at work, which can ironically work in the same policy framework if anyone gave enough of a shit to deal with it.

So that's why we don't use Graviton; corporate security policies. Our customers will just have to eat the price hikes.

allset_ · 3 years ago

Builds in production shouldn't be built using developer laptops. I think you're approaching this wrong. You can build and test on x86_64 laptops all day if you want and still easily deploy to arm64 servers.

jiggawatts · 3 years ago

You can cross-compile to ARM using Docker on Windows.

See: https://www.docker.com/blog/multi-arch-images/

nodesocket · 3 years ago

Agree, I've gone to Graviton instances by default for RDS and ElastiCache (run and own a DevOps consulting company). The big problem that I continue to deal with is native arm64 Docker containers (if you a cool kid running containers / Kubernetes). For example, the very popular Bitnami charts don't support arm builds even though the community has been screaming for support.

hdjjhhvvhga · 3 years ago

If I started to build today, I'd definitely go for Hetzner Cloud. There is zero possibility that I get surprised by a large bill.

Reitet00 · 3 years ago

Too bad Hetzner doesn't have cheaper Arm servers available. Their Ampere pricing is not really convincing.

jack_pp · 3 years ago

I feel it's currently in beta, I've tried it and apparently I can't create more than a few instances because my account is "too new", without a clear way to remove that limit so you're right, can't have a large bill if you can't even create 10 instances.

tester756 · 3 years ago

Let's go even further - "Cloud cost is underrated"

yjftsjthsd-h · 3 years ago

> In the industry there's a strong avoidance of ARM compute instances for no good reason.

Not no reason; it adds work and risks incompatibility. Now, that work might be relatively small, and most software these days is compatible with aarch64, but compared to amd64 (which is the de-facto standard, already supported by everything, the default without needing to set anything up) it's still something, and businesses are risk-averse.

rjh29 · 3 years ago

We are building everything for arm and I've surprised if other large companies aren't optimising for it.

mk89 · 3 years ago

I agree. Until it becomes an issue, where everyone runs screaming like chickens, literally nobody gives a shit.

Thaxll · 3 years ago

Because ARM perf was far far from being on part with Intel / AMD, also you need to be able to compile on that arch.

I found graviton to be a mixed bag. It was certainly extremely fast when using the very high end instances and I tested it successfully using a Rust based message queue system I was writing and it got some ridiculously fast number like 8 million messages a second, from memory, using the fastest possible graviton instance (this was about 18 months ago).

I did try to switch some of my database servers to it a couple of years ago and after random hangs, I gave up and went back to intel. I tried again further down the track and same thing - random hangs. I assume this sort of thing comes with a new architecture but I'd be hesitant to move any production infrastructure to it without extensive long term testing.

In the case of graviton based GPU instances I found that the GPU enabled software I wanted to use didn't work.

If you are comparing performance, I'd suggest buying a fast AMD machine and run it locally and compare performance - local servers tend to be much faster and cheaper than cloud. And if your application uses GPUs then if you possibly can then its very much in your interests to run local servers.

axiak · 3 years ago

Arm has a much looser memory model than x86 [1 for a comparison]. It's possible that the random hangs are due to a race condition in PG that doesn't show up in x86 because memory visibility doesn't require as much synchronization.

1: https://www.nickwilcox.com/blog/arm_vs_x86_memory_model/

glogla · 3 years ago

There are huge differences in the machine generations. We found that for our workload Graviton3 (c7g) is the best, followed by AMD (m6a), followed by Intel (m6i) with Graviton2 (m6g) somewhat lagging. We can't use Graviton3 however because of memory limitations, so we're using AMD. The difference to the old machine types (m5) is staggering, the m6a is basically twice the performance of m5, while being cheaper.

However, I've seen a lot of benchmarks telling a different story, so it is important to actually measure your workloads.

no_wizard · 3 years ago

I'd argue just find a different cloud provider.

GCP, Azure, Supabase, Cloudflare etc if you want managed services.

If you want a mix of managed services and raw compute, look more at Fly.io, Linode, Digital Ocean perhaps?

I have found AWS being the "cheapest" or even "reasonable" in the cost department to be slimmer every year.

jiggawatts · 3 years ago

Steer clear of Digital Ocean.

They've had senior staff on HN justifying security lapses that commenters were describing as a "clownshoes operation".

MuffinFlavored · 3 years ago

Cloudflare doesn’t let you host Docker containers or offer managed Postgres do they?

whalesalad · 3 years ago

I’ve been enjoying them here and there but I’ve also found that for some of my workloads a high clock Intel node is required. Even the Epyc nodes couldn’t keep up. I don’t completely know why, never dug too far into it.

Dowwie · 3 years ago

I'm curious about that Rust-based message queue system

andrewstuart · 3 years ago

What do you want to know? It was a prototype. I was trying to learn Rust (didn't succeed), but I did manage to hack together a message queue that used HTTP for client interaction.

I'd previously written a SQL database message queue in Python which worked with Postgres/MySQL and SQL server. This worked well but it was not fast enough for my liking. My goal was to build the fastest and simplest message queue server that exists, with zero configuration (I hate configuration).

I used Rust with Actix and I tried two strategies - one strategy was to use the plain old file system as a storage back end, with each message in a single file. This was so fast that it easily maxed out the capability of the disk way before the CPU capabilities were maxed out. The advantage of using plain old file system as a storage back end is it requires no configuration at all. So I moved on to a RAM only strategy in which the message queue was entirely ephemeral, leaving the responsibility for message persistence/storage/reliability to the client. This was the configuration that got about 8 million messages a second.

As far as I could tell my prototype left almost all message queue servers in the dust. This is because message queue servers seem to almost all integrate "reliable" message storage - that makes the entire solution much, much more complex and slow. My thinking was to separate the concerns of storage/reliability/delivery and focus my message queue only on message delivery, and push status information back to the client, which could then decide what to do about storage and retries.

I gave up because I didn't see the point in the end because it wasn't going to make me any money, and I was finding Rust frustratingly hard to learn and I had other things to do.

likeabbas · 3 years ago

fluvio.io ?

judge2020 · 3 years ago

> local servers tend to be much faster and cheaper than cloud.

Of course, running a server in your house is not going to achieve five or even three 9's of reliability, and even colocating a single rack in a single location might be more expensive than putting that infra in AWS (depending on how data-heavy your use case is, given AWS' exorbitant data transfer costs).

lantry · 3 years ago

you can hit three nines even if you're down for 1.5 minutes every day, or ten minutes a week. It's really not as hard to hit as it sounds. For a compute heavy process that isn't end user facing (e.g. batch processing) it's perfectly viable.

https://uptime.is/

Also, most cloud providers don't guarantee five nines anyway. GCE SLA is 99.5 on a single instance, 99.99 on a region

https://cloud.google.com/compute/sla

neodypsis · 3 years ago

Which database are you using?

andrewstuart · 3 years ago