Cloud cost optimisation is underrated. In the companies I've worked in nobody has really given a shit (at least not under normal economic circumstances). In the industry there's a strong avoidance of ARM compute instances for no good reason. If I were building from scratch today I would definitely go with Graviton.
At $dayjob I found an unused box in the cloud running an expensive database engine. It was idle for months, created to be used by a consultant on a project that had wound up. The consultant had quit his consultancy on top of this.
I was told under no uncertain terms not to even think of touching this VM because “the budget has been approved”.
I was shocked at the flagrant waste of money and assumed it was a one-off aberration.
Nope, for months afterwards I kept hearing the same refrain from manager after manager, from product owners and dev team leads.
“Don’t touch! We fought hard for this budget! You’ll take it from our cold dead hands!”
Eventually I soured on the whole idea of cloud cost optimisation a service for unmotivated third parties and gave up on the whole notion.
FWIW, in these situations you're better off proposing:
"I'm going to reuse this VM, to help our ... fleet scale better."
That way your management continues to use their allocated budget, and your real prod systems work slightly better (also will eventually require less additional $ to scale up - helping the company i.e. shareholders).
The thing to remember:
You would assume all middle management really manages are a top line and a bottom line. Numbers related to their KPIs/OKRs are roughly a top line, and numbers related to their resources (humans and cloud infra budget) are roughly their bottom line.
The reality: Middle management's resources (humans and cloud infra budget) are not their bottom line. Middle management gets rewarded (promoted) when they have "enough scope", scope has roughly always been defined by number of people (it now also includes things like cloud cost budget). As such middle management has to say "we need to do more with less", but they are promoted based on these numbers going up!
Is this reward structure in the best interest of companies (i.e. customers and/or shareholders)? No, neither. Is there a better system? Not yet. Is the reward structure created by middle management for middle management? Likely.
So in the meanwhile, if you don't want to become unmotivated, might as well work within the current reward structures.
One of my managers was a champion at that, taking scraps from everywhere for undercover projects. One of the few who managed to get things done in a highly bureaucratic company. Also helped by the fact he is good at picking competent people.
Imagine seeing your startup grow into a company where bureaucracy rewards department heads who waste money now to protect their budget so they can keep wasting it next year...
I think the main reason is "I want to run the same binaries locally that I run in the cloud," and it's a pretty valid one. However, it's also an expensive one sometimes.
Anecdotally, this is starting to shift with M1 MacBooks, Graviton is looking more attractive for precisely that architecture parity reason for teams using majority M1 devices.
Valid why? Do you not trust compilers? Is it infeasible to (at least occasionally) run automated tests on cloud instances?
Personally I've been pretty used to quite significant differences between local and production environments - it's rarely an issue, and I don't remember CPU architecture every being one. Things like timezones or firewall restrictions/ network differences (including talking to 3rd party APIs with IP whitelisting) are far more likely to cause problems.
Completely agree... the only exception I've run into is that for small operations build tooling often doesn't work well with arm64.
EG: GitHub actions can build a container in a few minutes in x64 or 35 minutes in arm64... likewise aws-cdk literally could not run an arm64 fargate ecs deployment for months after support was added (They simply did not support the required attribute in the container definition).
I would love to see this change as I've had nothing but great experiences with graviton for virtually anything arm supported.
Are you building on arm64 natively or via qemu. A few mins vs 35 for the same roughish spec of CPU, seems a bit off, even with optimisation considerations.
I've found arm64 builds on amd64 take longer when using one build context/arch (but doing multiple platforms), but that's as it's being emulated.
It's the oppostite on my M1, the buildx amd64 takes longer.
> GitHub actions can build a container in a few minutes in x64 or 35 minutes in arm64
What type of container, and on what runner? That has not been my experience at all, a cross-compiling buildx build with Python and a bunch of libraries takes only slightly longer for arm64 than it did for x86.
First graviton is not magic. We switched our main service, which is a nodejs monolith, and did not get any cost improvement (we had to add more instances to handle the same workload, which ended up being equivalent cost wise). There are certainly use cases when it's better, but it doesn't seem to be the only and obvious choice for all use cases.
Second our laptops and our CI are amd64 machines, and being able to run the same docker images in prod and locally is nice, and not having to build the image with qemu on the CI is also good.
I don't mind cloud-ARM, but there definitely are good reasons not to use it (which of course don't apply to everyone)
I just worked on a massive "optimised" cloud migration like you've never seen. We moved from multiple DCs to AWS and the costs are approximately 8x what the pre-migration costs are. We were realistically expecting 2x which gave us some regional agility and was expected but the unconstrained growth and misunderstanding of the cost model was terrible. It's designed to be so convoluted that you can't possibly estimate costs until you get the first bill at which point you are committed on a multi-month or year project. On top of that the assumption at the time of development is the cost is someone else's so the sprawl since the migration is dangerous which means we cannot leave ever now we've embraced the PaaS options.
The whole proposition relies on the idea of a sunk cost being accepted.
So yes, back to servers please. IaaS should be the maximal offering that is accepted by a business from a risk perspective unless the tool or technology is disposable in a 6 month window. There is space there for gains. PaaS hell no.
Edit: worth mentioning that AWS support is somewhere near dire. We've had issues with multiple services and despite being a VERY high roller with enterprise support we can't get anything fixed in any reasonable time. It's just someone else's crap you're using and they aren't any better at it than you are, just adding lead time to any issues. In some cases I've had to actually call out complete bad implementations that break function guarantees provided by open source projects (I can't logically warn people away from services as it's pretty obvious who I am if I do). One rule I've developed is that if it's not a core project: S3, EC2, EBS, ALB etc then it's probably a commercial liability in some way. There are no people working or with any knowledge on some major bits of AWS infra.
Rent servers: yes. Host your own: maybe. You can run a whole-ass company on 2 $70/mo servers from Hetzner (and some B2 for durable storage) while you figure out whether you have a market or not.
Like there's just no point in coloing when you're small because either all non-server bits will cost you for no reason or you're using something managed which is just cloud but more annoying.
Switching to Gravitron isn't an automatic cost savings. Everything is optimized for x86. It maybe cheaper, but significantly slower. We've been trying to migrate for the last year for both cost saving but also we switched to ARM based laptops.
This. We have three people entirely dedicated to reducing costs.
As for avoiding ARM, we do only x86-64 because corporate security policy demands that we have Windows laptops so that some box ticking overlord can fill out a security policy compliance form. That means we're stuck limping along with docker and WSL2. Every single engineer in the org has an arm64 machine at home already and wants a proper computer at work, which can ironically work in the same policy framework if anyone gave enough of a shit to deal with it.
So that's why we don't use Graviton; corporate security policies. Our customers will just have to eat the price hikes.
Builds in production shouldn't be built using developer laptops. I think you're approaching this wrong. You can build and test on x86_64 laptops all day if you want and still easily deploy to arm64 servers.
Agree, I've gone to Graviton instances by default for RDS and ElastiCache (run and own a DevOps consulting company). The big problem that I continue to deal with is native arm64 Docker containers (if you a cool kid running containers / Kubernetes). For example, the very popular Bitnami charts don't support arm builds even though the community has been screaming for support.
I feel it's currently in beta, I've tried it and apparently I can't create more than a few instances because my account is "too new", without a clear way to remove that limit so you're right, can't have a large bill if you can't even create 10 instances.
> In the industry there's a strong avoidance of ARM compute instances for no good reason.
Not no reason; it adds work and risks incompatibility. Now, that work might be relatively small, and most software these days is compatible with aarch64, but compared to amd64 (which is the de-facto standard, already supported by everything, the default without needing to set anything up) it's still something, and businesses are risk-averse.
The biggest downside I've found with Graviton is that it's gotten popular enough that availability of capacity is a problem in some regions/AZs - particularly if you're using larger EC2 instance types.
Also, Fargate Spot on Graviton is still not available, so if you want to run Spot in non-production environments, you're facing with running different architectures in prod vs. non-prod, which I don't like at all. Do the math on whether it's cheaper for your use case to go x86 spot/non-spot vs. Graviton non-spot.
I found graviton to be a mixed bag. It was certainly extremely fast when using the very high end instances and I tested it successfully using a Rust based message queue system I was writing and it got some ridiculously fast number like 8 million messages a second, from memory, using the fastest possible graviton instance (this was about 18 months ago).
I did try to switch some of my database servers to it a couple of years ago and after random hangs, I gave up and went back to intel. I tried again further down the track and same thing - random hangs. I assume this sort of thing comes with a new architecture but I'd be hesitant to move any production infrastructure to it without extensive long term testing.
In the case of graviton based GPU instances I found that the GPU enabled software I wanted to use didn't work.
If you are comparing performance, I'd suggest buying a fast AMD machine and run it locally and compare performance - local servers tend to be much faster and cheaper than cloud. And if your application uses GPUs then if you possibly can then its very much in your interests to run local servers.
Arm has a much looser memory model than x86 [1 for a comparison]. It's possible that the random hangs are due to a race condition in PG that doesn't show up in x86 because memory visibility doesn't require as much synchronization.
There are huge differences in the machine generations. We found that for our workload Graviton3 (c7g) is the best, followed by AMD (m6a), followed by Intel (m6i) with Graviton2 (m6g) somewhat lagging. We can't use Graviton3 however because of memory limitations, so we're using AMD. The difference to the old machine types (m5) is staggering, the m6a is basically twice the performance of m5, while being cheaper.
However, I've seen a lot of benchmarks telling a different story, so it is important to actually measure your workloads.
I’ve been enjoying them here and there but I’ve also found that for some of my workloads a high clock Intel node is required. Even the Epyc nodes couldn’t keep up. I don’t completely know why, never dug too far into it.
What do you want to know? It was a prototype. I was trying to learn Rust (didn't succeed), but I did manage to hack together a message queue that used HTTP for client interaction.
I'd previously written a SQL database message queue in Python which worked with Postgres/MySQL and SQL server. This worked well but it was not fast enough for my liking. My goal was to build the fastest and simplest message queue server that exists, with zero configuration (I hate configuration).
I used Rust with Actix and I tried two strategies - one strategy was to use the plain old file system as a storage back end, with each message in a single file. This was so fast that it easily maxed out the capability of the disk way before the CPU capabilities were maxed out. The advantage of using plain old file system as a storage back end is it requires no configuration at all. So I moved on to a RAM only strategy in which the message queue was entirely ephemeral, leaving the responsibility for message persistence/storage/reliability to the client. This was the configuration that got about 8 million messages a second.
As far as I could tell my prototype left almost all message queue servers in the dust. This is because message queue servers seem to almost all integrate "reliable" message storage - that makes the entire solution much, much more complex and slow. My thinking was to separate the concerns of storage/reliability/delivery and focus my message queue only on message delivery, and push status information back to the client, which could then decide what to do about storage and retries.
I gave up because I didn't see the point in the end because it wasn't going to make me any money, and I was finding Rust frustratingly hard to learn and I had other things to do.
> local servers tend to be much faster and cheaper than cloud.
Of course, running a server in your house is not going to achieve five or even three 9's of reliability, and even colocating a single rack in a single location might be more expensive than putting that infra in AWS (depending on how data-heavy your use case is, given AWS' exorbitant data transfer costs).
you can hit three nines even if you're down for 1.5 minutes every day, or ten minutes a week. It's really not as hard to hit as it sounds. For a compute heavy process that isn't end user facing (e.g. batch processing) it's perfectly viable.
I'm interested in hearing more about their switching to Graviton with Clickhouse.
We've been testing Clickhouse on Graviton and the performance isn't there due to a variety of reasons, most notably it seems because Clickhouse for arm64 is cross-complied and JIT isn't enabled like it is for amd64[1].
For AWS managed resources definitely use Graviton. But for spot instances in EC2 we've found better pricing and greater availability by staying on x86. (We run 100% of our web services and background workers on spot instances).
Same experience here. I cut over a whole bunch of instances to Graviton a while back and it "just worked" for a lot of our workloads. Test it first, obv.
Another easy cost-savings switch is telling Terraform to create EC2 root volumes using gp3 rather than gp2 (gp3 is ~10% cheaper and theoretically more performant). The AWS API and Terraform both still default to gp2 (last I checked), and I wonder how many places are paying a premium for it.
AWS Graviton is interesting because it is pretty different machine to their AMD and Intel offerings. A "16 vCPU" machine from AWS is 8 cores/16 threads, not 16 cores - except for Graviton, which actually has 16 cores, although much weaker ones. So for problems where you the cores can actually work in parallel, Graviton can keep up with AMD and Intel, while being somewhat cheaper. In single-threaded workload you get about half the performance.
Second thing I curious about is this very AWS heavy approach. ECS, CodeDeploy, ElastiCache. If I was their architect, I would probably go EKS, GitHub/Lab, Redis on EKS, just for the peace of mind.
ECS is so much simpler to use and understand than Kubernetes, even on EKS.
But as for CodeDeploy... IMHO the only reason to use it is "I don't want to deal with another vendor" due to procurement/compliance hell in large companies.
We have a very similar story at my org. We run around 100 RDS aurora clusters and switched to graviton. I'm surprised to see 35% gains here, we saw more like 10-15%. But since amazon natively supports mysql on aurora we didn't have to worry about compatibility. Our main highlight was the way we wrote our infra as code where we made switching instances types or service we use fairly simple task, so we have switched instance types a couple of times in past and could easily make dev use t3s.
Getting on cloud is a trap and not the usual we deploy on the servers and we live situation. Give weight to write some good code to manage your infra and able to adopt optimizations as they occur. It will ramp up in expense soon otherwise.
I was told under no uncertain terms not to even think of touching this VM because “the budget has been approved”.
I was shocked at the flagrant waste of money and assumed it was a one-off aberration.
Nope, for months afterwards I kept hearing the same refrain from manager after manager, from product owners and dev team leads.
“Don’t touch! We fought hard for this budget! You’ll take it from our cold dead hands!”
Eventually I soured on the whole idea of cloud cost optimisation a service for unmotivated third parties and gave up on the whole notion.
"I'm going to reuse this VM, to help our ... fleet scale better."
That way your management continues to use their allocated budget, and your real prod systems work slightly better (also will eventually require less additional $ to scale up - helping the company i.e. shareholders).
The thing to remember:
You would assume all middle management really manages are a top line and a bottom line. Numbers related to their KPIs/OKRs are roughly a top line, and numbers related to their resources (humans and cloud infra budget) are roughly their bottom line.
The reality: Middle management's resources (humans and cloud infra budget) are not their bottom line. Middle management gets rewarded (promoted) when they have "enough scope", scope has roughly always been defined by number of people (it now also includes things like cloud cost budget). As such middle management has to say "we need to do more with less", but they are promoted based on these numbers going up!
Is this reward structure in the best interest of companies (i.e. customers and/or shareholders)? No, neither. Is there a better system? Not yet. Is the reward structure created by middle management for middle management? Likely.
So in the meanwhile, if you don't want to become unmotivated, might as well work within the current reward structures.
One of my managers was a champion at that, taking scraps from everywhere for undercover projects. One of the few who managed to get things done in a highly bureaucratic company. Also helped by the fact he is good at picking competent people.
EG: GitHub actions can build a container in a few minutes in x64 or 35 minutes in arm64... likewise aws-cdk literally could not run an arm64 fargate ecs deployment for months after support was added (They simply did not support the required attribute in the container definition).
I would love to see this change as I've had nothing but great experiences with graviton for virtually anything arm supported.
I've found arm64 builds on amd64 take longer when using one build context/arch (but doing multiple platforms), but that's as it's being emulated.
It's the oppostite on my M1, the buildx amd64 takes longer.
What type of container, and on what runner? That has not been my experience at all, a cross-compiling buildx build with Python and a bunch of libraries takes only slightly longer for arm64 than it did for x86.
Second our laptops and our CI are amd64 machines, and being able to run the same docker images in prod and locally is nice, and not having to build the image with qemu on the CI is also good.
I don't mind cloud-ARM, but there definitely are good reasons not to use it (which of course don't apply to everyone)
The whole proposition relies on the idea of a sunk cost being accepted.
So yes, back to servers please. IaaS should be the maximal offering that is accepted by a business from a risk perspective unless the tool or technology is disposable in a 6 month window. There is space there for gains. PaaS hell no.
Edit: worth mentioning that AWS support is somewhere near dire. We've had issues with multiple services and despite being a VERY high roller with enterprise support we can't get anything fixed in any reasonable time. It's just someone else's crap you're using and they aren't any better at it than you are, just adding lead time to any issues. In some cases I've had to actually call out complete bad implementations that break function guarantees provided by open source projects (I can't logically warn people away from services as it's pretty obvious who I am if I do). One rule I've developed is that if it's not a core project: S3, EC2, EBS, ALB etc then it's probably a commercial liability in some way. There are no people working or with any knowledge on some major bits of AWS infra.
Like there's just no point in coloing when you're small because either all non-server bits will cost you for no reason or you're using something managed which is just cloud but more annoying.
As for avoiding ARM, we do only x86-64 because corporate security policy demands that we have Windows laptops so that some box ticking overlord can fill out a security policy compliance form. That means we're stuck limping along with docker and WSL2. Every single engineer in the org has an arm64 machine at home already and wants a proper computer at work, which can ironically work in the same policy framework if anyone gave enough of a shit to deal with it.
So that's why we don't use Graviton; corporate security policies. Our customers will just have to eat the price hikes.
See: https://www.docker.com/blog/multi-arch-images/
Not no reason; it adds work and risks incompatibility. Now, that work might be relatively small, and most software these days is compatible with aarch64, but compared to amd64 (which is the de-facto standard, already supported by everything, the default without needing to set anything up) it's still something, and businesses are risk-averse.
Also, Fargate Spot on Graviton is still not available, so if you want to run Spot in non-production environments, you're facing with running different architectures in prod vs. non-prod, which I don't like at all. Do the math on whether it's cheaper for your use case to go x86 spot/non-spot vs. Graviton non-spot.
I did try to switch some of my database servers to it a couple of years ago and after random hangs, I gave up and went back to intel. I tried again further down the track and same thing - random hangs. I assume this sort of thing comes with a new architecture but I'd be hesitant to move any production infrastructure to it without extensive long term testing.
In the case of graviton based GPU instances I found that the GPU enabled software I wanted to use didn't work.
If you are comparing performance, I'd suggest buying a fast AMD machine and run it locally and compare performance - local servers tend to be much faster and cheaper than cloud. And if your application uses GPUs then if you possibly can then its very much in your interests to run local servers.
1: https://www.nickwilcox.com/blog/arm_vs_x86_memory_model/
However, I've seen a lot of benchmarks telling a different story, so it is important to actually measure your workloads.
GCP, Azure, Supabase, Cloudflare etc if you want managed services.
If you want a mix of managed services and raw compute, look more at Fly.io, Linode, Digital Ocean perhaps?
I have found AWS being the "cheapest" or even "reasonable" in the cost department to be slimmer every year.
They've had senior staff on HN justifying security lapses that commenters were describing as a "clownshoes operation".
I'd previously written a SQL database message queue in Python which worked with Postgres/MySQL and SQL server. This worked well but it was not fast enough for my liking. My goal was to build the fastest and simplest message queue server that exists, with zero configuration (I hate configuration).
I used Rust with Actix and I tried two strategies - one strategy was to use the plain old file system as a storage back end, with each message in a single file. This was so fast that it easily maxed out the capability of the disk way before the CPU capabilities were maxed out. The advantage of using plain old file system as a storage back end is it requires no configuration at all. So I moved on to a RAM only strategy in which the message queue was entirely ephemeral, leaving the responsibility for message persistence/storage/reliability to the client. This was the configuration that got about 8 million messages a second.
As far as I could tell my prototype left almost all message queue servers in the dust. This is because message queue servers seem to almost all integrate "reliable" message storage - that makes the entire solution much, much more complex and slow. My thinking was to separate the concerns of storage/reliability/delivery and focus my message queue only on message delivery, and push status information back to the client, which could then decide what to do about storage and retries.
I gave up because I didn't see the point in the end because it wasn't going to make me any money, and I was finding Rust frustratingly hard to learn and I had other things to do.
Of course, running a server in your house is not going to achieve five or even three 9's of reliability, and even colocating a single rack in a single location might be more expensive than putting that infra in AWS (depending on how data-heavy your use case is, given AWS' exorbitant data transfer costs).
https://uptime.is/
Also, most cloud providers don't guarantee five nines anyway. GCE SLA is 99.5 on a single instance, 99.99 on a region
https://cloud.google.com/compute/sla
We've been testing Clickhouse on Graviton and the performance isn't there due to a variety of reasons, most notably it seems because Clickhouse for arm64 is cross-complied and JIT isn't enabled like it is for amd64[1].
1. https://fosstodon.org/@manish/109397948927679076
Deleted Comment
Another easy cost-savings switch is telling Terraform to create EC2 root volumes using gp3 rather than gp2 (gp3 is ~10% cheaper and theoretically more performant). The AWS API and Terraform both still default to gp2 (last I checked), and I wonder how many places are paying a premium for it.
AWS Graviton is interesting because it is pretty different machine to their AMD and Intel offerings. A "16 vCPU" machine from AWS is 8 cores/16 threads, not 16 cores - except for Graviton, which actually has 16 cores, although much weaker ones. So for problems where you the cores can actually work in parallel, Graviton can keep up with AMD and Intel, while being somewhat cheaper. In single-threaded workload you get about half the performance.
Second thing I curious about is this very AWS heavy approach. ECS, CodeDeploy, ElastiCache. If I was their architect, I would probably go EKS, GitHub/Lab, Redis on EKS, just for the peace of mind.
But as for CodeDeploy... IMHO the only reason to use it is "I don't want to deal with another vendor" due to procurement/compliance hell in large companies.