Is Northern Virginia still the least reliable AWS region?

bob1029 · 3 months ago

I think if you need something more reliable than us-east-1 that you should be hosting on prem in facilities you own and operate.

There aren't that many businesses that truly can't handle the worst case (so far) AWS outage. Payment processing is the strongest example I can come up with that is incompatible with the SLA that a typical cloud provider can offer. Visa going down globally for even a few minutes might be worse than a small town losing its power grid for an entire week.

It's a hell of a lot easier to just go down with everyone else, apologize on Twitter, and enjoy a forced snow day. Don't let it frustrate you. Stay focused on the business and customer experience. It's not ideal to be down, but there are usually much bigger problems to solve. Chasing an extra x% of uptime per year is usually not worth a multicloud/region clusterfuck. These tend to be even less resilient on average.

jl6 · 3 months ago

> worst case (so far)

It’s kind of amazing that after nearly 20 years of “cloud”, the worst case so far still hasn’t been all that bad. Outages are the mildest type of incident. A true cloud disaster would be something like a major S3 data loss event, or a compromise of the IAM control plane. That’s what it would take for people to take multi-region/multi-cloud seriously.

svelle · 3 months ago

> A true cloud disaster would be something like a major S3 data loss event

So like the OVH data center fire back in 2021?

roger01 · 3 months ago

> compromise of the IAM control plane

You mean like stealing the master keys for Azure? Oh wait a minute...

dijit · 3 months ago

I mean, EBS went offline and people were ok to continue using AWS…

https://arstechnica.com/information-technology/2011/04/amazo...

nineteen999 · 3 months ago

> It's a hell of a lot easier to just go down with everyone else, apologize on Twitter, and enjoy a forced snow day.

You forget things like emergency services. If we were to rely on AWS (even with a backup/DR zone in another region), and were to go down with everyone else and twiddle our fingers, houses burn down, people die, and our company has to pay abatements to the govt.

kankerlijer · 3 months ago

There are only two kinds of cloud regions: the ones people complain about and the ones nobody uses

joe_the_user · 3 months ago

A sound banker, alas, is not one who foresees danger and avoids it, but one who, when he is ruined, is ruined in a conventional and orthodox way along with his fellows, so that no one can really blame him. JM Keynes

paulddraper · 3 months ago

That is incredibly appropriate.

kachapopopow · 3 months ago

I like this a lot, this is a great comparison for hetzner american offerings since it's not big enough for them to even bother investing much into it so there's not that many complains about it. People just dumping it (me included) after discovering the amount of random issues it has probably also doesn't help.

if you are using hetzner: avoid everything other than fra region, ideally pray that you are placed in the newer part of the datacenter since it has the upgraded switching spine I haven't seen the old one in a bit so they might have deprecated it entirely.

jeltz · 3 months ago

Hetzner does not have any "fra region". They have Helsinki, Falkstien and Nuremberg in Europe. None of them which has any issues as far as I know. They used to have some issues with the very old stuff in Falkstien.

Manouchehri · 3 months ago

Yeah, I was often the single source of reporting Claude outages (or even missing support completely) on less commonly used Amazon Bedrock regions.

htrp · 3 months ago

Which regions were you using ? ( Thought claude had global inference support that routed to all regions)

vasco · 3 months ago

Eu-west-1 is miles better and is huge

yibers · 3 months ago

Ass covering-wise, you are probably better off going down with everyone else on us-east-1. The not so fun alternative: being targeted during an RCA explaining why you chose some random zone no one ever heard of.

rconti · 3 months ago

Places nobody's ever heard of like "Ohio" or "Oregon"?

Yeah, I'm not worried about being targeted in an RCA and pointedly asked why I chose a region with way better uptime than `us-tirefire-1`.

What _is_ worth considering is whether your more carefully considered region will perform better during an actual outage where some critical AWS resource goes down in Virginia, taking my region with it anyway.

xingped · 3 months ago

IIRC, some AWS services are solely deployed on and/or entirely dependent on us-east-1. I don't recall which ones, but I very distinctly remember this coming up once.

kristianc · 3 months ago

I find it funny that we see complaints about why software quality has got worse alongside people advocating to choose objectively risky AWS regions for career risk and blame minimisation reasons.

goalieca · 3 months ago

This was always the case. The OG saying was “no one got fired for buying IBM”. Then it was changed to Microsoft. And so on..

Deleted Comment

throwawaysleep · 3 months ago

They are for the same reason. How do customers react to either? If us-east-1 fails, nobody complains. If Microsoft uses a browser to render components on Windows and eats all of your RAM, nobody complains.

jordanb · 3 months ago

Istr major resource unavailability in US-East-2 during one of the big US-East-1 outages because people were trying to fail over. Then a week later there was a US-East-2 outage that didn't make the news.

So if you tried to be "smart" and set up in Ohio you got crushed by the thundering herd coming out of Virginia and then bit again because aws barely cares about you region and neither does anyone else.

The truth is Amazon doesn't have any real backup for Virginia. They don't have the capacity anywhere else and the whole geographic distribution scheme is a chimera.

Fhch6HQ · 3 months ago

This is an interesting point. As recently as mid-2023 us-east-2 was 3 campuses with a 5 building design capacity at each. I know they've expanded by multiples since, but us-east-1 would still dwarf them.

Makes one wonder, does us-west-2 have the capacity to take on this surge?

nothrabannosir · 3 months ago

> being targeted during an RCA explaining why you chose some random zone no one ever heard of.

“Duh, because there’s an AZ in us-east-1 where you can’t configure EBS volumes for attachment to fargate launch type ECS tasks, of course. Everybody knows that…”

:p

riffic · 3 months ago

how about following the well-architected framework and building something with a suitable level of 9s where you can justify your decisions during a blameless postmortem (please stamp your buzzword bingo card for a prize.)

paradox460 · 3 months ago

We vibe code everything in flavor of the month node frameworks, tyvm, because elixir is too hard to hire for (or some equally inane excuse)

g947o · 3 months ago

> explaining why you chose some random zone no one ever heard of

Is this from real experience of something that actually happened, or just imagined?

The only things that matter in a decision are:

* Services that are available in the region

* (if relevant and critical) Latency to other services

* SLAs for the region

Everything else is irrelevant.

If you think AWS is so bad that their SLAs are not trustworthy, that's a different problem to solve.

throwawaysleep · 3 months ago

This to me was the real lesson of the outage. A us-east-1 outage is treated like bad weather. A regional outage can be blamed on the dev. us-east-1 is too big to get blamed, which is why it should be the region of choice for an employee.

Esophagus4 · 3 months ago

Bizarre way of making decisions.

us-east-2 is objectively a better region to pick if you want US east, yet you feel safer picking use1 because “I’m safer making a worse decision that everyone understands is worse, as long as everyone else does it as well.”

dontdoxxme · 3 months ago

Why aren't you using IBM cloud?

thejosh · 3 months ago

Bandwidth cost is also another major reason.

bmitch3020 · 3 months ago

This story missed a glaring detail. There are simply more data centers in northern VA [0]. More than the rest of the US by a wide margin, or the entire EU+Asia. Things break here because it's where most things are.

[0]: https://www.datacenters.com/providers/amazon-aws/data-center...

noosphr · 3 months ago

At 34 hours of downtime that's two nines of uptime

At this point my garage is tied for reliability with us-east-1 largely because it got flooded 8 month ago.

nadis · 3 months ago

Cackling while reading this visiting my family in Northern Virginia for the holidays. Despite it being a prominent place in the history of the web, it's still the least reliable AWS region (for now).

rayiner · 3 months ago

Its nice to know that where I grew up is Too Big to Fail lol.

davidfstr · 3 months ago

I intentionally avoid using us-east-1 for anything, since I’ve seen so many outages.

temp0826 · 3 months ago

us-east-1 is often a lynchpin for services worldwide. Something hinky happening to dns or dynamodb in us-east-1 will probably wreck your day regardless of where you set up shop.

david_shaw · 3 months ago

Yes, it's the least reliable. Thanks for summarizing the data here to illustrate the issue.

It's often seen as the "standard" or "default" region to use when spinning up new US-based AWS services, is the oldest AWS center, has the most interconnected systems, and likely has the highest average load.

It makes sense that us-east-1 has reliability problems, but I wish Amazon was a little more upfront about some of the risks when choosing that zone.

Forgeties79 · 3 months ago

Nobody ever got fired for connecting to us-east-1