Tell HN: AWS appears to be down again

If you haven't seen yet, news is it was a power loss:

> 5:01 AM PST We can confirm a loss of power within a single data center within a single Availability Zone (USE1-AZ4) in the US-EAST-1 Region. This is affecting availability and connectivity to EC2 instances that are part of the affected data center within the affected Availability Zone. We are also experiencing elevated RunInstance API error rates for launches within the affected Availability Zone. Connectivity and power to other data centers within the affected Availability Zone, or other Availability Zones within the US-EAST-1 Region are not affected by this issue, but we would recommend failing away from the affected Availability Zone (USE1-AZ4) if you are able to do so. We continue to work to address the issue and restore power within the affected data center.

vinay_ys · 4 years ago

This is quite interesting as they claim their datacenter design does better than Uptime's Tier3+ design requirements which require redundant power supply paths. [https://aws.amazon.com/compliance/uptimeinstitute/]. I really hope they publish a thorough RCA for this incident.

tyingq · 4 years ago

"Electrical power systems are designed to be fully redundant so that in the event of a disruption, uninterruptible power supply units can be engaged for certain functions, while generators can provide backup power for the entire facility." https://aws.amazon.com/compliance/data-center/infrastructure...

So they have 2 different sources of power coming in. And generators. They do mention the UPS is only for "certain functions", so I guess it's not enough to handle full load while generators spin up if the 2 primaries go out. Or perhaps some failure in the source switching equipment (typically called a "static transfer switch").

Some detail on different approaches: https://www.donwil.com/wp-content/uploads/white-papers/Using...

JshWright · 4 years ago

> I really hope they publish a thorough RCA for this incident.

We're still waiting on the RCA for last week's us-west outage...

codeduck · 4 years ago

another example of a single dc in a single AZ rendering an entire region almost unusable. This has shades of eu-central-1 all over again.

nightpool · 4 years ago

Amazon is claiming the failure is limited to a single AZ. Are you seeing failures for instances outside of that AZ? If not, how has this rendered "the entire region almost unusable"?

SCdF · 4 years ago

So dumb question from someone who hasn't maintained large public infrastructure:

Isn't the whole point of availability zones is that you deploy to more than one and support failing over if one fails?

IE why are we (consumers) hearing about this or being obviously impacted (eg Epic Games Store is very broken right now)? Is my assessment wrong, or are all these apps that are failing built wrong? Or something in between?

fulafel · 4 years ago

IME people rarely test and drill for the failovers, it's just a checkbox in a high level plan. Maybe they have a todo item for it somewhere but it never seems very important as AZ failures are usually quite rare. After ignoring the issue for a while it starts to seem risky to test for it, you might get an outage due to bugs it's likely to uncover.

gpm · 4 years ago

> or are all these apps that are failing built wrong

Deploying to multiple places is more expensive, it's not wrong to choose not to, it's trading off reliability for cost.

It's also unclear to me how often things fail in a way that actually only affect one AZ, but I haven't seen any good statistics either way on that one.

peeters · 4 years ago

As I understand it for something like SQS, Lambda etc, AWS should automatically tolerate an AZ going down. They're responsible for making the service highly available. For something like EC2 though, where a customer is just running a node on AWS, there's no automatic failover. It's a lot more complicated to replicate a running, stateful virtual machine and have it seamlessly failover to a different host. So typically it's up to the developers to use EC2 in a way that makes it easy to relaunch the nodes on a different AZ.

robjan · 4 years ago

That's the theory but in practice very few companies bother because it's expensive, complicated and most workloads or customers can tolerate less than 100% uptime.

sprite · 4 years ago

I thought I was Multi AZ but something failed. I am mostly running EC2 + RDS both with 2 availability zones. I will have to dig into the problem but I think the issue is that my setup for RDS is one writer instance and one reader instance, each in a different AZ. However I guess there was nothing for it to fail over to since my other instance was the writer instance, so I guess I need to keep a 3rd instance available preferably in a 3rd AZ?

TruthWillHurt · 4 years ago

Amazon shifts the responsibility for multi-AZ deployment to us customers, saving themselves complexity and charging us extra - win-win for them.

_joel · 4 years ago

You're supposed to build your app across multiple AZ's but I know a lot of companies that don't do this and shove everything in a single AZ. It's not just about deploying and instance there but ensuring the consistency of data and state across the az's

xyst · 4 years ago

This region in general is a clusterfuck. If companies by now do not have a disaster recovery and resiliency strategy in place, you are just shooting yourself in the foot.

philsnow · 4 years ago

In today's world of stitching together dozens of services, who each probably do the same thing, how is one to avoid a dependency on us-east-1? Add yet another bullet to the vendor questionnaire (ugh) about whether they are singly-homed / have a failover plan?

It's turtles all the way down, and underneath all the turtles is us-east-1.

notyourday · 4 years ago

We are being told that the are still issues in the USE1-AZ4 and some of the instances are stuck in the wrong state as of 16:15 PM EST. There's no ET for resolution.

alostpuppy · 4 years ago

Why do folks host their stuff in us-East? Is there a draw other than organizational momentum?

dragonwriter · 4 years ago

> Why do folks host their stuff in us-East?

Off the top of my head, US-EAST-1 is:

(1) topologically closer to certain customers than other regions (this applies to all regions for different customers),

(2) consistently in the first set of regions to get new features,

(3) usually in the lowest price tier for features whose pricing varies by region,

(4) where certain global (notionally region agnostic) services are effectively hosted and certain interactions with them in region-specific services need to be done.

#4 is a unique feature of US-East-1, #2-#3 are factors in region selection that can also favor other regions, e.g., for users in the West US, US-West-2 beats US-West-1 on them, and is why some users topologically closer to US-West-1 favor US-West-2.

superdug · 4 years ago

It's the cheapest.

GrumpyNl · 4 years ago

How come they dont have power backups?

chkhd · 4 years ago

"When a fail-safe system fails, it fails by failing to fail-safe." - https://en.wikipedia.org/wiki/Systemantics

redm · 4 years ago

Some datacenter failures aren't related to redundancy. Some examples: 1) transfer switch failure where you can't switch over to backup generators and the UPS runs out, 2) someone accidentally hits the EOD, 3) maintenance work makes a mistake such as turning off the wrong circuits, 4) cooling doesn't switch over fully to backups and while your systems have power, its too hot to run. The list can go on and on.

I'm not sure why this is a big deal though, this is why Amazon has multiple AZ's. If your in one AZ, you take your chances.

taf2 · 4 years ago

it was not a total power loss. out of 40 instances we had running at the time of the incident only 5 of our instances appeared to be lost to the power outage. the bigger issue for us was ec2 api to stop/start these instances appeared to be unavailable (but probably due to the rack these instances were in having no power). The other issue that was impactful to us was that many of the remaining running instances in the zone had intermittent connectivity out to the internet. Additionally, the incident was made worse by many of our supporting vendors being impacted as well...

IMO it was handled rather well and fast by AWS... not saying we shouldn't beat them up (for a discount) but being honest this wasn't that bad.

chousuke · 4 years ago

Sometimes, you have a component which fails in such a way that your redundancies can't really help.

I once had to prepare for a total blackout scenario in a datacenter because there was a fault in the power supply system that required bypassing major systems to fix. Had some mistake or fault happened during those critical moments, all power would've been lost.

Well-designed redundancy makes high-impact incidents less likely, but you're not immune to Murphy's law.

trelane · 4 years ago

Anything can fail, even your backup, and especially if it's mechanical.

Spooky23 · 4 years ago

Their datacenter(s) aren’t magic because they are AWS. That facility is probably a decade old and like anything else as it ages the technical and maintenance debt makes management more challenging.

thetinguy · 4 years ago

They do. I remember watching one of their sessions where they showed every rack having its own battery backup.

TrueDuality · 4 years ago

According to the SOC certifications they give their customers they do.

AWS didn’t “go down”. They had an outage in one AZ, which is why there are multiple AZs in each region. If your app went down then you should be blaming your developers on this one, not AWS. Those having issues are discovering gaps in their HA designs.

Obviously it’s not good for an AZ to go down but it does happen and why any production workload should be architected to have seamless failover and recover to other AZs, typically by just dropping nodes in the down AZ.

People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.

matharmin · 4 years ago

AWS is under-reporting the severity of the issue though. The primary outage may be in a single AZ, but there are parts of the AWS stack that affected all AZs in us-east-1, and potentially other regions as well. For example, even now I'm unable to create a new ElastiCache cluster in different AZs of us-east-1.

zymhan · 4 years ago

> I'm unable to create a new ElastiCache cluster in different AZs of us-east-1

Isn't that because Elasticache will distribute the cluster across AZs automatically?

https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/...

boudin · 4 years ago

Issues are across all us-east 1, not one AZ.

Load balancers are not doing well at all. The only way in this case to avoid an outage is to be cross regions or cross cloud which is quite more complex to handle and require more resources to do well.

And I hope that nobody is listening your blaming and pointing fingers advice, that's the worst way to solve anything.

It's AWS job to ensure that things are reliable, that there is redundancy and that multi-AZ infra should be safe enough. The amount of issues in US-EAST-1 lately is really worrying.

acdha · 4 years ago

Some load balancers may be having issues but I have multiple busy workloads showing no issues all morning. One big challenge can be that some people reporting multi-AZ issues are shifting traffic and competing with everyone else, while workloads which were already running in the other AZs were fine. It can be really hard to accurately tell how much the problems you’re seeing generalize to everyone else.

I do agree that the end of this year has been a very bad period for AWS. I wonder whether there’s a connection to the pandemic conditions and the current job market – it feels like a lot of teams are running without much slack capacity, which could lead to both mistakes and longer recovery times.

phamilton · 4 years ago

Echoing this. We had to manually intervene and cut off the faulty AZ because our ASGs kept spinning up instances in it and our load balancers kept sending traffic to bad hosts.

In the past I've seen both of those systems seamlessly handle an AZ failure. Today was different.

tluyben2 · 4 years ago

> People commenting that servers shouldn’t go down ect don’t understand how true HA architectures work. You should expect and build for stuff to fail like this. Otherwise it’s like complaining that you lost data because a disk failed. Disks fail… build architecture where that won’t take you down.

Is that comparison fair? If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen. These entire datacenter failures should be anticipated, but to expect them is a bit too easy I think. There are plenty of hosters who don't have this stuff even once for the last decade in their only datacenter. I do not find it strange to expect or even demand that level but to protect yourself if it happens in any case if that fits your specific project and budget.

Edit; OK meant that raid-5 remark in the same context as the hosting; it can and does happen but it shouldn't; you should plan for a contingency but expect it goes far. We never had it (1000s of hard-drive, decades of hosting, millions of sites) and so we plan for it with backups; if it happens it will take some downtime but it costs next to nothing over time to do that. If we expected it, we would need to take far different measures. And we had less downtime in a decade than aws AZ had in the past months. I have a problem with the word 'expect'.

phone8675309 · 4 years ago

> Is that comparison fair? If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen.

There are plenty of situations where this might happen if they’re in your room: a lightning strike can cause a surge that causes the disks to fry, a thief might break in and steal your system, your house might burn down, an earthquake could cause your disks to crash, a flood could destroy the machines, and a sinkhole could open up and swallow your house. You may laugh at some of these as being improbable, but I have seen _all_ of these take out systems between my times in Florida (lightning, thief, sinkhole, and flood) and California (earthquake and house fire).

The fix for this is the same fix as being proposed by the parent post - putting physical space between the two systems so if one place become unavailable you still have a backup.

acdha · 4 years ago

> If you have 2 raid-5 mirrored raid 5 boxes in your room and all disks fail at the same time, you should complain. And that won't happen.

Here are some examples where that happened:

1. Drive manufacturer had a hardware issue affecting a certain production batch, causing failures pretty reliably after a certain number of power-on hours. A friend learned the hard way that his request to have mixed drives in his RAID array wasn’t followed.

2. AC issues showed a problem with airflow, causing one row to get enough warmer that faults were happening faster than the RAID rebuild time.

3. UPS took out a couple racks by cycling power off and on repeatedly until the hardware failed.

No, these aren’t common but they were very hard to recover from because even if some of the drives were usable you couldn’t trust them. One interesting dynamic of the public clouds are that you tend to have better bounds on the maximum outage duration, which is an interesting trade off compared to several incidents I’ve seen where the downtime stretched into weeks due to replacement delays or manual rebuild processes.

8note · 4 years ago

More generally, any correlation between two items gives potential for a correlated failure.

Same manufacturer, same disk space, same location, same operator, same maintenance schedule, same legal jurisdiction, same planet, you name it, and there's a common failure to match

AtNightWeCode · 4 years ago

"Won't happen". The 40,000 hours of runtime bug did happen. I would recommend people to take backups and store them offline or at least isolated from the main storage.

dylan604 · 4 years ago

>And that won't happen

HA! I had received new 16-bay chasis and all of the drives needed plus cold spares for each chasis. Set them up and started the RAID-5 init on a Friday. Left them running in the rack over the weekend. Returned on Monday to find multiple drives in each chasis had failed. Even with dedicated one of the 16 drives as a hot swap, the volumes would all have failed in an unrecoverable manner.

All drives were purchased at the same time, and happened to all come from a single batch from the manufacture. The manufacture confirmed this via serial numbers, and admitted they had an issue during production. All drives were replaced and at a larger volume size.

TL;DR: Drives will fail, and manufacturing issues happend. Don't buy all of your drives in an array from the same batch! It will happen. To say it won't is just pure inexeperience.

tyingq · 4 years ago

>AWS didn’t “go down”

The context of the parent seems to be that they intermittently couldn't get to the console. That seems fair to me. If we're blaming developers and finding gaps in HA design, then AWS should also figure out how to make the console url resilient. If it's not, then AWS does appear to be down.

I imagine it's pretty hard to design around these failures, because it's not always clear what to do. You would think, for example, that load balancers would work properly during this outage. They aren't. Or that you could deploy an Elasticache cluster to the remaining AZs. You can't. And I imagine the problems vary based on the AWS outage type.

Similarly, with the earlier broad us-east-1 outage, you couldn't update Route53 records. I don't think that was known beforehand by everyone that uses AWS. You can imagine changing DNS records might be useful during an outage.

strunz · 4 years ago

Except many AWS services still route through us-east-1 anyway, which is why they have had huge outages recently. AWS isn't as redundant as people think it is.

bencoder · 4 years ago

Our API is just appsync (graphql) + lambdas + dynamoDB so, theoretically, we shouldn't have been affected. But about 1 in 3 requests was just hanging and timing out.

As others have said, they are not being forthright about the severity of the issue, as is standard.

dkryptr · 4 years ago

100% agree. I'm actually surprised AWS hasn't built in a Chaos Monkey into their APIs/console so people can test their resiliency regularly if an AZ goes down.

edit: of course, AWS does have this: AWS Fault Injection Simulator

biohax2015 · 4 years ago

AWS Fault Injection Simulator does this.

stingraycharles · 4 years ago

Because then people would complain about AWS being less reliable than Azure / GCP.

TameAntelope · 4 years ago

Here's a secret that's now saved me from three outages this month:

Be in multiple AZs, and even multiple regions but if you're going to be in only one AZ or one region, make it us-east-2.

Deleted Comment

aledalgrande · 4 years ago

ItsBob · 4 years ago

I've built out many 42U racks in DC's in my time and there were a couple of rules that we never skipped:

1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens 2. Dual network (at minimum) - For the same reasons as above since the switches didn't always have dual power in them.

I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.

However, a power cut (in the traditional sense where the supplier has a failure so nothing comes in over the wire) should have literally zero effect!

What am I missing?

I've never worked anywhere with Amazon's budget so why are they not handling this? Is it more than just the imcoming supply being down?

growse · 4 years ago

> 1. Dual power in each server/device - One PSU was powered by one outlet, the other PSU by a different one with a different source meaning that we can lose a single power supply/circuit and nothing happens

Nothing happens if you remember that your new capacity limit per DC supply is 50% of the actual limit, and you're 100% confident that either of your supplies can seamlessly handle their load suddenly increasing by 100%.

I've seen more than one failure in a DC where they wired it up as you described, had a whole power side fail, followed by the other side promptly also failing because it couldn't handle the sudden new load placed on it.

dijit · 4 years ago

EDIT: I misunderstood you were talking about power feeds, the normal case is the run "48% as if it's 100%" (because of power spikes, but also most types of transformers run more efficiently under specific levels of load (40-60).

Normally this is factored into the Rack you buy from a hardware provider, they will tell you that you have 10A or 16A on each feed, if you exceed that: it will work, but you are overloading their feed and they might complain about it.

> I've only had a DC fail once when the engineer was performing work on the power circuitry for the DC and thought he was taking down one, but was in fact the wrong one and took both power circuits down at the same time.

This is all local scale. Your setup would not survive a data center scale power outage. At scale power outages are datacenter scale.

Data centers lose supply lines. They lose transformers. Sometimes they lose primary feed and secondary feed at the same time. Automatic transfer switches cannot be tested periodically i.e. they are typically tested once. Testing them is not "fire up a generator and see if we can draw from it"

It is cheaper to design a system that must be up which accounts for a data center being totally down and a portion of the system being totally unavailable than to add more datacenter mitigations.

bombcar · 4 years ago

The datacenter we were in had dual-sourced grid power (two separate grid connections on opposite sides of the block, coming from different substations) along with a room of batteries (good for iirc 1hr total runtime for the whole datacenter, setup in quad banks, two on each "rail"), _and_ multiple independent massive diesel generators, which they ran and switched power to every month for at least an hour.

And to top it off each rack had its own smaller UPS at the bottom and top, fed off both rails, and each server was fed from both.

We never had a power issue there; in fact SDGE would ask them to throw to the generators during potential brown-out conditions.

Of course this was a datacenter that was a former General Atomics setup iirc ...

Yes but if you have reliable power from two different sources then the biggest risk (I'd imagine) is the failover circuitry! Something that should be tested tbh.

Also, there are banks of batteries and generators in between the power company cables and the kit: did they not kick-in?

Again, this is all pure speculation: I have absolutely no idea of the exact failure, nor how their infrastructure is held together - this is all just speculation for the hell of it :)

vel0city · 4 years ago

The only full datacenter outage I've personally experienced was a power maintenance tech testing the transfer switch between systems where the power was 90 degrees out of phase. Big oof.

theideaofcoffee · 4 years ago

Transfer switches at any facility that's worth being colocated in are exercised as periodically as the generators to which they connect. In all of the facilities I have had systems in (>20MW total steady state IT load), that meant once per month at minimum to keep generators happy -and to ensure the transfer functionality works-, and more often if the local grid demands it, e.g. ComEd in Chicago, or Dominion in NoVA asking for load shedding.

ClumsyPilot · 4 years ago

"It is cheaper to design a system that must be up which accounts for a data center being totally down and a portion of the system being totally unavailable than to add more datacenter mitigations."

Citation needed - the same issue with testing, data races and expensive bandwidth come up.

uluyol · 4 years ago

Why spend the cost on dual X and Y when you can failover to another cluster?

For big DC workloads, it is usually, though not always, better to take the higher failure rate than add redundancy.

Really? You'd think at Amazon's scale an additional PSU in a 1U custom-built server (I assume they're custom) would be a few tens of $ at most.

Actually, now that I type that it makes sense. Scaling a few tens of dollars to a bajillion servers on the off-chance that you get an inbound power failure (quite rare I'd reckon) might cost more than what they'd lose if it does actually fail.

So yeah, they're potentially just balancing the risk here and minimising cost on the hardware.

Edit: changed grammar a bit.

bob1029 · 4 years ago

> I've never worked anywhere with Amazon's budget so why are they not handling this?

Perhaps we are going to discover how AWS produces such lofty margins by way of their next RCA publication.

Bluecobra · 4 years ago

> What am I missing?

My guess is that they cheaped out in having redundant PSUs to get you to use multiple availability zones. (More zones = more revenue)

Even a single PSU shouldn’t be an issue if they plugged in an ATS switch though.

Godel_unicode · 4 years ago

Unless the ATS breaks, which happens.

lordnacho · 4 years ago

What about a UPS/battery thingy? That's saved me a few times, though it normally just gives enough time for a short outage. Is it uncommon in cloud infra?

For even regular datacenters they'll often have UPS systems the size of a small car, usually several of these, to power the entire datacenters for a few minutes to get the diesel generator started.

Hippocrates · 4 years ago

Every time a major cloud provider has an outage, Infra people and execs cry foul and say we need to move to <the other one>. But does anyone really have an objective measure of how clouds stack up reliability-wise? I doubt it, since outages and their effects are nuanced. The other move is that they want to go multi-cloud... But I’ve been involved in enough multi-cloud initiatives to know how much time and effort those soak up, not to mention the overhead costs of maintaining two sets of infra sub-optimally. I would say that for most businesses, these costs far exceed that occasional six-hour-long outage.

mijoharas · 4 years ago

I mean from the explanation[0], assuming that is correct (I don't have evidence to suggest it's false) - you don't need to be multi-cloud, and you don't even need to be multi-region. As long as you're spread out over multiple availability zones in a region you should be resilient to this failure.

Somewhat surprising to see how many things are failing though, which implies, either that a lot of services aren't able to fail-over to a different availability zone, or there is something else going wrong.

[0] https://news.ycombinator.com/item?id=29648992

omh2 · 4 years ago

AWS doesn't follow their own advice about hosting multi-regional so every time us-east-1 has significant issues pretty much every AZ and region is affected.

Specifically large parts of the management API, and IAM service are seemingly centrally hosted in us-east-1. So called Global endpoints are also dependent on us-east-1 and parts of AWS' internal event queues (eg. event bridge triggers)

If your infrastructure is static you'll largely avoid the fallout, but if you rely on API calls or dynamically created resources you can get caught in the blast regardless of region

Yeah, my thought is not specific to this scenario. Indeed multi-AZ is a low cost and probably good idea because you often have a shared service management, control plane, and cheap bandwidth between things. Of course, when things fail they often ripple as may be the case here. I don't think clouds have their blast radius perfectly contained and they certainly don't communicate those details well.

One incident I recall was involving our GCP regional storage buckets, which we were using to achieve mutli-region redundancy. One day, both regions went down simultaneously. Google told us that the data was safe but the control plane and API for the service is global. Now I always wonder when I read about MR what that actually means...

zeckalpha · 4 years ago

That’s true for this failure but the prior two for AWS were region wide and the one for GCP last month was global.

sdevonoes · 4 years ago

Perhaps is us, the customers (and our customers, and the customers of our customers, ...), the ones who should get used to the status of "things can go wrong"? Except for some specific scenarios (medical-related stuff, for instance), if my favourite online shopping place is down, well, it's down, I'll buy later.

metadat · 4 years ago

I know the Oracle OCI cloud has a reputation for never going hard-down, but also realize HN seems to loathe Big Red (understandably, to a degree, though OCI is pretty nice IME and _very_ predictable).

SixDouble5321 · 4 years ago

I don't think it's unfair. They aren't the worst villain, but they are up there.

indigomm · 4 years ago

> I doubt it, since outages and their effects are nuanced.

Your point here deserves highlighting. A failure such as a zone failing is nowadays a relatively simple problem to have. But cloud services do have bugs, internal limits or partial failures that are much more complex. They often require support assistance, which is where the expertise of their staff comes into play. Having a single provider that you know well and trust is better than having multiple providers where you need to keep track of disparate issues.

mongrelion · 4 years ago

I agree with you. I think that having multi-AZ is the first thing to figure out before wanting to do multi-cloud, which is just another buzzword taken out of management's bullshit bucket :)

Agree, and multi AZ is usually easy. IME with AWS and GCP the control plane is the same, the scaling works across AZ, bandwidth is free and latency is near zero. The level of effort to do that is simply ticking the right boxes at setup time IME.

jtc331 · 4 years ago

I’ve seen at least half a dozen full region AWS issues in the past 8 months.

You really need multi-region and also not be relying on any AWS service that’s located only in us-east-1 (including everything from creating new S3 buckets to IAM’s STS).

sfoley · 4 years ago

Who says this? I have literally never once seen this.

hnarn · 4 years ago

Is there a history of AWS downtimes available somewhere? This makes what, three times in as many months?

edit: The question isn't necessarily AWS specific, just any data on amount of downtime per cloud provider on a timeline would be nice.

colinbartlett · 4 years ago

I have tons of this kind of data due to my side project, StatusGator. For some services like the big cloud providers I have data going back 7 years.

There indeed has been an uptick in AWS outages recently. You can see a bit of the history here: https://statusgator.com/services/amazon-web-services

exikyut · 4 years ago

(I was idly curious. It appears this data is available as part of the ~US$280/mo tier, along with a bunch of other things.)

MatteoFrigo · 4 years ago

I don't know about AWS, but both Google Cloud and Oracle Cloud maintain at least a high level history of past outages. See https://status.cloud.google.com/summary and https://ocistatus.oraclecloud.com/history

Given the hilariously awful reputation of the AWS status page I would hazard a guess that such a page would also be incredibly inaccurate.

If you can’t even admit you’re having an issue how can you keep an accurate record?

LuciusVerus · 4 years ago

I'd say three times in as many weeks, give it or take

spmurrayzzz · 4 years ago

This is a little more broad, beyond just cloud infra providers, but includes some of the kind of data you're looking for (post-mortems for outage events): https://github.com/danluu/post-mortems

andyjih_ · 4 years ago

The most hilarious irony of not being able to acknowledge a 4AM page in the PagerDuty mobile app because AWS is down.

(Which was about AWS being down?)

JCM9 · 4 years ago

IceWreck · 4 years ago

Honestly my server at home has more uptime than US-East-1

TacticalCoder · 4 years ago

I should blog about this one day but...

I have a server at OVH (not affiliated to them) which, at this point, I keep only for fun. It has 3162 days of uptime as I type this.

3 162 days. That's 8 years+ of uptime.

Does it have the traffic of Amazon? No.

Is it secure? Very likely not: it's running an old Debian version (Debian 7, which came out in, well, 2013).

It only has one port opened though, SSH. And with quite a hardened SSH setup at that.

I installed all the security patches I could install without rebooting it (so, yes, I know, this means I didn't install all the security patches for some required rebooting).

This server is, by now, a statement. It's about how stable Linux can be. It's about how amazingly stable Debian is. It's also about OVH: at times they had part of their datacenter burn (yup), at times they had full racks that had to be moved/disconnected. But somehow my server never got affected. It may have happened that at one point OVH had connectivity issues but my server went down.

I "gave back" many of my servers I didn't need anymore. But this one I keep just because...

I still use it, but only as an additional online/off-site backup where I send encrypted backups. It's not as if it gets zero use: I typically push backups to it daily.

They're only backups, they're encrypted. Even if my server is "owned" by some bad guys, the damage he could do is limited. Never seen anything suspicious on it though.

I like to do "silly" stuff like that. Like that one time I solve LCS35 by computing for about four years on commodity hardware at home.

I think it's about time I start to do some archeology on that server, to see what I can find. Apparently I installed Debian 7 on it in mid-october 2013.

I've created a temporary user account on it, which at times I've handle the password (before resetting it) to people just so they could SSH in and type: "uptime".

It is a thing of beauty.

Eight. Years. Of. Uptime.

nextaccountic · 4 years ago

> Like that one time I solve LCS35 by computing for about four years on commodity hardware at home.

Awesome! Are you Bernard Fabrot [0]?

[0] https://www.csail.mit.edu/news/programmers-solve-mits-20-yea...

kasey_junk · 4 years ago

I read this as a cautionary tale. Here we have a server that only through the grace of god is still up, and is likely owned up. If it isn't, it's because of how little is going on with it.

At its current use, it's likely not a major issue but imagine if someone saw this uptime and thought to take it as a statement of reliability and built a service on it. I for one, would want that disclosed because this is a disaster waiting to happen. I'd much rather someone disclose that they had a few servers each with no longer than 7 days of uptime because they'd been fully imaged and cycled in that time...

plandis · 4 years ago

Your server could just be an outlier. Doesn’t really say anything about AWS or any cloud provider.

BossingAround · 4 years ago

Does your server at home handle similar traffic to that of US-East-1 since you're comparing uptime?

Simiarly, my laptop, if I keep it plugged in the wall, and enable httpd on localhost, will surely have better uptime than any of the top clouds. I'd bet that it'd have 100% uptime if I plugged in a UPS and cared for traffic on my local network only.

christophilus · 4 years ago

Most people don't need to handle the traffic of US-East-1. They just need a single, simple, mostly reliable server. But they're often told, "Don't do that. It's too hard, and irresponsible, and what if you get a spike in traffic, and what if you need to add 5 new servers, and security is really hard."

In reality, most people don't need to scale. An occasional spike in traffic is a nuisance, but not the end of the world, and security is not terribly hard, if you keep your servers patched (which is trivial to automate).

I really don't understand why there's so much FUD around running your own stuff.

Sammi · 4 years ago

> Does your server at home handle similar traffic to that of US-East-1 since you're comparing uptime?

Of course it doesn't. Why are you asking antagonistic questions?

loopdoend · 4 years ago

Your home ISP has 100% uptime? That's incredible.

No but I access my home-server remotely from my university all the time and it hasn't gone down once.

Better uptime than paying for EC2 on AWS US-East-1.

Obviously this approach isn't scalable but it serves me well.

RONROC · 4 years ago

The prevailing wisdom throughout the last couple of years was:

“ditch your on-prem infrastructure and migrate to a major cloud provider”

And its starting to seem like it could be something like:

“ditch your on-prem infrastructure and spin up your own managed cloud”

This is probably untenable for larger orgs where convenience gets the blank check treatment, but for smaller operations that can’t realize that value at scale and are spooked by these outages, what are the alternatives?

I don't think it's reasonable to be spooked by these outages, and to think your resolution would be to leave AWS entirely.

A much faster and more effective solution that doesn't have you trading cloud problems with on-prem problems (the power outage still happens, except now it's your team that has to handle it) would be to update your services to run in multiple AZs and multiple regions.

Get out of AWS is you want, but don't get out of AWS because of outages. You should be able to mitigate this relatively easily.

f6v · 4 years ago

Self-managed infrastructure doesn’t fail now?

What an absolutely pointless comment.

Everything fails, we can argue the rate. But I would argue that understanding your constraints is better.

if you know that your secret storage system can't survive if a machine goes away: well, you wire redundant paths to the hardware and do memory mirroring and RAID the hell out of the disks. And if it fails you have a standby in place.

But if you use AWS Cognito.

And it goes down.

You're fucked mate.

iso1631 · 4 years ago

Not at this rate.

I remember we had a power outage in 2006, it actually took one of my services off air. Since then of course that has been rectified, and the loss of a building wouldn't impact on any of the critical, essential or important services I provide.

We’re going to be having this same tired, pedantic, round-about conversation when Tesla’s routinely decide to take out a family of four because it mistook a plastic bag for an off-ramp.

Commenters will show up like clockwork and say shit like:

“What man, it’s not like cars didn’t crash before? Haha”

Don’t be dense dude. And definitely don’t pursue a leadership position anytime in the future.

“Hybrid and multi cloud” is the future. In other words, give us more fucking money.

paulryanrogers · 4 years ago

Spread the risk? Smaller on prem and cloud / rented bare metal?

Spivak · 4 years ago

Nah, it's actually better to concentrate the risk in this case.

If your app depends on a few 3rd party services -- SendGrid, Twilio, Okta and they're all hosted on different infra then congrats! You're gonna have issues when any one of them are down, yayyy.

Also the marketing benefit can't be downplayed. If your postmortem is "AWS was having issues" then your execs and customers just accept that as the cost of doing business because there's a built-in assumption that AWS, Azure, GCP are world class and any in-house team couldn't do it better.

ernsheong · 4 years ago

Google Cloud seems to be doing much better, at least recently. There's also Azure. AWS seems to have placed growth above everything else at customers' expense.

Victerius · 4 years ago

I'm tempted to found a startup to help businesses migrate from cloud providers to on-prem infrastructure.

datavirtue · 4 years ago

Slinging some of that sweet Tanzu or Ranger?