Summary of the Amazon DynamoDB Service Disruption

"It was impacting a relatively small number of customers, but we should have posted the green-i to the dashboard sooner than we did on Monday."

Amazon. No. Please listen. You should never post a green-i. Green-i means nothing to anyone. It's a minimization of a problem. You should change the indicator to show that there is a problem. If there is a problem, that is what the indicator is for. Customers don't care about whatever weird internal political pressure causes you to want to show as little information as possible on that dashboard. Customers want to know that there is a problem, and we will totally be able to figure out if it is small or large on our own.

deanCommie · 10 years ago

What's with this all-or-nothing attitude? What is so wrong about different severity levels?

Green-i: Some small percentage of customers are affected, but most of you have nothing to worry about. So if you are a customer having problems and see this notice, you are re-assured that you are not crazy. And if you are NOT seeing problems, you probably won't. Amazon has not always been the best at posting this information quickly enough, hence a commitment to be better at messaging for those customers

Yellow: A significant amount of customers is affected. Potentially serious problems, and if you haven't seen them yet, don't be surprised if you will. Start monitoring your service health and preparing for a failover to another region.

Red: FUBAR. Happens rarely, but when it does you'll probably know even before your application alarms - you'll notice when half the internet shuts down.

felixgallo · 10 years ago

different severity levels are fine. That's what the indicator is for. Decorating the indicator, normally completely incorrectly, with a tiny extra indicator, is incomprehensible. That's why you've never seen a tiny red light in the upper corner of your green traffic light. Your 'check engine' light is not green with a tiny extra indicator in the corner. The metaphors go on and on.

Customers are totally not interested in the fact that some other random customer may not be experiencing problems and that amazon has a wide variety of services, some of which may not be affected in certain geographies. Customers go to that panel with one express goal: to discover if the problems they are seeing with their servers could be related to them. It is a triage check. The question is not, is Amazon super great and are the availability engineers the bestest on average over time? The question is, what the fuck is going on?

Amazon has suffered catastrophic unavailability, and the green-i has appeared belatedly, an hour or more into the problem. Because as revealed in this post, it is manually put up by engineers. Manually!

Manually!

mentat · 10 years ago

> Some small percentage of customers are affected, but most of you have nothing to worry about.

That's absolutely not what it means practically for any of the recent outages. It has come to mean "the service is down but not hard enough for us to admit it".

d23 · 10 years ago

> By 2:37am PDT, the error rate in customer requests to DynamoDB had risen far beyond any level experienced in the last 3 years, finally stabilizing at approximately 55%.

Yeah, a small green-i should cover it.

nickpsecurity · 10 years ago

Green traditionally means we're good, ready to go, etc. Non-intuitive to use any green indicator for a situation deserving a write-up like this. It's yellow at best.

takeda · 10 years ago

Yep, especially yellow is labeled clearly as "Performance issues". It is basically lying to make things appear better than they really are.

jamiesonbecker · 10 years ago

I'm starting to wonder how much more I should trust AWS. Is everything swept under the carpet?

e.g., not even attributed or thanks for root escalation vulnerabilities that I surfaced for them: https://news.ycombinator.com/item?id=10261507

We did not have detailed enough monitoring for this dimension (membership size), and didn’t have enough capacity allocated to the metadata service to handle these much heavier requests.

As much as I admire and rely on AWS' scale to build architectures and fault tolerant applications, it can't be ignored that the marketing towards going "full cloud" doesn't take into account how hard it is to build resilient architectures in the cloud.

I see those disruptions events as stop signs: when the cloud itself fails to scale, I rethink a few decisions we all make when surfing those trends.

http://yourdatafitsinram.com/ also comes to mind.

LoSboccacc · 10 years ago

infrastructure is hard, and exponentially hard with the number of nodes you need to scale.

that said, even with those disruptions and whatnot happening on Amazon as a warning, I am not skilled enough nor have time enough to build a non cloud resilient infrastructure.

I was looking to go with redundant vps at first, because amazon does have high cost for us, however, just learning all the things that can go wrong in the first very part, the load balancer, and all the gritty details one have to consider for just this little component to support interruption free failover, made me rethink the cost benefit of going managed.

it is true that going cloud doesn't really remove outages risks completely and it will not be as resilient as an infrastructure built with skill and love by the best out there, but how many shops can actually roll with their own solution and get an equivalent level of availability?

scaling web nodes is within my capabilities, building a ha database is already quite above my skill but I may manage, testing database failover, making sure it works, making sure that it can actually recover from one node dying and that the application stay live meanwhile? that's way above what I can reasonably do and what my company can afford to pay maintenance for.

chillydawg · 10 years ago

How is it any harder to do in the cloud than on a rack in a warehouse? At least you don't have to muck about with cables and phoning power companies up.

antirez · 10 years ago

Just an example: during the issue even people serving 10 ops/sec, but very important 10 ops/sec, were affected by a huge complessive load which was not their for most of the part. It's true that when you "go cloud" you don't have to manage your operations, but you are basically putting everything in the hands of other op people, and what happens to you is related to a more wide set of conditions.

So managing your stuff is hard, but you are in control and can do things in a way you believe is completely safe for you. Or you at least may incur in the same events sometimes, but perhaps paying a lot less for the same services. Or you can create your deployment with characteristics which are often impossible (a lot of RAM for each server is an example) to be cost effective in the cloud.

It's not stupid to use AWS services but is not stupid to manage your operations, either in your own hardware or at least using just bare metal and/or the virtual machines service certain providers give you, but still being in part accountable, responsabile, and in control, of your system software deployment and operations.

toomuchtodo · 10 years ago

I used to do infrastructure on physical hardware, and we'd go years without an outage sometimes (generators in the datacenter, diesel fuel contracts, redundant fiber providers using BGP). Doing it in the cloud is harder, because you're at the mercy of the provider when things go south, and you have no transparency into why it went wrong except what they're willing to publish. Why did it happen? Will it happen again?

I mean, you can argue that the cloud is better. But how often is Heroku and AWS down? About the same as physical providers (I concede S3 is pretty solid though).

nickpsecurity · 10 years ago

You call up IBM. You ask for a mainframe solution for two sites. You get experts to set it up for you with your application and such. You don't worry about downtime again for at least 30 years.

You call up Bull, Fujitsu, or Unisys for the same thing.

You call up HP. You ask for a NonStop solution. You get same thing for at least 20 years.

You call up VMS Software. You ask for an OpenVMS cluster. You get same thing for at least 17 years.

Well-designed OS's, software, and hardware did cloud-style stuff for a long time before cloud existed without the downtime. Cloud certainly brought price down and flexibility up. Yet, these clouds haven't matched 70-80's technology in uptime yet despite all the brains and money thrown at them. That's a fact.

So, shouldn't be used for anything mission critical where downtime costs lots of money.

bbrazil · 10 years ago

Given they had ~300 minutes of outage in 3 years, you're looking at ~99.98% reliable in just that region. That's pretty good for a stateful serving system, and indeed you'd be pushed to do better.

arturhoo · 10 years ago

yandie · 10 years ago

As this post is on, Dynamo is experiencing error again:

6:33 AM PDT We are investigating increased error rates for API requests in the US-EAST-1 Region.

Source: http://status.aws.amazon.com/

colinbartlett · 10 years ago

Some services are reporting problems now. Starting with the same set as last time such as Heroku and lots of services that depend on it:

https://statusgator.io/events

DynamoDB, Cloudwatch, EC2, Scaling, and Elastic Beanstalk are all impacted currently.

Someone1234 · 10 years ago

Their status screen is totally misleading and utterly pointless. Someone needs to get all of the RSS feeds (which are actually accurate) and create a new dash which is honest.

reustle · 10 years ago

Are the RSS feeds more accurate than the status page?

impostervt · 10 years ago

Thank you for this. I was wondering why I was getting slow responses from their API Gateway. It's "green" on their status board, but based on what I saw Sunday, the two services are linked.

cddotdotslash · 10 years ago

Pretty much the exact same thing that happened the other day. DynamoDB, then Lambda, now API Gateway. I can't update my Lambda functions, a small percentage are timing out, and API Gateway is sending back a "We're out of capacity" message.

ghshephard · 10 years ago

A pretty good post mortem, but one worrisome final comment:

"For us, availability is the most important feature of DynamoDB"

I would think that durability should be the most important feature of DynamoDB; better to have intermittent outage, or reduced capacity, but not have data loss - but perhaps there is something about DynamoDB which suggests availability is rated more highly than durability?

EwanToo · 10 years ago

I suspect they're simply trying to reassure customers that this size of outage won't happen again, not make some kind of deep mission statement..

jMyles · 10 years ago

If this is true (and I think you're right), then pronouncements about "the most important feature" are probably not timely.

stingraycharles · 10 years ago

I think this is written from an ops point of view. Dynamo-based systems allow a client to wait until at least X servers have acknowledged the write, and use eventual consistency in case there is some service disruption.

I personally didnt know they used meta data tables in Dynamo, and that sounds like they lost one of the most important traits which gave Dynamo its super-high availability abilities, which is having all the cluster state in the clients rather than in the servers.

sillypog · 10 years ago

Thank you for saying that! I just read the Dynamo paper the other day and that was the first thing that jumped out at me when I read this outage description. Now I feel like I understood the paper.

brudgers · 10 years ago

Since the \A\ in ACID is atomicity, I suspect the reading context for "availability" is CAP.

sturadnidge · 10 years ago

If I'm reading their post-outage actions correctly, they certainly align with availability > durability.

There are many things to like about DynamoDB, but one of the things I really dislike is the 'abscence of an error == success' pattern that they implement for many operations. With less frequent metadata requests, I imagine there is a greater chance of silent failures (ie availability looks fine, durability is gone).

Could be wrong though, I'm not going to pretend to know more about distributed systems than the DynamoDB developers.

senderista · 10 years ago

Durability is certainly not being compromised. I think they are just referring to increasing lease times. Timeouts are always a tradeoff (false alarm frequency vs. recovery time), and this event has prompted them to re-evaluate the tradeoff.

If you suspect the design of DynamoDB is sloppy, I encourage you to read this: http://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-s...

manuelflara · 10 years ago

They probably said that because this was an availability outage. If this was a postmortem after some data had been lost, they would've ended it with "For us, durability is the most important feature of DynamoDB".

Deleted Comment

deskamess · 10 years ago

> but we should have posted the green-i to the dashboard sooner than we did

Use full orange or even half orange circle or something else if you want to convey 'subset of customers may have issues'. If possible move 'having issue' items to the top - this way people do not have to scroll down to see whats broken.

zwily · 10 years ago

This was the first time we had actually seen red icons on the dashboard. We joked that they had to furiously photoshop them up Sunday morning.

(I'm sure it had happened before, just the first time we had seen them.)

Not enough granularity with a yellow icon. I feel like if only <1% of customers are affected, a yellow circle isn't warranted - that's for the 5-20% impact case

Unless you're one of the <1% of customers.

The whole point of status page is to to help determine if the issue you're experiencing is on your side or the SaaS, and not to show as much green as possible.

The way currently AWS status page works it simply fails to provide any functionality and might as well be shut down.

The colors on it start to change when you already see tons of articles about AWS being down.

userbinator · 10 years ago

Unavailable servers continued to retry requests for membership data, maintaining high load on the metadata service.

It sounds like the retries exacerbated the situation. I wonder if they use exponential backoff?

BillinghamJ · 10 years ago

It's amazing how frequently this seems to happen. A lot of the huge downtime events I've read about have occurred because of failures in the systems used to recover from failures.

It seems that perhaps more time/effort needs to be spent testing the systems (and the use of those systems) which are critical for handling failure. While building them, it's easy to dismiss their scaling requirements on the basis that they shouldn't ever be under significant load.

This is a classic, even because those software parts usually are the least tested in a variety of conditions, even when they are relatively well tested. It's very hard to simulate real failures.

> I wonder if they use exponential backoff?

More importantly, did they use randomised exponential backoff?

Having all the retires hitting at the same time can lead to a pulses of outages until things settle down.

jmason23 · 10 years ago

I have no doubt they would have been using randomised exponential backoff -- these issues are well recognised inside Amazon and best practices are well known. For example, here's a blog post from March this year from Marc Brooker on the topic: http://www.awsarchitectureblog.com/2015/03/backoff.html . It may not have been correctly tuned for this scenario however.

I'd say the Dynamo team are well aware of what they should have been doing, and kicking themselves for not foreseeing this cascading-failure case. ouch!

allengeorge · 10 years ago

Most exponential backoff implementations I've seen have an upper limit, where the initial (fast) retries work around transient networking blips.

My guess is that a case like this you'd hit the backoff's upper bound pretty quickly, and, given the large volume of servers hitting a comparatively small metadata pool, experience exactly the same failure mode.

beambot · 10 years ago

Looks like a fair number of key aws systems rely on DynamoDB -- and further, the same system used by customers. I wonder: do they have any inclination to decouple these dependencies to prevent correlated outages?

A large number of AWS services actually use other AWS services under the hood. Lambda is actually running code on EC2 stored on S3. API Gateway is actually CloudFront on the front end. Apparently most services use DynamoDB for metadata. I'm pretty sure CloudWatch logs are stored on S3. They really interconnect a lot of services. You could argue that it's good (dog fooding, better monitoring) or bad (too many dependencies) but at least they're putting confidence in their own products.

Jgrubb · 10 years ago

I don't think they rely on Dynamo, they rely on an internal metadata service that Dynamo just happened to overwhelm with too many large requests.

threeseed · 10 years ago

"Amazon SQS uses an internal table stored in DynamoDB to store information describing its queues"

"EC2 Auto Scaling stores information about its groups and launch configurations in an internal table in DynamoDB"

"CloudWatch uses an internal table stored in DynamoDB"

"Customers attempting to log into the Console during this period saw much higher latency in the login process. This was due to a very long timeout being set on an API call that relied on DynamoDB"

Seems like poor architectural design to have all of these storing state in the same instances of DynamoDB that are used by customers. If a new feature like GSI is added it should under no circumstances ever impact other services.

From the 2nd sentence:

> ... subsequent impact to other AWS services that depend on DynamoDB ...

And the metadata service is part of DynamoDB:

> The membership of a set of table/partitions within a server is managed by DynamoDB’s internal metadata service.

ckozlowski · 10 years ago

I noticed that in the "Impact on other services" bit, that CloudWatch and Console were affected, becasue they were dependent on DyanmoDB. Now, I don't pretend to know DyanmoDB to well, but it seems to me that having your monitoring application dependent on one of the things you would be monitoring is a strange circular dependency.

Would it have been wiser for Amazon to implement a completely separate instance of DynamoDB for service offerings that depended on it? Or is this just simply cost ineffective? Help me understand, thanks. =)

_alex_ · 10 years ago

CloudWatch is a monitoring service that AWS offers publicly. It is different from the monitoring service that AWS services use internally.

source: was on an AWS service team for several years

berkay · 10 years ago

It's still a problem even if it's not the internal monitoring system. Customers use CloudWatch to detect problems with DynamoDB and get notified. This dependency means customers may not get notified if CloudWatch does not work as expected due to a DynamoDB problem.

I follow. In this case though, I was thinking of the customer's monitoring tools. Their first action would be to try a diagnose themselves.

But good to know! Thanks!

It sounds like most of the remediation is "we'll make sure this service keeps working even if Dynamo is down". It's telling that so many services are relying heavily on Dynamo now. If anything, this event just makes me like it more. (I have several things built on it).

eli · 10 years ago

And then what? Set up another set of monitoring tools to watch this new DynamoDB instance that's only used for monitoring and console?