Readit News logoReadit News
Posted by u/iamdeedubs 9 years ago
Ask HN: Is S3 down?
I'm getting

{ "errorCode" : "InternalError" }

When I attempt to use the AWS Console to view s3

boulos · 9 years ago
Disclosure: I work on Google Cloud.

Apologies if you find this to be in poor taste, but GCS directly supports the S3 XML API (including v4):

https://cloud.google.com/storage/docs/interoperability

and has easy to use multi-regional support at a fraction of the cost of what it would take on AWS. I directly point my NAS box at home to GCS instead of S3 (sadly having to modify the little PHP client code to point it to storage.googleapis.com), and it works like a charm. Resumable uploads work differently between us, but honestly since we let you do up to 5TB per object, I haven't needed to bother yet.

Again, Disclosure: I work on Google Cloud (and we've had our own outages!).

NiekvdMaas · 9 years ago
Apologies if this is too much off-topic, but I want to share an anecdote of some some serious problems we had with GCS and why I'd be careful to trust them with critical services:

Our production Cloud SQL started throwing errors that we could not write anything to the database. We have Gold support, so quickly created a ticket. While there was a quick reply, it took a total of 21+ hours of downtime to get the issue fixed. During the downtime, there is nothing you can do to speed this up - you're waiting helplessly. Because Cloud SQL is a hosted service, you can not connect to a shell or access any filesystem data directly - there is nothing you can do, other than wait for the Google engineers to resolve the problem.

When the Cloud SQL instance was up&running again, support confirmed that there is nothing you can do to prevent a filesystem crash, it "just happens". The workaround they offered is to have a failover set up, so it can take over in case of downtime. The worst part is that GCS refused to offer credit, as according to their SLA this is not considered downtime. The SLA [1] states: "with respect to Google Cloud SQL Second Generation: all connection requests to a Multi-zone Instance fail" - so as long as the SQL instance accepts incoming connections, there is no downtime. Your data can get lost, your database can be unusable, your whole system might be down: according to Google, this is no downtime.

TL;DR: make sure to check the SLA before moving critical stuff to GCS.

[1]: https://cloud.google.com/sql/sla

fidget · 9 years ago
The GCS being referred to by the GP is Google Cloud Storage, not Cloud Sequel. You really do need failover set up though. That's true for basically any MySQL installation, managed or not.
adwf · 9 years ago
That isn't just a Google issue though. You'd have had the exact same trouble with AWS/RDS if you're running with no replica. The lack of filesystem access is a security "feature" for both. If you have no HA setup then you have no recourse but to restore to a new server from backup, or wait for your cloud provider to fix it.
lbill · 9 years ago
Not using a failover is a bold choice (not stupid, just bold). A failover is like a good insurance policy: you pay for it, you hope that you'll never need it, but when shit happens you are very happy to have it!
TekMol · 9 years ago
21 hours sounds pretty long to me. What type of data was it and how long would you have waited until you continued with a backup of the data on a different machine?
JPKab · 9 years ago
I've used both Google Cloud and AWS, and as of a year or so ago, I'm a Google Cloud convert. (Before that, you guys didn't at all have your shit together when it came to customer support)

It's not in bad taste, despite other comments saying otherwise. We need to recognize that competition is good, and Amazon isn't the answer to everything.

eknkc · 9 years ago
We were on GCP for around a year, it was my decision I really wanted to love GCP and I initially did. But we recently switched to AWS.

I think there is little GCP does better than AWS. Pricing is better on paper, but performance per buck seems to be on par. Stability is a lot worse on GCP, and I don't just mean service outages like this one (which they had their fair share) but also individual issues like instances slowing down or network acting up randomly. Also lack of service offerings like no PostgreSQL, functions never leaving alpha, no hosted redis clusters etc... Support is also too expensive compared to AWS.

Management interfaces are better on GCP and sustained use discount is a big step up against AWS reservations. Otherwise, I think AWS works better.

espeed · 9 years ago
Me too. We switched to Google Cloud years ago at its inception and have never looked back -- always viewed it as a competitive advantage due to its solid, more advanced infrastructure -- faster network, reliable disks, cleaner UI that's easier to manage. Just a cleaner operation all the way around.
snackai · 9 years ago
What indeed is bad taste is your choice of Google Cloud over AWS. No I really like GCP, use it at core of many apps, but if people really want a decentralized web we need to use more than one provider. Don't "convert". Use booth, redundancy ffs.
advisedwang · 9 years ago
I work in GCP support. I'm really curious: what do you feel changed that led to such improved support? I'd like to make sure we keep doing it.
vacri · 9 years ago
My experience of support with Google Apps for Business makes me very wary of using anything Google for critical business infra. Google products are nice, but as soon as you hit a problem or edge case, you're on your own in my experience.
ehsankia · 9 years ago
Honestly, if you're a big service that millions of people use, you should not put all your eggs in a single basket and should probably use a mix, in case one of the clouds goes down like in this case.
hkmurakami · 9 years ago
>(Before that, you guys didn't at all have your shit together when it came to customer support)

Sounds like it basically coincides with Diane Greene coming on board to run the show -- which is great news for all of us with increased competition on not just the technical front but also support (which is often the deal maker/breaker)

jamesblonde · 9 years ago
I just wrote a piece reflecting on the s3 outage and the limitations of s3 metadata/replication:

https://medium.com/@jim_dowling/reflections-on-s3s-architect...

themihai · 9 years ago
GCP has always felt like a forever beta product. On top of that you get a lot of lockin so I would never recommend GCP for a long term project.
twakefield · 9 years ago
The brilliance of open sourcing Borg (aka Kubernetes) is evident in times like these. We[0] are seeing more and more SaaS companies abstract away their dependencies on AWS or any particular cloud provider with Kubernetes.

Managing stateful services is still difficult but we are starting to see paths forward [1] and the community's velocity is remarkable.

K8s seems to be the wolf in sheep's clothing that will break AWS' virtual monopoly on IaaS.

[0] We (gravitational.com) help companies go "multi-region" or on-prem using Kubernetes as a portable run-time.

[1] Some interesting projects from this comment (https://news.ycombinator.com/item?id=13738916)

* Postgres automation for Kubernetes deployments https://github.com/sorintlab/stolon

* Automation for operating the Etcd cluster:https://github.com/coreos/etcd-operator

* Kubernetes-native deployment of Ceph: https://rook.io/

dankohn1 · 9 years ago
Note that Kubernetes "builds upon 15 years of experience of running production workloads [on Borg] at Google" [0], but is different code than Borg.

In addition to Rook, Minio [1] is also working to build an S3 alternative on top of Kubernetes, and the CNCF Landscape is a good way of tracking projects in the space [2].

[0] https://kubernetes.io/ [1] https://www.minio.io/ [2] https://github.com/cncf/landscape

Disclosure: I'm the executive director of CNCF, which hosts Kubernetes, and co-author of the landscape.

justicezyx · 9 years ago
K8s is a better borg! It leaps forward and build upon many years experience of operating the system.
013a · 9 years ago
Is there any way built in to Kubernetes to go multi-AZ, multi-region, or even multi-cloud? Is federation the answer to this?

I remember reading somewhere in the K8s documentation that it is designed such that nodes in a single cluster should be as close as possible, like in the same AZ.

blantonl · 9 years ago
I have a component in my business that writes about 9 million objects a month to Amazon S3. But, to leverage efficiencies in dropping storage costs for those objects I created an identical archiving architecture on Google Cloud.

It took me about 15 minutes to spin up the instances on Google Cloud that archive these objects and upload them to Google Storage. While we didn't have access to any of our existing uploaded objects on S3 during the outage, I was able to mitigate not having the ability to store any future ongoing objects. (our workload is much more geared towards being very very write heavy for these objects)

It it turns out this cost leveraging architecture works quite well as a disaster recovery architecture.

sachinag · 9 years ago
Opportunistic, sure. But I did not know about the API interoperability. Given the prices, makes sense to store stuff in both places in case one goes down.
khc · 9 years ago
I am surprised more people don't know about it. I get questions like https://github.com/kahing/goofys/issues/158 every now and then and to be fair I don't think they market it well: https://cloud.google.com/storage/docs/migrating

Disclosure: I don't work for google but have an upcoming interview there.

nodesocket · 9 years ago
Not poor taste at all. Love GCP. I actually host two corporate static sites using Google Cloud Storage and it is fantastic. I just wish there was a bucket wide setting to adjust the cache-control setting. Currently it defaults to 1 hour, and if you want to change it, you have to use the API/CLI and provide a custom cache control value each upload. I'd love to see a default cache-control setting in the web UI applying to the entire bucket.

I also want to personally thank Solomon (@boulos) for hooking me up with a Google Cloud NEXT conference pass. He is awesome!

dward · 9 years ago
Out of curiosity, are you also using the cloud CDN?

https://cloud.google.com/compute/docs/load-balancing/http/us...

7ewis · 9 years ago
How did you get the pass?

Been trying to get one for IO (can't attend NEXT unfortunately)

i336_ · 9 years ago
Hopefully you're still there even though S3 is back up. I have an interesting question I really, really hope you can answer. (Potential customer(s) here!!)

There are a large number of people out there looking intently at ACD's "unlimited for $60/yr" and wondering what that really means.

I recently found https://redd.it/5s7q04 which links to https://i.imgur.com/kiI4kmp.png (small screenshot) showing a user hit 1PB (!!) on ACD (1 month ago). If I understand correctly, the (throwaway) data in question was slowly being uploaded as a capacity test. This has surprised a lot of people, and I've been seriously considering ACD as a result.

On the way to finding the above thread I also just discovered https://redd.it/5vdvnp, which details how Amazon doesn't publish transfer thresholds, their "please stop doing what you're doing" support emails are frighteningly vague, and how a user became unable to download their uploaded data because they didn't know what speed/time ratios to use. This sort of thing has happened heaps of times.

I also know a small group of Internet archivists that feed data to Archive.org. If I understand correctly, they snap up disk deals wherever they can find them, besides using LTO4 tapes, the disks attached to VPS instances, and a few ACD and GDrive accounts for interstitial storage and crawl processing, which everyone is afraid to push too hard so they don't break. One person mentioned that someone they knew hit a brick wall after exactly 100TB uploaded - ACD simply would not let this person upload any more. (I wonder if their upload speed made them hit this limit.) The archive group also let me know that ACD was better at storing lots of data, while GDrive was better at smaller amounts of data being shared a lot.

So, I'm curious. Bandwidth and storage are certainly finite resources, I'll readily acknowledge that. GDrive is obviously going to have data-vs-time transfer thresholds and upper storage limits. However, GSuite's $10/month "unlimited storage" is a very interesting alternative to ACD (even at twice the cost) if some awareness of the transfer thresholds was available. I'm very curious what insight you can provide here!

The ability to create share links for any file is also pretty cool.

ptrptr · 9 years ago
Now that's what I call a shameless plug!
scrollaway · 9 years ago
We would definitely seriously consider switching to GCS more if your cloud functions were as powerful as AWS Lambda (trigger from an S3 event) and supported Python 3.6 with serious control over the environment.
boulos · 9 years ago
Is there something about the GCS trigger that doesn't work for you? I hear you on Python 3, but I'm also curious about "serious control over the environment". Can you be more specific?
vikiomega9 · 9 years ago
On a curious note, how do you guys use lambda?
simonebrunozzi · 9 years ago
I keep telling people that in my view, Google Cloud is far superior to AWS from a technical standpoint. Most people don't believe me... Yet. I guess it will change soon.
natbobc · 9 years ago
Google Cloud is the Betamax of cloud... while it might be technically superior it's not the only factor to consider. :)
notyourwork · 9 years ago
One service outage determines superiority? I prefer a lot more data than a single point.
joshontheweb · 9 years ago
I'm in the process of moving to GCS mostly based on how byzantine the AWS setup is. All kinds of crazy unintuitive configurations and permissions. In short, AWS makes me feel stupid.
joshontheweb · 9 years ago
I should add that someone from the AWS team reached out to me in response to this comment asking for feedback on how they can improve their usability. So I give them credit for that.
andmarios · 9 years ago
As far as I understand the S3 API of Cloud Storage is meant as a temporary solution until a proper migration to Google's APIs.

The S3 keys it produces are tied to your developer account. This means that if someone gets the keys from your NAS, he will have access to all the Cloud Storage buckets you have access to (e.g your employer's).

I use Google Cloud but not Amazon. Once I wanted a S3 bucket to try with NextCloud (then OwnCloud). I was really frightened to produce a S3 key with my google developer account.

BrandonY · 9 years ago
The HMAC credential that you'd use with the S3-compatible GCS API, also called the "XML API", does need to be associated with a Google account, but it doesn't need to be the main account of the developer. It can be any Google user account. I suggest creating a separate account and granting it only the permissions it needs. It'd be nice if service accounts (aka robot accounts) could be given HMAC credentials, that's not supported. Service accounts can, however, sign URLs with RSA keys.

As another option, you can continue using the XML API and switch out only the auth piece to Google's OAuth system while changing nothing else.

There's a lot more detail available at: https://cloud.google.com/storage/docs/migrating

Disclaimer: I work on Google Cloud Storage.

dividuum · 9 years ago
Is there any equivalent to the Bucket Policies that AWS provides (http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucke...). Cloud Storage seems to be limited to relatively simple policies without conditionals. For a few AWS IAM keys I set up a policy that limits write/delete access to a range of IPs (among other things). Something like that doesn't seem possible with what Google offers. Or do I miss something?
stef25 · 9 years ago
> OwnCloud

Kicked the tires, not impressed at all. Notes went missing from the interface could only get them back after manually digging through folders via FTP.

rynop · 9 years ago
"fraction of the cost" - how do you figure? Or are you just saying from a cost-to-store perspective?

Your Egress prices are quite a bit more compared to CloudFront for sub 10TB (.12/GB vs .085/GB).

The track record of s3 outages vs time your up and sending Egress seems like S3 wins in cost. If all your worried about is cross region data storage, your probably a big player and have AWS enterprise agreement in place which offsets the cost of storage.

boulos · 9 years ago
Sorry, my comparison is our Multi Regional storage (2.6c/GB/month) versus S3 Standard plus Cross-Regional Replication. That's the right comparison (especially for outages like this one).

As to our network pricing, we have a drastically different backbone (we feel its superior, so we charge more). But as you mention CloudFront, the right comparison is probably Google Cloud CDN (https://cloud.google.com/cdn/) which has lower pricing than "raw egress".

Deleted Comment

Spunkie · 9 years ago
So this is more compute related but do you know if there are any plans on supporting the equivalent of the webpagetest.org(WPT) private instance AMI on your platform?

Not only is webpagetest.org a google product but it's also much better suited for the minute by minute billing cycle of google cloud compute. For any team not needing to run hundreds of tests an hour the cost difference between running a WPT private instance on EC2 versus on google cloud compute could easily be in the thousands of dollars.

malloryerik · 9 years ago
Would use Google but I just can't give up access to China. Sad because I also sympathize with Google's position on China.
zoloateff · 9 years ago
boulous not in bad taste at all - happy google convert and gcs user works very well for us ymmv
zoloateff · 9 years ago
boulous is app engine datastore the preferred way to store data or cloud sql or something else, do you mind throwing some light on this thanks
DenisM · 9 years ago
If you made a .NET library that allows easily connecting to both AWC and GCS by only changing the endpoint I would certainly use that library instead of Amazon's own.

Just saying, it gets you a foot in the door.

danielvf · 9 years ago
I had no idea this was an option. Great to know!
sandGorgon · 9 years ago
i have had problems integrating apache spark using google storage. especially because s3 is directly supported in spark.

if you are api compatible with s3, could you make it easy /possible to work with google storage inside spark?

remember i may or may not run my spark on Dataproc.

bluedonuts · 9 years ago
You can use the Google cloud storage connector (https://cloud.google.com/hadoop/google-cloud-storage-connect...) which works with hadoop (and therefore spark).
mbrumlow · 9 years ago
What is your NAS box doing with S3/GCS ?
boulos · 9 years ago
Remote backup (Synology). I've asked them more than once to directly support GCS, or even just to accept my damn patch ;).
gaul · 9 years ago
S3 applications can use any object store if they use S3Proxy:

https://github.com/andrewgaul/s3proxy

thejosh · 9 years ago
How about giving a timeline of when Australia will be launching? I see you're hiring staff, and have a "sometime 2017" goal on the site, but how about a date estimate? :)
philliphaydon · 9 years ago
Does GCS support events yet?
hyperpallium · 9 years ago
As Relay's chief competitor in this region, we of Windsong have benefited modestly from the overflow; however, until now we thought it inappropriate to propose a coordinated response to the problem.
espeed · 9 years ago
What software are you using for your NAS box?
pmarreck · 9 years ago
Classy parley. I'll allow it.
masterleep · 9 years ago
Competition is great for consumers!
cperciva · 9 years ago
S3 is currently (22:00 UTC) back up.

The timeline, as observed by Tarsnap:

    First InternalError response from S3: 17:37:29
    Last successful request: 17:37:32
    S3 switches from 100% InternalError responses to 503 responses: 17:37:56
    S3 switches from 503 responses back to InternalError responses: 20:34:36
    First successful request: 20:35:50
    Most GET requests succeeding: ~21:03
    Most PUT requests succeeding: ~21:52

josephb · 9 years ago
Thanks for taking the time to post a timeline from the perspective of an S3 customer. It will be interesting to see how this lines up against other customer timelines, or the AWS RFO.
kaishiro · 9 years ago
Playing the role of the front-ender who pretends to be full-stack if the money is right, can someone explain the switch from internal error to 503 and back? Is that just them pulling s3 down while they investigate?
cperciva · 9 years ago
My guess based on the behaviour I've seen is that internal nodes were failing, and the 503 responses started because front-end nodes didn't have any back-end nodes which were marked as "not failing and ready for more requests". When Amazon fixed nodes, they would have marked the nodes as "not failed", at which point the front ends would have reverted to "we have nodes we can send traffic to" behaviour.
greenleafjacob · 9 years ago
Could be anything. Most likely scenario is the internal error is a load shedding error and the 503s were when the system became completely unresponsive. If it was a configuration issue then it is more likely that it would have directly recovered rather than going 'internal error -> 503 -> internal error'.
hmottestad · 9 years ago
503 is typically what we see when our proxy can't connect to the backend server. We usually get 500 with internal server error when we've messed up the backend server.

So it's likely that the first 500s were the backend for s3 failing, then they took the failing backends offline causing the load balancers to throw 503 because they couldn't connect to the backend.

Twirrim · 9 years ago
S3 is not a monolithic architecture, Amazon is a strong proponent of Service Oriented Architecture for producing scalable platforms.

There are a number of services behind the front end fleet in S3's architecture that handle different aspects of returning a response. Each of those will have their own code paths in the front end, very likely developed by different engineers over the years. As ever, appropriate status codes for various circumstances are something that always seems to spur debate amongst developers.

The change in status code would likely be a reflection of the various components entering unhealthy & healthy states, triggering different code paths for the front end... which suggests whatever happened might have had quite a broad impact, at least on their synchronous path components.

Dead Comment

gamache · 9 years ago
A piece of hard-earned advice: us-east-1 is the worst place to set up AWS services. You're signing up for the oldest hardware and the most frequent outages.

For legacy customers, it's hard to move regions, but in general, if you have the chance to choose a region other than us-east-1, do that. I had the chance to transition to us-west-2 about 18 months ago and in that time, there have been at least three us-east-1 outages that haven't affected me, counting today's S3 outage.

EDIT: ha, joke's on me. I'm starting to see S3 failures as they affect our CDN. Lovely :/

traskjd · 9 years ago
Reminds me of an old joke: Why do we host on AWS? Because if it goes down then our customers are so busy worried about themselves being down that they don't even notice that we're down!
nabla9 · 9 years ago
Reminds me of an even older joke (from 80's or 90's):

Q: Why computers don't crash at the same time?

A: Because network connections are not fast enough.

(I think we are starting to get there)

xbryanx · 9 years ago
I'm getting the same outage in us-west-2 right now.
firloop · 9 years ago
The dashboard doesn't load, nor does content using the generic S3 url [1], but we're in us-west-2 and it works fine if you use the region specific URL [2]. In practice this means our site on S3/Cloudfront is unaffected.

[1]: https://s3.amazonaws.com/restocks.io/robots.txt

[2]: https://s3-us-west-2.amazonaws.com/restocks.io/robots.txt

STRML · 9 years ago
Seeing it in eu-west-1 as well. Even the dashboard won't load. Shame on AWS for still reporting this as up; what use is a Personal Health Dashboard if it's to AWS's advantage not to report issues?
WaxProlix · 9 years ago
Same here, and it's 100% consistent, not 'increased error rates' but actually just fully down. I'd just stop working but I have a demo this afternoon... the downsides of serverless/cloud architectures, I guess.
all_usernames · 9 years ago
Our services in us-west-2 have been up the whole time.

I think the problem is globally accessible APIs are impacted. As others have noted, if you can use region/AZ-specific hostnames to connect, you can get though to S3.

CloudFront is faithfully serving up our existing files even from buckets in US-East.

codelitt · 9 years ago
IIRC the console for S3 is global and not region specific even though buckets are.
Ph4nt0m · 9 years ago
Same outage in ca-central-1
ngtvspc · 9 years ago
I can confirm this as well.
gamache · 9 years ago
Huh, I'm not seeing it on my us-west-2 services. Interesting.
movedx · 9 years ago
My advice is: don't keep your eggs in one basket. AZs a localised redundancy, but as Cloud is cheap and plentiful, you should be using two or more regions, at least, to house your solution (if it's important to you.)

EDIT: less arrogant. I need a coffee.

gamache · 9 years ago
But now you're talking about added effort. Multi-AZ on AWS is easy and fairly automatic, multi-region (and multi-provider) not so much. It's easy to say things like this, but people who can do ops are not cheap and plentiful.
jacquesm · 9 years ago
Two different vendors if you can afford it. It's a bit of a hassle though.
bischofs · 9 years ago
It shouldnt be technically possible to lose S3 on every region, how did amazon screw this up so bad?
boulos · 9 years ago
I believe the reports here are misleading: if you try to access your other regions through the default s3.amazonaws.com it apparently routes through us-east first (and fails), but you're "supposed to" always point directly at your chosen region.

Disclosure: I work on Google Cloud (and didn't test this, but some other comment makes that clear).

twistedpair · 9 years ago
Amen. We setup our company cloud 2 years ago in US-West-2 and have never looked back. No outage to date.
jacquesm · 9 years ago
If you have a piece of unvarnished wood handy...
compuguy · 9 years ago
Is us-east-2 (Ohio) any better (minus this aws-wide S3 issue)?
mullen · 9 years ago
us-east-2 is brand new and us-east-1 is the oldest region. Any time there is an issue, it is almost always us-east-1. If possible, I would migrate out of us-east-1.
jchmbrln · 9 years ago
Probably valid, though in this case while us-west-1 is still serving my static websites, I can't push at all.
nola-radar · 9 years ago
The s3 outage covered all regions.
movedx · 9 years ago
Really? Even Australia? Can you provide evidence of this so I know for any clients that call me today? :)

EDIT: Found my answer. "Just to stress: this is one S3 region that has become inaccessible, yet web apps are tripping up and vanishing as their backend evaporates away." -- https://www.theregister.co.uk/2017/02/28/aws_is_awol_as_s3_g...

notheguyouthink · 9 years ago
That's a really good point!

Dead Comment

Dead Comment

alexleclair · 9 years ago
Yup, same here. It has been a few minutes already. Wanna bet the green checkmark[1] will stay green until the incident is resolved?

[1] https://status.aws.amazon.com/

nostromo · 9 years ago
The red check mark is hosted on S3...
ak2196 · 9 years ago

Deleted Comment

AtheistOfFail · 9 years ago
Comment of the year.
emrekzd · 9 years ago
In December 2015 I received an e-mail with the following subject line from AWS, around 4 am in the morning:

"Amazon EC2 Instance scheduled for retirement"

When I checked the logs it was clear the hardware failed 30 mins before they scheduled it for retirement. EC2 and root device data was gone. The e-mail also said "you may have already lost data".

So I know that Amazon schedules servers for retirement after they already failed, green check doesn't surprise me.

smoodles · 9 years ago
So just as a FYI the reason that probably happened to you is that the underlying host was failing. I am assuming they wanted to give you a window to deal with it but the host croaked before then. I've been dealing w/ AWS for a long long time and I've never seen a maintenance event go early unless the physical hardware actually died...
amaks · 9 years ago
That what happens when cloud provider doesn't support live migration for VMs.
problems · 9 years ago
That's completely ridiculous, get some fucking RAID Amazon.

I order drives off newegg directly to my DC and I'm yet to lose data with the cheapest drives available in RAID10.

tuna-piano · 9 years ago
It's crazy how much better the communication (including updates and status pages) is of the companies that rely on AWS than AWS' communication itself.

https://status.heroku.com/incidents/1059

tcsf · 9 years ago
Blake Gentry gave a full accounting of Heroku's response process here - http://www.heavybit.com/library/video/every-minute-counts-co...

Amazon should take notes.

all_usernames · 9 years ago
I feel for them. Imagine, 40 or 50 different engineering teams all responsible for updating their statuses. At this moment on the AWS status page I see random usage of the red, yellow, and green icons, even though all the status updates are "Increased error rates." What that tells me is that there's no unified communication protocol across the teams, or they're not following it. And just imagine what it's like being on the S3 team right now.

I notice even Cloudflare is starting to have problems serving up pages now.

SnowingXIV · 9 years ago
Font Awesome went out for me for a bit, but they did a great job getting back up and keeping their users in the loop.

https://status.fortawesome.com/

tlogan · 9 years ago
These service health boards are more like advertisement page then actual status of the service.
mwfj · 9 years ago
I guess their bizarre thinking is something along the lines of: "unless we have proof that noone can access the service, we won't change the indicator from green to yellow.

Seriously: I don't understand why you guys stay with AWS.

hartleybrody · 9 years ago
I'm seeing green checkmarks across the board, but they just added a notice to the top of the page:

> Increased Error Rates

> We are investigating increased error rates for Amazon S3 requests in the US-EAST-1 Region.

syntheticcdo · 9 years ago
I guess sub-1% to 100% failure rate is technically an "increase".
samstave · 9 years ago
the worst thing is when your system cant handle these "increased error rates" as your control plane cascades failure due to something like this....

The worst "increased error rate" problem I had was when the API was failing and my autoscale system couldnt deal and launched thousands of instances because it couldnt tell when instances were launched (lack of API access) and the instances pummelled the fuck out of all other parts of the system and we basically had to reboot the entire platform....

Luckily, amazon is REALLY forgiving with respect to costs in these (and actually most) circumstance....

matwood · 9 years ago
I always joke that if one of those statuses ever went to red, it means the zombie apocalypse has begun.
paulddraper · 9 years ago
The number of non-green marks is the number of ICBMs currently in flight towards an AWS data center.
cperciva · 9 years ago
The good news is, if Amazon's services are marked as offline, you're allowed to use Amazon Lumberyard to control nuclear power plants.
chrisan · 9 years ago
In case anyone wants to see what mysterious the red icon looks like: https://status.aws.amazon.com/images/status3.gif

At best when there are problems (not like now I guess) I will see the "note" green icon https://status.aws.amazon.com/images/status1.gif

krylon · 9 years ago
I've heard (on the Fnord new show on the most recent CCC congress, so take it with a grain of salt and a bucket of humor) that Amazon's TOS are more or less void when a Zombie Apocalypse breaks out.

They had some convoluted but fairly specific wording in their TOS, whoever wrote must have had a lot of fun.

obsurveyor · 9 years ago
Then I guess it has begun, the page is now showing red. I'd put a picture on imgur but it's not loading.
jonstaab · 9 years ago
http://downdetector.com/status/aws-amazon-web-services looks like a reasonable alternative place to check/report downtime.
zedpm · 9 years ago
I just check Twitter, since Amazon's status is always a lie. My personal dashboard is still showing no problems. It's bad enough that the main public status is always green even when there's clearly a problem, but you'd think they could at least make the private status accurate.
eicnix · 9 years ago
Which is coincidently down.
jonstaab · 9 years ago
Gah. It was up 3 minutes ago. Anyone have any suspicion this is another ddos episode? I saw that SO was down last night too: https://twitter.com/StackStatus/status/836450836322516992
vjdhama · 9 years ago
Apparently, that's down too. Sigh.
Fiahil · 9 years ago
So, global S3 outage for more than an hour now. Still green, still talking about "US East issue". I'm amazed.
gtsteve · 9 years ago
It doesn't appear to be global; my app in eu-west-1 appears unaffected.

It's possible that the console won't work however as I believe that's served from us-east-1.

gordon_freeman · 9 years ago
Looks like they have fixed the issue with their health dashboard now.

From https://status.aws.amazon.com/ : Update at 11:35 AM PST: We have now repaired the ability to update the service health dashboard. The service updates are below. We continue to experience high error rates with S3 in US-EAST-1, which is impacting various AWS services. We are working hard at repairing S3, believe we understand root cause, and are working on implementing what we believe will remediate the issue.

talawahdotnet · 9 years ago
There was an alert on the personal health dashboard[1] a second ago, it said S3 Operational issue in us-east-1 but when I tried to view the details it showed an error.

Then I refreshed and the event disappeared altogether.

[1] https://phd.aws.amazon.com/phd/home?region=us-east-1#/dashbo...

socialentp · 9 years ago
Same here. But it is in the general status dashboard: http://status.aws.amazon.com/
tuna-piano · 9 years ago
Still green now, 8 minutes in.
bpicolo · 9 years ago
I've had a few non-Amazon providers tell me AWS things are not working in the last 5 minutes, no note from Amazon though.

Nice.

fudged71 · 9 years ago
... still green
clamprecht · 9 years ago
adrenalinelol · 9 years ago
Looks like his personal site isn't loading... :)
ak2196 · 9 years ago
We have a slack emoji for it called greenish. It's the classic AWS green checkmark with an info icon in the bottom. Apparently it's NOT an outage if you don't acknowledge it. It's called alt-uptime.
foxylion · 9 years ago
I really liked it. But when trying to add it to my HipChat group it failed to upload. Why? S3 outage, what an irony.
nhumrich · 9 years ago
AWS internal lingo calls this the "green-i"

Dead Comment

cheeze · 9 years ago
Just went yellow

Edit: nevermind

schneidmaster · 9 years ago
Did it? Still fields of green for me.
matthuggins · 9 years ago
Still green for me
cyberferret · 9 years ago
Well, at least our decision to split services has paid off. All of our web app infrastructure is on AWS, which is currently down, but our status page [0] is on Digital Ocean, so at least our customers can go see that we are down!

A pyrrhic victory... ;)

[0] - http://status.hrpartner.io

EDIT UPDATE: Well, I spoke too soon - even our status page is down now, but not sure if that is linked to the AWS issues, or simply the HN "hug of death" from this post! :)

EDIT UPDATE 2: Aaaaand, back up again. I think it just got a little hammered from HN traffic.

jariz · 9 years ago
When even the status page is down, panic.
AtheistOfFail · 9 years ago
Could be worse, your entire infrastructure could be hosted on Heroku.

You don't use S3 but because they do, your entire infrastructure crumbles.

bananabill · 9 years ago
I didn't realize Heroku used s3 until today, when my heroku app failed. Makes me wonder why I'm using heroku instead of just aws.
lambdasquirrel · 9 years ago
I don't see why this is being downvoted. It's a pretty legitimate concern.
ExactoKnight · 9 years ago
The biggest change heroku needs to make is support different regions.
insomniacity · 9 years ago
HTTP 500 :(
crack-the-code · 9 years ago
Plot twist: Digital Ocean is secretly hosted on AWS
cyberferret · 9 years ago
LOL...I just got notice that our status server is down now too! Maybe DO is just a rebranded offshoot of AWS after all... :D
gmisra · 9 years ago
FYI to S3 customers, per the SLA, most of us are eligible for a 10% credit for this billing period. But the burden is on the customer to provide incident logs and file a support ticket requesting said credit (it must be really challenging to programmatically identify outage coverage across customers /s)

https://aws.amazon.com/s3/sla/

primitivesuave · 9 years ago
The 10% savings of ~$10 does not compare to time/potential business lost, but thanks for the tip :)
mabbo · 9 years ago
> potential business lost

My startup's op team had a great discussion today because of this that basically boils down to "if we hit our sales goals, an incident like this a year from now would end our company".

Looks like our plans to start prepping for multi-cloud support will be a higher priority.

machbio · 9 years ago
thats for below 99.9% - they are at 99.997% .. you are never getting that 10% credit..
christop · 9 years ago
0.1% of 28 days is 40 minutes, so it seems likely to happen.
joatmon-snoo · 9 years ago
You got your orders of magnitude wrong ;)

    99.9964583 = 100 - 153/(30*24*60)
    99.6458333 = 100 * (1 - 153/(30*24*60))

geerlingguy · 9 years ago
From Amazon: https://twitter.com/awscloud/status/836656664635846656

    The dashboard not changing color is related to S3 issue.
    See the banner at the top of the dashboard for updates.
So it's not just a joke... S3 being down actually breaks its own status page!

etler · 9 years ago
For this kind of page it might be best for them to use a data URI image to remove as many external resources as possible.
Symbiote · 9 years ago
Unicode characters would work fine, and be even smaller.

Warning sign, octagonal sign, no Entry (all filtered by HN).

There are plenty of possibilities.

JangoSteve · 9 years ago
I was thinking they should host the little green check mark icons on s3.
jliptzin · 9 years ago
Thank god I checked HN. I was driving myself crazy last half hour debugging a change to S3 uploads that I JUST pushed to production. Reminds me of the time my dad had an electrician come to work on something minor in his house. Suddenly power went out to the whole house, electrician couldn't figure out why for hours. Finally they realized this was the big east coast blackout!
Havoc · 9 years ago
Disadvantage of being in the detail I guess. My thinking was Imgur seems broken today >>> Something major on the intertubes must be fk'd.
TeMPOraL · 9 years ago
Precisely how I discovered it. Imgur down. Imgur is almost like a piece of critical Internet infrastructure. That + some other site misbehaving tipped me off that something very wrong is happening...
stevehawk · 9 years ago
irc.freenode.net / ##aws (must be registered with nickserv to join)

outage first reported around 11:35CST.