Readit News logoReadit News
tcas · a year ago
My guess is this is all due to CloudWatch logs putlogevents failures.

By default a docker container configured with awslogs runs in "blocking" mode. As logs get logged, docker will buffer them and push to CloudWatch logs frequently. In case the log stream is faster than what the buffer can absorb, stdout/stderr get blocked and then the container will freeze on the logging write call. If putlogevents is failing, buffers are probably filling up and freezing containers. I assume most of AWS uses it's own logging system, which could cause these large, intermittent failures.

If you're okay dropping logs, add something like this to the container logging definition:

  "max-buffer-size": "25m"
  "mode": "non-blocking"

mbaumbach · a year ago
I just want to thank you for providing this info. This was exactly the cause of some of our issues and this config setting restored functionality to a major part of our app.
tcas · a year ago
Happy it helped. If you have a very high throughput app (or something that logs gigantic payloads), the "logging pauses" may slow down your app in non-obvious ways. Diagnosing it the very first time took forever (I think I straced the process in the docker container and saw it was hanging on `write(1)`)

https://aws.amazon.com/blogs/containers/preventing-log-loss-...

ackdesha · a year ago
It seems to have cascaded from AWS Kinesis...

[03:59 PM PDT] We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.

39 affected services listed:

AWS Application Migration Service

AWS Cloud9

AWS CloudShell

AWS CloudTrail

AWS CodeBuild

AWS DataSync

AWS Elemental

AWS Glue

AWS IAM Identity Center

AWS Identity and Access Management

AWS IoT Analytics

AWS IoT Device Defender

AWS IoT Device Management

AWS IoT Events

AWS IoT SiteWise

AWS IoT TwinMaker

AWS License Manager

AWS Organizations

AWS Step Functions

AWS Transfer Family

Amazon API Gateway

Amazon AppStream 2.0

Amazon CloudSearch

Amazon CloudWatch

Amazon Connect

Amazon EMR Serverless

Amazon Elastic Container Service

Amazon Kinesis Analytics

Amazon Kinesis Data Streams

Amazon Kinesis Firehose

Amazon Location Service

Amazon Managed Grafana

Amazon Managed Service for Prometheus

Amazon Managed Workflows for Apache Airflow

Amazon OpenSearch Service

Amazon Redshift

Amazon Simple Queue Service

Amazon Simple Storage Service

Amazon WorkSpaces

cout · a year ago
44 services are showing as affected now, and AWS IoT Analytics, AWS IoT TwinMaker, and Amazon Elastic MapReduce are showing as Resolved.
remram · a year ago
https://aws.amazon.com/kinesis/

> Amazon Kinesis Data Streams is a serverless streaming data service that simplifies the capture, processing, and storage of data streams at any scale.

I'd never heard of that one.

jmward01 · a year ago
This is a bigger deal than the 'degraded' implies. SQS has basically ground to a halt for reads which is leading to massive slowdowns where I am at and the logging issues are causing task timeouts.
rushingcreek · a year ago
The us-east-1 curse strikes again! Elastic Container Service is down for us completely.
chucky_z · a year ago
This is just starting to effect us, looks like SQS is the biggest loser right now.
mparnisari · a year ago
affect* :)
WheatMillington · a year ago
Our accounting system Xero is down, with reference on their status page to AWS. Related to this, I assume.

https://status.xero.com/

cout · a year ago
Though it is not listed in the 33 affected services, we are seeing an issue communicating with S3 via a Storage Gateway.
catlifeonmars · a year ago
Managed CloudFormation StackSets aren’t showing up for me. I assume this is related to Organizations.