My guess is this is all due to CloudWatch logs putlogevents failures.
By default a docker container configured with awslogs runs in "blocking" mode. As logs get logged, docker will buffer them and push to CloudWatch logs frequently. In case the log stream is faster than what the buffer can absorb, stdout/stderr get blocked and then the container will freeze on the logging write call. If putlogevents is failing, buffers are probably filling up and freezing containers. I assume most of AWS uses it's own logging system, which could cause these large, intermittent failures.
If you're okay dropping logs, add something like this to the container logging definition:
I just want to thank you for providing this info. This was exactly the cause of some of our issues and this config setting restored functionality to a major part of our app.
Happy it helped. If you have a very high throughput app (or something that logs gigantic payloads), the "logging pauses" may slow down your app in non-obvious ways. Diagnosing it the very first time took forever (I think I straced the process in the docker container and saw it was hanging on `write(1)`)
[03:59 PM PDT] We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.
> Amazon Kinesis Data Streams is a serverless streaming data service that simplifies the capture, processing, and storage of data streams at any scale.
This is a bigger deal than the 'degraded' implies. SQS has basically ground to a halt for reads which is leading to massive slowdowns where I am at and the logging issues are causing task timeouts.
By default a docker container configured with awslogs runs in "blocking" mode. As logs get logged, docker will buffer them and push to CloudWatch logs frequently. In case the log stream is faster than what the buffer can absorb, stdout/stderr get blocked and then the container will freeze on the logging write call. If putlogevents is failing, buffers are probably filling up and freezing containers. I assume most of AWS uses it's own logging system, which could cause these large, intermittent failures.
If you're okay dropping logs, add something like this to the container logging definition:
https://aws.amazon.com/blogs/containers/preventing-log-loss-...
[03:59 PM PDT] We can confirm increased error rates and latencies for Kinesis APIs within the US-EAST-1 Region. We have identified the root cause and are actively working to resolve the issue. As a result of this issue, other services, such as CloudWatch, are also experiencing increase error rates and delayed Cloudwatch log delivery. We will continue to keep you updated as we make progress in resolving the issue.
39 affected services listed:
AWS Application Migration Service
AWS Cloud9
AWS CloudShell
AWS CloudTrail
AWS CodeBuild
AWS DataSync
AWS Elemental
AWS Glue
AWS IAM Identity Center
AWS Identity and Access Management
AWS IoT Analytics
AWS IoT Device Defender
AWS IoT Device Management
AWS IoT Events
AWS IoT SiteWise
AWS IoT TwinMaker
AWS License Manager
AWS Organizations
AWS Step Functions
AWS Transfer Family
Amazon API Gateway
Amazon AppStream 2.0
Amazon CloudSearch
Amazon CloudWatch
Amazon Connect
Amazon EMR Serverless
Amazon Elastic Container Service
Amazon Kinesis Analytics
Amazon Kinesis Data Streams
Amazon Kinesis Firehose
Amazon Location Service
Amazon Managed Grafana
Amazon Managed Service for Prometheus
Amazon Managed Workflows for Apache Airflow
Amazon OpenSearch Service
Amazon Redshift
Amazon Simple Queue Service
Amazon Simple Storage Service
Amazon WorkSpaces
> Amazon Kinesis Data Streams is a serverless streaming data service that simplifies the capture, processing, and storage of data streams at any scale.
I'd never heard of that one.
https://status.xero.com/