Docker Systems Status: Full Service Disruption

This broke our builds since we rely on several public Docker images, and by default, Docker uses docker.io.

Thankfully, AWS provides a docker.io mirror for those who can't wait:

  FROM public.ecr.aws/docker/library/{image_name}

In the error logs, the issue was mostly related to the authentication endpoint:

▪ https://auth.docker.io → "No server is available to handle this request"

After switching to the AWS mirror, everything built successfully without any issues.

CamouflagedKiwi · 4 months ago

Mild irony that Docker is down because of the AWS outage, but the AWS mirror repos are still running...

kerblang · 4 months ago

Also, docker.io is rate-limited, so if your organization experiences enough growth you will start seeing build failures on a regular basis.

Also, quay.io - another image hoster, from red hat - has been read-only all day today.

If you're going to have docker/container image dependencies it's best to establish a solid hosting solution instead of riding whatever bus shows up

pploug · 4 months ago

Rate limits are primarily applied to unauthenticated users, open source projects and business accounts have none/much higher tresholds

suriya-ganesh · 4 months ago

based on the solution, it seems like it is quite straightforward to switchover

firloop · 4 months ago

I wasn't able to get this working, but I was able to use Google's mirror[0] just fine.

Just had to change

    FROM {image_name}

    FROM mirror.gcr.io/{image_name}

Hope this helps!

[0]: https://cloud.google.com/artifact-registry/docs/pull-cached-...

ic4l · 4 months ago

We tried this initially

  FROM mirror.gcr.io/{image_name}

We received

  failed to resolve source metadata for mirror.gcr.io/

So it looks like these services may not be true mirrors, and just functioning as a library proxy with a cache.

If you're image is not cached on one of these then you may be SOL.

geostyx · 4 months ago

public.ecr.aws was failing for me earlier with 5XX errors due to the AWS outage: https://news.ycombinator.com/item?id=45640754

anon7000 · 4 months ago

I manage a large build system and pulling from ECR has been flaking all day

I guess people who are running their own registries like Nexus and build their own container images from a common base image are feeling at least a bit more secure in their choice right now.

Wonder how many builds or redeployments this will break. Personally, nothing against Docker or Docker Hub of course, I find them to be useful.

yandie · 4 months ago

It's actually an important practice to have a docker image cache in the middle. You never know if an upstream image is purged randomly from docker, and your K8s node gets replaced, and now can't pull the base image for your service.

Just engineering hygiene IMO.

koolba · 4 months ago

> You never know if an upstream image is purged randomly from docker, and your K8s node gets replaced, and now can't pull the base image for your service.

That doesn’t make sense unless you have some oddball setup where k8s is building the images you’re running on the fly. Theres no such thing as “base image” for tasks running in k8s. There is just the image itself and its layers which may come from some other image.

But it’s not built by k8s. Its be built in whatever is building your images and storing I. Your registers. That’s where you need your true base image caching.

tom1337 · 4 months ago

We are using base images but unfortunately some github actions are pulling docker images in their prepare phase - so while my application would build, I cannot deploy it because the CI/CD depends on dockerhub and you cannot change where these images are pulled from (so they cannot go through a pull-through cache)…

roryirvine · 4 months ago

My advice: document the issue, and use it to help justify spending time on removing those vestigial dependencies on Docker asap.

It's not just about reducing your exposure to third parties who you (presumably) don't have a contract with, it's also good mitigation against potential supply chain attacks - especially if you go as far as building the base images from scratch.

enigmo · 4 months ago

mirrors can be configured in dockerd or buildkit. if you can update the config (might need a self-hosted runner?) it’s a quick fix - see https://cloud.google.com/artifact-registry/docs/pull-cached-... for an example. aws and azure are similar.

Sphax · 4 months ago

We run Harbor and mirror every base image using its Proxy Cache feature, it's quite nice. We've had this setup for years now and while it works fine, Harbor has some rough edges.

thephyber · 4 months ago

I came here to mention that any non-trivial company depending on Docker images should look into a local proxy cache. It’s too much infra for a solo developer / tiny organization, but is a good hedge against DockerHub, GitHub repo, etc downtime and can run faster (less ingress transfer) if located in the same region as the rest of your infra.

nusl · 4 months ago

Currently unable to do much of anything new in dev/prod environments without manual workarounds. I'd imagine the impact is pretty massive.

Asside; seems Signal is also having issues. Damn.

cebert · 4 months ago

I’m not sure that the impact will be that big. Most organizations have their own mirrors for artifacts.

ai-onehealth · 4 months ago

Yes I noticed Signal being down too

yread · 4 months ago

That is nothing compared to how good i feel about not using containers at all.

bombcar · 4 months ago

You don’t want a Rube Goldberg contraption doing everything?

So not agile!

jsmeaton · 4 months ago

Guess where we host nexus..

frenkel · 4 months ago

Only if they get their base images from somewhere else...

bravetraveler · 4 months ago

Pull-through caches are still useful even when the upstream is down... assuming the image(s) were pulled recently. The HEAD to upstream will obviously fail [when checking currency], but the software is happy to serve what it has already pulled.

Depends on the implementation, of course: I'm speaking to 'distribution/distribution', the reference. Harbor or whatever else may behave differently, I have no idea.

tj_591 · 4 months ago

Hi all, Tushar from Docker here. We’re sorry about the impact our current outage is having on many of you. Yes, this is related to the ongoing AWS incident and we’re working closely with AWS on getting our services restored. We’ll provide regular updates on dockerstatus.com .

We know how critical Docker Hub and services are to millions of developers, and we’re sorry for the pain this is causing. Thank you for your patience as we work to resolve this incident. We’ll publish a post-mortem in the next few days once this incident is fully resolved and we have a remediation plan.

freedomben · 4 months ago

Part of me hopes that we find out that Dynamo DB (which sounds like was the root of the cascading failures) is shipped in a Docker image which is hosted on Docker Hub :-D

tj_591 · 3 months ago

We’ve published an incident report outlining what happened and the steps we’re taking to strengthen resilience in the face of upstream service interruptions. - https://www.docker.com/blog/docker-hub-incident-report-octob...

tonyabracadabra · 4 months ago

pls bring it back

Dead Comment

atymic · 4 months ago

Result of AWS outage https://news.ycombinator.com/item?id=45640754

reader_1000 · 4 months ago

> We have identified the underlying issue with one of our cloud service providers.

Isn't it everyone using multiple cloud providers nowadays? Why are they affected by single cloud provider outage?

lvncelot · 4 months ago

I think more often than not, companies are using a single cloud provider, and even when multiple are used, it's either different projects with different legacy decisions or a conscious migration.

True multi-tenancy is not only very rare, it's an absolute pain to manage as soon as people start using any vendor-specific functionality.

jelder · 4 months ago

No, that's pretty rare, and generally means you can't count on any features more sophisticated than VMs and object storage.

On the other hand, it's pretty embarrassing at this point for something as fundamental as Docker to be in a single region. Most cloud providers make inter-region failover reasonably achievable.

roywiggins · 4 months ago

You can be multi-cloud in the sense that you aren't dependent on any single provider, or in the sense that you are dependent on all of them.

postexitus · 4 months ago

Not only they are not using multiple cloud providers, they are not using multiple cloud locations.

rcxdude · 4 months ago

Because it's hard enough to distribute a service across multiple machines in the same DC, let alone across multiple DCs and multiple providers.

pmontra · 4 months ago

Because even if service A is using multiple cloud providers not all the external services they use are doing the same thing, especially the smallest one or the cheapest ones. At least one of them is on AWS East-1, fails and degrades service A or takes it down.

Being multi-cloud does not come for free: time, engineers, knowledge and ultimately money.

DiggyJohnson · 4 months ago

Multi cloud is not nearly as trivial as often implied to implement for real world complex projects. Things get challenging the second your application steps off the happy path

wredcoll · 4 months ago

> Isn't it everyone using multiple cloud providers nowadays? Why are they affected by single cloud provider outage?

No? I very much doubt anyone is doing that.

walkabout · 4 months ago

> Isn't it everyone using multiple cloud providers nowadays?

Oh yes. All of them, in fact, especially if you count what key vendors host on.

> Why are they affected by single cloud provider outage?

Every workload is only on one cloud. Nb this doesn’t mean every workflow is on only one cloud. Important distinction since that would be more stable.

madisp · 4 months ago

they are using multiple cloud providers, but judging by the cloudflare r2 outage affecting them earlier this year I guess all of them are on the critical path?

nobleach · 4 months ago

Looking at the landscape around me, no. Everyone is in crisis cost-cutting, "gotta show that same growth the C-suite saw during Covid" mode. So being multi-provider, and even in some cases, being multi-regional, is now off the table. It's sad because the product really suffers. But hey, "growth".

KronisLV · 4 months ago

phillebaba · 4 months ago

Shameless plug but this might be a good time to install Spegel in your Kubernetes clusters if you have critical dependencies on Docker Hub.

https://spegel.dev/

osivertsson · 4 months ago

If it really is fully open-source please make that more visible on your landing page.

It is a huge deal if I can start investigating and deploying such a solution as a techie right away, compared to having to go through all the internal hoops for a software purchase.

CaptainOfCoit · 4 months ago

How hard is it to go to the GitHub repository and open the LICENSE file that is in almost every repository? Would have taken you less time than writing that comment, and showed you it's under MIT.

mocko · 4 months ago

After some digging - https://github.com/spegel-org/spegel/blob/main/LICENSE says MIT

E39M5S62 · 4 months ago

https://spegel.dev/project/community/

storm1er · 4 months ago

What's the difference with kuik? Spegel seems too complicated for my homelab, but could be a nice upgrade for my company

Kuik: https://github.com/enix/kube-image-keeper?tab=readme-ov-file...

It's been a while since I looked at kuik, but I would say the main difference is that Spegel doesn't do any of the pulling or storage of images. Instead it relies on Containerd to do it for you. This also means that Spegel does not have to manage garbage collection. The nice thing with this is that it doesn't change how images are initially pulled from upstream and is able to serve images that exist on the node before Spegel runs.

Also it looks kuik uses CRDs to store information about where images are cached, while Spegel uses its own p2p solution to do the routing of traffic between nodes.

If you are running k3s in your homelab you can enable Spegel with a flag as it is an embedded feature.

There is a couple of alternatives that mirrors more than just Docker Hub too, most of them pretty bloated and enterprisey, but they do what they say on the tin and saved me more than once. Artifactory, Nexus Repository, Cloudsmith and ProGet are some of them.

Spegel does not only mirror Docker Hub, and works a lot differently than the alternatives you suggested. Instead of being yet another failure point closer to your production environment, it runs a distributed stateless registry inside of your Kubernetes cluster. By piggy backing off of Containerds image store it will distribute already pulled images inside of the cluster.

mike-cardwell · 4 months ago

This looks good, but we're using GKE and it looks like it only works there with some hacks. Is there a timeline to make it work with GKE properly?

I am having some discussions about getting things working on GKE but I can't give an ETA as it really depends on how things align with deployment schedules. I am positive however that this will soon be resolved.

0xbadcafebee · 4 months ago

Google Cloud has its own cache of Docker Hub that you can use for free, AWS does as well

Deleted Comment

theanonymousone · 4 months ago

It's quite funny/interesting that this is higher in HN front page than the news of the AWS outage that caused it.

mcintyre1994 · 4 months ago

Not on the real secret front page! https://news.ycombinator.com/active :)

cakeday · 4 months ago

That's informative, I wasn't aware of that way to view HN, thanks.

pknopf · 4 months ago

What does the "active" page sort by?

helpfulmandrill · 4 months ago

I wonder if this is why I also can't log in to O'Reilly to do some "Docker is down, better find something to do" training...

p0w3n3d · 4 months ago

Just install a pull-through proxy that will store all the packages recently used.

m463 · 4 months ago

this is by design

docker got requests to allow you to configure a private registry, but they selfishly denied the ability to do that:

https://stackoverflow.com/questions/33054369/how-to-change-t...

redhat created docker-compatible podman and lets you close that hole

/etc/config/docker: BLOCK_REGISTRY='--block-registry=all' ADD_REGISTRY='--add-registry=registry.access.redhat.com'

compootr · 4 months ago

I still think this is an acceptable footgun (?) to have. The expressiveness of downloading an image tag with a domain included outweighs potential miscommunication issues.

For example, if you're on a team and you have documentation containing commands, but your docker config is outdated, you can accidentally pull from docker's global public registry.

A welcome change IMO would be removing global registries entirely, since it just makes it easier to tell where your image is coming from (but I severely doubt docker would ever consider this since it makes it fractionally easier to use their services)

scuff3d · 4 months ago

This is a huge stretch.

Even if you could configure a default registry to point at something besides docker.io a lot of people, I'd say the vast majority, wouldn't have bothered. So they'd still be in the same spot.

And it's not hard to just tag images. I don't have a single image pulling from docker.io at work. Takes two seconds to slap <company-repo>/ at the front of the image name.

Sadly doesn't help if you were using ECR in us-east-1 as your private registry. :(