Ordinarily, a single-host incident takes a couple minutes to resolve, and, ordinarily, when it's resolved, everything that was running on the host pops right back up. This single-host outage wasn't ordinary. Somehow, a containerd boltdb got corrupted, and it took something like 12 hours for a member of our team (themselves a containerd maintainer) to do some kind of unholy surgery on that database to bring the machine back online.
The runbook we have for handling and communicating single-host outages wasn't tuned for this kind of extended outage. It will be now. Probably we'll just paint the global status page when a single-host outage crosses some kind of time threshold.
anyway, had the same though when typing my sibling comment, felt so disgusted reading that