I get the angry unicorn page "No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists. Contact Support — GitHub Status — @githubstatus" with that last link going to https://x.com/githubstatus showing "GitHub Status Oct 22, 2018 Everything operating normally."
Used to work ops at AWS. I don't know if it's still the case but it required VERY HIGH management approval to actually flip any lights on their "status page" (likely it was referenced in some way for SLAs and refunding customers).
That is an excellent illustration to Goodhart's law. We're going to have this avesome status page, but since if we update it the clients would notice the system is down, we're going to put a lot of barriers to putting the actual status on that page.
Also probably a class action suit lurking somewhere in there eventually.
It's because of the way most companies build their status dashboards. There are usually at least 2 dashboards, one internal dashboard and one external dashboard. The internal dashboard is the actual monitoring dashboard, where it will be hooked up with other monitoring data sources. The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.
This is intentional. It's mostly a matter of discussing how to communicate it publicly and when to flip the switch to start the SLA timer. Also coordinating incident response during a huge outage is always challenging.
FWIW, our self-hosted Gitea instance has not had a single second of unplanned downtime in five years we've been running it. And there wasn't much _planned_ downtime because it's really easy to upgrade (pull a new image and recreate the container — takes out the instance for maybe 15 seconds late at night), and full backups are handled live thanks to zfs.
Migration to a new host takes another 15 seconds thanks to both zfs and containers.
I don't know how many GitHub downtime reports I've seen during that time, we're probably into high dozens by now.
I've been running Gitea on my homelab for a few months now. It's fantastic. It's like a snapshot of a point in time when GitHub was actually good, before it got enshittified by all of the social and AI nonsense.
I've been moving most of my projects off of GitHub and into Gitea, and will continue to do so.
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
To be fair - I really couldn't care less is the homepage is loading or not.
So long as I can fetch/commit to my repos, pretty much everything else is of secondary, tertiary, or no real importance to me.
(At work, I do indeed have systems running that monitor 200 statuses from client project homepages, almost all of which show better that 99.999% uptimes. And are practically useless. Most of them also monitor "canary" API requests which I strive to keep at 99.99% but don't always manage to achieve 99.9% - which is the very best and most expensive SLA we'll commit to.)
Looks like we have a full house outage at GitHub with everything down. Much worse than the so-called Twitter / X recent speed-bump that was screeched at and quickly forgotten.
I don't think GitHub has recovered from the monthly incidents that keeps occurring. Quite frankly it is the expectation that something will go down every month at GitHub which shows how unreliable the service is and this has happened for years.
I guess this 4 year old prediction post really aged well after all about self-hosting and not going all in on Github [0]
The timing is pretty uncanny. I just deployed a github page and had a DNS issue because I configured it wrong. I hit "check again" and github went down.
Perhaps this is a repeat of the Fastly incident with a customer's Varnish cache configuration causing an issue in their systems (I think this is a rough summary, I don't remember the details).
So, you're both responsible and not responsible at the same time :)
> Hope I don't appear in the incident report.
Appearing in an incident report with your HN username could be pretty funny...
I had a github page that was public, but it was made private and the DNS config was removed. Fast forward to today. I made the private repo public again and forced a deploy of the page without making a new commit. It said the DNS config was incomplete, so I tweaked it and hit "check again" and github went down.
It is kinda amazing how consistently status pages show everything fine during a total outage. It's not that hard to connect a status page to end-to-end monitoring statistics...
From my experience this requires a few steps happen first:
- an incident be declared internally to github
- support / incident team submits a new status page entry (with details on service(s) impact(ed))
- incident is worked on internally
- incident fixed
- page updated
- retro posted
Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.
I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.
Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)
They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!
Most status page products integrate to monitoring tools like Datadog[1], large teams like github would have it automated.
You ideally do not want to be making a decision on whether to update a status page or not during the first few minutes of an incident, bean counters inevitably tend to get involved to delay/not declare downtime if there is a manual process.
It is more likely the threshold is kept a bit higher than couple of minutes to reducing false positives rates, not because of manual updates.
Nah, _most_ status pages are hand updated to avoid false positives, and to avoid alerting customers when they otherwise would not have noticed. Very, very few organizations go out of their way to _tell_ customers they failed to meet their SLA proactively. GitHub's SLA remedy clause even stipulates that the customer is responsible for tracking availability, which GitHub will then work to confirm.
The status page says all is well, though: https://www.githubstatus.com/. Hilarious.
Good reason why companies shouldn't be using Twitter/X for status updates anymore!
Also probably a class action suit lurking somewhere in there eventually.
Now 4 out of 10 services are marked as "Incident", yet most of the others are also completely dead.
Migration to a new host takes another 15 seconds thanks to both zfs and containers.
I don't know how many GitHub downtime reports I've seen during that time, we're probably into high dozens by now.
I've been moving most of my projects off of GitHub and into Gitea, and will continue to do so.
We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.
https://x.com/githubstatus/status/1823864449494569023
https://downdetector.com/status/github/
So long as I can fetch/commit to my repos, pretty much everything else is of secondary, tertiary, or no real importance to me.
(At work, I do indeed have systems running that monitor 200 statuses from client project homepages, almost all of which show better that 99.999% uptimes. And are practically useless. Most of them also monitor "canary" API requests which I strive to keep at 99.99% but don't always manage to achieve 99.9% - which is the very best and most expensive SLA we'll commit to.)
I don't think GitHub has recovered from the monthly incidents that keeps occurring. Quite frankly it is the expectation that something will go down every month at GitHub which shows how unreliable the service is and this has happened for years.
I guess this 4 year old prediction post really aged well after all about self-hosting and not going all in on Github [0]
[0] https://news.ycombinator.com/item?id=22868406
I remember a time when systems would boast about their "five nines" uptime. It was before anything "cloud" appeared.
Deleted Comment
Deleted Comment
People use this page for guidance. I guess now we know how much it can be trusted.
Hope I don't appear in the incident report.
So, you're both responsible and not responsible at the same time :)
> Hope I don't appear in the incident report.
Appearing in an incident report with your HN username could be pretty funny...
I had a github page that was public, but it was made private and the DNS config was removed. Fast forward to today. I made the private repo public again and forced a deploy of the page without making a new commit. It said the DNS config was incomplete, so I tweaked it and hit "check again" and github went down.
Probably unrelated, but the timing was spooky.
Deleted Comment
- an incident be declared internally to github
- support / incident team submits a new status page entry (with details on service(s) impact(ed))
- incident is worked on internally
- incident fixed
- page updated
- retro posted
Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.
I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.
```
Received a 503 error. Data returned as a String was: <!DOCTYPE html> <!- -
Hello future GitHubber! I bet you're here to remove those nasty inline styles, DRY up these templates and make 'em nice and re-usable, right?
Please, don't. https://github.co...
```
That's where it's cut off on my screen.
Curious what the link is :)
I like to think, someone did.
https://www.bleepingcomputer.com/news/security/github-action...
https://www.githubstatus.com/incidents/kz4khcgdsfdv
give the poor github ops folks a second to get things moving.
You ideally do not want to be making a decision on whether to update a status page or not during the first few minutes of an incident, bean counters inevitably tend to get involved to delay/not declare downtime if there is a manual process.
It is more likely the threshold is kept a bit higher than couple of minutes to reducing false positives rates, not because of manual updates.
[1] https://www.atlassian.com/software/statuspage/integrations