GitHub was down - Readit News

I've never seen an outage this big. Even the homepage doesn't load. We've had recurrent issues with Actions not running, but this seems a lot bigger.

The status page says all is well, though: https://www.githubstatus.com/. Hilarious.

karmakaze · a year ago

I get the angry unicorn page "No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists. Contact Support — GitHub Status — @githubstatus" with that last link going to https://x.com/githubstatus showing "GitHub Status Oct 22, 2018 Everything operating normally."

kalkin · a year ago

I think this is because logged-out Twitter now shows top Tweets of all time from a user, rather than most recent Tweets.

Good reason why companies shouldn't be using Twitter/X for status updates anymore!

TwiztidK · a year ago

The era of Twitter/X status pages needs to come to an end given how unusable it is if you aren't logged in.

temp0826 · a year ago

Used to work ops at AWS. I don't know if it's still the case but it required VERY HIGH management approval to actually flip any lights on their "status page" (likely it was referenced in some way for SLAs and refunding customers).

smsm42 · a year ago

That is an excellent illustration to Goodhart's law. We're going to have this avesome status page, but since if we update it the clients would notice the system is down, we're going to put a lot of barriers to putting the actual status on that page.

Also probably a class action suit lurking somewhere in there eventually.

purkka · a year ago

I have to wonder how a company at the scale of GitHub can be so bad at keeping track of their status.

Now 4 out of 10 services are marked as "Incident", yet most of the others are also completely dead.

xuancanh · a year ago

It's because of the way most companies build their status dashboards. There are usually at least 2 dashboards, one internal dashboard and one external dashboard. The internal dashboard is the actual monitoring dashboard, where it will be hooked up with other monitoring data sources. The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.

x86a · a year ago

This is intentional. It's mostly a matter of discussing how to communicate it publicly and when to flip the switch to start the SLA timer. Also coordinating incident response during a huge outage is always challenging.

saul-paterson · a year ago

FWIW, our self-hosted Gitea instance has not had a single second of unplanned downtime in five years we've been running it. And there wasn't much _planned_ downtime because it's really easy to upgrade (pull a new image and recreate the container — takes out the instance for maybe 15 seconds late at night), and full backups are handled live thanks to zfs.

Migration to a new host takes another 15 seconds thanks to both zfs and containers.

I don't know how many GitHub downtime reports I've seen during that time, we're probably into high dozens by now.

chrisallenlane · a year ago

I've been running Gitea on my homelab for a few months now. It's fantastic. It's like a snapshot of a point in time when GitHub was actually good, before it got enshittified by all of the social and AI nonsense.

I've been moving most of my projects off of GitHub and into Gitea, and will continue to do so.

Lanedo · a year ago

Twitter now has:

We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.

https://x.com/githubstatus/status/1823864449494569023

Lanedo · a year ago

Github seems to be coming back up:

https://downdetector.com/status/github/

kinduff · a year ago

They are flipping the switches now, status page just changed.

ergocoder · a year ago

I wonder why the status just doesn't ping github.com for 200. That seems easy to do.

bigiain · a year ago

To be fair - I really couldn't care less is the homepage is loading or not.

So long as I can fetch/commit to my repos, pretty much everything else is of secondary, tertiary, or no real importance to me.

(At work, I do indeed have systems running that monitor 200 statuses from client project homepages, almost all of which show better that 99.999% uptimes. And are practically useless. Most of them also monitor "canary" API requests which I strive to keep at 99.99% but don't always manage to achieve 99.9% - which is the very best and most expensive SLA we'll commit to.)

fragmede · a year ago

from where? they don't only have one load balancer, so you'd still have the problem of the page showing green when it's not loading for some folk?

tinyhitman · a year ago

delaying SLA

rvz · a year ago

Looks like we have a full house outage at GitHub with everything down. Much worse than the so-called Twitter / X recent speed-bump that was screeched at and quickly forgotten.

I don't think GitHub has recovered from the monthly incidents that keeps occurring. Quite frankly it is the expectation that something will go down every month at GitHub which shows how unreliable the service is and this has happened for years.

I guess this 4 year old prediction post really aged well after all about self-hosting and not going all in on Github [0]

[0] https://news.ycombinator.com/item?id=22868406

dataspun · a year ago

statute of limitations for HN comment predictions is 3 years.

TacticalCoder · a year ago

> I've never seen an outage this big.

I remember a time when systems would boast about their "five nines" uptime. It was before anything "cloud" appeared.

Deleted Comment

rozenmd · a year ago

Here, we caught 35 minutes of downtime: https://github.onlineornot.com/incidents/6Yyj8YWD94zE

manquer · a year ago

Status page updates with "degraded availability". lol

Deleted Comment

RIMR · a year ago

Wow, the status page only just now started reporting issues, and it still doesn't seem to communicate the scale of the issue.

People use this page for guidance. I guess now we know how much it can be trusted.

ikiris · a year ago

It’s used to ease their comms, not a real time status board pointing at their monitoring.

https://www.githubstatus.com/ reports no problems, but it's clearly down for a lot of people (including me).

tabbott · a year ago

It is kinda amazing how consistently status pages show everything fine during a total outage. It's not that hard to connect a status page to end-to-end monitoring statistics...

blinded · a year ago

From my experience this requires a few steps happen first:

- an incident be declared internally to github

- support / incident team submits a new status page entry (with details on service(s) impact(ed))

- incident is worked on internally

- incident fixed

- page updated

- retro posted

Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.

I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.

cortesoft · a year ago

There is always going to be SOME delay between the outage and the status page, although 5 minutes is probably enough time where it should be updated

frabjoused · a year ago

It's simply too soon for the status page to report the anomaly, is my guess. It's been down for 4 minutes.

owyn · a year ago

Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)

beefsack · a year ago

They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!

kredd · a year ago

It went down literally 3 minutes ago (I was in the middle of writing a PR comment), let's see if their cron job kicks in and reports the issue.

thund · a year ago

it's starting to show now, about 10 minutes after the issue started

agosz · a year ago

It's showing a few incidents now. Some things are still green though that don't seem to be working.