Readit News logoReadit News
sebmellen · a year ago
I've never seen an outage this big. Even the homepage doesn't load. We've had recurrent issues with Actions not running, but this seems a lot bigger.

The status page says all is well, though: https://www.githubstatus.com/. Hilarious.

karmakaze · a year ago
I get the angry unicorn page "No server is currently available to service your request. Sorry about that. Please try refreshing and contact us if the problem persists. Contact Support — GitHub Status — @githubstatus" with that last link going to https://x.com/githubstatus showing "GitHub Status Oct 22, 2018 Everything operating normally."
kalkin · a year ago
I think this is because logged-out Twitter now shows top Tweets of all time from a user, rather than most recent Tweets.

Good reason why companies shouldn't be using Twitter/X for status updates anymore!

TwiztidK · a year ago
The era of Twitter/X status pages needs to come to an end given how unusable it is if you aren't logged in.
temp0826 · a year ago
Used to work ops at AWS. I don't know if it's still the case but it required VERY HIGH management approval to actually flip any lights on their "status page" (likely it was referenced in some way for SLAs and refunding customers).
smsm42 · a year ago
That is an excellent illustration to Goodhart's law. We're going to have this avesome status page, but since if we update it the clients would notice the system is down, we're going to put a lot of barriers to putting the actual status on that page.

Also probably a class action suit lurking somewhere in there eventually.

purkka · a year ago
I have to wonder how a company at the scale of GitHub can be so bad at keeping track of their status.

Now 4 out of 10 services are marked as "Incident", yet most of the others are also completely dead.

xuancanh · a year ago
It's because of the way most companies build their status dashboards. There are usually at least 2 dashboards, one internal dashboard and one external dashboard. The internal dashboard is the actual monitoring dashboard, where it will be hooked up with other monitoring data sources. The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.
x86a · a year ago
This is intentional. It's mostly a matter of discussing how to communicate it publicly and when to flip the switch to start the SLA timer. Also coordinating incident response during a huge outage is always challenging.
saul-paterson · a year ago
FWIW, our self-hosted Gitea instance has not had a single second of unplanned downtime in five years we've been running it. And there wasn't much _planned_ downtime because it's really easy to upgrade (pull a new image and recreate the container — takes out the instance for maybe 15 seconds late at night), and full backups are handled live thanks to zfs.

Migration to a new host takes another 15 seconds thanks to both zfs and containers.

I don't know how many GitHub downtime reports I've seen during that time, we're probably into high dozens by now.

chrisallenlane · a year ago
I've been running Gitea on my homelab for a few months now. It's fantastic. It's like a snapshot of a point in time when GitHub was actually good, before it got enshittified by all of the social and AI nonsense.

I've been moving most of my projects off of GitHub and into Gitea, and will continue to do so.

Lanedo · a year ago
Twitter now has:

We are experiencing interruptions in multiple public GitHub services. We suspect the impact is due to a database infrastructure related change that we are working on rolling back.

https://x.com/githubstatus/status/1823864449494569023

Lanedo · a year ago
Github seems to be coming back up:

https://downdetector.com/status/github/

kinduff · a year ago
They are flipping the switches now, status page just changed.
ergocoder · a year ago
I wonder why the status just doesn't ping github.com for 200. That seems easy to do.
bigiain · a year ago
To be fair - I really couldn't care less is the homepage is loading or not.

So long as I can fetch/commit to my repos, pretty much everything else is of secondary, tertiary, or no real importance to me.

(At work, I do indeed have systems running that monitor 200 statuses from client project homepages, almost all of which show better that 99.999% uptimes. And are practically useless. Most of them also monitor "canary" API requests which I strive to keep at 99.99% but don't always manage to achieve 99.9% - which is the very best and most expensive SLA we'll commit to.)

fragmede · a year ago
from where? they don't only have one load balancer, so you'd still have the problem of the page showing green when it's not loading for some folk?
tinyhitman · a year ago
delaying SLA
rvz · a year ago
Looks like we have a full house outage at GitHub with everything down. Much worse than the so-called Twitter / X recent speed-bump that was screeched at and quickly forgotten.

I don't think GitHub has recovered from the monthly incidents that keeps occurring. Quite frankly it is the expectation that something will go down every month at GitHub which shows how unreliable the service is and this has happened for years.

I guess this 4 year old prediction post really aged well after all about self-hosting and not going all in on Github [0]

[0] https://news.ycombinator.com/item?id=22868406

dataspun · a year ago
statute of limitations for HN comment predictions is 3 years.
TacticalCoder · a year ago
> I've never seen an outage this big.

I remember a time when systems would boast about their "five nines" uptime. It was before anything "cloud" appeared.

Deleted Comment

rozenmd · a year ago
Here, we caught 35 minutes of downtime: https://github.onlineornot.com/incidents/6Yyj8YWD94zE
manquer · a year ago
Status page updates with "degraded availability". lol

Deleted Comment

RIMR · a year ago
Wow, the status page only just now started reporting issues, and it still doesn't seem to communicate the scale of the issue.

People use this page for guidance. I guess now we know how much it can be trusted.

ikiris · a year ago
It’s used to ease their comms, not a real time status board pointing at their monitoring.
bitbasher · a year ago
The timing is pretty uncanny. I just deployed a github page and had a DNS issue because I configured it wrong. I hit "check again" and github went down.

Hope I don't appear in the incident report.

sunrunner · a year ago
Perhaps this is a repeat of the Fastly incident with a customer's Varnish cache configuration causing an issue in their systems (I think this is a rough summary, I don't remember the details).

So, you're both responsible and not responsible at the same time :)

> Hope I don't appear in the incident report.

Appearing in an incident report with your HN username could be pretty funny...

RIMR · a year ago
This will all clear up when it finishes checking your DNS configuration I bet.
zombot · a year ago
So it was you who crashed GitHub?
OutOfHere · a year ago
Fwiw, GitHub Pages is down too. The hosted Pages sites are down.
red_Seashell_32 · a year ago
Wait. You use github pages for something or actually work on it?
bitbasher · a year ago
I use it for something.

I had a github page that was public, but it was made private and the DNS config was removed. Fast forward to today. I made the private repo public again and forced a deploy of the page without making a new commit. It said the DNS config was incomplete, so I tweaked it and hit "check again" and github went down.

Probably unrelated, but the timing was spooky.

Deleted Comment

theovermage · a year ago
Bad bitbasher bad! :catbonk:
twp · a year ago
https://www.githubstatus.com/ reports no problems, but it's clearly down for a lot of people (including me).
tabbott · a year ago
It is kinda amazing how consistently status pages show everything fine during a total outage. It's not that hard to connect a status page to end-to-end monitoring statistics...
blinded · a year ago
From my experience this requires a few steps happen first:

- an incident be declared internally to github

- support / incident team submits a new status page entry (with details on service(s) impact(ed))

- incident is worked on internally

- incident fixed

- page updated

- retro posted

Even aws now seems to have some automation for their various services per region. But it doesn't automatically show issues because it could be at the customer level or subset of customers, or subset of customers if they are in region foo in AZ bar, on service version zed vs zed - 1. So they chose not to display issues for subsets.

I do agree it would be nice to have logins for the status page and then get detailed metrics based on customerid or userid. Someone start a company to compete with statuspage.

cortesoft · a year ago
There is always going to be SOME delay between the outage and the status page, although 5 minutes is probably enough time where it should be updated
frabjoused · a year ago
It's simply too soon for the status page to report the anomaly, is my guess. It's been down for 4 minutes.
owyn · a year ago
Once in the past I did actually have an incident where the site went down so hard that the tool that we used to update the status page didn't work. We did move it to a totally external and independent service after that. The first service we used was more flaky than our actual site was, so it kept showing the site down when it wasn't. So then we moved to another one, etc. Job security. :)
beefsack · a year ago
They say you shouldn't host status pages on the same infrastructure that it is monitoring, but in a way that makes it much more accurate and responsive in outages!
kredd · a year ago
It went down literally 3 minutes ago (I was in the middle of writing a PR comment), let's see if their cron job kicks in and reports the issue.
thund · a year ago
it's starting to show now, about 10 minutes after the issue started
agosz · a year ago
It's showing a few incidents now. Some things are still green though that don't seem to be working.
remram · a year ago
@dang https://www.githubstatus.com/incidents/kz4khcgdsfdv is probably a better link for this submission now
pietroppeter · a year ago
I should have looked for this before posting the same comment. Upvoted :)
erksa · a year ago
The mobile app on iOS is a 503 with

```

Received a 503 error. Data returned as a String was: <!DOCTYPE html> <!- -

Hello future GitHubber! I bet you're here to remove those nasty inline styles, DRY up these templates and make 'em nice and re-usable, right?

Please, don't. https://github.co...

```

That's where it's cut off on my screen.

Curious what the link is :)

I like to think, someone did.

arjvik · a year ago
Anyone seen the full text of the error page?
kgrax01 · a year ago
Could it have been brought down intentionally? Related to this?

https://www.bleepingcomputer.com/news/security/github-action...

johnnypangs · a year ago
Seems like it was a config change that cause it. They reverted it reality quickly.

https://www.githubstatus.com/incidents/kz4khcgdsfdv

bamboozled · a year ago
How would customer credentials being leaked be part of an outage of this size ?
kgrax01 · a year ago
If its enough of a security issue they could have pulled the site while its fixed/cleaned
fragmede · a year ago
Because there are worse things than being down; if the front page got hacked and is spewing gore or CSAM or PII or creds, for example.
low_tech_punk · a year ago
All the AI-native developers are twiddling their thumbs because Copilot is out of office.
j3s · a year ago
for everyone complaining about the status page - status pages are normally operated by hand by design, and will rarely reflect things in real-time.

give the poor github ops folks a second to get things moving.

manquer · a year ago
Most status page products integrate to monitoring tools like Datadog[1], large teams like github would have it automated.

You ideally do not want to be making a decision on whether to update a status page or not during the first few minutes of an incident, bean counters inevitably tend to get involved to delay/not declare downtime if there is a manual process.

It is more likely the threshold is kept a bit higher than couple of minutes to reducing false positives rates, not because of manual updates.

[1] https://www.atlassian.com/software/statuspage/integrations

xyzzy_plugh · a year ago
Nah, _most_ status pages are hand updated to avoid false positives, and to avoid alerting customers when they otherwise would not have noticed. Very, very few organizations go out of their way to _tell_ customers they failed to meet their SLA proactively. GitHub's SLA remedy clause even stipulates that the customer is responsible for tracking availability, which GitHub will then work to confirm.