An update on recent service disruptions

rattray · 4 years ago

I find it surprising that:

- The team was not able to sufficiently diagnose the bad query program, and/or devise emergency mitigations, by 1400 UTC March 17 (24h after the first incident).

- They were not able to reduce the load on the db more quickly (eg; cranking up rate limiting, shutting down async or noncritical services and features, etc). At $prevjob we had the ability to do things like this at the push of a button, and generally would have within 90min of incident onset to help systems heal. Here they did eventually figure out they could throttle webhooks but it took several days of repeated incidents.

- They did not see connections creeping up towards their limits and have emergency mitigations prepared in case of spikes like this.

- The simple failover described on March 22nd (for example) took almost 3h to complete.

- They decided to enable profiling during the approximate time window of their load spikes, without sufficient resources.

- They lept to failover their db when the proxy had issues, when it seems the failover can take multiple hours to successfully complete(??).

- They did not have sufficient sampled trace/log data to form a hypothesis without this profiling.

- New services like Packages and Codespaces rely on the `mysql1` primary db.

Certainly I have much empathy for the responding teams – I have scars from responding to some very painful degradations myself – but as a Github customer this does leave me concerned.

taywrobel · 4 years ago

Former tech lead on Packages here (no longer at GitHub).

> New services like Packages and Codespaces rely on the `mysql1` primary db.

A lot of this is due to that DB handling core data to GitHub (notably organization and repository data), which the integrated nature of the new feature offerings forces them to interact with (syncing permissions from repos for packages, publishing from actions, etc.). The link between repos and codespaces is even more unavoidable.

We were careful with Packages to keep as few service dependencies as possible in the critical path, especially for things like the anonymous read path, so serving project packages to users for open source projects or package managers such as homebrew are as insulated as they can be (and I suspect were unaffected by this incident).

But at the end of the day, there is some data that is central to most everything at the company, unfortunately.

rattray · 4 years ago

Thanks, that makes sense. I would have imagined that most of what you describe could hit the `mysql1` replica (not primary) but certainly it's imaginable that this would not be possible or wise in all cases.

And of course it's worth applauding that I think most read traffic (whether to Packages or other services) worked just fine through these incidents, if I understand correctly.

foolfoolz · 4 years ago

this sounds backwards. i’ve worked at multiple places with the one big legacy mysql db. all of them had blanket rule no new features live in this db. whatever this complicates you would design around it

seer · 4 years ago

I've not been part of designing systems at that scale (yet!) but isn't event sourcing kinda envisioned for just these kinds of problems in mind?

Like you would have all of that "core" data as Kafka topics and can safely interact with them without affecting the core services?

I know the answer is always "legacy" and "its been designed that way from the beginning" but I was wondering what do you think would have been the right way to design something like this to mitigate risk of current / future problems?

Deleted Comment

eek2121 · 4 years ago

They still could have designed around it. At my current job, all of our services are in separate clusters, with none of them being largely interdependent on the other. It's possible our front-end could be bottlenecked, but as the front-end is read only, that is an easier problem to fix.

While we obviously have our own unique issues, this could never happen (in its entirety) at our company.

0xsql · 4 years ago

[Former GitHub Employee; I do not speak on behalf of the company]

> The simple failover described on March 22nd (for example) took almost 3h to complete

and

> They lept to failover their db when the proxy had issues, when it seems the failover can take multiple hours to successfully complete

Our failovers are very fast, take a couple seconds most, and work with near zero down time. I'm sure there were also other issues at play here.

rattray · 4 years ago

Got it – thank you, that makes a lot more sense. But fascinating, I wonder what those other issues were.

thrdbndndn · 4 years ago

What is "$prevjob"? I thought it's meant to be a stock symbol (by Twitters' convention at least) but I can't find any company with this ticker.

KMag · 4 years ago

They're using shell variable replacement syntax to refer to their previous job, without naming their previous employer.

vorpalhex · 4 years ago

It's a placeholder for where the commentor last worked without actually using an employers name. It is a play on variable syntax.

Deleted Comment

lima · 4 years ago

It's a common-ish way of saying "placeholder for an unnamed previous job".

aeldidi · 4 years ago

I believe it’s supposed to be like a shell variable, which are prefixed with “$”.

jasonhansel · 4 years ago

My guess is that part of the issue could be lock contention creating unexpected interactions between multiple simultaneous transactions. Issues like that can be really hard to pin down since there isn't a single transaction causing the problem; instead the load is the result of idiosyncratic interactions between multiple different transactions.

0xbadcafebee · 4 years ago

This reminds me of something I often forget about. The value of GitHub to Microsoft may be significant brand-wise, but it might not be a huge revenue source comparatively. Even if it were a huge revenue source, most of the revenue might still be shuffled off to the corporate behemoth to cover its balance sheet rather than reinvested in improving the product. Meanwhile the actual value of GitHub to organizations around the world is probably a couple order of magnitude higher than its actual revenue.

At some point it's actually cheaper for a coalition of international organizations to fund inventing a backwards-compatible GitHub replacement that is very resilient to failure, rather than wait for GitHub to get a measely enough budget increase from Daddy Micro$oft to shore up their legacy MySQL database.

eek2121 · 4 years ago

GitHub is becoming a huge revenue source for them FYI. They are stealing a TON of business from competitors. Shoot, if Microsoft had a real competitor for Jira, they would probably steal even more business.

8n4vidtmkvmk · 4 years ago

fwiw i moved to gh from bitbucket after the mercurial fiasco (they decided to just straight up delete everyones repos)

kubanczyk · 4 years ago

> if Microsoft had a real competitor for Jira, they would probably steal even more business.

But they have a real competitor for Jira. Check out the recently beefed up GitHub Projects, my previous (scrum-ish) team was running on it and we've liked it more than Jira. Much simpler, but just enough for a dev.

Didn't check out whether it's available on enterprise github though (i.e. local instances).

freeone3000 · 4 years ago

What happened to Azure DevOps?

krageon · 4 years ago

> fund inventing a backwards-compatible GitHub replacement

There are so, so many and there have been for years. Folks use Github because it's what they know, not because an alternative is hard to find.

sdnews · 4 years ago

Yea "backwards-compatible GitHub replacement" would grow, gain traction and ... it would be bought by Micr0s0ft anyway. Welcome in corporate world.

longcommonname · 4 years ago

Our internal monitoring has seen more outages than they listed here. Theres been 4 full days where github actions have been mixed between completely broken and degraded status.

It's nice to finally get some comms, but this is incredibly late and incomplete.

sk5t · 4 years ago

GH Actions felt so fundamentally janky and under-documented when I last looked at it (October?) that this doesn't come as a surprise. I really can't foresee using it with requirements any stricter than "nice when it works."

jhugo · 4 years ago

Yup, same here. We had periods of more than 8 hours (during Asia daytime / US nighttime) where we couldn't push or pull, which were not mentioned on their status page at all.

swiftcoder · 4 years ago

I would not be surprised to learn that they have similar procedures to one of my past employers, where exec signoff was needed to even publish an update to the service status dashboard...

speedgoose · 4 years ago

It’s interesting to read that so many systems and activities are dependent on a single point of failure : the main primary MySQL node at GitHub.

cyberpunk · 4 years ago

I mean, they have $MEGABUCKS, they could probably get 1/2 the team who maintains mariadb to come in and work for them if they wanted, and they still have a giant single db node doing writes and struggle to fail it over.

We're doomed >_<

You would think it wouldn't be THAT hard to shard something like GitHub effectively.

I mean, all user accounts/repos starting with the letter 'a' go to the 'a' cluster and so on seems not exactly science-fiction levels of technology.

throwusawayus · 4 years ago

it's profoundly strange that github has not properly sharded yet. essentially all large social networks are or have used sharded mysql successfully, this is not rocket science

livejournal, facebook, twitter, linkedin, tumblr, pinterest all use (or formerly used) sharded mysql and most of these are at larger db size than github

i will also repeat my comment from another recent thread: i just cannot understand how 20+ former github db and infra people recently left to join a db sharding company. this makes no sense whatsoever in light of github's lack of successful sharding. wtf is going on in the tech world these days

sillysaurusx · 4 years ago

> they could probably get 1/2 the team who maintains mariadb to come in and work for them if they wanted

The Mythical Man Month has a few things to say about that.

(It's tempting to feel that the information is outdated, but in my experience it still seems true.)

pcj-github · 4 years ago

Lots of people commenting here about how easy it must be to shard mysql at GitHub scale...

otterley · 4 years ago

Just because you have money doesn’t mean you can hire anyone you want to. Talent is scarce, especially in highly-specialized roles, and people have free will.

I think the complexity of GitHub’s data management requirements would surprise you. Better to withhold judgment until you possess all the facts.

Deleted Comment

prepend · 4 years ago

My fear is that this seems like a cover excuse for moving off MySQL. The bug will be too hard and they’ll move off. They will choose SQLServer and take a lot of time to convert and then have even more outages.

nimbius · 4 years ago

100% agreed. a lift-and-shift migration to Galera and modern MariaDB wouldnt be hard, but knowing MS there are mid managers waiting in the wings to swoop in and drive this into the ground with azure/sqlserver, the former of which posted 8 outages in the past 90 days alone.

this is classic Microsoft. spend a ton of money for something very valuable -- in this case virtually all developer marketshare -- and then casually pedal it into the ground while you lie about the KPI's to C levels (IIS marketshare on netcraft as a function of parked websites at GoDaddy to dominate over Apache) and keep it on life support with other revenue streams (XBox) for the next 16 quarters until it becomes a repulsive enough carbuncle to shareholders that it gets the axe (Microsoft phone.) then in a year, limp into the barn with another product nobody else but you could afford to buy (minecraft) and slowly turn it into a KPI farm for Microsoft account metrics to drive some other failing product (Azure) and keep the C level happy while you alienate virtually every player with mechanics or requirements they hate.

drewbug01 · 4 years ago

At GitHub’s scale, you don’t just “move off” a database. At best it would be a gradual project that would take years for the company to complete, and likely trigger additional incidents along the way.

protomyth · 4 years ago

Why would Microsoft not migrate to SQL Server? MySQL is owned by Oracle. Microsoft cannot be happy about using a product from Oracle. SQL Server is a pretty good product and the conversion will give them even more tools and expertise for their consulting wing to do it for other companies.

qbasic_forever · 4 years ago

Does it matter? Github is a closed source service. If they run Microsoft SQL Server vs. MySQL it doesn't matter or change anything for its users. All that us as customers and users care about is the experience and performance. Whatever they want to do internally to achieve that doesn't matter at all IMHO.

nodesocket · 4 years ago

Microsoft SQL actually is pretty good and really performant (with my limited production use).

I would argue it’s more performant than vanilla MySQL and it supports multiple write masters using peer to peer transactional replication an enterprise license feature (https://docs.microsoft.com/en-us/sql/relational-databases/re...)

thecosmicfrog · 4 years ago

Counterpoint - with GitHub now being part of a large database and cloud vendor (i.e. SQL Server and Azure), surely there are at least some merits to moving workloads to those technologies? Leverage internal knowledge, talent, etc.

speedgoose · 4 years ago

I bet on Azure Cosmos DB.

antn · 4 years ago

Nope.

KMag · 4 years ago

Somehow, I just assumed that because git is based on content-addressed storage DAG that GitHub internally heavily leveraged a distributed hash table and content-addressed storage DAG. With content-addressed storage, all of the stored data is immutable, so caching is significantly simplified, and a DHT can scale very well horizontally. Care still needs to be taken around the transactions that change DAG roots (heads of branches), but the rest is just making sure immutable blobs are sufficiently replicated in the DHT.

A DHT isn't a great fit for really rapidly changing data like metrics and such, but I figured that everything required to keep the basic git clone, commit, and web serving functionality is a pretty natural fit for a DHT.

I figured even code search had a relatively small amount of storage for indexing recent commits, with compaction of that data resulting in per-token skiplists stored into the DHT. That way, failure outside of the DHT serving path still allows code search for all commits older than the latest (incremental) compaction. Distributed refcounting or Bloom filters could be used for garbage collecting the immutable blobs in the DHT. The probability of a reference cycle in SHA-256 hashes of even quadrillions of immutable blobs is vanishingly small.

crdrost · 4 years ago

Permissioning/ACLs is something that you really want to get synchronous whenever possible, you don't want caching and pipeline delays if you can help it.

Some maintainer is acting in bad faith, someone else quickly locks them out of all the repositories that they have not yet thought to corrupt, then it looks really really bad if they find out that they still have permission to run the CI/CD scripts, maybe with malicious substitutions, even though you blocked them from being able to push commits up.

The other part of that is, it's not too hard to get right. See what GitHub was doing. People don't change ACLs that often, nowhere near as often as you read them. If they did, you could rate-limit them. And the rest of the problem is, you have one writer and many read copies in one place and you can enforce boundaries on how stale the data gets on the read copies.

Two hard problems, first is cache invalidation.

Of course, they have found out that they are big enough to shard, one would have hoped that they would have found that out in a gentler way, much sooner. But let's not pretend that their architecture makes no sense, it's a fine architecture, they just happened to outgrow it in a bad way.

wincent · 4 years ago

Linked in the article is this other one, "Partitioning GitHub’s relational databases to handle scale" (https://github.blog/2021-09-27-partitioning-githubs-relation...). That describes how there isn't just one "main primary" node; there are multiple clusters, of which `mysql1` is just one (the original one — since then, many others have been partitioned off).

throwusawayus · 4 years ago

from that article it sounds like they are mostly doing "functional partitioning" (moving tables off to other db primary/replica clusters) rather than true sharding (splitting up tables by ranges of data)

functional partitioning is a band-aid. you do it when your main cluster is exploding but you need to buy time. it ultimately is a very bad thing, because generally your whole site is depenedent on every single functional partition being up. it moves you from 1 single point of failure to N single points of failure!

zamalek · 4 years ago

This is what happens when you handwave CAP theorem. I am shocked at the number of people who are shocked that Github couldn't easily resolve this.

Tech debt has to be paid in full, whether it's person-hours, or downtime, or both.

eatonphil · 4 years ago

I can imagine what they're in a rush to refactor.

polote · 4 years ago

The issue with transparency is not that when you are good everyone see it but that when you don't know what you are doing everyone see it too.

It is honestly a bit humiliating for a company like github to both says :

We have had issues with our database for years and have still not found a solution. I mean what?

We have been down several days in a row and we have no idea yet how to solve the issue apart from throttle limiting webhooks

HALtheWise · 4 years ago

It seems to me like one thing that would have made these failures way more manageable would be a solid adaptive load throttling framework. Presumably, the total load only exceeded their maximum capacity by a couple percent, so a good identity-aware throttling framework should be able to maintain full functionality for ~97% of users by shedding requests early in the flow, or probably better if some automated traffic is prioritized lower and can be dropped first. If I remember correctly, Tencent had a nice paper about how they do this, but I can't find a link now.

HALtheWise · 4 years ago

Edit: https://www.cs.columbia.edu/~ruigu/papers/socc18-final100.pd...

getcrunk · 4 years ago

Tldr?

paxys · 4 years ago

Outage checklist

- Was it DNS?

- Was it a bad config update?

- Was it an overloaded single point of failure?

There's rarely a #4

HL33tibCe7 · 4 years ago

A proposed fourth: was it BGP?

Although most of those fall under “bad config update” (although likewise that applies to DNS).

jenny91 · 4 years ago

When has BGP caused a serious outage at a website?

nojvek · 4 years ago

What is BGP?

doublerabbit · 4 years ago

#5 Is it plugged in?

Build&Configure server, A few hour drive it down to the DC, rack the server. Get back home, try to access. No luck. Turned out I had forgotten I had forgotten to connect power and turn it on.

karmakaze · 4 years ago

Or an actual DDoS, not self-inflicted.

renewiltord · 4 years ago

Huh, one guy on HN did say it was the DB that was the problem earlier on. Neat!

EscargotCult · 4 years ago

For anyone looking, here's the comment: https://news.ycombinator.com/item?id=30779291