- The team was not able to sufficiently diagnose the bad query program, and/or devise emergency mitigations, by 1400 UTC March 17 (24h after the first incident).
- They were not able to reduce the load on the db more quickly (eg; cranking up rate limiting, shutting down async or noncritical services and features, etc). At $prevjob we had the ability to do things like this at the push of a button, and generally would have within 90min of incident onset to help systems heal. Here they did eventually figure out they could throttle webhooks but it took several days of repeated incidents.
- They did not see connections creeping up towards their limits and have emergency mitigations prepared in case of spikes like this.
- The simple failover described on March 22nd (for example) took almost 3h to complete.
- They decided to enable profiling during the approximate time window of their load spikes, without sufficient resources.
- They lept to failover their db when the proxy had issues, when it seems the failover can take multiple hours to successfully complete(??).
- They did not have sufficient sampled trace/log data to form a hypothesis without this profiling.
- New services like Packages and Codespaces rely on the `mysql1` primary db.
Certainly I have much empathy for the responding teams – I have scars from responding to some very painful degradations myself – but as a Github customer this does leave me concerned.
Former tech lead on Packages here (no longer at GitHub).
> New services like Packages and Codespaces rely on the `mysql1` primary db.
A lot of this is due to that DB handling core data to GitHub (notably organization and repository data), which the integrated nature of the new feature offerings forces them to interact with (syncing permissions from repos for packages, publishing from actions, etc.). The link between repos and codespaces is even more unavoidable.
We were careful with Packages to keep as few service dependencies as possible in the critical path, especially for things like the anonymous read path, so serving project packages to users for open source projects or package managers such as homebrew are as insulated as they can be (and I suspect were unaffected by this incident).
But at the end of the day, there is some data that is central to most everything at the company, unfortunately.
Thanks, that makes sense. I would have imagined that most of what you describe could hit the `mysql1` replica (not primary) but certainly it's imaginable that this would not be possible or wise in all cases.
And of course it's worth applauding that I think most read traffic (whether to Packages or other services) worked just fine through these incidents, if I understand correctly.
this sounds backwards. i’ve worked at multiple places with the one big legacy mysql db. all of them had blanket rule no new features live in this db. whatever this complicates you would design around it
I've not been part of designing systems at that scale (yet!) but isn't event sourcing kinda envisioned for just these kinds of problems in mind?
Like you would have all of that "core" data as Kafka topics and can safely interact with them without affecting the core services?
I know the answer is always "legacy" and "its been designed that way from the beginning" but I was wondering what do you think would have been the right way to design something like this to mitigate risk of current / future problems?
They still could have designed around it. At my current job, all of our services are in separate clusters, with none of them being largely interdependent on the other. It's possible our front-end could be bottlenecked, but as the front-end is read only, that is an easier problem to fix.
While we obviously have our own unique issues, this could never happen (in its entirety) at our company.
My guess is that part of the issue could be lock contention creating unexpected interactions between multiple simultaneous transactions. Issues like that can be really hard to pin down since there isn't a single transaction causing the problem; instead the load is the result of idiosyncratic interactions between multiple different transactions.
This reminds me of something I often forget about. The value of GitHub to Microsoft may be significant brand-wise, but it might not be a huge revenue source comparatively. Even if it were a huge revenue source, most of the revenue might still be shuffled off to the corporate behemoth to cover its balance sheet rather than reinvested in improving the product. Meanwhile the actual value of GitHub to organizations around the world is probably a couple order of magnitude higher than its actual revenue.
At some point it's actually cheaper for a coalition of international organizations to fund inventing a backwards-compatible GitHub replacement that is very resilient to failure, rather than wait for GitHub to get a measely enough budget increase from Daddy Micro$oft to shore up their legacy MySQL database.
GitHub is becoming a huge revenue source for them FYI. They are stealing a TON of business from competitors. Shoot, if Microsoft had a real competitor for Jira, they would probably steal even more business.
> if Microsoft had a real competitor for Jira, they would probably steal even more business.
But they have a real competitor for Jira. Check out the recently beefed up GitHub Projects, my previous (scrum-ish) team was running on it and we've liked it more than Jira. Much simpler, but just enough for a dev.
Didn't check out whether it's available on enterprise github though (i.e. local instances).
Our internal monitoring has seen more outages than they listed here. Theres been 4 full days where github actions have been mixed between completely broken and degraded status.
It's nice to finally get some comms, but this is incredibly late and incomplete.
GH Actions felt so fundamentally janky and under-documented when I last looked at it (October?) that this doesn't come as a surprise. I really can't foresee using it with requirements any stricter than "nice when it works."
Yup, same here. We had periods of more than 8 hours (during Asia daytime / US nighttime) where we couldn't push or pull, which were not mentioned on their status page at all.
I would not be surprised to learn that they have similar procedures to one of my past employers, where exec signoff was needed to even publish an update to the service status dashboard...
I mean, they have $MEGABUCKS, they could probably get 1/2 the team who maintains mariadb to come in and work for them if they wanted, and they still have a giant single db node doing writes and struggle to fail it over.
We're doomed >_<
You would think it wouldn't be THAT hard to shard something like GitHub effectively.
I mean, all user accounts/repos starting with the letter 'a' go to the 'a' cluster and so on seems not exactly science-fiction levels of technology.
it's profoundly strange that github has not properly sharded yet. essentially all large social networks are or have used sharded mysql successfully, this is not rocket science
livejournal, facebook, twitter, linkedin, tumblr, pinterest all use (or formerly used) sharded mysql and most of these are at larger db size than github
i will also repeat my comment from another recent thread: i just cannot understand how 20+ former github db and infra people recently left to join a db sharding company. this makes no sense whatsoever in light of github's lack of successful sharding. wtf is going on in the tech world these days
Just because you have money doesn’t mean you can hire anyone you want to. Talent is scarce, especially in highly-specialized roles, and people have free will.
I think the complexity of GitHub’s data management requirements would surprise you. Better to withhold judgment until you possess all the facts.
My fear is that this seems like a cover excuse for moving off MySQL. The bug will be too hard and they’ll move off. They will choose SQLServer and take a lot of time to convert and then have even more outages.
100% agreed. a lift-and-shift migration to Galera and modern MariaDB wouldnt be hard, but knowing MS there are mid managers waiting in the wings to swoop in and drive this into the ground with azure/sqlserver, the former of which posted 8 outages in the past 90 days alone.
this is classic Microsoft. spend a ton of money for something very valuable -- in this case virtually all developer marketshare -- and then casually pedal it into the ground while you lie about the KPI's to C levels (IIS marketshare on netcraft as a function of parked websites at GoDaddy to dominate over Apache) and keep it on life support with other revenue streams (XBox) for the next 16 quarters until it becomes a repulsive enough carbuncle to shareholders that it gets the axe (Microsoft phone.) then in a year, limp into the barn with another product nobody else but you could afford to buy (minecraft) and slowly turn it into a KPI farm for Microsoft account metrics to drive some other failing product (Azure) and keep the C level happy while you alienate virtually every player with mechanics or requirements they hate.
At GitHub’s scale, you don’t just “move off” a database. At best it would be a gradual project that would take years for the company to complete, and likely trigger additional incidents along the way.
Why would Microsoft not migrate to SQL Server? MySQL is owned by Oracle. Microsoft cannot be happy about using a product from Oracle. SQL Server is a pretty good product and the conversion will give them even more tools and expertise for their consulting wing to do it for other companies.
Does it matter? Github is a closed source service. If they run Microsoft SQL Server vs. MySQL it doesn't matter or change anything for its users. All that us as customers and users care about is the experience and performance. Whatever they want to do internally to achieve that doesn't matter at all IMHO.
Counterpoint - with GitHub now being part of a large database and cloud vendor (i.e. SQL Server and Azure), surely there are at least some merits to moving workloads to those technologies? Leverage internal knowledge, talent, etc.
Somehow, I just assumed that because git is based on content-addressed storage DAG that GitHub internally heavily leveraged a distributed hash table and content-addressed storage DAG. With content-addressed storage, all of the stored data is immutable, so caching is significantly simplified, and a DHT can scale very well horizontally. Care still needs to be taken around the transactions that change DAG roots (heads of branches), but the rest is just making sure immutable blobs are sufficiently replicated in the DHT.
A DHT isn't a great fit for really rapidly changing data like metrics and such, but I figured that everything required to keep the basic git clone, commit, and web serving functionality is a pretty natural fit for a DHT.
I figured even code search had a relatively small amount of storage for indexing recent commits, with compaction of that data resulting in per-token skiplists stored into the DHT. That way, failure outside of the DHT serving path still allows code search for all commits older than the latest (incremental) compaction. Distributed refcounting or Bloom filters could be used for garbage collecting the immutable blobs in the DHT. The probability of a reference cycle in SHA-256 hashes of even quadrillions of immutable blobs is vanishingly small.
Permissioning/ACLs is something that you really want to get synchronous whenever possible, you don't want caching and pipeline delays if you can help it.
Some maintainer is acting in bad faith, someone else quickly locks them out of all the repositories that they have not yet thought to corrupt, then it looks really really bad if they find out that they still have permission to run the CI/CD scripts, maybe with malicious substitutions, even though you blocked them from being able to push commits up.
The other part of that is, it's not too hard to get right. See what GitHub was doing. People don't change ACLs that often, nowhere near as often as you read them. If they did, you could rate-limit them. And the rest of the problem is, you have one writer and many read copies in one place and you can enforce boundaries on how stale the data gets on the read copies.
Two hard problems, first is cache invalidation.
Of course, they have found out that they are big enough to shard, one would have hoped that they would have found that out in a gentler way, much sooner. But let's not pretend that their architecture makes no sense, it's a fine architecture, they just happened to outgrow it in a bad way.
Linked in the article is this other one, "Partitioning GitHub’s relational databases to handle scale" (https://github.blog/2021-09-27-partitioning-githubs-relation...). That describes how there isn't just one "main primary" node; there are multiple clusters, of which `mysql1` is just one (the original one — since then, many others have been partitioned off).
from that article it sounds like they are mostly doing "functional partitioning" (moving tables off to other db primary/replica clusters) rather than true sharding (splitting up tables by ranges of data)
functional partitioning is a band-aid. you do it when your main cluster is exploding but you need to buy time. it ultimately is a very bad thing, because generally your whole site is depenedent on every single functional partition being up. it moves you from 1 single point of failure to N single points of failure!
It seems to me like one thing that would have made these failures way more manageable would be a solid adaptive load throttling framework. Presumably, the total load only exceeded their maximum capacity by a couple percent, so a good identity-aware throttling framework should be able to maintain full functionality for ~97% of users by shedding requests early in the flow, or probably better if some automated traffic is prioritized lower and can be dropped first. If I remember correctly, Tencent had a nice paper about how they do this, but I can't find a link now.
Build&Configure server, A few hour drive it down to the DC, rack the server. Get back home, try to access. No luck. Turned out I had forgotten I had forgotten to connect power and turn it on.
- The team was not able to sufficiently diagnose the bad query program, and/or devise emergency mitigations, by 1400 UTC March 17 (24h after the first incident).
- They were not able to reduce the load on the db more quickly (eg; cranking up rate limiting, shutting down async or noncritical services and features, etc). At $prevjob we had the ability to do things like this at the push of a button, and generally would have within 90min of incident onset to help systems heal. Here they did eventually figure out they could throttle webhooks but it took several days of repeated incidents.
- They did not see connections creeping up towards their limits and have emergency mitigations prepared in case of spikes like this.
- The simple failover described on March 22nd (for example) took almost 3h to complete.
- They decided to enable profiling during the approximate time window of their load spikes, without sufficient resources.
- They lept to failover their db when the proxy had issues, when it seems the failover can take multiple hours to successfully complete(??).
- They did not have sufficient sampled trace/log data to form a hypothesis without this profiling.
- New services like Packages and Codespaces rely on the `mysql1` primary db.
Certainly I have much empathy for the responding teams – I have scars from responding to some very painful degradations myself – but as a Github customer this does leave me concerned.
> New services like Packages and Codespaces rely on the `mysql1` primary db.
A lot of this is due to that DB handling core data to GitHub (notably organization and repository data), which the integrated nature of the new feature offerings forces them to interact with (syncing permissions from repos for packages, publishing from actions, etc.). The link between repos and codespaces is even more unavoidable.
We were careful with Packages to keep as few service dependencies as possible in the critical path, especially for things like the anonymous read path, so serving project packages to users for open source projects or package managers such as homebrew are as insulated as they can be (and I suspect were unaffected by this incident).
But at the end of the day, there is some data that is central to most everything at the company, unfortunately.
And of course it's worth applauding that I think most read traffic (whether to Packages or other services) worked just fine through these incidents, if I understand correctly.
Like you would have all of that "core" data as Kafka topics and can safely interact with them without affecting the core services?
I know the answer is always "legacy" and "its been designed that way from the beginning" but I was wondering what do you think would have been the right way to design something like this to mitigate risk of current / future problems?
Deleted Comment
While we obviously have our own unique issues, this could never happen (in its entirety) at our company.
> The simple failover described on March 22nd (for example) took almost 3h to complete
and
> They lept to failover their db when the proxy had issues, when it seems the failover can take multiple hours to successfully complete
Our failovers are very fast, take a couple seconds most, and work with near zero down time. I'm sure there were also other issues at play here.
Deleted Comment
At some point it's actually cheaper for a coalition of international organizations to fund inventing a backwards-compatible GitHub replacement that is very resilient to failure, rather than wait for GitHub to get a measely enough budget increase from Daddy Micro$oft to shore up their legacy MySQL database.
But they have a real competitor for Jira. Check out the recently beefed up GitHub Projects, my previous (scrum-ish) team was running on it and we've liked it more than Jira. Much simpler, but just enough for a dev.
Didn't check out whether it's available on enterprise github though (i.e. local instances).
There are so, so many and there have been for years. Folks use Github because it's what they know, not because an alternative is hard to find.
It's nice to finally get some comms, but this is incredibly late and incomplete.
We're doomed >_<
You would think it wouldn't be THAT hard to shard something like GitHub effectively.
I mean, all user accounts/repos starting with the letter 'a' go to the 'a' cluster and so on seems not exactly science-fiction levels of technology.
livejournal, facebook, twitter, linkedin, tumblr, pinterest all use (or formerly used) sharded mysql and most of these are at larger db size than github
i will also repeat my comment from another recent thread: i just cannot understand how 20+ former github db and infra people recently left to join a db sharding company. this makes no sense whatsoever in light of github's lack of successful sharding. wtf is going on in the tech world these days
The Mythical Man Month has a few things to say about that.
(It's tempting to feel that the information is outdated, but in my experience it still seems true.)
I think the complexity of GitHub’s data management requirements would surprise you. Better to withhold judgment until you possess all the facts.
Deleted Comment
this is classic Microsoft. spend a ton of money for something very valuable -- in this case virtually all developer marketshare -- and then casually pedal it into the ground while you lie about the KPI's to C levels (IIS marketshare on netcraft as a function of parked websites at GoDaddy to dominate over Apache) and keep it on life support with other revenue streams (XBox) for the next 16 quarters until it becomes a repulsive enough carbuncle to shareholders that it gets the axe (Microsoft phone.) then in a year, limp into the barn with another product nobody else but you could afford to buy (minecraft) and slowly turn it into a KPI farm for Microsoft account metrics to drive some other failing product (Azure) and keep the C level happy while you alienate virtually every player with mechanics or requirements they hate.
I would argue it’s more performant than vanilla MySQL and it supports multiple write masters using peer to peer transactional replication an enterprise license feature (https://docs.microsoft.com/en-us/sql/relational-databases/re...)
A DHT isn't a great fit for really rapidly changing data like metrics and such, but I figured that everything required to keep the basic git clone, commit, and web serving functionality is a pretty natural fit for a DHT.
I figured even code search had a relatively small amount of storage for indexing recent commits, with compaction of that data resulting in per-token skiplists stored into the DHT. That way, failure outside of the DHT serving path still allows code search for all commits older than the latest (incremental) compaction. Distributed refcounting or Bloom filters could be used for garbage collecting the immutable blobs in the DHT. The probability of a reference cycle in SHA-256 hashes of even quadrillions of immutable blobs is vanishingly small.
Some maintainer is acting in bad faith, someone else quickly locks them out of all the repositories that they have not yet thought to corrupt, then it looks really really bad if they find out that they still have permission to run the CI/CD scripts, maybe with malicious substitutions, even though you blocked them from being able to push commits up.
The other part of that is, it's not too hard to get right. See what GitHub was doing. People don't change ACLs that often, nowhere near as often as you read them. If they did, you could rate-limit them. And the rest of the problem is, you have one writer and many read copies in one place and you can enforce boundaries on how stale the data gets on the read copies.
Two hard problems, first is cache invalidation.
Of course, they have found out that they are big enough to shard, one would have hoped that they would have found that out in a gentler way, much sooner. But let's not pretend that their architecture makes no sense, it's a fine architecture, they just happened to outgrow it in a bad way.
functional partitioning is a band-aid. you do it when your main cluster is exploding but you need to buy time. it ultimately is a very bad thing, because generally your whole site is depenedent on every single functional partition being up. it moves you from 1 single point of failure to N single points of failure!
Tech debt has to be paid in full, whether it's person-hours, or downtime, or both.
It is honestly a bit humiliating for a company like github to both says :
We have had issues with our database for years and have still not found a solution. I mean what?
We have been down several days in a row and we have no idea yet how to solve the issue apart from throttle limiting webhooks
- Was it DNS?
- Was it a bad config update?
- Was it an overloaded single point of failure?
There's rarely a #4
Although most of those fall under “bad config update” (although likewise that applies to DNS).
Build&Configure server, A few hour drive it down to the DC, rack the server. Get back home, try to access. No luck. Turned out I had forgotten I had forgotten to connect power and turn it on.