Slack’s migration to a cellular architecture

Their siloing strategy, which I'll roughly refer as resolving a request from a single AZ, is a good way to keep operations and monitoring simple.

A past team of mine managed services in a similar fashion. We had a couple (usually 2-4) single AZ clusters with a thin (Envoy) layer to balance traffic between clusters.

We could detect incidents in a single cluster by comparing metrics across clusters. Mitigation was easy, we could drain a cluster in under a minute, redirecting traffic to the other ones. Most traffic was intra AZ, so it was fast and there was no cross-AZ networking fees.

The downside is that most services were running in several clusters, so there was redundancy in compute, caches, etc.

When we talked to people outside the company, e.g. solution architects from our cloud provider, they would be surprised at our architecture and immediately suggest multi-region clusters. I would joke that our single AZ clusters were a feature, not a bug.

Nice to see other folks having success with a similar architecture!

jasonwatkinspdx · 2 years ago

Yeah, I talked with a business that used a similar architecture for the same reasons. It can be really effective in multi-tenant apps where each customers data is fully independent and private. They also used multiple Amazon organizational accounts as a security partition. It made a few things more difficult but they felt the peace of mind was worth it.

mox1 · 2 years ago

My company has a pretty unique strategy where we have separate AWS accounts for each unit within the company. Each unit gets a prod and non-prod account.

We have ~150 accounts, so roughly 75 different department, with some having not much and others have a lot of resources.

Its complex, but provides a lot of nice security primitives. We have an overarching administrative account, but that doesnt get used (and lots of alarm bells go off when it is).

AugustoCAS · 2 years ago

I assume you were using AWS? I know some of the AZ of other cloud providers (Azure? Oracle? Google?) are not fully siloed. They might have independent power and networking, but be in the same physical location.

I'm mentioning this for other people to be aware as one can easily make the assumption that an AZ is the same concept on all clouds, which is not true and painful to realise.

pl4nty · 2 years ago

Azure's zones are "physically separate", but it's unclear whether zones could be in the same building. Especially since they don't guarantee distance between zones - they just aim for 300mi (483km)

jeremyjh · 2 years ago

Actually I assumed AWS did it the same way as the others - I thought maybe they are in another building on a campus but I didn’t think that should be a factor in planning and that I should use regions for geographic redundancy anyway.

ak007 · 2 years ago

Thanks for highlighting this. Indeed all CSPs are not the same

ComputerGuru · 2 years ago

It sounds like you didn’t have persistent data, and were only offering compute? If there’s no need for a coherent master view accessible/writeable from all the clusters, there would be no reason to use multi-region cluster whatsoever.

athoscouto · 2 years ago

We did. But the persisted data didn't live inside those ephemeral compute clusters though.

endymi0n · 2 years ago

It took me some time to realize that Cloud Solution Architects are also just slightly more technical sales people in disguise whose only mission is upselling you onto more dependency. Same thing about their PR, every CxO these days says they need "multi-cloud", whatever that means and the costs are usually enormous, while complexity rises — with questionable benefit.

I did the math for our own stack and after a setback month in client revenue, and decided to put all our servers into a single AZ in a single region. The only multi-AZ, multi-region services are our backups. Surviving bad machines happens often enough that it's priced in via using Kubernetes, but losing a whole AZ is a freak accident that's just SO rare that calculating real business risk, it seemed apt to pretend it just doesn't happen (sorry, Google Cloud Paris customers).

Call me reckless, but I haven't looked back ever since and it saves us thousands of dollars in intra-AZ fees per month alone.

anon84873628 · 2 years ago

Yeah, for many businesses it probably isn't necessary to have crazy short RTO and RPO. Just restore the most recent backup in a new region and point at the cloud provider outage report...

dtx1 · 2 years ago

I think there's this general problem with cloud deployments I'm seeing happen more and more:

People building this huge Multi AZ, Hyper Redundant, Multi-National, infinitly scaling Cloud Solution for something that requires a single VM and a Database.

Most Companies just don't need that level of scale and would be better off building something smaller and when you actually do scale you rewrite it with the profits made from the smaller solutions.

Of course there are many companies that do require something large but you should seriously consider if something smaller will do first.

I think solutions like a 100% cloudflare workers based backend can sidestep this a little but usually, that's not possibly or even the right thing in every situation.

Deleted Comment

bushbaba · 2 years ago

The downside of single AZ clusters is capacity. If you have a need to drastically scale up the compute might not be available in a single AZ.

athoscouto · 2 years ago

Even though each cluster was single AZ the whole system wasn't, so we weren't bound by the capacity of a single AZ.

Most of the situations where we needed to drastically scale up were known ahead of time as well (e.g. campaign from customer), and we would preallocate instances or even more clusters.

I may be forcing my memory, but if I'm not mistaken, our auto scaling was setup in a way that the system could handle sudden load increases of ~50% without noticeable disruption. Spikes bigger than this could lead to increased latency and/or error rate.

jldugger · 2 years ago

Indeed, this is the main problem I run into. We have to scale up capacity before the traffic can be redirected or you basically double the scope of the outage briefly. Which involves multiple layers of capacity bringup -- ASG brings up new nodes, then HPA brings up the new pods.

So they run everything in AWS USE1? That doesn't seem very redundant, but then I guess if the whole of USE1 goes down Slack won't be the only service that will be affected.

messe · 2 years ago

AWS also uses Slack internally, so add that to the list of shit that can hit the fan if us-east-1/IAD goes down.

mynameisvlad · 2 years ago

Don’t they also use Chime? It wouldn’t be a single point of failure.

fotta · 2 years ago

Huh, I’m surprised they’re not all in on Chime.

Deleted Comment

johannes1234321 · 2 years ago

But then everybody trying to recover from USE1 outage can't use Slack to coordinate the recovery ...

deanCommie · 2 years ago

the "whole" of USE1 very rarely goes down [0], because unlike other cloud providers, Amazon's availability zones are actually independent and decoupled, and if you're running on EC2 in a zonal way it's highly unlikely an outage will affect multiple zones.

[0] There are of course exceptions that come once every few years, but most instances people can think of in terms of widespread outages is one specific service going down in a region, creating a cascade of other dependencies. e.g. Lambda or Kinesis going down and impacting some other higher-level service, say, Translate.

oceanplexian · 2 years ago

AZs are buildings often times right next to each other on the same street. People who think this is a great failure domain for your entire business are deeply misguided. All it takes is a hurricane, a truck hitting a pole, a fire, or any number of extremely common situations and infra will be wiped off the map. Build stuff to be properly multi-region.

asah · 2 years ago

Am I missing something about us-east-1 reliability ?

https://www.google.com/search?q=us-east-1+reliability https://www.google.com/search?q=us-east-1+outage

radicality · 2 years ago

Isn’t the point of the article that they don’t? And it describes how they implemented region drains to traffic shift between the different regions.

edit: Hmm or maybe not? I still sometimes confuse aws terminology. Perhaps it is all in us-east—1, just in different availability zones (buildings?)

ThePhysicist · 2 years ago

If I understand it correctly they have an edge network for ingress traffic but host all of their core services in a single AWS region (USE1) in multiple availability zones there.

jldugger · 2 years ago

>edit: Hmm or maybe not? I still sometimes confuse aws terminology. Perhaps it is all in us-east—1, just in different availability zones (buildings?)

Correct, us-east-1 has several AZs, names like us-east-1a, us-east-1b etc. IIRC us-east-1 has six of them now.

danielovichdk · 2 years ago

"A single Slack API request from a user (for example, loading messages in a channel) may fan out into hundreds of RPCs to service backends, each of which must complete to return a correct response to the user."

Not being a dick here but is this not a fairly obvious flaw?

I mean why not keep a structured "message log" of all channels of all time ?

For every write the system updates the message log.

I am guessing and making assumptions I know.

namdnay · 2 years ago

I imagine the base messages are in a single store. But then you have reactions, attachments, gifs, user profiles, and probably hundreds of custom integrations/plugins.

Having worked on other messaging apps, these are usually separated because their performance/scalability requirements are different

skullone · 2 years ago

XMMP was extensible to support all this in the early 2000s. Slack reinvented simple services in the most obtuse way. I have to use Slack and I sideline quarterback all the ways things could have been better every day.

zx8080 · 2 years ago

XMPP

Agree with this point of view. Except the Jabber/XMPP Cisco legal thing, there's just no tech answer on why on earth Slack did not use XMPP under the hood.

nijave · 2 years ago

Also version history (edits), threads, and links to content in other channels (sharing a message).

philwelch · 2 years ago

> When companies create this microservices bog and then, when any problem comes up, they say, “distributed systems are hard” it reminds me of when my toddler throws food on the floor then says, “look, big mess”

https://x.com/telmudic/status/1684479894406025216

random3 · 2 years ago

This brings back memories - we speced an open distributed operating system called Metal Cell and built an implementation called Cell-OS. It was inspired by the "Datacenter as a computer" paper, but built with open-source tech.

We had it running accross bare metal, AWS and Azure and it one of the key aspects was that it handled persistent workloads for big data, including distributed databases.

Kubernetes was just getting built when we started and was supposed to be a Mesos scheduler initially.

I assumed Kubernetes would get all the pieces in and make things easier, but I still miss the whole paradigm we had almost 10 years ago.

This is retro now :)

https://github.com/cell-os/metal-cell

https://github.com/cell-os/cell-os

purpleturtle22 · 2 years ago

Can someone ELI5 the difference between using AWS availability zone affinity and then simply dropping the downed AZ at the top most routing point?

Wouldn't that be the same thing, with the obvious caveat you are t using the routing technology Slack is using (We don't - We use vanilla AWS offerings)

t0mas88 · 2 years ago

They decided to use every routing tool available at least once in their setup, so they can't do this. But there is no explanation in the blog about why they use so many platforms and so many routing tools. Sounds to me like they got themselves into a mess and decided to continue on that path.

jonathankoren · 2 years ago

Somewhere, an engineering “leader” is going to point to this blog post and then say, “Well, that’s how Slack did it!” and promptly copy this overwrought system

esprehn · 2 years ago

Cells are not about guarding against AZ failure, but about partitioning the production infra to protect against bad deploys and configuration changes. Every AZ is split into many different cells.

hliyan · 2 years ago

So, guarding against human errors / process failures, and not hardware failures?

ec109685 · 2 years ago

Isn’t that exactly what they are doing? Keeping requests within an AZ and instead of using DNS at the first hop into AZ, they use envoy to control traffic shaping and making that initial decision if traffic needs to be routed away.

Terretta · 2 years ago

You're doing it right.

Isn’t that exactly what they are doing? Keeping requests within an AZ and using global DNS at the first hop into AZ.

aftbit · 2 years ago

How can such an architecture function with respect to user data? If the DB instance primary handling your shard is in AZ-1 and AZ-1 gets drained, how can your writes continue to be serviced?

progbits · 2 years ago

Usually in distributed strongly consistent and durable systems, data is not considered committed until it has been persisted in multiple replicas.

So if one goes down nothing is lost, but capacity and durability is degraded.

skybrian · 2 years ago

That makes sense on its own, but doesn’t it mean that there are lots of network requests happening between silos all the time? It doesn’t seem very siloed.

Or is this some lower-level service that “doesn’t count” somehow?

dexwiz · 2 years ago

Multiple tiers of redundancy. There is usually redundancy within the AZ and then a following copy in another AZ. Usually at least four copies exist for a tenant.

gibb0n · 2 years ago

Seems to be the same collection of services deployed in different AZs with a load balancer? The trick would be how data is replicated across the instances, which I'm guessing is some sort of event publishing or even backup sources of truth? It says that will come in the next article and surely that's the more interesting part than the load balancing...

Also explains to me why new features would take a while to roll out of you are cautiously updating instances/AZs one by one

alberth · 2 years ago

Is Slack still written in Hack/PHP?

muglug · 2 years ago

Yes — see my recent article https://slack.engineering/hakana-taking-hack-seriously/

We use a few languages to serve client requests, but by far the biggest codebase is written in Hack, which runs inside an interpreter called HHVM that’s also used at Facebook.

dcgudeman · 2 years ago

I noticed that the hack blog (https://hhvm.com/blog/) basically stopped posting updates since the end of 2022. As downstream users of hacklang development have you folks noticed a change in development pace or ambition within the hack development team?

koolba · 2 years ago

I really like the writing style in that article:

> PHP makes it really easy to make a dynamically-rendered website. PHP also makes it really easy to create an utterly insecure dynamically-rendered website.

WinLychee · 2 years ago

PHP has some excellent ideas that other languages can't replicate, while at the same time having terrible ideas that other languages don't have to think about. Overall a huge fan of modern PHP, thanks for this writeup.

Hi Matt

Thanks for Psalm!

Curious, if Slack was built today from ground up - what tech stack do you think should/would be used?

from the article:

>Slack does not share a common codebase or even runtime; services in the user-facing request path are written in Hack, Go, Java, and C++.

Man what a mess. Meanwhile, everyone else can extend a library used by their common services in a common language trivially.