Slack outage: Connectivity issues affecting all workspaces

Previous outages:

nojvek · 7 years ago

In light of how Slack and other companies haven't been able to get a decent level of uptime, I have to say, the company known to make huge web applications that don't go down in shame every couple of months is probably Google. I can't remember the last time Gmail was down. It just works! If google is down, probably your internet is down.

Their expertise and discipline in distributed applications is unrivaled. I'm guessing because they have datacenters everywhere with huge fat pipes in between, and their SREs are probably top notch who don't take shortcuts.

Google gets a whole bunch of things wrong at times, but somethings I gotta say, they've nailed it.

londons_explore · 7 years ago

Google is expert at designing services which you won't notice when there is downtime.

Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.

The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.

anderiv · 7 years ago

While their mail service does have a remarkable track record for uptime, that same record is not shared by many of their other services.

I admin a number of GSuite accounts, and we experience fairly frequent (~monthly) periods of strange behavior with Hangouts/Meet, and Google Drive.

Fortunately Google is very good about providing updates via email to administrators as they're working through an issue.

werid · 7 years ago

Funny you should mention google, as something is down over there right now. lots of reports of chromecasts being dead right, assuming something at google is down which is causing this.

adreamingsoul · 7 years ago

GMAIL is one use case, and it was one of the original services from Google and so it has one of the largest "bake" times with regards to knowing how to keep it online.

Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.

SmellyGeekBoy · 7 years ago

Facebook springs to mind as well.

codemac · 7 years ago

And those SRE's?

They use IRC.

rileyt · 7 years ago

There is also this https://status.slack.com/calendar, but they seem to grossly under report the actual downtime...

[edit] note that including this outage, they are reporting to have missed their monthly uptime guarantee 3 months in a row.

dexterdog · 7 years ago

Yeah, stripe does the same thing with their status page. I get alerts that they have an outage at least once a week and more often than not it never shows up as anything in their history. Honestly this is my only significant beef with the service and I've been using it for years now with multiple integrations.

Deleted Comment

You know how much of the community uses one messaging system when 15 minutes after it going down, it has over 40 points on the front page!

This says a lot about how it's a single point of failure in modern company comms.

It's even worrying to think about how some users probably have production-dependent (dare I postulate it) workflows in Slack that get crippled by its outage...

ITT: Chat about decentralisation that will ultimately lead to no action.*

*Because we've had this discussion so many times before...

FooBarWidget · 7 years ago

Yes it's a single point of failure, but so what? I don't particularly care whether other organizations fail at the same time as I do, I just care whether I fail. Hosting my own chat system does not solve that problem. In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that. It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down. If it's urgent I can use the phone, and my todo list is stored outside Slack.

philwelch · 7 years ago

> In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that.

Although with outages like these, I doubt it!

vasilipupkin · 7 years ago

if the software is architected this poorly so that it can literally go down simultaneously for all clients, then why would I trust that it's secure?

drb91 · 7 years ago

> It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down.

Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.

ljm · 7 years ago

In a way there's a second single point of failure though, right? So many people use Slack to integrate all kinds of things, and rely on their interaction with those platforms through Slack, that if Slack goes down then productivity halts and it's totally out of your hands while Slack themselves try to resolve the issue.

- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.

- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.

- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.

This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.

mikec3010 · 7 years ago

I think it's inexcusable for a chat program to go down in 2018.

* your hdd failed? Use a raid

* your power went out? Use a UPS

* your DNS went down? Use a fallback (slack2)

* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over

See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".

And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.

Dead Comment

djsumdog · 7 years ago

I worked at an open source company where they hosted their own IRC server. There are OSS alternatives to Slack and I wonder if that company has tried to adopt any of them.

This all goes back to one basic fact: The Cloud is Someone Else's Computer(tm).

If your hosted Confluence or Jira is down, you can go walk over to your IT team and they'll be like, "Yea we know. We broke something. We're working on it." If you're using a hosted (a.k.a "Cloud" solution), you're just kinda fucked. You can't even extract your data and try to run it locally if it's down (if that's even an option).

cwyers · 7 years ago

That's uptime-as-anecdote. Yes, you can throw your entire IT department at your outage instead of waiting on the vendor to fix it. How many of us work somewhere where the entire IT team is as large as the team that works on Slack's uptime?

forgot-my-pw · 7 years ago

I remember the netsplits of IRC days...

RandallBrown · 7 years ago

Let's say the self hosted chat app does go down. Now someone has to fix it. Someone who probably has something better to do. In a cloud hosted solution, the person in charge of fixing your computer doesn't work for you.

My experience with self hosted solutions is that they go down way more often and take longer to fix than cloud solutions.

Bartweiss · 7 years ago

I'm not sure about production dependent, but I'd love to see how many other companies have longer/worse outages thanks to this. There are definitely a lot of people counting on Slack as a sole channel to push low-level error notifications, and I doubt most of them have an easy fallback option.

brootstrap · 7 years ago

reading all this thread made me realize at my company (~50 people) we have a couple slack-bots that control a number of things, deploys being one of them. shrug

cremp · 7 years ago

It's not so much decentralization as chaos engineering.

Building the program with withstand failure after failure, of things in and out, of your control. Seems like Slack needs some chaos engineers...

ythn · 7 years ago

In my company we use Cisco Jabber for official comms but Slack unofficially. So when Slack goes down, we fall back on Jabber.

thomastjeffery · 7 years ago

My company has a customer support system that relies on Slack chatops.

It's an interesting morning, to say the least.

dijit · 7 years ago

To me it raises a concern, chatops and slack integrations are /very/ common, it's a form of vendor lock-in on their side and it makes absolute sense.

However, if you become dependent on chat-ops to do your job. (say: fallbacks for common things have eroded due to lack of use) then suddenly your company is crippled. And why? for a chat service? The value add from slack is grotesquely small in isolation.

yani · 7 years ago

What channel of communication did you pick talk to your teammates on Slack? I've received messages by Facebook Messenger, Line, and the rusty email :)

danmg · 7 years ago

Here's a chat decentralisation platform: https://www.ratbox.org/ .

Deleted Comment

patrickg_zill · 7 years ago

I have used Mattermost and been pleased with it. It is an open-source Slack clone you can run on a low-end VM or your own hardware.

ivm · 7 years ago

Luckily most of my active communities are on Discord nowadays. It works much faster and even has a dark theme by default.

rkeene2 · 7 years ago

But it's not any less centralized, which I think was the complaint, not that it was popular even though this was mentioned.

dictum · 7 years ago

I say this as someone who almost always prefers the dark theme wherever it is available: I wonder how much this desire for dark interfaces comes from almost every app interface having bright colors on white.

Somewhere along the shift to flat design, grays and non-bright colors have been ignored in the visual design of applications.

palencharizard · 7 years ago

> even has a dark theme by default.

I love how having a dark theme is second only to "it works" in terms of how we pick services these days XD

asdojasdosadsa · 7 years ago

It's lovely how in Slack.app if you want dark theme, you have to modify the internet javascript files...

Kiro · 7 years ago

Sure, it's worrying but worth it for me personally. I might go to jail due to this (seriously) but at least people won't die. For me that's the threshold.

MockObject · 7 years ago

You can't leave us hanging like that. How could a Slack failure possibly send you to jail?

anonu · 7 years ago

sarreph · 7 years ago

dannyrosen · 7 years ago

Hope Slack considers doing a post-mortem similar to Gitlab[1]. Sharing what they learned and giving customers context is appreciated.

[1]: https://about.gitlab.com/2017/02/10/postmortem-of-database-o...

trcollinson · 7 years ago

Yes, that way we can beat them up for years to come based on whatever mistake they made. It would be even better if they told us which employee made the mistake so we can incessantly mock that employee openly and publicly every time Slack is ever mentioned on HN. When GitHub was purchased by Microsoft, Gitlab came up quite a bit and we got to rehash that whole database outage over again many times over those few days. It was sad.

If it were my company, I would say a little as humanly possible.

coryfklein · 7 years ago

It's not about assigning blame, it's about sharing lessons learned with the broader community and being transparent and honest with paying customers about issues that may have significant impact on downstream productivity.

isostatic · 7 years ago

I hope you don't work in aviation with that attitude!

bgribble · 7 years ago

Maybe unrelated, but my AWS-hosted websockets-using app had an outage starting at the same time. Also a third-party API provider we use for handling inbound phone calls. So this smells like a wider outage than just Slack.

When I was in Moscow a few weeks back, Slack wouldn't work. Exact same behaviour - it loaded up the gui, loaded up previous conversations, but then wouldn't work past there.

Russia blocks a lot of AWS IPs, when I did a full VPN out to a server in Germany slack came good.

switch007 · 7 years ago

That's interesting. More speculation: they haven't given any detail in 2 hours, perhaps if it's an upstream/3rd-party problem, they haven't been given any info.

I know it's not exactly scientific, but the front page of https://downdetector.com shows a number of services that have problem spikes starting anywhere from 3am US/Eastern to 9am US/Eastern and continuing through now (11:24 US/Eastern): Google Home, Fortnite, Exede, Level 3, New York Times, AWS. Maybe totally unrelated to each other, who knows.

kohanz · 7 years ago

I'm wondering the same thing. I chose this morning to soft-launch my side-project/startup and sent out the sign-up link to my e-mail list. Of course, it's AWS Cognito-based, was working yesterday, but failed for the new users. Great timing! Phone support said they are looking into some outages (even though the status page is all green).

Maxion · 7 years ago

Telegram was down too just half an hour before slack. Dunno if they run on aws?

They do, and GCP.

I recall AWS/GCP public IPs getting banned in Russia when they were trying to block telegram.

r1ch · 7 years ago

Maybe I'm reading too much into it, but "We've received word that all workspaces are having troubles connecting to Slack." makes it sound like their internal monitoring didn't catch whatever is causing this. I was personally experiencing issues for about 20-30 minutes before the status update was posted.

parliament32 · 7 years ago

Pretty much every time there's a slack outage it takes them a solid 20 minutes to update their status page. Several times I've emailed them 10 minutes into an outage (following "nobody at the office can reach slack, but their status page says smooth sailing, we should do more diagnostics in case it's office internet or something..."), then gotten a response 10 minutes later to the tune of "we're aware, we just updated our status page, go look at that". I think they consider updating their status page a PR problem, so they avoid if if the issue can be fixed in under X minutes.

Which also makes their uptime totals completely bogus.

raphaelj · 7 years ago

We started having issues connecting to Slack 4 hours before they reported a status.

joobus · 7 years ago

Their subsequent update makes it sound like they still don't have a clue.

> Our team is still looking into the cause of the connectivity issues, and we'll continue to update you on our progress.

bsiemon · 7 years ago

I think that is just the tongue in cheek language they like to use.

Dirlewanger · 7 years ago

Yeah, it's really funny and ironic when you cost customers money!

castis · 7 years ago

It's interesting to me that the update messages are posted every 30 minutes from 1st notification until resolution. Judging by this and every other outage I assume this is automatic, and probably implemented to appease the people who are probably frustrated by the outage.

https://status.slack.com/2018-06/142edcb9e52c7663

We have a similar policy at $WORK (but manual). In our experience customers go mental if you say absolutely nothing.

There is also zero information in those statuses, which kinda defeats the purpose. Might as well just have the status landing page with no details.

patrickxb · 7 years ago

Good catch...definitely updating every 30 minutes exactly.

tapoxi · 7 years ago

It's times like this I wish there was a solid decentralized standard to pick from, but there's no clear choice between XMPP and Matrix.

kirbypineapple · 7 years ago

We use Slack for everyday company wide communication/ announcements and Riot for encrypted secure communications (you can host Riot yourself): https://about.riot.im/

sanderjd · 7 years ago

It's not about the protocols, it's about having a client with a user experience that is acceptable to an entire company rather than just a team of engineers. Which decentralized protocol has such a client? (Speaking as someone who got burned trying to advocate for IRC at a company that eventually and inevitably switched to Slack.)

ummonk · 7 years ago

Curious why you got burned with IRC client UX given the multitude of clients available for it.

There's plenty of decent XMPP clients, like Spark (https://igniterealtime.org/projects/spark/) but they'd take an IT team to configure.

Matrix has Riot (https://riot.im/app) but personally I find it incredibly confusing.

mynewtb · 7 years ago

Zulip is amazing, is a self hosted system without federation is an option for you.

tomstockmail · 7 years ago

Love the workflow with Zulip, but I hope they work out a way to join in with either Matrix or IRC3+ federation.

dielan · 7 years ago

Matrix is my preference for sure. It's fresh and exciting. While XMPP is harder to talk people into trying

cbm-vic-20 · 7 years ago

Their site (https://matrix.org) reeks of hype-oriented engineering. From the most cursory overview of their home page, their decentralization looks a lot like IRC peering.

edhelas · 7 years ago

If you are interested we are building a communication platform for communities fully based on XMPP https://movim.eu/ :) It can easily be deployed on a Web server.

There's always IRC ;-)

samuell · 7 years ago

IRC is actually viable, with https://riot.im for offline logging and mobile access.

yepthatsreality · 7 years ago

Create a new one![0]

[0] https://xkcd.com/927/

nixgeek · 7 years ago

Despite having a vote increment velocity far exceeding other items, a publish time of only 25 minutes ago, and more points, this item just dropped from #5 to #7 on the front page.

How’s that work exactly?

Edit: It’s now droppped to #14 even with comment count also rapidly increasing.

_wmd · 7 years ago

300 comments in one hour will definitely kill it. HN penalizes controversy, which it uses comment count as a proxy for. It works well most of the time

maxerickson · 7 years ago

Comment count is a factor.

edit: it's a negative factor...

Thanks for the clarity, this makes more sense now!

calcifer · 7 years ago

Quoting myself from 8 months ago [1]:

> I really don't understand these types of questions. The possible answers range from "because the ranking works that way" to "someone with privileges wanted it that way". On either end of the spectrum, the real question remains: so what? What difference does it make why a particular post is in a particular position? If the title seems interesting, you click on it. If not, you move on.

> I don't mean to question you in particular. It just seems like such a trivial concern to me that I truly can't understand why someone might possibly care.

[1] https://news.ycombinator.com/item?id=15576036

rattray · 7 years ago

Hmm, it's now off the front page entirely, which seems strange. I don't see much incendiary commenting or similar...

@dang, care to comment?