Readit News logoReadit News
anonu · 7 years ago
nojvek · 7 years ago
In light of how Slack and other companies haven't been able to get a decent level of uptime, I have to say, the company known to make huge web applications that don't go down in shame every couple of months is probably Google. I can't remember the last time Gmail was down. It just works! If google is down, probably your internet is down.

Their expertise and discipline in distributed applications is unrivaled. I'm guessing because they have datacenters everywhere with huge fat pipes in between, and their SREs are probably top notch who don't take shortcuts.

Google gets a whole bunch of things wrong at times, but somethings I gotta say, they've nailed it.

londons_explore · 7 years ago
Google is expert at designing services which you won't notice when there is downtime.

Take Google Search for example. When there is downtime, results might be slightly less accurate, or the parcel tracking box might not appear, or the page won't say the "last visited" time beside search results.

The SRE's are running around fixing whatever subsystem is down or broken, but you the user probably don't notice.

anderiv · 7 years ago
While their mail service does have a remarkable track record for uptime, that same record is not shared by many of their other services.

I admin a number of GSuite accounts, and we experience fairly frequent (~monthly) periods of strange behavior with Hangouts/Meet, and Google Drive.

Fortunately Google is very good about providing updates via email to administrators as they're working through an issue.

werid · 7 years ago
Funny you should mention google, as something is down over there right now. lots of reports of chromecasts being dead right, assuming something at google is down which is causing this.
adreamingsoul · 7 years ago
GMAIL is one use case, and it was one of the original services from Google and so it has one of the largest "bake" times with regards to knowing how to keep it online.

Every service/team has to go through a period of growing pains as they learn, improve, and fix the code to be more stable. You can't just take the learnings from one service and apply it to another, it has to be architected and written into the code and most teams start each new project/service with fresh code.

SmellyGeekBoy · 7 years ago
Facebook springs to mind as well.
codemac · 7 years ago
And those SRE's?

They use IRC.

rileyt · 7 years ago
There is also this https://status.slack.com/calendar, but they seem to grossly under report the actual downtime...

[edit] note that including this outage, they are reporting to have missed their monthly uptime guarantee 3 months in a row.

dexterdog · 7 years ago
Yeah, stripe does the same thing with their status page. I get alerts that they have an outage at least once a week and more often than not it never shows up as anything in their history. Honestly this is my only significant beef with the service and I've been using it for years now with multiple integrations.

Deleted Comment

sarreph · 7 years ago
You know how much of the community uses one messaging system when 15 minutes after it going down, it has over 40 points on the front page!

This says a lot about how it's a single point of failure in modern company comms.

It's even worrying to think about how some users probably have production-dependent (dare I postulate it) workflows in Slack that get crippled by its outage...

ITT: Chat about decentralisation that will ultimately lead to no action.*

*Because we've had this discussion so many times before...

FooBarWidget · 7 years ago
Yes it's a single point of failure, but so what? I don't particularly care whether other organizations fail at the same time as I do, I just care whether I fail. Hosting my own chat system does not solve that problem. In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that. It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down. If it's urgent I can use the phone, and my todo list is stored outside Slack.
philwelch · 7 years ago
> In fact, it may make it worse because then I have to worry about system administration, and Slack probably has more expertise on that.

Although with outages like these, I doubt it!

vasilipupkin · 7 years ago
if the software is architected this poorly so that it can literally go down simultaneously for all clients, then why would I trust that it's secure?
drb91 · 7 years ago
> It's likely that they can fix this problem for all customers faster than I can fix my problem for myself. And it's not like I'm crippled when Slack is down.

Well, you can probably infer the former from the dependency on the latter. You use these tools because they can reduce the scrambling when shit does hit the fan, not because they are necessary.

ljm · 7 years ago
In a way there's a second single point of failure though, right? So many people use Slack to integrate all kinds of things, and rely on their interaction with those platforms through Slack, that if Slack goes down then productivity halts and it's totally out of your hands while Slack themselves try to resolve the issue.

- You don't get GitHub notifications on pull requests and comments, so things don't get reviewed and merged if developers aren't in the habit of checking the PR tab on GitHub itself.

- You don't get CI notifications so you won't know how your latest test run or deploy is going without going straight into the CI service itself. Even worse when there's a failure and you're too used to having Slack warn you about that.

- Your team might depend on Slack so much that they don't know how else to efficiently communicate, and the most efficient channel to communicate a fallback is not available or rarely checked (e.g. email, face to face). So you get a lot of chaos as people come up with dozens of alternatives.

This is just poor discipline more than anything, putting too many structural eggs into one basket, but it doesn't change the fact that Slack has created that dependency.

mikec3010 · 7 years ago
I think it's inexcusable for a chat program to go down in 2018.

* your hdd failed? Use a raid

* your power went out? Use a UPS

* your DNS went down? Use a fallback (slack2)

* your whole datacenter flooded? Good thing you have multiple replicated cloud instances that seamlessly take over

See, these are the issues that "the cloud" was supposed to solve. Not give us the same problems as before, just with a recurring bill for "chat as a service".

And inb4 "chill Mike it's just a chat server not life support firmware" yeah but slack is the most trivial software you can think of: send text from one computer to another. I see no reason this service can't be nearly as reliable as life support firmware in 2018. We've had over 30 years to get this right. Raise the fricking bar.

Dead Comment

djsumdog · 7 years ago
I worked at an open source company where they hosted their own IRC server. There are OSS alternatives to Slack and I wonder if that company has tried to adopt any of them.

This all goes back to one basic fact: The Cloud is Someone Else's Computer(tm).

If your hosted Confluence or Jira is down, you can go walk over to your IT team and they'll be like, "Yea we know. We broke something. We're working on it." If you're using a hosted (a.k.a "Cloud" solution), you're just kinda fucked. You can't even extract your data and try to run it locally if it's down (if that's even an option).

cwyers · 7 years ago
That's uptime-as-anecdote. Yes, you can throw your entire IT department at your outage instead of waiting on the vendor to fix it. How many of us work somewhere where the entire IT team is as large as the team that works on Slack's uptime?
forgot-my-pw · 7 years ago
I remember the netsplits of IRC days...
RandallBrown · 7 years ago
Let's say the self hosted chat app does go down. Now someone has to fix it. Someone who probably has something better to do. In a cloud hosted solution, the person in charge of fixing your computer doesn't work for you.

My experience with self hosted solutions is that they go down way more often and take longer to fix than cloud solutions.

Bartweiss · 7 years ago
I'm not sure about production dependent, but I'd love to see how many other companies have longer/worse outages thanks to this. There are definitely a lot of people counting on Slack as a sole channel to push low-level error notifications, and I doubt most of them have an easy fallback option.
brootstrap · 7 years ago
reading all this thread made me realize at my company (~50 people) we have a couple slack-bots that control a number of things, deploys being one of them. shrug
cremp · 7 years ago
It's not so much decentralization as chaos engineering.

Building the program with withstand failure after failure, of things in and out, of your control. Seems like Slack needs some chaos engineers...

ythn · 7 years ago
In my company we use Cisco Jabber for official comms but Slack unofficially. So when Slack goes down, we fall back on Jabber.
thomastjeffery · 7 years ago
My company has a customer support system that relies on Slack chatops.

It's an interesting morning, to say the least.

dijit · 7 years ago
To me it raises a concern, chatops and slack integrations are /very/ common, it's a form of vendor lock-in on their side and it makes absolute sense.

However, if you become dependent on chat-ops to do your job. (say: fallbacks for common things have eroded due to lack of use) then suddenly your company is crippled. And why? for a chat service? The value add from slack is grotesquely small in isolation.

yani · 7 years ago
What channel of communication did you pick talk to your teammates on Slack? I've received messages by Facebook Messenger, Line, and the rusty email :)
danmg · 7 years ago
Here's a chat decentralisation platform: https://www.ratbox.org/ .

Deleted Comment

patrickg_zill · 7 years ago
I have used Mattermost and been pleased with it. It is an open-source Slack clone you can run on a low-end VM or your own hardware.
ivm · 7 years ago
Luckily most of my active communities are on Discord nowadays. It works much faster and even has a dark theme by default.
rkeene2 · 7 years ago
But it's not any less centralized, which I think was the complaint, not that it was popular even though this was mentioned.
dictum · 7 years ago
I say this as someone who almost always prefers the dark theme wherever it is available: I wonder how much this desire for dark interfaces comes from almost every app interface having bright colors on white.

Somewhere along the shift to flat design, grays and non-bright colors have been ignored in the visual design of applications.

palencharizard · 7 years ago
> even has a dark theme by default.

I love how having a dark theme is second only to "it works" in terms of how we pick services these days XD

asdojasdosadsa · 7 years ago
It's lovely how in Slack.app if you want dark theme, you have to modify the internet javascript files...
Kiro · 7 years ago
Sure, it's worrying but worth it for me personally. I might go to jail due to this (seriously) but at least people won't die. For me that's the threshold.
MockObject · 7 years ago
You can't leave us hanging like that. How could a Slack failure possibly send you to jail?
dannyrosen · 7 years ago
Hope Slack considers doing a post-mortem similar to Gitlab[1]. Sharing what they learned and giving customers context is appreciated.

[1]: https://about.gitlab.com/2017/02/10/postmortem-of-database-o...

trcollinson · 7 years ago
Yes, that way we can beat them up for years to come based on whatever mistake they made. It would be even better if they told us which employee made the mistake so we can incessantly mock that employee openly and publicly every time Slack is ever mentioned on HN. When GitHub was purchased by Microsoft, Gitlab came up quite a bit and we got to rehash that whole database outage over again many times over those few days. It was sad.

If it were my company, I would say a little as humanly possible.

coryfklein · 7 years ago
It's not about assigning blame, it's about sharing lessons learned with the broader community and being transparent and honest with paying customers about issues that may have significant impact on downstream productivity.
isostatic · 7 years ago
I hope you don't work in aviation with that attitude!
bgribble · 7 years ago
Maybe unrelated, but my AWS-hosted websockets-using app had an outage starting at the same time. Also a third-party API provider we use for handling inbound phone calls. So this smells like a wider outage than just Slack.
isostatic · 7 years ago
When I was in Moscow a few weeks back, Slack wouldn't work. Exact same behaviour - it loaded up the gui, loaded up previous conversations, but then wouldn't work past there.

Russia blocks a lot of AWS IPs, when I did a full VPN out to a server in Germany slack came good.

switch007 · 7 years ago
That's interesting. More speculation: they haven't given any detail in 2 hours, perhaps if it's an upstream/3rd-party problem, they haven't been given any info.
bgribble · 7 years ago
I know it's not exactly scientific, but the front page of https://downdetector.com shows a number of services that have problem spikes starting anywhere from 3am US/Eastern to 9am US/Eastern and continuing through now (11:24 US/Eastern): Google Home, Fortnite, Exede, Level 3, New York Times, AWS. Maybe totally unrelated to each other, who knows.
kohanz · 7 years ago
I'm wondering the same thing. I chose this morning to soft-launch my side-project/startup and sent out the sign-up link to my e-mail list. Of course, it's AWS Cognito-based, was working yesterday, but failed for the new users. Great timing! Phone support said they are looking into some outages (even though the status page is all green).
Maxion · 7 years ago
Telegram was down too just half an hour before slack. Dunno if they run on aws?
dijit · 7 years ago
They do, and GCP.

I recall AWS/GCP public IPs getting banned in Russia when they were trying to block telegram.

r1ch · 7 years ago
Maybe I'm reading too much into it, but "We've received word that all workspaces are having troubles connecting to Slack." makes it sound like their internal monitoring didn't catch whatever is causing this. I was personally experiencing issues for about 20-30 minutes before the status update was posted.
parliament32 · 7 years ago
Pretty much every time there's a slack outage it takes them a solid 20 minutes to update their status page. Several times I've emailed them 10 minutes into an outage (following "nobody at the office can reach slack, but their status page says smooth sailing, we should do more diagnostics in case it's office internet or something..."), then gotten a response 10 minutes later to the tune of "we're aware, we just updated our status page, go look at that". I think they consider updating their status page a PR problem, so they avoid if if the issue can be fixed in under X minutes.

Which also makes their uptime totals completely bogus.

raphaelj · 7 years ago
We started having issues connecting to Slack 4 hours before they reported a status.
joobus · 7 years ago
Their subsequent update makes it sound like they still don't have a clue.

> Our team is still looking into the cause of the connectivity issues, and we'll continue to update you on our progress.

bsiemon · 7 years ago
I think that is just the tongue in cheek language they like to use.
Dirlewanger · 7 years ago
Yeah, it's really funny and ironic when you cost customers money!
castis · 7 years ago
It's interesting to me that the update messages are posted every 30 minutes from 1st notification until resolution. Judging by this and every other outage I assume this is automatic, and probably implemented to appease the people who are probably frustrated by the outage.

https://status.slack.com/2018-06/142edcb9e52c7663

switch007 · 7 years ago
We have a similar policy at $WORK (but manual). In our experience customers go mental if you say absolutely nothing.
parliament32 · 7 years ago
There is also zero information in those statuses, which kinda defeats the purpose. Might as well just have the status landing page with no details.
patrickxb · 7 years ago
Good catch...definitely updating every 30 minutes exactly.
tapoxi · 7 years ago
It's times like this I wish there was a solid decentralized standard to pick from, but there's no clear choice between XMPP and Matrix.
kirbypineapple · 7 years ago
We use Slack for everyday company wide communication/ announcements and Riot for encrypted secure communications (you can host Riot yourself): https://about.riot.im/
sanderjd · 7 years ago
It's not about the protocols, it's about having a client with a user experience that is acceptable to an entire company rather than just a team of engineers. Which decentralized protocol has such a client? (Speaking as someone who got burned trying to advocate for IRC at a company that eventually and inevitably switched to Slack.)
ummonk · 7 years ago
Curious why you got burned with IRC client UX given the multitude of clients available for it.
tapoxi · 7 years ago
There's plenty of decent XMPP clients, like Spark (https://igniterealtime.org/projects/spark/) but they'd take an IT team to configure.

Matrix has Riot (https://riot.im/app) but personally I find it incredibly confusing.

mynewtb · 7 years ago
Zulip is amazing, is a self hosted system without federation is an option for you.
tomstockmail · 7 years ago
Love the workflow with Zulip, but I hope they work out a way to join in with either Matrix or IRC3+ federation.
dielan · 7 years ago
Matrix is my preference for sure. It's fresh and exciting. While XMPP is harder to talk people into trying
cbm-vic-20 · 7 years ago
Their site (https://matrix.org) reeks of hype-oriented engineering. From the most cursory overview of their home page, their decentralization looks a lot like IRC peering.
edhelas · 7 years ago
If you are interested we are building a communication platform for communities fully based on XMPP https://movim.eu/ :) It can easily be deployed on a Web server.
rkeene2 · 7 years ago
There's always IRC ;-)
samuell · 7 years ago
IRC is actually viable, with https://riot.im for offline logging and mobile access.

Deleted Comment

yepthatsreality · 7 years ago
Create a new one![0]

[0] https://xkcd.com/927/

nixgeek · 7 years ago
Despite having a vote increment velocity far exceeding other items, a publish time of only 25 minutes ago, and more points, this item just dropped from #5 to #7 on the front page.

How’s that work exactly?

Edit: It’s now droppped to #14 even with comment count also rapidly increasing.

_wmd · 7 years ago
300 comments in one hour will definitely kill it. HN penalizes controversy, which it uses comment count as a proxy for. It works well most of the time
maxerickson · 7 years ago
Comment count is a factor.

edit: it's a negative factor...

nixgeek · 7 years ago
Thanks for the clarity, this makes more sense now!
calcifer · 7 years ago
Quoting myself from 8 months ago [1]:

> I really don't understand these types of questions. The possible answers range from "because the ranking works that way" to "someone with privileges wanted it that way". On either end of the spectrum, the real question remains: so what? What difference does it make why a particular post is in a particular position? If the title seems interesting, you click on it. If not, you move on.

> I don't mean to question you in particular. It just seems like such a trivial concern to me that I truly can't understand why someone might possibly care.

[1] https://news.ycombinator.com/item?id=15576036

rattray · 7 years ago
Hmm, it's now off the front page entirely, which seems strange. I don't see much incendiary commenting or similar...

@dang, care to comment?