Slack is experiencing a service disruption

https://lists.dns-oarc.net/pipermail/dns-operations/2021-Sep...

tptacek · 4 years ago

Holy shit. That's bad.

What this suggests is that Slack, for reasons passing understanding, enabled DNSSEC on their zones (with a DS record that essentially turns DNSSEC on, and the accompanying key records) --- then disabled DNSSEC by pulling all the records. But the DS records are in caches; validating resolvers go looking for the keys, which don't exist, and say "welp, I guess Slack.com doesn't exist".

aeden · 4 years ago

I wonder if they are using tooling that doesn't properly retain DNSKEY records for DS that recently removed? This is one of the reasons we perform controlled automated key rotation and removal in DNSimple, so that we can ensure we retain the keys in the authoritative zone on each key rollover giving the DS records time to expire from caches.

icedchai · 4 years ago

It was especially bad since their status page wouldn't even resolve! I eventually just restarted my local caching DNS server.

ithkuil · 4 years ago

We had a DNS related outage with route53. Some of our zones just lost some records and then they reappeared. Could that explain what happened to slack's DNSSEC related records?

tialaramex · 4 years ago

Aren't you, in fact, the same Thomas Ptacek who has repeatedly claimed that DNSSEC is so irrelevant that events like this would go essentially unnoticed?

Edited to add, e.g. https://news.ycombinator.com/item?id=22400167

> DNSSEC is moribund and almost nobody uses it; in reality, the DNSSEC root private keys could land on Pastebin tomorrow and nothing would "break"

We have this whole thread here about a "service disruption" for Slack, and nobody leaked the "root private keys" just one person made a dumb error and it blew up their site.

TobTobXX · 4 years ago

Huh, TIL:

> Both Google and Cloudflare have a publicly accessible feature to flush the cache for a domain, so anyone could have done it: > https://developers.google.com/speed/public-dns/cache > https://1.1.1.1/purge-cache/

Quite useful feature indeed.

lima · 4 years ago

Unlikely to help in this particular case (which is a root-level DS record).

throwaway888abc · 4 years ago

Awesome handy tip, thanks!

terom · 4 years ago

https://dnsviz.net/d/slack.com/YVXX_g/dnssec/ the dnsviz analysis showing the slack.com zone DNSKEY existing at 12:55, followed by the the .com zone DS record at 15:30. However, the next analysis at 17:24 shows both the .com zone DS and slack.com DNSKEY records have disappeared!

Given that the slack.com DNSKEY shows up with a 1h TTL and the .com zone DS has a 24h TTL, they are screwed in the presence of cached slack.com DS records from the .com zone. Do not throw away your DNSKEY until your delegation's TTL has absolutely positively surely expired from any resolver caches!

The slack.com domain is an AWS Route 53 zone, I'd be really interested to see a post-mortem explaining what happened here. Are they unable to recover the KSK/ZSK and restore the DNSKEY/etc records?

dnsmichi · 4 years ago

Great analysis, thanks!

Slack support says that users should tell their ISPs to invalidate the DNS cache for slack.com https://status.slack.com/2021-09/06c1e17de93e7dc2 (access with 8.8.8.8 as resolver - fallback https://slack-status.azureedge.net/)

Since the faulty DS record was in .com, everyone has a max wait-for-ttl-to-expire time of 24h.

Google/Cloudflare etc. seem to also invalidate .com caching very quickly, 8.8.8.8 quickly was the first workaround.

Meanwhile, 14 hours later, DTAG in Germany still does not resolve. The default resolvers have dnssec enabled.

dig slack.com +cd

tells the resolver to skip dnssec validation tests, and then it works again. Screenshots with the command output in https://twitter.com/dnsmichi/status/1443840645513293853?s=2

Very interested in the post-mortem analysis. I think there were similar mistakes as with nasa.gov incident and the comcast analysis in 2012: https://www.internetsociety.org/blog/2012/01/comcast-release...

Learnings for me:

- dnstracer (https://gitlab.com/dnsmichi/dotfiles/-/blob/main/Brewfile#L5...) helps with detecting missing glue records, but not dnssec

- dnstrace (https://github.com/rs/dnstrace) is a better alternative with dnssec

Deleted Comment

;; fetch: www.slack.com/A ;; fetch: com/DS ;; fetch: ./DNSKEY ;; fetch: slack.com/DS ;; fetch: com/DNSKEY ;; fetch: www.slack.com/DS ;; validating slack.com/SOA: got insecure response; parent indicates it should be secure ;; no valid RRSIG resolving 'www.slack.com/DS/IN': 4.2.2.1#53 ;; broken trust chain resolving 'www.slack.com/A/IN': 4.2.2.1#53 ;; resolution failed: broken trust chain

% dig www.brokendnssec.net +dnssec @4.2.2.1 ; <<>> DiG 9.10.6 <<>> www.brokendnssec.net +dnssec @4.2.2.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58098 ;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags: do; udp: 8192 ;; QUESTION SECTION: ;www.brokendnssec.net. IN A ;; ANSWER SECTION: www.brokendnssec.net. 300 IN A 104.22.35.212 www.brokendnssec.net. 300 IN A 172.67.36.129 www.brokendnssec.net. 300 IN A 104.22.34.212 ;; Query time: 84 msec ;; SERVER: 4.2.2.1#53(4.2.2.1) ;; WHEN: Thu Sep 30 17:59:35 MDT 2021 ;; MSG SIZE rcvd: 97 % dig www.brokendnssec.net +dnssec @8.8.8.8 ; <<>> DiG 9.10.6 <<>> www.brokendnssec.net +dnssec @8.8.8.8 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 22514 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags: do; udp: 512 ;; QUESTION SECTION: ;www.brokendnssec.net. IN A ;; Query time: 96 msec ;; SERVER: 8.8.8.8#53(8.8.8.8) ;; WHEN: Thu Sep 30 17:59:44 MDT 2021 ;; MSG SIZE rcvd: 49

keville · 4 years ago

madars · 4 years ago

You can test it at home using "delv www.slack.com @4.2.2.1 +rtrace" (4.2.2.[1-6] are Level 3's servers):

exikyut · 4 years ago

FWIW, after getting exactly the same output from `delv` locally, I noted that

  $ dig +short slack.com @4.2.2.1 
  15.206.34.128

and I take this to mean that, despite failing DNSSEC, Level 3 is yielding an A record - and that the major players have basically monkey-patched this into working again, and it just needs to propagate through all the caches now?

ArchOversight · 4 years ago

Nope, Level 3's DNS servers don't do DNSSEC validation, as can be tested with dig:

mike-cardwell · 4 years ago

You can also just use dig. Running: `dig a www.slack.com` returns a SERVFAIL for me. Asking my resolver to skip the dnssec checking gives me the A record though: `dig +cd a www.slack.com`

I then look at my unbound dns resolver logs:

Sep 30 21:53:11 unbound[8985:0] info: validation failure <www.slack.com. A IN>: No DNSKEY record from 208.67.220.123 for key slack.com. while building chain of trust

nathanyz · 4 years ago

Could be related to the currently ongoing Let's Encrypt certificate expiration event(1) that is causing fun for engineers today(2).

(1) https://community.letsencrypt.org/t/help-thread-for-dst-root...

(2) https://twitter.com/search?q=letsencrypt&src=typed_query&f=l...

betaby · 4 years ago

No. DNSSEC has nothing to do HTTPS.

Interdependencies among services can cause all sorts of unexpected issues.

If you can't reach slack.com, here's their status page: https://slack-status.azureedge.net/2021-09/06c1e17de93e7dc2

omreaderhn · 4 years ago

Based on that status page and the list-serve email link you posted it seems like they still don't understand what's going on (if that email is correct).

ricardobeat · 4 years ago

It seems to be standard practice to be as fuzzy as possible during an outage, and only share details later. Probably avoids anyone looking stupid if the initial hunch turns out to be wrong.

tbarbugli · 4 years ago

From their status page

> Less than 1% of users may be experiencing trouble connecting to Slack

This is a good example of how to not communicate uptime issues. It clearly shows that Slack is trying to downplay the problem and do damage control. This is not smart and gets customers upset.

qw · 4 years ago

Also it seems to be based on assumptions and not facts. It is currently not working for many people at work who are sitting at home or at the office. Changing DNS to 8.8.8.8 "fixes" it, but it is clearly impacting more than 1% considering they are spread out over several countries and ISPs at home.

iamjohnsears · 4 years ago

Switching to Cloudflare DNS fixed this for me

michael_michael · 4 years ago

Thank you. Switching to 1.1.1.1 got me back online.

eigthbits · 4 years ago

Good tip, back up, appreciate it

daxuak · 4 years ago

Thank you, I switched to cloudflare/google DNS and flushed router's DNS cache, then it worked.

testplzignore · 4 years ago

I like how all 5 status updates (as of now) basically say the same thing, though they rushed the third one and forgot to apologize :)

phlhar · 4 years ago

Wow no DNS A Record for slack.com. Somebody fucked up

_jal · 4 years ago

It looks like a failed attempt at implementing DNSSEC.

mhousley · 4 years ago

It appears that only certain DNS servers are affected. I can connect from my phone, but not from my home internet connection.