What this suggests is that Slack, for reasons passing understanding, enabled DNSSEC on their zones (with a DS record that essentially turns DNSSEC on, and the accompanying key records) --- then disabled DNSSEC by pulling all the records. But the DS records are in caches; validating resolvers go looking for the keys, which don't exist, and say "welp, I guess Slack.com doesn't exist".
I wonder if they are using tooling that doesn't properly retain DNSKEY records for DS that recently removed? This is one of the reasons we perform controlled automated key rotation and removal in DNSimple, so that we can ensure we retain the keys in the authoritative zone on each key rollover giving the DS records time to expire from caches.
We had a DNS related outage with route53. Some of our zones just lost some records and then they reappeared. Could that explain what happened to slack's DNSSEC related records?
Aren't you, in fact, the same Thomas Ptacek who has repeatedly claimed that DNSSEC is so irrelevant that events like this would go essentially unnoticed?
> DNSSEC is moribund and almost nobody uses it; in reality, the DNSSEC root private keys could land on Pastebin tomorrow and nothing would "break"
We have this whole thread here about a "service disruption" for Slack, and nobody leaked the "root private keys" just one person made a dumb error and it blew up their site.
https://dnsviz.net/d/slack.com/YVXX_g/dnssec/ the dnsviz analysis showing the slack.com zone DNSKEY existing at 12:55, followed by the the .com zone DS record at 15:30. However, the next analysis at 17:24 shows both the .com zone DS and slack.com DNSKEY records have disappeared!
Given that the slack.com DNSKEY shows up with a 1h TTL and the .com zone DS has a 24h TTL, they are screwed in the presence of cached slack.com DS records from the .com zone. Do not throw away your DNSKEY until your delegation's TTL has absolutely positively surely expired from any resolver caches!
The slack.com domain is an AWS Route 53 zone, I'd be really interested to see a post-mortem explaining what happened here. Are they unable to recover the KSK/ZSK and restore the DNSKEY/etc records?
FWIW, after getting exactly the same output from `delv` locally, I noted that
$ dig +short slack.com @4.2.2.1
15.206.34.128
and I take this to mean that, despite failing DNSSEC, Level 3 is yielding an A record - and that the major players have basically monkey-patched this into working again, and it just needs to propagate through all the caches now?
You can also just use dig. Running: `dig a www.slack.com` returns a SERVFAIL for me. Asking my resolver to skip the dnssec checking gives me the A record though: `dig +cd a www.slack.com`
I then look at my unbound dns resolver logs:
Sep 30 21:53:11 unbound[8985:0] info: validation failure <www.slack.com. A IN>: No DNSKEY record from 208.67.220.123 for key slack.com. while building chain of trust
Based on that status page and the list-serve email link you posted it seems like they still don't understand what's going on (if that email is correct).
It seems to be standard practice to be as fuzzy as possible during an outage, and only share details later. Probably avoids anyone looking stupid if the initial hunch turns out to be wrong.
> Less than 1% of users may be experiencing trouble connecting to Slack
This is a good example of how to not communicate uptime issues. It clearly shows that Slack is trying to downplay the problem and do damage control. This is not smart and gets customers upset.
Also it seems to be based on assumptions and not facts. It is currently not working for many people at work who are sitting at home or at the office. Changing DNS to 8.8.8.8 "fixes" it, but it is clearly impacting more than 1% considering they are spread out over several countries and ISPs at home.
What this suggests is that Slack, for reasons passing understanding, enabled DNSSEC on their zones (with a DS record that essentially turns DNSSEC on, and the accompanying key records) --- then disabled DNSSEC by pulling all the records. But the DS records are in caches; validating resolvers go looking for the keys, which don't exist, and say "welp, I guess Slack.com doesn't exist".
Edited to add, e.g. https://news.ycombinator.com/item?id=22400167
> DNSSEC is moribund and almost nobody uses it; in reality, the DNSSEC root private keys could land on Pastebin tomorrow and nothing would "break"
We have this whole thread here about a "service disruption" for Slack, and nobody leaked the "root private keys" just one person made a dumb error and it blew up their site.
> Both Google and Cloudflare have a publicly accessible feature to flush the cache for a domain, so anyone could have done it: > https://developers.google.com/speed/public-dns/cache > https://1.1.1.1/purge-cache/
Quite useful feature indeed.
Given that the slack.com DNSKEY shows up with a 1h TTL and the .com zone DS has a 24h TTL, they are screwed in the presence of cached slack.com DS records from the .com zone. Do not throw away your DNSKEY until your delegation's TTL has absolutely positively surely expired from any resolver caches!
The slack.com domain is an AWS Route 53 zone, I'd be really interested to see a post-mortem explaining what happened here. Are they unable to recover the KSK/ZSK and restore the DNSKEY/etc records?
Slack support says that users should tell their ISPs to invalidate the DNS cache for slack.com https://status.slack.com/2021-09/06c1e17de93e7dc2 (access with 8.8.8.8 as resolver - fallback https://slack-status.azureedge.net/)
Since the faulty DS record was in .com, everyone has a max wait-for-ttl-to-expire time of 24h.
Google/Cloudflare etc. seem to also invalidate .com caching very quickly, 8.8.8.8 quickly was the first workaround.
Meanwhile, 14 hours later, DTAG in Germany still does not resolve. The default resolvers have dnssec enabled.
dig slack.com +cd
tells the resolver to skip dnssec validation tests, and then it works again. Screenshots with the command output in https://twitter.com/dnsmichi/status/1443840645513293853?s=2
Very interested in the post-mortem analysis. I think there were similar mistakes as with nasa.gov incident and the comcast analysis in 2012: https://www.internetsociety.org/blog/2012/01/comcast-release...
Learnings for me:
- dnstracer (https://gitlab.com/dnsmichi/dotfiles/-/blob/main/Brewfile#L5...) helps with detecting missing glue records, but not dnssec
- dnstrace (https://github.com/rs/dnstrace) is a better alternative with dnssec
Deleted Comment
I then look at my unbound dns resolver logs:
Sep 30 21:53:11 unbound[8985:0] info: validation failure <www.slack.com. A IN>: No DNSKEY record from 208.67.220.123 for key slack.com. while building chain of trust
(1) https://community.letsencrypt.org/t/help-thread-for-dst-root...
(2) https://twitter.com/search?q=letsencrypt&src=typed_query&f=l...
> Less than 1% of users may be experiencing trouble connecting to Slack
This is a good example of how to not communicate uptime issues. It clearly shows that Slack is trying to downplay the problem and do damage control. This is not smart and gets customers upset.
Deleted Comment
Deleted Comment