The Dangers of SSL Certificates

You need external monitoring of certificate validity. Your ACME client might not be sending failure notifications properly (like happened to Bazel here). The client could also think everything is OK because it acquired a new cert, meanwhile the certificate isn't installed properly (e.g., not reloading a service so it keeps using the old cert).

I have a simple Python script that runs every day and checks the certificates of multiple sites.

One time this script signaled that a cert was close to expiring even though I saw a newer cert in my browser. It turned out that I had accidentally launched another reverse proxy instance which was stuck on the old cert. Requests were randomly passed to either instance. The script helped me correct this mistake before it caused issues.

compumike · a month ago

100%, I've run into this too. I wrote some minimal scripts in Bash, Python, Ruby, Node.js (JavaScript), Go, and Powershell to send a request and alert if the expiration is less than 14 days from now: https://heyoncall.com/blog/barebone-scripts-to-check-ssl-cer... because anyone who's operating a TLS-secured website (which is... basically anyone with a website) should have at least that level of automated sanity check. We're talking about ~10 lines of Python!

firesteelrain · 2 months ago

There is a Prometheus plugin called ssl_exporter that will provide the ability for Grafana to display a dashboard of all of your certs and their expirations. But, the trick is that you need to know where all your certs are located. We were using Venafi to do auto discovery but a simple script to basically nmap your network provides the same functionality.

stackskipton · a month ago

Blackbox exporter will do same thing while testing HTTP and others.

machinationu · a month ago

relevant certificates could be located by scanning the certificate transparency logs

KronisLV · a month ago

> You need external monitoring of certificate validity.

Plug for Uptime Kuma, they support notifications ahead of expiry: https://github.com/louislam/uptime-kuma

Kind of cool to have an uptime monitoring tool that also had an option like that, two birds one stone and all that. Not affiliated with them, FOSS project.

weddpros · a month ago

The scalable way (up to thousands of certificates) is https://sslboard.com. Give it one apex domain, it will find all your in-use certificates, then set alerts (email or webhook). Fully external monitoring and inventory.

jcgl · a month ago

Looks like it relies on certificate transparency logs. That means that it won’t be monitor endpoints using wildcard certs. Best thing it could do would be to alert when a wildcard cert is expiring without a renewed cert having been issued.

Dead Comment

There are plenty of other technologies whose failure mode is a total outage, it’s not exclusive to a failed certificate renewal.

A certificate renewal process has several points at which failure can be detected and action taken, and it sounds like this team was relying only on a “failed to renew” alert/monitor.

A broken alerting system is mentioned “didn’t alert for whatever reason”.

If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.

Sounds like a case of “nothing in this automated process can fail, so we only need this one trivial monitor which also can’t fail so meh” attitude.

yearolinuxdsktp · 2 months ago

Additionally, warnings can be built into the clients themselves. If you connect to a host with less than 2 weeks cert expiry time, print a warning in your client. That will be further incentive to not let certs be not renewed in time.

tetha · a month ago

> If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.

This is also why you want a mix of alerts from the service users point of view, as well as internal troubleshooting alerts. The users point-of-view alerts usually give more value and can be surprisingly simple at times.

"Remaining validity of the certificates offered by the service" is a classical check from the users point of view. It may not tell you why this is going wrong, but it tells you something is going wrong. This captures a multitude of different possible errors - certs not reloading, the wrong certs being loaded, certs not being issued, DNS going to the wrong instance, new, shorter cert lifecycles, outages at the CA, and so on.

And then you can add further checks into the machinery to speed up the process of finding out why: Checks if the cert creation jobs run properly, checks if the certs on disk / in secret store are loaded or not, ...

Good alerting solutions might also allow relationships between these alerts to simplify troubleshooting as well: Don't alert for the cert expiry, if there is a failed cert renew cron job, alert for that instead.

SoftTalker · a month ago

Wait until they start expiring 47 days from issue (coming soon). Though maybe this will actually help, because it will happen often enough that you (a) won't completely forget how to deal with it and (b) have more motivation to be proactive.

dextercd · 2 months ago

jsiepkes · a month ago

I wonder what the point of this blog is. It's kinda easy to rip on certificates without giving atleast one possible way of fixing this, even if it's an unrealistic one.

Sure, the low-level nitty gritty of managing keys and certificates for TLS is hard if you don't have the expertise. You don't know about the hundreds of ways you can get bitten. But all the pieces for a better solution are there. Someone just needs to fold it into a neater higher level solution. But apparently by the time someone gained the expertise to manage this complexity they also loose interest in making a simple solution (I know I have).

> You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.

Of course you can, if you really want to. You could get different certificates with different expiry times for your reverse (ingress) proxies.

A more straight forward solution is to have monitoring which retrieves the certificate on your HTTPS endpoints and alert when the expiry time is sooner than it ever should be (i.e. when it should already have been renewed). For example by using Prometheus and ssl_exporter [1].

> and the renewal failures didn’t send notifications for whatever reason.

That's why you need to have deadman switch [2] type of monitoring in your alerting. That's not specific to TLS BTW. Heck even your entire Prometheus infra can go down. A service like healthchecks.io [3] can help with "monitoring the monitors".

[1] https://github.com/ribbybibby/ssl_exporter [2] https://en.wikipedia.org/wiki/Dead_man%27s_switch [3] https://healthchecks.io/

dvratil · 2 months ago

Happened on the first day of my first on-call rotation - a cert for one of the key services expired. Autorenew failed, because one of the subdomains on the cert no longer resolved.

The main lesson we took from this was: you absolutely need monitoring for cert expiration, with alert when (valid_to - now) becomes less than typical refresh window.

It's easy to forget this, especially when it's not strictly part of your app, but essential nonetheless.

aljgz · a month ago

No criticism of SSL-Certs in particular.

Essentially the flip side of any critical but low maintenance part of your system: it's so reliable that you can forget to have external monitors, it's reliable enough that it can work for years without any manual labor, it's so critical that can break everything.

Competent infra teams are really good at going over these. But once in a while one of them slips through. It's not a failure of the reliable but critical subsystem, it's a failure mode of humans.

One of the main ways "How Complex Systems Fail"

tialaramex · a month ago

The monitoring is the wrong way up, which is the case almost everywhere I've ever worked.

You want an upside down pyramid, in which every checked subsystem contributes an OK or some failure, and failure of these checks is the most serious failure, so the output from the bottom of your pyramid is in theory a single green OK. In practice, systems have always failed or are operating in some degraded state.

In this design the alternatives are: 1. Monitor says the Geese are Transmogrified correctly or 2. Monitoring detected a Goose Transmogrifier problem, or 3. Goose Transmogrifier Monitor failed. The absence of any overall result is a sign that the bottom of the pyramid failed, there is a major disaster, we need to urgently get monitoring working.

What I tend to see is instead a pyramid where the alternatives 1 and 2 work but 3 is silent, and in a summarisation layer, that can fail silently too, and in subsequent layers the same. In this system you always have an unknown amount of silently failed systems. You are flying blind.

xorcist · a month ago

Closely related to the ever more popular "We don't need monitoring, we have metrics."

philippta · a month ago

When I connect my server over SSH, I don't have to rotate anything, yet my connection is always secure.

I manually approve the authenticity of the server on the first connection.

From then, the only time I'd be prompted again would be, if either the server changed or if there's a risk of MITM.

Why can't we have this for the web?

> Why can't we have this for the web?

How do you propose to scale trust on first use? SSH basically says the trusting of a key is "out of scope" for them and makes it your problem. As in: You can put on a piece of paper, tell it over the phone, whatever, but SSH isn't going to solve it for you. How is some user landing on a HTTPS site going to determine the key used is actually trustworthy?

There have actually been attempts at solving this with some thing like DANE [1]. For a brief period Chrome had DANE support but it was removed due to being too complicated and being in (security) critical components. Besides, since DNSSEC has some cracks in it (you local resolver probably doesn't check it) you can have a discussion about how secure DANE is.

[1] https://en.wikipedia.org/wiki/DNS-based_Authentication_of_Na...

DANmode · a month ago

So DNS-adjacent protocols are supposed to be handling this TOFU directory,

but industry behemoths are too busy pushing other self-serving standards to execute together on this?

Am I…close?

ILearnAsIGo · a month ago

Would the issue not be that you would need to trust that first connection?

01HNNWZ0MV43FF · a month ago

Yep https://en.wikipedia.org/wiki/Trust_on_first_use

trvz · a month ago

Cookie banners aren’t annoying enough for you?

For the handful of regularly visited websites, I wouldn't mind.

jeroenhd · a month ago

SSH has its own certificate authority system to validate users and servers. This is because trust-on-first-use is not scalable unless you just ignore the risk (at which point you may as well not do encryption at all), so host keys are signed.

There is quite literally nothing that prevents you from putting a self-signed server certificate. Your browser will even ask you to trust and store the certificate like your client does on the screen that shows the fingerprint.

Good luck getting everyone else to trust your fingerprint, though.

donatj · a month ago

Once a year for a number of years we would have a small total outage as our Ops team forgot to renew our wildcard certificate. Like clockwork.

It's been a couple of years now so they must have set better reminders for themselves.

I have tried several times to convince them of the joys of ACME, but they're insistent that a Let's Encrypt certificate "looks unprofessional". More professional than a down application in my opinion at least. It's not the early 2000s anymore, no one's looking at your certificate.

dwood_dev · a month ago

I use ACME with Google Public CA for this reason. No one bats an eye at GPCA. Also, their limits are dramatically higher than LE.

Good news for your manual renewal friends, renewals drop to 197 days in February, halving again the year after, halving again until it reaches 47. So they will soon adopt automation, or suffer endless renewal pain.

loloquwowndueo · 2 months ago