I'm one of the Tailscale engineers who built node state encryption initially (@awly on Github), and who made the call to turn it off by default in 1.92.5.
TPMs are a great tool for organizations that have good control of their devices. But the very heterogeneous fleet of devices that Tailscale users have is very difficult to support out of the box. So for now we leave it to security-conscious users and admins to enable, while avoiding unexpected breakage for the broader user base.
We should've provided more of this context in the changelog, apologies!
Those issues are a surprising read. I would expect issues with TPM on old or niche devices, but not Dell XPS laptops, or a variety of VMs. But I guess I'm not entirely sure how my vms handle TPM state, or if they even can.
I'm running nearly all of my personal tailscale instances in containers and VMs. Looking now at the dashboard, it appears this feature really only encrypted things on my primary linux and windows pc, my iphone, and my main linux server's host. None of the VMs+containers i use were able to take advantage of this, nor was my laptop. Although my laptop might be too old.
Stuff breaks all the time, you just need a bigger sample size.
Overseeing IT admins for corp fleets is part of my gig, and from my experience, we get malfunctioning TPMs on anything consumer - Lenovo, Dell, HP, whatever. I think the incidence is some fraction of a percent, but get a few thousand devices and the chance of eventually experiencing it is high, very high. I can't imagine a vTPM being perfect either, since there isn't a hypervisor out there someone hasn't screwed up a VM on.
Just had a system board replaced on a device in my org, Dell laptop.
As part of setting up a device in our org we enroll our device in Intune (Microsoft's cloud-based device management tool aka UEM / RMM / MDM / etc). To enroll your device you take a "hardware hash" which's basically TPM attestation and some additional spices and upload it to their admin portal.
After the system board replacement we got errors that the device is in another orgs tenant. This is not unusual (you open a ticket with MS and they typically fix it for you), and really isn't to blame on Dell per se. Why ewaste equipment you can refurbish?
Just adding 5c to the anecdata out there re: TPM as an imperfect solution.
My eyes have opened up to the pitfalls of TPM recently while upgrading CPUs and BIOS/UEFI versions on various hardware in my home.
VMs typically do not use TPMs, so it is not surprising that the feature was not being used there. One common exception is VMware, which can provide the host's TPM to the VM for a better Windows 11 experience. One caveat is this doesn't work on most Ryzen systems because they implement a CPU-based fTPM that VMware does not accept.
It is in fact surprising that TPMs can be wiped so easily. It makes them almost useless compared to dedicated solutions like physical FIDO keys or smartcards, and does not bode well for hardware-backed Passkeys that would also be inherently reliant on TPM storage.
I had a Ryzen 3900x on a gigabyte motherboard and the fTPM was just totally unreliable for a pretty mainstream combination. Not fully sure which was to blame there.
At least it was fixed in the 5900x (and _different_ gigabyte motherboard, but from the same lineup) that replaced it.
I'm not sure what makes any of this "surprising". Each ticket reads like "we replaced the computer that tailscale was on, it doesn't work anymore" pikachu face.
Yeah, that was a feature and the exact reason why we use TPMs. I guess it should have been better advertised.
VMs don't have TPMs as they are hw devices, although you can run a software TPM (potentially backed by the host TPM) and pass it to them, which you might want to do for this use case.
Hardware key attestation is a yet-unfinished feature that we're building. The idea is to generate a signing key inside of the TPM and use it to send signatures to our control plane and other nodes, proving that it's the same node still. (The difference from node state encryption is that an attacker can still steal the node credentials from memory while they are decrypted at runtime).
We started by always generating hardware attestation keys on first start or loading them from the TPM if they were already generated (which seemed safe enough to do by default). That loading part was causing startup failures in some cases.
To be honest, I didn't get to the bottom of all the reports in that github issue, but this is likely why for some users setting `--encrypt-state=false` didn't help.
It was designed mostly for mechanisms where in the event of certain changes (BIOS upgrades, certain other firmware changes, some OS changes) there is a fallback mechanism to unlock the system and reset the key. This is why Windows BitLocker is so insistent about you saving your key somewhere else - if you do a BIOS update and it can’t decrypt, it’ll require your copy of the key and then reset the TPM-encrypted copy with the new BIOS accounted for.
A TPM’s primary function works by hashing things during the boot process, and then telling the TPM to only allow a certain operation if hashes X & Z don’t change. Depending on how the OS/software uses it, a whole host of things that go into that hash can change: BIOS updates being a common one. A hostile BIOS update can compromise the boot process, so some systems will not permit automatic decryption of the boot drive (or similar things) until the user can confirm that they have the key.
Thank you for your openness here - and yes, it would be nice to see this kind of reasoning in the changelog, even if it's tucked a little out of the way! Those of us who care will read it.
Also very welcome is to separate it into a small blogpost providing details, if the situation warrants a longer, more detailed format.
Your suspicion is correct. I have an AMD AM5 motherboard, and everytime I update it's BIOS it warns me that the fTPM will be reset, and I know it does so because afterwards Bitlocker prompts me to introduce the recovery key since it can't unlock the drive anymore.
> There's also tailscaled-on-macOS, but it won't have a TPM or Keychain bindings anyway.
Do you mean that on macOS, tailscaled does not and has never leveraged equivalent hardware-attestation functionality from the SEP? (Assuming such functionality is available)
The third one is just the open-source tailscaled binary that you have to compile yourself, and it doesn't talk to the Keychain. It stores a plaintext file on disk like the Linux variant without state encryption. Unlike the GUI variants, this one is not a Swift program that can easily talk to the Keychain API.
A BIOS update to my PC reset the TPM only this week. I did get a warning that Bitlocker keys would be wiped as a result before acting at least.
(I believe this was because it was fixing an AMD TPM exploit - presumably updating the TPM code wipes the TPM storage either deliberately or as an inevitable side effect.)
TPMs are basically storing the hashes of various pieces of software, then deterministically generating a key from those. Since the BIOS software changed, that hash changed, and the key it generates is completely new.
If someone had messed with your BIOS maliciously, that's desirable. Unfortunately you messing with your BIOS intentionally also makes the original key pretty much unrecoverable.
Coincidentally this was a feature unknown to me until I performed a SSD migration from one server to another and Tailscale failed to connect because ("of course!" in hindsight) it failed to decrypt whatever.
So not a TPM failure but certainly a gotcha! moment; luckily I had a fallback method to connect to the machine, otherwise in the particular situation I was in I would have been very sorry.
The "whoever needs this will enable it" + support angle makes total sense.
This never should have been on by default. The end user (read: administrator) needs to know they want to use the TPM.
This is a huge foot gun for many devices.
The accompanying changelog note hints at why:
> Failure to load hardware attestation keys no longer prevents the client from starting. This could happen when the TPM device is reset or replaced.
This is unfortunate as for many, many deployments, you absolutely want this on. But because it's a time bomb for certain device/OS combinations that Tailscale can't control or predict, and folks install Tailscale on pretty much everything, then the incidence of borked installs can only rise.
As someone with a passing interest in using TPM for crypto things, everytime I think deeply about the implementation details like this, I come back to needing some kind of recovery password/backup key setup that entirely negates the point of the TPM in the first place. They seem really neat, but I struggle to see the benefit they have for doing crypto when a tiny slip up means your users' keys are poof, gone. And the tiny slip up may not even be with your software, but some edge case in the OS/TPM stack.
The TPM was never designed to be the only holder of a key that cannot be reset. The idea was that it prevents you from typing in a password or reseting an attestation signature in a database for 99% of boots, but if certain things in the boot process change (as determined by the firmware, the CPU, the OS, and the application using the TPM) it’s designed to lock you out so those things cannot change without anyone’s notice.
For that purpose they’re pretty good, though there are advantages to a more signature-oriented boot security option like Apple’s Secure Enclave. But that only works so well because Apple simply doesn’t permit altering parts of the macOS boot process. For Windows/Linux, you have a variety of hardware, firmware, and OS vendors all in the mix and agreeing on escrow of keys for all of them is hard.
The primary argument in favor of TPM's is the desire to assert against tampering to the boot system, and as a secondary effect it can be one of the solutions to reduce the need for users to type in passwords.
You can still use crypto without a TPM, including with full disk encryption, and for LUKS specifically you can use multiple passwords and mechanisms to unlock the system. Different solutions will give different benefits and drawbacks. Me and a friend wrote a remote password provider for Debian called Mandos which uses machines on the local network as a way for unattended boots. It does not address the issue of tampering with the bios/boot loader, but for the primary purpose of protecting against someone stealing the server/disks it serves the purpose of allowing us to use encrypted disk without drawbacks of typing in passwords, and the backup server, itself with encrypted disks, handles the risk of needing recovery passwords. At most one needs to have an additional backup key installed for the backup server.
TPM keys are great for things like SSH keys or Passkeys, which surprisingly works well even in Windows.
The private key is safe from any exfiltration, and usage only requires a short PIN instead of a long passphrase. The TPM ensures you're physically typing that PIN at the machine not a remote desktop window or other redirection that could be hacked.
Obviously, this is problematic/annoying for scripts and things that can't share the SSH session, because you need to PIN with every authentication. Also, for encryption, you want to use something where you can backup the private key before stashing it in the TPM. Windows allows you to do this with certificates that are exported for backup prior to encrypting the private key with an unexportable TPM key in Hello.
But e.g. Windows uses a TPM by default now ? If TPMs were such a major issue then there would be millions of Windows users with TPM problems, no ?
I have no inside info, but this strikes me more as a bit of a "sledgehammer to crack a nut". Tailscale turning off important functionality due to small-but-vocal number of TPM edge cases ?
It is also very unfortunate they did not manage to find any middle ground between the hard-binary all-on or all-off.
Windows uses TPM for Bitlocker. A very common scenario where TPMs get reset is BIOS updates (when a TPM is implemented in firmware).
AFAIK, Windows cheats here because it also manages BIOS updates. When an update happens, it takes extra steps to preserve the Bitlocker encryption key in plaintext, and re-seals it to the TPM after the update completes.
Apart from Windows, there are many setups that fail in fun ways: Kubernetes pods that migrate from one VM with a TPM to another one, hypervisors that mount a virtual TPM to VMs, containers or VM images that do Tailscale registration on one machine and then get replicated to others, etc.
Tailscale already did some attempts at cleverness when deciding whether to enable features using a TPM (e.g. probing for TPM health/version on startup, disabling node state encryption on Kubernetes pods), but there was still a long tail of edge cases.
Windows seems to do two big things with a TPM. Bitlocker encryption and some microsoft account stuff.
If the bitlocker stuff goes wrong, big problem, hopefully you printed and kept your recovery key.
If the microsoft account stuff goes wrong, mostly the microsoft store and microsoft store apps break in subtle ways... but that's also how that ecosystem normally works, so how are you supposed to know it's the TPM problem?
Windows automatically reinitializes the TPM if it's reset boots normally, most end users will not notice any issues unless they have Bitlocker or biometrics configured.
Sure, but most users probably don't actually want this level of defense.
For the same reason that most folks don't use bank vault doors on their house.
Ex - even reasonably technical people hit this footgun in lots of edge cases... like updating their bios, changing the host of a vm running the tool, or having a k8s pod get scheduled on a different node.
From the changelog, it seems like this may have been due to issues caused by the on-by-default setting, although I don’t work for Tailscale and am speculating here with no inside info.
I wonder, would Tailscale be willing to confirm that they plan to fix whatever the issues are and re-enable this default within a short-ish timeframe? I currently have plenty of trust in the good intentions of the people running Tailscale, but with geopolitics as it currently is, I’d love to have a concrete reason even beyond that positive track record to believe that this change isn’t attempting to satisfy ease-of-surveillance concerns expressed by government agencies in whichever country.
Seems like the issues in question are not within Tailscale's span of control (basically, the devices themselves with TPMs are too unreliable in the general population, so the feature is more appropriate for controlled environments that opt in to its usage).
The TPM devices themselves are reliable, but using them comes with a lot of caveats. 99% of users have never heard of the TPM, and 99% of the ones who have won’t have realized that upgrading the BIOS clears¹ the TPM. Add in the fact that Tailscale users didn’t _know_ that tailscale was using the TPM and you have a recipe for users breaking things without realizing it. In an enterprise environment where you can afford to hire people specifically to care about these thing, using TPMs for additional security is a great idea.
¹: and very few of those can explain that it doesn’t actually clear the TPM. Instead it causes a different state to be measured by the TPM, and in that new state the TPM cannot unlock the keys that were previously stored in it. This is a great way to protect the computer against someone who can pull the hard drive out of the computer and try to read the data off of it, or who can substitute a different BIOS chip to get around a BIOS password, but not so great for ordinary users who want the occasional upgrade to go smoothly.
Thank god, I was running Tailscale on a nixos machine on some really old hardware and I couldn’t figure out why it kept crashing. It was because of this but it just failed silently.
My Tailscale was broken for the past month and I only just fixed it yesterday, and today this patch is released that would have made it a non-issue.
Updating my BIOS caused the issue. The main problem was that Tailscale's behaviour was very poor in this case. It simply got stuck "Starting" and never provided any error information.
Oh, I got bitten by that! I have my work Linux installation on an USB stick so I can boot it on either my desktop or laptop and one day tailscale stopped working. I thought that might be a rare situation, but it looks like TPM based encryption failed for other reasons too.
Another comment in this thread guessed right - this feature is too support intensive. Our original thinking was that a TPM being reset or replaced is always sign of tampering and should result in the client refusing to start or connect. But turns out there are many situations where TPMs are not reliable for non-malicious reasons. Some examples: * https://github.com/tailscale/tailscale/issues/17654 * https://github.com/tailscale/tailscale/issues/18288 * https://github.com/tailscale/tailscale/issues/18302 * plus a number of support tickets
TPMs are a great tool for organizations that have good control of their devices. But the very heterogeneous fleet of devices that Tailscale users have is very difficult to support out of the box. So for now we leave it to security-conscious users and admins to enable, while avoiding unexpected breakage for the broader user base.
We should've provided more of this context in the changelog, apologies!
I'm running nearly all of my personal tailscale instances in containers and VMs. Looking now at the dashboard, it appears this feature really only encrypted things on my primary linux and windows pc, my iphone, and my main linux server's host. None of the VMs+containers i use were able to take advantage of this, nor was my laptop. Although my laptop might be too old.
Overseeing IT admins for corp fleets is part of my gig, and from my experience, we get malfunctioning TPMs on anything consumer - Lenovo, Dell, HP, whatever. I think the incidence is some fraction of a percent, but get a few thousand devices and the chance of eventually experiencing it is high, very high. I can't imagine a vTPM being perfect either, since there isn't a hypervisor out there someone hasn't screwed up a VM on.
As part of setting up a device in our org we enroll our device in Intune (Microsoft's cloud-based device management tool aka UEM / RMM / MDM / etc). To enroll your device you take a "hardware hash" which's basically TPM attestation and some additional spices and upload it to their admin portal.
After the system board replacement we got errors that the device is in another orgs tenant. This is not unusual (you open a ticket with MS and they typically fix it for you), and really isn't to blame on Dell per se. Why ewaste equipment you can refurbish?
Just adding 5c to the anecdata out there re: TPM as an imperfect solution.
VMs typically do not use TPMs, so it is not surprising that the feature was not being used there. One common exception is VMware, which can provide the host's TPM to the VM for a better Windows 11 experience. One caveat is this doesn't work on most Ryzen systems because they implement a CPU-based fTPM that VMware does not accept.
At least it was fixed in the 5900x (and _different_ gigabyte motherboard, but from the same lineup) that replaced it.
Yeah, that was a feature and the exact reason why we use TPMs. I guess it should have been better advertised.
Question:
You link to https://github.com/tailscale/tailscale/issues/17654 where a user states[1]:
"Previous workaround from some comments (TS_ENCRYPT_STATE=false, FLAGS="--encrypt-state=false") didn't help on this problematic Debian 13 host"
And the same user states "I confirm this issue is NOT found anymore with tailscale version 1.92.1".
Could you provide a little extra context to clarify those types of comments which seem to suggest it wasn't state encryption after all ?
[1] https://github.com/tailscale/tailscale/issues/17654#issuecom...
Hardware key attestation is a yet-unfinished feature that we're building. The idea is to generate a signing key inside of the TPM and use it to send signatures to our control plane and other nodes, proving that it's the same node still. (The difference from node state encryption is that an attacker can still steal the node credentials from memory while they are decrypted at runtime).
We started by always generating hardware attestation keys on first start or loading them from the TPM if they were already generated (which seemed safe enough to do by default). That loading part was causing startup failures in some cases.
To be honest, I didn't get to the bottom of all the reports in that github issue, but this is likely why for some users setting `--encrypt-state=false` didn't help.
A TPM’s primary function works by hashing things during the boot process, and then telling the TPM to only allow a certain operation if hashes X & Z don’t change. Depending on how the OS/software uses it, a whole host of things that go into that hash can change: BIOS updates being a common one. A hostile BIOS update can compromise the boot process, so some systems will not permit automatic decryption of the boot drive (or similar things) until the user can confirm that they have the key.
Also very welcome is to separate it into a small blogpost providing details, if the situation warrants a longer, more detailed format.
> There's also tailscaled-on-macOS, but it won't have a TPM or Keychain bindings anyway.
Do you mean that on macOS, tailscaled does not and has never leveraged equivalent hardware-attestation functionality from the SEP? (Assuming such functionality is available)
The third one is just the open-source tailscaled binary that you have to compile yourself, and it doesn't talk to the Keychain. It stores a plaintext file on disk like the Linux variant without state encryption. Unlike the GUI variants, this one is not a Swift program that can easily talk to the Keychain API.
(I believe this was because it was fixing an AMD TPM exploit - presumably updating the TPM code wipes the TPM storage either deliberately or as an inevitable side effect.)
If someone had messed with your BIOS maliciously, that's desirable. Unfortunately you messing with your BIOS intentionally also makes the original key pretty much unrecoverable.
So not a TPM failure but certainly a gotcha! moment; luckily I had a fallback method to connect to the machine, otherwise in the particular situation I was in I would have been very sorry.
The "whoever needs this will enable it" + support angle makes total sense.
This is a huge foot gun for many devices.
The accompanying changelog note hints at why:
> Failure to load hardware attestation keys no longer prevents the client from starting. This could happen when the TPM device is reset or replaced.
This is unfortunate as for many, many deployments, you absolutely want this on. But because it's a time bomb for certain device/OS combinations that Tailscale can't control or predict, and folks install Tailscale on pretty much everything, then the incidence of borked installs can only rise.
For that purpose they’re pretty good, though there are advantages to a more signature-oriented boot security option like Apple’s Secure Enclave. But that only works so well because Apple simply doesn’t permit altering parts of the macOS boot process. For Windows/Linux, you have a variety of hardware, firmware, and OS vendors all in the mix and agreeing on escrow of keys for all of them is hard.
You can still use crypto without a TPM, including with full disk encryption, and for LUKS specifically you can use multiple passwords and mechanisms to unlock the system. Different solutions will give different benefits and drawbacks. Me and a friend wrote a remote password provider for Debian called Mandos which uses machines on the local network as a way for unattended boots. It does not address the issue of tampering with the bios/boot loader, but for the primary purpose of protecting against someone stealing the server/disks it serves the purpose of allowing us to use encrypted disk without drawbacks of typing in passwords, and the backup server, itself with encrypted disks, handles the risk of needing recovery passwords. At most one needs to have an additional backup key installed for the backup server.
The private key is safe from any exfiltration, and usage only requires a short PIN instead of a long passphrase. The TPM ensures you're physically typing that PIN at the machine not a remote desktop window or other redirection that could be hacked.
Obviously, this is problematic/annoying for scripts and things that can't share the SSH session, because you need to PIN with every authentication. Also, for encryption, you want to use something where you can backup the private key before stashing it in the TPM. Windows allows you to do this with certificates that are exported for backup prior to encrypting the private key with an unexportable TPM key in Hello.
And when you do it should be rare and lead to a password reset.
I have no inside info, but this strikes me more as a bit of a "sledgehammer to crack a nut". Tailscale turning off important functionality due to small-but-vocal number of TPM edge cases ?
It is also very unfortunate they did not manage to find any middle ground between the hard-binary all-on or all-off.
Apart from Windows, there are many setups that fail in fun ways: Kubernetes pods that migrate from one VM with a TPM to another one, hypervisors that mount a virtual TPM to VMs, containers or VM images that do Tailscale registration on one machine and then get replicated to others, etc.
Tailscale already did some attempts at cleverness when deciding whether to enable features using a TPM (e.g. probing for TPM health/version on startup, disabling node state encryption on Kubernetes pods), but there was still a long tail of edge cases.
If the bitlocker stuff goes wrong, big problem, hopefully you printed and kept your recovery key.
If the microsoft account stuff goes wrong, mostly the microsoft store and microsoft store apps break in subtle ways... but that's also how that ecosystem normally works, so how are you supposed to know it's the TPM problem?
Isn’t that exactly the desired behavior to defend against physical attacks?
For the same reason that most folks don't use bank vault doors on their house.
Ex - even reasonably technical people hit this footgun in lots of edge cases... like updating their bios, changing the host of a vm running the tool, or having a k8s pod get scheduled on a different node.
I'm surprised this was "default on" at all.
https://github.com/tailscale/tailscale/pull/18336
Seems like it caused tons of problems due to the variability of TPM quality among other things
I wonder, would Tailscale be willing to confirm that they plan to fix whatever the issues are and re-enable this default within a short-ish timeframe? I currently have plenty of trust in the good intentions of the people running Tailscale, but with geopolitics as it currently is, I’d love to have a concrete reason even beyond that positive track record to believe that this change isn’t attempting to satisfy ease-of-surveillance concerns expressed by government agencies in whichever country.
¹: and very few of those can explain that it doesn’t actually clear the TPM. Instead it causes a different state to be measured by the TPM, and in that new state the TPM cannot unlock the keys that were previously stored in it. This is a great way to protect the computer against someone who can pull the hard drive out of the computer and try to read the data off of it, or who can substitute a different BIOS chip to get around a BIOS password, but not so great for ordinary users who want the occasional upgrade to go smoothly.
Updating my BIOS caused the issue. The main problem was that Tailscale's behaviour was very poor in this case. It simply got stuck "Starting" and never provided any error information.