In most cases, encrypting sensitive information like e-mail addresses with a memory-resident key (e.g. injected using tools like Vault) in the application layer is a better strategy, at least if you need asynchronous access to that information (e.g. to send out weekly update e-mails). Most of the data leaks in the past were caused by compromised or misconfigured databases, not by compromised application server code.
Also, within the EU I need to be able to proactively reach my users (e.g. to notify them about a data loss), so only storing hashes of e-mail addresses and hoping users will log in so that I can send them an e-mail won't work.
This kind of encryption-at-rest scheme becomes an absolute necessity when cryptographic secrets have to be stored, such as 2FA TOTP secret keys or recovery codes.
Encrypting the email addresses and any Personally Identifiable Information on your users may also be a good practice, to limit which eyes can actually see the plaintext data (database provider, former developers without rotated credentials, an old backup left over..).
One issue with this though could be the inability to use the encrypted field for queries (eg: select * from users where email = 'foo@bar.com'), but OP's solution of hashing can help here: store the email encrypted, its hash in clear text, and do a query on the hash.
Sure if you encrypt with an application-level key it makes it harder for any adversary to use your data, as he/she will need to not only get access to your data but also obtain the encryption key to do anything with it.
Encrypting data like this is easy and can drastically reduce your attack surface.
This would not work for any serious/useful service: e-mails are not only for marketing, there are many good reasons to send one like (user requested) notifications, invoicing, ... and also screw-ups! If your service had a problem (security, broken data, invoicing again, long downtime, ...), you better contact your users before they find out on hackernews.
Even just a very thin encryption layer would probably do a decent job. Attackers are typically going at it from an infrastructure perspective: they make a hole, poke around for basic configuration info, locate the database, and siphon it out. They may or may not have enough time and knowledge to reverse-engineer a basic column-specific symmetric scheme.
The only drawback is that such scheme must then be made available to any system that consumes the database, possibly from multiple languages.
Not any system that consumes the database, just any system that needs to send email. Unless you are using the encryption for other fields as well, I suppose.
Plaintext email could be stored client side in a cookie and may be submitted to the server when use of the email is required, and if it validates.
If the user logs in and the site is down, a backup system could email them about the issue. This is the backup system, primary systems are down. Please contact support if you need more information. No need to email users who aren't using the system currently about downtime, or in fact no need to email users if they aren't using the system.
Further, if a "password recovery" flow is modified slightly, it can be repurposed for password-less logins by using strong tokens sent to user email, as they request them. A simplified 2FA flow can be established as well, where a token is texted the user after verifying email address. A second layer of security to texting tokens can be achieved using Google Authenticator.
To use such a system, the user will need to be OK with sending their email address each time they need email from the system AND be OK with having their phone handy to login. Of course not every use case requires security, or can be used with this proposed security system.
But how do you contact users if they aren't on the site? What if you have a data breach and need to notify them or need to remove their account because they are inactive and want to give them a heads up.
If your account recovery works by sending an email... which then sets a plaintext email cookie, there's no actual auth, right?
To make this make sense, I think you are assuming but without explicitly stating the use of signed cookies? EDIT: "if it validates", I guess so.
The other bit which is not clear to me is, what is the key in the database to identify ownership of user information?
You need a linking record which looks like hash(email) -> uid (or user record or whatever) which does not seem any better than what is proposed in TFA.
OTOH if no information is stored against the user's email / uid / username then you probably don't need login or auth.
This scheme struggles in the face of email address case folding.
At the protocol level, email addresses are case-folded on the RHS but case-sensitive on the LHS. So it’s crucial that LHS case is preserved by delivery systems. Unfortunately most users then treat them as folded on both. So you can successfully verify one variant, store the downcased hash, and it’ll subsequently match but delivery bounces. Or, hash the exact original input but have many baffled users unable to access their accounts. Neither is a good outcome.
This is not an edge behaviour either, I have tons of users that mix up their email capitalisation from day to day.
No it doesn't, convert the case when generating the hash – think of it as part of the hash function. But leave the case unchanged from whatever the user entered for any steps that involve sending an e-mail to the address.
I thought email addresses were case folded on the right hand side and site dependent on the left hand side.
The right hand side is more or less forced by the rules of DNS.
As for the left hand side, if I run the email for a site, can't I decide whether to deliver ABC@ and abc@ both to the same mailbox or to different mailboxes? And can't someone else make a different decision for their site?
If a site administrator does not have the prerogative to decide this, what rule prevents them? (And if there is such a rule, can you rely on it being enforced?)
Of course, but when processing an arbitrary email address, which will almost always be not on your site, you MUST treat the left hand side as case-sensitive (unless you have knowledge about that email domain).
site dependent means that when given an address you must treat the lhs as case sensitive. To do otherwise will mean that you've potentially broken the address and can no longer properly use it.
You could store the hash of the downcased address plus a capitalization mask which tells you which letters to capitalize.
This works from a technical perspective, as letters with ambiguous capitalization (Turkish i, etc) aren't allowed in emails. It's a very minor privacy compromise: if a user has a very rare pattern of capitalization then an attacker with access to the database could identify their account. Negligible compared to the current standard.
It’s a neat idea but unfortunately RFC 6531 opened up the local part to most of UTF-8, so internationalised capitalisation is in the mix now.
Ultimately I’ll never advise delivering to email addresses other than the precise octets of the one already verified, and this means the gold standard is always folding for match and uniqueness, but delivery precisely as verified.
How about this: store the verified email address, but encrypted using the hash of the case-folded input as part of the key. The intention being, you had to have the matching folded form in hand to obtain the verified canonical form. For extra jollies, only decrypt it on the client. (cryptography warning: I write this as the idea comes to me and without any analysis of emergent properties, vulnerabilities etc)
True. But since so many websites (e.g. all aviation companies I've encountered) case-smash the LHS of the email address and can get away with it since all other email software has had to adapt, this is a rather minor concern by now.
Says who? Email addresses are case insensitive. If email software treats emails as case sensitive then it is broken. People have to write email addresses on paper forms, in all caps.
Says RFC 5321 [1]: "The local-part of a mailbox MUST BE treated as case sensitive."
It _does_ recommend receivers treat it as case insensitive for maximum interoperability, so it is de facto insensitive, but something implementing it as case sensitive isn't broken.
You, and the paper forms, are incorrect. In fact, on such forms, you should use the proper case for your email address, otherwise you are entering an incorrect address, which may be fraud.
Some things that become difficult if you don't have a verified email address for your users:
- Most common: a user has a support request because they can't get into their account (e.g. you have sign-in-with-Facebook and they lost their account there, or got banned).
- Your authentication partner (again e.g. Facebook) disables your integration for some reason - someone reports your account as abusive (maybe maliciously) and it gets locked, and your attempts to work through Facebook customer support hit a brick wall. If you have email addresses you can at least get your users back into their accounts via a reset-password style flow.
- You have a data breach, and you need to tell your users what happened and what private data of theirs was leaked to an attacker.
- You get a legal threat - a DMCA takedown message for example - and need to pass it on to your users.
- You sell your service to another company and the lawyers involved in the transaction insist on emailing out a terms of service update.
Between 'yes' and 'no' we could still have airgapped or at least segregated systems, where an email address is known, but only to the part of the system responsible for communication.
In larger systems that could be a reasonable way to build things.
Keeping email addresses in the "auth" microservice which has tighter security - blocking security team code reviews, a smaller team who are allowed to modify it for example.
This is a clever idea but limited in applicability. It is probably fine for a low security web app or game, but could still leak personal information if the db got hacked.
The problem is that the salt has to be the same for each record and that emails present a limited search space.
Imagine I stole the database for blackmailable-fetish.com. All the emails are hashed with the same salt so I can brute-force the following restricted space:
[top 200 first names][top 1000 surnames][digits from 0 - 999]@[top 5 email providers]
That would probably get me 75% of the emails - let the extortion games begin!
Because it gives the false appearance of security. With this scheme, you always need to act as if the e-mail addresses are plaintext anyway. It should not be used.
True, but I maintain that if you are worried that hackers may steal your email database then a much better approach is to encrypt the emails with an external key.
You could just do that anyway.
I've seen spam email trying to extort people threatening to release indecent images of them (that they don't have. supposedly, they've been captured from the victims selfie cam).
In this case, you don't have to be accurate, you're just trying to call someones bluff, in the small case they've actually done said thing (and in addition believe you can prove it AND that payment will silence them)
Depending on the size of the user database it might be cheap enough to try all random-salts+hashed emails (if it fits in RAM it's probably cheap enough).
Sidenote, but I find this post maddening to understand, because the author seems to be using the word "e-mail" to mean both "e-mail address" and "e-mail message", and then uses ambiguous pronouns to boot:
> In conclusion, if you only use emails for transactional emails, you might be able to only store hashed versions of them.
HUH?
The most obvious way to interpret this sentence is as storing hashed versions of transaction e-mail messages. Which makes no sense and isn't what the author means, but wow this is some confusing writing.
> Earlier this year, when I went from having only Facebook-login [...] to allow registrations with email and password, one of my concerns was how to implement this is a way that protects the data and privacy of my users.
Any privacy effort is laudable. Then again, if you're serious about protecting your users' data and privacy, Facebook login is the elephant in the room.
Fully agree and you can be certain that Facebook does save your E-Mail address.
I use authentication services like auth0 and AWS cognito. The first one I think is completely safe for privacy, the second one is used for convenience (I think the service is good for stuff you host on AWS anyway, although it is generic, so it isn't restricted to that).
But using an auth-service is mostly about deferring risk of breaches to people more proficient in security. That comes with the cost that said auth service can know which services registered users are using.
The author is correct though. While a user that employs such an auth service, it can be good practice to hash the mail-address or even other identifiers for you own DB (you still need that to associate state with a user).
I fully agree, so when I released the update with email/password registration, I also stopped allowing new account creations via Facebook. Now it's only supported for login for legacy reasons, and those users can disconnect their Facebook account after connecting an email.
Why would you go to the trouble of not enabling Facebook login (as long as you provide other logins methods of course)? If someone is using Facebook & Facebook login they clearly don't mind being tracked by Facebook.
Wouldn't the lack of means to contact all of your users, immediately and directly, create other compliance challenges? You would be unable to notify users of a data breech until their next login; former users might be left permanently in the dark. Similarly, being unable to push legally mandated notice of policy updates could be an impossible challenge. I can see how this proposed scheme could work day to day, but you would likely be well served to retain un-hashed emails in cold storage.
> For every transactional email I need to send out - registration, account recovery, and email change verification - the user always initiates this by submitting their email address, and it will at that time be available to the backend to perform the needed action.
This sounds like terrible UX, not to mention email use cases not initiated by the user. I really think you'd be shooting yourself in the foot by setting up a small site with this philosophy because you don't need emails right now
Also, within the EU I need to be able to proactively reach my users (e.g. to notify them about a data loss), so only storing hashes of e-mail addresses and hoping users will log in so that I can send them an e-mail won't work.
Encrypting the email addresses and any Personally Identifiable Information on your users may also be a good practice, to limit which eyes can actually see the plaintext data (database provider, former developers without rotated credentials, an old backup left over..).
One issue with this though could be the inability to use the encrypted field for queries (eg: select * from users where email = 'foo@bar.com'), but OP's solution of hashing can help here: store the email encrypted, its hash in clear text, and do a query on the hash.
Sorry for the Google link, I can't figure out how to copy the direct link on Android Chrome.
Encrypting data like this is easy and can drastically reduce your attack surface.
The only drawback is that such scheme must then be made available to any system that consumes the database, possibly from multiple languages.
If the user logs in and the site is down, a backup system could email them about the issue. This is the backup system, primary systems are down. Please contact support if you need more information. No need to email users who aren't using the system currently about downtime, or in fact no need to email users if they aren't using the system.
Further, if a "password recovery" flow is modified slightly, it can be repurposed for password-less logins by using strong tokens sent to user email, as they request them. A simplified 2FA flow can be established as well, where a token is texted the user after verifying email address. A second layer of security to texting tokens can be achieved using Google Authenticator.
To use such a system, the user will need to be OK with sending their email address each time they need email from the system AND be OK with having their phone handy to login. Of course not every use case requires security, or can be used with this proposed security system.
To make this make sense, I think you are assuming but without explicitly stating the use of signed cookies? EDIT: "if it validates", I guess so.
The other bit which is not clear to me is, what is the key in the database to identify ownership of user information?
You need a linking record which looks like hash(email) -> uid (or user record or whatever) which does not seem any better than what is proposed in TFA.
OTOH if no information is stored against the user's email / uid / username then you probably don't need login or auth.
Deleted Comment
Dead Comment
Dead Comment
At the protocol level, email addresses are case-folded on the RHS but case-sensitive on the LHS. So it’s crucial that LHS case is preserved by delivery systems. Unfortunately most users then treat them as folded on both. So you can successfully verify one variant, store the downcased hash, and it’ll subsequently match but delivery bounces. Or, hash the exact original input but have many baffled users unable to access their accounts. Neither is a good outcome.
This is not an edge behaviour either, I have tons of users that mix up their email capitalisation from day to day.
The right hand side is more or less forced by the rules of DNS.
As for the left hand side, if I run the email for a site, can't I decide whether to deliver ABC@ and abc@ both to the same mailbox or to different mailboxes? And can't someone else make a different decision for their site?
If a site administrator does not have the prerogative to decide this, what rule prevents them? (And if there is such a rule, can you rely on it being enforced?)
This works from a technical perspective, as letters with ambiguous capitalization (Turkish i, etc) aren't allowed in emails. It's a very minor privacy compromise: if a user has a very rare pattern of capitalization then an attacker with access to the database could identify their account. Negligible compared to the current standard.
Ultimately I’ll never advise delivering to email addresses other than the precise octets of the one already verified, and this means the gold standard is always folding for match and uniqueness, but delivery precisely as verified.
How about this: store the verified email address, but encrypted using the hash of the case-folded input as part of the key. The intention being, you had to have the matching folded form in hand to obtain the verified canonical form. For extra jollies, only decrypt it on the client. (cryptography warning: I write this as the idea comes to me and without any analysis of emergent properties, vulnerabilities etc)
It _does_ recommend receivers treat it as case insensitive for maximum interoperability, so it is de facto insensitive, but something implementing it as case sensitive isn't broken.
[1] https://tools.ietf.org/html/rfc5321
- Most common: a user has a support request because they can't get into their account (e.g. you have sign-in-with-Facebook and they lost their account there, or got banned).
- Your authentication partner (again e.g. Facebook) disables your integration for some reason - someone reports your account as abusive (maybe maliciously) and it gets locked, and your attempts to work through Facebook customer support hit a brick wall. If you have email addresses you can at least get your users back into their accounts via a reset-password style flow.
- You have a data breach, and you need to tell your users what happened and what private data of theirs was leaked to an attacker.
- You get a legal threat - a DMCA takedown message for example - and need to pass it on to your users.
- You sell your service to another company and the lawyers involved in the transaction insist on emailing out a terms of service update.
There are plenty more.
Keeping email addresses in the "auth" microservice which has tighter security - blocking security team code reviews, a smaller team who are allowed to modify it for example.
The problem is that the salt has to be the same for each record and that emails present a limited search space.
Imagine I stole the database for blackmailable-fetish.com. All the emails are hashed with the same salt so I can brute-force the following restricted space:
[top 200 first names][top 1000 surnames][digits from 0 - 999]@[top 5 email providers]
That would probably get me 75% of the emails - let the extortion games begin!
In this case, you don't have to be accurate, you're just trying to call someones bluff, in the small case they've actually done said thing (and in addition believe you can prove it AND that payment will silence them)
> In conclusion, if you only use emails for transactional emails, you might be able to only store hashed versions of them.
HUH?
The most obvious way to interpret this sentence is as storing hashed versions of transaction e-mail messages. Which makes no sense and isn't what the author means, but wow this is some confusing writing.
Deleted Comment
Any privacy effort is laudable. Then again, if you're serious about protecting your users' data and privacy, Facebook login is the elephant in the room.
I use authentication services like auth0 and AWS cognito. The first one I think is completely safe for privacy, the second one is used for convenience (I think the service is good for stuff you host on AWS anyway, although it is generic, so it isn't restricted to that).
But using an auth-service is mostly about deferring risk of breaches to people more proficient in security. That comes with the cost that said auth service can know which services registered users are using.
The author is correct though. While a user that employs such an auth service, it can be good practice to hash the mail-address or even other identifiers for you own DB (you still need that to associate state with a user).
I don't have a FB account fwiw.
Though I don't know about it being compliant I suppose Facebook Login (and other forms of SSO) shifts the reliability to Facebook.
This sounds like terrible UX, not to mention email use cases not initiated by the user. I really think you'd be shooting yourself in the foot by setting up a small site with this philosophy because you don't need emails right now
Deleted Comment