Hey, author here. For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that empties them out while retaining the hit info.
If 10 users share an IP on a shared VPN around the globe and hit your site, you only count that as 1? What about corporate networks, etc? IP is a bad indicator
Yep, this is covered in the writeup. Results are accurate enough to gauge whether your post is doing well or not (keep in mind that this is a simple analytics system for a small blogging platform).
I also tested it in parallel with some other analytics platforms and it actually performed better due to the fact that adblockers are more prevalent than IP sharing per reader in this context.
Analytics doesnt need to be accurate. The important thing isnt really exact how many visitors etc. The important thing is trends. Do we have more users now than last week? Do we get more traffic from X than Y? If it is 1000 or 1034 isnt so important.
The idea of using CSS-triggered requests for analytics was really cool to me when I first encountered it.
One guy on twitter (no longer available) used it for mouse tracking: overlay an invisible grid of squares on the page, each with a unique background image triggered on hover. Each background image sends a specific request to the server, which interprets it!
The whole anonymization of IP addresses by just hashing the date and IP is just security theater.
Cryptographic hashes are designed to be fast. You can do 6 billion md5 hashes in a second on an MacBook (m1 pro) via hashcat and there’s only 4 billion ipv4 addresses. So you can brute force the entire range and find the IP address. Basically reverse the hash.
And that’s true even if they used something secure like SHA-256 instead of broken MD5
Aside from it being technically trivial to get an IP back from its hash, the EU data protection agency made it very clear that "hashing PII does not count as anonymising PII".
Even if you hash somebody's full name, you can later answer the question "does this hash match the this specific full name". Being able to answer this question implies that the anonymisation process is reversible.
We're members of some EU projects, and they share a common help desk. To serve as a knowledge base, the tickets are kept, but all PII is anonymized after 2 years AFAIK.
What they do is pretty simple. They overwrite the data fields with the text "<Anonymized>". No hashes, no identifiers, nothing. Everything is gone. Plain and simple.
I think the word "reversible" here is being stretched a bit. There is a significant difference between being able to list every name that has used your service and being able to check if a particular name has used your service. (Of course these can be effectively the same in cases where you can list all possible inputs such as hashed IPv4 addresses.)
That doesn't mean that hashing is enough for pure anonymity, but used properly hashes are definitely a step above something fully reversible (like encryption with a common key).
It depends. For example, if each day you generate a random nonce and use it to salt that day's PII (and don't store the nonce) then you cannot later determine (a) did person A visit on day N or (b) is visitor X on day N the same as visitor Y on day N+1. But you can still determine how many distinct visitors you had on day N, and answer questions about within-day usage patterns.
Author here. I commented down below, but it's probably more relevant in this thread.
For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that scrubs the ip hash which is now irrelevant.
For context, this problem also came up in a discussion about Storybook doing something similar in their telemetry [0] and with zero optimization it takes around two hours to calculate the salted hashes for every IPv4 on my home laptop.
Hashes should be salted. If you salt, you are fine, if you don't you aren't.
Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.
As explained in the article there seems to be no salt (or rather, the current date seems to be used as a salt, but that's not a random salt and can easily be guessed for anyone who wants to say "did IP x.y.z.w visit on date yy-mm-dd?".
It's pretty easy to reason about these things if you look from the perspective of an attacker. How would you do to figure out anything about a specific person given the data? If you can't, then the data is probably OK to store.
> Hashes should be salted. If you salt, you are fine, if you don't you aren't.
> Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.
I think I'm missing something.
If the salt is known to the server, then it's useless for this scenario. Because given a known salt, you can generate the hashes for every IP address + that salt very quickly. (Salting passwords works because the space for passwords is big, so rainbow tables are expensive to generate.)
If the salt is unknown to the server, i.e. generated by the client and 'never leaves the client'... then why bother with hashes? Just have the client generate a UUID directly instead of a salt.
Salts are generally stored with the hash, and are only really intended to prevent "rainbow table" attacks. (I.e. use of precomputed hash tables.) Though a predictable and matching salt per entry does mean you can attack all the hashes for a timestamp per hash attempt.
That being said, the previous responder's point still stands that you can brute force the salted IPs at about a second per IP with the colocated salt. Using multiple hash iterations (e.g. 1000x; i.e. "stretching") is how you'd meaningfully increase computational complexity, but still not in a way that makes use of the general "can't be practically reversed" hash guarantees.
Salting a standard cryptographic hash (like SHA2) doesn't do anything meaningful to slow a brute force attack. This problem is the reason we have password KDFs like scrypt.
(I don't care about this Bear analytics thing at all, and just clicked the comment thread to see if it was the Bear I thought it was; I do care about people's misconceptions about hashing.)
Maybe they use a secret salt or rotating salt? The example code doesn't, so I'm afraid you are right. But one addition and it can be made reasonable secure.
I am afraid, however, that this security theater is enough to pass many laws, regulations and such on PII.
Not really. They are designed to be fast enough and even then only as a secondary priority.
> You can do 6 billion … hashes/second on [commodity hardware] … there’s only 4 billion ipv4 addresses. So you can brute force the entire range
This is harder if you use a salt not known to the attacker. Per-entry salts can help even more, though that isn't relevant to IPv4 addresses in a web/app analytics context because after the attempt at anonymisation you want to still be able to tell that two addresses were the same.
> And that’s true even if they used something secure like SHA-256 instead of broken MD5
Relying purely on the computation complexity of one hash operation, even one not yet broken, is not safe given how easy temporary access to mass CPU/GPU power is these days. This can be mitigated somewhat by running many rounds of the hash with a non-global salt – which is what good key derivation processes do for instance. Of course you need to increase the number of rounds over time to keep up with the rate of growth in processing availability, to keep undoing your hash more hassle than it is worth.
But yeah, a single unsalted hash (or a hash with a salt the attacker knows) on IP address is not going to stop anyone who wants to work out what that address is.
A "salt not known to the attacker" is a "key" to a keyed hash function or message authentication code. A salt isn't a secret, though it's not usually published openly.
That's not a reasonable way to say it. It's literally the second priority, and heavily evaluated when deciding what algorithms to take.
> This is harder if you use a salt not known to the attacker.
The "attacker" here is the sever owner. So if you use a random salt and throw it away, you are good, anything resembling the way people use salt on practice is not fine.
Assuming you don't store the salts, this produces a value that is useless for anything but counting something like DAU. Which you could equally just do by counting them all and deleting all the data at the end of the day, or using a cardinality estimator like HLL.
Of course if you have multiple severs or may reboot you need to store the salt somewhere. If you are going to bother storing the salt and cleaning it up after the day is over it may be just as easy to clean the hashes at the end of the day (and keep the total count) which is equivalent. This should work unless you want to keep individual counts around for something like seeing distribution of requests per IP or similar. But in that case you could just replace the hashes with random values at the end of the day to fully anonymize them since you no longer need to increment then.
I think the opposite? I’m a dev with a bit of an interest in security, and this immediately jumped out at me from the story; knowing enough security to discard bad ideas is useful.
Seems clever and all, but `body:hover` will most probably completely miss all "keyboard-only" users and users with user agents (assistive technologies) that do not use pointer devices.
Yes, these are marginal groups perhaps, but it is always super bad sign seeing them excluded in any way.
I am not sure (I doubt) there is a 100 % reliable way to detect that "real user is reading this article (and issue HTTP request)" from baseline CSS in every single user agent out there (some of them might not support CSS at all, after all, or have loading of any kind of decorative images from CSS disabled).
There are modern selectors that could help, like :root:focus-within (requiring that user would actually focus something interactive there, what again is not guaranteed for al agents to trigger such selector), and/or bleeding edge scroll-linked animations (`@scroll-timeline`). But again, braille readers will probably remain left out.
I think most mobile browsers emit "hover" state whenever you tap / drag / swipe over something in the page. "active" state is even more reliable IMO. But yes, you are right that it is problematic. Quoting MDN page about ":hover" [1]:
> Note: The :hover pseudo-class is problematic on touchscreens. Depending on the browser, the :hover pseudo-class might never match, match only for a moment after touching an element, or continue to match even after the user has stopped touching and until the user touches another element. Web developers should make sure that content is accessible on devices with limited or non-existent hovering capabilities.
I really wish modern touchscreens spent the extra few cents to support hover. Samsung devices from the ~2012 era all supported detection of fingers hovering near the screen. I suspect it’s terrible patent laws holding back this technology, like most technologies that aren’t headline features.
I'm a keyboard user when on my computer, qutebrowser but I think your sentiments are correct, the numbers of keyboard only users are probably much much smaller than the number of people using Adblock. So OPs method is likely to produce a more accurate analytics than a JavaScript only design.
OP just thought of a creative, effective and probably faster more code efficient way to do analytics. I love it, thanks OP for sharing it
Joking aside, I love to read websites with keybaords, esp. if I'm reading blogs. So, it's possible that sometimes my pointer is out there somewhere to prevent distraction.
I think there might be more than ten [1] blind folks using computers out there, most of them not using pointing devices at all or not in a way that would produce "hover".
As written, it depends on where your pointer is, if your device has one. If it’s within the centre 760px (the content column plus 20px padding on each side), it’ll activate, but if it’s not, it won’t. This means that some keyboard users will be caught, and some mouse users (especially those with larger viewports) won’t.
> And not just the bad ones, like Google Analytics. Even Fathom and Plausible analytics struggle with logging activity on adblocked browsers.
I believe that's as they're trying to live in what amounts to a toxic wasteland. Users like us are done with the whole concept and as such I assume if CSS analytics becomes popular, then attempts will be made to bypass that too.
I manually unblocked Piwik/Matomo, Plausible and and Fathom from ublock. I don't see any harm in what and how these track. And they do give the people behind the site valuable information "to improve the service".
e.g. Plausible collects less information on me than the common nginx or Apache logs do. For me, as blogger, it's important to see when a post gets on HN, is linked from somewhere and what kinds of content are valued and which are ignored. So that I can blog about stuff you actually want to read and spread it through channels so that you are actually aware of it.
If every web client stopped the tracking, you, as blogger, could go back to just getting analytics on server logs (real analytics, using maths).
Arguably state of the art in that approach to user/session/visits tracking 20 years ago beats today's semi-adblocked disaster. By good use of path aliases aka routes, and canonical URLs, you can even do campaign measurement without messing up SEO (see Amazon.com URLs).
You're just saying a smaller-scale version of "as a publisher it's important for me to collect data on my audience to optimize my advertising revenue." The adtech companies take the shit for being the visible 10% but publishers are consistently the ones pressuring for more collection.
Nothing's gonna block your webserver's access.log fed into an analytics service.
If anything, you're gonna get numbers that are inflated because it's a bit impossible to dismiss all of the bot traffic just by looking at user agents.
The bit of the web that feels to me like a toxic wasteland is all the adverts; the tracking is a much more subtle issue, where the damage is the long-term potential of having a digital twin that can be experimented on to find how best to manipulate me.
I'm not sure how many people actually fear that. Might get responses from "yes, and it's creepy" to "don't be daft that's just SciFi".
True, though doing it in CSS does have a couple of interesting aspects, using :hover would filter out bots that didn't use a full-on webdriver (most bots, that is). I would think that using an @import with 'supports' for an empty-ish .css file would be better in some ways (since adblockers are awfully good at spotting 1px transparent tracking pixels, but less likely to block .css files to avoid breaking layouts), but that wouldn't have the clever :hover benefits.
I have a genuine question that I fear might be interpreted as a dismissive opinion but I'm actually interested in the answer: what's the goal of collecting analytics data in the case of personal blogs in a non-commercial context such as what Bearblog seems to be?
I can speak to this from the writer's perspective as someone who has been actively blogging since c. 2000 and has been consistently (very) interested in my "stats" the entire time.
The primary reason I care about analytics is to see if posts are getting read, which on the surface (and in some ways) is for reasons of vanity, but is actually about writer-reader engagement. I'm genuinely interested in what my readers resonate with, because I want to give them more of that. The "that" could be topical, tonal, length, who knows. It helps me hone my material specifically for my readers. Ultimately, I could write about a dozen different things in two dozen different ways. Obviously, I do what I like, but I refine it to resonate with my audience.
In this sense, analytics are kind of a way for me to get to know my audience. With blogs that had high engagement, analytics gave me a sort of fuzzy character description of who my readers were. As with above, I got to see what they liked, but also when they liked it. Were they reading first thing in the morning? Were they lunch time readers? Were they late at night readers. This helped me choose (or feel better about) posting at certain times. Of course, all of this was fuzzy intel, but I found it really helped me engage with my readership more actively.
Feedback loops. Contrary to what a lot of people seem to think, analytics is not just about advertising or selling data, it's about analysing site and content performance. Sure that can be used (and abused) for advertising, but it's also essential if you want any feedback about what you're doing.
You might get no monetary value from having 12 people read the site or 12,000 but from a personal perspective it's nice to know what people want to read about from you, and so you can feel like the time you spent writing it was well spent, and adjust if you wish to things that are more popular.
Curiosity? I like to know if anyone is reading what I write. It's also useful to know what people are interested in. Even personal bloggers may want to tailor content to their audience. It's good to know that 500 people have read an article about one topic, but only 3 people read one about a different topic.
For the curiosity, one solution I've been pondering, but never gotten around to implementing is just logging the country of origin for a request, rather than the entire IP.
IPs are useful in case of attack, but you could limit yourself to simply logging subnets. It's a little more aggressive block a subnet, or an entire ISP, but it seems like a good tradeoff.
I attempted to do this back at the start of this year, but lost motivation building the web ui. My trick is not CSS but simply loading fake images with <img> tags:
I've added an edit to the essay for clarity.
I also tested it in parallel with some other analytics platforms and it actually performed better due to the fact that adblockers are more prevalent than IP sharing per reader in this context.
One guy on twitter (no longer available) used it for mouse tracking: overlay an invisible grid of squares on the page, each with a unique background image triggered on hover. Each background image sends a specific request to the server, which interprets it!
For fun one summer, I extended that idea to create a JS-free "css only async web chat": https://github.com/kkuchta/css-only-chat
Cryptographic hashes are designed to be fast. You can do 6 billion md5 hashes in a second on an MacBook (m1 pro) via hashcat and there’s only 4 billion ipv4 addresses. So you can brute force the entire range and find the IP address. Basically reverse the hash.
And that’s true even if they used something secure like SHA-256 instead of broken MD5
Even if you hash somebody's full name, you can later answer the question "does this hash match the this specific full name". Being able to answer this question implies that the anonymisation process is reversible.
What they do is pretty simple. They overwrite the data fields with the text "<Anonymized>". No hashes, no identifiers, nothing. Everything is gone. Plain and simple.
That doesn't mean that hashing is enough for pure anonymity, but used properly hashes are definitely a step above something fully reversible (like encryption with a common key).
For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that scrubs the ip hash which is now irrelevant.
[0] https://news.ycombinator.com/item?id=37596757
Deleted Comment
Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.
As explained in the article there seems to be no salt (or rather, the current date seems to be used as a salt, but that's not a random salt and can easily be guessed for anyone who wants to say "did IP x.y.z.w visit on date yy-mm-dd?".
It's pretty easy to reason about these things if you look from the perspective of an attacker. How would you do to figure out anything about a specific person given the data? If you can't, then the data is probably OK to store.
> Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.
I think I'm missing something.
If the salt is known to the server, then it's useless for this scenario. Because given a known salt, you can generate the hashes for every IP address + that salt very quickly. (Salting passwords works because the space for passwords is big, so rainbow tables are expensive to generate.)
If the salt is unknown to the server, i.e. generated by the client and 'never leaves the client'... then why bother with hashes? Just have the client generate a UUID directly instead of a salt.
That being said, the previous responder's point still stands that you can brute force the salted IPs at about a second per IP with the colocated salt. Using multiple hash iterations (e.g. 1000x; i.e. "stretching") is how you'd meaningfully increase computational complexity, but still not in a way that makes use of the general "can't be practically reversed" hash guarantees.
(I don't care about this Bear analytics thing at all, and just clicked the comment thread to see if it was the Bear I thought it was; I do care about people's misconceptions about hashing.)
I am afraid, however, that this security theater is enough to pass many laws, regulations and such on PII.
Not really. They are designed to be fast enough and even then only as a secondary priority.
> You can do 6 billion … hashes/second on [commodity hardware] … there’s only 4 billion ipv4 addresses. So you can brute force the entire range
This is harder if you use a salt not known to the attacker. Per-entry salts can help even more, though that isn't relevant to IPv4 addresses in a web/app analytics context because after the attempt at anonymisation you want to still be able to tell that two addresses were the same.
> And that’s true even if they used something secure like SHA-256 instead of broken MD5
Relying purely on the computation complexity of one hash operation, even one not yet broken, is not safe given how easy temporary access to mass CPU/GPU power is these days. This can be mitigated somewhat by running many rounds of the hash with a non-global salt – which is what good key derivation processes do for instance. Of course you need to increase the number of rounds over time to keep up with the rate of growth in processing availability, to keep undoing your hash more hassle than it is worth.
But yeah, a single unsalted hash (or a hash with a salt the attacker knows) on IP address is not going to stop anyone who wants to work out what that address is.
That's not a reasonable way to say it. It's literally the second priority, and heavily evaluated when deciding what algorithms to take.
> This is harder if you use a salt not known to the attacker.
The "attacker" here is the sever owner. So if you use a random salt and throw it away, you are good, anything resembling the way people use salt on practice is not fine.
https://jolynch.github.io/posts/use_fast_data_algorithms/
Pseudo code:
You’ll never know enough.
Deleted Comment
Yes, these are marginal groups perhaps, but it is always super bad sign seeing them excluded in any way.
I am not sure (I doubt) there is a 100 % reliable way to detect that "real user is reading this article (and issue HTTP request)" from baseline CSS in every single user agent out there (some of them might not support CSS at all, after all, or have loading of any kind of decorative images from CSS disabled).
There are modern selectors that could help, like :root:focus-within (requiring that user would actually focus something interactive there, what again is not guaranteed for al agents to trigger such selector), and/or bleeding edge scroll-linked animations (`@scroll-timeline`). But again, braille readers will probably remain left out.
> Note: The :hover pseudo-class is problematic on touchscreens. Depending on the browser, the :hover pseudo-class might never match, match only for a moment after touching an element, or continue to match even after the user has stopped touching and until the user touches another element. Web developers should make sure that content is accessible on devices with limited or non-existent hovering capabilities.
[1] https://developer.mozilla.org/en-US/docs/Web/CSS/:hover
OP just thought of a creative, effective and probably faster more code efficient way to do analytics. I love it, thanks OP for sharing it
Joking aside, I love to read websites with keybaords, esp. if I'm reading blogs. So, it's possible that sometimes my pointer is out there somewhere to prevent distraction.
[1] was it base ten, right?
Deleted Comment
I believe that's as they're trying to live in what amounts to a toxic wasteland. Users like us are done with the whole concept and as such I assume if CSS analytics becomes popular, then attempts will be made to bypass that too.
I manually unblocked Piwik/Matomo, Plausible and and Fathom from ublock. I don't see any harm in what and how these track. And they do give the people behind the site valuable information "to improve the service".
e.g. Plausible collects less information on me than the common nginx or Apache logs do. For me, as blogger, it's important to see when a post gets on HN, is linked from somewhere and what kinds of content are valued and which are ignored. So that I can blog about stuff you actually want to read and spread it through channels so that you are actually aware of it.
If every web client stopped the tracking, you, as blogger, could go back to just getting analytics on server logs (real analytics, using maths).
Arguably state of the art in that approach to user/session/visits tracking 20 years ago beats today's semi-adblocked disaster. By good use of path aliases aka routes, and canonical URLs, you can even do campaign measurement without messing up SEO (see Amazon.com URLs).
If anything, you're gonna get numbers that are inflated because it's a bit impossible to dismiss all of the bot traffic just by looking at user agents.
I'm not sure how many people actually fear that. Might get responses from "yes, and it's creepy" to "don't be daft that's just SciFi".
The primary reason I care about analytics is to see if posts are getting read, which on the surface (and in some ways) is for reasons of vanity, but is actually about writer-reader engagement. I'm genuinely interested in what my readers resonate with, because I want to give them more of that. The "that" could be topical, tonal, length, who knows. It helps me hone my material specifically for my readers. Ultimately, I could write about a dozen different things in two dozen different ways. Obviously, I do what I like, but I refine it to resonate with my audience.
In this sense, analytics are kind of a way for me to get to know my audience. With blogs that had high engagement, analytics gave me a sort of fuzzy character description of who my readers were. As with above, I got to see what they liked, but also when they liked it. Were they reading first thing in the morning? Were they lunch time readers? Were they late at night readers. This helped me choose (or feel better about) posting at certain times. Of course, all of this was fuzzy intel, but I found it really helped me engage with my readership more actively.
You might get no monetary value from having 12 people read the site or 12,000 but from a personal perspective it's nice to know what people want to read about from you, and so you can feel like the time you spent writing it was well spent, and adjust if you wish to things that are more popular.
IPs are useful in case of attack, but you could limit yourself to simply logging subnets. It's a little more aggressive block a subnet, or an entire ISP, but it seems like a good tradeoff.
https://github.com/nolytics