Readit News logoReadit News
HermanMartinus · 2 years ago
Hey, author here. For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that empties them out while retaining the hit info.

I've added an edit to the essay for clarity.

myfonj · 2 years ago
Have you considered serving actual small transparent image with (private) caching headers set to expire at midnight (and not storing IPs at all)?
HermanMartinus · 2 years ago
That's a pretty neat idea. I'll add it to my list of things to explore.
bosch_mind · 2 years ago
If 10 users share an IP on a shared VPN around the globe and hit your site, you only count that as 1? What about corporate networks, etc? IP is a bad indicator
HermanMartinus · 2 years ago
Yep, this is covered in the writeup. Results are accurate enough to gauge whether your post is doing well or not (keep in mind that this is a simple analytics system for a small blogging platform).

I also tested it in parallel with some other analytics platforms and it actually performed better due to the fact that adblockers are more prevalent than IP sharing per reader in this context.

Culonavirus · 2 years ago
It's a bad indicator especially since people who would otherwise not use VPN apparently started using this: https://support.apple.com/en-us/102602
victorbjorklund · 2 years ago
Analytics doesnt need to be accurate. The important thing isnt really exact how many visitors etc. The important thing is trends. Do we have more users now than last week? Do we get more traffic from X than Y? If it is 1000 or 1034 isnt so important.
Galanwe · 2 years ago
Not even mentioning CGNAT.
dantiberian · 2 years ago
You should add this as a reply to the top comment as well.
fishtoaster · 2 years ago
The idea of using CSS-triggered requests for analytics was really cool to me when I first encountered it.

One guy on twitter (no longer available) used it for mouse tracking: overlay an invisible grid of squares on the page, each with a unique background image triggered on hover. Each background image sends a specific request to the server, which interprets it!

For fun one summer, I extended that idea to create a JS-free "css only async web chat": https://github.com/kkuchta/css-only-chat

willio58 · 2 years ago
Nice implementation!
jackjeff · 2 years ago
The whole anonymization of IP addresses by just hashing the date and IP is just security theater.

Cryptographic hashes are designed to be fast. You can do 6 billion md5 hashes in a second on an MacBook (m1 pro) via hashcat and there’s only 4 billion ipv4 addresses. So you can brute force the entire range and find the IP address. Basically reverse the hash.

And that’s true even if they used something secure like SHA-256 instead of broken MD5

WhyNotHugo · 2 years ago
Aside from it being technically trivial to get an IP back from its hash, the EU data protection agency made it very clear that "hashing PII does not count as anonymising PII".

Even if you hash somebody's full name, you can later answer the question "does this hash match the this specific full name". Being able to answer this question implies that the anonymisation process is reversible.

bayindirh · 2 years ago
We're members of some EU projects, and they share a common help desk. To serve as a knowledge base, the tickets are kept, but all PII is anonymized after 2 years AFAIK.

What they do is pretty simple. They overwrite the data fields with the text "<Anonymized>". No hashes, no identifiers, nothing. Everything is gone. Plain and simple.

kevincox · 2 years ago
I think the word "reversible" here is being stretched a bit. There is a significant difference between being able to list every name that has used your service and being able to check if a particular name has used your service. (Of course these can be effectively the same in cases where you can list all possible inputs such as hashed IPv4 addresses.)

That doesn't mean that hashing is enough for pure anonymity, but used properly hashes are definitely a step above something fully reversible (like encryption with a common key).

jefftk · 2 years ago
It depends. For example, if each day you generate a random nonce and use it to salt that day's PII (and don't store the nonce) then you cannot later determine (a) did person A visit on day N or (b) is visitor X on day N the same as visitor Y on day N+1. But you can still determine how many distinct visitors you had on day N, and answer questions about within-day usage patterns.
TylerE · 2 years ago
Is an ipv4 address really classes as PII? Sounds a bit insane.
ilrwbwrkhv · 2 years ago
Yes but if the business is not in the EU they don't need to care one bit about GDPR or EU.
HermanMartinus · 2 years ago
Author here. I commented down below, but it's probably more relevant in this thread.

For a bit of clarity around IP addresses hashes. The only use they have in this context is preventing duplicate hits in a day (making each page view unique by default). At the end of each day there is a worker job that scrubs the ip hash which is now irrelevant.

myfonj · 2 years ago
Have you considered serving actual small transparent image with caching headers set to expire at midnight?
Etheryte · 2 years ago
For context, this problem also came up in a discussion about Storybook doing something similar in their telemetry [0] and with zero optimization it takes around two hours to calculate the salted hashes for every IPv4 on my home laptop.

[0] https://news.ycombinator.com/item?id=37596757

Deleted Comment

alkonaut · 2 years ago
Hashes should be salted. If you salt, you are fine, if you don't you aren't.

Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.

As explained in the article there seems to be no salt (or rather, the current date seems to be used as a salt, but that's not a random salt and can easily be guessed for anyone who wants to say "did IP x.y.z.w visit on date yy-mm-dd?".

It's pretty easy to reason about these things if you look from the perspective of an attacker. How would you do to figure out anything about a specific person given the data? If you can't, then the data is probably OK to store.

piaste · 2 years ago
> Hashes should be salted. If you salt, you are fine, if you don't you aren't.

> Whether the salt can be kept indefinitely, or is rotated regularly etc is just an implementation detail, but the key with salting hashes for analytics is that the salt never leaves the client.

I think I'm missing something.

If the salt is known to the server, then it's useless for this scenario. Because given a known salt, you can generate the hashes for every IP address + that salt very quickly. (Salting passwords works because the space for passwords is big, so rainbow tables are expensive to generate.)

If the salt is unknown to the server, i.e. generated by the client and 'never leaves the client'... then why bother with hashes? Just have the client generate a UUID directly instead of a salt.

darken · 2 years ago
Salts are generally stored with the hash, and are only really intended to prevent "rainbow table" attacks. (I.e. use of precomputed hash tables.) Though a predictable and matching salt per entry does mean you can attack all the hashes for a timestamp per hash attempt.

That being said, the previous responder's point still stands that you can brute force the salted IPs at about a second per IP with the colocated salt. Using multiple hash iterations (e.g. 1000x; i.e. "stretching") is how you'd meaningfully increase computational complexity, but still not in a way that makes use of the general "can't be practically reversed" hash guarantees.

tptacek · 2 years ago
Salting a standard cryptographic hash (like SHA2) doesn't do anything meaningful to slow a brute force attack. This problem is the reason we have password KDFs like scrypt.

(I don't care about this Bear analytics thing at all, and just clicked the comment thread to see if it was the Bear I thought it was; I do care about people's misconceptions about hashing.)

berkes · 2 years ago
Maybe they use a secret salt or rotating salt? The example code doesn't, so I'm afraid you are right. But one addition and it can be made reasonable secure.

I am afraid, however, that this security theater is enough to pass many laws, regulations and such on PII.

dspillett · 2 years ago
> Cryptographic hashes are designed to be fast.

Not really. They are designed to be fast enough and even then only as a secondary priority.

> You can do 6 billion … hashes/second on [commodity hardware] … there’s only 4 billion ipv4 addresses. So you can brute force the entire range

This is harder if you use a salt not known to the attacker. Per-entry salts can help even more, though that isn't relevant to IPv4 addresses in a web/app analytics context because after the attempt at anonymisation you want to still be able to tell that two addresses were the same.

> And that’s true even if they used something secure like SHA-256 instead of broken MD5

Relying purely on the computation complexity of one hash operation, even one not yet broken, is not safe given how easy temporary access to mass CPU/GPU power is these days. This can be mitigated somewhat by running many rounds of the hash with a non-global salt – which is what good key derivation processes do for instance. Of course you need to increase the number of rounds over time to keep up with the rate of growth in processing availability, to keep undoing your hash more hassle than it is worth.

But yeah, a single unsalted hash (or a hash with a salt the attacker knows) on IP address is not going to stop anyone who wants to work out what that address is.

SAI_Peregrinus · 2 years ago
A "salt not known to the attacker" is a "key" to a keyed hash function or message authentication code. A salt isn't a secret, though it's not usually published openly.
marcosdumay · 2 years ago
> only as a secondary priority

That's not a reasonable way to say it. It's literally the second priority, and heavily evaluated when deciding what algorithms to take.

> This is harder if you use a salt not known to the attacker.

The "attacker" here is the sever owner. So if you use a random salt and throw it away, you are good, anything resembling the way people use salt on practice is not fine.

krsdcbl · 2 years ago
Don't forget that md5 is comparatively slow & there are way options for hashing nowadays:

https://jolynch.github.io/posts/use_fast_data_algorithms/

TekMol · 2 years ago
That is easy to fix though. Just use a temporary salt.

Pseudo code:

    if salt.day < today():
        salt = {day: today(), salt: random()}
    ip_hash = sha256(ip + salt.salt)

__alexs · 2 years ago
Assuming you don't store the salts, this produces a value that is useless for anything but counting something like DAU. Which you could equally just do by counting them all and deleting all the data at the end of the day, or using a cardinality estimator like HLL.
kevincox · 2 years ago
Of course if you have multiple severs or may reboot you need to store the salt somewhere. If you are going to bother storing the salt and cleaning it up after the day is over it may be just as easy to clean the hashes at the end of the day (and keep the total count) which is equivalent. This should work unless you want to keep individual counts around for something like seeing distribution of requests per IP or similar. But in that case you could just replace the hashes with random values at the end of the day to fully anonymize them since you no longer need to increment then.
hnreport · 2 years ago
This the type of comment that reinforces not even trying to learn or outsource security.

You’ll never know enough.

petesergeant · 2 years ago
I think the opposite? I’m a dev with a bit of an interest in security, and this immediately jumped out at me from the story; knowing enough security to discard bad ideas is useful.

Deleted Comment

ktta · 2 years ago
Not if they use a password hash like Argon2 or scrypt
__alexs · 2 years ago
Even then it is theatre because if you know the IP address you want to check it's trivial to see if there's a match.
ale42 · 2 years ago
But that's very heavy to compute at scale...
myfonj · 2 years ago
Seems clever and all, but `body:hover` will most probably completely miss all "keyboard-only" users and users with user agents (assistive technologies) that do not use pointer devices.

Yes, these are marginal groups perhaps, but it is always super bad sign seeing them excluded in any way.

I am not sure (I doubt) there is a 100 % reliable way to detect that "real user is reading this article (and issue HTTP request)" from baseline CSS in every single user agent out there (some of them might not support CSS at all, after all, or have loading of any kind of decorative images from CSS disabled).

There are modern selectors that could help, like :root:focus-within (requiring that user would actually focus something interactive there, what again is not guaranteed for al agents to trigger such selector), and/or bleeding edge scroll-linked animations (`@scroll-timeline`). But again, braille readers will probably remain left out.

qingcharles · 2 years ago
Marginal? Surely this affects 50%+ of user-agents, i.e. phones and tablets which don't support :hover? (without a mouse being plugged in)
myfonj · 2 years ago
I think most mobile browsers emit "hover" state whenever you tap / drag / swipe over something in the page. "active" state is even more reliable IMO. But yes, you are right that it is problematic. Quoting MDN page about ":hover" [1]:

> Note: The :hover pseudo-class is problematic on touchscreens. Depending on the browser, the :hover pseudo-class might never match, match only for a moment after touching an element, or continue to match even after the user has stopped touching and until the user touches another element. Web developers should make sure that content is accessible on devices with limited or non-existent hovering capabilities.

[1] https://developer.mozilla.org/en-US/docs/Web/CSS/:hover

callalex · 2 years ago
I really wish modern touchscreens spent the extra few cents to support hover. Samsung devices from the ~2012 era all supported detection of fingers hovering near the screen. I suspect it’s terrible patent laws holding back this technology, like most technologies that aren’t headline features.
demondemidi · 2 years ago
Keyboard only users? All 10 of them? ;)
vivekd · 2 years ago
I'm a keyboard user when on my computer, qutebrowser but I think your sentiments are correct, the numbers of keyboard only users are probably much much smaller than the number of people using Adblock. So OPs method is likely to produce a more accurate analytics than a JavaScript only design.

OP just thought of a creative, effective and probably faster more code efficient way to do analytics. I love it, thanks OP for sharing it

bayindirh · 2 years ago
Well with me, it's probably 11.

Joking aside, I love to read websites with keybaords, esp. if I'm reading blogs. So, it's possible that sometimes my pointer is out there somewhere to prevent distraction.

myfonj · 2 years ago
I think there might be more than ten [1] blind folks using computers out there, most of them not using pointing devices at all or not in a way that would produce "hover".

[1] was it base ten, right?

zichy · 2 years ago
Think about screen readers.
chrismorgan · 2 years ago
As written, it depends on where your pointer is, if your device has one. If it’s within the centre 760px (the content column plus 20px padding on each side), it’ll activate, but if it’s not, it won’t. This means that some keyboard users will be caught, and some mouse users (especially those with larger viewports) won’t.

Deleted Comment

nannal · 2 years ago
> And not just the bad ones, like Google Analytics. Even Fathom and Plausible analytics struggle with logging activity on adblocked browsers.

I believe that's as they're trying to live in what amounts to a toxic wasteland. Users like us are done with the whole concept and as such I assume if CSS analytics becomes popular, then attempts will be made to bypass that too.

berkes · 2 years ago
Why?

I manually unblocked Piwik/Matomo, Plausible and and Fathom from ublock. I don't see any harm in what and how these track. And they do give the people behind the site valuable information "to improve the service".

e.g. Plausible collects less information on me than the common nginx or Apache logs do. For me, as blogger, it's important to see when a post gets on HN, is linked from somewhere and what kinds of content are valued and which are ignored. So that I can blog about stuff you actually want to read and spread it through channels so that you are actually aware of it.

Terretta · 2 years ago
To your point, server logs have the info.

If every web client stopped the tracking, you, as blogger, could go back to just getting analytics on server logs (real analytics, using maths).

Arguably state of the art in that approach to user/session/visits tracking 20 years ago beats today's semi-adblocked disaster. By good use of path aliases aka routes, and canonical URLs, you can even do campaign measurement without messing up SEO (see Amazon.com URLs).

morelisp · 2 years ago
You're just saying a smaller-scale version of "as a publisher it's important for me to collect data on my audience to optimize my advertising revenue." The adtech companies take the shit for being the visible 10% but publishers are consistently the ones pressuring for more collection.
input_sh · 2 years ago
Nothing's gonna block your webserver's access.log fed into an analytics service.

If anything, you're gonna get numbers that are inflated because it's a bit impossible to dismiss all of the bot traffic just by looking at user agents.

ben_w · 2 years ago
The bit of the web that feels to me like a toxic wasteland is all the adverts; the tracking is a much more subtle issue, where the damage is the long-term potential of having a digital twin that can be experimented on to find how best to manipulate me.

I'm not sure how many people actually fear that. Might get responses from "yes, and it's creepy" to "don't be daft that's just SciFi".

account-5 · 2 years ago
Makes me reminiscent of uMatrix which could block the loading of CSS too.
its-summertime · 2 years ago

    ||somesite.example^$css
would work in ublock

momentary · 2 years ago
Is uMatrix not in vogue any more? It's still my go to tool!
chrismorgan · 2 years ago
This approach is no harder to block than the JavaScript approaches: you’re just blocking requests to certain URL patterns.
nannal · 2 years ago
That approach would work until analytics gets mixed in with actual styles and then you're trying to use a website without CSS.
marban · 2 years ago
Plausible still works if you reverse-proxy the script and the event url through your own /randompath.
fatih-erikli · 2 years ago
This is known as "pixel tracker" for decades.
cantSpellSober · 2 years ago
Used in emails as well. Loading a 1x1 transparent <img> is a more sure thing than triggering a hover event, but ad-blockers often block those
t0astbread · 2 years ago
Occasionally I've seen people fail and add the pixel as an attachment instead.
blacksmith_tb · 2 years ago
True, though doing it in CSS does have a couple of interesting aspects, using :hover would filter out bots that didn't use a full-on webdriver (most bots, that is). I would think that using an @import with 'supports' for an empty-ish .css file would be better in some ways (since adblockers are awfully good at spotting 1px transparent tracking pixels, but less likely to block .css files to avoid breaking layouts), but that wouldn't have the clever :hover benefits.
p4bl0 · 2 years ago
I have a genuine question that I fear might be interpreted as a dismissive opinion but I'm actually interested in the answer: what's the goal of collecting analytics data in the case of personal blogs in a non-commercial context such as what Bearblog seems to be?
taurusnoises · 2 years ago
I can speak to this from the writer's perspective as someone who has been actively blogging since c. 2000 and has been consistently (very) interested in my "stats" the entire time.

The primary reason I care about analytics is to see if posts are getting read, which on the surface (and in some ways) is for reasons of vanity, but is actually about writer-reader engagement. I'm genuinely interested in what my readers resonate with, because I want to give them more of that. The "that" could be topical, tonal, length, who knows. It helps me hone my material specifically for my readers. Ultimately, I could write about a dozen different things in two dozen different ways. Obviously, I do what I like, but I refine it to resonate with my audience.

In this sense, analytics are kind of a way for me to get to know my audience. With blogs that had high engagement, analytics gave me a sort of fuzzy character description of who my readers were. As with above, I got to see what they liked, but also when they liked it. Were they reading first thing in the morning? Were they lunch time readers? Were they late at night readers. This helped me choose (or feel better about) posting at certain times. Of course, all of this was fuzzy intel, but I found it really helped me engage with my readership more actively.

hennell · 2 years ago
Feedback loops. Contrary to what a lot of people seem to think, analytics is not just about advertising or selling data, it's about analysing site and content performance. Sure that can be used (and abused) for advertising, but it's also essential if you want any feedback about what you're doing.

You might get no monetary value from having 12 people read the site or 12,000 but from a personal perspective it's nice to know what people want to read about from you, and so you can feel like the time you spent writing it was well spent, and adjust if you wish to things that are more popular.

Veen · 2 years ago
Curiosity? I like to know if anyone is reading what I write. It's also useful to know what people are interested in. Even personal bloggers may want to tailor content to their audience. It's good to know that 500 people have read an article about one topic, but only 3 people read one about a different topic.
mrweasel · 2 years ago
For the curiosity, one solution I've been pondering, but never gotten around to implementing is just logging the country of origin for a request, rather than the entire IP.

IPs are useful in case of attack, but you could limit yourself to simply logging subnets. It's a little more aggressive block a subnet, or an entire ISP, but it seems like a good tradeoff.

freitzzz · 2 years ago
I attempted to do this back at the start of this year, but lost motivation building the web ui. My trick is not CSS but simply loading fake images with <img> tags:

https://github.com/nolytics