LLMs can unmask pseudonymous users at scale with surprising accuracy

I thought this would be more about stylometry but it's mostly about users literally posting the same identifiable information across multiple services, including in one example their age, dog name, profession.

It's all classic dox profiling techniques. Even the things like spelling differences being regional signals and commonality to specific things being discussed.

It's why one has to think about what is being posted to which community if using different identities, rather than posting the same things across all of them. Though any such effort would be a waste if reliant on some non-public info that later was exposed in a database breach which tied together previously unrelated profiles.

setopt · 11 days ago

I’m curious if an LLM-based defense for this could be made. Like a browser plugin that warns you if you type identifiable information (like occupation) into a text field, and highlights turns of phrase that are “unusual” enough to be identifiable.

everyday7732 · 11 days ago

or something which just inserts random untrue details about you every now and again, like they do in Alaska, where I live.

Only if said users happen to commit OPSEC failures themselves. LLMs aren't magic...

If someone can figure out who I am or what city I live in just by this username or my comments (with proof), I'll personally send you 500,000 JPY. I'm quite confident that's not going to happen though.

The paper referenced in the article does not even explain their exact testing methodology (such as the tools or exact prompts used) because they claim it would be misused for evil. In other words, "trust me bro."

Also see the previous discussion here: https://news.ycombinator.com/item?id=47139716

seanhunter · 11 days ago

Anyone who says that they can maintain perfect opsec over an extended period of time is seriously mistaken. A sufficiently motivated investigator with enough resources will join the dots eventually. The would-be evader has to be lucky every time whereas the investigator only has to be lucky once.

onionisafruit · 11 days ago

You live on Earth. Now that I won let’s go double or nothing. I bet I can guess where you got dem shoes at.

linkjuice4all · 11 days ago

He got them on his feet? He got them on the street?

tayo42 · 11 days ago

I skimmed some of your comments, You seem to be in the US, at least mid30s, you bought a .dev domain and run your own email? I would think those are possible leads. You really don't think you slipped up once or twice in 5 years of posting? I think an llm would go through all your posts and context of the posts to get. and that would be easier to check if you used any other social media with the same name and see if the accounts have similarities.

comrh · 11 days ago

Everyone commits opsec failures eventually. With LLMs linking anonymous accounts it just makes it even more likely to be caught.

Deleted Comment

trinsic2 · 11 days ago

I'm pretty sure they can use the meta data the pull from your various interactions with search and the text you post online. These services build fingerprints of your habits using these techniques to follow you everywhere. At some point in the chain they could easily connect this fingerprint to your identity as soon as you log into and account that contains a piece of identifying information about you. The threat is real. I can foresee someone programming a terminal or app that obfuscates online behavior to avoid this fingerprinting in the future.

Unless I am misreading something. Take a look at surveillance capitalism to see what's possible right now. It's going to be 100x worse as LLMs become more advanced.

It's not the things you post online, it's the nuances behind the way you type and other ways to determine behavior that allows them to be able to build these kinds of profiles.

ranger_danger · 11 days ago

Who is they? Which services?

From what I can tell, the article/paper in question does not appear to utilize any of the techniques you mention, but I'd be interested to learn more about it.

> it's the nuances behind the way you type

I found this paper which talks about some of those methods.

https://www.audiolabs-erlangen.de/content/04_fraunhofer/assi...

For example the "Text" section on page 91.

ggm · 11 days ago

With low precision, you're in Japan. But I don't need the JPY. of course that could be obfuscation.

ranger_danger · 11 days ago

The currency is not related to my location, I picked a random one, but thanks anyway :)

Dead Comment

big-chungus4 · 11 days ago

You are ranger_danger

Footprint0521 · 11 days ago

40 year old software dev in Detroit Michigan?

Not that I care, and that could be wildly off, but opsec is a wide term… and Claude one shot that… so safe out there bro, AI is wild

daemonologist · 11 days ago

I think Claude is guessing (educatedly - northern midwest does seem plausible). There's probably enough for the feds to track them down, but not me or an LLM.

iso-logi · 11 days ago

You are American, although you've discussed Ryanair before, which isn't exactly American. You have a number of comments and posts about Japan, which is strange, although you do drive a Japanese car.

daemonologist · 11 days ago

A JDM car, probably, to be precise. I think they lived in Japan for at least a little while, e.g.: https://news.ycombinator.com/item?id=44679406#44686142

Dead Comment

Springtime · 11 days ago

firefoxd · 11 days ago

There was a tool shared here that could show which accounts belong to the same person based on the writing patterns. Can't remember the name, but it found my old accounts on HN pretty accurately.

eps · 11 days ago

https://news.ycombinator.com/item?id=33755016

Way simpler than hnprofile from the sibling comment. This one used cosine similarity between user vocs - https://web.archive.org/web/20221126225241/https://stylometr...

firefoxd · 10 days ago

Oh yes, this is the one I was referring to. Too bad it's shutdown.

ElCapitanMarkla · 11 days ago

Hnprofile.com which has since closed down - lettergram was the author - https://news.ycombinator.com/item?id=17942981

zppln · 11 days ago

The internet is getting less interesting by the day.

RiverCrochet · 11 days ago

The future is offline.

senectus1 · 11 days ago

*selfhosted

xtiansimon · 11 days ago

> “This is a pretty new capability; previous approaches on re-identification generally required structured data, and two datasets with a similar schema that could be linked together.”

Right up there with Skynet, for me, has been the idea of disparate databases all being linked up by bad actors.

It appears as though DOGE illegally obtained taxpayer data from the IRS. I don’t trust DOGE to safeguard anything.

And the penalties do not seem to be very severe outside of HIPPA.

https://democracyforward.org/news/press-releases/new-details...

kanemcgrath · 11 days ago

Anonymous account unmasking represents a new threat to anonymity. not just this technique with llms, but the earlier text similarity one.

But I think it would be generally easier to counter in the same way.

Use an llm or heuristics to pose as someone else.

not only do you erase your traces, you add false positives in to the system which reduces the overall effectiveness of these techniques in the future. A bit of poisoning the well.

I hope eventually an easy to use tool, with maybe a small local llm, can make it easy enough to do this, so that any future deanonymization attacks would be too untrustworthy to rely on

notTooFarGone · 11 days ago

Like with browser fingerprinting, making it too unique is also an issue.

It may actually be a fine line. You may be flagged as an LLM later if your style is too generic and identified if your style is too unique.

petesergeant · 11 days ago

As a 32 year old Ghanaian woman living in Luang Prabang and studying as an ophthalmologist, this gives me some food for thought!

JKCalhoun · 11 days ago

My dogs Lacey and Baxter say "Hi!"

shubhamintech · 10 days ago

Stylometry is just the most legible version of this. The harder-to-defend surface: posting time patterns, topic clusters, cross-platform phrase matching, interaction graphs. LLMs synthesize weak signals at scale in a way no single analyst could, which makes the threat model fundamentally larger than "change how you write." Most OPSEC advice is written for the pre-LLM world.