I try to hide my real name whenever possible, out of an abundance of caution. You can still find it if you search carefully, but in today's hostile internet I see this kind of soft pseudonymity as my digital personal space, and expect to have it respected.
When playing around in GPT-3 I tried making sentences with my username. Imagine my surprise when I see it spitting out my (globally unique, unusual) full name!
Looking around, I found a paper that says language models spitting out personal information is a problem[1], a Google blog post that says there's not much that can be done[2], and an article that says OpenAI might automatically replace phone numbers in the future but other types of PII are harder to remove[3]. But nothing on what is actually being done.
If I had found my personal information on Google search results, or Facebook, I could ask the information to be removed, but GPT-3 seems to have no such support. Are we supposed to accept that large language models may reveal private information, with no recourse?
I don't care much about my name being public, but I don't know what else it might have memorized (political affiliations? Sexual preferences? Posts from 13-year old me?). In the age of GDPR this feels like an enormous regression in privacy.
EDIT: a small thank you for everybody commenting so far for not directly linking to specific results or actually writing my name, however easy it might be.
If my request for pseudonymity sounds strange given my lax infosec:
- I'm more worried about the consequences of language models in general than my own case, and
- people have done a lot more for a lot less name information[4].
[1]: https://arxiv.org/abs/2012.07805
[2]: https://ai.googleblog.com/2020/12/privacy-considerations-in-...
[3]: https://www.theregister.com/2021/03/18/openai_gpt3_data/
[4]: https://en.wikipedia.org/wiki/Slate_Star_Codex#New_York_Time...
As a purely practical matter -- again, not going into whether this is how things should be, merely how they do be -- it is futile to want the internet as a whole to have a concept of privacy, or to respect the concept of a "digital personal space". If your phone number or other PII has ever been associated with your identity, that association will be in place indefinitely and is probably available on multiple data broker sites.
The best way to be anonymous on the internet is to be anonymous, which means posting without any name or identifier at all. If that isn't practical, then using a non-meaningful pseudonym and not posting anything personally identifiable is recommended.
I learned this, by setting up a Disqus ID. I wanted to comment on a blog post, and started to set up an account.
After I started the process, it came back, with a list of random posts, from around the Internet (and some, very old), and said "Are these yours? If so, would you like to associate them with your account?"
I freaked. Many of them were outright troll comments (I was not always the haloed saint that you see before you) that I had sworn were done anonymously. They came from many different places (including DejaNews). I have no idea how Disqus found them.
Every single one of them was mine. Many, were ones that I had sworn were dead and buried in a deep grave in the mountains.
Needless to say, I do not have a Disqus ID.
Being non-anonymous means that I need to behave myself, online. I come across as a bit of a stuffy bore, but I suspect my IRL persona is that way, as well.
That's OK.
It’s not okay to be tracked so thoroughly that people stop feeling they can explore controversy online
That's okay, as long as you aren't a member of any persecuted minority, and as long as you don't have any interesting political views to share.
If I'd need full privacy, I'd have to add many more levels of security in my daily life that I don't find necessary. I just don't want people (or a SWAT team) to show up at my door because I triggered someone on the internet. That's why I post from multiple different accounts on different platforms. Though, I'm sure, in the future some form of AI will be able to link them all based on writing style and similarity of content of my posts. Guess I'll have to find another way to remain somewhat anonymous then.
Deleted Comment
https://www.optery.com/
It’s a YC company. My only affiliation is that I’m a customer.
I have a discount code if anyone is interested. I wasn’t sure if I could just paste it in the comments
Deleted Comment
Dead Comment
Good luck with that.
Deleted Comment
(1) I seem to remember a court case somewhere on the planet in the last months where lack of resistance was deemed indicative of consensual intercourse. Which is not even remotely acceptable. But I digress.
It's another different thing for my name to be auto-completed by the most popular, publicly available language model. That I'm less ok with, and I'm sure other people will find absolutely despicable.
We have GDPR and Right to Be Forgotten for a reason.
The sentences that stuck out to me are: “If your phone number or other PII has ever been associated with your identity, that association will be in place indefinitely and is probably available on multiple data broker sites.
The best way to be anonymous on the internet is to be anonymous, which means posting without any name or identifier at all. If that isn't practical, then using a non-meaningful pseudonym and not posting anything personally identifiable is recommended.”
Not for me. It took until page 3 for just my first name to appear. If somebody is looking at past Github commits, that's already a high enough barrier for me.
I only partially agree with your conclusion. Asking people to maintain total anonymity always, with any slips punishable by permanent publication of that PII, might be the current status quo, but is not where we as society want to head.
Another early result in DDG is a profile on deviantart, which you may not want linked to your professional identity (or maybe you do).
Your steam community page has a list of hundreds of games you own.
Fundamentally your problem isn't as much that your github account links to your name, it's that you use the same identifier across the web, one that isn't common like "neo", from "interesting" sites like deviantart to more normal ones like ubuntuforums.
You've removed your CV from your website, but it's still in internet archive. And do you really want your CV hidden? You've gota a good portfolio of work on the internet.
To me, the lack of separation of your names is far more of a challenge to your anonymity - especially when you call it out by posting something like this under that nome-de-plume. You have multiple aspects of your life that you can present in different ways, choosing a single unique nickname links those together, is that what you really want - even if your real name wasn't connected to it?
You can't "put the genie back in the bottle". It's out there, the Internet remembers forever.
A third approach is using a word that means something and thus is not unique at all.
Unique strings for usernames means lots of accurate hits. If you google mine, there will be lots of hits but none are me.
Of course this doesn’t account for “the crazies” that could potentially harass me into my physical life at an easier rate simply because they’re mad I won an online game or the like. Thankfully I haven’t had to deal with such a situation, but I also believe that may be a consequence of avoiding inflammatory back-and-forths or highly-political discussions since anonymity is reduced, which may invite those attacks.
It's better to use a username you copied from someone else also, like that if people find links, they find someone else entirely.
Going on a tangent here but I've started seeing more "do be" used lately. However, it doesn't seem right for some reason I can't pinpoint (English is not my first language).
Is it from a dialect?
It's an African American idiom which has bled into Gen Z vernacular, from what I've seen.
https://www.optery.com/
I’m a satisfied customer
obviously it's a little paranoid and arrogant to assume that anyone cares enough to go through my comments, but occasionally, on websites like this and reddit, I will just outright lie about where I'm from, or what my age or gender or ethnicity or sexuality is
Companies are building and selling GPT-3 with 6 billion parameters and one of those „parameters“ seems to be OP‘s username and his „strange“ two word last name.
If models grow bigger, they will potentially contain personal information about everyone of us.
If you can get yourself removed from search indices, shouldn’t there be a way for AI models, too?
Another thought: do we need new licenses (GPL, MIT, etc.) which disallow the use for (for-profit) AI training?
The input datasets should be managed as per GDPR/CA regulations, with clear flags protecting privacy of EU citizens and CA residents. And any derived models should propagate these labels and not allow querying information violating these regulations.
If GitHub Colab implementation or and GPT-3/4 models were developed without these regulations in mind these models should be retrained.
Yes, it is a hard research problem. Yet, there is no reason these models should be allowed to violate privacy in worse ways than traditional software.
So if you try to remove the information from a neural network model then it can still have it in different forms you may not even think of, for example in language models the same thing described with different words.
And on the other hand removing one thing may affect the models performance on other unrelated things too.
I don't think that we need new licenses, but probably open source projects need a better way to enforce them.
E.g. Copilot just ignores the licensing issues although I can imagine that there could be a solution with a few different models that return code for different purposes. (Like one model returns everything and the code can be used safely only for learning or hobby projects. Another model returns code for GPL code. And a third model returns code compatible with commercial or permissive open source projects.)
Or the model spits out also the licence(s) of the code, but not sure if this is technically possible.
The only way to be completely sure of removing information would be to re-train the model without that data.
Absolutely yes!
I would expect that it would take considerable effort to get this information removed from Google (you would have to write to them with a request under GDPR or similar and have them add a content filter) and I don't see why the same effort wouldn't allow you to get removed from GPT-3 (which is only accessible via a web API, so a similar filter could be added).
Imagine, for example, that you were falsely arrested for murder and then cleared of the crime.
It's very likely this would kill your career because employers Googling you would see the articles about your arrest.
In Europe, you would have a right to hide these articles from search engines.
I'm not going to take this in a political direction, but make of that what you will.
The first is asking a website owner to delete data they collected on you. That doesn't really apply here. The places this person's name is published are his own website that has this username as its url, his own Github repos, and published papers of his that were also on his website. No GDPR request is necessary to remove his name from these places because he already owns that data. As seen, he has already started to delete it himself.
The second is asking search engines to delist a result. As far as I understand, this usually has to involve information that is otherwise meant to be scrubbed from public record, like a newspaper article about a conviction that was eventually sealed. You can't ask Google to not index a scientific journal you published to or your public Github repos.
There are, of course, limits to this thanks to public interest exceptions. I don't believe Prince Andrew can ask Google to de-index anything associating him with Jeffrey Epstein. The public has a right to know, too.
In this guy's case, he really seems to be straddling a line. He contributed to open source projects under his real name linking to a Github repo with the same username he seems to reuse everywhere, including here, and also has a website where the url is that username, and it contained his CV with his real name on it along with a publication history with every publication using his real name. Is it reasonable to do those things and then ask Google and OpenAI not to associate the username with your real name?
At what point are you some regular Joe with a real grievance and at what point are you Ian Murdock complaining that GPT knows you're the Ian associated with debian?
They could:
1. Set up a content filter that filters op's name from the output. OpenAI would still need to keep record of the name, exposing it to leaks.
2. Remove the name from the dataset and retrain the model, which is obviously infeasible with each GDPR request.
I expect there are other instances where it is impractical or impossible to completely forget someone's data upon a request. Does Google send people spelunking into cold storage archives and actually destroy tapes (while migrating the data that is not supposed to be erased) every time they receive a request?
I have to say playing with GPT3 has been a mind blowing experience this week and you should all try it.
The most striking point was discovering that if I give it texts from my own chats, or copy paste in RFPs, and ask it to write lines for me, it’s better at sounding like a normal person than I am.
A stock example was “write a tag line for an ice cream shop”. We tried changing it a bit and I’ll give you some of what it’s punchlines.
“Write a tagline for an ice cream shop run by Bruce Wayne.” Result: “the only thing better than justice is ice cream”
“… run by an SCP”: “The SCP Ice Cream Shop: the only place where you can enjoy ice cream and fear for your life!”
„… run by Saddam Hussein”: “the best ice cream in the world, made by the worst man in the world!”
One thing to watch out for though is it is not self aware at all (at least in a practical sense) and can just make things up. For example, we tried giving it my daughters homework reading comprehension questions on the book “w pustyni i w puszczy” and it gave cogent, plausible and totally wrong answers that it made up on the spot. It would seem it hadn’t been given the book, and would have got an F.
And it can’t speak for itself. I can ask it directly “have you read Tractatus”, and it will insist “no, never”, but knows it front and back like a scholar.
So never blindly trust it ;)
If I can do this locally with some existing kit, I would love to hear your recommendation.
Deleted Comment
Right, this is why opsec is something that you must always be doing.
Anything you say can be preserved forever.
Better to use short-lived throwaway identities, and leave yourself the power of combining them later, than to start with one long-lived identity and find yourself unable to split it up.
It's inconvenient in real life that I'm expected to use my legal identity for everything. If I go to group therapy for an embarrassing personal problem, someone there can look me up because everyone is using real names. I don't like it.
If we created an identity that is completely different than our real identity when we're 13, great.
If not, that becomes a problem without an actual solution especially in the age of Internet archives.
I joke with them that if they googled my name (somewhat unique) you'd find 3-5 other people - none of whom look at all like me. Any hits I have are far far below the fold.
> Exercising Your Rights: California residents can exercise the above privacy rights by emailing us at: support@openai.com.
If you happen to be in California (or even if you are not) it might be worth trying to go through their support channel.
I'm also not a California resident, but I am under GDPR, which I understand is similarly strong. I'll try emailing them and see where it goes.
[1] https://openai.com/privacy/
Deleted Comment
> I don't care much about my name being public, but I don't know what else it might have memorized (political affiliations? Sexual preferences? Posts from 13-year old me?).
Combine this with
https://news.ycombinator.com/item?id=28216733https://news.ycombinator.com/item?id=27622100
Google fuck-ups are much, much more impactful than you'd expect because people have come to trust the information Google provides so automatically. This example is being invoked as comedy, but I see people do it regularly:
https://youtu.be/iO8la7BZUlA?t=178
So a bigger problem isn't what GPT-3 can memorize, but what associations it may decide to toss out there that people will treat as true facts.
Now think about the amount of work it takes to find out problems. It's wild that you have to to Google your own name every once in a while to see what's being turned up to make sure you're not being misrepresented, but that's not too much work. GPT-3 output, on the other hand, is elicited very contextually. It's not hard to imagine that <There is a Hristo Georgiev who sold Centroida and moved to Zurich> and <There is a Hristo Georgiev who murdered five women> pop up as <Hristo Georgiev, who sold Centroida and moved to Zurich, had murdered five women.> only under certain circumstances that you can't hope to be able to exhaustively discover.
From a personal angle: My birth name is also the pen name of an erotic fiction author. Hazy associations popping up in generated text could go quite poorly for me.
I didn’t anticipate the use case of GTP being used by debt collection agencies to tirelessly track down targets.
It will be a new type of debtors prison where any leaks of enough personally identifying facets to the internet will string together a mosaic of the target such that the AI sends them calls,sms,tinder dms, etc. until they pay and are released from the digital targeting system.