Why could a court favor the interest of the New York Times in a vague accusation versus the interest and right of hundred millions people?
Billion people use the internet daily. If any organization suspects some people use the Internet for illicit purposes eventually against their interests, would the court order the ISP to log all activities of all people? Would Google be ordered to save the search of all its customers because some might use it for bad things? And once we start, where will we stop? Crimes could happen in the past or in the future, will the court order the ISP and Google to retain the logs for 10 years, 20 years? Why not 100 years? Who should bear the cost for such outrageous demands?
The consequences of such orders are of enormous impact the puny judge can not even begin to comprehend. Privacy right is an integral part of the freedom of speech, a core human right. If you don’t have private thoughts, private information, anybody can be incriminated against them using these past information. We will cease to exist as individuals and I argue we will cease to exist as human as well.
Courts have always had the power to compel parties to a current case to preserve evidence. (For example, this was an issue in the Google monopoly case, since Google employees were using chats set to erase after 24 hours.) That becomes an issue in the discovery phase, well after the defendant has an opportunity to file a motion to dismiss. So a case with no specific allegation of wrongdoing would already be dismissed.
The power does not extend to any of your hypotheticals, which are not about active cases. Courts do not accept cases on the grounds that some bad thing might happen in the future; the plaintiff must show some concrete harm has already occurred. The only thing different here is how much potential evidence OpenAI has been asked to retain.
> Courts have always had the power to compel parties to a current case to preserve evidence.
Not just that, even without a specific court order parties to existing or reasonably anticipated litigation have a legal obligation that attaches immediately to preserve evidence. Courts tend to issue orders when a party presents reason to believe another party is out of compliance with that automatic obligation, or when there is a dispute over the extent of the obligation. (In this case, both factors seem to be in play.)
So if Amazon sues Google, claiming that it is being disadvantaged in search rankings, a court should be able to force Google to log all search activity, even when users delete it?
So then the courts need to find who is setting their chats do be deleted and order them to stop. Or find specific infringing chatters and order OpenAI to preserve these specified users’ logs. OpenAI is doing the responsible thing here.
Time does not need user logs to prove such a thing if it was true. Times can show that it is possible so they can show how their own users can access the text. Why would they need other user's data?
>Why could a court favor the interest of the New York Times in a vague accusation versus the interest and right of hundred millions people?
Probably because they bothered to pursue such a thing and hundreds of millions people did not.
How do you conclusively know if someone's content generating machine infringe with your rights? By saving all of its input/output for investigation.
It's ridiculous, sure but is it less ridiculous than AI companies claiming that the copyrights shouldn't apply to them because it will be bad for their business?
IMHO those are just growth pain. Back in the day people used to believe that the law don't apply on them because they did it on the internet and they were mostly right because the laws were made for another age. Eventually the laws both for criminal stuff and copyright caught up. Will be the same for AI, now we are in the wild west age of AI.
AI companies aren't seriously arguing that copyright shouldn't apply to them because "it's bad for business". The main argument is that they qualify for fair use because their work is transformative which is one of the major criteria for fair use. Fair use is the same doctrine that allows a school to play a movie for educational purposes without acquiring a license for the public performance of that movie. The original works don't have model weights and can't answer questions or interact with a user so the output is substantially different from the input.
> It's ridiculous, sure but is it less ridiculous than AI companies claiming that the copyrights shouldn't apply to them because it will be bad for their business?
Since that wasn't ever a real argument, your strawman is indeed ridiculous.
The argument is that requiring people to have a special license to process text with an algorithm is a dramatic expansion of the power of copyright law. Expansions of copyright law will inherently advantage large corporate users over individuals as we see already happening here.
New York Times thinks that they have the right to spy on the entire world to see if anyone might be trying to read articles for free.
That is the problem with copyright. That is why copyright power needs to be dramatically curtailed, not dramatically expanded.
You raise good points but the target of your support feels misplaced. Want private ai? You must self-host and inspect if it’s phoning home. No way around it in my view.
Otherwise, you are picking your data privacy champions as the exact same companies, people and investors that sold us social media, and did something quite untoward with the data they got. Fool me twice, fool me three times… where is the line?
In other words - OAI has to save logs now? Candidly they probably were already, or it’s foolish not to assume that.
Love the spirit of what you say and I practice it myself, literally.
But also, no - Just self-host or it's all your fault is never ever a sufficient answer to the problem.
It's exactly the same as when Exxon says "what are you doing to lower your own carbon footprint?" It's shifting the burden unfairly; companies like OpenAI put themselves out there and thus must ALWAYS be held to task.
> Privacy right is an integral part of the freedom of speech
I completely agree with you, but as a ChatGPT user I have to admit my fault in this too.
I have always been annoyed by what I saw as shameless breaches of copyright of thousands of authors (and other individuals) in the training of these LLMs, and I've been wary of the data security/confidentiality of these tools from the start too - and not for no reason. Yet I find ChatGPT et al so utterly compelling and useful, that I poured my personal data[0] into these tools anyway.
I've always felt conflicted about this, but the utility just about outweighed my privacy and copyright concerns. So as angry as I am about this situation, I also have to accept some of the blame too. I knew this (or other leaks or unsanctioned use of my data) was possible down the line.
But it's a wake up call. I've done nothing with these tools which is even slightly nefarious, but I am today deleting all my historical data (not just from ChatGPT[1] but other hosted AI tools) and will completely reassess my approach of using them - likely with an acceleration of my plans to move to using local models as much as I can.
[0] I do heavily redact my data that goes into hosted LLMs, but there's still more private data in there about me than I'd like.
[1] Which I know is very much a "after the horse has bolted" situation...
Keeping in mind that the purpose of IP law is to promote human progress, it's hard to see how legacy copyright interests should win a fight with AI training and development.
100 years from now, nobody will GAF about the New York Times.
A pretty clear distinction is that all ISPs in the world are not currently involved in a lawsuit with New York Times and are not accused of deleting evidence. What OpenAI is accused of is significantly different from merely agnostically routing packets between A and B. OpenAI is not raising astronomical funds because they operate as an ISP.
First - in the US, privacy is not a constitutional right. It should be, but it's not. You are protected against government searches, but that's about it. You can claim it's a core human right or whatever, but that doesn't make it true, and it's a fairly reductionist argument anyway. It has, fwiw, also historically not been seen as a core right for thousands of years. So i think it's a harder argument to make than you think despite the EU coming around on this. Again, I firmly believe it should be a core right, but asserting that it is doesn't make that true.
Second, if you want the realistic answer - this judge is probably overworked and trying to clear a bunch of simple motions off their docket. I think you probably don't realize how many motions they probably deal with on a daily basis. Imagine trying to get through 145 code reviews a day or something like that.
In this case, this isn't the trial, it's discovery. Not even discovery quite yet, if i read the docket right. Preservation orders of this kind are incredibly common in discovery, and it's not exactly high stakes most of the time. Most of the discovery motions are just parties being a pain in the ass to each other deliberately. This normally isn't even a thing that is heard in front of a judge directly, the judge is usually deciding on the filed papers.
So i'm sure the judge looked at it for a few minutes, thought it made sense at the time, and approved it. I doubt they spent hours thinking hard about the consequences.
OpenAI has asked to be heard in person on the motion, i'm sure the judge will grant it, listen to what they have to say, and determine they probably fucked it up, and fix it. That is what most judges do in this situation.
While the Constitution does not explicitly enumerate a "right to privacy," the Supreme Court has consistently recognized substantive privacy rights through Due Process Clause jurisprudence, establishing constitutional protection for intimate personal decisions in Griswold v. Connecticut (1965), Lawrence v. Texas (2003), and Obergefell v. Hodges (2015).
> It has, fwiw, also historically not been seen as a core right for thousands of years. So i think it's a harder argument to make than you think despite the EU coming around on this.
This doesn't seem true. I'd assume you know more about this than I do though so can you explain this in more detail? The concept of privacy is definitely more than thousands of years old. The concept of a "human right", is arguably much newer. Do you have particular evidence that a right to privacy is a harder argument to make that other human rights?
While the language differs, the right to privacy is enshrined more or less explicitly in many constitutions, including 11 USA states. It isn't just a "european" thing.
Even in the "protected against government searches" sense from the 4th Amendment, that right hardly exists when dealing with data you send to a company like OpenAI thanks to the third-party doctrine.
"First - in the US, privacy is not a constitutional right"
What? The supreme court disagreed with you in Griswold v. Connecticut (1965) and Roe v. Wade (1973).
While one could argue that they were vastly stretching the meaning of words in these decisions the point stands that at this time privacy is a constitutional right in the USA.
ChatGPT isn’t like an ISP here. They are being credibly accused of basing their entire business on illegal activity. It’s more like if The Pirate Bay was being sued. The alleged infringement is all they do, and requiring them to preserve records of their users is pretty reasonable.
Regardless of the details of this specific case, the courts are not democratic and do not decide based on the interest of the parties or how many they are, they decide based on the law.
The law is not a deterministic computer program. It’s a complex body of overlapping work and the courts are specifically chartered to use judgement. That’s why briefs from two parties in a dispute will often cite different laws and precedents.
For instance, Winter v. NRDC specifically says that courts must consider whether an injunction is in the public interest.
> Why could a court favor the interest of the New York Times in a vague accusation versus the interest and right of hundred millions people?
Because the law favors preservation of evidence for an active case above most other interests. It's not a matter of arbitrary preference by the particular court.
> Why could a court favor the interest of the New York Times in a vague accusation versus the interest and right of hundred millions people?
It simply didn't. ChatGPT hasn't deleted any user data.
> "OpenAI did not 'destroy' any data, and certainly did not delete any data in response to litigation events," OpenAI argued. "The Order appears to have incorrectly assumed the contrary."
It's a bit of a stretch to think a big tech company like ChatGPT is deleting users' data.
This is incorrect. As someone who has had the opportunity to work in several highly=regulated industries, companies do not want to hold onto extra data about you that they don’t have to unless their business is selling that data.
OpenAI already has a business, and not one they want to violate by having a massive amount of customer data stolen if they get hacked.
OpenAI is a business selling a product, it’s not a decentralized network of computers contributing spare processing power to run massive LLMs. Therefore, you can easily point a finger at them and tell them to stop some activity for which they are the sole gatekeeper.
I completely agree with you. But perhaps we should be more worried that OpenAI or Google can retain all this data and do pretty much what they want with it in the first place, without a judge getting into the picture.
It doesn't, it favors longstanding caselaw and laws already on the books.
There is a longstanding precedent with regards to business document retention, and chat logs have been part of that for years if not decades. The article tries to make this sound like this is something new, but if you look at the e-retention guidelines in various cases over the years this is all pretty standard.
For a business to continue operating, they must preserve business documents and related ESI upon an appropriate legal hold to avoid spoliation. They likely weren't doing this claiming the data was deleted, which is why the judge ruled in favor against OAI.
This isn't uncommon knowledge either, its required. E-discovery and Information Governance are things any business must meet in this area; and those documents are subject to discovery in certain cases, where OAI likely thought they could avoid it maliciously.
The matter here is OAI and its influence rabble are churning this trying to do a runaround on longstanding requirements that any IT professional in the US would have reiterated from their legal department/Information Governance policies.
There's nothing to see here, there's no real story. They were supposed to be doing this and didn't, were caught, and the order just forces them to do what any other business is required to do.
I remember an executive years ago (decades really), asking about document retention, ESI, and e-discovery and how they could do something (which runs along similar lines to what OAI tried as a runaround). I remember the lawyer at the time saying, "You've gotta do this or when it goes to court you will have an indefensible position as a result of spoliation...".
You are mistaken, and appear to be trying to frame this improperly towards a point of no accountability.
I suggest you review the longstanding e-discovery retention requirements that courts require of businesses to operate.
This is not new material, nor any different from what's been required for a long time now. All your hyperbole about privacy is without real basis, they are a company; they must comply with law, and it certainly is not outrageous to hold people who break the law to account, and this can only occur when regulatory requirements are actually fulfilled.
There is no argument here.
References:
Federal Rules of Civil Procedure (FRCP) 1, 4, 16, 26, 34, 37
There are many law firms who have written extensively on this and related subjects. I encourage you to look at those too.
(IANAL) Disclosure:
Don't take this as legal advice. I've had the opportunity to work with quite a few competent ones, but I don't interpret the law; only they can. If you need someone to provide legal advice seek out competent qualified counsel.
> Why could a court favor the interest of the New York Times in a vague accusation versus the interest and right of hundred millions people?
Can't you use the same arguments against, say, Copyright holders? Billionaires? Corporations doing the Texas two-step bankruptcy legal maneuver to prevent liability from allegedly poisoning humanity?
Interesting detail from the court order [0]: When asked by the judge if they could anonymize chat logs instead of deleting them, OpenAI's response effectively dodged the "how" and focused on "privacy laws mandate deletion." This implicitly admits they don't have a reliable method to sufficiently anonymize data to satisfy those privacy concerns.
This raises serious questions about the supposed "anonymization" of chat data used for training their new models, i.e. when users leave the "improve model for all users" toggle enabled in the settings (which is the default even for paying users). So, indeed, very bad for the current business model which appears to rely on present users (voluntarily) "feeding the machine" to improve it.
So, the NYT asked for this back in January and the court said no, but asked OpenAI if there was a way to accomplish the preservation goal in a privacy-preserving manner. OpenAI refused to engage for 5 f’ing months. The court said “fine, the NYT gets what they originally asked for”.
I'm not going to look up the comment, but a few months back I called this out and said if you seriously want to use any LLM in a privacy sensitive context you need to self host.
For example, if there are business consequences for leaking customer data, you better run that LLM yourself.
In the European privacy framework, and legal framework at large, you can't terms of service away requirements set by the law. If the law requires you to keep the logs, there is nothing you can get the user to sign off on to get you out of it.
> Some established businesses will need to review their contracts, regulations, and risk tolerance.
I've reviewed a lot of SaaS contracts over the years.
Nearly all of them have clauses that allow the vendor to do whatever they have to if ordered to by the government. That doesn't make it okay, but it means OpenAI customers probably don't have a legal argument, only a philosophical argument.
Same goes for privacy policies. Nearly every privacy policy has a carve out for things they're ordered to do by the government.
Just to be pedantic, could the company encrypt the logs with a third-party key in escrow, s.t they would not be able to access that data, but the third party could provide access e.g. for a court.
Well, it is gonna be all _AI Companies_ very soon so unless everyone switches to local models which don't really have the same degree of profitability as a SaaS, its probably not going to kill a company to have less user privacy because tbh people are used to not having privacy these days on the internet.
It certainly will kill off the few companies/people trusting them with closed source code or security related stuff but you really should not outsource that anywhere.
> It certainly will kill off the few companies/people trusting them with closed source code or security related stuff but you really should not outsource that anywhere.
And how many companies have proprietary code hosted on Github?
>don't really have the same degree of profitability as a SaaS
They have a fair bit. Local models lets companies sell you a much more expensive bit of hardware. Once Apple gets their stuff together it could end up being a genius move to go all in on local after the others have repeated scandals of leaking user data.
"It's a very exciting time in tech right now. If you're a first-rate programmer, there are a huge number of other places you can go work rather than at the company building the infrastructure of the police state."
---
So, courts order the preservation of AI logs, and government orders the building of a massive database. You do the math. This is such an annoying time to be alive in America, to say the least. PG needs to start blogging again about what's going on now days. We might be entering the digital version of the 60s, if we're lucky. Get local, get private, get secure, fight back.
If you were working with code that was proprietary, you probably shouldn't of been using cloud hosted LLMs anyways, but this would seem to seal the deal.
I think it's fair to question how proprietary your data is.
Like there's the algorithm by which a hedge fund is doing algorithmic trading, they'd be insane to take the risk. Then there's the code for a video game, it's proprietary, but competitors don't benefit substantially from an illicit copy. You ship the compiled artifacts to everyone, so the logic isn't that secret. Copies of the similar source code have linked before with no significant effects.
Your employees' seemingly private ChatGPT logs being aired in public during discovery for a random court case you aren't even involved in is absolutely a business risk.
Retention means an expansion of your threat model. Specifically, in a way you have little to no control over.
It's one thing if you get pwned because a hacker broke into your servers. It is another thing if you get pwned because a hacker broken into somebody else's servers.
At this point, do we believe OpenAI has a strong security infrastructure? Given the court order, it doesn't seem possible for them to have sufficient security for practical purposes. Your data might be encrypted at rest, but who has the keys? When you're buying secure instances, you don't want the provider to have your keys...
Will a business located in another jurisdiction be comfortable that the records of all staff queries & prompts are being stored and potentially discoverable by other parties? This is more than just a Google search, these prompts contain business strategy and IP (context uploads for example)
Thinking about the value of the dataset of Enron’s emails that was disclosed during their trials, imagine the value and cost to humanity of all OpenAI’s api logs even for a few months being entered into court record..
Not when people have nowhere else to go, pretty much you cannot escape it, it’s too convenient to not use now. You think no other AI chat providers doesn’t need to do this?
I think the court overstepped by ordering OpenAI to save all user chats. Private conversations with AI should be protected - people have a reasonable expectation that deleted chats stay deleted, and knowing everything is preserved will chill free expression. Congress needs to write clear rules about what companies can and can't do with our data when we use AI. But honestly, I don't have much faith that Congress can get their act together to pass anything useful, even when it's obvious and most people would support it.
This is from DuckDuckGo's privacy policy:
"We don’t track you. That’s our Privacy Policy in a nutshell.
We don’t save or share your search or browsing history when you search on DuckDuckGo or use our apps and extensions."
If the court compelled DuckDuckGo to log all searches, I would be equally concerned.
AI is not special and that's the exact issue. The court made a precedence here. If OpenAI can be ordered to preserve all the logs, then DuckDuckGo can face the same issue even if they don't want to do that.
Sure, preservation orders are routine - but this would be like ordering phone companies to record ALL calls just in case some might become evidence later. There's a huge difference between preserving specific communications in a targeted case and mass surveillance of every private conversation. The government shouldn't have that kind of blanket power over private communications.
Consider the opposite prevailing, where I can legally protect my warez site simply by saying "sorry, the conversation where I sent them a copy of a Disney movie was private".
The legal situation you describe is a matter of impossibility and unrelated to the OpenAI case.
In the case of a warez site they would never have logged such a "conversation" to begin with. So if the court requested that they produce all such communications the warez site would simply declare that as, "Impossibility of Performance".
In the case of OpenAI the courts are demanding that they preserve all future communications from all their end users—regardless of whether or not those end users are parties (or even relevant) to the case. The court is literally demanding that they re-engineer their product to record all communications where none existed previously.
I'm not a lawyer but that seems like it would violate FRCP 26(b)(1) which covers "proportionality". Meaning: The effort required to record the evidence is not proportional relative to the value of the information sought.
Also—generally speaking—courts recognize that a party is not required to create new documents or re-engineer systems to satisfy a discovery request. Yet that is exactly what the court has requested of OpenAI.
Would it be possible to comply with the order by anonymizing the data?
The court is after evidence that users use ChatGPT to bypass paywalls. Anonymizing the data in a way that makes it impossible to 1) pinpoint the users and 2) reconstruct the generic user conversation history would preserve privacy and allow OpenAI to comply in good faith with the order.
The fact that they are blaring sirens and hide behind the "we can't, think about users' privacy" feels akin to willingful negligence or that they know they have something to hide.
> feels akin to willingful negligence or that they know they have something to hide
Not at all; there is a presumption of innocence. Unless a given user is plausibly believed to be violating the law, there is no reason to search their data.
Anonymizing data is really hard and I'm not sure they'd be allowed to do it. I mean they're accused of deleting evidences, why would they be allowed to alter it ?
A targeted order is one thing, but this applies to ALL data. My data is not possible evidence as part of a lawsuit, unless you know something I don't know.
Not only does this mean OpenAI will have to retain this data on their servers, they could also be ordered to share it with the legal teams of companies they have been sued by during discovery (which is the entire point of a legal hold). Some law firm representing NYT could soon be reading out your private conversations with ChatGPT in a courtroom to prove their case.
My guess is they will store them on tape e.g. on something like Spectra TFinity ExaScale library. I assume AWS glacier et al use this sort of thing for their deep archives.
Storing them on something that has hours to days retrieval window satisfies the court order, is cheaper, and makes me as a customer that little bit more content with it (mass data breach would take months of plundering and easily detectable).
Glacier is tape silos, but this is textual data. You don't need to save output images, just the checkpoint+hash of the generating model and the seed. Stable diffusion saves this until you manually delete the metadata, for example. So my argument is you could do this with LTO as well. Text compresses well, especially if you don't do it naively.
This data cannot be anonymized. This is trivial provable, both mathematically, but given the type of data, it should also be intuitively obvious to even the most casual observer.
If you're talking to ChatGPT about being hunted by a Mexican cartel, and having escaped to your Uncle's vacation home in Maine -- which is the sort of thing a tiny (but non-zero) minority of people ask LLMs about -- that's 100% identifying.
And if the Mexican cartel finds out, e.g. because NY Times had a digital compromise at their law firm, that means someone is dead.
Legally, I think NY Times is 100% right in this lawsuit holistically, but this is a move which may -- quite literally -- kill people.
AOL found out and thus we all found out that you can't anonymize certain things, web searches in that case. I used to have bookmarked some literature from maybe ten years ago that said,(proved with math?), any moderate collection of data from or by individuals that fits certain criteria is de-anonymizeable, if not by itself, then with minimal extra data. I want to say it included if, for instance, instead of changing all occurances of genewitch to user9843711, every instance of genewitch was a different, unique id.
I apologize for not having cites or a better memory at this time.
> She suggested that OpenAI could have taken steps to anonymize the chat logs but chose not to, only making an argument for why it "would not" be able to segregate data, rather than explaining why it "can’t."
Sounds like bullshit lawyer speak. What exactly is the difference between the two?
That's always been the case for any of your data anywhere in any third party service of any kind, if it is relevant evidence in a lawsuit. Nothing specific to do with LLMs.
I ask again, why not anonymizing the data? That way NYT/the court could see if users are bypassing the paywall through ChatGPT while preserving privacy.
Even if I wrote it, I don't care if someone read out loud in public court "user <insert_hash_here> said: <insert nastiest thing you can think of here>"
This ruling is unbelievably dystopian for anyone that values a right to privacy. I understand that the logs will be useful in the occasional conviction, but storing a log of people’s most personal communications is absolutely not a just trade.
To protect their users from the this massive overreach, OpenAI should defy this order and eat the fines IMO.
This is a moot issue. OpenAI and all AI service providers already use all user-provided data for improving their models, and it's only a matter of time until they start selling it to advertisers, if they don't already. Whether or not they actually delete chat conversations is irrelevant.
Anyone concerned about their privacy wouldn't use these services to begin with. The fact they are so popular is indicative that most people value the service over their privacy, or simply don't care.
Plenty of service providers (including OpenAI) offer you the option to kindly ask them not to, and will even contractually agree not to use or sell your data if you want such an agreement.
Yes, they want to use everyone's data. But they also want everyone as a customer, and they can't have both at once. Offering people an opt-out is a popular middle-ground because the vast majority of people don't care about it, and those that do care are appeased
> The fact they are so popular is indicative that most people value the service over their privacy, or simply don't care.
Or, the general populace just doesn't understand the actual implications. The HN crowd can be guilty of severely overestimating the average person's tech literacy, and especially their understanding of privacy policies and ToS. Many may think they are OK with it, but I'd argue it's because they don't understand the potential real-world consequences of such privacy violations.
It's almost rigged. Either they are keeping the data (and ofc making money out of it) or deleting it destroying the evidence of the crimes they're committing..
So if you're a business that sends sensitive data through ChatGPT via the API and were relying on the representation that API inputs and outputs were not retained, OpenAI will just flip a switch to start retaining your data? Were notifications sent out, or did other companies just have to learn about this from the press?
Copyright in its current form is incompatible with private communication of any kind through computers, because computers by their nature make copies of the communication, so it makes any private communication through a computer into a potential crime, depending on its content. The logic of copyright enforcement, therefore, demands access to all such communications in order to investigate their legality, much like the Stasi.
Inevitably such a far-reaching state power will be abused for prurient purposes, for the sexual titillation of the investigators, and to suppress political dissent.
This is a ludicrous assertion and factually inaccurate beyond all practical intelligence.
A computer in service of an individual absolutely follows copyright because the creator is in control of the distribution and direction of the content.
Besides, copyright is a civil statute, not criminal. Everything about this comment is the most obtuse form of FUD possible. I’m pro copyright reform, but this is “Uncle off his meds ranting on Facebook” unhinged and shouldn’t be given credence whatsoever.
> A computer in service of an individual absolutely follows copyright because the creator is in control of the distribution and direction of the content.
I don’t understand what means. A computer in service of an individual turns copyright law into mattress tag removal law—practically unenforceable.
None of that is correct. Some of it is not even wrong, demonstrating an unbelievably profound ignorance of its topic. Furthermore, it is gratuitously insulting.
There has been a lot of opinion pieces popping up on HN recently that describe the benefits they see from LLMs and rebut the drawbacks most of them talk about. While they do bring up interesting points, NONE of them have even mentioned the privacy aspect.
This is the main reason I can’t use any LLM agents or post any portion of my code into a prompt window at work. We have NDAs and government regulations (like ITAR) we’d be breaking if any code left our servers.
This just proves the point. Until these tools are local, privacy will be an Achilles heal for LLMs.
Trivial after a substantial hardware investment and installation, configuration, testing, benchmarking, tweaking, hardening, benchmarking again, new models come out so more tweaking and benchmarking and tweaking again, all while slamming your head against the wall dealing with the mediocre documentation surrounding all hardware and software components you're trying to deploy.
Yes, but which of the state of the art models that offer the best results, are you allowed to do this with? As far as I've seen the models that you can host locally are not the ones being praised left and right in these articles. My company actually allows people to use a hosted version of Microsoft copilot, but most people don't because it's still not that much of a productivity boost (if any).
It is not at all trivial for an organization that may be doing everything on the cloud to locally set up the necessary hardware and ensure proper networking and security to that LLM running on said hardware.
> NONE of them have even mentioned the privacy aspect
because the privacy aspect has nothing to do with LLMs and everything to do with relying on cloud providers. HN users have been vocal about that since long before LLMs existed.
Billion people use the internet daily. If any organization suspects some people use the Internet for illicit purposes eventually against their interests, would the court order the ISP to log all activities of all people? Would Google be ordered to save the search of all its customers because some might use it for bad things? And once we start, where will we stop? Crimes could happen in the past or in the future, will the court order the ISP and Google to retain the logs for 10 years, 20 years? Why not 100 years? Who should bear the cost for such outrageous demands?
The consequences of such orders are of enormous impact the puny judge can not even begin to comprehend. Privacy right is an integral part of the freedom of speech, a core human right. If you don’t have private thoughts, private information, anybody can be incriminated against them using these past information. We will cease to exist as individuals and I argue we will cease to exist as human as well.
The power does not extend to any of your hypotheticals, which are not about active cases. Courts do not accept cases on the grounds that some bad thing might happen in the future; the plaintiff must show some concrete harm has already occurred. The only thing different here is how much potential evidence OpenAI has been asked to retain.
Not just that, even without a specific court order parties to existing or reasonably anticipated litigation have a legal obligation that attaches immediately to preserve evidence. Courts tend to issue orders when a party presents reason to believe another party is out of compliance with that automatic obligation, or when there is a dispute over the extent of the obligation. (In this case, both factors seem to be in play.)
Probably because they bothered to pursue such a thing and hundreds of millions people did not.
How do you conclusively know if someone's content generating machine infringe with your rights? By saving all of its input/output for investigation.
It's ridiculous, sure but is it less ridiculous than AI companies claiming that the copyrights shouldn't apply to them because it will be bad for their business?
IMHO those are just growth pain. Back in the day people used to believe that the law don't apply on them because they did it on the internet and they were mostly right because the laws were made for another age. Eventually the laws both for criminal stuff and copyright caught up. Will be the same for AI, now we are in the wild west age of AI.
Since that wasn't ever a real argument, your strawman is indeed ridiculous.
The argument is that requiring people to have a special license to process text with an algorithm is a dramatic expansion of the power of copyright law. Expansions of copyright law will inherently advantage large corporate users over individuals as we see already happening here.
New York Times thinks that they have the right to spy on the entire world to see if anyone might be trying to read articles for free.
That is the problem with copyright. That is why copyright power needs to be dramatically curtailed, not dramatically expanded.
Otherwise, you are picking your data privacy champions as the exact same companies, people and investors that sold us social media, and did something quite untoward with the data they got. Fool me twice, fool me three times… where is the line?
In other words - OAI has to save logs now? Candidly they probably were already, or it’s foolish not to assume that.
But also, no - Just self-host or it's all your fault is never ever a sufficient answer to the problem.
It's exactly the same as when Exxon says "what are you doing to lower your own carbon footprint?" It's shifting the burden unfairly; companies like OpenAI put themselves out there and thus must ALWAYS be held to task.
I completely agree with you, but as a ChatGPT user I have to admit my fault in this too.
I have always been annoyed by what I saw as shameless breaches of copyright of thousands of authors (and other individuals) in the training of these LLMs, and I've been wary of the data security/confidentiality of these tools from the start too - and not for no reason. Yet I find ChatGPT et al so utterly compelling and useful, that I poured my personal data[0] into these tools anyway.
I've always felt conflicted about this, but the utility just about outweighed my privacy and copyright concerns. So as angry as I am about this situation, I also have to accept some of the blame too. I knew this (or other leaks or unsanctioned use of my data) was possible down the line.
But it's a wake up call. I've done nothing with these tools which is even slightly nefarious, but I am today deleting all my historical data (not just from ChatGPT[1] but other hosted AI tools) and will completely reassess my approach of using them - likely with an acceleration of my plans to move to using local models as much as I can.
[0] I do heavily redact my data that goes into hosted LLMs, but there's still more private data in there about me than I'd like.
[1] Which I know is very much a "after the horse has bolted" situation...
100 years from now, nobody will GAF about the New York Times.
First - in the US, privacy is not a constitutional right. It should be, but it's not. You are protected against government searches, but that's about it. You can claim it's a core human right or whatever, but that doesn't make it true, and it's a fairly reductionist argument anyway. It has, fwiw, also historically not been seen as a core right for thousands of years. So i think it's a harder argument to make than you think despite the EU coming around on this. Again, I firmly believe it should be a core right, but asserting that it is doesn't make that true.
Second, if you want the realistic answer - this judge is probably overworked and trying to clear a bunch of simple motions off their docket. I think you probably don't realize how many motions they probably deal with on a daily basis. Imagine trying to get through 145 code reviews a day or something like that. In this case, this isn't the trial, it's discovery. Not even discovery quite yet, if i read the docket right. Preservation orders of this kind are incredibly common in discovery, and it's not exactly high stakes most of the time. Most of the discovery motions are just parties being a pain in the ass to each other deliberately. This normally isn't even a thing that is heard in front of a judge directly, the judge is usually deciding on the filed papers.
So i'm sure the judge looked at it for a few minutes, thought it made sense at the time, and approved it. I doubt they spent hours thinking hard about the consequences.
OpenAI has asked to be heard in person on the motion, i'm sure the judge will grant it, listen to what they have to say, and determine they probably fucked it up, and fix it. That is what most judges do in this situation.
This doesn't seem true. I'd assume you know more about this than I do though so can you explain this in more detail? The concept of privacy is definitely more than thousands of years old. The concept of a "human right", is arguably much newer. Do you have particular evidence that a right to privacy is a harder argument to make that other human rights?
While the language differs, the right to privacy is enshrined more or less explicitly in many constitutions, including 11 USA states. It isn't just a "european" thing.
Nothing has been seen as a core right for thousands of years, as the concept of human rights is only a few hundred years old.
What? The supreme court disagreed with you in Griswold v. Connecticut (1965) and Roe v. Wade (1973).
While one could argue that they were vastly stretching the meaning of words in these decisions the point stands that at this time privacy is a constitutional right in the USA.
Dead Comment
The law is not a deterministic computer program. It’s a complex body of overlapping work and the courts are specifically chartered to use judgement. That’s why briefs from two parties in a dispute will often cite different laws and precedents.
For instance, Winter v. NRDC specifically says that courts must consider whether an injunction is in the public interest.
Because the law favors preservation of evidence for an active case above most other interests. It's not a matter of arbitrary preference by the particular court.
It simply didn't. ChatGPT hasn't deleted any user data.
> "OpenAI did not 'destroy' any data, and certainly did not delete any data in response to litigation events," OpenAI argued. "The Order appears to have incorrectly assumed the contrary."
It's a bit of a stretch to think a big tech company like ChatGPT is deleting users' data.
OpenAI already has a business, and not one they want to violate by having a massive amount of customer data stolen if they get hacked.
Deleted Comment
Are these contradictory?
If you overhear a friend gossiping, can't you spread that gossip?
Also, where are human rights located? I'll give you a microscope.(sorry, I'm a moral anti-realist/expressivist and I can't help myself)
There is a longstanding precedent with regards to business document retention, and chat logs have been part of that for years if not decades. The article tries to make this sound like this is something new, but if you look at the e-retention guidelines in various cases over the years this is all pretty standard.
For a business to continue operating, they must preserve business documents and related ESI upon an appropriate legal hold to avoid spoliation. They likely weren't doing this claiming the data was deleted, which is why the judge ruled in favor against OAI.
This isn't uncommon knowledge either, its required. E-discovery and Information Governance are things any business must meet in this area; and those documents are subject to discovery in certain cases, where OAI likely thought they could avoid it maliciously.
The matter here is OAI and its influence rabble are churning this trying to do a runaround on longstanding requirements that any IT professional in the US would have reiterated from their legal department/Information Governance policies.
There's nothing to see here, there's no real story. They were supposed to be doing this and didn't, were caught, and the order just forces them to do what any other business is required to do.
I remember an executive years ago (decades really), asking about document retention, ESI, and e-discovery and how they could do something (which runs along similar lines to what OAI tried as a runaround). I remember the lawyer at the time saying, "You've gotta do this or when it goes to court you will have an indefensible position as a result of spoliation...".
You are mistaken, and appear to be trying to frame this improperly towards a point of no accountability.
I suggest you review the longstanding e-discovery retention requirements that courts require of businesses to operate.
This is not new material, nor any different from what's been required for a long time now. All your hyperbole about privacy is without real basis, they are a company; they must comply with law, and it certainly is not outrageous to hold people who break the law to account, and this can only occur when regulatory requirements are actually fulfilled.
There is no argument here.
References: Federal Rules of Civil Procedure (FRCP) 1, 4, 16, 26, 34, 37
There are many law firms who have written extensively on this and related subjects. I encourage you to look at those too.
(IANAL) Disclosure: Don't take this as legal advice. I've had the opportunity to work with quite a few competent ones, but I don't interpret the law; only they can. If you need someone to provide legal advice seek out competent qualified counsel.
Dead Comment
Can't you use the same arguments against, say, Copyright holders? Billionaires? Corporations doing the Texas two-step bankruptcy legal maneuver to prevent liability from allegedly poisoning humanity?
I sure hope so.
Edit: ... (up to a point)
Dead Comment
Dead Comment
Well maybe some people in power have pressured the court into this decision? The New York Times surely has some power as well via their channels
> That risk extended to users of ChatGPT Free, Plus, and Pro, as well as users of OpenAI’s application programming interface (API), OpenAI said.
This seems very bad for their business.
This raises serious questions about the supposed "anonymization" of chat data used for training their new models, i.e. when users leave the "improve model for all users" toggle enabled in the settings (which is the default even for paying users). So, indeed, very bad for the current business model which appears to rely on present users (voluntarily) "feeding the machine" to improve it.
[0] https://cdn.arstechnica.net/wp-content/uploads/2025/06/NYT-v...
So, the NYT asked for this back in January and the court said no, but asked OpenAI if there was a way to accomplish the preservation goal in a privacy-preserving manner. OpenAI refused to engage for 5 f’ing months. The court said “fine, the NYT gets what they originally asked for”.
Nice job guys.
And wrapper-around-ChatGPT startups should double-check their privacy policies, that all the "you have no privacy" language is in place.
For example, if there are business consequences for leaking customer data, you better run that LLM yourself.
I've reviewed a lot of SaaS contracts over the years.
Nearly all of them have clauses that allow the vendor to do whatever they have to if ordered to by the government. That doesn't make it okay, but it means OpenAI customers probably don't have a legal argument, only a philosophical argument.
Same goes for privacy policies. Nearly every privacy policy has a carve out for things they're ordered to do by the government.
If a court orders you to preserve user data, could you be held liable for preserving user data? Regardless of your privacy policy.
Well, it is gonna be all _AI Companies_ very soon so unless everyone switches to local models which don't really have the same degree of profitability as a SaaS, its probably not going to kill a company to have less user privacy because tbh people are used to not having privacy these days on the internet.
It certainly will kill off the few companies/people trusting them with closed source code or security related stuff but you really should not outsource that anywhere.
And how many companies have proprietary code hosted on Github?
They have a fair bit. Local models lets companies sell you a much more expensive bit of hardware. Once Apple gets their stuff together it could end up being a genius move to go all in on local after the others have repeated scandals of leaking user data.
https://x.com/paulg/status/1913338841068404903
"It's a very exciting time in tech right now. If you're a first-rate programmer, there are a huge number of other places you can go work rather than at the company building the infrastructure of the police state."
---
So, courts order the preservation of AI logs, and government orders the building of a massive database. You do the math. This is such an annoying time to be alive in America, to say the least. PG needs to start blogging again about what's going on now days. We might be entering the digital version of the 60s, if we're lucky. Get local, get private, get secure, fight back.
Like there's the algorithm by which a hedge fund is doing algorithmic trading, they'd be insane to take the risk. Then there's the code for a video game, it's proprietary, but competitors don't benefit substantially from an illicit copy. You ship the compiled artifacts to everyone, so the logic isn't that secret. Copies of the similar source code have linked before with no significant effects.
As far as I understand it, this ruling does not apply to Microsoft, does it?
They can still have legal contracts with other companies, that stipulate that they don't train on any of their data.
It's one thing if you get pwned because a hacker broke into your servers. It is another thing if you get pwned because a hacker broken into somebody else's servers.
At this point, do we believe OpenAI has a strong security infrastructure? Given the court order, it doesn't seem possible for them to have sufficient security for practical purposes. Your data might be encrypted at rest, but who has the keys? When you're buying secure instances, you don't want the provider to have your keys...
This is from DuckDuckGo's privacy policy: "We don’t track you. That’s our Privacy Policy in a nutshell. We don’t save or share your search or browsing history when you search on DuckDuckGo or use our apps and extensions."
If the court compelled DuckDuckGo to log all searches, I would be equally concerned.
In the case of a warez site they would never have logged such a "conversation" to begin with. So if the court requested that they produce all such communications the warez site would simply declare that as, "Impossibility of Performance".
In the case of OpenAI the courts are demanding that they preserve all future communications from all their end users—regardless of whether or not those end users are parties (or even relevant) to the case. The court is literally demanding that they re-engineer their product to record all communications where none existed previously.
I'm not a lawyer but that seems like it would violate FRCP 26(b)(1) which covers "proportionality". Meaning: The effort required to record the evidence is not proportional relative to the value of the information sought.
Also—generally speaking—courts recognize that a party is not required to create new documents or re-engineer systems to satisfy a discovery request. Yet that is exactly what the court has requested of OpenAI.
The court is after evidence that users use ChatGPT to bypass paywalls. Anonymizing the data in a way that makes it impossible to 1) pinpoint the users and 2) reconstruct the generic user conversation history would preserve privacy and allow OpenAI to comply in good faith with the order.
The fact that they are blaring sirens and hide behind the "we can't, think about users' privacy" feels akin to willingful negligence or that they know they have something to hide.
Not at all; there is a presumption of innocence. Unless a given user is plausibly believed to be violating the law, there is no reason to search their data.
Storing them on something that has hours to days retrieval window satisfies the court order, is cheaper, and makes me as a customer that little bit more content with it (mass data breach would take months of plundering and easily detectable).
That is probably the solution right there.
If you're talking to ChatGPT about being hunted by a Mexican cartel, and having escaped to your Uncle's vacation home in Maine -- which is the sort of thing a tiny (but non-zero) minority of people ask LLMs about -- that's 100% identifying.
And if the Mexican cartel finds out, e.g. because NY Times had a digital compromise at their law firm, that means someone is dead.
Legally, I think NY Times is 100% right in this lawsuit holistically, but this is a move which may -- quite literally -- kill people.
I apologize for not having cites or a better memory at this time.
Sounds like bullshit lawyer speak. What exactly is the difference between the two?
If that’s too big a risk it really is time to consider locally hosted LLMs.
Even if I wrote it, I don't care if someone read out loud in public court "user <insert_hash_here> said: <insert nastiest thing you can think of here>"
I had colleagues chat with GPT, and they send all kinds of identifying information to it.
To protect their users from the this massive overreach, OpenAI should defy this order and eat the fines IMO.
Anyone concerned about their privacy wouldn't use these services to begin with. The fact they are so popular is indicative that most people value the service over their privacy, or simply don't care.
Yes, they want to use everyone's data. But they also want everyone as a customer, and they can't have both at once. Offering people an opt-out is a popular middle-ground because the vast majority of people don't care about it, and those that do care are appeased
Or, the general populace just doesn't understand the actual implications. The HN crowd can be guilty of severely overestimating the average person's tech literacy, and especially their understanding of privacy policies and ToS. Many may think they are OK with it, but I'd argue it's because they don't understand the potential real-world consequences of such privacy violations.
Deleted Comment
Deleted Comment
Inevitably such a far-reaching state power will be abused for prurient purposes, for the sexual titillation of the investigators, and to suppress political dissent.
A computer in service of an individual absolutely follows copyright because the creator is in control of the distribution and direction of the content.
Besides, copyright is a civil statute, not criminal. Everything about this comment is the most obtuse form of FUD possible. I’m pro copyright reform, but this is “Uncle off his meds ranting on Facebook” unhinged and shouldn’t be given credence whatsoever.
I don’t understand what means. A computer in service of an individual turns copyright law into mattress tag removal law—practically unenforceable.
Nope. https://www.justia.com/intellectual-property/copyright/crimi...
This is the main reason I can’t use any LLM agents or post any portion of my code into a prompt window at work. We have NDAs and government regulations (like ITAR) we’d be breaking if any code left our servers.
This just proves the point. Until these tools are local, privacy will be an Achilles heal for LLMs.
Yup. Trivial.
Deleted Comment
because the privacy aspect has nothing to do with LLMs and everything to do with relying on cloud providers. HN users have been vocal about that since long before LLMs existed.