Readit News logoReadit News
lolinder · 2 years ago
There are two different questions at play here, and we need to be careful what we wish for.

The first concern is the most legitimate one: can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.

The second concern, though, is can Perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.

Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local. The very nature of a "user agent" is to be an automated tool that manipulates content hosted on the internet according to the specifications given to the tool by the user. I have a hard time seeing an argument against Perplexity using this data in this way that wouldn't apply equally to countless tools that we already all use and which companies try with varying degrees of success to block.

I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it. I want to be able to write scripts to manipulate the page and present it in a way that's useful for me. I don't currently use llms this way, but I'm uncomfortable with arguing that it's unethical for them to do that so long as they're citing the source.

putlake · 2 years ago
It's funny I posted the inverse of this. As a web publisher, I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.

But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.

There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.

epolanski · 2 years ago
> they are decreasing the probability that this user would come to by content (via Google, for example).

Google has been providing summaries of stuff and hijacking traffic for ages.

I kid you not, in the tourism sector this has been a HUGE issue, we have seen 50%+ decrease in views when they started doing it.

We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage.

It's just depressing. I'm more and more convinced that the age of regulations and competition is gone, US does want to have unkillable monopolies in the tech sector and we are all peons.

lolinder · 2 years ago
> A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.

If I visit your site from Google with my browser configured to go straight to Reader Mode whenever possible, is my visit more useful to you than a summary and a link to your site provided by Perplexity? Why does it matter so much that visitors be directly on your content?

briantakita · 2 years ago
> But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example).

Perplexity has source references. I find myself visiting the source references. Especially to validate the LLM output. And to learn more about the subject. Perplexity uses a Google search API to generate the reference links. I think a better strategy is to treat this as a new channel to receive visitors.

The browsing experience should be improved. Mozilla had a pilot called Context Graph. Perhaps Context Graph should be revisited?

> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.

This seems like a missing feature for analytics products & the LLMs/RAGs. I don't think searching via an LLM/RAG is going away. It's too effective for the end user. We have to learn to work with it the best we can.

anileated · 2 years ago
> I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy. But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable.

This appears to be self-contradictory. If you let an LLM to be trained* on “all the books” (posts, articles, etc.) in the world, the implication is that your potential readers will now simply ask that LLM. Not only will they pay Microsoft for that privilege, while you would get zilch, but you would not even know they ever read the fruits of your research.

* Incidentally, thinking of information acquisition by an ML model as if it was similar to human reading is a problematic fallacy.

SpaghettiCthulu · 2 years ago
You're missing the part where Perplexity still makes a request each time it's asked about the URL. You still get the traffic!
rcthompson · 2 years ago
I don't know what the typical usage pattern is, but when I've used Perplexity, I generally do click the relevant links instead of just trusting Perplexity's summary. I've seen plenty of cases where Perplexity's summary says exactly the opposite of the source.
antoniojtorres · 2 years ago
This hits the point exactly, it’s an extension of stuff like Google’s zero click results, they are regurgitating a website’s content with no benefit to the website.

I would say though, it feels like the training argument may ultimately lead to a similar outcome, though it’s a bit more ideological and less tangible than regurgitating the results of a query. Services like chatgpt are already being used a google replacement by many people, so long term it may reduce clicks from search as well.

richardatlarge · 2 years ago
I’m not sure if this is relevant but i go to a lot of sites because perplexity has it noted in its answer
insane_dreamer · 2 years ago
This is why media publishers went behind paywalls to get away from Google News
danlitt · 2 years ago
I'm not sure what you mean exactly. If Perplexity is actually doing something with your article in-band (e.g. downloading it, processing it, and present that processed article to the user) then they're just breaking the law.

I've never used that tool (and don't plan to) so I don't know. If they just embed the content in an iframe or something then there's no issue (but then there's no need or point in scraping). If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article <ARTICLE_TEXT>") then that's vanilla infringement, whether they lie about their UA or not.

gcanyon · 2 years ago
It seems self-evident to me that if a user tells a bot to go get a web page, robots.txt doesn't apply, and the bot shouldn't respect it. I understand others' concerns that, like Apple's reader, and other similar tools, it's ethically debatable whether a site should be required to comply with the request, and spoofing an agent seems in dubious territory. I don't think a good answer has been proposed for this challenge, unfortunately.
lolinder · 2 years ago
> spoofing an agent seems in dubious territory.

Just to clarify, Perplexity is not spoofing a user agent, they're legitimately using a headless Chrome to fetch the page.

The author just misunderstood their docs [0]: when they say that "you can identify our web crawler by its user agent", they're talking about the crawler, not the browser they use for ad hoc requests. As you note, crawling is different.

[0] https://docs.perplexity.ai/docs/perplexitybot

buro9 · 2 years ago
The companies will scrape and internalise the "customer asked for this" requests... and slowly turn the latter into the former, or just their own tool as the scraper.

No, easier to just ask a simple question: Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.

These companies don't need to be given an inch.

lolinder · 2 years ago
> Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.

So should Firefox not allow changing the user agent in order to bypass websites that erroneously claim to not work on Firefox?

elicksaur · 2 years ago
This is exactly the concern and there’s a lot of comments just completely ignoring it or willfully conflating.

Ad block isn’t the same problem because it doesn’t and can’t steal the creator’s data.

nradov · 2 years ago
Why should it be possible to stop an LLM from training itself on your data? If you want to restrict access to data then don't post it on a public website. It's easy enough to require registration and agreement to licensing terms for access.

It seems like some website owners want to have their cake and eat it too. They want their content indexed by Google and other crawlers in order to drive search traffic but they don't want their content used to train AI models that benefit other companies. At some point they're going to have to make a choice.

marcus0x62 · 2 years ago
Because if I run a server - at my own expense - I get to use information provided by the client to determine what, if any, response to provide? This isn’t a very difficult concept to grasp.
reissbaker · 2 years ago
To follow onto this:

If what Perplexity is doing is illegal, is it illegal to run an open-source LLM on your own machine, and have it do the same thing? If so, how are ad blockers or Reader Modes or screen readers legal?

And if it's legal to run an open-source LLM on your own machine, is it legal to run an open-source LLM on a rented server (e.g. because you need more GPUs)? And if that's legal, why is it illegal to run a closed-source LLM on servers? Could Perplexity simply release the model weights and keep doing what they're doing?

baxtr · 2 years ago
What will happen if:

Website owners decide to stop publishing because it’s not rewarded by a real human visit anymore?

Then perplexity and the like won’t have new information to train their models on and no sites to answer the questions.

I think there is a real content dilemma here at work. The incentives of Google and website owners were more or less aligned.

This is not the case with perplexity.

lolinder · 2 years ago
What is a "visit"? TFA demonstrates that they got a hit on their site, that's how they got the logs.

Is it necessary to load the JavaScript for it to count as a visit? What if I access the site with noscript?

Or is it only a visit if I see all your recommended content? I usually block those recommendations so that I don't get distracted from the article I actually came to read—is my visit a less legitimate visit than other people's?

What exactly is Perplexity doing here that isn't okay that people don't already do with their local user agents?

bko · 2 years ago
How would an LLM training on your writing reduce your reward?

I guess if you're doing it for a living sure, but most content I consume online is created without incentive (social media, blogs, stack overflow).

I write a fair amount and have been for a few years. I like to play with ideas. If an llm learned from my writing and it helped me propagate my ideas, I'd be happy. I lose on social status imaginary internet points but I honestly don't care much for them.

The craziest one is the stack overflow contributors. They write answers for free to help people become better programmers but they're mad an llm will read their suggestions and answer questions that help people become better programmers. I guess they do it for the glory of having their handle next to the answer?

ADeerAppeared · 2 years ago
> I think there is a real content dilemma here at work

It's not really a dilemma.

This is exactly what copyright serves to protect authors from. Perplexity copied the content, and in doing so directly competes with the original work, destroying it's market value and driving the original author out of business. Literally what copyright was invented to prevent.

It's the exact same situation as journalists going after Google & social media embeds of articles, which these sites propagandized as "prohibiting hyperlinking", but the issue has been the embedded (summary of the) content. Which people don't click through, and this is the entire point of those features for platforms like Facebook; Keeping users on facebook and not leaving.

This is why quite a few jurisdictions agreed with the journalists and moved to institute restrictions on such embedding.

By all practical considerations, perplexity is doing the exact same thing and trying to deflect with "we used an AI to paraphrase".

> The incentives of Google and website owners were more or less aligned.

The key difference here is that linking is and always has been fine. Google's Book search feature is fair use because the purpose is to send you to the book you searched for, not substitute the book.

Google's current AI summary feature is effectively the same as Perplexity. People don't click through to the original site, the original site doesn't get ad impressions or other revenue, and is driven out of business.

> What will happen if:

What will happen is what already is happening: Journalists are driven out of business, replaced by AI slop.

And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.

It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.

nradov · 2 years ago
A lot of the public website content targeted towards consumers is already SEO slop trying to sell you something or maximize ad revenue. If those website owners decide to stop publishing due to lack of real human visits then little of value will be lost. Much of the content with real value for consumers has already moved to sites that require registration (and sometimes payment) for access.

For technical content of value to professionals, much of that is hosted by vendors or industry organizations. Those tend to get their revenue in other ways and don't care about companies scraping their content for AI model training. Like the IETF isn't going to stop publishing new RFCs just because Perplexity uses them.

gpm · 2 years ago
> The second concern, though, is can perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.

This feels like the fundamental core component of what copyright allows you to forbid.

> Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local

Which is a huge difference. The latter is someone asking for a copy of my content (from someone with a valid license, myself), and manipulating it to display it (not creating new copies, broadly speaking allowed by copyright). The former adds in the criminal step of "and redistributing (modified, but that doesn't matter) versions of it to users without permission".

I mean, I'm all for getting rid of copyright, but I also know that's an incredibly unpopular position to take, and I don't see how this isn't just copyright infringement if you aren't advocating for repealing copyright law all together.

lolinder · 2 years ago
I'm curious to know where you draw the line for what constitutes legitimate manipulation by a person and when it becomes distribution.

I'm assuming that if I write code by hand for every part of the TCP/IP and HTTP stack I'm safe.

What if I use libraries written by other people for the TCP/IP and HTTP part?

What if I use a whole FOSS web browser?

What about a paid local web browser?

What if I run a script that I wrote on a cloud server?

What if I then allow other people to download and use that script on their own cloud servers?

What if I decide to offer that script as a service for free to friends and family, who can use my cloud server?

What if I offer it for free to the general public?

What if I start accepting money for that service, but I guarantee that only the one person who asked for the site sees the output?

Can you help me to understand where exactly I crossed the line?

reissbaker · 2 years ago
I actually don't see the legal distinction here. A browser with an ad blocker is also:

1. Asking for a copy of your content

2. Manipulating the content

3. Redistributing the content to the end-user who requested it

Ditto for the LLM that has been asked by the end user to fetch your content and show it to them (possibly with a manipulation step e.g. summarization).

I don't think there's a legal, copyright distinction between doing that on a server vs doing that on a local machine. And, for example, if there were a difference: using a browser on a remote desktop would be illegal, or using curl on a machine you were SSHed into would be illegal. Also, an LLM running locally on your machine (doing the exact same thing) would be legal!

I understand that it's inconvenient and difficult to monetize content when an LLM is summarizing it, and hard to upsell other pages on a website to users when they aren't coming to your website and are instead accessing it through an LLM. But legally I think there's not an obvious distinction on copyright grounds, and if there were (other than a very fine-grained ban on specifically LLMs accessing websites, without any general principle behind it), it would catch up a lot of legitimate behavior in the dragnet.

I'd also point out that in the U.S., search engines have passed the "Fair Use" test of exemption from copyright — I think it would be very hard to make a distinction between what a search engine is doing (which is on a server!) and what an LLM is doing based on trying to say copyright distinguishes between server vs client architectures.

zzo38computer · 2 years ago
If the user specifically asks for a file and asks a computer program to process it in a specific way, it should be permitted, regardless of user-agent spoofing (although user-agent spoofing should (normally) ideally only be done when the user specifically requests it; it should not do so automatically). However, this is better when using FOSS and/or local programs (or if the user is accessing them through a proxy, VPN, Tor, etc). Furthermore, any company that provides such services should not use unethical business practices, false advertising, etc, to do so.

If the company wants a copy of the files for your own use, then that is a bit different. When accessing large number of files at once, robots.txt is useful to block it. If they can get a copy of the files in a different way (assuming the files are intended to be public anyways), then they might do so. However, even in this case, still they should not use unethical business practices, false advertising, etc; and, they should also avoid user-agent spoofing.

(In this case, the reason for the user-agent spoofing does not seem to be deliberate, since it uses a headless browser. They should still change it though; probably by keeping the user-agent string but adding on a extra part such as "Perplexity", to indicate that it is what it is, in addition to the headless browser.)

EGreg · 2 years ago
A user-agent requests the file using your credentials, eg a cookie or public key signature.

It is transforming the content for you, an authorized party.

That is not the same as then making derivative copies and distributing the information to others without paying. For example, if I bought a ticket to a show, taped it and then distributed it to everyone, disregarding that the show prohibited this.

If I shared my Netflix password with up to 5 others, at least I can argue that they are part of my “family” or something. But to unlimited numbers of people? Why would they pay for netflix, and how would the shows get made?

I am not necessarily endorsing government force enforcing copyright, which is why I have been building a solution to enforce it at the tech level: https://Qbix.com/ecosystem

__loam · 2 years ago
The problem that Perplexity has that ad blockers don't is that they're an independent site that is publishing content based on work they didn't produce. That runs afoul of both copyright laws and section 230 which let's sites like Google and Facebook operate. That's pretty different from an ad blocker running on your local machine. The ad blocker isn't publishing the page it edited for you.
lolinder · 2 years ago
> they're an independent site that is publishing content based on work they didn't produce.

What distinguishes these two situations?

* User asks proprietary web browser to fetch content and render it a specific way, which it does

* User asks proprietary web service to fetch content and render it a specific way, which it does

The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?

Why is it that a proprietary web service manipulating content on behalf of a user is "publishing" content illegally, while a proprietary web browser doing the exact same kind of transformations is not? Assume that in both cases the proprietary software fetches the data upon request, does not cache it, and does not make the transformed content available to other users.

Deleted Comment

neycoda · 2 years ago
AI scraping against permission could allow corporations to formulate a loophole where Congress argues that it's impossible to enforce a law against and that it's easier to just make laws to allow corporations to close-source their websites (yes, HTML, CSS, and JavaScript, etc). I think what's most likely to happen is nothing will fundamentally change, and browsers will continue showing page source, and AI will continue scraping source content without permission.
immibis · 2 years ago
You can poison all your images with Glaze and Nightshade. Then you don't have to stop them from using them - they have to stop themselves from using them or their image generator will be useless. I don't know if there's a comparable system for text. If there was, it would probably be noticeable to humans.
cal85 · 2 years ago
> can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.

I’m not saying you’re wrong, but why? And what do you mean by “your data” here?

bhelkey · 2 years ago
> And what do you mean by “your data” here?

The website that they created.

__loam · 2 years ago
By "my data" he means the data a site spent time and money to create and publish.

Deleted Comment

daft_pink · 2 years ago
The other question is once the user directs the ai to read the website instead of crawling will the site then be fair game for training?
EGreg · 2 years ago
Let’s differentiate between:

1) a user-agent which makes an authenticated and authorized request for data, and delivers to the user

2) a user who then turns around and distributes the data or its derivatives to users in an unauthorized manner

A “dumber” example would be whether I can indefinitely cache and index most of information via the Google Places API, as long as my users request each item at least once. Can I duplicate all that map or streetview photo information that google paid cars to go around and photograph? Or how about the info that Google users entered as user-generated content?

THE REQUIREMENT TO OPEN SOURCE WEIGHTS

Legally, if I had a Creative Commons Share-Alike license on my data, and the LLM was trained on it and then served unlimited requests to others, without making the weights available…

…that would be almost exactly like if I had made my code available with Affero GPL license, someone would take my code but then incorporated it into a backend software hosting a social network or something, without making their own entire social network source code available. Technically this should be enforceable via a court order compelling the open sourcing to the public. (Alternatively, they’d have to pay damages in a class action lawsuit and stop using the tainted backend software or weights when serving all those people.)

TECHNICAL ANALYSIS

The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.

Distributing your scripts and software to process data is not the same as distributing arbitrary data the user agent found on the internet for which you don’t have a license.

If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.

LEGAL ANALYSIS

Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.

In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.

There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop...

Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.

Deleted Comment

treyd · 2 years ago
Yeah if people get to extensive about blocking then we're going to end up with a scenario where the web request functionality is implementing by telling the chatbot's users's browser to make the fetch and submit it back to the server for processing, making it largely indistinguishable from the user making the query themselves. If CORS gets in the way they can just prompt users to install a browser extension to use the web request functionality.
lofaszvanitt · 2 years ago
Citing the source doesn't bring you, the owner of the site, valuable data. When was your data accessed, who accessed it, from where, at what time, what device, etc. It brings data to the LLM's owner, and you get

N O T H I N G.

Could you change the way printed news magazines showed their content? No. Then, why is that a problem?

Btw nobody clicks on sources. NOBODY.

bluish29 · 2 years ago
> Btw nobody clicks on sources. NOBODY.

I always click on sources to verify what an LLM in this case says. I also hear the claim that a lot about people not reading sources (before LLM it was video content with references) but I always visited the sources. Is there a statistics or studies that actually support this claim? Or is it just a personal experience, of people (including me) enforcing it as generic behavior of all people?

insane_dreamer · 2 years ago
> I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it.

I'm okay with this world, as a tradeoff. I'm not sure users should have _the right_ to reformat others' content.

zzo38computer · 2 years ago
Users should have the right to reformat their own copy of others content (automatically as well as manually). However, if they then redistribute the reformatted copy, then they should not be allowed to claim that it is the same formatting as the original, because it is not the same as the original.
dheera · 2 years ago
Personally I think AI is a major win for accessibility and we should not be preventing people to access information in the way that is best suited for them.

Accessibility can mean everything from a blind person wanting to interacting with a website using voice, to someone recovering from a surgery and wanting something to reduce unnecessary popups and clicks on a website to get to the information they need. Accessibility is in the eye of the accessor, and AI is what enables them to achieve it.

The way I see it, AI is not a robot and doesn't need to look at robots.txt. Rather, AI is my low-cost secretary.

danlitt · 2 years ago
> The way I see it, AI is not a robot and doesn't need to look at robots.txt

I don't think you are seeing it very clearly then. Your secretary can also be a robot. What do you think an AI is if not a robot??

It doesn't "need" to look at robots.txt because nothing does.

maxrmk · 2 years ago
The author has misunderstood when the perplexity user agent applies.

Web site owners shouldn’t dictate what browser users can access their site with - whether that’s chrome, firefox, or something totally different like perplexity.

When retrieving a web page _for the user_ it’s appropriate to use a UA string that looks like a browser client.

If perplexity is collecting training data in bulk without using their UA that’s a different thing, and they should stop. But this article doesn’t show that.

JimDabell · 2 years ago
Just to go a little bit more into detail on this, because the article and most of the conversation here is based on a big misunderstanding:

robots.txt governs crawlers. Fetching a single user-specified URL is not crawling. Crawling is when you automatically follow links to continue fetching subsequent pages.

Perplexity’s documentation that the article links to describes how their crawler works. That is not the piece of software that fetches individual web pages when a user asks for them. That’s just a regular user-agent, because it’s acting as an agent for the user.

The distinction between crawling and not crawling has been very firmly established for decades. You can see it in action with wget. If you fetch a specific URL with `wget https://www.example.com` then wget will just fetch that URL. It will not fetch robots.txt at all.

If you tell wget to act recursively with `wget --recursive https://www.example.com` to crawl that website, then wget will fetch `https://www.example.com`, look for links on the page, then if it finds any links to other pages, it will fetch `https://www.example.com/robots.txt` to check if it is permitted to fetch any subsequent links.

This is the difference between fetching a web page and crawling a website. Perplexity is following the very well established norms here.

mattigames · 2 years ago
Its fairly logical to assume that robots.txt governs robots (empahsis in "bots") not just crawlers, if they are only intended to block crawlers why aren't they called crawlers.txt instead and remove all ambiguity?
rknightuk · 2 years ago
It’s not retrieving a web page though is it? It’s retrieving the content then manipulating it. Perplexity isn’t a web browser.
dewey · 2 years ago
> It’s retrieving the content then manipulating it. Perplexity isn’t a web browser.

So a browser with an ad-blocker that's removing / manipulating elements on the page isn't a browser? What about reader mode?

TeMPOraL · 2 years ago
Yes, that's literally why "user agent" is called "user agent". It's a program that acts in place and in the interest of its user, and this in particular always included allowing the user to choose what will or won't be rendered, and how. It's not up to the server what the client does with the response they get.
LeifCarrotson · 2 years ago
Retrieving the content of a web page then manipulating it is basically the definition of a web browser.
manojlds · 2 years ago
So if you have a browser that has Greasemonkey like scripts running on it, then it's not a browser? What about AI summary feature available on Edge now?

Deleted Comment

maxrmk · 2 years ago
I’d consider it a web browser but that’s a vague enough term that I can understand seeing it differently.

I’d be disappointed if it became common to block clients like this though. To me this feels like blocking google chrome because you don’t want to show up in google search (which is totally fine to want, for the record). Unnecessarily user hostile because you don’t approve of the company behind the client.

JoosToopit · 2 years ago
UA is just a signature a client sends. It's up to the client to use the signature they want to use.
mattigames · 2 years ago
And its up to the client to send as many requests as they see fit, it still called a DDOS attack when overdone regardless of the available freedom that the client has to do it.
wonnage · 2 years ago
Setting a correct user agent isn't required anyway, you just do it to not be an asshole. Robots.txt is an optional standard.

The article is just calling Perplexity out for some asshole behavior, it's not that complicated

It's clear they know they're engaging in poor behavior too, they could've documented some alternative UA for user-initiated requests instead of spoofing Chrome. Folks who trust them could've then blocked the training UA but allowed the alternative

Dead Comment

wrs · 2 years ago
I don’t think we should lump together “AI company scraping a website to train their base model” and “AI tool retrieving a web page because I asked it to”. At least, those should be two different user agents so you have the option to block one and not the other.
condiment · 2 years ago
If an AI agent is performing a search on behalf of a user, should its user agent be the same as that user’s?
Filligree · 2 years ago
Users don’t have user agent strings, user agents do.
gumby · 2 years ago
I think that’s the ideal as the server may provide different data depending on UA.

Does anyone actually do this, though?

lofaszvanitt · 2 years ago
It should, erm sorry, must pass all the info it got from the user to you, so you would have an idea who wanted info from your site.
KomoD · 2 years ago
I agree with that, but I also think that they should at least identify themselves instead of using a generic user agent.
BriggyDwiggs42 · 2 years ago
I’d rather share less information than more to any site I visit. Why does a user want to share that info?
beefnugs · 2 years ago
I want all people in the world with a dirty arse to change their user agent so i can not serve my website to dirty arses.
mrweasel · 2 years ago
Personally I don't even think that the issue. I'd prefer correct user-agent, that just common decency and shouldn't be an issue for most.

What I do expect the AI companies to do is to check the license of the content they scrape and follow that. Let's say I run a blog, and I have a CC BY-NC 4.0 license. You can train your AI and that content, as long as it's non-commercial. Otherwise you'd need to contact me an negotiate and appropriate license, for a fee. Or you can train your AI on my personal Github repo, where everything is ISC, that's fine, but for my work, which is GPLv3, then you have to ensure that the code your LLM returns is also under the GPLv3. Does any of the AI companies check that the license of ANYTHING?

lolinder · 2 years ago
> I'd prefer correct user-agent, that just common decency and shouldn't be an issue for most.

Tell that to the Chrome team. And the Safari team. And the Opera team. [0]

[0] https://webaim.org/blog/user-agent-string-history/

sebzim4500 · 2 years ago
More than this, I'd rather use a tool which lets me fake the user agent like I can in my browser.
supriyo-biswas · 2 years ago
And yet, OpenAI blocks both of these activities if you happen to block either "GPTBot" (the ingest crawler) or "ChatGPT-User" (retrieval during chat).
xbar · 2 years ago
Why should I have to differentiate Perplexity's services?

Dead Comment

JohnMakin · 2 years ago
Is it actually retrieving the page on the fly though? How do you know this? Even if it were - it’s not supposed to be able to.
tommy_axle · 2 years ago
What I gathered from the post was that one of the investigations was to ask what was on [some page url] and then check the logs moments later and saw it using a normal user agent.
supriyo-biswas · 2 years ago
You can just point it at a webserver and ask it a question like "Summarize the content at [URL]" with a sufficiently unique URL that no one would hit, maybe with an UUID. This is also explored on the very article itself.

In my testing they're using crawlers on AWS and they do not parse Javascript or CSS, so it is sufficient to serve some kind of interstitial challenge page like the one on Cloudflare, or you can build your own.

parasense · 2 years ago
> Is it actually retrieving the page on the fly though?

They are able to do so.

> How do you know this?

The access logs.

> Even if it were - it’s not supposed to be able to.

There is a distinction from data used to train a model, which is the indexing bot with the custom user-agent string, and the user-query input given to the aforementioned AI model. When you ask an AI some question, you normally input text into a form, and the text goes back to the AI model where the magic happens. In this scenario, instead of inputting a wall text into a form, the text is coming from a url.

These forms of user input are equivilent, and yet distinctly different. Therefore it's intelectually dishonest for the OP to claim the AI is indexing them, when OP is asking the AI to fetch their website to augment or add context to the question being asked.

IAmGraydon · 2 years ago
He literally showed a server log of it retrieving the page on the fly in the article.
janalsncm · 2 years ago
To steel man this, even though I think the article did a fine job already, maybe the author could’ve changed the content on the page so you would know if they were serving a cached response.
skilled · 2 years ago
Read this article if you want to know Perplexity’s idea of taking other people’s content and thinking they can get away with it,

https://stackdiary.com/perplexity-has-a-plagiarism-problem/

The CEO said that they have some “rough edges” to figure out, but their entire product is built on stealing people’s content. And apparently[0] they want to start paying big publishers to make all that noise go away.

[0]: https://www.semafor.com/article/06/12/2024/perplexity-was-pl...

Mathnerd314 · 2 years ago
It's been debated at length, but to make it short: piracy is not theft, and everyone in the LLM space has been taking other people’s content and so far getting away with it (pending lawsuits notwithstanding).
lolinder · 2 years ago
> so far getting away with it (pending lawsuits notwithstanding).

I know it feels like it's been longer, but it's not even been 2 years since ChatGPT was released. "So far" is in fact a very short amount of time in a world where important lawsuits like this can take 11 years to work their way through the courts [0].

[0] https://en.m.wikipedia.org/wiki/Oracle_v_Google

AlienRobot · 2 years ago
I'd believe it if they were targeting entities that could fight back, like stock photo companies and disney, instead of some guy with an artstation account, or some guy with a blog. To me it sounds like these products can't exist without exploiting someone and they're too coward to ask for permission because they know the answer is going to be "no."

Imagine how many things I could create if I just stole assets from others instead of having to deal with pesky things like copyright!

JumpCrisscross · 2 years ago
> pending lawsuits notwithstanding

That’s a hell of a caveat!

dspillett · 2 years ago
> piracy is not theft

Correct, but it is often a licensing breach (though sometimes depending upon the reading of some licenses, again these things are yet to be tested in any sort of court) and the companies doing it would be very quick to send a threatening legal letter if we used some of their output outside the stated licensing terms.

brookst · 2 years ago
If using copyrighted material to train an LLM is theft, so is reading a book.
more_corn · 2 years ago
I hate to argue this side of the fence, but when ai companies are taking the work of writers and artists en mass (replacing creative livelihoods with a machine trained on the artists stolen work) and achieving billion dollar valuations that’s actual stealing.

The key here is that creative content producers are being driven out of business through non consensual taking of their work.

Maybe it’s a new thing, but if it is, it’s worse than stealing.

cyanydeez · 2 years ago
Right, it's ironic we spent 30 years fighting piracy and then suddenly corporations start doing it and now it's suddenly ok.
twinge · 2 years ago
Aereo, Napster, Grokster, Grooveshark, Megaupload, and TVEyes: they all thought the same thing. Where are they now?
losvedir · 2 years ago
You wouldn't train an LLM on a car.
bongodongobob · 2 years ago
I cannot imagine how viewing/scraping a public website could ever be illegal, wrong, immoral etc. I just don't see the argument for it.
skilled · 2 years ago
Can’t wait for OpenAI to settle with The New York Times. For a billion dollars no less.
insane_dreamer · 2 years ago
> piracy is not theft

it was when Napster was doing it; but there's no entity like the RIAA to stop the AI bots

Dead Comment

readyman · 2 years ago
>and thinking they can get away with it

Can they not? I think that remains to be seen.

jhbadger · 2 years ago
Exactly. It's like when Uber started and flaunted the medallion taxi system of many cities. People said "These Uber people are idiots! They are going to get shut down! Don't they know the laws for taxis?" While a small number of cities did ban Uber (and even that generally only temporarily), in the end Uber basically won. I think a lot of people confuse what they want to happen versus what will happen.

Dead Comment

SonOfLilit · 2 years ago
Respecting robots.txt is something their training crawler should do, and I see no reason why their user agent (i.e. user asks it to retrieve a web page, it does) should, as it isn't a crawler (doesn't walk the graph).

As to "lying" about their user agents - this is 2024, the "User-Agent" header is considered a combination bug and privacy issue, all major browsers lie about being a browser that was popular many years ago, and recently the biggest browser(s?) standardized on sending one exact string from now on forever (which would obviously be a lie). This header is deprecated in every practical sense, and every user agent should send a legacy value saying "this is mozilla 5" just like Edge and Chrome and Firefox do (because at some point people figured out that if even one website exists that customizes by user agent but did not expect that new browsers would be released, nor was maintained since, then the internet would be broken unless they lie). So Perplexity doing the same is standard, and best, practice.

underdeserver · 2 years ago
They might be "lying" because of all sorts of reasons, but a specific version of Chrome on a specific OS still sends a unique user agent string.
SonOfLilit · 2 years ago
I stand corrected, thanks. However, I don't think it impacts my point.
jstanley · 2 years ago
If you've ever tried to do any web scraping, you'll know why they lie about the User-Agent, and you'd do it too if you wanted your program to work properly.

Discriminating based on User-Agent string is the unethical part.

croes · 2 years ago
>and you'd do it too if you wanted your program to work properly.

If I know the creator of the page doesn't want his page used by my program I wouldn't do it.

>Discriminating based on User-Agent string is the unethical part.

Not being exploited by an AI company is unethical? Robots.txt is made for telling bot identified by user agent what they are allowed to read.

lolinder · 2 years ago
> Robots.txt is made for telling bot identified by user agent what they are allowed to read.

Specifically it's meant for instructing "automatic clients known as crawlers" [0]. A crawler is defined by MDN as "a program, often called a bot or robot, which systematically browses the Web to collect data from webpages." [1]

As generally understood, wget is not a crawler even though it may be used to build one. Neither is curl. A crawler is a program which systematically browses the web, usually to build a search index.

I see no evidence that Perplexity's crawler is ignoring robots.txt, I only see evidence that when a user does a one-off request for a specific URL then Perplexity uses Chrome to access the site.

Basically, OP is using the wrong tool for the job and complaining when it doesn't work. If he wants to be excluded from Perplexity for one-off requests (as distinct from crawling) he needs to reach out to them, there is no applicable RFC.

[0] https://www.rfc-editor.org/rfc/rfc9309.html

[1] https://developer.mozilla.org/en-US/docs/Glossary/Crawler

marcus0x62 · 2 years ago
Please explain - in detail - why using information communicated by the client to change how my server operates is “unethical”. Keep in mind I pay money and expend time to provide free content for people to consume.
tensor · 2 years ago
Here is a simple example. If you made your website only work in say, Microsoft Edge, and blocked everyone else telling them to download Edge. I'd think you're an asshole. Whether or not being an ass is unethical I'll leave to the philosophers.

Clearly there are many other scenarios, and many that are more muddy, but overall when we get in to the business of trying to force people to consume content in particular ways it's a bit icky in my opinion.

The extreme end result of this is no more open web, just force people to download your app to consume your content. This is happening too and it sucks.

bayindirh · 2 years ago
What if the scraper is not respecting robots.txt to begin with? Aren't they unethical enough to warrant a stronger method to prevent scraping?
skeledrew · 2 years ago
Should there be a difference in treatment between a user going on a website and manually copying the content over to a bot to process vs giving the bot the URL so it does the fetching as well? I've done both (mainly to get summaries or translations) and I know which I generally prefer.
bakugo · 2 years ago
There is nothing unethical about not wanting AI companies to steal your content and sell it for a profit.
surfingdino · 2 years ago
I find your ethical standards perplexing...
rknightuk · 2 years ago
I wouldn’t because I have ethics.
sebzim4500 · 2 years ago
Here's my user agent on chrome:

>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36

There are at least five lies here.

* It isn't made by Mozilla

* It doesn't use WebKit

* It doesn't use KHTML

* It isn't safari

* That isn't even my version of chrome, presumably it hides the minor/patch versions for privacy reasons.

Lying in your user agent in order to make the internet work is a practice that is almost as old as user agents. Your browser is almost certainly doing it right now to look at this comment.

bastawhiz · 2 years ago
I have a silly website that just proxies GitHub and scrambles the text. It runs on CF Workers.

https://guthib.mattbasta.workers.dev

For the past month or two, it's been hitting the free request limit as some AI company has scraped it to hell. I'm not inclined to stop them. Go ahead, poison your index with literal garbage. It's the cost of not actually checking the data you're indiscriminately scraping.

esha_manideep · 2 years ago
They check after they scrape
deely3 · 2 years ago
How? Real people read all millions of pages of internet texts to verify it?
bastawhiz · 2 years ago
That's a lot of time and bandwidth to waste

Deleted Comment

Eisenstein · 2 years ago
How does github feel about this? You are sending the traffic to them while changing the content.
bastawhiz · 2 years ago
Frankly I don't care. They can block me if they want.
kuschkufan · 2 years ago
Call the fuzz
airstrike · 2 years ago
Who cares?
natch · 2 years ago
It seems to me there could be some confusion here.

When providing a service such as Perplexity AI's, there are two use cases to consider for accessing web sites.

One is the scraping use case for training, where a crawler is being used and it is gathering data in bulk. Hopefully in a way that doesn't hammer one site at a time, but spreads the requests around gently.

The other is the use case for fulfilling a user's specific query in real time. The blog post seemed to be hitting this second use case. In this use case, the system component that retrieves the web page is not acting as a crawler, but more as a browser or something akin to a browser plugin that is retrieving the content on behalf of the actual human end user, on their request.

It's appropriate that these two use cases have different norms for how they behave.

The author may have been thinking of the first use case, but actually exercising the second use case, and mistakenly expecting it to behave according to how it should behave for the first use case.

emrah · 2 years ago
This