There are two different questions at play here, and we need to be careful what we wish for.
The first concern is the most legitimate one: can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.
The second concern, though, is can Perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.
Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local. The very nature of a "user agent" is to be an automated tool that manipulates content hosted on the internet according to the specifications given to the tool by the user. I have a hard time seeing an argument against Perplexity using this data in this way that wouldn't apply equally to countless tools that we already all use and which companies try with varying degrees of success to block.
I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it. I want to be able to write scripts to manipulate the page and present it in a way that's useful for me. I don't currently use llms this way, but I'm uncomfortable with arguing that it's unethical for them to do that so long as they're citing the source.
It's funny I posted the inverse of this. As a web publisher, I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.
But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.
There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
> they are decreasing the probability that this user would come to by content (via Google, for example).
Google has been providing summaries of stuff and hijacking traffic for ages.
I kid you not, in the tourism sector this has been a HUGE issue, we have seen 50%+ decrease in views when they started doing it.
We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage.
It's just depressing. I'm more and more convinced that the age of regulations and competition is gone, US does want to have unkillable monopolies in the tech sector and we are all peons.
> A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.
If I visit your site from Google with my browser configured to go straight to Reader Mode whenever possible, is my visit more useful to you than a summary and a link to your site provided by Perplexity? Why does it matter so much that visitors be directly on your content?
> But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example).
Perplexity has source references. I find myself visiting the source references. Especially to validate the LLM output. And to learn more about the subject. Perplexity uses a Google search API to generate the reference links. I think a better strategy is to treat this as a new channel to receive visitors.
The browsing experience should be improved. Mozilla had a pilot called Context Graph. Perhaps Context Graph should be revisited?
> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
This seems like a missing feature for analytics products & the LLMs/RAGs. I don't think searching via an LLM/RAG is going away. It's too effective for the end user. We have to learn to work with it the best we can.
> I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.
But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable.
This appears to be self-contradictory. If you let an LLM to be trained* on “all the books” (posts, articles, etc.) in the world, the implication is that your potential readers will now simply ask that LLM. Not only will they pay Microsoft for that privilege, while you would get zilch, but you would not even know they ever read the fruits of your research.
* Incidentally, thinking of information acquisition by an ML model as if it was similar to human reading is a problematic fallacy.
I don't know what the typical usage pattern is, but when I've used Perplexity, I generally do click the relevant links instead of just trusting Perplexity's summary. I've seen plenty of cases where Perplexity's summary says exactly the opposite of the source.
This hits the point exactly, it’s an extension of stuff like Google’s zero click results, they are regurgitating a website’s content with no benefit to the website.
I would say though, it feels like the training argument may ultimately lead to a similar outcome, though it’s a bit more ideological and less tangible than regurgitating the results of a query. Services like chatgpt are already being used a google replacement by many people, so long term it may reduce clicks from search as well.
I'm not sure what you mean exactly. If Perplexity is actually doing something with your article in-band (e.g. downloading it, processing it, and present that processed article to the user) then they're just breaking the law.
I've never used that tool (and don't plan to) so I don't know. If they just embed the content in an iframe or something then there's no issue (but then there's no need or point in scraping). If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article <ARTICLE_TEXT>") then that's vanilla infringement, whether they lie about their UA or not.
It seems self-evident to me that if a user tells a bot to go get a web page, robots.txt doesn't apply, and the bot shouldn't respect it. I understand others' concerns that, like Apple's reader, and other similar tools, it's ethically debatable whether a site should be required to comply with the request, and spoofing an agent seems in dubious territory. I don't think a good answer has been proposed for this challenge, unfortunately.
Just to clarify, Perplexity is not spoofing a user agent, they're legitimately using a headless Chrome to fetch the page.
The author just misunderstood their docs [0]: when they say that "you can identify our web crawler by its user agent", they're talking about the crawler, not the browser they use for ad hoc requests. As you note, crawling is different.
The companies will scrape and internalise the "customer asked for this" requests... and slowly turn the latter into the former, or just their own tool as the scraper.
No, easier to just ask a simple question: Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.
Why should it be possible to stop an LLM from training itself on your data? If you want to restrict access to data then don't post it on a public website. It's easy enough to require registration and agreement to licensing terms for access.
It seems like some website owners want to have their cake and eat it too. They want their content indexed by Google and other crawlers in order to drive search traffic but they don't want their content used to train AI models that benefit other companies. At some point they're going to have to make a choice.
Because if I run a server - at my own expense - I get to use information provided by the client to determine what, if any, response to provide? This isn’t a very difficult concept to grasp.
If what Perplexity is doing is illegal, is it illegal to run an open-source LLM on your own machine, and have it do the same thing? If so, how are ad blockers or Reader Modes or screen readers legal?
And if it's legal to run an open-source LLM on your own machine, is it legal to run an open-source LLM on a rented server (e.g. because you need more GPUs)? And if that's legal, why is it illegal to run a closed-source LLM on servers? Could Perplexity simply release the model weights and keep doing what they're doing?
What is a "visit"? TFA demonstrates that they got a hit on their site, that's how they got the logs.
Is it necessary to load the JavaScript for it to count as a visit? What if I access the site with noscript?
Or is it only a visit if I see all your recommended content? I usually block those recommendations so that I don't get distracted from the article I actually came to read—is my visit a less legitimate visit than other people's?
What exactly is Perplexity doing here that isn't okay that people don't already do with their local user agents?
How would an LLM training on your writing reduce your reward?
I guess if you're doing it for a living sure, but most content I consume online is created without incentive (social media, blogs, stack overflow).
I write a fair amount and have been for a few years. I like to play with ideas. If an llm learned from my writing and it helped me propagate my ideas, I'd be happy. I lose on social status imaginary internet points but I honestly don't care much for them.
The craziest one is the stack overflow contributors. They write answers for free to help people become better programmers but they're mad an llm will read their suggestions and answer questions that help people become better programmers. I guess they do it for the glory of having their handle next to the answer?
> I think there is a real content dilemma here at work
It's not really a dilemma.
This is exactly what copyright serves to protect authors from. Perplexity copied the content, and in doing so directly competes with the original work, destroying it's market value and driving the original author out of business. Literally what copyright was invented to prevent.
It's the exact same situation as journalists going after Google & social media embeds of articles, which these sites propagandized as "prohibiting hyperlinking", but the issue has been the embedded (summary of the) content. Which people don't click through, and this is the entire point of those features for platforms like Facebook; Keeping users on facebook and not leaving.
This is why quite a few jurisdictions agreed with the journalists and moved to institute restrictions on such embedding.
By all practical considerations, perplexity is doing the exact same thing and trying to deflect with "we used an AI to paraphrase".
> The incentives of Google and website owners were more or less aligned.
The key difference here is that linking is and always has been fine. Google's Book search feature is fair use because the purpose is to send you to the book you searched for, not substitute the book.
Google's current AI summary feature is effectively the same as Perplexity. People don't click through to the original site, the original site doesn't get ad impressions or other revenue, and is driven out of business.
> What will happen if:
What will happen is what already is happening: Journalists are driven out of business, replaced by AI slop.
And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.
It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.
A lot of the public website content targeted towards consumers is already SEO slop trying to sell you something or maximize ad revenue. If those website owners decide to stop publishing due to lack of real human visits then little of value will be lost. Much of the content with real value for consumers has already moved to sites that require registration (and sometimes payment) for access.
For technical content of value to professionals, much of that is hosted by vendors or industry organizations. Those tend to get their revenue in other ways and don't care about companies scraping their content for AI model training. Like the IETF isn't going to stop publishing new RFCs just because Perplexity uses them.
> The second concern, though, is can perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.
This feels like the fundamental core component of what copyright allows you to forbid.
> Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local
Which is a huge difference. The latter is someone asking for a copy of my content (from someone with a valid license, myself), and manipulating it to display it (not creating new copies, broadly speaking allowed by copyright). The former adds in the criminal step of "and redistributing (modified, but that doesn't matter) versions of it to users without permission".
I mean, I'm all for getting rid of copyright, but I also know that's an incredibly unpopular position to take, and I don't see how this isn't just copyright infringement if you aren't advocating for repealing copyright law all together.
I actually don't see the legal distinction here. A browser with an ad blocker is also:
1. Asking for a copy of your content
2. Manipulating the content
3. Redistributing the content to the end-user who requested it
Ditto for the LLM that has been asked by the end user to fetch your content and show it to them (possibly with a manipulation step e.g. summarization).
I don't think there's a legal, copyright distinction between doing that on a server vs doing that on a local machine. And, for example, if there were a difference: using a browser on a remote desktop would be illegal, or using curl on a machine you were SSHed into would be illegal. Also, an LLM running locally on your machine (doing the exact same thing) would be legal!
I understand that it's inconvenient and difficult to monetize content when an LLM is summarizing it, and hard to upsell other pages on a website to users when they aren't coming to your website and are instead accessing it through an LLM. But legally I think there's not an obvious distinction on copyright grounds, and if there were (other than a very fine-grained ban on specifically LLMs accessing websites, without any general principle behind it), it would catch up a lot of legitimate behavior in the dragnet.
I'd also point out that in the U.S., search engines have passed the "Fair Use" test of exemption from copyright — I think it would be very hard to make a distinction between what a search engine is doing (which is on a server!) and what an LLM is doing based on trying to say copyright distinguishes between server vs client architectures.
If the user specifically asks for a file and asks a computer program to process it in a specific way, it should be permitted, regardless of user-agent spoofing (although user-agent spoofing should (normally) ideally only be done when the user specifically requests it; it should not do so automatically). However, this is better when using FOSS and/or local programs (or if the user is accessing them through a proxy, VPN, Tor, etc). Furthermore, any company that provides such services should not use unethical business practices, false advertising, etc, to do so.
If the company wants a copy of the files for your own use, then that is a bit different. When accessing large number of files at once, robots.txt is useful to block it. If they can get a copy of the files in a different way (assuming the files are intended to be public anyways), then they might do so. However, even in this case, still they should not use unethical business practices, false advertising, etc; and, they should also avoid user-agent spoofing.
(In this case, the reason for the user-agent spoofing does not seem to be deliberate, since it uses a headless browser. They should still change it though; probably by keeping the user-agent string but adding on a extra part such as "Perplexity", to indicate that it is what it is, in addition to the headless browser.)
A user-agent requests the file using your credentials, eg a cookie or public key signature.
It is transforming the content for you, an authorized party.
That is not the same as then making derivative copies and distributing the information to others without paying. For example, if I bought a ticket to a show, taped it and then distributed it to everyone, disregarding that the show prohibited this.
If I shared my Netflix password with up to 5 others, at least I can argue that they are part of my “family” or something. But to unlimited numbers of people? Why would they pay for netflix, and how would the shows get made?
I am not necessarily endorsing government force enforcing copyright, which is why I have been building a solution to enforce it at the tech level: https://Qbix.com/ecosystem
The problem that Perplexity has that ad blockers don't is that they're an independent site that is publishing content based on work they didn't produce. That runs afoul of both copyright laws and section 230 which let's sites like Google and Facebook operate. That's pretty different from an ad blocker running on your local machine. The ad blocker isn't publishing the page it edited for you.
> they're an independent site that is publishing content based on work they didn't produce.
What distinguishes these two situations?
* User asks proprietary web browser to fetch content and render it a specific way, which it does
* User asks proprietary web service to fetch content and render it a specific way, which it does
The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?
Why is it that a proprietary web service manipulating content on behalf of a user is "publishing" content illegally, while a proprietary web browser doing the exact same kind of transformations is not? Assume that in both cases the proprietary software fetches the data upon request, does not cache it, and does not make the transformed content available to other users.
AI scraping against permission could allow corporations to formulate a loophole where Congress argues that it's impossible to enforce a law against and that it's easier to just make laws to allow corporations to close-source their websites (yes, HTML, CSS, and JavaScript, etc). I think what's most likely to happen is nothing will fundamentally change, and browsers will continue showing page source, and AI will continue scraping source content without permission.
You can poison all your images with Glaze and Nightshade. Then you don't have to stop them from using them - they have to stop themselves from using them or their image generator will be useless. I don't know if there's a comparable system for text. If there was, it would probably be noticeable to humans.
> can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.
I’m not saying you’re wrong, but why? And what do you mean by “your data” here?
1) a user-agent which makes an authenticated and authorized request for data, and delivers to the user
2) a user who then turns around and distributes the data or its derivatives to users in an unauthorized manner
A “dumber” example would be whether I can indefinitely cache and index most of information via the Google Places API, as long as my users request each item at least once. Can I duplicate all that map or streetview photo information that google paid cars to go around and photograph? Or how about the info that Google users entered as user-generated content?
THE REQUIREMENT TO OPEN SOURCE WEIGHTS
Legally, if I had a Creative Commons Share-Alike license on my data, and the LLM was trained on it and then served unlimited requests to others, without making the weights available…
…that would be almost exactly like if I had made my code available with Affero GPL license, someone would take my code but then incorporated it into a backend software hosting a social network or something, without making their own entire social network source code available. Technically this should be enforceable via a court order compelling the open sourcing to the public. (Alternatively, they’d have to pay damages in a class action lawsuit and stop using the tainted backend software or weights when serving all those people.)
TECHNICAL ANALYSIS
The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.
Distributing your scripts and software to process data is not the same as distributing arbitrary data the user agent found on the internet for which you don’t have a license.
If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.
LEGAL ANALYSIS
Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.
In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.
Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.
Yeah if people get to extensive about blocking then we're going to end up with a scenario where the web request functionality is implementing by telling the chatbot's users's browser to make the fetch and submit it back to the server for processing, making it largely indistinguishable from the user making the query themselves. If CORS gets in the way they can just prompt users to install a browser extension to use the web request functionality.
Citing the source doesn't bring you, the owner of the site, valuable data. When was your data accessed, who accessed it, from where, at what time, what device, etc. It brings data to the LLM's owner, and you get
N O T H I N G.
Could you change the way printed news magazines showed their content? No. Then, why is that a problem?
I always click on sources to verify what an LLM in this case says. I also hear the claim that a lot about people not reading sources (before LLM it was video content with references) but I always visited the sources. Is there a statistics or studies that actually support this claim? Or is it just a personal experience, of people (including me) enforcing it as generic behavior of all people?
> I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it.
I'm okay with this world, as a tradeoff. I'm not sure users should have _the right_ to reformat others' content.
Users should have the right to reformat their own copy of others content (automatically as well as manually). However, if they then redistribute the reformatted copy, then they should not be allowed to claim that it is the same formatting as the original, because it is not the same as the original.
Personally I think AI is a major win for accessibility and we should not be preventing people to access information in the way that is best suited for them.
Accessibility can mean everything from a blind person wanting to interacting with a website using voice, to someone recovering from a surgery and wanting something to reduce unnecessary popups and clicks on a website to get to the information they need. Accessibility is in the eye of the accessor, and AI is what enables them to achieve it.
The way I see it, AI is not a robot and doesn't need to look at robots.txt. Rather, AI is my low-cost secretary.
The author has misunderstood when the perplexity user agent applies.
Web site owners shouldn’t dictate what browser users can access their site with - whether that’s chrome, firefox, or something totally different like perplexity.
When retrieving a web page _for the user_ it’s appropriate to use a UA string that looks like a browser client.
If perplexity is collecting training data in bulk without using their UA that’s a different thing, and they should stop. But this article doesn’t show that.
Just to go a little bit more into detail on this, because the article and most of the conversation here is based on a big misunderstanding:
robots.txt governs crawlers. Fetching a single user-specified URL is not crawling. Crawling is when you automatically follow links to continue fetching subsequent pages.
Perplexity’s documentation that the article links to describes how their crawler works. That is not the piece of software that fetches individual web pages when a user asks for them. That’s just a regular user-agent, because it’s acting as an agent for the user.
The distinction between crawling and not crawling has been very firmly established for decades. You can see it in action with wget. If you fetch a specific URL with `wget https://www.example.com` then wget will just fetch that URL. It will not fetch robots.txt at all.
If you tell wget to act recursively with `wget --recursive https://www.example.com` to crawl that website, then wget will fetch `https://www.example.com`, look for links on the page, then if it finds any links to other pages, it will fetch `https://www.example.com/robots.txt` to check if it is permitted to fetch any subsequent links.
This is the difference between fetching a web page and crawling a website. Perplexity is following the very well established norms here.
Its fairly logical to assume that robots.txt governs robots (empahsis in "bots") not just crawlers, if they are only intended to block crawlers why aren't they called crawlers.txt instead and remove all ambiguity?
Yes, that's literally why "user agent" is called "user agent". It's a program that acts in place and in the interest of its user, and this in particular always included allowing the user to choose what will or won't be rendered, and how. It's not up to the server what the client does with the response they get.
So if you have a browser that has Greasemonkey like scripts running on it, then it's not a browser? What about AI summary feature available on Edge now?
I’d consider it a web browser but that’s a vague enough term that I can understand seeing it differently.
I’d be disappointed if it became common to block clients like this though. To me this feels like blocking google chrome because you don’t want to show up in google search (which is totally fine to want, for the record). Unnecessarily user hostile because you don’t approve of the company behind the client.
And its up to the client to send as many requests as they see fit, it still called a DDOS attack when overdone regardless of the available freedom that the client has to do it.
Setting a correct user agent isn't required anyway, you just do it to not be an asshole. Robots.txt is an optional standard.
The article is just calling Perplexity out for some asshole behavior, it's not that complicated
It's clear they know they're engaging in poor behavior too, they could've documented some alternative UA for user-initiated requests instead of spoofing Chrome. Folks who trust them could've then blocked the training UA but allowed the alternative
I don’t think we should lump together “AI company scraping a website to train their base model” and “AI tool retrieving a web page because I asked it to”. At least, those should be two different user agents so you have the option to block one and not the other.
Personally I don't even think that the issue. I'd prefer correct user-agent, that just common decency and shouldn't be an issue for most.
What I do expect the AI companies to do is to check the license of the content they scrape and follow that. Let's say I run a blog, and I have a CC BY-NC 4.0 license. You can train your AI and that content, as long as it's non-commercial. Otherwise you'd need to contact me an negotiate and appropriate license, for a fee. Or you can train your AI on my personal Github repo, where everything is ISC, that's fine, but for my work, which is GPLv3, then you have to ensure that the code your LLM returns is also under the GPLv3. Does any of the AI companies check that the license of ANYTHING?
What I gathered from the post was that one of the investigations was to ask what was on [some page url] and then check the logs moments later and saw it using a normal user agent.
You can just point it at a webserver and ask it a question like "Summarize the content at [URL]" with a sufficiently unique URL that no one would hit, maybe with an UUID. This is also explored on the very article itself.
In my testing they're using crawlers on AWS and they do not parse Javascript or CSS, so it is sufficient to serve some kind of interstitial challenge page like the one on Cloudflare, or you can build your own.
> Is it actually retrieving the page on the fly though?
They are able to do so.
> How do you know this?
The access logs.
> Even if it were - it’s not supposed to be able to.
There is a distinction from data used to train a model, which is the indexing bot with the custom user-agent string, and the user-query input given to the aforementioned AI model. When you ask an AI some question, you normally input text into a form, and the text goes back to the AI model where the magic happens. In this scenario, instead of inputting a wall text into a form, the text is coming from a url.
These forms of user input are equivilent, and yet distinctly different. Therefore it's intelectually dishonest for the OP to claim the AI is indexing them, when OP is asking the AI to fetch their website to augment or add context to the question being asked.
To steel man this, even though I think the article did a fine job already, maybe the author could’ve changed the content on the page so you would know if they were serving a cached response.
The CEO said that they have some “rough edges” to figure out, but their entire product is built on stealing people’s content. And apparently[0] they want to start paying big publishers to make all that noise go away.
It's been debated at length, but to make it short: piracy is not theft, and everyone in the LLM space has been taking other people’s content and so far getting away with it (pending lawsuits notwithstanding).
> so far getting away with it (pending lawsuits notwithstanding).
I know it feels like it's been longer, but it's not even been 2 years since ChatGPT was released. "So far" is in fact a very short amount of time in a world where important lawsuits like this can take 11 years to work their way through the courts [0].
I'd believe it if they were targeting entities that could fight back, like stock photo companies and disney, instead of some guy with an artstation account, or some guy with a blog. To me it sounds like these products can't exist without exploiting someone and they're too coward to ask for permission because they know the answer is going to be "no."
Imagine how many things I could create if I just stole assets from others instead of having to deal with pesky things like copyright!
Correct, but it is often a licensing breach (though sometimes depending upon the reading of some licenses, again these things are yet to be tested in any sort of court) and the companies doing it would be very quick to send a threatening legal letter if we used some of their output outside the stated licensing terms.
I hate to argue this side of the fence, but when ai companies are taking the work of writers and artists en mass (replacing creative livelihoods with a machine trained on the artists stolen work) and achieving billion dollar valuations that’s actual stealing.
The key here is that creative content producers are being driven out of business through non consensual taking of their work.
Maybe it’s a new thing, but if it is, it’s worse than stealing.
Exactly. It's like when Uber started and flaunted the medallion taxi system of many cities. People said "These Uber people are idiots! They are going to get shut down! Don't they know the laws for taxis?" While a small number of cities did ban Uber (and even that generally only temporarily), in the end Uber basically won. I think a lot of people confuse what they want to happen versus what will happen.
Respecting robots.txt is something their training crawler should do, and I see no reason why their user agent (i.e. user asks it to retrieve a web page, it does) should, as it isn't a crawler (doesn't walk the graph).
As to "lying" about their user agents - this is 2024, the "User-Agent" header is considered a combination bug and privacy issue, all major browsers lie about being a browser that was popular many years ago, and recently the biggest browser(s?) standardized on sending one exact string from now on forever (which would obviously be a lie). This header is deprecated in every practical sense, and every user agent should send a legacy value saying "this is mozilla 5" just like Edge and Chrome and Firefox do (because at some point people figured out that if even one website exists that customizes by user agent but did not expect that new browsers would be released, nor was maintained since, then the internet would be broken unless they lie). So Perplexity doing the same is standard, and best, practice.
If you've ever tried to do any web scraping, you'll know why they lie about the User-Agent, and you'd do it too if you wanted your program to work properly.
Discriminating based on User-Agent string is the unethical part.
> Robots.txt is made for telling bot identified by user agent what they are allowed to read.
Specifically it's meant for instructing "automatic clients known as crawlers" [0]. A crawler is defined by MDN as "a program, often called a bot or robot, which systematically browses the Web to collect data from webpages." [1]
As generally understood, wget is not a crawler even though it may be used to build one. Neither is curl. A crawler is a program which systematically browses the web, usually to build a search index.
I see no evidence that Perplexity's crawler is ignoring robots.txt, I only see evidence that when a user does a one-off request for a specific URL then Perplexity uses Chrome to access the site.
Basically, OP is using the wrong tool for the job and complaining when it doesn't work. If he wants to be excluded from Perplexity for one-off requests (as distinct from crawling) he needs to reach out to them, there is no applicable RFC.
Please explain - in detail - why using information communicated by the client to change how my server operates is “unethical”. Keep in mind I pay money and expend time to provide free content for people to consume.
Here is a simple example. If you made your website only work in say, Microsoft Edge, and blocked everyone else telling them to download Edge. I'd think you're an asshole. Whether or not being an ass is unethical I'll leave to the philosophers.
Clearly there are many other scenarios, and many that are more muddy, but overall when we get in to the business of trying to force people to consume content in particular ways it's a bit icky in my opinion.
The extreme end result of this is no more open web, just force people to download your app to consume your content. This is happening too and it sucks.
Should there be a difference in treatment between a user going on a website and manually copying the content over to a bot to process vs giving the bot the URL so it does the fetching as well? I've done both (mainly to get summaries or translations) and I know which I generally prefer.
>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36
There are at least five lies here.
* It isn't made by Mozilla
* It doesn't use WebKit
* It doesn't use KHTML
* It isn't safari
* That isn't even my version of chrome, presumably it hides the minor/patch versions for privacy reasons.
Lying in your user agent in order to make the internet work is a practice that is almost as old as user agents. Your browser is almost certainly doing it right now to look at this comment.
For the past month or two, it's been hitting the free request limit as some AI company has scraped it to hell. I'm not inclined to stop them. Go ahead, poison your index with literal garbage. It's the cost of not actually checking the data you're indiscriminately scraping.
It seems to me there could be some confusion here.
When providing a service such as Perplexity AI's, there are two use cases to consider for accessing web sites.
One is the scraping use case for training, where a crawler is being used and it is gathering data in bulk. Hopefully in a way that doesn't hammer one site at a time, but spreads the requests around gently.
The other is the use case for fulfilling a user's specific query in real time. The blog post seemed to be hitting this second use case. In this use case, the system component that retrieves the web page is not acting as a crawler, but more as a browser or something akin to a browser plugin that is retrieving the content on behalf of the actual human end user, on their request.
It's appropriate that these two use cases have different norms for how they behave.
The author may have been thinking of the first use case, but actually exercising the second use case, and mistakenly expecting it to behave according to how it should behave for the first use case.
The first concern is the most legitimate one: can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.
The second concern, though, is can Perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.
Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local. The very nature of a "user agent" is to be an automated tool that manipulates content hosted on the internet according to the specifications given to the tool by the user. I have a hard time seeing an argument against Perplexity using this data in this way that wouldn't apply equally to countless tools that we already all use and which companies try with varying degrees of success to block.
I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it. I want to be able to write scripts to manipulate the page and present it in a way that's useful for me. I don't currently use llms this way, but I'm uncomfortable with arguing that it's unethical for them to do that so long as they're citing the source.
But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.
There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
Google has been providing summaries of stuff and hijacking traffic for ages.
I kid you not, in the tourism sector this has been a HUGE issue, we have seen 50%+ decrease in views when they started doing it.
We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage.
It's just depressing. I'm more and more convinced that the age of regulations and competition is gone, US does want to have unkillable monopolies in the tech sector and we are all peons.
If I visit your site from Google with my browser configured to go straight to Reader Mode whenever possible, is my visit more useful to you than a summary and a link to your site provided by Perplexity? Why does it matter so much that visitors be directly on your content?
Perplexity has source references. I find myself visiting the source references. Especially to validate the LLM output. And to learn more about the subject. Perplexity uses a Google search API to generate the reference links. I think a better strategy is to treat this as a new channel to receive visitors.
The browsing experience should be improved. Mozilla had a pilot called Context Graph. Perhaps Context Graph should be revisited?
> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.
This seems like a missing feature for analytics products & the LLMs/RAGs. I don't think searching via an LLM/RAG is going away. It's too effective for the end user. We have to learn to work with it the best we can.
This appears to be self-contradictory. If you let an LLM to be trained* on “all the books” (posts, articles, etc.) in the world, the implication is that your potential readers will now simply ask that LLM. Not only will they pay Microsoft for that privilege, while you would get zilch, but you would not even know they ever read the fruits of your research.
* Incidentally, thinking of information acquisition by an ML model as if it was similar to human reading is a problematic fallacy.
I would say though, it feels like the training argument may ultimately lead to a similar outcome, though it’s a bit more ideological and less tangible than regurgitating the results of a query. Services like chatgpt are already being used a google replacement by many people, so long term it may reduce clicks from search as well.
I've never used that tool (and don't plan to) so I don't know. If they just embed the content in an iframe or something then there's no issue (but then there's no need or point in scraping). If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article <ARTICLE_TEXT>") then that's vanilla infringement, whether they lie about their UA or not.
Just to clarify, Perplexity is not spoofing a user agent, they're legitimately using a headless Chrome to fetch the page.
The author just misunderstood their docs [0]: when they say that "you can identify our web crawler by its user agent", they're talking about the crawler, not the browser they use for ad hoc requests. As you note, crawling is different.
[0] https://docs.perplexity.ai/docs/perplexitybot
No, easier to just ask a simple question: Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.
These companies don't need to be given an inch.
So should Firefox not allow changing the user agent in order to bypass websites that erroneously claim to not work on Firefox?
Ad block isn’t the same problem because it doesn’t and can’t steal the creator’s data.
It seems like some website owners want to have their cake and eat it too. They want their content indexed by Google and other crawlers in order to drive search traffic but they don't want their content used to train AI models that benefit other companies. At some point they're going to have to make a choice.
If what Perplexity is doing is illegal, is it illegal to run an open-source LLM on your own machine, and have it do the same thing? If so, how are ad blockers or Reader Modes or screen readers legal?
And if it's legal to run an open-source LLM on your own machine, is it legal to run an open-source LLM on a rented server (e.g. because you need more GPUs)? And if that's legal, why is it illegal to run a closed-source LLM on servers? Could Perplexity simply release the model weights and keep doing what they're doing?
Website owners decide to stop publishing because it’s not rewarded by a real human visit anymore?
Then perplexity and the like won’t have new information to train their models on and no sites to answer the questions.
I think there is a real content dilemma here at work. The incentives of Google and website owners were more or less aligned.
This is not the case with perplexity.
Is it necessary to load the JavaScript for it to count as a visit? What if I access the site with noscript?
Or is it only a visit if I see all your recommended content? I usually block those recommendations so that I don't get distracted from the article I actually came to read—is my visit a less legitimate visit than other people's?
What exactly is Perplexity doing here that isn't okay that people don't already do with their local user agents?
I guess if you're doing it for a living sure, but most content I consume online is created without incentive (social media, blogs, stack overflow).
I write a fair amount and have been for a few years. I like to play with ideas. If an llm learned from my writing and it helped me propagate my ideas, I'd be happy. I lose on social status imaginary internet points but I honestly don't care much for them.
The craziest one is the stack overflow contributors. They write answers for free to help people become better programmers but they're mad an llm will read their suggestions and answer questions that help people become better programmers. I guess they do it for the glory of having their handle next to the answer?
It's not really a dilemma.
This is exactly what copyright serves to protect authors from. Perplexity copied the content, and in doing so directly competes with the original work, destroying it's market value and driving the original author out of business. Literally what copyright was invented to prevent.
It's the exact same situation as journalists going after Google & social media embeds of articles, which these sites propagandized as "prohibiting hyperlinking", but the issue has been the embedded (summary of the) content. Which people don't click through, and this is the entire point of those features for platforms like Facebook; Keeping users on facebook and not leaving.
This is why quite a few jurisdictions agreed with the journalists and moved to institute restrictions on such embedding.
By all practical considerations, perplexity is doing the exact same thing and trying to deflect with "we used an AI to paraphrase".
> The incentives of Google and website owners were more or less aligned.
The key difference here is that linking is and always has been fine. Google's Book search feature is fair use because the purpose is to send you to the book you searched for, not substitute the book.
Google's current AI summary feature is effectively the same as Perplexity. People don't click through to the original site, the original site doesn't get ad impressions or other revenue, and is driven out of business.
> What will happen if:
What will happen is what already is happening: Journalists are driven out of business, replaced by AI slop.
And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.
It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.
For technical content of value to professionals, much of that is hosted by vendors or industry organizations. Those tend to get their revenue in other ways and don't care about companies scraping their content for AI model training. Like the IETF isn't going to stop publishing new RFCs just because Perplexity uses them.
This feels like the fundamental core component of what copyright allows you to forbid.
> Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local
Which is a huge difference. The latter is someone asking for a copy of my content (from someone with a valid license, myself), and manipulating it to display it (not creating new copies, broadly speaking allowed by copyright). The former adds in the criminal step of "and redistributing (modified, but that doesn't matter) versions of it to users without permission".
I mean, I'm all for getting rid of copyright, but I also know that's an incredibly unpopular position to take, and I don't see how this isn't just copyright infringement if you aren't advocating for repealing copyright law all together.
I'm assuming that if I write code by hand for every part of the TCP/IP and HTTP stack I'm safe.
What if I use libraries written by other people for the TCP/IP and HTTP part?
What if I use a whole FOSS web browser?
What about a paid local web browser?
What if I run a script that I wrote on a cloud server?
What if I then allow other people to download and use that script on their own cloud servers?
What if I decide to offer that script as a service for free to friends and family, who can use my cloud server?
What if I offer it for free to the general public?
What if I start accepting money for that service, but I guarantee that only the one person who asked for the site sees the output?
Can you help me to understand where exactly I crossed the line?
1. Asking for a copy of your content
2. Manipulating the content
3. Redistributing the content to the end-user who requested it
Ditto for the LLM that has been asked by the end user to fetch your content and show it to them (possibly with a manipulation step e.g. summarization).
I don't think there's a legal, copyright distinction between doing that on a server vs doing that on a local machine. And, for example, if there were a difference: using a browser on a remote desktop would be illegal, or using curl on a machine you were SSHed into would be illegal. Also, an LLM running locally on your machine (doing the exact same thing) would be legal!
I understand that it's inconvenient and difficult to monetize content when an LLM is summarizing it, and hard to upsell other pages on a website to users when they aren't coming to your website and are instead accessing it through an LLM. But legally I think there's not an obvious distinction on copyright grounds, and if there were (other than a very fine-grained ban on specifically LLMs accessing websites, without any general principle behind it), it would catch up a lot of legitimate behavior in the dragnet.
I'd also point out that in the U.S., search engines have passed the "Fair Use" test of exemption from copyright — I think it would be very hard to make a distinction between what a search engine is doing (which is on a server!) and what an LLM is doing based on trying to say copyright distinguishes between server vs client architectures.
If the company wants a copy of the files for your own use, then that is a bit different. When accessing large number of files at once, robots.txt is useful to block it. If they can get a copy of the files in a different way (assuming the files are intended to be public anyways), then they might do so. However, even in this case, still they should not use unethical business practices, false advertising, etc; and, they should also avoid user-agent spoofing.
(In this case, the reason for the user-agent spoofing does not seem to be deliberate, since it uses a headless browser. They should still change it though; probably by keeping the user-agent string but adding on a extra part such as "Perplexity", to indicate that it is what it is, in addition to the headless browser.)
It is transforming the content for you, an authorized party.
That is not the same as then making derivative copies and distributing the information to others without paying. For example, if I bought a ticket to a show, taped it and then distributed it to everyone, disregarding that the show prohibited this.
If I shared my Netflix password with up to 5 others, at least I can argue that they are part of my “family” or something. But to unlimited numbers of people? Why would they pay for netflix, and how would the shows get made?
I am not necessarily endorsing government force enforcing copyright, which is why I have been building a solution to enforce it at the tech level: https://Qbix.com/ecosystem
What distinguishes these two situations?
* User asks proprietary web browser to fetch content and render it a specific way, which it does
* User asks proprietary web service to fetch content and render it a specific way, which it does
The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?
Why is it that a proprietary web service manipulating content on behalf of a user is "publishing" content illegally, while a proprietary web browser doing the exact same kind of transformations is not? Assume that in both cases the proprietary software fetches the data upon request, does not cache it, and does not make the transformed content available to other users.
Deleted Comment
I’m not saying you’re wrong, but why? And what do you mean by “your data” here?
The website that they created.
Deleted Comment
1) a user-agent which makes an authenticated and authorized request for data, and delivers to the user
2) a user who then turns around and distributes the data or its derivatives to users in an unauthorized manner
A “dumber” example would be whether I can indefinitely cache and index most of information via the Google Places API, as long as my users request each item at least once. Can I duplicate all that map or streetview photo information that google paid cars to go around and photograph? Or how about the info that Google users entered as user-generated content?
THE REQUIREMENT TO OPEN SOURCE WEIGHTS
Legally, if I had a Creative Commons Share-Alike license on my data, and the LLM was trained on it and then served unlimited requests to others, without making the weights available…
…that would be almost exactly like if I had made my code available with Affero GPL license, someone would take my code but then incorporated it into a backend software hosting a social network or something, without making their own entire social network source code available. Technically this should be enforceable via a court order compelling the open sourcing to the public. (Alternatively, they’d have to pay damages in a class action lawsuit and stop using the tainted backend software or weights when serving all those people.)
TECHNICAL ANALYSIS
The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.
Distributing your scripts and software to process data is not the same as distributing arbitrary data the user agent found on the internet for which you don’t have a license.
If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.
LEGAL ANALYSIS
Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.
In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.
There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop...
Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.
Deleted Comment
N O T H I N G.
Could you change the way printed news magazines showed their content? No. Then, why is that a problem?
Btw nobody clicks on sources. NOBODY.
I always click on sources to verify what an LLM in this case says. I also hear the claim that a lot about people not reading sources (before LLM it was video content with references) but I always visited the sources. Is there a statistics or studies that actually support this claim? Or is it just a personal experience, of people (including me) enforcing it as generic behavior of all people?
I'm okay with this world, as a tradeoff. I'm not sure users should have _the right_ to reformat others' content.
Accessibility can mean everything from a blind person wanting to interacting with a website using voice, to someone recovering from a surgery and wanting something to reduce unnecessary popups and clicks on a website to get to the information they need. Accessibility is in the eye of the accessor, and AI is what enables them to achieve it.
The way I see it, AI is not a robot and doesn't need to look at robots.txt. Rather, AI is my low-cost secretary.
I don't think you are seeing it very clearly then. Your secretary can also be a robot. What do you think an AI is if not a robot??
It doesn't "need" to look at robots.txt because nothing does.
Web site owners shouldn’t dictate what browser users can access their site with - whether that’s chrome, firefox, or something totally different like perplexity.
When retrieving a web page _for the user_ it’s appropriate to use a UA string that looks like a browser client.
If perplexity is collecting training data in bulk without using their UA that’s a different thing, and they should stop. But this article doesn’t show that.
robots.txt governs crawlers. Fetching a single user-specified URL is not crawling. Crawling is when you automatically follow links to continue fetching subsequent pages.
Perplexity’s documentation that the article links to describes how their crawler works. That is not the piece of software that fetches individual web pages when a user asks for them. That’s just a regular user-agent, because it’s acting as an agent for the user.
The distinction between crawling and not crawling has been very firmly established for decades. You can see it in action with wget. If you fetch a specific URL with `wget https://www.example.com` then wget will just fetch that URL. It will not fetch robots.txt at all.
If you tell wget to act recursively with `wget --recursive https://www.example.com` to crawl that website, then wget will fetch `https://www.example.com`, look for links on the page, then if it finds any links to other pages, it will fetch `https://www.example.com/robots.txt` to check if it is permitted to fetch any subsequent links.
This is the difference between fetching a web page and crawling a website. Perplexity is following the very well established norms here.
So a browser with an ad-blocker that's removing / manipulating elements on the page isn't a browser? What about reader mode?
Deleted Comment
I’d be disappointed if it became common to block clients like this though. To me this feels like blocking google chrome because you don’t want to show up in google search (which is totally fine to want, for the record). Unnecessarily user hostile because you don’t approve of the company behind the client.
The article is just calling Perplexity out for some asshole behavior, it's not that complicated
It's clear they know they're engaging in poor behavior too, they could've documented some alternative UA for user-initiated requests instead of spoofing Chrome. Folks who trust them could've then blocked the training UA but allowed the alternative
Dead Comment
Does anyone actually do this, though?
What I do expect the AI companies to do is to check the license of the content they scrape and follow that. Let's say I run a blog, and I have a CC BY-NC 4.0 license. You can train your AI and that content, as long as it's non-commercial. Otherwise you'd need to contact me an negotiate and appropriate license, for a fee. Or you can train your AI on my personal Github repo, where everything is ISC, that's fine, but for my work, which is GPLv3, then you have to ensure that the code your LLM returns is also under the GPLv3. Does any of the AI companies check that the license of ANYTHING?
Tell that to the Chrome team. And the Safari team. And the Opera team. [0]
[0] https://webaim.org/blog/user-agent-string-history/
Dead Comment
In my testing they're using crawlers on AWS and they do not parse Javascript or CSS, so it is sufficient to serve some kind of interstitial challenge page like the one on Cloudflare, or you can build your own.
They are able to do so.
> How do you know this?
The access logs.
> Even if it were - it’s not supposed to be able to.
There is a distinction from data used to train a model, which is the indexing bot with the custom user-agent string, and the user-query input given to the aforementioned AI model. When you ask an AI some question, you normally input text into a form, and the text goes back to the AI model where the magic happens. In this scenario, instead of inputting a wall text into a form, the text is coming from a url.
These forms of user input are equivilent, and yet distinctly different. Therefore it's intelectually dishonest for the OP to claim the AI is indexing them, when OP is asking the AI to fetch their website to augment or add context to the question being asked.
https://stackdiary.com/perplexity-has-a-plagiarism-problem/
The CEO said that they have some “rough edges” to figure out, but their entire product is built on stealing people’s content. And apparently[0] they want to start paying big publishers to make all that noise go away.
[0]: https://www.semafor.com/article/06/12/2024/perplexity-was-pl...
I know it feels like it's been longer, but it's not even been 2 years since ChatGPT was released. "So far" is in fact a very short amount of time in a world where important lawsuits like this can take 11 years to work their way through the courts [0].
[0] https://en.m.wikipedia.org/wiki/Oracle_v_Google
Imagine how many things I could create if I just stole assets from others instead of having to deal with pesky things like copyright!
That’s a hell of a caveat!
Correct, but it is often a licensing breach (though sometimes depending upon the reading of some licenses, again these things are yet to be tested in any sort of court) and the companies doing it would be very quick to send a threatening legal letter if we used some of their output outside the stated licensing terms.
The key here is that creative content producers are being driven out of business through non consensual taking of their work.
Maybe it’s a new thing, but if it is, it’s worse than stealing.
it was when Napster was doing it; but there's no entity like the RIAA to stop the AI bots
Dead Comment
Can they not? I think that remains to be seen.
Dead Comment
As to "lying" about their user agents - this is 2024, the "User-Agent" header is considered a combination bug and privacy issue, all major browsers lie about being a browser that was popular many years ago, and recently the biggest browser(s?) standardized on sending one exact string from now on forever (which would obviously be a lie). This header is deprecated in every practical sense, and every user agent should send a legacy value saying "this is mozilla 5" just like Edge and Chrome and Firefox do (because at some point people figured out that if even one website exists that customizes by user agent but did not expect that new browsers would be released, nor was maintained since, then the internet would be broken unless they lie). So Perplexity doing the same is standard, and best, practice.
Discriminating based on User-Agent string is the unethical part.
If I know the creator of the page doesn't want his page used by my program I wouldn't do it.
>Discriminating based on User-Agent string is the unethical part.
Not being exploited by an AI company is unethical? Robots.txt is made for telling bot identified by user agent what they are allowed to read.
Specifically it's meant for instructing "automatic clients known as crawlers" [0]. A crawler is defined by MDN as "a program, often called a bot or robot, which systematically browses the Web to collect data from webpages." [1]
As generally understood, wget is not a crawler even though it may be used to build one. Neither is curl. A crawler is a program which systematically browses the web, usually to build a search index.
I see no evidence that Perplexity's crawler is ignoring robots.txt, I only see evidence that when a user does a one-off request for a specific URL then Perplexity uses Chrome to access the site.
Basically, OP is using the wrong tool for the job and complaining when it doesn't work. If he wants to be excluded from Perplexity for one-off requests (as distinct from crawling) he needs to reach out to them, there is no applicable RFC.
[0] https://www.rfc-editor.org/rfc/rfc9309.html
[1] https://developer.mozilla.org/en-US/docs/Glossary/Crawler
Clearly there are many other scenarios, and many that are more muddy, but overall when we get in to the business of trying to force people to consume content in particular ways it's a bit icky in my opinion.
The extreme end result of this is no more open web, just force people to download your app to consume your content. This is happening too and it sucks.
>Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36
There are at least five lies here.
* It isn't made by Mozilla
* It doesn't use WebKit
* It doesn't use KHTML
* It isn't safari
* That isn't even my version of chrome, presumably it hides the minor/patch versions for privacy reasons.
Lying in your user agent in order to make the internet work is a practice that is almost as old as user agents. Your browser is almost certainly doing it right now to look at this comment.
https://guthib.mattbasta.workers.dev
For the past month or two, it's been hitting the free request limit as some AI company has scraped it to hell. I'm not inclined to stop them. Go ahead, poison your index with literal garbage. It's the cost of not actually checking the data you're indiscriminately scraping.
Deleted Comment
When providing a service such as Perplexity AI's, there are two use cases to consider for accessing web sites.
One is the scraping use case for training, where a crawler is being used and it is gathering data in bulk. Hopefully in a way that doesn't hammer one site at a time, but spreads the requests around gently.
The other is the use case for fulfilling a user's specific query in real time. The blog post seemed to be hitting this second use case. In this use case, the system component that retrieves the web page is not acting as a crawler, but more as a browser or something akin to a browser plugin that is retrieving the content on behalf of the actual human end user, on their request.
It's appropriate that these two use cases have different norms for how they behave.
The author may have been thinking of the first use case, but actually exercising the second use case, and mistakenly expecting it to behave according to how it should behave for the first use case.