Perplexity AI is lying about their user agent

There are two different questions at play here, and we need to be careful what we wish for.

The first concern is the most legitimate one: can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.

The second concern, though, is can Perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.

Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local. The very nature of a "user agent" is to be an automated tool that manipulates content hosted on the internet according to the specifications given to the tool by the user. I have a hard time seeing an argument against Perplexity using this data in this way that wouldn't apply equally to countless tools that we already all use and which companies try with varying degrees of success to block.

I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it. I want to be able to write scripts to manipulate the page and present it in a way that's useful for me. I don't currently use llms this way, but I'm uncomfortable with arguing that it's unethical for them to do that so long as they're citing the source.

putlake · 2 years ago

It's funny I posted the inverse of this. As a web publisher, I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.

But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.

There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.

epolanski · 2 years ago

> they are decreasing the probability that this user would come to by content (via Google, for example).

Google has been providing summaries of stuff and hijacking traffic for ages.

I kid you not, in the tourism sector this has been a HUGE issue, we have seen 50%+ decrease in views when they started doing it.

We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage.

It's just depressing. I'm more and more convinced that the age of regulations and competition is gone, US does want to have unkillable monopolies in the tech sector and we are all peons.

lolinder · 2 years ago

> A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.

If I visit your site from Google with my browser configured to go straight to Reader Mode whenever possible, is my visit more useful to you than a summary and a link to your site provided by Perplexity? Why does it matter so much that visitors be directly on your content?

briantakita · 2 years ago

> But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example).

Perplexity has source references. I find myself visiting the source references. Especially to validate the LLM output. And to learn more about the subject. Perplexity uses a Google search API to generate the reference links. I think a better strategy is to treat this as a new channel to receive visitors.

The browsing experience should be improved. Mozilla had a pilot called Context Graph. Perhaps Context Graph should be revisited?

> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.

This seems like a missing feature for analytics products & the LLMs/RAGs. I don't think searching via an LLM/RAG is going away. It's too effective for the end user. We have to learn to work with it the best we can.

anileated · 2 years ago

> I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy. But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable.

This appears to be self-contradictory. If you let an LLM to be trained* on “all the books” (posts, articles, etc.) in the world, the implication is that your potential readers will now simply ask that LLM. Not only will they pay Microsoft for that privilege, while you would get zilch, but you would not even know they ever read the fruits of your research.

* Incidentally, thinking of information acquisition by an ML model as if it was similar to human reading is a problematic fallacy.

SpaghettiCthulu · 2 years ago

You're missing the part where Perplexity still makes a request each time it's asked about the URL. You still get the traffic!

rcthompson · 2 years ago

I don't know what the typical usage pattern is, but when I've used Perplexity, I generally do click the relevant links instead of just trusting Perplexity's summary. I've seen plenty of cases where Perplexity's summary says exactly the opposite of the source.

antoniojtorres · 2 years ago

This hits the point exactly, it’s an extension of stuff like Google’s zero click results, they are regurgitating a website’s content with no benefit to the website.

I would say though, it feels like the training argument may ultimately lead to a similar outcome, though it’s a bit more ideological and less tangible than regurgitating the results of a query. Services like chatgpt are already being used a google replacement by many people, so long term it may reduce clicks from search as well.

richardatlarge · 2 years ago

I’m not sure if this is relevant but i go to a lot of sites because perplexity has it noted in its answer

insane_dreamer · 2 years ago

This is why media publishers went behind paywalls to get away from Google News

danlitt · 2 years ago

I'm not sure what you mean exactly. If Perplexity is actually doing something with your article in-band (e.g. downloading it, processing it, and present that processed article to the user) then they're just breaking the law.

I've never used that tool (and don't plan to) so I don't know. If they just embed the content in an iframe or something then there's no issue (but then there's no need or point in scraping). If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article <ARTICLE_TEXT>") then that's vanilla infringement, whether they lie about their UA or not.

gcanyon · 2 years ago

It seems self-evident to me that if a user tells a bot to go get a web page, robots.txt doesn't apply, and the bot shouldn't respect it. I understand others' concerns that, like Apple's reader, and other similar tools, it's ethically debatable whether a site should be required to comply with the request, and spoofing an agent seems in dubious territory. I don't think a good answer has been proposed for this challenge, unfortunately.

lolinder · 2 years ago

> spoofing an agent seems in dubious territory.

Just to clarify, Perplexity is not spoofing a user agent, they're legitimately using a headless Chrome to fetch the page.

The author just misunderstood their docs [0]: when they say that "you can identify our web crawler by its user agent", they're talking about the crawler, not the browser they use for ad hoc requests. As you note, crawling is different.

[0] https://docs.perplexity.ai/docs/perplexitybot

buro9 · 2 years ago

The companies will scrape and internalise the "customer asked for this" requests... and slowly turn the latter into the former, or just their own tool as the scraper.

No, easier to just ask a simple question: Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.

These companies don't need to be given an inch.

lolinder · 2 years ago

> Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.

So should Firefox not allow changing the user agent in order to bypass websites that erroneously claim to not work on Firefox?

elicksaur · 2 years ago

This is exactly the concern and there’s a lot of comments just completely ignoring it or willfully conflating.

Ad block isn’t the same problem because it doesn’t and can’t steal the creator’s data.

nradov · 2 years ago

Why should it be possible to stop an LLM from training itself on your data? If you want to restrict access to data then don't post it on a public website. It's easy enough to require registration and agreement to licensing terms for access.

It seems like some website owners want to have their cake and eat it too. They want their content indexed by Google and other crawlers in order to drive search traffic but they don't want their content used to train AI models that benefit other companies. At some point they're going to have to make a choice.

marcus0x62 · 2 years ago

Because if I run a server - at my own expense - I get to use information provided by the client to determine what, if any, response to provide? This isn’t a very difficult concept to grasp.

reissbaker · 2 years ago

To follow onto this:

If what Perplexity is doing is illegal, is it illegal to run an open-source LLM on your own machine, and have it do the same thing? If so, how are ad blockers or Reader Modes or screen readers legal?

And if it's legal to run an open-source LLM on your own machine, is it legal to run an open-source LLM on a rented server (e.g. because you need more GPUs)? And if that's legal, why is it illegal to run a closed-source LLM on servers? Could Perplexity simply release the model weights and keep doing what they're doing?

baxtr · 2 years ago

What will happen if:

Website owners decide to stop publishing because it’s not rewarded by a real human visit anymore?

Then perplexity and the like won’t have new information to train their models on and no sites to answer the questions.

I think there is a real content dilemma here at work. The incentives of Google and website owners were more or less aligned.

This is not the case with perplexity.

lolinder · 2 years ago

What is a "visit"? TFA demonstrates that they got a hit on their site, that's how they got the logs.

Is it necessary to load the JavaScript for it to count as a visit? What if I access the site with noscript?

Or is it only a visit if I see all your recommended content? I usually block those recommendations so that I don't get distracted from the article I actually came to read—is my visit a less legitimate visit than other people's?

What exactly is Perplexity doing here that isn't okay that people don't already do with their local user agents?

bko · 2 years ago

How would an LLM training on your writing reduce your reward?

I guess if you're doing it for a living sure, but most content I consume online is created without incentive (social media, blogs, stack overflow).

I write a fair amount and have been for a few years. I like to play with ideas. If an llm learned from my writing and it helped me propagate my ideas, I'd be happy. I lose on social status imaginary internet points but I honestly don't care much for them.

The craziest one is the stack overflow contributors. They write answers for free to help people become better programmers but they're mad an llm will read their suggestions and answer questions that help people become better programmers. I guess they do it for the glory of having their handle next to the answer?

ADeerAppeared · 2 years ago

> I think there is a real content dilemma here at work

It's not really a dilemma.

This is exactly what copyright serves to protect authors from. Perplexity copied the content, and in doing so directly competes with the original work, destroying it's market value and driving the original author out of business. Literally what copyright was invented to prevent.

It's the exact same situation as journalists going after Google & social media embeds of articles, which these sites propagandized as "prohibiting hyperlinking", but the issue has been the embedded (summary of the) content. Which people don't click through, and this is the entire point of those features for platforms like Facebook; Keeping users on facebook and not leaving.

This is why quite a few jurisdictions agreed with the journalists and moved to institute restrictions on such embedding.

By all practical considerations, perplexity is doing the exact same thing and trying to deflect with "we used an AI to paraphrase".

> The incentives of Google and website owners were more or less aligned.

The key difference here is that linking is and always has been fine. Google's Book search feature is fair use because the purpose is to send you to the book you searched for, not substitute the book.

Google's current AI summary feature is effectively the same as Perplexity. People don't click through to the original site, the original site doesn't get ad impressions or other revenue, and is driven out of business.

> What will happen if:

What will happen is what already is happening: Journalists are driven out of business, replaced by AI slop.

And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.

It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.

nradov · 2 years ago

A lot of the public website content targeted towards consumers is already SEO slop trying to sell you something or maximize ad revenue. If those website owners decide to stop publishing due to lack of real human visits then little of value will be lost. Much of the content with real value for consumers has already moved to sites that require registration (and sometimes payment) for access.

For technical content of value to professionals, much of that is hosted by vendors or industry organizations. Those tend to get their revenue in other ways and don't care about companies scraping their content for AI model training. Like the IETF isn't going to stop publishing new RFCs just because Perplexity uses them.

gpm · 2 years ago

> The second concern, though, is can perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.

This feels like the fundamental core component of what copyright allows you to forbid.

> Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local

Which is a huge difference. The latter is someone asking for a copy of my content (from someone with a valid license, myself), and manipulating it to display it (not creating new copies, broadly speaking allowed by copyright). The former adds in the criminal step of "and redistributing (modified, but that doesn't matter) versions of it to users without permission".

I mean, I'm all for getting rid of copyright, but I also know that's an incredibly unpopular position to take, and I don't see how this isn't just copyright infringement if you aren't advocating for repealing copyright law all together.

lolinder · 2 years ago

I'm curious to know where you draw the line for what constitutes legitimate manipulation by a person and when it becomes distribution.

I'm assuming that if I write code by hand for every part of the TCP/IP and HTTP stack I'm safe.

What if I use libraries written by other people for the TCP/IP and HTTP part?

What if I use a whole FOSS web browser?

What about a paid local web browser?

What if I run a script that I wrote on a cloud server?

What if I then allow other people to download and use that script on their own cloud servers?

What if I decide to offer that script as a service for free to friends and family, who can use my cloud server?

What if I offer it for free to the general public?

What if I start accepting money for that service, but I guarantee that only the one person who asked for the site sees the output?

Can you help me to understand where exactly I crossed the line?

reissbaker · 2 years ago

I actually don't see the legal distinction here. A browser with an ad blocker is also:

1. Asking for a copy of your content

2. Manipulating the content

3. Redistributing the content to the end-user who requested it

Ditto for the LLM that has been asked by the end user to fetch your content and show it to them (possibly with a manipulation step e.g. summarization).

I don't think there's a legal, copyright distinction between doing that on a server vs doing that on a local machine. And, for example, if there were a difference: using a browser on a remote desktop would be illegal, or using curl on a machine you were SSHed into would be illegal. Also, an LLM running locally on your machine (doing the exact same thing) would be legal!

I understand that it's inconvenient and difficult to monetize content when an LLM is summarizing it, and hard to upsell other pages on a website to users when they aren't coming to your website and are instead accessing it through an LLM. But legally I think there's not an obvious distinction on copyright grounds, and if there were (other than a very fine-grained ban on specifically LLMs accessing websites, without any general principle behind it), it would catch up a lot of legitimate behavior in the dragnet.

I'd also point out that in the U.S., search engines have passed the "Fair Use" test of exemption from copyright — I think it would be very hard to make a distinction between what a search engine is doing (which is on a server!) and what an LLM is doing based on trying to say copyright distinguishes between server vs client architectures.

zzo38computer · 2 years ago

If the user specifically asks for a file and asks a computer program to process it in a specific way, it should be permitted, regardless of user-agent spoofing (although user-agent spoofing should (normally) ideally only be done when the user specifically requests it; it should not do so automatically). However, this is better when using FOSS and/or local programs (or if the user is accessing them through a proxy, VPN, Tor, etc). Furthermore, any company that provides such services should not use unethical business practices, false advertising, etc, to do so.

If the company wants a copy of the files for your own use, then that is a bit different. When accessing large number of files at once, robots.txt is useful to block it. If they can get a copy of the files in a different way (assuming the files are intended to be public anyways), then they might do so. However, even in this case, still they should not use unethical business practices, false advertising, etc; and, they should also avoid user-agent spoofing.

(In this case, the reason for the user-agent spoofing does not seem to be deliberate, since it uses a headless browser. They should still change it though; probably by keeping the user-agent string but adding on a extra part such as "Perplexity", to indicate that it is what it is, in addition to the headless browser.)

EGreg · 2 years ago

A user-agent requests the file using your credentials, eg a cookie or public key signature.

It is transforming the content for you, an authorized party.

That is not the same as then making derivative copies and distributing the information to others without paying. For example, if I bought a ticket to a show, taped it and then distributed it to everyone, disregarding that the show prohibited this.

If I shared my Netflix password with up to 5 others, at least I can argue that they are part of my “family” or something. But to unlimited numbers of people? Why would they pay for netflix, and how would the shows get made?

I am not necessarily endorsing government force enforcing copyright, which is why I have been building a solution to enforce it at the tech level: https://Qbix.com/ecosystem

__loam · 2 years ago

The problem that Perplexity has that ad blockers don't is that they're an independent site that is publishing content based on work they didn't produce. That runs afoul of both copyright laws and section 230 which let's sites like Google and Facebook operate. That's pretty different from an ad blocker running on your local machine. The ad blocker isn't publishing the page it edited for you.

lolinder · 2 years ago

> they're an independent site that is publishing content based on work they didn't produce.

What distinguishes these two situations?

* User asks proprietary web browser to fetch content and render it a specific way, which it does

* User asks proprietary web service to fetch content and render it a specific way, which it does

The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?

Why is it that a proprietary web service manipulating content on behalf of a user is "publishing" content illegally, while a proprietary web browser doing the exact same kind of transformations is not? Assume that in both cases the proprietary software fetches the data upon request, does not cache it, and does not make the transformed content available to other users.

Deleted Comment

neycoda · 2 years ago

AI scraping against permission could allow corporations to formulate a loophole where Congress argues that it's impossible to enforce a law against and that it's easier to just make laws to allow corporations to close-source their websites (yes, HTML, CSS, and JavaScript, etc). I think what's most likely to happen is nothing will fundamentally change, and browsers will continue showing page source, and AI will continue scraping source content without permission.

immibis · 2 years ago

You can poison all your images with Glaze and Nightshade. Then you don't have to stop them from using them - they have to stop themselves from using them or their image generator will be useless. I don't know if there's a comparable system for text. If there was, it would probably be noticeable to humans.

cal85 · 2 years ago

> can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.

I’m not saying you’re wrong, but why? And what do you mean by “your data” here?

bhelkey · 2 years ago

> And what do you mean by “your data” here?

The website that they created.

__loam · 2 years ago

By "my data" he means the data a site spent time and money to create and publish.

Deleted Comment

daft_pink · 2 years ago

The other question is once the user directs the ai to read the website instead of crawling will the site then be fair game for training?

EGreg · 2 years ago

Let’s differentiate between:

1) a user-agent which makes an authenticated and authorized request for data, and delivers to the user

2) a user who then turns around and distributes the data or its derivatives to users in an unauthorized manner

A “dumber” example would be whether I can indefinitely cache and index most of information via the Google Places API, as long as my users request each item at least once. Can I duplicate all that map or streetview photo information that google paid cars to go around and photograph? Or how about the info that Google users entered as user-generated content?

THE REQUIREMENT TO OPEN SOURCE WEIGHTS

Legally, if I had a Creative Commons Share-Alike license on my data, and the LLM was trained on it and then served unlimited requests to others, without making the weights available…

…that would be almost exactly like if I had made my code available with Affero GPL license, someone would take my code but then incorporated it into a backend software hosting a social network or something, without making their own entire social network source code available. Technically this should be enforceable via a court order compelling the open sourcing to the public. (Alternatively, they’d have to pay damages in a class action lawsuit and stop using the tainted backend software or weights when serving all those people.)

TECHNICAL ANALYSIS

The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.

Distributing your scripts and software to process data is not the same as distributing arbitrary data the user agent found on the internet for which you don’t have a license.

If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.

LEGAL ANALYSIS

Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.

In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.

There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop...

Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.

Deleted Comment

treyd · 2 years ago

Yeah if people get to extensive about blocking then we're going to end up with a scenario where the web request functionality is implementing by telling the chatbot's users's browser to make the fetch and submit it back to the server for processing, making it largely indistinguishable from the user making the query themselves. If CORS gets in the way they can just prompt users to install a browser extension to use the web request functionality.

lofaszvanitt · 2 years ago

Citing the source doesn't bring you, the owner of the site, valuable data. When was your data accessed, who accessed it, from where, at what time, what device, etc. It brings data to the LLM's owner, and you get

N O T H I N G.

Could you change the way printed news magazines showed their content? No. Then, why is that a problem?

Btw nobody clicks on sources. NOBODY.

bluish29 · 2 years ago

> Btw nobody clicks on sources. NOBODY.

I always click on sources to verify what an LLM in this case says. I also hear the claim that a lot about people not reading sources (before LLM it was video content with references) but I always visited the sources. Is there a statistics or studies that actually support this claim? Or is it just a personal experience, of people (including me) enforcing it as generic behavior of all people?

insane_dreamer · 2 years ago

> I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it.

I'm okay with this world, as a tradeoff. I'm not sure users should have _the right_ to reformat others' content.

zzo38computer · 2 years ago

Users should have the right to reformat their own copy of others content (automatically as well as manually). However, if they then redistribute the reformatted copy, then they should not be allowed to claim that it is the same formatting as the original, because it is not the same as the original.

dheera · 2 years ago

Personally I think AI is a major win for accessibility and we should not be preventing people to access information in the way that is best suited for them.

Accessibility can mean everything from a blind person wanting to interacting with a website using voice, to someone recovering from a surgery and wanting something to reduce unnecessary popups and clicks on a website to get to the information they need. Accessibility is in the eye of the accessor, and AI is what enables them to achieve it.

The way I see it, AI is not a robot and doesn't need to look at robots.txt. Rather, AI is my low-cost secretary.

danlitt · 2 years ago

> The way I see it, AI is not a robot and doesn't need to look at robots.txt

I don't think you are seeing it very clearly then. Your secretary can also be a robot. What do you think an AI is if not a robot??

It doesn't "need" to look at robots.txt because nothing does.

I don’t think we should lump together “AI company scraping a website to train their base model” and “AI tool retrieving a web page because I asked it to”. At least, those should be two different user agents so you have the option to block one and not the other.

condiment · 2 years ago

If an AI agent is performing a search on behalf of a user, should its user agent be the same as that user’s?

Filligree · 2 years ago

Users don’t have user agent strings, user agents do.

gumby · 2 years ago

I think that’s the ideal as the server may provide different data depending on UA.

Does anyone actually do this, though?

lofaszvanitt · 2 years ago

It should, erm sorry, must pass all the info it got from the user to you, so you would have an idea who wanted info from your site.

KomoD · 2 years ago

I agree with that, but I also think that they should at least identify themselves instead of using a generic user agent.

BriggyDwiggs42 · 2 years ago

I’d rather share less information than more to any site I visit. Why does a user want to share that info?

beefnugs · 2 years ago

I want all people in the world with a dirty arse to change their user agent so i can not serve my website to dirty arses.

mrweasel · 2 years ago

Personally I don't even think that the issue. I'd prefer correct user-agent, that just common decency and shouldn't be an issue for most.

What I do expect the AI companies to do is to check the license of the content they scrape and follow that. Let's say I run a blog, and I have a CC BY-NC 4.0 license. You can train your AI and that content, as long as it's non-commercial. Otherwise you'd need to contact me an negotiate and appropriate license, for a fee. Or you can train your AI on my personal Github repo, where everything is ISC, that's fine, but for my work, which is GPLv3, then you have to ensure that the code your LLM returns is also under the GPLv3. Does any of the AI companies check that the license of ANYTHING?

lolinder · 2 years ago

> I'd prefer correct user-agent, that just common decency and shouldn't be an issue for most.

Tell that to the Chrome team. And the Safari team. And the Opera team. [0]

[0] https://webaim.org/blog/user-agent-string-history/

sebzim4500 · 2 years ago

More than this, I'd rather use a tool which lets me fake the user agent like I can in my browser.

supriyo-biswas · 2 years ago

And yet, OpenAI blocks both of these activities if you happen to block either "GPTBot" (the ingest crawler) or "ChatGPT-User" (retrieval during chat).

xbar · 2 years ago

Why should I have to differentiate Perplexity's services?

Dead Comment

JohnMakin · 2 years ago

Is it actually retrieving the page on the fly though? How do you know this? Even if it were - it’s not supposed to be able to.

tommy_axle · 2 years ago

What I gathered from the post was that one of the investigations was to ask what was on [some page url] and then check the logs moments later and saw it using a normal user agent.

supriyo-biswas · 2 years ago

You can just point it at a webserver and ask it a question like "Summarize the content at [URL]" with a sufficiently unique URL that no one would hit, maybe with an UUID. This is also explored on the very article itself.

In my testing they're using crawlers on AWS and they do not parse Javascript or CSS, so it is sufficient to serve some kind of interstitial challenge page like the one on Cloudflare, or you can build your own.

parasense · 2 years ago

> Is it actually retrieving the page on the fly though?

They are able to do so.

> How do you know this?

The access logs.

> Even if it were - it’s not supposed to be able to.

There is a distinction from data used to train a model, which is the indexing bot with the custom user-agent string, and the user-query input given to the aforementioned AI model. When you ask an AI some question, you normally input text into a form, and the text goes back to the AI model where the magic happens. In this scenario, instead of inputting a wall text into a form, the text is coming from a url.

These forms of user input are equivilent, and yet distinctly different. Therefore it's intelectually dishonest for the OP to claim the AI is indexing them, when OP is asking the AI to fetch their website to augment or add context to the question being asked.

IAmGraydon · 2 years ago

He literally showed a server log of it retrieving the page on the fly in the article.

janalsncm · 2 years ago

To steel man this, even though I think the article did a fine job already, maybe the author could’ve changed the content on the page so you would know if they were serving a cached response.