Readit News logoReadit News
Falkon1313 · a month ago
This is kinda amusing.

robots.txt main purpose back in the day was curtailing penalties in the search engines when you got stuck maintaining a badly-built dynamic site that had tons of dynamic links and effectively got penalized for duplicate content. It was basically a way of saying "Hey search engines, these are the canonical URLs, ignore all the other ones with query parameters or whatever that give almost the same result."

It could also help keep 'nice' crawlers from getting stuck crawling an infinite number of pages on those sites.

Of course it never did anything for the 'bad' crawlers that would hammer your site! (And there were a lot of them, even back then.) That's what IP bans and such were for. You certainly wouldn't base it on something like User-Agent, which the user agent itself controlled! And you wouldn't expect the bad bots to play nicely just because you asked them.

That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.

Or the Evil Bit proposal, to suggest that malware should identify itself in the headers. "The Request for Comments recommended that the last remaining unused bit, the "Reserved Bit" in the IPv4 packet header, be used to indicate whether a packet had been sent with malicious intent, thus making computer security engineering an easy problem – simply ignore any messages with the evil bit set and trust the rest."

MiddleMan5 · a month ago
It should be noted here that the Evil Bit proposal was an April Fools RFC https://datatracker.ietf.org/doc/html/rfc3514
Y_Y · a month ago
While we're at it, it should be noted that Do Not Track was not, apparently, a joke.

It's the same as a noreply email, if you can get away with sticking your fingers in your ears and humming when someone is telling you something you don't want to hear, and you have a computer to hide behind, then it's all good.

2OEH8eoCRo0 · a month ago
I like the 128 bit strength indicator for how "evil" something is.
pi_22by7 · a month ago
So it did the same work that a sitemap does? Interesting.

Or maybe more like the opposite: robots.txt told bots what not to touch, while sitemaps point them to what should be indexed. I didn’t realize its original purpose was to manage duplicate content penalties though. That adds a lot of historical context to how we think about SEO controls today.

JimDabell · a month ago
> I didn’t realize its original purpose was to manage duplicate content penalties though.

That wasn’t its original purpose. It’s true that you didn’t want crawlers to read duplicate content, but it wasn’t because search engines penalised you for it – WWW search engines had only just been invented and they didn’t penalise duplicate content. It was mostly about stopping crawlers from unnecessarily consuming server resources. This is what the RFC from 1994 says:

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

https://www.robotstxt.org/orig.html

tbrownaw · a month ago
> And you wouldn't expect the bad bots to play nicely just because you asked them.

Well, yes, the point is to tell the bots what you've decided to consider "bad" and will ban them for. So that they can avoid doing that.

Which of course only works to the degree that they're basically honest about who they are or at least incompetent at disguising themselves.

gbalduzzi · a month ago
I think it depends on the definition of bad.

I always consider "good" a bot that doesn't disguise itself and follows the robots.txt rules. I may not consider good the final intent of the bot or the company behind it, but the crawler behaviour is fundamentally good.

Especially considering the fact that it is super easy to disguise a crawler and not follow the robots conventions

nullc · a month ago
> That's about as naive as the Do-Not-Track header, which was basically kindly asking companies whose entire business is tracking people to just not do that thing that they got paid for.

It's usually a bad default to assume incompetence on the part of others, especially when many experienced and knowledgeable people have to be involved to make a thing happen.

The idea behind the DNT header was to back it up with legislation-- and sure you can't catch and prosecute all tracking, but there are limitations on the scale of criminal move fast and break things before someone rats you out. :P

pjmlp · a month ago
Some people just believe that because someone says so, everyone will nicely obey and follow the rules, don't know maybe it is a cultural thing.
vintagedave · a month ago
Or a positive belief in human nature.

I admit I'm one of those people. After decades where I should perhaps be a bit more cynical, from time to time I am still shocked or saddened when I see people do things that benefit themselves over others.

But I kinda like having this attitude and expectation. Makes me feel healthier.

0xEF · a month ago
It's easy to believe, though, and most of us do it every day. For example, our commute to work is marked by the trust that other drivers will cooperate, following the rules, so that we all get to where we are going.

There are varying degrees of this through our lives, where the trust lies not in the fact that people will just follow the rules because they are rules, but because the rules set expectations, allowing everyone to (more or less) know what's going on and decide accordingly. This also makes it easier to single out the people who do not think the rules apply to them so we can avoid trusting them (and, probably, avoid them in general).

PaulHoule · a month ago
Robots.txt was created long before Google and before people were thinking about SEO:

https://en.wikipedia.org/wiki/Robots.txt

The scenario I remember was that the underfunded math department had an underpowered server connected via a wide and short pipe to the overfunded CS department and webcrawler experiments would crash the math department's web site repeatedly.

abirch · a month ago
With the advent of AI and the notion of actually going to a website as being quaint: each website should have a humans.txt such as https://www.netflix.com/humans.txt or https://www.google.com/humans.txt
LorenPechtel · a month ago
Yup. Robots.txt was a don't-swamp-me thing.
EPendragon · a month ago
It is so interesting to track this technology's origin back to the source. It makes sense that it would come from a background of limited resources where things would break if you overwhelm it. It didn't take much to do so.
franga2000 · a month ago
I still see the value in robots.txt and DNT as a clear, standardised way of posting a "don't do this" sign that companies could be forced to respect through legal means.

The GDPR requires consent for tracking. DNT is a very clear "I do not consent" statement. It's a very widely known standard in the industry. It would therefore make sense that a court would eventually find companies not respecting it are in breach of the GDPR.

That was a theory at least...

EPendragon · a month ago
Would robot traffic be considered tracking in light of GDPR standards? As far as I know there are no regulatory rules in relation to enforcing robots behaviors outside of robots.txt, which is more of an honor system.
dumbfounder · a month ago
I created a search engine that crawled the web way back in 2003. I used a proper user agent that included my email address. I got SO many angry emails about my crawler, which played as nice as I was able to make it play. Which was pretty nice I believe. If it’s not Google people didn’t want it. That’s a good way to prevent anyone from ever competing with Google. It isn’t just about that preview for LinkedIn, it’s about making sure the web is accessible by everyone and everything that is trying to make its way. Sure, block the malicious ones. But don’t just assume that every bot is malicious by default.
knorker · a month ago
The most annoying thing about being a good bot owner, in my experience, is when you get complaints about it misbehaving, only to find that it was actually somebody malicious who wrote their own abusive bot, but is using your bot's user agent.
pimterry · a month ago
Cloudflare have some new bot verification proposals designed to fix this, with cryptographic proofs that the user-agent is who they say they are: https://blog.cloudflare.com/web-bot-auth/.
TylerE · a month ago
That's easy to say when it's your bot, but I've been on the other side to know that the problem isn't your bot, it's the 9000 other ones just like it, none of which will deliver traffic anywhere close to the resources consumed by scraping.
kijin · a month ago
True. Major search engines and bots from social networks have a clear value proposition: in exchange for consuming my resources, they help drive human traffic to my site. GPTBot et al. will probably do the same, as more people use AI to replace search.

A random scraper, on the other hand, just racks up my AWS bill and contributes nothing in return. You'd have to be very, very convincing in your bot description (yes, I do check out the link in the user-agent string to see what the bot claims to be for) in order to justify using other people's resources on a large scale and not giving anything back.

An open web that is accessible to all sounds great, but that ideal only holds between consenting adults. Not parasites.

komali2 · a month ago
I'm confused why scraping is so resource intensive - it hits every URL your site serves? For an individual ecommerce site that's maybe 10,000 hits?
EPendragon · a month ago
I definitely agree here. My initial response was to block everything, however you realize that web is complex and interdependent. I still believe that everyone should have autonomy over their online resources if they desire. But that comes with an intentionality behind it. If you want to allow or disallow certain traffic, you also should answer the question why or why not. That requires understanding what each bot does. That takes time and effort.

My foray into robots.txt started from the whole notion of AI companies training on everything they can put their hands on. I want to be able to have a say whether I allow it or not. While not all bots will honor the robots.txt file, there are plenty that do. One way that I found you can test that is by asking the model directly to scrape a particular link (assuming the model has browsing capabilities).

Bots are not malicious by default. It is what that company does with your data and how you feel about it that matters in the end.

tomrod · a month ago
> But don’t just assume that every bot is malicious by default.

I'll bite. It seems like a poor strategy to trust by default.

ronsor · a month ago
I'll bite harder. That's how the public Internet works. If you don't trust clients at all, serve them a login page instead of content.
mytailorisrich · a month ago
It's just that people are suspicious of unknown crawlers, and rightly so.

Since it is impossible to know a priori which crawler are malicious, and many are malicious, it is reasonable to default to considering anything unknown malicious.

Jach · a month ago
I guess back in 2003 people would expect an email to actually go somewhere, these days I would expect it to either go nowhere or just be part of a campaign to collect server admin emails for marketing/phishing purposes. Angry emails are always a bit much, but I wonder if they aren't sent as much anymore in general or if people just stopped posting them to point and laugh at and wonder what goes through people's minds to get so upset to send such emails.

My somewhat silly take on seeing a bunch of information like emails in a user agent string is that I don't want to know about your stupid bot. Just crawl my site with a normal user agent and if there's a problem I'll block you based on that problem. It's usually not a permanent block, and it's also usually setup with something like fail2ban so it's not usually an instant request drop. If you want to identify yourself as a bot, fine, but take a hint from googlebot and keep the user agent short with just your identifier and an optional short URL. Lots of bots respect this convention.

But I'm just now reminded of some "Palo Alto Networks" company that started dumping their garbage junk in my logs, they have the audacity to include messages in the user agent like "If you would like to be excluded from our scans, please send IP addresses/domains to: scaninfo@paloaltonetworks.com" or "find out more about our scans in [link]". I put a rule in fail2ban to see if they'd take a hint (how about your dumb bot detects that it's blocked and stops/slows on its own accord?) but I forgot about it until now, seems they're still active. We'll see if they stop after being served nothing but zipbombs for a while before I just drop every request with that UA. It's not that I mind the scans, I'd just prefer to not even know they exist.

EPendragon · a month ago
I think a better solution would be to block all the traffic, but have a comment in robots.txt with a way to be added onto a whitelist to scrape the contents of the resource. This puts a burden of requesting the access on the owner of the bot, and if they really want that access, they can communicate it and we can work it out.
donohoe · a month ago
I try to stay away from negative takes here, so I’ll keep this as constructive as I can:

It’s surprising to see the author frame what seems like a basic consequence of their actions as some kind of profound realization. I get that personal growth stories can be valuable, but this one reads more like a confession of obliviousness than a reflection with insight.

And then they posted it here for attention.

zem · a month ago
it's mostly that they didn't think of the page preview fetcher as a "crawler", and did not intend for their robots.txt to block it. it may not be profound but it's at the least not a completely trivial realisation. and heck, an actual human-written blog post can okay improve the average quality of the web.
archievillain · a month ago
The bots are called "crawlers" and "spiders", which to me evokes the image of tiny little things moving rapidly and mechanically from one place to another, leaving no niche unexplored. Spiders exploring a vast web.

Objectively, "I give you one (1) URL and you traverse the link to it so you can get some metadata" still counts as crawling, but I think that's not how most people conceptualize the term.

It'd be like telling someone "I spent part of the last year travelling." and when they ask you where you went, you tell them you commuted to-and-fro your workplace five times a week. That's technically travelling, although the other person would naturally expect you to talk about a vacation or a work trip or something to that effect.

EPendragon · a month ago
That's exactly it. It was one of those unintended consequences of blocking everything that led me down the road of figuring it out.

Like other commenters have indicated, I will likely need to go back and allow some other social media to access the OPG data for previews to render properly. But since I mostly post on LinkedIn and HN, I don't feel the need to go and allow all of the other media at the moment. That might change in the future.

spookie · a month ago
They posted it here because they wouldn't appear on Google otherwise (:
EPendragon · a month ago
I mean it was a realization for me, although I wouldn't call it profound. To your point, it was closer to obliviousness, which led me to learn more about Open Graph Protocol details and how Robots Exclusion Protocol works.

I try to write about things that I learn or find interesting. Sharing it here in the hopes that others might find it interesting too.

coolgoose · a month ago
I agree, and I am also confused on how this got on the frontpage of all things. It's like reading a news article of 'water is wet'.

You block things -> of course good actors will respect and avoid you -> of course bad actors will just ignore it as it's a piece of "please do not do this" not a firewall blocking things.

EPendragon · a month ago
Honestly, I am also surprised how this got on the frontpage. This was supposed to be a small post of what I have learnt in the process of fixing my LinkedIn previews. I don't know how we got here.
franga2000 · a month ago
The problem with robots.txt is the reliance on identity rather purpose of the bots.

The author had blocked all bots because they wanted to get rid of AI scrapers. Then they wanted to unblock bots scraping for OpenGraph embeds so they unblocked...LinkedIn specifically. What if I post a link to their post on Twitter or any of the many Mastodon instances? Now they'd have to manually unblock all of their UA, which they obviously won't, so this creates an even bigger power advantage to the big companies.

What we need is an ability to block "AI training" but allow "search indexing, opengraph, archival".

And of course, we'd need a legal framework to actually enforce this, but that's an entirely different can of worms.

kevincox · a month ago
I think there is a long standing question about what robots.txt is for in general. In my opinion it was originally (and still is mostly) intended for crawlers. It is for bots that are discovering links and following them. A search engine would be an obvious example of a crawler. These are links that even if discovered shouldn't be crawled.

On the other end is user-requested URLs. Obviously a browser operated by a human shouldn't consider robots.txt. Almost as obviously, a tool subscribing to a specific iCal calendar feed shouldn't follow robots.txt because the human told it to access that URL. (I recall some service, can't remember if it was Google Calendar or Yahoo Pipes or something else that wouldn't let you subscribe to calendars blocked by robots.txt which seemed very wrong.)

The URL preview use case is somewhat murky. If the user is posting a single link and expecting it to generate a preview this very much isn't crawling. It is just fetching based on a specific user request. However if the user posts a long message with multiple links this is approaching crawling that message for links to discover. Overall I think this "URL preview on social media" probably shouldn't follow robots.txt but it isn't clear to me.

zajio1am · a month ago
Something like Common Crawl can be used for both search and AI training (and any other purpose, decided after crawl is done).
franga2000 · a month ago
Then such a crawler should mark itself with all purpose tags and thus be blocked in this scenario.

Alternatively, it could make the request anyways and separate the crawled sites by permitted purpose in its output.

BlackFly · a month ago
This is just a problem of sharing information in band instead of out of band. The OpenGraph metadata is in band with the page content that doesn't need to be shared with OpenGraph bots. The way to separate the usage is to separate the content and metadata with some specific query using `content-type` or `HEAD` or something, then bots are free to fetch that (useless for AI bots) and you can freely forbid all bots from the actual content. Then you don't really need much of a legal framework.
EPendragon · a month ago
I like the idea of using HEAD or OPTIONS methods and have all bots access that so that they get a high level idea of what's going on, without the access to actual content, if the owner decided to block it.
EPendragon · a month ago
I do like your suggestion of creating some standard that categorizes using function or purpose like you mention. This could simplify things granted that there is a way to validate the function and for spoofing to be hard to achieve. And yes - there is also legal.

I do think that I will likely need to go back and unblock a couple of other bots for this exact reason - so that it would be possible to share it and have previews in other social media. I like to take a slow and thoughtful approach to allowing this traffic as I get to learn what it is that I want and do not want.

Comments here have been a great resource to learn more about this issue and see what other people value.

jarofgreen · a month ago
Hey OP,

1)

You consider this about the Linkedin site but don't stop to think about other social networks. This is true about basically all of them. You may not post on Facebook, Bluesky, etc, but other people may like your links and post them there.

I recently ran into this as it turns out the Facebook entries in https://github.com/ai-robots-txt/ai.robots.txt also block the crawler FB uses for link previews.

2)

From your first post,

> But what about being closer to the top of the Google search results - you might ask? One, search engines crawling websites directly is only one variable in getting a higher search engine ranking. References from other websites will also factor into that.

Kinda .... it's technically true that you can rank in Google if you block them in robots.txt but it's going to take a lot more work. Also your listing will look worse (last time I saw this there was no site description, but that was a few years back). If you care about Google SEO traffic you maybe want to let them on your site.

EPendragon · a month ago
Hey, @jarofgreen! Thank you for the feedback!

1) I only considered LinkedIn alone since I have been posting there and here on HN, and that's it. I figured I will let it play out until I need to allow for more bots to access it. Your suggestion of other people wanting to share the links to the blog is a very valid one that I haven't thought about. I might need to allow several other platforms.

2) With Google and other search engines I have seen a trend towards the AI summaries. It seems like this is the new meta for search engines. And with that I believe it will reduce the organic traffic from those engines to the websites. So, I do not particularly feel that this is a loss for me.

I might eat my words in the future, but right now I think that social media and HN sharing is what will drive the most meaningful traffic to my blog. It is a word-of-mouth marketing, that I think is a lot more powerful than finding my blog in a Google search.

I will definitely need to go back and do some more research on this topic to make sure that I'm not shooting myself in the foot with these decisions. Comments here have been very helpful in considering other opinions and options.

crowcroft · a month ago
Point 2 is true to an extent, assuming you aren't monetizing your traffic though, wouldn't earning citations still be more valuable than not showing up in Google at all?

You should also consider that a large proportion of search is purely navigational. What if someone is trying to find your blog and is typing 'evgenii pendragon'. AI summaries don't show for this kind of search.

Currently I can see your site is still indexed (disallowing Google won't remove a page from the index if it's already been indexed) and so you show in search results, but because you block Google in robots.txt it can't crawl content and so shows an awkward 'No information is available for this page.' message. If you continue to block Google in robots.txt eventually the links to your site will disappear entirely.

Even if you really don't want to let AI summarize your content, I would at least allow crawlers to your homepage.

PaulKeeble · a month ago
The problem is the robots that do follow robots.txt its all the bots that don't. Robots.txt is largely irrelevant now they don't represent most of the traffic problem. They certainly don't represent the bots that are going to hammer your site without any regard, those bots don't follow robots.txt.
zarzavat · a month ago
That's what honeypots are for.

Deny /honeypot in your robots.txt

Add <a href="/honeypot" style="display:none" aria-hidden="true">ban me</a> to your index.html

If an IP accesses that path, ban it.

EPendragon · a month ago
There is an interesting article about AI tarpits that addresses a similar issue: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-...

The argument is basically to have them scrape your website indefinitely wasting their resources for the bots that decide to ignore your robots.txt (or any bot if you desire)

brewmarche · a month ago
I wonder whether the path in the robots.txt (or maybe a <link> tag with a bogus rel attribute) would already be enough to make evil bots follow it. That would at least avoid accidental follows due to CSS/ARIA not properly working in weird constellations
jpc0 · a month ago
> <a href="/honeypot" style="display:none" aria-hidden="true">ban me</a>

Unrelated meta question but is the aria tag necessarily since display: none; should be removing the content from the flow?

fpoling · a month ago
While this may work today, the bots now more and more use if not the full headless rendering then at least apply CSS rules not to fetch invisible content.

Deleted Comment

kwar13 · a month ago
I like this. Adding now. Thanks!
anonnon · a month ago
Not sure why you were downvoted. I have zero confidence that OpenAI, Anthropic, and the rest respect robots.txt however much they insist they do. It's also clear that they're laundering their traffic through residential ISP IP addresses to make detection harder. There are plenty of third-parties advertising the service, and farming it out affords the AI companies some degree of plausible deniability.
chneu · a month ago
Nobody has any confidence in ai to not ddos. That's why there have been dozens of posts about how bandwidth is becoming an issue for many websites as bots continuously crawl their sites for new info.

Wikipedia has backups for this reason. AI companies ignore the readily available backups and instead crawl every page hundreds of times a day.

I think debian also recently spoke up about it.

EPendragon · a month ago
One way to test if the model respects robots.txt could be to ask the model if it can scrape a given URL. One thing that it doesn't address though is the scrapping of the training data for the model. That area feels more like a wild west.
NackerHughes · a month ago
This article could have been two lines. It takes some serious stretching of school-essay-writing muscles to inflate it to this many pages of waffle.
EPendragon · a month ago
I think a paragraph could have been enough to describe the issue.

My goal with this post was to describe my personal experience with the problem, research, and the solution - the overall journey. I also wanted to write it in a manner that a non-technical person would be able to follow. Hence, being more explicit in my elaborations.

crowcroft · a month ago
Another common unintended consequence I've seen is conflating crawling and indexing with regards to robots.txt.

If you make a new page and never want it to enter the Google search index, adding it to robots.txt is fine, Google will never crawl it and it will never enter the index.

If you have an existing page that is currently indexed and want to remove it, adding that page to robots.txt is a BAD idea though. In the short term Google will continue to show the page in search results, but show it with no metadata (because it can't crawl it anymore). Even worse, Google won't notice up any noindex tags on the page, because robots.txt is blocking the page from being crawled!

Eventually Google will get the hint and remove the page from the index, but it can be a very frustrating time waiting for that to happen.

rafaelm · a month ago
There are cases where Google might find a URL blocked in robots.txt (through external or internal links), and the page can still be indexed and show up in the search results, even if they can't crawl it. [1].

The only way to be sure that it will stay out of the results is to use a noindex tag. Which, as you mentioned, search engine bots need to "read" in the code. If the URL is blocked, the "noindex" cannot be read.

[1] https://developers.google.com/search/docs/crawling-indexing/... (refer to the red "Warning" section)

EPendragon · a month ago
It is an interesting tidbit. I personally don't need Google to remove it from indexing. It is more of a "I don't care if they index it". I mostly care about the scrapping and not indexing. I do understand that these terms could be used interchangeably. In the past I might have conflated them.