# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See https://www.reddit.com/r/reddit4researchers/ for details on how Reddit continues to support research and non-commercial use.
# policy: https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
User-agent: *
Disallow: /
Their license/Eula clearly state that Reddit has perpetual whatever to content posted on Reddit, but relying solely on DMCA for "stolen" content _yet again_ feels like a terrible way to deal with non-original content. Part of me hopes that Reddit gets hit with some new precidence-setting lawsuits regarding non-original content that requires useful attribution, but I double t that will ever happen.
it's even worse. it's not theirs (it's the users'), they are merely hosting it and using it (ToS gives them a fancy irrevocable license I guess).
so they can do whatever they want with it and the actual owners/authors have no chance to really influence Reddit at all to make it crawlable. (the GDPR-like data takeout is nice, but ... completely useless in these cases where the value is in the composition and aggregation with other users' content.)
That’s a weird statement to be absolutist about. The majority of individuals and companies who want to be successful do not do so by scrapping websites, thus have no reason to disobey robots.txt. Most people in the world, ambitious or not, wouldn’t even understand what your sentence refers to.
Has not NYT tried to sue OpenAI because of them ignoring robots.txt or you mean it's impossible to prove and / or it's still more profitable to just ignore robots.txt?
With the amount of crap in Reddit, cleaning it must be a very non-trivial problem. (I mean, it never is, but in the case of Reddit it's probably extra complicated)
I understand the AI context, but this is dangerously anticompetitive for other search engines.
This is a dangerous precedent for the internet. Business conglomerates have been controlling most of the web, but refusing basic interoperability is even worse.
> There is nothing preventing search companies paying the same $60 Million to license content.
Yes, actually, there is - having $60m to throw around.
"Barriers to entry often cause or aid the existence of monopolies and oligopolies" [0]. Monopolies and oligopolies are definitionally the opposite of free market forces. This is quite literally Econ 101.
How many other sites might have leverage to charge to be indexed?
I don't want to live in a world where you have to use X search engine to get answers from Y site - but this seems like the beginning of that world.
From an efficiency perspective - it's obviously better for websites to just lease their data to search engines then both sides paying tons of bandwidth and compute to get that data onto search engines.
Realistically, there are only 2 search engines now.
This seems very bad for Kagi - but possibly could lead the old, cool, hobbiest & un-monetized web being reinvented?
That seems like the business model for streaming. You subscribe to X provider to watch Y series. So, as for streaming, I suppose a pirate bay search engine will come up
Pirate Bay is probably not the most optimal analogy, more like Anna's Archive imho [1], individually offered by web property scrape runs compressed into a package, maybe served by torrents like this Academic Torrents site example [2].
Scraper engine->validation/processing/cleanup->object storage->index + torrent serving is rough pipeline sketch.
> but this seems like the beginning of that world.
It's not the beginning, it's mere continuation.
Walled gardens have existed since the AOL days. They deteriorate over time but it doesn't prevent companies from trying (each time, in bigger attempts).
idk man i bet you five bucks and a handshake it's just going to play out like the existing startup grift.
There's an established player with institutional protections, then a scrappy upstart takes a bunch of VC money, converts it into runway, gives away the product for free, gradually replaces and becomes the standard, then puts out an s-1 document saying "we don't make money and we never have, want to invest?" and then they start to enjoy all the institutional protections. Or they don't. Either way you pay yourself handsomely from the runway money so who cares.
The upstart gets indexed and has an API, the established player doesn't.
The upstart is more easily found and modular but the institutional player can refuse to be indexed to own their data and they can block their API to prevent ai slop from getting in and dominating their content.
IANAL but as far as I understand the current legal status (in the US) a change in robots.txt or terms and conditions is not binding for web scrapers since the data is publicly accessible. Neither does displaying a banner "By using this site you accept our terms and conditions" change anything about that. The only thing that can make these kinds of terms binding is if the data is only accessible after proactively accepting terms. For instance by restricting the website until one has created an account. Linkedin lost a case against a startup scraping and indexing their data because of that a few years ago.
This problem is only going to get worse. for my thegreatestbooks.org site i used to just get indexed/scraped by google and bing. now it's like 50+ AI bots scraping my entire site just so they can train a LLM to answer questions my site answers without having a user ever visit my site. I just checked cloudflare and in the past 24 hours I've had 1.2 million bot/automated requests
They changed robots.txt a month or so ago. For the first 19 years of life, reddit had a very permissive robots.txt. We allowed all by default and then only restricted certain poorly behaved agents (and Bender's Shiny Metal Ass(tm))
But I can understand why they made the change they did. The data was being abused.
My guess is that this was an oversight -- that they will do an audit and reopen it for search engines after those engines agree not to use the data for training, because let's face it, reddit is a for profit business and they have to protect their income streams.
> But I can understand why they made the change they did. The data was being abused.
Depends how you see it - if you see it as 'their' data (legally true) or if you see it as user content (how their users would likely see it).
If you see it as 'user content', they are actually selling the data to be abused by one company, rather than stopping it being abused at all.
From a commercial 'lets sell user data and make a profit' perspective I get it, although does seem short-sighted to decide to effectively de-list yourself from alternative search engines (guess they just got enough cash to make it worth their while).
Is that actually true? Reddit may indeed have a license to use that data (derived from their ToS), but I very much doubt they actually own the copyright to it. If I write a comment on Reddit, then copy-paste it somewhere else, can Reddit sue me for copyright infringement?
Person extensively quoted in the article here. They are welcome to reach out. But not a single person from any level did that, nor replied to my polite requests to explain and engage. We first contacted them in early June and by 13th June, I had escalated to Steve Huffman @spez.
An acquaintance investigating Reddit's moderation mechanization inquired how a major subreddit was moderated after an Associated Press post was auto removed by automod. They were banned from said sub. They inquired why they were banned, and they shared they would share any responses with a journalism org (to be transparent where any replies would be going, because they are going to a journalism org). They were muted by mods for 28 days and were "told off" in a very poor manner (per the screenshots I've seen) by the anonymous mod who replied to them. They were then banned from Reddit for 3 days after an appeal for "harassment"; when they requested more info about what was considered harassment, they were ignored. Ergo, inquiring as to how the mods of a major sub are automodding non-biased journalism sources (the AP, in this case) without any transparency appears to be considered harassment by Reddit. The interaction was submitted to the FTC through their complaint system to contribute towards their existing antitrust investigation of Reddit.
Shared because it is unlikely Reddit responds except when required by law, so I recommend engaging regulators (FTC, and DOJ at the bare minimum) and legislators (primarily those focused on Section 230 reforms) whenever possible with regards to this entity. They're the only folks worth escalating to, as Reddit's incentives are to gate content, keep ad buyers happy, and keep the user base in check while they struggle to break even, sharing as little information publicly as possible along the way [1] [2].
One (in this case, 2) company's incentive for profit should not take priority over the usability/well being of the internet as a whole, ever, and is exactly why we are where we are now. This is an absolutely terrible precedent.
The blocks for MojeekBot, as Cloudflare verified and respectful bot for 20 years, started before the robots.txt file changes. We first noticed in early June.
We thought it was an oversight too at first. It usually is. Large publishers have blocked us when they have not considered the details, but then reinstated us when we got in touch and explained.
I personally feel that this kind of "exclusive search only by Google deal" should result in an anti-trust case against Google. This is the kind of abuse of monopoly power that caused anti-trust laws to be passed in the 1890s.
how was it being abused. You still clicked on the information and saw the reddit ads? Now they won't get any of that from "rival" sites to google. I guess they figured the 60 million was more than that ad revenue. Seems greedy but I don't think it's illegal like others are suggesting.
Ah so when reddit uses user content for monetization it's ok but when others do it then it isn't? Reddit may want that double standard but I think the only thing they are going to achieve with this stunt is more people ignoring robots.txt.
You can see it here: https://search.google.com/test/rich-results/result?id=_mYogl... (click on "View Tested Page")
Calling it "public" content in the very act of exercising their ownership over it. The balls on whoever wrote that.
so they can do whatever they want with it and the actual owners/authors have no chance to really influence Reddit at all to make it crawlable. (the GDPR-like data takeout is nice, but ... completely useless in these cases where the value is in the composition and aggregation with other users' content.)
Dead Comment
https://old.reddit.com/r/redditdev/comments/1doc3pt/updating...
Of course, that became unsustainable so now I have everything behind a login wall.
Dead Comment
I will never take a statement given by a company that blatantly lies like this at face value going forward. What a bunch of clowns.
This is a dangerous precedent for the internet. Business conglomerates have been controlling most of the web, but refusing basic interoperability is even worse.
If reddit had exclusive agreement, it would be anti-competive.
This is classic HN anti-Google tirade (and downvoting facts, logic and concepts of free market)
Yes, actually, there is - having $60m to throw around.
"Barriers to entry often cause or aid the existence of monopolies and oligopolies" [0]. Monopolies and oligopolies are definitionally the opposite of free market forces. This is quite literally Econ 101.
[0] - https://en.wikipedia.org/wiki/Barriers_to_entry
How many other sites might have leverage to charge to be indexed?
I don't want to live in a world where you have to use X search engine to get answers from Y site - but this seems like the beginning of that world.
From an efficiency perspective - it's obviously better for websites to just lease their data to search engines then both sides paying tons of bandwidth and compute to get that data onto search engines.
Realistically, there are only 2 search engines now.
This seems very bad for Kagi - but possibly could lead the old, cool, hobbiest & un-monetized web being reinvented?
edit:
> Realistically, there are only 2 search engines now.
https://seirdy.one/posts/2021/03/10/search-engines-with-own-...
From the article:
This seems to assert that ~0 other search providers do any crawling at all. Ever. Are we sure that's the case?Scraper engine->validation/processing/cleanup->object storage->index + torrent serving is rough pipeline sketch.
[1] https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu... ("HN Search: annas archive")
[2] https://academictorrents.com/details/9c263fc85366c1ef8f5bb9d... ("AcademicTorrents: Reddit comments/submissions 2005-06 to 2023-12 [2.52TB]")
It's not the beginning, it's mere continuation.
Walled gardens have existed since the AOL days. They deteriorate over time but it doesn't prevent companies from trying (each time, in bigger attempts).
It still exists. It just isn't that popular.
There's an established player with institutional protections, then a scrappy upstart takes a bunch of VC money, converts it into runway, gives away the product for free, gradually replaces and becomes the standard, then puts out an s-1 document saying "we don't make money and we never have, want to invest?" and then they start to enjoy all the institutional protections. Or they don't. Either way you pay yourself handsomely from the runway money so who cares.
The upstart gets indexed and has an API, the established player doesn't.
The upstart is more easily found and modular but the institutional player can refuse to be indexed to own their data and they can block their API to prevent ai slop from getting in and dominating their content.
https://www.ilga.gov/legislation/ilcs/ilcs4.asp?DocName=0720...
But I can understand why they made the change they did. The data was being abused.
My guess is that this was an oversight -- that they will do an audit and reopen it for search engines after those engines agree not to use the data for training, because let's face it, reddit is a for profit business and they have to protect their income streams.
Depends how you see it - if you see it as 'their' data (legally true) or if you see it as user content (how their users would likely see it).
If you see it as 'user content', they are actually selling the data to be abused by one company, rather than stopping it being abused at all.
From a commercial 'lets sell user data and make a profit' perspective I get it, although does seem short-sighted to decide to effectively de-list yourself from alternative search engines (guess they just got enough cash to make it worth their while).
Is that actually true? Reddit may indeed have a license to use that data (derived from their ToS), but I very much doubt they actually own the copyright to it. If I write a comment on Reddit, then copy-paste it somewhere else, can Reddit sue me for copyright infringement?
Shared because it is unlikely Reddit responds except when required by law, so I recommend engaging regulators (FTC, and DOJ at the bare minimum) and legislators (primarily those focused on Section 230 reforms) whenever possible with regards to this entity. They're the only folks worth escalating to, as Reddit's incentives are to gate content, keep ad buyers happy, and keep the user base in check while they struggle to break even, sharing as little information publicly as possible along the way [1] [2].
[1] https://www.bloomberg.com/news/articles/2024-05-09/reddit-la... | https://archive.today/wQuKM
[2] https://www.sec.gov/edgar/browse/?CIK=1713445
We thought it was an oversight too at first. It usually is. Large publishers have blocked us when they have not considered the details, but then reinstated us when we got in touch and explained.
you can always buy a competitor's or make your own vacuum cleaner if you hate buying at Walmart
maybe what you are really mad about is Reddit monopolising content