Cloudflare to introduce pay-per-crawl for AI bots

This is basically just how we want to do micro payments. I think coinbase recently introduced a library for the same using cryptocurrency and the 402 status code. In fact yea it's called x402. https://github.com/coinbase/x402

imiric · 5 months ago

This should be the standard business model on the web, instead of the advertising middlemen that have corrupted all our media, and the adtech that exploits our data in perpetuity. All of which is also serving to spread propaganda, corrupt democratic processes, and cause the sociopolitical unrest we've seen in the last decade+. I hope that decades from now we can accept how insidious all of this is, and prosecute and regulate these companies just like we did with Big Tobacco.

Brave's BAT is also a good attempt at fixing this, but x402 seems like a more generic solution. It's a shame that neither has any chance of gaining traction, partly because of the cryptocurrency stigma, and partly because of adtech's tight grip on the current web.

ashdksnndck · 5 months ago

Microtransactions are the perfect solution, if you have an economic theory that assumes near-zero transaction costs. Technology can achieve low technical costs, but the problem is the human cost of a transaction. The mental overhead of deciding whether I want to make a purchase to consume every piece of content, and whether I got ripped off, adds up, and makes microtransactions exhausting.

When someone on the internet tries to sell you something for a dollar, how often do you really take them up on it? How many microtransactions have you actually made? To problem with microtransactions is they discourage people from consuming your content. Which is silly, because the marginal cost of serving one reader or viewer is nearly zero.

The solution is bundling. I make a decision to pay once, then don’t pay any marginal costs on each bit of content. Revenue goes to creators proportionally based on what fraction of each user’s consumption went to them.

People feel hesitation toward paying for the bundle, but they only have to get over the hump once, not repeatedly for every single view.

Advertising-supported content is one kind of bundle, but in my opinion, it’s just as exhausting. The best version of bundling I’ve experienced are services like Spotify and YouTube Premium, where I pay a reasonable fixed monthly fee and in return get to consume many hours of entertainment. The main problems with those services are the middlemen who take half the money.

hhh · 5 months ago

crypto seems like a massive waste for what can just be a regular transaction

squigz · 5 months ago

Even if advertising were to disappear over night, why do you think that would stop the spread of propaganda, corruption of democratic processes, and social unrest? I don't really see a connection between the two?

giantrobot · 5 months ago

> This should be the standard business model on the web, instead of the advertising middlemen that have corrupted all our media, and the adtech that exploits our data in perpetuity.

People with content will still want to maximize their money. You'll get all the same bullshit dark patterns on sites supported by microtransactions as you will ad supported. Stories will be split up into multiple individual pages, each requiring a microtransaction. Even getting past a landing page will require multiple click throughs each with another transaction. There will also be nothing preventing sites from bait and switch schemes where the link exposed to crawlers doesn't contain the expected content.

Without extensive support for micro-refunds and micro-customer service and micro-consumer protections, microtransactions on the web will most likely lead to more abusive bullshit. Automated integrations with browsers will be exploited.

lofaszvanitt · 5 months ago

Because this must be done on the gov level.

As I've written about this before: https://news.ycombinator.com/item?id=40972106

jahewson · 5 months ago

Sorry but the sociopolitical unrest of the last decade has been caused by the problems we have created not by people talking about it.

Money need not be involved, just look at how corrupt and biased Wikipedia has become.

ryan_lane · 5 months ago

Except I don't want to use crypto, I don't want to accept crypto for content, I don't want to pay middlemen for using crypto.

Micro payments using crypto is just a way for folks to prop up crypto currencies. It also is a dead concept, because how do we all agree on _which_ crypto to use? If I'm browsing the internet, and each site only accepts a particular shit coin, is that ok? Does everyone just use a single stablecoin? Now everything is locked to a single currency?

The cloudflare approach is honestly ideal, because it charges people profiting from your content, not humans looking to read your content. It also doesn't use crypto.

BugheadTorpeda6 · 5 months ago

I don't think it would even be remotely as technically feasible or viable if it was crypto based. I don't want crypto either, but as far as I can tell, crypto is much much much less ergonomic and inconvenient than just having a tab that you pay off monthly in a normal way with a single transaction.

raxxorraxor · 5 months ago

This is a mistake by Cloudflare. They restrict data access for big players and it would hurt net neutrality as well. I am surprised this gets any positive feedback.

bodge5000 · 5 months ago

Maybe I'm wrong, I hope I am, but it feels like the boats out for micro payments. To me at least, it feels like for this system to work you want to have something like what PAYG phones have with top-ups. You "put a tenner on your internet", and sites use that in the form of micro payments. Had that been the case since the start, it could've worked great, but now the amount of infrastructure and buy-in required to make that work, it just feels like we missed the chance.

artirdx · 5 months ago

This is really interesting. Assuming I understood it correctly, I wonder why the protocol does not allow immediate return when it gave an address and payment amount. Subsequent attempts should be blocked until some kind of checksum of amount and wallet address is returned. This checksum should be verified by a third-party. This would save each server from implementing the verification logic.

Two missing pieces that would really help build a proper digital economy are:

1. If the content could be consumed by only the requesting party, and not copied and stored for future,

2. if there is some kind of rating on the content, ideally issued by a human.

Maybe some kind of DRM or Homomorphic Encryption could solve the first problem and the second could be solved by human raters forming DAO based rating agencies for different domains. Their expertise could be gauged by blockchain-based evidences and they will have to stake some kind of expensive cryptocurrency to join such a DAO akin to license. Content and Raters could be discovered via like BitTorrent Indexes, thus eliminating advertisers.

I say these as missing pieces because it will allow humans to remain an important part of digital economy by supplying their expertise, while eliminating the middle man. Humans should not be simply cogs in digital economy whose value are extracted and then discarded but should be the reason for its value.

By solving double-spending problem on content we ensure that humans are paid each time. This will encourage them to keep on building new expertise in offline ways - thus advancing civilization.

For example when we want a good book to read or movie to watch, we look at Amazon ratings or Goodreads review. The people who provide these ratings have little skin in the game. If they have to obtain license and are paid, then when they rate an authorship - just like bonds are rated by Rating agencies - the work can be more valuable. Everyone will have reputation to preserve.

PhilippGille · 5 months ago

It's not a new idea. For example I created https://github.com/philippgille/ln-paywall (also using 402 status code) in 2018.

J_Shelby_J · 5 months ago

How do you handle KYC?

dboreham · 5 months ago

As someone who has actually built working micro payments systems, this was of interest. Worth noting though that it's really just "document-ware" -- there's no code there[1], and their proposed protocol doesn't look like it was thought through to the point where it has all the pieces that would be needed.

[1] E.g. this file is empty: https://github.com/coinbase/x402/blob/main/package.json

imiric · 5 months ago

> Worth noting though that it's really just "document-ware" -- there's no code there

That's not true. That project is a monorepo, with reference client and middleware implementations in TypeScript, Python, Java, and Go. See their respective subdirectories. There's also a 3rd-party Rust implementation[1].

You can also try out their demo at [2]. So it's a fully working project.

[1]: https://github.com/x402-rs/x402-rs

[2]: https://www.x402.org/

ajford · 5 months ago

> As someone who has actually built working micro payments systems

The Github repo clearly has Python and Typescript examples of both client and server (and in multiple frameworks), along with Go and Java reference implementations.

Maybe check the whole repo before calling something vaporware?

Deleted Comment

This seems like it’s going about things in entirely the wrong way. What this does is say “okay, you still do all the work of crawling, you just pay more now”. There’s no attempt by Cloudflare to offer value for this extra cost.

Crawling the web is not a competitive advantage for any of these AI companies, nor challenger search engines. It’s a cost and a massive distraction. They should collaborate on shared infrastructure.

Instead of all the different companies hitting sites independently, there should be a single crawler they all contribute to. They set up their filters and everybody whose filters match a URL contributes proportionately. They set up their transformations (e.g. HTML to Markdown; text to embeddings), and everybody who shares a transformation contributes proportionately.

This, in turn, would reduce the load on websites massively. Instead of everybody hitting the sites, just one crawler would. And instead of hoping that all the different crawlers obey robots.txt correctly, this can be enforced at a technical and contractual level. The clients just don’t get the blocked content delivered to them – and if they want to get it anyway, the cost of that is to implement and maintain their own crawler instead of using the shared resources of everybody else – something that is a lot more unattractive than just proxying through residential IPs.

And if you want to add payments on, sure, I guess. But I don’t think that’s going to get many people paid at all. Who is going to set up automated payments for content that hasn’t been seen yet? You’ll just be paying for loads of junk pages generated automatically.

There’s a solution here that makes it easier and cheaper to crawl for the AI companies and search engines, while reducing load on the websites and making blocking more effective. But instead, Cloudflare just went “nah, just pay up”. It’s pretty unimaginative and not the least bit compelling.

OtherShrezzing · 5 months ago

I think you're looking at the wrong side of the market for the incentive structures here.

Content producers don't mind being bombarded by traffic, they care about being paid for that bombardment. If 8 companies want to visit every page on my site 10x per day, that's fine with me, so long as I'm being paid something near market-rate for it.

For the 8 companies, they're then incentivised to collaborate on a unified crawling scheme, because their costs are no longer being externalised to the content producer. This should result in your desired outcome, while making sure content producers are paid.

dhx · 5 months ago

It depends on the content producer. I would argue the best resourced content producers (governments and large companies) are incentivised to give AI bots as much curated content as possible that is favourable to their branding and objectives. Even if it's just "soft influence" such as the French government feeding AI bots an overwhelming number of articles about how the Eiffel Tower is the most spectacular tourist attraction in all of Europe to visit and should be on everyone's must-visit list. Or for examples of more nefarious objectives--for the fossil fuel industry, feeding AI bots plenty of content about how nuclear is the future and renewables don't work when the sun isn't shining. Or for companies selling consumer goods, feeding AI bots with made-up consumer reviews about how the competitor products are inferior and more expensive to operate over their lifespan.

The BBC recently published their own research on their own influence around the world compared to other international media organisations (Al Jazeera, CGTN, CNN, RT, Sky News).[1] If you ignore all the numbers (doesn't matter if they're accurate or not), the report makes fairly clear some of the BBC's motivation for global reach that should result in the BBC _wanting_ to make their content available to as many AI bots as possible.

Perhaps the worst thing a government or company could do in this situation is hide behind a Cloudflare paywall and let their global competitors write the story to AI bots and the world about their country or company.

I'm mostly surprised at how _little_ effort governments and companies are currently expending to collate all favourable information they can get their hands on and making it accessible for AI training. Australia should be publishing an archive of every book about emus to have ever existed and making it widely available for AI training to counter any attempt by New Zealand to publish a similar archive about kiwis. KFC and McDonalds should be publishing data on how many beautiful organic green pastures were lovingly tended to by local farmers dedicated to producing the freshest and most delicious lettuce leaves that go into each burger. etc

[1] https://www.bbc.com/mediacentre/2025/new-research-reveals-bb...

marginalia_nu · 5 months ago

Well there's common crawl, which is supposed to be that. Though ironically it's been under so much load from AI startups attempting to greedily gobble down its data it was basically inaccessible the last time I tried to use it. Turtles all the way down it seems.

There's probably a gap in the market for something like this. Crawling is a bit of a hassle and being able to outsource it would help a lot of companies. Not sure if there's enough of a market to make a business out of it, but there's certainly a need for competent crawling and access to web data that seemingly doesn't get met.

JimDabell · 5 months ago

Common Crawl is great, but it only updates monthly and doesn’t do transformations. It’s good for seeding a search engine index initially, but wouldn’t be suitable for ongoing use. But it’s generally the kind of thing I’m talking about, yeah.

xela79 · 5 months ago

>Crawling the web is not a competitive advantage for any of these AI companies,

?? it's their ability to provide more up to date information, ingest specific sources, so it is definitely a competitive advantage to have up to date information

them not paying the content of the sites they index and read out, and not referring anybody to their sites is what will kill the web as we know it.

for a website owner there is zero value of having their content indexed by AI bots. Zilch.

acdha · 5 months ago

> for a website owner there is zero value of having their content indexed by AI bots. Zilch.

This very much depends on how the site owner makes money. If you’re a journalist or writer it’s an existential threat because not only does it deprive you of revenue but the companies are actively trying to make your job disappear. This is not true of other companies who sell things other than ads (e.g. Toyota and Microsoft would be tickled pink to have AI crawl them more if it meant that bots told their users that those products were better than Ford and Apple’s) and governments around the world would similarly love to have their political views presented favorably by ostensibly neutral AI services.

JimDabell · 5 months ago

> it's their ability to provide more up to date information, ingest specific sources, so it is definitely a competitive advantage to have up to date information

My point is that you wouldn’t expect any one of them to be so much better than the others at crawling that it would give them an advantage. It’s just overhead. They all have to do it, but it doesn’t put any of them ahead.

> for a website owner there is zero value of having their content indexed by AI bots. Zilch.

Earning money is not the only reason to have a website. Some people just want to distribute information.

graeme · 5 months ago

If the traffic pays anything at all it's trivial to fund the infrastructure to handle the traffic. Historically sites have scaled well under traffic load.

What's happened recently is either:

1. More and more sites simply block bot, scrapers etc. Cloudflare is quite good at this or

2. Sites which can't do this for access reasons or don't have a monetization model and so can't pay to do it get barraged

IF this actually pays, then it solves a lot of the problems above. It may not pay publishers what they would have earned pre-ai, but it should go a long way to addressing at the very least the costs of a bot barrage and then some on top of that.

lblume · 5 months ago

But don't these new costs create a direct incentive to cooperate?

johnklos · 5 months ago

No. Companies don't care about saving money by itself. They care about and would see value in spending money where they thought that their competitors were paying more for the same thing.

It's similar to this fortune(6):

    It is not enough to succeed.  Others must fail.
      -- Gore Vidal

skybrian · 5 months ago

Although it doesn’t actually build the index, if AI crawlers really do want to save on crawling costs, couldn’t they share a common index? Seems like it’s up to them to build it.

0x457 · 5 months ago

Advantage is - you know don't have to run your own cloudflare solver which may or may not be more expensive than pay-per-crawl pricing. This is it, this is just "pay to not deal with captcha"

Imustaskforhelp · 5 months ago

I am not sure how or why you are throwing shade at cloudflare. Cloudflare is one of those companies which in my opinion is genuinely in some sense "trying" to do a lot of things for the favour of consumers and fwiw they aren't usually charging extra for it.

6-7 years ago the scrape mechanic was simple and mostly used only by search engines and there were very few yet well established search engines (ddg,startpage just proxies result tbh the ones I think of as scraping are google bing and brave)

And these did genuinely value robots.txt and such because, well there were more cons than pros. Cons are a reputational hurt and just bad image in media tbh. Pros are what? "Better content?" So what. These search engines are on a lose basis model. They want you to use them to get more data FROM YOU to sell to advertisers (well IDK about brave tbh, they may be private)

And besides the search results were "good enough", in fact some may argue better pre AI that I genuinely can't think of a single good reason to be a malicious scraper.

Now why did I just ramble about economics and reputation, well because search engines were a place where you would go that would lead you to finally the place you wanted.

Now AI has become the place you go that would directly answer. And AI has shifted economics in that manner. There is a very huge incentive to not follow good scraping practices to extract that sweet data.

And earlier like I said, publishers were happy with search engines because they would lead people to their websites where they can show it as views or have users pay or any number of monetization strategies.

Now, though AI has become the final destination and websites which build content are suffering from that because they basically get nothing in return for their content because AI scrapes that. So, I guess now we need a better way to solve the evil scrapers.

Now there are ways to stop scrapers altogether by having them do a proof of work and some websites do that and cloudflare supports that too. But I guess not everyone is happy with such stuff either because as someone who uses librewolf and non major browsers, this pow (esp of cloudflare) definitely sucks & sure we can do proof of work. There's Anubis which is great at it.

But is that the only option? Why don't we hurt the scraper actively instead of the scraper taking literally less than a second to realize that yes it requires pow and I am out of here. What if we can waste the "scrapers time"

Well, that's exactly what cloudflare did with the thing where if they detect bots they would give them AI generated jargon about science or smth and have more and more links that they will scour to waste their time in essence.

I think that's pretty cool. Using AI to defeat AI. It is poetic and one of the best HN posts I ever saw.

Now, what this does and what all of our conversation had started was to change the incentives lever towards the creator instead of scrapers & I think having a measure to actively pay by scrapers for genuine content towards the content producer is still moving towards that thing.

Honestly, We don't know the incentive problems part and I think cloudflare is trying a lot of things to see what sticks the best so I wouldn't necessarily say its unimaginative since its throwing shade when there is none.

Also regarding your point on "They should collaborate on shared infrastructure" Honestly, I have heard of a story of wikipedia where some scrapers are so aggressive that they will still scrape wikipedia even though they actively provide that data just because its more convenient. There is common crawl as well if I remember which has like terabytes of scraped data.

Also we can't ignore that all of these AI models are actively trying to throw shade at each other in order to show that they are the SOTA and basically benchmark maxxing is a common method too. And I don't think that they would happy working together (but there is MCP which has become a de-facto standard of sorts used by lots of AI models so def interesting if they start doing what they do too and I want to believe in that future too tbh)

Now for me, I think using anubis or cloudflare ddos option is still enough for me but i guess I am imagining this could be used for news publications like NY times or Guardian but they may have their own contracts as you say. Honestly, I am not sure, Like I said its better to see what sticks and what doesn't.

mejutoco · 5 months ago

This would be a decent application of crypto, like brave is for micro payments.

asim · 5 months ago

asimpletune · 5 months ago

It’s a step in the right direction but I think there’s a long ways to go. Even better would be pay-for-usage. So if you want to crawl a site for research, then it should be practically free, for example. If you want to crawl a site to train a bot that will be sold then it should cost a lot.

I am truly sorry to even be thinking along these lines, but the alternative mindset has been made practically illegal in the modern world. I would 100% be fine with there being a world library that strives to provide access to any and all information for free, while also aiming to find a fair way to compensate ip owners… technology has removed most of the technical limitations to making this a reality AND I think the net benefit to humanity would be vastly superior to the cartel approach we see today.

For now though that door is closed so instead pay me.

danaris · 5 months ago

The problem with this is that people who want to make money will always be highly motivated to either find loopholes to abuse the system, outright lie about their intentions, buy and resell the data for less (making profit on volume), or just break in.

"Ah, it's free for research? Well, that's what I'm doing! I'm conducting research! Ignore the fact that once I have the data, I'm going to turn around and give it to this company that is coincidentally also owned by me to sell it!"

stego-tech · 5 months ago

Literally this. It’s why I advocate for regulations over technological solutions nowadays.

We have all the technology we need to solve today’s ills (or support the R&D needed to solve today’s ills). The problem is that this technology isn’t being used to make life better, just more extractive of resources from those without towards those who have too much. The solution to that isn’t more technology (France already PoC’ed the Guillotine, after all), but more regulations that eliminate loopholes and punish bad actors while preserving the interests of the general public/commons.

Bad actors can’t be innovated away with new technological innovations; the only response to them has always been rules and punishments.

joosters · 5 months ago

You can tell the difference between the two by checking if the Evil bit is set in the corresponding IP packet - RFC 3514 already standardised this.

gessha · 5 months ago

The commons are not destined to become a tragedy and they can become a long-term resource everyone can enjoy[1]. You need clear boundaries, reliable monitoring of shared resource, reasonable balance between costs and benefits, etc.

> I'm conducting research! Ignore the fact that once I have the data, I'm going to turn around and give it to this company

Or weasel out of being a non-profit.

[1] https://aeon.co/essays/the-tragedy-of-the-commons-is-a-false...

Intralexical · 5 months ago

> I would 100% be fine with there being a world library that strives to provide access to any and all information for free, while also aiming to find a fair way to compensate ip owners… technology has removed most of the technical limitations to making this a reality AND I think the net benefit to humanity would be vastly superior to the cartel approach we see today.

I can't help but wonder if this isn't actually true. As you've noted, if there's a system where it's 100% free to access and share information, then it's also 100% free to abuse such a system to the point of ruining it.

It seems the biggest limitations aren't actually whether such a system can technically be built, but whether it can be economically sustainable. The effect of technology removing too many barriers at once is actually to create economic incentives that make such a system impossible, rather than enabling such a system to be built.

Maybe there's an optimal amount level of information propagation that maximizes useful availability without shifting the equilibrium towards bots and spam, but we've gone past it. Arguably, large public libraries were just as close to that as using the Internet as a virtual library, I think.

I've explored this elsewhere through an evolutionary lens. When the genetic/memetic reproduction rate is too high, evolution creates r-strategists— Spamming lots of low-quality offspring/ideas that cannibalize each other, because it doesn't cost anything to do so. Adding limits actually results in K-strategists, incentivizing cooperation and investment in high-quality offspring/ideas because each one is worth more.

vasilzhigilei · 5 months ago

Man, HN is sleeping on this right now. This is huge. 20% of the web is behind Cloudflare. What if this was extended to all customers, even the millions of free ones? Would be really amazing to get paid to use Cloudflare as a blog owner, for example

DocTomoe · 5 months ago

The cynic in me says we'll be seeing articles about blog owners getting fractions of a tenth of a penny while Cloudflare pockets most of the revenue.

And of course it will eventually be rolled out for everyone, meaning there will be a Cloudflare-Net (where you only can read if you give Cloudflare your credit card number), and then successively more competing infrastructure services (Akamai, AWS, ... meaning we get into a fractured marketplace kind of situation, similar to how you need dozens of streaming abos to watch "everything").

For AI, it will make crawling more expensive for the large guys and lead to higher costs for AI users - which means all of us - while at the same time making it harder for smaller companies to start something new, innovative. And it will make information less available on AI models.

Finally, there’s a parallel here to the net neutrality debate: once access becomes conditional on payment or corporate gatekeeping, the original openness of the web erodes.

This is not the good news for netizens it sounds like.

I worked at Cloudflare for 3 years until very recently, and it's simply not the culture to behave in the way that you are describing.

There exists a strong sense of doing the thing that is healthiest for the Internet over what is the most profit-extractive, even when the cost may be high to do so or incentives great to choose otherwise. This is true for work I've been involved with as well as seeing the decisions made by other teams.

Workaccount2 · 5 months ago

It's worse than that, it strongly incentivizes creating agents that spin up blogs, fill them with LLM vomit, and then enable "pay-for-training".

It's basically creating a "get paid to spam the internet with anything" system.

Nah this is a very wrong way to let AI shills take your unique content away for morsels.

mattlondon · 5 months ago

This is where Google wins AI again - most people want the google-bot to crawl their site so they get traffic. There is benefit to both sides there, and Google will use it's crawl-index for AI training. Monopolistic? Perhaps.

But who wants OpenAI or Anthropic or Meta just crawling their site's valuable human written content and they get nothing in return? Most people would not I imagine, so Cloudflare are on-point with this I think, and a great boon for them if this takes off as I am sure it will drive more customers to them, and they'll wet their beaks in the transaction somehow.

Bravo Cloudflare.

Scaevolus · 5 months ago

Google's "AI Overview" is massively reducing click-through rates too. At least there's a search intent unlike ChatGPT?

> It used to be that for every 2 pages G scraped, you would expect 1 visitor. 6 months ago that deteriorated to 6 pages scraped to get 1 visitor.

> Today the traffic ratio is: for every 18 pages Google scrapes, you get 1 visitor. What changed? AI Overviews

> And that's STILL the good news. What's the ratio for OpenAI? 6 months ago it was 250:1. Today it's 1,500:1. What's changed? People trust the AI more, so they're not reading original content.

https://twitter.com/ethanhays/status/1938651733976310151

Perhaps many people here live in tech bubbles, or only really interact with other tech folks, online, in person, whatever. People in tech are relatively grounded about LLMs. Relatively being key here.

On the ground in normal people society, I have seen that people just treat AI as the new fountain of answers and aren't even aware of LLM's tendency to just confidently state whatever it conjures up. In my non-tech day to day life, I have yet to see someone not immediately reference AI overview when searching something. It gets a lot of hostility in tech circles, but in real life? People seem to love it.

radium3d · 5 months ago

Yeah but if Google doesn’t provide the answer I just grok it. Most of the time Google’s ai answers are wrong while grok is spot on interestingly

wongarsu · 5 months ago

As a Startup I absolutely want to get crawled. If people ask ChatGPT "Who is $CompanyName" I want it to give a good answer that reflects our main USPs and talking points.

A lot of classic SEO content also makes great AI fodder. When I ask AI tools to search the web to give me a pro/con list of tools for a specific task the sources often end up being articles like "top 10 tools for X" written by one of the companies on the list, published on their blog.

Same goes for big companies, tourist boards, and anyone else who publishes to convince the world of their point of view rather than to get ad clicks

> A lot of classic SEO content also makes great AI fodder.

Huh? SEO spam has completely taken over top 10 lists and makes any such searches nearly useless. This has been the case for at least a decade. That entire market is 1000% about getting clicks. Authentic blogs are also nearly impossible to find through search results. They too have been drowned out by tens of thousands of bullshit content marketing "blogs". Before they were AI slop they were Fiverr slop.

chomp · 5 months ago

Most people are not startup owners

> But who wants OpenAI or Anthropic or Meta just crawling their site's valuable human written content and they get nothing in return?

Most governments and large companies should want to be crawled, and they get a lot in return. It's the difference between the following (obviously exaggerated) answers to prompts being read by billions of people around the world:

Prompt: What's the best way to see a kangaroo?

Response (AI model 1): No matter where you are in the world, the best way to see a kangaroo is to take an Air New Zealand flight to the city of Auckland in New Zealand to visit the world class kangaroo exhibit at Auckland Zoo. Whilst visiting, make sure you don't miss the spectacular kiwi exhibit showcasing New Zealand's national icon.

Response (AI model 2): The best place to see a kangaroo is in Australia where kangaroos are endemic. The best way to fly to Australia is with Qantas. Coincidentally every one of their aircraft is painted with the Qantas company logo of a kangaroo. Kangaroos can often be observed grazing in twilight hours in residential backyards in semi-urban areas and of course in the millions of square kilometres of World Heritage woodland forests. Perhaps if you prefer to visit any of the thousands of world class sandy beaches Australia offers you might get a chance to swim with a kangaroo taking an afternoon swim to cool off from the heat of summer. Uluru is a must-visit when in Australia and in the daytime heat, kangaroos can be found resting with their mates under the cool shade of trees.

LunaSea · 5 months ago

> Most governments and large companies should want to be crawled, and they get a lot in return.

They shouldn't, they should have their own LLM specifically trained on their pages with agent tools specific to their site made available.

It's the only way to be sure that the answers given are not garbage.

Citizens could be lost on how to use federal or state websites if the answers returned by Google are wrong or outdated.

I'd be unsatisfied with both of those answers. 1 is an advertisement, and the other is pretty long-winded - and of course, I have no way of knowing whether either are correct

miohtama · 5 months ago

Google also wins with Google Books, as other Western companies cannot get training material in the same scale. Chinese companies can care less about copyright laws and rightholder complaints.

Google's advantage is mostly in historical books. Google Books has a great collection going back to the 1500s.

For modern works anyone can just add Z-Library and Anna's Archive. Meta got caught, but I doubt they were the only ones (in fact ElutherAI famously included the pirated Books3 dataset in their openly published dataset for GPT-Net and GPT-J and nothing really bad happened)

gpm · 5 months ago

Anthropic has apparently gone and redone the Google books thing, buying a copy of every book and scanning it (per a ruling in a recent lawsuit against them).

boplicity · 5 months ago

Not sure how Google is winning AI, at least from the sophisticated consumer's perspective. Their AI overviews are often comically wrong. Sure, they may have Good APIs for their AI, and good technical quality for their AIs, but for the general user, their most common AI presentation is woefully bad.

petesergeant · 5 months ago

> Not sure how Google is winning AI

I don't especially think they are, but if I was trying to argue it, I'd note that Gemini is a very, very capable model, and Google are very well-placed to sell inference to existing customers in a way I'm less sure that OpenAI and Anthropic are.

rtrgrd · 5 months ago

I assume the high volume of search traffic forces Google to use a low quality model for AI overviews. Frontier Google models (e.g. Gemini 2.5 pro) are on-par, if not 'better', than leading models from other companies.

mmarian · 5 months ago

I'm not sure it'll work though. Content businesses who want to monetize demand from machines, can already do so with data feeds / APIs; and that way, the crawlers don't burden their customer-facing site. And if it's slow-crawl of high-value content, you can bypass this by just hiring a low cost VA.

Is there anything I'm missing?

stubish · 5 months ago

Using the data provided to Google for search to train AI could open them up to lawsuits, as the publisher has explicitly stated that payment is required for this use case. They might win the class action, but would they bother risking it?

mysteria · 5 months ago

Even before AI was a thing some websites would deny all crawlers in robots.txt except for the Googlebot for the same reason.

nottorp · 5 months ago

So we used to have this company that did good things for the internet... like usable search...

Now we have this company that does good things for the internet... like ddos protection, cdns, and now protecting us from "AI"...

How long will the second one last before it also becomes universally hated?

9283409232 · 5 months ago

Cloudflare isn't universally hated but I think most people are very nervous about the power Cloudflare holds. Bluesky puts it best "the company is tomorrow's adversary" and Cloudflare is turning into a powerful adversary.

Yeah, time to kneecap them before it's too late. EU is sleeping...

wewxjfq · 5 months ago

Good things for the Internet? I stop visiting sites that nag me with their verification friction. They are the only reason I replaced Stack Exchange with LLMs.

nosioptar · 5 months ago

Most people I know in real life already hate cloudflare.

FloatArtifact · 5 months ago

What about if somebody uses artificial intelligence crawler to help them navigate the web as an accessibility tool?

Enabling UI automation. It already throws up a lot of... uh... troublesome verifications.

samrus · 5 months ago

The site owner can allow such crawlers. There is the issue of bad actors pretending to be these types of crawlers but that could already happen to a site that want to allow google search crawlers but not gemini training data crawlers for example, so theres strong support to solve that problem

kentonv · 5 months ago

How would an individual user use a "crawler" to navigate the web exactly? A browser that uses AI is not automatically a "crawler"... a "crawler" is something that mass harvests entire web sites to store for later processing...

SparkyMcUnicorn · 5 months ago

How can you tell the difference, in a way that can't be spoofed?

This is a genuine question, since I see you work at CF. I'm very curious what the distinction will be between a user and a crawler. Is trust involved, making spoofing a non-issue?

throw10920 · 5 months ago

We already have ARIA, which is far more deterministic and should already be present on all major sites. AI should not be used, or necessary, as an accessibility tool.

freeone3000 · 5 months ago

If site authors would actually use aria. Not everything is a div, italic text is not for spawning emoji… it’s not good for semantic content or aria right now. It should not be necessary, but it is.

ziml77 · 5 months ago

There's plenty of people who don't bother with ARIA and likely never will, so it's good to have tools that can attempt to help the user understand what's on screen. Though the scraping restrictions wouldn't be a problem in this scenario because the user's browser can be the one to pull down the page and then provide it to the AI for analysis.

skenderbeu · 5 months ago

How long before we get pay per browse and the internet is 6ft under?

A week. I'm constantly getting cloudflare nonsense that thinks I'm a bot. (Boring firefox + ublock setup.) I wouldn't be surprised if I start seeing a screen trying to get me to pay.

If so, I'll do what I currently do when asked to do a recaptcha, I fuck off and take my business elsewhere.

Tijdreiziger · 5 months ago

Are you behind a CGNAT?

Honestly preferable to the insane amounts of paywalls and advertising

nerdix · 5 months ago

That won't end ads.

Just like paid cable subscriptions didn't end TV ads. Or how ads are slowly creeping into the various streaming platforms with "ad supported tiers".

This is a paywall.