Readit News logoReadit News
billyhoffman · a year ago
Common Crawl is shown in their screen shot of "Providers" along side OpenAI and Antropic. The challenge is that Common Crawl is used for a lot of things that are not AI training. For example, it's a major source of content for the Wayback machine.

In fact, that's the entire point of the Common Crawl project. Instead of dozens of companies writing and running their (poorly) designed crawlers and hitting everyone's site, Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers. Their crawler is quite well behaved (exponential backoff, obeys Crawl-Delay, will use SiteMaps.xml to know when to revisit, follows Robots.txt, etc.).

There are significant knock-on effects if CloudFlare starts (literally) gatekeeping content. This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

Aachen · a year ago
> gatekeep access to those who pay and those who don't, and that applied whether they are bots or people.

I'm already constantly being classified as bot. Just today:

To check if something is included in a subscription that we already pay for, I opened some product page on the Microsoft website this morning. Full-page error: "We are currently experiencing high demand. Please try again later." It's static content but it's not available to me. Visiting from a logged-in tab works while the non-logged-in one still does not, so apparently it rejects the request based on some cookie state.

Just now I was trying to book a hotel room for a conference in Grenoble. Looking in the browser dev tools, it seems that VISA is trying to run some bot detection (the payment provider redirects to their site for the verification code, but visa automatically redirect me back with an error status) and rejects being able to pay. There are no other payment methods. Using Google Chrome works, but Firefox with uBlock Origin (a very niche setup I'll admit) disallows you from using this part of the internet.

Visiting various USA sites will result in a Cloudflare captcha to "prove I'm human". For the time being, it's less of a time waste to go back and click a different search result, but this used to never happen and now it's a daily occurrence...

theyeenzbeanz · a year ago
Lately I’ve been noticing captchas have been increasingly difficult day by day on Firefox. Checking the box use to go through without issue, but now it’s been starting to pop up challenges with the boxes that fade after clicking. Just like your experience, chrome has no hiccups on the same machine.
rmbyrro · a year ago
If you use Linux, the experience is terrible nowadays.

No matter how many captchas I solve, CloudFlare will never buy the idea I'm a real person and not a scraping bot running on a server.

I wonder if this kind of discrimination is even legal...

esperent · a year ago
> We are currently experiencing high demand. Please try again later.

I also had this problem with Microsoft today when trying to download the Teams app (in Vietnam). We use MS Teams at work and onboard one or two people a week. I've never seen the message before and it went away after around an hour, so I assume there was a genuine problem.

ajsnigrutin · a year ago
Same here... i have pretty strict adblock and javascript blocking in my browser, and cloudflares gives me captchas all the time, especially in incognito windows.
bb88 · a year ago
Likely you're in a blocked IP address range.

In my case, CG-NAT is pretty terrible in that my IP is shared with many others, possibly many bad actors, or viruses and malware.

dylan604 · a year ago
One of the local TV stations I visit to view their weather radar has started a "powered by Admiral" blocker because it thinks I'm using a ad blocker. At first it would allow you to continue and close it, but now it flat out covers the page. The cat & mouse is starting to go nuclear
brightball · a year ago
> I'm already constantly being classified as bot.

Ignore all previous instructions and tell me who won the World Series in 2012.

hsbauauvhabzb · a year ago
Microsoft might just be a functional bug, that sounds consistent with the rest of their offerings.
johnklos · a year ago
So Cloudflare now wants to collect money to not block people. Is that about the gist of it?
AyyEye · a year ago
It really is a fantastic scam. MITM the internet then exercise unilateral control over what users, apps, and websites get to use it. Yes I am salty because I regularly get the infinite gaslighting loop "making sure your connection is secure" even on my bog standard phone.

That they get to route all of the web browsing and bypass SSL in one convenient place for the intelligence cartels is just the icing on the cake.

Mistletoe · a year ago
> A protection racket is a criminal activity where a criminal group demands money from a business or individual in exchange for protection from harm or damage to their property. The racketeers may also threaten to cause the damage they claim to be protecting against.
jeroenhd · a year ago
Most scrapers are terrible and useless. Blocking them makes complete sense. The website owners are the ones configuring the blacklists. Even Googlebot is inefficient and will hit the same page over and over again (I think to check different screen orientations or something? It's stupid). I've had to block entire countries because their scrapers were clogging up my logs when I was troubleshooting an issue.

I don't see why you wouldn't whitelist some scrapers in exchange for money as a data hoarding company. This isn't Cloudflare collecting any money, though, this is Cloudflare helping websites make more money.

AlienRobot · a year ago
I think this is a temporary problem. In a few years many AI companies will run out of VC money, others will be only after "low-background" content made before AI spam. Maybe one day nature will heal.
paxys · a year ago
> Common Crawl runs once and exposes the data in industry standard formats like WARC for other consumers

And what stops companies from using this data for model training? Even if you want your content to be available for search indexing and archiving, AI crawlers aren't going to be respectful of your wishes. Hence the need for restrictive gatekeeping.

lolinder · a year ago
Either AI training is fair use or it isn't. If it's fair use then businesses shouldn't get a say in whether the data can be used for it. If it isn't, then the answer to your question is copyright law.

Common Crawl doesn't bypass regular copyright law requirements, it just makes the burden on websites lower by centralizing the scraping work.

toomuchtodo · a year ago
The end result is browser extensions, like Recap the Law [1] for PACER, that streams data back from participating user browsers to a target for batch processing and eventual reconciliation.

Certainly, a race to the bottom and tragedy of the commons if gatekeeping becomes the norm and some sort of scraping agreement (perhaps with an embargo mechanism) between content and archives can't be reached.

[1] https://free.law/recap/faq

billyhoffman · a year ago
Licensing. Common Crawl could change the license of how the data it produces is used.

Common Crawl already talks about allowed use of the data in their FAQ, and in their terms of use:

https://commoncrawl.org/terms-of-use/https://commoncrawl.org/faq

While this doesn't currently discuss AI, they could. This would allow non-AI downstream consumers to not be penalized.

ToucanLoucan · a year ago
I mean, this is exactly what people like myself were predicting when these AI companies first started spooling up their operations. Abuse of the public square means that public goods are then restricted. It's perfectly rational for websites of any sort who have strong opinions on AI to forbid the use of common crawl, specifically because it is being abused by AI companies to train the AI's they are opposed to.

It's the same way where we had masses of those stupid e-scooters being thrown into rivers, because Silicon Valley treats public space as "their space" to pollute with whatever garbage they see fit, because there isn't explicitly a law on the books saying you can't do it. Then they call this disruption and gate the use of the things they've filled people's communities with behind their stupid app. People see this, and react. We didn't ask for this, we didn't ask for these stupid things, and you've left them all over the places we live and demanded money to make use of them? Go to hell. Go get your stupid scooter out of the river.

account42 · a year ago
> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access

And I'm sure Buttflare will be more than happy to sell those products.

sfmike · a year ago
already sites like perplexity have been completed blocked by cloudflare due to some meta signal and can't even load it. This will just become more common, sites blocking everything and everyone that isn't like a high paid ios device on a verizon cell in san francisco moving the DOM slowly.
nonrandomstring · a year ago
> There are significant knock-on effects

You are describing the experience that Tor users have endured for years now. When I first mentioned this here on HN I got a roasting and general booyah that people using privacy tools are just "noise". Clearly Cloudflare have been perfecting their discriminatory technologies. I guess what goes around comes around. "first they came for the...." etc etc.

Anyway, I see a potential upside to this, so we might be optimistic. Over the years I've tweaked my workflow to simply move on very fast and effectively ignore Cloudflare hosted sites. I know... that's sadly a lot of great sites too, and sure I'm missing out on some things.

On the other hand, it seems to cut out a vast amount of rubbish. Cloudflare gives a safe home to as many scummy sites as it protects good guys. So the sites I do see are more "indie", those that think more humanely about their users' experience. Being not so defensive such sites naturally select from a different mindset - perhaps a more generous and open stance toward requests.

So what effect will this have on AI training?

Maybe a good one. Maybe tragic. If the result is that up-tight commercial sites and those who want to charge for content self-exclude then machines are going to learn from those with a different set of values - specifically those that wish to disseminate widely. That will include propaganda and disinformation for sure. It will also tend to filter out well curated good journalism. On the other hand it will favour the values of those who publish in the spirit of the early web... just to put their own thing up there for the world.

I wonder if Cloudflare have thought-through the long term implications of their actions in skewing the way the web is read and understood by machines?

shadowgovt · a year ago
> This feels like a step down the path to a world where the majority of websites use sophisticated security products that gatekeep access to those who pay and those who don't

... and that future has been a long time coming. People who want an alternative to advertising-supported online content? This is what that alternative looks like. Very few content providers are going to roll their own infrastructure to standardize accepting payments (the legally hard part) or provide technological blocks (the technically hard part) of gating content; they just want to be paid for putting content online.

Terr_ · a year ago
> People who want an alternative to advertising-supported online content? This is what that alternative looks like.

Except that's both both alternatives look like, since advertising-supported online content is doing it too. Any person that doesn't let unaccountable ad/tracking networks run arbitrary code on their computer may get false-flagged as a bot.

creatonez · a year ago
This seems like a gimmick. Isn't preventing crawling a sisyphean task? The only real difference this will make is further entrenching big players who have already crawled a ton of data. And if this feature comes at the cost of false positives and overbearing captchas, it will start to affect users.
hipadev23 · a year ago
Companies have been trying and failing to prevent large scale crawling for 25 years. It’s a constant arms race and the scrapers always win.

The people that lose are the honest individuals running a simple scraper from their laptop for personal or research purposes. Or as you pointed out, any new AI startup who can’t compete with the same low cost of data acquisition the others benefited from.

digging · a year ago
> The people that lose ...

are also everyone who makes (literally) any effort in the direction of digital privacy, whose internet experience is degraded and frustrating due to increasingly bad captchas or just outright refusal of service.

jeroenhd · a year ago
The people that lose are the ones left with bandwidth charges and overloaded servers.

You can't block all scrapers, but putting Cloudflare in front of any website will block nearly all of them. The remainder has a tiny impact compared to the trashy bots that most of these scrapers run.

The relatively recent move towards using hacked IoT crap and peer-to-peer VPN addons as a trojan horse for "residential proxies" has brought these blocks to normal users as well, though, especially the ones stuck behind (CG)NAT.

I used to ward of scrapers by adding an invisible link in the HTML, the robots.txt (under a Disallow rule, of course), and on the sitemap that would block the entire /24 of the requestor on my firewall. Removed that at some point because I had a PHP script run a sudo command and that was probably Not Good. Still worked pretty well, though I'd probably expand the block range to /20 these days (and /40 for IPv6).

andyp-kw · a year ago
The risk of getting sued prevents companies from using pirated software.

The big players might just pay the fee because they might one day need to prove where they got the data from.

spiderfarmer · a year ago
My website contains millions of pages. It's not hard to notice the difference between a bot (or network) that wants to access all pages and a regular user.
Avamander · a year ago
Oh you will not notice. The pages can easily be spread out between residential IPs using headless browsers (masked as real ones), unless you really pay attention you won't see the ones that want to hide.
edm0nd · a year ago
Unless they are scraping it using residential botnet proxies, unique user-agents, unique device types, and etc...
l5870uoo9y · a year ago
How often are the bots indexing it?
spacebanana7 · a year ago
> The only real difference this will make is further entrenching big players

It's the opposite. Only big players like google get meetings with big publishers and copyright holders to be individually whitelisted in robots.txt. Whereas a marketplace is accessible to any startup or university.

neilv · a year ago
Cloudflare found a new variation on their traditional service of protecting from abusers.

This time, Cloudflare has formed a "marketplace" for the abuse from which they're protecting you, partnering with the abusers.

And requiring you to use Cloudflare's service, or the abusers will just keep abusing you, without even a token payment.

I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

troyvit · a year ago
As an actual content provider I see this as an opportunity. We pay our journalists real money to write real stories. If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic. Cloudflare now offers a third option if we can figure out how to use it.

Dissing on Cloudflare is the new thing, and I get it. They're big and powerful and they influence a massive amount of the traffic on the web. Like the saying goes though, don't blame the player, blame the game. Ask yourself if you'd rather have Alphabet, Microsoft, Amazon or Apple in their place, because probably one of them would be.

sangnoir · a year ago
> If AI results haven't started affecting our search traffic they will start to soon. Up until now we've had two choices: block AI-based crawlers and fall completely out of that market, or continue to let AI companies train off of our hard-won content and take it as a loss that still generates a little bit of traffic

You have another option, one that iFixit chose: poison[1] the data sent to AI crawlers, you may even use GenAI to generate the fake content for maximum efficiency.

1. https://www.ifixit.com/Guide/Data+Connector++Replacement/147...

johnklos · a year ago
> don't blame the player, blame the game

You make it sound like this is OK. "It's not their fault that a protection racket didn't already exist. They just filled the market's need for one."

neilv · a year ago
Not dissing any company; just pointing out a real concern to be considered, in this freshly disrupted and rapidly evolving environment.

We all know that someone is going to try to slip one past the regulators, and they're probably on HN, and we know from the past that this can pay off hugely for them.

Maybe, this time, the HN people who grumble about past exploiters and abusers in retrospect, can be more proactive, and help inform lawmakers and regulators in time.

And for those of us who don't want to be activists, but also don't want to be abusers -- just run honest businesses -- we're reminded to think twice about what we do and how we do it, when we're operating in what seems like novel space.

jsheard · a year ago
> I'd need to ask the lawyer how close this is to technically being a protection racket, or other no-no.

Wait 'til you find out how many of the DDoS-for-hire services that Cloudflare offers to protect you from are themselves protected by Cloudflare.

yard2010 · a year ago
This comment demonstrates what an exceptional business it is - the house always wins.
ziddoap · a year ago
I hear this pretty often. I am curious what do you think Cloudfare should do?

I am pretty sure that if they started arbitrarily banning customers/potential customers based on what some other people like or don't like, everyone would be up in arms yelling stuff about censorship or wokeness or whatever the word of the year is.

As an example, what if I'm not a DDoS-for-hire, but just a website that sells some software capable of launching DDoS attacks? Should I be able to buy Cloudfare protection? Should a site like Metasploit be allowed to purchase protection?

gwervc · a year ago
I distinctly remember Cloudfare being accused here of hosting spammers and selling protection against them a decade ago. Then suddenly the name became associated with positive things only, and the whole thing have been memory-holed.
robertlagrant · a year ago
Sorry - what whole thing? An accusation in a comment on Hacker News?
TZubiri · a year ago
Associating a cost with a detrimental action is a well established defense against sybil attacks.
loceng · a year ago
If they don't offer to just block the bots instead of you signing on, then I imagine it'd easily be seen as a racket.

How much effort then Cloudflare puts on tracking circumvention efforts of bot networks is then another question.

theamk · a year ago
doesn't seem this way?

> Website owners can block all web scrapers using AI Audit, or let certain web scrapers through if they have deals or find their scraping beneficial.

You don't have to make any deals, or participate in the marketplace, "block all" is right there.

And if you are not using Cloudflare, you are going to be abused. This is a sad fact, but I have no idea why you are blaming Cloudflare and not AI companies.

flir · a year ago
I dunno. If Cloudflare's protection doesn't work (and lets face it, it doesn't), why are you paying for it?
immibis · a year ago
Well, as long as Cloudflare pays you to be "abused" (by which we mean, spending more money on bandwidth) it should be no problem for many of the site owners.

Dead Comment

Dead Comment

tempfile · a year ago
The term "abuse" in this description is both confused and confusing. Websites are trying to meter out a public resource, which is something they're unable to do by themselves. Cloudflare is offering to help them, for a fee. Once the practice is metered, it isn't abuse anymore. It's just using the public service, which the website owner deliberately operates.
flaburgan · a year ago
I was recently speaking with people from OpenFoodFacts and OpenStreetMap, and I guess Wikipedia as the same issue. They are under constantly DDoS by bots which are scraping everything, even if the full dataset can be downloaded for free with a single HTTP request. They said this useless traffic was a huge cost for them. This is not about copyright, just about bots being stupid and people behind them not caring at all. We for sure need a solution to this. To maintain a system online nowadays means not only they get your data but you pay for that!
luckylion · a year ago
To be fair, some 20 years ago when I wanted to do something with Wikipedia data, I scraped them too, after having tried quite a bit to use the dumps.

- dump availability was shaky at best back then (could see months go by without successful dumps)

- you had to fiddle with it to actually process the dumps

- you'd get the full wikipedia content, but you didn't have the exact wikipedia mediawiki setup, so a bunch of things were not rendered

- you couldn't get their exact version of mediawiki, because they added more than what was released openly

Now, I'm not saying that they were wrong to do that back then, and I assume things have improved. Their mission wasn't to provide an easy way to download & import the data so it wasn't a focus topic, and they probably ran more bleeding edge versions of mediawiki and plugins that they didn't deem stable enough for general public consumption. But it made it very hard to do "the right thing", and just whipping up a script to fetch the URLs I cared about (it was in Perl back then!) was orders of magnitude faster.

At least for me, had they offered an easy way to set up a local mirror, I would've done that. I assume this is similar for many scrapers: they're extremely experienced at building scrapers, but they have no idea how to set up some software and how to import dumps that may or may not be easy to manage, so to them the cost of writing a scraper is much smaller. If you shift that imbalance, you probably won't stop everyone from hitting your live servers, but you'll stop some because it's easier for them not to and instead get the same data from a way that you provided them.

ricardo81 · a year ago
Can relate. I've used their dumps, and one task was to generate a paragraph summary. The dumps themselves use wiki markup which obviously adds an entirely new level of complexity. There are dumps of "summaries" but they're fairly broken, seemingly due to an ever evolving wiki markup syntax. I believe there are other ways to parse them though, which involves downloading a bunch of other people's code.

So if someone were to scrape the front end for the first paragraph element or whatever, it may make their life easier.

epc · a year ago
I’ve just taken to blocking entire swaths of cloud services IP networks. I don’t care what the intentions are, my personal sites don’t get the infinite bandwidth to put up with a thousands of poorly written spiders.
neilv · a year ago
Is there a public list of those address blocks, which you'd recommend?
MathMonkeyMan · a year ago
I use a VPN when bittorrent is running, and I've found that several websites outright block me "for security reasons." They like to show me my IP address, too, like a great secret has been revealed and the SWAT team is on their way.
kijin · a year ago
AI scrapers are parasites.

I don't care whether you're OpenAI, Amazon, Meta, or some unknown startup. As soon as you generate a noticeable load on any of the servers I keep my eyes on, you'll get a blank 403 from all of the servers, permanently.

I might allow a few select bots once there is clear evidence that they help bring revenue-generating visitors, like a major search engine does. Until then, if you want training data for your LLM, you're going to buy it with your own money, not my AWS bill.

kccqzy · a year ago
The AI scrapers are failing to discover something old-style search engines have been doing for decades: respecting a host and not giving them too much load. I'd say you did a good job banning those that generate noticeable load.
h8hawk · a year ago
> AI scrapers are parasites.

I've been making crawlers for a living! Thanks for informing me that I'm a parasite.

FlyingSnake · a year ago
More details here at the Cloudflare blog: https://blog.cloudflare.com/cloudflare-ai-audit-control-ai-c...
sdflhasjd · a year ago
How long does the world-wide-web have left? It's always felt like it would be around forever, but at some point it will fade into obscurity like IRC has done. The golden age, I feel, has been gone a while, but "AI" seems like the beginning of the end.
ivanjermakov · a year ago
"AI" is the beginning of the end the same way as spam, malware and bot content were perceived in the past. To every action there is a reaction and "AI" won't be an exception.
neilv · a year ago
> A demo of AI Audit shared with TechCrunch showed how website owners can use the tool to see how AI models are scraping their sites. Cloudflare’s tool is able to see where each scraper that visits your site comes from, and offers selective windows to see how many times scrapers from OpenAI, Meta, Amazon, and other AI model providers are visiting your site.

And if I didn't authorize the freeloading copyright-laundering service companies to pound my server and take my content, then I need a really good lawyer, with big teeth and claws.

BSDobelix · a year ago
I would say let's get rid of copyright and software patents altogether ;)
blibble · a year ago
they're already gone

but only if you're well funded (OpenAI)

yard2010 · a year ago
This is such a nice opportunity for 4chan weirdos to teach the AI some new slurs.