Why we're taking legal action against SerpApi's unlawful scraping

Google really doesn't have a leg to stand on here. They scrape the Internet. They replace content against the wishes of users multiple different times, such as with AMP. Their entire business model recently has been to provide you answers they learned from scraping your website and now they want to sue other people who are doing the same.

Data wants to be free. They knew that once.

EDIT: Also to be clear I am not saying they can't win legally. I'm sure they can do legal games and could shop around until they were successful. They are in the wrong conceptually.

miki123211 · 2 months ago

As the post says, Google only scrapes the websites that want to be scraped. Sure, it's opt-out (via robots.txt) rather than opt-in, but they do give you a choice. You can even decide between no scraping at all and opting out on a per-scraper basis, and Google will absolutely honor your preferences in that regard.

SERP API just assumes everybody wants to be scraped, and doesn't give you a choice.

(whether websites should have such a choice is a different matter entirely).

Nextgrid · 2 months ago

This is a bad argument because Google is using its monopoly to effectively force websites to allow Google to scrape them.

lurking_swe · 2 months ago

requiring me to explicitly opt-out of something is NOT the same thing as getting my consent. So your argument breaks down there.

You know what getting my consent would look like? Google hosting a form where i can tell them PLEASE SCRAPE MY WEBSITE and include it in your search results. That is what consent looks like.

Google has never asked for my consent. Yet they expect others to behave by different rules.

Now where google may have a reasonable case is that google scrapes with the intention of offering the data “for free”. SerpAPI does not.

gnfargbl · 2 months ago

If this is about protecting third parties from being scraped, why does Google have an interest at all? Surely Google won't have the relevant third-party data itself because, as you say, Google respects robots.txt. So how can that data be scraped from Google?

I don't think this suit is actually about that, though. I think Google's complaint is that

> SerpApi deceptively takes content that Google licenses from others

In other words, this is just a good old-fashioned licence violation.

ricardo81 · 2 months ago

Unfortunately they do have a couple of points that may prove salient (though I fully agree about them being scrapers also).

You can search Google _for free_ (with all the caveats of that statement), part of their grievance is that serpapi use the scraped data as a paid for service

Lots of Google bot blocking is also circumvented, which they seem to have made a lot of efforts towards in the past year

- robots.txt directives (fwiw)

- You need JS

- If you have no cookie you'll be given a set of JS fingerprints, apparently one set for mobile and one for desktop. You may have to tweak what fingerprints you give back in order to get results custom to user agent etc.

Google was never that bothered about scraping if it was done at a reasonable volume. With pools of millions of IPs and a handle on how to get around their blocking they're at the mercy of how polite the scraping is. They're maybe also worried about people reselling data en masse to competitors i.e. their usual all your data belongs to us and only us.

crote · 2 months ago

> You can search Google for free

I thought the ads counted as payment? That seems to be the logic used to take technical measures against adblockers on YouTube while pushing users towards a paid ad-free subscription, at least.

If viewing ads is payment, then Google isn't a free service. If viewing ads isn't payment, then Google should have no problem with people using adblockers.

LunaSea · 2 months ago

> You can search Google _for free_

Well not through their API which you do need to pay for and is a paid service.

eddythompson80 · 2 months ago

Eh, and in 20 if SerpApi or whatever the fuck becomes the next google, they’ll have a blog post titled “Why we’re taking legal action against BlemFlamApi data collection”.

The biggest joke was all the “hackers” 25 years ago shouting “Don’t be evil like Oracle, Microsoft, Apple or Adobe and charge for your software, be good like Google and just put like a banner ad or something and give it away for free”

Nextgrid · 2 months ago

We need a legal precedent that enshrines adversarial interoperability as legal so that we can have a competitive market of BlemFlamApis with no risks of being sued.

p0w3n3d · 2 months ago

they have the Leg to stand on. It's called Money. Second one is called Position (on the market). Third is Lawyers. It's a stable tripod

I had an idea - take SerpAPI and save top-10 or 20 links for many queries (millions), and put that in a RAG database. Then it can power a local LLM do web search without ever touching Google.

The index would just point a local crawler towards hubs of resources, links, feeds, and specialized search engines. Then fresh information would come from the crawler itself. My thinking is that reputable sites don't appear every day, if you update your local index once every few months it is sufficient.

The index could host 1..10 or even 100M stubs, each one touching on a different topic, and concentrating the best entry points on the web for that topic. A local LLM can RAG-search it, and use an agent to crawl from there on. If you solve search this way, without Google, and you also have local code execution sandbox, and local model, you can cut the cord. Search was the missing ingredient.

You can still call regular search engines for discovery. You can build your personalized cache of search stubs using regular LLMs that have search integration, like ChatGPT and Gemini, you only need to do it once per topic.

ricardo81 · 2 months ago

Fetching web pages at the kind of volume needed to keep the index fresh is a problem, unless you're Googlebot. It requires manual intervention with whitelisting yourself with the likes of Cloudflare, cutting deals with the likes of Reddit and getting a good reputation with any other kind of potential bot blocking software that's unfamiliar with your user agent. Even then, you may still find yourself blocked from critical pieces of information.

visarga · 2 months ago

No, I think we can get by with using CommonCrawl, pulling every few months the fresh content and updating the search stubs. The idea is you don't change the entry points often, you open them up when you need to get the fresh content.

Imagine this stack: local LLM, local search stub index, and local code execution sandbox - a sovereign stack. You can get some privacy and independence back.

ddtaylor · 2 months ago

modeless · 2 months ago

I bet SerpApi is getting more business than ever due to the Streisand effect. I hadn't heard about them, but if I want an API for Google results I'm definitely going to choose the one that was so hard for Google to block that they had to sue them instead. I see on their website they even advertise a "legal shield" where they assume scraping liability for their customers.

Daviey · 2 months ago

Can confirm, I just signed up /because/ of this announcement.

falloutx · 2 months ago

And its API seems really easy to use.

polishdude20 · 2 months ago

"Google follows industry-standard crawling protocols, and honors websites’ directives over crawling of their content."

Is that true with how they trained Gemini? Doesn't everyone with a foundational model scrape the web relentlessly without regard for robots.txt?

No, but AFAIK they pulled some shenanigans with "bundling" Gemini scraping and search engine scraping.

Almost everybody wants to appear in search, so disallowing the entirety of Google is far more costly than E.G. disallowing Openai, who even differentiates between content scraped for training and content accessed to respond to a user request.

inkysigma · 2 months ago

While there isn't a way to differentiate between scraping for training data and content accessed in response to a user request, I think you can block Googlebot-extended to block training access.

jonatron · 2 months ago

They're in a unique position where many people allow googlebot but try to block most other bots

tenuousemphasis · 2 months ago

Allow for the purpose of indexing, not training models.

Like if you give a friend a key to your house so they can check on your plants when you're out of town but they throw a rager and trash the place.

fragmede · 2 months ago

Google honors robots.txt. They're not "everyone with. a foundational model" though.

Dead Comment

Reminds me of (the ironic AI summary) https://www.google.com/search?channel=entpr&q=celebritynetwo...

Testimony https://medium.com/@brianwarner/celebritynetworths-statement...

CNW ended up putting up content for fake celebrity's after declining Google's request for API usage to prove that Google was scraping them.

throwfaraway135 · 2 months ago

Pre the AI era at least morally they would have a stronger case, but now this is kind of hypocritical.

They also started caring about this, probably because they don't want their competitors to get the same data as they have.

randyrand · 2 months ago

google will lose, and I'm surprised they are even trying. hiQ v. LinkedIn already settled this: scraping public web pages isn’t “unauthorized access,” even if the site says no via robots.txt or ToS. Those aren’t locks.

littlecranky67 · 2 months ago

In Germany, this was also already ruled lawful by the highest court (in the context of plane ticket prices scraping).

bitbasher · 2 months ago

HiQ lost on appeal, Microsoft won

No, it's more complicated than that: https://www.morganlewis.com/blogs/sourcingatmorganlewis/2022...

The short answer is that scraping isn't a CFAA offence but might be a terms and conditions violation, depending on the specifics of the access.

sjtgraham · 2 months ago

Incorrect. OP's view is present day 9th Circuit precedent.

WhatsName · 2 months ago

Isn't search engine results a product that Google offers? [1] I find the argument quite strange that website owners agreed to Google being able to do anything with that data beyond displaying them in their search results when they wrote that robots.txt maybe ten years ago, but others shall not access those results programatically.

I certainly did not and find using the content google scraped from my website for money or AI (which they also sell on a token basis) more questionable than some third party offering API access to it.

[1] https://docs.cloud.google.com/generative-ai-app-builder/docs...