Google really doesn't have a leg to stand on here. They scrape the Internet. They replace content against the wishes of users multiple different times, such as with AMP. Their entire business model recently has been to provide you answers they learned from scraping your website and now they want to sue other people who are doing the same.
Data wants to be free. They knew that once.
EDIT: Also to be clear I am not saying they can't win legally. I'm sure they can do legal games and could shop around until they were successful. They are in the wrong conceptually.
As the post says, Google only scrapes the websites that want to be scraped. Sure, it's opt-out (via robots.txt) rather than opt-in, but they do give you a choice. You can even decide between no scraping at all and opting out on a per-scraper basis, and Google will absolutely honor your preferences in that regard.
SERP API just assumes everybody wants to be scraped, and doesn't give you a choice.
(whether websites should have such a choice is a different matter entirely).
requiring me to explicitly opt-out of something is NOT the same thing as getting my consent. So your argument breaks down there.
You know what getting my consent would look like? Google hosting a form where i can tell them PLEASE SCRAPE MY WEBSITE and include it in your search results. That is what consent looks like.
Google has never asked for my consent. Yet they expect others to behave by different rules.
Now where google may have a reasonable case is that google scrapes with the intention of offering the data “for free”. SerpAPI does not.
If this is about protecting third parties from being scraped, why does Google have an interest at all? Surely Google won't have the relevant third-party data itself because, as you say, Google respects robots.txt. So how can that data be scraped from Google?
I don't think this suit is actually about that, though. I think Google's complaint is that
> SerpApi deceptively takes content that Google licenses from others
In other words, this is just a good old-fashioned licence violation.
Unfortunately they do have a couple of points that may prove salient (though I fully agree about them being scrapers also).
You can search Google _for free_ (with all the caveats of that statement), part of their grievance is that serpapi use the scraped data as a paid for service
Lots of Google bot blocking is also circumvented, which they seem to have made a lot of efforts towards in the past year
- robots.txt directives (fwiw)
- You need JS
- If you have no cookie you'll be given a set of JS fingerprints, apparently one set for mobile and one for desktop. You may have to tweak what fingerprints you give back in order to get results custom to user agent etc.
Google was never that bothered about scraping if it was done at a reasonable volume. With pools of millions of IPs and a handle on how to get around their blocking they're at the mercy of how polite the scraping is. They're maybe also worried about people reselling data en masse to competitors i.e. their usual all your data belongs to us and only us.
I thought the ads counted as payment? That seems to be the logic used to take technical measures against adblockers on YouTube while pushing users towards a paid ad-free subscription, at least.
If viewing ads is payment, then Google isn't a free service. If viewing ads isn't payment, then Google should have no problem with people using adblockers.
Eh, and in 20 if SerpApi or whatever the fuck becomes the next google, they’ll have a blog post titled “Why we’re taking legal action against BlemFlamApi data collection”.
The biggest joke was all the “hackers” 25 years ago shouting “Don’t be evil like Oracle, Microsoft, Apple or Adobe and charge for your software, be good like Google and just put like a banner ad or something and give it away for free”
We need a legal precedent that enshrines adversarial interoperability as legal so that we can have a competitive market of BlemFlamApis with no risks of being sued.
I bet SerpApi is getting more business than ever due to the Streisand effect. I hadn't heard about them, but if I want an API for Google results I'm definitely going to choose the one that was so hard for Google to block that they had to sue them instead. I see on their website they even advertise a "legal shield" where they assume scraping liability for their customers.
No, but AFAIK they pulled some shenanigans with "bundling" Gemini scraping and search engine scraping.
Almost everybody wants to appear in search, so disallowing the entirety of Google is far more costly than E.G. disallowing Openai, who even differentiates between content scraped for training and content accessed to respond to a user request.
While there isn't a way to differentiate between scraping for training data and content accessed in response to a user request, I think you can block Googlebot-extended to block training access.
I had an idea - take SerpAPI and save top-10 or 20 links for many queries (millions), and put that in a RAG database. Then it can power a local LLM do web search without ever touching Google.
The index would just point a local crawler towards hubs of resources, links, feeds, and specialized search engines. Then fresh information would come from the crawler itself. My thinking is that reputable sites don't appear every day, if you update your local index once every few months it is sufficient.
The index could host 1..10 or even 100M stubs, each one touching on a different topic, and concentrating the best entry points on the web for that topic. A local LLM can RAG-search it, and use an agent to crawl from there on. If you solve search this way, without Google, and you also have local code execution sandbox, and local model, you can cut the cord. Search was the missing ingredient.
You can still call regular search engines for discovery. You can build your personalized cache of search stubs using regular LLMs that have search integration, like ChatGPT and Gemini, you only need to do it once per topic.
Fetching web pages at the kind of volume needed to keep the index fresh is a problem, unless you're Googlebot. It requires manual intervention with whitelisting yourself with the likes of Cloudflare, cutting deals with the likes of Reddit and getting a good reputation with any other kind of potential bot blocking software that's unfamiliar with your user agent. Even then, you may still find yourself blocked from critical pieces of information.
No, I think we can get by with using CommonCrawl, pulling every few months the fresh content and updating the search stubs. The idea is you don't change the entry points often, you open them up when you need to get the fresh content.
Imagine this stack: local LLM, local search stub index, and local code execution sandbox - a sovereign stack. You can get some privacy and independence back.
google will lose, and I'm surprised they are even trying. hiQ v. LinkedIn already settled this: scraping public web pages isn’t “unauthorized access,” even if the site says no via robots.txt or ToS. Those aren’t locks.
Isn't search engine results a product that Google offers? [1]
I find the argument quite strange that website owners agreed to Google being able to do anything with that data beyond displaying them in their search results when they wrote that robots.txt maybe ten years ago, but others shall not access those results programatically.
I certainly did not and find using the content google scraped from my website for money or AI (which they also sell on a token basis) more questionable than some third party offering API access to it.
Data wants to be free. They knew that once.
EDIT: Also to be clear I am not saying they can't win legally. I'm sure they can do legal games and could shop around until they were successful. They are in the wrong conceptually.
SERP API just assumes everybody wants to be scraped, and doesn't give you a choice.
(whether websites should have such a choice is a different matter entirely).
You know what getting my consent would look like? Google hosting a form where i can tell them PLEASE SCRAPE MY WEBSITE and include it in your search results. That is what consent looks like.
Google has never asked for my consent. Yet they expect others to behave by different rules.
Now where google may have a reasonable case is that google scrapes with the intention of offering the data “for free”. SerpAPI does not.
I don't think this suit is actually about that, though. I think Google's complaint is that
> SerpApi deceptively takes content that Google licenses from others
In other words, this is just a good old-fashioned licence violation.
You can search Google _for free_ (with all the caveats of that statement), part of their grievance is that serpapi use the scraped data as a paid for service
Lots of Google bot blocking is also circumvented, which they seem to have made a lot of efforts towards in the past year
- robots.txt directives (fwiw)
- You need JS
- If you have no cookie you'll be given a set of JS fingerprints, apparently one set for mobile and one for desktop. You may have to tweak what fingerprints you give back in order to get results custom to user agent etc.
Google was never that bothered about scraping if it was done at a reasonable volume. With pools of millions of IPs and a handle on how to get around their blocking they're at the mercy of how polite the scraping is. They're maybe also worried about people reselling data en masse to competitors i.e. their usual all your data belongs to us and only us.
I thought the ads counted as payment? That seems to be the logic used to take technical measures against adblockers on YouTube while pushing users towards a paid ad-free subscription, at least.
If viewing ads is payment, then Google isn't a free service. If viewing ads isn't payment, then Google should have no problem with people using adblockers.
Well not through their API which you do need to pay for and is a paid service.
The biggest joke was all the “hackers” 25 years ago shouting “Don’t be evil like Oracle, Microsoft, Apple or Adobe and charge for your software, be good like Google and just put like a banner ad or something and give it away for free”
Is that true with how they trained Gemini? Doesn't everyone with a foundational model scrape the web relentlessly without regard for robots.txt?
Almost everybody wants to appear in search, so disallowing the entirety of Google is far more costly than E.G. disallowing Openai, who even differentiates between content scraped for training and content accessed to respond to a user request.
Like if you give a friend a key to your house so they can check on your plants when you're out of town but they throw a rager and trash the place.
Dead Comment
Testimony https://medium.com/@brianwarner/celebritynetworths-statement...
CNW ended up putting up content for fake celebrity's after declining Google's request for API usage to prove that Google was scraping them.
The index would just point a local crawler towards hubs of resources, links, feeds, and specialized search engines. Then fresh information would come from the crawler itself. My thinking is that reputable sites don't appear every day, if you update your local index once every few months it is sufficient.
The index could host 1..10 or even 100M stubs, each one touching on a different topic, and concentrating the best entry points on the web for that topic. A local LLM can RAG-search it, and use an agent to crawl from there on. If you solve search this way, without Google, and you also have local code execution sandbox, and local model, you can cut the cord. Search was the missing ingredient.
You can still call regular search engines for discovery. You can build your personalized cache of search stubs using regular LLMs that have search integration, like ChatGPT and Gemini, you only need to do it once per topic.
Imagine this stack: local LLM, local search stub index, and local code execution sandbox - a sovereign stack. You can get some privacy and independence back.
They also started caring about this, probably because they don't want their competitors to get the same data as they have.
The short answer is that scraping isn't a CFAA offence but might be a terms and conditions violation, depending on the specifics of the access.
I certainly did not and find using the content google scraped from my website for money or AI (which they also sell on a token basis) more questionable than some third party offering API access to it.
[1] https://docs.cloud.google.com/generative-ai-app-builder/docs...