Here's a quick demo showing how you can use it to scrape leads from an auto dealer directory. What's cool is that it scrapes non-uniform pages, which is quite hard to do with "traditional" scrapers: https://youtu.be/wPbyPSFsqzA
A little background: I've written lots and lots of scrapers over the last 10+ years. They're fun to write when they work, but the internet has changed in ways that make them harder to write. One change has been the increasing complexity of web pages due to SPAs and obfuscated CSS/HTML.
I started experimenting with using ChatGPT to parse pages, and it's surprisingly effective. It can take the raw text and/or HTML of a page, and answer most scraping requests. And in addition to traditional scraping thigns like pulling out prices, it can extract subjective data, like summarizing the tone of an article.
As an example, I used FetchFox to scrape Hacker News comment threads. I asked it for the number of comments, and also for a summary of the topic and tone of the articles. Here are the results: https://fetchfoxai.com/s/cSXpBs3qBG . You can see the prompt I used for this scrape here: https://imgur.com/uBQRIYv
Right now, the tool does a "two step" scrape. It starts with an initial page, (like LinkedIn) and looks for specific types of links on that page, (like links to software engineer profiles). It does this using an LLM, which receives a list of links from the page, and looks for the relevant ones.
Then, it queues up each link for an individual scrape. It directs Chrome to visit the pages, get the text/HTML, and then analyze it using an LLM.
There are options for how fast/slow to do the scrape. Some sites (like HN) are friendly, and you can scrape them very fast. For example here's me scraping Amazon with 50 tabs: https://x.com/ortutay/status/1824344168350822434 . Other sites (like LinkedIn) have strong anti-scraping measures, so it's better to use the "1 foreground tab" option. This is slower, but it gives better results on those sites.
The extension is 100% free forever if you use your OpenAI API key. It's also free "for now" with our backend server, but if that gets overloaded or too expensive we'll have to introduce a paid plan.
Last thing, you can check out the code at https://github.com/fetchfox/fetchfox . Contributions welcome :)
I also assume you don't check the robots.txt of websites?
I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.
related:
- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926
- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182
Imo, users should be allowed to use automation tools to access websites and collect data. Most of these sites thrive off of user generated content anyways, for example Reddit is built on UGC. Why shouldn't people be able to scrape it?
Deleted Comment
https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin...
These are entirely different things. The upshot of the proceedings is that while the courts ruled there wasn't sufficient for an injunction to stop the scraping, it was nonetheless still injurious to the plaintiff and had breached their User Agreement -- thus allowing LinkedIn to compel hiQ towards a settlement.
From Wikipedia:
https://www.youtube.com/watch?v=miBk0lyMBC0
Also wondering how does the OP think about comparing themselves and standing out in the marketplace of seemingly bazillion options
More specifically, FetchFox is targeting a specific niche of scraping. It focuses on small scale scraping, like dozens or a few hundred pages. This is partly because, as a Chrome extension, it can only scrape what the user's internet connection can support. You can't scrape thousands or millions of pages on a residential connection.
But a separate reason is, I think that LLM's open up a new market and use case for scraping. FetchFox lets anyone scrape without coding knowledge. Imagine you're doing a research project, and want data from 100 websites. FetchFox makes that easy, whereas with traditional scraping you would have needed coding knowledge to scrape those sites.
As an example, I used FetchFox to research political bias in the media. I was able to get data from hundreds of articles without writing a line of code: https://ortutay.substack.com/p/analyzing-media-bias-with-ai . I think this tool could be used by many non-technical people in the same way.
e.g. Mr John Smith is a journalist, find his ten most recent articles via locating his personal website, news sites and social media.
so wondering if your tool will be obsolete in a years time?
Personally I am looking into options in this area, are you planning to offer a cloud based version of this at some point/could you tell which existing ones are good if not?
Deleted Comment
It is not easy to evaluate scrapers unless you have had to deal with lots of poorly written websites in your life. Just using it on a few highly structured well-maintained sites can be impressive but if you are using it to acquire data from many websites or large websites things get hairy fast.
Most scrapers today are some combination of extracting xpaths and reducing them to the loosest common form, parsing semantic (or easy to identify, like links) or highly structured content that has discoverable patterns, and LLMs.
The actual best way to scrape a site is to determine if they are populating the data you want with API calls and replicate those. People are usually more reluctant to completely change back-end code but will make subtle breaking changes (to your scraper) to front-end code all the time. For example small structural or naming changes. This has become more problematic since people have been moving to more SSR and semi-SSR injection. There can also be a problem with discovering all the pages on a site if it doesn't have poorly designed or implemented paging or search.
Some of the worst sites to scrape are large WP sites that have obviously been through a few developers. If you really want to test a scraper find some of those and they will put it to the test.
Cloudflare is another issue. Not necessarily an issue with this plugin, but because so many sites use it, you typically have to spin up multiple automated headless browsers using residential proxies for any type of large-scale scraping.
Some things that LLM does shine at related to scraping is interpreting freeform addresses, custom tables (meaning no TR,TD, but just divs and CSS to make it look like a table, and lists that are also just styled divs. Often there are no tags, attributes, keywords, or generalized xpath that will help you depending on how the developer put it together.
Surprisingly there is a pretty old library from Microsoft of all places called Prose if you can still find it (they keep updating but using the same name for different things and trying to inject AI) that is really good at pattern matching and prediction and is small, fast, and free and generally great at building a generalized scraper. Only drawback is I believe the only one I could find was .NET at the time.
This kind of stuff gets expensive fast.
For one thing, many (most? all?) large sites ban Amazon IP's from accessing their websites. This is not a problem for FetchFox.
Also, with FetchFox, you can scrape a logged in session without exposing any sensitive information. Your login tokens/passwords are never exposed to any 3rd party proxy like they would be with cloud scraping. And if you use your own OpenAI API key, the extension developer (me) never sees any of the activity in your scraping. OpenAI does see it, however.
> And is there any reliable scraping services that can actually do scraping of those large companies' sites at a reasonable cost?
FetchFox :).
But besides that, the gold standard for scraping is proxied mobile IP requests. There are services that let you make requests which appear to come from a mobile IP address. These are very hard for big sites to block, because mobile providers aggregate many customer requests together.
The downside is mainly cost. Also, the providers in this space can be semi-sketchy, depending on how they get the proxy bandwidth. Some employ spyware, or embed proxies into mobile games without user knowledge/consent. Beware what you're getting into.
Deleted Comment