Readit News logoReadit News
Posted by u/marcell 2 years ago
Show HN: I'm making an AI scraper called FetchFoxfetchfoxai.com/...
Hi! I'm Marcell, and I'm working on FetchFox (https://fetchfoxai.com). It's a Chrome extension that lets you use AI to scrape any website for any data. I'd love to get your feedback.

Here's a quick demo showing how you can use it to scrape leads from an auto dealer directory. What's cool is that it scrapes non-uniform pages, which is quite hard to do with "traditional" scrapers: https://youtu.be/wPbyPSFsqzA

A little background: I've written lots and lots of scrapers over the last 10+ years. They're fun to write when they work, but the internet has changed in ways that make them harder to write. One change has been the increasing complexity of web pages due to SPAs and obfuscated CSS/HTML.

I started experimenting with using ChatGPT to parse pages, and it's surprisingly effective. It can take the raw text and/or HTML of a page, and answer most scraping requests. And in addition to traditional scraping thigns like pulling out prices, it can extract subjective data, like summarizing the tone of an article.

As an example, I used FetchFox to scrape Hacker News comment threads. I asked it for the number of comments, and also for a summary of the topic and tone of the articles. Here are the results: https://fetchfoxai.com/s/cSXpBs3qBG . You can see the prompt I used for this scrape here: https://imgur.com/uBQRIYv

Right now, the tool does a "two step" scrape. It starts with an initial page, (like LinkedIn) and looks for specific types of links on that page, (like links to software engineer profiles). It does this using an LLM, which receives a list of links from the page, and looks for the relevant ones.

Then, it queues up each link for an individual scrape. It directs Chrome to visit the pages, get the text/HTML, and then analyze it using an LLM.

There are options for how fast/slow to do the scrape. Some sites (like HN) are friendly, and you can scrape them very fast. For example here's me scraping Amazon with 50 tabs: https://x.com/ortutay/status/1824344168350822434 . Other sites (like LinkedIn) have strong anti-scraping measures, so it's better to use the "1 foreground tab" option. This is slower, but it gives better results on those sites.

The extension is 100% free forever if you use your OpenAI API key. It's also free "for now" with our backend server, but if that gets overloaded or too expensive we'll have to introduce a paid plan.

Last thing, you can check out the code at https://github.com/fetchfox/fetchfox . Contributions welcome :)

jackienotchan · 2 years ago
You have LinkedIn and Twitter examples, where you're very likely violating their TOS as they prohibit any scraping.

I also assume you don't check the robots.txt of websites?

I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.

related:

- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926

- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182

marcell · 2 years ago
Scraping is semi-controversial, but in this case it's just a user with a Chrome extension visiting the site. LinkedIn has lots and lots of shady patterns around showing different results to Google Bot vs. regular users to encourage logged in sessions. Many other sites like Pinterest and Twitter/X employ similar annoying patterns.

Imo, users should be allowed to use automation tools to access websites and collect data. Most of these sites thrive off of user generated content anyways, for example Reddit is built on UGC. Why shouldn't people be able to scrape it?

kaoD · 2 years ago
In hopes of saving someone a search: UGC = User Generated Content.
firtoz · 2 years ago
If let's say I built an extension that allows people to scrape things on demand and the extension sends that data also to my servers, removing PII in the process, would that be allowed?

Deleted Comment

padolsey · 2 years ago
Technically it's acting on behalf of a proactive user in Chrome so IMHO is non-"robotic". But heh tbf this was also the excuse of Perplexity where they argued they are a legitimate non-robotic-user-agent (thus don't need to respect robots.txt) because they only make requests at the time of a user query. We need a new way of understanding what it even means to be a legitimate human user-agent. The presence of AIs as client-side catalysts will only grow.
silvanocerza · 2 years ago
Scraping is not illegal. Note that this decision is before the AI craze.

https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin...

aguaviva · 2 years ago
The parent didn't say the scraping was "illegal", but that it violated ToS.

These are entirely different things. The upshot of the proceedings is that while the courts ruled there wasn't sufficient for an injunction to stop the scraping, it was nonetheless still injurious to the plaintiff and had breached their User Agreement -- thus allowing LinkedIn to compel hiQ towards a settlement.

From Wikipedia:

   The 9th Circuit ruled that hiQ had the right to do web scraping.[1][2][3] However, the Supreme Court, based on its Van Buren v. United States decision,[4] vacated the decision and remanded the case for further review in June 2021. In a second ruling in April 2022 the Ninth Circuit affirmed its decision.[5][6] In November 2022 the U.S. District Court for the Northern District of California ruled that hiQ had breached LinkedIn's User Agreement and a settlement agreement was reached between the two parties.[7]

robofanatic · 2 years ago
I see scraping to be equivalent to a Cherry Tree Shaking machine :-) If you are authorized to pick cherries from a tree then why not use a tree shaker and do the job in seconds but yeah make sure you don't kill the tree in the process. Also the tree owner must have right to deny you from using the tree shaker machine.

https://www.youtube.com/watch?v=miBk0lyMBC0

siamese_puff · 2 years ago
Hackers have gotten so boring these days. A fellow hacker builds a fun tool and we first gravitate toward the legal implications?
churros_train · 2 years ago
I am really curious how do people actually evaluate scrapers? There are so many options and I am dizzy just trying to read them...

Also wondering how does the OP think about comparing themselves and standing out in the marketplace of seemingly bazillion options

marcell · 2 years ago
Try it out and let me know if you like it :). If there's a bug I'll fix it!

More specifically, FetchFox is targeting a specific niche of scraping. It focuses on small scale scraping, like dozens or a few hundred pages. This is partly because, as a Chrome extension, it can only scrape what the user's internet connection can support. You can't scrape thousands or millions of pages on a residential connection.

But a separate reason is, I think that LLM's open up a new market and use case for scraping. FetchFox lets anyone scrape without coding knowledge. Imagine you're doing a research project, and want data from 100 websites. FetchFox makes that easy, whereas with traditional scraping you would have needed coding knowledge to scrape those sites.

As an example, I used FetchFox to research political bias in the media. I was able to get data from hundreds of articles without writing a line of code: https://ortutay.substack.com/p/analyzing-media-bias-with-ai . I think this tool could be used by many non-technical people in the same way.

Malidir · 2 years ago
why would these non technical people even use a tool when they can go to a internet connected llm and say 'go to this site and get this info'

e.g. Mr John Smith is a journalist, find his ten most recent articles via locating his personal website, news sites and social media.

so wondering if your tool will be obsolete in a years time?

churros_train · 2 years ago
Ah thats really interesting! How do you evaluate large scale cloud scraping services, since its operations are entirely hidden from you?

Personally I am looking into options in this area, are you planning to offer a cloud based version of this at some point/could you tell which existing ones are good if not?

Deleted Comment

mhuffman · 2 years ago
>I am really curious how do people actually evaluate scrapers?

It is not easy to evaluate scrapers unless you have had to deal with lots of poorly written websites in your life. Just using it on a few highly structured well-maintained sites can be impressive but if you are using it to acquire data from many websites or large websites things get hairy fast.

Most scrapers today are some combination of extracting xpaths and reducing them to the loosest common form, parsing semantic (or easy to identify, like links) or highly structured content that has discoverable patterns, and LLMs.

The actual best way to scrape a site is to determine if they are populating the data you want with API calls and replicate those. People are usually more reluctant to completely change back-end code but will make subtle breaking changes (to your scraper) to front-end code all the time. For example small structural or naming changes. This has become more problematic since people have been moving to more SSR and semi-SSR injection. There can also be a problem with discovering all the pages on a site if it doesn't have poorly designed or implemented paging or search.

Some of the worst sites to scrape are large WP sites that have obviously been through a few developers. If you really want to test a scraper find some of those and they will put it to the test.

Cloudflare is another issue. Not necessarily an issue with this plugin, but because so many sites use it, you typically have to spin up multiple automated headless browsers using residential proxies for any type of large-scale scraping.

Some things that LLM does shine at related to scraping is interpreting freeform addresses, custom tables (meaning no TR,TD, but just divs and CSS to make it look like a table, and lists that are also just styled divs. Often there are no tags, attributes, keywords, or generalized xpath that will help you depending on how the developer put it together.

Surprisingly there is a pretty old library from Microsoft of all places called Prose if you can still find it (they keep updating but using the same name for different things and trying to inject AI) that is really good at pattern matching and prediction and is small, fast, and free and generally great at building a generalized scraper. Only drawback is I believe the only one I could find was .NET at the time.

smcleod · 2 years ago
Out of interest - why is it called FetchFox - but it doesn't work on Firefox?
marcell · 2 years ago
FireFox versión coming soon!
Malidir · 2 years ago
Indeed!
konata390 · 2 years ago
A bit off-topic, but why do people still use the GIF format? The "example-hn.gif" is 8.5MB, for 45 seconds of pretty stuttery video. I converted it to a similar looking VP9 video, and it was only 1.5MB, and with AV1 I got it down to 550KB with basically lossless quality.
stevenicr · 2 years ago
from what I have noticed - many of the "video instead of gif" sites / apps disable (right click-save , long press-save) saving.. so many users prefer and use gifs because they are easy to save and share and many times the video version is impossible.
CalRobert · 2 years ago
Of all the names to give something you built for chrome instead of Firefox…
rchaud · 2 years ago
Fitting considering it uses "Open"AI for the heavy lifting.
marcell · 2 years ago
I know! I used ChatGPT to get ideas for the name + logo, and didn't realize the issue until it was too late
netsharc · 2 years ago
Too late... how?
thegabriele · 2 years ago
It's not too late...
bearjaws · 2 years ago
This is really cool, I'd just go ahead and double check that max spend limit on your OAI key before going to bed :)

This kind of stuff gets expensive fast.

marcell · 2 years ago
Thanks for the kind words! I have a spend limit in place :)
benrules2 · 2 years ago
This is a really cool tool. Have been playing with similar scraping capabilities, so appreciate you sharing the source code as well. People who are saying "loads of scraping tools already exist" have likely not suffered through the current state of the art too, as heuristic based approaches absolutely pale in comparison to what an LLM can extract.
trog · 2 years ago
Would love something like this that allows users to trivially turn sites like Facebook/Twitter into RSS feeds. I'm sure this kinda thing is a useful stepping stone to doing that.
churros_train · 2 years ago
My impression is that facebook and twitter have really strong anti scraping measures. Is that wrong? And is there any reliable scraping services that can actually do scraping of those large companies' sites at a reasonable cost?
marcell · 2 years ago
One thing to note about FetchFox: it runs as a Chrome extension. This means it has a different interaction with anti-scraping measures than cloud based tools.

For one thing, many (most? all?) large sites ban Amazon IP's from accessing their websites. This is not a problem for FetchFox.

Also, with FetchFox, you can scrape a logged in session without exposing any sensitive information. Your login tokens/passwords are never exposed to any 3rd party proxy like they would be with cloud scraping. And if you use your own OpenAI API key, the extension developer (me) never sees any of the activity in your scraping. OpenAI does see it, however.

> And is there any reliable scraping services that can actually do scraping of those large companies' sites at a reasonable cost?

FetchFox :).

But besides that, the gold standard for scraping is proxied mobile IP requests. There are services that let you make requests which appear to come from a mobile IP address. These are very hard for big sites to block, because mobile providers aggregate many customer requests together.

The downside is mainly cost. Also, the providers in this space can be semi-sketchy, depending on how they get the proxy bandwidth. Some employ spyware, or embed proxies into mobile games without user knowledge/consent. Beware what you're getting into.

marcell · 2 years ago
I had another request for the exact same thing, actually. I'm planning to separate out the scraping library from the Chrome extension, and this project would be a good use case for that library.

Deleted Comment