Readit News logoReadit News
Posted by u/sachou a year ago
Show HN: DataFuel.dev – Turn websites into LLM-ready datadatafuel.dev/...
Just launched DataFuel.dev on Product Hunt last Sunday, and I landed in the top 3!

I built this API after working on an AI chatbot builder.

Scraping can be a pain, but we need clean markdown data for fine-tuning or doing RAG with new LLM models.

DataFuel API helps you transform websites into LLM-ready data. I've already got my first paying users.

Would love your feedback to improve my product and my marketing!

jackienotchan · a year ago
I'm noticing a big increase in crawling activity on the sites I manage, likely from bots collecting data for LLMs. Most of them don't use proper user agents and of course don't stick to any scraping best practices that the industry has developed over the past two decades.

This trend is creating a lot of headaches for developers responsible for maintaining heavily scraped sites.

related:

- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926

- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182

SalmonSnarker · a year ago
Projects like this should respect that if a site's robots.txt contains a long list of Disallow entries for other AI scrapers that they are probably not welcome to scrape either.

(of course this project doesn't do that)

sachou · a year ago
Good point! Thanks for the feedback.

Let me add that to my todos.

keyle · a year ago
It boggles my mind that you would launch without that as a prime directive.
emmanueloga_ · a year ago
I thought this might be interesting to share and potentially useful for the author of Datafuel as a comparison. I recently built something similar for a small app [1].

I use Bun.js's fetch to crawl pages, process them with Mozilla’s Readability (via JSDOM), and convert the cleaned content to Markdown using Turndown. I also strip href attributes from links since they’re unnecessary for my use case, and I don't recurse links. My implementation is basic, with minimal error handling and pretty dumb content trimming to stay within prompt tokens limit which could use improvements! I also found this Python library that seems a lot fancier than what I need, but also a lot more powerful [2].

I’m curious where a solution like Datafuel excels, especially since it already has customers? From the top of my head, the real complexity in scraping appears when processing a sizable number or URLS regularly and becomes more of a background processing / scheduling problem.

I feel like something like Datafuel could become more adopted if it was a nicely put together as a library to crawl locally, and then if you find yourself crawling regularly and want to delegate the scheduling of those crawls, you could buy into the service: "ping me back when these 10_000 URLs are done crawling", or something like that.

--

1: https://github.com/EmmanuelOga/plangs2/blob/main/packages/ai...

2: https://github.com/adbar/trafilatura

sachou · a year ago
yes exactly,

The main issue in scraping:

- If you scrape a lot, you will be block based on you IP; You need to use PROXY - Scraping entire website need specific logic, retries and more - It becomes an heavy background job

All the above takes time, so if in your business it is not your core feature, likely better to outsource it.

Good job doing it tho!

emmanueloga_ · a year ago
Some ideas:

Highlight the advantages of your service over DIY solutions prominently on your marketing site. The site looks great! but I think it could better focus on convincing developers to adopt your product vs just listing features.

Consider reaching out to clients to quantify the time saved using your service. Emphasize how it eliminates the hassle of setting up custom background job processes, proxies, and other complexities that can snowball into a full-fledged project.

Good luck on your journey!

zerop · a year ago
Great, congrats on your launch.

1. Does it take care of Bot detection. Most sites will have it.

2. Is this something similar to Firecrawl - https://www.firecrawl.dev/

sachou · a year ago
Yes, it has an extensive proxy IP and retry system in place to bypass bot detection.

I’m also trying to gather more feedback to identify the killer feature:

- Adding vectorization to Pinecone out of the box? - Adding multiple integrations like n8n, etc.?

Any crucial pain points to avoid?

gregoryl · a year ago
Are you concerned about making a product that does this? The legal aspect of accessing a computer system that is intending to block your use seems worrisome.
Its_Padar · a year ago
> Please make it easy for users to try your thing out, preferably without having to sign up, get a confirmation email, and other such barriers. You'll get more feedback that way, plus HN users get ornery if you make them jump through hoops. https://news.ycombinator.com/item?id=22336638

> Off topic: blog posts, sign-up pages, newsletters, lists, and other reading material. Those can't be tried out, so can't be Show HNs. Make a regular submission instead. https://news.ycombinator.com/showhn.html

This looks like quite an interesting project but Show HNs need to be usable without sign up pages

sachou · a year ago
noted thank you for the nice reminder. Good I ll add more free tool and open playground
olup · a year ago
I am interested, but why should I use this one over jina ai reader (which is also free) or firecrawl, or the ten other puppeteer + readability + turndown pipeline (or even a AWS lambda doing the same) ? This is not sarcastic I am genuinely looking for something fresh in the field.
sachou · a year ago
do you need to embed it directly in pinecone ?

If yes then DataFuel is the right choice. Adding this feature as we speak.

Please let me know :)

olup · a year ago
Interesting but we process documents before embedding them, and have specific requirements for the embedder.

Having developed a couple of page to markdown myself, I think the bigger challenge is to make sense of so many pages that rely on spacial organisation of information that only makes sense to human, or even presence of images. One way to do it is to render the page as an image and extract data with a vision llm. But you do need heuristic on when to do classic extraction and when to use vision, plus get rid of cookie banner and overlays. This is more complex and costly, but have real business value, for the one that can pull it off.

benatkin · a year ago
> Scraping can be a pain, but we need clean markdown data for fine-tuning or doing RAG with new LLM models.

And normally it's still a pain even if you sign up for a scraping service, and I don't see how this will be different.

aitchnyu · a year ago
Will this benefit sites or internal wikis which have well written content, good search and SEO? I interviewed at a few companies which apparently enables managers to use AI as an excuse to implement text search.
sachou · a year ago
I guess so if you goal is to have people knows about your content, might have a small SEO bump