Just launched DataFuel.dev on Product Hunt last Sunday, and I landed in the top 3!
I built this API after working on an AI chatbot builder.
Scraping can be a pain, but we need clean markdown data for fine-tuning or doing RAG with new LLM models.
DataFuel API helps you transform websites into LLM-ready data. I've already got my first paying users.
Would love your feedback to improve my product and my marketing!
This trend is creating a lot of headaches for developers responsible for maintaining heavily scraped sites.
related:
- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926
- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182
(of course this project doesn't do that)
Let me add that to my todos.
I use Bun.js's fetch to crawl pages, process them with Mozilla’s Readability (via JSDOM), and convert the cleaned content to Markdown using Turndown. I also strip href attributes from links since they’re unnecessary for my use case, and I don't recurse links. My implementation is basic, with minimal error handling and pretty dumb content trimming to stay within prompt tokens limit which could use improvements! I also found this Python library that seems a lot fancier than what I need, but also a lot more powerful [2].
I’m curious where a solution like Datafuel excels, especially since it already has customers? From the top of my head, the real complexity in scraping appears when processing a sizable number or URLS regularly and becomes more of a background processing / scheduling problem.
I feel like something like Datafuel could become more adopted if it was a nicely put together as a library to crawl locally, and then if you find yourself crawling regularly and want to delegate the scheduling of those crawls, you could buy into the service: "ping me back when these 10_000 URLs are done crawling", or something like that.
--
1: https://github.com/EmmanuelOga/plangs2/blob/main/packages/ai...
2: https://github.com/adbar/trafilatura
The main issue in scraping:
- If you scrape a lot, you will be block based on you IP; You need to use PROXY - Scraping entire website need specific logic, retries and more - It becomes an heavy background job
All the above takes time, so if in your business it is not your core feature, likely better to outsource it.
Good job doing it tho!
Highlight the advantages of your service over DIY solutions prominently on your marketing site. The site looks great! but I think it could better focus on convincing developers to adopt your product vs just listing features.
Consider reaching out to clients to quantify the time saved using your service. Emphasize how it eliminates the hassle of setting up custom background job processes, proxies, and other complexities that can snowball into a full-fledged project.
Good luck on your journey!
1. Does it take care of Bot detection. Most sites will have it.
2. Is this something similar to Firecrawl - https://www.firecrawl.dev/
I’m also trying to gather more feedback to identify the killer feature:
- Adding vectorization to Pinecone out of the box? - Adding multiple integrations like n8n, etc.?
Any crucial pain points to avoid?
> Off topic: blog posts, sign-up pages, newsletters, lists, and other reading material. Those can't be tried out, so can't be Show HNs. Make a regular submission instead. https://news.ycombinator.com/showhn.html
This looks like quite an interesting project but Show HNs need to be usable without sign up pages
If yes then DataFuel is the right choice. Adding this feature as we speak.
Please let me know :)
Having developed a couple of page to markdown myself, I think the bigger challenge is to make sense of so many pages that rely on spacial organisation of information that only makes sense to human, or even presence of images. One way to do it is to render the page as an image and extract data with a vision llm. But you do need heuristic on when to do classic extraction and when to use vision, plus get rid of cookie banner and overlays. This is more complex and costly, but have real business value, for the one that can pull it off.
And normally it's still a pain even if you sign up for a scraping service, and I don't see how this will be different.