For scraping, how do you handle Cloudflare and Captchas? Do you respect robots.txt instructions of websites?
For scraping, how do you handle Cloudflare and Captchas? Do you respect robots.txt instructions of websites?
This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported here on HN (and experienced myself).
Does Webhound respect robots.txt directives and do you disclose the identity of your crawlers via user-agent header?
- BrowserUse - Founded 2024
- Greptile - Founded 2023
The third quote is from a VC who has never founded a startup himself and has a clear interest in pushing founders to trade work-life balance for his own quick returns.
So none of these people worked on anything longer than 2 years. I wonder what will happen if we check back in 5–10 years. Will they still be doing and promoting 996, or will they be burned out and have changed their minds? Make your bets.
YC seems to fund quite many document extraction companies, even within the same batch:
- Pulse (YC W24): https://www.ycombinator.com/companies/pulse-3
- OmniAI (YC W24): https://www.ycombinator.com/companies/omniai
- Extend (YC W23): https://www.ycombinator.com/companies/extend
How do you differentiate from these? And how do you see the space evolving as LLMs commoditize PDF extraction?
Do you have any built-in features that address these issues?
Do you have any built-in features that address these issues?
[0] https://www.ycombinator.com/launches/Lbx-simplex-on-demand-p...
The record/replay is definitely and interesting direction. The browser automation space is getting super crowded though (even within YC), so curious to hear how you differentiate from:
- BrowserUse
- Browserbase
- BrowserBook
- Skyvern