Ask HN: Should I consider a startup based on scraped data?

IANAL, but to keep your expenses low and get traction sooner, here's my advice:

Unless you're going against obvious warnings for each site, then scrape first, make it free, and ask questions later. IF you're successful quick enough, you will be a force in itself and your marketplace will be one where everyone wants to remain listed. Speed & adoption wins, stay under the radar as long as you can. you want people to love your product so it doesn't get pulled and/or makes people want it back. When you get notified, respond immediately. Very important: PROFIT LATER. Once you are taking payments, some could say you are making money off of their data, and they'll want a piece of that money. If it's a free service, less feathers to ruffle, less of a target. Cease & desist will stop you from pulling THEIR data. Getting sued for the money you brought in will ultimately stop you from pulling ANY data.

>a link would lead directly to the original website.

Track this heavily, this is the value you are adding to the data providers you are scraping from. If they see business growth coming from your space, they'll support you. Get allies early.

The innocent 'I wanted to build a tool to reduce headaches to help the community' is best defense here. (So don't post online anywhere stating otherwise...) Trying to get approval from a large enough 'chunk' of the data providers without some numbers behind you is a waste of precious time.

Good Luck!

un-devmox · 10 years ago

Thank you!

>The innocent 'I wanted to build a tool to reduce headaches to help the community' is best defense here.

Seems like the best defense as well as the truth. I hope they would see it that way.

>Trying to get approval from a large enough 'chunk' of the data providers without some numbers behind you is a waste of precious time.

That's part of my dilema. It would be hard to get some sort of approval otherwise.

From personal experience, it's quite the headache, even if you stay within legal parameters, you will run into site owners who are less than thrilled about what you're doing (possibly understandably so).

I ran into several people who wrote cease and desists, which I honored, and into several others who started banning our IP addresses, etc, disallowing us specifically via robots.txt, etc.. There are obviously ways to get around these issues, but the main question is, morally, would you want to go around them? Are you willing to go against website owners who flat out don't want you scraping their data? Would you be willing to fight them legally for your right to do so?

Ultimately, that's what it came down for me, I just felt really crappy about it and stopped.

runbycomment · 10 years ago

Agreed that it can be a headache, but wanted to offer an alternative perspective.

Personally, I feel that inclusion in Google constitutes public access to the data. As long as I'm not logged into an account on their system, I feel ethically justified about scraping their data.

In other words, I do not feel compelled to respect robots.txt if that file does not also block googlebot.

Legally it may be another issue, but ethically I consider inclusion in Google as an announcement that this information is public.

fencepost · 10 years ago

Ignoring/bypassing robots.txt is probably a bad idea unless you're going to never even look for it and are going to try to plead incompetence if someone comes after you.

In the early stages you probably won't be robots.txt'd because you're insignificant.

In later stages, you're hoping to not be robots.txt'd because you're providing a worthwhile service not just for users but for the site.

At neither stage should you force companies that want you not indexing their content to go beyond basic means (robots.txt) because the more serious measures are all going to cost them more money (tracking / blocking your IPs, C&D, DMCA requests to your provider requesting that the entire site be taken down because there are thousands of infringing items, lawsuits seeking (damages | court costs | costs for dealing with your circumvention of technical measures to keep you out of the site), finding of friendly prosecutors, etc.).

You don't want to go down that more expensive road.

davidjairala · 10 years ago

Just to make sure I understand your reply correctly, are you saying that if a robot.txt file disallows your specific crawler but allows googlebot you'd see no problem with crawling it?

kjhughes · 10 years ago

First of all, don't scrape, and don't call what you're doing scraping. Scraping immediately connotes theft in the sense of taking something which is not meant to be taken.

Instead, index. Indexing, on the other hand, connotes supplementation in the sense of adding value to that which is already there. Have the thumbnails, excerpts of the descriptions, and whatever secret sauce you've not mentioned add value to the owners' data. Provide traffic or some other measurable benefit to them.

Don't rely merely on Fair Use (or weak interpretations of the doctrine). Provide value to the data owners, and be ready to respect their wishes if they chose not to accept the value proposition you offer.

bengali3 · 10 years ago

Animats · 10 years ago

Well, first obey "robots.txt".

Our SiteTruth system does some web scraping. It's looking mostly for the name and address of the business behind the web site. We're open about this; we use a user-agent string of "Sitetruth.com site rating system", list that on "botsvsbrowsers.com" and what we do is documented on our web site. We've had one complaint in five years, and that was because someone had a security system which thought our system's behavior resembled some known attack.

About once a month, we check back with each site to see if their name or address changed. We look at no more than 20 pages per site (if we haven't found the business address looking in the obvious places, a human wouldn't have either). So the traffic is very low. Most scraper-oriented sites hit sites a lot harder than that, enough to be annoying.

We've seen some scraper blocking. We launch up to 3 HTTP requests to the same site in parallel. A few sites used to refuse to respond if they receive more than three HTTP requests in 10 seconds. That seems to have stopped, though; with some major browsers now using look-ahead fetching, that's become normal browser behavior. More sites are using "robots.txt" to block all robots other than Google, but it's under 1% of the several million web sites we examine. We're not seeing problems from using our own user-agent string.

So I'd suggest 1) obey "robots.txt", 2) use your own user agent string that clearly identifies you, and 3) don't hit sites very often. As for what you do with the data, you need to talk to a lawyer and read Feist vs. Rural Telephone.

btown · 10 years ago

> My target market for this search engine would be the owners and associates of the sites I would be scraping.

If the product is for competitive analysis or price-comparison purposes, which is the only conclusion I can draw from that sentence (why else would you scrape your peers?)... then Market Leader A is highly incentivized to try to shut down any provider that feeds their content in an actionable way to their smaller competitors B and C. Even if A could theoretically benefit from B and C's information just as much, B and C have more to gain than A does, and that's dangerous to A. And A does have an argument that their proprietary content is not being used under fair use. It might not stand up in court, but their legal department can still make your life a living hell, and if they deem the threat large enough, they probably have enough resources to bleed you dry without breaking a sweat.

Perhaps the potential upside of addressing this market is worth the legal risk. I am not a lawyer. But as soon as you get reasonably big, you'll paint a target on your back.

MehdiEG · 10 years ago

This is the sort of "startup" that I've seen commonly done by self-proclaimed "serial entrepreneurs".

Hire a developer for next-to-nothing / hour in the Philipines, India or China. Get them to build a quick-and-dirty scraping tool that's focused on a specific industry. Then try to flog it to slightly shady businesses. Try to stay under the radar for as long as you can and make as much money as you can while you're there. Sooner or later, you'll get busted and shut down - no big loss to you.

The people I've seen do that typically have a dozen or so of such "startup" going at any one time and they just keep shutting one down to start another.

This is not the sort of startup that will get you the fame and respect of the tech startup world. But it can certainly make you money if you have the "right" mindset. Just don't bet the farm on it.

larrydag · 10 years ago

Here's my opinions on the matter.

1) Build a MVP prototype with the scraped data. Don't worry about the business model. Yet VERY IMPORTANT make sure you are allowed to scrape the data in the first place. Work out an agreement that you are interested in the data but don't give away your methods.

2) Pitch the idea FIRST AND ONLY to the data owners. Suggest to them the usefulness of their data. They may want to invest in YOU to build it out. If the data owners are hard to approach then reach out to mentors that have networks connections.

3) Fall back and last resort is to build up your own data. This will be tough and tricky. You might have to build your own search engine (or similar type data feeding app). You at least own the data.

As conculsion, content ownership is king in the online media world. Make sure you follow the appropriate channels. Talk to the data owners about interest in their data. Get aggreements in place for access without giving away proprietary methods.

Great advice! Would you build the MVP first before pitching to the data owners? My prototype only indexed (one time scrape) 10 sites and still relies on a fair bit of imagination from the business owner. I'm thinking an MVP would have to index at least 100 or so sites before being at all useful.

It depends of your definition of MVP. I believe MVP is just enough to show you have a working concept that could have the potential for revenue. Since I'm a data guy I'm always going to say more data is better.

declan · 10 years ago

If you honor robots.txt or provide a straightforward way for sites to opt-out of your search engine, you're in better shape than you would be otherwise.

Google honors robots.txt but few site owners enable it because of the cost of delisting. By contrast, the cost of delisting from your specialized search engine is low, so you might see some of your content dry up.

In the U.S., at least, you do not have the legal right to connect to a site if the owner as requested that you stop -- see eBay v. Bidder's Edge. Fair use has nothing to do with that point (fair use deals with what use you can make of the information once you obtain it, not with any right to obtain it in the first place).

Talking to a lawyer is always good advice.