The data I would be scraping are images and its associated description. I would only store and display thumbnail images. Without an image, the description would be fairly worthless. For each image/item, a link would lead directly to the original website.
One business model I am considering, and the most obvious, is a subscription based web app.
While at PyCon last month I showed a few people a prototype. One person, an employee at Google, said, "Be careful." He was alluding to potential copyright and legal issues. "But," I said, "I'm not really doing anything different than Google." He countered, "Google has lots of lawyers." Ahhhh, message heard loud and clear!
I understand, in general, copyright and fair use [0]. But, I don't want to be writing letters to the owners of the original content arguing this fact let alone wind up in court. What advice or experiences can you share that might helpful?
[0] http://en.wikipedia.org/wiki/Fair_use
Instead, index. Indexing, on the other hand, connotes supplementation in the sense of adding value to that which is already there. Have the thumbnails, excerpts of the descriptions, and whatever secret sauce you've not mentioned add value to the owners' data. Provide traffic or some other measurable benefit to them.
Don't rely merely on Fair Use (or weak interpretations of the doctrine). Provide value to the data owners, and be ready to respect their wishes if they chose not to accept the value proposition you offer.
Unless you're going against obvious warnings for each site, then scrape first, make it free, and ask questions later. IF you're successful quick enough, you will be a force in itself and your marketplace will be one where everyone wants to remain listed. Speed & adoption wins, stay under the radar as long as you can. you want people to love your product so it doesn't get pulled and/or makes people want it back. When you get notified, respond immediately. Very important: PROFIT LATER. Once you are taking payments, some could say you are making money off of their data, and they'll want a piece of that money. If it's a free service, less feathers to ruffle, less of a target. Cease & desist will stop you from pulling THEIR data. Getting sued for the money you brought in will ultimately stop you from pulling ANY data.
>a link would lead directly to the original website.
Track this heavily, this is the value you are adding to the data providers you are scraping from. If they see business growth coming from your space, they'll support you. Get allies early.
The innocent 'I wanted to build a tool to reduce headaches to help the community' is best defense here. (So don't post online anywhere stating otherwise...) Trying to get approval from a large enough 'chunk' of the data providers without some numbers behind you is a waste of precious time.
Good Luck!
>The innocent 'I wanted to build a tool to reduce headaches to help the community' is best defense here.
Seems like the best defense as well as the truth. I hope they would see it that way.
>Trying to get approval from a large enough 'chunk' of the data providers without some numbers behind you is a waste of precious time.
That's part of my dilema. It would be hard to get some sort of approval otherwise.
Our SiteTruth system does some web scraping. It's looking mostly for the name and address of the business behind the web site. We're open about this; we use a user-agent string of "Sitetruth.com site rating system", list that on "botsvsbrowsers.com" and what we do is documented on our web site. We've had one complaint in five years, and that was because someone had a security system which thought our system's behavior resembled some known attack.
About once a month, we check back with each site to see if their name or address changed. We look at no more than 20 pages per site (if we haven't found the business address looking in the obvious places, a human wouldn't have either). So the traffic is very low. Most scraper-oriented sites hit sites a lot harder than that, enough to be annoying.
We've seen some scraper blocking. We launch up to 3 HTTP requests to the same site in parallel. A few sites used to refuse to respond if they receive more than three HTTP requests in 10 seconds. That seems to have stopped, though; with some major browsers now using look-ahead fetching, that's become normal browser behavior. More sites are using "robots.txt" to block all robots other than Google, but it's under 1% of the several million web sites we examine. We're not seeing problems from using our own user-agent string.
So I'd suggest 1) obey "robots.txt", 2) use your own user agent string that clearly identifies you, and 3) don't hit sites very often. As for what you do with the data, you need to talk to a lawyer and read Feist vs. Rural Telephone.
I ran into several people who wrote cease and desists, which I honored, and into several others who started banning our IP addresses, etc, disallowing us specifically via robots.txt, etc.. There are obviously ways to get around these issues, but the main question is, morally, would you want to go around them? Are you willing to go against website owners who flat out don't want you scraping their data? Would you be willing to fight them legally for your right to do so?
Ultimately, that's what it came down for me, I just felt really crappy about it and stopped.
Personally, I feel that inclusion in Google constitutes public access to the data. As long as I'm not logged into an account on their system, I feel ethically justified about scraping their data.
In other words, I do not feel compelled to respect robots.txt if that file does not also block googlebot.
Legally it may be another issue, but ethically I consider inclusion in Google as an announcement that this information is public.
In the early stages you probably won't be robots.txt'd because you're insignificant.
In later stages, you're hoping to not be robots.txt'd because you're providing a worthwhile service not just for users but for the site.
At neither stage should you force companies that want you not indexing their content to go beyond basic means (robots.txt) because the more serious measures are all going to cost them more money (tracking / blocking your IPs, C&D, DMCA requests to your provider requesting that the entire site be taken down because there are thousands of infringing items, lawsuits seeking (damages | court costs | costs for dealing with your circumvention of technical measures to keep you out of the site), finding of friendly prosecutors, etc.).
You don't want to go down that more expensive road.
If the product is for competitive analysis or price-comparison purposes, which is the only conclusion I can draw from that sentence (why else would you scrape your peers?)... then Market Leader A is highly incentivized to try to shut down any provider that feeds their content in an actionable way to their smaller competitors B and C. Even if A could theoretically benefit from B and C's information just as much, B and C have more to gain than A does, and that's dangerous to A. And A does have an argument that their proprietary content is not being used under fair use. It might not stand up in court, but their legal department can still make your life a living hell, and if they deem the threat large enough, they probably have enough resources to bleed you dry without breaking a sweat.
Perhaps the potential upside of addressing this market is worth the legal risk. I am not a lawyer. But as soon as you get reasonably big, you'll paint a target on your back.
Hire a developer for next-to-nothing / hour in the Philipines, India or China. Get them to build a quick-and-dirty scraping tool that's focused on a specific industry. Then try to flog it to slightly shady businesses. Try to stay under the radar for as long as you can and make as much money as you can while you're there. Sooner or later, you'll get busted and shut down - no big loss to you.
The people I've seen do that typically have a dozen or so of such "startup" going at any one time and they just keep shutting one down to start another.
This is not the sort of startup that will get you the fame and respect of the tech startup world. But it can certainly make you money if you have the "right" mindset. Just don't bet the farm on it.
1) Build a MVP prototype with the scraped data. Don't worry about the business model. Yet VERY IMPORTANT make sure you are allowed to scrape the data in the first place. Work out an agreement that you are interested in the data but don't give away your methods.
2) Pitch the idea FIRST AND ONLY to the data owners. Suggest to them the usefulness of their data. They may want to invest in YOU to build it out. If the data owners are hard to approach then reach out to mentors that have networks connections.
3) Fall back and last resort is to build up your own data. This will be tough and tricky. You might have to build your own search engine (or similar type data feeding app). You at least own the data.
As conculsion, content ownership is king in the online media world. Make sure you follow the appropriate channels. Talk to the data owners about interest in their data. Get aggreements in place for access without giving away proprietary methods.
Google honors robots.txt but few site owners enable it because of the cost of delisting. By contrast, the cost of delisting from your specialized search engine is low, so you might see some of your content dry up.
In the U.S., at least, you do not have the legal right to connect to a site if the owner as requested that you stop -- see eBay v. Bidder's Edge. Fair use has nothing to do with that point (fair use deals with what use you can make of the information once you obtain it, not with any right to obtain it in the first place).
Talking to a lawyer is always good advice.