Readit News logoReadit News
jillesvangurp · 6 years ago
I was at an Elasticsearch meetup yesterday where we had a good laugh about several similar scandals in Germany recently involving completely unprotected Elasticsearch running on a public IP address without a firewall (e.g. https://www.golem.de/news/elasticsearch-datenleak-bei-conrad..., in German). This beats any of that.

Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."

Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?

In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).

It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.

jerrac · 6 years ago
I've been using ES off and on since before 1.0 came out. It has always baffled me that ES doesn't require a username and password by default.

ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.

Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.

I am serious about my question. Could anyone clue me in?

jillesvangurp · 6 years ago
It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).

If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.

At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.

0ld · 6 years ago
> It has always baffled me that ES doesn't require a username and password by default.

because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin

ibirman · 6 years ago
They offer security as a paid feature.
lacker · 6 years ago
If you set up elasticsearch on a cloud service like AWS, by default your firewall will prevent the outside world from interacting with it, and no authentication is really necessary. If you do use authentication, you probably wouldn't want username+password, you would probably want it to hook into your AWS role manager thing. So to me, username+password seems useful, but it isn't going to be one of the top two most common authentication schemes, so it seems reasonable that it should not be the default.

MongoDB also by default does not have username+password authentication turned on.

I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.

dmos62 · 6 years ago
Password auth over HTTP is horrible. Short of binding a public IP address to your instance, basic auth without HTTPS setup is probably the worst thing you can do.
paco_sinbad · 6 years ago
It's a marketing ploy by ES.

They aggregated the data and published it so that the viral breach would spread their name around because all publicity is good publicity.

Just riffing of course.

jschwartzi · 6 years ago
This addresses entirely the wrong question. By looking at it as a technical problem you're completely missing the broader ethical problem. Why was anyone allowed by law to amass this amount of data? And why did PDS not take the security and privacy concerns of 1.2 billion people seriously enough to ensure the data was handled correctly? They obviously thought it was valuable enough to amass a huge database. Do they sell this to just anyone? If not, who can buy access to this data? How much does it cost, and what steps are involved in doing so?

This makes me want to talk to a lawyer.

Xylakant · 6 years ago
> Out of the box it does not even bind to a public internet address.

Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.

Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.

lmilcin · 6 years ago
You typically use these in pods which share networking but are not available from outside.

It doesn't matter then if you bind it to 0.0.0.0.

cookiecaper · 6 years ago
I've come across several such ES instances that are 100% exposed to the world without even trying, and ES is by no means the first tool to have this problem. People are never going to stop doing this. Making it annoyingly difficult within ES just weakens them such that some other "wow it's so easy" search product will be better positioned to eat their lunch.
czbond · 6 years ago
ES, Mongo, Redis used to be some of the easiest targets for production data (security vuln wise). Deployed by SWE's usually, with products that were early versions, and didn't have access control by default.
ryan_lane · 6 years ago
ES's practice of making its security a proprietary paid for product is the cause for these kinds of things. It's a shitty practice, and this is one of the reasons I'm glad AWS forked it.
staticassertion · 6 years ago
Other databases learned that not requiring a user/password upon install is completely irresponsible. ES and other dbs need to catch up ASAP, it's ridiculous.

Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.

jwandborg · 6 years ago
Trusting software you install to be secure is ridiculous and completely irresponsible, especially if you did not pay for someone else to take the blame.

The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.

throwaway5752 · 6 years ago
Wasn't this exact same thing a huge scandal just a few years ago for Mongo on Shodan?

I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.

achillean · 6 years ago
Yes, and similar issues still exist with public MongoDB instances even though the defaults are secure.
healsjnr1 · 6 years ago
This assumes it was incompetence and not done intentionally.

My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.

Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.

Welcome to the early 90s internet.

codetrotter · 6 years ago
> It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above

I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.

Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.

But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)

> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.

Again, not necessarily, for the same reason as above.

But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(

Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.

Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.

Companies that can’t handle data securely, have no business handling data at all.

Scoundreller · 6 years ago
My favourite was Bitomat.pl's loss of 17k bitcoins in 2011 because they restarted their EC2 instance.

I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.

https://siliconangle.com/2011/08/01/third-largest-bitcoin-ex...

danmur · 6 years ago
Not to say this is what people are doing, but I don't think it requires much knowledge to run under Docker, and it's pretty easy to expose it to the public internet that way.
Quekid5 · 6 years ago
Incompetence and indifference will be the ruin of us all.

This is just another symptom of the Principal-agent problem writ large.

shadowgovt · 6 years ago
It's a tragedy that all of this data was available to anyone in a public database instead of.... checks notes... available to anyone who was willing to sign up for a free account that allowed them 1,000 queries.

It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.

yoaviram · 6 years ago
If your in Europe or California, I suggest sending both companies an erasure request: https://yourdigitalrights.org/?company=peopledatalabs.comhttps://yourdigitalrights.org/?company=oxydata.io

Disclaimer: I'm one of the creators of yourdigitalrights.org.

Already__Taken · 6 years ago
Can I use this on behalf my @company users HIBP has just emailed me about?
arbol · 6 years ago
This is great. Thanks
godelski · 6 years ago
Would it be better if this was a paid service? If the issue access to the data, then maybe we should ask if this data should be collected in the first place.
chmod775 · 6 years ago
> If the issue access to the data, then maybe we should ask if this data should be collected in the first place.

Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.

The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.

amerine · 6 years ago
That is OPs point.
sparkywolf · 6 years ago
I found a vulnerability in linkedIn a few years back that allowed anyone to access a private profile (because client side validation was enough for them I guess..?)

They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.

john-radio · 6 years ago
I reported an issue to the LinkedIn competitor https://about.me two years ago where signing in with my Google credentials gives me access to some the account of some random other person with a similar name to me. I think that during registration, I attempted to register about.me/johnradio (except it's not "johnradio"), but he was already using it, and then the bug occurred that gave me this access.

I randomly check every 6 months or so and yep, still not fixed.

skissane · 6 years ago
My gmail is my first initial followed by my last name. There are other people on this planet with same first initial and last name, some of whom seem to think that must be their email too, because I keep on getting emails where they used it to sign up for things.
simonlc · 6 years ago
I actually had a similar thing happen with facebook, though we didnt share names.
Ayesh · 6 years ago
I can only imagine about.me mass-creating profiles for names found on other web pages, and opening a way for someone to "claim" those profile with a matching Google account sign-in.

About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.

paulgb · 6 years ago
I had a similar experience. In 2014 I reported an issue where you could take over someone's account by adding an email you control to it and having them complete the flow by sending them a link (which, unless they looked very carefully, looked exactly like the regular log-in flow at the time - especially if they used a public email service and you registered a similar-looking account).

I tried it on a friend and it worked, but LinkedIn's response was basically "meh".

My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.

icebraining · 6 years ago
LI is terrible if you actually try to use it, but it's harmless enough if you just use it as a profile hosting service, where people are likely to look. I just auto-archive their emails and only visit the site a couple of times per year.
adrianmonk · 6 years ago
While not good, what's the connection to this story?

The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.

In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?

Ayesh · 6 years ago
I signed up for an API key to see what they have on me, and the data it returned looks awfully close to what I have on linked in.
robbya · 6 years ago
A few years of heads up is sufficient to disclose publicly. Full disclosure helps keep companies honest about security.
stopadvertising · 6 years ago
I deleted my linkedin a few years back when they had some bug where I would randomly get page views as some other person, with all their connections and account details and whatnot. It would only last a few minutes then switch me back to my account, but they aggressively ignored my attempts to reach out to them about this bug so I just gave up.

Dead Comment

slg · 6 years ago
The number in the HN headline was changed from 1.2 billion to 1 billion (despite the original source's headline saying 1.2). It is kind of amazing that leaking the personal data of 200 million people is now just a rounding error that can be dropped from headlines.
class4behavior · 6 years ago
Imho, it's more impressive that it's basically a non-story outside of it security news.
trickstra · 6 years ago
The general public just shrugs upon hearing such news. They still think there is nothing dangerous if their data gets leaked.
StillBored · 6 years ago
I think the solution here is laws which require anonymity, and that includes in banking (where it will never happen).

That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".

To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.

Had this been a post-paid account they would have my name/address/SSN/etc.

TheSpiceIsLife · 6 years ago
Do you think it’s reasonable to believe your name / address / SSN / DOB / etc is already out there?

I’m of the opinion it’s too late for prevention and we need, instead, mitigation.

a3n · 6 years ago
Exactly. The very reason for existence of the two companies, pdl and oxy, is to tie n pieces of data with m pieces of data.

So depending on how the "anonymous" phone number was used, it's plausible that the number can be connected with other PII.

In fact I wonder if there is any such thing as non-PII, given the existence of such companies.

krn · 6 years ago
> Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIn data, including recruiter information.

"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2], which is a parent company of NordVPN.

It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".

[1] https://oxylabs.io/

[2] https://litigation.maxval-ip.com/Litigation/DetailView?CaseI...

tyingq · 6 years ago
The article says it is "Company 2: OxyData.Io (OXY)"* (http://oxydata.io)
krn · 6 years ago
OxyData and OxyLabs seem to be sister companies[1]: the former sells data as a product, the latter sells scraping as a service.

[1] https://vpnscam.com/wp-content/uploads/2018/08/2018-08-24-09...

Deleted Comment

gorbachev · 6 years ago
How is that possible? LinkedIn blocked mining the data this way several years ago.

Is it still possible if you pay LinkedIn enough? Or is this old data?

avip · 6 years ago
It is strictly impossible to "block mining data" on the public web. Double that if the miner has free access to a pool of residential IPs.

[source: experience]

tyingq · 6 years ago
A large number residential proxies and fake LinkedIn accounts would look the same to LinkedIn as normal browsing.
toxicFork · 6 years ago
I'm a nordvpn user. Practices like this scares me though. I guess it's time to switch to a new vpn?
Havoc · 6 years ago
Out of curiosity how do you guys think they managed to scrape LinkedIn on such a large scale?

I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).

kaivi · 6 years ago
I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.

Here are some tricks which may or may not work today:

- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.

- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.

- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.

hitpointdrew · 6 years ago
> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Nice tip!!

> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.

Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).

davidhyde · 6 years ago
You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)

In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?

xfer · 6 years ago
> Have an app where user logs in through said website, then scrape their friends using this user's token.

That's some extremely shady thing to do.

isoos · 6 years ago
> Don't be too kind on the big websites.

I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.

nfoz · 6 years ago
Don't you consider this unethical -- if not against the site itself, than against the other users of the site whose data you're scraping?
Ayesh · 6 years ago
Wow these are some hot tips!

YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.

Pretending to be Googlebot also helps.

Havoc · 6 years ago
>- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

Clever. VMs with IPV6 are cheap as a bonus :)

Same for non-js mobile. Thanks for the tips

adatavizguy · 6 years ago
- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.

How would someone do that using node.js? Asking for a friend.

sillysaurusx · 6 years ago
So far, the answers have contained non-technical answers like "Distributed Scraping." Well, yes, obviously.

A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.

Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.

The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.

It's hard to counter a determined scraper.

onlyrealcuzzo · 6 years ago
I wrote a chrome headless framework that types using semi-realistic key presses (timing, mistakes, corrections) and does semi-realistic scrolling / swiping and clicking / tapping.

It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it

It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.

hobofan · 6 years ago
LinkedIn doesn't protection doesn't seem to be that sophisticated at the moment. Someone I know maintains ~weekly up-to-date profiles of a few million users via a headless scraper that uses ~10 different premium accounts and a very low number of different IPs.
cookiecaper · 6 years ago
As long as you are able to source more than one provider, this can work well enough. If you're dependent on a single data source, e.g., because that source is the only possible source of said data, you'll get nuked from orbit by legal rather than technical means.

I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.

Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.

The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.

scraping_legal · 6 years ago
The US courts decided that scraping is legal, even if against EULA:

> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.

https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...

shkkmo · 6 years ago
That is a blatant misrepresentation of that decision. That decision was upholding a lower court's preliminary injunction that prevents LinkedIn from blocking hiQ while the main case between the two is litigated. It is not a final decision and it doesn't purport to say that scraping is legal (it even points out other laws besides the CFAA that might be used to prohibit scraping.)
Matsta · 6 years ago
LinkedIn Sales Navigator is a paid tool which allows you to search their whole database. Then depending on how much you pay you can get all their personal details (Email address, phone number, even their address sometimes.) https://business.linkedin.com/sales-solutions/sales-navigato...
calibas · 6 years ago
I've always been a little confused how this works. If I got all that info for free, it's a "data leak", but if I pay to get the same detailed personal information it's...

In either case my personal data is given away without my consent, but there's this implication that it's only an issue when someone doesn't pay for it.

Abishek_Muthian · 6 years ago
LinkedIn gives away email id and phone number (even if you had given just for 2FA) to all your contacts. I checked PDL, it has all the information from LinkedIn except for phone number, which I promptly removed once I identified the 2FA issue (now TOTP is available).
fredley · 6 years ago
'Mobile Proxies' like https://oxylabs.io/mobile-proxies (no affiliation) allow you to use large pools of mobile or domestic IPs to scrape. It's expensive, but not prohibitively so. Once you've got a mobile IP you become incredible hard to throttle, since you're behind a mobile NAT gateway.
Ididntdothis · 6 years ago
You probably have to be highly distributed. At least that’s what I did when I tried to scrape a large site some years ago. I had around 100 machines in different countries and gave each of them random pages to scrape.

Dead Comment

tyri_kai_psomi · 6 years ago
Distributed bot and scraper networks. Thousands of IPs geographically dispersed throughout the world. There is only so much you can do with rate limiting.
gdulli · 6 years ago
They asked about LinkedIn, where the content is gated behind a login. If it was a rate limiting problem, that would be trivial.

Needing to be logged in as the same user defeats the purpose of proxying to hide your physical origin.

Registering thousands of different users to use in a distributed way is hard now that they require a text message verification for new accounts.

StuffedParrot · 6 years ago
Proxies can also work well for cheaper than buying distributed compute.
momokoko · 6 years ago
Scraping LinkedIn is so common you can usually hire people with years of experience in it. It is not as complicated as you might think. There are at minimum hundreds of companies that sell LinkedIn data they have scraped.
iamsb · 6 years ago
You use a proxy botnet and route your scraping requests through that. Use something like hola proxy or crawlera for example.
alasdair_ · 6 years ago
I scraped 10 million records from linkedin a few years ago from a single ip by using their search function. I got a list of the top 1000 first names and top 1000 last names and wrote a script to query all combinations and scrape the results.

This may or may not still work.

magashna · 6 years ago
It looks like the purpose was data enrichment, so maybe it was pieced together over time from multiple sources. My linkedin from PDL only had 1 bit of wrong info. I wasn't able to find anything on my personal email addresses which is good.
MrOxiMoron · 6 years ago
once worked on a project that tried to do just that, but at the time the LinkedIn api was already limited to seeing the authenticated users connections connections, which was too limited for what we wanted to do, can only imagine it got worse. It's also the reason recruiters really want to connect to you on LinkedIn because even if you are not interested, your connections might be.
sergiotapia · 6 years ago
A very large distributed network of machines.
memn0nis · 6 years ago
Hey - not related to your comment (apologies) but wanted to get in touch . You left a note on a previous post of mine about wanting to simplify FTP. I'd love to work on this project and wanted to see if you'd be willing to connect so I can understand the problem better. Feel free to email me at kunal@mightydash.com, and thanks in advance!
anilshanbhag · 6 years ago
People data labs's data is pretty accurate. Here is mine: https://api.peopledatalabs.com/v4/person?api_key=9c6a1382204...

You can try it for yourself by changing the email. All of the information is public, so I don't mind. They are basically doing data integration.

BoorishBears · 6 years ago
Haha, when I was a kid and scared to use my real name for things, for some reason I used my email... which had my real name in it, to open a Github account with a fake name

So the api knows me as the famous architect, Art Vandelay

soylentcola · 6 years ago
Reminds me of when I used to get free magazine subscriptions (and the subsequent junk mail/robocalls) addressed to Santos L. Halper.
EGreg · 6 years ago
There is a way to get every developer’s email on github thanks to git commits adding it :))
guenthert · 6 years ago
That must have been a long time ago, Boorish Bears.
btbuildem · 6 years ago
Wow.. I checked with an email address I use for disposable purposes. The only thing they had on it was a blank LinkedIn profile -- meaning that LinkedIn cancer has trawled some pretty questionable sites, harvesting email addresses as placeholders for their accounts. WTF.
netsharc · 6 years ago
Ah, looks like everyone's using that API key, I got 2 queries for my addresses and got a "rate limit exceeded" message.

Strangely it only says I work in real estate (no I don't) when I looked up the email address I use for LinkedIn...

SirYandi · 6 years ago
You, and others can use my api key, just signed up.

e75ac28b25480e60071b24d819d4692a0b315c037046b9ff6ec9dfb1e99a895c

ethagnawl · 6 years ago
Try changing v4 to v3 in the URL.
afturner · 6 years ago
Here's mine eaca37c25ca1a9c5d85efb8cbaf1742b4fbfeee0054d713961176ab9500c2f2b
cmdshiftf4 · 6 years ago
It returned a 404 for my personal email account, so that appears to be sufficiently protected.

More surprisingly it had data such as my name, title and work email address which was connected to old work email account (Okta managed - GSuite) that I never associated with external services, and absolutely never used on a social networking site like LinkedIn.

alanbernstein · 6 years ago
That API key is now public, too! Rate limited.
big_chungus · 6 years ago
Yeah no kidding. Though if you wait until it flips to a new minute and refresh, that helps. Though it takes all of a minute to register a free key, so probably no big deal.
rohan1024 · 6 years ago
Your api key is now permanently in public. After few days, people will still be able to use this for their own usage.
briffle · 6 years ago
a few days? its already hit its limit :)
0xTJ · 6 years ago
I'm actually a bit surprised at how little data they have on me. They've associated my main email with an old junk email, they've got my first and last name, and know that I'm male, but there's little more.
yyyk · 6 years ago
Nothing for most of my accounts, except one which somehow was falsely attributed to someone else. Odd given I do have a LinkedIn profile; Their scraping must be far from perfect.
asdfman123 · 6 years ago
Wait, so is this mostly just Linkedin data in JSON form?
Izkata · 6 years ago
My personal email seems to be based on Github and Gravatar, while my job search and work emails got linked together and appear to be based on LinkedIn.
soared · 6 years ago
This seems exceptionally unethical
pc86 · 6 years ago
Displaying public information publicly, or sharing your API key?
troebr · 6 years ago
It would be really surprised if this were compliant with the GDPR. I live in the US but I tried email accounts of relatives in Europe and they had data in there.
ThrustVectoring · 6 years ago
It looks like it's a US-based company without enough of a European presence to fall under their jurisdiction.
stfwn · 6 years ago
Oh yes, I'm going to try and see if they have data on me and send a number of GDPR requests if they do. For others from the EU, it's very easy to do using: https://www.mydatadoneright.eu/request
olivierduval · 6 years ago
So... if the owner is known, it will be quite costly ;-)
tmpaccc · 6 years ago
I don't know how accurate the coordinates of your address in India are, but it's 5 minutes away from me. Small world, huh?
lm28469 · 6 years ago
I'm glad they don't have jack shit on me besides my email, is there a list of their data source(s) ?

Deleted Comment