I was at an Elasticsearch meetup yesterday where we had a good laugh about several similar scandals in Germany recently involving completely unprotected Elasticsearch running on a public IP address without a firewall (e.g. https://www.golem.de/news/elasticsearch-datenleak-bei-conrad..., in German). This beats any of that.
Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."
Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?
In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).
It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
I've been using ES off and on since before 1.0 came out. It has always baffled me that ES doesn't require a username and password by default.
ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.
Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.
I am serious about my question. Could anyone clue me in?
It has to exist on a private network behind a firewall with ports open to application servers and other es nodes only. Running things on a public ip address is a choice that should not be taken lightly. Clustering over the public internet is not a thing with Elasticsearch (or similar products).
If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.
At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.
> It has always baffled me that ES doesn't require a username and password by default.
because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin
If you set up elasticsearch on a cloud service like AWS, by default your firewall will prevent the outside world from interacting with it, and no authentication is really necessary. If you do use authentication, you probably wouldn't want username+password, you would probably want it to hook into your AWS role manager thing. So to me, username+password seems useful, but it isn't going to be one of the top two most common authentication schemes, so it seems reasonable that it should not be the default.
MongoDB also by default does not have username+password authentication turned on.
I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.
Password auth over HTTP is horrible. Short of binding a public IP address to your instance, basic auth without HTTPS setup is probably the worst thing you can do.
This addresses entirely the wrong question. By looking at it as a technical problem you're completely missing the broader ethical problem. Why was anyone allowed by law to amass this amount of data? And why did PDS not take the security and privacy concerns of 1.2 billion people seriously enough to ensure the data was handled correctly? They obviously thought it was valuable enough to amass a huge database. Do they sell this to just anyone? If not, who can buy access to this data? How much does it cost, and what steps are involved in doing so?
> Out of the box it does not even bind to a public internet address.
Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.
Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.
I've come across several such ES instances that are 100% exposed to the world without even trying, and ES is by no means the first tool to have this problem. People are never going to stop doing this. Making it annoyingly difficult within ES just weakens them such that some other "wow it's so easy" search product will be better positioned to eat their lunch.
ES, Mongo, Redis used to be some of the easiest targets for production data (security vuln wise). Deployed by SWE's usually, with products that were early versions, and didn't have access control by default.
ES's practice of making its security a proprietary paid for product is the cause for these kinds of things. It's a shitty practice, and this is one of the reasons I'm glad AWS forked it.
Other databases learned that not requiring a user/password upon install is completely irresponsible. ES and other dbs need to catch up ASAP, it's ridiculous.
Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.
Trusting software you install to be secure is ridiculous and completely irresponsible, especially if you did not pay for someone else to take the blame.
The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.
Wasn't this exact same thing a huge scandal just a few years ago for Mongo on Shodan?
I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.
This assumes it was incompetence and not done intentionally.
My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.
Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.
> It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above
I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.
Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.
But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)
> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
Again, not necessarily, for the same reason as above.
But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(
Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.
Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.
Companies that can’t handle data securely, have no business handling data at all.
My favourite was Bitomat.pl's loss of 17k bitcoins in 2011 because they restarted their EC2 instance.
I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.
Not to say this is what people are doing, but I don't think it requires much knowledge to run under Docker, and it's pretty easy to expose it to the public internet that way.
It's a tragedy that all of this data was available to anyone in a public database instead of.... checks notes... available to anyone who was willing to sign up for a free account that allowed them 1,000 queries.
It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.
Would it be better if this was a paid service? If the issue access to the data, then maybe we should ask if this data should be collected in the first place.
> If the issue access to the data, then maybe we should ask if this data should be collected in the first place.
Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.
The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.
I found a vulnerability in linkedIn a few years back that allowed anyone to access a private profile (because client side validation was enough for them I guess..?)
They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.
I reported an issue to the LinkedIn competitor https://about.me two years ago where signing in with my Google credentials gives me access to some the account of some random other person with a similar name to me. I think that during registration, I attempted to register about.me/johnradio (except it's not "johnradio"), but he was already using it, and then the bug occurred that gave me this access.
I randomly check every 6 months or so and yep, still not fixed.
My gmail is my first initial followed by my last name. There are other people on this planet with same first initial and last name, some of whom seem to think that must be their email too, because I keep on getting emails where they used it to sign up for things.
I can only imagine about.me mass-creating profiles for names found on other web pages, and opening a way for someone to "claim" those profile with a matching Google account sign-in.
About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.
I had a similar experience. In 2014 I reported an issue where you could take over someone's account by adding an email you control to it and having them complete the flow by sending them a link (which, unless they looked very carefully, looked exactly like the regular log-in flow at the time - especially if they used a public email service and you registered a similar-looking account).
I tried it on a friend and it worked, but LinkedIn's response was basically "meh".
My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.
LI is terrible if you actually try to use it, but it's harmless enough if you just use it as a profile hosting service, where people are likely to look. I just auto-archive their emails and only visit the site a couple of times per year.
While not good, what's the connection to this story?
The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.
In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?
I deleted my linkedin a few years back when they had some bug where I would randomly get page views as some other person, with all their connections and account details and whatnot. It would only last a few minutes then switch me back to my account, but they aggressively ignored my attempts to reach out to them about this bug so I just gave up.
The number in the HN headline was changed from 1.2 billion to 1 billion (despite the original source's headline saying 1.2). It is kind of amazing that leaking the personal data of 200 million people is now just a rounding error that can be dropped from headlines.
I think the solution here is laws which require anonymity, and that includes in banking (where it will never happen).
That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".
To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.
Had this been a post-paid account they would have my name/address/SSN/etc.
> Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIn data, including recruiter information.
"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2], which is a parent company of NordVPN.
It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".
Out of curiosity how do you guys think they managed to scrape LinkedIn on such a large scale?
I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).
I've crawled a popular social network on a large scale, currently doing the same for dating services as a hobby. God, wish I'd still got paid for webscraping.
Here are some tricks which may or may not work today:
- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.
- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.
- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.
> - Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
Nice tip!!
> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).
You forgot the part about exposing your finished database to unprotected elasticsearch http endpoint ;)
In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?
I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.
YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.
So far, the answers have contained non-technical answers like "Distributed Scraping." Well, yes, obviously.
A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.
Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.
The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.
I wrote a chrome headless framework that types using semi-realistic key presses (timing, mistakes, corrections) and does semi-realistic scrolling / swiping and clicking / tapping.
It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it
It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.
LinkedIn doesn't protection doesn't seem to be that sophisticated at the moment. Someone I know maintains ~weekly up-to-date profiles of a few million users via a headless scraper that uses ~10 different premium accounts and a very low number of different IPs.
As long as you are able to source more than one provider, this can work well enough. If you're dependent on a single data source, e.g., because that source is the only possible source of said data, you'll get nuked from orbit by legal rather than technical means.
I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.
Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.
The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.
The US courts decided that scraping is legal, even if against EULA:
> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.
That is a blatant misrepresentation of that decision. That decision was upholding a lower court's preliminary injunction that prevents LinkedIn from blocking hiQ while the main case between the two is litigated. It is not a final decision and it doesn't purport to say that scraping is legal (it even points out other laws besides the CFAA that might be used to prohibit scraping.)
LinkedIn Sales Navigator is a paid tool which allows you to search their whole database. Then depending on how much you pay you can get all their personal details (Email address, phone number, even their address sometimes.) https://business.linkedin.com/sales-solutions/sales-navigato...
I've always been a little confused how this works. If I got all that info for free, it's a "data leak", but if I pay to get the same detailed personal information it's...
In either case my personal data is given away without my consent, but there's this implication that it's only an issue when someone doesn't pay for it.
LinkedIn gives away email id and phone number (even if you had given just for 2FA) to all your contacts. I checked PDL, it has all the information from LinkedIn except for phone number, which I promptly removed once I identified the 2FA issue (now TOTP is available).
'Mobile Proxies' like https://oxylabs.io/mobile-proxies (no affiliation) allow you to use large pools of mobile or domestic IPs to scrape. It's expensive, but not prohibitively so. Once you've got a mobile IP you become incredible hard to throttle, since you're behind a mobile NAT gateway.
You probably have to be highly distributed. At least that’s what I did when I tried to scrape a large site some years ago. I had around 100 machines in different countries and gave each of them random pages to scrape.
Distributed bot and scraper networks. Thousands of IPs geographically dispersed throughout the world. There is only so much you can do with rate limiting.
Scraping LinkedIn is so common you can usually hire people with years of experience in it. It is not as complicated as you might think. There are at minimum hundreds of companies that sell LinkedIn data they have scraped.
I scraped 10 million records from linkedin a few years ago from a single ip by using their search function. I got a list of the top 1000 first names and top 1000 last names and wrote a script to query all combinations and scrape the results.
It looks like the purpose was data enrichment, so maybe it was pieced together over time from multiple sources. My linkedin from PDL only had 1 bit of wrong info. I wasn't able to find anything on my personal email addresses which is good.
once worked on a project that tried to do just that, but at the time the LinkedIn api was already limited to seeing the authenticated users connections connections, which was too limited for what we wanted to do, can only imagine it got worse.
It's also the reason recruiters really want to connect to you on LinkedIn because even if you are not interested, your connections might be.
Hey - not related to your comment (apologies) but wanted to get in touch . You left a note on a previous post of mine about wanting to simplify FTP. I'd love to work on this project and wanted to see if you'd be willing to connect so I can understand the problem better. Feel free to email me at kunal@mightydash.com, and thanks in advance!
Haha, when I was a kid and scared to use my real name for things, for some reason I used my email... which had my real name in it, to open a Github account with a fake name
So the api knows me as the famous architect, Art Vandelay
Wow.. I checked with an email address I use for disposable purposes. The only thing they had on it was a blank LinkedIn profile -- meaning that LinkedIn cancer has trawled some pretty questionable sites, harvesting email addresses as placeholders for their accounts. WTF.
It returned a 404 for my personal email account, so that appears to be sufficiently protected.
More surprisingly it had data such as my name, title and work email address which was connected to old work email account (Okta managed - GSuite) that I never associated with external services, and absolutely never used on a social networking site like LinkedIn.
Yeah no kidding. Though if you wait until it flips to a new minute and refresh, that helps. Though it takes all of a minute to register a free key, so probably no big deal.
I'm actually a bit surprised at how little data they have on me. They've associated my main email with an old junk email, they've got my first and last name, and know that I'm male, but there's little more.
Nothing for most of my accounts, except one which somehow was falsely attributed to someone else. Odd given I do have a LinkedIn profile; Their scraping must be far from perfect.
My personal email seems to be based on Github and Gravatar, while my job search and work emails got linked together and appear to be based on LinkedIn.
It would be really surprised if this were compliant with the GDPR. I live in the US but I tried email accounts of relatives in Europe and they had data in there.
Oh yes, I'm going to try and see if they have data on me and send a number of GDPR requests if they do. For others from the EU, it's very easy to do using: https://www.mydatadoneright.eu/request
Out of the box it does not even bind to a public internet address. Somebody configured this to 'fix' that and then went on to make sure the thing was reachable from the public internet on a non standard port that on most OSes would require you to disable the firewall or open a port. The ES manual section for network settings is pretty clear about this with a nice warning at the top: "Never expose an unprotected node to the public internet."
Giving read access is one thing. I bet this thing also happily processes curl -X DELETE "http:<ip>:9200/*" (deletes all indices). Does it count as a data breach when somebody of the general public cleans up your mess like that?
In any case, Elasticsearch is a bit of a victim of its own success here and may need to act to protect users against their own stupidity since clearly masses of people who arguably should not be taking technical decisions now find it easy enough to fire up an Elasticsearch server and put some data in it (given the amount of companies that seem to be getting caught with their pants down).
It's indeed really easy to setup. But setting it up properly still requires RTFMing, dismissing the warning above, and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
ES is a database that has to exist on a network to be usable. Heck, it expects that you have multiple nodes, and will complain if you don't. So one of the first things you do is expose it to the network so you can use it.
Yes, it takes some serious incompetence to not realize you need to secure your network, but why in the world would you not add basic authentication into ES from the start? I'd never design a tool like a database without including authentication.
I am serious about my question. Could anyone clue me in?
If you are running mysql or postgres on a public ip address it would be equally stupid and irresponsible regardless of the useless default password that many people never change unless you also set up TLS properly (which would require knowing what you are doing with e.g. certificates). The security in those products is simply not designed for being exposed on a public ip address over a non TLS connection. Pretending otherwise would be a mistake. Having basic authentication in Elasticsearch would be the pointless equivalent. Base64 (i.e. basic authentication over http) encoded plaintext passwords is not a form of security worth bothering with. Which is why they never did this. It would be a false sense of security.
At some point you just have to call out people for being utter morons. The blame is on them, 100%. The only deficiency here is with their poor decision making. Going "meh http, public IP, no password, what could possibly go wrong?! lets just upload the entirety of linkedin to that." That level of incompetence, negligence, and indifference is inexcusable. I bet, MS/Linkedin is considering legal action against individuals and companies involved. IMHO they'd be well within their rights to sue these people into bankruptcy.
because auth was a part of their paid service (and by paid i mean 'very goddamned expensive') until like half a year ago when they made it free because of freshly emerged amazons opendistro free auth plugin
MongoDB also by default does not have username+password authentication turned on.
I think defaulting to username+password is a relic of the pre-cloud era, and nowadays is not optimal.
They aggregated the data and published it so that the viral breach would spread their name around because all publicity is good publicity.
Just riffing of course.
This makes me want to talk to a lawyer.
Bind to all interfaces used to be the default in 1.x - it changed pretty much because people were footgunning themselves.
Coupled with lack of security in the base/free distribution, that made for a dangerous pitfall. At least now security is finally part of the free offering, but the OSS version still comes with no access control at all.
It doesn't matter then if you bind it to 0.0.0.0.
Documentation is not security. If you need to "RTFM" to not be in an ownable state it's ES's fault.
The only thing you can do to secure your software is to restrict its communication channels. Once you've secured the communication channels, the software auth is decorative at best.
I can't believe anyone shipping a datastore could let it happen after that. Doesn't postgresql still limit the default listen_address to local connections only? Seems like the best approach. On a distribute store consistency operations between nodes should go on a different channel than queries and should be allowed on a node by node basis at worst. At least at that point, it requires someone who should know better to make it open to the world. Even just listening for local connections passwordless auth should never be a default.
My understanding is neither company is owning this data set and there is an assumption that it is a third company that has either legally or illegally obtained the data and is using it for their own services.
Another option is that the data was exfiltrated by a loose group of people who wanted this to be freely available on a random ip. Know the ip, get sick access to a trove of PII. No logins, no accounts, no trace.
Welcome to the early 90s internet.
I would bet that in a lot of cases, people that configure their servers like in the OP just don’t read the official docs at all.
Stack Overflow, Quora, etc. are great places to get answers, because of the huge amount of questions that have already been asked and answered there.
But when people rely solely on SO, Quora, blog posts and other secondary, tertiary, ..., nth-ary sources of information, Bad Stuff will result, because of all the information that is left unsaid on Q&A sites and in blog posts. (Which is fine on its own – the problem is when the reader is ignorant about the unsaid knowledge.)
> and having some clue about what ip addresses and ports are and why having a database with full read write access on a public ip & port is a spectacularly bad idea.
Again, not necessarily, for the same reason as above.
But even if they did, it is a sad fact that a lot of people dismiss concerns over security with the kinds of “counter-arguments” that I am sure we are all too familiar with. :(
Thankfully though, we are beginning to see a shift in legislation being oriented towards protecting the privacy of people whose data is stored by companies.
Ideally, the fines should cause businesses to go bankrupt if they severely mishandle data about people. Realistically that is not what happens. For the most part they will get but a slap on the wrist. But it’s a start.
Companies that can’t handle data securely, have no business handling data at all.
I understand that the "ephemeral" nature of EC2 was in the documentation, but ESL speakers may have glossed over the significance of a word they didn't fully comprehend.
https://siliconangle.com/2011/08/01/third-largest-bitcoin-ex...
This is just another symptom of the Principal-agent problem writ large.
It seems like PDL's core business model is irresponsible regarding their stewardship of the data they've harvested.
Disclaimer: I'm one of the creators of yourdigitalrights.org.
Outlawing the collection of data would be hard and is unlikely to work, but the fact that companies like AT&T are allowed to sell your data, as they did with OP's (where else would that unused phone number come from), is an angle new legislation can use.
The EU now already has a piece of legislation aimed at stifling these practices. The US and other economies just need to follow suit.
They didn't take my report seriously (still not completely patched) and I feel like that told me all I needed to know about their security practices.
I randomly check every 6 months or so and yep, still not fixed.
About.me's business model was quite unsettling to me and they have made little to no effort to protect the user data from scrapers.
I tried it on a friend and it worked, but LinkedIn's response was basically "meh".
My life has only gotten better since I deleted LinkedIn a few years ago. I know I'm in a privileged position to be able to do that, but I strongly recommend everyone here consider whether what they gain from their account is worth the crap and spam they have to put up with.
The article says some LinkedIn data was scraped, but I don't see anywhere that it specifically says a LinkedIn security flaw was used in the scraping. Although it is vague about what data was scraped and how, so it doesn't preclude that either.
In other words, are you saying a LinkedIn vulnerability was exploited here, or suggesting that it probably was, or are you just mentioning LinkedIn because it's tangentially related?
Dead Comment
That is because a couple days ago, I got a text message from tmobile (which seemed genuine) basically saying that my account was one of a larger subset of prepaid phone accounts which had been compromised and that my personal information had been potentially taken by "hackers".
To which I got a good chuckle, because tmobile is one of the few phone companies that will let you create completely anonymous prepaid accounts using cash and without filling out any information. AKA you buy a sim card for $$$ and that is it. So, basically the only information they lost of mine as far as I can tell, is the phone number and type of phone I'm using (which they gather from their network). If they got the "meta" data about usage/location/etc that would have been different but it didn't sound like the hacker got that far.
Had this been a post-paid account they would have my name/address/SSN/etc.
I’m of the opinion it’s too late for prevention and we need, instead, mitigation.
So depending on how the "anonymous" phone number was used, it's plausible that the number can be connected with other PII.
In fact I wonder if there is any such thing as non-PII, given the existence of such companies.
"Oxy" most likely stands for Oxylabs[1], a data mining service by Tesonet[2], which is a parent company of NordVPN.
It is probably safe to assume, that LinkedIn was scraped using a residential proxy network, since Oxylabs offers "32M+ 100% anonymous proxies from all around the globe with zero IP blocking".
[1] https://oxylabs.io/
[2] https://litigation.maxval-ip.com/Litigation/DetailView?CaseI...
[1] https://vpnscam.com/wp-content/uploads/2018/08/2018-08-24-09...
Deleted Comment
Is it still possible if you pay LinkedIn enough? Or is this old data?
[source: experience]
I've been wanting to do some social graph experimentation on it (small scale - say 1000 people near me) but concluded I probably couldn't scrape enough via raw scraping without freaking out their anti-scraping. (And API is a non-starter since that basically says everything is verboten).
Here are some tricks which may or may not work today:
- Have an app where user logs in through said website, then scrape their friends using this user's token. That way you get exponential leverage on the number of API calls you can make, with just a handful of users.
- Call their API through ipv6, because they may not yet have a proper, ipv6 subnet-based rate limiter.
- Scrape the mobile website. Even Facebook still has a non-js mobile version. This single WAP/mobile website defeats every anti-scraping measure they may have.
- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
- Don't be too kind on the big websites. They can afford to keep all their data in hot pages, and as a one man you will never exhaust them.
Nice tip!!
> -- From a purely practical perspective, start with a baremetal transaction-isolation-less database like Cassandra/ScyllaDB. Don't rely on googling "postgres vs mongodb" or "sql vs nosql", those articles will all end in "YMMV". What you really need is massive IOPS, and a multi-node ring-based index with ScyllaDB will achieve that easily. Or just use MongoDB on one machine if you're not in hurry.
Somewhat ironically Elasticsearch would probably work really well for this too (just make sure your elasticsearch isn't open to the world on the internet!).
In all seriousness does anyone know why you can even host an elasticsearch database as http and without credentials? Seems to be the default. What is the use case for this?
That's some extremely shady thing to do.
I usually recommend latency-based dynamic load control for that. Once the website starts to reply 500-1000ms longer than the average one-thread latency, it is time to take a bit of it back. It is also a co-operative strategy between fellow scrapers, even if they don't know about the other ones pushing larger load on the servers.
YMMV, and cloud providers would hate you for this, but you can automate the IP rotation with a cloud providers that bills you by the hour. It's easier than ever nowadays to spin an instance in Frankfurt, use it for an hour, and then another in Singapore for the second hour.
Pretending to be Googlebot also helps.
Clever. VMs with IPV6 are cheap as a bonus :)
Same for non-js mobile. Thanks for the tips
How would someone do that using node.js? Asking for a friend.
A more useful answer is: I did this once, many years ago. Back then it was a matter of hooking up PhantomJS and making sure your user string was set correctly. Since PhantomJS was – I think – essentially the same as what headless chrome is today, the server can't determine that you're running a headless browser.
Now, it's not so easy nowadays to do that. There are mechanisms to detect whether the client is in headless mode. But most websites don't implement advanced detection and countermeasures. And in the ideal case, you can't really detect that someone is doing automated scraping. Imagine a VM that's literally running chrome, and the script is set up to interact with the VM using nothing but mouse movements and keyboard presses. You could even throw in some AI to the mix: record some real mouse movements and keyboard presses over time, then hook up some AI to your script such that it generates movements and keyboard presses that are impossible to distinguish from real human inputs. Such a system would be almost impossible to differentiate vs your real users.
The other piece of the puzzle is user accounts. You often have to have "aged" user accounts. For example, if you tried to scrape LinkedIn using your own account, it wouldn't matter if you were using 500 IPs. They would probably notice.
It's hard to counter a determined scraper.
It's not very hard to get something that would be too hard for almost every website beside Google and Facebook to bother with. If it's a 1 on a 0-9 scale in difficulty, most websites just don't have the resources to detect it
It took me like ~3 hours to write it, but I guarantee it would take months for someone to detect it, and even then, they'd have a lot of false positives and negatives.
I had a business that was generating more money than my full-time job for a while. We helped and greatly simplified matters for several thousand independent proprietors while having a positive effect on the load of the data source, since we were able to batch/coalesce requests, make better use of caches, and take notification responsibilities on ourselves.
Once in a while someone would get worried and grumpy at the data source and there were a couple of cat-and-mouse games, but we easily outwitted their scraping detection each time. When they got tired of losing the technical game, they sent out the lawyers, which was far more effective. We were acquiring facts about dates and times from the place that issued/decided those dates and times, so there wasn't really any reliable alternative data source, and we had to shut down.
The glimmer of hope on the horizon is LinkedIn v. HiQ, which seems poised to potentially finally overturn 4 decades of anti-scraping case law, but not holding my breath too hard there.
> In a long-awaited decision in hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit Court of Appeals ruled that automated scraping of publicly accessible data likely does not violate the Computer Fraud and Abuse Act (CFAA). This is an important clarification of the CFAA’s scope, which should provide some relief to the wide variety of researchers, journalists, and companies who have had reason to fear cease and desist letters threatening liability simply for accessing publicly available information in a way that publishers object to. It’s a major win for research and innovation, which will hopefully pave the way for courts and Congress to further curb abuse of the CFAA.
https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...
In either case my personal data is given away without my consent, but there's this implication that it's only an issue when someone doesn't pay for it.
Dead Comment
Needing to be logged in as the same user defeats the purpose of proxying to hide your physical origin.
Registering thousands of different users to use in a distributed way is hard now that they require a text message verification for new accounts.
This may or may not still work.
You can try it for yourself by changing the email. All of the information is public, so I don't mind. They are basically doing data integration.
So the api knows me as the famous architect, Art Vandelay
Strangely it only says I work in real estate (no I don't) when I looked up the email address I use for LinkedIn...
e75ac28b25480e60071b24d819d4692a0b315c037046b9ff6ec9dfb1e99a895c
More surprisingly it had data such as my name, title and work email address which was connected to old work email account (Okta managed - GSuite) that I never associated with external services, and absolutely never used on a social networking site like LinkedIn.
Deleted Comment