Readit News logoReadit News
jandrewrogers · 2 months ago
People don't grasp how easy it is to build data models like this even without privileged first-party data access.

In 2012 I created a killer prototype that demonstrated that you could accurately reconstruct most people's flight history at scale from social media and/or ad data. Probably the first of its kind. This has been possible for a long time.

A quick sketch of how it worked:

We filtered out all spatiotemporal edges in the entity graph with an implied speed of <300 kilometers per hour or <200 kilometers distance, IIRC. This was the proxy for "was on a plane". It also implicitly provided the origin and destination.

These edges can be correlated with both public flight data and maintenance IoT data from jet engines to put entities on a specific flight. People overlook the extent to which innocuous industrial IoT data can be used as a proxy for relationships in unrelated domains.

In rare cases, there was more than one plausible commercial flight. Because we had their flight history, we assumed in these cases that it was the primary airline they had used in the past, either generally or for that specific origin and destination. This almost always resolved perfectly.

This was impressively effective and it didn't require first-party data from airlines or particularly sophisticated analytics. Space and time are the primary keys of reality.

gruez · 2 months ago
>We filtered out all spatiotemporal edges in the entity graph with an implied speed of <300 kilometers per hour or <200 kilometers distance, IIRC. This was the proxy for "was on a plane". It also implicitly provided the origin and destination.

Sounds like the bigger issue is that you're able to get "spatiotemporal" data in the first place? Otherwise it's like saying "we can figure out all stores you've been to, if we have your credit card transaction history". Sure, it's kinda creepy that you can figure out which stores I went to, but the bigger problem is that you can get the transaction data in the first place. Moreover whatever "spatiotemporal" data needed to reconstruct such flight history is probably more valuable than the flight history itself. Who cares if you know Joe flew on United 8340 when you have hour-by-hour updates on his rough location?

AnthonyMouse · 2 months ago
> Otherwise it's like saying "we can figure out all stores you've been to, if we have your credit card transaction history".

The preposterous thing is that payment processors aren't just allowed to collect this information and tie it to your name, they're required to do that.

People talk a big game about fighting fascism, but how can you allow these laws to exist if you can contemplate what happens when actual fascists get hold of that data going back decades? They need to be dismantled now.

jandrewrogers · 2 months ago
> Sounds like the bigger issue is that you're able to get "spatiotemporal" data in the first place?

Almost all data is spatiotemporal data, people just aren't used to thinking about it like that. Everything that "happens" is an event with associated times and places.

Tagging of events with spatiotemporal attributes, or with metadata that can be used to infer spatiotemporal attributes, is pervasive. Every system data passes through, even if not the creator of it, observes the event of the data passing through it. Event observation is not trying to track things but it implicitly and necessarily creates the data that makes tracking and spatiotemporal inference possible.

These kinds of analyses rely almost entirely on knowing the events occurred; you could encrypt the contents of the data and it wouldn't matter. Software leaks spatiotemporal event context everywhere across myriad systems, internal and external, that incidentally collect it. There isn't anything nefarious about most of it and much of it is required for reasons of criminal and civil liability.

What people underestimate is that you can analytically stitch together many unrelated sparse data sources with spatiotemporal attributes, many of which are quite crap or seemingly unfit for purpose, to reconstruct a dense high-quality graph. Counter-intuitively, diverse and seemingly irrelevant data sources often produce better data models. It surfaces bias, errors, manipulation, and processing artifacts in individual sources you might otherwise miss.

It is much more difficult to access the obvious first-party data sources than it used to be, mostly because people with that data are far more selective about who they give access. It doesn't really matter, that is a speed bump for the unsophisticated. The exponential growth in the scale and diversity of network-connected telemetry of all types pretty much guarantees these data models will always be constructible.

The historical limiter has always been the absence of data infrastructure platforms that can handle these kinds of analytics at scale.

magicalist · 2 months ago
> Sounds like the bigger issue is that you're able to get "spatiotemporal" data in the first place?

Yeah, this just sounds like it's written from the perspective of a data broker.

Tying particular ad analytics (presumably ip geolocation?) to thousands of particular individuals and having it well populated enough to track them is "privileged first-party data access" by another name.

gmueckl · 2 months ago
Twitter has had timestamped amd geotagged posts for ages. Just clustering things like hashtags of tweets spatiotemporally results in a treasure trove if information about events.

I'm sure that other platforms attach the same kind of info to posts. It's just a matter of scraping it.

throawayonthe · 2 months ago
but it's obviously very easy ro get from social media? e.g. you have a post from paris and then later that day a post from brussels
bsder · 2 months ago
> People don't grasp how easy it is to build data models like this even without privileged first-party data access.

People on this site probably understand this better than 99% of the world.

The problem is "What can I, as an individual, do about it?"

autoexec · 2 months ago
block ads, stay off most social media, don't use mobile devices while traveling
Den_VR · 2 months ago
You can also exploit it for personal profit. As for stopping it, good luck. Best case is probably to degrade or poison data sources in a preferably legal way.
noman-land · 2 months ago
Where do you get "maintenance IoT data from jet engines"?
supportengineer · 2 months ago
Exactly. This does not pass the sniff test.
jszymborski · 2 months ago
Indeed, seems like it's way easier to just got the databroker route.
smcin · 2 months ago
Can you post a link/self-citation to any 2012 post on that prototype, I couldn't find it.

Was this only on US data, or EU or worldwide? And when you say "ad data", do you mean like ADINT?

And when you infer "was on a plane" from spatiotemporal edges in the entity graph with an implied speed of >300 km/h and >200 km distance, does that only work if the individual social-media app itself is collecting real-time geolocation? So if you only access social media from the browser not the app, does that partially prevent this privacy exploit?

zahlman · 2 months ago
The thing that really tickles me is that there's supposedly all this frightening information that can be gathered on people, including by investigating the history of ads they were served; but then in the vast majority of cases the only use the Bad Guys ever seem to come up with for that information is to serve you more ads.
bmitc · 2 months ago
> People don't grasp how easy it is to build data models like this even without privileged first-party data access.

> We filtered out all spatiotemporal edges

How were you getting spatiotemporal data?

tqi · 2 months ago
Presumably ICE is trying to determine what cities / countries a person has visited and when, ie your starting point.
justanything · 2 months ago
Can you eli5 the implementation and how your prototype worked?
gleenn · 2 months ago
Sounds like if you have a record of a lot of location/timestamp data for people, you look at the distance difference divided by the time difference. Now you have average speed for any pair of points. Now filter where the average speed is as fast as a Boeing jet. That filters out most of the data except for people who are almost certainly on a plane. Et voila, you now look at those data points geolocation and you have people who traveled from one city to another because you already have the location. Compare City1 -> City2 with any public flights in those cities around those times and you know who flew on what flight from where to where and at what time.
no_wizard · 2 months ago
What was your accuracy rate for this? I imagine it was quite high, but do you happen to remember what your +/- was?

Deleted Comment

wingspar · 2 months ago
Honestly asking, How did you validate your results?
jandrewrogers · 2 months ago
In this particular case it was just a proof-of-concept, albeit at scale. We did not run a proper ground-truthing process but people actually running that type of data model in production could have ground-truthed the analytic model if they wanted to.

However, it turns out that thousands of people like to talk about their flights on social media, so we scraped that as a spot check and it mostly lined up perfectly. Good enough for a demo and it would have been difficult to come up with an alternative explanation for the patterns in the data.

The purpose of the PoC was to sell the data analysis infrastructure that made that type analysis possible at scale, it wasn't about the data per se. It was a compelling demo we invented given the data that happened to be available. Startup life.

fsckboy · 2 months ago
i don't have any special knowledge in this area, but just thinking about it idly while sitting here, "robbing their homes while they are away" comes to mind as a good proxy.
canadiantim · 2 months ago
Great info

Dead Comment

btown · 2 months ago
It's funny to see ARC just being described as a "data broker," which strongly implies that it doesn't play a role in facilitating the actual underlying consumer activity.

ARC and IATA absolutely do play such a role, as the financial clearinghouses for ensuring that travel agents (online and offline) and airlines can pay each other, and as gatekeepers/certification bodies for agencies to ensure these financial systems aren't abused.

Now, they absolutely do sell access to data to third parties, governmental and nongovernmental. But the reason they have this data isn't because they buy it to resell it; they are fully part of the funds flow for the underlying transaction. Whether they should be allowed to sell or share non-anonymized data on passenger records and prices paid is a very good question, but at the very least this is about as first-party as data gets.

https://www.altexsoft.com/blog/airline-reporting-corporation... describes some of these flows. (Here be dragons.)

ujkiolp · 2 months ago
this counters none of the points covered in the article
btown · 2 months ago
...nor is it meant to?

Two things can be true simultaneously: (a) it is worrisome that a company is selling PII at scale to government entities who would otherwise need to request that data through accountable warrant processes, and (b) we shouldn't call every such company a "data broker" lest we dilute the specificity of that term, particularly when the companies in question participate in the funds flow of the customer transaction.

Dead Comment

leblancfg · 2 months ago
The amount and extent of data that is available out there by brokers for purchase by literally any company is *mind-boggling*. However bad you think it is, multiply that by 10.
jeffbee · 2 months ago
I would say that in general the HN crowd doesn't understand the industry at all, and they need to change the direction of their understanding, rather than the magnitude. Your basic hackernews believes that e.g. Google is out there selling all your personal information. But compared to these other industries the tech industry is almost airtight. It has long been possible for someone to pick up the phone and order, in any format they want, transaction data as narrowly targeted as they wish. Credit card line items for 35-year-old dentists living on the 400 block of Elm street in local town? By end of day.
supriyo-biswas · 2 months ago
This is correct; what people fundamentally misunderstand is that data brokers directly sell personal information about people, but Google and Facebook only allow for targeted advertising while keeping personal information within the confines of their company.
taeric · 2 months ago
It has been truly frustrating when people will blame the "tech industry" for what is essentially reckless behavior from other industries. For a while, it was often the finance sector that did most of the crazy stuff. With crypto being an obnoxious overlap of the two.
everdrive · 2 months ago
I'm also surprised that this is so hidden from everyone. Where are the engineers leaking secrets? Much of the online discourse is pure speculation based on what can be observed from the very end of the chain. (ie, what your computer is giving up) The speculation is not necessarily _incorrect_ but is too vague to be useful to anyone. Where does my data _actually_ go? Does anyone know? Can anyone describe the life of my data as it goes through the whole ecosystem? Does anyone know what mitigations are, and are not effective?
sofixa · 2 months ago
> Your basic hackernews believes that e.g. Google is out there selling all your personal information

To add to this, any mention of "telemetry" is taken to mean your PII being taken by bad actors to abuse, instead of what it is in 99% of cases, which is usage statistics. (X% of our users use feature A, it merits investment). It can be both, but there's usually no place for differentiation, just pitchforks.

ck_one · 2 months ago
Is that actually possible? Can we do a live test here?

Let's say we want this dataset: Credit card line items for 35-year-old dentists living on the 400 block of Elm street in local town

How much do I have to pay you to get it?

worik · 2 months ago
> Credit card line items for 35-year-old dentists living on the 400 block of Elm street

I do not believe that. I would like evidence before I am convinced

If my bank is releasing that data I am horrified. I live in anew Zealand and our privacy laws are clear: it would be illegal

flossposse · 2 months ago
I think the HN crowd is especially vocal about the tech industry in particular because that's the industry a lot of us have first-hand knowledge of - we know from personal observation that it is anything but airtight
criddell · 2 months ago
> Your basic hackernews believes that e.g. Google is out there selling all your personal information.

I think most people here understand that Google sells ads against that data, but they aren't selling the data.

southernplaces7 · 2 months ago
Okay, and who are these people you contact for this data, and how do they themselves obtain it so precisely? You say the big tech industry is pretty air-tight about sharing data, so how does mysterious X company have on hand the credit ratings of all those youngish dentists on Elm street, among other kinds of information? How o these dynamics work, since you seem to know it internally?
Melatonic · 2 months ago
Anyway to opt out of this type of data collection per company? I know for some things you can contact each individual broker and opt out (via some identifier like your email address) of your data being at least publicly available
trollied · 2 months ago
A colleague created a banner ad that was an image that had the text “told you I could do this mate!” and targeted an individual to prove a point.

The general public have no idea how much ad providers and data brokers know about them.

rvnx · 2 months ago
Seems just like retargeting in that case. Ask “victim” to visit page A. On that page A place a retargeting pixel, then now everywhere on the Internet you can display a message for that user as long as you are willing to pay a high price for that impression (high price is way way way less than 0.1 USD)
lyton · 2 months ago
Reminds me of the time when Signal(the private messaging app) once tried to get ad data from Facebook and show it to users with a high degree of specificity eg “You got this ad because you’re a middle aged woman who enjoys kpop and loves reading about Christopher Nolan”

Relevant article: http://archive.today/fzUL4

rustcleaner · 2 months ago
I need you to tell me how I do this right now. This will put so much cred into my spiels with people in meatspace. So many bricks will be shat!
JohnMakin · 2 months ago
I work in this space - I'd say 1000x.
southernplaces7 · 2 months ago
I asked this same thing in another comment here, but since you mention working in this space, I ask you directly. Where do the brokers obtain their data from? If it's easy for them to obtain, would those who buy it from brokers not be able to simply get it from its respective sources? I'm genuinely curious about how this dynamic works.
OsrsNeedsf2P · 2 months ago
Could you elaborate with specifics? If it's this bad, why haven't we heard anything from a whistleblower or seen a good demo?
Melatonic · 2 months ago
Anyway to combat it or stop your info from being overly harvested?
blindriver · 2 months ago
Around 2014 I worked with recruiters and they had a tool that aggregated data on everyone through LinkedIn, yelp, twitter, GitHub, eventbrite, etc. it was breathtaking the amount of information you could get on anyone, over 10+ years ago.

I’m guessing with the help of Palantir, the government has even more data and can probably link Reddit posts etc based on styleometry and can even perform psychological analysis on your personality and tendencies, etc.

worik · 2 months ago
> it was breathtaking the amount of information you could get on anyone, over 10+ years ago.

After being burnt by things taken from my social media out of context, used to publicly shame me, I locked down my social media

Am I "sweetly naive" to think that had an effect? I do think it did

Before I stopped using Facebook I noticed, over the last decade, that almost every account I encountered was locked down similarly

My point is I suspect it is getting harder, not easier, for data thieves. The golden age of data theft has passed. Maybe.

rustcleaner · 2 months ago
>styleometry

I really need to start using PocketPal (local LLM on Android) to restate my messages.

---

Oh, the places I'd like to send my texts so fine, With PocketPal, a tool that's truly divine, Local LLM on Android, a wondrous device to see, To help me restate my messages with glee! Wheee!

kevin_thibedeau · 2 months ago
The government has been buying and funding R&D with data brokers since before Google existed.
southernplaces7 · 2 months ago
My question here is also how the brokers obtain the data themselves? Wouldn't it be simple for those who buy it from the brokers at a markup to just get it from its original sources themselves? Also, if the data is in any case available, the real at-fault culprits aren't so much the brokers as those who store and so easily sell it in the first instance.
roadside_picnic · 2 months ago
> Wouldn't it be simple for those who buy it from the brokers at a markup to just get it from its original sources themselves?

In many cases joining datasets is both labor intensive and creates a surprising amount of new information, and there is also plenty of "free" data that is incredibly tedious to work with.

I used to work with real estate data for the government and if you search for any common things you might want to know you often land on a data brokers page even though property assessor data is freely available in most counties. The problem is each county has their own system of storing data and their own process for searching it. It's a lot of work to learn how just this one dataset works, combining this for all counties in the US is a massive project.

Whenever I buy a new home I always look up all my neighbors, figure out when they bought the house, how much they paid etc. Some people get freaked out by this, but this information is public in most counties.

By joining this data with another public data set, you can actually figure out which lender your neighbors used and what their reported income at time of sale, their age and ethnic background.

Of course there are plenty of other ways data brokers come across data, but even cleaning up and joining public data can require a fair bit of time and expertise.

Deleted Comment

victorbjorklund · 2 months ago
Sellers of the data wanna deal with one or a few buyers that buy bulk. They dont wanna deal with thousands of customers.
greenie_beans · 2 months ago
what are some good cheap sources to get this? i have an art project idea that i've wanted make that would require invasive data profiles, but it's very big project and i have no idea where to start
onlyrealcuzzo · 2 months ago
Further, they are literally in the business of selling your data for a profit.

It should not be surprising that they are selling your data for a profit...

willguest · 2 months ago
It's amazing to me that the market for data is so well hidden from public view. So many large companies are mining and trading data on a daily basis - you would think that a data marketplace would have been a thing by now, especially with all the noise about "decentralisation" (yes, I know, crypto shill bros).

I've been touting this as a business model for years. Better still, I'd like to see it done with behavioural models (in the open). That would really blow the lid off the industry. Imagine people charging companies, instead of simply being the product...

Hilift · 2 months ago
Is it really that hidden? In 2021, a guy went to another person's home to exact revenge for something 50 years earlier. Security video showed him holding the PeopleFinders folder. What should surprise people is their governments are selling some of the data.
willguest · 2 months ago
Thank you for making my point.

Here's some research aided by Perplexity, which estimates that the global data market is valued at about $1.7 Trillion, with data monetization growing at about 17.6% CAGR:

https://www.perplexity.ai/search/today-i-would-like-to-try-a... (138 sources)

Also, Meta can identify you based on your movement and a few pieces of social data (all of which is in the open).

Tel Aviv airport has been running behavioural monitoring for about a decade, predicting crimes before they happen.

You mention a case from 2021, which is about $5 trillion ago, and think that the government selling data is surprising. This is mature market that already knows everything about everyone, especially in the US, and is more concerned with what to do with it. The faucet is open, the ground floor is flooded, and we're discussing the different types of fish that have moved into our apartment.

sixothree · 2 months ago
Yes! It is hidden. Go and get your data from this company. Report the results.
libraryatnight · 2 months ago
Just shut it down and turn it all off. Thinking of ways to profit from this behavior is perverse.
willguest · 2 months ago
Thinking of ways to profit from it is the absolute norm but, yes, it is perverse.

I'd happily run it as a non-profit with the purpose of highlighting the value of people's data. Tough gig though, when there are all these "off switch" guys around.

AlexandrB · 2 months ago
I don't get it. Why would CBP and ICE need to buy this from a data broker? The TSA is right there scanning everyone's boarding pass as part of going through security.
Beretta_Vexee · 2 months ago
Because there is probably a well-defined regulatory framework for accessing data collected by the TSA, whereas there are few or no requirements when the same data is purchased from a broker.

It is not even certain that the data actually comes from the TSA. It could come from airlines, payment companies, etc.

There is no guarantee of quality when purchasing data from a broker.

mrweasel · 2 months ago
The regulatory angle at least explains part of my wondering. I'm not really surprised that they have access to this information, I'm just surprised that they buy it, rather than just demanding it be handed over.
roadside_picnic · 2 months ago
When I worked for the federal government I wanted to collect some publicly visible tweets (this was before the Library of Congress started to harvest them, and back when the API was better). As a government employee I had to write a detailed document of: why I needed this data, what PII would be stored, how long it would be stored and how I would ensure it had been deleted. Then that document had to be approved. Even though this is a project that any person could have done on the weekend, I still had to go through all this work for approval, the collect the data.

But you're proposing something even more outlandish, asking another agency for data. The politics of this are mind bending. If one one agency give their data to another and that agency is successful using it it will make the giving agency look bad which is unacceptable. It was wild how many times another, supposedly friendly agency, would not share data. In fact, I was cautioned not to even bring up the idea in shared meetings because it would create unnecessary friction.

If you buy it from a 3rd party government contractor, none of this has to happen.

DistractionRect · 2 months ago
Probably because the tsa isn't able/allowed to hand out access willy nilly.

It's kinda like how the police need warrants to request cellphone data, but cellphone companies could sell realtime data to third parties who in turn sold it to the police.

https://news.ycombinator.com/item?id=17081684

AlexandrB · 2 months ago
It's fine to speculate, but I really wish the article had made it explicit given that the EFF has actual lawyers on staff.
krunck · 2 months ago
Government uses corporations to get around laws and the constitution. Corporations in turn get to use government to get around regulation. Same as it ever was.
fnordpiglet · 2 months ago
Beyond the other reasons stated re: regulations and law, which this government seems to be more than willing to ignore, the process of setting up reliable feeds of usable data between organizational functions can be more difficult than buying the data from an entity whose profit derives from curation and distribution of the same data. It might seem absurd on the surface but paying a premium for a repackaging of the data is often meaningfully easier and more reliable and you probably save money in the end. The TSA tech teams role isn’t to package and enrich data with useful metadata, with documentation and SLAs, and their incentives don’t naturally align no matter how hard a political appointee bangs a table. The data broker has every incentive however, and will continue to in perpetuity.
hnburnsy · 2 months ago
ICE, CPB and TSA are all part of DHS, of course they can share the Advanced Passenger Information System data, why are we paying third parties to provide it?
renewiltord · 2 months ago
At a company I once worked at, the data division of a company bought a list of their stores from us. Full polygons, visit durations, etc.
tonymet · 2 months ago
Suspects purchase a flight weeks + months before the flight. The TSA screens them just minutes before getting on.

Flight purchases would be critical and distinct information for law enforcement.

andrew_lettuce · 2 months ago
This is wrong. You need to provide your travel documentation id and they share your personal info well before you get on the plane
tonymet · 2 months ago
leave it to hackernews to downvote the right answer. People were asking why , not "should they"
tgsovlerkhgsel · 2 months ago
This could actually be interesting because in many past egregious data broker cases, the offenders had no business in the EU so they could just laugh as they were handed one 20M fine after the other (e.g. Clearview), or they were making way more than 4% of their revenue in profit from privacy violations so they could just risk the fine.

But here, the controller of the data is the airline, the transfer to the data broker might be illegal, and an airline is the worst company to commit GDPR violations with: They have a lot of global revenue but a relatively thin margin, very little of that margin comes from data abuse (so they can't just shrug off the GDPR fine as a small cost of doing shady business), and they are reachable in the EU (worst case a member state can ground and confiscate their planes, and essentially ban them from flying to the EU by threatening to confiscate any other plane that lands). And yes, Germany will impound a plane to get debts paid: https://www.reuters.com/article/world/thai-prince-to-pay-bon...

manquer · 2 months ago
While airlines are the obvious source for such data sets , there are a number of other sources.

The barcode in the boarding pass contains all the information that airlines know about you [1]. It is after all only encoded and not encrypted and so many companies manufacture readers for it.

Airports check-in systems, or it could be from the baggage handling system , the duty free shop or the airport lounge and so on.

There are so many different players who have access to most or all of the data it would hard to prove it came any one source at all.

That is just the barcodes on the boarding pass, passport scanners are like couple of hundred dollars ans airport shops/car rentals use them all the time.

Many airports use facial scanning these days and don’t even ask for boarding pass/passport/visa during boarding at all .

There are auxiliary sources which could be used in conjunction with other sources like Uber booking and so on.

[1] https://krebsonsecurity.com/2015/10/whats-in-a-boarding-pass...

sealeck · 2 months ago
I agree that they can get the data through other means. Not so sure about

> There are so many different players who have access to most or all of the data it would hard to prove it came any one source at all.

Because a prosecutor can obtain copies of all emails talking about this, they can examine your bank accounts for payments from data brokers, they can require legal to give them copies of any contracts, they can look at audit logs from the production database and airlines aren't Evil Inc -- stuff will inevitably leak and get out. You can't cover yourself that well as a CEO looking to make a quick buck...

identigral · 2 months ago
https://github.com/yaelwrites/Big-Ass-Data-Broker-Opt-Out-Li... is a useful place to start for opting out. As of this writing, this list does not include Airlines Reporting Corporation (ARC), a data broker mentioned in the article.
gnabgib · 2 months ago
Little discussion 2 months ago (43+7 points, 2+3 comments) https://news.ycombinator.com/item?id=43949975 https://news.ycombinator.com/item?id=43952971