Readit News logoReadit News
chaps · a year ago
I do work with "open data" on a near-obsessive basis and -- friends, please do not trust "open data" portals to reflect reality accurately. The datasets are often curated, categories changed during the ETL processes, rows missing, and things like that. For example, Chicago's "crimes" dataset intentionally doesn't include all homicides. Can't remember the exact dataset, but I once had a conversation with Chicago's head of open data who told me that they intentionally removed many rows because they were concerned that the public was going to misinterpret the results... but didn't make it clear that rows were missing. So I guess everybody gets the opportunity to misinterpret the results!

FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.

pjot · a year ago
This is super true. For my city’s portal as well. I’ve found one way around this by versioning the dataset - that is, committing the diffs in git. Credit to Simon Willison’s git-scraping technique.

I do this with my power company’s outage map: https://github.com/patricktrainer/entergy-outages

67k commits!

https://simonwillison.net/2020/Oct/9/git-scraping/

chaps · a year ago
That's a really freaking neat trick. Thanks!
amy-petrik-214 · a year ago
Hah that's classic politics "Hello John Q. Public, here's all our data! It speaks for itself" John Q. Public: "Wow, you really improved last few years homicide-wise" "And so you see, a third party unrelated to us has just confirmed what a great job we're doing with simple empirical, evidence-based governance!"

So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.

stevage · a year ago
I worked in open data for quite a few years. This is a very weird take.

Open data portals generally have data is useful form. FOI probably gives you PDFs.

chaps · a year ago
"FOI probably gives you PDFs."

Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.

Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.

bshep · a year ago
Where I grew up the data for murders is curated in such a way that anybody that dies 24 after being attacked is not considered a ‘murder’. Tehy do this to reduce the statistical murder rate.
chaps · a year ago
Can you say more about this?
whoiscroberts · a year ago
Well now we know why crime is down
kalendos · a year ago
I can only imagine. Many ETLs are already messy in companies with better tooling and processes.

Would love to read more about your experience with Open Data. Any place where I can reach out?

chaps · a year ago
Here's something about shotspotter data in Chicago: https://x.com/foiachap/status/1775296597850480663

And this one makes some rounds: https://mchap.io/that-time-the-city-of-seattle-accidentally-...

Feel free to reach out!

IanCal · a year ago
Although pre-cleaned data is often not reflective of reality and requires careful work to use, often requiring a lot more knowledge of the field.
gordon_freeman · a year ago
But even if dataset is incomplete or not accurate, do you think we could at least get directionally right insights from such datasets?
chaps · a year ago
Yes, of course there can be. But I cannot ignore the harms in doing so, by misrepresenting the data in a way that disallows others to understand what is or isn't there -- it happens regularly. These datasets are often used as a political tool and contracted with local universities to show that they're providing data... though not actually providing the accurate data. Simultaneously though, people who don't know data will champion the data as accurate because it comes from a university program.

Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.

We can do better than that.

whitej125 · a year ago
Would be neat if instead an open-ended challenge ("here's some data, do something cool") the MTA instead shared a list of hypothetical or real problems to solve and provided data that could be potentially useful in the exploration/solution to the problem.
maxverse · a year ago
Also, considering they just got a 68 billion dollar budget approved [1] over the next 5 years, even a small monetary reward would be nice for this. It doesn't need to be a ton of money, but something other than "here's a piece of empty and memorabilia and we'll write a blog post" would be a good incentive

[1] https://ny1.com/nyc/all-boroughs/news/2024/09/25/mta-board-a...

exegete · a year ago
I think you are misinterpreting that article. The MTA board approved the plan to spend $68B but they depend on the state to give them funds. That’s the amount of money they are asking for based on the projects they want to complete. The state government has to pass a budget to fund that plan (or do something else). Additionally several current, already started projects are on hold due to the “pause” of congestion pricing which was going to be a funding source.
doctorpangloss · a year ago
Why would a cost center political institution enumerate all its problems? It is kind of miraculous they can engage with the public this way at all.
slt2021 · a year ago
I could not find dataset with payroll hours reported and overtime reimbursed for each MTA employee.

I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.

Does anyone have links to MTA payroll/hours/overtime related dataset?

or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc

Deleted Comment

WUMBOWUMBO · a year ago
perhaps this could be covered in a FOIA request
thecosas · a year ago
Time for someone to crack their knuckles and do a Power Broker-style MTA Open Data mashup :-)

https://en.wikipedia.org/wiki/The_Power_Broker

krebby · a year ago
shrikar · a year ago
Tried building something with Cursor + Chatgpt in 30mins not bad for the initial exploration https://www.youtube.com/watch?v=w3mkXPdTVlI and the demo link: https://mtachallenge.streamlit.app/
stevage · a year ago
Interesting, these open data challenges were all the rage 10 years ago. Wonder why the sudden trip down memory lane.
nocman · a year ago
I keep clicking on these 'MTA' articles expecting them to be about a "message transfer agent".

Then I think, oh, right, wrong MTA. Guess I've spent too much time dealing with email servers.