I do work with "open data" on a near-obsessive basis and -- friends, please do not trust "open data" portals to reflect reality accurately. The datasets are often curated, categories changed during the ETL processes, rows missing, and things like that. For example, Chicago's "crimes" dataset intentionally doesn't include all homicides. Can't remember the exact dataset, but I once had a conversation with Chicago's head of open data who told me that they intentionally removed many rows because they were concerned that the public was going to misinterpret the results... but didn't make it clear that rows were missing. So I guess everybody gets the opportunity to misinterpret the results!
FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.
This is super true. For my city’s portal as well. I’ve found one way around this by versioning the dataset - that is, committing the diffs in git. Credit to Simon Willison’s git-scraping technique.
Hah that's classic politics
"Hello John Q. Public, here's all our data! It speaks for itself"
John Q. Public: "Wow, you really improved last few years homicide-wise"
"And so you see, a third party unrelated to us has just confirmed what a great job we're doing with simple empirical, evidence-based governance!"
So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.
Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.
Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.
Where I grew up the data for murders is curated in such a way that anybody that dies 24 after being attacked is not considered a ‘murder’. Tehy do this to reduce the statistical murder rate.
Yes, of course there can be. But I cannot ignore the harms in doing so, by misrepresenting the data in a way that disallows others to understand what is or isn't there -- it happens regularly. These datasets are often used as a political tool and contracted with local universities to show that they're providing data... though not actually providing the accurate data. Simultaneously though, people who don't know data will champion the data as accurate because it comes from a university program.
Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.
Would be neat if instead an open-ended challenge ("here's some data, do something cool") the MTA instead shared a list of hypothetical or real problems to solve and provided data that could be potentially useful in the exploration/solution to the problem.
Also, considering they just got a 68 billion dollar budget approved [1] over the next 5 years, even a small monetary reward would be nice for this. It doesn't need to be a ton of money, but something other than "here's a piece of empty and memorabilia and we'll write a blog post" would be a good incentive
I think you are misinterpreting that article. The MTA board approved the plan to spend $68B but they depend on the state to give them funds. That’s the amount of money they are asking for based on the projects they want to complete. The state government has to pass a budget to fund that plan (or do something else). Additionally several current, already started projects are on hold due to the “pause” of congestion pricing which was going to be a funding source.
I could not find dataset with payroll hours reported and overtime reimbursed for each MTA employee.
I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.
Does anyone have links to MTA payroll/hours/overtime related dataset?
or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc
FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.
I do this with my power company’s outage map: https://github.com/patricktrainer/entergy-outages
67k commits!
https://simonwillison.net/2020/Oct/9/git-scraping/
So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.
Open data portals generally have data is useful form. FOI probably gives you PDFs.
Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.
Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.
Would love to read more about your experience with Open Data. Any place where I can reach out?
And this one makes some rounds: https://mchap.io/that-time-the-city-of-seattle-accidentally-...
Feel free to reach out!
Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.
We can do better than that.
[1] https://ny1.com/nyc/all-boroughs/news/2024/09/25/mta-board-a...
I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.
Does anyone have links to MTA payroll/hours/overtime related dataset?
or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc
Deleted Comment
https://en.wikipedia.org/wiki/The_Power_Broker
https://new.mta.info/article/introducing-subway-origin-desti...
Then I think, oh, right, wrong MTA. Guess I've spent too much time dealing with email servers.