MTA Open Data Challenge

chaps · a year ago

I do work with "open data" on a near-obsessive basis and -- friends, please do not trust "open data" portals to reflect reality accurately. The datasets are often curated, categories changed during the ETL processes, rows missing, and things like that. For example, Chicago's "crimes" dataset intentionally doesn't include all homicides. Can't remember the exact dataset, but I once had a conversation with Chicago's head of open data who told me that they intentionally removed many rows because they were concerned that the public was going to misinterpret the results... but didn't make it clear that rows were missing. So I guess everybody gets the opportunity to misinterpret the results!

FOIA is the better alternative because it gives you the original, pre-cleaned data. Open data is a lie.

pjot · a year ago

This is super true. For my city’s portal as well. I’ve found one way around this by versioning the dataset - that is, committing the diffs in git. Credit to Simon Willison’s git-scraping technique.

I do this with my power company’s outage map: https://github.com/patricktrainer/entergy-outages

67k commits!

https://simonwillison.net/2020/Oct/9/git-scraping/

chaps · a year ago

That's a really freaking neat trick. Thanks!

amy-petrik-214 · a year ago

Hah that's classic politics "Hello John Q. Public, here's all our data! It speaks for itself" John Q. Public: "Wow, you really improved last few years homicide-wise" "And so you see, a third party unrelated to us has just confirmed what a great job we're doing with simple empirical, evidence-based governance!"

So that means what you want to do is specialize in identifying bias in these datasets and finding the smoking gun. Such a task can be an ugly business but necessary for the public good, pushing data sharers to either share good data, or not share, but not share tricksy data in this unethical way.

stevage · a year ago

I worked in open data for quite a few years. This is a very weird take.

Open data portals generally have data is useful form. FOI probably gives you PDFs.

chaps · a year ago

"FOI probably gives you PDFs."

Having submitted thousands of FOIA requests, I get the impression that you haven't, actually, submitted many FOIA requests. I've received many, many, many, many non-PDF FOIA responses.

Share me some of the open data you've worked with and I'd love to poke at it and tell you where it's wrong and where assumptions about its data is wrong.

bshep · a year ago

Where I grew up the data for murders is curated in such a way that anybody that dies 24 after being attacked is not considered a ‘murder’. Tehy do this to reduce the statistical murder rate.

chaps · a year ago

Can you say more about this?

whoiscroberts · a year ago

Well now we know why crime is down

kalendos · a year ago

I can only imagine. Many ETLs are already messy in companies with better tooling and processes.

Would love to read more about your experience with Open Data. Any place where I can reach out?

chaps · a year ago

Here's something about shotspotter data in Chicago: https://x.com/foiachap/status/1775296597850480663

And this one makes some rounds: https://mchap.io/that-time-the-city-of-seattle-accidentally-...

Feel free to reach out!

IanCal · a year ago

Although pre-cleaned data is often not reflective of reality and requires careful work to use, often requiring a lot more knowledge of the field.

gordon_freeman · a year ago

But even if dataset is incomplete or not accurate, do you think we could at least get directionally right insights from such datasets?

chaps · a year ago

Yes, of course there can be. But I cannot ignore the harms in doing so, by misrepresenting the data in a way that disallows others to understand what is or isn't there -- it happens regularly. These datasets are often used as a political tool and contracted with local universities to show that they're providing data... though not actually providing the accurate data. Simultaneously though, people who don't know data will champion the data as accurate because it comes from a university program.

Sometimes what can happen is that somebody inexperienced will try to make some assessment of the data and come to the exact wrong conclusion because they didn't know what not to trust. But it gets on the news anyway and damage is done.

We can do better than that.

whitej125 · a year ago

Would be neat if instead an open-ended challenge ("here's some data, do something cool") the MTA instead shared a list of hypothetical or real problems to solve and provided data that could be potentially useful in the exploration/solution to the problem.

maxverse · a year ago

Also, considering they just got a 68 billion dollar budget approved [1] over the next 5 years, even a small monetary reward would be nice for this. It doesn't need to be a ton of money, but something other than "here's a piece of empty and memorabilia and we'll write a blog post" would be a good incentive

[1] https://ny1.com/nyc/all-boroughs/news/2024/09/25/mta-board-a...

exegete · a year ago

I think you are misinterpreting that article. The MTA board approved the plan to spend $68B but they depend on the state to give them funds. That’s the amount of money they are asking for based on the projects they want to complete. The state government has to pass a budget to fund that plan (or do something else). Additionally several current, already started projects are on hold due to the “pause” of congestion pricing which was going to be a funding source.

doctorpangloss · a year ago

Why would a cost center political institution enumerate all its problems? It is kind of miraculous they can engage with the public this way at all.

slt2021 · a year ago

I could not find dataset with payroll hours reported and overtime reimbursed for each MTA employee.

I wanted to investigate how well MTA is managing its workforce and compensation (as to require additional tax in form of Congestion Pricing to fix its budget hole), but there seems to be no dataset for that.

Does anyone have links to MTA payroll/hours/overtime related dataset?

or alternatively, I need dataset to study each and every subway improvement project, and components of each project in materials, labor and etc

Deleted Comment

WUMBOWUMBO · a year ago

perhaps this could be covered in a FOIA request

thecosas · a year ago

Time for someone to crack their knuckles and do a Power Broker-style MTA Open Data mashup :-)

https://en.wikipedia.org/wiki/The_Power_Broker

krebby · a year ago

Some really nice example visualizations from Matt Yarri and Julia Lynn at the MTA: https://www.linkedin.com/posts/matt-yarri_some-of-the-data-w...

https://new.mta.info/article/introducing-subway-origin-desti...

shrikar · a year ago

Tried building something with Cursor + Chatgpt in 30mins not bad for the initial exploration https://www.youtube.com/watch?v=w3mkXPdTVlI and the demo link: https://mtachallenge.streamlit.app/

stevage · a year ago

Interesting, these open data challenges were all the rage 10 years ago. Wonder why the sudden trip down memory lane.

nocman · a year ago

I keep clicking on these 'MTA' articles expecting them to be about a "message transfer agent".

Then I think, oh, right, wrong MTA. Guess I've spent too much time dealing with email servers.