President Signs Government-Wide Open Data Bill

I am reminded of the "API" decision made by Jeff Bezos at Amazon, as famously described by Steve Yegge:

So one day Jeff Bezos issued a mandate. He's doing that all the time, of course, and people scramble like ants being pounded with a rubber mallet whenever it happens. But on one occasion -- back around 2002 I think, plus or minus a year -- he issued a mandate that was so out there, so huge and eye-bulgingly ponderous, that it made all of his other mandates look like unsolicited peer bonuses.

His Big Mandate went something along these lines:

1) All teams will henceforth expose their data and functionality through service interfaces.

2) Teams must communicate with each other through these interfaces.

3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.

4) It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.

5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.

6) Anyone who doesn't do this will be fired.

https://plus.google.com/+RipRowan/posts/eVeouesvaVX

randyrand · 7 years ago

This sounds "great", but how exactly does your team that works on say, a matrix library for computer graphics, expose its data over the network?

Perhaps instead of it being "all teams" it should be "all cpu processes"?

anth_anm · 7 years ago

They don't, they write a library and others consume it.

It's not a "you must find some data or service to expose". If you have data and someone wants to use it, they do it via service.

sephoric · 7 years ago

This is how Bezos built AWS, so he meant services that are needed to build websites, such as databases, servers, containers, queues, caches, storage...

ender341341 · 7 years ago

In that example they'd expose it through the source control system and through the build system, though that's really stretching the meaning.

What it really means is that if you're building something that you run on your teams hosts it needs to be exposed through services. For most of the services at amazon that means using the common service interfaces that everyone uses and not just giving out DB credentials (which I've seen happen it took (think it's still in progress 5+ years later)) to untangle.

TACIXAT · 7 years ago

That's a weird example. A team that makes a library probably doesn't share too much data with other teams. I think this directive is more like, don't share a spreadsheet of your quarterly stats.

grogenaut · 7 years ago

The non-sarcastic answer is either:

- you build a matrix math system as a service... or

- you build cloud rendering (as they have done several times)

gaadd33 · 7 years ago

What matrix library for computer graphics was Amazon working on back in the early 2000s? Genuinely curious since I don't hear much about things like that from Amazon.

jfoutz · 7 years ago

Advertise schemas and accept functions over that schema.

I am deeply afraid of the impact of this law. The amount of meta-work required to consolidate and annotate data we collect, in order to prepare it for public consumption, seems likely to hurt government efficiency.

In addition to the administrative burden, it appears to ignore the fact that non-sensitive information, in sufficient quantity and correlation, becomes sensitive information.

Perhaps my skepticism is misplaced, but my initial reaction is that this sounds better in the abstract than it will turn out to be in practice.

CWuestefeld · 7 years ago

Part of my wife's job is to research Medicaid billing codes for every state (yes, this is a state thing, but I'm just making an example). Once in a while she can get their codes in a form as "advanced" as an excel spreadsheet. But more likely she'll get a PDF doc that she's got to run through an OCR programming to convert it to a spreadsheet, and has to check for errors. Or for some states, nothing is published at all - she's got to piece it together from partner hospital billing records.

There's no doubt that getting this data into a sane format will take the states some extra resources.

But when you consider how much more efficient this will make my wife's company, and every other provider of Medicaid services, it's bound to be a huge win on net. And improving efficiency of delivering healthcare should be important.

The government is big, but the private sector is still much larger. So there's great leverage to make our overall systems more efficient because an investment in efficiency on the government side will be multiplied many times over as seen by the many private entities that the government is overseeing.

mywittyname · 7 years ago

There's money to be made selling these information to hospitals. It's just really hard to sell things to hospitals.

danmaz74 · 7 years ago

That would make sense... If this provision included additional funding for the additional burden of providing this service.

wolco · 7 years ago

Would this put your wife out of work or force her to change positions internally?

da_chicken · 7 years ago

Inefficiency and high expense is the primary burden of open governments, representative governments, and democratic governments. If you want a cheap, efficient government, you want an absolute monarchy. This is why corporations tend to be structured into rather strict hierarchies that bear no small resemblance to feudal kingdoms. That's also why they're terrible at meeting worker demands.

whatshisface · 7 years ago

>I am deeply afraid

It's a meta-comment, but this term sounds slightly misplaced when you're talking about bureaucrats having to do more paperwork.

torstenvl · 7 years ago

Your reply leans pretty heavily on the assumption that all government agencies consist of bureaucrats. I think you should re-examine that assumption. A small minority of government workers are involved in issuing regulations at places like the EPA. Most are military service members, Homeland Security workers, DOJ law enforcement, etc.

Deleted Comment

rising-sky · 7 years ago

Might not be misplaced if the commenter _is_ a bureaucrat

drusepth · 7 years ago

>The amount of meta-work required to consolidate and annotate data we collect, in order to prepare it for public consumption, seems likely to hurt government efficiency.

I (briefly) thought the same thing when I first read about it, but I think the efficiency gained from having digital, standardized formats will eventually outweigh the inefficiencies of the initial conversion to that format.

I'm also happy that otherwise "dead" data (e.g. papers sitting in boxes in a basement somewhere) could now be used more effectively in aggregate to further increase operating efficiencies. Imagine trying to put together a comparison between a specific subset of finance reports across departments when Department A uses one digital format, Department B uses another digital format, and Departments C through Z all have them in boxes. What would have otherwise been a beaurocratic headache _before you even get to data munging_ now becomes an ordeal that's easier on all fronts, and that data can then be used to fight back against otherwise unknown inefficiencies.

vharuck · 7 years ago

>The amount of meta-work required to consolidate and annotate data we collect, in order to prepare it for public consumption, seems likely to hurt government efficiency.

As a data analyst working for a state government, not consolidating or creating metadata really hurts my efficiency. I've gotten too comfortable with munging tables in PDFs.

>it appears to ignore the fact that non-sensitive information, in sufficient quantity and correlation, becomes sensitive information

This is something we're trying to figure out. The problem is, I doubt many agencies are actually maintaining privacy with their publications. The Census is adopting differential privacy strategies [0], but my own agency relies on practices from the days of printed reports. I know for a fact some of them don't work, but government is slow to adapt.

[0]: https://privacytools.seas.harvard.edu/why-census-bureau-adop...

gumby · 7 years ago

I am a big supporter of government and do not consider efficiency a primary objective (a good one, but secondary).

To make a cartoony analogy: flight security would be more efficient if everyone flew naked with no hand luggage, but that would defeat the purpose of people traveling from place to place for their own reasons.

Likewise: the government has collected or generated that info; let's put it into a reasonably clean and accessible format so others (who, in the US, have funded its collection/generation anyway) can build upon it.

tvanantwerp · 7 years ago

The inefficiencies of correctly recording and distributing data will be, I think, greatly outweighed by the increased efficiencies of having standardized machine-readable data that's easy to access and use. I work at a think tank that uses various government data sets across agencies and jurisdictions, and the cleaning that goes into analysis is a nightmare. Some agencies have their own quirky conventions--I've seen "-1" used as a flag for "no data" before, which as you can imagine returned some strange results on analysis. A regulation that says "publish data and do it precisely this way" will be a welcome one for me.

prepend · 7 years ago

I agree when bolted onto data, but if data collections are properly designed to be findable, accessible, interoperable, and reusable from the start, I think long term data management drops due to more efficient processes.

I think closed data or data in pdf masks a lot of technical debt that causes manual labor, expensive proprietary licenses (looking at you SAS for archived data sets).

mdpopescu · 7 years ago

...seems likely to hurt government efficiency.

Poe's law strikes again.

Shivetya · 7 years ago

with the right tools most of it could be very automatic and eventually it will be a non concern. it is getting up to speed that is painful and getting everyone on board

Dead Comment

ajr0 · 7 years ago

> The Chief Data Officer (CDO) will "(1) be responsible for lifecycle data management"

I am very interested in what this type of lifecycle might look like considering most data I feel should be kept forever. I wonder how a lifecycle might collide with the challenges that bit rot[0] are dealing with.

[0] https://www.theguardian.com/technology/2015/feb/13/what-is-b...

ocdtrekkie · 7 years ago

This is already getting really exciting. Some government entities (including lower levels like local, county, and state governments) have moved to digitizing their old paper and microfilm records. But if they're expected to maintain many types of records essentially forever, it places a constant burden to continue to update and migrate data in perpetuity, whereas paper or microfilm can sit in a box in a closet for decades.

For the most part, common file formats like PDFs, JPGs, and TIFs are likely to be understood for a very, very long time, but you don't just have file storage, you have systems to manage, index, and find those files, and those systems are likely to need constant maintenance.

swebs · 7 years ago

I've seen Blu-ray disc manufacturers claim lifespans of over 100 years with capacities of 100 GB per disc. An entire warehouse of paper and microfilm documents would be able to fit in a shoebox.

tracker1 · 7 years ago

I would think that outsourcing this to Amazon and Azure for redundant copies would be a sufficient start. AWS S3 and Azure Blob Storage would go a long way. Structured directories, document metadata complements and having hot indexes for said metadata in a database would be a very good start.

A lot of this type of work is already being/been done.

talmand · 7 years ago

It's not like paper in boxes don't have their own maintenance costs.

teddyh · 7 years ago

mLuby · 7 years ago

>The OPEN Government Data Act requires all non-sensitive government data to be made available in open and machine-readable formats by default.

That sounds pretty awesome (and expensive)!

dak1 · 7 years ago

Not if you expand what's considered sensitive!

mooman219 · 7 years ago

I'd always want to make sure no PII is accidentally leaked. Example: in 1997, researchers at MIT showed that using only gender, date of birth and ZIP code, it is possible to identify the majority of US residents! They proved their point by identifying the Massachusetts governor's medical records in a publicly-available dataset that was presumed anonymous. In 2010, Netflix published an “anonymized” dataset of movie ratings by users. After it was released to the public, researchers were able to identify many Netflix users, even though the dataset only contained user ID, movie, rating, and rating date.

bilbo0s · 7 years ago

True.

Kind of depends on what they say is "sensitive". Probably not much of a facility for appealing that designation either.

I sorta look at it like the "First Step" prison reform act, it's only the start.

Baby steps.

skookumchuck · 7 years ago

I suspect it's more expensive to make it not machine-readable, considering that it likely starts out in a computer, and if it is not stored that way, it become essentially inaccessible even to the people who stored it.

maccio92 · 7 years ago

My hope is that machine-readable means an actual data format, rather than a scanned PDF or something like that.

I would guess it starts out in excel!

Buttons840 · 7 years ago

Does it apply to state governments?

rmason · 7 years ago

No, but I believe it can't help but be an indicator of the trend toward open government. Right now there are around six states with open data laws.

I have been working for two years with a group trying to add Michigan to the list. Tomorrow I am publishing an open letter to our governor asking her to support our efforts. I will post a link on HN.

crabl · 7 years ago

The deep irony here is that https://www.data.gov/ is still down due to the government shutdown.

Just curious, do government services really not have any reserve funding? It seems like avoiding shutdowns could be solved pretty easily by having reserve funds (at least for 1-3 months or so).

But perhaps that's the point. Shutdowns are supposed to be inconvenient.

Maxious · 7 years ago

Indeed. "Many agencies, particularly the military, would intentionally run out of money, obligating Congress to provide additional funds to avoid breaching contracts." https://en.wikipedia.org/wiki/Antideficiency_Act

ghostly_s · 7 years ago

Agencies would need legislation authorizing them to retain funds for this purpose.

mschuster91 · 7 years ago

> It seems like avoiding shutdowns could be solved pretty easily by having reserve funds (at least for 1-3 months or so

Or by having the government continue working according to last year's budget planning, like many other countries do. I know of no government shutdown in the Western countries except in the US. Even for "third world" countries such things would be disastrous as armies don't like not being fed and paid.

I don't think there is any point, it's just poor government design in general.

e40 · 7 years ago

Can we stop with these comments? We get it, and we all know there's a shutdown. It doesn't add anything to the discussion.

En_gr_Student · 7 years ago

This sounds like an amazing thing! Anyone serious does their work reproducibly. This shouldn't add more than a little bit about storage on devices and paper, in terms of costs.

Haven't dug in... does this include data as part of government funded grant studies? It is a nice start to more open data from the govt.

jointpdf · 7 years ago

I searched/skimmed through the full text of the bill, and it doesn't seem that it applies to data used/generated by grant-based research (or other projects). The language of the bill is heavily focused on executive agencies.

That would be ideal, although a large portion of grant funding goes to medical research (e.g. via NIH), where it would be difficult to anonymize the data. They could require that it be sanitized (differential privacy, etc.), but I don't know how that could be verified effectively/efficiently. The grants process is quite time-consuming for the grantees (and the grantors) without this requirement.

*not a lawyer

count · 7 years ago

Data generated by a researcher on a government grant generally includes a provision in the grant that says the govt owns the data, doesn't it? And if the govt owns it, this seems to indicate it should be open...

maxxxxx · 7 years ago

I am surprised that in the current political climate any reasonable laws can pass. My only explanation is that there were no lobbyists fighting it.

ww520 · 7 years ago

It was passed before the end of last year when the two Houses and the presidency were in the same party's hand. It would be much harder this year.

vilda · 7 years ago

Another example is the First Step Act that has been signed Dec 21 last year. It is being considered especially important for poor and black people.

HumanDrivenDev · 7 years ago

Lobbyists are probably fighting laws you don't like as well.

No question.

TomMckenny · 7 years ago

Even a broken clock is right twice a day.