> The Chief Data Officer (CDO) will "(1) be responsible for lifecycle data management"
I am very interested in what this type of lifecycle might look like considering most data I feel should be kept forever. I wonder how a lifecycle might collide with the challenges that bit rot[0] are dealing with.
This is already getting really exciting. Some government entities (including lower levels like local, county, and state governments) have moved to digitizing their old paper and microfilm records. But if they're expected to maintain many types of records essentially forever, it places a constant burden to continue to update and migrate data in perpetuity, whereas paper or microfilm can sit in a box in a closet for decades.
For the most part, common file formats like PDFs, JPGs, and TIFs are likely to be understood for a very, very long time, but you don't just have file storage, you have systems to manage, index, and find those files, and those systems are likely to need constant maintenance.
I've seen Blu-ray disc manufacturers claim lifespans of over 100 years with capacities of 100 GB per disc. An entire warehouse of paper and microfilm documents would be able to fit in a shoebox.
I would think that outsourcing this to Amazon and Azure for redundant copies would be a sufficient start. AWS S3 and Azure Blob Storage would go a long way. Structured directories, document metadata complements and having hot indexes for said metadata in a database would be a very good start.
A lot of this type of work is already being/been done.
I am reminded of the "API" decision made by Jeff Bezos at Amazon, as famously described by Steve Yegge:
So one day Jeff Bezos issued a mandate. He's doing that all the time, of course, and people scramble like ants being pounded with a rubber mallet whenever it happens. But on one occasion -- back around 2002 I think, plus or minus a year -- he issued a mandate that was so out there, so huge and eye-bulgingly ponderous, that it made all of his other mandates look like unsolicited peer bonuses.
His Big Mandate went something along these lines:
1) All teams will henceforth expose their data and functionality through service interfaces.
2) Teams must communicate with each other through these interfaces.
3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
4) It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.
5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
This is how Bezos built AWS, so he meant services that are needed to build websites, such as databases, servers, containers, queues, caches, storage...
In that example they'd expose it through the source control system and through the build system, though that's really stretching the meaning.
What it really means is that if you're building something that you run on your teams hosts it needs to be exposed through services. For most of the services at amazon that means using the common service interfaces that everyone uses and not just giving out DB credentials (which I've seen happen it took (think it's still in progress 5+ years later)) to untangle.
That's a weird example. A team that makes a library probably doesn't share too much data with other teams. I think this directive is more like, don't share a spreadsheet of your quarterly stats.
What matrix library for computer graphics was Amazon working on back in the early 2000s? Genuinely curious since I don't hear much about things like that from Amazon.
I'd always want to make sure no PII is accidentally leaked. Example: in 1997, researchers at MIT showed that using only gender, date of birth and ZIP code, it is possible to identify the majority of US residents! They proved their point by identifying the Massachusetts governor's medical records in a publicly-available dataset that was presumed anonymous. In 2010, Netflix published an “anonymized” dataset of movie ratings by users. After it was released to the public, researchers were able to identify many Netflix users, even though the dataset only contained user ID, movie, rating, and rating date.
I suspect it's more expensive to make it not machine-readable, considering that it likely starts out in a computer, and if it is not stored that way, it become essentially inaccessible even to the people who stored it.
No, but I believe it can't help but be an indicator of the trend toward open government. Right now there are around six states with open data laws.
I have been working for two years with a group trying to add Michigan to the list. Tomorrow I am publishing an open letter to our governor asking her to support our efforts. I will post a link on HN.
Just curious, do government services really not have any reserve funding? It seems like avoiding shutdowns could be solved pretty easily by having reserve funds (at least for 1-3 months or so).
But perhaps that's the point. Shutdowns are supposed to be inconvenient.
Indeed. "Many agencies, particularly the military, would intentionally run out of money, obligating Congress to provide additional funds to avoid breaching contracts." https://en.wikipedia.org/wiki/Antideficiency_Act
> It seems like avoiding shutdowns could be solved pretty easily by having reserve funds (at least for 1-3 months or so
Or by having the government continue working according to last year's budget planning, like many other countries do. I know of no government shutdown in the Western countries except in the US. Even for "third world" countries such things would be disastrous as armies don't like not being fed and paid.
I am deeply afraid of the impact of this law. The amount of meta-work required to consolidate and annotate data we collect, in order to prepare it for public consumption, seems likely to hurt government efficiency.
In addition to the administrative burden, it appears to ignore the fact that non-sensitive information, in sufficient quantity and correlation, becomes sensitive information.
Perhaps my skepticism is misplaced, but my initial reaction is that this sounds better in the abstract than it will turn out to be in practice.
Part of my wife's job is to research Medicaid billing codes for every state (yes, this is a state thing, but I'm just making an example). Once in a while she can get their codes in a form as "advanced" as an excel spreadsheet. But more likely she'll get a PDF doc that she's got to run through an OCR programming to convert it to a spreadsheet, and has to check for errors. Or for some states, nothing is published at all - she's got to piece it together from partner hospital billing records.
There's no doubt that getting this data into a sane format will take the states some extra resources.
But when you consider how much more efficient this will make my wife's company, and every other provider of Medicaid services, it's bound to be a huge win on net. And improving efficiency of delivering healthcare should be important.
The government is big, but the private sector is still much larger. So there's great leverage to make our overall systems more efficient because an investment in efficiency on the government side will be multiplied many times over as seen by the many private entities that the government is overseeing.
Inefficiency and high expense is the primary burden of open governments, representative governments, and democratic governments. If you want a cheap, efficient government, you want an absolute monarchy. This is why corporations tend to be structured into rather strict hierarchies that bear no small resemblance to feudal kingdoms. That's also why they're terrible at meeting worker demands.
Your reply leans pretty heavily on the assumption that all government agencies consist of bureaucrats. I think you should re-examine that assumption. A small minority of government workers are involved in issuing regulations at places like the EPA. Most are military service members, Homeland Security workers, DOJ law enforcement, etc.
>The amount of meta-work required to consolidate and annotate data we collect, in order to prepare it for public consumption, seems likely to hurt government efficiency.
I (briefly) thought the same thing when I first read about it, but I think the efficiency gained from having digital, standardized formats will eventually outweigh the inefficiencies of the initial conversion to that format.
I'm also happy that otherwise "dead" data (e.g. papers sitting in boxes in a basement somewhere) could now be used more effectively in aggregate to further increase operating efficiencies. Imagine trying to put together a comparison between a specific subset of finance reports across departments when Department A uses one digital format, Department B uses another digital format, and Departments C through Z all have them in boxes. What would have otherwise been a beaurocratic headache _before you even get to data munging_ now becomes an ordeal that's easier on all fronts, and that data can then be used to fight back against otherwise unknown inefficiencies.
>The amount of meta-work required to consolidate and annotate data we collect, in order to prepare it for public consumption, seems likely to hurt government efficiency.
As a data analyst working for a state government, not consolidating or creating metadata really hurts my efficiency. I've gotten too comfortable with munging tables in PDFs.
>it appears to ignore the fact that non-sensitive information, in sufficient quantity and correlation, becomes sensitive information
This is something we're trying to figure out. The problem is, I doubt many agencies are actually maintaining privacy with their publications. The Census is adopting differential privacy strategies [0], but my own agency relies on practices from the days of printed reports. I know for a fact some of them don't work, but government is slow to adapt.
I am a big supporter of government and do not consider efficiency a primary objective (a good one, but secondary).
To make a cartoony analogy: flight security would be more efficient if everyone flew naked with no hand luggage, but that would defeat the purpose of people traveling from place to place for their own reasons.
Likewise: the government has collected or generated that info; let's put it into a reasonably clean and accessible format so others (who, in the US, have funded its collection/generation anyway) can build upon it.
The inefficiencies of correctly recording and distributing data will be, I think, greatly outweighed by the increased efficiencies of having standardized machine-readable data that's easy to access and use. I work at a think tank that uses various government data sets across agencies and jurisdictions, and the cleaning that goes into analysis is a nightmare. Some agencies have their own quirky conventions--I've seen "-1" used as a flag for "no data" before, which as you can imagine returned some strange results on analysis. A regulation that says "publish data and do it precisely this way" will be a welcome one for me.
I agree when bolted onto data, but if data collections are properly designed to be findable, accessible, interoperable, and reusable from the start, I think long term data management drops due to more efficient processes.
I think closed data or data in pdf masks a lot of technical debt that causes manual labor, expensive proprietary licenses (looking at you SAS for archived data sets).
with the right tools most of it could be very automatic and eventually it will be a non concern. it is getting up to speed that is painful and getting everyone on board
This sounds like an amazing thing!
Anyone serious does their work reproducibly. This shouldn't add more than a little bit about storage on devices and paper, in terms of costs.
I searched/skimmed through the full text of the bill, and it doesn't seem that it applies to data used/generated by grant-based research (or other projects). The language of the bill is heavily focused on executive agencies.
That would be ideal, although a large portion of grant funding goes to medical research (e.g. via NIH), where it would be difficult to anonymize the data. They could require that it be sanitized (differential privacy, etc.), but I don't know how that could be verified effectively/efficiently. The grants process is quite time-consuming for the grantees (and the grantors) without this requirement.
Data generated by a researcher on a government grant generally includes a provision in the grant that says the govt owns the data, doesn't it?
And if the govt owns it, this seems to indicate it should be open...
I am very interested in what this type of lifecycle might look like considering most data I feel should be kept forever. I wonder how a lifecycle might collide with the challenges that bit rot[0] are dealing with.
[0] https://www.theguardian.com/technology/2015/feb/13/what-is-b...
For the most part, common file formats like PDFs, JPGs, and TIFs are likely to be understood for a very, very long time, but you don't just have file storage, you have systems to manage, index, and find those files, and those systems are likely to need constant maintenance.
A lot of this type of work is already being/been done.
Deleted Comment
So one day Jeff Bezos issued a mandate. He's doing that all the time, of course, and people scramble like ants being pounded with a rubber mallet whenever it happens. But on one occasion -- back around 2002 I think, plus or minus a year -- he issued a mandate that was so out there, so huge and eye-bulgingly ponderous, that it made all of his other mandates look like unsolicited peer bonuses.
His Big Mandate went something along these lines:
1) All teams will henceforth expose their data and functionality through service interfaces.
2) Teams must communicate with each other through these interfaces.
3) There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
4) It doesn't matter what technology they use. HTTP, Corba, Pubsub, custom protocols -- doesn't matter. Bezos doesn't care.
5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
6) Anyone who doesn't do this will be fired.
https://plus.google.com/+RipRowan/posts/eVeouesvaVX
Perhaps instead of it being "all teams" it should be "all cpu processes"?
It's not a "you must find some data or service to expose". If you have data and someone wants to use it, they do it via service.
What it really means is that if you're building something that you run on your teams hosts it needs to be exposed through services. For most of the services at amazon that means using the common service interfaces that everyone uses and not just giving out DB credentials (which I've seen happen it took (think it's still in progress 5+ years later)) to untangle.
- you build a matrix math system as a service... or
- you build cloud rendering (as they have done several times)
That sounds pretty awesome (and expensive)!
Kind of depends on what they say is "sensitive". Probably not much of a facility for appealing that designation either.
I sorta look at it like the "First Step" prison reform act, it's only the start.
Baby steps.
I have been working for two years with a group trying to add Michigan to the list. Tomorrow I am publishing an open letter to our governor asking her to support our efforts. I will post a link on HN.
But perhaps that's the point. Shutdowns are supposed to be inconvenient.
Or by having the government continue working according to last year's budget planning, like many other countries do. I know of no government shutdown in the Western countries except in the US. Even for "third world" countries such things would be disastrous as armies don't like not being fed and paid.
In addition to the administrative burden, it appears to ignore the fact that non-sensitive information, in sufficient quantity and correlation, becomes sensitive information.
Perhaps my skepticism is misplaced, but my initial reaction is that this sounds better in the abstract than it will turn out to be in practice.
There's no doubt that getting this data into a sane format will take the states some extra resources.
But when you consider how much more efficient this will make my wife's company, and every other provider of Medicaid services, it's bound to be a huge win on net. And improving efficiency of delivering healthcare should be important.
The government is big, but the private sector is still much larger. So there's great leverage to make our overall systems more efficient because an investment in efficiency on the government side will be multiplied many times over as seen by the many private entities that the government is overseeing.
It's a meta-comment, but this term sounds slightly misplaced when you're talking about bureaucrats having to do more paperwork.
Deleted Comment
I (briefly) thought the same thing when I first read about it, but I think the efficiency gained from having digital, standardized formats will eventually outweigh the inefficiencies of the initial conversion to that format.
I'm also happy that otherwise "dead" data (e.g. papers sitting in boxes in a basement somewhere) could now be used more effectively in aggregate to further increase operating efficiencies. Imagine trying to put together a comparison between a specific subset of finance reports across departments when Department A uses one digital format, Department B uses another digital format, and Departments C through Z all have them in boxes. What would have otherwise been a beaurocratic headache _before you even get to data munging_ now becomes an ordeal that's easier on all fronts, and that data can then be used to fight back against otherwise unknown inefficiencies.
As a data analyst working for a state government, not consolidating or creating metadata really hurts my efficiency. I've gotten too comfortable with munging tables in PDFs.
>it appears to ignore the fact that non-sensitive information, in sufficient quantity and correlation, becomes sensitive information
This is something we're trying to figure out. The problem is, I doubt many agencies are actually maintaining privacy with their publications. The Census is adopting differential privacy strategies [0], but my own agency relies on practices from the days of printed reports. I know for a fact some of them don't work, but government is slow to adapt.
[0]: https://privacytools.seas.harvard.edu/why-census-bureau-adop...
To make a cartoony analogy: flight security would be more efficient if everyone flew naked with no hand luggage, but that would defeat the purpose of people traveling from place to place for their own reasons.
Likewise: the government has collected or generated that info; let's put it into a reasonably clean and accessible format so others (who, in the US, have funded its collection/generation anyway) can build upon it.
I think closed data or data in pdf masks a lot of technical debt that causes manual labor, expensive proprietary licenses (looking at you SAS for archived data sets).
Poe's law strikes again.
Dead Comment
That would be ideal, although a large portion of grant funding goes to medical research (e.g. via NIH), where it would be difficult to anonymize the data. They could require that it be sanitized (differential privacy, etc.), but I don't know how that could be verified effectively/efficiently. The grants process is quite time-consuming for the grantees (and the grantors) without this requirement.
*not a lawyer