Day 2 of Statistics' User Forum was just as good as the first (which I covered yesterday): here are my personal picks of the day.
Colin MacDonald, the government's Chief Information Officer, could probably have said nothing substantive at all and got away with it, since he has one of those Scottish accents that effortlessly conveys authority and credibility (Adam Smith probably benefitted, too). But as it happens he had important things to say about the potential collision between the new view of treating public data as a strategic asset, and older views on rights to privacy and the appropriate stewardship of data that was collected for one purpose and might now be used for another. It needn't be a head-on crash, as people's views on privacy have also been evolving, and the Facebook generation appears to be a lot more comfortable with having shared data potentially broadcast to the four winds, but there can be tricky issues to work through if we are going to be in a world where different public and private sector databases are being combined to yield deeper - and potentially intrusive - views of people's lives.
John Whitehead, the former Treasury Secretary, is now the chair of the new Data Futures Forum, "a working group to advise ministers on how the collection, sharing and use of business and personal information will impact on public services in the coming years" (as the press release in February said when it was set up). He mostly spoke about what the Forum will be doing and who its members are: they will be posting the first of three planned discussions documents on the new Forum website on April 2, so keep a look out for it.
John brought along one of those Forum members, James Mansell from the Ministry of Social Development, who showed us one of the best infographics I've seen in a long time. It was the interactions over time, with various public agencies, of one (anonymised, archetypal) 20 year old, and how the silo view of each single institution of its interactions with him completely missed the overall pattern of the guy's life when you mapped the whole picture, as James did in his infographic, by combining the data from all the different agencies. He also put up some indicative costings of what the guy had cost the education, welfare and justice systems over his first 20 years, which came to $215K, the bulk of it backloaded towards his later teen years. The main point was to show the power of the combined data. But you were also irresistibly led to the conclusion (or at least I was) that the costs to the taxpayer would have been a lot less, and the guy and his community a lot happier and better off, if more of that money had been spent earlier on, as it might have been if there had been the right helicopter view of his life. I baled James up afterwards and said I'd be after him for a copy of the infographic, and as and when I get it, I'll put it up.
Colin Lynch, the Deputy Statistician, gave a very good speech about administrative data and its role in the official statistics system. If statistics isn't your first language, administrative data is data collected by private and public entities in the normal course of their operations (eg the IRD's tax returns, or the supermarkets' checkout scanner data, or online job advertisements). As Colin noted (and I remember myself from conversations at the time at Stats), as recently as four years ago there were still debates being had as to whether administrative data had a role in the generation of official statistics: back then, they were typically compiled from dedicated Stats surveys, such as the Household Labour Force Survey. Now, the exact opposite applies: the working rule of thumb is, administrative data first, surveys (if we have to) second. Benefits? A combo of lower load on survey respondents, and, often, better, quicker and more comprehensive data. What's not to like? I know, I know, it's not as straightforward as all that, but it's obviously the way to go. One of these days we'll connect data hosepipes between the banks and the supermarkets and Stats, and we'll be getting daily retail sales.
Next up was an absolute wow of a presentation from Dr Neil Fraser from Macquarie University in Sydney. This was extraordinary: a guy totally on top of his complex topic, which was in the general realm of big data, effective visualisation of massive quantities of data, and the importance of good visualisation, since we predominantly learn in visual ways. He had the most amazing examples of how vast data volumes can be effectively summarised and displayed, including this Stanford Dissertation Browser. He also gave out copies of this remarkable "Periodic Table of Visualization Methods". Give it a go: as you'll discover, when you hover over each element, up pops an example of the visualisation technique. You can find out more about the whole thing at the parent website. I'm assuming that Stats will make Neil's presentation available at some point: don't miss it.
Then we got John Edwards, the Privacy Commissioner. He was a little squeezed for time, but even so got across his main points, which were (as you'd expect) about potential pitfalls when individual-level data has not been adequately anonymised. He had some startling examples (from the US and the UK) where apparently anonymised statistical data about individuals could, in reality, be reverse engineered to identify the individuals concerned. And he had some good advice on what he called "data hygiene protocols" to prevent the same thing happening here.
The afternoon's concurrent sessions were, literally, standing room only: there's a vast interest, in particular, in using the IDI, the Integrated Data Infrastructure that Stats maintains, and which is made up of a series of linked longitudinal databases. The IDI enables researchers and policy analysts to track interlinked patterns over time at an individual level. Bill Rosenberg, my mate at the CTU, dobbed me in and had me "volunteered" to chair one of these sessions, which was okay because I got to hear some of these uses of the IDI. For example, it's been used to track what happens to school-leavers: it can tell whether they're in employment (because there's all the business employer-employee data there), what they earn (IRD), whether they went on to tertiary education (Ministry of Education data), whether they're on a benefit (MSD data), or even out of the country (Customs departure card data). So now we can answer questions like, what qualifications are most likely to land young people in a job, and what they'll earn there, and look at the links between the level of educational attainment in secondary school and later life.
We wrapped up with Bill English, who shares the same passion for data that (as I noted yesterday) Maurice Williamson does. Lots in his speech, too: I noted his points about how good information wins arguments, changes minds, and shifts the ground for debate, and how better data has been needed to understand policy interventions where, especially with the more complex ones, politicians and officials had essentially been "flying blind". He gave various example of how even simple examples of quantification (eg producing an accurate count of the number of young mothers on welfare, which was much lower than popular mythology had imagined) had transformed how people thought about the issue and how they went about tackling it. He had one commitment - to open data, all good - and a warning. I've heard him give warnings before in like vein that he's subsequently delivered on, so listen up: woe betide the public agency or chief executive that sits on unreleased public data. And finally he talked a bit about the IDI: the whole point, for him, is to use it as a decision tool for better policy design and delivery.
A great couple of days. Well done to the folks at Stats who pulled it all together: Peter Crosland and the team, take a well-deserved bow.