In 1897, the Indiana legislature considered a bill for "introducing a new mathematical truth", a clever procedure for "squaring the circle". The procedure didn't work; for one thing, it assumed that the value of pi was 3.2 (it isn't). The bill didn't pass but, if it had, it wouldn't have changed the value of pi – it just would have made the Indiana legislature look a bit silly. Parliaments can change a lot of things, but not the laws of mathematics.
The Australian Parliament is now considering amending the Privacy Act. Attorney-General George Brandis introduced the amendments, saying "there is a strict and standard government procedure to de-identify all government data that is published. Data that is released is anonymised so that the individuals who are the subject of that data cannot be identified." But the bill specifies a two-year jail term for re-identifying people from those data sets. Usually, acts that are impossible don't need to be banned.
Well, what is de-identification exactly? And does it work?
There are good mathematical reasons for doubt. Computer scientists have successfully re-identified "de-identified" data sets of health, social networks, online ratings and web searches, and shown high levels of uniqueness in telecommunications metadata and payments data – a key step towards re-identification.
The reason is simple: everyone is unique if you know enough about them. A surprisingly small number of ordinary facts is enough to isolate most people, even without names, addresses or dates of birth.
For example, from your telecommunications metadata, it takes only four points of location to spot 95 per cent of people. If I know where you get up in the morning, where you work, and where you spent Friday evening and Sunday lunch, there's a good chance I can identify you simply from the location of your phone. Our own analysis of Australia's MBS/PBS de-identified data set replicates these themes. It seems very unlikely that the research value of the data can be preserved without a substantial risk that individuals could be identified.
A negative consequence of the threat of jail time is that it discourages law-abiding Australian researchers or "white hats" from making the simplest and most convincing demonstration that a de-identification method has failed. It doesn't make the data any more secure from malicious or criminal exposure.1 If those rules had been in place in September, we might not have identified the problem in the MBS/PBS data set encryption, the data set would still be online, and the government would be unaware of its insecurity.
The Productivity Commission's recent draft report suggests that some data could be available only to "trusted users" such as health researchers. That's a good idea: a de-identification method that isn't secure enough for public release might be fine for a data set that's kept secure and restricted to certain people or queries by law.
No one denies the tremendous potential of big data to improve policy, inspire research and fuel economic growth. But we have a mathematical problem to solve: we don't know how to use this resource directly and protect privacy at the same time. So should the Parliament declare a "mathematical truth" of secure de-identification even though the maths probably doesn't work? Or should we try to think up something new?
There are exciting new ideas for provably privacy-preserving computation on sensitive data. Google has experimented with differential privacy, a statistical technique for computing valid aggregate statistics with a high degree of uncertainty about individual data. Microsoft researchers have used homomorphic encryption to compute on private genomic data without ever learning the individual records. We even know of some Australians who routinely use this kind of maths.
Australia really can become a leader in the data sciences by innovating in the technologies that work, while carefully using re-identification to shine light on those technologies that don't.
A surprisingly small number of ordinary facts is enough to isolate most people, even without names, addresses or dates of birth.
* Whether the amended federal Privacy Act will apply to university researchers or journalists is still unclear. And while Commonwealth agencies or the Attorney-General may grant permission for re-identification research, no guarantee is possible.
Dr Chris Culnane is a research fellow at the University of Melbourne's department of computing and information systems, where Dr Benjamin Rubinstein and Dr Vanessa Teague are senior lecturers.