Fake real data? Or real fake data?

This post is a reboot of a post originally published on my personal blog on 23rd March 2014]

I came this great blog post from Redrow Analytics recently that looked at fake data. It reminded me of a question I often ask myself: What makes real data ‘real’? It is a question worth asking since real data has a nasty habit of being abused.

This is a game with echoes from my lectures on solid state physics many years ago. It is also a game that ex-physicist Albert-László Barabási and co-authors played to great effect in Small But Slow:How Network Topology and Burstiness Slow Down Spreading. The authors start with a timed record of 325 million calls between 4.6 million users of a phone network. This is like a video recording that you can play back to see the patterns in how people call each other. The authors simulated an ‘infection’, just like a phone virus, that spreads from person to person as one calls another.

In order to show the importance of the structure of the network and the timing of the calls, Karsai et al shuffled the network at random destroying the cliques and communities in the network. Then they shuffled the times of all of the calls and so on, eventually arriving at a completely synthetic call network that looks exactly the same as the original from a distance, but had all the patterns and correlations washed out. What they found is that the time for the ‘virus’ to take over decreased steadily as each aspect of the data became uncorrelated. In other words, wiping out the correlations in the real data leads to a network which looks like the original but underestimates a key characteristic.

Call Detail Records (CDRs) are an extremely rich source of data, detailing each call or SMS between 2 SIMs along with a timestamp and a rough location. While the content of the call is obviously sensitive, this meta-data of the call is also sensitive as it reveals locations and relationships between people. It was shown in a recent high profile paper that even when names, phone numbers and any other personal information is removed, only a minimum amount of information is needed to elucidate the identity of a person in a CDR. What this means on a practical level is that if I know 4 times and places of when someone made calls, I could go to the anonymised CDR and figure out which anonymous user represents that person and then see all their other calls, in 95% of cases.

Given the proven value of CDR analysis for transport planning, mapping poverty, tracking malaria and disaster relief but in light of the liability of intrinsically privacy invading real data, the prospect of realistic synthetic CDR data is enticing. This was the admirable motivation behind DP-WHERE: Differentially private modeling of human mobility by Mir et al.

In this paper, the authors mapped out the home and work locations based on commuting patters. They then measured the distribution of distances that people travel and the distributions of the number of calls make per day, the length of the calls and the time of day that people call. From this they were able to create synthetic user behaviours based on real user behaviours by sampling these distributions. While the new data looks like the old data in many ways, it isn’t actually descriptive of any real person, so it can’t violate anyone’s privacy. Of course this is a great achievement and represents a potentially sustainable way for researchers to gain insights from CDR datasets without using data from which someone could be identifiable.

But what happened in the process is that each characteristic of the data is now independent of any other. So if people who live in this part of town make more calls, this will be smoothed out. Or it might be that people who commute farther make more calls that are shorter, but in the synthetic data this pattern is not preserved.

Of course this isn’t realistic, there are all kinds of biases and correlations in the real world. These correlations and bursty behaviours represent the ‘datainess’ of data; we don’t spread our calls or movements out throughout the day evenly, if half of people travel 4 miles to work and the other half travel 2 miles it’s not accurate to talk of an average person traveling 3 miles.

The question is how un-realistic are synthetic mobility traces? The authors measure the effect of this, looking at the average of the maximum range of users in a day, among other quantities. This metric is found to decrease from 3.2 to 1.9 miles. To some people this is an acceptable level of accuracy and to others not and it is certainly application specific. This is by no means a criticism of Mir et al; this is certainly a valuable proof of concept of an idea with great potential. It is by the authors own admission that the context must be considered.

As with any pre-processing step in a Big Data pipeline, biases can creep in and I would prefer that at least partially realistic data be available than none at all. But that said, humans don’t bounce around at random like the molecules of the ideal gas that young physicists all learn about. It could be that we are missing an important piece of the puzzle if we pretend that they do.

Data, science, data science and trace amounts of the Middle East and the UN

Data, science, data science and trace amounts of the Middle East and the UN