Shades of Grey: What Data Should be Shared, How and When?
[This post is a reboot of a post originally published on my personal blog on 22nd July 2015]
I recently excitedly watched DataSift’s webinar on their new product: Facebook Topic Data. In short, this is a new data stream that DataSift will offer that for the first time allows access to the activity, conversations and narratives of 1.39 billion monthly Facebook users. In the past, Facebook data has only been directly available to privileged researchers within the company or via the public API. The former requires strict controls as well as individual Non-Disclosure Agreements while the latter only allows access to the subset of content posted on public pages and even then only with access systematically throttled.
- Penetration. No single platform has the global reach of Facebook
- Standardisation. Now all researchers have a standard data product
- Demographic information. Facebook users generally reveal a lot about themselves, whether consciously or indirectly.
What makes Facebook Topic Data unique is the way that the data is filtered and accessed. Generally speaking, data scientists make a query to an API to retrieve some raw data matching some criteria (like keywords, location and/or time period) transferring the results to their local machine from a remote server. Now, if we are so inclined, we could go through and read every single word that was written. But because this is 2015 and there is a lot of social media content about, we almost certainly going to have to extract some summary statistics. That’s not to say that there is never any need to ‘drill down’ and examine individual pieces of content. Any researcher worth their salt should get their hands dirty and sanity check what they are pulling out, there are a myriad of reasons why your prized data is actually garbage: spam, noisy keywords (did you mean ‘chase’ as in ‘the thrill of the chase’? Or as in the bank?) or a good old fashioned bug in your query or pipeline.
Yet under this new model we jump straight to aggregated data with no means to ‘look under the hood’. Undertaking natural language processing teaches us very quickly and firmly that language is complex and is, frustratingly, closer to an art than a science. There exist many parameters and hyperparameters to topic models, many stemming schemes exist that are language and context specific and so on. All these choices will be made for us by the Facebook and DataSift teams.
More vs Less
Fundamentally this is a struggle between two very noble prerogatives: data minimisation and exploratory data analysis. The first is a legal principle that states that the only data that should be shared is that which is explicitly required for a specified purpose. This hinders potential adversaries who might attempt to collect some sensitive data for some purpose under the guise of getting other data for another more virtuous purpose. At this point any right-thinking data scientist defers to John Tukey. Tukey understood data science long before it began to be visualised as a hose/waterfall/ocean of blue 0’s and 1's. He coined the phrase ‘Exploratory Data Analysis’ which might be described more prosaically as ‘not knowing what you are looking for’.
Many defining attributes are put forward for Big Data, but surely it can be agreed that the presence of many variables is one. Data science is fundamentally different from traditional scientific approaches whereby hypotheses are arrived at and tested in turn. Rather each variable is examined, explored, characterised and it’s value in some model is quantified. This approach is not without its shortcomings, such as p-hacking which has received much high profile coverage recently. Yet, exploratory data analysis is likely to remain an important methodological approach going forward.
For this reason, the new product sets a precedent that unnerves me somewhat. We now have an opaque pipeline designed and maintained by Facebook itself. Some argue that this standardisation will provide parity between different studies, yet what guarantee do we have that this pipeline will remain static? Indeed one lesson of Google Flu trends was that models based on aggregated data only work while the underlying process remains unchanged (I waxed lyrical on this point on the Global Pulse blog).
At this stage I must state clearly that I admire Facebook for making this data available to researchers and developers in any form at all as they are not compelled to. Secondly they should be lauded for ‘baking in’ privacy protecting controls, e.g. redacting results if content from less than 100 unique users appears. That said, Facebook is incredibly powerful and it is incumbent upon us to question and critique the practices of powerful parties.
At this point I should make a disclaimer that all the views expressed above are my own personal views and do not reflect the views of any organisations I work for or with.