Getting Insights Safely: Aggregation Behind the Firewall

I recently excitedly watched DataSift’s webinar on their new product: Facebook Topic Data. In short, this is a new data stream that DataSift will offer that for the first time allows access to the activity, conversations and narratives of 1.39 billion monthly Facebook users. In the past, Facebook data has only been directly available to privileged researchers within the company or via the public API. The former requires strict controls as well as individual Non-Disclosure Agreements while the latter only allows access to the subset of content posted on public pages and even then only with access systematically throttled.

The Buzz

As Francesco D’Orazio points out in his wonderful post on this new product, this is a really big deal for many reasons

  • Standardisation. Now all researchers have a standard data product
  • Demographic information. Facebook users generally reveal a lot about themselves, whether consciously or indirectly.

More vs Less

Fundamentally this is a struggle between two very noble prerogatives: data minimisation and exploratory data analysis. The first is a legal principle that states that the only data that should be shared is that which is explicitly required for a specified purpose. This hinders potential adversaries who might attempt to collect some sensitive data for some purpose under the guise of getting other data for another more virtuous purpose. At this point any right-thinking data scientist defers to John Tukey. Tukey understood data science long before it began to be visualised as a hose/waterfall/ocean of blue 0’s and 1's. He coined the phrase ‘Exploratory Data Analysis’ which might be described more prosaically as ‘not knowing what you are looking for’.

Data, science, data science and trace amounts of the Middle East and the UN

