Data Science and Statistics

A data scientist is a statistician who works in San Fransisco

So goes a witty quip that I have tried and failed to find an attribution for. But in my experience the predicate of this; that data scientists do the same work as statisticians, isn’t actually true in essence or even in spirit. Among many data scientists I encounter, I feel a disdain towards statistics to a degree that is detrimental. To understand why, start with a practical working definition of statistics that I just made up

A set of tools and methods to make robust inferences from measurements of a sample to a population

A simple example would be consideration of the relation between height and age in adults in the US. The population here is the entire adult population of the US and the sample would be, let’s say, 10,000 adults that you have paid to give you their measurements and ages. Tools from statistics would help you determine if the relationship that you learn from your 10,000 volunteers between age and height, also adequately describes the other millions of adults you didn’t measure.

N=all

Fast forward to a typical data science problem today and classical statistics seems hopelessly quaint. Consider a study on mobility patterns by a typical cell phone company with a decent market share of, say, 10% of the population of a country. If we were to draw a conclusion based on the user base of this company the sample size would be comparable to that of the entire population! This breakdown of the typical situation facing statisticians lies behind the phrase ‘n=all’.

P-values and High Dimensions

Never mind the arbitrariness of that threshold, more worryingly random data would, by definition, pass this significance test 5% of the time! (FiveTwentyEight featured a great interactive tool to see this for yourself). Therefore a more strict criterion should be applied to avoid cherry picking those variables or conditions that come in under the bar or p-hacking. Again, this is a modern problem: that of rich and cheap datasets with an abundance of variables or conditions.

Statistical Advocacy

Consider a scenario when we would like to determine the popularity of a post or product based on a number of upvotes and downvotes e.g. Reddit posts, Amazon products etc. Let’s say we have 100 upvotes and 0 downvotes; in this case we can be pretty confident in assigning a high popularity. But what about early on when there is only one upvote? Strictly speaking that represents a 100% approval rating, but surely the fact that it is based on a single rating should be borne in mind. Statistics provides just such a measure: the Wilson Score Interval: an extremely useful measure for a common problem that many have surely encountered. For me this is statistics at it’s best: formalising an intuitive measure of ‘how good’ a result is or ‘how strong’ an effect is.

I’m strongly of the opinion that the plethora of available MOOCs, tutorials, blogs and other resources have democratised data science for the better. However I would argue that one of the casualties of this simplification of tackling a data science problem end-to-end is this kind of statistical rigour (closely followed by omission of data acquisition and cleaning). It’s hard to Google for a statistical test for a particular situation when you don’t know how to formalise that situation in terms of distributions, trials and p-values or if such a test even exists. A worse scenario is when the test or tool you know and reach for is not the right one for the job: when all you have is a hammer, everything looks like a nail. For example, t-tests assume normality of distributions and many statistics don’t deal well with with heavy tailed distributions.

So what’s to be done? A short but rigorous delve into the basics of frequentist statistics will reward you many times over: p-values, confidence intervals, bootstrapping, t-tests, the Kolmogorov-Smirnov test are good places to start (both Statistics in a Nutshell and Data Analysis with Open Source Tools served me well here). Economics, psychology and medicine are used to smaller datasets so reading in these areas will help you see these tools in action. The fantastic Shape of Data blog capitalises on the graphical intuition behind many statistical ideas. Well developed software packages such as Scikit-learn have great documentation and offer worked examples of statistical tests.

But one can only brush up on palliative methods to a certain degree. The truth is that new data sources offer a brave new world for understanding human behaviour. Going door to door asking people to fill in surveys can offer rigorous sampling to counteract potential bias in who was included, but can do nothing to stop people lying, misunderstanding or misremembering. Digital footprints from social media, web searches or elsewhere offer an objective record of what someone said or did, yet it offers limited insight into who is included and less into whether these people are representative. For me, coming to terms with this is the greater challenge for data scientists.

Endnote

Data, science, data science and trace amounts of the Middle East and the UN

Data, science, data science and trace amounts of the Middle East and the UN