To code or not Code: Masochism or Productivity

[This post is a reboot of a post originally published on my personal blog on 17th October 2014]

There was a great panel at Strata this week that convened a group of data scientists debating a proposition:

If you can’t code, you can’t be a Data Scientist

Admittedly the ‘against’ camp had a tough job on their hands, and the audience was certainly pre-disposed to insisting on coding ability (a before and after poll confirmed this bias and the debate didn’t seem to convert anyone). In the end the debate centred on the distinction between ‘coding’, that is using existing commands to make new operations to do novel things and ‘running code’ which is simply executing existing commands. This can turn into a bit of a semantic argument quite quickly: is it ‘coding’ to run ls on the command line? Is it ‘coding’ to copy and paste someone else’s code and then run it?

As with many things concerning data science, I am in agreement with Hilary Mason on her point

You may be able to meet the bare minimum qualifications to be a data scientist without being able to code, but you will never be great. I won’t hire you

While data science does have a lot of predictable and well-understood operations (examining distributions of variables, summary statistics, cross-correlations etc) that need to be done on all datasets, there is often some unique catch in the data. In order to fully debug the data that you have, you must be able to drill down and shine a light on it in new and innovative ways. To do that you have to dig in and not be afraid to dig in.

Of course there does exist a class of data science problems and datasets that are well enough understood that they could be described as routine. In these cases, then a non-coder can by all means harvest the insights by plugging into Tableau or even Excel. Yet we have to admit that the most interesting problems use datasets that are new and fresh and come with a myriad of undiscovered gotchas.

Further, there are fewer and fewer excuses for not being able to code nowadays. Most languages have very lively communities in most cities or videos and tutorials online. You don’t need to struggle to install compilers or environments: lessons can be executed in the browser directly.

But I think this leads to a more general and interesting question;

Do you need to know how to do low-level operations in order to do high-level operations?

This dichotomy rings particularly true with computing where there are seemingly infinite levels of abstraction; from clicking on windows with a mouse at one extreme to operating system calls, machine code, assembly code down to bytes, bits and further down to transistors, logic gates, semiconductor hetero-structures and electrons. Thankfully a lot of smart people have done a lot of hard work so that we can use spreadsheets and check our emails in blissful ignorance of these details.

I have heard a few different perspectives on this; some consider that making coding more accessible to people across the board is always a good thing; even if it is slightly superficial it is better than nothing. To insist that everyone goes through a traditional, rigorous training for several years is elitist and excludes many from the fun of data science as well as potentially discarding fresh perspectives from non-experts and driving ‘group-think’. While others see the benefits of the ‘school of hard knocks’ approach summed up very well in this recent answer to the Quora question about which language is best for a programmer to learn for maximum financial reward.

Learn C/C++. You might never use it professionally, but it contains a lifetime of lessons. And the hardest problems, the ones that the top engineers are asked to solve, will sooner or later hit some foundational C code.

When I look back at the many times I have reinvented the wheel, unaware that an operation or routine was just one single built-in call away, I have mixed feelings. Of course my solution was sub-optimal and naive and it ate into my evenings when I should have been reading a book or exercising, but it imparted a lot. Now when making use of that built-in functionality, I have the very vivid memory of exactly what went into making that library giving an extra layer of awareness to what is going on.

Computing is unique in allowing you to jump to the middle without understanding the beginning, to be a dilettante. My perspective may be an artifact of a physics training which actually walks you through history, teaching you incorrect things that were considered correct at different times before eventually letting you in on the truth in your final years. Perhaps unsurprisingly, I see great value in learning the basics. If you know how to use the command line, you can appreciate and see how a new productivity tool works on top of the command line. It doesn’t stop you from using that productivity tool to save you time, but it does help you gain a better understanding when things go wrong. As compute becomes cheaper and cheaper, you can arrive at a ‘good enough’ level of optimisation pretty quickly. But a lack of a fundamental understanding of data structures, algorithms and architecture can trip you up later when you make naive and outrageous requests of your machines.

In the same way, I recommend that learners of Arabic grind through the tough grammar and nuanced pronunciation of the formal language before picking up a significantly simpler dialect. There will come a time that you need to speak proper Arabic or read a newspaper and it will hurt to undo all your bad habits and to have to suddenly think about all the proper declension of verbs that you had previously been able to skip over.

This is the same reason that big tech companies continue to grill candidates on ‘Greatest Hits’ algorithms such as Quicksort. You will never have the need to implement these yourself, but someday the time will come that only the implementation of a Bloom Filter will do the trick. When it does, Tableau won’t be able to help you but Algorithms 101 will.

Data, science, data science and trace amounts of the Middle East and the UN

Data, science, data science and trace amounts of the Middle East and the UN