How to Understand Deep Learning (in a Shallow Way)
Deep Learning is here! Deep learning is hot! Every start-up pitch worth it’s seed funding mentions Deep Learning! Yet like many over-hyped technologies, very few people when pressed can tell you what it actually is. Most estimates put the number of people with significant Deep Learning experience worldwide at less than 50.
As a result, there exists a huge gap for accessible ways to understand the basis of Deep Learning, and to go further and actually become a practitioner with the ability to troubleshoot a misbehaving model rather than just copy and paste code. Having made my own path through this via trial and error, I would like to share my learning and reflections with a somewhat technical beneficiary in mind (think a Masters student or former technical practitioner now in a management role).
** A caveat: Deep Learning moves fast! An arms race within research groups in academia and industry mean that huge advances are made every few months. Read this article now while it is fresh!
A Short History of the Black Box
An article on Deep Learning has to start with some history of Neural Networks. Neural Networks were an attempt to crudely reflect the function of a human brain in a numerical prediction model in silico. The brain functions through stimuli (colours, lights, sound), neurons (connected units of the brain) and outputs (recognising a tree, saying ‘hello’ to someone) and a Neural Network works in the same way. This led to Frank Rosenblatt’s Perceptron that in reality looked like this
But actually represented this
The original generous analogy with the human brain has been stretched to its limits and in my opinion encourages inflated expectations of the capabilities of Neural Network derived models. The high level view of this process:
<inputs> →<processing> → <output>
is at first blush so abstract as to be almost meaningless. The crux of this is the <processing> piece that maps inputs to outputs or the ‘black box’; more on this later. All that remains is to adjust the internal model until it works on a set of examples or training data; this reduces to a fairly trivial although expensive optimisation through a slightly more intelligent version of trial and error.
This abstraction has the benefit that it can be applied to almost any problem. For example,
- Speech recognition: sound waves of speech are processed to output a sequence of words.
- Image recognition: a set of pixels are processed to give the class of objects in the image.
- Autonomous Vehicle Control: imagery from a front facing vehicle camera can be used to output controls for the steering and brakes.
Non-Linearity (A More Technical Aside)
You might have noticed that the original perceptron is in fact almost absurdly simple. A single ‘layer’ maps the inputs to outputs as
y=W.x+b
Where x is the vector of inputs, W is a vector of node weights, b is a vector of biases and y is the scalar output. In fact, this is just an obfuscatory way to write a simple linear model. In fact, even if the neural network is expanded to include more than one layer, this still reduces to a linear model that can be fitted in the same way as a linear regression.
The value added here is in so called ‘hidden layers’. These are layers that apply a non-linear mapping. The intuition here is that the nodes in hidden layers require a high input signal (typically outputs and inputs are trivially scaled to be in the range [0–1]) in order to return a large output signal; or are harder to ‘activate’. There are several choices for the function to be used here, such as a sigmoid or the closely related soft-max.
Neural networks become at once interesting, useful and uninterpretable with the addition of hidden layers. By composing architectures that combine regular and hidden networks in particular ways, very complex behaviours can be learned and modelled (more on this below).
How to Learn
One crucial piece of setting up a deep neural network is training i.e. arriving at the correct sets of weights that define the black-box needed to make the right predictions. There is no way to know a priori what the most suitable weights are for the job, even given a large set of examples to learn from. Instead some initial weights are chosen and each example is run through the machine. If the machine gets it wrong, which it will do frequently to begin with, then the machine is told to adjust each weight in the direction that would have given something closer to the correct answer. This is a process known as backpropagation of errors. This process gives rise to a large number of hyper-parameters, essentially arbitrary choices about the learning process, so it is worth understanding well. These hyper-parameters are covered in the glossary below.
AutoML
One long standing criticism of Machine Learning is the degree of domain knowledge required to set up a working ML model. More specifically, this means knowing what kind of model works best (tree based, discriminant based, ensemble etc) and deciding what exactly the inputs should look like and preparing them to be so from the raw data. The latter is generally known as feature engineering. One attractive aspect of deep learning is the promise that we can do away with feature engineering and simply add layers that perform feature engineering transformations and leave the optimisation of the network to decide which kinds of transformations (log scaling, multiplicative etc) are appropriate. In short, ‘shallow learning’ requires a human to perform feature engineering while deep learning has the necessary complex non-linear capabilities to automatically perform feature engineering.
Francois Challet expands upon this very nicely in a recent blogpost The Future of Deep Learning, when imagining that an ML task could someday be fed ‘raw’ to a program that matches the problem to a suitable model and chooses hyperparameters, without an expensive computer science PhD graduate to slow things down. Whose jobs are being taken away by computers now?
In a word, we will move away from having on one hand “hard-coded algorithmic intelligence” (handcrafted software) and on the other hand “learned geometric intelligence” (deep learning). We will have instead a blend of formal algorithmic modules that provide reasoning and abstraction capabilities, and geometric modules that provide informal intuition and pattern recognition capabilities. The whole system would be learned with little or no human involvement.
This gets at another massively powerful and appealing aspect of deep learning; that of hierarchical layers of representation that parallels some aspects of how humans recognise objects. Rather than learning an exact match of what a car looks like, we look at many cars and begin to recognise cross-cutting features e.g. 4 wheels, a bonnet. These wheels are made of parts and motifs and edge configurations that eventually map down to raw pixels. As the ‘Canadian mafia’ explain
An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts.
The Ghost in the Machine and How to Find it
With the extra complexity in the architecture of deep learning, come extra problems to be solved. Networks with large architectures can be incredibly complicated, with many 10's of layers of different kinds performing all manner of different operations. Several people have noticed that amid all of that complexity, deep neural networks can create nonsensical edge cases. For example, Jeff Clune’s lab has managed to tweak images that have been successfully classified by an ANN (e.g. as a bus) such that they look indistinguishable to humans yet are then misclassified by that same ANN. Conversely, images that resemble random noise can convince a neural network that it is an example of an arbitrary tangible object. A nice video accompanying their paper shows this well.
Recently, negative images have been shown to completely bamboozle Deep Neural Networks that achieve very high accuracies on the original images.
These seemingly inexplicable malfunctions of deep learning have led to scrutinisation of how these algorithms make decisions, particularly since so few people are qualified to debug them. One set of inaccuracies arise from the complexities of the algorithm itself, leading to unexpected and seemingly counter-intuitive results as above. A second source of inaccuracies arises from models learning biases from training data.
The General Data Protection Regulation has called for ‘algorithmic auditability’, but as I argue elsewhere, simply mandating access to ‘an algorithm’ is somewhat naive and unworkable. A better solution is to define a set of benchmark classification problems that can be fed into an algorithm and then the results meaningfully compared across algorithms.
At the very least, the responsible practitioner should find ways to thoroughly examine the learned layers. Thankfully that is a call that has is starting to be answered by the open source community in the form of the Keras visualisation library and the infinitely entertaining interactive Neural Network playground.
How to Get Deep — And Estimated Time Investments
- 1 month: The canonical starting point for machine learning (and also one of the very first MOOCs) is Andrew Ng’s course on Machine Learning. It’s long but it covers the bread and butter of ML and being able to listen to one of the masters speak for several hours is invaluable.
- 10 minutes: Yan LeCunn at Facebook has a series of great educational vignettes on AI concepts for those who learn visually.
- 1 hour: The single best resource I found for walking through the core components of deep learning is the first chapter of Andrew Nielsens book.
- 2 hours: For the more practically minded, Caffe in a Day focuses on how to code things in Caffe and troubleshoot non-converging models.
- 30 minutes: A high level 9 page review of Deep Learning appeared in Nature recently.
- Unlimited: The tensorflow playground is one of the best ways to understand how neural networks converge and learn.
- 20 minutes per week: The Machine Learning subreddit is full of interesting and practical questions on best practices in ML that are usually patiently answered.
- 30 minutes: Practical advice on getting Deep Neural Nets to converge efficiently is hard to come by, this is very useful content.
Glossary
Deep learning is ripe with jargon. Here are some terms you will hear frequently:
- Deep learning: Artificial Neural Networks (ANN) with complex non-linear structure.
- Shallow learning: Artificial Neural Networks with simple structures and architectures.
- Convolutional Neural Networks: Neural Networks that have convolutional layers that essentially apply a filter (through convolution, thus the name) to identify important features. The critical part here is that the importance of each pixel is also coupled to related pixels nearby.
- Recurrent Neural Networks: An ANN that processes a stream of data, rather than a single snapshot of data. For example, speech is made up of a sound signal over time. The frequency profile at one snapshot is used as a vector, but the result of the application of the RNN to that snapshot is partially determined by previous snapshots. For that reason RNNs are stateful.
- Long-Short Term Memory: LSTM is an example of an RNN, it is a particular architecture.
Generative Adversarial Network: A GAN is a very hot application of Deep Learning in which synthetic data is created in such a way that it cannot be distinguished from real data. For example, given audio or video of a person speaking, the GAN would output realistic but fake audio or video. The architecture consists of a network which creates the synthetic data and a second classifier which tries to identify the fake data: the adversary. The first network is optimised when its fake data cannot be distinguished.
- Hidden layer: a layer that applies a non-linear function that smoothly applies the effect of a threshold.
- Epoch: a training epoch refers to one cycle of tuning your model by looking at training data. At the end of an epoch, training begins again. The number of epochs is a hyperparameter.
- Rectified Linear Unit (ReLu): A rectified linear unit is an activation function to be used in a hidden layer. It is commonly used as a preferable alternative to a sigmoid or logistic activation function.
- Feature Map: the result of the application of a function to a data vector.
- Pooling: Pooling is a process by which a high dimensional feature map is reduced in resolution e.g. 128x128 pixels reduce to 32x32 pixels. There are various rules to do so, for example choosing the highest value (max-pooling) or the average value (average-pooling) of a 2x2 subset to represent an output pixel. The rate at which to compress is the stride.
Libraries
So far I have landed on Keras as my library of choice and I have yet to formulate a reliable opinion of each. So for now I simply list these below.
- Caffe — optimised for image processing and developed by Berkley.
- Torch — developed by Facebook & NYU, API only available in C.
- TensorFlow — Google masterpiece.
- CNTK — From Microsoft.
- Keras — Seems to be single handedly maintained by a Google alumnus François Chollet. A nice high level interface into TensorFlow, Theano and CNTK.