Neural Networks and Deep Learning Basics

If you haven’t lived under a rock you’ve heard about neural networks. You probably even have an idea of where they are used, e.g. face recognition, self-driving cars, etc. But what is it exactly and how does it work? It’s a vast topic, but I’d like to try to cover some basics here, mainly to challenge my own understanding of the subject (hence the quote) but also possibly to help someone to understand it.

What is a Neural Network?

A neural network is an example of a machine learning system. As such, it is designed to “learn” (i.e. progressively improve performance on a specific task) from given data without being explicitly programmed. There are different types of machine learning (supervised, unsupervised, reinforcement learning, etc.), but neural networks these days have had the most success when used for supervised learning, so in this post I won’t touch any of the other types.

In general, supervised learning is when you have input variables (X) and an output variable (Y) and you use an algorithm to learn the mapping function from the input to the output.

Multiple architectures of neural networks exist, but let’s start with the simplest possible example. The simplest neural network consists of an input layer, and the output layer. The output layer in its turn contains only single neuron. The input layer typically isn’t counted, so this is a single-layer neural network (also called single-layer perceptron).

Single Layer{:height="50%” width="50%"}

As input, we have a training example X, where x1,x2,x3..xn are numerical “features”. If we’re predicting a house price, the features may include the number of bedrooms, the size of the house, how old is the building, etc. If we’re trying to detect a cat in a picture, X contains pixel values from the image.

The y is the output or the label. This can be the predicted price of the house, or 0/1 for the cat detection task (0 - no cat in the picture, 1 - there’s a cat). In between there’s a neuron that takes the input X, applies function f to the weighted sum of its features and produces the output y. So, apart from the feature values that are taken from an example, the function f needs the weights (w1,w2….wn) and bias (b), and these are the parameters the neural network will have to “learn” during its training. The function f is called activation function, and its purpose is to introduce non-linearity. Most of the data in real world cannot be described with a linear function, and if your data can be described with a linear function you don’t really need a neural network. There are different types of activation functions used (sigmoid, relu, tanh, etc.) but discussing the difference between them is, imho, beyond the basic intro into the topic.

Now, the example above was more of a toy example, and in reality neural networks are much larger than that. They contain numerous layers most of which have many neurons. In fact, when people talk about “deep learning”, they basically mean neural networks with many layers. For example:

Feedforward NN{:height="50%” width="50%"}

This is a 4 layer neural network (input doesn’t count as a layer). Each of the layers except output is called a “hidden layer”. This type of neural network is called “feedforward network”, because the data flows in one direction, from input to output, without any cycles or loops. While a single layer can only learn a simple function, stacking more neurons in a layer, and then adding more layers allows a neural network learn much more complex functions from the data.

How exactly will the network “learn”?

For the model of a neural network to “learn” from given data we need to train it. Let’s see how this process works. Each neuron calculates a number by applying the activation function to the input it receives from the previous layer, and using parameters it has (weights and bias). The question is - “How can we choose the best parameters (weights and biases) for a network, given some labeled training samples?” Let’s say you have the training data, and you’ve decided on the network structure. A typical process would be:

  1. Define a cost function to minimize (how far off is a predicted answer for a training example from the known answer)
  2. Initialize the network’s weights (e.g. randomly)
  3. Run the training data forward through the network to generate a prediction for each example (forward propagation)
  4. Measure the cost
  5. Propagate the error is back through the network in order to calculate the gradients for all output and hidden neurons (backpropagation)
  6. Update the weights with regards to the calculated gradients
  7. Repeat steps 3 through 6 a fixed number of times. Each pass like that (forward propagation, backpropagation, updating weights) is typically called an epoch.

The basic method for weights optimization is called gradient descent, but there are various optimizations of the gradient descent itself, such as Momentum, Adam, RMSProp, etc. Initializing the weights can also be done in a more sophisticated way than plain random. If you were to implement a neural network architecture from scratch, that would be a challenging task. Luckily tools like tensorflow and keras take care of a lot of things for you (like doing backprop and updating the weights). Designing and training a simple feedforward neural network becomes a piece of cake.

Feedforward neural network example

Here’s an example of a simple neural network with keras:

from keras.models import Sequential
import keras.layers as ll
from keras import regularizers

model = Sequential()

# network body
model.add(ll.Dense(units=50, activation='relu')) # a layer with 50 neurons each of which has RELU activation function
model.add(ll.Dense(units=40, activation='relu'))
model.add(ll.Dense(units=30, activation='relu'))
model.add(ll.Dense(units=20, activation='relu'))
model.add(ll.Dense(1, activation='sigmoid')) # the output layer with sigmoid activation function

# the cost (loss) used here is mean squared error
# instead of plain gradient descent, Adam weights optimizer is used
model.compile(loss=keras.losses.mean_squared_error,optimizer = keras.optimizers.Adam(lr=0.004, beta_1=0.9, beta_2=0.999, decay=0.0), metrics=["accuracy"])

# the model is trained for 15 epochs, and the result is evaluated on the unseen data (dev_set, Y_dev) using accuracy as an evaluation metric, Y_train,
          validation_data=(dev_set, Y_dev), epochs=15);

Obviously I’ve omitted a lot of details to make the post shorter and easier to understand. But if you found this useful, I can follow this up with posts about different activation functions, other types of neural networks (e.g. CNNs, RNNs), evaluating performance of a network, problem of overfitting, tuning hyperparameters, and so on.