tags:
  - article
  - fundamentals

What is a neural network ?

A neural network, in essence, is an universal function approximator.

In the picture above, the blue function is approximated by a 6-neurons network, each responsible for a segment of the fitted curve, using a Rectified linear unit function, noted as Relu, that is detailed in the Activation functions section.

In other words, given a set of samples of data from an unknown Probability Distribution, we can approximate this unknown distribution given enough samples. This becomes useful in many fields of science, where given measurements, we wish to predict some continuous future behavior (regression) or discrete category (classification).

From Statistics, we assume that any measure of real-world phenomenon is a sample from a Probability Distribution, so we can consider that neural networks are an Anything-to-Anything model, as long as we can numerically quantify and measure it.

There are many methods to achieve exactly that, from the simple linear to polynomial regressions, to decision trees, Random forest and most of Machine Learning.

From # Universal approximation theorem, it states that any function can be approximated by an arbitraily large neural network (is proven in the general sense ) :

Multilayer feedforward networks are universal approximators

This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators.

In simple terms, a Borel measurable function is like a rule that ensures when you measure something in the output, you can also measure the corresponding part in the input using the same kind of measurement tools (Borel sets). This is important in fields like probability and statistics to ensure that functions behave nicely with the kinds of sets we can measure.

In practice, we limit our measured functions to differentiable functions to be able to compute its gradient for the famous Gradient Descent optimization algorithm.

How are they built ?

Neural networks are like logical lego lasagna, they are modular like lego bricks by design and stack layer by layer like a lasagna. When broken down to the modular level, they are pretty simple !

To understand more complex architectures, you only need to learn the new lego blocks, the rest remains mostly the same !

Hardware optimization

The Perceptron

A perceptron is a fundamental concept in the field of artificial intelligence and machine learning, serving as the basic building block of neural networks. A perceptron is a simple model of a neuron, the basic unit of the brain.

In summary, a perceptron is a simple model of a neuron that takes inputs, applies weights, and produces an output based on an activation function. It was a pioneering concept in the field of artificial intelligence, paving the way for more advanced neural networks.

How it works :

Inputs: These are the data points or features that the perceptron considers.
Weights: Each input has an associated weight, which determines the importance of that input, noted w1,w2 ... wn here.
Bias: An additional parameter b that helps to fit the model better by shifting the activation function.
Activation Function: This function decides whether the perceptron "fires" (produces an output) based on the weighted sum of inputs. Commonly, a step function is used, which outputs 1 if the sum exceeds a threshold and 0 otherwise.

Activation functions

This function decides whether the perceptron "fires" (produces an output) based on the weighted sum of inputs. Commonly, a step function is used, which outputs 1 if the sum exceeds a threshold and 0 otherwise.

The most commonly used is the Rectified linear unit function, or ReLu :

So for a given neuron of and with a base function
we have :

Without going into details, there is a whole family of activation functions used in practice depending on the problem, and is an important design choice when building a neural network.

Multi-Layer Perceptron (MLP)

The most simple architecture (and the first) to exist is the following, connecting multiple perceptrons in layers, hence the name :

A Multi-Layer Perceptron (MLP) is a type of artificial neural network that consists of multiple layers of interconnected nodes, or "neurons," inspired by the structure of the human brain.

Structure:
- Input Layer: The first layer that receives input data. Each node represents a feature of the data.
- Hidden Layers: One or more layers between the input and output layers. These layers perform computations and extract features from the input data.
- Output Layer: The final layer that produces the output of the network. Each node represents a possible output or class.
How It Works:
- Data flows from the input layer through the hidden layers to the output layer.
- Each node in a layer is connected to every node in the next layer.
- Each connection has a weight that the network learns during training.
- Nodes apply an activation function to the weighted sum of their inputs to introduce non-linearity, allowing the network to learn complex patterns.
Learning Process:
- The MLP learns by adjusting the weights of the connections using an algorithm called backpropagation algorithm.
- During training, the network compares its predictions to the actual outputs and updates the weights to minimize the error.
- This process is repeated over many iterations to improve the network's accuracy.
Applications:
- MLPs are used for various tasks, including classification, regression, and pattern recognition.
- They are versatile and can be applied to problems in image and speech recognition, natural language processing, and more.

In summary, a Multi-Layer Perceptron is a neural network with multiple layers of interconnected nodes that learn to recognize patterns in data through a process of weight adjustment and using the backpropagation algorithm. It is a foundational model in the field of deep learning.

How to train your dragon neural network ?

Backpropagation & Gradient Descent

The main algorithm behind training neural networks is called backpropagation algorithm, and will be detailed further in its own article.
As a video is worth a thousand words, the best visual intuition for backpropagation is here from 3blue1Brown # Backpropagation, intuitively | DL3.

In short, the backpropagation algorithm is a fundamental method used to train artificial neural networks, including Multi-Layer Perceptrons (MLPs).

It works by minimizing the error between the network's predicted outputs and the actual target values through a process called Gradient Descent. This algorithm is a subclass of algorithm dedicated to solving optimization problems, where we find the minimum of a given function as fast as possible.

Two great videos covering this topic to build intuition :
# Watching Neural Networks Learn
# Gradient Descent vs Evolution | How Neural Networks Learn

During training, data is fed forward through the network to generate predictions. The error is then calculated using a loss function, which compares the prediction to the provided ground-truth, and this error is propagated backward through the network. It is this error, or loss, that is optimized through the Gradient Descent.

The algorithm computes the gradient of the loss function with respect to each weight by applying the chain rule of calculus, determining how much each weight contributed to the error. The weights are then updated in the opposite direction of the gradient to reduce the error, iteratively improving the network's performance.

This process is repeated over many epochs (whole dataset), allowing the network to learn and refine its internal representations of the data.

Voilà ! You now have a fully trained neural network ! 😎

Complements

There are many more steps to building a model fit for a complex problem, that are outside the model itself such as :
hyperparameter tuning
Neural network architecture design such as Multi-task Learning or Multi-output Regression Neural Network
Hardware optimization with GPU for local ML workflow

and language specific tasks such as learning python and its machine learning librairies such as Tensorflow, Keras, Parallelism in python, Running jupyter or IDE on WSL2, or installing Linux for GPU usage using WSL2 Ubuntu 22.04+Windows 10.