Neural Networks

Feed-Forward Networks and Deep Learning

Huanfa Chen - huanfa.chen@ucl.ac.uk

09/02/2026

History

Image Credit: medium.com

History

History of neural net

Nearly everything we talk about today existed in 1990
What changed since then?
- More data
- Faster computers (GPUs)
- Some improvements: Relu, dropout, adam, batch-normalization, residual networks

Linear Regression as Neural Net

$y = \sum_i W_i x_i + b = Wx + b$

$y = g(Wx + b)$, where $g(z) = z$ (identity function)

Logistic regression as neural net

Logistic Regression as Neural Net

$y = \sigma(Wx + b)$

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Logistic regression as neural net

Basic Architecture of Neural Networks

$h(x) = f(W_1x + b_1)$; f: hidden layer activation function

$o(x) = g(W_2h(x) + b_2)$ g: output layer activation function

Basic neural network architecture

Output layer activation function $g$

For regression: $g$ is identity function $g(z) = z$
For binary classification: $g$ is sigmoid function $g(z) = \sigma(z)$ (to output a probability [0,1])
For multi-class classification: $g$ is softmax function $g(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ (to output a probability distribution over classes)

Basic neural network architecture

Nonlinear Activation Functions $f$

The primary job of f is to break the linearity of the model
Standard choices: tanh (pronounced like “than”) or relu (rectified linear unit) (pronounced as /ray-loo/)
Tanh squashes between -1 and 1; saturates towards infinities
ReLU is constant zero for negative numbers, then identity

Nonlinear activation functions

More Layers

Neural network with many layers

Hidden layers usually all have the same non-linear function
Other names: Multilayer perceptron, feed-forward neural network
Many layers → “deep learning”

Supervised Neural Networks

Non-linear models for classification and regression
Work well for very large datasets
Notoriously slow to train; need for GPUs
Use dot products $Wx$; require preprocessing similar to SVM or PCA, unlike trees
Many variants: Convolutional nets, GRUs, LSTMs, recursive networks, VAEs, GANs, deep RL

Training Objective

$h(x) = f(W_1x+b_1)$

$o(x) = g(W_2h(x)+b_2) = g(W_2f(W_1x + b_1) + b_2)$

The objective is to minimise the difference between true and predicted y values, or the loss function

$\min_{W_1,W_2,b_1,b_2} \sum\limits_{i=1}^N l(y_i,o(x_i))$

$= \min_{W_1,W_2,b_1,b_2} \sum\limits_{i=1}^N l(y_i,g(W_2f(W_1x+b_1)+b_2))$

$l$ = Squared loss for regression; Cross-entropy loss for classification

Backpropagation

Backpropagation in neural networks

Backpropagation

Need $\frac{\partial l(y, o)}{\partial W_i}$ and $\frac{\partial l(y, o)}{\partial b_i}$

$\text{net}(x) := W_1x + b_1$

Backpropagation equations

Gradient Computation

Backpropagation is clever application of chain rule for derivatives
Single backward pass from output to input computes derivatives
An efficient way to compute gradients
Training NN is an optimisation problem, which means to find the optimal parameters $W_i$ and $b_i$ that minimise the loss function
Training NN is a non-convex and challenging optimisation problem. Usually a local optimum is found.

Recap on gradient

Optimise $\arg\min_w F(w)$ by stepping along $-\nabla F(w)$
Update: $w_{i+1} = w_i - \eta_i \nabla F(w_i)$
Converges to a local minimum

Gradient 1D

ReLU Differentiability

ReLU differentiability

ReLU not differentiable at zero. At x>0, gradient is 1; at x<0, gradient is 0
For a function to be differentiable, the slope must be the same whether approaching from the left or right
Use subgradient descent; any gradient below the function works
In practice, most FL framewors simply hardcode the gradient at zero to be 0 or 0.5

Optimising W, b

Usually, We don’t use all data to compute the gradient at each step.

Batch $W_i \leftarrow W_i - \eta\sum\limits_{j=1}^N \frac{\partial l(x_j,y_j)}{\partial W_i}$

Instead, we can use a subset of data to estimate the gradient.

Online/Stochastic $W_i \leftarrow W_i - \eta\frac{\partial l(x_j,y_j)}{\partial W_i}$

Minibatch $W_i \leftarrow W_i - \eta\sum\limits_{j=k}^{k+m} \frac{\partial l(x_j,y_j)}{\partial W_i}$

Learning Heuristics

$\eta$ controls the step size of gradient descent. Can adjust during training.
Constant $\eta$ is not good
Can decrease $\eta$ over time
Better: adaptive $\eta$ for each entry of $W_i$
State-of-the-art: Adam (Adaptive Moment Estimation)

Adam optimiser

The logic: for a parameter, if a gradient is consistely large, should decrease the learning rate and slow down;
If a gradient is consistely small, should increase the learning rate for that parameter.
The process: keep track of the first moment (mean) and second moment (uncentered variance) of the gradients; use them to adapt the learning rate for each parameter

Update rule of Adam optimiser

Estimate 1st moment: $m_t = \beta_1 m_{t-1} + (1 - \beta_1)g_t$
Estimate 2nd moment: $v_t = \beta_2 v_{t-1} + (1 - \beta_2)g_t^2$
Bias Correction: as $m_t$ and $v_t$ start at zero, Adam uses a mathematical trick to “warm them up” during the first few steps.
Update Weight: $\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$

Picking Optimisation Algorithms

Small dataset: off-the-shelf like l-bfgs (quicker than Adam on small datasets)
Big dataset: adam / rmsprop
Have time & nerve: tune the schedule

Neural Nets with sklearn

Neural nets with sklearn

mlp = MLPClassifier(solver='lbfgs', random_state=0).fit(X_train, y_train)
print(mlp.score(X_train, y_train))
print(mlp.score(X_test, y_test))

Don’t use sklearn for real projects but toy problems in neural nets
Why? sklearn’s MLP is not optimised for large datasets; no GPU support; no support for conv nets, etc.

Complexity Control

Number of parameters: hidden layers, hidden units
Regularisation
Early Stopping
Dropout

NN subject to overfitting and random seeds

Effect of random state

Network is way over capacity and can overfit in many ways
Regularisation might make it less dependent on initialization

Regularisation

Regularisation works by modifying the loss function: $L = \sum\limits_{i=1}^N l(y_i, o(x_i))$
L2 regularisation: add $\lambda \sum\limits_i W_i^2$ to loss function; to penalise large weights
L1 regularisation: add $\lambda \sum\limits_i |W_i|$ to loss function; to encourage many weights to be zero
Dropout: randomly set some activations to zero during training; prevents co-adaptation of neurons

Hidden Layer Size

mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(5,), random_state=10)
mlp.fit(X_train, y_train)

Hidden layer size effect

Single hidden layer with 5 units
Each unit corresponds to different part of decision boundary

Multiple Hidden Layers

mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(10, 10, 10), random_state=0)
mlp.fit(X_train, y_train)

Multiple hidden layers

3 hidden layers each of size 10
Main way to control complexity

Activation Functions

mlp = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(10, 10, 10),
                    activation='tanh', random_state=0)
mlp.fit(X_train, y_train)

Activation functions

Using tanh gives smoother boundaries
ReLU doesn’t work as well with l-bfgs on small networks
For large networks, relu is preferred

Regression

Neural network regression

from sklearn.neural_network import MLPRegressor
mlp_relu = MLPRegressor(solver=\"lbfgs\").fit(X, y)
mlp_tanh = MLPRegressor(solver=\"lbfgs\", activation='tanh').fit(X, y)

Grid-Searching Neural Nets

from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(
    data.data, data.target, stratify=data.target, random_state=0)

from sklearn.model_selection import GridSearchCV
pipe = make_pipeline(StandardScaler(), MLPClassifier(solver=\"lbfgs\", random_state=1))
param_grid = {'mlpclassifier__alpha': np.logspace(-3, 3, 7)}
grid = GridSearchCV(pipe, param_grid)

Grid search results

Searching Hidden Layer Sizes

from sklearn.model_selection import GridSearchCV
pipe = make_pipeline(StandardScaler(), MLPClassifier(solver=\"lbfgs\", random_state=1))
param_grid = {'mlpclassifier__hidden_layer_sizes':
              [(10,), (50,), (100,), (500,), (10, 10), (50, 50), (100, 100), (500, 500)]}
grid = GridSearchCV(pipe, param_grid)
grid.fit(X_train, y_train)

Search hidden layer sizes

GPU Support

GPU performance

Important limitation: GPUs have much less memory than RAM
Memory copies between RAM and GPU are expensive

CPU VS. GPU

Feature	CPU	GPU
Architecture	Few cores optimised for sequential tasks	Thousands of cores for parallel processing
Memory	Larger cache size, less memory bandwidth	Higher memory bandwidth, less cache size
Power Consumption	Generally lower power consumption	Higher power consumption due to many cores
Use Cases	General-purpose computing, complex logic	Graphics rendering, deep learning, scientific simulations
Cost	Less expensive	Expensive
Programming Complexity	Easier to program for general tasks	Requires knowledge of parallel programming (e.g., CUDA)

How much GPU power do you need?

Task Type	Typical GPU	Total GPU Hours	Estimated Cost (Cloud)
Small Project (MNIST, basic CNN)	1x RTX 3060 / 4060	< 1 hour	< $1
Fine-tuning 7B LLM (e.g., Llama 3 8B)	1x A100 (80GB)	5–20 hours	$10 – $40
Training Mid-size Model (e.g., Stable Diffusion)	8x A100 Cluster	500–2,000 hours	$1,000 – $5,000
Pre-training Large LLM (GPT-3/4 scale)	10,000+ H100s	Millions of hours	$50M – $100M+

Computational Graph

Computation graph

‘blue-print’; a directed acyclic graph representing computations in NN
Nodes represent variables or operations (e.g., matrix multiplication, activation functions), edges represent flow of data (tensors) between operations
Given limited GPU memory, important to know what to cache/discard
Helps with visual debugging and understanding network structure

Deep Learning Framework Requirements

Autodiff
GPU support
Optimisation and inspection of computation graph
On-the-fly generation of computation graph (optional)
Distribution over multiple GPUs and/or cluster (optional)

Current choices: PyTorch / Torch, TensorFlow

Deep Learning Libraries

PyTorch (torch) (default for AI research & development)
Keras (TensorFlow, CNTK, Theano) (enterprise production)
Chainer (chainer)
MXNet (MXNet)

Quick Look at TensorFlow

"Down to the metal" - don’t use for everyday tasks
Three steps for learning:
1. Build computation graph (using array operations and functions)
2. Create Optimizer (gradient descent, adam, etc.) attached to graph
3. Run actual computation

TensorFlow basics

PyTorch Example

dtype = torch.float
device = torch.device("cpu")

N = 100
x = torch.randn(N, 1, device=device, dtype=dtype)
y = torch.randn(N, 1, device=device, dtype=dtype)
w = torch.randn(D_in, H, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    y_pred = x.mm(w1)
    loss = (y_pred - y).pow(2).sum().item()
    loss.backward()
    w1 -= learning_rate * w1.grad
    w1.grad.zero_()

Best Practices

Don’t go down to the metal (i.e. write low-level code) unless you have to!
Don’t write TensorFlow, write Keras!
Don’t write PyTorch, write pytorch.nn or FastAI (or Skorch or ignite)

Convolutional Neural Networks

Idea #1 Behind CNNs

Translation invariance: CNN can recognise an object regardless of where it is in the frame.
… because the same filters scan every part of the image (weight sharing)

Translation invariance in CNNs

Idea #2 Behind CNNs

Weight sharing: the principle that the same set of weights (a filter, or kernel) is used to scan every part of an image
So, this filter will detect a specific feature (e.g. edge, curve), regardless of its position
Each filter corresponds to a specific feature or pattern, rather than location-specific information

Definition of Convolution

\[(f*g)[n] = \sum\limits_{m=-\infty}^\infty f[m]g[n-m]\]

\[= \sum\limits_{m=-\infty}^\infty f[n-m]g[m]\]

Convolution definition

1D Example: Gaussian Smoothing

Gaussian smoothing

Convolutions in 2D

2D convolution illustration

source: Arden Dertat

2D Convolution Animation

2D convolution animation

source: Arden Dertat

2D Smoothing

2D smoothing

2D Gradients

2D gradients

Max Pooling

Max pooling

Need to remember position of maximum for back-propagation
Again not differentiable → subgradient descent

Convolutional Neural Networks

Convolutional neural network

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner: Gradient-based learning applied to document recognition

Other Architectures

Other architectures

Conv-nets with Keras

Preparing Data

batch_size = 128
num_classes = 10
epochs = 12

img_rows, img_cols = 28, 28

(x_train, y_train), (x_test, y_test) = mnist.load_data()

X_train_images = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
X_test_images = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
input_shape = (img_rows, img_cols, 1)

y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Create Tiny Network

from keras.layers import Conv2D, MaxPooling2D, Flatten

num_classes = 10
cnn = Sequential()
cnn.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(Conv2D(32, (3, 3), activation='relu'))
cnn.add(MaxPooling2D(pool_size=(2, 2)))
cnn.add(Flatten())
cnn.add(Dense(64, activation='relu'))
cnn.add(Dense(num_classes, activation='softmax'))

Number of Parameters

Convolutional Network for MNIST

CNN parameters

Dense Network for MNIST

Dense parameters

Train and Evaluate

cnn.compile(\"adam\", \"categorical_crossentropy\", metrics=['accuracy'])
history_cnn = cnn.fit(X_train_images, y_train,
                      batch_size=128, epochs=20, verbose=1, validation_split=.1)
cnn.evaluate(X_test_images, y_test)

 9952/10000 [============================>.] - ETA: 0s
 [0.089020583277629253, 0.98429999999999995]

Train and evaluate

Visualise Filters

weights, biases = cnn_small.layers[0].get_weights()
weights2, biases2 = cnn_small.layers[2].get_weights()
print(weights.shape)
print(weights2.shape)

(3,3,1,8)
(3,3,8,8)

Visualize filters

Learned Features

Learned features

MNIST and Permuted MNIST

MNIST original

rng = np.random.RandomState(42)
perm = rng.permutation(784)
X_train_perm = X_train.reshape(-1, 784)[:, perm].reshape(-1, 28, 28)
X_test_perm = X_test.reshape(-1, 784)[:, perm].reshape(-1, 28, 28)

MNIST permuted

Summary

We’ve covered the architecture and training of neural networks
We also covered the idea and implementation of convolutional neural networks

Neural Networks

Feed-Forward Networks and Deep Learning

History

History

Linear Regression as Neural Net

Logistic Regression as Neural Net

Basic Architecture of Neural Networks

Output layer activation function \(g\)

Nonlinear Activation Functions \(f\)

More Layers

Supervised Neural Networks

Training Objective

Backpropagation

Backpropagation

Gradient Computation

Recap on gradient

ReLU Differentiability

Optimising W, b

Learning Heuristics

Adam optimiser

Update rule of Adam optimiser

Picking Optimisation Algorithms

Neural Nets with sklearn

Complexity Control

NN subject to overfitting and random seeds

Regularisation

Hidden Layer Size

Multiple Hidden Layers

Activation Functions

Regression

Grid-Searching Neural Nets

Searching Hidden Layer Sizes

GPU Support

CPU VS. GPU

How much GPU power do you need?

Computational Graph

Deep Learning Framework Requirements

Deep Learning Libraries

Quick Look at TensorFlow

PyTorch Example

Best Practices

Convolutional Neural Networks

Idea #1 Behind CNNs

Idea #2 Behind CNNs

Why Weight Sharing?

Definition of Convolution

1D Example: Gaussian Smoothing

Convolutions in 2D

2D Convolution Animation

2D Smoothing

2D Gradients

Max Pooling

Convolutional Neural Networks

Other Architectures

Conv-nets with Keras

Preparing Data

Create Tiny Network

Number of Parameters

Train and Evaluate

Visualise Filters

Learned Features

MNIST and Permuted MNIST

Summary

Questions?