IF3270 Pembelajaran Mesin · Midterm Study Notes

Neural Networks
from Scratch

Perceptron · Feedforward Neural Network · Convolutional Neural Network
Perceptron Delta Rule / Adaline Gradient Descent FFNN / MLP Backpropagation Activation Functions CNN Convolution Pooling LeNet-5
References: Goodfellow et al. (2016) Deep Learning · LeCun et al. (1989, 1998) · Mitchell (1997) · Raschka et al. (2022)
01 · Perceptron

The Perceptron

The foundation of all neural networks — a single artificial neuron that learns to classify linearly separable data. (Rosenblatt 1958; Mitchell 1997, Ch. 4; Raschka 2022, Ch. 11)

1.1 Biological Motivation

McCulloch and Pitts (1943) described a neuron as a simple logic gate with binary outputs: multiple signals arrive at the dendrites, are integrated in the cell body, and if the accumulated signal exceeds a threshold, an output signal is generated along the axon.

Dendrites → x
Receive input signals. Analogous to input features \(x_1,x_2,\ldots,x_m\).
Synapse strength → w
Connection strength. Analogous to weights \(w_i\) — the importance / contribution of each input.
Cell body → Σ
Integrates all incoming signals. Analogous to the weighted sum \(\mathbf{w}^T\mathbf{x}\).
Axon firing → f(net)
Fires (outputs signal) only when the threshold is exceeded. Analogous to the activation function.

1.2 Formal Definition of a Neuron

Given real-valued input \(\mathbf{x}=(x_1,\ldots,x_m)\), weight vector \(\mathbf{w}=(w_1,\ldots,w_m)\), and bias \(w_0\) (also written \(b\) or \(\theta\)). The hypothesis space is \(H = \{\mathbf{w}\mid\mathbf{w}\in\mathbb{R}^{d+1}\}\).

Net Input (Linear Combination)
\[ z = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_m x_m = \mathbf{w}^T\mathbf{x} \]
\(w_0\) is the bias/threshold weight; the bias input \(x_0 = +1\) always.
Sign Function (output ±1) — Perceptron Training Rule
\[ \hat{y} = \sigma(z) = \begin{cases} +1 & z > 0 \\ -1 & \text{otherwise} \end{cases} \]
Step Function (output 0/1) — Alternative Notation
\[ \hat{y} = \sigma(z) = \begin{cases} 1 & z \geq 0 \\ 0 & \text{otherwise} \end{cases} \]
Both produce a discrete-valued output. The perceptron represents a hyperplane decision surface in \(\mathbb{R}^n\): a line in \(\mathbb{R}^2\), a plane in \(\mathbb{R}^3\), and an \((n-1)\)-dimensional subspace in \(\mathbb{R}^n\).

1.3 m-of-n Functions

A perceptron easily represents m-of-n functions: functions where at least \(m\) of the \(n\) inputs must be true. Strategy: set all input weights to the same value (e.g., 0.5) then set the threshold \(w_0\).

Perceptron Representations
\[ \text{AND } (m=n): \quad \hat{y} = \text{sign}(x_1 + x_2 - 1) \] \[ \text{OR } (m=1): \quad \hat{y} = \text{sign}(2x_1 + 2x_2 - 1) \]
Example: AND with \(w_0=-0.8,\; w_1=w_2=0.5\). Compute \(\Sigma = -0.8 + 0.5x_1 + 0.5x_2\) and apply sign. Only when \(x_1=x_2=1\) gives \(\Sigma = 0.2 > 0 \Rightarrow +1\).
The XOR Problem (Minsky & Papert, 1969): A single perceptron can only solve linearly separable problems. XOR cannot be separated by any hyperplane — no single perceptron can learn it. This drove the development of multi-layer networks.

1.4 Perceptron Training (Learning) Rule

Uses the thresholded output for the update signal. Proven to converge if training data is linearly separable and learning rate \(\eta\) is sufficiently small.

Perceptron Training Rule — Mitchell (1997)
InputTraining data \(D = \{\langle\mathbf{x}_i, y_i\rangle\}\),  learning rate \(\eta\),  activation: sign or step
InitSet each \(w_k \leftarrow 0\) or small random value. Set \(E \leftarrow 0\).
RepeatFor each example \(\langle\mathbf{x}_i, y_i\rangle\):
3a.Compute prediction: \(o_i \leftarrow \text{sign}(\mathbf{w}^T\mathbf{x}_i)\)
3b.If \(o_i \neq y_i\):  \(w_j \leftarrow w_j + \Delta w_j\)  where  \(\Delta w_j = \eta(y_i - o_i)x_{ij}\)
3c.Update error: \(E \leftarrow E + (y_i - o_i)\)
UntilAll examples correctly classified (E=0), or max epochs reached. Return \(\mathbf{w}\).
Perceptron Weight Update Rule
\[ w_j \leftarrow w_j + \Delta w_j, \qquad \Delta w_j = \eta\,(y_i - o_i)\,x_{ij} \]
\(\eta\) = learning rate · \(y_i\) = true label (±1) · \(o_i\) = predicted label (thresholded) · \(x_{ij}\) = \(j\)-th feature of example \(i\). If correct prediction: \(y_i = o_i \Rightarrow \Delta w_j = 0\) (no update).
Convergence Theorem: The Perceptron Training Rule is guaranteed to converge in a finite number of steps if the training data is linearly separable and \(\eta\) is sufficiently small. It does NOT converge if data is not linearly separable.

1.5 Batch Gradient Descent — The Delta Rule

When data is NOT linearly separable, the Perceptron rule fails. The Delta Rule (= Widrow-Hoff rule = LMS rule = Adaline rule) uses the unthresholded linear output and minimizes SSE — always converges to best-fit approximation.

Error Function — Sum of Squared Errors (SSE)
\[ E(\mathbf{w}) = \frac{1}{2}\sum_{d \in D}(y_d - o_d)^2 \]
\(o_d = \mathbf{w}^T\mathbf{x}_d\) is the linear (unthresholded) output. The \(\frac{1}{2}\) factor simplifies the derivative.
Gradient Vector (Partial Derivatives)
\[ \nabla E(\mathbf{w}) = \left[\frac{\partial E}{\partial w_0},\;\frac{\partial E}{\partial w_1},\;\ldots,\;\frac{\partial E}{\partial w_d}\right] \] \[ \frac{\partial E}{\partial w_k} = \frac{\partial}{\partial w_k}\frac{1}{2}\sum_{d\in D}(y_d - o_d)^2 = \sum_{d\in D}(y_d - o_d)(-x_{dk}) \]
The gradient gives the magnitude and direction of steepest ascent in the error landscape. We subtract it to descend.
Training Rule: Batch Gradient Descent
\[ \mathbf{w} \leftarrow \mathbf{w} + \Delta\mathbf{w}, \qquad \Delta\mathbf{w} = -\eta\,\nabla E(\mathbf{w}) \] \[ \Delta w_k = -\eta\,\frac{\partial E}{\partial w_k} = \eta\sum_{d \in D}(y_d - o_d)(x_{dk}) \]
Weights updated once per epoch after accumulating \(\Delta w_k\) over all \(D\). \(\eta\) = gradient descent step size.

1.6 Stochastic & Mini-Batch Gradient Descent

Stochastic GD (SGD) updates weights after each individual example — an incremental update.

Batch GD
\[ \Delta w_k = \eta\sum_{d \in D}(y_d - o_d)(x_{dk}) \]

One weight update per epoch. Stable, but slow for large datasets. Can get stuck in local minima.

Stochastic (Incremental) GD
\[ \Delta w_k = \eta\,(y_d - o_d)(x_{dk}) \]

One update per example. Faster, noisier gradient — can escape local minima. Both Batch and SGD use the linear function and give best-fit approximation.

Epoch
One full pass through all training data (or all mini-batches). Features and labels processed once.
Batch Size
Number of samples per mini-batch split. If = 1 → true SGD. If = |D| → Batch GD. If 1 < b < |D| → Mini-batch SGD.
Step
Processing one mini-batch, ending with one weight update. Steps per epoch = |D| / batch_size.
Key Distinction: Perceptron Training Rule uses thresholded output (sign/step) for updates — only converges if linearly separable. Delta Rule / Batch GD uses unthresholded linear output — converges to best-fit regardless of separability.
02 · Feedforward Neural Network

Feedforward Neural Network (FFNN)

Also called Multi-Layer Perceptron (MLP) — stacks layers of neurons with non-linear activations to learn complex mappings. Connections flow only forward: a directed acyclic graph. (Raschka 2022 Ch.11; Goodfellow et al. 2016)

2.1 From Adaline to FFNN

The Adaline (ADAptive LInear NEuron) is a single-layer neural network and the direct bridge from the Perceptron to FFNN. It uses the identity activation for weight updates, and applies a threshold only at inference time.

Adaline = Delta Rule = LMS = Widrow-Hoff (Raschka 2022)
\[ w = w + \Delta w, \quad b = b + \Delta b \qquad (b = w_0) \] \[ \Delta w_j = -\eta\,\frac{\partial E}{\partial w_j}, \quad \Delta b = -\eta\,\frac{\partial E}{\partial b} \] \[ E = \frac{1}{2}(y_i - o_i)^2, \qquad \hat{y} = o = f\!\left(\sum_j w_j x_j + b\right) \]
The activation function \(f\) here is the identity (linear). A threshold is applied only at inference to produce the class label 0 or 1. Multiple names, same algorithm.

2.2 FFNN Architecture

Input Layer (l=0)
Receives feature vector \(\mathbf{x}\). No computation — values passed forward. Width = number of features. Convention: \(x_0 = +1\) (bias).
Hidden Layers
Learn internal representations. One or more layers. Written \(H_1, H_2, \ldots, H_h\) in compact style. Non-linear activation required.
Output Layer
Produces \(\hat{\mathbf{y}}\). Width = number of classes (classification) or 1 (regression). Appropriate activation (sigmoid/softmax/linear).
Depth
Number of layers (hidden + output). Deeper → learns more abstract, hierarchical features.
Width
Neurons per layer. Wider → more capacity but more parameters. Notated as \(n_{h_1}, n_{h_2}, \ldots\)
Fully Connected
Every neuron in layer \(l\) connects to every neuron in layer \(l+1\) with a unique weight \(w_{ji}\).

2.3 Forward Propagation

Information flows from input through each layer to produce output \(\hat{\mathbf{y}}\). Two equivalent notations are used in class:

Goodfellow Notation — General Forward Pass Algorithm
\[ \mathbf{h}^{(0)} = \mathbf{x} \] \[ \text{for } k = 1, \ldots, l:\quad \mathbf{a}^{(k)} = \mathbf{b}^{(k)} + W^{(k)}\mathbf{h}^{(k-1)},\quad \mathbf{h}^{(k)} = f\!\left(\mathbf{a}^{(k)}\right) \] \[ \hat{\mathbf{y}} = \mathbf{h}^{(l)}, \qquad J = L(\hat{\mathbf{y}},\,\mathbf{y}) + \lambda\,\Omega(\theta) \]
\(\mathbf{a}^{(k)}\) = pre-activation at layer \(k\)  ·  \(\mathbf{h}^{(k)}\) = post-activation  ·  \(W^{(k)}\) = weight matrix of layer \(k\)  ·  \(\mathbf{b}^{(k)}\) = bias vector  ·  \(J\) = total cost  ·  \(L\) = loss function  ·  \(\lambda\Omega(\theta)\) = regularization term
Compact Notation (Wxh / Why style — used in lecture slides)
\[ f(\mathbf{x};\,W_{xh},\mathbf{c},\,W_{hy},\mathbf{b}) = f_2\!\left(W_{hy}^T\cdot f_1\!\left(W_{xh}^T\cdot\mathbf{x}+\mathbf{c}\right)+\mathbf{b}\right) \] \[ \text{If biases absorbed into weight matrices:}\quad f(\mathbf{x};\,W_{xh},W_{hy}) = f_2\!\left(W_{hy}^T\cdot f_1\!\left(W_{xh}^T\cdot\mathbf{x}\right)\right) \]
\(W_{xh}\) = weights from input to hidden  ·  \(\mathbf{c}\) = hidden bias  ·  \(W_{hy}\) = weights from hidden to output  ·  \(\mathbf{b}\) = output bias  ·  \(f_1, f_2\) = activation functions at each layer
Mini-Batch X Notation (for batched training)
\[ f(\mathbf{X};\,W_{xh},\mathbf{c},\,W_{hy},\mathbf{b}) = f_2\!\left(f_1\!\left(\mathbf{X}W_{xh}+\mathbf{c}\right)\cdot W_{hy}+\mathbf{b}\right) \]
\(\mathbf{X}\) is the mini-batch matrix of shape \([n_{\text{samples}} \times n_{\text{features}}]\). When using batches, the input goes on the left: \(\mathbf{X}W_{xh}\) vs single input: \(W_{xh}^T\mathbf{x}\).

2.4 Number of Parameters

Parameter Count — Fully Connected Layer
\[ \text{Params for layer } k = (n_{k-1}+1)\times n_k \] \[ \text{Total} = \sum_{k=1}^{L}(n_{k-1}+1)\times n_k \]
\(n_{k-1}\) = neurons in previous layer  ·  \(n_k\) = neurons in current layer  ·  \(+1\) accounts for the bias weight per neuron.
Example (Raschka): input=8, h1=12, h2=8, output=1 → \((8+1)\times12 + (12+1)\times8 + (8+1)\times1 = 108+104+9 = \mathbf{221}\) total params.

2.5 Worked Example: XOR with Sigmoid

XOR Network (Sigmoid Activation — Non-Linear Solution)
\[ h_1 = \sigma(-10 + 20x_1 + 20x_2), \quad h_2 = \sigma(30 - 20x_1 - 20x_2) \] \[ y = \sigma(-30 + 20h_1 + 20h_2), \qquad \text{where}\quad \sigma(z) = \frac{1}{1+e^{-z}} \]
Input \((0,0)\): \(h_1\approx 0, h_2\approx 1 \Rightarrow y\approx 0\). Input \((0,1)\): \(h_1\approx 1, h_2\approx 1 \Rightarrow y\approx 1\). Input \((1,1)\): \(h_1\approx 1, h_2\approx 0 \Rightarrow y\approx 0\). Correctly solves XOR.
03 · Activation Functions

Activation Functions & Loss Functions

Non-linear functions applied element-wise after the weighted sum. Without non-linearity, stacking any number of layers collapses to a single linear transformation.

3.1 Why Non-linearity?

Universal Approximation Theorem: A single hidden layer with sufficient neurons can approximate any continuous function to arbitrary accuracy. However, deep networks require exponentially fewer neurons than shallow networks for the same approximation quality, generalize better, and empirically outperform shallow networks. Test accuracy improves as depth increases from 3 to 11 layers; CNNs consistently outperform fully-connected nets with the same parameter count.

3.2 Common Activation Functions

Key Derivatives (required for Backpropagation)

Activation Derivatives
\[ \sigma'(z) = \sigma(z)\bigl(1-\sigma(z)\bigr) \] \[ \tanh'(z) = 1 - \tanh^2(z) \] \[ \text{ReLU}'(z) = \begin{cases}1 & z>0\\0 & z\le0\end{cases} \]
These derivatives appear in every backprop computation. Sigmoid and tanh derivatives become very small near saturation (→ vanishing gradient). ReLU derivative is either 0 or 1 — no vanishing gradient for positive inputs.

3.3 Loss Functions

Mean Squared Error (MSE) — Regression
\[ L(y,\hat{y}) = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2 \]
Use with linear output activation. Penalizes larger errors quadratically.
Binary Cross-Entropy — Binary Classification
\[ L(y,\hat{y}) = -\frac{1}{n}\sum_i\Bigl[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\Bigr] \]
Use with sigmoid output. Measures divergence between true and predicted distributions.
Categorical Cross-Entropy — Multi-Class Classification
\[ L(y,\hat{y}) = -\sum_{c=1}^C y_c\log\hat{y}_c \]
Use with softmax output for C classes. \(y_c=1\) for the true class, 0 otherwise (one-hot encoding).
04 · Backpropagation

Backpropagation

Algorithm to compute gradients of the loss w.r.t. all weights by applying the chain rule layer-by-layer in reverse. Goal: set weights to minimize error \(E\). (Mitchell 1997; Goodfellow et al. 2016)

4.1 Three-Step Learning Procedure

Learning Procedure
1.Forward Propagation: Compute output \(\hat{y}\) for input \(\mathbf{x}\). Cache all pre-activations \(\mathbf{a}^{(k)}\) and post-activations \(\mathbf{h}^{(k)}\).
2.Backward Propagation: Compute error term \(\delta\) for each unit. The target for hidden units is NOT directly available — it must be inferred from output deltas via the chain rule.
3.Weight Update: Update every weight using the computed gradients and learning rate \(\eta\).

4.2 Gradient for Output Unit Weights (with Sigmoid)

Using the chain rule to compute \(\partial E_d / \partial w_{ji}\) for a weight connecting unit \(i\) to output unit \(j\):

Chain Rule Decomposition — Output Layer
\[ \frac{\partial E_d}{\partial w_{ji}} = \frac{\partial E_d}{\partial o_j}\cdot\frac{\partial o_j}{\partial\,\text{net}_j}\cdot\frac{\partial\,\text{net}_j}{\partial w_{ji}} = -(t_j-o_j)\,o_j(1-o_j)\,x_{ji} \]
\(\dfrac{\partial E_d}{\partial o_j} = -(t_j-o_j)\) (from SSE: \(E_d=\tfrac{1}{2}\sum_k(t_k-o_k)^2\))  ·  \(\dfrac{\partial o_j}{\partial\,\text{net}_j} = o_j(1-o_j)\) (sigmoid derivative)  ·  \(\dfrac{\partial\,\text{net}_j}{\partial w_{ji}} = x_{ji}\)
Output Unit Error Signal (δ)
\[ \delta_j = o_j(1-o_j)(t_j - o_j) \] \[ \therefore\quad \frac{\partial E_d}{\partial w_{ji}} = -\delta_j\,x_{ji} \]
\(\delta_j\) is the error signal at output unit \(j\). It captures how wrong the output is, scaled by how sensitive the activation is at that operating point.

4.3 Gradient for Hidden Unit Weights

For hidden unit \(h\), no direct target is available. We back-propagate error from all output units \(k\) that receive from \(h\):

Hidden Unit Error Signal (δh)
\[ \delta_h = o_h(1-o_h)\sum_{k\in\,\text{outputs}(h)} w_{kh}\,\delta_k \] \[ \therefore\quad \frac{\partial E_d}{\partial w_{ji}} = -\delta_h\,x_{ji} \]
\(\delta_h\) is a weighted sum of output deltas, weighted by the connections from \(h\) to each output unit. This is the "propagating error backward" step that gives backpropagation its name.

4.4 Weight Update (same for all layers)

Weight Update Rule — Mitchell (1997), Eq. T4.5
\[ w_{ji} \leftarrow w_{ji} + \Delta w_{ji}, \qquad \Delta w_{ji} = \eta\,\delta_j\,x_{ji} \]
\(\eta\) = learning rate  ·  \(\delta_j\) = error signal at unit \(j\)  ·  \(x_{ji}\) = input from unit \(i\) to unit \(j\)

4.5 Full Backpropagation Algorithm (Mitchell, 1997)

BACKPROPAGATION(training_examples, η, nin, nout, nhidden)
InitCreate feed-forward network with \(n_{\text{in}}\) inputs, \(n_{\text{hidden}}\) hidden units, \(n_{\text{out}}\) output units. Initialize all weights to small random values (e.g., between −0.05 and 0.05).
RepeatUntil termination (fixed iterations / training error < threshold / validation criterion met). Do for each \(\langle\mathbf{x}, \mathbf{t}\rangle\) in training_examples:
1.Forward pass: Input \(\mathbf{x}\) to the network; compute output \(o_u\) of every unit \(u\) in the network.
2.Output deltas: For each output unit \(k\):
\[\delta_k \leftarrow o_k(1-o_k)(t_k - o_k)\]
3.Hidden deltas: For each hidden unit \(h\):
\[\delta_h \leftarrow o_h(1-o_h)\sum_{k\,\in\,\text{outputs}}w_{kh}\,\delta_k\]
4.Update weights: For each weight \(w_{ji}\):
\[w_{ji} \leftarrow w_{ji} + \Delta w_{ji},\qquad \Delta w_{ji} = \eta\,\delta_j\,x_{ji}\]
Computational Efficiency: One backward pass computes gradients for ALL parameters simultaneously, reusing cached forward-pass values. This makes deep network training feasible.

4.6 Common Training Problems

Vanishing Gradient
Gradients shrink exponentially through deep layers. Sigmoid/tanh worsen this near saturation. Fix: ReLU, residual connections, batch normalization.
Exploding Gradient
Gradients grow exponentially → unstable (NaN). Fix: gradient clipping, careful initialization.
Local Minima
Batch GD can get stuck. SGD noise helps escape. No guarantee of global minimum if multiple local minima exist.
Learning Rate η
Too small → converges very slowly. Too large → overshoots and diverges. Use schedules or adaptive optimizers (Adam).
Overfitting
Network memorizes training data. Fix: dropout, L1/L2 regularization \(\lambda\Omega(\theta)\), early stopping, more data.
Dying ReLU
Neurons always output 0 → zero gradient → never recover. Fix: Leaky ReLU (\(\alpha=0.01\) slope), He initialization.
05 · Convolutional Neural Network

Convolutional Neural Network (CNN)

An ANN that replaces general matrix multiplication with the convolution operation in at least one layer. Explicitly assumes inputs have grid-like (spatial) topology. (LeCun et al. 1989, 1998; Goodfellow et al. 2016; Lepe & Vazquez 2017)

5.1 Why CNN? — The Problem with FFNN on Images

Dimensionality Explosion: A \(320\times280\) RGB image has \(320\times280\times3 = 268{,}800\) pixels. A single FFNN input→output layer on that image requires \(>8\) billion multiplications. A CNN with a \(1\times2\) kernel needs only \(319\times280\times3 = 267{,}960\) operations — roughly 60,000× more efficient.

Three core problems with FFNN on images:

  • Dimensionality explosion. Flattening a 2D/3D image to 1D and connecting every pixel to every neuron creates an impractical number of parameters.
  • Spatial information destroyed. Flattening kills neighborhood relationships. Pixels and their neighbors carry joint meaning (edges, textures); once flattened they are independent numbers.
  • Parameter explosion → overfitting. More weights = more memorization, not better generalization.

5.2 What CNN Is

Input
A multidimensional array (e.g., RGB image of size \(H\times W\) stored as \(H\times W\times 3\) array of pixel intensities 0–255).
Kernel / Filter
Small learnable weight matrix (e.g., 3×3 or 5×5) that slides over the input. Also called: weights, parameters, filter.
Feature Map
Output produced when a kernel is convolved with the input. High values = pattern detected at that location.
Convolution
Sliding a kernel over the input, computing dot products at each position to produce one value per location in the feature map.

5.3 Two Core Innovations

Local (Sparse) Connectivity

Each output neuron connects only to a local region of the input — the receptive field. A 3×3 kernel: each output connects to only 9 inputs, not thousands. Biologically inspired: Hubel & Wiesel (1959) found visual cortex neurons respond to small local visual regions.

Parameter Sharing

The same kernel weights are used at every spatial position. One kernel is learned and applied everywhere — instead of unique weights per position. Gives translation equivariance: the same feature anywhere uses the same detector.

LeCun 1989 Parameter Comparison (Net-3)

ConfigurationWeights
Fully connected (FFNN)>3,200
Locally connected (no weight sharing)1,226
Locally connected + weight sharing (CNN)206

5.4 The Convolution Operation

1D Convolution (Conceptual)

Slide a kernel across a 1D input array, computing element-wise products and summing at each position. Parameter sharing: same kernel values reused at every position.

2D Convolution (Grayscale Images)

The kernel slides across both rows and columns. At each position, multiply overlapping values element-wise and sum to produce one output value in the feature map.

3D Convolution (RGB / Multi-channel)

For RGB images, the kernel has depth = number of input channels (\(C_{\text{in}}\)). A kernel of size \(F\times F\times C_{\text{in}}\) produces one feature map. Using \(K\) different kernels produces \(K\) feature maps (output depth = \(K\)).

Output Dimension Formula
\[ V = \frac{W - F + 2P}{S} + 1 \]
\(W\) = input width/height (assumes square)  ·  \(F\) = filter/kernel size  ·  \(P\) = padding (zeros added around border)  ·  \(S\) = stride  ·  \(V\) = output spatial dimension  ·  Output volume = \(V \times V \times K\) (for \(K\) filters)

Padding Types

  • Valid (P=0): No padding → output shrinks. Each conv layer reduces spatial dimensions.
  • Same (P=(F−1)/2 with S=1): Zero-pad to keep output same size as input.

Worked Examples

Input ShapeFiltersPSV CalculationOutput Volume
3×3×31, size 2×2×301\((3-2+0)/1+1=2\)2×2×1
3×3×23, size 2×2×201\((3-2+0)/1+1=2\)2×2×3
32×32×310, size 5×5×301\((32-5+0)/1+1=28\)28×28×10

5.5 The Three Stages of a CNN Layer

Three-Stage CNN Layer Structure
Stage 1 Convolution (Affine Transform): Kernel slides across input computing linear combinations. Produces raw feature map — still linear.
\[\text{net}_c = \mathbf{X}\star K\quad\text{(where }\star\text{ denotes convolution/cross-correlation)}\]
Stage 2 Detector (Non-linearity / Activation): Non-linear activation applied element-wise. Without this, stacking conv layers collapses to one linear transform. Most common: ReLU. \[H = f_c(\text{net}_c) = \max(0,\,\text{net}_c)\] Example: \([-9, 32, 14, -6]\xrightarrow{\text{ReLU}}[0, 32, 14, 0]\)
Stage 3 Pooling (Downsampling): Reduces spatial dimensions. Benefits: fewer parameters downstream, translation invariance (small input shifts don’t change output much), controls overfitting. Most common: Max Pooling.

Pooling Types

TypeOperationNotes
Max PoolingMaximum value in each windowMost common. A 2×2 pool with S=2 halves the spatial dimensions. Preserves dominant features.
Average PoolingMean value in each windowSmoother downsampling. Used in LeNet-5.
L2-norm Pooling\(\sqrt{\sum x_i^2}\) in each windowLess common. Emphasizes larger activations.

5.6 Parameter Counting in CNN Layers

Number of Neurons in a Conv Layer
\[ N_{\text{neurons}} = V_H \times V_W \times K \]
\(V_H, V_W\) = output height and width (from output formula)  ·  \(K\) = number of filters
Parameters WITHOUT Weight Sharing (locally connected)
\[ N_{\text{params}} = N_{\text{neurons}} \times (F\times F\times C_{\text{in}} + 1) \]
Parameters WITH Weight Sharing (standard CNN)
\[ N_{\text{params}} = K \times (F\times F\times C_{\text{in}} + 1) \]
\(K\) = number of filters  ·  \(F\times F\) = filter spatial size  ·  \(C_{\text{in}}\) = input channels  ·  \(+1\) = one bias per filter
Example (LeNet C1): \(K=6,\, F=5,\, C_{\text{in}}=1 \Rightarrow 6\times(25\times1+1) = 6\times26 = \mathbf{156}\) params

5.7 Full CNN Architecture Pattern

A typical CNN repeats [Conv + ReLU → Pool] blocks, then flattens into fully-connected layers for classification:

INPUT
H×W×C
CONV
+ReLU
POOL
Max/Avg
CONV
+ReLU
POOL
Max/Avg
FLATTEN
1D vector
FC
Dense
OUTPUT
Softmax
Feature Hierarchy: Conv layer 1 → edges/colors. Conv layer 2 → corners/textures. Deeper layers → shapes, object parts, objects. Pooling reduces spatial size. FC layers do the final classification based on extracted features.

5.8 LeNet-5 (LeCun et al., 1998)

The first practical CNN for handwritten digit recognition (MNIST). Input: 32×32 grayscale (MNIST 28×28 padded to 32×32). End-to-end training with backpropagation.

LayerTypeOutput ShapeDetailsParameters
InputImage32×32×1Grayscale, padded from 28×28
C1Conv28×28×66 filters, 5×5, S=1, P=0, tanh  ·  \(V=(32-5)/1+1=28\)\(6\times(5\times5\times1+1)=\mathbf{156}\)
S2Avg Pool14×14×62×2 window, S=20 (no learnable params)
C3Conv10×10×1616 filters, 5×5, S=1, P=0, tanh  ·  \(V=(14-5)/1+1=10\)\(16\times(5\times5\times6+1)=\mathbf{2{,}416}\)
S4Avg Pool5×5×162×2 window, S=20
Flatten400\(5\times5\times16=400\)
FC C5FC120Fully connected, tanh\((400+1)\times120=\mathbf{48{,}120}\)
FC F6FC84Fully connected, tanh\((120+1)\times84=\mathbf{10{,}164}\)
OutputFC1010 classes (digits 0–9), Softmax\((84+1)\times10=\mathbf{850}\)
Total Parameters61,706
LeNet-5 Total Parameter Verification
\[ 156 + 0 + 2{,}416 + 0 + 48{,}120 + 10{,}164 + 850 = \mathbf{61{,}706} \]
C3 uses 5×5 kernel on 6-channel S2 output: \(16\times(5\times5\times6+1)=16\times151=2{,}416\). Tiny by modern standards (ResNet-50 ≈25M params), but proved CNN viability for image tasks for the first time.

5.9 Notable CNN Architectures

ArchitectureYearKey Innovation
LeNet-51998First practical CNN. Local connectivity + weight sharing.
AlexNet20128 layers, ReLU activation (novel), dropout, GPU training on ImageNet. Marked the “Rise of CNN.”
VGG16201516 layers, only 3×3 filters. Full model: ∼138M params. Input: 224×224×3.
GoogLeNet2014Inception modules: parallel filter paths of different sizes in one layer.
ResNet2015Skip (residual) connections: gradients flow directly through layers, enabling 50/101/152-layer nets.
Vision Transformer2020Transformer self-attention applied to image patches — no convolution needed.

5.10 CNN vs FFNN — Complete Comparison

AspectFFNN (MLP)CNN
Input handlingFlattened 1D vector — spatial structure lostGrid-structured — spatial info preserved
ConnectivityEvery neuron → every input (fully connected)Each neuron → local receptive field only
Weight sharingNo — unique weight per connectionYes — same kernel at every position
Spatial feature detectionNoYes — kernels learn edges, textures, shapes
Parameter countVery high for image inputsMuch lower due to weight sharing
Overfitting tendencyHigher for image tasksLower for image tasks
Feature engineeringManualAutomatic (learned filters)
Translation invarianceNoYes (shared filters + pooling)
Best forTabular data, fixed-size vectorsImages, audio spectrograms, grid-structured data
Two Key Properties of CNN: (1) Translation equivariance: if the input shifts, the feature map shifts by the same amount — a feature is detected regardless of where it appears. (2) Hierarchical feature learning: layer 1 learns edges → layer 2 learns textures → deeper layers learn shapes → object parts → full objects.

5.11 Key CNN Terms

Kernel / Filter (K)
Learnable weight matrix. One kernel → one feature map. \(K\) kernels → \(K\) output channels.
Feature Map
Output of applying one kernel to the full input. High values = pattern detected at that location.
Receptive Field
Local region of input one filter examines. Size = \(F\times F\). Deeper layers have larger effective receptive fields.
Stride (S)
Pixels the kernel moves per step. \(S=1\) → overlapping. \(S=2\) → roughly halves output size.
Padding (P)
“Valid” (P=0): output shrinks. “Same” (\(P=(F-1)/2\) with S=1): output same size as input.
Parameter Sharing
Same kernel weights applied everywhere → dramatic parameter reduction. FFNN would need unique weights per connection.
Sparse Connectivity
Each output unit connects to only \(F\times F\times C_{\text{in}}\) inputs (the receptive field), not all inputs.
Translation Equivariance
If input shifts, feature map shifts by same amount. Kernels detect features at any position using same weights.
Translation Invariance
After pooling, small shifts in input do not change the output. Pooling provides this property.
Flatten
Reshape 3D volume \((H\times W\times C)\) after conv/pool into a 1D vector to feed into FC layers.
Feature Hierarchy
Early layers: edges → Middle: corners, textures → Deep: object parts, faces, objects.
Depth (channels)
Number of feature maps in a layer = number of filters \(K\). RGB image has depth 3 (input channels).
Glossary

All Key Terms at a Glance

Every important term from the midterm topics, organized by section.

Math & Notation

x, X input vector / mini-batch
w, W weights
b, w⊂0⊂, c bias
z / net / a pre-activation
o / h post-activation
ŷ predicted output
y / t true label / target
E / J error / total cost
η learning rate
δ error signal (delta)
λ regularization rate
Ω(θ) regularizer
σ sigmoid
f(·) activation function
K kernel/filter
V output dim
F filter size
P padding
S stride

Perceptron Terms

Perceptron artificial neuron for linear classification
McCulloch-Pitts early neuron model
threshold θ cutoff for firing
sign function outputs ±1
step function outputs 0/1
net input z weighted sum before activation
hyperplane decision surface linear separator
linear separability classes separable by a hyperplane
hypothesis space H set of possible weights
perceptron training rule update rule for thresholded output
convergence theorem finite convergence for separable data
XOR problem not linearly separable
m-of-n function true if at least m inputs are true
epoch one full pass over the data
batch size samples per update group
step one mini-batch update
delta rule LMS / Widrow-Hoff update
LMS rule least mean squares rule
Adaline adaptive linear neuron
Widrow-Hoff another name for delta rule
batch GD gradient descent over all samples
SGD stochastic gradient descent
mini-batch SGD gradient descent on small batches
gradient ∇E(w) direction of steepest ascent
SSE sum of squared errors

FFNN Terms

feedforward information flows only forward
MLP multi-layer perceptron
hidden layer intermediate representation layer
depth number of layers
width number of neurons per layer
Wxh / Why notation compact layer notation
compact / explicit style two equivalent notations
forward propagation compute outputs from inputs
backpropagation reverse-mode gradient computation
chain rule derivative composition rule
gradient descent minimize loss by following negative gradient
vanishing gradient gradients shrink through layers
exploding gradient gradients grow too large
overfitting memorizes training data
underfitting model too simple
sigmoid squashing activation
tanh zero-centered squashing activation
ReLU rectified linear unit
leaky ReLU ReLU with small negative slope
softmax probability distribution over classes
MSE mean squared error
binary CE binary cross-entropy
categorical CE multi-class cross-entropy
output delta δj error signal at output unit
hidden delta δh error signal at hidden unit
universal approximation theorem single hidden layer can approximate any continuous function
one-hot encoding vector with one active class entry
regularization λΩ(θ) penalty term to reduce overfitting

CNN Terms

convolution / cross-correlation sliding local dot product
kernel / filter learnable weight window
feature map convolution output
stride S step size of the kernel
padding P zero border added to input
valid padding no padding, output shrinks
same padding keeps output size
receptive field local input region a filter sees
parameter sharing same kernel reused everywhere
sparse connectivity each output connects to local inputs
max pooling take maximum value per window
average pooling take mean value per window
L2-norm pooling use Euclidean norm per window
flatten reshape volume into 1D vector
translation equivariance shifted input shifts feature map
translation invariance small shifts do not change output much
feature hierarchy edges to objects across layers
depth (channels) number of feature maps
V = (W−F+2P)/S + 1 output dimension formula
Nparams = K×(F²×Cin+1) CNN parameter formula
LeNet-5 first practical CNN
AlexNet deep CNN breakthrough
VGG16 deep stack of 3×3 filters
ResNet / skip connections residual shortcut paths
locally connected no weight sharing

Notes based on: Goodfellow et al. (2016) Deep Learning · LeCun et al. (1989, 1998) · Mitchell (1997) · Raschka et al. (2022)
IF3270 Pembelajaran Mesin · Sem 2-2025/2026 · Tim Pengajar IF3270
Topics: Perceptron · FFNN · CNN  |  Excluded: Ensemble · CNN Backpropagation