IF3270 Pembelajaran Mesin · Midterm Study Notes

Neural Networks
from Scratch

Perceptron · Feedforward Neural Network · Convolutional Neural Network

Perceptron Delta Rule / Adaline Gradient Descent FFNN / MLP Backpropagation Activation Functions CNN Convolution Pooling LeNet-5

References: Goodfellow et al. (2016) Deep Learning · LeCun et al. (1989, 1998) · Mitchell (1997) · Raschka et al. (2022)

01 · Perceptron

The Perceptron

The foundation of all neural networks — a single artificial neuron that learns to classify linearly separable data. (Rosenblatt 1958; Mitchell 1997, Ch. 4; Raschka 2022, Ch. 11)

1.1 Biological Motivation

McCulloch and Pitts (1943) described a neuron as a simple logic gate with binary outputs: multiple signals arrive at the dendrites, are integrated in the cell body, and if the accumulated signal exceeds a threshold, an output signal is generated along the axon.

Dendrites → x

Receive input signals. Analogous to input features \(x_1,x_2,\ldots,x_m\).

Synapse strength → w

Connection strength. Analogous to weights \(w_i\) — the importance / contribution of each input.

Cell body → Σ

Integrates all incoming signals. Analogous to the weighted sum \(\mathbf{w}^T\mathbf{x}\).

Axon firing → f(net)

Fires (outputs signal) only when the threshold is exceeded. Analogous to the activation function.

1.2 Formal Definition of a Neuron

Perceptron Architecture

Given real-valued input \(\mathbf{x}=(x_1,\ldots,x_m)\), weight vector \(\mathbf{w}=(w_1,\ldots,w_m)\), and bias \(w_0\) (also written \(b\) or \(\theta\)). The hypothesis space is \(H = \{\mathbf{w}\mid\mathbf{w}\in\mathbb{R}^{d+1}\}\).

Net Input (Linear Combination)

\[ z = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_m x_m = \mathbf{w}^T\mathbf{x} \]

\(w_0\) is the bias/threshold weight; the bias input \(x_0 = +1\) always.

Sign Function (output ±1) — Perceptron Training Rule

\[ \hat{y} = \sigma(z) = \begin{cases} +1 & z > 0 \\ -1 & \text{otherwise} \end{cases} \]

Step Function (output 0/1) — Alternative Notation

\[ \hat{y} = \sigma(z) = \begin{cases} 1 & z \geq 0 \\ 0 & \text{otherwise} \end{cases} \]

Both produce a discrete-valued output. The perceptron represents a hyperplane decision surface in \(\mathbb{R}^n\): a line in \(\mathbb{R}^2\), a plane in \(\mathbb{R}^3\), and an \((n-1)\)-dimensional subspace in \(\mathbb{R}^n\).

1.3 m-of-n Functions

A perceptron easily represents m-of-n functions: functions where at least \(m\) of the \(n\) inputs must be true. Strategy: set all input weights to the same value (e.g., 0.5) then set the threshold \(w_0\).

Perceptron Representations

\[ \text{AND } (m=n): \quad \hat{y} = \text{sign}(x_1 + x_2 - 1) \] \[ \text{OR } (m=1): \quad \hat{y} = \text{sign}(2x_1 + 2x_2 - 1) \]

Example: AND with \(w_0=-0.8,\; w_1=w_2=0.5\). Compute \(\Sigma = -0.8 + 0.5x_1 + 0.5x_2\) and apply sign. Only when \(x_1=x_2=1\) gives \(\Sigma = 0.2 > 0 \Rightarrow +1\).

The XOR Problem (Minsky & Papert, 1969)

A single perceptron can only solve linearly separable problems. XOR cannot be separated by any hyperplane — no single perceptron can learn it. This drove the development of multi-layer networks.

1.4 Perceptron Training (Learning) Rule

Uses the thresholded output for the update signal. Proven to converge if training data is linearly separable and learning rate \(\eta\) is sufficiently small.

Perceptron Training Rule — Mitchell (1997)

InputTraining data \(D = \{\langle\mathbf{x}_i, y_i\rangle\}\), learning rate \(\eta\), activation: sign or step

InitSet each \(w_k \leftarrow 0\) or small random value. Set \(E \leftarrow 0\).

RepeatFor each example \(\langle\mathbf{x}_i, y_i\rangle\):

3a.Compute prediction: \(o_i \leftarrow \text{sign}(\mathbf{w}^T\mathbf{x}_i)\)

3b.If \(o_i \neq y_i\): \(w_j \leftarrow w_j + \Delta w_j\) where \(\Delta w_j = \eta(y_i - o_i)x_{ij}\)

3c.Update error: \(E \leftarrow E + (y_i - o_i)\)

UntilAll examples correctly classified (E=0), or max epochs reached. Return \(\mathbf{w}\).

Perceptron Weight Update Rule

\[ w_j \leftarrow w_j + \Delta w_j, \qquad \Delta w_j = \eta\,(y_i - o_i)\,x_{ij} \]

\(\eta\) = learning rate · \(y_i\) = true label (±1) · \(o_i\) = predicted label (thresholded) · \(x_{ij}\) = \(j\)-th feature of example \(i\). If correct prediction: \(y_i = o_i \Rightarrow \Delta w_j = 0\) (no update).

Convergence Theorem

The Perceptron Training Rule is guaranteed to converge in a finite number of steps if the training data is linearly separable and \(\eta\) is sufficiently small. It does NOT converge if data is not linearly separable.

1.5 Batch Gradient Descent — The Delta Rule

When data is NOT linearly separable, the Perceptron rule fails. The Delta Rule (= Widrow-Hoff rule = LMS rule = Adaline rule) uses the unthresholded linear output and minimizes SSE — always converges to best-fit approximation.

Error Function — Sum of Squared Errors (SSE)

\[ E(\mathbf{w}) = \frac{1}{2}\sum_{d \in D}(y_d - o_d)^2 \]

\(o_d = \mathbf{w}^T\mathbf{x}_d\) is the linear (unthresholded) output. The \(\frac{1}{2}\) factor simplifies the derivative.

Gradient Vector (Partial Derivatives)

\[ \nabla E(\mathbf{w}) = \left[\frac{\partial E}{\partial w_0},\;\frac{\partial E}{\partial w_1},\;\ldots,\;\frac{\partial E}{\partial w_d}\right] \] \[ \frac{\partial E}{\partial w_k} = \frac{\partial}{\partial w_k}\frac{1}{2}\sum_{d\in D}(y_d - o_d)^2 = \sum_{d\in D}(y_d - o_d)(-x_{dk}) \]

The gradient gives the magnitude and direction of steepest ascent in the error landscape. We subtract it to descend.

Training Rule: Batch Gradient Descent

\[ \mathbf{w} \leftarrow \mathbf{w} + \Delta\mathbf{w}, \qquad \Delta\mathbf{w} = -\eta\,\nabla E(\mathbf{w}) \] \[ \Delta w_k = -\eta\,\frac{\partial E}{\partial w_k} = \eta\sum_{d \in D}(y_d - o_d)(x_{dk}) \]

Weights updated once per epoch after accumulating \(\Delta w_k\) over all \(D\). \(\eta\) = gradient descent step size.

1.6 Stochastic & Mini-Batch Gradient Descent

Stochastic GD (SGD) updates weights after each individual example — an incremental update.

Batch GD

\[ \Delta w_k = \eta\sum_{d \in D}(y_d - o_d)(x_{dk}) \]

One weight update per epoch. Stable, but slow for large datasets. Can get stuck in local minima.

Stochastic (Incremental) GD

\[ \Delta w_k = \eta\,(y_d - o_d)(x_{dk}) \]

One update per example. Faster, noisier gradient — can escape local minima. Both Batch and SGD use the linear function and give best-fit approximation.

Epoch

One full pass through all training data (or all mini-batches). Features and labels processed once.

Batch Size

Number of samples per mini-batch split. If = 1 → true SGD. If = |D| → Batch GD. If 1 < b < |D| → Mini-batch SGD.

Step

Processing one mini-batch, ending with one weight update. Steps per epoch = |D| / batch_size.

Key Distinction

Perceptron Training Rule uses thresholded output (sign/step) for updates — only converges if linearly separable. Delta Rule / Batch GD uses unthresholded linear output — converges to best-fit regardless of separability.

02 · Feedforward Neural Network

Feedforward Neural Network (FFNN)

Also called Multi-Layer Perceptron (MLP) — stacks layers of neurons with non-linear activations to learn complex mappings. Connections flow only forward: a directed acyclic graph. (Raschka 2022 Ch.11; Goodfellow et al. 2016)

2.1 From Adaline to FFNN

The Adaline (ADAptive LInear NEuron) is a single-layer neural network and the direct bridge from the Perceptron to FFNN. It uses the identity activation for weight updates, and applies a threshold only at inference time.

Adaline = Delta Rule = LMS = Widrow-Hoff (Raschka 2022)

\[ w = w + \Delta w, \quad b = b + \Delta b \qquad (b = w_0) \] \[ \Delta w_j = -\eta\,\frac{\partial E}{\partial w_j}, \quad \Delta b = -\eta\,\frac{\partial E}{\partial b} \] \[ E = \frac{1}{2}(y_i - o_i)^2, \qquad \hat{y} = o = f\!\left(\sum_j w_j x_j + b\right) \]

The activation function \(f\) here is the identity (linear). A threshold is applied only at inference to produce the class label 0 or 1. Multiple names, same algorithm.

2.2 FFNN Architecture

Input Layer (l=0)

Receives feature vector \(\mathbf{x}\). No computation — values passed forward. Width = number of features. Convention: \(x_0 = +1\) (bias).

Hidden Layers

Learn internal representations. One or more layers. Written \(H_1, H_2, \ldots, H_h\) in compact style. Non-linear activation required.

Output Layer

Produces \(\hat{\mathbf{y}}\). Width = number of classes (classification) or 1 (regression). Appropriate activation (sigmoid/softmax/linear).

Depth

Number of layers (hidden + output). Deeper → learns more abstract, hierarchical features.

Width

Neurons per layer. Wider → more capacity but more parameters. Notated as \(n_{h_1}, n_{h_2}, \ldots\)

Fully Connected

Every neuron in layer \(l\) connects to every neuron in layer \(l+1\) with a unique weight \(w_{ji}\).

2.3 Forward Propagation

Information flows from input through each layer to produce output \(\hat{\mathbf{y}}\). Two equivalent notations are used in class:

Goodfellow Notation — General Forward Pass Algorithm

\[ \mathbf{h}^{(0)} = \mathbf{x} \] \[ \text{for } k = 1, \ldots, l:\quad \mathbf{a}^{(k)} = \mathbf{b}^{(k)} + W^{(k)}\mathbf{h}^{(k-1)},\quad \mathbf{h}^{(k)} = f\!\left(\mathbf{a}^{(k)}\right) \] \[ \hat{\mathbf{y}} = \mathbf{h}^{(l)}, \qquad J = L(\hat{\mathbf{y}},\,\mathbf{y}) + \lambda\,\Omega(\theta) \]

\(\mathbf{a}^{(k)}\) = pre-activation at layer \(k\) · \(\mathbf{h}^{(k)}\) = post-activation · \(W^{(k)}\) = weight matrix of layer \(k\) · \(\mathbf{b}^{(k)}\) = bias vector · \(J\) = total cost · \(L\) = loss function · \(\lambda\Omega(\theta)\) = regularization term

Compact Notation (W_xh / W_hy style — used in lecture slides)

\[ f(\mathbf{x};\,W_{xh},\mathbf{c},\,W_{hy},\mathbf{b}) = f_2\!\left(W_{hy}^T\cdot f_1\!\left(W_{xh}^T\cdot\mathbf{x}+\mathbf{c}\right)+\mathbf{b}\right) \] \[ \text{If biases absorbed into weight matrices:}\quad f(\mathbf{x};\,W_{xh},W_{hy}) = f_2\!\left(W_{hy}^T\cdot f_1\!\left(W_{xh}^T\cdot\mathbf{x}\right)\right) \]

\(W_{xh}\) = weights from input to hidden · \(\mathbf{c}\) = hidden bias · \(W_{hy}\) = weights from hidden to output · \(\mathbf{b}\) = output bias · \(f_1, f_2\) = activation functions at each layer

Mini-Batch X Notation (for batched training)

\[ f(\mathbf{X};\,W_{xh},\mathbf{c},\,W_{hy},\mathbf{b}) = f_2\!\left(f_1\!\left(\mathbf{X}W_{xh}+\mathbf{c}\right)\cdot W_{hy}+\mathbf{b}\right) \]

\(\mathbf{X}\) is the mini-batch matrix of shape \([n_{\text{samples}} \times n_{\text{features}}]\). When using batches, the input goes on the left: \(\mathbf{X}W_{xh}\) vs single input: \(W_{xh}^T\mathbf{x}\).

2.4 Number of Parameters

Parameter Count — Fully Connected Layer

\[ \text{Params for layer } k = (n_{k-1}+1)\times n_k \] \[ \text{Total} = \sum_{k=1}^{L}(n_{k-1}+1)\times n_k \]

\(n_{k-1}\) = neurons in previous layer · \(n_k\) = neurons in current layer · \(+1\) accounts for the bias weight per neuron.
Example (Raschka): input=8, h1=12, h2=8, output=1 → \((8+1)\times12 + (12+1)\times8 + (8+1)\times1 = 108+104+9 = \mathbf{221}\) total params.

2.5 Worked Example: XOR with Sigmoid

XOR Network (Sigmoid Activation — Non-Linear Solution)

\[ h_1 = \sigma(-10 + 20x_1 + 20x_2), \quad h_2 = \sigma(30 - 20x_1 - 20x_2) \] \[ y = \sigma(-30 + 20h_1 + 20h_2), \qquad \text{where}\quad \sigma(z) = \frac{1}{1+e^{-z}} \]

Input \((0,0)\): \(h_1\approx 0, h_2\approx 1 \Rightarrow y\approx 0\). Input \((0,1)\): \(h_1\approx 1, h_2\approx 1 \Rightarrow y\approx 1\). Input \((1,1)\): \(h_1\approx 1, h_2\approx 0 \Rightarrow y\approx 0\). Correctly solves XOR.

03 · Activation Functions

Activation Functions & Loss Functions

Non-linear functions applied element-wise after the weighted sum. Without non-linearity, stacking any number of layers collapses to a single linear transformation.

3.1 Why Non-linearity?

Universal Approximation Theorem

A single hidden layer with sufficient neurons can approximate any continuous function to arbitrary accuracy. However, deep networks require exponentially fewer neurons than shallow networks for the same approximation quality, generalize better, and empirically outperform shallow networks. Test accuracy improves as depth increases from 3 to 11 layers; CNNs consistently outperform fully-connected nets with the same parameter count.

3.2 Common Activation Functions

Step / Sign

\(\displaystyle f(z)=\begin{cases}1\;(or+1)&z\ge0\\0\;(or-1)&\text{else}\end{cases}\)

Range: {0,1} or {±1}

Perceptron. Non-differentiable → cannot use in backprop.

Sigmoid (σ)

\(\displaystyle\sigma(z)=\frac{1}{1+e^{-z}}\)

Range: (0, 1)

Output as probability. Vanishing gradient for large |z|. Used in Adaline/output layer for binary classification.

Tanh

\(\displaystyle\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}\)

Range: (−1, 1)

Zero-centered. Stronger gradients than sigmoid, but still vanishes at extremes. Used in LeNet-5.

ReLU

\(\displaystyle f(z)=\max(0,z)\)

Range: [0, ∞)

Most popular for hidden layers. No vanishing gradient for \(z>0\). Suffers from dying ReLU (\(z\le0\) → always 0 gradient).

Leaky ReLU

\(\displaystyle f(z)=\max(\alpha z,\,z),\;\alpha\ll1\)

Range: (−∞, ∞)

Fixes dying ReLU. Small slope \(\alpha\approx0.01\) for negative inputs. Gradient always non-zero.

Softmax

\(\displaystyle\sigma(\mathbf{z})_k=\frac{e^{z_k}}{\sum_j e^{z_j}}\)

Range: (0,1), sum=1

Output layer for multi-class. Converts logits to probability distribution over C classes.

Key Derivatives (required for Backpropagation)

Activation Derivatives

\[ \sigma'(z) = \sigma(z)\bigl(1-\sigma(z)\bigr) \] \[ \tanh'(z) = 1 - \tanh^2(z) \] \[ \text{ReLU}'(z) = \begin{cases}1 & z>0\\0 & z\le0\end{cases} \]

These derivatives appear in every backprop computation. Sigmoid and tanh derivatives become very small near saturation (→ vanishing gradient). ReLU derivative is either 0 or 1 — no vanishing gradient for positive inputs.

3.3 Loss Functions

Mean Squared Error (MSE) — Regression

\[ L(y,\hat{y}) = \frac{1}{n}\sum_{i=1}^n(y_i - \hat{y}_i)^2 \]

Use with linear output activation. Penalizes larger errors quadratically.

Binary Cross-Entropy — Binary Classification

\[ L(y,\hat{y}) = -\frac{1}{n}\sum_i\Bigl[y_i\log\hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\Bigr] \]

Use with sigmoid output. Measures divergence between true and predicted distributions.

Categorical Cross-Entropy — Multi-Class Classification

\[ L(y,\hat{y}) = -\sum_{c=1}^C y_c\log\hat{y}_c \]

Use with softmax output for C classes. \(y_c=1\) for the true class, 0 otherwise (one-hot encoding).

04 · Backpropagation

Backpropagation

Algorithm to compute gradients of the loss w.r.t. all weights by applying the chain rule layer-by-layer in reverse. Goal: set weights to minimize error \(E\). (Mitchell 1997; Goodfellow et al. 2016)

4.1 Three-Step Learning Procedure

Learning Procedure

1.Forward Propagation: Compute output \(\hat{y}\) for input \(\mathbf{x}\). Cache all pre-activations \(\mathbf{a}^{(k)}\) and post-activations \(\mathbf{h}^{(k)}\).

2.Backward Propagation: Compute error term \(\delta\) for each unit. The target for hidden units is NOT directly available — it must be inferred from output deltas via the chain rule.

3.Weight Update: Update every weight using the computed gradients and learning rate \(\eta\).

4.2 Gradient for Output Unit Weights (with Sigmoid)

Using the chain rule to compute \(\partial E_d / \partial w_{ji}\) for a weight connecting unit \(i\) to output unit \(j\):

Chain Rule Decomposition — Output Layer

\[ \frac{\partial E_d}{\partial w_{ji}} = \frac{\partial E_d}{\partial o_j}\cdot\frac{\partial o_j}{\partial\,\text{net}_j}\cdot\frac{\partial\,\text{net}_j}{\partial w_{ji}} = -(t_j-o_j)\,o_j(1-o_j)\,x_{ji} \]

\(\dfrac{\partial E_d}{\partial o_j} = -(t_j-o_j)\) (from SSE: \(E_d=\tfrac{1}{2}\sum_k(t_k-o_k)^2\)) · \(\dfrac{\partial o_j}{\partial\,\text{net}_j} = o_j(1-o_j)\) (sigmoid derivative) · \(\dfrac{\partial\,\text{net}_j}{\partial w_{ji}} = x_{ji}\)

Output Unit Error Signal (δ)

\[ \delta_j = o_j(1-o_j)(t_j - o_j) \] \[ \therefore\quad \frac{\partial E_d}{\partial w_{ji}} = -\delta_j\,x_{ji} \]

\(\delta_j\) is the error signal at output unit \(j\). It captures how wrong the output is, scaled by how sensitive the activation is at that operating point.

4.3 Gradient for Hidden Unit Weights

For hidden unit \(h\), no direct target is available. We back-propagate error from all output units \(k\) that receive from \(h\):

Hidden Unit Error Signal (δ_h)

\[ \delta_h = o_h(1-o_h)\sum_{k\in\,\text{outputs}(h)} w_{kh}\,\delta_k \] \[ \therefore\quad \frac{\partial E_d}{\partial w_{ji}} = -\delta_h\,x_{ji} \]

\(\delta_h\) is a weighted sum of output deltas, weighted by the connections from \(h\) to each output unit. This is the "propagating error backward" step that gives backpropagation its name.

4.4 Weight Update (same for all layers)

Weight Update Rule — Mitchell (1997), Eq. T4.5

\[ w_{ji} \leftarrow w_{ji} + \Delta w_{ji}, \qquad \Delta w_{ji} = \eta\,\delta_j\,x_{ji} \]

\(\eta\) = learning rate · \(\delta_j\) = error signal at unit \(j\) · \(x_{ji}\) = input from unit \(i\) to unit \(j\)

4.5 Full Backpropagation Algorithm (Mitchell, 1997)

BACKPROPAGATION(training_examples, η, n_in, n_out, n_hidden)

InitCreate feed-forward network with \(n_{\text{in}}\) inputs, \(n_{\text{hidden}}\) hidden units, \(n_{\text{out}}\) output units. Initialize all weights to small random values (e.g., between −0.05 and 0.05).

RepeatUntil termination (fixed iterations / training error < threshold / validation criterion met). Do for each \(\langle\mathbf{x}, \mathbf{t}\rangle\) in training_examples:

1.Forward pass: Input \(\mathbf{x}\) to the network; compute output \(o_u\) of every unit \(u\) in the network.

2.Output deltas: For each output unit \(k\):
\[\delta_k \leftarrow o_k(1-o_k)(t_k - o_k)\]

3.Hidden deltas: For each hidden unit \(h\):
\[\delta_h \leftarrow o_h(1-o_h)\sum_{k\,\in\,\text{outputs}}w_{kh}\,\delta_k\]

4.Update weights: For each weight \(w_{ji}\):
\[w_{ji} \leftarrow w_{ji} + \Delta w_{ji},\qquad \Delta w_{ji} = \eta\,\delta_j\,x_{ji}\]

Computational Efficiency

One backward pass computes gradients for ALL parameters simultaneously, reusing cached forward-pass values. This makes deep network training feasible.

4.6 Common Training Problems

Vanishing Gradient

Gradients shrink exponentially through deep layers. Sigmoid/tanh worsen this near saturation. Fix: ReLU, residual connections, batch normalization.

Exploding Gradient

Gradients grow exponentially → unstable (NaN). Fix: gradient clipping, careful initialization.

Local Minima

Batch GD can get stuck. SGD noise helps escape. No guarantee of global minimum if multiple local minima exist.

Learning Rate η

Too small → converges very slowly. Too large → overshoots and diverges. Use schedules or adaptive optimizers (Adam).

Overfitting

Network memorizes training data. Fix: dropout, L1/L2 regularization \(\lambda\Omega(\theta)\), early stopping, more data.

Dying ReLU

Neurons always output 0 → zero gradient → never recover. Fix: Leaky ReLU (\(\alpha=0.01\) slope), He initialization.

05 · Convolutional Neural Network

Convolutional Neural Network (CNN)

An ANN that replaces general matrix multiplication with the convolution operation in at least one layer. Explicitly assumes inputs have grid-like (spatial) topology. (LeCun et al. 1989, 1998; Goodfellow et al. 2016; Lepe & Vazquez 2017)

5.1 Why CNN? — The Problem with FFNN on Images

Dimensionality Explosion

A \(320\times280\) RGB image has \(320\times280\times3 = 268{,}800\) pixels. A single FFNN input→output layer on that image requires \(>8\) billion multiplications. A CNN with a \(1\times2\) kernel needs only \(319\times280\times3 = 267{,}960\) operations — roughly 60,000× more efficient.

Three core problems with FFNN on images:

Dimensionality explosion. Flattening a 2D/3D image to 1D and connecting every pixel to every neuron creates an impractical number of parameters.
Spatial information destroyed. Flattening kills neighborhood relationships. Pixels and their neighbors carry joint meaning (edges, textures); once flattened they are independent numbers.
Parameter explosion → overfitting. More weights = more memorization, not better generalization.

5.2 What CNN Is

Input

A multidimensional array (e.g., RGB image of size \(H\times W\) stored as \(H\times W\times 3\) array of pixel intensities 0–255).

Kernel / Filter

Small learnable weight matrix (e.g., 3×3 or 5×5) that slides over the input. Also called: weights, parameters, filter.

Feature Map

Output produced when a kernel is convolved with the input. High values = pattern detected at that location.

Convolution

Sliding a kernel over the input, computing dot products at each position to produce one value per location in the feature map.

5.3 Two Core Innovations

Local (Sparse) Connectivity

Each output neuron connects only to a local region of the input — the receptive field. A 3×3 kernel: each output connects to only 9 inputs, not thousands. Biologically inspired: Hubel & Wiesel (1959) found visual cortex neurons respond to small local visual regions.

Parameter Sharing

The same kernel weights are used at every spatial position. One kernel is learned and applied everywhere — instead of unique weights per position. Gives translation equivariance: the same feature anywhere uses the same detector.

LeCun 1989 Parameter Comparison (Net-3)

Configuration	Weights
Fully connected (FFNN)	>3,200
Locally connected (no weight sharing)	1,226
Locally connected + weight sharing (CNN)	206

5.4 The Convolution Operation

1D Convolution (Conceptual)

Slide a kernel across a 1D input array, computing element-wise products and summing at each position. Parameter sharing: same kernel values reused at every position.

2D Convolution (Grayscale Images)

The kernel slides across both rows and columns. At each position, multiply overlapping values element-wise and sum to produce one output value in the feature map.

3D Convolution (RGB / Multi-channel)

For RGB images, the kernel has depth = number of input channels (\(C_{\text{in}}\)). A kernel of size \(F\times F\times C_{\text{in}}\) produces one feature map. Using \(K\) different kernels produces \(K\) feature maps (output depth = \(K\)).

Output Dimension Formula

\[ V = \frac{W - F + 2P}{S} + 1 \]

\(W\) = input width/height (assumes square) · \(F\) = filter/kernel size · \(P\) = padding (zeros added around border) · \(S\) = stride · \(V\) = output spatial dimension · Output volume = \(V \times V \times K\) (for \(K\) filters)

Padding Types

Valid (P=0): No padding → output shrinks. Each conv layer reduces spatial dimensions.
Same (P=(F−1)/2 with S=1): Zero-pad to keep output same size as input.

Worked Examples

Input Shape	Filters	S	V Calculation	Output Volume
3×3×3	1, size 2×2×3	1	\((3-2+0)/1+1=2\)	2×2×1
3×3×2	3, size 2×2×2	1	\((3-2+0)/1+1=2\)	2×2×3
32×32×3	10, size 5×5×3	1	\((32-5+0)/1+1=28\)	28×28×10

5.5 The Three Stages of a CNN Layer

Three-Stage CNN Layer Structure

Stage 1 Convolution (Affine Transform): Kernel slides across input computing linear combinations. Produces raw feature map — still linear.
\[\text{net}_c = \mathbf{X}\star K\quad\text{(where }\star\text{ denotes convolution/cross-correlation)}\]

Stage 2 Detector (Non-linearity / Activation): Non-linear activation applied element-wise. Without this, stacking conv layers collapses to one linear transform. Most common: ReLU. \[H = f_c(\text{net}_c) = \max(0,\,\text{net}_c)\] Example: \([-9, 32, 14, -6]\xrightarrow{\text{ReLU}}[0, 32, 14, 0]\)

Stage 3 Pooling (Downsampling): Reduces spatial dimensions. Benefits: fewer parameters downstream, translation invariance (small input shifts don’t change output much), controls overfitting. Most common: Max Pooling.

Pooling Types

Max Pooling (2×2, stride 2)

This diagram illustrates the most common pooling operation: max pooling. The table below describes how average pooling and L2-norm pooling differ by using mean or root-sum-of-squares over each window instead of the maximum.

Type	Operation	Notes
Max Pooling	Maximum value in each window	Most common. A 2×2 pool with S=2 halves the spatial dimensions. Preserves dominant features.
Average Pooling	Mean value in each window	Smoother downsampling. Used in LeNet-5.
L2-norm Pooling	\(\sqrt{\sum x_i^2}\) in each window	Less common. Emphasizes larger activations.

5.6 Parameter Counting in CNN Layers

Number of Neurons in a Conv Layer

\[ N_{\text{neurons}} = V_H \times V_W \times K \]

\(V_H, V_W\) = output height and width (from output formula) · \(K\) = number of filters

Parameters WITHOUT Weight Sharing (locally connected)

\[ N_{\text{params}} = N_{\text{neurons}} \times (F\times F\times C_{\text{in}} + 1) \]

Parameters WITH Weight Sharing (standard CNN)

\[ N_{\text{params}} = K \times (F\times F\times C_{\text{in}} + 1) \]

\(K\) = number of filters · \(F\times F\) = filter spatial size · \(C_{\text{in}}\) = input channels · \(+1\) = one bias per filter
Example (LeNet C1): \(K=6,\, F=5,\, C_{\text{in}}=1 \Rightarrow 6\times(25\times1+1) = 6\times26 = \mathbf{156}\) params

5.7 Full CNN Architecture Pattern

A typical CNN repeats [Conv + ReLU → Pool] blocks, then flattens into fully-connected layers for classification:

INPUT

H×W×C

→

CONV

+ReLU

→

POOL

Max/Avg

→

CONV

+ReLU

→

POOL

Max/Avg

→

FLATTEN

1D vector

→

Dense

→

OUTPUT

Softmax

Feature Hierarchy

Conv layer 1 → edges/colors. Conv layer 2 → corners/textures. Deeper layers → shapes, object parts, objects. Pooling reduces spatial size. FC layers do the final classification based on extracted features.

5.8 LeNet-5 (LeCun et al., 1998)

The first practical CNN for handwritten digit recognition (MNIST). Input: 32×32 grayscale (MNIST 28×28 padded to 32×32). End-to-end training with backpropagation.

Layer	Type	Output Shape	Details	Parameters
Input	Image	32×32×1	Grayscale, padded from 28×28	—
C1	Conv	28×28×6	6 filters, 5×5, S=1, P=0, tanh · \(V=(32-5)/1+1=28\)	\(6\times(5\times5\times1+1)=\mathbf{156}\)
S2	Avg Pool	14×14×6	2×2 window, S=2	0 (no learnable params)
C3	Conv	10×10×16	16 filters, 5×5, S=1, P=0, tanh · \(V=(14-5)/1+1=10\)	\(16\times(5\times5\times6+1)=\mathbf{2{,}416}\)
S4	Avg Pool	5×5×16	2×2 window, S=2	0
Flatten	—	400	\(5\times5\times16=400\)	—
FC C5	FC	120	Fully connected, tanh	\((400+1)\times120=\mathbf{48{,}120}\)
FC F6	FC	84	Fully connected, tanh	\((120+1)\times84=\mathbf{10{,}164}\)
Output	FC	10	10 classes (digits 0–9), Softmax	\((84+1)\times10=\mathbf{850}\)
Total Parameters				61,706

LeNet-5 Total Parameter Verification

\[ 156 + 0 + 2{,}416 + 0 + 48{,}120 + 10{,}164 + 850 = \mathbf{61{,}706} \]

C3 uses 5×5 kernel on 6-channel S2 output: \(16\times(5\times5\times6+1)=16\times151=2{,}416\). Tiny by modern standards (ResNet-50 ≈25M params), but proved CNN viability for image tasks for the first time.

5.9 Notable CNN Architectures

Architecture	Year	Key Innovation
LeNet-5	1998	First practical CNN. Local connectivity + weight sharing.
AlexNet	2012	8 layers, ReLU activation (novel), dropout, GPU training on ImageNet. Marked the “Rise of CNN.”
VGG16	2015	16 layers, only 3×3 filters. Full model: ∼138M params. Input: 224×224×3.
GoogLeNet	2014	Inception modules: parallel filter paths of different sizes in one layer.
ResNet	2015	Skip (residual) connections: gradients flow directly through layers, enabling 50/101/152-layer nets.
Vision Transformer	2020	Transformer self-attention applied to image patches — no convolution needed.

5.10 CNN vs FFNN — Complete Comparison

Aspect	FFNN (MLP)	CNN
Input handling	Flattened 1D vector — spatial structure lost	Grid-structured — spatial info preserved
Connectivity	Every neuron → every input (fully connected)	Each neuron → local receptive field only
Weight sharing	No — unique weight per connection	Yes — same kernel at every position
Spatial feature detection	No	Yes — kernels learn edges, textures, shapes
Parameter count	Very high for image inputs	Much lower due to weight sharing
Overfitting tendency	Higher for image tasks	Lower for image tasks
Feature engineering	Manual	Automatic (learned filters)
Translation invariance	No	Yes (shared filters + pooling)
Best for	Tabular data, fixed-size vectors	Images, audio spectrograms, grid-structured data

Two Key Properties of CNN

(1) Translation equivariance: if the input shifts, the feature map shifts by the same amount — a feature is detected regardless of where it appears. (2) Hierarchical feature learning: layer 1 learns edges → layer 2 learns textures → deeper layers learn shapes → object parts → full objects.

5.11 Key CNN Terms

Kernel / Filter (K)

Learnable weight matrix. One kernel → one feature map. \(K\) kernels → \(K\) output channels.

Feature Map

Output of applying one kernel to the full input. High values = pattern detected at that location.

Receptive Field

Local region of input one filter examines. Size = \(F\times F\). Deeper layers have larger effective receptive fields.

Stride (S)

Pixels the kernel moves per step. \(S=1\) → overlapping. \(S=2\) → roughly halves output size.

Padding (P)

“Valid” (P=0): output shrinks. “Same” (\(P=(F-1)/2\) with S=1): output same size as input.

Parameter Sharing

Same kernel weights applied everywhere → dramatic parameter reduction. FFNN would need unique weights per connection.

Sparse Connectivity

Each output unit connects to only \(F\times F\times C_{\text{in}}\) inputs (the receptive field), not all inputs.

Translation Equivariance

If input shifts, feature map shifts by same amount. Kernels detect features at any position using same weights.

Translation Invariance

After pooling, small shifts in input do not change the output. Pooling provides this property.

Flatten

Reshape 3D volume \((H\times W\times C)\) after conv/pool into a 1D vector to feed into FC layers.

Feature Hierarchy

Early layers: edges → Middle: corners, textures → Deep: object parts, faces, objects.

Depth (channels)

Number of feature maps in a layer = number of filters \(K\). RGB image has depth 3 (input channels).

Glossary

All Key Terms at a Glance

Every important term from the midterm topics, organized by section.

Math & Notation

x, X input vector / mini-batch

w, W weights

b, w⊂0⊂, c bias

z / net / a pre-activation

o / h post-activation

ŷ predicted output

y / t true label / target

E / J error / total cost

η learning rate

δ error signal (delta)

λ regularization rate

Ω(θ) regularizer

σ sigmoid

f(·) activation function

K kernel/filter

V output dim

F filter size

P padding

S stride

Perceptron Terms

Perceptron artificial neuron for linear classification

McCulloch-Pitts early neuron model

threshold θ cutoff for firing

sign function outputs ±1

step function outputs 0/1

net input z weighted sum before activation

hyperplane decision surface linear separator

linear separability classes separable by a hyperplane

hypothesis space H set of possible weights

perceptron training rule update rule for thresholded output

convergence theorem finite convergence for separable data

XOR problem not linearly separable

m-of-n function true if at least m inputs are true

epoch one full pass over the data

batch size samples per update group

step one mini-batch update

delta rule LMS / Widrow-Hoff update

LMS rule least mean squares rule

Adaline adaptive linear neuron

Widrow-Hoff another name for delta rule

batch GD gradient descent over all samples

SGD stochastic gradient descent

mini-batch SGD gradient descent on small batches

gradient ∇E(w) direction of steepest ascent

SSE sum of squared errors

FFNN Terms

feedforward information flows only forward

MLP multi-layer perceptron

hidden layer intermediate representation layer

depth number of layers

width number of neurons per layer

W_xh / W_hy notation compact layer notation

compact / explicit style two equivalent notations

forward propagation compute outputs from inputs

backpropagation reverse-mode gradient computation

chain rule derivative composition rule

gradient descent minimize loss by following negative gradient

vanishing gradient gradients shrink through layers

exploding gradient gradients grow too large

overfitting memorizes training data

underfitting model too simple

sigmoid squashing activation

tanh zero-centered squashing activation

ReLU rectified linear unit

leaky ReLU ReLU with small negative slope

softmax probability distribution over classes

MSE mean squared error

binary CE binary cross-entropy

categorical CE multi-class cross-entropy

output delta δ_j error signal at output unit

hidden delta δ_h error signal at hidden unit

universal approximation theorem single hidden layer can approximate any continuous function

one-hot encoding vector with one active class entry

regularization λΩ(θ) penalty term to reduce overfitting

CNN Terms

convolution / cross-correlation sliding local dot product

kernel / filter learnable weight window

feature map convolution output

stride S step size of the kernel

padding P zero border added to input

valid padding no padding, output shrinks

same padding keeps output size

receptive field local input region a filter sees

parameter sharing same kernel reused everywhere

sparse connectivity each output connects to local inputs

max pooling take maximum value per window

average pooling take mean value per window

L2-norm pooling use Euclidean norm per window

flatten reshape volume into 1D vector

translation equivariance shifted input shifts feature map

translation invariance small shifts do not change output much

feature hierarchy edges to objects across layers

depth (channels) number of feature maps

V = (W−F+2P)/S + 1 output dimension formula

N_params = K×(F²×C_in+1) CNN parameter formula

LeNet-5 first practical CNN

AlexNet deep CNN breakthrough

VGG16 deep stack of 3×3 filters

ResNet / skip connections residual shortcut paths

locally connected no weight sharing

Notes based on: Goodfellow et al. (2016) Deep Learning · LeCun et al. (1989, 1998) · Mitchell (1997) · Raschka et al. (2022)
IF3270 Pembelajaran Mesin · Sem 2-2025/2026 · Tim Pengajar IF3270
Topics: Perceptron · FFNN · CNN | Excluded: Ensemble · CNN Backpropagation

Neural Networksfrom Scratch

The Perceptron

1.1 Biological Motivation

1.2 Formal Definition of a Neuron

1.3 m-of-n Functions

1.4 Perceptron Training (Learning) Rule

1.5 Batch Gradient Descent — The Delta Rule

1.6 Stochastic & Mini-Batch Gradient Descent

Batch GD

Stochastic (Incremental) GD

Feedforward Neural Network (FFNN)

2.1 From Adaline to FFNN

2.2 FFNN Architecture

2.3 Forward Propagation

2.4 Number of Parameters

2.5 Worked Example: XOR with Sigmoid

Activation Functions & Loss Functions

3.1 Why Non-linearity?

3.2 Common Activation Functions

Key Derivatives (required for Backpropagation)

3.3 Loss Functions

Backpropagation

4.1 Three-Step Learning Procedure

4.2 Gradient for Output Unit Weights (with Sigmoid)

4.3 Gradient for Hidden Unit Weights

4.4 Weight Update (same for all layers)

4.5 Full Backpropagation Algorithm (Mitchell, 1997)

4.6 Common Training Problems

Convolutional Neural Network (CNN)

5.1 Why CNN? — The Problem with FFNN on Images

5.2 What CNN Is

5.3 Two Core Innovations

Local (Sparse) Connectivity

Parameter Sharing

LeCun 1989 Parameter Comparison (Net-3)

5.4 The Convolution Operation

1D Convolution (Conceptual)

2D Convolution (Grayscale Images)

3D Convolution (RGB / Multi-channel)

Padding Types

Worked Examples

5.5 The Three Stages of a CNN Layer

Pooling Types

5.6 Parameter Counting in CNN Layers

5.7 Full CNN Architecture Pattern

5.8 LeNet-5 (LeCun et al., 1998)

5.9 Notable CNN Architectures

5.10 CNN vs FFNN — Complete Comparison

5.11 Key CNN Terms

All Key Terms at a Glance

Math & Notation

Perceptron Terms

FFNN Terms

CNN Terms

Neural Networks
from Scratch