References: Goodfellow et al. (2016) Deep Learning · LeCun et al. (1989, 1998) · Mitchell (1997) · Raschka et al. (2022)
01 · Perceptron
The Perceptron
The foundation of all neural networks — a single artificial
neuron that learns to classify linearly separable data. (Rosenblatt
1958; Mitchell 1997, Ch. 4; Raschka 2022, Ch. 11)
1.1 Biological Motivation
McCulloch and Pitts (1943) described a neuron as a simple logic gate
with binary outputs: multiple signals arrive at the
dendrites, are integrated in the cell
body, and if the accumulated signal exceeds a threshold, an
output signal is generated along the axon.
Dendrites →
x
Receive input signals. Analogous to
input features \(x_1,x_2,\ldots,x_m\).
Synapse strength →
w
Connection strength. Analogous to
weights \(w_i\) — the importance / contribution of each
input.
Cell body →
Σ
Integrates all incoming
signals. Analogous to the weighted sum
\(\mathbf{w}^T\mathbf{x}\).
Axon firing →
f(net)
Fires (outputs signal) only when
the threshold is exceeded. Analogous to the activation
function.
1.2 Formal Definition of a Neuron
Perceptron Architecture
Given real-valued input \(\mathbf{x}=(x_1,\ldots,x_m)\), weight
vector \(\mathbf{w}=(w_1,\ldots,w_m)\), and bias \(w_0\) (also written
\(b\) or \(\theta\)). The hypothesis space is \(H =
\{\mathbf{w}\mid\mathbf{w}\in\mathbb{R}^{d+1}\}\).
Both produce a discrete-valued
output. The perceptron represents a hyperplane decision
surface in \(\mathbb{R}^n\): a line in \(\mathbb{R}^2\),
a plane in \(\mathbb{R}^3\), and an \((n-1)\)-dimensional subspace
in \(\mathbb{R}^n\).
1.3 m-of-n Functions
A perceptron easily represents m-of-n functions:
functions where at least \(m\) of the \(n\) inputs must be true.
Strategy: set all input weights to the same value (e.g., 0.5) then set
the threshold \(w_0\).
Example: AND with \(w_0=-0.8,\;
w_1=w_2=0.5\). Compute \(\Sigma = -0.8 + 0.5x_1 + 0.5x_2\) and apply
sign. Only when \(x_1=x_2=1\) gives \(\Sigma = 0.2 > 0 \Rightarrow
+1\).
The XOR Problem (Minsky & Papert, 1969)
A single perceptron can only solve linearly separable problems. XOR cannot be separated by any hyperplane — no single perceptron can learn it. This drove the development of multi-layer networks.
1.4 Perceptron Training (Learning) Rule
Uses the thresholded output for the update signal.
Proven to converge if training data is linearly separable and learning
rate \(\eta\) is sufficiently small.
Perceptron Training Rule — Mitchell
(1997)
InputTraining data \(D =
\{\langle\mathbf{x}_i, y_i\rangle\}\), learning rate
\(\eta\), activation: sign or step
InitSet
each \(w_k \leftarrow 0\) or small random value. Set \(E
\leftarrow 0\).
RepeatFor each
example \(\langle\mathbf{x}_i, y_i\rangle\):
\(\eta\) = learning rate · \(y_i\) =
true label (±1) · \(o_i\) = predicted label
(thresholded) · \(x_{ij}\) = \(j\)-th feature of example
\(i\). If correct prediction: \(y_i = o_i \Rightarrow \Delta w_j =
0\) (no update).
Convergence Theorem
The Perceptron Training Rule is guaranteed to converge in a finite number of steps if the training data is linearly separable and \(\eta\) is sufficiently small. It does NOT converge if data is not linearly separable.
1.5 Batch Gradient Descent — The Delta Rule
When data is NOT linearly separable, the Perceptron rule fails. The
Delta Rule (= Widrow-Hoff rule = LMS rule = Adaline
rule) uses the unthresholded linear output and minimizes SSE
— always converges to best-fit approximation.
One
weight update per epoch. Stable, but slow for large datasets. Can
get stuck in local minima.
Stochastic (Incremental) GD
\[ \Delta w_k = \eta\,(y_d - o_d)(x_{dk}) \]
One
update per example. Faster, noisier gradient — can escape
local minima. Both Batch and SGD use the linear function and give
best-fit approximation.
Epoch
One full pass through all training data (or all
mini-batches). Features and labels processed once.
Batch Size
Number of samples per mini-batch split. If = 1
→ true SGD. If = |D| → Batch GD. If 1 < b < |D|
→ Mini-batch SGD.
Step
Processing one mini-batch, ending with one weight
update. Steps per epoch = |D| / batch_size.
Key Distinction
Perceptron Training Rule uses thresholded output (sign/step) for updates — only converges if linearly separable. Delta Rule / Batch GD uses unthresholded linear output — converges to best-fit regardless of separability.
02 · Feedforward Neural Network
Feedforward Neural Network (FFNN)
Also called Multi-Layer Perceptron (MLP) — stacks layers of
neurons with non-linear activations to learn complex mappings.
Connections flow only forward: a directed acyclic graph. (Raschka
2022 Ch.11; Goodfellow et al. 2016)
2.1 From Adaline to FFNN
The Adaline (ADAptive LInear NEuron) is a
single-layer neural network and the direct bridge from the Perceptron
to FFNN. It uses the identity activation for weight updates,
and applies a threshold only at inference time.
\[ w = w + \Delta w, \quad b = b + \Delta b \qquad (b = w_0) \]
\[ \Delta w_j = -\eta\,\frac{\partial E}{\partial w_j}, \quad \Delta b
= -\eta\,\frac{\partial E}{\partial b} \]
\[ E = \frac{1}{2}(y_i - o_i)^2, \qquad \hat{y} = o = f\!\left(\sum_j
w_j x_j + b\right) \]
The activation function \(f\) here is the
identity (linear). A threshold is applied only at inference to
produce the class label 0 or 1. Multiple names, same
algorithm.
2.2 FFNN Architecture
Input Layer
(l=0)
Receives feature vector
\(\mathbf{x}\). No computation — values passed forward.
Width = number of features. Convention: \(x_0 = +1\)
(bias).
Hidden Layers
Learn internal representations. One or more
layers. Written \(H_1, H_2, \ldots, H_h\) in compact style.
Non-linear activation required.
Output Layer
Produces \(\hat{\mathbf{y}}\). Width = number of
classes (classification) or 1 (regression). Appropriate activation
(sigmoid/softmax/linear).
Depth
Number of layers (hidden + output). Deeper →
learns more abstract, hierarchical features.
Width
Neurons per layer. Wider → more capacity but
more parameters. Notated as \(n_{h_1}, n_{h_2},
\ldots\)
Fully
Connected
Every neuron in layer \(l\)
connects to every neuron in layer \(l+1\) with a unique weight
\(w_{ji}\).
2.3 Forward Propagation
Information flows from input through each layer to produce output
\(\hat{\mathbf{y}}\). Two equivalent notations are used in class:
Goodfellow Notation — General Forward
Pass Algorithm
\(\mathbf{X}\) is the mini-batch matrix of
shape \([n_{\text{samples}} \times n_{\text{features}}]\). When
using batches, the input goes on the left: \(\mathbf{X}W_{xh}\) vs
single input: \(W_{xh}^T\mathbf{x}\).
2.4 Number of Parameters
Parameter Count — Fully Connected
Layer
\[ \text{Params for layer } k = (n_{k-1}+1)\times n_k \]
\[ \text{Total} = \sum_{k=1}^{L}(n_{k-1}+1)\times n_k \]
\(n_{k-1}\) = neurons in previous layer
· \(n_k\) = neurons in current layer
· \(+1\) accounts for the bias weight per
neuron. Example (Raschka): input=8, h1=12, h2=8, output=1 →
\((8+1)\times12 + (12+1)\times8 + (8+1)\times1 = 108+104+9 =
\mathbf{221}\) total params.
Non-linear functions applied element-wise after the weighted sum.
Without non-linearity, stacking any number of layers collapses to a
single linear transformation.
3.1 Why Non-linearity?
Universal Approximation Theorem
A single hidden layer with sufficient neurons can approximate any continuous function to arbitrary accuracy. However, deep networks require exponentially fewer neurons than shallow networks for the same approximation quality, generalize better, and empirically outperform shallow networks. Test accuracy improves as depth increases from 3 to 11 layers; CNNs consistently outperform fully-connected nets with the same parameter count.
These derivatives appear in every backprop
computation. Sigmoid and tanh derivatives become very small near
saturation (→ vanishing gradient). ReLU derivative is either 0
or 1 — no vanishing gradient for positive inputs.
Use with softmax output for C classes.
\(y_c=1\) for the true class, 0 otherwise (one-hot encoding).
04 · Backpropagation
Backpropagation
Algorithm to compute gradients of the loss w.r.t. all weights by
applying the chain rule layer-by-layer in reverse. Goal: set weights
to minimize error \(E\). (Mitchell 1997; Goodfellow et al. 2016)
4.1 Three-Step Learning Procedure
Learning Procedure
1.Forward
Propagation: Compute output \(\hat{y}\) for input
\(\mathbf{x}\). Cache all pre-activations \(\mathbf{a}^{(k)}\) and
post-activations \(\mathbf{h}^{(k)}\).
2.Backward
Propagation: Compute error term \(\delta\) for each
unit. The target for hidden units is NOT directly available
— it must be inferred from output deltas via the chain
rule.
3.Weight Update:
Update every weight using the computed gradients and learning rate
\(\eta\).
4.2 Gradient for Output Unit Weights (with Sigmoid)
Using the chain rule to compute \(\partial E_d / \partial w_{ji}\)
for a weight connecting unit \(i\) to output unit \(j\):
\(\delta_j\) is the error signal at output
unit \(j\). It captures how wrong the output is, scaled by how
sensitive the activation is at that operating point.
4.3 Gradient for Hidden Unit Weights
For hidden unit \(h\), no direct target is available. We
back-propagate error from all output units \(k\) that receive from
\(h\):
\(\delta_h\) is a weighted sum of output
deltas, weighted by the connections from \(h\) to each output unit.
This is the "propagating error backward" step that gives
backpropagation its name.
InitCreate
feed-forward network with \(n_{\text{in}}\) inputs,
\(n_{\text{hidden}}\) hidden units, \(n_{\text{out}}\) output
units. Initialize all weights to small random values (e.g.,
between −0.05 and 0.05).
RepeatUntil
termination (fixed iterations / training error < threshold /
validation criterion met). Do for each \(\langle\mathbf{x},
\mathbf{t}\rangle\) in training_examples:
1.Forward pass:
Input \(\mathbf{x}\) to the network; compute output \(o_u\) of
every unit \(u\) in the network.
2.Output deltas:
For each output unit \(k\): \[\delta_k \leftarrow
o_k(1-o_k)(t_k - o_k)\]
3.Hidden deltas:
For each hidden unit \(h\): \[\delta_h \leftarrow
o_h(1-o_h)\sum_{k\,\in\,\text{outputs}}w_{kh}\,\delta_k\]
4.Update weights:
For each weight \(w_{ji}\): \[w_{ji} \leftarrow w_{ji} + \Delta
w_{ji},\qquad \Delta w_{ji} =
\eta\,\delta_j\,x_{ji}\]
Computational Efficiency
One backward pass computes gradients for ALL parameters simultaneously, reusing cached forward-pass values. This makes deep network training feasible.
4.6 Common Training Problems
Vanishing
Gradient
Gradients shrink exponentially
through deep layers. Sigmoid/tanh worsen this near saturation.
Fix: ReLU, residual connections, batch normalization.
Batch GD can get stuck. SGD noise helps escape.
No guarantee of global minimum if multiple local minima
exist.
Learning Rate
η
Too small → converges very
slowly. Too large → overshoots and diverges. Use schedules or
adaptive optimizers (Adam).
Overfitting
Network memorizes training data. Fix: dropout,
L1/L2 regularization \(\lambda\Omega(\theta)\), early stopping,
more data.
Dying ReLU
Neurons always output 0 → zero gradient
→ never recover. Fix: Leaky ReLU (\(\alpha=0.01\) slope), He
initialization.
05 · Convolutional Neural Network
Convolutional Neural Network (CNN)
An ANN that replaces general matrix multiplication with the
convolution operation in at least one layer. Explicitly
assumes inputs have grid-like (spatial) topology. (LeCun et al.
1989, 1998; Goodfellow et al. 2016; Lepe & Vazquez 2017)
5.1 Why CNN? — The Problem with FFNN on Images
Dimensionality Explosion
A \(320\times280\) RGB image has \(320\times280\times3 = 268{,}800\) pixels. A single FFNN input→output layer on that image requires \(>8\) billion multiplications. A CNN with a \(1\times2\) kernel needs only \(319\times280\times3 = 267{,}960\) operations — roughly 60,000× more efficient.
Three core problems with FFNN on images:
Dimensionality explosion. Flattening a 2D/3D
image to 1D and connecting every pixel to every neuron creates an
impractical number of parameters.
Spatial information destroyed. Flattening kills
neighborhood relationships. Pixels and their neighbors carry joint
meaning (edges, textures); once flattened they are independent
numbers.
Parameter explosion → overfitting. More
weights = more memorization, not better generalization.
5.2 What CNN Is
Input
A multidimensional array (e.g., RGB image of size
\(H\times W\) stored as \(H\times W\times 3\) array of pixel
intensities 0–255).
Kernel /
Filter
Small learnable weight matrix
(e.g., 3×3 or 5×5) that slides over the input. Also
called: weights, parameters, filter.
Feature Map
Output produced when a kernel is convolved with
the input. High values = pattern detected at that
location.
Convolution
Sliding a kernel over the input, computing dot
products at each position to produce one value per location in the
feature map.
5.3 Two Core Innovations
Local (Sparse) Connectivity
Each
output neuron connects only to a local region of
the input — the receptive field. A
3×3 kernel: each output connects to only 9 inputs, not
thousands. Biologically inspired: Hubel & Wiesel (1959) found
visual cortex neurons respond to small local visual regions.
Parameter Sharing
The
same kernel weights are used at every spatial
position. One kernel is learned and applied everywhere —
instead of unique weights per position. Gives translation
equivariance: the same feature anywhere uses the same
detector.
LeCun 1989 Parameter Comparison (Net-3)
Configuration
Weights
Fully connected (FFNN)
>3,200
Locally connected (no weight sharing)
1,226
Locally connected + weight sharing
(CNN)
206
5.4 The Convolution Operation
1D Convolution (Conceptual)
Slide a kernel across a 1D input array, computing element-wise
products and summing at each position. Parameter sharing: same kernel
values reused at every position.
2D Convolution (Grayscale Images)
The kernel slides across both rows and columns. At each position,
multiply overlapping values element-wise and sum to produce one output
value in the feature map.
3D Convolution (RGB / Multi-channel)
For RGB images, the kernel has depth = number of input
channels (\(C_{\text{in}}\)). A kernel of size \(F\times
F\times C_{\text{in}}\) produces one feature map. Using \(K\)
different kernels produces \(K\) feature maps (output depth =
\(K\)).
This diagram illustrates the most common pooling operation: max pooling. The table below describes how average pooling and L2-norm pooling differ by using mean or root-sum-of-squares over each window instead of the maximum.
Type
Operation
Notes
Max Pooling
Maximum value in each
window
Most common. A 2×2 pool with S=2 halves the
spatial dimensions. Preserves dominant features.
Average Pooling
Mean value in each
window
Smoother downsampling. Used in LeNet-5.
L2-norm Pooling
\(\sqrt{\sum x_i^2}\)
in each window
Less common. Emphasizes larger
activations.
5.6 Parameter Counting in CNN Layers
Number of Neurons in a Conv Layer
\[ N_{\text{neurons}} = V_H \times V_W \times K \]
\(V_H, V_W\) = output height and width (from
output formula) · \(K\) = number of filters
Parameters WITHOUT Weight Sharing (locally
connected)
\(K\) = number of filters
· \(F\times F\) = filter spatial size
· \(C_{\text{in}}\) = input channels
· \(+1\) = one bias per filter
Example (LeNet C1): \(K=6,\, F=5,\, C_{\text{in}}=1 \Rightarrow
6\times(25\times1+1) = 6\times26 = \mathbf{156}\) params
5.7 Full CNN Architecture Pattern
A typical CNN repeats [Conv + ReLU → Pool] blocks, then flattens
into fully-connected layers for classification:
INPUT
H×W×C
→
CONV
+ReLU
→
POOL
Max/Avg
→
CONV
+ReLU
→
POOL
Max/Avg
→
FLATTEN
1D vector
→
FC
Dense
→
OUTPUT
Softmax
Feature Hierarchy
Conv layer 1 → edges/colors. Conv layer 2 → corners/textures. Deeper layers → shapes, object parts, objects. Pooling reduces spatial size. FC layers do the final classification based on extracted features.
5.8 LeNet-5 (LeCun et al., 1998)
The first practical CNN for handwritten digit recognition (MNIST).
Input: 32×32 grayscale (MNIST 28×28 padded to
32×32). End-to-end training with backpropagation.
C3 uses 5×5 kernel on 6-channel S2
output: \(16\times(5\times5\times6+1)=16\times151=2{,}416\). Tiny by
modern standards (ResNet-50 ≈25M params), but proved CNN
viability for image tasks for the first time.
5.9 Notable CNN Architectures
Architecture
Year
Key Innovation
LeNet-5
1998
First practical
CNN. Local connectivity + weight sharing.
AlexNet
2012
8 layers, ReLU
activation (novel), dropout, GPU training on ImageNet. Marked the
“Rise of CNN.”
VGG16
2015
16 layers, only
3×3 filters. Full model: ∼138M params. Input:
224×224×3.
GoogLeNet
2014
Inception
modules: parallel filter paths of different sizes in one
layer.
Transformer
self-attention applied to image patches — no convolution
needed.
5.10 CNN vs FFNN — Complete Comparison
Aspect
FFNN (MLP)
CNN
Input handling
Flattened 1D vector — spatial
structure lost
Grid-structured — spatial info
preserved
Connectivity
Every neuron → every input (fully
connected)
Each neuron → local receptive field
only
Weight sharing
No — unique weight per
connection
Yes — same kernel at every
position
Spatial feature detection
No
Yes —
kernels learn edges, textures, shapes
Parameter count
Very high for image
inputs
Much lower due to weight sharing
Overfitting tendency
Higher for image
tasks
Lower for image tasks
Feature engineering
Manual
Automatic (learned
filters)
Translation invariance
No
Yes (shared filters
+ pooling)
Best for
Tabular data, fixed-size
vectors
Images, audio spectrograms, grid-structured
data
Two Key Properties of CNN
(1) Translation equivariance: if the input shifts, the feature map shifts by the same amount — a feature is detected regardless of where it appears. (2) Hierarchical feature learning: layer 1 learns edges → layer 2 learns textures → deeper layers learn shapes → object parts → full objects.
5.11 Key CNN Terms
Kernel / Filter
(K)
Learnable weight matrix. One kernel
→ one feature map. \(K\) kernels → \(K\) output
channels.
Feature Map
Output of applying one kernel to the full input.
High values = pattern detected at that location.
Receptive
Field
Local region of input one filter
examines. Size = \(F\times F\). Deeper layers have larger
effective receptive fields.
Stride (S)
Pixels the kernel moves per step. \(S=1\) →
overlapping. \(S=2\) → roughly halves output
size.
Padding (P)
“Valid” (P=0): output shrinks.
“Same” (\(P=(F-1)/2\) with S=1): output same size as
input.
Parameter
Sharing
Same kernel weights applied
everywhere → dramatic parameter reduction. FFNN would need
unique weights per connection.
Sparse
Connectivity
Each output unit connects
to only \(F\times F\times C_{\text{in}}\) inputs (the receptive
field), not all inputs.
Translation
Equivariance
If input shifts, feature
map shifts by same amount. Kernels detect features at any position
using same weights.
Translation
Invariance
After pooling, small shifts
in input do not change the output. Pooling provides this
property.
Flatten
Reshape 3D volume \((H\times W\times C)\) after
conv/pool into a 1D vector to feed into FC layers.
translation
invariance small shifts do not change output much
feature hierarchy
edges to objects across layers
depth (channels)
number of feature maps
V = (W−F+2P)/S +
1 output dimension formula
Nparams =
K×(F²×Cin+1) CNN parameter
formula
LeNet-5 first
practical CNN
AlexNet deep CNN
breakthrough
VGG16 deep stack
of 3×3 filters
ResNet / skip
connections residual shortcut paths
locally connected
no weight sharing
Notes based on: Goodfellow et al. (2016) Deep Learning · LeCun
et al. (1989, 1998) · Mitchell (1997) · Raschka et al.
(2022)
IF3270 Pembelajaran Mesin · Sem 2-2025/2026 · Tim
Pengajar IF3270
Topics: Perceptron · FFNN · CNN |
Excluded: Ensemble · CNN Backpropagation