1.1 Biological Motivation
McCulloch and Pitts (1943) described a neuron as a simple logic gate
with binary outputs: multiple signals arrive at the
dendrites, are integrated in the cell
body, and if the accumulated signal exceeds a threshold, an
output signal is generated along the axon.
Dendrites →
x
Receive input signals. Analogous to
input features \(x_1,x_2,\ldots,x_m\).
Synapse strength →
w
Connection strength. Analogous to
weights \(w_i\) — the importance / contribution of each
input.
Cell body →
Σ
Integrates all incoming
signals. Analogous to the weighted sum
\(\mathbf{w}^T\mathbf{x}\).
Axon firing →
f(net)
Fires (outputs signal) only when
the threshold is exceeded. Analogous to the activation
function.
1.2 Formal Definition of a Neuron
Given real-valued input \(\mathbf{x}=(x_1,\ldots,x_m)\), weight
vector \(\mathbf{w}=(w_1,\ldots,w_m)\), and bias \(w_0\) (also written
\(b\) or \(\theta\)). The hypothesis space is \(H =
\{\mathbf{w}\mid\mathbf{w}\in\mathbb{R}^{d+1}\}\).
1.3 m-of-n Functions
A perceptron easily represents m-of-n functions:
functions where at least \(m\) of the \(n\) inputs must be true.
Strategy: set all input weights to the same value (e.g., 0.5) then set
the threshold \(w_0\).
The XOR Problem (Minsky & Papert,
1969): A single perceptron can only solve linearly
separable problems. XOR cannot be separated by any hyperplane
— no single perceptron can learn it. This drove the development
of multi-layer networks.
1.4 Perceptron Training (Learning) Rule
Uses the thresholded output for the update signal.
Proven to converge if training data is linearly separable and learning
rate \(\eta\) is sufficiently small.
InputTraining data \(D =
\{\langle\mathbf{x}_i, y_i\rangle\}\), learning rate
\(\eta\), activation: sign or step
InitSet
each \(w_k \leftarrow 0\) or small random value. Set \(E
\leftarrow 0\).
RepeatFor each
example \(\langle\mathbf{x}_i, y_i\rangle\):
3a.Compute
prediction: \(o_i \leftarrow
\text{sign}(\mathbf{w}^T\mathbf{x}_i)\)
3b.If \(o_i
\neq y_i\): \(w_j \leftarrow w_j + \Delta w_j\)
where \(\Delta w_j = \eta(y_i -
o_i)x_{ij}\)
3c.Update
error: \(E \leftarrow E + (y_i - o_i)\)
UntilAll
examples correctly classified (E=0), or max epochs reached. Return
\(\mathbf{w}\).
Convergence Theorem: The
Perceptron Training Rule is guaranteed to converge in a finite number
of steps if the training data is linearly separable and \(\eta\) is
sufficiently small. It does NOT converge if data is not linearly
separable.
1.5 Batch Gradient Descent — The Delta Rule
When data is NOT linearly separable, the Perceptron rule fails. The
Delta Rule (= Widrow-Hoff rule = LMS rule = Adaline
rule) uses the unthresholded linear output and minimizes SSE
— always converges to best-fit approximation.
1.6 Stochastic & Mini-Batch Gradient Descent
Stochastic GD (SGD) updates weights after each individual example
— an incremental update.
Batch GD
\[ \Delta w_k = \eta\sum_{d \in D}(y_d - o_d)(x_{dk}) \]
One
weight update per epoch. Stable, but slow for large datasets. Can
get stuck in local minima.
Stochastic (Incremental) GD
\[ \Delta w_k = \eta\,(y_d - o_d)(x_{dk}) \]
One
update per example. Faster, noisier gradient — can escape
local minima. Both Batch and SGD use the linear function and give
best-fit approximation.
Epoch
One full pass through all training data (or all
mini-batches). Features and labels processed once.
Batch Size
Number of samples per mini-batch split. If = 1
→ true SGD. If = |D| → Batch GD. If 1 < b < |D|
→ Mini-batch SGD.
Step
Processing one mini-batch, ending with one weight
update. Steps per epoch = |D| / batch_size.
Key Distinction: Perceptron
Training Rule uses thresholded output (sign/step) for updates
— only converges if linearly separable. Delta Rule / Batch GD
uses unthresholded linear output — converges to
best-fit regardless of separability.
2.1 From Adaline to FFNN
The Adaline (ADAptive LInear NEuron) is a
single-layer neural network and the direct bridge from the Perceptron
to FFNN. It uses the identity activation for weight updates,
and applies a threshold only at inference time.
2.2 FFNN Architecture
Input Layer
(l=0)
Receives feature vector
\(\mathbf{x}\). No computation — values passed forward.
Width = number of features. Convention: \(x_0 = +1\)
(bias).
Hidden Layers
Learn internal representations. One or more
layers. Written \(H_1, H_2, \ldots, H_h\) in compact style.
Non-linear activation required.
Output Layer
Produces \(\hat{\mathbf{y}}\). Width = number of
classes (classification) or 1 (regression). Appropriate activation
(sigmoid/softmax/linear).
Depth
Number of layers (hidden + output). Deeper →
learns more abstract, hierarchical features.
Width
Neurons per layer. Wider → more capacity but
more parameters. Notated as \(n_{h_1}, n_{h_2},
\ldots\)
Fully
Connected
Every neuron in layer \(l\)
connects to every neuron in layer \(l+1\) with a unique weight
\(w_{ji}\).
2.3 Forward Propagation
Information flows from input through each layer to produce output
\(\hat{\mathbf{y}}\). Two equivalent notations are used in class:
2.4 Number of Parameters
2.5 Worked Example: XOR with Sigmoid
3.1 Why Non-linearity?
Universal Approximation
Theorem: A single hidden layer with sufficient neurons can
approximate any continuous function to arbitrary accuracy. However,
deep networks require exponentially fewer neurons than
shallow networks for the same approximation quality, generalize
better, and empirically outperform shallow networks. Test accuracy
improves as depth increases from 3 to 11 layers; CNNs consistently
outperform fully-connected nets with the same parameter count.
3.2 Common Activation Functions
Step / Sign
\(\displaystyle
f(z)=\begin{cases}1\;(or+1)&z\ge0\\0\;(or-1)&\text{else}\end{cases}\)
Range: {0,1} or {±1}
Perceptron. Non-differentiable → cannot
use in backprop.
Sigmoid (σ)
\(\displaystyle\sigma(z)=\frac{1}{1+e^{-z}}\)
Range: (0, 1)
Output as probability. Vanishing gradient for
large |z|. Used in Adaline/output layer for binary
classification.
Tanh
\(\displaystyle\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}\)
Range: (−1, 1)
Zero-centered. Stronger gradients than
sigmoid, but still vanishes at extremes. Used in LeNet-5.
ReLU
\(\displaystyle f(z)=\max(0,z)\)
Range: [0, ∞)
Most popular for hidden layers. No vanishing
gradient for \(z>0\). Suffers from dying ReLU (\(z\le0\) →
always 0 gradient).
Leaky ReLU
\(\displaystyle f(z)=\max(\alpha z,\,z),\;\alpha\ll1\)
Range: (−∞, ∞)
Fixes dying ReLU. Small slope
\(\alpha\approx0.01\) for negative inputs. Gradient always
non-zero.
Softmax
\(\displaystyle\sigma(\mathbf{z})_k=\frac{e^{z_k}}{\sum_j e^{z_j}}\)
Range: (0,1), sum=1
Output layer for multi-class. Converts logits
to probability distribution over C classes.
Key Derivatives (required for Backpropagation)
3.3 Loss Functions
4.1 Three-Step Learning Procedure
1.Forward
Propagation: Compute output \(\hat{y}\) for input
\(\mathbf{x}\). Cache all pre-activations \(\mathbf{a}^{(k)}\) and
post-activations \(\mathbf{h}^{(k)}\).
2.Backward
Propagation: Compute error term \(\delta\) for each
unit. The target for hidden units is NOT directly available
— it must be inferred from output deltas via the chain
rule.
3.Weight Update:
Update every weight using the computed gradients and learning rate
\(\eta\).
4.2 Gradient for Output Unit Weights (with Sigmoid)
Using the chain rule to compute \(\partial E_d / \partial w_{ji}\)
for a weight connecting unit \(i\) to output unit \(j\):
4.3 Gradient for Hidden Unit Weights
For hidden unit \(h\), no direct target is available. We
back-propagate error from all output units \(k\) that receive from
\(h\):
4.4 Weight Update (same for all layers)
4.5 Full Backpropagation Algorithm (Mitchell, 1997)
InitCreate
feed-forward network with \(n_{\text{in}}\) inputs,
\(n_{\text{hidden}}\) hidden units, \(n_{\text{out}}\) output
units. Initialize all weights to small random values (e.g.,
between −0.05 and 0.05).
RepeatUntil
termination (fixed iterations / training error < threshold /
validation criterion met). Do for each \(\langle\mathbf{x},
\mathbf{t}\rangle\) in training_examples:
1.Forward pass:
Input \(\mathbf{x}\) to the network; compute output \(o_u\) of
every unit \(u\) in the network.
2.Output deltas:
For each output unit \(k\):
\[\delta_k \leftarrow
o_k(1-o_k)(t_k - o_k)\]
3.Hidden deltas:
For each hidden unit \(h\):
\[\delta_h \leftarrow
o_h(1-o_h)\sum_{k\,\in\,\text{outputs}}w_{kh}\,\delta_k\]
4.Update weights:
For each weight \(w_{ji}\):
\[w_{ji} \leftarrow w_{ji} + \Delta
w_{ji},\qquad \Delta w_{ji} =
\eta\,\delta_j\,x_{ji}\]
Computational Efficiency:
One backward pass computes gradients for ALL parameters
simultaneously, reusing cached forward-pass values. This makes deep
network training feasible.
4.6 Common Training Problems
Vanishing
Gradient
Gradients shrink exponentially
through deep layers. Sigmoid/tanh worsen this near saturation.
Fix: ReLU, residual connections, batch normalization.
Exploding
Gradient
Gradients grow exponentially
→ unstable (NaN). Fix: gradient clipping, careful
initialization.
Local Minima
Batch GD can get stuck. SGD noise helps escape.
No guarantee of global minimum if multiple local minima
exist.
Learning Rate
η
Too small → converges very
slowly. Too large → overshoots and diverges. Use schedules or
adaptive optimizers (Adam).
Overfitting
Network memorizes training data. Fix: dropout,
L1/L2 regularization \(\lambda\Omega(\theta)\), early stopping,
more data.
Dying ReLU
Neurons always output 0 → zero gradient
→ never recover. Fix: Leaky ReLU (\(\alpha=0.01\) slope), He
initialization.
5.1 Why CNN? — The Problem with FFNN on Images
Dimensionality Explosion: A
\(320\times280\) RGB image has \(320\times280\times3 = 268{,}800\)
pixels. A single FFNN input→output layer on that image requires
\(>8\) billion multiplications. A CNN with a \(1\times2\) kernel needs
only \(319\times280\times3 = 267{,}960\) operations — roughly
60,000× more efficient.
Three core problems with FFNN on images:
- Dimensionality explosion. Flattening a 2D/3D
image to 1D and connecting every pixel to every neuron creates an
impractical number of parameters.
- Spatial information destroyed. Flattening kills
neighborhood relationships. Pixels and their neighbors carry joint
meaning (edges, textures); once flattened they are independent
numbers.
- Parameter explosion → overfitting. More
weights = more memorization, not better generalization.
5.2 What CNN Is
Input
A multidimensional array (e.g., RGB image of size
\(H\times W\) stored as \(H\times W\times 3\) array of pixel
intensities 0–255).
Kernel /
Filter
Small learnable weight matrix
(e.g., 3×3 or 5×5) that slides over the input. Also
called: weights, parameters, filter.
Feature Map
Output produced when a kernel is convolved with
the input. High values = pattern detected at that
location.
Convolution
Sliding a kernel over the input, computing dot
products at each position to produce one value per location in the
feature map.
5.3 Two Core Innovations
Local (Sparse) Connectivity
Each
output neuron connects only to a local region of
the input — the receptive field. A
3×3 kernel: each output connects to only 9 inputs, not
thousands. Biologically inspired: Hubel & Wiesel (1959) found
visual cortex neurons respond to small local visual regions.
Parameter Sharing
The
same kernel weights are used at every spatial
position. One kernel is learned and applied everywhere —
instead of unique weights per position. Gives translation
equivariance: the same feature anywhere uses the same
detector.
LeCun 1989 Parameter Comparison (Net-3)
| Configuration | Weights |
| Fully connected (FFNN) | >3,200 |
| Locally connected (no weight sharing) | 1,226 |
| Locally connected + weight sharing
(CNN) | 206 |
5.4 The Convolution Operation
1D Convolution (Conceptual)
Slide a kernel across a 1D input array, computing element-wise
products and summing at each position. Parameter sharing: same kernel
values reused at every position.
2D Convolution (Grayscale Images)
The kernel slides across both rows and columns. At each position,
multiply overlapping values element-wise and sum to produce one output
value in the feature map.
3D Convolution (RGB / Multi-channel)
For RGB images, the kernel has depth = number of input
channels (\(C_{\text{in}}\)). A kernel of size \(F\times
F\times C_{\text{in}}\) produces one feature map. Using \(K\)
different kernels produces \(K\) feature maps (output depth =
\(K\)).
Padding Types
- Valid (P=0): No padding → output shrinks.
Each conv layer reduces spatial dimensions.
- Same (P=(F−1)/2 with S=1): Zero-pad to keep
output same size as input.
Worked Examples
| Input Shape | Filters | P | S | V
Calculation | Output Volume |
| 3×3×3 | 1, size
2×2×3 | 0 | 1 | \((3-2+0)/1+1=2\) | 2×2×1 |
| 3×3×2 | 3, size
2×2×2 | 0 | 1 | \((3-2+0)/1+1=2\) | 2×2×3 |
| 32×32×3 | 10, size
5×5×3 | 0 | 1 | \((32-5+0)/1+1=28\) | 28×28×10 |
5.5 The Three Stages of a CNN Layer
Stage 1
Convolution (Affine Transform): Kernel slides
across input computing linear combinations. Produces raw feature
map — still linear.
\[\text{net}_c = \mathbf{X}\star K\quad\text{(where }\star\text{
denotes convolution/cross-correlation)}\]
Stage 2
Detector (Non-linearity / Activation):
Non-linear activation applied element-wise. Without this, stacking
conv layers collapses to one linear transform. Most common: ReLU.
\[H = f_c(\text{net}_c) = \max(0,\,\text{net}_c)\]
Example: \([-9, 32, 14, -6]\xrightarrow{\text{ReLU}}[0, 32, 14,
0]\)
Stage 3
Pooling (Downsampling): Reduces spatial
dimensions. Benefits: fewer parameters downstream, translation
invariance (small input shifts don’t change output much),
controls overfitting. Most common: Max Pooling.
Pooling Types
| Type | Operation | Notes |
| Max Pooling | Maximum value in each
window | Most common. A 2×2 pool with S=2 halves the
spatial dimensions. Preserves dominant features. |
| Average Pooling | Mean value in each
window | Smoother downsampling. Used in LeNet-5. |
| L2-norm Pooling | \(\sqrt{\sum x_i^2}\)
in each window | Less common. Emphasizes larger
activations. |
5.6 Parameter Counting in CNN Layers
5.7 Full CNN Architecture Pattern
A typical CNN repeats [Conv + ReLU → Pool] blocks, then flattens
into fully-connected layers for classification:
Feature Hierarchy: Conv layer
1 → edges/colors. Conv layer 2 → corners/textures. Deeper
layers → shapes, object parts, objects. Pooling reduces spatial
size. FC layers do the final classification based on extracted
features.
5.8 LeNet-5 (LeCun et al., 1998)
The first practical CNN for handwritten digit recognition (MNIST).
Input: 32×32 grayscale (MNIST 28×28 padded to
32×32). End-to-end training with backpropagation.
| Layer | Type | Output
Shape | Details | Parameters |
| Input | Image | 32×32×1 | Grayscale,
padded from 28×28 | — |
| C1 | Conv | 28×28×6 | 6 filters,
5×5, S=1, P=0, tanh ·
\(V=(32-5)/1+1=28\) | \(6\times(5\times5\times1+1)=\mathbf{156}\) |
| S2 | Avg
Pool | 14×14×6 | 2×2 window,
S=2 | 0 (no learnable params) |
| C3 | Conv | 10×10×16 | 16
filters, 5×5, S=1, P=0, tanh ·
\(V=(14-5)/1+1=10\) | \(16\times(5\times5\times6+1)=\mathbf{2{,}416}\) |
| S4 | Avg
Pool | 5×5×16 | 2×2 window,
S=2 | 0 |
| Flatten | — | 400 | \(5\times5\times16=400\) | — |
| FC C5 | FC | 120 | Fully connected,
tanh | \((400+1)\times120=\mathbf{48{,}120}\) |
| FC F6 | FC | 84 | Fully connected,
tanh | \((120+1)\times84=\mathbf{10{,}164}\) |
| Output | FC | 10 | 10 classes (digits
0–9),
Softmax | \((84+1)\times10=\mathbf{850}\) |
| Total
Parameters | 61,706 |
5.9 Notable CNN Architectures
| Architecture | Year | Key Innovation |
| LeNet-5 | 1998 | First practical
CNN. Local connectivity + weight sharing. |
| AlexNet | 2012 | 8 layers, ReLU
activation (novel), dropout, GPU training on ImageNet. Marked the
“Rise of CNN.” |
| VGG16 | 2015 | 16 layers, only
3×3 filters. Full model: ∼138M params. Input:
224×224×3. |
| GoogLeNet | 2014 | Inception
modules: parallel filter paths of different sizes in one
layer. |
| ResNet | 2015 | Skip (residual)
connections: gradients flow directly through layers, enabling
50/101/152-layer nets. |
| Vision
Transformer | 2020 | Transformer
self-attention applied to image patches — no convolution
needed. |
5.10 CNN vs FFNN — Complete Comparison
| Aspect | FFNN (MLP) | CNN |
| Input handling | Flattened 1D vector — spatial
structure lost | Grid-structured — spatial info
preserved |
| Connectivity | Every neuron → every input (fully
connected) | Each neuron → local receptive field
only |
| Weight sharing | No — unique weight per
connection | Yes — same kernel at every
position |
| Spatial feature detection | No | Yes —
kernels learn edges, textures, shapes |
| Parameter count | Very high for image
inputs | Much lower due to weight sharing |
| Overfitting tendency | Higher for image
tasks | Lower for image tasks |
| Feature engineering | Manual | Automatic (learned
filters) |
| Translation invariance | No | Yes (shared filters
+ pooling) |
| Best for | Tabular data, fixed-size
vectors | Images, audio spectrograms, grid-structured
data |
Two Key Properties of CNN:
(1) Translation equivariance: if the input shifts,
the feature map shifts by the same amount — a feature is
detected regardless of where it appears. (2) Hierarchical
feature learning: layer 1 learns edges → layer 2
learns textures → deeper layers learn shapes → object parts
→ full objects.
5.11 Key CNN Terms
Kernel / Filter
(K)
Learnable weight matrix. One kernel
→ one feature map. \(K\) kernels → \(K\) output
channels.
Feature Map
Output of applying one kernel to the full input.
High values = pattern detected at that location.
Receptive
Field
Local region of input one filter
examines. Size = \(F\times F\). Deeper layers have larger
effective receptive fields.
Stride (S)
Pixels the kernel moves per step. \(S=1\) →
overlapping. \(S=2\) → roughly halves output
size.
Padding (P)
“Valid” (P=0): output shrinks.
“Same” (\(P=(F-1)/2\) with S=1): output same size as
input.
Parameter
Sharing
Same kernel weights applied
everywhere → dramatic parameter reduction. FFNN would need
unique weights per connection.
Sparse
Connectivity
Each output unit connects
to only \(F\times F\times C_{\text{in}}\) inputs (the receptive
field), not all inputs.
Translation
Equivariance
If input shifts, feature
map shifts by same amount. Kernels detect features at any position
using same weights.
Translation
Invariance
After pooling, small shifts
in input do not change the output. Pooling provides this
property.
Flatten
Reshape 3D volume \((H\times W\times C)\) after
conv/pool into a 1D vector to feed into FC layers.
Feature
Hierarchy
Early layers: edges →
Middle: corners, textures → Deep: object parts, faces,
objects.
Depth
(channels)
Number of feature maps in a
layer = number of filters \(K\). RGB image has depth 3 (input
channels).
Math & Notation
x, X input vector
/ mini-batch
w, W weights
b, w⊂0⊂, c
bias
z / net / a
pre-activation
o / h
post-activation
ŷ predicted
output
y / t true label /
target
E / J error /
total cost
η learning
rate
δ error
signal (delta)
λ
regularization rate
Ω(θ)
regularizer
σ
sigmoid
f(·)
activation function
K
kernel/filter
V output dim
F filter
size
P padding
S stride
Perceptron Terms
Perceptron
artificial neuron for linear classification
McCulloch-Pitts
early neuron model
threshold θ
cutoff for firing
sign function
outputs ±1
step function
outputs 0/1
net input z
weighted sum before activation
hyperplane decision
surface linear separator
linear
separability classes separable by a hyperplane
hypothesis space H
set of possible weights
perceptron training
rule update rule for thresholded output
convergence
theorem finite convergence for separable data
XOR problem not
linearly separable
m-of-n function
true if at least m inputs are true
epoch one full
pass over the data
batch size samples
per update group
step one
mini-batch update
delta rule LMS /
Widrow-Hoff update
LMS rule least
mean squares rule
Adaline adaptive
linear neuron
Widrow-Hoff
another name for delta rule
batch GD gradient
descent over all samples
SGD stochastic
gradient descent
mini-batch SGD
gradient descent on small batches
gradient
∇E(w) direction of steepest ascent
SSE sum of squared
errors
FFNN Terms
feedforward
information flows only forward
MLP multi-layer
perceptron
hidden layer
intermediate representation layer
depth number of
layers
width number of
neurons per layer
Wxh /
Why notation compact layer notation
compact / explicit
style two equivalent notations
forward
propagation compute outputs from inputs
backpropagation
reverse-mode gradient computation
chain rule
derivative composition rule
gradient descent
minimize loss by following negative gradient
vanishing gradient
gradients shrink through layers
exploding gradient
gradients grow too large
overfitting
memorizes training data
underfitting model
too simple
sigmoid squashing
activation
tanh zero-centered
squashing activation
ReLU rectified
linear unit
leaky ReLU ReLU
with small negative slope
softmax
probability distribution over classes
MSE mean squared
error
binary CE binary
cross-entropy
categorical CE
multi-class cross-entropy
output delta
δj error signal at output unit
hidden delta
δh error signal at hidden unit
universal approximation
theorem single hidden layer can approximate any continuous
function
one-hot encoding
vector with one active class entry
regularization
λΩ(θ) penalty term to reduce
overfitting
CNN Terms
convolution /
cross-correlation sliding local dot product
kernel / filter
learnable weight window
feature map
convolution output
stride S step size
of the kernel
padding P zero
border added to input
valid padding no
padding, output shrinks
same padding keeps
output size
receptive field
local input region a filter sees
parameter sharing
same kernel reused everywhere
sparse
connectivity each output connects to local inputs
max pooling take
maximum value per window
average pooling
take mean value per window
L2-norm pooling
use Euclidean norm per window
flatten reshape
volume into 1D vector
translation
equivariance shifted input shifts feature map
translation
invariance small shifts do not change output much
feature hierarchy
edges to objects across layers
depth (channels)
number of feature maps
V = (W−F+2P)/S +
1 output dimension formula
Nparams =
K×(F²×Cin+1) CNN parameter
formula
LeNet-5 first
practical CNN
AlexNet deep CNN
breakthrough
VGG16 deep stack
of 3×3 filters
ResNet / skip
connections residual shortcut paths
locally connected
no weight sharing
Notes based on: Goodfellow et al. (2016) Deep Learning · LeCun
et al. (1989, 1998) · Mitchell (1997) · Raschka et al.
(2022)
IF3270 Pembelajaran Mesin · Sem 2-2025/2026 · Tim
Pengajar IF3270
Topics: Perceptron · FFNN · CNN |
Excluded: Ensemble · CNN Backpropagation