IF3270 · Pembelajaran Mesin · Final Term

ML Last Term Notes

CNN BackpropRNNLSTMAttentionEncoder-DecoderTransformerSelf-AttentionBPTTReinforcement Learning
References: Goodfellow et al. (2016), Hochreiter & Schmidhuber (1997), Bahdanau et al. (2015), Vaswani et al. (2017), Sutton & Barto (2018), Lecture Decks 5–9, NNTikZ — Fraser Love (2024)

1. CNN Backpropagation

Extending gradient descent to update kernels and weights in a convolutional architecture. Gradients propagate through pooling and convolutional layers via the chain rule.

1.a Forward Pass Recap

LayerEquation / Description
Convolutionnetc=XKnet_c = X * K — linear operation between input XX and kernel KK
DetectorH=ReLU(netc)H = \text{ReLU}(net_c) — non-linearity zeroing negative activations
PoolingMax/Average pooling for dimensionality reduction and regularization
Fully ConnectedO=Softmax(HW+b)O = \text{Softmax}(H \cdot W + b) — final classification layer

1.b Backward Mechanism

Update FC Weights (W)

Ewji=Eojojnetjnetjwji\frac{\partial E}{\partial w_{ji}} = \frac{\partial E}{\partial o_j} \cdot \frac{\partial o_j}{\partial net_j} \cdot \frac{\partial net_j}{\partial w_{ji}}

ΔW=(targetoutput)Output(1Output)H\Delta W = -(\text{target} - \text{output}) \cdot \text{Output}(1-\text{Output}) \cdot H

Update Kernel (K)

EK=EnetjnetjHHnetcnetcK\frac{\partial E}{\partial K} = \frac{\partial E}{\partial net_j} \cdot \frac{\partial net_j}{\partial H} \cdot \frac{\partial H}{\partial net_c} \cdot \frac{\partial net_c}{\partial K}

Gradient propagates backward through Pooling → Activation (ReLU) → Kernel.

Max Pooling Backprop Gradient passes only to the max-value unit from the forward pass; all others receive zero gradient.

Parameter Count (Weight Sharing)

Params=(FilterW×FilterH×Channels+1)×NbFilters\text{Params} = (\text{FilterW} \times \text{FilterH} \times \text{Channels} + 1) \times \text{NbFilters}

Weight sharing makes CNNs far more parameter-efficient than FFNNs on the same input size.

CNN Architecture — Conv2D Blocks → GlobalAvgPool → Dense


2. Recurrent Neural Networks (RNN)

A class of neural networks for sequential data where the order of inputs is crucial. RNNs maintain a hidden state that carries information across timesteps.

2.a Motivation: IID Breakdown & Sequential Data

Standard supervised learning assumes data samples are Independent and Identically Distributed (i.i.d.). Sequential data violates this — in language, time-series, or sensor readings each observation depends on previous ones. A vanilla FFNN has no mechanism to share context across positions, and requires fixed-size input.

RNN Solution A recurrent connection feeds the hidden state h(t1)h^{(t-1)} back into the computation at step tt, encoding the sequence history into a running “memory” vector.

2.b Sequence Modeling Types

TypeDescriptionExample
One-to-OneStandard FFNN — single input → single output, no sequence
Many-to-OneSequence input → single outputSentiment analysis, time-series forecasting
One-to-ManySingle input → sequence outputImage captioning
Many-to-Many (Sync)Input & output same lengthPOS tagging, NER, frame labeling
Many-to-Many (Delayed)Encoder-Decoder: output starts after full input consumedMachine translation

2.c Architecture & Forward Equations

h(t)=fh ⁣(Wxhx(t)+Whhh(t1)+bh)h^{(t)} = f_h\!\left(W_{xh}\, x^{(t)} + W_{hh}\, h^{(t-1)} + b_h\right)

o(t)=fy ⁣(Whoh(t)+bho)o^{(t)} = f_y\!\left(W_{ho}\, h^{(t)} + b_{ho}\right)

Where:

  • WxhW_{xh}: input→hidden, WhhW_{hh}: hidden→hidden (recurrent), WhoW_{ho}: hidden→output
  • fhf_h: typically tanh\tanhfyf_y: softmax (classification) or linear (regression)

RNN Cell — Internal Structure

2.d Multi-Layer RNN & Parameter Sharing

RNN cells can be stacked: the hidden state of layer \ell becomes the input of layer +1\ell+1. In Keras, intermediate layers require return_sequences=True. The defining property is parameter sharing — the same Wxh,Whh,WhoW_{xh}, W_{hh}, W_{ho} are used at every timestep, enabling variable-length inputs and efficient parameter use.


3. LSTM & Vanishing Gradient

LSTM was designed to solve the vanishing gradient problem in standard RNNs by maintaining a separate cell state — a gradient highway regulated by learnable gates.

3.a Vanishing Gradient Problem

During BPTT, gradients are multiplied repeatedly by WhhW_{hh} and tanh()<1\tanh'(\cdot) < 1. For long sequences this product approaches zero exponentially — the network cannot learn long-range dependencies. The cell state in LSTM keeps this product near 1 when the forget gate is open.

3.b Gate Equations

All gates receive concatenated input [h(t1),x(t)][h^{(t-1)}, x^{(t)}]:

ft=σ ⁣(Wf[h(t1),x(t)]+bf)(Forget)f_t = \sigma\!\left(W_f\,[h^{(t-1)}, x^{(t)}] + b_f\right) \quad\text{(Forget)}

it=σ ⁣(Wi[h(t1),x(t)]+bi)(Input)i_t = \sigma\!\left(W_i\,[h^{(t-1)}, x^{(t)}] + b_i\right) \quad\text{(Input)}

C~t=tanh ⁣(Wc[h(t1),x(t)]+bc)(Candidate)\tilde{C}_t = \tanh\!\left(W_c\,[h^{(t-1)}, x^{(t)}] + b_c\right) \quad\text{(Candidate)}

ot=σ ⁣(Wo[h(t1),x(t)]+bo)(Output)o_t = \sigma\!\left(W_o\,[h^{(t-1)}, x^{(t)}] + b_o\right) \quad\text{(Output)}

State Updates:

C(t)=C(t1)ftkeep from past  +  itC~tadd new infoC^{(t)} = \underbrace{C^{(t-1)} \odot f_t}_{\text{keep from past}} \;+\; \underbrace{i_t \odot \tilde{C}_t}_{\text{add new info}}

h(t)=ottanh ⁣(C(t))h^{(t)} = o_t \odot \tanh\!\left(C^{(t)}\right)

\odot = element-wise multiplication. ft1f_t \approx 1 keeps old cell state; ft0f_t \approx 0 forgets it.

GateRole
Forget ftf_tσ\sigma output 0 = forget old memory; 1 = keep entirely
Input iti_tControls how much of the candidate C~t\tilde{C}_t is written to cell state
Candidate C~t\tilde{C}_tProposed new values for cell state via tanh\tanh of concatenated input
Output oto_tFilters what fraction of C(t)C^{(t)} is exposed as hidden state h(t)h^{(t)}

LSTM Cell — Gate Structure & Cell State Highway

3.c Parameter Counting

Each gate has weight matrix shape (m+n)×n(m+n) \times n plus bias nn, where mm = input dim, nn = LSTM units:

LSTM params=(m+n+1)×4n\text{LSTM params} = (m + n + 1) \times 4n

Total (+ output layer)=(m+n+1)×4n  +  (n+1)×k\text{Total (+ output layer)} = (m + n + 1) \times 4n \;+\; (n + 1) \times k

mm = input dim, nn = LSTM units, kk = output classes. Factor 4 = four gates (f, i, c̃, o).


4. Attention Mechanism

Attention was introduced to overcome the information bottleneck of a fixed-length context vector in vanilla Encoder-Decoder networks. The decoder now dynamically focuses on different encoder positions at each decoding step.

4.a Encoder-Decoder Without Attention

The vanilla seq2seq model compresses the entire input sequence into a single fixed vector c=h(Tx)c = h^{(T_x)}. The decoder generates output from only this vector. For long sequences, this bottleneck causes severe information loss.

Bottleneck Problem For a 50-word sentence, all information must fit into one fixed-size vector. Empirically, translation quality degrades sharply as input length grows.

4.b Bahdanau (Additive) Attention

At each decoder step tt, a different context vector ctc_t is computed as a weighted sum over all encoder hidden states {hj}\{h_j\}:

etj=vatanh ⁣(Wast1+Uahj)(alignment score)e_{tj} = v_a^\top \tanh\!\left(W_a\, s_{t-1} + U_a\, h_j\right) \quad\text{(alignment score)}

αtj=exp(etj)k=1Txexp(etk)(attention weights via softmax)\alpha_{tj} = \frac{\exp(e_{tj})}{\displaystyle\sum_{k=1}^{T_x} \exp(e_{tk})} \quad\text{(attention weights via softmax)}

ct=j=1Txαtjhj(context vector)c_t = \sum_{j=1}^{T_x} \alpha_{tj}\, h_j \quad\text{(context vector)}

st1s_{t-1}: previous decoder hidden state — hjh_j: encoder state at position jjWa,Ua,vaW_a, U_a, v_a: learned alignment params. Score computed before decoder step tt (uses st1s_{t-1}).

Alignment Matrix αtj\alpha_{tj} Visualizing all αtj\alpha_{tj} values reveals which source positions each output token attends to — giving interpretable word alignments in translation tasks.

4.c Luong (Multiplicative) Attention

Luong et al. compute the alignment score after the decoder step, using the current hidden state hth_t:

score(ht,hˉs)={hthˉsdothtWahˉsgeneralvatanh ⁣(Wa[ht;hˉs])concat\text{score}(h_t, \bar{h}_s) = \begin{cases} h_t^\top \bar{h}_s & \text{dot} \\ h_t^\top W_a\, \bar{h}_s & \text{general} \\ v_a^\top \tanh\!\left(W_a\,[h_t;\, \bar{h}_s]\right) & \text{concat} \end{cases}

BahdanauLuong
Hidden state usedst1s_{t-1} (pre-step)hth_t (post-step)
Formula typeAdditiveMultiplicative
Context scopeMany-to-one or many-to-manySame

Bahdanau Attention — Bidirectional Encoder + Attention Weights


5. Self-Attention & Transformer

The Transformer (Vaswani et al. 2017) replaces recurrent cells with self-attention layers entirely, enabling fully parallel computation and direct long-range dependency modeling.

5.a Q / K / V Self-Attention

For input sequence X=[x1,,xN]X = [x_1,\ldots,x_N], three linear projections create Query, Key, and Value vectors per position:

qn=βq+Ωqxn("what am I looking for?")q_n = \beta_q + \Omega_q\, x_n \quad\text{("what am I looking for?")}

km=βk+Ωkxm("what do I have?")k_m = \beta_k + \Omega_k\, x_m \quad\text{("what do I have?")}

vm=βv+Ωvxm("what I contribute")v_m = \beta_v + \Omega_v\, x_m \quad\text{("what I contribute")}

Scaled Dot-Product Self-Attention:

a[xm,xn]=softmaxm ⁣[kmqnDq]a[x_m, x_n] = \text{softmax}_m\!\left[\frac{k_m^\top q_n}{\sqrt{D_q}}\right]

Sa[X]=VSoftmax ⁣[KQDq]\text{Sa}[X] = V \cdot \text{Softmax}\!\left[\frac{K^\top Q}{\sqrt{D_q}}\right]

Division by Dq\sqrt{D_q} prevents dot-products from growing large and saturating softmax. Each position receives a weighted sum of all value vectors.

5.b Transformer vs RNN Encoder-Decoder

RNNTransformer
ProcessingSequential: h(t)h^{(t)} depends on h(t1)h^{(t-1)}All positions in parallel
Dependency pathTT steps between distant positionsO(1)O(1) via self-attention
Positional infoImplicit in recurrent orderExplicit positional encoding (sinusoidal or learned)
Multi-headHH independent heads; concatenate outputs

Why This Matters In an RNN, information from step 1 must survive T1T{-}1 transformations to reach step TT. In a Transformer, every pair of positions attends to each other directly in a single layer — drastically reducing the path length for long-range information.

Scaled Dot-Product Attention

Transformer Architecture — Encoder & Decoder


6. Backpropagation Through Time (BPTT)

BPTT trains recurrent networks by unrolling them into a deep feed-forward network (with shared weights), then applying standard backpropagation.

6.a BPTT Procedure

  1. Forward — Compute all h(t)h^{(t)} and o(t)o^{(t)} for the full sequence.
  2. Loss — Compute total loss E=tEtE = \sum_t E_t (e.g., cross-entropy at each output step).
  3. Backward — Compute E/Wxh,  E/Whh,  E/Who\partial E / \partial W_{xh},\; \partial E / \partial W_{hh},\; \partial E / \partial W_{ho} by accumulating gradients across all timesteps (shared weights).
  4. Update — Apply accumulated gradients via SGD / Adam to the shared weight matrices.

Truncated BPTT Backpropagation is limited to kk steps back for very long sequences. Reduces memory and computation at the cost of not learning ultra-long-range dependencies.


7. Reinforcement Learning

RL is a paradigm where an agent learns a policy through interaction with an environment, using reward signals as the only feedback. No labeled examples — the agent discovers good behavior via trial and error.

7.a RL vs Supervised Learning

Supervised LearningReinforcement Learning
InputLabeled pairs (x,y)(x, y)No labels; reward RtR_t
Feedback timingImmediate, correct answer givenDelayed; reward may come much later
Actions affect future?NoYes — actions influence future states
GoalMinimize prediction errorMaximize cumulative reward

Reward Hypothesis All goals can be described as maximizing the expected cumulative reward. Central hypothesis of RL.

7.b Agent-Environment Loop

At each timestep tt: agent observes StS_t, selects action AtA_t, environment returns reward Rt+1R_{t+1} and next state St+1S_{t+1}.

Ht=O1,R1,A1,  O2,R2,A2,  ,  At1,Ot,RtH_t = O_1, R_1, A_1,\; O_2, R_2, A_2,\; \ldots,\; A_{t-1}, O_t, R_t

Sta=f(Ht)S_t^a = f(H_t)

Agent state is any function of the history. In fully observable environments Sta=SteS_t^a = S_t^e.

7.c Return & Discounted Return

Undiscounted:

Gt=Rt+1+Rt+2+=k=0Rt+k+1G_t = R_{t+1} + R_{t+2} + \cdots = \sum_{k=0}^{\infty} R_{t+k+1}

Discounted:

Gt=k=0γkRt+k+1,0γ1G_t = \sum_{k=0}^{\infty} \gamma^k\, R_{t+k+1}, \quad 0 \leq \gamma \leq 1

γ=0\gamma = 0: only immediate reward. γ1\gamma \to 1: all future rewards equally valued. Ensures sum converges; reflects that near-term rewards are more certain.

7.d Value Functions & Policy

v(s)=E ⁣[Gt  |  St=s](state-value)v(s) = \mathbb{E}\!\left[G_t \;\middle|\; S_t = s\right] \quad\text{(state-value)}

q(s,a)=E ⁣[Gt  |  St=s,  At=a](action-value)q(s,a) = \mathbb{E}\!\left[G_t \;\middle|\; S_t = s,\; A_t = a\right] \quad\text{(action-value)}

A=π(S)(deterministic)π(AS)=P(At=aSt=s)(stochastic)A = \pi(S) \quad\text{(deterministic)} \qquad \pi(A \mid S) = P(A_t = a \mid S_t = s) \quad\text{(stochastic)}

ComponentDescription
PolicyAgent’s behavior function — maps states to actions or distributions
Value Function v(s)v(s)Expected future reward from state ss
Action-Value q(s,a)q(s,a)Expected future reward from taking action aa in state ss
Model (optional)Predicts St+1S_{t+1} and Rt+1R_{t+1} given St,AtS_t, A_t — not always learned
Model-FreeLearns from raw experience (Q-learning, SARSA)
Model-BasedBuilds environment model to plan ahead

7.e Q-Learning & SARSA

Both learn an action-value function Q(s,a)Q(s,a). The only difference is which next-step value is used as the bootstrap target:

Q-Learning (Off-Policy):

Q(st,at)Q(st,at)+αδtQ(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\,\delta_t

δt=rt+1+γmaxaQ(st+1,a)Q(st,at)\delta_t = r_{t+1} + \gamma \cdot \max_{a}\, Q(s_{t+1}, a) - Q(s_t, a_t)

Bootstraps from the greedy best action in st+1s_{t+1}, regardless of what was actually taken. Learns optimal policy even while exploring.

SARSA (On-Policy):

Q(st,at)Q(st,at)+αδtQ(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha\,\delta_t

δt=rt+1+γQ(st+1,at+1)Q(st,at)\delta_t = r_{t+1} + \gamma \cdot Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)

Bootstraps from the value of the next action actually chosen by the current policy.

SARSA Name Origin The update uses the tuple (St,At,Rt+1,St+1,At+1)(S_t,\, A_t,\, R_{t+1},\, S_{t+1},\, A_{t+1}) — five elements, hence “SARSA”.

Q-LearningSARSA
TypeOff-policyOn-policy
Bootstrap targetmaxaQ(s,a)\max_a Q(s',a)Q(s,a)Q(s', a') where aa' is actually taken
BehaviorOptimistic — assumes best future actionConservative — penalizes risky explorations
ConvergenceTo QQ^* regardless of exploration policyTo optimal for current exploration strategy
Differ whenat+1argmaxaQ(st+1,a)a_{t+1} \neq \arg\max_a Q(s_{t+1},a)Same condition

Q-Learning vs SARSA — Bootstrap Target Comparison

7.f Optimal Policy & Grid World

After training, π\pi^* selects argmaxaQ(s,a)\arg\max_a Q(s,a) in each state. A single episode updates only visited states — the rest remain unchanged.

Episode Example Path (2,1)(2,2)(3,2)(3,3)(2,1) \to (2,2) \to (3,2) \to (3,3) with α=0.5,γ=0.7\alpha=0.5,\, \gamma=0.7. Only Q((2,1),R)Q((2,1),R), Q((2,2),D)Q((2,2),D), Q((3,2),R)Q((3,2),R) are updated. Policy at (3,2)(3,2) changes from ”↑ or →” to ”↑ only” because Q((3,2),R)Q((3,2), R) falls sharply from reward 10-10.

Grid World — Optimal Policy Before & After One Episode


8. Glossary

TermDefinition
IIDIndependent & Identically Distributed — standard ML assumption violated by sequential data
Parameter SharingSame weight matrices Wxh,Whh,WhoW_{xh}, W_{hh}, W_{ho} used at every RNN timestep
Vanishing GradientGradients shrink exponentially during BPTT; solved by LSTM cell state highway
Cell StateLSTM long-term memory highway — regulated by forget / input gates
Context Vectorct=jαtjhjc_t = \sum_j \alpha_{tj} h_j — weighted sum of encoder states used by decoder
Alignment Scoreetje_{tj} measures relevance of encoder position jj to decoder step tt
Self-AttentionComputes pairwise relationships between all token pairs in parallel
Q / K / VQuery, Key, Value — three learned linear projections in Transformer self-attention
Scaled Dot-ProductSa[X]=VSoftmax[KQ/Dq]\text{Sa}[X] = V \cdot \text{Softmax}[K^\top Q / \sqrt{D_q}]
Return GtG_tk=0γkRt+k+1\sum_{k=0}^\infty \gamma^k R_{t+k+1} — cumulative discounted reward from step tt
Value v(s)v(s)E[GtSt=s]\mathbb{E}[G_t \mid S_t = s] — expected return from state ss
Policy π\piAgent’s behavior; deterministic A=π(S)A=\pi(S) or stochastic π(AS)\pi(A\|S)
Q-LearningOff-policy TD: bootstraps from maxaQ(s,a)\max_a Q(s',a) — converges to QQ^*
SARSAOn-policy TD: bootstraps from Q(s,a)Q(s',a') where aa' is actually taken
TD Error δt\delta_trt+1+γQ(st+1,)Q(st,at)r_{t+1} + \gamma\,Q(s_{t+1},\cdot) - Q(s_t,a_t) — correction signal for both algorithms

Goodfellow et al. (2016) · Hochreiter & Schmidhuber (1997) · Bahdanau et al. (2015) · Vaswani et al. (2017) · Sutton & Barto (2018) · IF3270 Lecture Decks 5–9 · Attention & Transformer diagrams: Fraser Love, NNTikZ (2024)