Slide 1: Training as Optimization

Purpose: Frame the lecture as training by minimizing a loss.

Slide 2: Labeled Data

Purpose: Fix the training data notation.

Formula:

\[(x_1,y_1),\ldots,(x_n,y_n)\]

Slide 3: What Is a Model?

Purpose: Define the object we are training.

Formula:

\[m(x;w)\]

Slide 4: Model Class Choices

Purpose: Separate a single model from a class of models.

Slide 5: Three Questions

Purpose: Separate optimization, model selection, and generalization.

Slide 6: Train-Validation-Test Split

Purpose: State the standard workflow.

Slide 7: Test-Set Leakage

Purpose: Warn against treating the test set as untouchable.

Slide 8: Training Loss (Empirical Risk)

Purpose: Define the objective we minimize.

Formula:

\[L(w) = \frac{1}{n}\sum_{i=1}^n \ell_i(w)\]

Slide 9: Training Objective

Purpose: State the optimization problem.

Formula:

\[\min_{w \in \mathbb{R}} L(w)\]

Slide 10: Full Gradient

Purpose: Show the cost of a full-batch step.

Formula:

\[L'(w) = \frac{1}{n}\sum_{i=1}^n \ell_i'(w)\]

Slide 11: Why Stochastic Methods

Purpose: Motivate SGD by scale.

Slide 12: SGD Update

Purpose: State the stochastic update rule.

Formula:

\[w_{k+1} = w_k - \eta\,\ell_{i_k}'(w_k)\]

Slide 13: Sampling a Random Index (PyTorch)

Purpose: Show how to draw one sample index in code.

i_k = torch.randint(low=0, high=n, size=(1,)).item()

Slide 14: Sampling a Minibatch (PyTorch)

Purpose: Show how to sample a batch with replacement.

idx = torch.randint(low=0, high=n, size=(B,))

Slide 15: Per-Step Cost

Purpose: Compare full-batch GD to SGD.

Slide 16: Synthetic Data for the Experiment

Purpose: Define the regression dataset.

Slide 17: Linear Model

Purpose: Fix the prediction rule.

Formula:

\[\hat y = m(x;w) = wx\]

Slide 18: Per-Sample Loss

Purpose: Define the squared error.

Formula:

\[\ell_i(w) = \tfrac{1}{2}(y_i - wx_i)^2\]

Slide 19: Per-Sample Derivative

Purpose: Compute the stochastic gradient.

Formula:

\[\ell_i'(w) = (wx_i - y_i)x_i\]

Slide 20: SGD Update for This Problem

Purpose: Write the concrete 1D update.

Formula:

\[w_{k+1} = w_k - \eta (w_k x_{i_k} - y_{i_k}) x_{i_k}\]

Slide 21: Closed-Form Minimizer (Diagnostics Only)

Purpose: Provide the reference solution for the toy problem.

Formula:

\[w^\star = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}\]

Slide 22: Objective Gap

Purpose: Define the diagnostic we will plot.

Slide 23: What Success Looks Like

Purpose: See the fitted line on noisy data.

SGD fit on synthetic regression data Figure 2.1.

Slide 24: Constant Step Size Behavior

Purpose: Describe the noise-floor phenomenon.

Slide 25: Constant Step Size Tradeoff

Purpose: Visualize the noise floor across step sizes.

SGD constant step size: objective gap Figure 2.2.

Slide 26: What “Noise Floor” Means

Purpose: Clarify the diagnostic.

Slide 27: Full-Dataset Diagnostics

Purpose: Explain what we can log in the toy problem.

Slide 28: Diagnostics Plot

Purpose: Show objective gap and gradient norm together.

SGD diagnostics: objective gap and full gradient magnitude Figure 2.3.

Slide 29: Convergent Step Size Schedules

Purpose: State the classical sufficient condition.

Formula:

\[\sum_{k=0}^\infty \eta_k = \infty \quad\text{and}\quad \sum_{k=0}^\infty \eta_k^2 < \infty\]

Slide 30: Power Schedule

Purpose: Give a standard schedule that satisfies the condition.

Formula:

\[\eta_k = \frac{\eta_0}{(k+1)^p} \quad\text{with}\quad p \in (\tfrac{1}{2},1]\]

Slide 31: Geometric Schedule

Purpose: Give a fast-decaying alternative.

Formula:

\[\eta_k = \eta_0 \gamma^k \quad\text{with}\quad \gamma \in (0,1)\]

Slide 32: Schedule Comparison

Purpose: Compare constant, geometric, and power schedules.

Step size schedules Figure 2.4.

Slide 33: Unbiased Gradient Estimate

Purpose: Explain why SGD points in the right direction on average.

Formula:

\[\mathbb{E}[X] = \frac{1}{n}\sum_{i=1}^n \ell_i'(w) = L'(w)\]

Slide 34: Expected Update

Purpose: Link SGD to gradient descent in expectation.

Formula:

\[\mathbb{E}[w_{k+1}\mid w_k] = w_k - \eta L'(w_k)\]

Slide 35: Variance Controls the Noise Floor

Purpose: Explain what higher label noise does.

Slide 36: Effect of Label Noise

Purpose: Visualize variance effects.

Effect of label noise on SGD Figure 2.6.

Slide 37: Variance Reduction by Averaging

Purpose: State the basic probability fact.

Formula:

\[\mathrm{Var}\Big(\frac{1}{B}\sum_{j=1}^B X_j\Big) = \frac{1}{B}\,\mathrm{Var}(X_1)\]

Slide 38: Minibatch Gradient Estimate

Purpose: Define the minibatch estimator.

Formula:

\[G_k = \frac{1}{B}\sum_{i \in B_k} \ell_i'(w_k)\]

Slide 39: Rule of Thumb for Noise

Purpose: Connect batch size and step size.

Slide 40: Two Ways to Measure Progress

Purpose: Separate iteration count from total gradient work.

Slide 41: Minibatch vs Iterations

Purpose: Show iteration efficiency.

Minibatch: objective gap vs iterations Figure 2.7.

Slide 42: Minibatch vs Total Gradients

Purpose: Compare total gradient work.

Minibatch: objective gap vs total gradients Figure 2.8.

Slide 43: Why Use Large Batches

Purpose: Motivate minibatches via parallelism.

Slide 44: When Constant Steps Are Enough

Purpose: State a clean sufficient condition.

Formula:

\[\ell_i'(w^\star)=0 \quad \text{for all } i\]

Slide 45: Noiseless Regression Example

Purpose: Show a case where the condition holds.

Formula:

\[y_i = x_i\]

Slide 46: Per-Sample Gradients Vanish

Purpose: See why SGD becomes deterministic.

Formula:

\[\ell_i'(w) = (w-1)x_i^2\]

Slide 47: Noiseless vs Noisy

Purpose: Contrast the two regimes.

Noiseless vs noisy: constant step size Figure 2.5.

Slide 48: Expressivity Context

Purpose: Explain when constant steps are common in practice.

Slide 49: Validation Loss

Purpose: Monitor performance on new data.

Slide 50: Validation Loop (PyTorch)

Purpose: Show the evaluation pattern.

# Assume we have training data (x_tr, y_tr) and validation data (x_va, y_va).
# Assume "step(w)" performs one SGD or minibatch-SGD update on training data.

eval_every = 200

for k in range(max_iters):
    w = step(w)

    if k % eval_every == 0:
        with torch.no_grad():
            train_loss = 0.5 * torch.mean((y_tr - w * x_tr)**2)
            val_loss = 0.5 * torch.mean((y_va - w * x_va)**2)

        print(f"k={k:6d}  train_loss={train_loss:.3e}  val_loss={val_loss:.3e}")

Slide 51: Why Monitor Validation Loss

Purpose: State what validation diagnostics are for.

Slide 52: Conclusion: Core Points

Purpose: Capture the main takeaways.

Slide 53: Conclusion: Practical Points

Purpose: Capture the practical takeaways.