Table of Contents

Looking for the 2025 version? It’s archived here.

Export lectures to markdown

0. Introduction | Slides | Notebook

Course content, a deliverable, and spam classification in PyTorch.

1. Optimization and PyTorch Basics in 1D

Optimization setup, minimizers and stationarity, 1D gradient descent, diagnostics, step-size tuning, and PyTorch autodiff basics.

2. Stochastic Optimization Basics in 1D

Empirical risk, SGD updates, step-size schedules, noise floors, unbiasedness and variance, minibatches, and validation diagnostics.

3. Optimization and PyTorch basics in higher dimensions | Live demo

Lift optimization to $\mathbb{R}^d$, derive gradient descent from the local model, and tour PyTorch tensors, efficiency, dtypes, and devices.

4. Loss functions and models for regression and classification problems | Live demo

Formulate ML objectives, choose losses for regression/classification, and build/train linear and convolutional models in PyTorch.

5. A step-by-step introduction to transformer models

Building transformers from scratch: embeddings, attention, residual connections, and next-token prediction on Shakespeare.

6. A step-by-step introduction to diffusion models

Diffusion models from first principles: forward process, reverse process, noise prediction, U-Net, sampling, DDIM, conditional generation, and FID.

7. Reinforcement learning for language models

The REINFORCE gradient estimator, baselines, KL penalties, rejection sampling, gradient weight rescaling, and a reward shaping experiment on Shakespeare.

8. More on optimizers

Algorithm modifiers (momentum, schedulers, gradient clipping), techniques that change the problem (LoRA, quantization, weight decay), the optimizer zoo (SignSGD, Signum, AdaGrad, RMSProp, Adam, AdamW), coordinate-wise scaling, Newton’s method, and Muon.

9. Benchmarking Optimizers

How to compare optimizers fairly: time-to-result, why tuning is inseparable from the optimizer, and the AlgoPerf benchmark.

10. The Tuning Playbook

Systematic hyperparameter tuning: scientific/nuisance/fixed roles, search methods, batch size, training duration, and the Google tuning playbook.


Some 2025 content below; yet to be deleted.

12. Scaling Transformers: Parallelism Strategies from the Ultrascale Playbook | Cheatsheet

How do we scale training of transformers to 100s of billions of parameters?

Recap | Cheatsheet

A recap of the course.