Looking for the 2025 version? It’s archived here.
0. Introduction | Slides | Notebook
Course content, a deliverable, and spam classification in PyTorch.
1. Optimization and PyTorch Basics in 1D
Optimization setup, minimizers and stationarity, 1D gradient descent, diagnostics, step-size tuning, and PyTorch autodiff basics.
2. Stochastic Optimization Basics in 1D
Empirical risk, SGD updates, step-size schedules, noise floors, unbiasedness and variance, minibatches, and validation diagnostics.
3. Optimization and PyTorch basics in higher dimensions | Live demo
Lift optimization to $\mathbb{R}^d$, derive gradient descent from the local model, and tour PyTorch tensors, efficiency, dtypes, and devices.
4. Loss functions and models for regression and classification problems | Live demo
Formulate ML objectives, choose losses for regression/classification, and build/train linear and convolutional models in PyTorch.
5. A step-by-step introduction to transformer models
Building transformers from scratch: embeddings, attention, residual connections, and next-token prediction on Shakespeare.
6. A step-by-step introduction to diffusion models
Diffusion models from first principles: forward process, reverse process, noise prediction, U-Net, sampling, DDIM, conditional generation, and FID.
7. Reinforcement learning for language models
The REINFORCE gradient estimator, baselines, KL penalties, rejection sampling, gradient weight rescaling, and a reward shaping experiment on Shakespeare.
Algorithm modifiers (momentum, schedulers, gradient clipping), techniques that change the problem (LoRA, quantization, weight decay), the optimizer zoo (SignSGD, Signum, AdaGrad, RMSProp, Adam, AdamW), coordinate-wise scaling, Newton’s method, and Muon.
How to compare optimizers fairly: time-to-result, why tuning is inseparable from the optimizer, and the AlgoPerf benchmark.
Systematic hyperparameter tuning: scientific/nuisance/fixed roles, search methods, batch size, training duration, and the Google tuning playbook.
Some 2025 content below; yet to be deleted.
12. Scaling Transformers: Parallelism Strategies from the Ultrascale Playbook | Cheatsheet
How do we scale training of transformers to 100s of billions of parameters?
A recap of the course.