Notes

This is an attempt to collate notes I write while I’m researching or scrolling online. Expect small summaries of papers / technical results and half-baked thoughts. In that way it will be similar to my grad school blog, in which I wrote about a bunch of random topics. Maybe it will be useful to you? You shouldn’t take them too seriously.

2025-06-18 Basic facts about GPUs

Making sure I don't forget what I read.

gpu cuda
2025-06-12 Basic idea behind flash attention (V1)

Making attention less memory bound.

transformers gpu autograd
2025-06-11 Using min cut to determine activation recomputation strategy

Which activations should we save in the forward pass?

autograd min cut optimization
2025-06-09 Linear layouts, triton, and linear algebra over F_2

Arranging tensor data for computation on GPUs using linear algebra over F_2

linear algebra finite fields triton gpu
2025-06-02 Modded-NanoGPT Walkthrough II: Muon Optimizer, Model Architecture, and Parallelism

Part II: Muon optimizer, GPT architecture details, and distributed training in modded-nanogpt.

pytorch transformers optimization distributed
2025-05-15 Is AlphaEvolve problem B.1 hard?

Is AlphaEvolve problem B.1 hard? Yes

ai-for-math
2025-05-13 Modded-NanoGPT Walkthrough I: initial setup, compiler config, and custom FP8 operations

Part 1 of a two part series on the modded-nanogpt repo

pytorch transformers optimization
2025-05-12 DeepSeek-Prover-V2 overview

An exciting new model for theorem proving

reinforcement-learning lean ai-for-math
2025-05-08 Getting the hang of policy gradients by reframing optimization as RL

How to make your life harder and learn something about policy gradients

reinforcement-learning policy-gradient optimization
2025-05-07 All roads lead to likelihood: the value of reinforcement learning in fine-tuning

When DPO and RLHF are the same and / or different

reinforcement-learning fine-tuning
2025-05-06 Basic facts about policy gradients

Basic math of policy gradients, actor critic, and proximal policy optimization

reinforcement-learning policy-gradient
2025-05-05 Adaptive data analysis via subsampling

A general recipe for building estimators based on adaptive queries

adaptive-data-analysis
2025-05-04 Weak baselines

Be careful with empirical claims

half-baked
2025-05-04 The ladder mechanism for ml competitions

How to adapt models to the test set without exponentially inflating generalization bounds

adaptive-data-analysis benchmarks
2025-05-03 What is kv cache?

A simple concept with a complicated name

transformers
2025-05-03 What is constitutional AI?

Anthropic's way to do RLHF without the H

ai-alignment
2025-05-03 Multi head, multi query, and grouped query attention

Different ways to do attention

transformers