This is an attempt to collate notes I write while I’m researching or scrolling online. Expect small summaries of papers / technical results and half-baked thoughts. In that way it will be similar to my grad school blog, in which I wrote about a bunch of random topics. Maybe it will be useful to you? You shouldn’t take them too seriously.
Making sure I don't forget what I read.
Making attention less memory bound.
Which activations should we save in the forward pass?
Arranging tensor data for computation on GPUs using linear algebra over F_2
Part II: Muon optimizer, GPT architecture details, and distributed training in modded-nanogpt.
Is AlphaEvolve problem B.1 hard? Yes
Part 1 of a two part series on the modded-nanogpt repo
An exciting new model for theorem proving
How to make your life harder and learn something about policy gradients
When DPO and RLHF are the same and / or different
Basic math of policy gradients, actor critic, and proximal policy optimization
A general recipe for building estimators based on adaptive queries
Be careful with empirical claims
How to adapt models to the test set without exponentially inflating generalization bounds
A simple concept with a complicated name
Anthropic's way to do RLHF without the H
Different ways to do attention