This is an attempt to collate notes I write while I’m researching or scrolling online. Expect small summaries of papers / technical results and half-baked thoughts. Maybe it will be useful to you? You shouldn’t take them too seriously.
Part 1 of a two part series on the modded-nanogpt repo
An exciting new model for theorem proving
How to make your life harder and learn something about policy gradients
When DPO and RLHF are the same and / or different
Basic math of policy gradients, actor critic, and proximal policy optimization
A general recipe for building estimators based on adaptive queries
Be careful with empirical claims
How to adapt models to the test set without exponentially inflating generalization bounds
A simple concept with a complicated name
Anthropic's way to do RLHF without the H
Different ways to do attention