All roads lead to likelihood: the value of reinforcement learning in fine-tuning

last updated: 2025-05-07

The paper “All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning” examines why online, two-stage fine-tuning procedures like RLHF often appear to outperform direct offline methods such as DPO in aligning language models with human preferences. The core of the paper establishes an algebraic equivalence between these approaches under specific assumptions about the reward model’s structure, and then hypothesizes that observed empirical differences arise from a “generation-verification gap,” where learning a simpler reward model (verifier) separately is easier than jointly learning a policy (generator) and its implicit reward.

The foundation for modeling preferences is the Bradley-Terry (BT) model, where the probability of preferring trajectory $\xi_1$ over $\xi_2$ is

\[P(\xi_1 \succ \xi_2 | r) = \sigma(r(\xi_1) - r(\xi_2)),\]

with $r(\xi)$ being a scalar reward for trajectory $\xi$ and $\sigma$ the sigmoid function. Offline Direct Preference Optimization (DPO) directly optimizes a policy $\pi \in \Pi$ (where $\Pi$ is the class of policies) by defining an implicit “local” reward

\[r_\pi(\xi) = \sum_{h=0}^{H-1} \log \pi(a_h|s_h).\]

The DPO objective is then to find the policy $\hat{\pi}_{DPO}$ that maximizes the log-likelihood of observed preferences:

\[\hat{\pi}_{DPO} = \underset{\pi \in \Pi}{\text{argmax}} \sum_{(\xi^+, \xi^-) \in D} \log \sigma(r_\pi(\xi^+) - r_\pi(\xi^-)).\]

In contrast, online RLHF first learns an explicit “global” reward model $\hat{r}_G$ from a class of reward functions $R$ by maximizing:

\[\hat{r}_G = \underset{r_G \in R}{\text{argmax}} \sum_{(\xi^+, \xi^-) \in D} \log \sigma(r_G(\xi^+) - r_G(\xi^-)).\]

Subsequently, it learns a policy $\hat{\pi}_{RLHF}^*$ using this $\hat{r}_G$. The objective for this policy optimization involves maximizing a combination of expected reward and the policy’s entropy. The causal entropy of a policy $\pi$, denoted $H(\pi)$, is defined as the expected sum of the negative log probabilities of actions taken under that policy, over the course of a trajectory $\xi = (s_0, a_0, \dots, s_{H-1}, a_{H-1}, s_H)$ generated by $\pi$:

\[H(\pi) = \mathbb{E}_{\xi \sim \pi} \left[ -\sum_{h=0}^{H-1} \log \pi(a_h|s_h) \right].\]

Here, the expectation $\mathbb{E}_{\xi \sim \pi}$ is over trajectories where $s_0$ is drawn from an initial state distribution and subsequent actions $a_h$ are drawn from $\pi(\cdot|s_h)$. The policy optimization objective, as ‘per the principle of maximum entropy RL’, is:

\[\hat{\pi}_{RLHF}^* = \underset{\pi' \in \Pi}{\text{argmax}} \left( \mathbb{E}_{\xi \sim \pi'} [\hat{r}_G(\xi)] + H(\pi') \right).\]

One can show that the policy yields a trajectory distribution $P_{\hat{\pi}_{RLHF}^* }^* (\xi) \propto \exp(\hat{r}_G(\xi))$.

The paper’s first main result (Theorem 2.2) demonstrates an algebraic equivalence: if the global reward model class $R$ in RLHF is constrained to be $R(\Pi)$ (i.e., rewards must have the form $r_{\pi’}(\xi) = \sum_h \log \pi’(a_h|s_h)$ for some policy $\pi’ \in \Pi$), then the two approaches are identical. Under this constraint, Stage 1 of RLHF becomes:

\[\hat{r}_G = \underset{r_{\pi'} \in R(\Pi)}{\text{argmax}} \sum_{(\xi^+, \xi^-) \in D} \log \sigma(r_{\pi'}(\xi^+) - r_{\pi'}(\xi^-)).\]

The policy $\pi’$ parameterizing the optimal $r_{\pi’}$ in this expression is, by definition, $\hat{\pi}_{DPO}$. Thus, the learned reward model is $\hat{r}_G = r_{\hat{\pi}_{DPO}}(\xi)$.

Stage 2 of RLHF then seeks a policy $\pi^*$ such that its trajectory distribution satisfies $P_{\pi^* }^* (\xi) \propto \exp(r_{\hat{\pi}_{DPO}}(\xi))$. Substituting $r_{\hat{\pi}_{DPO}}(\xi) = \sum_h \log \hat{\pi}_{DPO}(a_h|s_h)$, we get:

\[P_{\pi^*}^*(\xi) \propto \exp\left(\sum_h \log \hat{\pi}_{DPO}(a_h|s_h)\right) = \prod_h \hat{\pi}_{DPO}(a_h|s_h).\]

This implies the optimal policy $\pi^*$ from RLHF Stage 2 is $\hat{\pi}_{DPO}$. Thus, when the reward model architecture is restricted in this specific way, RLHF simply rearranges the DPO optimization.

To explain the empirically observed advantage of online methods, the paper proposes the “generation-verification gap” hypothesis (H6). This posits that the true underlying reward structure $r^* $ dictating preferences might be “simple” (belonging to a class $R_{sim}$ that is easy to learn) and that an unconstrained global reward model in RLHF’s Stage 1 can effectively learn this $r^* $. If DPO’s implicit reward $r_\pi(\xi)$ struggles to represent this $r^* $ with a policy $\pi$ that is also easy to find, or if the joint optimization of finding such a $\pi$ is inherently harder, then RLHF gains an advantage. RLHF decouples the problem: first learn $r^* \in R_{sim}$, then derive a policy for it. DPO attempts to find a policy $\pi$ whose structure $r_\pi$ simultaneously captures $r^*$ and defines an optimal policy. A related result (Theorem 3.1) formalizes that if DPO’s search were restricted to policies optimal for some $r \in R_{sim}$, it would match RLHF’s outcome.

The paper presents experiments where manipulating task or reward complexity (e.g., very short trajectories, or using a complex predefined reward like ROUGE-L) alters the performance gap between online and offline methods. These are interpreted as supporting H6 by showing that when the generation-verification gap is presumed to be small (verification is as hard as generation, or generation is trivially easy), the online advantage diminishes.