Basic facts about policy gradients

last updated:

Problem setup

The goal of Reinforcement Learning (RL) is for an agent to learn “optimal” behavior through interaction with an environment. The agent’s decisions are guided by a policy $\pi_\theta(a|s)$, parameterized by $\theta$. The objective is to find $\theta$ that maximizes

\[J(\theta) = E_{\tau \sim \pi_\theta}[G_0],\]

Here, $G_0 = \sum_{k=0}^\infty \gamma^k r_{k+1}$ is the total discounted return for a trajectory $\tau=(s_0, a_0, r_1, s_1, \dots)$ generated by following policy $\pi_\theta$ from an initial state $s_0$ drawn from a distribution $p_0(s_0)$ with $\gamma \in [0,1]$ being the “discount factor.” The distribution $p_0(s_0)$ specifies the probability of starting an episode in state $s_0$.

Value functions and Bellman equations

To evaluate policies, we define value functions. These functions rely on the concept of the return from a generic time step $t$, $G_t = \sum_{k=0}^\infty \gamma^k r_{t+k+1}$. Importantly, in this definition of $G_t$, the discount $\gamma^k$ applies to the $k$-th reward after $r_t$, meaning the “discount clock” effectively resets at time $t$.

  1. The State-Value Function $V^\pi(s) = E_\pi[G_t | s_t=s]$ is the expected return from starting in state $s$ and following policy $\pi$.
  2. The Action-Value Function $Q^\pi(s,a) = E_\pi[G_t | s_t=s, a_t=a]$ is the expected return from starting in state $s$, taking action $a$, and then following policy $\pi$.

Because the environment and policy are stationary, the specific value of $t$ used in the definition does not change the resulting values $V^\pi(s)$ or $Q^\pi(s,a)$; they are functions of state (and action) only.

These value functions are related by the identity:

\[V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a).\]

They satisfy the Bellman expectation equations, which express values recursively. Let $P(s’|s,a)$ be the state transition probability and $R(s,a,s’)$ be the expected immediate reward for $(s,a,s’)$.

\[\begin{aligned} V^\pi(s) &= \sum_a \pi(a|s) \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')], \\ Q^\pi(s,a) &= \sum_{s'} P(s'|s,a) [R(s,a,s') + \gamma V^\pi(s')]. \end{aligned}\]

These equations state that the value of a state (or state-action pair) under $\pi$ is the sum of the expected immediate reward and the discounted expected value of subsequent states encountered by following $\pi$.

The policy gradient theorem

The Policy Gradient Theorem provides an expression for $\nabla_\theta J(\theta)$. It’s useful because we can compute unbiased estimates of the gradient from samples of trajectories generated by $\pi_\theta$.

Let $P(\tau|\theta)$ be the probability of trajectory $\tau$ under policy $\pi_\theta$.

\[J(\theta) = \sum_\tau P(\tau|\theta) G_0(\tau).\]

Then, $\nabla_\theta J(\theta) = \sum_\tau \nabla_\theta P(\tau|\theta) G_0(\tau)$. Using the log-derivative trick,

\[\nabla_\theta P(\tau|\theta) = P(\tau|\theta) \nabla_\theta \log P(\tau|\theta),\]

we get:

\[\nabla_\theta J(\theta) = \sum_\tau P(\tau|\theta) (\nabla_\theta \log P(\tau|\theta)) G_0(\tau) = E_{\tau \sim \pi_\theta} [(\nabla_\theta \log P(\tau|\theta)) G_0(\tau)].\]

The probability of a trajectory is $P(\tau|\theta) = p_0(s_0) \prod_{t=0}^\infty \pi_\theta(a_t|s_t) P(s_{t+1}|s_t, a_t)$. Thus, $\log P(\tau|\theta) = \log p_0(s_0) + \sum_{t=0}^\infty (\log \pi_\theta(a_t|s_t) + \log P(s_{t+1}|s_t, a_t))$. The gradient $\nabla_\theta$ only affects terms involving $\pi_\theta$:

\[\nabla_\theta \log P(\tau|\theta) = \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t).\]

Substituting this back, we get:

\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \left( \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) \right) G_0(\tau) \right]\]

It can be shown that terms $\nabla_\theta \log \pi_\theta(a_k|s_k)$ for $k < t$ are uncorrelated with rewards $r_{t’}$ for $t’ \ge t$ given $s_k, a_k$. This “causality” argument allows us to replace $G_0(\tau)$ with $G_k(\tau)$ for the term $\nabla_\theta \log \pi_\theta(a_k|s_k)$, leading to a common form:

\[\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right] \quad (*)\]

In this expression, $G_t = \sum_{k=0}^\infty \gamma^k r_{t+k+1}$ is the full discounted return experienced from state $s_t$ onwards within the trajectory $\tau$. The term $\nabla_\theta \log \pi_\theta(a_t|s_t)$ is often called the “score function” for action $a_t$ in state $s_t$. The product of the score function and $G_t$ determines the direction and magnitude of the update for actions taken along a trajectory.

The gradient $E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right]$ is typically estimated using Monte Carlo sampling. For a single trajectory $\tau^{\text{sample}} \sim \pi_\theta$, an unbiased estimate of this gradient is $\sum_{t=0}^\infty (\nabla_\theta \log \pi_\theta(a_t|s_t)) G_t^{\text{sample}}$. A stochastic gradient ascent update then uses this estimate:

\[\theta \leftarrow \theta + \alpha \left( \sum_{t=0}^\infty (\nabla_\theta \log \pi_\theta(a_t|s_t)) G_t^{\text{sample}} \right)\]

where $G_t^{\text{sample}}$ is the return from the sampled trajectory $\tau^{\text{sample}}$. In practice, the sum is truncated at a finite horizon $H$.

Connection to value functions

To connect this to value functions, we can utilize the definition of the action-value function, $Q^{\pi_\theta}(s_t,a_t) = E_{\pi_\theta}[G_t | s_t, a_t]$. This states that $Q^{\pi_\theta}(s_t,a_t)$ is the expected value of the random variable $G_t$, conditioned on having been in state $s_t$ and taken action $a_t$, and subsequently following policy $\pi_\theta$.

Using the law of total expectation ($E[X] = E_Y[E[X|Y]]$), we can rewrite the expectation in $(*)$:

\[\begin{aligned} \nabla_\theta J(\theta) &= E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right] \\ &= \sum_{t=0}^\infty E_{\tau \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) G_t \right] \\ &= \sum_{t=0}^\infty E_{(s_t,a_t) \text{ in } \tau \text{ at step } t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) E[G_t | s_t, a_t, \pi_\theta] \right] \\ &= \sum_{t=0}^\infty E_{(s_t,a_t) \text{ in } \tau \text{ at step } t} \left[ \nabla_\theta \log \pi_\theta(a_t|s_t) Q^{\pi_\theta}(s_t,a_t) \right] \end{aligned}\]

Here, the expectation $E_{(s_t,a_t) \text{ in } \tau \text{ at step } t}[\cdot]$ denotes averaging over all possible state-action pairs $(s_t, a_t)$ that can occur at time $t$ when trajectories are generated according to $s_0 \sim p_0(s_0)$ and policy $\pi_\theta$. This form, $\nabla_\theta J(\theta) = E_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) Q^{\pi_\theta}(s_t,a_t) \right]$, shows that the policy gradient depends on the Q-values of the actions taken.

The Advantage Function: Improving Gradient Estimates

The variance of $G_t$ as an estimator for $Q^{\pi_\theta}(s_t,a_t)$ can be high. We can introduce a state-dependent baseline $b(s_t)$ into the policy gradient expression without changing its expectation:

\[\nabla_\theta J(\theta) = E_{\pi_\theta} [ \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^{\pi_\theta}(s_t,a_t) - b(s_t)) ].\]

This holds because

\[E_{a_t \sim \pi_\theta(\cdot|s_t)}[\nabla_\theta \log \pi_\theta(a_t|s_t) b(s_t)] = \sum_{a_t} \nabla_\theta \pi_\theta(a_t|s_t) b(s_t) = b(s_t) \nabla_\theta \sum_{a_t} \pi_\theta(a_t|s_t) = 0.\]

When using samples $G_t$ in place of $Q^{\pi_\theta}(s_t,a_t)$, the term in the gradient is $\nabla_\theta \log \pi_\theta(a_t|s_t) (G_t - b(s_t))$. The variance of the scalar factor $(G_t - b(s_t))$, conditioned on $s_t$, is minimized by choosing $b(s_t) = E[G_t|s_t] = V^{\pi_\theta}(s_t)$.

Thus, the optimal choice for $b(s_t)$ is $V^{\pi_\theta}(s_t)$. This leads to using the Advantage Function

\[A^{\pi_\theta}(s_t, a_t) = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t).\]

The policy gradient estimate becomes

\[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) (G_t - V_w(s_t)),\]

where $V_w(s_t)$ is an estimate of $V^{\pi_\theta}(s_t)$, and $(G_t - V_w(s_t))$ is an estimate of $A^{\pi_\theta}(s_t,a_t)$.

Actor-Critic Methods

Actor-Critic methods learn two distinct components:

  1. The Actor: A policy $\pi_\theta(a|s)$, parameterized by $\theta$, which determines how the agent behaves.
  2. The Critic: A state-value function approximator $V_w(s)$, parameterized by $w$, which estimates $V^{\pi_\theta}(s)$, the value of states under the current actor’s policy.

These components are learned concurrently.

Critic Learning: The critic $V_w(s)$ (commonly a neural network) aims to approximate $V^{\pi_\theta}(s)$. It is trained by minimizing a loss function that measures the discrepancy between its predictions and target values derived from experience. For TD(0) learning, after observing a transition $(s_t, a_t, r_{t+1}, s_{t+1})$, the target for $V_w(s_t)$ is

\[y_t = r_{t+1} + \gamma V_w(s_{t+1}).\]

The critic’s parameters $w$ are updated to minimize the squared error

\[(y_t - V_w(s_t))^2.\]

The gradient of $\frac{1}{2}(y_t - V_w(s_t))^2$ with respect to $w$ is $-(y_t - V_w(s_t))\nabla_w V_w(s_t)$, assuming $y_t$ is treated as a fixed target during differentiation. This is only an approximation of the full gradient step method because the target $y_t$ itself contains $V_w(s_{t+1})$, but its dependency on $w$ is ignored when computing the gradient for the update related to $V_w(s_t)$.

Given this gradient estimate, we can update the critics parameters using the stochastic gradient update:

\[w \leftarrow w + \alpha_w (r_{t+1} + \gamma V_w(s_{t+1}) - V_w(s_t)) \nabla_w V_w(s_t)\]

Here, the term

\[\delta_t = r_{t+1} + \gamma V_w(s_{t+1}) - V_w(s_t)\]

is called the TD error. This TD error serves as an estimate of the advantage $A^{\pi_\theta}(s_t,a_t)$. To see this relationship, recall $A^{\pi_\theta}(s_t,a_t) = Q^{\pi_\theta}(s_t,a_t) - V^{\pi_\theta}(s_t)$. By definition, $Q^{\pi_\theta}(s_t,a_t) = E_{\pi_\theta}[r_{t+1} + \gamma V^{\pi_\theta}(s_{t+1}) | s_t, a_t]$. So,

\[A^{\pi_\theta}(s_t,a_t) = E_{\pi_\theta}[r_{t+1} + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t) | s_t, a_t].\]

The TD error $\delta_t = (r_{t+1} + \gamma V_w(s_{t+1})) - V_w(s_t)$ is thus a sample-based estimate of the quantity inside this expectation, where $V_w$ approximates $V^{\pi_\theta}$, and the single observed $(r_{t+1}, s_{t+1})$ pair replaces the expectation over possible next states and rewards.

The TD error $\delta_t$ is a biased estimate of $A^{\pi_\theta}(s_t,a_t)$ if $V_w \neq V^{\pi_\theta}$. However, $\delta_t$ can have lower variance as an advantage estimator compared to the estimate $(G_t^{\text{sample}} - V_w(s_t))$. The return $G_t^{\text{sample}} = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots$ accumulates noise from many future random rewards. The TD error replaces the sum of all future discounted rewards beyond $r_{t+1}$ (i.e., $\gamma G_{t+1}^{\text{sample}}$) with a single bootstrapped estimate $\gamma V_w(s_{t+1})$. If $V_w(s_{t+1})$ is a reasonably stable (even if biased) estimate of $E[G_{t+1}|s_{t+1}]$, then the variance of $r_{t+1} + \gamma V_w(s_{t+1})$ can be much lower than the variance of $r_{t+1} + \gamma G_{t+1}^{\text{sample}}$.

Actor Update: The actor’s parameters $\theta$ are updated using the TD error $\delta_t$ as the estimate of the advantage.

In an online (per-step) setting, the actor update is performed after each transition, using the $\delta_t$ computed for that step:

\[\theta \leftarrow \theta + \alpha_\theta \nabla_\theta \log \pi_\theta(a_t|s_t) \delta_t\]

This update adjusts the policy immediately based on the most recent experience.

In a batch setting, after collecting a batch of $N$ transitions and their corresponding TD errors ${(s_i, a_i, \delta_i)}_{i=1}^N$, the actor update sums these contributions:

\[\theta \leftarrow \theta + \alpha_\theta \sum_{i=1}^{N} \nabla_\theta \log \pi_\theta(a_i|s_i) \delta_i\]

This update adjusts the policy based on the aggregated experience in the batch. In both cases, the update aims to make actions that lead to a positive TD error (indicating a better-than-expected outcome relative to $V_w(s_t)$) more probable, and actions leading to a negative TD error less probable.

Proximal Policy Optimization (PPO): Constraining Policy Updates for Improved Performance

Policy gradient methods update policy parameters $\theta$. When these updates are based on data collected under a previous policy $\pi_{\theta_{old}}$ (the “old” or behavior policy), and advantage estimates $\hat{A}_t^{\theta_{old}}$ are computed using $\pi_{\theta_{old}}$’s value function, large discrepancies between the new policy $\pi_\theta$ and $\pi_{\theta_{old}}$ can make these advantage estimates inaccurate for evaluating $\pi_\theta$. This can degrade performance. Proximal Policy Optimization (PPO) introduces mechanisms to obtain more reliable policy improvements by constraining how much the policy can change in each update step.

The primary objective is to find a new policy $\pi_\theta$ such that its expected return $J(\theta)$ is greater than $J(\theta_{old})$. A fundamental result from policy iteration theory (related to Kakade & Langford, 2002) provides an exact expression for this performance difference:

\[J(\theta) - J(\theta_{old}) = E_{s \sim d^{\pi_\theta}} \left[ E_{a \sim \pi_\theta(\cdot|s)} [A^{\pi_{\theta_{old}}}(s,a)] \right]\]

This identity states that the improvement in expected return is equal to the expected advantage of the actions taken by the new policy $\pi_\theta$, where these advantages $A^{\pi_{\theta_{old}}}(s,a) = Q^{\pi_{\theta_{old}}}(s,a) - V^{\pi_{\theta_{old}}}(s)$ are calculated with respect to the old policy $\pi_{\theta_{old}}$. The outer expectation is over states $s$ visited according to the state visitation distribution $d^{\pi_\theta}$ induced by the new policy $\pi_\theta$.

Directly optimizing $J(\theta) - J(\theta_{old})$ using this expression is challenging because $d^{\pi_\theta}$ depends on the parameters $\theta$ being optimized, making the expectation difficult to compute or differentiate.

To make optimization tractable, we form a local approximation. The first step in this approximation is to replace the expectation over states visited by the new policy, $s \sim d^{\pi_\theta}$, with an expectation over states visited by the old policy, $s \sim d^{\pi_{\theta_{old}}}$. This substitution yields an approximation that is more accurate when $\pi_\theta$ is close to $\pi_{\theta_{old}}$ (implying $d^{\pi_\theta} \approx d^{\pi_{\theta_{old}}}$)

So, we approximate:

\[E_{s \sim d^{\pi_\theta}} \left[ E_{a \sim \pi_\theta(\cdot|s)} [A^{\pi_{\theta_{old}}}(s,a)] \right] \approx E_{s \sim d^{\pi_{\theta_{old}}}} \left[ E_{a \sim \pi_\theta(\cdot|s)} [A^{\pi_{\theta_{old}}}(s,a)] \right]\]

Now, the inner expectation $E_{a \sim \pi_\theta(\cdot|s)} [A^{\pi_{\theta_{old}}}(s,a)]$ is still with respect to actions from the new policy $\pi_\theta$. Since we are working with data (trajectories) generated by $\pi_{\theta_{old}}$, we use importance sampling to rewrite this inner expectation in terms of actions $a \sim \pi_{\theta_{old}}(\cdot|s)$:

\[\begin{aligned} E_{a \sim \pi_\theta(\cdot|s)} [A^{\pi_{\theta_{old}}}(s,a)] &= \sum_a \pi_\theta(a|s) A^{\pi_{\theta_{old}}}(s,a) \\ &= \sum_a \pi_{\theta_{old}}(a|s) \frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a) \end{aligned}\]

Substituting this back into the approximation for $J(\theta) - J(\theta_{old})$, we obtain a surrogate objective for the performance improvement (ignoring the constant $J(\theta_{old})$ for maximization purposes):

\[L_{\theta_{old}}^{IS}(\theta) = E_{s \sim d^{\pi_{\theta_{old}}}, a \sim \pi_{\theta_{old}}(\cdot|s)} \left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a)\right]\]

Here, $\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the importance sampling ratio for a state-action pair $(s_t, a_t)$ from data collected under $\pi_{\theta_{old}}$. This ratio re-weights the advantage $A^{\pi_{\theta_{old}}}(s_t,a_t)$ to account for the relative probability of taking action $a_t$ under $\pi_\theta$ versus $\pi_{\theta_{old}}$. The objective $L_{\theta_{old}}^{IS}(\theta)$ provides an estimate of policy performance improvement, which is a first-order approximation accurate when $\pi_\theta$ is close to $\pi_{\theta_{old}}$. Maximizing $L_{\theta_{old}}^{IS}(\theta)$ aims to increase the probability of actions that had high advantages under $\pi_{\theta_{old}}$, correctly weighted for the policy change.

This expression for $L_{\theta_{old}}^{IS}(\theta)$ is an expectation over state-action pairs $(s,a)$. The sampling process defined by the expectation is as follows: first, a state $s$ is drawn according to $d^{\pi_{\theta_{old}}}(s)$, the distribution representing the frequency of visiting state $s$ under policy $\pi_{\theta_{old}}$. Then, given $s$, an action $a$ is drawn according to $\pi_{\theta_{old}}(a|s)$. The term $d^{\pi_{\theta_{old}}}(s) \pi_{\theta_{old}}(a|s)$ thus represents the joint probability (or probability density) of the pair $(s,a)$ under this process. The expectation can be written explicitly as:

\[L_{\theta_{old}}^{IS}(\theta) = \sum_s \sum_a \left( d^{\pi_{\theta_{old}}}(s) \pi_{\theta_{old}}(a|s) \right) \left[\frac{\pi_\theta(a|s)}{\pi_{\theta_{old}}(a|s)} A^{\pi_{\theta_{old}}}(s,a)\right]\]

In practice, this expectation is estimated from data. We execute $\pi_{\theta_{old}}$ in the environment to generate a batch of trajectories. Each time step $t$ within these trajectories yields a specific state-action pair $(s_t, a_t)$ that was encountered. This collection of observed $(s_t, a_t)$ pairs from the batch forms an empirical distribution over state-action pairs. Under suitable conditions (e.g., ergodicity of the Markov chain induced by $\pi_{\theta_{old}}$ and a sufficiently large batch of data), this empirical distribution approximates the true underlying joint distribution $P(s,a|\pi_{\theta_{old}}) = d^{\pi_{\theta_{old}}}(s) \pi_{\theta_{old}}(a|s)$. Consequently, a Monte Carlo estimate of $L_{\theta_{old}}^{IS}(\theta)$ is formed by averaging the term $\left[\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} \hat{A}^{\pi_{\theta_{old}}}(s_t,a_t)\right]$ over all $(s_t, a_t)$ pairs in the collected batch (i.e., samples from the empirical distribution), using an estimated advantage $\hat{A}^{\pi_{\theta_{old}}}(s_t,a_t)$ for each.

One could in priciple optimize $L_{\theta_{old}}^{IS}(\theta)$ directly using the above estimate strategy. However, if $\rho_t(\theta)$ becomes very large or very small (i.e., $\pi_\theta$ significantly differs from $\pi_{\theta_{old}}$), the variance of the gradient of $L_{\theta_{old}}^{IS}(\theta)$ (when estimated from samples) can be large. PPO is an attempt to address this by further modifying this surrogate objective.

The PPO clipped objective $L^{CLIP}(\theta)$ builds upon this same principle of empirical estimation. It is an average where, for each observed time step $t$ in the collected batch, the core term $\rho_t(\theta) \hat{A}_t^{\theta_{old}}$ is first subject to the clipping mechanism before being included in the average:

\[L^{CLIP}(\theta) = \hat{E}_t \left[ \min\left( \rho_t(\theta) \hat{A}_t^{\theta_{old}} , \operatorname{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t^{\theta_{old}} \right) \right]\]

Here:

The purpose of the $\min$ and $\operatorname{clip}$ operations is to form a more conservative (pessimistic) objective compared to $L_{\theta_{old}}^{IS}(\theta)$, effectively creating a lower bound on the expected improvement when the policy change is small.

Thus, the PPO objective $L^{CLIP}(\theta)$ is a practical sample-based objective that incrementally improves the policy. It processes each time step $(s_t, a_t)$ from trajectories run under $\pi_{\theta_{old}}$, calculates the importance-weighted advantage, applies the clipping rule to this term, and then averages these clipped terms over the entire batch of collected experiences. This averaging provides a Monte Carlo estimate of the expectation of the clipped quantity, which serves as the surrogate objective for improving the policy.