Understanding Policy Gradient - a fundamental idea in RL

Understanding Policy Gradient - a fundamental idea in RL
Training with Vanilla Policy Gradient as imagined by MS Bing. I love the upward arrow — it reflects the principle at the core of the method: gradient ascent.

How do you begin to learn Reinforcement Learning?

My preferred approach is to study code. Reading and analyzing code can disambiguate many ideas and concepts that are hard to grasp from papers or blog posts alone.

A policy I trained using Vanilla Policy Gradient on the Cartpole-v1 environment. The code is here.

Crafting an optimal policy by learning value functions is a very straightforward idea. A good example is Deep Q-Learning, which I covered in an earlier blog post.

But policy gradient methods are a completely different beast! They're unlike anything you'll have encountered in the supervised or semi-supervised deep learning world.

If you have a background in probability and can follow math notation, I highly recommend the Intro to Policy Optimization from Spinning Up by OpenAI.

Beyond a background in probability, the prerequisites include:

  • understanding how log odds (logits) become probabilities once passed through a sigmoid or softmax
  • understanding the negative log-likelihood loss
  • a working grasp of the chain rule

In this blog post, I'll explain the main ideas behind policy gradient without any mathematical heavy lifting.

This should let anyone jump into policy gradient methods right away, and make it easier to patch up the missing theoretical pieces later, if you choose to.

And there's value in this even if you can already follow the math with ease. It's worth interacting with the ideas you're learning at different levels of complexity and rigor — there's something to be gained from understanding the why behind the operations, and the bigger picture of the method.

So let's get started.

One equation to rule them all

The equation below is at the heart of policy gradient methods.

\(J\) is the expected return of using policy \(\pi), parametrized by weights \(\theta).

To make this concrete, \(\pi) can be a neural network that takes in observations (the state of the environment) and outputs a distribution over possible actions. We sample from that distribution, and that's how we decide which action to take.

What we ultimately want is to maximize the expected return — and we do that by modifying our policy's parameters (\theta).

So far, so good.

The right-hand side of the equation is the recipe for calculating \(J\), the expected return. It's an expectation over trajectories sampled from policy \(\pi\), with a reward assigned to each trajectory by the reward function \(R\).

What a mouthful!

But the idea is simple:

  • collect a handful of trajectories by following policy \(\pi\)
  • for each trajectory, generate a reward using function \(R\)

Once you have enough trajectories, take the mean of their rewards — and that's your expected return!

There's one crucial thing to keep in mind here. The higher the variance of your data, the larger the sample size you need to estimate the mean accurately.

So the higher the variance of what sits inside the square brackets on the right (\(R(\tau)\)), the more of it — the trajectories and their associated rewards — we'll need to collect for a good estimate. Otherwise, we risk our policy not training well.

Great — we now have a way to estimate expected returns.

To maximize them, we do the most straightforward thing in the universe: we calculate the policy's gradient and take a step in the positive direction!

We perform gradient ascent.

But how do you calculate the gradient of a policy?

Derivation of simple policy gradient

This is where the scary-looking math shows up:

Derivation of policy gradient from Spinning Up by OpenAI.

But, again, the idea is simple. We can skip past the derivation straight to the bottom line.

The \(\mathop{\mathbb{E}}\) tells us, once more, that we'll be dealing with a mean of values collected over enough trajectories. Inside the square brackets, we take the logarithm of the probabilities our model outputs and weight it by rewards.

The plan is to increase the log probs across the board — but not equally! The higher the reward associated with a log prob, the more we want it to increase.

So over a batch of trajectories, the gradients tell us how to modify the parameters \(\theta\) so that the log probs of actions tied to higher rewards go up.

It doesn't matter that log probs are always negative. The closer they are to 0, the closer the probability is to 1 — and we want to push the probability of high-reward actions as close to 1 as we can.

There's just one minor detail to handle in the code.

The PyTorch optimizer minimizes the value whose gradients we computed. That's because it's built for cost functions, which typically output larger values the greater the gap between predictions and labels; the goal there is to shrink that gap.

None of that applies here. There are no labels — we only adjust the parameters of our policy to increase the score.

Still, since we're using an off-the-shelf optimizer, we add a minus sign in front of the expected return. By minimizing the negative of \(J\), we increase \(J\) itself!

Improving our algorithm

The more we can reduce the variance of the collected data, the better and faster Vanilla Policy Gradient will train.

One useful idea is to change how we assign rewards to each observation-action pair.

In the original formulation, we took the reward at the end of a trajectory and assigned it uniformly to every action. But if you think about it — why reward or penalize the policy for actions it took in the past? The only thing an action at a given timestep can influence is the future!

This is where reward-to-go comes in.

Instead of using the sum of rewards across all actions, we sum the rewards from each timestep forward. For a given action, we look at the rewards collected from that point until the end of the episode, and use their sum as the reward for that action.

The expected grad-log-prob lemma is another result we can lean on here.

\underE{x \sim P_{\theta}}{\nabla_{\theta} \log P_{\theta}(x)} = 0.

The consequence of the lemma is that for any function depending only on state, we get a gradient of 0. That means we can add or subtract as many functions of this form as we like to our Vanilla Policy Gradient calculation.

This is useful, because such functions — called baselines — let us reduce the variance of our estimated returns!

One natural choice of baseline is the on-policy value function:

b(s_t) = V^{\pi}(s_t)

We subtract the estimated value at every step and work with the result. In tests, this reliably reduces the variance of the sample estimate for the policy gradient.

Intuitively, we can read it like this:

We don't really care about the raw reward of a given action. What we want to know is whether the chosen action leads to a better outcome than we'd expect on average from the current situation.

A related idea is the advantage function.

Advantage functions come in different shapes and sizes; you can read more about them in High-Dimensional Continuous Control Using Generalized Advantage Estimation.

\nabla_{\theta} J(\pi_{\theta}) = \underE{\tau \sim \pi_{\theta}}{\sum_{t=0}^{T} \nabla_{\theta} \log \pi_{\theta}(a_t |s_t) \left(\sum_{t'=t}^T R(s_{t'}, a_{t'}, s_{t'+1}) - b(s_t)\right)}.
reward-to-go and a baseline added to the calculation of the gradients of the expected returns

What's neat about advantage functions is this: if an action is better than what we'd expect from our policy, the modified reward term comes out positive. If it would lead to a worse outcome, the term comes out negative.

So we either increase or decrease the probability of an action based on how it compares to our policy's average performance!

Summary

When do you understand something?

Did Newton understand gravity? Or did we only learn what gravity is with Einstein and general relativity?

I used to believe I had to understand every little bit of anything I wanted to use. That's a very insecure and limiting way to live.

I want math to make sense to me — but understanding the details without grasping the bigger picture is useless. And there's no shame in stopping at a level of understanding that fits your situation and your goals.

Sometimes going down a rabbit hole is the most rewarding, most enjoyable thing you can do; sometimes it's just a giant waste of time.

I wrote this blog post so that you and I can choose where to stop — at whatever level lets us pursue what we're after most effectively.