RL Algorithm: PPO

RL foundations and Proximal Policy Optimization (PPO) Algorithm

Based on PPO by RethinkFun📒

Policy Gradient

The objective is to maximize the Expected Return \(R\) (accumulated reward) by optimizing the Policy \(\theta\).

First, consider \(R\) of the whole trajectory \(\tau\), and apply the gradient w.r.t. \(\theta\):

\[ \begin{aligned} \nabla E_{\tau \sim P_{\theta}(\tau)}[R(\tau)] & = \nabla\sum_{\tau} R(\tau) P_{\theta}(\tau) \\\\ & = \sum_{\tau} P_{\theta}(\tau)R(\tau) \frac{\nabla P_{\theta}(\tau)}{P_{\theta}(\tau)} \end{aligned} \]

Use sampling to approximate the expectation:

\[ \begin{aligned} \sum_{\tau} P_{\theta}(\tau)R(\tau) \frac{\nabla P_{\theta}(\tau)}{P_{\theta}(\tau)} & \approx \frac{1}{N}\sum^N_{n=1} R(\tau^n) \frac{\nabla P_{\theta}(\tau^n)}{P_{\theta}(\tau^n)} \\\\ & = \frac{1}{N}\sum^N_{n=1} R(\tau^n) \nabla \log P_{\theta}(\tau^n) \end{aligned} \]

Then, expand the sampling distribution of \(\tau\):

\[ \begin{aligned} \frac{1}{N}\sum^N_{n=1} R(\tau^n) \nabla \log P_{\theta}(\tau^n) & = \frac{1}{N}\sum^N_{n=1} R(\tau^n) \nabla \log \prod^{T_n}_{t=1}{P_\theta(a^t_n|s^t_n)} \\\\ & = \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} R(\tau^n) \nabla \log P_\theta(a^t_n|s^t_n) \end{aligned} \]

Explanation of the objective function:

  • Maximize the probabilities of the action selections in the trajectory when the return is positive
  • Minimize the probabilities in opposite situations.

Improve the Calculation of Return \(R\)

Discounted Future Return
  1. An action \(a_t\) only affects the Return in the future.
  2. The influence of an action \(a_t\) will be discounted gradually along the timestep.

We can modify the Return \(R\) to take the timestep \(t\) and discounted factor \(\gamma\) into account:

\[ R(\tau^n) \rightarrow \sum^{T_n}_{t'=t} \gamma^{t'-t}r^n_{t'}=R^n_t \]

Relative Expected Return

There could be situations where all the actions would have positive (or negative) returns at the same time. Thus, preferences of actions (relative returns) would be more representative and meaningful.

We can have a baseline value to normalize the returns of actions:

\[ \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} R^n_t \log P_\theta(a^t_n|s^t_n) \rightarrow \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} (R^n_t - B(s^t_n)) \log P_\theta(a^t_n|s^t_n) \]

The baseline value \(B(s^t_n)\) is supposed to based only on the state.

Advantage

The calculation of Return suffers from high variance based on sampling. Thus, fitted functions can be introduced here to estimate the values we need.

  • Action-Value Function \(Q(s, a)\): Estimate the expected Return of taking action \(a\) in state \(s\).

  • State-Value Function \(V(s)\): Estimate the expected Return of the state \(s\).

  • Advantage Function \(A(s, a) = Q(s, a) - V(s)\): The relative Return of taking action \(a\).

We can further simplify the functions through representing the \(Q\) with \(V\):

\[ Q_\theta(s_t, a_t) = r_t + \gamma * V_\theta(s_{t+1}) \]

where \(r_t\) is the current reward.

Then, the Advantage Function can be represented by the State-Value Function:

\[ A_\theta(s_t, a_t) = r_t + \gamma * V_\theta(s_{t+1})-V_\theta(s_t) \]

With this simplification, we will only need to fit a State-Value Function \(V\) for the whole training.

Actor-Critic Framework: the critic aims to estimate the state values.

The Advantage Value can be used directly in the previous objective function:

\[ \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} A_\theta(s^t_n, a^t_n) \log P_\theta(a^t_n|s^t_n) \]

Estimate Advantage with Multi-step Sampling

We can approximate the State-Value based on the sampling reward and value of the next state:

\[ V_\theta(s_{t+1}) \approx r_{t+1} + \gamma * V_\theta(s_{t+2}) \]

Then, we can calculate the advantages with different sampling steps:

\[ \begin{aligned} A_\theta^1(s_t, a) & = r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_{t}) \\\\ A_\theta^2(s_t, a) & = r_t + \gamma * r_{t+1} + \gamma^2 V_\theta(s_{t+2}) - V_\theta(s_{t}) \\\\ A_\theta^3(s_t, a) & = r_t + \gamma * r_{t+1} + \gamma^2 * r_{t+2} + \gamma^3 V_\theta(s_{t+3}) - V_\theta(s_{t}) \end{aligned} \]

Note: The bigger the sampling step, the higher variance and lower bias, vice versa.

We can also introduce \(\delta\) to simplify the formula:

\[ \delta_t^V = r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_t) \]

Then, the multi-step sampling advantage function can be simplified:

\[ \begin{aligned} A_\theta^1(s_t, a) & = \delta^V_t \\\\ A_\theta^2(s_t, a) & = \delta^V_t + \gamma \delta^V_{t+1} \\\\ A_\theta^3(s_t, a) & = \delta^V_t + \gamma \delta^V_{t+1} + \gamma^2 \delta^V_{t+2} \end{aligned} \]

Generalized Advantage Estimation (GAE)

How to balance the bias and variance in the advantage calculation?

GAE: weighted average of advantage estimations with different sampling steps.

\[ \begin{aligned} A_\theta^{GAE}(s_t, a) & = (1 - \lambda)(A_\theta^1 + \lambda A_\theta^2 + \lambda^2 A_\theta^3 + ...) \\\\ & = (1 - \lambda)(\delta^V_t (1+\lambda+\lambda^2+...) + \gamma \delta^V_{t+1}(\lambda+\lambda^2+...) + ...) \\\\ & = (1 - \lambda)(\delta^V_t \frac{1}{1-\lambda} + \gamma \delta^V_{t+1} \frac{\lambda}{1-\lambda} + ...) \\\\ & = \sum^\infty_{b=0}(\gamma\lambda)^b\delta^V_{t+b} \end{aligned} \]

where the \(\lambda\) is the discounted factor of the sampling steps.

After all, we can get the new policy gradient:

\[ \delta^V_t = r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_t) \]

\[ A_\theta^{GAE}(s_t, a) = \sum^\infty_{b=0}(\gamma\lambda)^b\delta^V_{t+b} \]

\[ \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} A_\theta^{GAE}(s^t_n, a^t_n) \nabla \log P_\theta(a^t_n|s^t_n) \]

where the \(V(s)\) and \(P(a|s)\) are the only two functions need to be fitted.

Typically, the \(V\) shares the network weight with the Policy, differentiated by a value head. And the labels used to train the Value Network are the discounted returns in samples:

\[ \mathrm{Label}(s_t) = \sum^{T_n}_{t'=t}\gamma^{t'-t}r^n_{t'} \]

Proximal Policy Optimization (PPO)

How to handle situations when the sampling policy is different with the policy being trained?

  • On-Policy: Sampling Policy is the same with the Policy being trained.
  • Off-Policy: The opposite.

Basic Idea: Adjust the target Policy \(p\) according to the outcomes of the sampling Policy \(q\). i.e. if the advantage based on \(q\) is negative, and the probability of selecting the action in the \(p\) is higher than \(q\), then \(p\) should learn more from it.

Importance Sampling (IS)

Approximate the expectation using different sampling distribution:

\[ \begin{aligned} E(f(x))_{x\sim p(x)} & = \sum_x f(x) * p(x) \\\\ & = \sum_x f(x) \frac{p(x)}{q(x)} * q(x) \\\\ & = E(f(x)\frac{p(x)}{q(x)})_{x\sim q(x)} \\\\ & \approx \frac{1}{N}\sum^N_{n=1} f(x)\frac{p(x)}{q(x)}_{x\sim q(x)} \end{aligned} \]

PPO Objective Function

Introduce IS into the GAE objective function:

\[ \begin{aligned} \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} A_\theta^{GAE}(s^t_n, a^t_n) \nabla \log P_\theta(a^t_n|s^t_n) & = \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} A_{\theta'}^{GAE}(s^t_n, a^t_n) \frac{P_\theta(a^t_n|s^t_n)}{P_{\theta'}(a^t_n|s^t_n)} \nabla \log P_\theta(a^t_n|s^t_n) \\\\ & = \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} A_{\theta'}^{GAE}(s^t_n, a^t_n) \frac{\nabla P_\theta(a^t_n|s^t_n)}{P_{\theta'}(a^t_n|s^t_n)} \end{aligned} \]

The loss function of PPO:

\[ \mathrm{Loss} = - \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} A_{\theta'}^{GAE}(s^t_n, a^t_n) \frac{P_\theta(a^t_n|s^t_n)}{P_{\theta'}(a^t_n|s^t_n)} \]

The training performance will degrade when the distance between the sampling Policy and target Policy is too large. i.e. when the probability of an action differs a lot, the training signal means less to the target Policy.

Solution 1: adding KL Divergence into the loss function

\[ \mathrm{Loss}_{ppo} = - \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} A_{\theta'}^{GAE}(s^t_n, a^t_n) \frac{P_\theta(a^t_n|s^t_n)}{P_{\theta'}(a^t_n|s^t_n)} + \beta KL(P_\theta, P_{\theta'}) \]

The KL Divergence can help mitigate the loss when the distance is large, and constraint the shift between the two policies.

Solution 2: clipping the large IS factor

\[ \mathrm{Loss}_{ppo2} = - \frac{1}{N}\sum^N_{n=1} \sum^{T_n}_{t=1} \min(A_{\theta'}^{GAE}(s^t_n, a^t_n) \frac{P_\theta(a^t_n|s^t_n)}{P_{\theta'}(a^t_n|s^t_n)}, \mathrm{clip}(\frac{P_\theta(a^t_n|s^t_n)}{P_{\theta'}(a^t_n|s^t_n)}, 1-\epsilon, 1+\epsilon) A_{\theta'}^{GAE}(s^t_n, a^t_n)) \]

This way, the IS factor will be restricted in a limited range.