Decoupling Value and Policy for Generalization in Reinforcement Learning

What?

Decoupling policy and value function in actor-critic methods and not letting the policy learn spurious correlations that hinder generalisation.

Why?

Having a rich visual observational space leads to policies suffering from spurious correlation and deteriorating generalisation properties. Sharing a network between the policy and the value function might make things worse. This paper investigates how decoupling those can help.

How?

TLDR:

Decouple policy from the value function. (DAAC)
Hinder the policy feature extractor from overfitting to a particular problem instance. (IDAAC)

We are in a setting, where we have a finite number of samples from the distribution of tasks and want to generalise to new tasks (e.g. procgen). Obviously, overfitting to some features specific to a particular instance is bad since this is not going to generalise.

Decoupling policy and value function:
- Do not share parameters between the policy and the value;
- Use two heads for the policy net:
  - to generate actions (policy itself)
  - to output the advantage values (given the action) $A_\theta(s,a)$
    - otherwise the learning might suffer ("relies on gradients from the value function to learn useful features")
    - I am confused after reading this. So, the VF can still have useful feature representations for the policy to rely on?
- train policy with the three loss components
  - PPO policy gradient loss (clipped)
  - Entropy bonus for exploration
  - Advantage loss $\frac{1}{T}\sum_{t=1}^{T}(A_\theta(s_t, a_t) - \hat{A}t)^2,$ where $\hat{A}t$ is the GAE estimate $\hat{A}t = \sum{k=t}^T(\gamma \lambda)^{k-t}\delta_k$ and $\delta_t = r_t+\gamma V\phi(s{t+1}) - V_\phi(s_t)$.
- Train a usual VF with the MSE.
- Having policy and value decoupled has an additional benefit of making different update schedule for each of them (see this paper for more).
Hindering the policy from overfitting on the particular instances:
- Train an adversarial model to learn an encoder that does not let instance-inherent features stay in the hidden representation.
- Train a discriminator that tries to predict if one state precedes the other:
  - $L_D(\phi) = -\log[D_\phi(E_\theta(s_i), E(s_j))] - \log[1-D_\phi(E_\theta(s_i), E_\theta(s_j))]$, where $E_\theta$ is the part of the policy network.
- Use the additional component of the encoder loss for the IDAAC policy update:
  - $L_E(\theta) = -\frac{1}{2}\log{[D_\psi(E_\theta(s_i), E_\theta(s_j))]} -\frac{1}{2}\log{[1-D_\psi(E_\theta(s_i), E_\theta(s_j))]}$

And?

I like the main insight of the paper: in a partially observable setting, we have to remember something specific to an instance to be able to fit the value function well. I am not sure I buy the argument about the numbers of steps left. It would be great to run an experiment taking the output of the value network feature extractor (before the last linear layer) and learn the regression on a number of steps left (freezing the feature extractor). Compared to just training from scratch, how fast this would learn? If not faster than from scratch, maybe it's not about the number of steps but about the ordering (if x' goes after x → value of x is higher). IDAAC should also improve the performance under the ordering hypothesis.
The paper has a super neat background section clearly describing the setting. However, I do not think POMDPs were introduced in Bellman in 1957 as cited + it's a bit weird to include $\gamma$ to a definition of POMDP.
I don't have a good intuition why learning an advantage function is less prone to remember specific features compared to the value function. Yeah, I saw the plots but I don't have a good intuition.
Would making the value function myopic help the issue of the value function overfitting to the environment akin to the advantage?
I think there is a typo in Equations 2 and 3 in the paper (i and j should be swapped for the second component of the RHS of each equation.
The figure with high-value loss - generalisation correlation is extremely interesting. However, I find the formulation of it problematic: "...models with a higher value loss generalize better". This formulation implies the causal relationship, whereas it has not been shown in the paper.

This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.