What?
Evolving agent morphology based on the mutual information between the terminal state and actions of the agent.
Why?
Morphology evolution is an interesting and practically important problem. Existing approaches rely on the full-blown RL loop inside, and that is expensive. Can we simplify the inner loop?
How?
source original paper
- RL inside of morphology evolution is costly, how can we let it go?
- Also, letting the RL go will make us free from the reward function, and that is why, I believe, the authors call their approach 'task agnostic'.
- The main idea:
- use action primitives (macro-actions)
- make the objective of the morphology evolution to find such a configuration so that an agent reliably can reach any state through a unique sequence of actions;
- the objective above can be framed as maximising the mutual information between the terminal state and the action primitive.
- I do not copypaste the objective since I'm not sure I get it and I believe there are typos in the Equation 1.
- practically, the authors train a classifier $q_\phi(a\mid s_T, m)$, where $a$ is the action primitive, $s_T$ is the terminal state, and $m$ is the morphology.
- The classifier is a GNN, that takes a graph with node-limbs and edges-joints.
- Graph entity features include limb positions, joint type and range etc.
- GNNs make sure that we can process different creatures with different actions set sizes.
- Since we might add more joints, we want to correct the size of the number of joints. I did not get exactly how
- There are three main steps when looking for an optimal morphology:
- Sample/mutate the morphologies.
- Get data by random sampling using the action primitives (e.g. sinusoids of different frequences/phases)
- Train the classifier using all the collected data, recompute fitness of the morphologies.
And?
- I had hard time reading this paper. I think, it is great that random uniform sampling can help us to speed up morphology evolution, but I don't think the evidence the paper provides is convincing enough to justify the "information theoretic objective" and biological methaphors.
- My first source of confusion is that I don't know what is the problem the paper is solving.
- The authors say that they propose a method for morphology evolution without any reward specification or policy learning.
- Okay, how do we know if we are doing well? What provides the learning signal?
- The authors state that "in order to accomplish any task in its environment, a morphology should be able to reliably reach any state through a unique sequence of actions".
- I think, this is quite an important assumption about the nature of the task and not all the tasks can be solved that way (e.g. what if 90% of the states are dangerous/unwanted states? 99%?
- Also, if we have a needle in a haystack problem, encouraging the agent to visit all the states is not really optimal.
- I find the biological motivation about the nature training "generalists" misleading. We are highly specialised creatures (though nature, i.e the algorithm, is pretty general).
- I find the terminology ill-defined and adding more confusion:
- What is the environment exactly? The authors seem to separate the environment and the agent features, but this should be made more formal.
- On page 3 (unnumbered equation with the mutual information), the authors say "assuming that all morphologies begin in the same state, we remove the objective's dependence on starting state and derive a variational lower bound...".
- I don't even know what it means. Different morphologies will not be in the same state since different configurations will induce different initial state distributions.
- Moreover, adding more limbs changes the state space (even the dimensionality of it!)
- My second problem with the paper is that it is self-contradictory at times:
- First the authors propose this grand vision of an agent that is able to visit all the states reliably.
- Then, the authors (understandably) introduces the hardcoded primitives to make it work (doing uniform random exploration will not do you good)
- The authors say that we should predict the action that led to that terminal state (empowerment!). But then they add gaussian noise to terminal states.
- I do not get this one at all since it increases the aleatoric noise, and this is the opposite of empowerment.
- I do not understand the first term of Equation 1, and believe there's a typo in it. There's $\lvert A_j\rvert$ in it, but there's no some joint selection outside of it. Why do we care about the number of primitives of a single joint only?
- Thirdly, what is more expensive, to uniformly random visit all the possible states or solve an RL problem inside?
- I believe that this will depend on the nature of the task, and the current solution is not task agnostic as it claims to be.
- Baselines/ablations
- I am not sure if the baselines also use action primitives or not. If they don't, there should be a baseline for that as well.
- I have a suspicion, that the method works because it prunes the most broken morphologies and does not spend any time/samples evaluating those.
- There's an ablation using the variance of the terminal states instead of the proposed objective, but I would have tried a simpler baseline: measure the distance of the terminal state from the initial one. I believe this will work nicely at least of the locomotion tasks.
This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.