What?
A self-supervised graph-memory-based approach to navigation.
Why?
The existing approaches do not account for the navigation specifics (e.g. purely reactive or general-purpose memory) or use metric representations. There's work from psychology suggesting that landmark navigation might be the thing. This paper investigates it.
How?
source: original paper
source: original paper
- Setting:
- An agent is exposed to video footage of someone (human demonstrator in this case) mindlessly wandering around the maze. The agent builds an internal representation of the environment at this stage.
- The agent starts in some point of the maze and is provided with the goal.
- Components:
- Semi-Parametric topological memory:
- Components:
- Retrieval Network
- Estimates similarity of two states;
- If two states are temporarily close, they are similar.
- Trained on data from a random agent exploring the environment
- Cross-entropy loss with inputs of $(o_i, o_j, y_{ij})$, where $y_{ij}$ is the label which is 1 if the two states are no more than 20 steps apart from each other.
- Siamese networks are used for this.
- Memory Graph
- naive way to build a graph:
- consequtive states
- two similar states (based on the retrieval network output) and a threshold $s_\text{shortcut}$.
- better way:
- avoid adding trivial edges (add those that are at least $\Delta T_l$ steps apart);
- use a sequence of states (instead of a single frame) to do the retrieval (take the median of the outputs of the network)
- Stages:
- Localisation
- Do k-NN to localize the agent and the given goal.
- Planning
- Use Dijkstra to do the planning on the memory graph.
- Waypoint selection
- Pick the maximum distant node within reach (based by the output of the retrieval function and the threshold $s_\text{reach}$)
- Locomotion Network
- Take two observations (current, goal), output the probabilities over actions.
- Train in a self-supervised way:
- random agent to get the data
- intuitively, we want the agent to look at the current state and the waypoint to decide which action to pick.
And?
- The paper has an amazingly written related work section.
- However, it would be great if the paper grounded its language in the one used in RL and analyzed its findings using that language. For example, how is model-based RL related to this? What does the generalisation of the method depend on?
- I like the writing style. It's concise and the authors do not engage in vague conversations including using poorly defined terms.
- Weak points:
- There are a lot of hyperparameters. (But okay, in RL we do a lot of them as well, it's just they are hidden in the algorithm implementations and considered to be default).
- The paper omits some assumptions that are important:
- The observational data an agent sees at the first state is very important. I believe, there should be good coverage of the state-action space in order for this to work (since it will affect building the graph, i.e. transition model)
- I think the paper misses an important baseline (some external memory) that is able to keep information in memory longer than an LSTM. I don't know exactly which one, but this paper comes to mind first.
- The graph is build based on the exploration data from the first stage. Assuming that the data are bad, how can we update the graph based on the agent experience in the environment?
This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.