Temporally-Extended ε-greedy exploration

What?

$\epsilon$-greedy with randomly-sampled action-repeat.

Why?

Exploration is an important problem for RL. $\epsilon$-greedy dumb, but it is simple and often beats other more sophisticated approaches. Can we keep it simple and do better than $\epsilon$-greedy?

How?

Source: original paper. Check out my version below!

$\epsilon$-greedy
- extremely dumb
- extremely strong
  - simple
  - stationary (no side effects on learning dynamics)
  - full coverage of the state space.

<aside> 💡 Main idea: Take $\epsilon$-greedy, sample action duration < horizon in addition to action and repeat this action the sampled number of times.

</aside>

Pseudocode (Appendix B1 from the paper):

def ez_greedy(Q, eps, z):
  n = 0 # duration
  w = -1 # assigned action
  s = env.reset()
  while True:
    if n == 0:
      if random()<eps:
        n = z.sample()
        w = action_space.uniform()
        a = w
    else:
      a = w
      n = n-1 # reduce duration left by one
    s,r,done,_ = env.step(a)

The sampling can be done in multiple ways, but the authors stick to zeta (Zipf) distribution (hence the name):

$$ z(n) \propto n^{-\mu}, $$

with $\mu=2$.

And?

I like this paper. It describes a simple, but powerful idea applicable everywhere $\epsilon$-greedy is applicable. However, I believe, the paper adds additional complexity in writing which could have been avoided.
Evaluation on the Grid World, classical environments and Atari show that you can beat $\epsilon$-greedy without building 1000 lines of code around it.
I loved the connection to Lévy flight ❤️.
The name of the method is also cool (easy-greedy exploration)
Funnily enough, I thought about a similar idea back in 2017 when working on my master thesis. I remember producing the plots below. I have a random walk (+1 or -1) and plot a cumulative sum with different action repeats. The pictures illustrate that you can go much further if you repeat your actions. However, there are many more holes on the right. This paper just fixes those holes by sampling from all possible combinations of action repeats given the horizon.