Model-free prediction algorithms for deep reinforcement learning --Similarities and differences (WIP)

 Published On March 16, 2017

Conventions:

  • Sets are represented by $ \mathcal{A, S},…$ (caligraphy font)
  • Vectors are represented by $\mathbf{w, \theta,…}$ (bold font)
  • Functions are represented by $Q,\hat{Q}, V, \hat V,… $ (capital letters)
  • Random variables are represened by $s,a,r…$ (lower case letters)

Note: This post is for comparing the differences and understanding the similarities of various model-free prediction algorithms for (deep) reinforcement learning (especially with function approximations). Red colored fonts indicates the comparable differences (if applicable) from the preceding equation/algorithm. Some of the details may be left out for brievity. Please refer to Sutton & Barto, 2017 and the cited papers for completeness.

N-step return with value function approximation

equivalently,

$\mathbf{\lambda-return}$ with value function approximation

The $\lambda-return$, $g_{t,\mathbf{w_t}}^{\lambda}$ combines all n-step (from the current time step) returns using $\lambda^{n-1}$ as the weights for each of the n-step returns. $(1-\lambda)$ is the normalizing term so that, $(1-\lambda) \sum_{n=1}^{\infty}\lambda^{n-1}=1$. Visually, it makes intuitive sense:

1505067011338

$\lambda-return$ with value function approximation for $\color{red}{episodic}$ MDPs

Let T be the time step at which a terminal state is reached which marks the end of an episode. Therefore, the number of steps that can be taken from the current time step t until the end of the episode is $T-t$ .

(Question: Is the normalizing factor of $(1-\lambda)$ still valid? If n goes till $\infty$, summation of $\lambda^{n-1}$ approaches $\frac{1}{1-r}$. For the summation still $T-t$, the term comes to $\frac{1-\lambda^{T-t-1}}{1-\lambda}$. Is the term $1-\lambda$ still the normalizing factor? )

We can pull out the last term from this summation and write it as:

By the definition of an episodic MDP, after the end of an episode (i.e after a terminal state is reached), there is nothing for an agnet to do. By observation, the last n-step return, ${\color{green}{\lambda^{T-t-1}g_{t,\mathbf{w_t}}^{T-t}}}$ is $\color{green}{equal}$ to the full return, $\color{green}{g_{t,\mathbf{w_t}}}$ of the episode.

Truncated $\lambda-return $

Off-line^1 TD($\lambda$) is equivalent to the (off-line) $\lambda-return$ algorithm Sutton & Barto; 1998,Sutton; 1988. However, online^2 TD($\lambda$) is only approximately equal to the online $\lambda-return$ algorithm. Seijen & Sutton; 2014 proposed the use of truncated $\lambda-return$ that truncates the $\lambda-return$ at a specific timestep $t’$ to pave way for the strict-online forward view algorithm.

Offline updates:

Updates are accumulated after every step within an episode but applied in batch at the end of the episode.

Online update:

Updates are applied online at each time step within an episode

Forward view

For each visited state, look forward in time to all the subsequent rewards and the sates visited to determine its update.

1505079869919

Temporal Difference Learning with function approximation for $\color{blue}{on-policy}$ model-free prediction

At time step $t+1$, the error at timestep $t$ is used to adjust the estimate of $V(s_t,\mathbf{w_t})$ through the following weight (of the State-value function approximator) update:

Online forward view (?) TD(0)

TD(0) updates the weights using one-step return, $r_{t+1} + \gamma * \hat V(s_{t+1},\mathbf w_t)$. This update depends on the immediate reward, $r_{t}$ and the current (at time step $t$) approximation of the value of the next state, $\hat V(s_t,\mathbf w_t)$.

Strict online-forward view

The vanilla/conventional Backward view TD($\lambda$)

or

where,

and

True online TD($\lambda$)

True online TD($\lambda$) forms the backward view of the truncated $\lambda-return$ algorithm. Like the vanilla TD($\lambda$), it updates the weights proportional to a decaying eligibility trace.

or

where,

and

True online TD($\lambda$) algorithm Seijien & Sutton; 2014

1505084324974

[]:


Tags: Deep-Reinforcement-Learning model-free-prediction TD(lambda) RL DRL

Comments:

comments powered by Disqus