Model-free control algorithms for deep reinforcement learning --Similarities and differences (WIP)

 Published On March 16, 2017

Conventions:

  • Sets are represented by $\mathcal{A, S}$,… (caligraphy font)
  • Vectors are represented by $\mathbf{w, \theta,…}$ (bold font)
  • Functions are represented by $Q,\hat Q, V, \hat V,$… (capital letters)
  • Random variables are represened by $s,a,r$… (lower case letters)

Note: This post is for comparing the differences and understanding the similarities of various model-free control algorithms in (deep) reinforcement learning (especially with function approximations). It is not intended to be a primer or a comprehensive refresher. Please refer to Sutton & Barto, 2017 for completeness.

n-step return for action-value function with value function approximation

equivalently,

$\mathbf{\lambda -return}$ for action-value function with value function approximation

SARSA update with function approximation for on-policy model-free control

At time step $t+1$, the error at timestep $t$ is used to adjust the estimate of $Q(s_t,a_t,\mathbf{w_t})$ through the following weight (of the Q-value function approximator) update:

This is similar to the $TD(0)$ update step for state-value function.

One-step Q-learning update with function approximation [[Lin;1992]][]

At time step $t+1$, the error at timestep $t$ is used to adjust the estimate of $Q(s_t,a_t,w_t)$ through the following weight (of the Q-value function approximator) update:

This is similar to the $TD(0)$/ SARSA update step except for the $\max_{a\in\mathcal A}$ on the $\hat Q$ function (which means that the $\hat Q$ value associated with the action that yields the maximum value in state $s_{t+1}$ is used or equivalently, $\hat Q(s_{t+1},argmax_{a’}(\hat Q(s_{t+1},a’)),w)$ ).

Temporal Difference Q-Learning with function approximation [[Watkins;1989]][]


Tags: Deep-Reinforcement-Learning RL DRL

Comments:

comments powered by Disqus