Temporal Difference Algorithm
It’s the stochastic gradient descent for solving Bellman or Bellman Optimality Equation. Model-Free Algorithm. TD can be used to solve Bellman Equation and Bellman Optimality Equation.
TD Algorithm
It estimates the state values of a given policy.
Given a policy $\pi$, our goal is to estimate $v_\pi (s)$ for all $s \in \mathcal{S}$. Suppose that we have some experience samples $(s_0, r_1, s_1, \cdots, s_t, r_{t+1}, s_{t+1}, \cdots)$ generated following $\pi$. Here, $t$ denotes the time step and $v_t(s_t)$ is the estimate of $v_\pi(s_t)$.
$$
v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))]
$$
$$
v_{t+1}(s) = v_t(s), \quad \text{for all} \quad s \neq s_t
$$
Only the visited state $s_t$ will be updated, other will remian the same as before.
The definition of the state value is
$$
v_\pi(s) = \mathbb{E}[R_{t+1}+ \gamma G_{t+1}|S_t = s], \quad s \in \mathcal{S}
$$
$$
v_\pi(s) = \mathbb{E}[R_{t+1}+ \gamma v_{\pi}(S_{t+1})|S_t = s], \quad s \in \mathcal{S}
$$
The equation (4) is also called the expected bellman equation.
The Bellman Equaiton (4) can be wriiten by using RM algortithm as
$$
g(v_\pi(s_t)) = v_\pi(s_t) - \mathbb{E}[R_{t+1}+ \gamma v_{\pi}(S_{t+1})|S_t = s] =0
$$
The noisy observation can be written as
$$
\tilde{g}(v_\pi(s_t))=v_\pi - [r_{t+1}+ \gamma v_\pi (s_{t+1})] = g(v_\pi(s_t)) + \eta
$$
$$
v_{t+1 (s_t)} = v_t(s_t) - \alpha_t(s_t)\tilde{g}(v_t(s_t))
$$
Property of TD Algorithm
The TD Algorithm is defined as
$$
v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t)[v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))]
$$
where $\bar{v}_t$ is the TD target
$$ \bar{v}_t = r_{t+1} + \gamma v_t(s_{t+1}) $$It is called TD target is because $\bar v_t$ is the target value, the algorithm attempts to drive $v(s_t)$ to.
$$
v_{t+1}(s_t) - \bar v_t = [v_t(s_t) - \bar v_t] - \alpha_t(s_t)[v_t(s_t) - \bar v_t]
$$
$$
v_{t+1}(s_t) - \bar v_t = [1 - \alpha_t(s_t)][v_t(s_t) - \bar v_t]
$$
$$
|v_t(s_t) - \bar v_t| \geq |v_{t+1}(s_t) - \bar v_t|
$$
which means that the updated state value is closer to the target value than the old state value.
$\delta_t$ is the TD error
$$
\delta_t = v(s_t) - \bar{v_t} = v_t(s_t) - (r_{t+1} + \gamma v_t(s_{t+1}))
$$
which is also known as the temporal-difference because it reflects the discrepancy between two steps $t$ and $t + 1$. The TD error is zero if the state value estimate is accurate. When $v_t = v_\pi$, the expected value of the TD error is
$$
\mathbb{E}[\delta_t|S_t = s_t] = \mathbb{E}[v_\pi (S_t) - (R_{t+1}+\gamma v_\pi (S_{t+1}))|S_t = s_t]=v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t = s_t]=0
$$
It can also be interpreted as innovation.
Sarsa
Rather than estimating the state value like TD algorithm, SARSA estimates the action values of a given policy. Therefore, SARSA is also known as the TD algorithm for Action Value. However, to obtain the optimal policy. SARSA have to combine with policy improvement method.
Suppose we have some experience samples generated following $\pi: (s_0, a_0, r_1, a_1, \cdots, s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}, \cdots)$
$$
q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)[q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1},a_{t+1}))]
$$
$$
q_{t+1}(s, a) = q_t(s,a), \quad \text{for all } (s, a) \neq (s_t, a_t)
$$
Here the $v_t(s_t, a_t)$ is the estimation of $v_\pi(s_t, a_t)$. At time t, only the q-value of $(s_t, a_t)$ is updated, whereas the q-values of the others remain the same. The SARSA can be used to solve the Bellman Equation expressed in terms of action values.
$$
q_\pi (s, a) = \mathbb{E}[R + \gamma q_\pi(S’, A’)|s, a], \quad \text{for all } (s, a)
$$
Algorithm: First, update the q-value of the visited action-state pair. Second, update the policy to an $\epsilon$-greedy one.
Initialization: $\alpha_t (s, a) = \alpha > 0$ for all $(s, a)$ and all $t$. Initial $q_0(s, a)$ for all $(s, a)$. Initial $\epsilon$-greedy policy $\pi_0$ derived from $q_0$.
Goal: Learn an optimal policy that can lead the agent to the target state from an initial state $s_0$.
For each episode, do
Generate $a_0$ at $s_0$ following $\pi_0(s_0)$
If $s_t (t=0, 1, 2,\cdots)$ is not the target state, do
Collect an experience sample $(r_{t+1}, s_{t+1}, a_{t+1})$ given $(s_t, a_t)$: generate $r_{t+1}$, $s_{t+1}$
by interacting with the environments, generate $a_{t+1}$ following $\pi_t(s_{t+1})$.
Update q-value for $(s_t, a_t)$:
$q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)[v_t(s_t, a_t) - (r_{t+1} + \gamma v_t(s_{t+1},a_{t+1}))]$
Update policy for $s_t$:
- Use $\epsilon$-greedy policy
- $s \leftarrow s_{t+1}, a_t \leftarrow a_{t+1}$
The task of SARSA is to find an optimal path from a specific starting state to a target state. Therefore, we don’t need to consider all the states.
There are two strategies to evaluate the convergence and performance of SARSA: Episode Length, Total Rewards.
N-step SARSA
SARSA and MC learning are two extreme cases of n-step SARSA. The main difference is how many steps are taken to obtain the action value $G_t$.
$$
q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)[v_t(s_t, a_t) - (r_{t+1} + \gamma r_{t+2} + \gamma^2r_{t+3}+\cdots + \gamma^nq_t(s_{t+n}, a_{t+n})]
$$
The definition of the action value is
$$
q_\pi(s, a) = \mathbb{E}[G_t|S_t=s, A_t=a]
$$
Where $G_t$ is the discounted return satisfying
$$
G_t = R_{t+1} + \gamma R_{t+2}+ \gamma^2 R_{t+3} + \cdots
$$
Sarsa
$$
G_t = G_{t}^{(1)} = R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1})
$$
n-step Sarsa
$$
G_t = G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n})
$$
MC Learning
$$
G_t = G_t^{(\infty)}= R_{t+1} + \gamma R_{t+2}+ \gamma^2 R_{t+3} + \cdots
$$
Q-Learning
Directly solve the Bellman Optimality Equation and estimate optimal action values to obtain the optimal policy. This is also the off-policy Algorithm. SARSA solves the Bellman Equation, while Q-Learning solves the Bellman Optimality Equation.
$$
q_{t+1}(s_t, a_t) = q_t(s_t, a_t)-\alpha_t(s_t, a_t)[q_t(s, a)-(r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})}q_t(s_{t+1}, a))]
$$
$$
q_{t+1}(s, a)= q_t(s, a), \quad \text{for all } (s, a) \neq (s_t, a_t)
$$
Here, the $q_t(s_t)$ is the optimal action value of $(s_t, a_t)$. The TD target $\bar v_t = r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})}q_t(s_{t+1}, a)$. The Sarsa requires $(s_{t+1}, r_{t+1},a_{t+1})$, while the Q-learning merely requires $(s_{t+1}, r_{t+1})$. The Q-learning aims to solve the Bellman Optimality Equation in terms of action value, which is expressed as followed.
$$
q(s,a) = \mathbb{E}[R_{t+1} + \gamma\max_a q(S_{t+1}, a)|S_t =s, A_t=a]
$$
Algorithm: on-policy Q-Learning
Initializaiton: $\alpha_t(s, a) = \alpha > 0$ for all $(s, a)$ and all $t$. $\epsilon \in (0, 1)$ . Initial $q_0(s, a)$ for all $(s, a)$.
Goal: Learn an optimal path that can lead the agent ot the target state from an intial state $s_0$.
For each episode, do
if $s_t$ is not the target state, do
Collect the experience sample $(a_t, r_{t+1}, s_{t+1})$ given $s_t$
Update q-value for $(s_t, a_t)$
- $q_{t+1}(s_t, a_t) = q_t(s_t, a_t)-\alpha_t(s_t, a_t)[q_t(s, a)-(r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})}q_t(s_{t+1}, a))]$
Update policy for $s_t$:
- Use the $\epsilon$-greedy policy
Algorithm: off-policy Q-Learning
Initializaiton: $\alpha_t(s, a) = \alpha > 0$ for all $(s, a)$ and all $t$. $\epsilon \in (0, 1)$ . Initial $q_0(s, a)$ for all $(s, a)$.
Goal: learn an optimal target policy $\pi_T$ for all states from the experence sampels generated by $\pi_b$.
For each episode genrated by $\pi_b$, do
if $s_t$ is not the target state, do
Collect the experience sample $(a_t, r_{t+1}, s_{t+1})$ given $s_t$
Update q-value for $(s_t, a_t)$
- $q_{t+1}(s_t, a_t) = q_t(s_t, a_t)-\alpha_t(s_t, a_t)[q_t(s, a)-(r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})}q_t(s_{t+1}, a))]$
Update policy for $s_t$:
- Use the $\epsilon$-greedy policy