Value Function Approximation
The state/action values are represented as function for large scale state and action spaces. The NN can be applied from this chapter. There are two problems when considering function approximation: 1) which feature vector should be chosen 2) which parameter vector is the best for approximation.
- select an objective function for defining optimal policies.
- derive gradient of the objective function.
- apply gradient-based algorithm to solve the optimization problem.
Define Objective Function
$$
J(w) = \mathbb{E}[(v_{\pi}(S)- \hat v(S, w))^2]
$$
where $v_{\pi}(s)$ is the true state value, $\hat v(s, w)$ is the approximated state value. Our goal is to find an optimal $w$ that can best approximate the $v_\pi(s)$ for every $s$.
- What is the probability distribution of $S$
First is the uniform distribution: (no consideration of long-term Markov process)
$$
J(w) = \frac{1}{n}\sum_{s\in \mathcal{S}}[(v_{\pi}(s)- \hat{v}(s, w))^2]
$$
Second is the stationary distribution: (consider the long-term behavior of MDP)
Let ${d_\pi(s)}_{s\in \mathcal S}$ is the stationary distribution of the Markov Process under policy $\pi$.
$$
J(w) = \sum_{s\in \mathcal{S}}d_\pi(s)[(v_{\pi}(s)- \hat{v}(s, w))^2]
$$
To obtain $d_\pi (s)$ we need the transitional probability. The $d_\pi$ can be calculated in this way.
$$
d_\pi^T = d_\pi^T P_\pi
$$
Here the $d_s$ is the left eigenvector of $P_\pi$ associated with the eigenvalue 1. The solution of this is called the stationary distribution.
Minimize the Objective Function
We can apply Gradient Descent Algorithm:
$$
w_{k+1} = w_k - \alpha_k \nabla_w J(w_k)
$$
$$
\nabla_w J (w_k) = -2 \mathbb{E}[(v_\pi(S) - \hat{v}(S, w_k))(\nabla_w \hat{v}(S, w_k))]
$$
$$
w_{k+1} = w_k + 2\alpha_k \mathbb{E}[(v_\pi(S) - \hat{v}(S, w_k))(\nabla_w \hat{v}(S, w_k))]
$$
There is a expectation in terms of $S$, we can use SGD instead:
$$
w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat{v}(s_t, w_t))(\nabla_w \hat{v}(s_t, w_t))
$$
But here $v_\pi(s)$ is unknown.
We can use MC Method: $g_t$ can be used as an approximation of $v_\pi(s_t)$.
$$
w_{t+1} = w_t + \alpha_t(g_t- \hat{v}(s_t, w_t))\nabla_w\hat{v}(s_t, w_t)
$$Time Temporal Method: $v_\pi(s_t) = r_{t+1} + \gamma \hat {v}(s_{t+1}, w_t)$
$$
w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {v}(s_{t+1}, w_t) - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t)
$$
Algorithm: TD learning of state values with function approximation
Initialization: A function $\hat{v}(s, w)$ that is a differentiable in $w$. Initial parameter $w_0$.
Goal: Learn the true state values of a given policy $\pi$
For each episode ${(s_t, r_{t+1}, s_{t+1})}_t$ generated by $\pi$, do
- For each sample $(s_t, r_{t+1}, s_{t+1})$, do
- In the general case, $w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {v}(s_{t+1}, w_t) - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t)$
Sarsa
$$
w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {q}(s_{t+1}, a_{t+1}, w_t) - \hat{q}(s_t, a_t, w_t))\nabla_w \hat{q}(s_t, a_t, w_t)
$$
Algorithm: SARSA with function approximation
Initialization: Initial parameter $w_0$. Initial policy $\pi_0$. $\alpha_t = \alpha > 0$ for all $t$
Goal: Learn an optimal policy that can lead the agent to the target from an initial state $s_0$.
For each episode, do
Generate $a_0$ at $s_0$ following $\pi_0 (s_0)$
If $s_t$ $(t=0,1,2,\cdots)$ is not the target state, do
Collect the experience sample $(a_{t+1}, s_{t+1}, r_{t+1})$ given $(s_t, a_t)$: generate $r_{t+1}, s_{t+1}$
by interacting with the environment: generate $\pi_{t}(s_{t+1})$.
Update q-value
- $w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {q}(s_{t+1}, a_{t+1}, w_t) - \hat{q}(s_t, a_t, w_t))\nabla_w \hat{q}(s_t, a_t, w_t)$
Update policy
- $\epsilon$-greedy
- $s_t \leftarrow s_{t+1}$, $a_t \leftarrow a_{t+1}$
Q-Learning with function approximation
$$
w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \max_{a\in\mathcal{A(s_{t+1})}}\hat {q}(s_{t+1}, a_{t}, w_t) - \hat{q}(s_t, a_t, w_t))\nabla_w \hat{q}(s_t, a_t, w_t)
$$
Algorithm: on-policy Q-Learning with function approximation
Initializaiton: $\alpha_t(s, a) = \alpha > 0$ for all $(s, a)$ and all $t$. $\epsilon \in (0, 1)$ . Initial $q_0(s, a)$ for all $(s, a)$.
Goal: Learn an optimal path that can lead the agent ot the target state from an intial state $s_0$.
For each episode, do
if $s_t$ is not the target state, do
Collect the experience sample $(a_t, r_{t+1}, s_{t+1})$ given $s_t$
Update q-value for $(s_t, a_t)$
$w_{t+1}(s_t, a_t) = w_t(s_t, a_t)+\alpha_t(s_t, a_t)[(r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})}\hat{q}_t(s_{t+1},a_t, w_t) - \hat{q}_t(s_t,a_t,w_t)]\nabla_w \hat{q}(s_t, a_t, w_t) $Update policy for $s_t$:
- Use the $\epsilon$-greedy policy
Deep Q-Learning
The DQN aims to minimize the following objective function:
$$
J =\mathbb{E}[(R + \gamma \max_{a\in \mathcal{A(S’)}}\hat{q}(S’, a, w)-\hat{q}(S, A, w))^2]
$$
$$
q(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a \in \mathcal{A}(S_{t+1})}q(S_{t+1}, a)], \quad \text{for all } s, a
$$
where $(S, A, R, S’)$ are random variable. The objective function can be viewed as squared Bellman optimality error.
There are two networks for the DQN: target network $\hat{q}(s, a, w_T)$ which is used to obtain the target value $y_T= r + \gamma \max_{a\in \mathcal{A}(s’)}\hat{q}(s’,a, w_T)$ and main network $\hat{q}(s, a, w)$ is used to minimized the target error.
$$
J = \mathbb{E}[(R+\gamma \max_{a\in\mathcal{A}(S’)}\hat{q}(S’, a, w_T)-\hat{q}(S, A, w))^2]
$$
where, $w_T$ is the target network’s parameter. When $w_T$ is fixed, the gradient of $J$ is
$$
\nabla_w J = \mathbb{E}[(R + \gamma \max_{a\in \mathcal{A}(S’)}\hat{q}(S’, a, w_T)-\hat{q}(S, A, w))\nabla_w\hat{q}(S, A, w)],
$$
Algorithm Deep Q-learning
Initialization: A main network and a target network with the same initial parameter.
Goal: Learn an optimal target network to approximate the optimal action values from the experience samples generated by a given behavior policy $\pi_b$.
Store the experience samples generated by $\pi_b$ in a replay buffer $\mathcal{B} = {s, a, r, s’}$
For each iteration, do
Uniformly draw a mini-batch of samples from $\mathcal{B}$
For each sample $(s, a, r, s’)$, calculate the target value $y_T$.
Update the main network to minimize the $\delta$ using the mini-batch of samples
Set $w_T = w$ every C iterations.
Minimize the TD errors, which is a metric to measure the convergence and the performance of a model.