0%

RL Note 6: Value Function Approximation

Value Function Approximation

The state/action values are represented as function for large scale state and action spaces. The NN can be applied from this chapter. There are two problems when considering function approximation: 1) which feature vector should be chosen 2) which parameter vector is the best for approximation.

  1. select an objective function for defining optimal policies.
  2. derive gradient of the objective function.
  3. apply gradient-based algorithm to solve the optimization problem.

Define Objective Function

$$
J(w) = \mathbb{E}[(v_{\pi}(S)- \hat v(S, w))^2]
$$

where $v_{\pi}(s)$ is the true state value, $\hat v(s, w)$ is the approximated state value. Our goal is to find an optimal $w$ that can best approximate the $v_\pi(s)$ for every $s$.

  • What is the probability distribution of $S$

First is the uniform distribution: (no consideration of long-term Markov process)

$$
J(w) = \frac{1}{n}\sum_{s\in \mathcal{S}}[(v_{\pi}(s)- \hat{v}(s, w))^2]
$$

Second is the stationary distribution: (consider the long-term behavior of MDP)

Let ${d_\pi(s)}_{s\in \mathcal S}$ is the stationary distribution of the Markov Process under policy $\pi$.

$$
J(w) = \sum_{s\in \mathcal{S}}d_\pi(s)[(v_{\pi}(s)- \hat{v}(s, w))^2]
$$

To obtain $d_\pi (s)$ we need the transitional probability. The $d_\pi$ can be calculated in this way.
$$
d_\pi^T = d_\pi^T P_\pi
$$
Here the $d_s$ is the left eigenvector of $P_\pi$ associated with the eigenvalue 1. The solution of this is called the stationary distribution.

Minimize the Objective Function

We can apply Gradient Descent Algorithm:

$$
w_{k+1} = w_k - \alpha_k \nabla_w J(w_k)
$$

$$
\nabla_w J (w_k) = -2 \mathbb{E}[(v_\pi(S) - \hat{v}(S, w_k))(\nabla_w \hat{v}(S, w_k))]
$$

$$
w_{k+1} = w_k + 2\alpha_k \mathbb{E}[(v_\pi(S) - \hat{v}(S, w_k))(\nabla_w \hat{v}(S, w_k))]
$$

There is a expectation in terms of $S$, we can use SGD instead:

$$
w_{t+1} = w_t + \alpha_t (v_\pi(s_t) - \hat{v}(s_t, w_t))(\nabla_w \hat{v}(s_t, w_t))
$$

But here $v_\pi(s)$ is unknown.

  • We can use MC Method: $g_t$ can be used as an approximation of $v_\pi(s_t)$.

    $$
    w_{t+1} = w_t + \alpha_t(g_t- \hat{v}(s_t, w_t))\nabla_w\hat{v}(s_t, w_t)
    $$

  • Time Temporal Method: $v_\pi(s_t) = r_{t+1} + \gamma \hat {v}(s_{t+1}, w_t)$

$$
w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {v}(s_{t+1}, w_t) - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t)
$$

Algorithm: TD learning of state values with function approximation

Initialization: A function $\hat{v}(s, w)$ that is a differentiable in $w$. Initial parameter $w_0$.

Goal: Learn the true state values of a given policy $\pi$

For each episode ${(s_t, r_{t+1}, s_{t+1})}_t$ generated by $\pi$, do

  • For each sample $(s_t, r_{t+1}, s_{t+1})$, do
    • In the general case, $w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {v}(s_{t+1}, w_t) - \hat{v}(s_t, w_t))\nabla_w \hat{v}(s_t, w_t)$

Sarsa

$$
w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {q}(s_{t+1}, a_{t+1}, w_t) - \hat{q}(s_t, a_t, w_t))\nabla_w \hat{q}(s_t, a_t, w_t)
$$

Algorithm: SARSA with function approximation

Initialization: Initial parameter $w_0$. Initial policy $\pi_0$. $\alpha_t = \alpha > 0$ for all $t$

Goal: Learn an optimal policy that can lead the agent to the target from an initial state $s_0$.

For each episode, do

  • Generate $a_0$ at $s_0$ following $\pi_0 (s_0)$

  • If $s_t$ $(t=0,1,2,\cdots)$ is not the target state, do

    • Collect the experience sample $(a_{t+1}, s_{t+1}, r_{t+1})$ given $(s_t, a_t)$: generate $r_{t+1}, s_{t+1}$

    • by interacting with the environment: generate $\pi_{t}(s_{t+1})$.

    • Update q-value

      • $w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {q}(s_{t+1}, a_{t+1}, w_t) - \hat{q}(s_t, a_t, w_t))\nabla_w \hat{q}(s_t, a_t, w_t)$
    • Update policy

      • $\epsilon$-greedy
      • $s_t \leftarrow s_{t+1}$, $a_t \leftarrow a_{t+1}$

Q-Learning with function approximation

$$
w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \max_{a\in\mathcal{A(s_{t+1})}}\hat {q}(s_{t+1}, a_{t}, w_t) - \hat{q}(s_t, a_t, w_t))\nabla_w \hat{q}(s_t, a_t, w_t)
$$

Algorithm: on-policy Q-Learning with function approximation

Initializaiton: $\alpha_t(s, a) = \alpha > 0$ for all $(s, a)$ and all $t$. $\epsilon \in (0, 1)$ . Initial $q_0(s, a)$ for all $(s, a)$.

Goal: Learn an optimal path that can lead the agent ot the target state from an intial state $s_0$.

For each episode, do

  • if $s_t$ is not the target state, do

    • Collect the experience sample $(a_t, r_{t+1}, s_{t+1})$ given $s_t$

    • Update q-value for $(s_t, a_t)$

      $w_{t+1}(s_t, a_t) = w_t(s_t, a_t)+\alpha_t(s_t, a_t)[(r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})}\hat{q}_t(s_{t+1},a_t, w_t) - \hat{q}_t(s_t,a_t,w_t)]\nabla_w \hat{q}(s_t, a_t, w_t) $
    • Update policy for $s_t$:

      • Use the $\epsilon$-greedy policy

Deep Q-Learning

The DQN aims to minimize the following objective function:

$$
J =\mathbb{E}[(R + \gamma \max_{a\in \mathcal{A(S’)}}\hat{q}(S’, a, w)-\hat{q}(S, A, w))^2]
$$

$$
q(s, a) = \mathbb{E}[R_{t+1} + \gamma \max_{a \in \mathcal{A}(S_{t+1})}q(S_{t+1}, a)], \quad \text{for all } s, a
$$

where $(S, A, R, S’)$ are random variable. The objective function can be viewed as squared Bellman optimality error.

There are two networks for the DQN: target network $\hat{q}(s, a, w_T)$ which is used to obtain the target value $y_T= r + \gamma \max_{a\in \mathcal{A}(s’)}\hat{q}(s’,a, w_T)$ and main network $\hat{q}(s, a, w)$ is used to minimized the target error.

$$
J = \mathbb{E}[(R+\gamma \max_{a\in\mathcal{A}(S’)}\hat{q}(S’, a, w_T)-\hat{q}(S, A, w))^2]
$$

where, $w_T$ is the target network’s parameter. When $w_T$ is fixed, the gradient of $J$ is

$$
\nabla_w J = \mathbb{E}[(R + \gamma \max_{a\in \mathcal{A}(S’)}\hat{q}(S’, a, w_T)-\hat{q}(S, A, w))\nabla_w\hat{q}(S, A, w)],
$$

Algorithm Deep Q-learning

Initialization: A main network and a target network with the same initial parameter.

Goal: Learn an optimal target network to approximate the optimal action values from the experience samples generated by a given behavior policy $\pi_b$.

Store the experience samples generated by $\pi_b$ in a replay buffer $\mathcal{B} = {s, a, r, s’}$

  • For each iteration, do

    • Uniformly draw a mini-batch of samples from $\mathcal{B}$

    • For each sample $(s, a, r, s’)$, calculate the target value $y_T$.

    • Update the main network to minimize the $\delta$ using the mini-batch of samples

  • Set $w_T = w$ every C iterations.

Minimize the TD errors, which is a metric to measure the convergence and the performance of a model.

-------------Thank you for your reading!-------------

Welcome to my other publishing channels