Actor-Critic Method
It’s the combination of value-based and policy-based algorithm. But it is still in the scope of policy gradient algorithm. Actor is for policy update, while critic is the value update step.
$$ \theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t) =\theta_t + \alpha \mathbb{E}_{S\sim \eta, A \sim \pi(S, \theta)}[\nabla_\theta \ln \pi (A|S, \theta )q_\pi(S, A)] $$ $$ \theta_{t+1} =\theta_t + \alpha [\nabla_\theta \ln \pi (a_t|s_t, \theta )q_t(s_t, a_t)] $$
When the $q_t(s_t, a_t)$ is estimated by TD Learning, the algorithm will be actor-critic. This shows the combination of value-based and policy-based methods, as $\theta$ appears in $\pi$ and $q_\pi$.
- The critic corresponds to the value update step via the Sarsa algorithm.
- The actor corresponds to the policy update step.
Since $q_t(s_t, a_t)$ is unknown, we can classfify the methods based on the approaches to use to approximate the $q_t(s_t,a_t)$
- If we use MC learning, then the method is called REINFORCE.
- if we use Temporal Difference Learning, then the method is called Actor-Critic
Algorithm: QAC
Initialization: A policy function is initialized as $\pi(a|s, \theta_0)$; An action value function is initialized as $q(s, a, w_0)$.
Goal: to minimize the objective function $J(\theta)$
At time step t in each episode
Generate $a_t$ by following $\pi(a_t|s_t,\theta_t)$, observing $s_{t+1}, r_{t+1}$ and generate $a_{t+1}$ by following $\pi(a_{t+1}|s_{t+1}, \theta_t)$.
Actor (Policy Update)
$$
\theta_{t+1} =\theta_t + \alpha [\nabla_\theta \ln \pi (a_t|s_t, \theta )q_t(s_t, a_t)]
$$
- Critic (Value Update)
$$
w_{t+1} = w_k + \alpha_k(r_{t+1} + \gamma \hat {q}(s_{t+1}, a_{t+1}, w_t) - \hat{q}(s_t, a_t, w_t))\nabla_w \hat{q}(s_t, a_t, w_t)
$$
Advantage Actor-Critic (A2C)
Baseline Invariance
$$ \mathbb{E}_{S\sim \eta, A \sim \pi(S, \theta)}[\nabla_\theta \ln \pi (A|S, \theta )q_t(S, A)] = \mathbb{E}_{S\sim \eta, A \sim \pi(S, \theta)}[\nabla_\theta \ln \pi (A|S, \theta )(q_t(S, A) - b(S))] $$The baseline is a scalar function of $S$. It is useful since it can reduce the approximation variance, when we use samples to approximate the true gradient.
The optimal baseline to minimize $\text{var}(X)$ is
$$ b^*(s) = \frac{\mathbb{E}_{\mathcal{A} \sim \pi}[||\nabla_\theta \ln \pi(A|s, \theta_t)||^2\ q_\pi(s, A)]}{\mathbb{E}_{\mathcal{A} \sim \pi}[||\nabla_\theta \ln \pi(A|s, \theta_t)||^2]}, \quad s\in \mathcal{S} $$ However, we always use $$ b^{s} = \mathbb{E}_{A\sim\pi}[{q_\pi(s, A)}] = v_\pi(s) $$ Therefore the gradient ascent becomes $$ \theta_{t+1} =\theta_t + \alpha \mathbb{E}_{S\sim \eta, A \sim \pi(S, \theta)}[\nabla_\theta \ln \pi (A|S, \theta )(q_t(S, A) - v_t(S))] \\= \theta_t + \alpha \mathbb{E}_{S\sim \eta, A \sim \pi(S, \theta)}[\nabla_\theta \ln \pi (A|S, \theta )\delta_t(S, A)] $$ $\delta _\pi(S, A)$ is called advantage function. This an be approximated by TD errors. $$ q_t(s_t, a_t)- v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) $$ Since we use the TD errors, this method is also called TD-AC.Algorithm: Q2C
Initialization: A policy function is initialized as $\pi(a|s, \theta_0)$; A value function $v(s, w_0)$ where $w_0$ is the initial parameter. .
Goal: to minimize the objective function $J(\theta)$
At time step t in each episode, do
Generate $a_t$ by following $\pi(a_t|s_t,\theta_t)$, observing $s_{t+1}, r_{t+1}$.
Advantage (TD error):
$\delta_t = r_{t+1} + \gamma v(s_{t+1}, w_t) - v(s_t, w_t)$
Actor (Policy Update)
$$
\theta_{t+1} =\theta_t + \alpha \delta_t \nabla_\theta \ln\pi(a_t|s_t, \theta_t)
$$
- Critic (Value Update)
$$
w_{t+1} = w_k + \alpha_k \delta_t \nabla_w v(s_t,w_t)
$$
Off-Policy AC
Importance Sampling
Our goal is to estimate $\mathbb{E}_{X\sim p_0}[X]$.
Suppose we have random variable $X \in \mathcal{X}$, and we have its probability distribution $p_0(X)$. If the sample $\{x_i\}_{i=1}^{n}$ are generated by following the distribution $p_0$, then we can use the mean estimation $\bar x$ to approximate the $\mathbb{E}_{X\sim p_0}[X]$. However, when the sample $\{x_i\}_{i=1}^{n}$ is not generated from $p_0$, but generted from $p_1$, what should we do? In this case, we can use importance sampling technique to approximate it. $$ \mathbb{E}_{X\sim p_0}[X] = \sum_{x \in \mathcal{X}} p_0(x)x = \sum_{x \in \mathcal{X}}p_1(x)\frac{p_0(x)}{p_1(x)}x = \mathbb{E}_{X\sim p_1}[f(X)] $$ $$ \mathbb{E}_{X\sim p_0}[X] = \mathbb{E}_{X\sim p_1}[f(X)] \approx \bar{f} = \frac{1}{n} \sum_{i=1}^n\frac{p_0(x_i)}{p_1(x_i)}x_i $$ where the $\frac{p_0(x_i)}{p_1(x_i)}$ is the importance weightOff-policy policy gradient Theorem
$$ J(\theta) = \sum_{s\in \mathbb{S}} d_\beta(s)v_\pi(s) = \mathbb{E}_{\mathbb{S \sim d_\beta}}[v_\pi(S)] $$ $$ \nabla_\theta J(\theta) = \mathbb{E}_{\mathcal{S}\sim \rho, A\sim \beta}[\frac{\pi(A|S, \theta)}{\beta(A|S)}\nabla_\theta \ln(A|S, \theta)q_\pi(S, A)] $$ where the $\frac{\pi(A|S, \theta)}{\beta(A|S)}$ is the importance weight. $$ \rho (s)= \sum_{s'\in \mathcal{S}}d_{\beta}(s')\text{Pr}_\pi(s|s') $$ After we add baseline to reduce the estimation variance. $$ \nabla_\theta J(\theta) = \mathbb{E}_{\mathcal{S}\sim \rho, A\sim \beta}[\frac{\pi(A|S, \theta)}{\beta(A|S)}\nabla_\theta \ln(A|S, \theta)(q_\pi(S, A)-v_\pi(S))] $$ The corresponding SGA is $$ \theta_{t+1} = \theta_{t} + \frac{\pi(a_t|s_t, \theta)}{\beta(a_t|s_t)}\delta_t\nabla_\theta \ln(a_t|s_t, \theta) $$Deterministic Actor-Critic
$$ \nabla_\theta J(\theta) = \sum_{s\in \mathcal{S}} d_\mu(s)\nabla_\theta \,\mu(s)(\nabla_a q_\mu (s, a))|_{a = \mu(s)} = \mathbb{E}_{\mathcal{S} \sim d_{\mu}}[\nabla_\theta \mu(S)(\nabla_a q_\mu(S, a))|_{a=\mu(S)}] $$Deterministic Case is naturally off-policy and effectively handle continuous action spaces. Here, $a = \mu(s, \theta)$ is used to denote a deterministic policy.
The corresponding stochastic gradient-ascent algorithm is
$$
\theta_{t+1} = \theta_{t} +\alpha_\theta\nabla_\theta \mu(s_t)(\nabla_a q_\mu(s_t, a))|_{a=\mu(s_t)}
$$
DAC is off-policy, as the behavior policy $d_\mu$ maybe is different from the target policy $\mu$. The actor is off-policy, since there is no $A$ in the expression. The critic is off-policy as the the first policy that generate $a_t$ by interacting with the environment, which is then the behavior policy, while the second policy is $u$, which is the target policy the critic aims to evaluate.
Algorithm: Deterministic policy gradient or deterministic actor-critic
Initialization: A given behavior policy $\beta(a|s)$. A target policy $\pi(a|s,\theta_0)$.
At time step $t$ in each episode, do
Generate $a_t$ following $\beta(s_t)$ and observe $r_{t+1},s_{t+1}$
Advantage:
$\delta_t = r_{t+1} + \gamma v(s_{t+1}, w_t) - v(s_t, w_t)$
Actor (Policy Update):
$$
\theta_{t+1} = \theta_{t} + \frac{\pi(a_t|s_t, \theta)}{\beta(a_t|s_t)}\delta_t\nabla_\theta \ln(a_t|s_t, \theta)
$$
Critic (Value Update):
$$
w_{t+1} = w_t + \alpha_w \delta_t \nabla_w q(s_t, a_t, w_t)
$$