Monte Carlo Estimation
First Introduction of Model-Free Estimation Algorithm. The dilemma of Data and Model. The mean estimation is the core of MC Estimation, which uses stochastic experience samples to solve estimation problem.
Mean Estimation
Consider the a random variable $X$ which can take values from a finite set $\mathcal X$. There are two approaches to calculate the estimation of $\mathbb{E}[X]$, model-based and model-free. $\mathbb{E}[X]$ can be called as the expected value, mean value, and average.
model-based
$$
\mathbb{E}[X] = \sum_{x \in \mathcal{X}} p(x) x
$$model-free (the samples have to be iid
$$
\mathbb{E}[X] \approx \bar x = \frac{1}{n} \sum_{j=1}^{n} x_j,
\quad n\rightarrow\infty,\quad \bar x \rightarrow \mathbb{E}[X]
$$
MC Basic
It’s important to know the fundamental idea of MC-based Reinforcement Learning.
For model-based model, the action value is:
$$
q_\pi (s, a) = \mathbb{E}[G_t|S_t=s, A_t=a] = \sum_{r\in \mathcal{R}}p(r|s, a)r+ \sum_{s’\in \mathcal{S}}v_\pi(s’)p(s’|s, a)
$$
For model-free model, the action value is the expected return when starting from $(s, a)$.
$$
q_\pi (s, a) = \mathbb{E}[G_t|S_t=s, A_t=a]=\mathbb{E}[R_{t+1}+ \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots |S_t=s, A_t=a]
$$
Starting from $(s, a)$, the agent can interact with the environment by following policy $\pi_k$ and obtain certain number of episodes. Suppose there are $n$ episodes and the return for each episode is $g_{\pi_k}^{(i)}(s, a)$, then the $q_\pi (s, a)$ can be expressed as
$$
q_{\pi_k} (s, a) = \mathbb{E}[G_t|S_t=s, A_t=a] = \frac{1}{n}\sum_{i=1}^n g_{\pi_k}^{(i)}(s, a)
$$
Algorithm
Initialization: Initial guess $\pi_0$.
Goal: Search for an optimal policy.
Loop: For the kth iteration, do
For every state $s \in \mathcal{S}$ in state space, do
- For every action $a \in \mathcal{A}$ in action space, do
- Collecting sufficiently many episodes starting from $(s, a)$ by following $\pi_k$
- Policy Evaluation:
- $q_{\pi_k} (s,a) \approx q_k(s, a) =$ the average return of all the episodes starting from $(s, a)$
- Policy improvement:
- $a_k^*(s) = arg \max_a q_k(s, a)$
- $\pi_{k+1}(a|s) = 1$ if $a = a_k^*$ and $\pi_{k+1}(a|s)=0$ otherwise
- For every action $a \in \mathcal{A}$ in action space, do
where we can also define the episode length: The length of episode will decide whether an agent can reach the target or get positive reward. which also means that if the episode is too small, the return can not be positive so is the expected state value.
The episode length will greatly impact the final policies.
sparse reward
which refers to the scenario in which no positive rewards can be obtained unless the target is reached. Therefore, when the state space is too large, it will downgrade the computational efficiency. To solve this problem, we can apply non-sparse reward, which is the setting with “attractive field”, the field except for target which also has positive reward.
MC Exploring Start
Initial Visit
An episode is only used to estimate the action value of the initial state-action pair that the episode starts from. This is the strategy of “initial visit” used by MC Basic.
e.g. Suppose there is an episode $s1 \xrightarrow{a_2} s3 \xrightarrow{a_1} \cdots$
This episode is only used to estimate $q_t(s1, a2)$, this is sample wasting. There are many state-action pairs wasted in an episode.
first-visit
If only counting the first-time visits. The samples are approximately independent, the variance will be smaller in this case. However, the trajectory of every $(s, a)$ only has one sample.
every-visit
It can count every visit of a state-action pair. If an episode is sufficiently long to visit every state-action pair multiple times, this sigle episode then is sufficient to estimalte all action values. This is the best for sample usage efficiency. The number of sample is larger, which result in a faster convergence speed. But the sample are highly correlated to each other.
There are two strategies to update the policies.:
- In the policy evaluation step, collecting all the episodes starting from the same state-action pair and using the average return of these episode. (used by MC Basic)
- Use the return of single episode to approximate the corresponding action value and improve the policy in an episode-by-episode fashion.
Algorithm 5.2: MC Exploring Starts
Initialization: Initial Policy $\pi_0(a|s)$ and initial value $q(s, a)$ for all $(s, a)$. $\text{Returns}(s, a)=0$ and $\text{Nums}(s, a)=0$.
Goal: Search for an optimal policy.
Loop: For each episode, do
Episode generation: Selecting a starting state-action pair $(s_0, a_0)$ and ensure that all pairs can be selected. Following the current policy, generate an episode of length T.
Initialization for each episode: g ← 0.
For each episode, $t = T-1, T-2, \cdots, 0$, do
$g \leftarrow \gamma g + r_{t+1}$
Returns $(s_t, a_t) \leftarrow$ Returns $(s_t, a_t) + g$
Nums $(s_t, a_t) \leftarrow$ Nums $(s_t, a_t) + 1$
Policy Evaluation:
$q(s_t, a_t) \leftarrow \text{Returns} (s_t, a_t) / \text{Nums} (s_t, a_t)$
Policy Improvement:
$\pi(a|s_t)=1$ if $a = arg \max_a q(s, a)$ and $\pi(a|s_t)=0$ otherwise
MC $\epsilon$-Greedy
Only if every state-action pair is well explored, their state-action values can be well estimated. This is the disavantage of both MC Basic and MC exploring start. Otherwise, if an action is not well explored, the action value may not be accurately estimated. The corresponding action may not be chosen even it is the optimal one. The fundamental idea of this Greedy Algorithm is to enhance exploration by balancing exploitation and exploration.
soft policy
It has the positive probability to take any action at any state. With soft policy, a single episode that is sufficiently long can visit every state-action pair many times. Therefore, this idea to some degree achieves automation of generating a sequence of episodes that can visit every state-action pairs multiple times.
The policy description
The MC $\epsilon$-Greedy policy is a stochastic policy that has a higher chance of choosing the greedy action and the same nonzero probability of taking any other action.
$$ \pi(a \mid s) = \begin{cases} 1 - \dfrac{\epsilon}{|\mathcal{A}(s)|}\bigl(|\mathcal{A}(s)| - 1\bigr), & \text{for the greedy action}, \\[4pt] \dfrac{\epsilon}{|\mathcal{A}(s)|}, & \text{for other } |\mathcal{A}(s)| - 1 \text{ actions}. \end{cases} $$When the $\epsilon$ is 0, this is totally greedy; when $\epsilon$ is 1, the probability of taking any action equals to $\frac{\epsilon}{|\mathcal{A(s)}|-1}$.
The Policy Improvement based on $\epsilon$-greedy.
$$
\pi_{k+1}(s) = arg \max_{\pi \in \Pi_\epsilon} \sum_a \pi(a|s) q_{\pi_k}(s, a)
$$
The $\epsilon$-greedy policy is
$$ \pi_{k+1}(a | s) = \begin{cases} 1 - \dfrac{\epsilon}{|\mathcal{A}(s)|}\bigl(|\mathcal{A}(s)|-1\bigr), & a = a_k^*, \\ \dfrac{\epsilon}{|\mathcal{A}(s)|}, & a \neq a_k^* \end{cases} $$Algorithm
Initialization: Initial Policy $\pi_0(a|s)$ and initial value $q(s, a)$ for all $(s, a)$. $\text{Returns}(s, a)=0$ and $\text{Nums}(s, a)=0$.
Goal: Search for an optimal policy.
Loop: For each episode, do
Episode generation: Selecting a starting state-action pair $(s_0, a_0)$. Following the current policy, generate an episode of length T.
initialization for each episode: g ← 0
For each episode, $t = T-1, T-2, \cdots, 0$, do
$g \leftarrow \gamma g + r_{t+1}$
Returns $(s_t, a_t) \leftarrow$ Returns $(s_t, a_t) + g$
Nums $(s_t, a_t) \leftarrow$ Nums $(s_t, a_t) + 1$
Policy Evaluation:
$q(s_t, a_t) \leftarrow \text{Returns} (s_t, a_t) / \text{Nums} (s_t, a_t)$Policy Improvement:
$$ \pi(a \mid s) = \begin{cases} 1 - \dfrac{\epsilon}{|\mathcal{A}(s)|}\bigl(|\mathcal{A}(s)| - 1\bigr), & a = a^*, \\[4pt] \dfrac{\epsilon}{|\mathcal{A}(s)|}, & a \neq a^* \end{cases} $$
$a^* = arg \max_a q(s_t, a)$
The policy is