Stochastic Approximation

There are two approaches to calculate the $\bar x$, non-incremental and incremental. Stochasitc Approximation is an approach that use incremental method rather than non-incremental method for root-finding or optimization problems.

Mean Estimation Example

$$
\mathbb{E}[X] \approx \bar x = \frac{1}{n}\sum_{j=1}^{n}x_j, \quad n\rightarrow\infty, \quad \bar x \rightarrow \mathbb{E}[X]
$$

Suppose that

$$
w_{k+1} = \frac{1}{k}\sum_{i=1}^{k}x_i, \quad k = 1, 2,\cdots
$$

$$
w_k = \frac{1}{k-1}\sum_{i=1}^{k-1}x_i, \quad k=2, 3, \cdots
$$

$$
w_{k+1} = \frac{1}{k}\sum_{i=1}^{k}x_i = \frac{1}{k}(\sum_{i=1}^{k-1}x_i + x_k) = \frac{1}{k}((k-1)w_k+x_k)= w_k - \frac{1}{k}(w_k-x_k)
$$

Therefore, we obtain the iterative form

$$
w_{k+1} = w_k - \frac{1}{k}(w_k - x_k)
$$

There is a general expression of equation (5), which is quite similar to the following RM algorithm.

$$
w_{k+1} = w_k - \alpha_k(w_k - x_k), \alpha_k >0
$$

When the $\alpha_k$ statisfies some conditions, $w_k \rightarrow \mathbb{E}[X]$ as $k \rightarrow \infty$

Robbins-Monro Algorithm

It’s one of the typical Stochastic Algorithms (stochastic iterative algorithms) for solving root-finding or optimization problems. Stochastic Gradient Descent (SGD) is a special form of RM.

Considering a Black-Box System, here only the input $w$ and the noisy output $\tilde{g}(w, \eta)$ are known. The expression of function $g$ is unknown. Our aim is to solve $g(w) = 0$ using $w$ and $\eta$. The RM algorithm for the aim is defined as
$$
\tilde{g}(w, \eta) = g(w) + \eta
$$

$$
w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k), \quad k = 1, 2, \cdots
$$

We simply consider the case of $w_{k+1} = w_k - a_k g(w_k)$, for $g(w) = \text{tanh}(w-1)$. The iterative expression will lead to convergence of $w^* = 1$. e.g. for $g(w)=\tanh(w-1)$

$w_k > w^$ , $g(w_k)>0$. Then $w_{k+1}<w_k$. When the $k$ is large enough. $w_{\infin} \rightarrow w^$
$w_k < w^$ , $g(w_k)>0$. Then $w_{k+1}<w_k$. When the $k$ is large enough. $w_{\infin} \rightarrow w^$

RM theorem

$0< c_1 \leq \nabla_w g(w)\leq c_2 \quad \text{for all}$ $w$

$\sum_{k=1}^\infty a_k = \infty$, $\sum_{k=1}^{\infty}a_k^2 < \infty$
$\mathbb{E}[\eta_k|\mathcal{H}_k] = 0$ and $\mathbb{E}[\eta_k^2|\mathcal{H}_k] = \infty$ where $\mathcal{H}_k= \{ w_k, w_{k-1}, \cdots\}$,

then $w_k$ surely converges to the root $w^*$.

The second condition is very important as

$$
w_{k+1} - w_k = -a_k \tilde{g}(w_k, \eta_k), \quad k = 1, 2, \cdots
$$

The $\sum_{k=1}^{\infty}a_k^2 < \infty$ means that if $k\rightarrow \infty$, then $a_k \rightarrow 0$, then $w_{k+1} = w_{k}$.
The $\sum_{k=1}^\infty a_k = \infty$ means that $a_k$ should not converge too fast. we have $w_1 - w_\infty\sum_{k=1}^{\infty}a_k \tilde{g}(w_k, \eta_k)$. If $\sum_{k=1}^\infty a_k = \infty$, then $\sum_{k=1}^{\infty}a_k \tilde{g}(w_k, \eta_k)$ is bounded. Therefore, $|w_1 - w_\infty| \leq b$.

For example, we can define a funciton as $g(w) = w - \mathbb{E}[X]$. The noisy observation can be defined like $\tilde{g}(w, \eta) = w - x$, if noisy observation $x$ and input $w$ is given.

$\tilde{g}(w, \eta) = w - x = w - x + \mathbb{E}[X] - \mathbb{E}[X]= w - \mathbb{E}[X] + \mathbb{E}[X]- x = g(w) + \eta$

$\eta = \mathbb{E}[X]- x$, $g(w) = w - \mathbb{E}[X]$. Therefore, the RM algorithm for solving this problem is
$$
w_{k+1} = w_k - \alpha_k \tilde{g}(w_k, \eta_k) = w_k - \alpha_k(w_k - x_k)
$$

Stochastic Gradient Descent Algorithm

Consider the following optimization problem:

$$
\min_w J(w) = \mathbb{E}[f(w, X)]
$$

This algorithm can be translated as finding a parameter $w$ , which could minimize the expected loss in all the possible samples.

where the $w$ is parameter needed to be optimized, and the $X$ is random variable, the expectation will be calculated based on $X$.

We can use the gradient descent to solve the optimized $w^*$. The gradient of $\mathbb{E}[f(w, X)]$ is $\nabla_w \mathbb{E}[f(w, X)]$, which is equivalent to $\mathbb{E}[\nabla_w f(w, X)]$. Then the gradient descent is

$$
w_{k+1} = w_k - \alpha_k \nabla_w J(w_k) = w_k - \alpha_k \mathbb{E}[\nabla_w f(w, X)]
$$

For model-free cases, we can calculate this through using a large number of iid samples ${x_i}_{i=1}^n$, the gradient of the expected value is

$$
\mathbb{E}[\nabla_w f(w, X)]\approx \frac{1}{n}\sum_{i=1}^{n}\nabla_wf(w_k, x_i)
$$

Thus, the gradient descent is

$$
w_{k+1} = w_k - \frac{\alpha_k}{n}\sum_{i=1}^{n}\nabla_wf(w_k, x_i)
$$

For these algorithms, we have to collect all the samples in advanced, which is low efficient. Then we can introduce SGD instead.

$$
w_{k+1} = w_k - \alpha_k\nabla_wf(w_k, x_k)
$$

The SGD replaces the true gradient $\mathbb{E}[\nabla_w f(w, X)]$ with stochastic gradient $\nabla_w f(w, X)$.

$$
\nabla_wf(w_k, x_k) = \mathbb{E}[\nabla_w f(w, X)] + \nabla_wf(w_k, x_k)- \mathbb{E}[\nabla_w f(w, X)] \
= \mathbb{E}[\nabla_w f(w, X)] + \eta_k
$$

Therefore, the SGD can be written as
$$
w_{k+1} = w_k - \alpha_k \mathbb{E}[\nabla_wf(w, X)]-\alpha_k \eta_k
$$
This is the same as regualar GD, except the extra $\alpha_k \eta_k$.

Since,
$$
\mathbb{E}[\eta_k] = \mathbb{E}[\nabla_wf(w_k, x_k)- \mathbb{E}[\nabla_w f(w, X)]] = \mathbb{E}_{x_k}[\nabla_wf(w_k, x_k)] - \mathbb{E}_X[\nabla_wf(w_k, X)] = 0
$$

as $x_k$ is iid, the additional $\eta$ will not affect the convergence of iteration.

Convergence Pattern of SGD

When $w_k$ is close to $w^*$, does the convergence become more random. But when the $w_k$ is far away from the $w^*$, The SGD will behave similarly to the regular gradient descent.

We can define the relative error between stochastic gradient descent and regular gradient descent.

$$ \delta_k = \frac{|\nabla_wf(w_k, x_k)-\mathbb{E}[\nabla_w f(w_k, X)]|}{\mathbb{E}[\nabla_w f(w_k, X)]} $$ which can also be written as $$ \delta_k = \frac{|\nabla_wf(w_k, x_k)-\mathbb{E}[\nabla_w f(w_k, X)]|}{\mathbb{E}[\nabla_w f(w_k, X)-\mathbb{E}[\nabla_w f(w^*, X) ]} = \frac{|\nabla_wf(w_k, x_k)-\mathbb{E}[\nabla_w f(w_k, X)]|}{\mathbb{E}[\nabla_w^2 f(\tilde{w}, X)(w_k - w^*)]} $$ since $\mathbb{E}[\nabla_w^2 f(\tilde{w}, X)(w_k - w^*)] = 0$ and mean value theorom. $\tilde{w}_k \in [w_k, w^*]$. $$ |\mathbb{E}[\nabla_w^2 f(\tilde{w}, X)(w_k - w^*)]| \geq c|w_k - w^*| $$ $$ \delta_k \leq \frac{|\nabla_wf(w_k, x_k)-\mathbb{E}[\nabla_w f(w_k, X)]|}{c|w_k - w^*|} $$ The denominator indicates the distance between $w_k$ and optimal solution $w^*$

Jason's CFD World

RL Note 4: Stochastic Approximation

Stochastic Approximation

Robbins-Monro Algorithm

RM theorem

Stochastic Gradient Descent Algorithm

Convergence Pattern of SGD