[RL] Monte-Carlo Methods

Monte-Carlo RL

MC learn directly from episodes of experience.
MC is model-free. No knowledge of MDP transitions or rewraeds.
MC learns fro complete episodes. no bootstrapping
1. Here Bootstrapping means Using one or more estimated values in the update step fro the same kind of estimated value.. like in $SARSA(\lambda)$ or $Q(\lambda)$
2. The single-step TD is to utilize the bootstrap by using a mix of results from different length trajectories.
MC uses the simplest possible idea: value = mean return
2 ways:
1. model-free: no model necessary and still attains optimality
2. Simulated: needs only a simulation, not a full model
Caveat: can only apply MC to episodic MDPs
1. All episodes must terminate

policy evaluation

The goal is to learn $v_\pi$ from episodes of experience under policy $\pi$
\[S_1,A_1,R_2,....S_k \sim 1\]
and the return is the total discounted reward: $G_t=R_{t+1}+\gamma R_{t+2}+...\gamma ^{T-1}T_T$.

The value function is the expected return : $v_\pi(x)=\mathbb{E}_{\pi}[G_t~S_t=s]$
P.S. This policy use empirical mean rather than expected return.

By different visits during an episode, they can be diverged into Every-Visit and First-visit, and both converge asymptotically.

First-visit

proof of convergence

In fact it's the "backward mean $\frac{S(s)}{N(s)}$" to update $V(s)$

blackjack (21 points)

In Sutton's book, we get the exploration graph like

Every-Visit Monte-Carlo Policy Evaluation

the difference is every visits.

Incremental Mean

The mean $\mu_1,\mu_2.... $ of the sequent can be computed incrementally by

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Similar to Bandit, is to find optimal from the explore/exploit dilemma the entire rest of episode included. and the only choice considered is at each state and doesn't bootstrap.(unlike DP).

Time required to estimate one state does not depend onthe total number of states

Temporal-Difference Learning

Bootstrapping

Saying: To lift one up, strap the boot of sb. again and again which is incompletable.

Modern definition: re-sampling technique from the same sample again and again which has the statistical meaning.

intro

TD methods learn directly from episodes of experience
TD is model-free:no knowledge of MDP transitions/rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

$(number)$ the number represent the look ahead times.

driving home example

TD is more flexible for MC have to wait for the final result for the update. On policy vs. off policy.

MC make a nearest prediction

We actually can generate the estimated MDP graph and corresponding example for AB example.

Comparison

TD can learn before knowing the final outcome
1. TD can learn online after every step (less memory & peak computation)
2. MC must wait until end of episode before return is known
TD can learn without the final outcome
1. TD can learn from incomplete sequences
2. MC can only learn from complete sequences
3. TD works in continuing (non-terminating) environments
4. MC only works for episodic (terminating) environment
  ### result is TD performs better in random walk

batch MC and TD

add step parameter $\alpha$

Unified comparison

$\to$

Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling:update samples an expectation
- MC samples
- DP does not sample
- TD samples

n-step TD

⁃   $

\begin{array}{l}\text { n-Step Return } \ \qquad \begin{aligned} \text { Consider the following } n \text { -step returns for } n=1,2, \infty \ \qquad \begin{aligned} n=1 &(T D) & \frac{G_{t}^{{(1)}}{G_{t}^{{(2)}}=\frac{R_{t+1}+\gamma}} V\left(S_{t+1}\right)}{R_{t+1}+\gamma R_{t+2}+\gamma^{2} V\left(S_{t+2}\right)} \ & \vdots \end{aligned} \ \qquad n=\infty \quad(M C) \quad G_{t}^{{(\infty)}=R_{t+1}+\gamma} R_{t+2}+\ldots \gamma^{T-1} R_{T} \end{aligned} \ \text { - Define the } n \text { -step return } \ \qquad \underbrace{G_{t}^{{(n)}=R_{t+1}+\gamma} R_{t+2}+\ldots+\gamma^{n-1} R_{t+n}+\gamma^{n} V\left(S_{t+n}\right)}_{The\ General\ case}\ \text { - n} \text { -step temporal-difference learning } \ \qquad V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}^{{(n)}-V\left(S_{t}\right)\right)\end{array}}
$$

Reference

April 12, 2020February 9, 2022

[RL] RL Category

Following the wikipeadia

Reinforcement Learning is an area of machihne learning inspired by behavioral psychology, concerned with how software agents ought to take actions in an environment so as to maximzie some notion of cumulative reward.

Behavioral Psychology

Behavior is primarily shaped by reinforcement rather than free-will.

behaviors that result in praise/pleasure tend to repeat
behaviors that result in punishment/pain tend to become extinct

agent

An entity (learner & decision maker) that is equipped with Sensors end-effectors and goals

Action

Used by the agent to interact with the environment.
May have many di↵erent temporal granularities and abstractions

reward

A reward R**t is a scalar feedback signal

Indicates how well agent is doing at step t

The agent’s job is to maximize cumulative reward

hypothesis: All goals can be described by the maximization of expected cumulative reward

Main Topics of Reinforcement Learning

Learning: by trial and error

Planning: search, reason, thought, cognition

Prediction: evaluation functions, knowledge

Control: action selection, decision making

Dynamics: how the state changes given the actions of the agent

Model-based RL

dynamics are known or are estimated
solving RL problems that use models and planning

Model-free RL

unknown dynamics
explicitly trial-and-error learners

not necessarily iid

P.S. 逆强化学习。

Summary

March 25, 2020February 9, 2022

[RL] Bandit 算法

反馈：奖励

agent

action

bandit 是强化学习的特例

利用与探索的两难

短期与长期的对决

现实中的例子

两臂老虎机

三种类型

奖励模型

Decision making under uncertainty

action vs. reward

types of reward

types of rewards model

types of context

example

先验or后验

structured

global

summary

example

网飞展示广告

随机bandit

regret： equivalent goal

the value of an arbitrary action a is the mean reward for a (unknown):

the optimal value is

we care about total regret

example on a

In general, those values are unknown.

we must try actions to learn the action-values(explore), and prefer those that appear best(exploit).

action-value methods

methods that learn action-value estimates and nothing else

根据强大数定理。收敛于真实期望值。

counting regret

桥梁：计算total regret。变量为指示变量

亚当定理：得到lemma的结果

how to define a good learner.

linear!! 有两种情况

Greedy algorithm

每次取最高的期望

$\epsilon - greedy$

防止出现局部最优

proof of lower bound

incremental implementation

性能考量、实现在线学习

proof

bandit notation

进一步提高性能 Optimistic Initial values

All methods s ofar depends on which is 0 in this case. but we can do 5.

Initally bad case for more 尝试。

algorithm

DecayIng $\epsilon _{t} -Greedy$

insight：先降后升

performance comparison

## bandit algorithm lower bounds(depends on gaps)

Optimism in the face of uncertainty

which action to pick?

越未知需要探索

Upper Confidence Bound

hoeffding's inequality

general case

calculation

取bound得到置信区间的位置

UCB1

action 没有被充分的探索过，所以需要探索

上界探索

credit:https://jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/

Theorem: Suppose UCB1 is run as above. Then its expected cumulative regret $\mathbb{E}(latex-1584977927119.png)$ is at most$\displaystyle 8 \sum_{i : \mu_i < \mu^*} \frac{\log T}{\Delta_i} + \left ( 1 + \frac{\pi^2}{3} \right ) \left ( \sum_{j=1}^K \Delta_j \right )$

Okay, this looks like one nasty puppy, but it’s actually not that bad. The first term of the sum signifies that we expect to play any suboptimal machine about a logarithmic number of times, roughly scaled by how hard it is to distinguish from the optimal machine. That is, if $\Delta_i$ is small we will require more tries to know that action $i$ is suboptimal, and hence we will incur more regret. The second term represents a small constant number (the $1 + \pi^2 / 3$ part) that caps the number of times we’ll play suboptimal machines in excess of the first term due to unlikely events occurring. So the first term is like our expected losses, and the second is our risk.

But note that this is a worst-case bound on the regret. We’re not saying we will achieve this much regret, or anywhere near it, but that UCB1 simply cannot do worse than this. Our hope is that in practice UCB1 performs much better.

Before we prove the theorem, let’s see how derive the $O(latex-1584977927114.png)$ bound mentioned above. This will require familiarity with multivariable calculus, but such things must be endured like ripping off a band-aid. First consider the regret as a function $R(latex-1584977927125.png)$ (excluding of course $\Delta^*$ ), and let’s look at the worst case bound by maximizing it. In particular, we’re just finding the problem with the parameters which screw our bound as badly as possible, The gradient of the regret function is given by

$\displaystyle \frac{\partial R}{\partial \Delta_i} = - \frac{8 \log T}{\Delta_i2} + 1 + \frac{\pi2}{3}$

and it’s zero if and only if for each $i$ , $} = O(latex-1584977927124.png)$ . However this is a minimum of the regret bound (the Hessian is diagonal and all its eigenvalues are positive). Plugging in the $\Delta_i = O(latex-1584977927170.png)$ (which are all the same) gives a total bound of $O(latex-1584977927132.png)$ . If we look at the only possible endpoint (the $\Delta_i = 1$ ), then we get a local maximum of $O(latex-1584977927132.png)$ . But this isn’t the $O(latex-1584977927114.png)$ we promised, what gives? Well, this upper bound grows arbitrarily large as the $\Delta_i$ go to zero. But at the same time, if all the $\Delta_i$ are small, then we shouldn’t be incurring much regret because we’ll be picking actions that are close to optimal!

Indeed, if we assume for simplicity that all the $\Delta_i = \Delta$ are the same, then another trivial regret bound is $\Delta T$ (why?). The true regret is hence the minimum of this regret bound and the UCB1 regret bound: as the UCB1 bound degrades we will eventually switch to the simpler bound. That will be a non-differentiable switch (and hence a critical point) and it occurs at $\Delta = O(latex-1584977927136.png)$ . Hence the regret bound at the switch is $\Delta T = O(latex-1584977927304.png)$ , as desired.

Proving the Worst-Case Regret Bound

Proof. The proof works by finding a bound on $P_i(latex-1584978965148.png)$ , the expected number of times UCB chooses an action up to round $T$ . Using the $\Delta$ notation, the regret is then just $\sum_i \Delta_i \mathbb{E}(latex-1584978965117.png)$ , and bounding the $P_i$ ‘s will bound the regret.

Recall the notation for our upper bound $a(latex-1584978965147.png) = \sqrt{2 \log T / P_j(T)}$ and let’s loosen it a bit to $a(latex-1584978965148.png) = \sqrt{2 \log T / y}$ so that we’re allowed to “pretend” a action has been played $y$ times. Recall further that the random variable $I_t$ has as its value the index of the machine chosen. We denote by $\chi(latex-1584978965127.png)$ the indicator random variable for the event $E$ . And remember that we use an asterisk to denote a quantity associated with the optimal action (e.g., $\overline{x}^*$ is the empirical mean of the optimal action).

Indeed for any action $i$ , the only way we know how to write down $P_i(T)$ is as

$\displaystyle P_i(latex-1584978965149.png) = 1 + \sum_{t=K}^T \chi(I_t = i)$

The 1 is from the initialization where we play each action once, and the sum is the trivial thing where just count the number of rounds in which we pick action $i$ . Now we’re just going to pull some number $m-1$ of plays out of that summation, keep it variable, and try to optimize over it. Since we might play the action fewer than $m$ times overall, this requires an inequality.

$P_i(latex-1584978965206.png) \leq m + \sum_{t=K}^T \chi(I_t = i \textup{ and } P_i(t-1) \geq m)$

These indicator functions should be read as sentences: we’re just saying that we’re picking action $i$ in round $t$ and we’ve already played $i$ at least $m$ times. Now we’re going to focus on the inside of the summation, and come up with an event that happens at least as frequently as this one to get an upper bound. Specifically, saying that we’ve picked action $i$ in round $t$ means that the upper bound for action $i$ exceeds the upper bound for every other action. In particular, this means its upper bound exceeds the upper bound of the best action (and $i$ might coincide with the best action, but that’s fine). In notation this event is

$\displaystyle \overline{x}_i + a(latex-1584978965271.png) \geq \overline{x}* + a(P*(T), t-1)$

Denote the upper bound $\overline{x}_i + a(latex-1584978965272.png)$ for action $i$ in round $t$ by $U_i(latex-1584978965333.png)$ . Since this event must occur every time we pick action $i$ (though not necessarily vice versa), we have

$\displaystyle P_i(latex-1584978964873.png) \leq m + \sum_{t=K}T \chi(U_i(t-1) \geq U*(t-1) \textup{ and } P_i(t-1) \geq m)$

We’ll do this process again but with a slightly more complicated event. If the upper bound of action $i$ exceeds that of the optimal machine, it is also the case that the maximum upper bound for action $i$ we’ve seen after the first $m$ trials exceeds the minimum upper bound we’ve seen on the optimal machine (ever). But on round $t$ we don’t know how many times we’ve played the optimal machine, nor do we even know how many times we’ve played machine $i$ (except that it’s more than $m$ ). So we try all possibilities and look at minima and maxima. This is a pretty crude approximation, but it will allow us to write things in a nicer form.

Denote by $\overline{x}_{i,s}$ the random variable for the empirical mean after playing action $i$ a total of $s$ times, and $\overline{x}^*_s$ the corresponding quantity for the optimal machine. Realizing everything in notation, the above argument proves that

$\displaystyle P_i(T) \leq m + \sum_{t=K}T \chi \left ( \max_{m \leq s < t} \overline{x}{i,s} + a(s, t-1) \geq \min{0 < s' < t} \overline{x}*_{s'} + a(s', t-1) \right )$

Indeed, at each $t$ for which the max is greater than the min, there will be at least one pair $s,s'$ for which the values of the quantities inside the max/min will satisfy the inequality. And so, even worse, we can just count the number of pairs $s, s'$ for which it happens. That is, we can expand the event above into the double sum which is at least as large:

$\displaystyle P_i(T) \leq m + \sum_{t=K}T \sum_{s = m}{t-1} \sum_{s' = 1}{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t-1) \geq \overline{x}*_{s'} + a(s', t-1) \right )$

We can make one other odd inequality by increasing the sum to go from $t=1$ to $\infty$ . This will become clear later, but it means we can replace $t-1$ with $t$ and thus have

$\displaystyle P_i(T) \leq m + \sum_{t=1}\infty \sum_{s = m}{t-1} \sum_{s' = 1}{t-1} \chi \left ( \overline{x}_{i,s} + a(s, t) \geq \overline{x}*_{s'} + a(s', t) \right )$

Now that we’ve slogged through this mess of inequalities, we can actually get to the heart of the argument. Suppose that this event actually happens, that $\overline{x}{i,s} + a(s, t) \geq \overline{x}^*{s'} + a(s', t)$ . Then what can we say? Well, consider the following three events:

(1) $\displaystyle \overline{x}*_{s'} \leq \mu* - a(s', t)$
(2) $\displaystyle \overline{x}_{i,s} \geq \mu_i + a(latex-1584978965039.png)$
(3) $\displaystyle \mu^* < \mu_i + 2a(s, t)$

In words, (1) is the event that the empirical mean of the optimal action is less than the lower confidence bound. By our Chernoff bound argument earlier, this happens with probability $t^{-4}$ . Likewise, (2) is the event that the empirical mean payoff of action $i$ is larger than the upper confidence bound, which also occurs with probability $t^{-4}$ . We will see momentarily that (3) is impossible for a well-chosen $m$ (which is why we left it variable), but in any case the claim is that one of these three events must occur. For if they are all false, we have

$\displaystyle \begin{matrix} \overline{x}{i,s} + a(s, t) \geq \overline{x}^*{s'} + a(s', t) & > & \mu^* - a(s',t) + a(s',t) = \mu^* \ \textup{assumed} & (1) \textup{ is false} & \ \end{matrix}$

and

$\begin{matrix} \mu_i + 2a(s,t) & > & \overline{x}{i,s} + a(s, t) \geq \overline{x}^*{s'} + a(s', t) \ & (2) \textup{ is false} & \textup{assumed} \ \end{matrix}$

But putting these two inequalities together gives us precisely that (3) is true:

$\mu^* < \mu_i + 2a(s,t)$

This proves the claim.

By the union bound, the probability that at least one of these events happens is $2t^{-4}$ plus whatever the probability of (3) being true is. But as we said, we’ll pick $m$ to make (3) always false. Indeed $m$ depends on which action $i$ is being played, and if $s \geq m > 8 \log T / \Delta_i^2$ then $2a(latex-1584978965191.png) \leq \Delta_i$ , and by the definition of $\Delta_i$ we have

$\mu^* - \mu_i - 2a(latex-1584978965195.png) \geq \mu^* - \mu_i - \Delta_i = 0$ .

Now we can finally piece everything together. The expected value of an event is just its probability of occurring, and so

$\displaystyle \begin{aligned} \mathbb{E}(P_i(T)) & \leq m + \sum_{t=1}\infty \sum_{s=m}t \sum_{s' = 1}t \textup{P}(\overline{x}_{i,s} + a(s, t) \geq \overline{x}*_{s'} + a(s', t)) \ & \leq \left \lceil \frac{8 \log T}{\Delta_i2} \right \rceil + \sum_{t=1}\infty \sum_{s=m}t \sum_{s' = 1}t 2t{-4} \ & \leq \frac{8 \log T}{\Delta_i2} + 1 + \sum_{t=1}\infty \sum_{s=1}t \sum_{s' = 1}t 2t{-4} \ & = \frac{8 \log T}{\Delta_i2} + 1 + 2 \sum_{t=1}\infty t^{-2} \ & = \frac{8 \log T}{\Delta_i2} + 1 + \frac{\pi2}{3} \ \end{aligned}$

The second line is the Chernoff bound we argued above, the third and fourth lines are relatively obvious algebraic manipulations, and the last equality uses the classic solution to Basel’s problem. Plugging this upper bound in to the regret formula we gave in the first paragraph of the proof establishes the bound and proves the theorem.

result

Intro for Thompson Sampling

Probability Matching

~ 最大后验概率

history

在知道后验的情况下比UCB 好

beta-bernoulli bandit

先验共轭

greedy vs. Thompson Sampling

第一步不一样。期望 vs 样本值

simulation

Other kind of Bayesian inference

梯度bandit 算法

随机梯度上升法

why not 梯度下降？

more general for RL

proof

// lemma 1 for homework

梯度的期望->期望的梯度

Adversarial Bandit

simple bandit game

importance-weighted estimators

用于估计其他未被触发的arm

another good algorithm

bias ~ variance tradeoff

方差太大。

几率小的reward 现在增大了。用隐含的增加的探索来使反差减少

algorithm

adversary bandits with expert's advice

意见的权重的分布=》exp4

additional exploration

uniform exploration

proved no need

summary

bound

看作为独立为RL之外的眼界方向

February 29, 2020February 9, 2022

[RL] Probability Review

Basics

The set

general definition of Probability

样本空间

IMG_B1B034C78691-1

概率的物理意义

frequentist view: a long-run frequency over a large number of repetitions of an experiment.

Bayesian view: a degree of belief about the event in question.
We can assign probabilities to hypotheses like "candidate will win the election" or "the defendant is guilty"can't be repeated.

Markov & Monta Carlo + computing power + algorithm thrives the Bayesian view.

role

IMG_F7558CBB4476-1

条件概率

所有事情都有条件，条件就会产生概率
e.g. Conditioning -> DIVIDE & CONCUER -> recursively apply to multi-stage problem.

P（A｜B） = $\frac{P(A\ and\ B)}{P(B)}$

chain rules

有利于分布式计算

IMG_EC132FE4D2D1-1

Inference & Bayes' Rules

概率分布和极限定理

PDF 概率密度函数

混合型

IMG_434B41011BCA-1

PDF

valid PDF

non negative $f(x)\geq0$
integral to 1:
$\int^{\infty}_{-\infty}f(x)dx=1$

probability distribution

summary of probability distribution

三种距离衡量 in ML, DL, AI

全变量距离

usually in GAN

小数定理（稀疏事件） in poisson

去食堂吃饭人数可以用柏松分布来描述

Sample mean

强大数定理SLLN

收敛到真正的概率值以概率为一收敛

弱大数定理WLLN

以概率收敛

中心极限定理

Generating function

PGF - Z
MGF - Laplace
CF - 傅立叶

APPLICATION

branching process
bridge complex and probability
play a role in large deviation theory
## Multi variables.
joint distribution provides complete information about how multiple r.v. interact in high-dimensional space

joint CDF &PDF

marginal PMF

conditional PMF

joint PDF

Screen Shot 2020-03-03 at 03.04.48
Screen Shot 2020-03-03 at 03.29.53
Screen Shot 2020-03-03 at 03.31.50

techniques

general Bayes' Rules.

general LOTP

change of variables

summary

Order Statistics

CDF of order statistic

Screen Shot 2020-03-03 at 03.57.04

proof

PDF of Order Statostic

two methods to find PDF

CDF -differentiate> PDF (ugly)
PDF*dx
###proof

## joint PDF

e.g. order statistics of Uniforms

story:beta-Binomial Conjugacy

Screen Shot 2020-03-03 at 16.07.50

Mean vs Bayes'

deduction

e.g. 拉普拉斯问题

来自大名鼎鼎的拉普拉斯的问题，若给定太阳每天都升起的历史记录，则太阳明天仍然能升起的概率是多少？

拉普拉斯自己的解法：
假定太阳升起这一事件服从一个未知参数A的伯努利过程，且A是[0,1]内均匀分布，则利用已给定的历史数据，太阳明天能升起这一事件的后验概率为
$P(Xn+1|Xn=1,Xn-1=1,...,X1=1)=\frac{P(Xn+1,Xn=1,Xn-1=1,...,X1=1)}{P(Xn=1,Xn-1=1,...,X1=1)}$=Aⁿ⁺¹ 在[0,1]内对A的积分/Aⁿ 在[0,1]内对A的积分=$\frac{n+1}{n+2}$,即已知太阳从第1天到第n天都能升起，第n+1天能升起的概率接近于1.

Monte carlo

importance sampling

reduce the 方差

importance sampling

example

February 25, 2020February 9, 2022

What does Multi-Armed Bandit means?

credit:https://iosband.github.io/2015/07/19/Efficient-experimentation-and-multi-armed-bandits.html

At first, multi-armed bandit means using
$f^* : \mathcal{X} \rightarrow \mathbb{R}$

Each arm $i$ pays out 1 dollar with probability $p_i$ if it is played; otherwise it pays out nothing.
While the $p_1,…,p_k$ are fixed, we don’t know any of their values.
Each timestep $t$ we pick a single arm $a_t$ to play.
Based on our choice, we receive a return of $r_t \sim Ber(p_{a_t})$.
##How should we choose arms so as to maximize total expected return?##

Monte-Carlo RL

policy evaluation

First-visit

proof of convergence

blackjack (21 points)

Every-Visit Monte-Carlo Policy Evaluation

Incremental Mean

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Temporal-Difference Learning

Bootstrapping

intro

driving home example

Comparison

batch MC and TD

Unified comparison

n-step TD

Reference

Behavioral Psychology

agent

Action

reward

Main Topics of Reinforcement Learning

Summary

利用与探索的两难

现实中的例子

两臂老虎机

三种类型

奖励模型

Decision making under uncertainty

action vs. reward

types of reward

types of rewards model

types of context

先验or后验

structured

global

summary

example

随机bandit

regret： equivalent goal

example on a

action-value methods

counting regret

how to define a good learner.

Greedy algorithm

\(\epsilon - greedy\)

proof of lower bound

incremental implementation

proof

bandit notation

进一步提高性能 Optimistic Initial values

algorithm

DecayIng \(\epsilon _{t} -Greedy\)

performance comparison

Upper Confidence Bound

hoeffding's inequality

calculation

UCB1

上界探索

Proving the Worst-Case Regret Bound

result

Intro for Thompson Sampling

Probability Matching

history

beta-bernoulli bandit

greedy vs. Thompson Sampling

simulation

Other kind of Bayesian inference

梯度bandit 算法

why not 梯度下降？

proof

Adversarial Bandit

simple bandit game

importance-weighted estimators

another good algorithm

algorithm

adversary bandits with expert's advice

additional exploration