Reinforce Learning - victoryang00’ blog

Monte-Carlo RL

MC learn directly from episodes of experience.
MC is model-free. No knowledge of MDP transitions or rewraeds.
MC learns fro complete episodes. no bootstrapping
1. Here Bootstrapping means Using one or more estimated values in the update step fro the same kind of estimated value.. like in $SARSA(\lambda)$ or $Q(\lambda)$
2. The single-step TD is to utilize the bootstrap by using a mix of results from different length trajectories.
MC uses the simplest possible idea: value = mean return
2 ways:
1. model-free: no model necessary and still attains optimality
2. Simulated: needs only a simulation, not a full model
Caveat: can only apply MC to episodic MDPs
1. All episodes must terminate

policy evaluation

The goal is to learn $v_\pi$ from episodes of experience under policy $\pi$
\[S_1,A_1,R_2,....S_k \sim 1\]
and the return is the total discounted reward: $G_t=R_{t+1}+\gamma R_{t+2}+...\gamma ^{T-1}T_T$.

The value function is the expected return : $v_\pi(x)=\mathbb{E}_{\pi}[G_t~S_t=s]$
P.S. This policy use empirical mean rather than expected return.

By different visits during an episode, they can be diverged into Every-Visit and First-visit, and both converge asymptotically.

First-visit

proof of convergence

In fact it's the "backward mean $\frac{S(s)}{N(s)}$" to update $V(s)$

blackjack (21 points)

In Sutton's book, we get the exploration graph like

Every-Visit Monte-Carlo Policy Evaluation

the difference is every visits.

Incremental Mean

The mean $\mu_1,\mu_2.... $ of the sequent can be computed incrementally by

incremental MC-updates

Monte-Carlo Estimation of Action Values

Backup Diagram for Monte-Carlo

Similar to Bandit, is to find optimal from the explore/exploit dilemma the entire rest of episode included. and the only choice considered is at each state and doesn't bootstrap.(unlike DP).

Time required to estimate one state does not depend onthe total number of states

Temporal-Difference Learning

Bootstrapping

Saying: To lift one up, strap the boot of sb. again and again which is incompletable.

Modern definition: re-sampling technique from the same sample again and again which has the statistical meaning.

intro

TD methods learn directly from episodes of experience
TD is model-free:no knowledge of MDP transitions/rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

$(number)$ the number represent the look ahead times.

driving home example

TD is more flexible for MC have to wait for the final result for the update. On policy vs. off policy.

MC make a nearest prediction

We actually can generate the estimated MDP graph and corresponding example for AB example.

Comparison

TD can learn before knowing the final outcome
1. TD can learn online after every step (less memory & peak computation)
2. MC must wait until end of episode before return is known
TD can learn without the final outcome
1. TD can learn from incomplete sequences
2. MC can only learn from complete sequences
3. TD works in continuing (non-terminating) environments
4. MC only works for episodic (terminating) environment
  ### result is TD performs better in random walk

batch MC and TD

add step parameter $\alpha$

Unified comparison

$\to$

Bootstrapping: update involves an estimate
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling:update samples an expectation
- MC samples
- DP does not sample
- TD samples

n-step TD

⁃   $

\begin{array}{l}\text { n-Step Return } \ \qquad \begin{aligned} \text { Consider the following } n \text { -step returns for } n=1,2, \infty \ \qquad \begin{aligned} n=1 &(T D) & \frac{G_{t}^{{(1)}}{G_{t}^{{(2)}}=\frac{R_{t+1}+\gamma}} V\left(S_{t+1}\right)}{R_{t+1}+\gamma R_{t+2}+\gamma^{2} V\left(S_{t+2}\right)} \ & \vdots \end{aligned} \ \qquad n=\infty \quad(M C) \quad G_{t}^{{(\infty)}=R_{t+1}+\gamma} R_{t+2}+\ldots \gamma^{T-1} R_{T} \end{aligned} \ \text { - Define the } n \text { -step return } \ \qquad \underbrace{G_{t}^{{(n)}=R_{t+1}+\gamma} R_{t+2}+\ldots+\gamma^{n-1} R_{t+n}+\gamma^{n} V\left(S_{t+n}\right)}_{The\ General\ case}\ \text { - n} \text { -step temporal-difference learning } \ \qquad V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\alpha\left(G_{t}^{{(n)}-V\left(S_{t}\right)\right)\end{array}}
$$

Reference

Following the wikipeadia

Reinforcement Learning is an area of machihne learning inspired by behavioral psychology, concerned with how software agents ought to take actions in an environment so as to maximzie some notion of cumulative reward.

Behavioral Psychology

Behavior is primarily shaped by reinforcement rather than free-will.

behaviors that result in praise/pleasure tend to repeat
behaviors that result in punishment/pain tend to become extinct

agent

An entity (learner & decision maker) that is equipped with Sensors end-effectors and goals

Action

Used by the agent to interact with the environment.
May have many di↵erent temporal granularities and abstractions

reward

A reward R**t is a scalar feedback signal

Indicates how well agent is doing at step t

The agent’s job is to maximize cumulative reward

hypothesis: All goals can be described by the maximization of expected cumulative reward

Main Topics of Reinforcement Learning

Learning: by trial and error

Planning: search, reason, thought, cognition

Prediction: evaluation functions, knowledge

Control: action selection, decision making

Dynamics: how the state changes given the actions of the agent

Model-based RL

dynamics are known or are estimated
solving RL problems that use models and planning

Model-free RL

unknown dynamics
explicitly trial-and-error learners

not necessarily iid

P.S. 逆强化学习。

Category: Reinforce Learning

[RL] Monte-Carlo Methods