课程-强化学习

 

课程大纲

  • Lecture 1: Overview(课程概括与RL基础)
  • Lecture 2: Markov Decision Process(马尔科夫决策过程))
  • Lecture 3: Model-free Prediction and Control(无模型的预测和控制)
  • Lecture 4: On-policy and Off-policy Learning()
  • Lecture 5: Value Function Approximation(价值函数近似)
  • Lecture 6: Policy Optimization: Foundation(策略优化基础)
  • Lecture 7: Policy Optimization: State of the Art(策略优化进阶)
  • Lecture 8: Model-based RL(基于环境模型的RL)
  • Lecture 9: Case Study on AlphaGo series(示例学习) . Lecture 10: Case Study on AlphaStar and OpenAl Five(示例学习)
  • Lecture 11: Distributed computing and RL system design(分布式计算和强化学习系统设计)
  • Lecture 12: Generative Modeling(生成式模型)

参考书籍:英文原版

Lecture1: Overview

强化学习定义:a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex and uncertain environment.

强化学习与监督学习差异:

  • 输入为序列数据(非独立同分布)
  • Learner采取的action并非明确
  • Trial-and-error exploration(试错)
  • Delayed reward(延迟奖励)

一个RL Agent组成包括:

  • Policy(行为函数)
  • Value function(价值函数)
  • Model(状态表示)

exploration(尝试新的策略) & exploitation(采用当前最优的策略)

Lecture: Markov Decision Process

Markov Process(马尔可夫过程):现在状态与过去状态是条件独立的。

Markov Chain(马尔可夫链):时间、状态都是离散的马尔可夫过程称为马尔可夫链(可以用状态转移矩阵表示)。

Markov Reward Process

Markov Reward Process(马尔可夫奖励过程):马尔可夫链 + 奖励函数。

1、定义

  • S is a (finite) set of states (s $\in$ S)

  • P is dynamics/transition model that specifies $P(S_{t+1} = s^{\prime}\mid s_t = s)$

  • R is a reward function $R(s_t = s) = \mathbb{E} [r_t \mid s_t = s]$

  • Discount factor $\gamma \in [0, 1]$

2、MRP的价值函数

\[V(s)=\mathbb{E}\left[G_{t} \mid s_{t}=s\right]=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\ldots \mid s_{t}=s\right]\]

同时满足Bellman等式:

\[V(s) = R(s) + \gamma\sum_{s^{\prime}\in S}^{}P(s^{\prime} \mid s)V(s^{\prime})\]

表示为立即奖励$R(s)$ + 未来奖励之和乘以折扣系数$\gamma$

蒙特卡洛算法计算MRP的价值函数:

Markov Decision Process

Markov Decision Process(马尔可夫决策过程):马尔可夫奖励过程 + 决策(action)。

1、定义

  • S is a finite set of states

  • A is a finite set of actions

  • $P^a$ is dynamics/transition model for each action $P(s_{t+1} = s^{\prime} s_t = s, a_t = a)$
  • R is a reward function $R(s_t = s, a_t = a) = \mathbb{E}[r_t \mid s_t = s, a_t = a]$

  • Discount factor $\gamma \in [0, 1]$

2、MDP价值函数

\[v^π(s) = \sum_{a\in A}π(a\mid s)q^π (s, a)\]

同时满足Bellman Expectation等式:

\[v^π(s) = E_π[R_{t+1} + \gamma v^π{S_{t+1} \mid s_t = s}]\]

Decision Making in MDP

1、Prediction(预测)

  • Input: MDP <S, A, P, R, $\gamma$> and policy π or MRP <S, $P^π$, $R^π$, $\gamma$>

  • Output: 价值函数$v_π$

2、Control(控制)

  • Input: MDP <S, A, P, R, $\gamma$>

  • Output: 最优价值函数$v^$和最优策略$π^$