Lecture 3 - Optimal Policy and Bellman Optimality Equation

Lecture 3: Optimal Policy and Bellman Optimality Equation

1-Outline

In this lecture:

  • Core concepts: optimal policy and optimal state value
  • A fundamental tool: Bellman optimality equation (BOE)

2-Motivating examples

alt text

  • Exercise: write out the Bellman equation and solve the state values (set γ=0.9\gamma = 0.9)

  • Bellman equations:

vπ(s1)=1+γvπ(s2),v_\pi(s_1) = -1 + \gamma v_\pi(s_2),

vπ(s2)=+1+γvπ(s4),v_\pi(s_2) = +1 + \gamma v_\pi(s_4),

vπ(s3)=+1+γvπ(s4),v_\pi(s_3) = +1 + \gamma v_\pi(s_4),

vπ(s4)=+1+γvπ(s4).v_\pi(s_4) = +1 + \gamma v_\pi(s_4).

  • State values: vπ(s4)=vπ(s3)=vπ(s2)=10,vπ(s1)=8v_\pi(s_4) = v_\pi(s_3) = v_\pi(s_2) = 10, v_\pi(s_1) = 8

  • Exercise: calculate the action values of the five actions for s1s_1

  • Action values:

qπ(s1,a1)=1+γvπ(s1)=6.2,q_\pi(s_1, a_1) = -1 + \gamma v_\pi(s_1) = 6.2,

qπ(s1,a2)=1+γvπ(s2)=8,q_\pi(s_1, a_2) = -1 + \gamma v_\pi(s_2) = 8,

qπ(s1,a3)=0+γvπ(s3)=9,q_\pi(s_1, \textcolor{orange}{a_3}) = 0 + \gamma v_\pi(s_3) = 9,

qπ(s1,a4)=1+γvπ(s1)=6.2,q_\pi(s_1, a_4) = -1 + \gamma v_\pi(s_1) = 6.2,

qπ(s1,a5)=0+γvπ(s1)=7.2.q_\pi(s_1, a_5) = 0 + \gamma v_\pi(s_1) = 7.2.

  • Question: While the policy is not good, how can we improve it?
  • Answer: We can improve the policy based on action values.

In particular, the current policy π(as1)\pi(a|s_1) is:

π(as1)={1a=a20aa2\pi(a|s_1) = \begin{cases} 1 & a = a_2 \\ 0 & a \neq a_2 \end{cases}

Observe the action values that we obtained just now:

qπ(s1,a1)=6.2,qπ(s1,a2)=8,qπ(s1,a3)=9,q_\pi(s_1, a_1) = 6.2, \quad q_\pi(s_1, a_2) = 8, \quad q_\pi(s_1, a_3) = 9,

qπ(s1,a4)=6.2,qπ(s1,a5)=7.2.q_\pi(s_1, a_4) = 6.2, \quad q_\pi(s_1, a_5) = 7.2.

What if we select the greatest action value? Then, the new policy is

πnew(as1)={1a=a30aa3\pi_{new}(a|s_1) = \begin{cases} 1 & a = a_3 \\ 0 & a \neq a_3 \end{cases}

  • Question: Why doing this can improve the policy?
  • Intuition: Action values can be used to evaluate actions.
  • Math: nontrivial and will be introduced in the this lecture.

3-Definition of optimal policy

The state value could be used to evaluate if a policy is good or not: if

vπ1(s)vπ2(s)for allsSv_{\pi_1}(s) \geq v_{\pi_2}(s) \quad \text{for all} \quad s \in \mathcal{S}

then π1\pi_1 is “better” than π2\pi_2.

  • Definition: A policy π\pi^* is optimal if vπ(s)vπ(s)v_{\pi^*}(s) \geq v_\pi(s) for all ss and for any other policy π\pi.

The definition leads to many questions:

  • Does the optimal policy exist?
  • Is the optimal policy unique?
  • Is the optimal policy stochastic or deterministic?
  • How to obtain the optimal policy?
    To answer these questions, we study the Bellman optimality equation.

4-BOE: Introduction

  • Bellman optimality equation (elementwise form):

v(s)=maxπaπ(as)(rp(rs,a)r+γsp(ss,a)v(s)),sSv(s) = \max_\pi \sum_a \pi(a|s) \left(\sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v(s')\right), \quad s \in \mathcal{S}

=maxπaπ(as)q(s,a),sS= \max_\pi \sum_a \pi(a|s)q(s,a), \quad s \in \mathcal{S}

  • Remarks:

    • p(rs,a),p(ss,a),r,γp(r|s,a), p(s'|s,a), r, \gamma are known.
    • v(s),v(s)v(s), v(s') are unknown and to be calculated.
    • Is π(s)\pi(s) known or unknown?
  • Bellman optimality equation (matrix-vector form):

v=maxπ(rπ+γPπv)v = \max_\pi (r_\pi + \gamma P_\pi v)

where the elements corresponding to ss or ss' are

[rπ]saπ(as)rp(rs,a)r,[r_\pi]_s \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r,

[Pπ]s,s=p(ss)aπ(as)sp(ss,a)[P_\pi]_{s,s'} = p(s'|s) \triangleq \sum_a \pi(a|s) \sum_{s'} p(s'|s,a)

Here maxπ\max_\pi is performed elementwise:

maxπ[]=[maxπ(s1)maxπ(sn)]\max_\pi \begin{bmatrix} * \\ \vdots \\ * \end{bmatrix} = \begin{bmatrix} \max_{\pi(s_1)} * \\ \vdots \\ \max_{\pi(s_n)} * \end{bmatrix}

  • Remarks:
    • BOE is tricky yet elegant!
    • Why elegant? It describes the optimal policy and optimal state value in an elegant way.
    • Why tricky? There is a maximization on the right-hand side, which may not be straightforward to see how to compute.
    • This lecture will answer all the following questions:
      • Algorithm: how to solve this equation?
      • Existence: does this equation have solutions?
      • Uniqueness: is the solution to this equation unique?
      • Optimality: how is it related to optimal policy?

5-BOE: Preliminaries

5.1-BOE: Maximization on the right-hand side

alt text

alt text

alt text

5.2-BOE: Rewrite as v=f(v)\mathbb{v} = f(\mathbb{v})

The BOE is v=maxπ(rπ+γPπv)v = \max_\pi(r_\pi + \gamma P_\pi v). Let

f(v):=maxπ(rπ+γPπv)f(v) := \max_\pi(r_\pi + \gamma P_\pi v)

Then, the Bellman optimality equation becomes

v=f(v)v = f(v)

where

[f(v)]s=maxπaπ(as)q(s,a),sS[f(v)]_s = \max_\pi \sum_a \pi(a|s)q(s,a), \quad s \in \mathcal{S}

This equation looks very simple. How to solve it?

5.3-Contraction mapping theorem(压缩映射定理)

Some concepts:

  • Fixed point: xXx \in X is a fixed point of f:XXf: X \rightarrow X if

f(x)=xf(x) = x

  • Contraction mapping: f:XXf: X \rightarrow X is a contraction mapping if

f(x1)f(x2)γx1x2||f(x_1) - f(x_2)|| \leq \gamma ||x_1 - x_2||

where γ(0,1)\gamma \in (0, 1).

  • Remarks:
    • γ\gamma must be strictly less than 1 so that many limits such as γk0\gamma^{k} \rightarrow 0 as kk \rightarrow \infty hold.
    • Here || \cdot || is the norm of the vector space XX.

Examples to demonstrate the concepts:

alt text

Above all, we can get the definition of Contraction Mapping Theorem:

alt text

6-BOE: Solution

6.1-Contraction property of BOE

Let’s come back to the Bellman optimality equation:

v=f(v)=maxπ(rπ+γPπv)v = f(v) = \max_\pi(r_\pi + \gamma P_\pi v)

alt text

6.2-Solve the Bellman optimality equation

Applying the contraction mapping theorem gives the following results:

alt text

Importance:The algorithm in (1) is called the value iteration algorithm.

7-Optimality

7.1-Policy Optimality

Suppose vv^* is the solution to the Bellman optimality equation. It satisfies

v=maxπ(rπ+γPπv)v^* = \max_\pi(r_\pi + \gamma P_\pi v^*)

Suppose

π=argmaxπ(rπ+γPπv)\pi^* = \arg \max_\pi(r_\pi + \gamma P_\pi v^*)

Then

v=rπ+γPπvv^* = r_{\pi^*} + \gamma P_{\pi^*} v^*

Therefore, π\pi^* is a policy and v=vπv^* = v_{\pi^*} is the corresponding state value.

Is π\pi^* the optimal policy? Is vv^* the greatest state value can be achieved?

alt text

Now we understand why we study the BOE. That is because it describes the optimal state value and optimal policy in an elegant way.

7.2-Optimal policy

What does an optimal policy π\pi^* look like?

π(s)=argmaxπaπ(as)(rp(rs,a)r+γsp(ss,a)v(s))q(s,a)\pi^*(s) = \arg \max_\pi \sum_a \pi(a|s) \underbrace{\left(\sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v^*(s')\right)}_{q^*(s,a)}

alt text

8-Analyzing optimal policies

What factors determine the optimal state value and optimal policy?
It can be clearly seen from the BOE

v(s)=maxπaπ(as)(rp(rs,a)r+γsp(ss,a)v(s)){v(s)} = {\max_\pi} \sum_a {\pi(a|s)} \left(\sum_r \textcolor{red}{p(r|s,a)}r + \textcolor{red}{\gamma} \sum_{s'} \textcolor{red}{p(s'|s,a)}{v(s')}\right)

that there are three factors:

  • System model: p(ss,a){p(s'|s,a)}, p(rs,a){p(r|s,a)}
  • Reward design: rr
  • Discount rate: γ{\gamma}

We next show how rr and γ{\gamma} can affect the optimal policy.

alt text

alt text

alt text

alt text

alt text

alt text

alt text

9-Additional Content

9.1-压缩映射定理的详细解读

  1. 存在性(Existence):必有一个不动点 xx^*

    • “there exists a fixed point xx^* satisfying f(x)=xf(x^*) = x^*
    • 含义:只要 ff 是“完备度量空间上的压缩映射”,就一定存在至少一个 xx^*,使得 f(x)=xf(x^*) = x^*(即不动点方程有解)。
    • 为什么存在?(结合文献3、4的证明思路):
      • 从空间中任意选一个初始点 x0x_0,构造迭代序列:x1=f(x0)x_1 = f(x_0)x2=f(x1)=f(f(x0))x_2 = f(x_1) = f(f(x_0))x3=f(x2)=f(f(f(x0)))x_3 = f(x_2) = f(f(f(x_0))) \dots
      • 因为 ff 是压缩映射,这个序列是“柯西序列”(序列中越靠后的点,彼此距离越近,趋近于0);而空间是完备的,柯西序列一定收敛到某个极限 xx^*,这个 xx^* 就是不动点(对 xk+1=f(xk)x_{k+1} = f(x_k) 两边取极限,得 x=f(x)x^* = f(x^*))。
  2. 唯一性(Uniqueness):不动点 xx^* 是唯一的

    • “The fixed point xx^* is unique”
    • 含义:不存在第二个不动点 yy^*(即不动点方程只有一个解)。
    • 怎么证明唯一?(反证法,文献3、4通用):
      • 假设还有另一个不动点 yy^*(满足 f(y)=yf(y^*) = y^*),则:d(x,y)=d(f(x),f(y))αd(x,y)d(x^*,y^*) = d(f(x^*), f(y^*)) \leq \alpha \cdot d(x^*, y^*)
      • 因为 α<1\alpha < 1,两边同时减去 αd(x,y)\alpha \cdot d(x^*, y^*),得:d(x,y)αd(x,y)0    d(x,y)(1α)0d(x^*, y^*) - \alpha \cdot d(x^*, y^*) \leq 0 \implies d(x^*, y^*) \cdot (1 - \alpha) \leq 0
      • 又因为 1α>01 - \alpha > 0,所以只能是 d(x,y)=0d(x^*, y^*) = 0,即 x=yx^* = y^*,因此不动点唯一。
  3. 算法(Algorithm):迭代序列收敛到 xx^*,且指数收敛

    • “Consider a sequence {xk}\{x_k\} where xk+1=f(xk)x_{k+1} = f(x_k), then xkxx_k \to x^* as kk \to \infty. Moreover, the convergence rate is exponentially fast.”

    • 含义:给了一个具体的求解方法——从任意初始点 x0x_0 出发,用 xk+1=f(xk)x_{k+1} = f(x_k) 迭代生成序列,当迭代次数 kk 足够大时,xkx_k 会无限趋近于不动点 xx^*,且收敛速度是“指数级”的(很快)。

    • 直观例子(以 f(x)=0.5xf(x) = 0.5x 为例):
      取初始点 x0=1x_0 = 1,则:
      x1=f(x0)=0.5×1=0.5x_1 = f(x_0) = 0.5 \times 1 = 0.5
      x2=f(x1)=0.5×0.5=0.25x_2 = f(x_1) = 0.5 \times 0.5 = 0.25
      x3=f(x2)=0.5×0.25=0.125x_3 = f(x_2) = 0.5 \times 0.25 = 0.125
      x4=0.0625x_4 = 0.0625,…,xk=(0.5)kx_k = (0.5)^k
      kk \to \infty 时,xk0x_k \to 0(即不动点 x=0x^* = 0)。

    • 指数收敛的意义:

      • 误差 d(xk,x)d(x_k, x^*) 会按 αk\alpha^k 的速度缩小(α<1\alpha < 1)。比如 α=0.5\alpha = 0.5,迭代10次后,误差就是初始误差的 (0.5)1011024(0.5)^{10} \approx \frac{1}{1024},迭代20次后误差约 1106\frac{1}{10^6},很快就能逼近不动点,这也是压缩映射定理在数值计算中(比如解微分方程、强化学习贝尔曼方程)被广泛应用的原因——迭代高效。

9.2-证明BOE是压缩映射

  1. 压缩映射的定义回顾

度量空间(此处为“有界价值函数空间 + 合适的范数,如无穷范数 \| \cdot \|_\infty”)中,若映射 f:XXf: X \to X 满足:存在常数 α[0,1)\alpha \in [0,1),使得对任意 v1,v2Xv_1, v_2 \in X,有

f(v1)f(v2)αv1v2\| f(v_1) - f(v_2) \| \leq \alpha \| v_1 - v_2 \|

ff压缩映射α\alpha 为“压缩常数”,需小于1)。

  1. 分析贝尔曼最优方程的映射结构
    BOE的映射形式为:

f(v)=maxπ(rπ+γPπv)f(v) = \max_\pi \left( r_\pi + \gamma P_\pi v \right)

其中:

  • rπr_\pi:策略 π\pi 下的期望即时奖励向量(每个元素对应一个状态的期望奖励)。
  • PπP_\pi:策略 π\pi 下的状态转移概率矩阵(每行元素和为1,因为是概率转移)。
  • γ\gamma折扣因子,满足 0γ<10 \leq \gamma < 1
  1. 关键步骤:证明范数的“压缩性”
    需证明:对任意两个价值函数 v1,v2v_1, v_2,有

f(v1)f(v2)γv1v2\| f(v_1) - f(v_2) \| \leq \gamma \| v_1 - v_2 \|

  • 步骤1:分析“单策略下的转移项”
    任意一个策略 π\pi,考虑 rπ+γPπvr_\pi + \gamma P_\pi vvv 的依赖:

(rπ+γPπv1)(rπ+γPπv2)=γPπ(v1v2)(r_\pi + \gamma P_\pi v_1) - (r_\pi + \gamma P_\pi v_2) = \gamma P_\pi (v_1 - v_2)

由于 PπP_\pi转移概率矩阵(每行元素和为1,元素非负),根据“矩阵与向量乘法的范数性质”:
对无穷范数 \| \cdot \|_\infty,有 Pπww\| P_\pi w \|_\infty \leq \| w \|_\infty(加权平均不会放大向量的无穷范数)。

因此:

(rπ+γPπv1)(rπ+γPπv2)=γPπ(v1v2)γv1v2\| (r_\pi + \gamma P_\pi v_1) - (r_\pi + \gamma P_\pi v_2) \|_\infty = \gamma \| P_\pi (v_1 - v_2) \|_\infty \leq \gamma \| v_1 - v_2 \|_\infty

  • 步骤2:对“所有策略取最大值”的影响
    f(v)f(v)对所有策略 π\pi 逐元素取最大值的结果(即每个状态的价值,取所有策略下 rπ+γPπvr_\pi + \gamma P_\pi v 的最大值)。

根据“最大值运算的单调性”:若 g1(π)g2(π)+cg_1(\pi) \leq g_2(\pi) + c 对所有 π\pi 成立,则 maxπg1(π)maxπg2(π)+c\max_\pi g_1(\pi) \leq \max_\pi g_2(\pi) + c

g1(π)=rπ+γPπv1g_1(\pi) = r_\pi + \gamma P_\pi v_1g2(π)=rπ+γPπv2g_2(\pi) = r_\pi + \gamma P_\pi v_2,则 g1(π)g2(π)γv1v2g_1(\pi) - g_2(\pi) \leq \gamma \| v_1 - v_2 \|_\infty(由步骤1)。因此:

maxπg1(π)maxπg2(π)maxπ(g1(π)g2(π))γv1v2\max_\pi g_1(\pi) - \max_\pi g_2(\pi) \leq \max_\pi \left( g_1(\pi) - g_2(\pi) \right) \leq \gamma \| v_1 - v_2 \|_\infty

这意味着:

f(v1)f(v2)γv1v2\| f(v_1) - f(v_2) \|_\infty \leq \gamma \| v_1 - v_2 \|_\infty

  1. 结论:满足压缩映射的条件
    由于折扣因子 γ[0,1)\gamma \in [0,1)(压缩常数小于1),因此映射 f(v)=maxπ(rπ+γPπv)f(v) = \max_\pi (r_\pi + \gamma P_\pi v)压缩映射