Lecture 2 - State Value and Bellman Equation

Lecture 2: State Value and Bellman Equation

Outline

In this lecture:

  • A core concept: state value
  • A fundamental tool: Bellman equation

Motivating examples

Motivating example 1: Why return is important?

alt text

alt text

alt text

alt text

In summary, starting from s1s_1,

return1>return3>return2return_1 > return_3 > return_2

The above inequality suggests that the first policy is the best and the second policy is the worst, which is exactly the same as our intuition. Calculating return is important to evaluate a policy.

Motivating example 2: How to calculate return?

  • While return is important, how to calculate it?

alt text

  • Method 1: by definition
    • Let viv_i denote the return obtained starting from si(i=1,2,3,4)s_i(i = 1,2,3,4)

v1=r1+γr2+γ2r3+v2=r2+γr3+γ2r4+v3=r3+γr4+γ2r1+v4=r4+γr1+γ2r2+ v_1 = r_1 + \gamma r_2 + \gamma^2 r_3 + \dots \\ v_2 = r_2 + \gamma r_3 + \gamma^2 r_4 + \dots \\ v_3 = r_3 + \gamma r_4 + \gamma^2 r_1 + \dots \\ v_4 = r_4 + \gamma r_1 + \gamma^2 r_2 + \dots

  • Method 2:
    • The returns rely on each other. Bootstrapping!

v1=r1+γ(r2+γr3+)=r1+γv2v2=r2+γ(r3+γr4+)=r2+γv3v3=r3+γ(r4+γr1+)=r3+γv4v4=r4+γ(r1+γr2+)=r4+γv1 v_1 = r_1 + \gamma(r_2 + \gamma r_3 + \dots) = r_1 + \gamma v_2 \\ v_2 = r_2 + \gamma(r_3 + \gamma r_4 + \dots) = r_2 + \gamma v_3 \\ v_3 = r_3 + \gamma(r_4 + \gamma r_1 + \dots) = r_3 + \gamma v_4 \\ v_4 = r_4 + \gamma(r_1 + \gamma r_2 + \dots) = r_4 + \gamma v_1

How to solve these equations? Write in the following matrix-vector form:

[v1v2v3v4]=[r1r2r3r4]+[γv2γv3γv4γv1]=[r1r2r3r4]+γ[0100001000011000][v1v2v3v4] \left[\begin{array}{c} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right] = \left[\begin{array}{c} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right] + \left[\begin{array}{c} \gamma v_2 \\ \gamma v_3 \\ \gamma v_4 \\ \gamma v_1 \end{array}\right] = \left[\begin{array}{c} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right] + \gamma \left[\begin{array}{cccc} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{array}\right] \left[\begin{array}{c} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right]

which can be rewritten as:

v=r+γPv \mathbf{v} = \mathbf{r} + \gamma \mathbf{P} \mathbf{v}

How to solve this equation? Write in the following form:

v=r+γPvv=(IγP)1r \mathbf{v} = \mathbf{r} + \gamma \mathbf{P} \mathbf{v} \\ \mathbf{v} = (I - \gamma \mathbf{P})^{-1} \mathbf{r}

This is the Bellman equation (for this specific deterministic problem)!

  • Though simple, it demonstrates the core idea: the value of one state relies on the values of other states.
  • A matrix-vector form is more clear to see how to solve the state values.

State Value

Some notations

Consider the following single-step process:

StAtRt+1,St+1S_t \stackrel{A_t}{\rightarrow} R_{t+1}, S_{t+1}

t,t+1t, t+1: discrete time instances
StS_t: state at time tt
AtA_t: the action taken in state StS_t
Rt+1R_{t+1}: the reward obtained after taking AtA_t
St+1S_{t+1}: the state transited to after taking AtA_t

Note that St,At,Rt+1S_t, A_t, R_{t+1} are all random variables.

This step is governed by the following probability distributions:
StAtS_t \rightarrow A_t is governed by π(At=aSt=s)\pi(A_t = a|S_t = s)
St,AtRt+1S_t, A_t \rightarrow R_{t+1} is governed by p(Rt+1=rSt=s,At=a)p(R_{t+1} = r|S_t = s, A_t = a)
St,AtSt+1S_t, A_t \rightarrow S_{t+1} is governed by p(St+1=sSt=s,At=a)p(S_{t+1} = s'|S_t = s, A_t = a)

At this moment, we assume we know the model (i.e., the probability distributions)!

Consider the following multi-step trajectory:

StAtRt+1,St+1At+1Rt+2,St+2At+2Rt+3,S_t \stackrel{A_t}{\rightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\rightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\rightarrow} R_{t+3}, \ldots

The discounted return is

Gt=Rt+1+γRt+2+γ2Rt+3+G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots

γ(0,1)\gamma \in (0,1) is a discount rate.
GtG_t is also a random variable since Rt+1,Rt+2,R_{t+1}, R_{t+2}, \ldots are random variables.

State Value

The expectation (or called expected value or mean) of GtG_t is defined as the state-value function or simply state value:

vπ(s)=E[GtSt=s]v_{\pi}(s) = \mathbb{E}[G_t|S_t = s]

  • Remarks:

    • It is a function of ss. It is a conditional expectation with the condition that the state starts from ss.
    • It is based on the policy π\pi. For a different policy, the state value may be different.
    • It represents the “value” of a state. If the state is greater, then the policy is better because greater cumulative rewards can be obtained.
  • Q: What is the relationship between return and state value?

    • A: The state value is the mean of all possible returns that can be obtained starting from a state. If everything - π(as)\pi(a|s), p(rs,a)p(r|s,a), p(ss,a)p(s'|s,a) - is deterministic, then state value is the same as return.
  • Example

alt text

  • Recall the returns obtained from s1s_1 for the three examples:
    • vπ1(s1)=0+γ1+γ21+=γ(1+γ+γ2+)=γ1γv_{\pi_1}(s_1) = 0 + \gamma 1 + \gamma^2 1 + \cdots = \gamma(1 + \gamma + \gamma^2 + \ldots) = \frac{\gamma}{1 - \gamma}

    • vπ2(s1)=1+γ1+γ21+=1+γ(1+γ+γ2+)=1+γ1γv_{\pi_2}(s_1) = -1 + \gamma 1 + \gamma^2 1 + \cdots = -1 + \gamma(1 + \gamma + \gamma^2 + \ldots) = -1 + \frac{\gamma}{1 - \gamma}

    • vπ3(s1)=0.5(1+γ1γ)+0.5(γ1γ)=0.5+γ1γv_{\pi_3}(s_1) = 0.5\left(-1 + \frac{\gamma}{1 - \gamma}\right) + 0.5\left(\frac{\gamma}{1 - \gamma}\right) = -0.5 + \frac{\gamma}{1 - \gamma}

  • 很明显可以看出来vπ1(s1)v_{\pi_1}(s_1)是一个很好的策略,因为他的状态值是最大的。

Bellman Equation: Derivation

Bellman Equation

  • While state value is important, how to calculate? The answer lies in the Bellman equation.
  • In a word, the Bellman equation describes the relationship among the values of all states.
  • Next, we derive the Bellman equation.
  • There is some math.
  • We already have the intuition.

Deriving the Bellman Equation

Consider a random trajectory:

StAtRt+1,St+1At+1Rt+2,St+2At+2Rt+3,S_t \stackrel{A_t}{\rightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\rightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\rightarrow} R_{t+3}, \ldots

The return GtG_t can be written as

Gt=Rt+1+γRt+2+γ2Rt+3+,=Rt+1+γ(Rt+2+γRt+3+),=Rt+1+γGt+1,\begin{aligned} G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots, \\ &= R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \ldots), \\ &= R_{t+1} + \gamma G_{t+1}, \end{aligned}

Then, it follows from the definition of the state value that

vπ(s)=E[GtSt=s]=E[Rt+1+γGt+1St=s]=E[Rt+1St=s]+γE[Gt+1St=s]\begin{aligned} v_{\pi}(s) &= \mathbb{E}[G_t|S_t = s] \\ &= \mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_t = s] \\ &= \mathbb{E}[R_{t+1}|S_t = s] + \gamma \mathbb{E}[G_{t+1}|S_t = s] \end{aligned}

Next, calculate the two terms, respectively.

First, calculate the first term E[Rt+1St=s]\mathbb{E}[R_{t+1}|S_t = s]:

E[Rt+1St=s]=aπ(as)E[Rt+1St=s,At=a]=aπ(as)rp(rs,a)r\begin{aligned} \mathbb{E}[R_{t+1}|S_t = s] &= \sum_{a} \pi(a|s)\mathbb{E}[R_{t+1}|S_t = s, A_t = a] \\ &= \sum_{a} \pi(a|s) \sum_{r} p(r|s,a)r \end{aligned}

Note that

  • This is the mean of immediate rewards.
  • 这里公式推导其实采用了两次离散随机变量的期望公式。

Second, calculate the second term E[Gt+1St=s]\mathbb{E}[G_{t+1}|S_t = s]:

E[Gt+1St=s]=sE[Gt+1St=s,St+1=s]p(ss)=sE[Gt+1St+1=s]p(ss)=svπ(s)p(ss)=svπ(s)ap(ss,a)π(as)\begin{aligned} \mathbb{E}[G_{t+1}|S_t = s] &= \sum_{s'} \mathbb{E}[G_{t+1}|S_t = s, S_{t+1} = s']p(s'|s) \\ &= \sum_{s'} \mathbb{E}[G_{t+1}|S_{t+1} = s']p(s'|s) \\ &= \sum_{s'} v_\pi(s')p(s'|s) \\ &= \sum_{s'} v_\pi(s') \sum_{a} p(s'|s,a)\pi(a|s) \end{aligned}

Note that

  • This is the mean of future rewards
  • E[Gt+1St=s,St+1=s]=E[Gt+1St+1=s]\mathbb{E}[G_{t+1}|S_t = s, S_{t+1} = s'] = \mathbb{E}[G_{t+1}|S_{t+1} = s'] due to the memoryless Markov property.

Therefore, we have

alt text

Highlights:

  • The above equation is called the Bellman equation, which characterizes the relationship among the state-value functions of different states.
  • It consists of two terms: the immediate reward term and the future reward term.
  • A set of equations: every state has an equation like this!!!

alt text

补充1:针对Bellman方程中future reward部分推导的理解

需结合全期望公式MDP的马尔可夫性状态价值函数的定义,分三步拆解:

1. 明确核心对象:未来累积回报 Gt+1G_{t+1}

Gt+1G_{t+1} 是“从时刻 t+1t+1 开始的累积折扣回报”,定义为:

Gt+1=Rt+2+γRt+3+γ2Rt+4+G_{t+1} = R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \dots

状态价值函数 vπ(s)v_\pi(s') 的定义是:在策略 π\pi 下,从状态 ss' 出发能获得的累积折扣回报的期望,即:

vπ(s)=Eπ[Gt+1St+1=s]v_\pi(s') = \mathbb{E}_\pi\left[ G_{t+1} \mid S_{t+1} = s' \right]

2. 第一步:对“动作 AtA_t”求期望(用策略 π\pi 分层)

我们需要计算 E[Gt+1St=s]\mathbb{E}\left[ G_{t+1} \mid S_t = s \right](“时刻 tt 处于状态 ss 时,未来累积回报 Gt+1G_{t+1} 的期望”)。

由于“时刻 tt 选择的动作 AtA_t”是随机的(由策略 π\pi 决定,即 π(as)=P(At=aSt=s)\pi(a \mid s) = P(A_t = a \mid S_t = s)),根据全期望公式(对所有可能的动作 aa 分层求和):

E[Gt+1St=s]=aπ(as)E[Gt+1St=s,At=a]\mathbb{E}\left[ G_{t+1} \mid S_t = s \right] = \sum_{a} \pi(a \mid s) \cdot \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right]

解释:“状态 ssGt+1G_{t+1} 的期望” = 对每个动作 aa,“选动作 aaGt+1G_{t+1} 的条件期望” × “选动作 aa 的概率”,再对所有 aa 求和。

3. 第二步:对“下一个状态 St+1S_{t+1}”求期望(用转移概率 pp 分层)

现在需要计算 E[Gt+1St=s,At=a]\mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right](“状态 ss 且选动作 aa 时,Gt+1G_{t+1} 的期望”)。

此时,执行动作 aa 后,下一个状态 St+1S_{t+1} 是随机的(由状态转移概率 p(ss,a)=P(St+1=sSt=s,At=a)p(s' \mid s, a) = P(S_{t+1} = s' \mid S_t = s, A_t = a) 决定)。根据全期望公式(对所有可能的下一个状态 ss' 分层求和):

E[Gt+1St=s,At=a]=sp(ss,a)E[Gt+1St=s,At=a,St+1=s]\mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right] = \sum_{s'} p(s' \mid s, a) \cdot \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s' \right]

E[Gt+1St=s,At=a,St+1=s]\mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s' \right] 就是 vπ(s)v_\pi(s')——因为“已知下一个状态是 ss',从 ss' 出发的累积回报期望就是状态价值 vπ(s)v_\pi(s')”(这是状态价值函数的定义)。因此:

E[Gt+1St=s,At=a]=sp(ss,a)vπ(s)\mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right] = \sum_{s'} p(s' \mid s, a) \cdot v_\pi(s')

4. 第三步:合并并引入折扣因子 γ\gamma

将第二步的结果代入第一步,得到:

E[Gt+1St=s]=aπ(as)sp(ss,a)vπ(s)\mathbb{E}\left[ G_{t+1} \mid S_t = s \right] = \sum_{a} \pi(a \mid s) \sum_{s'} p(s' \mid s, a) \cdot v_\pi(s')

原项是 γE[Gt+1St=s]\gamma \mathbb{E}\left[ G_{t+1} \mid S_t = s \right](因为累积回报要乘折扣因子 γ\gamma),因此最终:

γE[Gt+1St=s]=γaπ(as)sp(ss,a)vπ(s)\gamma \mathbb{E}\left[ G_{t+1} \mid S_t = s \right] = \gamma \sum_{a} \pi(a \mid s) \sum_{s'} p(s' \mid s, a) \cdot v_\pi(s')

核心逻辑总结

通过全期望公式,先对“动作 aa”按策略 π\pi 分层,再对“下一个状态 ss'”按转移概率 pp 分层,最终将“未来累积回报的期望”转化为“下一个状态价值函数的期望”,并结合折扣因子 γ\gamma,体现了MDP的马尔可夫性(未来回报仅依赖于下一个状态,与历史无关)。

An illustrative example

alt text

Write out the Bellman equation according to the general expression:

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]v_\pi(s) = \sum_{a} \pi(a|s) \left[ \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') \right]

This example is simple because the policy is deterministic.

First, consider the state value of s1s_1:

  • π(a=a3s1)=1\pi(a = a_3|s_1) = 1 and π(aa3s1)=0\pi(a \neq a_3|s_1) = 0.
  • p(s=s3s1,a3)=1p(s' = s_3|s_1,a_3) = 1 and p(ss3s1,a3)=0p(s' \neq s_3|s_1,a_3) = 0.
  • p(r=0s1,a3)=1p(r = 0|s_1,a_3) = 1 and p(r0s1,a3)=0p(r \neq 0|s_1,a_3) = 0.

Substituting them into the Bellman equation gives

vπ(s1)=0+γvπ(s3)v_\pi(s_1) = 0 + \gamma v_\pi(s_3)

Similarly, it can be obtained that

vπ(s1)=0+γvπ(s3),vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),vπ(s4)=1+γvπ(s4).\begin{aligned} v_\pi(s_1) &= 0 + \gamma v_\pi(s_3), \\ v_\pi(s_2) &= 1 + \gamma v_\pi(s_4), \\ v_\pi(s_3) &= 1 + \gamma v_\pi(s_4), \\ v_\pi(s_4) &= 1 + \gamma v_\pi(s_4). \end{aligned}

How to solve them?

vπ(s1)=0+γvπ(s3),vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),vπ(s4)=1+γvπ(s4).\begin{aligned} v_\pi(s_1) &= 0 + \gamma v_\pi(s_3), \\ v_\pi(s_2) &= 1 + \gamma v_\pi(s_4), \\ v_\pi(s_3) &= 1 + \gamma v_\pi(s_4), \\ v_\pi(s_4) &= 1 + \gamma v_\pi(s_4). \end{aligned}

Solve the above equations one by one from the last to the first:

vπ(s4)=11γ,vπ(s3)=11γ,vπ(s2)=11γ,vπ(s1)=γ1γ.\begin{aligned} v_\pi(s_4) &= \frac{1}{1 - \gamma}, \\ v_\pi(s_3) &= \frac{1}{1 - \gamma}, \\ v_\pi(s_2) &= \frac{1}{1 - \gamma}, \\ v_\pi(s_1) &= \frac{\gamma}{1 - \gamma}. \end{aligned}

If γ=0.9\gamma = 0.9, then

vπ(s4)=110.9=10,vπ(s3)=110.9=10,vπ(s2)=110.9=10,vπ(s1)=0.910.9=9.\begin{aligned} v_\pi(s_4) &= \frac{1}{1 - 0.9} = 10, \\ v_\pi(s_3) &= \frac{1}{1 - 0.9} = 10, \\ v_\pi(s_2) &= \frac{1}{1 - 0.9} = 10, \\ v_\pi(s_1) &= \frac{0.9}{1 - 0.9} = 9. \end{aligned}

What to do after we have calculated state values?(Calculating action value and improve policy)

alt text

Answer:

vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)],vπ(s2)=1+γvπ(s4),vπ(s3)=1+γvπ(s4),vπ(s4)=1+γvπ(s4).v_\pi(s_1) = 0.5[0 + \gamma v_\pi(s_3)] + 0.5[-1 + \gamma v_\pi(s_2)],\\ v_\pi(s_2) = 1 + \gamma v_\pi(s_4),\\ v_\pi(s_3) = 1 + \gamma v_\pi(s_4),\\ v_\pi(s_4) = 1 + \gamma v_\pi(s_4).

Solve the above equations one by one from the last to the first.

vπ(s4)=11γ,vπ(s3)=11γ,vπ(s2)=11γ,vπ(s1)=0.5[0+γvπ(s3)]+0.5[1+γvπ(s2)],=0.5+γ1γ.v_\pi(s_4) = \frac{1}{1 - \gamma},\quad v_\pi(s_3) = \frac{1}{1 - \gamma},\quad v_\pi(s_2) = \frac{1}{1 - \gamma},\\ v_\pi(s_1) = 0.5[0 + \gamma v_\pi(s_3)] + 0.5[-1 + \gamma v_\pi(s_2)],\\ = -0.5 + \frac{\gamma}{1 - \gamma}.

Substituting γ=0.9\gamma = 0.9 yields

vπ(s4)=10,vπ(s3)=10,vπ(s2)=10,vπ(s1)=0.5+9=8.5.v_\pi(s_4) = 10,\quad v_\pi(s_3) = 10,\quad v_\pi(s_2) = 10,\quad v_\pi(s_1) = -0.5 + 9 = 8.5.

Compare with the previous policy. This one is worse.

Bellman Equation: Matrix-vector form

Matrix-vector form of the Bellman equation

Why consider the matrix-vector form? Because we need to solve the state values from it!

  • One unknown relies on another unknown. How to solve the unknowns?

    vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]v_\pi(s) = \sum_a \pi(a|s) \left[ \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') \right]

  • Elementwise form: The above elementwise form is valid for every state sSs \in S. That means there are S|S| equations like this!

  • Matrix-vector form: If we put all the equations together, we have a set of linear equations, which can be concisely written in a matrix-vector form. The matrix-vector form is very elegant and important.

Recall that:

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]v_\pi(s) = \sum_a \pi(a|s) \left[ \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') \right]

Rewrite the Bellman equation as

vπ(s)=rπ(s)+γspπ(ss)vπ(s)v_\pi(s) = r_\pi(s) + \gamma \sum_{s'} p_\pi(s'|s)v_\pi(s')

where

rπ(s)aπ(as)rp(rs,a)r,pπ(ss)aπ(as)p(ss,a)r_\pi(s) \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r, \quad p_\pi(s'|s) \triangleq \sum_a \pi(a|s)p(s'|s,a)

Suppose the states could be indexed as sis_i (i=1,,n)(i = 1, \ldots, n).
For state sis_i, the Bellman equation is

vπ(si)=rπ(si)+γsjpπ(sjsi)vπ(sj)v_\pi(s_i) = r_\pi(s_i) + \gamma \sum_{s_j} p_\pi(s_j|s_i)v_\pi(s_j)

Put all these equations for all the states together and rewrite to a matrix-vector form

vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pi

where

  • vπ=[vπ(s1),,vπ(sn)]TRnv_\pi = [v_\pi(s_1), \ldots, v_\pi(s_n)]^T \in \mathbb{R}^n
  • rπ=[rπ(s1),,rπ(sn)]TRnr_\pi = [r_\pi(s_1), \ldots, r_\pi(s_n)]^T \in \mathbb{R}^n
  • PπRn×nP_\pi \in \mathbb{R}^{n \times n}, where [Pπ]ij=pπ(sjsi)[P_\pi]_{ij} = p_\pi(s_j|s_i), is the state transition matrix

If there are four states, vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pi can be written out as

[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]vπ=[rπ(s1)rπ(s2)rπ(s3)rπ(s4)]rπ+γ[pπ(s1s1)pπ(s2s1)pπ(s3s1)pπ(s4s1)pπ(s1s2)pπ(s2s2)pπ(s3s2)pπ(s4s2)pπ(s1s3)pπ(s2s3)pπ(s3s3)pπ(s4s3)pπ(s1s4)pπ(s2s4)pπ(s3s4)pπ(s4s4)]Pπ[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]vπ.\underbrace{ \begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ v_\pi(s_3) \\ v_\pi(s_4) \end{bmatrix} }_{v_\pi} = \underbrace{ \begin{bmatrix} r_\pi(s_1) \\ r_\pi(s_2) \\ r_\pi(s_3) \\ r_\pi(s_4) \end{bmatrix} }_{r_\pi} + \gamma \underbrace{ \begin{bmatrix} p_\pi(s_1|s_1) & p_\pi(s_2|s_1) & p_\pi(s_3|s_1) & p_\pi(s_4|s_1) \\ p_\pi(s_1|s_2) & p_\pi(s_2|s_2) & p_\pi(s_3|s_2) & p_\pi(s_4|s_2) \\ p_\pi(s_1|s_3) & p_\pi(s_2|s_3) & p_\pi(s_3|s_3) & p_\pi(s_4|s_3) \\ p_\pi(s_1|s_4) & p_\pi(s_2|s_4) & p_\pi(s_3|s_4) & p_\pi(s_4|s_4) \end{bmatrix} }_{P_\pi} \underbrace{ \begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ v_\pi(s_3) \\ v_\pi(s_4) \end{bmatrix} }_{v_\pi} .

alt text

For this specific example:

[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]=[0111]+γ[0010000100010001][vπ(s1)vπ(s2)vπ(s3)vπ(s4)]\begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ v_\pi(s_3) \\ v_\pi(s_4) \end{bmatrix} = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 1 \end{bmatrix} + \gamma \begin{bmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ v_\pi(s_3) \\ v_\pi(s_4) \end{bmatrix}

alt text

For this specific example:

[vπ(s1)vπ(s2)vπ(s3)vπ(s4)]=[0.5(0)+0.5(1)111]+γ[00.50.50000100010001][vπ(s1)vπ(s2)vπ(s3)vπ(s4)]\begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ v_\pi(s_3) \\ v_\pi(s_4) \end{bmatrix} = \begin{bmatrix} 0.5(0) + 0.5(-1) \\ 1 \\ 1 \\ 1 \end{bmatrix} + \gamma \begin{bmatrix} 0 & 0.5 & 0.5 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} v_\pi(s_1) \\ v_\pi(s_2) \\ v_\pi(s_3) \\ v_\pi(s_4) \end{bmatrix}

补充2:针对代换步骤中future reward部分求和交换的理解

要理解这一步的求和合并,核心在于理解**“策略 π\pi 对动作 aa 的加权平均”**,通过这一操作可以将“依赖动作 aa 的详细式”简化为“策略 π\pi 下的期望形式”。下面我们分步骤进行详细拆解:

第一步:定义“策略 π\pi 下的期望奖励” rπ(s)r_\pi(s)

原贝尔曼方程的即时奖励部分为:

aπ(as)rp(rs,a)r\sum_a \pi(a|s) \sum_r p(r|s,a) r

这部分表达式的含义是:在状态 ss 下,先按照策略 π\pi 对所有可能的动作 aa 进行加权,接着对每个动作 aa 下所有可能的奖励 rr 按照转移概率 p(rs,a)p(r|s,a) 求期望,最终得到的结果就是“策略 π\pi 下,状态 ss 的期望即时奖励”。

为了简化表达,我们直接将这部分定义为 rπ(s)r_\pi(s)

rπ(s)aπ(as)rp(rs,a)rr_\pi(s) \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a) r

第二步:定义“策略 π\pi 下的状态转移概率” pπ(ss)p_\pi(s'|s)

原贝尔曼方程的未来奖励部分(包含 vπ(s)v_\pi(s') 的项)为:

aπ(as)[γsp(ss,a)vπ(s)]\sum_a \pi(a|s) \left[ \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \right]

步骤1:提取常数 γ\gamma

由于 γ\gamma 是与 aass' 都无关的折扣因子,我们可以将其从内层求和中提取出来,得到:

γaπ(as)sp(ss,a)vπ(s)\gamma \sum_a \pi(a|s) \sum_{s'} p(s'|s,a) v_\pi(s')

步骤2:交换求和顺序并合并对 aa 的求和

因为求和具有交换律(这是线性运算的基本性质),所以我们可以交换“对 aa 的求和”与“对 ss' 的求和”的顺序,得到:

γs(aπ(as)p(ss,a))vπ(s)\gamma \sum_{s'} \left( \sum_a \pi(a|s) p(s'|s,a) \right) v_\pi(s')

此时,括号内的 aπ(as)p(ss,a)\sum_a \pi(a|s) p(s'|s,a) 表示:在状态 ss 下,先按照策略 π\pi 对所有可能的动作 aa 进行加权,再对每个动作 aa 下“转移到状态 ss' 的概率 p(ss,a)p(s'|s,a)”求和,最终得到的就是“策略 π\pi 下,状态 ss 转移到 ss' 的期望概率”。

为了简化表达,我们将这部分定义为 pπ(ss)p_\pi(s'|s)

pπ(ss)aπ(as)p(ss,a)p_\pi(s'|s) \triangleq \sum_a \pi(a|s) p(s'|s,a)

第三步:合并为简化的贝尔曼方程

将前面得到的“即时奖励部分 rπ(s)r_\pi(s)”和“未来奖励部分 γspπ(ss)vπ(s)\gamma \sum_{s'} p_\pi(s'|s) v_\pi(s')”结合起来,原贝尔曼方程就可以简化为:

vπ(s)=rπ(s)+γspπ(ss)vπ(s)v_\pi(s) = r_\pi(s) + \gamma \sum_{s'} p_\pi(s'|s) v_\pi(s')

核心逻辑总结

通过“对动作 aa 求策略加权的期望”,我们将原本依赖“动作 aa”的详细贝尔曼方程,转化为“策略 π\pi 下的马尔可夫链形式”(转移概率和奖励都经过策略 π\pi 加权平均)。这样的转化使得方程更加简洁,也更接近“普通马尔可夫链的贝尔曼方程”形式。

Bellman equation: Solve the state values

Solve state values

Why to solve state values?

  • Given a policy, finding out the corresponding state values is called policy evaluation!
  • It is a fundamental problem in RL. It is the foundation to find better policies.
  • Therefore, it is important to understand how to solve the Bellman equation.

The Bellman equation in matrix-vector form is

vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pi

  • The closed-form solution is:

    vπ=(IγPπ)1rπv_\pi = (I - \gamma P_\pi)^{-1} r_\pi

  • The matrix IγPπI - \gamma P_\pi is inevitable. See details in my book.

  • We still need to use numerical algorithms to calculate the matrix inverse.

  • Can we avoid the matrix inverse operation? Yes, as shown below.

  • An iterative solution is:

    vk+1=rπ+γPπvkv_{k+1} = r_\pi + \gamma P_\pi v_k

This algorithm leads to a sequence {v₀, v₁, v₂, …}. We can show that

vkvπ=(IγPπ)1rπ,kv_k \to v_\pi = (I - \gamma P_\pi)^{-1} r_\pi, \quad k \to \infty

alt text

Examples: rboundary=rforbidden=1r_{\text{boundary}} = r_{\text{forbidden}} = -1, rtarget=+1r_{\text{target}} = +1, γ=0.9\gamma = 0.9

  • The following are two “good” policies and the state values. The two policies are different for the top two states in the forth column.

alt text

  • The following is a “bad” policy and the state values. The state values are less than those of the good policies.

alt text

Action value

Action value

From state value to action value:

  • State value: the average return the agent can get starting from a state.

  • Action value: the average return the agent can get starting from a state and taking an action.

  • Q: Why do we care action value?

  • A: Because we want to know which action is better. This point will be clearer in the following lectures.

  • A: We will frequently use action values.

Definition:

qπ(s,a)=E[GtSt=s,At=a]q_\pi(s,a) = \mathbb{E}[G_t|S_t = s, A_t = a]

  • qπ(s,a)q_\pi(s,a) is a function of the state-action pair (s,a)(s,a)
  • qπ(s,a)q_\pi(s,a) depends on π\pi

It follows from the properties of conditional expectation that

E[GtSt=s]vπ(s)=aE[GtSt=s,At=a]qπ(s,a)π(as)\underbrace{\mathbb{E}[G_t|S_t = s]}_{v_\pi(s)} = \sum_a \underbrace{\mathbb{E}[G_t|S_t = s, A_t = a]}_{q_\pi(s,a)} \pi(a|s)

Hence,

vπ(s)=aπ(as)qπ(s,a)v_\pi(s) = \sum_a \pi(a|s) q_\pi(s,a)

Recall that the state value is given by

vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]v_\pi(s) = \sum_a \pi(a|s)\left[\sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s')\right]

After comparing the functions vπ(s)v_\pi(s) and qπ(s,a)q_\pi(s,a) ,we have the action-value function as

qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)q_\pi(s, a) = \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s')

Those are the two sides of the same coin:

  • vπ(s)=aπ(as)qπ(s,a)v_\pi(s) = \sum_a \pi(a|s) q_\pi(s,a) shows how to obtain state values from action values.
  • qπ(s,a)=rp(rs,a)r+γsp(ss,a)vπ(s)q_\pi(s, a) = \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') shows how to obtain action values from state values.

Illustrative example for action value

alt text

Write out the action values for state s1s_1.

qπ(s1,a2)=1+γvπ(s2),q_\pi(s_1, a_2) = -1 + \gamma v_\pi(s_2),

Questions:

  • qπ(s1,a1)q_\pi(s_1, a_1), qπ(s1,a3)q_\pi(s_1, a_3), qπ(s1,a4)q_\pi(s_1, a_4), qπ(s1,a5)q_\pi(s_1, a_5) =? Be careful!

For the other actions:

qπ(s1,a1)=1+γvπ(s1),q_\pi(s_1, a_1) = -1 + \gamma v_\pi(s_1),

qπ(s1,a3)=0+γvπ(s3),q_\pi(s_1, a_3) = 0 + \gamma v_\pi(s_3),

qπ(s1,a4)=1+γvπ(s1),q_\pi(s_1, a_4) = -1 + \gamma v_\pi(s_1),

qπ(s1,a5)=0+γvπ(s1).q_\pi(s_1, a_5) = 0 + \gamma v_\pi(s_1).

Highlights:

  • Action value is important since we care about which action to take.
  • We can first calculate all the state values and then calculate the action values.
  • We can also directly calculate the action values with or without models.

Summary

Key concepts and results:

  • State value: vπ(s)=E[GtSt=s]v_\pi(s) = \mathbb{E}[G_t|S_t = s]

  • Action value: qπ(s,a)=E[GtSt=s,At=a]q_\pi(s,a) = \mathbb{E}[G_t|S_t = s, A_t = a]

  • The Bellman equation (elementwise form):

    vπ(s)=aπ(as)[rp(rs,a)r+γsp(ss,a)vπ(s)]qπ(s,a)v_\pi(s) = \sum_a \pi(a|s)\underbrace{\left[\sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s')\right]}_{q_\pi(s,a)}

    =aπ(as)qπ(s,a)= \sum_a \pi(a|s)q_\pi(s,a)

  • The Bellman equation (matrix-vector form):

    vπ=rπ+γPπvπv_\pi = r_\pi + \gamma P_\pi v_\pi

  • How to solve the Bellman equation: closed-form solution, iterative solution