Lecture 2: State Value and Bellman Equation
Outline
In this lecture:
A core concept: state value
A fundamental tool: Bellman equation
Motivating examples
Motivating example 1: Why return is important?
In summary, starting from s 1 s_1 s 1 ,
r e t u r n 1 > r e t u r n 3 > r e t u r n 2 return_1 > return_3 > return_2
r e t u r n 1 > r e t u r n 3 > r e t u r n 2
The above inequality suggests that the first policy is the best and the second policy is the worst, which is exactly the same as our intuition. Calculating return is important to evaluate a policy.
Motivating example 2: How to calculate return?
While return is important, how to calculate it?
Method 1 : by definition
Let v i v_i v i denote the return obtained starting from s i ( i = 1 , 2 , 3 , 4 ) s_i(i = 1,2,3,4) s i ( i = 1 , 2 , 3 , 4 )
v 1 = r 1 + γ r 2 + γ 2 r 3 + … v 2 = r 2 + γ r 3 + γ 2 r 4 + … v 3 = r 3 + γ r 4 + γ 2 r 1 + … v 4 = r 4 + γ r 1 + γ 2 r 2 + … v_1 = r_1 + \gamma r_2 + \gamma^2 r_3 + \dots \\
v_2 = r_2 + \gamma r_3 + \gamma^2 r_4 + \dots \\
v_3 = r_3 + \gamma r_4 + \gamma^2 r_1 + \dots \\
v_4 = r_4 + \gamma r_1 + \gamma^2 r_2 + \dots
v 1 = r 1 + γ r 2 + γ 2 r 3 + … v 2 = r 2 + γ r 3 + γ 2 r 4 + … v 3 = r 3 + γ r 4 + γ 2 r 1 + … v 4 = r 4 + γ r 1 + γ 2 r 2 + …
Method 2 :
The returns rely on each other. Bootstrapping!
v 1 = r 1 + γ ( r 2 + γ r 3 + … ) = r 1 + γ v 2 v 2 = r 2 + γ ( r 3 + γ r 4 + … ) = r 2 + γ v 3 v 3 = r 3 + γ ( r 4 + γ r 1 + … ) = r 3 + γ v 4 v 4 = r 4 + γ ( r 1 + γ r 2 + … ) = r 4 + γ v 1 v_1 = r_1 + \gamma(r_2 + \gamma r_3 + \dots) = r_1 + \gamma v_2 \\
v_2 = r_2 + \gamma(r_3 + \gamma r_4 + \dots) = r_2 + \gamma v_3 \\
v_3 = r_3 + \gamma(r_4 + \gamma r_1 + \dots) = r_3 + \gamma v_4 \\
v_4 = r_4 + \gamma(r_1 + \gamma r_2 + \dots) = r_4 + \gamma v_1
v 1 = r 1 + γ ( r 2 + γ r 3 + … ) = r 1 + γ v 2 v 2 = r 2 + γ ( r 3 + γ r 4 + … ) = r 2 + γ v 3 v 3 = r 3 + γ ( r 4 + γ r 1 + … ) = r 3 + γ v 4 v 4 = r 4 + γ ( r 1 + γ r 2 + … ) = r 4 + γ v 1
How to solve these equations? Write in the following matrix-vector form:
[ v 1 v 2 v 3 v 4 ] = [ r 1 r 2 r 3 r 4 ] + [ γ v 2 γ v 3 γ v 4 γ v 1 ] = [ r 1 r 2 r 3 r 4 ] + γ [ 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 ] [ v 1 v 2 v 3 v 4 ] \left[\begin{array}{c} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right] =
\left[\begin{array}{c} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right] +
\left[\begin{array}{c} \gamma v_2 \\ \gamma v_3 \\ \gamma v_4 \\ \gamma v_1 \end{array}\right] =
\left[\begin{array}{c} r_1 \\ r_2 \\ r_3 \\ r_4 \end{array}\right] +
\gamma \left[\begin{array}{cccc} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{array}\right]
\left[\begin{array}{c} v_1 \\ v_2 \\ v_3 \\ v_4 \end{array}\right]
⎣ ⎢ ⎢ ⎢ ⎡ v 1 v 2 v 3 v 4 ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ r 1 r 2 r 3 r 4 ⎦ ⎥ ⎥ ⎥ ⎤ + ⎣ ⎢ ⎢ ⎢ ⎡ γ v 2 γ v 3 γ v 4 γ v 1 ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ r 1 r 2 r 3 r 4 ⎦ ⎥ ⎥ ⎥ ⎤ + γ ⎣ ⎢ ⎢ ⎢ ⎡ 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 ⎦ ⎥ ⎥ ⎥ ⎤ ⎣ ⎢ ⎢ ⎢ ⎡ v 1 v 2 v 3 v 4 ⎦ ⎥ ⎥ ⎥ ⎤
which can be rewritten as:
v = r + γ P v \mathbf{v} = \mathbf{r} + \gamma \mathbf{P} \mathbf{v}
v = r + γ P v
How to solve this equation? Write in the following form:
v = r + γ P v v = ( I − γ P ) − 1 r \mathbf{v} = \mathbf{r} + \gamma \mathbf{P} \mathbf{v} \\
\mathbf{v} = (I - \gamma \mathbf{P})^{-1} \mathbf{r}
v = r + γ P v v = ( I − γ P ) − 1 r
This is the Bellman equation (for this specific deterministic problem)!
Though simple, it demonstrates the core idea: the value of one state relies on the values of other states.
A matrix-vector form is more clear to see how to solve the state values.
State Value
Some notations
Consider the following single-step process:
S t → A t R t + 1 , S t + 1 S_t \stackrel{A_t}{\rightarrow} R_{t+1}, S_{t+1}
S t → A t R t + 1 , S t + 1
• t , t + 1 t, t+1 t , t + 1 : discrete time instances
• S t S_t S t : state at time t t t
• A t A_t A t : the action taken in state S t S_t S t
• R t + 1 R_{t+1} R t + 1 : the reward obtained after taking A t A_t A t
• S t + 1 S_{t+1} S t + 1 : the state transited to after taking A t A_t A t
Note that S t , A t , R t + 1 S_t, A_t, R_{t+1} S t , A t , R t + 1 are all random variables.
This step is governed by the following probability distributions:
• S t → A t S_t \rightarrow A_t S t → A t is governed by π ( A t = a ∣ S t = s ) \pi(A_t = a|S_t = s) π ( A t = a ∣ S t = s )
• S t , A t → R t + 1 S_t, A_t \rightarrow R_{t+1} S t , A t → R t + 1 is governed by p ( R t + 1 = r ∣ S t = s , A t = a ) p(R_{t+1} = r|S_t = s, A_t = a) p ( R t + 1 = r ∣ S t = s , A t = a )
• S t , A t → S t + 1 S_t, A_t \rightarrow S_{t+1} S t , A t → S t + 1 is governed by p ( S t + 1 = s ′ ∣ S t = s , A t = a ) p(S_{t+1} = s'|S_t = s, A_t = a) p ( S t + 1 = s ′ ∣ S t = s , A t = a )
At this moment, we assume we know the model (i.e., the probability distributions)!
Consider the following multi-step trajectory:
S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 , … S_t \stackrel{A_t}{\rightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\rightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\rightarrow} R_{t+3}, \ldots
S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 , …
The discounted return is
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + …
• γ ∈ ( 0 , 1 ) \gamma \in (0,1) γ ∈ ( 0 , 1 ) is a discount rate.
• G t G_t G t is also a random variable since R t + 1 , R t + 2 , … R_{t+1}, R_{t+2}, \ldots R t + 1 , R t + 2 , … are random variables.
State Value
The expectation (or called expected value or mean) of G t G_t G t is defined as the state-value function or simply state value:
v π ( s ) = E [ G t ∣ S t = s ] v_{\pi}(s) = \mathbb{E}[G_t|S_t = s]
v π ( s ) = E [ G t ∣ S t = s ]
Recall the returns obtained from s 1 s_1 s 1 for the three examples:
v π 1 ( s 1 ) = 0 + γ 1 + γ 2 1 + ⋯ = γ ( 1 + γ + γ 2 + … ) = γ 1 − γ v_{\pi_1}(s_1) = 0 + \gamma 1 + \gamma^2 1 + \cdots = \gamma(1 + \gamma + \gamma^2 + \ldots) = \frac{\gamma}{1 - \gamma}
v π 1 ( s 1 ) = 0 + γ 1 + γ 2 1 + ⋯ = γ ( 1 + γ + γ 2 + … ) = 1 − γ γ
v π 2 ( s 1 ) = − 1 + γ 1 + γ 2 1 + ⋯ = − 1 + γ ( 1 + γ + γ 2 + … ) = − 1 + γ 1 − γ v_{\pi_2}(s_1) = -1 + \gamma 1 + \gamma^2 1 + \cdots = -1 + \gamma(1 + \gamma + \gamma^2 + \ldots) = -1 + \frac{\gamma}{1 - \gamma}
v π 2 ( s 1 ) = − 1 + γ 1 + γ 2 1 + ⋯ = − 1 + γ ( 1 + γ + γ 2 + … ) = − 1 + 1 − γ γ
v π 3 ( s 1 ) = 0.5 ( − 1 + γ 1 − γ ) + 0.5 ( γ 1 − γ ) = − 0.5 + γ 1 − γ v_{\pi_3}(s_1) = 0.5\left(-1 + \frac{\gamma}{1 - \gamma}\right) + 0.5\left(\frac{\gamma}{1 - \gamma}\right) = -0.5 + \frac{\gamma}{1 - \gamma}
v π 3 ( s 1 ) = 0 . 5 ( − 1 + 1 − γ γ ) + 0 . 5 ( 1 − γ γ ) = − 0 . 5 + 1 − γ γ
很明显可以看出来v π 1 ( s 1 ) v_{\pi_1}(s_1) v π 1 ( s 1 ) 是一个很好的策略,因为他的状态值是最大的。
Bellman Equation: Derivation
Bellman Equation
While state value is important, how to calculate? The answer lies in the Bellman equation.
In a word, the Bellman equation describes the relationship among the values of all states.
Next, we derive the Bellman equation.
There is some math.
We already have the intuition.
Deriving the Bellman Equation
Consider a random trajectory:
S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 , … S_t \stackrel{A_t}{\rightarrow} R_{t+1}, S_{t+1} \stackrel{A_{t+1}}{\rightarrow} R_{t+2}, S_{t+2} \stackrel{A_{t+2}}{\rightarrow} R_{t+3}, \ldots
S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 , …
The return G t G_t G t can be written as
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … , = R t + 1 + γ ( R t + 2 + γ R t + 3 + … ) , = R t + 1 + γ G t + 1 , \begin{aligned}
G_t &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots, \\
&= R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \ldots), \\
&= R_{t+1} + \gamma G_{t+1},
\end{aligned} G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + … , = R t + 1 + γ ( R t + 2 + γ R t + 3 + … ) , = R t + 1 + γ G t + 1 ,
Then, it follows from the definition of the state value that
v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ] \begin{aligned}
v_{\pi}(s) &= \mathbb{E}[G_t|S_t = s] \\
&= \mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_t = s] \\
&= \mathbb{E}[R_{t+1}|S_t = s] + \gamma \mathbb{E}[G_{t+1}|S_t = s]
\end{aligned} v π ( s ) = E [ G t ∣ S t = s ] = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] + γ E [ G t + 1 ∣ S t = s ]
Next, calculate the two terms, respectively.
First, calculate the first term E [ R t + 1 ∣ S t = s ] \mathbb{E}[R_{t+1}|S_t = s] E [ R t + 1 ∣ S t = s ] :
E [ R t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r \begin{aligned}
\mathbb{E}[R_{t+1}|S_t = s] &= \sum_{a} \pi(a|s)\mathbb{E}[R_{t+1}|S_t = s, A_t = a] \\
&= \sum_{a} \pi(a|s) \sum_{r} p(r|s,a)r
\end{aligned} E [ R t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r
Note that
This is the mean of immediate rewards.
这里公式推导其实采用了两次离散随机变量的期望公式。
Second, calculate the second term E [ G t + 1 ∣ S t = s ] \mathbb{E}[G_{t+1}|S_t = s] E [ G t + 1 ∣ S t = s ] :
E [ G t + 1 ∣ S t = s ] = ∑ s ′ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) p ( s ′ ∣ s ) = ∑ s ′ v π ( s ′ ) ∑ a p ( s ′ ∣ s , a ) π ( a ∣ s ) \begin{aligned}
\mathbb{E}[G_{t+1}|S_t = s] &= \sum_{s'} \mathbb{E}[G_{t+1}|S_t = s, S_{t+1} = s']p(s'|s) \\
&= \sum_{s'} \mathbb{E}[G_{t+1}|S_{t+1} = s']p(s'|s) \\
&= \sum_{s'} v_\pi(s')p(s'|s) \\
&= \sum_{s'} v_\pi(s') \sum_{a} p(s'|s,a)\pi(a|s)
\end{aligned} E [ G t + 1 ∣ S t = s ] = s ′ ∑ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = s ′ ∑ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) = s ′ ∑ v π ( s ′ ) p ( s ′ ∣ s ) = s ′ ∑ v π ( s ′ ) a ∑ p ( s ′ ∣ s , a ) π ( a ∣ s )
Note that
This is the mean of future rewards
E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] = E [ G t + 1 ∣ S t + 1 = s ′ ] \mathbb{E}[G_{t+1}|S_t = s, S_{t+1} = s'] = \mathbb{E}[G_{t+1}|S_{t+1} = s'] E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] = E [ G t + 1 ∣ S t + 1 = s ′ ] due to the memoryless Markov property .
Therefore, we have
Highlights :
The above equation is called the Bellman equation , which characterizes the relationship among the state-value functions of different states.
It consists of two terms: the immediate reward term and the future reward term.
A set of equations: every state has an equation like this!!!
补充1:针对Bellman方程中future reward部分推导的理解
需结合全期望公式 、MDP的马尔可夫性 和状态价值函数的定义 ,分三步拆解:
1. 明确核心对象:未来累积回报 G t + 1 G_{t+1} G t + 1
G t + 1 G_{t+1} G t + 1 是“从时刻 t + 1 t+1 t + 1 开始的累积折扣回报”,定义为:
G t + 1 = R t + 2 + γ R t + 3 + γ 2 R t + 4 + … G_{t+1} = R_{t+2} + \gamma R_{t+3} + \gamma^2 R_{t+4} + \dots
G t + 1 = R t + 2 + γ R t + 3 + γ 2 R t + 4 + …
而状态价值函数 v π ( s ′ ) v_\pi(s') v π ( s ′ ) 的定义是:在策略 π \pi π 下,从状态 s ′ s' s ′ 出发能获得的累积折扣回报的期望 ,即:
v π ( s ′ ) = E π [ G t + 1 ∣ S t + 1 = s ′ ] v_\pi(s') = \mathbb{E}_\pi\left[ G_{t+1} \mid S_{t+1} = s' \right]
v π ( s ′ ) = E π [ G t + 1 ∣ S t + 1 = s ′ ]
2. 第一步:对“动作 A t A_t A t ”求期望(用策略 π \pi π 分层)
我们需要计算 E [ G t + 1 ∣ S t = s ] \mathbb{E}\left[ G_{t+1} \mid S_t = s \right] E [ G t + 1 ∣ S t = s ] (“时刻 t t t 处于状态 s s s 时,未来累积回报 G t + 1 G_{t+1} G t + 1 的期望”)。
由于“时刻 t t t 选择的动作 A t A_t A t ”是随机的(由策略 π \pi π 决定,即 π ( a ∣ s ) = P ( A t = a ∣ S t = s ) \pi(a \mid s) = P(A_t = a \mid S_t = s) π ( a ∣ s ) = P ( A t = a ∣ S t = s ) ),根据全期望公式 (对所有可能的动作 a a a 分层求和):
E [ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ⋅ E [ G t + 1 ∣ S t = s , A t = a ] \mathbb{E}\left[ G_{t+1} \mid S_t = s \right] = \sum_{a} \pi(a \mid s) \cdot \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right]
E [ G t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) ⋅ E [ G t + 1 ∣ S t = s , A t = a ]
解释:“状态 s s s 下 G t + 1 G_{t+1} G t + 1 的期望” = 对每个动作 a a a ,“选动作 a a a 时 G t + 1 G_{t+1} G t + 1 的条件期望” × “选动作 a a a 的概率”,再对所有 a a a 求和。
3. 第二步:对“下一个状态 S t + 1 S_{t+1} S t + 1 ”求期望(用转移概率 p p p 分层)
现在需要计算 E [ G t + 1 ∣ S t = s , A t = a ] \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right] E [ G t + 1 ∣ S t = s , A t = a ] (“状态 s s s 且选动作 a a a 时,G t + 1 G_{t+1} G t + 1 的期望”)。
此时,执行动作 a a a 后,下一个状态 S t + 1 S_{t+1} S t + 1 是随机的(由状态转移概率 p ( s ′ ∣ s , a ) = P ( S t + 1 = s ′ ∣ S t = s , A t = a ) p(s' \mid s, a) = P(S_{t+1} = s' \mid S_t = s, A_t = a) p ( s ′ ∣ s , a ) = P ( S t + 1 = s ′ ∣ S t = s , A t = a ) 决定)。根据全期望公式 (对所有可能的下一个状态 s ′ s' s ′ 分层求和):
E [ G t + 1 ∣ S t = s , A t = a ] = ∑ s ′ p ( s ′ ∣ s , a ) ⋅ E [ G t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right] = \sum_{s'} p(s' \mid s, a) \cdot \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s' \right]
E [ G t + 1 ∣ S t = s , A t = a ] = s ′ ∑ p ( s ′ ∣ s , a ) ⋅ E [ G t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ]
而 E [ G t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a, S_{t+1} = s' \right] E [ G t + 1 ∣ S t = s , A t = a , S t + 1 = s ′ ] 就是 v π ( s ′ ) v_\pi(s') v π ( s ′ ) ——因为“已知下一个状态是 s ′ s' s ′ ,从 s ′ s' s ′ 出发的累积回报期望就是状态价值 v π ( s ′ ) v_\pi(s') v π ( s ′ ) ”(这是状态价值函数的定义)。因此:
E [ G t + 1 ∣ S t = s , A t = a ] = ∑ s ′ p ( s ′ ∣ s , a ) ⋅ v π ( s ′ ) \mathbb{E}\left[ G_{t+1} \mid S_t = s, A_t = a \right] = \sum_{s'} p(s' \mid s, a) \cdot v_\pi(s')
E [ G t + 1 ∣ S t = s , A t = a ] = s ′ ∑ p ( s ′ ∣ s , a ) ⋅ v π ( s ′ )
4. 第三步:合并并引入折扣因子 γ \gamma γ
将第二步的结果代入第一步,得到:
E [ G t + 1 ∣ S t = s ] = ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) ⋅ v π ( s ′ ) \mathbb{E}\left[ G_{t+1} \mid S_t = s \right] = \sum_{a} \pi(a \mid s) \sum_{s'} p(s' \mid s, a) \cdot v_\pi(s')
E [ G t + 1 ∣ S t = s ] = a ∑ π ( a ∣ s ) s ′ ∑ p ( s ′ ∣ s , a ) ⋅ v π ( s ′ )
原项是 γ E [ G t + 1 ∣ S t = s ] \gamma \mathbb{E}\left[ G_{t+1} \mid S_t = s \right] γ E [ G t + 1 ∣ S t = s ] (因为累积回报要乘折扣因子 γ \gamma γ ),因此最终:
γ E [ G t + 1 ∣ S t = s ] = γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) ⋅ v π ( s ′ ) \gamma \mathbb{E}\left[ G_{t+1} \mid S_t = s \right] = \gamma \sum_{a} \pi(a \mid s) \sum_{s'} p(s' \mid s, a) \cdot v_\pi(s')
γ E [ G t + 1 ∣ S t = s ] = γ a ∑ π ( a ∣ s ) s ′ ∑ p ( s ′ ∣ s , a ) ⋅ v π ( s ′ )
核心逻辑总结
通过全期望公式 ,先对“动作 a a a ”按策略 π \pi π 分层,再对“下一个状态 s ′ s' s ′ ”按转移概率 p p p 分层,最终将“未来累积回报的期望”转化为“下一个状态价值函数的期望”,并结合折扣因子 γ \gamma γ ,体现了MDP的马尔可夫性 (未来回报仅依赖于下一个状态,与历史无关)。
An illustrative example
Write out the Bellman equation according to the general expression:
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] v_\pi(s) = \sum_{a} \pi(a|s) \left[ \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') \right]
v π ( s ) = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ]
This example is simple because the policy is deterministic.
First, consider the state value of s 1 s_1 s 1 :
π ( a = a 3 ∣ s 1 ) = 1 \pi(a = a_3|s_1) = 1 π ( a = a 3 ∣ s 1 ) = 1 and π ( a ≠ a 3 ∣ s 1 ) = 0 \pi(a \neq a_3|s_1) = 0 π ( a = a 3 ∣ s 1 ) = 0 .
p ( s ′ = s 3 ∣ s 1 , a 3 ) = 1 p(s' = s_3|s_1,a_3) = 1 p ( s ′ = s 3 ∣ s 1 , a 3 ) = 1 and p ( s ′ ≠ s 3 ∣ s 1 , a 3 ) = 0 p(s' \neq s_3|s_1,a_3) = 0 p ( s ′ = s 3 ∣ s 1 , a 3 ) = 0 .
p ( r = 0 ∣ s 1 , a 3 ) = 1 p(r = 0|s_1,a_3) = 1 p ( r = 0 ∣ s 1 , a 3 ) = 1 and p ( r ≠ 0 ∣ s 1 , a 3 ) = 0 p(r \neq 0|s_1,a_3) = 0 p ( r = 0 ∣ s 1 , a 3 ) = 0 .
Substituting them into the Bellman equation gives
v π ( s 1 ) = 0 + γ v π ( s 3 ) v_\pi(s_1) = 0 + \gamma v_\pi(s_3)
v π ( s 1 ) = 0 + γ v π ( s 3 )
Similarly, it can be obtained that
v π ( s 1 ) = 0 + γ v π ( s 3 ) , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) . \begin{aligned}
v_\pi(s_1) &= 0 + \gamma v_\pi(s_3), \\
v_\pi(s_2) &= 1 + \gamma v_\pi(s_4), \\
v_\pi(s_3) &= 1 + \gamma v_\pi(s_4), \\
v_\pi(s_4) &= 1 + \gamma v_\pi(s_4).
\end{aligned}
v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) = 0 + γ v π ( s 3 ) , = 1 + γ v π ( s 4 ) , = 1 + γ v π ( s 4 ) , = 1 + γ v π ( s 4 ) .
How to solve them?
v π ( s 1 ) = 0 + γ v π ( s 3 ) , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) . \begin{aligned}
v_\pi(s_1) &= 0 + \gamma v_\pi(s_3), \\
v_\pi(s_2) &= 1 + \gamma v_\pi(s_4), \\
v_\pi(s_3) &= 1 + \gamma v_\pi(s_4), \\
v_\pi(s_4) &= 1 + \gamma v_\pi(s_4).
\end{aligned}
v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) = 0 + γ v π ( s 3 ) , = 1 + γ v π ( s 4 ) , = 1 + γ v π ( s 4 ) , = 1 + γ v π ( s 4 ) .
Solve the above equations one by one from the last to the first:
v π ( s 4 ) = 1 1 − γ , v π ( s 3 ) = 1 1 − γ , v π ( s 2 ) = 1 1 − γ , v π ( s 1 ) = γ 1 − γ . \begin{aligned}
v_\pi(s_4) &= \frac{1}{1 - \gamma}, \\
v_\pi(s_3) &= \frac{1}{1 - \gamma}, \\
v_\pi(s_2) &= \frac{1}{1 - \gamma}, \\
v_\pi(s_1) &= \frac{\gamma}{1 - \gamma}.
\end{aligned}
v π ( s 4 ) v π ( s 3 ) v π ( s 2 ) v π ( s 1 ) = 1 − γ 1 , = 1 − γ 1 , = 1 − γ 1 , = 1 − γ γ .
If γ = 0.9 \gamma = 0.9 γ = 0 . 9 , then
v π ( s 4 ) = 1 1 − 0.9 = 10 , v π ( s 3 ) = 1 1 − 0.9 = 10 , v π ( s 2 ) = 1 1 − 0.9 = 10 , v π ( s 1 ) = 0.9 1 − 0.9 = 9. \begin{aligned}
v_\pi(s_4) &= \frac{1}{1 - 0.9} = 10, \\
v_\pi(s_3) &= \frac{1}{1 - 0.9} = 10, \\
v_\pi(s_2) &= \frac{1}{1 - 0.9} = 10, \\
v_\pi(s_1) &= \frac{0.9}{1 - 0.9} = 9.
\end{aligned}
v π ( s 4 ) v π ( s 3 ) v π ( s 2 ) v π ( s 1 ) = 1 − 0 . 9 1 = 1 0 , = 1 − 0 . 9 1 = 1 0 , = 1 − 0 . 9 1 = 1 0 , = 1 − 0 . 9 0 . 9 = 9 .
What to do after we have calculated state values?(Calculating action value and improve policy)
Answer:
v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) . v_\pi(s_1) = 0.5[0 + \gamma v_\pi(s_3)] + 0.5[-1 + \gamma v_\pi(s_2)],\\
v_\pi(s_2) = 1 + \gamma v_\pi(s_4),\\
v_\pi(s_3) = 1 + \gamma v_\pi(s_4),\\
v_\pi(s_4) = 1 + \gamma v_\pi(s_4).
v π ( s 1 ) = 0 . 5 [ 0 + γ v π ( s 3 ) ] + 0 . 5 [ − 1 + γ v π ( s 2 ) ] , v π ( s 2 ) = 1 + γ v π ( s 4 ) , v π ( s 3 ) = 1 + γ v π ( s 4 ) , v π ( s 4 ) = 1 + γ v π ( s 4 ) .
Solve the above equations one by one from the last to the first.
v π ( s 4 ) = 1 1 − γ , v π ( s 3 ) = 1 1 − γ , v π ( s 2 ) = 1 1 − γ , v π ( s 1 ) = 0.5 [ 0 + γ v π ( s 3 ) ] + 0.5 [ − 1 + γ v π ( s 2 ) ] , = − 0.5 + γ 1 − γ . v_\pi(s_4) = \frac{1}{1 - \gamma},\quad v_\pi(s_3) = \frac{1}{1 - \gamma},\quad v_\pi(s_2) = \frac{1}{1 - \gamma},\\
v_\pi(s_1) = 0.5[0 + \gamma v_\pi(s_3)] + 0.5[-1 + \gamma v_\pi(s_2)],\\
= -0.5 + \frac{\gamma}{1 - \gamma}.
v π ( s 4 ) = 1 − γ 1 , v π ( s 3 ) = 1 − γ 1 , v π ( s 2 ) = 1 − γ 1 , v π ( s 1 ) = 0 . 5 [ 0 + γ v π ( s 3 ) ] + 0 . 5 [ − 1 + γ v π ( s 2 ) ] , = − 0 . 5 + 1 − γ γ .
Substituting γ = 0.9 \gamma = 0.9 γ = 0 . 9 yields
v π ( s 4 ) = 10 , v π ( s 3 ) = 10 , v π ( s 2 ) = 10 , v π ( s 1 ) = − 0.5 + 9 = 8.5. v_\pi(s_4) = 10,\quad v_\pi(s_3) = 10,\quad v_\pi(s_2) = 10,\quad v_\pi(s_1) = -0.5 + 9 = 8.5.
v π ( s 4 ) = 1 0 , v π ( s 3 ) = 1 0 , v π ( s 2 ) = 1 0 , v π ( s 1 ) = − 0 . 5 + 9 = 8 . 5 .
Compare with the previous policy. This one is worse.
Why consider the matrix-vector form? Because we need to solve the state values from it!
One unknown relies on another unknown. How to solve the unknowns?
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] v_\pi(s) = \sum_a \pi(a|s) \left[ \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') \right]
v π ( s ) = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ]
Elementwise form : The above elementwise form is valid for every state s ∈ S s \in S s ∈ S . That means there are ∣ S ∣ |S| ∣ S ∣ equations like this!
Matrix-vector form : If we put all the equations together, we have a set of linear equations, which can be concisely written in a matrix-vector form. The matrix-vector form is very elegant and important.
Recall that:
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] v_\pi(s) = \sum_a \pi(a|s) \left[ \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') \right]
v π ( s ) = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ]
Rewrite the Bellman equation as
v π ( s ) = r π ( s ) + γ ∑ s ′ p π ( s ′ ∣ s ) v π ( s ′ ) v_\pi(s) = r_\pi(s) + \gamma \sum_{s'} p_\pi(s'|s)v_\pi(s')
v π ( s ) = r π ( s ) + γ s ′ ∑ p π ( s ′ ∣ s ) v π ( s ′ )
where
r π ( s ) ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r , p π ( s ′ ∣ s ) ≜ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) r_\pi(s) \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a)r, \quad p_\pi(s'|s) \triangleq \sum_a \pi(a|s)p(s'|s,a)
r π ( s ) ≜ a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r , p π ( s ′ ∣ s ) ≜ a ∑ π ( a ∣ s ) p ( s ′ ∣ s , a )
Suppose the states could be indexed as s i s_i s i ( i = 1 , … , n ) (i = 1, \ldots, n) ( i = 1 , … , n ) .
For state s i s_i s i , the Bellman equation is
v π ( s i ) = r π ( s i ) + γ ∑ s j p π ( s j ∣ s i ) v π ( s j ) v_\pi(s_i) = r_\pi(s_i) + \gamma \sum_{s_j} p_\pi(s_j|s_i)v_\pi(s_j)
v π ( s i ) = r π ( s i ) + γ s j ∑ p π ( s j ∣ s i ) v π ( s j )
Put all these equations for all the states together and rewrite to a matrix-vector form
v π = r π + γ P π v π v_\pi = r_\pi + \gamma P_\pi v_\pi
v π = r π + γ P π v π
where
v π = [ v π ( s 1 ) , … , v π ( s n ) ] T ∈ R n v_\pi = [v_\pi(s_1), \ldots, v_\pi(s_n)]^T \in \mathbb{R}^n v π = [ v π ( s 1 ) , … , v π ( s n ) ] T ∈ R n
r π = [ r π ( s 1 ) , … , r π ( s n ) ] T ∈ R n r_\pi = [r_\pi(s_1), \ldots, r_\pi(s_n)]^T \in \mathbb{R}^n r π = [ r π ( s 1 ) , … , r π ( s n ) ] T ∈ R n
P π ∈ R n × n P_\pi \in \mathbb{R}^{n \times n} P π ∈ R n × n , where [ P π ] i j = p π ( s j ∣ s i ) [P_\pi]_{ij} = p_\pi(s_j|s_i) [ P π ] i j = p π ( s j ∣ s i ) , is the state transition matrix
If there are four states, v π = r π + γ P π v π v_\pi = r_\pi + \gamma P_\pi v_\pi v π = r π + γ P π v π can be written out as
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π = [ r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) ] ⏟ r π + γ [ p π ( s 1 ∣ s 1 ) p π ( s 2 ∣ s 1 ) p π ( s 3 ∣ s 1 ) p π ( s 4 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 2 ∣ s 2 ) p π ( s 3 ∣ s 2 ) p π ( s 4 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 2 ∣ s 3 ) p π ( s 3 ∣ s 3 ) p π ( s 4 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 4 ) ] ⏟ P π [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π . \underbrace{
\begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix}
}_{v_\pi} =
\underbrace{
\begin{bmatrix}
r_\pi(s_1) \\
r_\pi(s_2) \\
r_\pi(s_3) \\
r_\pi(s_4)
\end{bmatrix}
}_{r_\pi} +
\gamma
\underbrace{
\begin{bmatrix}
p_\pi(s_1|s_1) & p_\pi(s_2|s_1) & p_\pi(s_3|s_1) & p_\pi(s_4|s_1) \\
p_\pi(s_1|s_2) & p_\pi(s_2|s_2) & p_\pi(s_3|s_2) & p_\pi(s_4|s_2) \\
p_\pi(s_1|s_3) & p_\pi(s_2|s_3) & p_\pi(s_3|s_3) & p_\pi(s_4|s_3) \\
p_\pi(s_1|s_4) & p_\pi(s_2|s_4) & p_\pi(s_3|s_4) & p_\pi(s_4|s_4)
\end{bmatrix}
}_{P_\pi}
\underbrace{
\begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix}
}_{v_\pi}
.
v π ⎣ ⎢ ⎢ ⎢ ⎡ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤ = r π ⎣ ⎢ ⎢ ⎢ ⎡ r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤ + γ P π ⎣ ⎢ ⎢ ⎢ ⎡ p π ( s 1 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 1 ) p π ( s 2 ∣ s 2 ) p π ( s 2 ∣ s 3 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 1 ) p π ( s 3 ∣ s 2 ) p π ( s 3 ∣ s 3 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 1 ) p π ( s 4 ∣ s 2 ) p π ( s 4 ∣ s 3 ) p π ( s 4 ∣ s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤ v π ⎣ ⎢ ⎢ ⎢ ⎡ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤ .
For this specific example:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0 1 1 1 ] + γ [ 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix} =
\begin{bmatrix}
0 \\
1 \\
1 \\
1
\end{bmatrix} + \gamma
\begin{bmatrix}
0 & 0 & 1 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix}
⎣ ⎢ ⎢ ⎢ ⎡ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ 0 1 1 1 ⎦ ⎥ ⎥ ⎥ ⎤ + γ ⎣ ⎢ ⎢ ⎢ ⎡ 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 ⎦ ⎥ ⎥ ⎥ ⎤ ⎣ ⎢ ⎢ ⎢ ⎡ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤
For this specific example:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] = [ 0.5 ( 0 ) + 0.5 ( − 1 ) 1 1 1 ] + γ [ 0 0.5 0.5 0 0 0 0 1 0 0 0 1 0 0 0 1 ] [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] \begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix} =
\begin{bmatrix}
0.5(0) + 0.5(-1) \\
1 \\
1 \\
1
\end{bmatrix} +
\gamma
\begin{bmatrix}
0 & 0.5 & 0.5 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 1 \\
0 & 0 & 0 & 1
\end{bmatrix}
\begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix}
⎣ ⎢ ⎢ ⎢ ⎡ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎡ 0 . 5 ( 0 ) + 0 . 5 ( − 1 ) 1 1 1 ⎦ ⎥ ⎥ ⎥ ⎤ + γ ⎣ ⎢ ⎢ ⎢ ⎡ 0 0 0 0 0 . 5 0 0 0 0 . 5 0 0 0 0 1 1 1 ⎦ ⎥ ⎥ ⎥ ⎤ ⎣ ⎢ ⎢ ⎢ ⎡ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ⎦ ⎥ ⎥ ⎥ ⎤
补充2:针对代换步骤中future reward部分求和交换的理解
要理解这一步的求和合并 ,核心在于理解**“策略 π \pi π 对动作 a a a 的加权平均”**,通过这一操作可以将“依赖动作 a a a 的详细式”简化为“策略 π \pi π 下的期望形式”。下面我们分步骤进行详细拆解:
第一步:定义“策略 π \pi π 下的期望奖励” r π ( s ) r_\pi(s) r π ( s )
原贝尔曼方程的即时奖励部分 为:
∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r \sum_a \pi(a|s) \sum_r p(r|s,a) r
a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r
这部分表达式的含义是:在状态 s s s 下,先按照策略 π \pi π 对所有可能的动作 a a a 进行加权,接着对每个动作 a a a 下所有可能的奖励 r r r 按照转移概率 p ( r ∣ s , a ) p(r|s,a) p ( r ∣ s , a ) 求期望,最终得到的结果就是“策略 π \pi π 下,状态 s s s 的期望即时奖励”。
为了简化表达,我们直接将这部分定义为 r π ( s ) r_\pi(s) r π ( s ) :
r π ( s ) ≜ ∑ a π ( a ∣ s ) ∑ r p ( r ∣ s , a ) r r_\pi(s) \triangleq \sum_a \pi(a|s) \sum_r p(r|s,a) r
r π ( s ) ≜ a ∑ π ( a ∣ s ) r ∑ p ( r ∣ s , a ) r
第二步:定义“策略 π \pi π 下的状态转移概率” p π ( s ′ ∣ s ) p_\pi(s'|s) p π ( s ′ ∣ s )
原贝尔曼方程的未来奖励部分 (包含 v π ( s ′ ) v_\pi(s') v π ( s ′ ) 的项)为:
∑ a π ( a ∣ s ) [ γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] \sum_a \pi(a|s) \left[ \gamma \sum_{s'} p(s'|s,a) v_\pi(s') \right]
a ∑ π ( a ∣ s ) [ γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ]
步骤1:提取常数 γ \gamma γ
由于 γ \gamma γ 是与 a a a 和 s ′ s' s ′ 都无关的折扣因子 ,我们可以将其从内层求和中提取出来,得到:
γ ∑ a π ( a ∣ s ) ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) \gamma \sum_a \pi(a|s) \sum_{s'} p(s'|s,a) v_\pi(s')
γ a ∑ π ( a ∣ s ) s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ )
步骤2:交换求和顺序并合并对 a a a 的求和
因为求和具有交换律 (这是线性运算的基本性质),所以我们可以交换“对 a a a 的求和”与“对 s ′ s' s ′ 的求和”的顺序,得到:
γ ∑ s ′ ( ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) ) v π ( s ′ ) \gamma \sum_{s'} \left( \sum_a \pi(a|s) p(s'|s,a) \right) v_\pi(s')
γ s ′ ∑ ( a ∑ π ( a ∣ s ) p ( s ′ ∣ s , a ) ) v π ( s ′ )
此时,括号内的 ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) \sum_a \pi(a|s) p(s'|s,a) ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) 表示:在状态 s s s 下,先按照策略 π \pi π 对所有可能的动作 a a a 进行加权,再对每个动作 a a a 下“转移到状态 s ′ s' s ′ 的概率 p ( s ′ ∣ s , a ) p(s'|s,a) p ( s ′ ∣ s , a ) ”求和,最终得到的就是“策略 π \pi π 下,状态 s s s 转移到 s ′ s' s ′ 的期望概率”。
为了简化表达,我们将这部分定义为 p π ( s ′ ∣ s ) p_\pi(s'|s) p π ( s ′ ∣ s ) :
p π ( s ′ ∣ s ) ≜ ∑ a π ( a ∣ s ) p ( s ′ ∣ s , a ) p_\pi(s'|s) \triangleq \sum_a \pi(a|s) p(s'|s,a)
p π ( s ′ ∣ s ) ≜ a ∑ π ( a ∣ s ) p ( s ′ ∣ s , a )
第三步:合并为简化的贝尔曼方程
将前面得到的“即时奖励部分 r π ( s ) r_\pi(s) r π ( s ) ”和“未来奖励部分 γ ∑ s ′ p π ( s ′ ∣ s ) v π ( s ′ ) \gamma \sum_{s'} p_\pi(s'|s) v_\pi(s') γ ∑ s ′ p π ( s ′ ∣ s ) v π ( s ′ ) ”结合起来,原贝尔曼方程就可以简化为:
v π ( s ) = r π ( s ) + γ ∑ s ′ p π ( s ′ ∣ s ) v π ( s ′ ) v_\pi(s) = r_\pi(s) + \gamma \sum_{s'} p_\pi(s'|s) v_\pi(s')
v π ( s ) = r π ( s ) + γ s ′ ∑ p π ( s ′ ∣ s ) v π ( s ′ )
核心逻辑总结
通过“对动作 a a a 求策略加权的期望”,我们将原本依赖“动作 a a a ”的详细贝尔曼方程,转化为“策略 π \pi π 下的马尔可夫链形式”(转移概率和奖励都经过策略 π \pi π 加权平均)。这样的转化使得方程更加简洁,也更接近“普通马尔可夫链的贝尔曼方程”形式。
Bellman equation: Solve the state values
Solve state values
Why to solve state values?
Given a policy, finding out the corresponding state values is called policy evaluation !
It is a fundamental problem in RL. It is the foundation to find better policies.
Therefore, it is important to understand how to solve the Bellman equation.
The Bellman equation in matrix-vector form is
v π = r π + γ P π v π v_\pi = r_\pi + \gamma P_\pi v_\pi
v π = r π + γ P π v π
The closed-form solution is:
v π = ( I − γ P π ) − 1 r π v_\pi = (I - \gamma P_\pi)^{-1} r_\pi
v π = ( I − γ P π ) − 1 r π
The matrix I − γ P π I - \gamma P_\pi I − γ P π is inevitable. See details in my book.
We still need to use numerical algorithms to calculate the matrix inverse.
Can we avoid the matrix inverse operation? Yes, as shown below.
An iterative solution is:
v k + 1 = r π + γ P π v k v_{k+1} = r_\pi + \gamma P_\pi v_k
v k + 1 = r π + γ P π v k
This algorithm leads to a sequence {v₀, v₁, v₂, …}. We can show that
v k → v π = ( I − γ P π ) − 1 r π , k → ∞ v_k \to v_\pi = (I - \gamma P_\pi)^{-1} r_\pi, \quad k \to \infty
v k → v π = ( I − γ P π ) − 1 r π , k → ∞
Examples: r boundary = r forbidden = − 1 r_{\text{boundary}} = r_{\text{forbidden}} = -1 r boundary = r forbidden = − 1 , r target = + 1 r_{\text{target}} = +1 r target = + 1 , γ = 0.9 \gamma = 0.9 γ = 0 . 9
The following are two “good” policies and the state values. The two policies are different for the top two states in the forth column.
The following is a “bad” policy and the state values. The state values are less than those of the good policies.
Action value
Action value
From state value to action value:
State value: the average return the agent can get starting from a state.
Action value: the average return the agent can get starting from a state and taking an action.
Q: Why do we care action value?
A: Because we want to know which action is better. This point will be clearer in the following lectures.
A: We will frequently use action values.
Definition:
q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}[G_t|S_t = s, A_t = a]
q π ( s , a ) = E [ G t ∣ S t = s , A t = a ]
q π ( s , a ) q_\pi(s,a) q π ( s , a ) is a function of the state-action pair ( s , a ) (s,a) ( s , a )
q π ( s , a ) q_\pi(s,a) q π ( s , a ) depends on π \pi π
It follows from the properties of conditional expectation that
E [ G t ∣ S t = s ] ⏟ v π ( s ) = ∑ a E [ G t ∣ S t = s , A t = a ] ⏟ q π ( s , a ) π ( a ∣ s ) \underbrace{\mathbb{E}[G_t|S_t = s]}_{v_\pi(s)} = \sum_a \underbrace{\mathbb{E}[G_t|S_t = s, A_t = a]}_{q_\pi(s,a)} \pi(a|s)
v π ( s ) E [ G t ∣ S t = s ] = a ∑ q π ( s , a ) E [ G t ∣ S t = s , A t = a ] π ( a ∣ s )
Hence,
v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) v_\pi(s) = \sum_a \pi(a|s) q_\pi(s,a)
v π ( s ) = a ∑ π ( a ∣ s ) q π ( s , a )
Recall that the state value is given by
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] v_\pi(s) = \sum_a \pi(a|s)\left[\sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s')\right]
v π ( s ) = a ∑ π ( a ∣ s ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ]
After comparing the functions v π ( s ) v_\pi(s) v π ( s ) and q π ( s , a ) q_\pi(s,a) q π ( s , a ) ,we have the action-value function as
q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) q_\pi(s, a) = \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s')
q π ( s , a ) = r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ )
Those are the two sides of the same coin :
v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) v_\pi(s) = \sum_a \pi(a|s) q_\pi(s,a) v π ( s ) = ∑ a π ( a ∣ s ) q π ( s , a ) shows how to obtain state values from action values.
q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) q_\pi(s, a) = \sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s') q π ( s , a ) = ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) shows how to obtain action values from state values.
Illustrative example for action value
Write out the action values for state s 1 s_1 s 1 .
q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) , q_\pi(s_1, a_2) = -1 + \gamma v_\pi(s_2),
q π ( s 1 , a 2 ) = − 1 + γ v π ( s 2 ) ,
Questions:
q π ( s 1 , a 1 ) q_\pi(s_1, a_1) q π ( s 1 , a 1 ) , q π ( s 1 , a 3 ) q_\pi(s_1, a_3) q π ( s 1 , a 3 ) , q π ( s 1 , a 4 ) q_\pi(s_1, a_4) q π ( s 1 , a 4 ) , q π ( s 1 , a 5 ) q_\pi(s_1, a_5) q π ( s 1 , a 5 ) =? Be careful!
For the other actions:
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) , q_\pi(s_1, a_1) = -1 + \gamma v_\pi(s_1),
q π ( s 1 , a 1 ) = − 1 + γ v π ( s 1 ) ,
q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) , q_\pi(s_1, a_3) = 0 + \gamma v_\pi(s_3),
q π ( s 1 , a 3 ) = 0 + γ v π ( s 3 ) ,
q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) , q_\pi(s_1, a_4) = -1 + \gamma v_\pi(s_1),
q π ( s 1 , a 4 ) = − 1 + γ v π ( s 1 ) ,
q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) . q_\pi(s_1, a_5) = 0 + \gamma v_\pi(s_1).
q π ( s 1 , a 5 ) = 0 + γ v π ( s 1 ) .
Highlights :
Action value is important since we care about which action to take.
We can first calculate all the state values and then calculate the action values.
We can also directly calculate the action values with or without models.
Summary
Key concepts and results:
State value: v π ( s ) = E [ G t ∣ S t = s ] v_\pi(s) = \mathbb{E}[G_t|S_t = s] v π ( s ) = E [ G t ∣ S t = s ]
Action value: q π ( s , a ) = E [ G t ∣ S t = s , A t = a ] q_\pi(s,a) = \mathbb{E}[G_t|S_t = s, A_t = a] q π ( s , a ) = E [ G t ∣ S t = s , A t = a ]
The Bellman equation (elementwise form):
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] ⏟ q π ( s , a ) v_\pi(s) = \sum_a \pi(a|s)\underbrace{\left[\sum_r p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_\pi(s')\right]}_{q_\pi(s,a)}
v π ( s ) = a ∑ π ( a ∣ s ) q π ( s , a ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ]
= ∑ a π ( a ∣ s ) q π ( s , a ) = \sum_a \pi(a|s)q_\pi(s,a)
= a ∑ π ( a ∣ s ) q π ( s , a )
The Bellman equation (matrix-vector form):
v π = r π + γ P π v π v_\pi = r_\pi + \gamma P_\pi v_\pi
v π = r π + γ P π v π
How to solve the Bellman equation: closed-form solution, iterative solution