强化学习(四)- Advantage Actor-Critic 及贝尔曼方程推导(A2C)

0 概览

  • Advantage Actor-Critic 主要在于Q函数的计算,
  • 其中baseline b选择为状态价值函数,使用神经网络代替 V π ( s , w ) V_\pi (s,w) Vπ(s,w)
  • Q函数使用贝尔曼方程来近似 Q π ( s , A ) = r t + γ V π ( s t + 1 ) Q_\pi(s,A)=r_t+\gamma V_\pi(s_{t+1}) Qπ(s,A)=rt+γVπ(st+1)
  • 其中Advantage 体现在 Q π ( s , A ) − V π ( s t ) Q_\pi(s,A)-V_\pi(s_t) Qπ(s,A)Vπ(st)
  • 贝尔曼方程:
    Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})] Qπ(st,at)=ESt+1[Rt+γVπ(St+1)]
    V π ( s t ) = E A t , S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt,St+1[Rt+γVπ(St+1)]]

1 核心公式

  • policy gradient 公式;
    E A ~ π [ ∂ I n π ( A ∣ s ; θ ) ∂ θ ∗ ( Q π ( s , A ) − b ) ] E_{A~\pi}[\frac{\partial In\pi(A|s;\theta)}{\partial \theta} * (Q_\pi(s,A)-b)] EAπ[θI(As;θ)(Qπ(s,A)b)]
    其中baseline b 使用 V π ( s t ) V_\pi(s_t) Vπ(st)表示
  • 则核心公式为
    E A ~ π [ ∂ I n π ( A ∣ s ; θ ) ∂ θ ∗ ( Q π ( s , A ) − V π ( s t ) ) ] E_{A~\pi}[\frac{\partial In\pi(A|s;\theta)}{\partial \theta} * (Q_\pi(s,A)-V_\pi(s_t))] EAπ[θI(As;θ)(Qπ(s,A)Vπ(st))] (公式1 )

2个神经网络actor 和critic

  • actor ,策略 policy π \pi π 使用神经网络表示: π ( a ∣ s ; θ ) \pi(a|s;\theta) π(as;θ)
  • critic , 状态价值函数V 使用神经网络表示 V π ( s , w ) V_\pi (s,w) Vπ(s,w)

3 模型训练

训练目标:

actor 网络:使状态价值函数V的值最大
critic网络:使TDtarget 和 s t 和s_{t} st的价值网络误差最小

模型训练

1 观察一组状态转移数据 ( s t , a t , r t , s t + 1 ) (s_t,a_t,r_t,s_{t+1}) (st,at,rt,st+1)
2 计算TDtarget ,使用 y t = r t + γ . v ( s t + 1 ; w ) y_t=r_t+\gamma . v(s_{t+1};w) yt=rt+γ.v(st+1;w) ,其中V为神经网络
3 计算 s t 和 s t + 1 s_t和s_{t+1} stst+1的TD error ; δ t = V ( s t ; w ) − y t \delta_t=V(s_t;w)-y_t δt=V(st;w)yt
4 更新策略梯度 π \pi π 神经网络;
θ = θ − β ∗ δ t ∂ I n π ( a t ∣ s t ; θ ) ∂ θ \theta=\theta-\beta* \delta_t \frac{\partial In\pi(a_t|s_t;\theta)} {\partial \theta} θ=θβδtθI(atst;θ)
5 更新价值网络v
w = w − α ∗ δ t ∗ ∂ v ( s t ; w ) ∂ w w=w-\alpha*\delta_t*\frac{\partial v(s_t;w)}{\partial w} w=wαδtwv(st;w)

4 贝尔曼方程推导

基本定义:

  • 回报(累计奖励) return : U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 . . . . U_t=R_t+\gamma R_{t+1}+\gamma^2R{t+2}+\gamma^3R{t+3} .... Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3....
  • 动作价值函数: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi (s_t,a_t)=E[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=E[UtSt=st,At=at]
  • 状态价值函数: V π ( s t ) = E A [ Q π ( s t , A ) ] V_\pi (s_t)=E_A[Q_\pi(s_t,A)] Vπ(st)=EA[Qπ(st,A)]

贝尔曼方程推导

  • Q π ( s t , a t ) = E S t + 1 , A t + 1 [ R t + γ ∗ Q π ( S t + 1 , A t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+\gamma *Q_\pi(S_{t+1},A_{t+1})] Qπ(st,at)=ESt+1,At+1[Rt+γQπ(St+1,At+1)]
    讲求和 A t + 1 A_{t+1} At+1移动到公式内
  • Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ E A t + 1 [ Q π ( S t + 1 , A t + 1 ) ] ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *E_{A_{t+1}}[Q_\pi(S_{t+1},A_{t+1})]] Qπ(st,at)=ESt+1[Rt+γEAt+1[Qπ(St+1,At+1)]]

其中

  • E A t + 1 [ Q π ( S t + 1 , A t + 1 ) ] = V π ( S t + 1 ) E_{A_{t+1}}[Q_\pi(S_{t+1},A_{t+1})]=V_\pi(S_{t+1}) EAt+1[Qπ(St+1,At+1)]=Vπ(St+1)
  • Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})] Qπ(st,at)=ESt+1[Rt+γVπ(St+1)] (公式2)

根据状态价值函数定义: V π ( s t ) = E A [ Q π ( s t , A ) ] V_\pi (s_t)=E_A[Q_\pi(s_t,A)] Vπ(st)=EA[Qπ(st,A)]
=》 V π ( s t ) = E A t [ E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t}[E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt[ESt+1[Rt+γVπ(St+1)]]

  • V π ( s t ) = E A t , S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt,St+1[Rt+γVπ(St+1)]]。(公式3)

核心公式:
Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})] Qπ(st,at)=ESt+1[Rt+γVπ(St+1)]
V π ( s t ) = E A t , S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt,St+1[Rt+γVπ(St+1)]]

5 蒙特卡洛近似

  • Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})] Qπ(st,at)=ESt+1[Rt+γVπ(St+1)]。(公式2)
  • V π ( s t ) = E A t , S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt,St+1[Rt+γVπ(St+1)]] (公式3)

公式蒙特卡洛近似:

  • Q π ( s t , a t ) = r t + γ ∗ V π ( S t + 1 ) Q_\pi(s_t,a_t)=r_t+\gamma *V_\pi(S_{t+1}) Qπ(st,at)=rt+γVπ(St+1)
  • V π ( s t ) = r t + γ ∗ V π ( S t + 1 ) ] V_\pi (s_t)=r_t+\gamma *V_\pi(S_{t+1})] Vπ(st)=rt+γVπ(St+1)]