强化学习(四)- Advantage Actor-Critic 及贝尔曼方程推导(A2C)
0 概览
- Advantage Actor-Critic 主要在于Q函数的计算,
- 其中baseline b选择为状态价值函数,使用神经网络代替 V π ( s , w ) V_\pi (s,w) Vπ(s,w)
- Q函数使用贝尔曼方程来近似 Q π ( s , A ) = r t + γ V π ( s t + 1 ) Q_\pi(s,A)=r_t+\gamma V_\pi(s_{t+1}) Qπ(s,A)=rt+γVπ(st+1)
- 其中Advantage 体现在 Q π ( s , A ) − V π ( s t ) Q_\pi(s,A)-V_\pi(s_t) Qπ(s,A)−Vπ(st)上
- 贝尔曼方程:
Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})] Qπ(st,at)=ESt+1[Rt+γ∗Vπ(St+1)]
V π ( s t ) = E A t , S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt,St+1[Rt+γ∗Vπ(St+1)]]
1 核心公式
- policy gradient 公式;
E A ~ π [ ∂ I n π ( A ∣ s ; θ ) ∂ θ ∗ ( Q π ( s , A ) − b ) ] E_{A~\pi}[\frac{\partial In\pi(A|s;\theta)}{\partial \theta} * (Q_\pi(s,A)-b)] EA~π[∂θ∂Inπ(A∣s;θ)∗(Qπ(s,A)−b)]
其中baseline b 使用 V π ( s t ) V_\pi(s_t) Vπ(st)表示 - 则核心公式为
E A ~ π [ ∂ I n π ( A ∣ s ; θ ) ∂ θ ∗ ( Q π ( s , A ) − V π ( s t ) ) ] E_{A~\pi}[\frac{\partial In\pi(A|s;\theta)}{\partial \theta} * (Q_\pi(s,A)-V_\pi(s_t))] EA~π[∂θ∂Inπ(A∣s;θ)∗(Qπ(s,A)−Vπ(st))] (公式1 )
2个神经网络actor 和critic
- actor ,策略 policy π \pi π 使用神经网络表示: π ( a ∣ s ; θ ) \pi(a|s;\theta) π(a∣s;θ)
- critic , 状态价值函数V 使用神经网络表示 V π ( s , w ) V_\pi (s,w) Vπ(s,w)
3 模型训练
训练目标:
actor 网络:使状态价值函数V的值最大
critic网络:使TDtarget
和
s
t
和s_{t}
和st的价值网络误差最小
模型训练
1 观察一组状态转移数据
(
s
t
,
a
t
,
r
t
,
s
t
+
1
)
(s_t,a_t,r_t,s_{t+1})
(st,at,rt,st+1)
2 计算TDtarget ,使用
y
t
=
r
t
+
γ
.
v
(
s
t
+
1
;
w
)
y_t=r_t+\gamma . v(s_{t+1};w)
yt=rt+γ.v(st+1;w) ,其中V为神经网络
3 计算
s
t
和
s
t
+
1
s_t和s_{t+1}
st和st+1的TD error ;
δ
t
=
V
(
s
t
;
w
)
−
y
t
\delta_t=V(s_t;w)-y_t
δt=V(st;w)−yt
4 更新策略梯度
π
\pi
π 神经网络;
θ
=
θ
−
β
∗
δ
t
∂
I
n
π
(
a
t
∣
s
t
;
θ
)
∂
θ
\theta=\theta-\beta* \delta_t \frac{\partial In\pi(a_t|s_t;\theta)} {\partial \theta}
θ=θ−β∗δt∂θ∂Inπ(at∣st;θ)
5 更新价值网络v
w
=
w
−
α
∗
δ
t
∗
∂
v
(
s
t
;
w
)
∂
w
w=w-\alpha*\delta_t*\frac{\partial v(s_t;w)}{\partial w}
w=w−α∗δt∗∂w∂v(st;w)
4 贝尔曼方程推导
基本定义:
- 回报(累计奖励) return : U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 . . . . U_t=R_t+\gamma R_{t+1}+\gamma^2R{t+2}+\gamma^3R{t+3} .... Ut=Rt+γRt+1+γ2Rt+2+γ3Rt+3....
- 动作价值函数: Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi (s_t,a_t)=E[U_t|S_t=s_t,A_t=a_t] Qπ(st,at)=E[Ut∣St=st,At=at]
- 状态价值函数: V π ( s t ) = E A [ Q π ( s t , A ) ] V_\pi (s_t)=E_A[Q_\pi(s_t,A)] Vπ(st)=EA[Qπ(st,A)]
贝尔曼方程推导
-
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
,
A
t
+
1
[
R
t
+
γ
∗
Q
π
(
S
t
+
1
,
A
t
+
1
)
]
Q_\pi(s_t,a_t)=E_{S_{t+1},A_{t+1}}[R_t+\gamma *Q_\pi(S_{t+1},A_{t+1})]
Qπ(st,at)=ESt+1,At+1[Rt+γ∗Qπ(St+1,At+1)]
讲求和 A t + 1 A_{t+1} At+1移动到公式内 - Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ E A t + 1 [ Q π ( S t + 1 , A t + 1 ) ] ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *E_{A_{t+1}}[Q_\pi(S_{t+1},A_{t+1})]] Qπ(st,at)=ESt+1[Rt+γ∗EAt+1[Qπ(St+1,At+1)]]
其中
-
E
A
t
+
1
[
Q
π
(
S
t
+
1
,
A
t
+
1
)
]
=
V
π
(
S
t
+
1
)
E_{A_{t+1}}[Q_\pi(S_{t+1},A_{t+1})]=V_\pi(S_{t+1})
EAt+1[Qπ(St+1,At+1)]=Vπ(St+1)
则 - Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})] Qπ(st,at)=ESt+1[Rt+γ∗Vπ(St+1)] (公式2)
根据状态价值函数定义:
V
π
(
s
t
)
=
E
A
[
Q
π
(
s
t
,
A
)
]
V_\pi (s_t)=E_A[Q_\pi(s_t,A)]
Vπ(st)=EA[Qπ(st,A)]
=》
V
π
(
s
t
)
=
E
A
t
[
E
S
t
+
1
[
R
t
+
γ
∗
V
π
(
S
t
+
1
)
]
]
V_\pi (s_t)=E_{A_t}[E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]]
Vπ(st)=EAt[ESt+1[Rt+γ∗Vπ(St+1)]]
- V π ( s t ) = E A t , S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt,St+1[Rt+γ∗Vπ(St+1)]]。(公式3)
核心公式:
Q
π
(
s
t
,
a
t
)
=
E
S
t
+
1
[
R
t
+
γ
∗
V
π
(
S
t
+
1
)
]
Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]
Qπ(st,at)=ESt+1[Rt+γ∗Vπ(St+1)]
V
π
(
s
t
)
=
E
A
t
,
S
t
+
1
[
R
t
+
γ
∗
V
π
(
S
t
+
1
)
]
]
V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]]
Vπ(st)=EAt,St+1[Rt+γ∗Vπ(St+1)]]
5 蒙特卡洛近似
- Q π ( s t , a t ) = E S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] Q_\pi(s_t,a_t)=E_{S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})] Qπ(st,at)=ESt+1[Rt+γ∗Vπ(St+1)]。(公式2)
- V π ( s t ) = E A t , S t + 1 [ R t + γ ∗ V π ( S t + 1 ) ] ] V_\pi (s_t)=E_{A_t,S_{t+1}}[R_t+\gamma *V_\pi(S_{t+1})]] Vπ(st)=EAt,St+1[Rt+γ∗Vπ(St+1)]] (公式3)
公式蒙特卡洛近似:
- Q π ( s t , a t ) = r t + γ ∗ V π ( S t + 1 ) Q_\pi(s_t,a_t)=r_t+\gamma *V_\pi(S_{t+1}) Qπ(st,at)=rt+γ∗Vπ(St+1)
- V π ( s t ) = r t + γ ∗ V π ( S t + 1 ) ] V_\pi (s_t)=r_t+\gamma *V_\pi(S_{t+1})] Vπ(st)=rt+γ∗Vπ(St+1)]