Generative Adversarial Imitation Learning 论文简析

《Generative Adversarial Imitation Learning》2016

1、几个概念:

(1) occupancy measure   ρπ(s,a):

  \rho _{}\pi \left ( s,a \right )=\pi \left ( a|s \right )\sum ^{\infty }_{t=0}\gamma ^{t}P(s_{t}=s|\pi)

(2)cost function   C(s,a),   π策略下的累计回报:

 

 (3)causal entropy:

 

(4) 学徒学习公式

(5)用RTPO来进行策略更新,保证每个策略更新后前后两个策略的差距不会太大。

This algorithm relies crucially on the TRPO policy step, which is a natural gradient step constrained to ensure that πθi+1 does not stray too far πθi。

2、算法

其中,判别器参数用adam更新,目的是increase  D,生成器用TRPO更新,目的是 decrease D

其中, cost function c(s,a) = logD(s,a)