Generative Adversarial Imitation Learning 论文简析

时间 2021-01-02

《Generative Adversarial Imitation Learning》2016

1、几个概念：

（1） occupancy measure ρπ(s,a)：

$\rho _{}\pi \left ( s,a \right )=\pi \left ( a|s \right )\sum ^{\infty }_{t=0}\gamma ^{t}P(s_{t}=s|\pi)$

（2）cost function C(s,a), π策略下的累计回报：

（3）causal entropy:

(4) 学徒学习公式

（5）用RTPO来进行策略更新，保证每个策略更新后前后两个策略的差距不会太大。

This algorithm relies crucially on the TRPO policy step, which is a natural gradient step constrained to ensure that πθi+1 does not stray too far πθi。

2、算法

其中，判别器参数用adam更新，目的是increase D，生成器用TRPO更新，目的是 decrease D

其中， cost function c(s,a) = logD(s,a)