danagi 16d8e9b051
SAC implementation update (#212)
- replace DiagGuassian with Independent(Normal) (pytorch has already supported this)
- detach alpha from autograd
- add value/alpha to result (more informational)
- revert #204 to fix #211

Co-authored-by: Trinkle23897 <463003665@qq.com>
2020-09-12 08:44:50 +08:00
..
2020-09-12 08:44:50 +08:00

Bipedal-Hardcore-SAC

  • Our default choice: remove the done flag penalty, will soon converge to ~270 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)
  • If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (~200 reward)