History

Trainer refactor : some definition change (#293 )

This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation.

2021-02-21 13:06:02 +08:00

results/sac

sac mujoco result (#246 )

2020-11-09 16:43:55 +08:00

acrobot_dualdqn.py

Trainer refactor : some definition change (#293 )

2021-02-21 13:06:02 +08:00

bipedal_hardcore_sac.py

Trainer refactor : some definition change (#293 )

2021-02-21 13:06:02 +08:00

lunarlander_dqn.py

Trainer refactor : some definition change (#293 )

2021-02-21 13:06:02 +08:00

mcc_sac.py

Trainer refactor : some definition change (#293 )

2021-02-21 13:06:02 +08:00

README.md

sac mujoco result (#246 )

2020-11-09 16:43:55 +08:00

README.md

Bipedal-Hardcore-SAC

Our default choice: remove the done flag penalty, will soon converge to ~280 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)
If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (~200 reward)