History

train_fn(epoch) -> train_fn(epoch, num_env_step)
test_fn(epoch) -> test_fn(epoch, num_env_step)

2020-09-26 16:35:37 +08:00

2020-09-12 08:44:50 +08:00

acrobot_dualdqn.py

2020-09-26 16:35:37 +08:00

bipedal_hardcore_sac.py

2020-09-26 16:35:37 +08:00

lunarlander_dqn.py

2020-09-26 16:35:37 +08:00

mcc_sac.py

2020-09-26 16:35:37 +08:00

README.md

2020-09-12 08:44:50 +08:00

Bipedal-Hardcore-SAC

Our default choice: remove the done flag penalty, will soon converge to ~270 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)
If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (~200 reward)