History

This PR separates the `global_step` into `env_step` and `gradient_step`. In the future, the data from the collecting state will be stored under `env_step`, and the data from the updating state will be stored under `gradient_step`.

Others:
- add `rew_std` and `best_result` into the monitor
- fix network unbounded in `test/continuous/test_sac_with_il.py` and `examples/box2d/bipedal_hardcore_sac.py`
- change the dependency of ray to 1.0.0 since ray-project/ray#10134 has been resolved

2020-10-04 21:55:43 +08:00

results/sac

SAC implementation update (#212 )

2020-09-12 08:44:50 +08:00

acrobot_dualdqn.py

change API of train_fn and test_fn (#229 )

2020-09-26 16:35:37 +08:00

bipedal_hardcore_sac.py

change the step in trainer (#235 )

2020-10-04 21:55:43 +08:00

lunarlander_dqn.py

change API of train_fn and test_fn (#229 )

2020-09-26 16:35:37 +08:00

mcc_sac.py

change API of train_fn and test_fn (#229 )

2020-09-26 16:35:37 +08:00

README.md

SAC implementation update (#212 )

2020-09-12 08:44:50 +08:00

README.md

Bipedal-Hardcore-SAC

Our default choice: remove the done flag penalty, will soon converge to ~270 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)
If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (~200 reward)