History

Trainer refactor : some definition change (#293 )

This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation.

2021-02-21 13:06:02 +08:00

results/sac

sac mujoco result (#246 )

2020-11-09 16:43:55 +08:00

runnable

Trainer refactor : some definition change (#293 )

2021-02-21 13:06:02 +08:00

mujoco_sac.py

Trainer refactor : some definition change (#293 )

2021-02-21 13:06:02 +08:00

README.md

sac mujoco result (#246 )

2020-11-09 16:43:55 +08:00

README.md

Mujoco Result

SAC (single run)

The best reward computes from 100 episodes returns in the test phase.

SAC on Swimmer-v3 always stops at 47~48.

task	3M best reward	parameters	time cost (3M)
HalfCheetah-v3	10157.70 ± 171.70	`python3 mujoco_sac.py --task HalfCheetah-v3`	2~3h
Walker2d-v3	5143.04 ± 15.57	`python3 mujoco_sac.py --task Walker2d-v3`	2~3h
Hopper-v3	3604.19 ± 169.55	`python3 mujoco_sac.py --task Hopper-v3`	2~3h
Humanoid-v3	6579.20 ± 1470.57	`python3 mujoco_sac.py --task Humanoid-v3 --alpha 0.05`	2~3h
Ant-v3	6281.65 ± 686.28	`python3 mujoco_sac.py --task Ant-v3`	2~3h

Which parts are important?

DO NOT share the same network with two critic networks.
The sigma (of the Gaussian policy) MUST be conditioned on input.
The network size should not be less than 256.
The deterministic evaluation helps a lot :)