ChenDRAG 150d0ec51b
Step collector implementation (#280)
This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail.

Things changed in this PR:

1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv;
2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.)
3. add policy.exploration_noise(act, batch) -> act
4. small change in BasePolicy.compute_*_returns
5. move reward_metric from collector to trainer
6. fix np.asanyarray issue (different version's numpy will result in different output)
7. flake8 maxlength=88
8. polish docs and fix test

Co-authored-by: n+e <trinkle23897@gmail.com>
2021-02-19 10:33:49 +08:00
..
2020-11-09 16:43:55 +08:00
2020-11-09 16:43:55 +08:00

Mujoco Result

SAC (single run)

The best reward computes from 100 episodes returns in the test phase.

SAC on Swimmer-v3 always stops at 47~48.

task 3M best reward parameters time cost (3M)
HalfCheetah-v3 10157.70 ± 171.70 python3 mujoco_sac.py --task HalfCheetah-v3 2~3h
Walker2d-v3 5143.04 ± 15.57 python3 mujoco_sac.py --task Walker2d-v3 2~3h
Hopper-v3 3604.19 ± 169.55 python3 mujoco_sac.py --task Hopper-v3 2~3h
Humanoid-v3 6579.20 ± 1470.57 python3 mujoco_sac.py --task Humanoid-v3 --alpha 0.05 2~3h
Ant-v3 6281.65 ± 686.28 python3 mujoco_sac.py --task Ant-v3 2~3h

Which parts are important?

  1. DO NOT share the same network with two critic networks.
  2. The sigma (of the Gaussian policy) MUST be conditioned on input.
  3. The network size should not be less than 256.
  4. The deterministic evaluation helps a lot :)