This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <trinkle23897@gmail.com>
Mujoco Result
SAC (single run)
The best reward computes from 100 episodes returns in the test phase.
SAC on Swimmer-v3 always stops at 47~48.
task | 3M best reward | parameters | time cost (3M) |
---|---|---|---|
HalfCheetah-v3 | 10157.70 ± 171.70 | python3 mujoco_sac.py --task HalfCheetah-v3 |
2~3h |
Walker2d-v3 | 5143.04 ± 15.57 | python3 mujoco_sac.py --task Walker2d-v3 |
2~3h |
Hopper-v3 | 3604.19 ± 169.55 | python3 mujoco_sac.py --task Hopper-v3 |
2~3h |
Humanoid-v3 | 6579.20 ± 1470.57 | python3 mujoco_sac.py --task Humanoid-v3 --alpha 0.05 |
2~3h |
Ant-v3 | 6281.65 ± 686.28 | python3 mujoco_sac.py --task Ant-v3 |
2~3h |
Which parts are important?
- DO NOT share the same network with two critic networks.
- The sigma (of the Gaussian policy) MUST be conditioned on input.
- The network size should not be less than 256.
- The deterministic evaluation helps a lot :)