Tianshou/examples/box2d/README.md

# Bipedal-Hardcore-SAC

- Our default choice: remove the done flag penalty, will soon converge to \~250 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)
- If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (\~200 reward)
- Action noise is only necessary in the beginning. It is a negative impact at the end of the training. Removing it can reach \~255 (our best result under the original env, no done penalty removed).

![](results/sac/BipedalHardcore.png)
fix critical bugs in MAPolicy and docs update (#207) - fix a bug in MAPolicy: `buffer.rew = Batch()` doesn't change `buffer.rew` (thanks mypy) - polish examples/box2d/bipedal_hardcore_sac.py - several docs update - format setup.py and bump version to 0.2.7 2020-09-08 21:10:48 +08:00			`# Bipedal-Hardcore-SAC`

			`- Our default choice: remove the done flag penalty, will soon converge to \~250 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)`
			`- If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (\~200 reward)`
			`- Action noise is only necessary in the beginning. It is a negative impact at the end of the training. Removing it can reach \~255 (our best result under the original env, no done penalty removed).`

			`![](results/sac/BipedalHardcore.png)`