Tianshou/examples/box2d/README.md

# Bipedal-Hardcore-SAC

- Our default choice: remove the done flag penalty, will soon converge to \~280 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)
- If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (\~200 reward)

![](results/sac/BipedalHardcore.png)


# BipedalWalker-BDQ

- To demonstrate the cpabilities of the BDQ to scale up to big discrete action spaces, we run it on a discretized version of the BipedalWalker-v3 environment, where the number of possible actions in each dimension is 25, for a total of 25^4 = 390 625 possible actions. A usaual DQN architecture would use 25^4 output neurons for the Q-network, thus scaling exponentially with the number of action space dimensions, while the Branching architecture scales linearly and uses only 25*4 output neurons. 

![](results/bdq/BipedalWalker.png)
fix critical bugs in MAPolicy and docs update (#207) - fix a bug in MAPolicy: `buffer.rew = Batch()` doesn't change `buffer.rew` (thanks mypy) - polish examples/box2d/bipedal_hardcore_sac.py - several docs update - format setup.py and bump version to 0.2.7 2020-09-08 21:10:48 +08:00			`# Bipedal-Hardcore-SAC`

sac mujoco result (#246) 2020-11-09 16:43:55 +08:00			`- Our default choice: remove the done flag penalty, will soon converge to \~280 reward within 100 epochs (10M env steps, 3~4 hours, see the image below)`
fix critical bugs in MAPolicy and docs update (#207) - fix a bug in MAPolicy: `buffer.rew = Batch()` doesn't change `buffer.rew` (thanks mypy) - polish examples/box2d/bipedal_hardcore_sac.py - several docs update - format setup.py and bump version to 0.2.7 2020-09-08 21:10:48 +08:00			`- If the done penalty is not removed, it converges much slower than before, about 200 epochs (20M env steps) to reach the same performance (\~200 reward)`

SAC implementation update (#212) - replace DiagGuassian with Independent(Normal) (pytorch has already supported this) - detach alpha from autograd - add value/alpha to result (more informational) - revert #204 to fix #211 Co-authored-by: Trinkle23897 <463003665@qq.com> 2020-09-12 08:44:50 +08:00			`![](results/sac/BipedalHardcore.png)`
Add BranchingDQN for large discrete action spaces (#618) 2022-05-15 15:40:32 +02:00

			`# BipedalWalker-BDQ`

			`- To demonstrate the cpabilities of the BDQ to scale up to big discrete action spaces, we run it on a discretized version of the BipedalWalker-v3 environment, where the number of possible actions in each dimension is 25, for a total of 25^4 = 390 625 possible actions. A usaual DQN architecture would use 25^4 output neurons for the Q-network, thus scaling exponentially with the number of action space dimensions, while the Branching architecture scales linearly and uses only 25*4 output neurons.`

			`![](results/bdq/BipedalWalker.png)`