Tianshou

Author	SHA1	Message	Date
ChenDRAG	1423eeb3b2	Add warnings for duplicate usage of action-bounded actor and action scaling method (#850 ) - Fix the current bug discussed in #844 in `test_ppo.py`. - Add warning for `ActorProb ` if both `max_action ` and `unbounded=True` are used for model initializations. - Add warning for PGpolicy and DDPGpolicy if they find duplicate usage of action-bounded actor and action scaling method.	2023-04-23 16:03:31 -07:00
Jiayi Weng	1037627a5b	fix info not pass issue in PGPolicy (#787 ) close #775	2022-12-24 13:06:54 -08:00
Yi Su	662af52820	Fix Atari PPO example (#780 ) - [x] I have marked all applicable categories: + [ ] exception-raising fix + [x] algorithm implementation fix + [ ] documentation modification + [ ] new feature - [x] I have reformatted the code using `make format` (required) - [x] I have checked the code using `make commit-checks` (required) - [x] If applicable, I have mentioned the relevant/related issue(s) - [x] If applicable, I have listed every items in this Pull Request below While trying to debug Atari PPO+LSTM, I found significant gap between our Atari PPO example vs [CleanRL's Atari PPO w/ EnvPool](https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_envpoolpy). I tried to align our implementation with CleaRL's version, mostly in hyper parameter choices, and got significant gain in Breakout, Qbert, SpaceInvaders while on par in other games. After this fix, I would suggest updating our [Atari Benchmark](https://tianshou.readthedocs.io/en/master/tutorials/benchmark.html) PPO experiments. A few interesting findings: - Layer initialization helps stabilize the training and enable the use of larger learning rates; without it, larger learning rates will trigger NaN gradient very quickly; - ppo.py#L97-L101: this change helps training stability for reasons I do not understand; also it makes the GPU usage higher. Shoutout to [CleanRL](https://github.com/vwxyzjn/cleanrl) for a well-tuned Atari PPO reference implementation!	2022-12-04 12:23:18 -08:00
Yi Su	9ce0a554dc	Add Atari SAC examples (#657 ) - Add Atari (discrete) SAC examples; - Fix a bug in Discrete SAC evaluation; default to deterministic mode.	2022-06-04 13:26:08 +08:00
Michal Gregor	277138ca5b	Added support for clipping to DQNPolicy (#642 ) * When clip_loss_grad=True is passed, Huber loss is used instead of the MSE loss. * Made the argument's name more descriptive; * Replaced the smooth L1 loss with the Huber loss, which has an identical form to the default parametrization, but seems to be better known in this context; * Added a fuller description to the docstring;	2022-05-18 19:33:37 +08:00
Anas BELFADIL	53e6b0408d	Add BranchingDQN for large discrete action spaces (#618 )	2022-05-15 21:40:32 +08:00
Yi Su	dd16818ce4	implement REDQ based on original contribution by @Jimenius (#623 ) Co-authored-by: Minhui Li <limh@lamda.nju.edu.cn>	2022-05-01 00:06:00 +08:00
Alex Nikulkov	92456cdb68	Add learning rate scheduler to BasePolicy (#598 )	2022-04-17 23:52:30 +08:00
ChenDRAG	75d7c9f1d9	Fix action scaling bug in SAC (#591 ) close #588	2022-04-12 00:26:06 +08:00
Andrea Boscolo Camiletto	2336a7db1b	fixed typo in rainbow DQN paper reference (#569 ) * fixed typo in rainbow DQN paper ref * fix gym==0.23 ci failure Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>	2022-03-16 21:38:51 +08:00
Alex Nikulkov	74f430ea36	Add a comment before SAC alpha loss (#565 ) Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>	2022-03-09 06:38:42 +08:00
Kenneth Schröder	cd7654bfd5	Fixing casts to int by to_torch_as(...) calls in policies when using discrete actions (#521 )	2022-02-07 03:42:46 +08:00
ChenDRAG	c25926dd8f	Formalize variable names (#509 ) Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>	2022-01-30 00:53:56 +08:00
Jiayi Weng	926ec0b9b1	update save_fn in trainer (#459 ) - collector.collect() now returns 4 extra keys: rew/rew_std/len/len_std (previously this work is done in logger) - save_fn() will be called at the beginning of trainer	2021-10-13 21:25:24 +08:00
Jiayi Weng	5df64800f4	final fix for actor_critic shared head parameters (#458 )	2021-10-04 23:19:07 +08:00
n+e	fc251ab0b8	bump to v0.4.3 (#432 ) * add makefile * bump version * add isort and yapf * update contributing.md * update PR template * spelling check	2021-09-03 05:05:04 +08:00
Ending Hsiao	a740496a51	fix dual clip implementation (#435 ) close #433	2021-09-02 21:43:14 +08:00
Andriy Drozdyuk	8a5e2190f7	Add Weights and Biases Logger (#427 ) - rename BasicLogger to TensorboardLogger - refactor logger code - add WandbLogger Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>	2021-08-30 22:35:02 +08:00
n+e	e4f4f0e144	fix docs build failure and a bug in a2c/ppo optimizer (#428 ) * fix rtfd build * list + list -> set.union * change seed of test_qrdqn * add py39 test	2021-08-30 02:07:03 +08:00
Yi Su	291be08d43	Add Rainbow DQN (#386 ) - add RainbowPolicy - add `set_beta` method in prio_buffer - add NoisyLinear in utils/network	2021-08-29 23:34:59 +08:00
Andriy Drozdyuk	d161059c3d	Replaced indice by plural indices (#422 )	2021-08-20 21:58:44 +08:00
n+e	ebaca6f8da	add vizdoom example, bump version to 0.4.2 (#384 )	2021-06-26 18:08:41 +08:00
Yi Su	c0bc8e00ca	Add Fully-parameterized Quantile Function (#376 )	2021-06-15 11:59:02 +08:00
Yi Su	f3169b4c1f	Add Implicit Quantile Network (#371 )	2021-05-29 09:44:23 +08:00
Ark	655d5fb14f	Allow researchers to choose whether to use Double DQN (#368 )	2021-05-21 10:53:34 +08:00
Yi Su	b5c3ddabfa	Add discrete Conservative Q-Learning for offline RL (#359 ) Co-authored-by: Yi Su <yi.su@antgroup.com> Co-authored-by: Yi Su <yi.su@antfin.com>	2021-05-12 09:24:48 +08:00
Yuge Zhang	f4e05d585a	Support deterministic evaluation for onpolicy algorithms (#354 )	2021-04-27 21:22:39 +08:00
n+e	ff4d3cd714	Support different state size and fix exception in venv.__del__ (#352 ) - Batch: do not raise error when it finds list of np.array with different shape[0]. - Venv's obs: add try...except block for np.stack(obs_list) - remove venv.__del__ since it is buggy	2021-04-25 15:23:46 +08:00
ChenDRAG	844d7703c3	NPG Mujoco benchmark release (#347 )	2021-04-21 16:31:20 +08:00
ChenDRAG	1dcf65fe21	Add NPG policy (#344 )	2021-04-21 09:52:15 +08:00
ChenDRAG	a57503c0aa	TRPO benchmark release (#340 )	2021-04-19 17:05:06 +08:00
ChenDRAG	5057b5c89e	Add TRPO policy (#337 )	2021-04-16 20:37:12 +08:00
ChenDRAG	dd4a01132c	Fix SAC loss explode (#333 ) * change SAC action_bound_method to "clip" (tanh is hardcoded in forward) * docstring update * modelbase -> modelbased	2021-04-04 17:33:35 +08:00
n+e	09692c84fe	fix numpy>=1.20 typing check (#323 ) Change the behavior of to_numpy and to_torch: from now on, dict is automatically converted to Batch and list is automatically converted to np.ndarray (if an error occurs, raise the exception instead of converting each element in the list).	2021-03-30 16:06:03 +08:00
ChenDRAG	6426a39796	ppo benchmark (#330 )	2021-03-30 11:50:35 +08:00
ChenDRAG	5d580c3662	refactor ppo (#329 )	2021-03-28 18:28:36 +08:00
ChenDRAG	1730a9008a	A2C benchmark for mujoco (#325 )	2021-03-28 13:12:43 +08:00
ChenDRAG	3ac67d9974	refactor A2C/PPO, change behavior of value normalization (#321 )	2021-03-25 10:12:39 +08:00
ChenDRAG	e27b5a26f3	Refactor PG algorithm and change behavior of `compute_episodic_return` (#319 ) - simplify code - apply value normalization (global) and adv norm (per-batch) in on-policy algorithms	2021-03-23 22:05:48 +08:00
ChenDRAG	2c11b6e43b	Add lr_scheduler option for Onpolicy algorithm (#318 ) add lr_scheduler option in PGPolicy/A2CPolicy/PPOPolicy	2021-03-22 16:57:24 +08:00
ChenDRAG	4d92952a7b	Remap action to fit gym's action space (#313 ) Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>	2021-03-21 16:45:50 +08:00
n+e	ec23c7efe9	fix qvalue mask_action error for obs_next (#310 ) * fix #309 * remove for-loop in dqn expl_noise	2021-03-15 08:06:24 +08:00
ChenDRAG	e605bdea94	MuJoCo Benchmark - DDPG, TD3, SAC (#305 ) Releasing Tianshou's SOTA benchmark of 9 out of 13 environments from the MuJoCo Gym task suite.	2021-03-07 19:21:02 +08:00
ChenDRAG	f22b539761	Remove reward_normaliztion option in offpolicy algorithm (#298 ) * remove rew_norm in nstep implementation * improve test * remove runnable/ * various doc fix Co-authored-by: n+e <trinkle23897@gmail.com>	2021-02-27 11:20:43 +08:00
ChenDRAG	3108b9db0d	Add Timelimit trick to optimize policies (#296 ) * consider timelimit.truncated in calculating returns by default * remove ignore_done	2021-02-26 13:23:18 +08:00
ChenDRAG	7036073649	Trainer refactor : some definition change (#293 ) This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation.	2021-02-21 13:06:02 +08:00
ChenDRAG	150d0ec51b	Step collector implementation (#280 ) This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <trinkle23897@gmail.com>	2021-02-19 10:33:49 +08:00
wizardsheng	1eb6137645	Add QR-DQN algorithm (#276 ) This is the PR for QR-DQN algorithm: https://arxiv.org/abs/1710.10044 1. add QR-DQN policy in tianshou/policy/modelfree/qrdqn.py. 2. add QR-DQN net in examples/atari/atari_network.py. 3. add QR-DQN atari example in examples/atari/atari_qrdqn.py. 4. add QR-DQN statement in tianshou/policy/init.py. 5. add QR-DQN unit test in test/discrete/test_qrdqn.py. 6. add QR-DQN atari results in examples/atari/results/qrdqn/. 7. add compute_q_value in DQNPolicy and C51Policy for simplify forward function. 8. move `with torch.no_grad():` from `_target_q` to BasePolicy By running "python3 atari_qrdqn.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '19.8 ± 0.40', in epoch 8.	2021-01-28 09:27:05 +08:00
Jialu Zhu	a511cb4779	Add offline trainer and discrete BCQ algorithm (#263 ) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com>	2021-01-20 18:13:04 +08:00
wizardsheng	c6f2648e87	Add C51 algorithm (#266 ) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39.	2021-01-06 10:17:45 +08:00

1 2

93 Commits