238 Commits

Author SHA1 Message Date
Anas BELFADIL
d976a5aa91
Fixed hardcoded reward_treshold (#548) 2022-03-04 10:35:39 +08:00
Jiayi Weng
c248b4f87e
fix conda support and keep API compatibility (#536)
* loose constrains

* fix nni issue (#478)

* fix coverage
2022-02-26 00:05:02 +08:00
Chengqi Duan
23fbc3b712
upgrade gym version to >=0.21, fix related CI and update examples/atari (#534)
Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>
2022-02-25 07:40:33 +08:00
Mohammad Mahdi Rahimi
c7e2e56fac
Pettingzoo support (#494)
Co-authored-by: Rodrigo de Lazcano <r.l.p.v96@gmail.com>
Co-authored-by: J K Terry <justinkterry@gmail.com>
2022-02-15 22:56:45 +08:00
Jiayi Weng
3d697aa4c6
make unit test faster (#522)
* test cache expert data in offline training

* faster cql test

* faster tests

* use dummy

* test ray dependency
2022-02-09 00:24:52 +08:00
Chengqi Duan
9c100e0705
Enable venvs.reset() concurrent execution (#517)
- change the internal API name of worker: send_action -> send, get_result -> recv (align with envpool)
- add a timing test for venvs.reset() to make sure the concurrent execution
- change venvs.reset() logic

Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>
2022-02-08 00:40:01 +08:00
ChenDRAG
c25926dd8f
Formalize variable names (#509)
Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>
2022-01-30 00:53:56 +08:00
Bernard Tan
bc53ead273
Implement CQLPolicy and offline_cql example (#506) 2022-01-16 05:30:21 +08:00
Yi Su
a59d96d041
Add Intrinsic Curiosity Module (#503) 2022-01-15 02:43:48 +08:00
Yi Su
3592f45446
Fix critic network for Discrete CRR (#485)
- Fixes an inconsistency in the implementation of Discrete CRR. Now it uses `Critic` class for its critic, following conventions in other actor-critic policies;
- Updates several offline policies to use `ActorCritic` class for its optimizer to eliminate randomness caused by parameter sharing between actor and critic;
- Add `writer.flush()` in TensorboardLogger to ensure real-time result;
- Enable `test_collector=None` in 3 trainers to turn off testing during training;
- Updates the Atari offline results in README.md;
- Moves Atari offline RL examples to `examples/offline`; tests to `test/offline` per review comments.
2021-11-28 23:10:28 +08:00
Bernard Tan
5c5a3db94e
Implement BCQPolicy and offline_bcq example (#480)
This PR implements BCQPolicy, which could be used to train an offline agent in the environment of continuous action space. An experimental result 'halfcheetah-expert-v1' is provided, which is a d4rl environment (for Offline Reinforcement Learning).
Example usage is in the examples/offline/offline_bcq.py.
2021-11-22 22:21:02 +08:00
Markus28
8f19a86966
Implements set_env_attr and get_env_attr for vector environments (#478)
close #473
2021-11-03 00:08:00 +08:00
Jiayi Weng
e45e2096d8
add multi-GPU support (#461)
add a new class DataParallelNet
2021-10-06 01:39:14 +08:00
Jiayi Weng
5df64800f4
final fix for actor_critic shared head parameters (#458) 2021-10-04 23:19:07 +08:00
Ayush Chaurasia
22d7bf38c8
Improve W&B logger (#441)
- rename WandBLogger -> WandbLogger
- add save_data and restore_data
- allow more input arguments for wandb init
- integrate wandb into test/modelbase/test_psrl.py and examples/atari/atari_dqn.py
- documentation update
2021-09-24 21:52:23 +08:00
n+e
fc251ab0b8
bump to v0.4.3 (#432)
* add makefile
* bump version
* add isort and yapf
* update contributing.md
* update PR template
* spelling check
2021-09-03 05:05:04 +08:00
Andriy Drozdyuk
8a5e2190f7
Add Weights and Biases Logger (#427)
- rename BasicLogger to TensorboardLogger
- refactor logger code
- add WandbLogger

Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>
2021-08-30 22:35:02 +08:00
n+e
e4f4f0e144
fix docs build failure and a bug in a2c/ppo optimizer (#428)
* fix rtfd build

* list + list -> set.union

* change seed of test_qrdqn

* add py39 test
2021-08-30 02:07:03 +08:00
Yi Su
291be08d43
Add Rainbow DQN (#386)
- add RainbowPolicy
- add `set_beta` method in prio_buffer
- add NoisyLinear in utils/network
2021-08-29 23:34:59 +08:00
Andriy Drozdyuk
d161059c3d
Replaced indice by plural indices (#422) 2021-08-20 21:58:44 +08:00
n+e
5b7732a29b
make ppo discrete test script more general (#418) 2021-08-15 21:37:37 +08:00
n+e
c19876179a
add env_id in preprocess fn (#391) 2021-07-05 09:50:39 +08:00
n+e
ebaca6f8da
add vizdoom example, bump version to 0.4.2 (#384) 2021-06-26 18:08:41 +08:00
Yi Su
c0bc8e00ca
Add Fully-parameterized Quantile Function (#376) 2021-06-15 11:59:02 +08:00
Yi Su
f3169b4c1f
Add Implicit Quantile Network (#371) 2021-05-29 09:44:23 +08:00
Yi Su
8f7bc65ac7
Add discrete Critic Regularized Regression (#367) 2021-05-19 13:29:56 +08:00
Yi Su
b5c3ddabfa
Add discrete Conservative Q-Learning for offline RL (#359)
Co-authored-by: Yi Su <yi.su@antgroup.com>
Co-authored-by: Yi Su <yi.su@antfin.com>
2021-05-12 09:24:48 +08:00
Ark
84f58636eb
Make trainer resumable (#350)
- specify tensorboard >= 2.5.0
- add `save_checkpoint_fn` and `resume_from_log` in trainer

Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>
2021-05-06 08:53:53 +08:00
Yuge Zhang
f4e05d585a
Support deterministic evaluation for onpolicy algorithms (#354) 2021-04-27 21:22:39 +08:00
n+e
ff4d3cd714
Support different state size and fix exception in venv.__del__ (#352)
- Batch: do not raise error when it finds list of np.array with different shape[0].

- Venv's obs: add try...except block for np.stack(obs_list)

- remove venv.__del__ since it is buggy
2021-04-25 15:23:46 +08:00
ChenDRAG
1dcf65fe21
Add NPG policy (#344) 2021-04-21 09:52:15 +08:00
n+e
c059f98abf
fix atari_bcq (#345) 2021-04-20 22:59:21 +08:00
ChenDRAG
5057b5c89e
Add TRPO policy (#337) 2021-04-16 20:37:12 +08:00
ChenDRAG
dd4a01132c
Fix SAC loss explode (#333)
* change SAC action_bound_method to "clip" (tanh is hardcoded in forward)

* docstring update

* modelbase -> modelbased
2021-04-04 17:33:35 +08:00
n+e
825da9bc53
add cross-platform test and release 0.4.1 (#331)
* bump to 0.4.1

* add cross-platform test
2021-03-31 15:14:22 +08:00
n+e
09692c84fe
fix numpy>=1.20 typing check (#323)
Change the behavior of to_numpy and to_torch: from now on, dict is automatically converted to Batch and list is automatically converted to np.ndarray (if an error occurs, raise the exception instead of converting each element in the list).
2021-03-30 16:06:03 +08:00
ChenDRAG
6426a39796
ppo benchmark (#330) 2021-03-30 11:50:35 +08:00
Yuge Zhang
7db21f3df6
Test on finite vector env (#324)
add test/base/test_env_finite.py
2021-03-25 22:59:34 +08:00
ChenDRAG
3ac67d9974
refactor A2C/PPO, change behavior of value normalization (#321) 2021-03-25 10:12:39 +08:00
ChenDRAG
e27b5a26f3
Refactor PG algorithm and change behavior of compute_episodic_return (#319)
- simplify code
- apply value normalization (global) and adv norm (per-batch) in on-policy algorithms
2021-03-23 22:05:48 +08:00
ChenDRAG
4d92952a7b
Remap action to fit gym's action space (#313)
Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>
2021-03-21 16:45:50 +08:00
n+e
ec23c7efe9
fix qvalue mask_action error for obs_next (#310)
* fix #309
* remove for-loop in dqn expl_noise
2021-03-15 08:06:24 +08:00
ChenDRAG
243ab43b3c
support observation normalization in BaseVectorEnv (#308)
add RunningMeanStd
2021-03-11 20:50:20 +08:00
ChenDRAG
f22b539761
Remove reward_normaliztion option in offpolicy algorithm (#298)
* remove rew_norm in nstep implementation
* improve test
* remove runnable/
* various doc fix

Co-authored-by: n+e <trinkle23897@gmail.com>
2021-02-27 11:20:43 +08:00
ChenDRAG
3108b9db0d
Add Timelimit trick to optimize policies (#296)
* consider timelimit.truncated in calculating returns by default
* remove ignore_done
2021-02-26 13:23:18 +08:00
ChenDRAG
9b61bc620c add logger (#295)
This PR focus on refactor of logging method to solve bug of nan reward and log interval. After these two pr, hopefully fundamental change of tianshou/data is finished. We then can concentrate on building benchmarks of tianshou finally.

Things changed:

1. trainer now accepts logger (BasicLogger or LazyLogger) instead of writer;
2. remove utils.SummaryWriter;
2021-02-24 14:48:42 +08:00
Trinkle23897
e99e1b0fdd Improve buffer.prev() & buffer.next() (#294) 2021-02-22 19:19:22 +08:00
ChenDRAG
7036073649
Trainer refactor : some definition change (#293)
This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation.
2021-02-21 13:06:02 +08:00
ChenDRAG
150d0ec51b
Step collector implementation (#280)
This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail.

Things changed in this PR:

1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv;
2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.)
3. add policy.exploration_noise(act, batch) -> act
4. small change in BasePolicy.compute_*_returns
5. move reward_metric from collector to trainer
6. fix np.asanyarray issue (different version's numpy will result in different output)
7. flake8 maxlength=88
8. polish docs and fix test

Co-authored-by: n+e <trinkle23897@gmail.com>
2021-02-19 10:33:49 +08:00
Trinkle23897
d918022ce9 merge master into dev 2021-02-18 12:46:55 +08:00