Changes:
- Disclaimer in README
- Replaced all occurences of Gym with Gymnasium
- Removed code that is now dead since we no longer need to support the
old step API
- Updated type hints to only allow new step API
- Increased required version of envpool to support Gymnasium
- Increased required version of PettingZoo to support Gymnasium
- Updated `PettingZooEnv` to only use the new step API, removed hack to
also support old API
- I had to add some `# type: ignore` comments, due to new type hinting
in Gymnasium. I'm not that familiar with type hinting but I believe that
the issue is on the Gymnasium side and we are looking into it.
- Had to update `MyTestEnv` to support `options` kwarg
- Skip NNI tests because they still use OpenAI Gym
- Also allow `PettingZooEnv` in vector environment
- Updated doc page about ReplayBuffer to also talk about terminated and
truncated flags.
Still need to do:
- Update the Jupyter notebooks in docs
- Check the entire code base for more dead code (from compatibility
stuff)
- Check the reset functions of all environments/wrappers in code base to
make sure they use the `options` kwarg
- Someone might want to check test_env_finite.py
- Is it okay to allow `PettingZooEnv` in vector environments? Might need
to update docs?
IMHO, unit tests, compared with integration tests or end-to-end tests or
other tests, often means something that only tests a single
method/function/class/etc, and often has a lot of stubs and mocks so it
is far from a typical/real usage scenario. On the other hand,
integration tests or e2e tests mock less and are more like the real
case.
Tianshou says:
> ... tests include the full agent training procedure for all of the
implemented algorithms
It seems that this is more than unit test, and falls into the category
of integration or even e2e tests.
## implementation
I implemented HER solely as a replay buffer. It is done by temporarily
directly re-writing transitions storage (`self._meta`) during the
`sample_indices()` call. The original transitions are cached and will be
restored at the beginning of the next sampling or when other methods is
called. This will make sure that. for example, n-step return calculation
can be done without altering the policy.
There is also a problem with the original indices sampling. The sampled
indices are not guaranteed to be from different episodes. So I decided
to perform re-writing based on the episode. This guarantees that the
sampled transitions from the same episode will have the same re-written
goal. This also make the re-writing ratio calculation slightly differ
from the paper, but it won't be too different if there are many episodes
in the buffer.
In the current commit, HER replay buffer only support 'future' strategy
and online sampling. This is the best of HER in term of performance and
memory efficiency.
I also add a few more convenient replay buffers
(`HERVectorReplayBuffer`, `HERReplayBufferManager`), test env
(`MyGoalEnv`), gym wrapper (`TruncatedAsTerminated`), unit tests, and a
simple example (examples/offline/fetch_her_ddpg.py).
## verification
I have added unit tests for almost everything I have implemented.
HER replay buffer was also tested using DDPG on [`FetchReach-v3`
env](https://github.com/Farama-Foundation/Gymnasium-Robotics). I used
default DDPG parameters from mujoco example and didn't tune anything
further to get this good result! (train script:
examples/offline/fetch_her_ddpg.py).

- This PR adds the checks that are defined in the Makefile as pre-commit
hooks.
- Hopefully, the checks are equivalent to those from the Makefile, but I
can't guarantee it.
- CI remains as it is.
- As I pointed out on discord, I experienced some conflicts between
flake8 and yapf, so it might be better to transition to some other
combination (e.g. black).
This PR implements BCQPolicy, which could be used to train an offline agent in the environment of continuous action space. An experimental result 'halfcheetah-expert-v1' is provided, which is a d4rl environment (for Offline Reinforcement Learning).
Example usage is in the examples/offline/offline_bcq.py.
This PR focus on refactor of logging method to solve bug of nan reward and log interval. After these two pr, hopefully fundamental change of tianshou/data is finished. We then can concentrate on building benchmarks of tianshou finally.
Things changed:
1. trainer now accepts logger (BasicLogger or LazyLogger) instead of writer;
2. remove utils.SummaryWriter;
This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation.
This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix#245. You can check #274 for more detail.
Things changed in this PR:
1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv;
2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.)
3. add policy.exploration_noise(act, batch) -> act
4. small change in BasePolicy.compute_*_returns
5. move reward_metric from collector to trainer
6. fix np.asanyarray issue (different version's numpy will result in different output)
7. flake8 maxlength=88
8. polish docs and fix test
Co-authored-by: n+e <trinkle23897@gmail.com>
This is the PR for QR-DQN algorithm: https://arxiv.org/abs/1710.10044
1. add QR-DQN policy in tianshou/policy/modelfree/qrdqn.py.
2. add QR-DQN net in examples/atari/atari_network.py.
3. add QR-DQN atari example in examples/atari/atari_qrdqn.py.
4. add QR-DQN statement in tianshou/policy/init.py.
5. add QR-DQN unit test in test/discrete/test_qrdqn.py.
6. add QR-DQN atari results in examples/atari/results/qrdqn/.
7. add compute_q_value in DQNPolicy and C51Policy for simplify forward function.
8. move `with torch.no_grad():` from `_target_q` to BasePolicy
By running "python3 atari_qrdqn.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '19.8 ± 0.40', in epoch 8.
This is the first commit of 6 commits mentioned in #274, which features
1. Refactor of `Class Net` to support any form of MLP.
2. Enable type check in utils.network.
3. Relative change in docs/test/examples.
4. Move atari-related network to examples/atari/atari_network.py
Co-authored-by: Trinkle23897 <trinkle23897@gmail.com>
This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887
1. add C51 policy in tianshou/policy/modelfree/c51.py.
2. add C51 net in tianshou/utils/net/discrete.py.
3. add C51 atari example in examples/atari/atari_c51.py.
4. add C51 statement in tianshou/policy/__init__.py.
5. add C51 test in test/discrete/test_c51.py.
6. add C51 atari results in examples/atari/results/c51/.
By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9.
By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39.
Add an indicator(i.e. `self.learning`) of learning will be convenient for distinguishing state of policy.
Meanwhile, the state of `self.training` will be undisputed in the training stage.
Related issue: #211
Others:
- fix a bug in DDQN: target_q could not be sampled from np.random.rand
- fix a bug in DQN atari net: it should add a ReLU before the last layer
- fix a bug in collector timing
Co-authored-by: n+e <463003665@qq.com>