(Issue #512) Random start in Collector sample actions from the action space, while policies output action in a range (typically [-1, 1]) and map action to the action space. The buffer only stores unmapped actions, so the actions randomly initialized are not correct when the action range is not [-1, 1]. This may influence policy learning and particularly model learning in model-based methods.
This PR fixes it by adding an inverse operation before adding random initial actions to the buffer.
- collector.collect() now returns 4 extra keys: rew/rew_std/len/len_std (previously this work is done in logger)
- save_fn() will be called at the beginning of trainer
Change the behavior of to_numpy and to_torch: from now on, dict is automatically converted to Batch and list is automatically converted to np.ndarray (if an error occurs, raise the exception instead of converting each element in the list).
This PR focus on refactor of logging method to solve bug of nan reward and log interval. After these two pr, hopefully fundamental change of tianshou/data is finished. We then can concentrate on building benchmarks of tianshou finally.
Things changed:
1. trainer now accepts logger (BasicLogger or LazyLogger) instead of writer;
2. remove utils.SummaryWriter;
This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation.
This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix#245. You can check #274 for more detail.
Things changed in this PR:
1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv;
2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.)
3. add policy.exploration_noise(act, batch) -> act
4. small change in BasePolicy.compute_*_returns
5. move reward_metric from collector to trainer
6. fix np.asanyarray issue (different version's numpy will result in different output)
7. flake8 maxlength=88
8. polish docs and fix test
Co-authored-by: n+e <trinkle23897@gmail.com>
This PR separates the `global_step` into `env_step` and `gradient_step`. In the future, the data from the collecting state will be stored under `env_step`, and the data from the updating state will be stored under `gradient_step`.
Others:
- add `rew_std` and `best_result` into the monitor
- fix network unbounded in `test/continuous/test_sac_with_il.py` and `examples/box2d/bipedal_hardcore_sac.py`
- change the dependency of ray to 1.0.0 since ray-project/ray#10134 has been resolved
Add an indicator(i.e. `self.learning`) of learning will be convenient for distinguishing state of policy.
Meanwhile, the state of `self.training` will be undisputed in the training stage.
Related issue: #211
Others:
- fix a bug in DDQN: target_q could not be sampled from np.random.rand
- fix a bug in DQN atari net: it should add a ReLU before the last layer
- fix a bug in collector timing
Co-authored-by: n+e <463003665@qq.com>
Cherry-pick from #200
- update the function signature
- format code-style
- move _compile into separate functions
- fix a bug in to_torch and to_numpy (Batch)
- remove None in action_range
In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black))
1. add policy.eval() in all test scripts' "watch performance"
2. remove dict return support for collector preprocess_fn
3. add `__contains__` and `pop` in batch: `key in batch`, `batch.pop(key, deft)`
4. exact n_episode for a list of n_episode limitation and save fake data in cache_buffer when self.buffer is None (#184)
5. fix tensorboard logging: h-axis stands for env step instead of gradient step; add test results into tensorboard
6. add test_returns (both GAE and nstep)
7. change the type-checking order in batch.py and converter.py in order to meet the most often case first
8. fix shape inconsistency for torch.Tensor in replay buffer
9. remove `**kwargs` in ReplayBuffer
10. remove default value in batch.split() and add merge_last argument (#185)
11. improve nstep efficiency
12. add max_batchsize in onpolicy algorithms
13. potential bugfix for subproc.wait
14. fix RecurrentActorProb
15. improve the code-coverage (from 90% to 95%) and remove the dead code
16. fix some incorrect type annotation
The above improvement also increases the training FPS: on my computer, the previous version is only ~1800 FPS and after that, it can reach ~2050 (faster than v0.2.4.post1).
- Refacor code to remove duplicate code
- Enable async simulation for all vector envs
- Remove `collector.close` and rename `VectorEnv` to `DummyVectorEnv`
The abstraction of vector env changed.
Prior to this pr, each vector env is almost independent.
After this pr, each env is wrapped into a worker, and vector envs differ with their worker type. In fact, users can just use `BaseVectorEnv` with different workers, I keep `SubprocVectorEnv`, `ShmemVectorEnv` for backward compatibility.
Co-authored-by: n+e <463003665@qq.com>
Co-authored-by: magicly <magicly007@gmail.com>
* add policy.update to enable post process and remove collector.sample
* update doc in policy concept
* remove collector.sample in doc
* doc update of concepts
* docs
* polish
* polish policy
* remove collector.sample in docs
* minor fix
* Apply suggestions from code review
just a test
* doc fix
Co-authored-by: Trinkle23897 <463003665@qq.com>
Unify the implementation with multi-environments (wrap a single environment in a multi-environment with one envs) to greatly simplify the code.
This changed the behavior of single-environment.
Prior to this pr, for single environment, collector.collect(n_step=n) will step n steps.
After this pr, for single environment, collector.collect(n_step=n) will step m episodes until the steps are greater than n.
That is to say, collectors now always collect full episodes.
* code refactor; remove unused kwargs; add reward_normalization for dqn
* bugfix for __setitem__ with torch.Tensor; add Batch.condense
* minor fix
* support cat with empty Batch
* remove the dependency of is_empty on len; specify the semantic of empty Batch by test cases
* support stack with empty Batch
* remove condense
* refactor code to reflect the shared / partial / reserved categories of keys
* add is_empty(recursive=False)
* doc fix
* docfix and bugfix for _is_batch_set
* add doc for key reservation
* bugfix for algebra operators
* fix cat with lens hint
* code refactor
* bugfix for storing None
* use ValueError instead of exception
* hide lens away from users
* add comment for __cat
* move the computation of the initial value of lens in cat_ itself.
* change the place of doc string
* doc fix for Batch doc string
* change recursive to recurse
* doc string fix
* minor fix for batch doc
* remove multibuf
* reward_metric
* make fileds with empty Batch rather than None after reset
* many fixes and refactor
Co-authored-by: Trinkle23897 <463003665@qq.com>
* in-place empty_ for Batch
* change Batch.empty to in-place fill; add copy option for Batch construction
* type signiture & remove shadow names for copy
* add doc for data type (only support numbers and object data type)
* add unit test for Batch copy
* fix pep8
* add test case for Batch.empty
* doc fix
* fix pep8
* use object to test Batch
* test commit
* refact
* change Batch(copy) testcase
* minor fix
Co-authored-by: Trinkle23897 <463003665@qq.com>
* Enable to stack Batch instances. Add Batch cat static method. Rename cat in cat_ since inplace.
* Properly handle Batch init using np.array of dict.
* WIP
* Get rid of metadata.
* Update UT. Replace cat by cat_ everywhere.
* Do not sort Batch keys anymore for efficiency. Add items method.
* Fix cat copy issue.
* Add unit test to chack cat and stack methods.
* Remove used import.
* Fix linter issues.
* Fix unit tests.
Co-authored-by: Alexis Duburcq <alexis.duburcq@wandercraft.eu>
* Add auto alpha tuning and exploration noise for sac.
Add class BaseNoise and GaussianNoise for the concept of exploration noise.
Add new test for sac tested in MountainCarContinuous-v0,
which should benefits from the two above new feature.
* add exploration noise to collector, fix example to adapt modification
* Enable to convert Batch data back to torch.
* Add torch converter to collector.
* Fix
* Move to_numpy/to_torch convert in dedicated utils.py.
* Use to_numpy/to_torch to convert arrays.
* fix lint
* fix
* Add unit test to check Batch from/to numpy.
* Fix Batch over Batch.
Co-authored-by: Alexis Duburcq <alexis.duburcq@wandercraft.eu>