This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation.
This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix#245. You can check #274 for more detail.
Things changed in this PR:
1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv;
2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.)
3. add policy.exploration_noise(act, batch) -> act
4. small change in BasePolicy.compute_*_returns
5. move reward_metric from collector to trainer
6. fix np.asanyarray issue (different version's numpy will result in different output)
7. flake8 maxlength=88
8. polish docs and fix test
Co-authored-by: n+e <trinkle23897@gmail.com>
This PR separates the `global_step` into `env_step` and `gradient_step`. In the future, the data from the collecting state will be stored under `env_step`, and the data from the updating state will be stored under `gradient_step`.
Others:
- add `rew_std` and `best_result` into the monitor
- fix network unbounded in `test/continuous/test_sac_with_il.py` and `examples/box2d/bipedal_hardcore_sac.py`
- change the dependency of ray to 1.0.0 since ray-project/ray#10134 has been resolved
Add an indicator(i.e. `self.learning`) of learning will be convenient for distinguishing state of policy.
Meanwhile, the state of `self.training` will be undisputed in the training stage.
Related issue: #211
Others:
- fix a bug in DDQN: target_q could not be sampled from np.random.rand
- fix a bug in DQN atari net: it should add a ReLU before the last layer
- fix a bug in collector timing
Co-authored-by: n+e <463003665@qq.com>
Cherry-pick from #200
- update the function signature
- format code-style
- move _compile into separate functions
- fix a bug in to_torch and to_numpy (Batch)
- remove None in action_range
In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black))
1. add policy.eval() in all test scripts' "watch performance"
2. remove dict return support for collector preprocess_fn
3. add `__contains__` and `pop` in batch: `key in batch`, `batch.pop(key, deft)`
4. exact n_episode for a list of n_episode limitation and save fake data in cache_buffer when self.buffer is None (#184)
5. fix tensorboard logging: h-axis stands for env step instead of gradient step; add test results into tensorboard
6. add test_returns (both GAE and nstep)
7. change the type-checking order in batch.py and converter.py in order to meet the most often case first
8. fix shape inconsistency for torch.Tensor in replay buffer
9. remove `**kwargs` in ReplayBuffer
10. remove default value in batch.split() and add merge_last argument (#185)
11. improve nstep efficiency
12. add max_batchsize in onpolicy algorithms
13. potential bugfix for subproc.wait
14. fix RecurrentActorProb
15. improve the code-coverage (from 90% to 95%) and remove the dead code
16. fix some incorrect type annotation
The above improvement also increases the training FPS: on my computer, the previous version is only ~1800 FPS and after that, it can reach ~2050 (faster than v0.2.4.post1).
- Refacor code to remove duplicate code
- Enable async simulation for all vector envs
- Remove `collector.close` and rename `VectorEnv` to `DummyVectorEnv`
The abstraction of vector env changed.
Prior to this pr, each vector env is almost independent.
After this pr, each env is wrapped into a worker, and vector envs differ with their worker type. In fact, users can just use `BaseVectorEnv` with different workers, I keep `SubprocVectorEnv`, `ShmemVectorEnv` for backward compatibility.
Co-authored-by: n+e <463003665@qq.com>
Co-authored-by: magicly <magicly007@gmail.com>
* add policy.update to enable post process and remove collector.sample
* update doc in policy concept
* remove collector.sample in doc
* doc update of concepts
* docs
* polish
* polish policy
* remove collector.sample in docs
* minor fix
* Apply suggestions from code review
just a test
* doc fix
Co-authored-by: Trinkle23897 <463003665@qq.com>
Unify the implementation with multi-environments (wrap a single environment in a multi-environment with one envs) to greatly simplify the code.
This changed the behavior of single-environment.
Prior to this pr, for single environment, collector.collect(n_step=n) will step n steps.
After this pr, for single environment, collector.collect(n_step=n) will step m episodes until the steps are greater than n.
That is to say, collectors now always collect full episodes.
* code refactor; remove unused kwargs; add reward_normalization for dqn
* bugfix for __setitem__ with torch.Tensor; add Batch.condense
* minor fix
* support cat with empty Batch
* remove the dependency of is_empty on len; specify the semantic of empty Batch by test cases
* support stack with empty Batch
* remove condense
* refactor code to reflect the shared / partial / reserved categories of keys
* add is_empty(recursive=False)
* doc fix
* docfix and bugfix for _is_batch_set
* add doc for key reservation
* bugfix for algebra operators
* fix cat with lens hint
* code refactor
* bugfix for storing None
* use ValueError instead of exception
* hide lens away from users
* add comment for __cat
* move the computation of the initial value of lens in cat_ itself.
* change the place of doc string
* doc fix for Batch doc string
* change recursive to recurse
* doc string fix
* minor fix for batch doc
* remove multibuf
* reward_metric
* make fileds with empty Batch rather than None after reset
* many fixes and refactor
Co-authored-by: Trinkle23897 <463003665@qq.com>
* in-place empty_ for Batch
* change Batch.empty to in-place fill; add copy option for Batch construction
* type signiture & remove shadow names for copy
* add doc for data type (only support numbers and object data type)
* add unit test for Batch copy
* fix pep8
* add test case for Batch.empty
* doc fix
* fix pep8
* use object to test Batch
* test commit
* refact
* change Batch(copy) testcase
* minor fix
Co-authored-by: Trinkle23897 <463003665@qq.com>
* Enable to stack Batch instances. Add Batch cat static method. Rename cat in cat_ since inplace.
* Properly handle Batch init using np.array of dict.
* WIP
* Get rid of metadata.
* Update UT. Replace cat by cat_ everywhere.
* Do not sort Batch keys anymore for efficiency. Add items method.
* Fix cat copy issue.
* Add unit test to chack cat and stack methods.
* Remove used import.
* Fix linter issues.
* Fix unit tests.
Co-authored-by: Alexis Duburcq <alexis.duburcq@wandercraft.eu>
* Add auto alpha tuning and exploration noise for sac.
Add class BaseNoise and GaussianNoise for the concept of exploration noise.
Add new test for sac tested in MountainCarContinuous-v0,
which should benefits from the two above new feature.
* add exploration noise to collector, fix example to adapt modification
* Enable to convert Batch data back to torch.
* Add torch converter to collector.
* Fix
* Move to_numpy/to_torch convert in dedicated utils.py.
* Use to_numpy/to_torch to convert arrays.
* fix lint
* fix
* Add unit test to check Batch from/to numpy.
* Fix Batch over Batch.
Co-authored-by: Alexis Duburcq <alexis.duburcq@wandercraft.eu>
* update atari.py
* fix setup.py
pass the pytest
* fix setup.py
pass the pytest
* add args "render"
* change the tensorboard writter
* change the tensorboard writter
* change device, render, tensorboard log location
* change device, render, tensorboard log location
* remove some wrong local files
* fix some tab mistakes and the envs name in continuous/test_xx.py
* add examples and point robot maze environment
* fix some bugs during testing examples
* add dqn network and fix some args
* change back the tensorboard writter's frequency to ensure ppo and a2c can write things normally
* add a warning to collector
* rm some unrelated files
* reformat
* fix a bug in test_dqn due to the model wrong selection