update some tutorial

2020-03-30 22:52:25 +08:00 · 2020-03-30 22:52:25 +08:00 · 04208e6cce
commit 04208e6cce
parent 2169dd2201
6 changed files with 148 additions and 21 deletions
--- a/README.md
+++ b/README.md
@ -2,6 +2,7 @@
  <a href="http://tianshou.readthedocs.io"><img width="300px" height="auto" src="docs/_static/images/tianshou-logo.png"></a>
 </div>

+---

 [![PyPI](https://img.shields.io/pypi/v/tianshou)](https://pypi.org/project/tianshou/)
 [![Unittest](https://github.com/thu-ml/tianshou/workflows/Unittest/badge.svg?branch=master)](https://github.com/thu-ml/tianshou/actions)
@ -18,7 +19,7 @@
 - [Policy Gradient (PG)](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf)
 - [Deep Q-Network (DQN)](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)
 - [Double DQN (DDQN)](https://arxiv.org/pdf/1509.06461.pdf) with n-step returns
- [Advantage Actor-Critic (A2C)](http://incompleteideas.net/book/RLbook2018.pdf)
+- [Advantage Actor-Critic (A2C)](https://openai.com/blog/baselines-acktr-a2c/)
 - [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf)
 - [Proximal Policy Optimization (PPO)](https://arxiv.org/pdf/1707.06347.pdf)
 - [Twin Delayed DDPG (TD3)](https://arxiv.org/pdf/1802.09477.pdf)
@ -26,8 +27,6 @@

 Tianshou supports parallel workers for all algorithms as well. All of these algorithms are reformatted as replay-buffer based algorithms.

-Tianshou is still under development. More algorithms are going to be added and we always welcome contributions to help make Tianshou better. If you would like to contribute, please check out [CONTRIBUTING.md](https://github.com/thu-ml/tianshou/blob/master/CONTRIBUTING.md).
-
 ## Installation

 Tianshou is currently hosted on [PyPI](https://pypi.org/project/tianshou/). You can simply install Tianshou with the following command:
@ -53,21 +52,23 @@ If no error occurs, you have successfully installed Tianshou.

 ## Documentation

-The tutorials and API documentation are hosted on [https://tianshou.readthedocs.io](https://tianshou.readthedocs.io).
+The tutorials and API documentation are hosted on [tianshou.readthedocs.io](https://tianshou.readthedocs.io).

 The example scripts are under [test/](https://github.com/thu-ml/tianshou/blob/master/test) folder and [examples/](https://github.com/thu-ml/tianshou/blob/master/examples) folder.

+<!-- 这里有一份天授平台简短的中文简介：https://www.zhihu.com/question/377263715 -->
+
 ## Why Tianshou?

 ### Fast-speed

-Tianshou is a lightweight but high-speed reinforcement learning platform. For example, here is a test on a laptop (i7-8750H + GTX1060). It only uses 3 seconds for training a agent based on vanilla policy gradient on the CartPole-v0 task.
+Tianshou is a lightweight but high-speed reinforcement learning platform. For example, here is a test on a laptop (i7-8750H + GTX1060). It only uses 3 seconds for training a agent based on vanilla policy gradient on the CartPole-v0 task: `python3 test/discrete/test_pg.py --seed 0 --render 0.03` (seed may be different across different platform and device)

 <div align="center">
  <img src="docs/_static/images/testpg.gif"></a>
 </div>

-We select some of famous (>1k stars) reinforcement learning platforms. Here is the benchmark result for other algorithms and platforms on toy scenarios: (tested on the same laptop as mentioned above)
+We select some of famous reinforcement learning platforms: 2 GitHub repo with most stars in all RL platforms (Baselines, RLlib) and 2 GitHub repo with most stars in PyTorch RL platforms (PyTorch DRL and rlpyt). Here is the benchmark result for other algorithms and platforms on toy scenarios: (tested on the same laptop as mentioned above)

 | RL Platform     | [Tianshou](https://github.com/thu-ml/tianshou)               | [Baselines](https://github.com/openai/baselines)             | [Ray/RLlib](https://github.com/ray-project/ray/tree/master/rllib/) | [PyTorch DRL](https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch) | [rlpyt](https://github.com/astooke/rlpyt)                    |
 | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
@ -93,7 +94,7 @@ We will add results of Atari Pong / Mujoco these days.

 ### Reproducible

-Tianshou has unit tests. Different from other platforms, **the unit tests include the full agent training procedure for all of the implemented algorithms**. It will be failed once it cannot train an agent to perform well enough on limited epochs on toy scenarios. The unit tests secure the reproducibility of our platform. 
+Tianshou has its unit tests. Different from other platforms, **the unit tests include the full agent training procedure for all of the implemented algorithms**. It would be failed once if it could not train an agent to perform well enough on limited epochs on toy scenarios. The unit tests secure the reproducibility of our platform. 

 Check out the [GitHub Actions](https://github.com/thu-ml/tianshou/actions) page for more detail.

@ -235,6 +236,12 @@ Look at the result saved in tensorboard: (on bash script)
 tensorboard --logdir log/dqn
 ```

+You can check out the [documentation](https://tianshou.readthedocs.io) for advanced usage.
+
+## Contributing
+
+Tianshou is still under development. More algorithms and features are going to be added and we always welcome contributions to help make Tianshou better. If you would like to contribute, please check out [CONTRIBUTING.md](https://github.com/thu-ml/tianshou/blob/master/CONTRIBUTING.md).
+
 ## Citing Tianshou

 If you find Tianshou useful, please cite it in your publications.
@ -253,6 +260,7 @@ If you find Tianshou useful, please cite it in your publications.
 ## TODO

 - [x] More examples on [mujoco, atari] benchmark
+- [ ] More algorithms
 - [ ] Prioritized replay buffer
 - [ ] RNN support
 - [ ] Imitation Learning
--- a/docs/_static/images/concepts_arch.png
+++ b/docs/_static/images/concepts_arch.png
--- a/docs/index.rst
+++ b/docs/index.rst
@ -11,7 +11,7 @@ Welcome to Tianshou!
 * `Policy Gradient (PG) <https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf>`_
 * `Deep Q-Network (DQN) <https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf>`_
 * `Double DQN (DDQN) <https://arxiv.org/pdf/1509.06461.pdf>`_ with n-step returns
-* `Advantage Actor-Critic (A2C) <http://incompleteideas.net/book/RLbook2018.pdf>`_
+* `Advantage Actor-Critic (A2C) <https://openai.com/blog/baselines-acktr-a2c/>`_
 * `Deep Deterministic Policy Gradient (DDPG) <https://arxiv.org/pdf/1509.02971.pdf>`_
 * `Proximal Policy Optimization (PPO) <https://arxiv.org/pdf/1707.06347.pdf>`_
 * `Twin Delayed DDPG (TD3) <https://arxiv.org/pdf/1802.09477.pdf>`_
--- a/docs/tutorials/concepts.rst
+++ b/docs/tutorials/concepts.rst
@ -1,4 +1,130 @@
 Basic Concepts in Tianshou
 ==========================

-Under construction...
+Tianshou has split a Reinforcement Learning agent training procedure into these parts: trainer, collector, policy, and data buffer. The general control flow can be discribed as:
+
+.. image:: ../_static/images/concepts_arch.png
+    :align: center
+    :height: 250
+
+
+Data Batch
+----------
+
+Tianshou provides :class:`~tianshou.data.Batch` as the internal data structure to pass any kinds of data to other method, for example, a collector gives a :class:`~tianshou.data.Batch` to policy for learning. Here is its usage:
+::
+
+    >>> import numpy as np
+    >>> from tianshou.data import Batch
+    >>> data = Batch(a=4, b=[5, 5], c='2312312')
+    >>> data.b
+    [5, 5]
+    >>> data.b = np.array([3, 4, 5])
+    >>> len(data.b)
+    3
+    >>> data.b[-1]
+    5
+
+In short, you can define a :class:`~tianshou.data.Batch` with any key-value pair.
+
+Current implementation of Tianshou typically use 6 keys:
+
+* ``obs``: observation of step t;
+* ``act``: action of step t;
+* ``rew``: reward of step t;
+* ``done``: the done flag of step t;
+* ``obs_next``: observation of step t+1;
+* ``info``: info of step t (in ``gym.Env``, the ``env.step()`` function return 4 arguments, and the last one is ``info``);
+
+:class:`~tianshou.data.Batch` has other methods:
+::
+
+    >>> data = Batch(obs=np.array([0, 11, 22]), rew=np.array([6, 6, 6]))
+    >>> # here we test __getitem__
+    >>> index = [2, 1]
+    >>> data[index].obs
+    array([22, 11])
+
+    >>> data.append(data)  # how we use a list
+    >>> data.obs
+    array([0, 11, 22, 0, 11, 22])
+
+    >>> # split whole data into multiple small batch
+    >>> for d in data.split(size=2, permute=False):
+    >>>     print(d.obs, d.rew)
+    [ 0 11] [6 6]
+    [22  0] [6 6]
+    [11 22] [6 6]
+
+
+Data Buffer
+-----------
+
+:class:`~tianshou.data.ReplayBuffer` stores data generated from interaction between the policy and environment. It stores basically 6 types of data as mentioned above (7 types with importance weight in :class:`~tianshou.data.PrioritizedReplayBuffer`). Here is the :class:`~tianshou.data.ReplayBuffer`'s usage:
+::
+
+    >>> from tianshou.data import ReplayBuffer
+    >>> buf = ReplayBuffer(size=20)
+    >>> for i in range(3):
+    >>>     buf.add(obs=i, act=i, rew=i, done=i, obs_next=i + 1, info={})
+    >>> buf.obs
+    # since we set size = 20, len(buf.obs) == 20.
+    array([0., 1., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
+           0., 0., 0.])
+
+    >>> buf2 = ReplayBuffer(size=10)
+    >>> for i in range(15):
+    >>>     buf2.add(obs=i, act=i, rew=i, done=i, obs_next=i + 1, info={})
+    >>> buf2.obs
+    # since its size = 10, it only stores the last 10 steps' result.
+    array([10., 11., 12., 13., 14.,  5.,  6.,  7.,  8.,  9.])
+
+    >>> # move buf2's result into buf (keep the order of time meanwhile)
+    >>> buf.update(buf2)
+    array([ 0.,  1.,  2.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14.,
+            0.,  0.,  0.,  0.,  0.,  0.,  0.])
+
+    >>> # get a random sample from buffer, the batch_data is equal to buf[incide].
+    >>> batch_data, indice = buf.sample(batch_size=4)
+    >>> batch_data.obs == buf[indice].obs
+    array([ True,  True,  True,  True])
+
+The :class:`~tianshou.data.ReplayBuffer` is based on ``numpy.ndarray``. Tianshou provides other type of data buffer such as :class:`~tianshou.data.ListReplayBuffer` (based on list), :class:`tianshou.data.PrioritizedReplayBuffer` (based on Segment Tree and ``numpy.ndarray``). Check out the API documentation for more detail.
+
+
+Policy
+------
+
+Tianshou aims to modularizing RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit :class:`~tianshou.policy.BasePolicy`.
+
+For demonstration, we use the source code of policy gradient :class:`~tianshou.policy.PGPolicy`. Policy gradient computes each frame's return as:
+
+.. math::
+
+    G_t = \sum_{i=t}^T \gamma^{i - t}r_i = r_t + \gamma r_{t + 1} + \cdots + \gamma^{T - t} r_T
+
+, where T is the terminal timestep, :math:`\gamma` is the discount factor :math:`\in [0, 1]`.
+
+TODO
+
+
+Collector
+---------
+
+TODO
+
+
+Trainer
+-------
+
+Once you have a collector and a policy, you can start writing the training method for your RL agent. Trainer, to be honest, is a simple wrapper. It helps you save energy for writing the training loop.
+
+Tianshou has two types of trainer: :meth:`~tianshou.trainer.onpolicy_trainer` and :meth:`~tianshou.trainer.offpolicy_trainer`, corresponding to on-policy algorithms (such as Policy Gradient) and off-policy algorithms (such as DQN). Please check out the API documentation for the usage.
+
+There will be more types of trainer, for instance, multi-agent trainer.
+
+
+Conclusion
+----------
+
+So far, we go through the overall framework of Tianshou. Really simple, isn't it?
--- a/docs/tutorials/dqn.rst
+++ b/docs/tutorials/dqn.rst
@ -183,7 +183,7 @@ Train a Policy with Customized Codes

 "I don't want to use your provided trainer. I want to customize it!"

-No problem! Here are the usages:
+No problem! Here is the usage:
 ::

    # pre-collect 5000 frames with random action before training
--- a/tianshou/data/buffer.py
+++ b/tianshou/data/buffer.py
@ -73,14 +73,7 @@ class ReplayBuffer(object):
                np.arange(self._index, self._size),
                np.arange(0, self._index),
            ])
-        return Batch(
-            obs=self.obs[indice],
-            act=self.act[indice],
-            rew=self.rew[indice],
-            done=self.done[indice],
-            obs_next=self.obs_next[indice],
-            info=self.info[indice]
-        ), indice
+        return self[indice], indice

    def __getitem__(self, index):
        return Batch(
@ -121,8 +114,8 @@ class PrioritizedReplayBuffer(ReplayBuffer):
    def add(self, obs, act, rew, done, obs_next, info={}, weight=None):
        raise NotImplementedError

-    def sample_indice(self, batch_size):
-        raise NotImplementedError
-
    def sample(self, batch_size):
        raise NotImplementedError
+
+    def reset(self):
+        raise NotImplementedError