zh_CN docs

2020-06-02 08:51:14 +08:00 · 2020-06-02 08:51:14 +08:00 · f818a2467b
commit f818a2467b
parent 5f2c5347df
7 changed files with 53 additions and 54 deletions
--- a/README.md
+++ b/README.md
@ -5,7 +5,8 @@
 ---

 [![PyPI](https://img.shields.io/pypi/v/tianshou)](https://pypi.org/project/tianshou/)
-[![Documentation Status](https://readthedocs.org/projects/tianshou/badge/?version=latest)](https://tianshou.readthedocs.io)
+[![Read the Docs](https://img.shields.io/readthedocs/tianshou)](https://tianshou.readthedocs.io/en/latest)
+[![Read the Docs](https://img.shields.io/readthedocs/tianshou-docs-zh-cn?label=%E4%B8%AD%E6%96%87%E6%96%87%E6%A1%A3)](https://tianshou.readthedocs.io/zh/latest/)
 [![Unittest](https://github.com/thu-ml/tianshou/workflows/Unittest/badge.svg?branch=master)](https://github.com/thu-ml/tianshou/actions)
 [![codecov](https://img.shields.io/codecov/c/gh/thu-ml/tianshou)](https://codecov.io/gh/thu-ml/tianshou)
 [![GitHub issues](https://img.shields.io/github/issues/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/issues)
@ -14,7 +15,7 @@
 [![GitHub license](https://img.shields.io/github/license/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/blob/master/LICENSE)
 [![Join the chat at https://gitter.im/thu-ml/tianshou](https://badges.gitter.im/thu-ml/tianshou.svg)](https://gitter.im/thu-ml/tianshou?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

-**Tianshou** ([天授](https://baike.baidu.com/item/%E5%A4%A9%E6%8E%88)) is a reinforcement learning platform based on pure PyTorch. Unlike existing reinforcement learning libraries, which are mainly based on TensorFlow, have many nested classes, unfriendly API, or slow-speed, Tianshou provides a fast-speed framework and pythonic API for building the deep reinforcement learning agent with the least number of lines of code. The supported interface algorithms currently include:
+**Tianshou** ([天授](https://baike.baidu.com/item/%E5%A4%A9%E6%8E%88)) is a reinforcement learning platform based on pure PyTorch. Unlike existing reinforcement learning libraries, which are mainly based on TensorFlow, have many nested classes, unfriendly API, or slow-speed, Tianshou provides a fast-speed modularized framework and pythonic API for building the deep reinforcement learning agent with the least number of lines of code. The supported interface algorithms currently include:


 - [Policy Gradient (PG)](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf)
@ -29,7 +30,7 @@
 - [Prioritized Experience Replay (PER)](https://arxiv.org/pdf/1511.05952.pdf)
 - [Generalized Advantage Estimator (GAE)](https://arxiv.org/pdf/1506.02438.pdf)

-Tianshou supports parallel workers for all algorithms as well. All of these algorithms are reformatted as replay-buffer based algorithms. Our team is working on supporting more algorithms and more scenarios on Tianshou in this period of development.
+Tianshou supports parallel workers for all algorithms as well since all of them are reformatted as replay-buffer based algorithms. All of the algorithms support recurrent state representation in actor network (RNN-style training in POMDP). The environment state can be any type (Dict, self-defined class, ...).

 In Chinese, Tianshou means divinely ordained and is derived to the gift of being born with. Tianshou is a reinforcement learning platform, and the RL algorithm does not learn from humans. So taking "Tianshou" means that there is no teacher to study with, but rather to learn by themselves through constant interaction with the environment.

@ -75,6 +76,8 @@ The tutorials and API documentation are hosted on [tianshou.readthedocs.io](http

 The example scripts are under [test/](https://github.com/thu-ml/tianshou/blob/master/test) folder and [examples/](https://github.com/thu-ml/tianshou/blob/master/examples) folder. It may fail to run with PyPI installation, so please re-install the github version through `pip3 install git+https://github.com/thu-ml/tianshou.git@master`.

+中文文档位于 [https://tianshou.readthedocs.io/zh/latest/](https://tianshou.readthedocs.io/zh/latest/)
+
 <!-- 这里有一份天授平台简短的中文简介：https://www.zhihu.com/question/377263715 -->

 ## Why Tianshou?
@ -93,25 +96,26 @@ python3 test/discrete/test_pg.py --seed 0 --render 0.03

 We select some of famous reinforcement learning platforms: 2 GitHub repos with most stars in all RL platforms (OpenAI Baseline and RLlib) and 2 GitHub repos with most stars in PyTorch RL platforms (PyTorch DRL and rlpyt). Here is the benchmark result for other algorithms and platforms on toy scenarios: (tested on the same laptop as mentioned above)

-| RL Platform     | [Tianshou](https://github.com/thu-ml/tianshou)               | [Baselines](https://github.com/openai/baselines)             | [Ray/RLlib](https://github.com/ray-project/ray/tree/master/rllib/) | [PyTorch DRL](https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch) | [rlpyt](https://github.com/astooke/rlpyt)                    |
-| --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| GitHub Stars    | [![GitHub stars](https://img.shields.io/github/stars/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/openai/baselines)](https://github.com/openai/baselines/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/ray-project/ray)](https://github.com/ray-project/ray/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch)](https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/astooke/rlpyt)](https://github.com/astooke/rlpyt/stargazers) |
-| Algo - Task     | PyTorch                                                      | TensorFlow                                                   | TF/PyTorch                                                   | PyTorch                                                      | PyTorch                                                      |
-| PG - CartPole   | 9.03±4.18s                                                   | None                                                         | 15.77±6.28s                                                  | None                                                         | ?                                                            |
-| DQN - CartPole  | 10.61±5.51s                                                  | 1046.34±291.27s                                              | 40.16±12.79s                                                 | 175.55±53.81s                                                | ?                                                            |
-| A2C - CartPole  | 11.72±3.85s                                                  | *(~1612s)                                                    | 46.15±6.64s                                                  | Runtime Error                                                | ?                                                            |
-| PPO - CartPole  | 32.55±10.09s                                                 | *(~1179s)                                                    | 62.21±13.31s (APPO)                                          | 29.16±15.46s                                                 | ?                                                            |
-| DDPG - Pendulum | 46.95±24.31s                                                 | *(>1h)                                                       | 377.99±13.79s                                                | 652.83±471.28s                                               | 172.18±62.48s                                                |
-| TD3 - Pendulum  | 48.39±7.22s                                                  | None                                                         | 620.83±248.43s                                               | 619.33±324.97s                                               | 210.31±76.30s                                                |
-| SAC - Pendulum  | 38.92±2.09s                                                  | None                                                         | 92.68±4.48s                                                  | 808.21±405.70s                                               | 295.92±140.85s                                               |
+| RL Platform     | [Tianshou](https://github.com/thu-ml/tianshou)               | [Baselines](https://github.com/openai/baselines)             | [Stable-Baselines](https://github.com/hill-a/stable-baselines) | [Ray/RLlib](https://github.com/ray-project/ray/tree/master/rllib/) | [PyTorch-DRL](https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch) | [rlpyt](https://github.com/astooke/rlpyt)                    |
+| --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| GitHub Stars    | [![GitHub stars](https://img.shields.io/github/stars/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/openai/baselines)](https://github.com/openai/baselines/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/hill-a/stable-baselines)](https://github.com/hill-a/stable-baselines/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/ray-project/ray)](https://github.com/ray-project/ray/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch)](https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/stargazers) | [![GitHub stars](https://img.shields.io/github/stars/astooke/rlpyt)](https://github.com/astooke/rlpyt/stargazers) |
+| Algo - Task     | PyTorch                                                      | TensorFlow                                                   | TensorFlow                                                   | TF/PyTorch                                                   | PyTorch                                                      | PyTorch                                                      |
+| PG - CartPole   | 6.09±4.60s                                                   | None                                                         | None                                                         | 19.26±2.29s                                                  | None                                                         | ?                                                            |
+| DQN - CartPole  | 6.09±0.87s                                                   | 1046.34±291.27s                                              | 93.47±58.05s                                                 | 28.56±4.60s                                                  | 31.58±11.30s \*\*                                            | ?                                                            |
+| A2C - CartPole  | 6.36±1.63s                                                   | \*(~1612s)                                                   | 57.56±12.87s                                                 | 57.92±9.94s                                                  | \*(Not converged)                                            | ?                                                            |
+| PPO - CartPole  | 31.82±7.76s                                                  | \*(~1179s)                                                   | 34.79±17.02s                                                 | 44.60±17.04s                                                 | 23.99±9.26s \*\*                                             | ?                                                            |
+| PPO - Pendulum  | 16.18±2.49s                                                  | 745.43±160.82s                                               | 259.73±27.37s                                                | 123.62±44.23s                                                | Runtime Error                                                | ?                                                            |
+| DDPG - Pendulum | 37.26±9.55s                                                  | \*(>1h)                                                      | 277.52±92.67s                                                | 314.70±7.92s                                                 | 59.05±10.03s \*\*                                            | 172.18±62.48s                                                |
+| TD3 - Pendulum  | 44.04±6.37s                                                  | None                                                         | 99.75±21.63s                                                 | 149.90±7.54s                                                 | 57.52±17.71s \*\*                                            | 210.31±76.30s                                                |
+| SAC - Pendulum  | 36.02±0.77s                                                  | None                                                         | 124.85±79.14s                                                | 97.42±4.75s                                                  | 63.80±27.37s \*\*                                            | 295.92±140.85s                                               |

-*\*: Could not reach the target reward threshold in 1e6 steps in any of 10 runs. The total runtime is in the brackets.*
+*\*: Could not reach the target reward threshold in 1e6 steps in any of 5 runs. The total runtime is in the brackets.*
+
+*\*\*: Since no specific evaluation function is implemented in PyTorch-DRL, the condition is relaxed to "The average total reward for 20 consecutive complete games during training is greater than or equal to threshold".*

 *?: We have tried but it is nontrivial for running non-Atari game on rlpyt. See [here](https://github.com/astooke/rlpyt/issues/135).*

-All of the platforms use 10 different seeds for testing. We erase those trials which failed for training. The reward threshold is 195.0 in CartPole and -250.0 in Pendulum over consecutive 100 episodes' mean returns. 
-
-Tianshou and RLlib's configures are very similar. They both use multiple workers for sampling. Indeed, both RLlib and rlpyt are excellent reinforcement learning platform.
+All of the platforms use 5 different seeds for testing. We erase those trials which failed for training. The reward threshold is 195.0 in CartPole and -250.0 in Pendulum over consecutive 100 episodes' mean returns (except for PyTorch-DRL). 

 We will add results of Atari Pong / Mujoco these days.

--- a/docs/contributing.rst
+++ b/docs/contributing.rst
@ -1,9 +1,8 @@
-========================
 Contributing to Tianshou
 ========================

 Install Develop Version
-=======================
+-----------------------

 To install Tianshou in an "editable" mode, run

@ -18,7 +17,7 @@ in the main directory. This installation is removable by
    python3 setup.py develop --uninstall

 PEP8 Code Style Check
-=====================
+---------------------

 We follow PEP8 python code style. To check, in the main directory, run:

@ -27,7 +26,7 @@ We follow PEP8 python code style. To check, in the main directory, run:
    flake8 . --count --show-source --statistics

 Test Locally
-============
+------------

 This command will run automatic tests in the main directory

@ -36,7 +35,7 @@ This command will run automatic tests in the main directory
    pytest test --cov tianshou -s --durations 0 -v

 Test by GitHub Actions
-======================
+----------------------

 1. Click the ``Actions`` button in your own repo:

@ -56,7 +55,7 @@ Test by GitHub Actions
    :align: center

 Documentation
-=============
+-------------

 Documentations are written under the ``docs/`` directory as ReStructuredText (``.rst``) files. ``index.rst`` is the main page. A Tutorial on ReStructuredText can be found `here <https://pythonhosted.org/an_example_pypi_project/sphinx.html>`_.

@ -71,6 +70,6 @@ To compile documentation into webpages, run
 under the ``docs/`` directory. The generated webpages are in ``docs/_build`` and can be viewed with browsers.

 Chinese Documentation
-=====================
+---------------------

-Chinese documentation is in https://github.com/thu-ml/tianshou-docs-zh_CN.
+Chinese documentation is in https://tianshou.readthedocs.io/zh/latest/
--- a/docs/contributor.rst
+++ b/docs/contributor.rst
@ -1,4 +1,3 @@
-===========
 Contributor
 ===========

--- a/docs/index.rst
+++ b/docs/index.rst
@ -3,7 +3,6 @@
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

-====================
 Welcome to Tianshou!
 ====================

@ -22,11 +21,12 @@ Welcome to Tianshou!
 * :meth:`~tianshou.policy.BasePolicy.compute_episodic_return` `Generalized Advantage Estimator <https://arxiv.org/pdf/1506.02438.pdf>`_


-Tianshou supports parallel workers for all algorithms as well. All of these algorithms are reformatted as replay-buffer based algorithms.
+Tianshou supports parallel workers for all algorithms as well since all of them are reformatted as replay-buffer based algorithms. All of the algorithms support recurrent state representation in actor network (RNN-style training in POMDP). The environment state can be any type (Dict, self-defined class, ...).

+中文文档位于 https://tianshou.readthedocs.io/zh/latest/

 Installation
-============
+------------

 Tianshou is currently hosted on `PyPI <https://pypi.org/project/tianshou/>`_. You can simply install Tianshou with the following command (with Python >= 3.6):
 ::
@ -87,7 +87,7 @@ Tianshou is still under development, you can also check out the documents in sta


 Indices and tables
-==================
+------------------

 * :ref:`genindex`
 * :ref:`modindex`
--- a/docs/tutorials/concepts.rst
+++ b/docs/tutorials/concepts.rst
@ -1,4 +1,3 @@
-==========================
 Basic concepts in Tianshou
 ==========================

@ -10,7 +9,7 @@ Tianshou splits a Reinforcement Learning agent training procedure into these par


 Data Batch
-==========
+----------

 .. automodule:: tianshou.data.Batch
   :members:
@ -18,7 +17,7 @@ Data Batch


 Data Buffer
-===========
+-----------

 .. automodule:: tianshou.data.ReplayBuffer
   :members:
@ -29,7 +28,7 @@ Tianshou provides other type of data buffer such as :class:`~tianshou.data.ListR
 .. _policy_concept:

 Policy
-======
+------

 Tianshou aims to modularizing RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit :class:`~tianshou.policy.BasePolicy`.

@ -91,7 +90,7 @@ For other method, you can check out :doc:`/api/tianshou.policy`. We give the usa


 Collector
-=========
+---------

 The :class:`~tianshou.data.Collector` enables the policy to interact with different types of environments conveniently.
 In short, :class:`~tianshou.data.Collector` has two main methods:
@ -107,7 +106,7 @@ The general explanation is listed in :ref:`pseudocode`. Other usages of collecto


 Trainer
-=======
+-------

 Once you have a collector and a policy, you can start writing the training method for your RL agent. Trainer, to be honest, is a simple wrapper. It helps you save energy for writing the training loop. You can also construct your own trainer: :ref:`customized_trainer`.

@ -119,7 +118,7 @@ There will be more types of trainers, for instance, multi-agent trainer.
 .. _pseudocode:

 A High-level Explanation
-========================
+------------------------

 We give a high-level explanation through the pseudocode used in section :ref:`policy_concept`:
 ::
@ -142,6 +141,6 @@ We give a high-level explanation through the pseudocode used in section :ref:`po


 Conclusion
-==========
+----------

 So far, we go through the overall framework of Tianshou. Really simple, isn't it?
--- a/docs/tutorials/dqn.rst
+++ b/docs/tutorials/dqn.rst
@ -1,4 +1,3 @@
-==============
 Deep Q Network
 ==============

@ -11,7 +10,7 @@ Contrary to existing Deep RL libraries such as `RLlib <https://github.com/ray-pr


 Make an Environment
-===================
+-------------------

 First of all, you have to make an environment for your agent to interact with. For environment interfaces, we follow the convention of `OpenAI Gym <https://github.com/openai/gym>`_. In your Python code, simply import Tianshou and make the environment:
 ::
@ -25,7 +24,7 @@ CartPole-v0 is a simple environment with a discrete action space, for which DQN


 Setup Multi-environment Wrapper
-===============================
+-------------------------------

 It is available if you want the original ``gym.Env``: 
 ::
@ -45,7 +44,7 @@ For the demonstration, here we use the second block of codes.


 Build the Network
-=================
+-----------------

 Tianshou supports any user-defined PyTorch networks and optimizers but with the limitation of input and output API. Here is an example code: 
 ::
@ -81,7 +80,7 @@ The rules of self-defined networks are:


 Setup Policy
-============
+------------

 We use the defined ``net`` and ``optim``, with extra policy hyper-parameters, to define a policy. Here we define a DQN policy with using a target network: 
 ::
@ -92,7 +91,7 @@ We use the defined ``net`` and ``optim``, with extra policy hyper-parameters, to


 Setup Collector
-===============
+---------------

 The collector is a key concept in Tianshou. It allows the policy to interact with different types of environments conveniently. 
 In each step, the collector will let the policy perform (at least) a specified number of steps or episodes and store the data in a replay buffer.
@ -103,7 +102,7 @@ In each step, the collector will let the policy perform (at least) a specified n


 Train Policy with a Trainer
-===========================
+---------------------------

 Tianshou provides :class:`~tianshou.trainer.onpolicy_trainer` and :class:`~tianshou.trainer.offpolicy_trainer`. The trainer will automatically stop training when the policy reach the stop condition ``stop_fn`` on test collector. Since DQN is an off-policy algorithm, we use the :class:`~tianshou.trainer.offpolicy_trainer` as follows:
 ::
@ -159,7 +158,7 @@ It shows that within approximately 4 seconds, we finished training a DQN agent o


 Save/Load Policy
-================
+----------------

 Since the policy inherits the ``torch.nn.Module`` class, saving and loading the policy are exactly the same as a torch module:
 ::
@ -169,7 +168,7 @@ Since the policy inherits the ``torch.nn.Module`` class, saving and loading the


 Watch the Agent's Performance
-=============================
+-----------------------------

 :class:`~tianshou.data.Collector` supports rendering. Here is the example of watching the agent's performance in 35 FPS:
 ::
@ -182,7 +181,7 @@ Watch the Agent's Performance
 .. _customized_trainer:

 Train a Policy with Customized Codes
-====================================
+------------------------------------

 "I don't want to use your provided trainer. I want to customize it!"

--- a/docs/tutorials/trick.rst
+++ b/docs/tutorials/trick.rst
@ -1,4 +1,3 @@
-======================================
 Train a model-free RL agent within 30s
 ======================================

@ -8,7 +7,7 @@ You can also contribute to this page with your own tricks :)


 Avoid batch-size = 1
-====================
+--------------------

 In the traditional RL training loop, we always use the policy to interact with only one environment for collecting data. That means most of the time the network use batch-size = 1. Quite inefficient!
 Here is an example of showing how inefficient it is:
@ -53,7 +52,7 @@ By the way, A2C is better than A3C in some cases: A3C needs to act independently


 Algorithm specific tricks
-=========================
+-------------------------

 Here is about the experience of hyper-parameter tuning on CartPole and Pendulum:

@ -67,7 +66,7 @@ Here is about the experience of hyper-parameter tuning on CartPole and Pendulum:


 Code-level optimization
-=======================
+-----------------------

 Tianshou has many short-but-efficient lines of code. For example, when we want to compute :math:`V(s)` and :math:`V(s')` by the same network, the best way is to concatenate :math:`s` and :math:`s'` together instead of computing the value function using twice of network forward.

@ -75,7 +74,7 @@ Tianshou has many short-but-efficient lines of code. For example, when we want t


 Finally
-=======
+-------

 With fast-speed sampling, we could use large batch-size and large learning rate for faster convergence.