add docs of trick

2020-04-02 21:57:26 +08:00 · 2020-04-02 21:57:26 +08:00 · 7cb5146611
commit 7cb5146611
parent 0e86d44860
6 changed files with 92 additions and 21 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -16,11 +16,11 @@ pip install ".[dev]"
 #### Tests

 This command will run automatic tests in the main directory
-```python
+```bash
 pytest test --cov tianshou -s
 ```

-To run on your own GitHub Repo, enable the [GitHub Action](/actions) and it will automatically run the test.
+To run on your own GitHub Repo, enable the [GitHub Action](https://github.com/features/actions) and it will automatically run the test.

 ##### PEP8 Code Style Check

--- a/README.md
+++ b/README.md
@ -13,7 +13,7 @@
 [![GitHub license](https://img.shields.io/github/license/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/blob/master/LICENSE)
 [![Join the chat at https://gitter.im/thu-ml/tianshou](https://badges.gitter.im/thu-ml/tianshou.svg)](https://gitter.im/thu-ml/tianshou?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

-**Tianshou** ([天授](https://baike.baidu.com/item/%E5%A4%A9%E6%8E%88)) is a reinforcement learning platform based on pure PyTorch. Unlike existing reinforcement learning libraries, which are mainly based on TensorFlow, have many nested classes, unfriendly API, or slow-speed, Tianshou provides a fast-speed framework and pythonic API for building the deep reinforcement learning agent. The supported interface algorithms include:
+**Tianshou** ([天授](https://baike.baidu.com/item/%E5%A4%A9%E6%8E%88)) is a reinforcement learning platform based on pure PyTorch. Unlike existing reinforcement learning libraries, which are mainly based on TensorFlow, have many nested classes, unfriendly API, or slow-speed, Tianshou provides a fast-speed framework and pythonic API for building the deep reinforcement learning agent with the least number of lines of code. The supported interface algorithms include:


 - [Policy Gradient (PG)](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf)
@ -25,7 +25,7 @@
 - [Twin Delayed DDPG (TD3)](https://arxiv.org/pdf/1802.09477.pdf)
 - [Soft Actor-Critic (SAC)](https://arxiv.org/pdf/1812.05905.pdf)

-Tianshou supports parallel workers for all algorithms as well. All of these algorithms are reformatted as replay-buffer based algorithms. Our team is working on supporting more algorithms and more scenarios on Tianshou in this period of development.
+Tianshou supports parallel workers for all algorithms as well. All of these algorithms are reformatted as replay-buffer based algorithms. Our team is working on supporting more algorithms and more scenarios on Tianshou in this period of development, including model-based / atari / etc.

 ## Installation

@ -88,7 +88,7 @@ We select some of famous reinforcement learning platforms: 2 GitHub repos with m

 *\*: Could not reach the target reward threshold in 1e6 steps in any of 10 runs. The total runtime is in the brackets.*

-*?: We have tried but it is nontrivial for running non-Atari game on rlpyt. See [here](https://github.com/astooke/rlpyt/issues/127#issuecomment-601741210).*
+*?: We have tried but it is nontrivial for running non-Atari game on rlpyt. See [here](https://github.com/astooke/rlpyt/issues/135).*

 All of the platforms use 10 different seeds for testing. We erase those trials which failed for training. The reward threshold is 195.0 in CartPole and -250.0 in Pendulum over consecutive 100 episodes' mean returns. 

@ -273,6 +273,6 @@ If you find Tianshou useful, please cite it in your publications.

 ## Acknowledgment

-Tianshou was previously a reinforcement learning platform based on TensorFlow. You can check out the branch [`priv`](https://github.com/thu-ml/tianshou/tree/priv) for more detail. Many thanks to [Haosheng Zou](https://github.com/HaoshengZou)'s pioneering work for `Tianshou<=0.1.1`.
+Tianshou was previously a reinforcement learning platform based on TensorFlow. You can check out the branch [`priv`](https://github.com/thu-ml/tianshou/tree/priv) for more detail. Many thanks to [Haosheng Zou](https://github.com/HaoshengZou)'s pioneering work for Tianshou before version 0.1.1.

 We would like to thank [TSAIL](http://ml.cs.tsinghua.edu.cn/) and [Institute for Artificial Intelligence, Tsinghua University](http://ai.tsinghua.edu.cn/) for providing such an excellent AI research platform.
--- a/docs/index.rst
+++ b/docs/index.rst
@ -49,6 +49,7 @@ If no error occurs, you have successfully installed Tianshou.

   tutorials/dqn
   tutorials/concepts
+   tutorials/trick.rst

 .. toctree::
   :maxdepth: 1
--- a/docs/tutorials/concepts.rst
+++ b/docs/tutorials/concepts.rst
@ -126,10 +126,10 @@ Here is the pseudocode showing the training process **without Tianshou framework
            b_s, b_a, b_s_, b_r, b_d = buffer.get(size=64)
            # compute 2-step returns. How?
            b_ret = compute_2_step_return(buffer, b_r, b_d, ...)
-            # update DQN
+            # update DQN policy
            agent.update(b_s, b_a, b_s_, b_r, b_d, b_ret)

-Thus, we need a time-dependent interface for calculating the 2-step return. :meth:`~tianshou.policy.BasePolicy.process_fn` provides this interface by giving the replay buffer, the sample index, and the sample batch data. Since we store all the data in the order of time, you can simply compute the 2-step return as:
+Thus, we need a time-related interface for calculating the 2-step return. :meth:`~tianshou.policy.BasePolicy.process_fn` finishes this work by providing the replay buffer, the sample index, and the sample batch data. Since we store all the data in the order of time, you can simply compute the 2-step return as:
 ::

    class DQN_2step(BasePolicy):
@ -141,6 +141,7 @@ Thus, we need a time-dependent interface for calculating the 2-step return. :met
            # this will return a batch data where batch_2.obs is s_t+2
            # we can also get s_t+2 through:
            #   batch_2_obs = buffer.obs[(indice + 2) % buffer_len]
+            # in short, buffer.obs[i] is equal to buffer[i].obs, but the former is more effecient.
            Q = self(batch_2, eps=0)  # shape: [batchsize, action_shape]
            maxQ = Q.max(dim=-1)
            batch.returns = batch.rew \
@ -153,7 +154,7 @@ This code does not consider the done flag, so it may not work very well. It show
 For other method, you can check out the API documentation for more detail. We give a high-level explanation through the same pseudocode:
 ::

-    # pseudocode, cannot work
+    # pseudocode, cannot work                                       # methods in tianshou
    buffer = Buffer(size=10000)
    s = env.reset()
    agent = DQN()                                                   # done in policy.__init__(...)
@ -166,7 +167,7 @@ For other method, you can check out the API documentation for more detail. We gi
            b_s, b_a, b_s_, b_r, b_d = buffer.get(size=64)          # done in collector.sample(batch_size)
            # compute 2-step returns. How?
            b_ret = compute_2_step_return(buffer, b_r, b_d, ...)    # done in policy.process_fn(batch, buffer, indice)
-            # update DQN
+            # update DQN policy
            agent.update(b_s, b_a, b_s_, b_r, b_d, b_ret)           # done in policy.learn(batch, ...)


--- a/docs/tutorials/trick.rst
+++ b/docs/tutorials/trick.rst
@ -0,0 +1,79 @@
+Train a model-free RL agent within 30s
+======================================
+
+This page summarizes some hyper-parameter tuning experience and code-level trick when training a model-free DRL agent.
+
+
+Avoid batch-size = 1
+--------------------
+
+In the traditional RL training loop, we always use the policy to interact with only one environment for collecting data. That means most of the time the network use batch-size = 1. Quite inefficient!
+Here is an example of showing how inefficient it is:
+::
+
+    import torch, time
+    from torch import nn
+
+    class Net(nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.model = nn.Sequential(
+                nn.Linear(3, 128), nn.ReLU(inplace=True),
+                nn.Linear(128, 128), nn.ReLU(inplace=True),
+                nn.Linear(128, 1))
+        def forward(self, s):
+            return self.model(s)
+
+    net = Net()
+    cnt = 1000
+    div = 128
+    a = torch.randn([128, 3])
+
+    t = time.time()
+    for i in range(cnt):
+        b = net(a)
+    t1 = (time.time() - t) / cnt
+    print(t1)
+    t = time.time()
+    for i in range(cnt):
+        for a_ in a.split(a.shape[0] // div):
+            b = net(a_)
+    t2 = (time.time() - t) / cnt
+    print(t2)
+    print(t2 / t1)
+
+The first test uses batch-size 128, and the second test uses batch-size = 1 for 128 times. In our test, the first is 70-80 times faster than the second.
+
+So how could we avoid the case of batch-size = 1? The answer is synchronize sampling: we create multiple independent environments and sample simultaneously. It is similar to A2C, but other algorithms can also use this method. In our experiments, sampling from more environments benefits not only the sample speed but also the convergence of neural network (we guess it lowers the sample bias).
+
+By the way, A2C is better than A3C in some cases: A3C needs to act independently and sync the gradient to master, but, in a single node, using A3C to act with batch-size = 1 is quite resource-consuming.
+
+
+Algorithm specific tricks
+-------------------------
+
+Here is about the experience of hyper-parameter tuning on CartPole and Pendulum:
+
+* :class:`~tianshou.policy.DQNPolicy`: use estimation_step greater than 1 and target network, also with a suitable size of replay buffer;
+* :class:`~tianshou.policy.PGPolicy`: TBD
+* :class:`~tianshou.policy.A2CPolicy`: TBD
+* :class:`~tianshou.policy.PPOPolicy`: TBD
+* :class:`~tianshou.policy.DDPGPolicy`, :class:`~tianshou.policy.TD3Policy`, and :class:`~tianshou.policy.SACPolicy`: We found two tricks. The first is to ignore the done flag. The second is to normalize reward to a standard normal distribution (it is against the theoretical analysis, but indeed works very well). The two tricks work amazingly on Mujoco tasks, typically with a faster converge speed (1M -> 200K).
+
+* On-policy algorithms: increase the repeat-time (to 2 or 4) of the given batch in each training update will make the algorithm more stable. 
+
+
+Code-level optimization
+-----------------------
+
+Tianshou has many short-but-efficient lines of code. For example, when we want to compute :math:`V_s` and :math:`V_{s'}` by the same network, the best way is to concatenate :math:`s` and :math:`s'` together instead of computing the value function using twice of network forward.
+
+Jiayi: I write each line of code after quite a lot of time of consideration. Details make a difference.
+
+
+Finally
+-------
+
+With fast-speed sampling, we could use large batch-size and large learning rate
+
+RL algorithms are seed-sensitive. Try more seeds and pick the best. But for our demo, we just used seed = 0 and found it work surprisingly well on policy gradient, so we did not try other seed.
--- a/tianshou/env/atari.py
+++ b/tianshou/env/atari.py
@ -15,7 +15,6 @@ def create_atari_environment(name=None, sticky_actions=True,


 class preprocessing(object):
-
    def __init__(self, env, frame_skip=4, terminal_on_life_loss=False,
                 size=84, max_episode_steps=2000):
        self.max_episode_steps = max_episode_steps
@ -36,7 +35,6 @@ class preprocessing(object):

    @property
    def observation_space(self):
-
        return Box(low=0, high=255, shape=(self.size, self.size, 4),
                   dtype=np.uint8)

@ -63,11 +61,9 @@ class preprocessing(object):
            self._pool_and_resize() for _ in range(self.frame_skip)], axis=-1)

    def render(self, mode):
-
        return self.env.render(mode)

    def step(self, action):
-
        total_reward = 0.
        observation = []
        for t in range(self.frame_skip):
@ -103,12 +99,10 @@ class preprocessing(object):
        return observation, total_reward, is_terminal, info

    def _grayscale_obs(self, output):
-
        self.env.ale.getScreenGrayscale(output)
        return output

    def _pool_and_resize(self):
-
        if self.frame_skip > 1:
            np.maximum(self.screen_buffer[0], self.screen_buffer[1],
                       out=self.screen_buffer[0])
@ -119,7 +113,3 @@ class preprocessing(object):
        int_image = np.asarray(transformed_image, dtype=np.uint8)
        # return np.expand_dims(int_image, axis=2)
        return int_image
-
-
-if __name__ == '__main__':
-    create_atari_environment()