add vizdoom example, bump version to 0.4.2 (#384)

2021-06-26 18:08:41 +08:00 · 2021-06-26 18:08:41 +08:00 · ebaca6f8da
commit ebaca6f8da
parent c0bc8e00ca
32 changed files with 683 additions and 17 deletions
--- a/.github/workflows/pytest.yml
+++ b/.github/workflows/pytest.yml
@ -24,7 +24,7 @@ jobs:
    - name: Test with pytest
      # ignore test/throughput which only profiles the code
      run: |
-        pytest test --ignore-glob='*profile.py' --cov=tianshou --cov-report=xml --durations=0 -v
+        pytest test --ignore-glob='*profile.py' --cov=tianshou --cov-report=xml --cov-report=term-missing --durations=0 -v
    - name: Upload coverage to Codecov
      uses: codecov/codecov-action@v1
      with:
--- a/README.md
+++ b/README.md
@ -6,7 +6,7 @@
 [![PyPI](https://img.shields.io/pypi/v/tianshou)](https://pypi.org/project/tianshou/)
 [![Conda](https://img.shields.io/conda/vn/conda-forge/tianshou)](https://github.com/conda-forge/tianshou-feedstock)
-[![Read the Docs](https://img.shields.io/readthedocs/tianshou)](https://tianshou.readthedocs.io/en/latest)
+[![Read the Docs](https://img.shields.io/readthedocs/tianshou)](https://tianshou.readthedocs.io/en/master)
 [![Read the Docs](https://img.shields.io/readthedocs/tianshou-docs-zh-cn?label=%E4%B8%AD%E6%96%87%E6%96%87%E6%A1%A3)](https://tianshou.readthedocs.io/zh/latest/)
 [![Unittest](https://github.com/thu-ml/tianshou/workflows/Unittest/badge.svg?branch=master)](https://github.com/thu-ml/tianshou/actions)
 [![codecov](https://img.shields.io/codecov/c/gh/thu-ml/tianshou)](https://codecov.io/gh/thu-ml/tianshou)
@ -14,7 +14,6 @@
 [![GitHub stars](https://img.shields.io/github/stars/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/stargazers)
 [![GitHub forks](https://img.shields.io/github/forks/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/network)
 [![GitHub license](https://img.shields.io/github/license/thu-ml/tianshou)](https://github.com/thu-ml/tianshou/blob/master/LICENSE)
 [![Gitter](https://badges.gitter.im/thu-ml/tianshou.svg)](https://gitter.im/thu-ml/tianshou?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
 **Tianshou** ([天授](https://baike.baidu.com/item/%E5%A4%A9%E6%8E%88)) is a reinforcement learning platform based on pure PyTorch. Unlike existing reinforcement learning libraries, which are mainly based on TensorFlow, have many nested classes, unfriendly API, or slow-speed, Tianshou provides a fast-speed modularized framework and pythonic API for building the deep reinforcement learning agent with the least number of lines of code. The supported interface algorithms currently include:
--- a/examples/vizdoom/.gitignore
+++ b/examples/vizdoom/.gitignore
@ -0,0 +1 @@
 _vizdoom.ini
--- a/examples/vizdoom/README.md
+++ b/examples/vizdoom/README.md
@ -0,0 +1,66 @@
 # ViZDoom
 [ViZDoom](https://github.com/mwydmuch/ViZDoom) is a popular RL env for a famous first-person shooting game Doom. Here we provide some results and intuitions for this scenario.
 ## Train
 To train an agent:
 ```bash
 python3 vizdoom_c51.py --task {D1_basic|D3_battle|D4_battle2}
 ```
 D1 (health gathering) should finish training (no death) in less than 500k env step (5 epochs);
 D3 can reach 1600+ reward (75+ killcount in 5 minutes);
 D4 can reach 700+ reward. Here is the result:
 (episode length, the maximum length is 2625 because we use frameskip=4, that is 10500/4=2625)
 ![](results/c51/length.png)
 (episode reward)
 ![](results/c51/reward.png)
 To evaluate an agent's performance:
 ```bash
 python3 vizdoom_c51.py --test-num 100 --resume-path policy.pth --watch --task {D1_basic|D3_battle|D4_battle2}
 ```
 To save `.lmp` files for recording:
 ```bash
 python3 vizdoom_c51.py --save-lmp --test-num 100 --resume-path policy.pth --watch --task {D1_basic|D3_battle|D4_battle2}
 ```
 it will store `lmp` file in `lmps/` directory. To watch these `lmp` files (for example, d3 lmp):
 ```bash
 python3 replay.py maps/D3_battle.cfg episode_8_25.lmp
 ```
 We provide two lmp files (d3 best and d4 best) under `results/c51`, you can use the following command to enjoy:
 ```bash
 python3 replay.py maps/D3_battle.cfg results/c51/d3.lmp
 python3 replay.py maps/D4_battle2.cfg results/c51/d4.lmp
 ```
 ## Maps
 See [maps/README.md](maps/README.md)
 ## Algorithms
 The setting is exactly the same as Atari. You can definitely try more algorithms listed in Atari example.
 ## Reward
 1. living reward is bad
 2. combo-action is really important
 3. negative reward for health and ammo2 is really helpful for d3/d4
 4. only with positive reward for health is really helpful for d1
 5. remove MOVE_BACKWARD may converge faster but the final performance may be lower
--- a/examples/vizdoom/env.py
+++ b/examples/vizdoom/env.py
@ -0,0 +1,129 @@
 import os
 import cv2
 import gym
 import numpy as np
 import vizdoom as vzd
 def normal_button_comb():
    actions = []
    m_forward = [[0.0], [1.0]]
    t_left_right = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0]]
    for i in m_forward:
        for j in t_left_right:
            actions.append(i + j)
    return actions
 def battle_button_comb():
    actions = []
    m_forward_backward = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0]]
    m_left_right = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0]]
    t_left_right = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0]]
    attack = [[0.0], [1.0]]
    speed = [[0.0], [1.0]]
    for m in attack:
        for n in speed:
            for j in m_left_right:
                for i in m_forward_backward:
                    for k in t_left_right:
                        actions.append(i + j + k + m + n)
    return actions
 class Env(gym.Env):
    def __init__(
        self, cfg_path, frameskip=4, res=(4, 40, 60), save_lmp=False
    ):
        super().__init__()
        self.save_lmp = save_lmp
        self.health_setting = "battle" in cfg_path
        if save_lmp:
            os.makedirs("lmps", exist_ok=True)
        self.res = res
        self.skip = frameskip
        self.observation_space = gym.spaces.Box(
            low=0, high=255, shape=res, dtype=np.float32
        )
        self.game = vzd.DoomGame()
        self.game.load_config(cfg_path)
        self.game.init()
        if "battle" in cfg_path:
            self.available_actions = battle_button_comb()
        else:
            self.available_actions = normal_button_comb()
        self.action_num = len(self.available_actions)
        self.action_space = gym.spaces.Discrete(self.action_num)
        self.spec = gym.envs.registration.EnvSpec("vizdoom-v0")
        self.count = 0
    def get_obs(self):
        state = self.game.get_state()
        if state is None:
            return
        obs = state.screen_buffer
        self.obs_buffer[:-1] = self.obs_buffer[1:]
        self.obs_buffer[-1] = cv2.resize(obs, (self.res[-1], self.res[-2]))
    def reset(self):
        if self.save_lmp:
            self.game.new_episode(f"lmps/episode_{self.count}.lmp")
        else:
            self.game.new_episode()
        self.count += 1
        self.obs_buffer = np.zeros(self.res, dtype=np.uint8)
        self.get_obs()
        self.health = self.game.get_game_variable(vzd.GameVariable.HEALTH)
        self.killcount = self.game.get_game_variable(
            vzd.GameVariable.KILLCOUNT)
        self.ammo2 = self.game.get_game_variable(vzd.GameVariable.AMMO2)
        return self.obs_buffer
    def step(self, action):
        self.game.make_action(self.available_actions[action], self.skip)
        reward = 0.0
        self.get_obs()
        health = self.game.get_game_variable(vzd.GameVariable.HEALTH)
        if self.health_setting:
            reward += health - self.health
        elif health > self.health:  # positive health reward only for d1/d2
            reward += health - self.health
        self.health = health
        killcount = self.game.get_game_variable(vzd.GameVariable.KILLCOUNT)
        reward += 20 * (killcount - self.killcount)
        self.killcount = killcount
        ammo2 = self.game.get_game_variable(vzd.GameVariable.AMMO2)
        # if ammo2 > self.ammo2:
        reward += ammo2 - self.ammo2
        self.ammo2 = ammo2
        done = False
        info = {}
        if self.game.is_player_dead() or self.game.get_state() is None:
            done = True
        elif self.game.is_episode_finished():
            done = True
            info["TimeLimit.truncated"] = True
        return self.obs_buffer, reward, done, info
    def render(self):
        pass
    def close(self):
        self.game.close()
 if __name__ == '__main__':
    # env = Env("maps/D1_basic.cfg", 4, (4, 84, 84))
    env = Env("maps/D3_battle.cfg", 4, (4, 84, 84))
    print(env.available_actions)
    action_num = env.action_space.n
    obs = env.reset()
    print(env.spec.reward_threshold)
    print(obs.shape, action_num)
    for i in range(4000):
        obs, rew, done, info = env.step(0)
        if done:
            env.reset()
    print(obs.shape, rew, done)
    cv2.imwrite("test.png", obs.transpose(1, 2, 0)[..., :3])
--- a/examples/vizdoom/maps/D1_basic.cfg
+++ b/examples/vizdoom/maps/D1_basic.cfg
@ -0,0 +1,39 @@
 # Lines starting with # are treated as comments (or with whitespaces+#).
 # It doesn't matter if you use capital letters or not.
 # It doesn't matter if you use underscore or camel notation for keys, e.g. episode_timeout is the same as episodeTimeout.
 doom_scenario_path = D1_basic.wad
 doom_map = map01
 # Rewards
 # Each step is good for you!
 living_reward = 0
 # And death is not!
 death_penalty = 0
 # Rendering options
 screen_resolution = RES_160X120
 screen_format = GRAY8
 render_hud = false
 render_crosshair = false
 render_weapon = false
 render_decals = false
 render_particles = false
 window_visible = false
 # make episodes finish after 10500 actions (tics)
 episode_timeout = 10500
 # Available buttons
 available_buttons =
 {
    MOVE_FORWARD
    TURN_LEFT
    TURN_RIGHT
 }
 # Game variables that will be in the state
 available_game_variables = { HEALTH }
 mode = PLAYER
--- a/examples/vizdoom/maps/D1_basic.wad
+++ b/examples/vizdoom/maps/D1_basic.wad
--- a/examples/vizdoom/maps/D2_navigation.cfg
+++ b/examples/vizdoom/maps/D2_navigation.cfg
@ -0,0 +1,39 @@
 # Lines starting with # are treated as comments (or with whitespaces+#).
 # It doesn't matter if you use capital letters or not.
 # It doesn't matter if you use underscore or camel notation for keys, e.g. episode_timeout is the same as episodeTimeout.
 doom_scenario_path = D2_navigation.wad
 doom_map = map01
 # Rewards
 # Each step is good for you!
 living_reward = 0
 # And death is not!
 death_penalty = 0
 # Rendering options
 screen_resolution = RES_160X120
 screen_format = GRAY8
 render_hud = false
 render_crosshair = false
 render_weapon = false
 render_decals = false
 render_particles = false
 window_visible = false
 # make episodes finish after 10500 actions (tics)
 episode_timeout = 10500
 # Available buttons
 available_buttons =
 {
    MOVE_FORWARD
    TURN_LEFT
    TURN_RIGHT
 }
 # Game variables that will be in the state
 available_game_variables = { HEALTH }
 mode = PLAYER
--- a/examples/vizdoom/maps/D2_navigation.wad
+++ b/examples/vizdoom/maps/D2_navigation.wad
--- a/examples/vizdoom/maps/D3_battle.cfg
+++ b/examples/vizdoom/maps/D3_battle.cfg
@ -0,0 +1,48 @@
 # Lines starting with # are treated as comments (or with whitespaces+#).
 # It doesn't matter if you use capital letters or not.
 # It doesn't matter if you use underscore or camel notation for keys, e.g. episode_timeout is the same as episodeTimeout.
 doom_scenario_path = D3_battle.wad
 doom_map = map01
 # Rewards
 living_reward = 0
 death_penalty = 100
 # Rendering options
 screen_resolution = RES_160X120
 screen_format = GRAY8
 render_hud = false
 render_crosshair = true
 render_weapon = true
 render_decals = false
 render_particles = false
 window_visible = false
 # make episodes finish after 10500 actions (tics)
 episode_timeout = 10500
 # Available buttons
 available_buttons =
 {
    MOVE_FORWARD
    MOVE_BACKWARD
    MOVE_LEFT
    MOVE_RIGHT
    TURN_LEFT
    TURN_RIGHT
    ATTACK
    SPEED
 }
 # Game variables that will be in the state
 available_game_variables =
 {
    KILLCOUNT
    AMMO2
    HEALTH
 }
 mode = PLAYER
 doom_skill = 2
--- a/examples/vizdoom/maps/D3_battle.wad
+++ b/examples/vizdoom/maps/D3_battle.wad
--- a/examples/vizdoom/maps/D4_battle2.cfg
+++ b/examples/vizdoom/maps/D4_battle2.cfg
@ -0,0 +1,48 @@
 # Lines starting with # are treated as comments (or with whitespaces+#).
 # It doesn't matter if you use capital letters or not.
 # It doesn't matter if you use underscore or camel notation for keys, e.g. episode_timeout is the same as episodeTimeout.
 doom_scenario_path = D4_battle2.wad
 doom_map = map01
 # Rewards
 living_reward = 0
 death_penalty = 100
 # Rendering options
 screen_resolution = RES_160X120
 screen_format = GRAY8
 render_hud = false
 render_crosshair = true
 render_weapon = true
 render_decals = false
 render_particles = false
 window_visible = false
 # make episodes finish after 10500 actions (tics)
 episode_timeout = 10500
 # Available buttons
 available_buttons =
 { 
    MOVE_FORWARD
    MOVE_BACKWARD
    MOVE_LEFT
    MOVE_RIGHT
    TURN_LEFT
    TURN_RIGHT
    ATTACK
    SPEED
 }
 # Game variables that will be in the state
 available_game_variables =
 {
    KILLCOUNT
    AMMO2
    HEALTH
 }
 mode = PLAYER
 doom_skill = 2
--- a/examples/vizdoom/maps/D4_battle2.wad
+++ b/examples/vizdoom/maps/D4_battle2.wad
--- a/examples/vizdoom/maps/README.md
+++ b/examples/vizdoom/maps/README.md
@ -0,0 +1,3 @@
 D1-D4 maps are from https://github.com/intel-isl/DirectFuturePrediction/
 More maps and cfgs: https://github.com/mwydmuch/ViZDoom/tree/master/scenarios
--- a/examples/vizdoom/maps/spectator.py
+++ b/examples/vizdoom/maps/spectator.py
@ -0,0 +1,71 @@
 #!/usr/bin/env python3
 #####################################################################
 # This script presents SPECTATOR mode. In SPECTATOR mode you play and
 # your agent can learn from it.
 # Configuration is loaded from "../../scenarios/<SCENARIO_NAME>.cfg" file.
 #
 # To see the scenario description go to "../../scenarios/README.md"
 #####################################################################
 from __future__ import print_function
 from time import sleep
 import vizdoom as vzd
 from argparse import ArgumentParser
 # import cv2
 if __name__ == "__main__":
    parser = ArgumentParser("ViZDoom example showing how to use SPECTATOR mode.")
    parser.add_argument('-c', type=str, dest="config", default="D3_battle.cfg")
    parser.add_argument('-w', type=str, dest="wad_file", default="D3_battle.wad")
    args = parser.parse_args()
    game = vzd.DoomGame()
    # Choose scenario config file you wish to watch.
    # Don't load two configs cause the second will overrite the first one.
    # Multiple config files are ok but combining these ones doesn't make much sense.
    game.load_config(args.config)
    game.set_doom_scenario_path(args.wad_file)
    # Enables freelook in engine
    game.add_game_args("+freelook 1")
    game.set_screen_resolution(vzd.ScreenResolution.RES_640X480)
    # Enables spectator mode, so you can play.
    # Sounds strange but it is the agent who is supposed to watch not you.
    game.set_window_visible(True)
    game.set_mode(vzd.Mode.SPECTATOR)
    game.init()
    episodes = 1
    for i in range(episodes):
        print("Episode #" + str(i + 1))
        game.new_episode()
        while not game.is_episode_finished():
            state = game.get_state()
            print(state.screen_buffer.dtype, state.screen_buffer.shape)
            # cv2.imwrite(f'imgs/{state.number}.png', state.screen_buffer)
            # game.make_action([0, 0, 0])
            game.advance_action()
            last_action = game.get_last_action()
            reward = game.get_last_reward()
            print("State #" + str(state.number))
            print("Game variables: ", state.game_variables)
            print("Action:", last_action)
            print("Reward:", reward)
            print("=====================")
        print("Episode finished!")
        print("Total reward:", game.get_total_reward())
        print("************************")
        sleep(2.0)
    game.close()
--- a/examples/vizdoom/network.py
+++ b/examples/vizdoom/network.py
@ -0,0 +1 @@
 ../atari/atari_network.py
--- a/examples/vizdoom/replay.py
+++ b/examples/vizdoom/replay.py
@ -0,0 +1,35 @@
 # import cv2
 import sys
 import time
 import tqdm
 import vizdoom as vzd
 def main(cfg_path="maps/D3_battle.cfg", lmp_path="test.lmp"):
    game = vzd.DoomGame()
    game.load_config(cfg_path)
    game.set_screen_format(vzd.ScreenFormat.CRCGCB)
    game.set_screen_resolution(vzd.ScreenResolution.RES_1024X576)
    game.set_window_visible(True)
    game.set_render_hud(True)
    game.init()
    game.replay_episode(lmp_path)
    killcount = 0
    with tqdm.trange(10500) as tq:
        while not game.is_episode_finished():
            game.advance_action()
            state = game.get_state()
            if state is None:
                break
            killcount = game.get_game_variable(vzd.GameVariable.KILLCOUNT)
            time.sleep(1 / 35)
            # cv2.imwrite(f"imgs/{tq.n}.png",
            #             state.screen_buffer.transpose(1, 2, 0)[..., ::-1])
            tq.update(1)
    game.close()
    print("killcount:", killcount)
 if __name__ == '__main__':
    main(*sys.argv[-2:])
--- a/examples/vizdoom/results/c51/d3.lmp
+++ b/examples/vizdoom/results/c51/d3.lmp
--- a/examples/vizdoom/results/c51/d4.lmp
+++ b/examples/vizdoom/results/c51/d4.lmp
--- a/examples/vizdoom/results/c51/length.png
+++ b/examples/vizdoom/results/c51/length.png
--- a/examples/vizdoom/results/c51/reward.png
+++ b/examples/vizdoom/results/c51/reward.png
--- a/examples/vizdoom/vizdoom_c51.py
+++ b/examples/vizdoom/vizdoom_c51.py
@ -0,0 +1,177 @@
 import os
 import torch
 import pprint
 import argparse
 import numpy as np
 from torch.utils.tensorboard import SummaryWriter
 from tianshou.policy import C51Policy
 from tianshou.utils import BasicLogger
 from tianshou.env import SubprocVectorEnv
 from tianshou.trainer import offpolicy_trainer
 from tianshou.data import Collector, VectorReplayBuffer
 from env import Env
 from network import C51
 def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--task', type=str, default='D1_basic')
    parser.add_argument('--seed', type=int, default=0)
    parser.add_argument('--eps-test', type=float, default=0.005)
    parser.add_argument('--eps-train', type=float, default=1.)
    parser.add_argument('--eps-train-final', type=float, default=0.05)
    parser.add_argument('--buffer-size', type=int, default=2000000)
    parser.add_argument('--lr', type=float, default=0.0001)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--num-atoms', type=int, default=51)
    parser.add_argument('--v-min', type=float, default=-10.)
    parser.add_argument('--v-max', type=float, default=10.)
    parser.add_argument('--n-step', type=int, default=3)
    parser.add_argument('--target-update-freq', type=int, default=500)
    parser.add_argument('--epoch', type=int, default=300)
    parser.add_argument('--step-per-epoch', type=int, default=100000)
    parser.add_argument('--step-per-collect', type=int, default=10)
    parser.add_argument('--update-per-step', type=float, default=0.1)
    parser.add_argument('--batch-size', type=int, default=64)
    parser.add_argument('--training-num', type=int, default=10)
    parser.add_argument('--test-num', type=int, default=100)
    parser.add_argument('--logdir', type=str, default='log')
    parser.add_argument('--render', type=float, default=0.)
    parser.add_argument(
        '--device', type=str,
        default='cuda' if torch.cuda.is_available() else 'cpu')
    parser.add_argument('--frames-stack', type=int, default=4)
    parser.add_argument('--skip-num', type=int, default=4)
    parser.add_argument('--resume-path', type=str, default=None)
    parser.add_argument('--watch', default=False, action='store_true',
                        help='watch the play of pre-trained policy only')
    parser.add_argument('--save-lmp', default=False, action='store_true',
                        help='save lmp file for replay whole episode')
    parser.add_argument('--save-buffer-name', type=str, default=None)
    return parser.parse_args()
 def test_c51(args=get_args()):
    args.cfg_path = f"maps/{args.task}.cfg"
    args.wad_path = f"maps/{args.task}.wad"
    args.res = (args.skip_num, 84, 84)
    env = Env(args.cfg_path, args.frames_stack, args.res)
    args.state_shape = args.res
    args.action_shape = env.action_space.shape or env.action_space.n
    # should be N_FRAMES x H x W
    print("Observations shape:", args.state_shape)
    print("Actions shape:", args.action_shape)
    # make environments
    train_envs = SubprocVectorEnv([
        lambda: Env(args.cfg_path, args.frames_stack, args.res)
        for _ in range(args.training_num)])
    test_envs = SubprocVectorEnv([
        lambda: Env(args.cfg_path, args.frames_stack,
                    args.res, args.save_lmp)
        for _ in range(min(os.cpu_count() - 1, args.test_num))])
    # seed
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)
    train_envs.seed(args.seed)
    test_envs.seed(args.seed)
    # define model
    net = C51(*args.state_shape, args.action_shape,
              args.num_atoms, args.device)
    optim = torch.optim.Adam(net.parameters(), lr=args.lr)
    # define policy
    policy = C51Policy(
        net, optim, args.gamma, args.num_atoms, args.v_min, args.v_max,
        args.n_step, target_update_freq=args.target_update_freq
    ).to(args.device)
    # load a previous policy
    if args.resume_path:
        policy.load_state_dict(torch.load(args.resume_path, map_location=args.device))
        print("Loaded agent from: ", args.resume_path)
    # replay buffer: `save_last_obs` and `stack_num` can be removed together
    # when you have enough RAM
    buffer = VectorReplayBuffer(
        args.buffer_size, buffer_num=len(train_envs), ignore_obs_next=True,
        save_only_last_obs=True, stack_num=args.frames_stack)
    # collector
    train_collector = Collector(policy, train_envs, buffer, exploration_noise=True)
    test_collector = Collector(policy, test_envs, exploration_noise=True)
    # log
    log_path = os.path.join(args.logdir, args.task, 'c51')
    writer = SummaryWriter(log_path)
    writer.add_text("args", str(args))
    logger = BasicLogger(writer)
    def save_fn(policy):
        torch.save(policy.state_dict(), os.path.join(log_path, 'policy.pth'))
    def stop_fn(mean_rewards):
        if env.spec.reward_threshold:
            return mean_rewards >= env.spec.reward_threshold
        elif 'Pong' in args.task:
            return mean_rewards >= 20
        else:
            return False
    def train_fn(epoch, env_step):
        # nature DQN setting, linear decay in the first 1M steps
        if env_step <= 1e6:
            eps = args.eps_train - env_step / 1e6 * \
                (args.eps_train - args.eps_train_final)
        else:
            eps = args.eps_train_final
        policy.set_eps(eps)
        logger.write('train/eps', env_step, eps)
    def test_fn(epoch, env_step):
        policy.set_eps(args.eps_test)
    # watch agent's performance
    def watch():
        print("Setup test envs ...")
        policy.eval()
        policy.set_eps(args.eps_test)
        test_envs.seed(args.seed)
        if args.save_buffer_name:
            print(f"Generate buffer with size {args.buffer_size}")
            buffer = VectorReplayBuffer(
                args.buffer_size, buffer_num=len(test_envs),
                ignore_obs_next=True, save_only_last_obs=True,
                stack_num=args.frames_stack)
            collector = Collector(policy, test_envs, buffer,
                                  exploration_noise=True)
            result = collector.collect(n_step=args.buffer_size)
            print(f"Save buffer into {args.save_buffer_name}")
            # Unfortunately, pickle will cause oom with 1M buffer size
            buffer.save_hdf5(args.save_buffer_name)
        else:
            print("Testing agent ...")
            test_collector.reset()
            result = test_collector.collect(n_episode=args.test_num,
                                            render=args.render)
        rew = result["rews"].mean()
        lens = result["lens"].mean() * args.skip_num
        print(f'Mean reward (over {result["n/ep"]} episodes): {rew}')
        print(f'Mean length (over {result["n/ep"]} episodes): {lens}')
    if args.watch:
        watch()
        exit(0)
    # test train_collector and start filling replay buffer
    train_collector.collect(n_step=args.batch_size * args.training_num)
    # trainer
    result = offpolicy_trainer(
        policy, train_collector, test_collector, args.epoch,
        args.step_per_epoch, args.step_per_collect, args.test_num,
        args.batch_size, train_fn=train_fn, test_fn=test_fn,
        stop_fn=stop_fn, save_fn=save_fn, logger=logger,
        update_per_step=args.update_per_step, test_in_train=False)
    pprint.pprint(result)
    watch()
 if __name__ == '__main__':
    test_c51(get_args())
--- a/setup.py
+++ b/setup.py
@ -55,7 +55,7 @@ setup(
    ],
    extras_require={
        "dev": [
-            "Sphinx",
+            "sphinx<4",
            "sphinx_rtd_theme",
            "sphinxcontrib-bibtex",
            "flake8",
--- a/test/modelbased/test_psrl.py
+++ b/test/modelbased/test_psrl.py
@ -22,7 +22,7 @@ def get_args():
    parser.add_argument('--step-per-epoch', type=int, default=1000)
    parser.add_argument('--episode-per-collect', type=int, default=1)
    parser.add_argument('--training-num', type=int, default=1)
-    parser.add_argument('--test-num', type=int, default=100)
+    parser.add_argument('--test-num', type=int, default=10)
    parser.add_argument('--logdir', type=str, default='log')
    parser.add_argument('--render', type=float, default=0.0)
    parser.add_argument('--rew-mean-prior', type=float, default=0.0)
@ -36,12 +36,12 @@ def get_args():
 def test_psrl(args=get_args()):
    env = gym.make(args.task)
    if args.task == "NChain-v0":
-        env.spec.reward_threshold = 3647  # described in PSRL paper
+        env.spec.reward_threshold = 3400
        # env.spec.reward_threshold = 3647  # described in PSRL paper
    print("reward threshold:", env.spec.reward_threshold)
    args.state_shape = env.observation_space.shape or env.observation_space.n
    args.action_shape = env.action_space.shape or env.action_space.n
    # train_envs = gym.make(args.task)
    # train_envs = gym.make(args.task)
    train_envs = DummyVectorEnv(
        [lambda: gym.make(args.task) for _ in range(args.training_num)])
    # test_envs = gym.make(args.task)
--- a/tianshou/init.py
+++ b/tianshou/init.py
@ -1,7 +1,7 @@
 from tianshou import data, env, utils, policy, trainer, exploration
-__version__ = "0.4.1"
+__version__ = "0.4.2"
 __all__ = [
    "env",
--- a/tianshou/data/batch.py
+++ b/tianshou/data/batch.py
@ -526,7 +526,12 @@ class Batch:
            elif all(isinstance(e, (Batch, dict)) for e in v):  # third often
                self.__dict__[k] = Batch.stack(v, axis)
            else:  # most often case is np.ndarray
-                self.__dict__[k] = _to_array_with_correct_type(np.stack(v, axis))
+                try:
                    self.__dict__[k] = _to_array_with_correct_type(np.stack(v, axis))
                except ValueError:
                    warnings.warn("You are using tensors with different shape,"
                                  " fallback to dtype=object by default.")
                    self.__dict__[k] = np.array(v, dtype=object)
        # all the keys
        keys_total = set.union(*[set(b.keys()) for b in batches])
        # keys that are reserved in all batches
--- a/tianshou/data/buffer/base.py
+++ b/tianshou/data/buffer/base.py
@ -53,6 +53,7 @@ class ReplayBuffer:
        self._save_only_last_obs = save_only_last_obs
        self._sample_avail = sample_avail
        self._meta: Batch = Batch()
        self._ep_rew: Union[float, np.ndarray]
        self.reset()
    def __len__(self) -> int:
--- a/tianshou/data/collector.py
+++ b/tianshou/data/collector.py
@ -56,7 +56,8 @@ class Collector(object):
        exploration_noise: bool = False,
    ) -> None:
        super().__init__()
-        if not isinstance(env, BaseVectorEnv):
+        if isinstance(env, gym.Env) and not hasattr(env, "__len__"):
            warnings.warn("Single environment detected, wrap to DummyVectorEnv.")
            env = DummyVectorEnv([lambda: env])
        self.env = env
        self.env_num = len(env)
@ -223,7 +224,8 @@ class Collector(object):
            # get bounded and remapped actions first (not saved into buffer)
            action_remap = self.policy.map_action(self.data.act)
            # step in env
-            obs_next, rew, done, info = self.env.step(action_remap, id=ready_env_ids)
+            obs_next, rew, done, info = self.env.step(
                action_remap, ready_env_ids)  # type: ignore
            self.data.update(obs_next=obs_next, rew=rew, done=done, info=info)
            if self.preprocess_fn:
@ -426,7 +428,8 @@ class AsyncCollector(Collector):
            # get bounded and remapped actions first (not saved into buffer)
            action_remap = self.policy.map_action(self.data.act)
            # step in env
-            obs_next, rew, done, info = self.env.step(action_remap, id=ready_env_ids)
+            obs_next, rew, done, info = self.env.step(
                action_remap, ready_env_ids)  # type: ignore
            # change self.data here because ready_env_ids has changed
            ready_env_ids = np.array([i["env_id"] for i in info])
--- a/tianshou/policy/imitation/discrete_crr.py
+++ b/tianshou/policy/imitation/discrete_crr.py
@ -75,8 +75,8 @@ class DiscreteCRRPolicy(PGPolicy):
        self._min_q_weight = min_q_weight
    def sync_weight(self) -> None:
-        self.actor_old.load_state_dict(self.actor.state_dict())  # type: ignore
+        self.actor_old.load_state_dict(self.actor.state_dict())
-        self.critic_old.load_state_dict(self.critic.state_dict())  # type: ignore
+        self.critic_old.load_state_dict(self.critic.state_dict())
    def learn(self, batch: Batch, **kwargs: Any) -> Dict[str, float]:  # type: ignore
        if self._target and self._iter % self._freq == 0:
--- a/tianshou/policy/modelfree/dqn.py
+++ b/tianshou/policy/modelfree/dqn.py
@ -72,7 +72,7 @@ class DQNPolicy(BasePolicy):
    def sync_weight(self) -> None:
        """Synchronize the weight for the target network."""
-        self.model_old.load_state_dict(self.model.state_dict())  # type: ignore
+        self.model_old.load_state_dict(self.model.state_dict())
    def _target_q(self, buffer: ReplayBuffer, indice: np.ndarray) -> torch.Tensor:
        batch = buffer[indice]  # batch.obs_next: s_{t+n}
--- a/tianshou/policy/modelfree/trpo.py
+++ b/tianshou/policy/modelfree/trpo.py
@ -64,6 +64,7 @@ class TRPOPolicy(NPGPolicy):
        self._max_backtracks = max_backtracks
        self._delta = max_kl
        self._backtrack_coeff = backtrack_coeff
        self._optim_critic_iters: int
    def learn(  # type: ignore
        self, batch: Batch, batch_size: int, repeat: int, **kwargs: Any
--- a/tianshou/utils/log_tools.py
+++ b/tianshou/utils/log_tools.py
@ -88,7 +88,7 @@ class BasicLogger(BaseLogger):
    You can also rewrite write() func to use your own writer.
    :param SummaryWriter writer: the writer to log data.
-    :param int train_interval: the log interval in log_train_data(). Default to 1.
+    :param int train_interval: the log interval in log_train_data(). Default to 1000.
    :param int test_interval: the log interval in log_test_data(). Default to 1.
    :param int update_interval: the log interval in log_update_data(). Default to 1000.
    :param int save_interval: the save interval in save_data(). Default to 1 (save at
@ -98,7 +98,7 @@ class BasicLogger(BaseLogger):
    def __init__(
        self,
        writer: SummaryWriter,
-        train_interval: int = 1,
+        train_interval: int = 1000,
        test_interval: int = 1,
        update_interval: int = 1000,
        save_interval: int = 1,
		`@ -0,0 +1,3 @@`
							`D1-D4 maps are from https://github.com/intel-isl/DirectFuturePrediction/`

							`More maps and cfgs: https://github.com/mwydmuch/ViZDoom/tree/master/scenarios`