Tianshou/tianshou/trainer/utils.py

import time
import numpy as np
from typing import Any, Dict, Union, Callable, Optional

from tianshou.data import Collector
from tianshou.policy import BasePolicy
from tianshou.utils import BaseLogger


def test_episode(
    policy: BasePolicy,
    collector: Collector,
    test_fn: Optional[Callable[[int, Optional[int]], None]],
    epoch: int,
    n_episode: int,
    logger: Optional[BaseLogger] = None,
    global_step: Optional[int] = None,
    reward_metric: Optional[Callable[[np.ndarray], np.ndarray]] = None,
) -> Dict[str, Any]:
    """A simple wrapper of testing policy in collector."""
    collector.reset_env()
    collector.reset_buffer()
    policy.eval()
    if test_fn:
        test_fn(epoch, global_step)
    result = collector.collect(n_episode=n_episode)
    if reward_metric:
        result["rews"] = reward_metric(result["rews"])
    if logger and global_step is not None:
        logger.log_test_data(result, global_step)
    return result


def gather_info(
    start_time: float,
    train_c: Optional[Collector],
    test_c: Collector,
    best_reward: float,
    best_reward_std: float,
) -> Dict[str, Union[float, str]]:
    """A simple wrapper of gathering information from collectors.

    :return: A dictionary with the following keys:

        * ``train_step`` the total collected step of training collector;
        * ``train_episode`` the total collected episode of training collector;
        * ``train_time/collector`` the time for collecting transitions in the \
            training collector;
        * ``train_time/model`` the time for training models;
        * ``train_speed`` the speed of training (env_step per second);
        * ``test_step`` the total collected step of test collector;
        * ``test_episode`` the total collected episode of test collector;
        * ``test_time`` the time for testing;
        * ``test_speed`` the speed of testing (env_step per second);
        * ``best_reward`` the best reward over the test results;
        * ``duration`` the total elapsed time.
    """
    duration = time.time() - start_time
    model_time = duration - test_c.collect_time
    test_speed = test_c.collect_step / test_c.collect_time
    result: Dict[str, Union[float, str]] = {
        "test_step": test_c.collect_step,
        "test_episode": test_c.collect_episode,
        "test_time": f"{test_c.collect_time:.2f}s",
        "test_speed": f"{test_speed:.2f} step/s",
        "best_reward": best_reward,
        "best_result": f"{best_reward:.2f} ± {best_reward_std:.2f}",
        "duration": f"{duration:.2f}s",
        "train_time/model": f"{model_time:.2f}s",
    }
    if train_c is not None:
        model_time -= train_c.collect_time
        train_speed = train_c.collect_step / (duration - test_c.collect_time)
        result.update({
            "train_step": train_c.collect_step,
            "train_episode": train_c.collect_episode,
            "train_time/collector": f"{train_c.collect_time:.2f}s",
            "train_time/model": f"{model_time:.2f}s",
            "train_speed": f"{train_speed:.2f} step/s",
        })
    return result
ppo and early stop 2020-03-20 19:52:29 +08:00			`import time`
add pytorch drl result 2020-03-27 09:04:29 +08:00			`import numpy as np`
Step collector implementation (#280) This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <trinkle23897@gmail.com> 2021-02-19 10:33:49 +08:00			`from typing import Any, Dict, Union, Callable, Optional`
ppo and early stop 2020-03-20 19:52:29 +08:00
add type annotation 2020-05-12 11:31:47 +08:00			`from tianshou.data import Collector`
			`from tianshou.policy import BasePolicy`
add logger (#295) This PR focus on refactor of logging method to solve bug of nan reward and log interval. After these two pr, hopefully fundamental change of tianshou/data is finished. We then can concentrate on building benchmarks of tianshou finally. Things changed: 1. trainer now accepts logger (BasicLogger or LazyLogger) instead of writer; 2. remove utils.SummaryWriter; 2021-02-24 14:48:42 +08:00			`from tianshou.utils import BaseLogger`
ppo and early stop 2020-03-20 19:52:29 +08:00
add type annotation 2020-05-12 11:31:47 +08:00
			`def test_episode(`
code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`policy: BasePolicy,`
			`collector: Collector,`
change API of train_fn and test_fn (#229) train_fn(epoch) -> train_fn(epoch, num_env_step) test_fn(epoch) -> test_fn(epoch, num_env_step) 2020-09-26 16:35:37 +08:00			`test_fn: Optional[Callable[[int, Optional[int]], None]],`
code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`epoch: int,`
Step collector implementation (#280) This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <trinkle23897@gmail.com> 2021-02-19 10:33:49 +08:00			`n_episode: int,`
add logger (#295) This PR focus on refactor of logging method to solve bug of nan reward and log interval. After these two pr, hopefully fundamental change of tianshou/data is finished. We then can concentrate on building benchmarks of tianshou finally. Things changed: 1. trainer now accepts logger (BasicLogger or LazyLogger) instead of writer; 2. remove utils.SummaryWriter; 2021-02-24 14:48:42 +08:00			`logger: Optional[BaseLogger] = None,`
code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`global_step: Optional[int] = None,`
Step collector implementation (#280) This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <trinkle23897@gmail.com> 2021-02-19 10:33:49 +08:00			`reward_metric: Optional[Callable[[np.ndarray], np.ndarray]] = None,`
			`) -> Dict[str, Any]:`
docs for env 2020-04-04 21:02:06 +08:00			`"""A simple wrapper of testing policy in collector."""`
ppo and early stop 2020-03-20 19:52:29 +08:00			`collector.reset_env()`
			`collector.reset_buffer()`
			`policy.eval()`
			`if test_fn:`
change API of train_fn and test_fn (#229) train_fn(epoch) -> train_fn(epoch, num_env_step) test_fn(epoch) -> test_fn(epoch, num_env_step) 2020-09-26 16:35:37 +08:00			`test_fn(epoch, global_step)`
optimize training procedure and improve code coverage (#189) 1. add policy.eval() in all test scripts' "watch performance" 2. remove dict return support for collector preprocess_fn 3. add `__contains__` and `pop` in batch: `key in batch`, `batch.pop(key, deft)` 4. exact n_episode for a list of n_episode limitation and save fake data in cache_buffer when self.buffer is None (#184) 5. fix tensorboard logging: h-axis stands for env step instead of gradient step; add test results into tensorboard 6. add test_returns (both GAE and nstep) 7. change the type-checking order in batch.py and converter.py in order to meet the most often case first 8. fix shape inconsistency for torch.Tensor in replay buffer 9. remove `**kwargs` in ReplayBuffer 10. remove default value in batch.split() and add merge_last argument (#185) 11. improve nstep efficiency 12. add max_batchsize in onpolicy algorithms 13. potential bugfix for subproc.wait 14. fix RecurrentActorProb 15. improve the code-coverage (from 90% to 95%) and remove the dead code 16. fix some incorrect type annotation The above improvement also increases the training FPS: on my computer, the previous version is only ~1800 FPS and after that, it can reach ~2050 (faster than v0.2.4.post1). 2020-08-27 12:15:18 +08:00			`result = collector.collect(n_episode=n_episode)`
Step collector implementation (#280) This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <trinkle23897@gmail.com> 2021-02-19 10:33:49 +08:00			`if reward_metric:`
			`result["rews"] = reward_metric(result["rews"])`
add logger (#295) This PR focus on refactor of logging method to solve bug of nan reward and log interval. After these two pr, hopefully fundamental change of tianshou/data is finished. We then can concentrate on building benchmarks of tianshou finally. Things changed: 1. trainer now accepts logger (BasicLogger or LazyLogger) instead of writer; 2. remove utils.SummaryWriter; 2021-02-24 14:48:42 +08:00			`if logger and global_step is not None:`
			`logger.log_test_data(result, global_step)`
optimize training procedure and improve code coverage (#189) 1. add policy.eval() in all test scripts' "watch performance" 2. remove dict return support for collector preprocess_fn 3. add `__contains__` and `pop` in batch: `key in batch`, `batch.pop(key, deft)` 4. exact n_episode for a list of n_episode limitation and save fake data in cache_buffer when self.buffer is None (#184) 5. fix tensorboard logging: h-axis stands for env step instead of gradient step; add test results into tensorboard 6. add test_returns (both GAE and nstep) 7. change the type-checking order in batch.py and converter.py in order to meet the most often case first 8. fix shape inconsistency for torch.Tensor in replay buffer 9. remove `**kwargs` in ReplayBuffer 10. remove default value in batch.split() and add merge_last argument (#185) 11. improve nstep efficiency 12. add max_batchsize in onpolicy algorithms 13. potential bugfix for subproc.wait 14. fix RecurrentActorProb 15. improve the code-coverage (from 90% to 95%) and remove the dead code 16. fix some incorrect type annotation The above improvement also increases the training FPS: on my computer, the previous version is only ~1800 FPS and after that, it can reach ~2050 (faster than v0.2.4.post1). 2020-08-27 12:15:18 +08:00			`return result`
ppo and early stop 2020-03-20 19:52:29 +08:00

code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`def gather_info(`
			`start_time: float,`
Add offline trainer and discrete BCQ algorithm (#263) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com> 2021-01-20 02:13:04 -08:00			`train_c: Optional[Collector],`
code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`test_c: Collector,`
			`best_reward: float,`
change the step in trainer (#235) This PR separates the `global_step` into `env_step` and `gradient_step`. In the future, the data from the collecting state will be stored under `env_step`, and the data from the updating state will be stored under `gradient_step`. Others: - add `rew_std` and `best_result` into the monitor - fix network unbounded in `test/continuous/test_sac_with_il.py` and `examples/box2d/bipedal_hardcore_sac.py` - change the dependency of ray to 1.0.0 since ray-project/ray#10134 has been resolved 2020-10-04 21:55:43 +08:00			`best_reward_std: float,`
code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`) -> Dict[str, Union[float, str]]:`
add docs of collector and trainer (#20) 2020-04-05 18:34:45 +08:00			`"""A simple wrapper of gathering information from collectors.`
add some docs 2020-04-03 21:28:12 +08:00
add docs of collector and trainer (#20) 2020-04-05 18:34:45 +08:00			`:return: A dictionary with the following keys:`
add some docs 2020-04-03 21:28:12 +08:00
add docs of collector and trainer (#20) 2020-04-05 18:34:45 +08:00			* ``train_step`` the total collected step of training collector;
			* ``train_episode`` the total collected episode of training collector;
Trainer refactor : some definition change (#293) This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation. 2021-02-21 13:06:02 +08:00			* ``train_time/collector`` the time for collecting transitions in the \
add some docs 2020-04-03 21:28:12 +08:00			`training collector;`
add docs of collector and trainer (#20) 2020-04-05 18:34:45 +08:00			* ``train_time/model`` the time for training models;
Trainer refactor : some definition change (#293) This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation. 2021-02-21 13:06:02 +08:00			* ``train_speed`` the speed of training (env_step per second);
add docs of collector and trainer (#20) 2020-04-05 18:34:45 +08:00			* ``test_step`` the total collected step of test collector;
			* ``test_episode`` the total collected episode of test collector;
			* ``test_time`` the time for testing;
Trainer refactor : some definition change (#293) This PR focus on some definition change of trainer to make it more friendly to use and be consistent with typical usage in research papers, typically change `collect-per-step` to `step-per-collect`, add `update-per-step` / `episode-per-collect` accordingly, and modify the documentation. 2021-02-21 13:06:02 +08:00			* ``test_speed`` the speed of testing (env_step per second);
add docs of collector and trainer (#20) 2020-04-05 18:34:45 +08:00			* ``best_reward`` the best reward over the test results;
			* ``duration`` the total elapsed time.
add some docs 2020-04-03 21:28:12 +08:00			`"""`
ppo and early stop 2020-03-20 19:52:29 +08:00			`duration = time.time() - start_time`
Add offline trainer and discrete BCQ algorithm (#263) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com> 2021-01-20 02:13:04 -08:00			`model_time = duration - test_c.collect_time`
ppo and early stop 2020-03-20 19:52:29 +08:00			`test_speed = test_c.collect_step / test_c.collect_time`
Add offline trainer and discrete BCQ algorithm (#263) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com> 2021-01-20 02:13:04 -08:00			`result: Dict[str, Union[float, str]] = {`
code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`"test_step": test_c.collect_step,`
			`"test_episode": test_c.collect_episode,`
			`"test_time": f"{test_c.collect_time:.2f}s",`
			`"test_speed": f"{test_speed:.2f} step/s",`
			`"best_reward": best_reward,`
change the step in trainer (#235) This PR separates the `global_step` into `env_step` and `gradient_step`. In the future, the data from the collecting state will be stored under `env_step`, and the data from the updating state will be stored under `gradient_step`. Others: - add `rew_std` and `best_result` into the monitor - fix network unbounded in `test/continuous/test_sac_with_il.py` and `examples/box2d/bipedal_hardcore_sac.py` - change the dependency of ray to 1.0.0 since ray-project/ray#10134 has been resolved 2020-10-04 21:55:43 +08:00			`"best_result": f"{best_reward:.2f} ± {best_reward_std:.2f}",`
code format and update function signatures (#213) Cherry-pick from #200 - update the function signature - format code-style - move _compile into separate functions - fix a bug in to_torch and to_numpy (Batch) - remove None in action_range In short, the code-format only contains function-signature style and `'` -> `"`. (pick up from [black](https://github.com/psf/black)) 2020-09-12 15:39:01 +08:00			`"duration": f"{duration:.2f}s",`
Add offline trainer and discrete BCQ algorithm (#263) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com> 2021-01-20 02:13:04 -08:00			`"train_time/model": f"{model_time:.2f}s",`
ppo and early stop 2020-03-20 19:52:29 +08:00			`}`
Add offline trainer and discrete BCQ algorithm (#263) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com> 2021-01-20 02:13:04 -08:00			`if train_c is not None:`
			`model_time -= train_c.collect_time`
			`train_speed = train_c.collect_step / (duration - test_c.collect_time)`
			`result.update({`
			`"train_step": train_c.collect_step,`
			`"train_episode": train_c.collect_episode,`
			`"train_time/collector": f"{train_c.collect_time:.2f}s",`
			`"train_time/model": f"{model_time:.2f}s",`
			`"train_speed": f"{train_speed:.2f} step/s",`
			`})`
			`return result`