Tianshou/tianshou/policy/modelfree/c51.py

from typing import Any, Dict, Optional

import numpy as np
import torch

from tianshou.data import Batch, ReplayBuffer
from tianshou.policy import DQNPolicy


class C51Policy(DQNPolicy):
    """Implementation of Categorical Deep Q-Network. arXiv:1707.06887.

    :param torch.nn.Module model: a model following the rules in
        :class:`~tianshou.policy.BasePolicy`. (s -> logits)
    :param torch.optim.Optimizer optim: a torch.optim for optimizing the model.
    :param float discount_factor: in [0, 1].
    :param int num_atoms: the number of atoms in the support set of the
        value distribution. Default to 51.
    :param float v_min: the value of the smallest atom in the support set.
        Default to -10.0.
    :param float v_max: the value of the largest atom in the support set.
        Default to 10.0.
    :param int estimation_step: the number of steps to look ahead. Default to 1.
    :param int target_update_freq: the target network update frequency (0 if
        you do not use the target network). Default to 0.
    :param bool reward_normalization: normalize the reward to Normal(0, 1).
        Default to False.

    .. seealso::

        Please refer to :class:`~tianshou.policy.DQNPolicy` for more detailed
        explanation.
    """

    def __init__(
        self,
        model: torch.nn.Module,
        optim: torch.optim.Optimizer,
        discount_factor: float = 0.99,
        num_atoms: int = 51,
        v_min: float = -10.0,
        v_max: float = 10.0,
        estimation_step: int = 1,
        target_update_freq: int = 0,
        reward_normalization: bool = False,
        **kwargs: Any,
    ) -> None:
        super().__init__(
            model, optim, discount_factor, estimation_step, target_update_freq,
            reward_normalization, **kwargs
        )
        assert num_atoms > 1, "num_atoms should be greater than 1"
        assert v_min < v_max, "v_max should be larger than v_min"
        self._num_atoms = num_atoms
        self._v_min = v_min
        self._v_max = v_max
        self.support = torch.nn.Parameter(
            torch.linspace(self._v_min, self._v_max, self._num_atoms),
            requires_grad=False,
        )
        self.delta_z = (v_max - v_min) / (num_atoms - 1)

    def _target_q(self, buffer: ReplayBuffer, indices: np.ndarray) -> torch.Tensor:
        return self.support.repeat(len(indices), 1)  # shape: [bsz, num_atoms]

    def compute_q_value(
        self, logits: torch.Tensor, mask: Optional[np.ndarray]
    ) -> torch.Tensor:
        return super().compute_q_value((logits * self.support).sum(2), mask)

    def _target_dist(self, batch: Batch) -> torch.Tensor:
        if self._target:
            a = self(batch, input="obs_next").act
            next_dist = self(batch, model="model_old", input="obs_next").logits
        else:
            next_b = self(batch, input="obs_next")
            a = next_b.act
            next_dist = next_b.logits
        next_dist = next_dist[np.arange(len(a)), a, :]
        target_support = batch.returns.clamp(self._v_min, self._v_max)
        # An amazing trick for calculating the projection gracefully.
        # ref: https://github.com/ShangtongZhang/DeepRL
        target_dist = (
            1 - (target_support.unsqueeze(1) - self.support.view(1, -1, 1)).abs() /
            self.delta_z
        ).clamp(0, 1) * next_dist.unsqueeze(1)
        return target_dist.sum(-1)

    def learn(self, batch: Batch, **kwargs: Any) -> Dict[str, float]:
        if self._target and self._iter % self._freq == 0:
            self.sync_weight()
        self.optim.zero_grad()
        with torch.no_grad():
            target_dist = self._target_dist(batch)
        weight = batch.pop("weight", 1.0)
        curr_dist = self(batch).logits
        act = batch.act
        curr_dist = curr_dist[np.arange(len(act)), act, :]
        cross_entropy = -(target_dist * torch.log(curr_dist + 1e-8)).sum(1)
        loss = (cross_entropy * weight).mean()
        # ref: https://github.com/Kaixhin/Rainbow/blob/master/agent.py L94-100
        batch.weight = cross_entropy.detach()  # prio-buffer
        loss.backward()
        self.optim.step()
        self._iter += 1
        return {"loss": loss.item()}
fix qvalue mask_action error for obs_next (#310) * fix #309 * remove for-loop in dqn expl_noise 2021-03-15 08:06:24 +08:00			`from typing import Any, Dict, Optional`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00
bump to v0.4.3 (#432) * add makefile * bump version * add isort and yapf * update contributing.md * update PR template * spelling check 2021-09-03 05:05:04 +08:00			`import numpy as np`
			`import torch`

Add QR-DQN algorithm (#276) This is the PR for QR-DQN algorithm: https://arxiv.org/abs/1710.10044 1. add QR-DQN policy in tianshou/policy/modelfree/qrdqn.py. 2. add QR-DQN net in examples/atari/atari_network.py. 3. add QR-DQN atari example in examples/atari/atari_qrdqn.py. 4. add QR-DQN statement in tianshou/policy/init.py. 5. add QR-DQN unit test in test/discrete/test_qrdqn.py. 6. add QR-DQN atari results in examples/atari/results/qrdqn/. 7. add compute_q_value in DQNPolicy and C51Policy for simplify forward function. 8. move `with torch.no_grad():` from `_target_q` to BasePolicy By running "python3 atari_qrdqn.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '19.8 ± 0.40', in epoch 8. 2021-01-28 09:27:05 +08:00			`from tianshou.data import Batch, ReplayBuffer`
bump to v0.4.3 (#432) * add makefile * bump version * add isort and yapf * update contributing.md * update PR template * spelling check 2021-09-03 05:05:04 +08:00			`from tianshou.policy import DQNPolicy`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00

			`class C51Policy(DQNPolicy):`
			`"""Implementation of Categorical Deep Q-Network. arXiv:1707.06887.`

			`:param torch.nn.Module model: a model following the rules in`
			:class:`~tianshou.policy.BasePolicy`. (s -> logits)
			`:param torch.optim.Optimizer optim: a torch.optim for optimizing the model.`
			`:param float discount_factor: in [0, 1].`
			`:param int num_atoms: the number of atoms in the support set of the`
Remove reward_normaliztion option in offpolicy algorithm (#298) * remove rew_norm in nstep implementation * improve test * remove runnable/ * various doc fix Co-authored-by: n+e <trinkle23897@gmail.com> 2021-02-27 11:20:43 +08:00			`value distribution. Default to 51.`
			`:param float v_min: the value of the smallest atom in the support set.`
			`Default to -10.0.`
			`:param float v_max: the value of the largest atom in the support set.`
			`Default to 10.0.`
			`:param int estimation_step: the number of steps to look ahead. Default to 1.`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00			`:param int target_update_freq: the target network update frequency (0 if`
Remove reward_normaliztion option in offpolicy algorithm (#298) * remove rew_norm in nstep implementation * improve test * remove runnable/ * various doc fix Co-authored-by: n+e <trinkle23897@gmail.com> 2021-02-27 11:20:43 +08:00			`you do not use the target network). Default to 0.`
			`:param bool reward_normalization: normalize the reward to Normal(0, 1).`
			`Default to False.`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00
			`.. seealso::`

			Please refer to :class:`~tianshou.policy.DQNPolicy` for more detailed
			`explanation.`
			`"""`

			`def __init__(`
			`self,`
			`model: torch.nn.Module,`
			`optim: torch.optim.Optimizer,`
			`discount_factor: float = 0.99,`
			`num_atoms: int = 51,`
			`v_min: float = -10.0,`
			`v_max: float = 10.0,`
			`estimation_step: int = 1,`
			`target_update_freq: int = 0,`
			`reward_normalization: bool = False,`
			`**kwargs: Any,`
			`) -> None:`
bump to v0.4.3 (#432) * add makefile * bump version * add isort and yapf * update contributing.md * update PR template * spelling check 2021-09-03 05:05:04 +08:00			`super().__init__(`
			`model, optim, discount_factor, estimation_step, target_update_freq,`
			`reward_normalization, **kwargs`
			`)`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00			`assert num_atoms > 1, "num_atoms should be greater than 1"`
			`assert v_min < v_max, "v_max should be larger than v_min"`
			`self._num_atoms = num_atoms`
			`self._v_min = v_min`
			`self._v_max = v_max`
			`self.support = torch.nn.Parameter(`
			`torch.linspace(self._v_min, self._v_max, self._num_atoms),`
			`requires_grad=False,`
			`)`
			`self.delta_z = (v_max - v_min) / (num_atoms - 1)`

Replaced indice by plural indices (#422) 2021-08-20 09:58:44 -04:00			`def _target_q(self, buffer: ReplayBuffer, indices: np.ndarray) -> torch.Tensor:`
			`return self.support.repeat(len(indices), 1) # shape: [bsz, num_atoms]`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00
fix qvalue mask_action error for obs_next (#310) * fix #309 * remove for-loop in dqn expl_noise 2021-03-15 08:06:24 +08:00			`def compute_q_value(`
			`self, logits: torch.Tensor, mask: Optional[np.ndarray]`
			`) -> torch.Tensor:`
			`return super().compute_q_value((logits * self.support).sum(2), mask)`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00
			`def _target_dist(self, batch: Batch) -> torch.Tensor:`
			`if self._target:`
			`a = self(batch, input="obs_next").act`
Step collector implementation (#280) This is the third PR of 6 commits mentioned in #274, which features refactor of Collector to fix #245. You can check #274 for more detail. Things changed in this PR: 1. refactor collector to be more cleaner, split AsyncCollector to support asyncvenv; 2. change buffer.add api to add(batch, bffer_ids); add several types of buffer (VectorReplayBuffer, PrioritizedVectorReplayBuffer, etc.) 3. add policy.exploration_noise(act, batch) -> act 4. small change in BasePolicy.compute_*_returns 5. move reward_metric from collector to trainer 6. fix np.asanyarray issue (different version's numpy will result in different output) 7. flake8 maxlength=88 8. polish docs and fix test Co-authored-by: n+e <trinkle23897@gmail.com> 2021-02-19 10:33:49 +08:00			`next_dist = self(batch, model="model_old", input="obs_next").logits`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00			`else:`
			`next_b = self(batch, input="obs_next")`
			`a = next_b.act`
			`next_dist = next_b.logits`
			`next_dist = next_dist[np.arange(len(a)), a, :]`
			`target_support = batch.returns.clamp(self._v_min, self._v_max)`
			`# An amazing trick for calculating the projection gracefully.`
			`# ref: https://github.com/ShangtongZhang/DeepRL`
bump to v0.4.3 (#432) * add makefile * bump version * add isort and yapf * update contributing.md * update PR template * spelling check 2021-09-03 05:05:04 +08:00			`target_dist = (`
			`1 - (target_support.unsqueeze(1) - self.support.view(1, -1, 1)).abs() /`
			`self.delta_z`
			`).clamp(0, 1) * next_dist.unsqueeze(1)`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00			`return target_dist.sum(-1)`

			`def learn(self, batch: Batch, **kwargs: Any) -> Dict[str, float]:`
Add offline trainer and discrete BCQ algorithm (#263) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com> 2021-01-20 02:13:04 -08:00			`if self._target and self._iter % self._freq == 0:`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00			`self.sync_weight()`
			`self.optim.zero_grad()`
			`with torch.no_grad():`
			`target_dist = self._target_dist(batch)`
			`weight = batch.pop("weight", 1.0)`
			`curr_dist = self(batch).logits`
			`act = batch.act`
			`curr_dist = curr_dist[np.arange(len(act)), act, :]`
bump to v0.4.3 (#432) * add makefile * bump version * add isort and yapf * update contributing.md * update PR template * spelling check 2021-09-03 05:05:04 +08:00			`cross_entropy = -(target_dist * torch.log(curr_dist + 1e-8)).sum(1)`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00			`loss = (cross_entropy * weight).mean()`
			`# ref: https://github.com/Kaixhin/Rainbow/blob/master/agent.py L94-100`
			`batch.weight = cross_entropy.detach() # prio-buffer`
			`loss.backward()`
			`self.optim.step()`
Add offline trainer and discrete BCQ algorithm (#263) The result needs to be tuned after `done` issue fixed. Co-authored-by: n+e <trinkle23897@gmail.com> 2021-01-20 02:13:04 -08:00			`self._iter += 1`
Add C51 algorithm (#266) This is the PR for C51algorithm: https://arxiv.org/abs/1707.06887 1. add C51 policy in tianshou/policy/modelfree/c51.py. 2. add C51 net in tianshou/utils/net/discrete.py. 3. add C51 atari example in examples/atari/atari_c51.py. 4. add C51 statement in tianshou/policy/__init__.py. 5. add C51 test in test/discrete/test_c51.py. 6. add C51 atari results in examples/atari/results/c51/. By running "python3 atari_c51.py --task "PongNoFrameskip-v4" --batch-size 64", get best_result': '20.50 ± 0.50', in epoch 9. By running "python3 atari_c51.py --task "BreakoutNoFrameskip-v4" --n-step 1 --epoch 40", get best_reward: 407.400000 ± 31.155096 in epoch 39. 2021-01-06 10:17:45 +08:00			`return {"loss": loss.item()}`