Tianshou/docs/tutorials/concepts.rst

Basic Concepts in Tianshou
==========================

Tianshou splits a Reinforcement Learning agent training procedure into these parts: trainer, collector, policy, and data buffer. The general control flow can be described as:

.. image:: ../_static/images/concepts_arch.png
    :align: center
    :height: 300


Data Batch
----------

Tianshou provides :class:`~tianshou.data.Batch` as the internal data structure to pass any kinds of data to other method, for example, a collector gives a :class:`~tianshou.data.Batch` to policy for learning. Here is its usage:
::

    >>> import numpy as np
    >>> from tianshou.data import Batch
    >>> data = Batch(a=4, b=[5, 5], c='2312312')
    >>> data.b
    [5, 5]
    >>> data.b = np.array([3, 4, 5])
    >>> len(data.b)
    3
    >>> data.b[-1]
    5

In short, you can define a :class:`~tianshou.data.Batch` with any key-value pair.

Current implementation of Tianshou typically use 6 keys in :class:`~tianshou.data.Batch`:

* ``obs``: observation of step :math:`t` ;
* ``act``: action of step :math:`t` ;
* ``rew``: reward of step :math:`t` ;
* ``done``: the done flag of step :math:`t` ;
* ``obs_next``: observation of step :math:`t+1` ;
* ``info``: info of step :math:`t` (in ``gym.Env``, the ``env.step()`` function return 4 arguments, and the last one is ``info``);

:class:`~tianshou.data.Batch` has other methods, including ``__getitem__``, ``append``, and ``split``:
::

    >>> data = Batch(obs=np.array([0, 11, 22]), rew=np.array([6, 6, 6]))
    >>> # here we test __getitem__
    >>> index = [2, 1]
    >>> data[index].obs
    array([22, 11])

    >>> data.append(data)  # how we use a list
    >>> data.obs
    array([0, 11, 22, 0, 11, 22])

    >>> # split whole data into multiple small batch
    >>> for d in data.split(size=2, permute=False):
    ...     print(d.obs, d.rew)
    [ 0 11] [6 6]
    [22  0] [6 6]
    [11 22] [6 6]


Data Buffer
-----------

:class:`~tianshou.data.ReplayBuffer` stores data generated from interaction between the policy and environment. It stores basically 6 types of data as mentioned above (7 types with importance weight in :class:`~tianshou.data.PrioritizedReplayBuffer`). Here is the :class:`~tianshou.data.ReplayBuffer`'s usage:
::

    >>> from tianshou.data import ReplayBuffer
    >>> buf = ReplayBuffer(size=20)
    >>> for i in range(3):
    ...     buf.add(obs=i, act=i, rew=i, done=i, obs_next=i + 1, info={})
    >>> buf.obs
    # since we set size = 20, len(buf.obs) == 20.
    array([0., 1., 2., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0.])

    >>> buf2 = ReplayBuffer(size=10)
    >>> for i in range(15):
    ...     buf2.add(obs=i, act=i, rew=i, done=i, obs_next=i + 1, info={})
    >>> buf2.obs
    # since its size = 10, it only stores the last 10 steps' result.
    array([10., 11., 12., 13., 14.,  5.,  6.,  7.,  8.,  9.])

    >>> # move buf2's result into buf (keep the order of time meanwhile)
    >>> buf.update(buf2)
    array([ 0.,  1.,  2.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14.,
            0.,  0.,  0.,  0.,  0.,  0.,  0.])

    >>> # get a random sample from buffer, the batch_data is equal to buf[incide].
    >>> batch_data, indice = buf.sample(batch_size=4)
    >>> batch_data.obs == buf[indice].obs
    array([ True,  True,  True,  True])

The :class:`~tianshou.data.ReplayBuffer` is based on ``numpy.ndarray``. Tianshou provides other type of data buffer such as :class:`~tianshou.data.ListReplayBuffer` (based on list), :class:`~tianshou.data.PrioritizedReplayBuffer` (based on Segment Tree and ``numpy.ndarray``). Check out the API documentation for more detail.


Policy
------

Tianshou aims to modularizing RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit :class:`~tianshou.policy.BasePolicy`.

For demonstration, we use the source code of policy gradient :class:`~tianshou.policy.PGPolicy`. Policy gradient computes each frame's return as:

.. math::

    G_t = \sum_{i=t}^T \gamma^{i - t}r_i = r_t + \gamma r_{t + 1} + \cdots + \gamma^{T - t} r_T

, where :math:`T` is the terminal timestep, :math:`\gamma` is the discount factor, :math:`\gamma \in (0, 1]`.

This process is done in ``process_fn``


Collector
---------

TODO


Trainer
-------

Once you have a collector and a policy, you can start writing the training method for your RL agent. Trainer, to be honest, is a simple wrapper. It helps you save energy for writing the training loop.

Tianshou has two types of trainer: :meth:`~tianshou.trainer.onpolicy_trainer` and :meth:`~tianshou.trainer.offpolicy_trainer`, corresponding to on-policy algorithms (such as Policy Gradient) and off-policy algorithms (such as DQN). Please check out the API documentation for the usage.

There will be more types of trainer, for instance, multi-agent trainer.


Conclusion
----------

So far, we go through the overall framework of Tianshou. Really simple, isn't it?