Michael Panchenko bf3859a457 Extension of ExpLauncher and DataclassPPrintMixin
1. Launch in main process if only 1 exp is passed
2. Launcher returns a list of stats for successful exps
3. More detailed logging for unsuccessful expos
4. Raise error if all runs were unsuccessful
5. DataclassPPrintMixin allows retrieving a pretty repr string
6. Minor improvements in docstrings
2024-05-07 16:21:50 +02:00

146 lines
6.4 KiB
Python

import multiprocessing
from dataclasses import dataclass
from tianshou.utils.string import ToStringMixin
@dataclass
class SamplingConfig(ToStringMixin):
"""Configuration of sampling, epochs, parallelization, buffers, collectors, and batching."""
num_epochs: int = 100
"""
the number of epochs to run training for. An epoch is the outermost iteration level and each
epoch consists of a number of training steps and a test step, where each training step
* collects environment steps/transitions (collection step), adding them to the (replay)
buffer (see :attr:`step_per_collect`)
* performs one or more gradient updates (see :attr:`update_per_step`),
and the test step collects :attr:`num_episodes_per_test` test episodes in order to evaluate
agent performance.
The number of training steps in each epoch is indirectly determined by
:attr:`step_per_epoch`: As many training steps will be performed as are required in
order to reach :attr:`step_per_epoch` total steps in the training environments.
Specifically, if the number of transitions collected per step is `c` (see
:attr:`step_per_collect`) and :attr:`step_per_epoch` is set to `s`, then the number
of training steps per epoch is `ceil(s / c)`.
Therefore, if `num_epochs = e`, the total number of environment steps taken during training
can be computed as `e * ceil(s / c) * c`.
"""
step_per_epoch: int = 30000
"""
the total number of environment steps to be made per epoch. See :attr:`num_epochs` for
an explanation of epoch semantics.
"""
batch_size: int | None = 64
"""for off-policy algorithms, this is the number of environment steps/transitions to sample
from the buffer for a gradient update; for on-policy algorithms, its use is algorithm-specific.
On-policy algorithms use the full buffer that was collected in the preceding collection step
but they may use this parameter to perform the gradient update using mini-batches of this size
(causing the gradient to be less accurate, a form of regularization).
``batch_size=None`` means that the full buffer is used for the gradient update. This doesn't
make much sense for off-policy algorithms and is not recommended then. For on-policy or offline algorithms,
this means that the full buffer is used for the gradient update (no mini-batching), and
may make sense in some cases.
"""
num_train_envs: int = -1
"""the number of training environments to use. If set to -1, use number of CPUs/threads."""
train_seed: int = 42
"""the seed to use for the training environments."""
num_test_envs: int = 1
"""the number of test environments to use"""
num_test_episodes: int = 1
"""the total number of episodes to collect in each test step (across all test environments).
"""
buffer_size: int = 4096
"""the total size of the sample/replay buffer, in which environment steps (transitions) are
stored"""
step_per_collect: int = 2048
"""
the number of environment steps/transitions to collect in each collection step before the
network update within each training step.
Note that the exact number can be reached only if this is a multiple of the number of
training environments being used, as each training environment will produce the same
(non-zero) number of transitions.
Specifically, if this is set to `n` and `m` training environments are used, then the total
number of transitions collected per collection step is `ceil(n / m) * m =: c`.
See :attr:`num_epochs` for information on the total number of environment steps being
collected during training.
"""
repeat_per_collect: int | None = 1
"""
controls, within one gradient update step of an on-policy algorithm, the number of times an
actual gradient update is applied using the full collected dataset, i.e. if the parameter is
5, then the collected data shall be used five times to update the policy within the same
training step.
The parameter is ignored and may be set to None for off-policy and offline algorithms.
"""
update_per_step: float = 1.0
"""
for off-policy algorithms only: the number of gradient steps to perform per sample
collected (see :attr:`step_per_collect`).
Specifically, if this is set to `u` and the number of samples collected in the preceding
collection step is `n`, then `round(u * n)` gradient steps will be performed.
Note that for on-policy algorithms, only a single gradient update is usually performed,
because thereafter, the samples no longer reflect the behavior of the updated policy.
To change the number of gradient updates for an on-policy algorithm, use parameter
:attr:`repeat_per_collect` instead.
"""
start_timesteps: int = 0
"""
the number of environment steps to collect before the actual training loop begins
"""
start_timesteps_random: bool = False
"""
whether to use a random policy (instead of the initial or restored policy to be trained)
when collecting the initial :attr:`start_timesteps` environment steps before training
"""
replay_buffer_ignore_obs_next: bool = False
replay_buffer_save_only_last_obs: bool = False
"""if True, for the case where the environment outputs stacked frames (e.g. because it
is using a `FrameStack` wrapper), save only the most recent frame so as not to duplicate
observations in buffer memory. Specifically, if the environment outputs observations `obs` with
shape (N, ...), only obs[-1] of shape (...) will be stored.
Frame stacking with a fixed number of frames can then be recreated at the buffer level by setting
:attr:`replay_buffer_stack_num`.
"""
replay_buffer_stack_num: int = 1
"""
the number of consecutive environment observations to stack and use as the observation input
to the agent for each time step. Setting this to a value greater than 1 can help agents learn
temporal aspects (e.g. velocities of moving objects for which only positions are observed).
If the environment already stacks frames (e.g. using a `FrameStack` wrapper), this should either not
be used or should be used in conjunction with :attr:`replay_buffer_save_only_last_obs`.
"""
@property
def test_seed(self) -> int:
return self.train_seed + self.num_train_envs
def __post_init__(self) -> None:
if self.num_train_envs == -1:
self.num_train_envs = multiprocessing.cpu_count()