Addresses #1122:
* We Introduced a new flag `is_within_training_step` which is enabled by
the training algorithm when within a training step, where a training
step encompasses training data collection and policy updates. This flag
is now used by algorithms to decide whether their `deterministic_eval`
setting should indeed apply instead of the torch training flag (which
was abused!).
* The policy's training/eval mode (which should control torch-level
learning only) no longer needs to be set in user code in order to
control collector behaviour (this didn't make sense!). The respective
calls have been removed.
* The policy should, in fact, always be in evaluation mode when applying
data collection, as there is no reason to ever have gradient
accumulation enabled for any type of rollout. We thus specifically set
the policy to evaluation mode in Collector.collect. Further, it never
makes sense to compute gradients during collection, so the possibility
to pass `no_grad=False` was removed.
Further changes:
- Base class for collectors: `BaseCollector`
- New util context managers `in_eval_mode` and `in_train_mode` for torch
modules.
- `reset` of `Collectors` now returns `obs` and `info`.
- `no-grad` no longer accepted as kwarg of `collect`
- Removed deprecations of `0.5.1` (will likely not affect anyone) and
the unused `warnings` module.
* Add class ExperimentCollection to improve usability
* Remove parameters from ExperimentBuilder.build
* Renamed ExperimentBuilder.build_default_seeded_experiments to build_seeded_collection,
changing the return type to ExperimentCollection
* Replace temp_config_mutation (which was not appropriate for the public API) with
method copy (which performs a safe deep copy)
* Remove flag `eval_mode` from Collector.collect
* Replace flag `is_eval` in BasePolicy with `is_within_training_step` (negating usages)
and set it appropriately in BaseTrainer
New method training_step, which
* collects training data (method _collect_training_data)
* performs "test in train" (method _test_in_train)
* performs policy update
The old method named train_step performed only the first two points
and was now split into two separate methods
* Add class ExperimentCollection to improve usability
* Remove parameters from ExperimentBuilder.build
* Renamed ExperimentBuilder.build_default_seeded_experiments to build_seeded_collection,
changing the return type to ExperimentCollection
* Replace temp_config_mutation (which was not appropriate for the public API) with
method copy (which performs a safe deep copy)
This PR fixes a bug in DQN and lifts a limination in reusing the actor's
preprocessing network for continuous environments.
* `atari_network.DQN`:
* Fix input validation
* Fix output_dim not being set if features_only=True and
output_dim_added_layer not None
* `continuous.Critic`:
* Add flag `apply_preprocess_net_to_obs_only` to allow the
preprocessing network to be applied to the observations only (without
the actions concatenated), which is essential for the case where we want
to reuse the actor's preprocessing network
* CriticFactoryReuseActor: Use the flag, fixing the case where we want
to reuse an actor's
preprocessing network for the critic (must be applied before
concatenating
the actions)
* Minor improvements in docs/docstrings
preprocessing network to be applied to the observations only (without
the actions concatenated), which is essential for the case where we want
to reuse the actor's preprocessing network