* EnvFactory now uses the creation of a single environment as
the basic functionality which the more high-level functions build
upon
* Introduce enum EnvMode to indicate the purpose for which an env
is created, allowing the factory creation process to change its
behaviour accordingly
* Add EnvFactoryGymnasium to provide direct support for envs that
can be created via gymnasium.make
- EnvPool is supported via an injectible EnvPoolFactory
- Existing EnvFactory implementations are now derived from
EnvFactoryGymnasium
* Use a separate environment (which uses new EnvMode.WATCH) for
watching agent performance after training (instead of using test
environments, which the user may want to configure differently)
This PR adds strict typing to the output of `update` and `learn` in all
policies. This will likely be the last large refactoring PR before the
next release (0.6.0, not 1.0.0), so it requires some attention. Several
difficulties were encountered on the path to that goal:
1. The policy hierarchy is actually "broken" in the sense that the keys
of dicts that were output by `learn` did not follow the same enhancement
(inheritance) pattern as the policies. This is a real problem and should
be addressed in the near future. Generally, several aspects of the
policy design and hierarchy might deserve a dedicated discussion.
2. Each policy needs to be generic in the stats return type, because one
might want to extend it at some point and then also extend the stats.
Even within the source code base this pattern is necessary in many
places.
3. The interaction between learn and update is a bit quirky, we
currently handle it by having update modify special field inside
TrainingStats, whereas all other fields are handled by learn.
4. The IQM module is a policy wrapper and required a
TrainingStatsWrapper. The latter relies on a bunch of black magic.
They were addressed by:
1. Live with the broken hierarchy, which is now made visible by bounds
in generics. We use type: ignore where appropriate.
2. Make all policies generic with bounds following the policy
inheritance hierarchy (which is incorrect, see above). We experimented a
bit with nested TrainingStats classes, but that seemed to add more
complexity and be harder to understand. Unfortunately, mypy thinks that
the code below is wrong, wherefore we have to add `type: ignore` to the
return of each `learn`
```python
T = TypeVar("T", bound=int)
def f() -> T:
return 3
```
3. See above
4. Write representative tests for the `TrainingStatsWrapper`. Still, the
black magic might cause nasty surprises down the line (I am not proud of
it)...
Closes#933
---------
Co-authored-by: Maximilian Huettenrauch <m.huettenrauch@appliedai.de>
Co-authored-by: Michael Panchenko <m.panchenko@appliedai.de>
This PR adds a new method for getting actions from an env's observation
and info. This is useful for standard inference and stands in contrast
to batch-based methods that are currently used in training and
evaluation. Without this, users have to do some kind of gymnastics to
actually perform inference with a trained policy. I have also added a
test for the new method.
In future PRs, this method should be included in the examples (in the
the "watch" section).
To add this required improving multiple typing things and, importantly,
_simplifying the signature of `forward` in many policies!_ This is a
**breaking change**, but it will likely affect no users. The `input`
parameter of forward was a rather hacky mechanism, I believe it is good
that it's gone now. It will also help with #948 .
The main functional change is the addition of `compute_action` to
`BasePolicy`.
Other minor changes:
- improvements in typing
- updated PR and Issue templates
- Improved handling of `max_action_num`
Closes#981
(should be dev dependency only) by introducing a new
place where jsonargparse can be configured:
logging.run_cli, which is also slightly more convenient
of number of environments in SamplingConfig is used
(values are now passed to factory method)
This is clearer and removes the need to pass otherwise
unnecessary configuration to environment factories at
construction
* Add persistence/restoration of Experiment instance
* Add file logging in experiment
* Allow all persistence/logging to be disabled
* Disable persistence in tests
* Add example atari_iqn_hl
* Factor out trainer callbacks to new module atari_callbacks
* Extract base class for DQN-based agent factories
* Improved module factory interface design, achieving higher generality
* Changed machanism for reusing actor's preprocessing module in critics
to avoid special handling in AgentFactory implementations, improving
separation of concerns:
- Added CriticFactoryReuseActor as the new critic factory
- Added ActorFactoryTransientStorageDecorator to pass on the actor
data
- Added helper classes ActorFuture, ActorFutureProviderProtocol
* Add example atari_sac_hl
* Implement example mujoco_redq_hl
* Add abstraction CriticEnsembleFactory with default implementations
to suit REDQ
* Fix type annotation of linear_layer in Net, MLP, Critic
(was incompatible with REDQ usage)
* Allow to specify trainer callbacks (train_fn, test_fn, stop_fn)
in high-level API, adding the necessary abstractions and pass-on
mechanisms
* Add example atari_dqn_hl
* Add common based class for A2C and PPO agent factories
* Add default for dist_fn parameter, adding corresponding factories
* Add example mujoco_a2c_hl
* Use prefix convention (subclasses have superclass names as prefix) to
facilitate discoverability of relevant classes via IDE autocompletion
* Use dual naming, adding an alternative concise name that omits the
precise OO semantics and retains only the essential part of the name
(which can be more pleasing to users not accustomed to
convoluted OO naming)