535 Commits

Author SHA1 Message Date
carlocagnetta
08f1770fa2 Fix docs/requirements.txt 2023-12-04 11:45:53 +01:00
carlocagnetta
b1b7f24b94 Fix docs/requirements.txt 2023-12-04 11:45:53 +01:00
carlocagnetta
64af97b65e Fix RTD and gh-pages docu auto-generation 2023-12-04 11:45:53 +01:00
carlocagnetta
9aad2c9673 Fix readthedocs to install poetry 2023-12-04 11:45:53 +01:00
carlocagnetta
fb97091a92 Fix action gh-pages 2023-12-04 11:45:52 +01:00
carlocagnetta
ca2e4a1d96 Fix docs index 2023-12-04 11:45:52 +01:00
carlocagnetta
102045c98c Fix .readthedocs.yaml 2023-12-04 11:45:51 +01:00
carlocagnetta
8f0c62ace3 Documentation update: jupyter-book running on ReadTheDocs including tutorial notebooks 2023-12-04 11:45:51 +01:00
carlocagnetta
6df56161f5 Move notebooks to doc and resolve spellcheck 2023-12-04 11:42:38 +01:00
carlocagnetta
3a87b33d4f Publish notebooks only from master 2023-12-04 11:41:32 +01:00
carlocagnetta
8489fde07e Publish compiled Jupyter books to github-pages 2023-12-04 11:41:32 +01:00
carlocagnetta
b74ea40970 Add git-action to compile notebooks 2023-12-04 11:41:32 +01:00
carlocagnetta
624998d2eb fix notebook action 2023-12-04 11:41:22 +01:00
carlocagnetta
30fc884078 Publish notebooks only on master 2023-12-04 11:41:22 +01:00
carlocagnetta
0108726dfe fix build books action command 2023-12-04 11:41:22 +01:00
carlocagnetta
60f54fded9 Change gh-pages publish dir 2023-12-04 11:41:21 +01:00
carlocagnetta
1170ad37fd Change gh-pages publish dir 2023-12-04 11:41:21 +01:00
carlocagnetta
23eadb11f8 Change books publish dir 2023-12-04 11:41:21 +01:00
carlocagnetta
c175f7647e Fix notebooks.yml 2023-12-04 11:41:21 +01:00
carlocagnetta
bbe5da41f7 Add compiled books for documentation 2023-12-04 11:41:21 +01:00
carlocagnetta
0e9837fa17 Fix worklfow action compiling notebooks 2023-12-04 11:41:20 +01:00
carlocagnetta
04c374c439 add git-action to compile notebooks 2023-12-04 11:41:20 +01:00
carlocagnetta
96704a82a2 Update jupyter-book _config.yml to comply with template 2023-12-04 11:41:20 +01:00
carlocagnetta
148e11a1cd Update notebooks to current version 2023-12-04 11:41:20 +01:00
carlocagnetta
de3a021a0a Setup jupyter-book 2023-12-04 11:41:19 +01:00
carlocagnetta
7ed07031b7 Update gitignore 2023-12-04 11:40:18 +01:00
carlocagnetta
acc671b887 Update gitignore 2023-12-04 11:39:13 +01:00
carlocagnetta
6a60396839 Add jupyter pkg dependency to poetry env 2023-12-04 11:37:38 +01:00
carlocagnetta
efaadec6a1 Removed notebook outputs 2023-12-04 11:36:47 +01:00
carlocagnetta
6b6ce0fdf1 Add notebooks from ./docs/tutorials/get_started.rst to ./notebooks 2023-12-04 11:36:47 +01:00
Michael Panchenko
8d3d1f164b
Support batch_size=None and use it in various scripts (#993)
Closes #986
2023-11-24 10:13:10 -08:00
Michael Panchenko
f134bc20b5
Bugfix/discrete bcq inf (#995)
Fixes a small bug with using np.inf instead of torch-based infinity

Closes #963

---------

Co-authored-by: ivan.rodriguez <ivan.rodriguez@unternehmertum.de>
2023-11-24 11:17:40 +01:00
Matthew Turnshek
31fa0325fa
Update quickstart argument name (#994)
Noticed an improper argument name when going through the quickstart.
2023-11-22 21:05:37 -08:00
Michael Panchenko
3a1bc18add
Method to compute actions from observations (#991)
This PR adds a new method for getting actions from an env's observation
and info. This is useful for standard inference and stands in contrast
to batch-based methods that are currently used in training and
evaluation. Without this, users have to do some kind of gymnastics to
actually perform inference with a trained policy. I have also added a
test for the new method.

In future PRs, this method should be included in the examples (in the
the "watch" section).

To add this required improving multiple typing things and, importantly,
_simplifying the signature of `forward` in many policies!_ This is a
**breaking change**, but it will likely affect no users. The `input`
parameter of forward was a rather hacky mechanism, I believe it is good
that it's gone now. It will also help with #948 .

The main functional change is the addition of `compute_action` to
`BasePolicy`.

Other minor changes:
- improvements in typing
- updated PR and Issue templates
- Improved handling of `max_action_num`

Closes #981
2023-11-16 17:27:53 +00:00
Dominik Jain
6d6c85e594
Fix an issue where policies built with LRSchedulerFactoryLinear were not picklable (#992)
- [X] I have marked all applicable categories:
    + [X] exception-raising fix
    + [ ] algorithm implementation fix
    + [ ] documentation modification
    + [ ] new feature
- [X] I have reformatted the code using `make format` (**required**)
- [X] I have checked the code using `make commit-checks` (**required**)
- [ ] If applicable, I have mentioned the relevant/related issue(s)
- [ ] If applicable, I have listed every items in this Pull Request
below

The cause was the use of a lambda function in the state of a generated
object.
2023-11-14 10:23:18 -08:00
Michael Panchenko
962c6d1e11
High-Level API (#970)
This PR closes #938. It introduces all the fundamental concepts and
abstractions, and it already covers the majority of the algorithms. It
is not a complete and finalised product, however, and we recommend that
the high-level API remain in alpha stadium for some time, as already
suggested in the issue.

The changes in this PR are described on a [wiki
page](https://github.com/aai-institute/tianshou/wiki/High-Level-API), a
copy of which is provided below. (The original page is perhaps more
readable, because it does not render line breaks verbatim.)

# Introducing the Tianshou High-Level API

The new high-level library was created based on object-oriented design 
principles with two primary design goals:

 * **ease of use** for the end user (without sacrificing generality)

   This is achieved through:
     * a single, well-defined point of interaction (`ExperimentBuilder`)
       which uses declarative semantics, allowing the user to focus on 
       what to do rather than how to do it.

     * easily injectible parametrisation.
       
       For complex parametrisation involving objects, the respective 
       library classes are easily discoverable, keeping the need to 
browse reference documentation - or, even worse, inspect code or class
       hierarchies - to an absolute minimium.

     * reduced points of failure. 
     
Because the high-level API is at a higher level of abstraction, where
       more knowledge is available, we can centrally define reasonable 
       defaults and apply consistency checks in order to ensure that
illegal configurations result in meaningful errors (and are completely
       avoided as long as the users does not modify default behaviour).
For example, we can consider interactions between the nature of the
       action space and the neural networks being used.


 * **maintainability** for developers
   
   This is achieved through:
     * a modular design with strong separation of concerns
     * a high level of factorisation, which largely avoids duplication, 
       partly through the use of mixins and multiple inheritance. 
This invariably makes the code slightly more complex, yet it greatly
reduces the lines of code to be written/updated, so it is a reasonable
       compromise in this case.

## Changeset

The entire high-level library is in its own subpackage
`tianshou.highlevel`
and **almost no changes were made to the original library** in order to 
support the new APIs. 
For the most part, only typing-related changes were made, which have
aligned type annotations with existing example applications or have made
explicit interfaces that were previously implicit.

Furthermore, some helper modules were added to the the `tianshou.util`
package
(all of which were copied from the [sensAI
library](https://github.com/jambit/sensAI)).

Many example applications were added, based on the existing MuJoCo and
Atari
examples (see below).

## User-Facing Interface

### User Experience Example

To illustrate the UX, consider this video recording (IntelliJ IDEA):


![UX](https://github.com/aai-institute/tianshou/wiki/resources/ppo-experiment-builder.gif)

Observe how conveniently relevant classes can be discovered via the
IDE's
auto-completion function.
Discoverability is markedly enhanced by using a prefix-based naming
convention,
where classes that can be used as parameters use the base class name as
a prefix,
allowing all potentially relevant subclasses to be straightforwardly 
auto-completed.


### Declarative Semantics

A key design principle for the user-facing interface was to achieve 
*declarative semantics*, where the user
is no longer concerned with generating a lengthy procedure that
sequentially
constructs components that build upon each other. 
Instead, the user focuses purely on
*declaring* the properties of the learning task he would like to run.

* This essentially reduces boiler-plate code to zero, as every part of
the
    code is defining essential, experiment-specific configuration.
* This makes it possible to centrally handle interdependent
configuration
    and detect/avoid misspecification.

In order to enable the configuration of interdependent objects without 
requiring the user to instantiate the respective objects sequentially,
we
heavily employ the *factory pattern*.

### Experiment Builders

The end user's primary entry point is an `ExperimentBuilder`, which is
specialised for each algorithm.
As the name suggests, it uses the builder pattern in order to create
an `Experiment` object, which is then used to run the learning task.

* At builder construction, the user is required to provide only
essential
   configuration, particularly the environment factory.
 * The bulk of the algorithm-specific parameters can be provided 
   via an algorithm-specific parameter object. 
For instance, `PPOExperimentBuilder` has the method `with_ppo_params`,
   which expects an object of type `PPOParams`.
* Parametrisation that requires the provision of more complex interfaces
   (e.g. were multiple specification variants exist) are handled via 
   dedicated builder methods. 
   For example, for the specification of the critic component in an
   actor-critic algorithm, the following group of functions is provided:
* `with_critic_factory` (where the user can provide any (user-defined)
       factory for the critic component)
     * `with_critic_factory_default` (with which the user specifies that
the default, `Net`-based critic architecture shall be used and has the
       option to parametrise it)
* `with_critic_factory_use_actor` (with which the user indicates that
the
critic component shall reuse the preprocessing network from the actor
       component)

#### Examples

##### Minimal Example

In the simplest of cases, where the user wants to use the default 
parametrisation for everything, a user could run a PPO learning task
as follows,

```python
experiment = PPOExperimentBuilder(MyEnvFactory()).build()
experiment.run()
```

where `MyEnvFactory` is a factory for the agent's environment.
The default behaviour will adapt depending on whether the factory 
creates environments with discrete or continuous action spaces.

##### Fully Parametrised MuJoCo Example 

Importantly, the user still has the option to configure all the details.
Consider this example, which is from the high-level version of the 
`mujoco_ppo` example:

```python
log_name = os.path.join(task, "ppo", str(experiment_config.seed), datetime_tag())

sampling_config = SamplingConfig(
    num_epochs=epoch,
    step_per_epoch=step_per_epoch,
    batch_size=batch_size,
    num_train_envs=training_num,
    num_test_envs=test_num,
    buffer_size=buffer_size,
    step_per_collect=step_per_collect,
    repeat_per_collect=repeat_per_collect,
)

env_factory = MujocoEnvFactory(task, experiment_config.seed, obs_norm=True)

experiment = (
    PPOExperimentBuilder(env_factory, experiment_config, sampling_config)
    .with_ppo_params(
        PPOParams(
            discount_factor=gamma,
            gae_lambda=gae_lambda,
            action_bound_method=bound_action_method,
            reward_normalization=rew_norm,
            ent_coef=ent_coef,
            vf_coef=vf_coef,
            max_grad_norm=max_grad_norm,
            value_clip=value_clip,
            advantage_normalization=norm_adv,
            eps_clip=eps_clip,
            dual_clip=dual_clip,
            recompute_advantage=recompute_adv,
            lr=lr,
            lr_scheduler_factory=LRSchedulerFactoryLinear(sampling_config)
            if lr_decay
            else None,
            dist_fn=DistributionFunctionFactoryIndependentGaussians(),
        ),
    )
    .with_actor_factory_default(hidden_sizes, torch.nn.Tanh, continuous_unbounded=True)
    .with_critic_factory_default(hidden_sizes, torch.nn.Tanh)
    .build()
)
experiment.run(log_name)
```

This is functionally equivalent to the procedural, low-level example.
Compare the scripts here:
* [original low-level
example](https://github.com/aai-institute/tianshou/blob/feat/high-level-api/examples/mujoco/mujoco_ppo.py)
* [new high-level
example](https://github.com/aai-institute/tianshou/blob/feat/high-level-api/examples/mujoco/mujoco_ppo_hl.py)

In general, find example applications of the high-level API in the
`examples/`
folder in scripts using the `_hl.py` suffix:

* [MuJoCo
examples](https://github.com/aai-institute/tianshou/tree/feat/high-level-api/examples/mujoco)
* [Atari
examples](https://github.com/aai-institute/tianshou/tree/feat/high-level-api/examples/atari)

### Experiments

The `Experiment` representation contains
 * the agent factory ,
 * the environment factory,
 * further definitions pertaining to storage & logging.

An exeriment may be run several times, assigning a name (and
corresponding
storage location) to each run.

#### Persistence and Logging

Experiments can be serialized and later be reloaded.

```python
    experiment = Experiment.from_directory("log/my_experiment")
```

Because the experiment representation is composed purely of
configuration
and factories, which themselves are composed purely of configuration and
factories, persisted objects are compact and do not contain state.


Every experiment run produces the following artifacts:
 * the serialized experiment
 * the serialized best policy found during training
 * a log file
 * (optionally) user-defined data, as the persistence
   handlers are modular

Running a reloaded experiment can optionally resume training of the
serialized
policy.

All relevant objects have meaningful string representations that can
appear
in logs, which is conveniently achieved through the use of 
`ToStringMixin` (from sensAI). 
Its use furthermore prevents string representations of recurring objects
from being printed more than once.
For example, consider this string representation, which was generated
for
the fully parametrised PPO experiment from the example above:

```
Experiment[
    config=ExperimentConfig(
        seed=42, 
        device='cuda', 
        policy_restore_directory=None, 
        train=True, 
        watch=True, 
        watch_render=0.0, 
        persistence_base_dir='log', 
        persistence_enabled=True), 
    sampling_config=SamplingConfig[
        num_epochs=100, 
        step_per_epoch=30000, 
        batch_size=64, 
        num_train_envs=64, 
        num_test_envs=10, 
        buffer_size=4096, 
        step_per_collect=2048, 
        repeat_per_collect=10, 
        update_per_step=1.0, 
        start_timesteps=0, 
        start_timesteps_random=False, 
        replay_buffer_ignore_obs_next=False, 
        replay_buffer_save_only_last_obs=False, 
        replay_buffer_stack_num=1], 
    env_factory=MujocoEnvFactory[
        task=Ant-v4, 
        seed=42, 
        obs_norm=True], 
    agent_factory=PPOAgentFactory[
        sampling_config=SamplingConfig[<<], 
        optim_factory=OptimizerFactoryAdam[
            weight_decay=0, 
            eps=1e-08, 
            betas=(0.9, 0.999)], 
        policy_wrapper_factory=None, 
        trainer_callbacks=TrainerCallbacks(
            epoch_callback_train=None, 
            epoch_callback_test=None, 
            stop_callback=None), 
        params=PPOParams[
            gae_lambda=0.95, 
            max_batchsize=256, 
            lr=0.0003, 
            lr_scheduler_factory=LRSchedulerFactoryLinear[sampling_config=SamplingConfig[<<]], 
            action_scaling=default, 
            action_bound_method=clip, 
            discount_factor=0.99, 
            reward_normalization=True, 
            deterministic_eval=False, 
            dist_fn=DistributionFunctionFactoryIndependentGaussians[], 
            vf_coef=0.25, 
            ent_coef=0.0, 
            max_grad_norm=0.5, 
            eps_clip=0.2, 
            dual_clip=None, 
            value_clip=False, 
            advantage_normalization=False, 
            recompute_advantage=True], 
        actor_factory=ActorFactoryTransientStorageDecorator[
            actor_factory=ActorFactoryDefault[
                continuous_actor_type=ContinuousActorType.GAUSSIAN, 
                continuous_unbounded=True, 
                continuous_conditioned_sigma=False, 
                hidden_sizes=[64, 64], 
                hidden_activation=<class 'torch.nn.modules.activation.Tanh'>, 
                discrete_softmax=True]], 
        critic_factory=CriticFactoryDefault[
            hidden_sizes=[64, 64], 
            hidden_activation=<class 'torch.nn.modules.activation.Tanh'>], 
        critic_use_action=False], 
    logger_factory=LoggerFactoryDefault[
        logger_type=tensorboard, 
        wandb_project=None], 
    env_config=None]
```


## Library Developer Perspective

The presentation thus far has focussed on the user's perspective.
From the perspective of a Tianshou developer, it is important that the 
high-level API be clearly structured and maintainable.
Here are the most relevant representations:

* **Policy parameters** are represented as dataclasses (base class
`Params`).

The goal is for the parameters to be ultimately passed to the
corresponding
   policy class (e.g. `PPOParams` contains parameters for `PPOPolicy`).

   * **Parameter transformation**:
In part, the parameter dataclass attributes already correspond directly
to
     policy class parameters.
However, because the high-level interface must, in many cases, abstract
away
     from the low-level interface,
     we establish the notion of a `ParamTransformer`, which transforms
one or more parameters into the form that is required by the policy
class:
     The idea is that the dictionary representation of the dataclass is 
successively transformed via `ParamTransformer`s such that the resulting
dictionary can ultimately be used as keyword arguments for the policy.
To achieve maintainability, the declaration of parameter transformations
     is colocated with the parameters they affect.
     Tests ensure that naming issues are detected.

   * **Composition and inheritance**:
     We use inheritance and mixins to reduce duplication.

 * **Factories** are an essential principle of the library.
   Because the creation of objects may depend on objects that are not 
yet created, a declarative approach necessitates that we transition from
   the objects themselves to factories.

* The `EnvFactory` was already mentioned above, as it is a user-facing
       abstraction.
Its purpose is to create the (vectorized) `Environments` that will be
       used in the experiments.
* An `AgentFactory` is the central component that creates the policy,
       the trainer as well as the necessary collectors.
To support a new type of policy, a subclass that handles the policy
       creation is required.
In turn, the main task when implementing a new algorithm-specific
`ExperimentBuilder` is the creation of the corresponding `AgentFactory`.
* Several types of factories serve to parametrize policies and training
       processes, e.g.
         * `OptimizerFactory` for the creation of torch optimizers
         * `ActorFactory` for the creation of actor models
         * `CriticFactory` for the creation of critic models
* `IntermediateModuleFactory` for the creation of models that produce
           intermediate/latent representations
* `EnvParamFactory` for the creation of parameters based on properties
           of the environment
         * `NoiseFactory` for the creation of `BaseNoise` instances
* `DistributionFunctionFactory` for the creation of functions that
           create torch distributions from tensors
         * `LRSchedulerFactory` for learning rate schedulers
         * `PolicyWrapperFactory` for policy wrappers that extend the 
functionality of the regular policy (e.g. intrinsic curiosity)
         * `AutoAlphaFactory` for automatically tuned regularization 
           coefficients (as supported by SAC or REDQ)
     * A `LoggerFactory` handles the creation of the experiment logger, 
but the default implementation already handles the cases that were
       used in the examples.

* The `ExperimentBuilder` implementations make use of mixins to add
common
functionality. As mentioned above, the main task in an
algorithm-specific
    specialization is to create the `AgentFactory`.
2023-11-09 00:15:49 +01:00
Dominik Jain
dae4000cd2 Revert "Depend on sensAI instead of copying its utils (logging, string)"
This reverts commit fdb0eba93d81fa5e698770b4f7088c87fc1238da.
2023-11-08 19:11:39 +01:00
Dominik Jain
ac672f65d1 Add docstring for ActorFactoryTransientStorageDecorator 2023-11-06 17:18:10 +01:00
Dominik Jain
7e6d3d627e Rename class ActorCriticModuleOpt -> ActorCriticOpt 2023-11-06 16:51:41 +01:00
Dominik Jain
5c8d57a2d2
Fix index error in call to _with_critic_factory_default
Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>
2023-11-06 16:17:14 +01:00
Dominik Jain
fdb0eba93d Depend on sensAI instead of copying its utils (logging, string) 2023-10-27 20:15:58 +02:00
Dominik Jain
5952993cfe Add option to disable file logging 2023-10-27 18:59:43 +02:00
Stefano Mariani, PhD
b72bebbc48
Fixed misleading multi-agent training sentences (#980)
- [X] I have marked all applicable categories:
    + [ ] exception-raising fix
    + [ ] algorithm implementation fix
    + [X] documentation modification
    + [ ] new feature
- [X] I have reformatted the code using `make format` (**required**)
- [X] I have checked the code using `make commit-checks` (**required**)
- [X] If applicable, I have mentioned the relevant/related issue(s)
    + resolves issue #973 
- [ ] If applicable, I have listed every items in this Pull Request
below
2023-10-26 09:48:44 -07:00
Dominik Jain
a3dbe90515 Allow to configure the policy persistence mode, adding a new mode
which stores the entire policy (new default), supporting applications
where it is desired to be bale to load the policy without having
to instantiate an environment or recreate a corresponding policy
object
2023-10-26 13:19:33 +02:00
Dominik Jain
86cca8ffc3 Add comment explaining use of _logFormat 2023-10-26 12:50:08 +02:00
Dominik Jain
c613557740 Apply datetime_tag() in high-level examples 2023-10-26 12:50:08 +02:00
Dominik Jain
d684dae6cd Change default number of environments (train=#CPUs, test=1) 2023-10-26 12:50:08 +02:00
Dominik Jain
3cd6dcc307 BaseTrainer: Remove info on default values from docstrings 2023-10-26 10:55:03 +02:00
Dominik Jain
da2194eff6 Force kwargs in PolicyWrapperFactoryIntrinsicCuriosity init 2023-10-26 10:43:59 +02:00
Dominik Jain
96298eafd8 Add convenient construction mechanisms for Environments
(based on factory function for a single environment)
2023-10-25 21:20:07 +02:00