Tianshou

Author	SHA1	Message	Date
carlocagnetta	08f1770fa2	Fix docs/requirements.txt	2023-12-04 11:45:53 +01:00
carlocagnetta	b1b7f24b94	Fix docs/requirements.txt	2023-12-04 11:45:53 +01:00
carlocagnetta	64af97b65e	Fix RTD and gh-pages docu auto-generation	2023-12-04 11:45:53 +01:00
carlocagnetta	9aad2c9673	Fix readthedocs to install poetry	2023-12-04 11:45:53 +01:00
carlocagnetta	fb97091a92	Fix action gh-pages	2023-12-04 11:45:52 +01:00
carlocagnetta	ca2e4a1d96	Fix docs index	2023-12-04 11:45:52 +01:00
carlocagnetta	102045c98c	Fix .readthedocs.yaml	2023-12-04 11:45:51 +01:00
carlocagnetta	8f0c62ace3	Documentation update: jupyter-book running on ReadTheDocs including tutorial notebooks	2023-12-04 11:45:51 +01:00
carlocagnetta	6df56161f5	Move notebooks to doc and resolve spellcheck	2023-12-04 11:42:38 +01:00
carlocagnetta	3a87b33d4f	Publish notebooks only from master	2023-12-04 11:41:32 +01:00
carlocagnetta	8489fde07e	Publish compiled Jupyter books to github-pages	2023-12-04 11:41:32 +01:00
carlocagnetta	b74ea40970	Add git-action to compile notebooks	2023-12-04 11:41:32 +01:00
carlocagnetta	624998d2eb	fix notebook action	2023-12-04 11:41:22 +01:00
carlocagnetta	30fc884078	Publish notebooks only on master	2023-12-04 11:41:22 +01:00
carlocagnetta	0108726dfe	fix build books action command	2023-12-04 11:41:22 +01:00
carlocagnetta	60f54fded9	Change gh-pages publish dir	2023-12-04 11:41:21 +01:00
carlocagnetta	1170ad37fd	Change gh-pages publish dir	2023-12-04 11:41:21 +01:00
carlocagnetta	23eadb11f8	Change books publish dir	2023-12-04 11:41:21 +01:00
carlocagnetta	c175f7647e	Fix notebooks.yml	2023-12-04 11:41:21 +01:00
carlocagnetta	bbe5da41f7	Add compiled books for documentation	2023-12-04 11:41:21 +01:00
carlocagnetta	0e9837fa17	Fix worklfow action compiling notebooks	2023-12-04 11:41:20 +01:00
carlocagnetta	04c374c439	add git-action to compile notebooks	2023-12-04 11:41:20 +01:00
carlocagnetta	96704a82a2	Update jupyter-book _config.yml to comply with template	2023-12-04 11:41:20 +01:00
carlocagnetta	148e11a1cd	Update notebooks to current version	2023-12-04 11:41:20 +01:00
carlocagnetta	de3a021a0a	Setup jupyter-book	2023-12-04 11:41:19 +01:00
carlocagnetta	7ed07031b7	Update gitignore	2023-12-04 11:40:18 +01:00
carlocagnetta	acc671b887	Update gitignore	2023-12-04 11:39:13 +01:00
carlocagnetta	6a60396839	Add jupyter pkg dependency to poetry env	2023-12-04 11:37:38 +01:00
carlocagnetta	efaadec6a1	Removed notebook outputs	2023-12-04 11:36:47 +01:00
carlocagnetta	6b6ce0fdf1	Add notebooks from ./docs/tutorials/get_started.rst to ./notebooks	2023-12-04 11:36:47 +01:00
Michael Panchenko	8d3d1f164b	Support batch_size=None and use it in various scripts (#993 ) Closes #986	2023-11-24 10:13:10 -08:00
Michael Panchenko	f134bc20b5	Bugfix/discrete bcq inf (#995 ) Fixes a small bug with using np.inf instead of torch-based infinity Closes #963 --------- Co-authored-by: ivan.rodriguez <ivan.rodriguez@unternehmertum.de>	2023-11-24 11:17:40 +01:00
Matthew Turnshek	31fa0325fa	Update quickstart argument name (#994 ) Noticed an improper argument name when going through the quickstart.	2023-11-22 21:05:37 -08:00
Michael Panchenko	3a1bc18add	Method to compute actions from observations (#991 ) This PR adds a new method for getting actions from an env's observation and info. This is useful for standard inference and stands in contrast to batch-based methods that are currently used in training and evaluation. Without this, users have to do some kind of gymnastics to actually perform inference with a trained policy. I have also added a test for the new method. In future PRs, this method should be included in the examples (in the the "watch" section). To add this required improving multiple typing things and, importantly, _simplifying the signature of `forward` in many policies!_ This is a breaking change, but it will likely affect no users. The `input` parameter of forward was a rather hacky mechanism, I believe it is good that it's gone now. It will also help with #948 . The main functional change is the addition of `compute_action` to `BasePolicy`. Other minor changes: - improvements in typing - updated PR and Issue templates - Improved handling of `max_action_num` Closes #981	2023-11-16 17:27:53 +00:00
Dominik Jain	6d6c85e594	Fix an issue where policies built with LRSchedulerFactoryLinear were not picklable (#992 ) - [X] I have marked all applicable categories: + [X] exception-raising fix + [ ] algorithm implementation fix + [ ] documentation modification + [ ] new feature - [X] I have reformatted the code using `make format` (required) - [X] I have checked the code using `make commit-checks` (required) - [ ] If applicable, I have mentioned the relevant/related issue(s) - [ ] If applicable, I have listed every items in this Pull Request below The cause was the use of a lambda function in the state of a generated object.	2023-11-14 10:23:18 -08:00
Michael Panchenko	962c6d1e11	High-Level API (#970 ) This PR closes #938. It introduces all the fundamental concepts and abstractions, and it already covers the majority of the algorithms. It is not a complete and finalised product, however, and we recommend that the high-level API remain in alpha stadium for some time, as already suggested in the issue. The changes in this PR are described on a [wiki page](https://github.com/aai-institute/tianshou/wiki/High-Level-API), a copy of which is provided below. (The original page is perhaps more readable, because it does not render line breaks verbatim.) # Introducing the Tianshou High-Level API The new high-level library was created based on object-oriented design principles with two primary design goals: * ease of use for the end user (without sacrificing generality) This is achieved through: * a single, well-defined point of interaction (`ExperimentBuilder`) which uses declarative semantics, allowing the user to focus on what to do rather than how to do it. * easily injectible parametrisation. For complex parametrisation involving objects, the respective library classes are easily discoverable, keeping the need to browse reference documentation - or, even worse, inspect code or class hierarchies - to an absolute minimium. * reduced points of failure. Because the high-level API is at a higher level of abstraction, where more knowledge is available, we can centrally define reasonable defaults and apply consistency checks in order to ensure that illegal configurations result in meaningful errors (and are completely avoided as long as the users does not modify default behaviour). For example, we can consider interactions between the nature of the action space and the neural networks being used. * maintainability for developers This is achieved through: * a modular design with strong separation of concerns * a high level of factorisation, which largely avoids duplication, partly through the use of mixins and multiple inheritance. This invariably makes the code slightly more complex, yet it greatly reduces the lines of code to be written/updated, so it is a reasonable compromise in this case. ## Changeset The entire high-level library is in its own subpackage `tianshou.highlevel` and almost no changes were made to the original library in order to support the new APIs. For the most part, only typing-related changes were made, which have aligned type annotations with existing example applications or have made explicit interfaces that were previously implicit. Furthermore, some helper modules were added to the the `tianshou.util` package (all of which were copied from the [sensAI library](https://github.com/jambit/sensAI)). Many example applications were added, based on the existing MuJoCo and Atari examples (see below). ## User-Facing Interface ### User Experience Example To illustrate the UX, consider this video recording (IntelliJ IDEA): ![UX](https://github.com/aai-institute/tianshou/wiki/resources/ppo-experiment-builder.gif) Observe how conveniently relevant classes can be discovered via the IDE's auto-completion function. Discoverability is markedly enhanced by using a prefix-based naming convention, where classes that can be used as parameters use the base class name as a prefix, allowing all potentially relevant subclasses to be straightforwardly auto-completed. ### Declarative Semantics A key design principle for the user-facing interface was to achieve declarative semantics, where the user is no longer concerned with generating a lengthy procedure that sequentially constructs components that build upon each other. Instead, the user focuses purely on declaring the properties of the learning task he would like to run. * This essentially reduces boiler-plate code to zero, as every part of the code is defining essential, experiment-specific configuration. * This makes it possible to centrally handle interdependent configuration and detect/avoid misspecification. In order to enable the configuration of interdependent objects without requiring the user to instantiate the respective objects sequentially, we heavily employ the factory pattern. ### Experiment Builders The end user's primary entry point is an `ExperimentBuilder`, which is specialised for each algorithm. As the name suggests, it uses the builder pattern in order to create an `Experiment` object, which is then used to run the learning task. * At builder construction, the user is required to provide only essential configuration, particularly the environment factory. * The bulk of the algorithm-specific parameters can be provided via an algorithm-specific parameter object. For instance, `PPOExperimentBuilder` has the method `with_ppo_params`, which expects an object of type `PPOParams`. * Parametrisation that requires the provision of more complex interfaces (e.g. were multiple specification variants exist) are handled via dedicated builder methods. For example, for the specification of the critic component in an actor-critic algorithm, the following group of functions is provided: * `with_critic_factory` (where the user can provide any (user-defined) factory for the critic component) * `with_critic_factory_default` (with which the user specifies that the default, `Net`-based critic architecture shall be used and has the option to parametrise it) * `with_critic_factory_use_actor` (with which the user indicates that the critic component shall reuse the preprocessing network from the actor component) #### Examples ##### Minimal Example In the simplest of cases, where the user wants to use the default parametrisation for everything, a user could run a PPO learning task as follows, ```python experiment = PPOExperimentBuilder(MyEnvFactory()).build() experiment.run() ``` where `MyEnvFactory` is a factory for the agent's environment. The default behaviour will adapt depending on whether the factory creates environments with discrete or continuous action spaces. ##### Fully Parametrised MuJoCo Example Importantly, the user still has the option to configure all the details. Consider this example, which is from the high-level version of the `mujoco_ppo` example: ```python log_name = os.path.join(task, "ppo", str(experiment_config.seed), datetime_tag()) sampling_config = SamplingConfig( num_epochs=epoch, step_per_epoch=step_per_epoch, batch_size=batch_size, num_train_envs=training_num, num_test_envs=test_num, buffer_size=buffer_size, step_per_collect=step_per_collect, repeat_per_collect=repeat_per_collect, ) env_factory = MujocoEnvFactory(task, experiment_config.seed, obs_norm=True) experiment = ( PPOExperimentBuilder(env_factory, experiment_config, sampling_config) .with_ppo_params( PPOParams( discount_factor=gamma, gae_lambda=gae_lambda, action_bound_method=bound_action_method, reward_normalization=rew_norm, ent_coef=ent_coef, vf_coef=vf_coef, max_grad_norm=max_grad_norm, value_clip=value_clip, advantage_normalization=norm_adv, eps_clip=eps_clip, dual_clip=dual_clip, recompute_advantage=recompute_adv, lr=lr, lr_scheduler_factory=LRSchedulerFactoryLinear(sampling_config) if lr_decay else None, dist_fn=DistributionFunctionFactoryIndependentGaussians(), ), ) .with_actor_factory_default(hidden_sizes, torch.nn.Tanh, continuous_unbounded=True) .with_critic_factory_default(hidden_sizes, torch.nn.Tanh) .build() ) experiment.run(log_name) ``` This is functionally equivalent to the procedural, low-level example. Compare the scripts here: * [original low-level example](https://github.com/aai-institute/tianshou/blob/feat/high-level-api/examples/mujoco/mujoco_ppo.py) * [new high-level example](https://github.com/aai-institute/tianshou/blob/feat/high-level-api/examples/mujoco/mujoco_ppo_hl.py) In general, find example applications of the high-level API in the `examples/` folder in scripts using the `_hl.py` suffix: * [MuJoCo examples](https://github.com/aai-institute/tianshou/tree/feat/high-level-api/examples/mujoco) * [Atari examples](https://github.com/aai-institute/tianshou/tree/feat/high-level-api/examples/atari) ### Experiments The `Experiment` representation contains * the agent factory , * the environment factory, * further definitions pertaining to storage & logging. An exeriment may be run several times, assigning a name (and corresponding storage location) to each run. #### Persistence and Logging Experiments can be serialized and later be reloaded. ```python experiment = Experiment.from_directory("log/my_experiment") ``` Because the experiment representation is composed purely of configuration and factories, which themselves are composed purely of configuration and factories, persisted objects are compact and do not contain state. Every experiment run produces the following artifacts: * the serialized experiment * the serialized best policy found during training * a log file * (optionally) user-defined data, as the persistence handlers are modular Running a reloaded experiment can optionally resume training of the serialized policy. All relevant objects have meaningful string representations that can appear in logs, which is conveniently achieved through the use of `ToStringMixin` (from sensAI). Its use furthermore prevents string representations of recurring objects from being printed more than once. For example, consider this string representation, which was generated for the fully parametrised PPO experiment from the example above: ``` Experiment[ config=ExperimentConfig( seed=42, device='cuda', policy_restore_directory=None, train=True, watch=True, watch_render=0.0, persistence_base_dir='log', persistence_enabled=True), sampling_config=SamplingConfig[ num_epochs=100, step_per_epoch=30000, batch_size=64, num_train_envs=64, num_test_envs=10, buffer_size=4096, step_per_collect=2048, repeat_per_collect=10, update_per_step=1.0, start_timesteps=0, start_timesteps_random=False, replay_buffer_ignore_obs_next=False, replay_buffer_save_only_last_obs=False, replay_buffer_stack_num=1], env_factory=MujocoEnvFactory[ task=Ant-v4, seed=42, obs_norm=True], agent_factory=PPOAgentFactory[ sampling_config=SamplingConfig[<<], optim_factory=OptimizerFactoryAdam[ weight_decay=0, eps=1e-08, betas=(0.9, 0.999)], policy_wrapper_factory=None, trainer_callbacks=TrainerCallbacks( epoch_callback_train=None, epoch_callback_test=None, stop_callback=None), params=PPOParams[ gae_lambda=0.95, max_batchsize=256, lr=0.0003, lr_scheduler_factory=LRSchedulerFactoryLinear[sampling_config=SamplingConfig[<<]], action_scaling=default, action_bound_method=clip, discount_factor=0.99, reward_normalization=True, deterministic_eval=False, dist_fn=DistributionFunctionFactoryIndependentGaussians[], vf_coef=0.25, ent_coef=0.0, max_grad_norm=0.5, eps_clip=0.2, dual_clip=None, value_clip=False, advantage_normalization=False, recompute_advantage=True], actor_factory=ActorFactoryTransientStorageDecorator[ actor_factory=ActorFactoryDefault[ continuous_actor_type=ContinuousActorType.GAUSSIAN, continuous_unbounded=True, continuous_conditioned_sigma=False, hidden_sizes=[64, 64], hidden_activation=<class 'torch.nn.modules.activation.Tanh'>, discrete_softmax=True]], critic_factory=CriticFactoryDefault[ hidden_sizes=[64, 64], hidden_activation=<class 'torch.nn.modules.activation.Tanh'>], critic_use_action=False], logger_factory=LoggerFactoryDefault[ logger_type=tensorboard, wandb_project=None], env_config=None] ``` ## Library Developer Perspective The presentation thus far has focussed on the user's perspective. From the perspective of a Tianshou developer, it is important that the high-level API be clearly structured and maintainable. Here are the most relevant representations: * Policy parameters are represented as dataclasses (base class `Params`). The goal is for the parameters to be ultimately passed to the corresponding policy class (e.g. `PPOParams` contains parameters for `PPOPolicy`). * Parameter transformation: In part, the parameter dataclass attributes already correspond directly to policy class parameters. However, because the high-level interface must, in many cases, abstract away from the low-level interface, we establish the notion of a `ParamTransformer`, which transforms one or more parameters into the form that is required by the policy class: The idea is that the dictionary representation of the dataclass is successively transformed via `ParamTransformer`s such that the resulting dictionary can ultimately be used as keyword arguments for the policy. To achieve maintainability, the declaration of parameter transformations is colocated with the parameters they affect. Tests ensure that naming issues are detected. * Composition and inheritance: We use inheritance and mixins to reduce duplication. * Factories are an essential principle of the library. Because the creation of objects may depend on objects that are not yet created, a declarative approach necessitates that we transition from the objects themselves to factories. * The `EnvFactory` was already mentioned above, as it is a user-facing abstraction. Its purpose is to create the (vectorized) `Environments` that will be used in the experiments. * An `AgentFactory` is the central component that creates the policy, the trainer as well as the necessary collectors. To support a new type of policy, a subclass that handles the policy creation is required. In turn, the main task when implementing a new algorithm-specific `ExperimentBuilder` is the creation of the corresponding `AgentFactory`. * Several types of factories serve to parametrize policies and training processes, e.g. * `OptimizerFactory` for the creation of torch optimizers * `ActorFactory` for the creation of actor models * `CriticFactory` for the creation of critic models * `IntermediateModuleFactory` for the creation of models that produce intermediate/latent representations * `EnvParamFactory` for the creation of parameters based on properties of the environment * `NoiseFactory` for the creation of `BaseNoise` instances * `DistributionFunctionFactory` for the creation of functions that create torch distributions from tensors * `LRSchedulerFactory` for learning rate schedulers * `PolicyWrapperFactory` for policy wrappers that extend the functionality of the regular policy (e.g. intrinsic curiosity) * `AutoAlphaFactory` for automatically tuned regularization coefficients (as supported by SAC or REDQ) * A `LoggerFactory` handles the creation of the experiment logger, but the default implementation already handles the cases that were used in the examples. * The `ExperimentBuilder` implementations make use of mixins to add common functionality. As mentioned above, the main task in an algorithm-specific specialization is to create the `AgentFactory`.	2023-11-09 00:15:49 +01:00
Dominik Jain	dae4000cd2	Revert "Depend on sensAI instead of copying its utils (logging, string)" This reverts commit fdb0eba93d81fa5e698770b4f7088c87fc1238da.	2023-11-08 19:11:39 +01:00
Dominik Jain	ac672f65d1	Add docstring for ActorFactoryTransientStorageDecorator	2023-11-06 17:18:10 +01:00
Dominik Jain	7e6d3d627e	Rename class ActorCriticModuleOpt -> ActorCriticOpt	2023-11-06 16:51:41 +01:00
Dominik Jain	5c8d57a2d2	Fix index error in call to _with_critic_factory_default Co-authored-by: Jiayi Weng <trinkle23897@gmail.com>	2023-11-06 16:17:14 +01:00
Dominik Jain	fdb0eba93d	Depend on sensAI instead of copying its utils (logging, string)	2023-10-27 20:15:58 +02:00
Dominik Jain	5952993cfe	Add option to disable file logging	2023-10-27 18:59:43 +02:00
Stefano Mariani, PhD	b72bebbc48	Fixed misleading multi-agent training sentences (#980 ) - [X] I have marked all applicable categories: + [ ] exception-raising fix + [ ] algorithm implementation fix + [X] documentation modification + [ ] new feature - [X] I have reformatted the code using `make format` (required) - [X] I have checked the code using `make commit-checks` (required) - [X] If applicable, I have mentioned the relevant/related issue(s) + resolves issue #973 - [ ] If applicable, I have listed every items in this Pull Request below	2023-10-26 09:48:44 -07:00
Dominik Jain	a3dbe90515	Allow to configure the policy persistence mode, adding a new mode which stores the entire policy (new default), supporting applications where it is desired to be bale to load the policy without having to instantiate an environment or recreate a corresponding policy object	2023-10-26 13:19:33 +02:00
Dominik Jain	86cca8ffc3	Add comment explaining use of _logFormat	2023-10-26 12:50:08 +02:00
Dominik Jain	c613557740	Apply datetime_tag() in high-level examples	2023-10-26 12:50:08 +02:00
Dominik Jain	d684dae6cd	Change default number of environments (train=#CPUs, test=1)	2023-10-26 12:50:08 +02:00
Dominik Jain	3cd6dcc307	BaseTrainer: Remove info on default values from docstrings	2023-10-26 10:55:03 +02:00
Dominik Jain	da2194eff6	Force kwargs in PolicyWrapperFactoryIntrinsicCuriosity init	2023-10-26 10:43:59 +02:00
Dominik Jain	96298eafd8	Add convenient construction mechanisms for Environments (based on factory function for a single environment)	2023-10-25 21:20:07 +02:00

1 2 3 4 5 ...

535 Commits