Bernard Tan 5c5a3db94e
Implement BCQPolicy and offline_bcq example (#480)
This PR implements BCQPolicy, which could be used to train an offline agent in the environment of continuous action space. An experimental result 'halfcheetah-expert-v1' is provided, which is a d4rl environment (for Offline Reinforcement Learning).
Example usage is in the examples/offline/offline_bcq.py.
2021-11-22 22:21:02 +08:00
..

Offline

In offline reinforcement learning setting, the agent learns a policy from a fixed dataset which is collected once with any policy. And the agent does not interact with environment anymore.

Once the dataset is collected, it will not be changed during training. We use d4rl datasets to train offline agent. You can refer to d4rl to see how to use d4rl datasets.

Train

Tianshou provides an offline_trainer for offline reinforcement learning. You can parse d4rl datasets into a ReplayBuffer , and set it as the parameter buffer of offline_trainer. offline_bcq.py is an example of offline RL using the d4rl dataset.

To train an agent with BCQ algorithm:

python offline_bcq.py --task halfcheetah-expert-v1

After 1M steps:

halfcheetah-expert-v1_reward

halfcheetah-expert-v1 is a mujoco environment. The setting of hyperparameters are similar to the offpolicy algorithms in mujoco environment.

Results

Environment BCQ
halfcheetah-expert-v1 10624.0 ± 181.4