4.8 KiB
tianshou
Tianshou(天授) is a reinforcement learning platform. The following image illustrate its architecture.

agent
Examples
Self-play Framework
core
Policy Wrapper
Stochastic policies (OnehotCategorical, Gaussian), deterministic policies (policy as in DQN, DDPG)
Specific network architectures in original paper of DQN, TRPO, A3C, etc. Policy-Value Network of AlphaGo Zero
brief intro of current implementation:
how to write your own network:
- define the observation placeholder yourself, pass it to
observation_placeholder
when initializing a policy instance - pass a callable when initializing a policy instance. The callable should satisfy only three conditions:
- it accepts no parameters
- it does not create any new placeholders
- it returns
action-related tensors, value_head
Our lib will take care of your observation placeholder from now on, as well as all the placeholders that will be created by our lib.
The other placeholders, such as keep_prob
in dropout and clip_param
in ppo loss
should be managed by your own (see examples/ppo_cartpole_alternative.py)
The weight_update
parameter:
- 0 means manually update target network
- 1 means no target network (the target network is updated every 1 minibatch)
- (0, 1) is the target network as used in DDPG
- greater than 1 is the target network as used in DQN
Other comments are in the python files in example/ and in the lib codes. Refactor is definitely needed so don't dwell too much on annoying details...
Algorithm
losses
policy gradient (and its variants), DQN (and its variants), DDPG, TRPO, PPO
optimizer
TRPO, natural gradient (and TensorFlow optimizers (sgd, adam))
Planning
MCTS
data
Training style - Batch, Replay (and its variants)
Advantage Estimation Function
Multithread Read/Write
environment
DQN repeat frames, Reward Reshaping, image preprocessing (not sure where)
simulator
Go, Othello/Reversi, Warzone
examples
During development, run examples under ./examples/
directory with, e.g. python ppo_example.py
.
Running them under this directory with python examples/ppo_example.py
will not work.
About coding style
Please follow google python coding style
There's a more detailed Chinese version google python coding style in Chinese
All files/folders should be named with lower case letters and underline (except specified names such as AlphaGo
).
Try to use full names. Don't use abbrevations for class/function/variable names except common abbrevations (such as num
for number, dim
for dimension, env
for environment, op
for operation). For now we use pi
to refer to the policy in examples/ppo_example.py.
The """xxx""" comment should be written right after class/function. Also comment the part that's not intuitive during the code. We must comment, but for now we don't need to polish them.
TODO
Search based method parallel.
YongRen: Policy Wrapper, in order of Gaussian, DQN and DDPG
TongzhengRen: losses, in order of ppo, pg, DQN, DDPG with management of placeholders
YouQiaoben: data/Batch, implement num_timesteps, fix memory growth in num_episodes; adv_estimate.gae_lambda (need to write a value network in tf)
ShihongSong: data/Replay; then adv_estimate.dqn after YongRen's DQN
HaoshengZou: collaborate mainly on Policy and losses; interfaces and architecture
Note: install openai/gym first to run the Atari environment; note that interfaces between modules may not be finalized; the management of placeholders and feed_dict
may have to be done manually for the time being;
Without preprocessing and other tricks, this example will not train to any meaningful results. Codes should past two tests: individual module test and run through this example code.
Some bug to fix
For DQN and other deterministic policy: $\epsilon$-greedy or other exploration during collection?
In Batch.py, notice that we cannot stop by setting num_timestep
Magic numbers
One idea
Like zhusuan, we can register losses background so that we need not claim it in the example.
Dependency
Tensorflow (Version >= 1.4) Gym