2017-11-04 01:38:59 +08:00
# tianshou
2017-12-04 16:39:35 +08:00
Tianshou(天授) is a reinforcement learning platform. The following image illustrate its architecture.
2017-11-06 15:58:21 +08:00
2017-12-04 16:39:35 +08:00
< img src = "https://github.com/sproblvem/tianshou/blob/master/docs/figures/tianshou_architecture.png" height = "200" / >
2017-11-06 15:58:21 +08:00
2017-12-04 16:20:45 +08:00
## agent
Examples
2017-12-04 16:21:33 +08:00
2017-12-04 16:20:45 +08:00
Self-play Framework
2017-11-06 15:17:55 +08:00
## core
2017-12-08 21:09:23 +08:00
### Policy Wrapper
Stochastic policies (OnehotCategorical, Gaussian), deterministic policies (policy as in DQN, DDPG)
Specific network architectures in original paper of DQN, TRPO, A3C, etc. Policy-Value Network of AlphaGo Zero
2017-11-06 15:17:55 +08:00
2017-12-04 16:20:45 +08:00
### Algorithm
2017-11-06 15:17:55 +08:00
2017-12-08 21:09:23 +08:00
#### losses
policy gradient (and its variants), DQN (and its variants), DDPG, TRPO, PPO
2017-11-06 20:35:53 +08:00
2017-12-08 21:09:23 +08:00
#### optimizer
TRPO, natural gradient (and TensorFlow optimizers (sgd, adam))
2017-11-06 20:39:09 +08:00
2017-12-04 16:20:45 +08:00
### Planning
MCTS
2017-11-06 20:39:09 +08:00
2017-12-04 16:20:45 +08:00
## data
2017-12-08 21:09:23 +08:00
Training style - Batch, Replay (and its variants)
2017-11-06 20:35:53 +08:00
2017-12-08 21:09:23 +08:00
Advantage Estimation Function
2017-11-06 20:39:09 +08:00
2017-12-04 16:20:45 +08:00
Multithread Read/Write
2017-11-06 20:35:53 +08:00
2017-12-04 16:20:45 +08:00
## environment
2017-12-08 21:09:23 +08:00
DQN repeat frames, Reward Reshaping, image preprocessing (not sure where)
2017-11-06 20:35:53 +08:00
2017-12-04 16:20:45 +08:00
## simulator
Go, Othello/Reversi, Warzone
2017-11-06 20:35:53 +08:00
2017-12-04 16:39:35 +08:00
< img src = "https://github.com/sproblvem/tianshou/blob/master/docs/figures/go.png" height = "150" / > < img src = "https://github.com/sproblvem/tianshou/blob/master/docs/figures/reversi.jpg" height = "150" / > < img src = "https://github.com/sproblvem/tianshou/blob/master/docs/figures/warzone.jpg" height = "150" / >
2017-12-04 16:20:45 +08:00
## TODO
Search based method parallel.
2017-12-08 21:09:23 +08:00
2017-12-10 13:31:43 +08:00
< font color = Red > Write comments. Please do not use abbreviations unless others can know it well. (e.g. adv can short for advantage/adversarial, please use the full name instead) < / font >
< font color = Red > Please name the module formally. (e.g. use more lower case "_", I think a module called "Batch" seems terrible)< / font >
2017-12-08 21:09:23 +08:00
YongRen: Policy Wrapper, in order of Gaussian, DQN and DDPG
TongzhengRen: losses, in order of ppo, pg, DQN, DDPG with management of placeholders
YouQiaoben: data/Batch, implement num_timesteps, fix memory growth in num_episodes; adv_estimate.gae_lambda (need to write a value network in tf)
ShihongSong: data/Replay; then adv_estimate.dqn after YongRen's DQN
2017-12-08 21:18:29 +08:00
HaoshengZou: collaborate mainly on Policy and losses; interfaces and architecture
2017-12-08 21:09:23 +08:00
Note: install openai/gym first to run the Atari environment; note that interfaces between modules may not be finalized; the management of placeholders and `feed_dict` may have to be done manually for the time being;
Without preprocessing and other tricks, this example will not train to any meaningful results. Codes should past two tests: individual module test and run through this example code.