# tianshou Tianshou(天授) is a reinforcement learning platform. The following image illustrate its architecture.

## agent Examples Self-play Framework ## core ### Policy Wrapper Stochastic policies (OnehotCategorical, Gaussian), deterministic policies (policy as in DQN, DDPG) Specific network architectures in original paper of DQN, TRPO, A3C, etc. Policy-Value Network of AlphaGo Zero ### Algorithm #### losses policy gradient (and its variants), DQN (and its variants), DDPG, TRPO, PPO #### optimizer TRPO, natural gradient (and TensorFlow optimizers (sgd, adam)) ### Planning MCTS ## data Training style - Batch, Replay (and its variants) Advantage Estimation Function Multithread Read/Write ## environment DQN repeat frames, Reward Reshaping, image preprocessing (not sure where) ## simulator Go, Othello/Reversi, Warzone

## TODO Search based method parallel. `Please Write comments.` `Please do not use abbreviations unless others can know it well. (e.g. adv can short for advantage/adversarial, please use the full name instead)` `Please name the module formally. (e.g. use more lower case and "_", I think a module called "Batch" is terrible)` YongRen: Policy Wrapper, in order of Gaussian, DQN and DDPG TongzhengRen: losses, in order of ppo, pg, DQN, DDPG with management of placeholders YouQiaoben: data/Batch, implement num_timesteps, fix memory growth in num_episodes; adv_estimate.gae_lambda (need to write a value network in tf) ShihongSong: data/Replay; then adv_estimate.dqn after YongRen's DQN HaoshengZou: collaborate mainly on Policy and losses; interfaces and architecture Note: install openai/gym first to run the Atari environment; note that interfaces between modules may not be finalized; the management of placeholders and `feed_dict` may have to be done manually for the time being; Without preprocessing and other tricks, this example will not train to any meaningful results. Codes should past two tests: individual module test and run through this example code.