hongshaorou/Tianshou

Fork 0

Go to file

JialianLee 7964064242 Modification for reversi

2017-12-22 15:26:47 +08:00

AlphaGo

Modification for reversi

2017-12-22 15:26:47 +08:00

docs/figures

add example images for readme

2017-12-04 16:24:54 +08:00

examples

add comments and todos

2017-12-17 13:28:21 +08:00

tianshou

Merge remote-tracking branch 'origin/master'

2017-12-22 00:24:06 +08:00

utils

AlphaGo update

2017-11-26 13:36:52 +08:00

__init__.py

add __init__.py

2017-12-01 01:38:11 +08:00

.gitignore

implement the training process

2017-12-21 23:30:24 +08:00

LICENSE

Initial commit

2017-11-04 01:38:59 +08:00

README.md

add a detailed Chinese google coding style for convenience

2017-12-18 23:32:41 +08:00

README.md

tianshou

Tianshou(天授) is a reinforcement learning platform. The following image illustrate its architecture.

agent

Examples

Self-play Framework

core

Policy Wrapper

Stochastic policies (OnehotCategorical, Gaussian), deterministic policies (policy as in DQN, DDPG)

Specific network architectures in original paper of DQN, TRPO, A3C, etc. Policy-Value Network of AlphaGo Zero

Algorithm

losses

policy gradient (and its variants), DQN (and its variants), DDPG, TRPO, PPO

optimizer

TRPO, natural gradient (and TensorFlow optimizers (sgd, adam))

Planning

MCTS

data

Training style - Batch, Replay (and its variants)

Advantage Estimation Function

Multithread Read/Write

environment

DQN repeat frames, Reward Reshaping, image preprocessing (not sure where)

simulator

Go, Othello/Reversi, Warzone

About coding style

Please follow google python coding style

There's a more detailed Chinese version google python coding style in Chinese

All files/folders should be named with lower case letters and underline (except specified names such as AlphaGo).

Try to use full names. Don't use abbrevations for class/function/variable names except common abbrevations (such as num for number, dim for dimension, env for environment, op for operation). For now we use pi to refer to the policy in examples/ppo_example.py.

The """xxx""" comment should be written right after class/function. Also comment the part that's not intuitive during the code. We must comment, but for now we don't need to polish them.

High Priority TODO

For Haosheng and Tongzheng: separate actor and critic, rewrite the interfaces for policy

Others can still focus on the task below.

TODO

Search based method parallel.

YongRen: Policy Wrapper, in order of Gaussian, DQN and DDPG

TongzhengRen: losses, in order of ppo, pg, DQN, DDPG with management of placeholders

YouQiaoben: data/Batch, implement num_timesteps, fix memory growth in num_episodes; adv_estimate.gae_lambda (need to write a value network in tf)

ShihongSong: data/Replay; then adv_estimate.dqn after YongRen's DQN

HaoshengZou: collaborate mainly on Policy and losses; interfaces and architecture

Note: install openai/gym first to run the Atari environment; note that interfaces between modules may not be finalized; the management of placeholders and feed_dict may have to be done manually for the time being;

Without preprocessing and other tricks, this example will not train to any meaningful results. Codes should past two tests: individual module test and run through this example code.