Tianshou/README.md

# tianshou
Tianshou(天授) is a reinforcement learning platform. The following image illustrate its architecture.

<img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/tianshou_architecture.png" height="200"/>

## agent
&nbsp;&nbsp;&nbsp;&nbsp;Examples

&nbsp;&nbsp;&nbsp;&nbsp;Self-play Framework

## core

### Policy Wrapper
&nbsp;&nbsp;&nbsp;&nbsp;Stochastic policies (OnehotCategorical, Gaussian), deterministic policies (policy as in DQN, DDPG)

&nbsp;&nbsp;&nbsp;&nbsp;Specific network architectures in original paper of DQN, TRPO, A3C, etc. Policy-Value Network of AlphaGo Zero

#### brief intro of current implementation:

how to write your own network:
- define the observation placeholder yourself, pass it to `observation_placeholder` when initializing a policy instance
- pass a callable when initializing a policy instance. The callable should satisfy only three conditions:
    - it accepts no parameters
    - it does not create any new placeholders
    - it returns `action-related tensors, value_head`

Our lib will take care of your observation placeholder from now on, as well as
all the placeholders that will be created by our lib.

The other placeholders, such as `keep_prob` in dropout and `clip_param` in ppo loss
should be managed by your own (see examples/ppo_cartpole_alternative.py)

The `weight_update` parameter:
- 0 means manually update target network
- 1 means no target network (the target network is updated every 1 minibatch)
- (0, 1) is the target network as used in DDPG
- greater than 1 is the target network as used in DQN

Other comments are in the python files in example/ and in the lib codes.
Refactor is definitely needed so don't dwell too much on annoying details...

### Algorithm

#### losses
&nbsp;&nbsp;&nbsp;&nbsp;policy gradient (and its variants), DQN (and its variants), DDPG, TRPO, PPO

#### optimizer
&nbsp;&nbsp;&nbsp;&nbsp;TRPO, natural gradient (and TensorFlow optimizers (sgd, adam))

### Planning
&nbsp;&nbsp;&nbsp;&nbsp;MCTS

## data
&nbsp;&nbsp;&nbsp;&nbsp;Training style - Batch, Replay (and its variants)

&nbsp;&nbsp;&nbsp;&nbsp;Advantage Estimation Function

&nbsp;&nbsp;&nbsp;&nbsp;Multithread Read/Write

## environment
&nbsp;&nbsp;&nbsp;&nbsp;DQN repeat frames, Reward Reshaping, image preprocessing (not sure where)

## simulator
&nbsp;&nbsp;&nbsp;&nbsp;Go, Othello/Reversi, Warzone

<img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/go.png" height="150"/> <img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/reversi.jpg" height="150"/> <img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/warzone.jpg" height="150"/>

## examples

During development, run examples under `./examples/` directory with, e.g. `python ppo_example.py`.
Running them under this directory with `python examples/ppo_example.py` will not work.


## About coding style

Please follow [google python coding style](https://google.github.io/styleguide/pyguide.html)

There's a more detailed Chinese version [google python coding style in Chinese](http://www.runoob.com/w3cnote/google-python-styleguide.html)

All files/folders should be named with lower case letters and underline (except specified names such as `AlphaGo`).

Try to use full names. Don't use abbrevations for class/function/variable names except common abbrevations (such as `num` for number, `dim` for dimension, `env` for environment, `op` for operation). For now we use `pi` to refer to the policy in examples/ppo_example.py.

The """xxx""" comment should be written right after class/function. Also comment the part that's not intuitive during the code. We must comment, but for now we don't need to polish them.

# High Priority TODO

For Haosheng and Tongzheng: separate actor and critic, rewrite the interfaces for policy

Others can still focus on the task below.

## TODO
Search based method parallel.

YongRen: Policy Wrapper, in order of Gaussian, DQN and DDPG

TongzhengRen: losses, in order of ppo, pg, DQN, DDPG with management of placeholders

YouQiaoben: data/Batch, implement num_timesteps, fix memory growth in num_episodes; adv_estimate.gae_lambda (need to write a value network in tf)

ShihongSong: data/Replay; then adv_estimate.dqn after YongRen's DQN

HaoshengZou: collaborate mainly on Policy and losses; interfaces and architecture

Note: install openai/gym first to run the Atari environment; note that interfaces between modules may not be finalized; the management of placeholders and `feed_dict` may have to be done manually for the time being;

Without preprocessing and other tricks, this example will not train to any meaningful results. Codes should past two tests: individual module test and run through this example code.
Initial commit 2017-11-04 01:38:59 +08:00			`# tianshou`
Update README.md add the illustrate pictures 2017-12-04 16:39:35 +08:00			`Tianshou(天授) is a reinforcement learning platform. The following image illustrate its architecture.`
Update README.md add the arch image to readme 2017-11-06 15:58:21 +08:00
Update README.md add the illustrate pictures 2017-12-04 16:39:35 +08:00			`<img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/tianshou_architecture.png" height="200"/>`
Update README.md add the arch image to readme 2017-11-06 15:58:21 +08:00
Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`## agent`
			`    Examples`
Update README.md 2017-12-04 16:21:33 +08:00
Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`    Self-play Framework`
architecture design patch 2017-11-06 15:17:55 +08:00
			`## core`

model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			`### Policy Wrapper`
			`    Stochastic policies (OnehotCategorical, Gaussian), deterministic policies (policy as in DQN, DDPG)`

			`    Specific network architectures in original paper of DQN, TRPO, A3C, etc. Policy-Value Network of AlphaGo Zero`
architecture design patch 2017-11-06 15:17:55 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`#### brief intro of current implementation:`

			`how to write your own network:`
			- define the observation placeholder yourself, pass it to `observation_placeholder` when initializing a policy instance
			`- pass a callable when initializing a policy instance. The callable should satisfy only three conditions:`
			`- it accepts no parameters`
			`- it does not create any new placeholders`
			- it returns `action-related tensors, value_head`

			`Our lib will take care of your observation placeholder from now on, as well as`
			`all the placeholders that will be created by our lib.`

			The other placeholders, such as `keep_prob` in dropout and `clip_param` in ppo loss
			`should be managed by your own (see examples/ppo_cartpole_alternative.py)`

			The `weight_update` parameter:
			`- 0 means manually update target network`
			`- 1 means no target network (the target network is updated every 1 minibatch)`
			`- (0, 1) is the target network as used in DDPG`
			`- greater than 1 is the target network as used in DQN`

			`Other comments are in the python files in example/ and in the lib codes.`
			`Refactor is definitely needed so don't dwell too much on annoying details...`

Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`### Algorithm`
architecture design patch 2017-11-06 15:17:55 +08:00
model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			`#### losses`
			`    policy gradient (and its variants), DQN (and its variants), DDPG, TRPO, PPO`
Update README.md add potential bugs of leela. 2017-11-06 20:35:53 +08:00
model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			`#### optimizer`
			`    TRPO, natural gradient (and TensorFlow optimizers (sgd, adam))`
Update README.md format modify 2017-11-06 20:39:09 +08:00
Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`### Planning`
			`    MCTS`
Update README.md format modify 2017-11-06 20:39:09 +08:00
Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`## data`
model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			`    Training style - Batch, Replay (and its variants)`
Update README.md add potential bugs of leela. 2017-11-06 20:35:53 +08:00
model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			`    Advantage Estimation Function`
Update README.md format modify 2017-11-06 20:39:09 +08:00
Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`    Multithread Read/Write`
Update README.md add potential bugs of leela. 2017-11-06 20:35:53 +08:00
Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`## environment`
model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			`    DQN repeat frames, Reward Reshaping, image preprocessing (not sure where)`
Update README.md add potential bugs of leela. 2017-11-06 20:35:53 +08:00
Update README.md Sub-module function of tianshou. 2017-12-04 16:20:45 +08:00			`## simulator`
			`    Go, Othello/Reversi, Warzone`
Update README.md add potential bugs of leela. 2017-11-06 20:35:53 +08:00
Update README.md add the illustrate pictures 2017-12-04 16:39:35 +08:00			`<img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/go.png" height="150"/> <img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/reversi.jpg" height="150"/> <img src="https://github.com/sproblvem/tianshou/blob/master/docs/figures/warzone.jpg" height="150"/>`

fix imports to support both python2 and python3. move contents from __init__.py to leave for work after major development. 2017-12-23 15:34:44 +08:00			`## examples`

			During development, run examples under `./examples/` directory with, e.g. `python ppo_example.py`.
			Running them under this directory with `python examples/ppo_example.py` will not work.

model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00
coding style 2017-12-10 14:23:40 +08:00			`## About coding style`

fix naming and comments of coding style, delete .json 2017-12-10 17:23:13 +08:00			`Please follow [google python coding style](https://google.github.io/styleguide/pyguide.html)`

add a detailed Chinese google coding style for convenience 2017-12-18 23:32:41 +08:00			`There's a more detailed Chinese version [google python coding style in Chinese](http://www.runoob.com/w3cnote/google-python-styleguide.html)`

minor fix 2017-12-10 17:33:10 +08:00			All files/folders should be named with lower case letters and underline (except specified names such as `AlphaGo`).
fix naming and comments of coding style, delete .json 2017-12-10 17:23:13 +08:00
minor fix 2017-12-10 17:33:10 +08:00			Try to use full names. Don't use abbrevations for class/function/variable names except common abbrevations (such as `num` for number, `dim` for dimension, `env` for environment, `op` for operation). For now we use `pi` to refer to the policy in examples/ppo_example.py.
fix naming and comments of coding style, delete .json 2017-12-10 17:23:13 +08:00
			`The """xxx""" comment should be written right after class/function. Also comment the part that's not intuitive during the code. We must comment, but for now we don't need to polish them.`
minor fixed 2017-12-10 13:34:07 +08:00
assign TODO to Haosheng and Tongzheng 2017-12-15 14:27:04 +08:00			`# High Priority TODO`

			`For Haosheng and Tongzheng: separate actor and critic, rewrite the interfaces for policy`

			`Others can still focus on the task below.`
minor fixed 2017-12-10 13:36:43 +08:00
coding style 2017-12-10 14:23:40 +08:00			`## TODO`
			`Search based method parallel.`
add some TODO 2017-12-10 13:31:43 +08:00
model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			`YongRen: Policy Wrapper, in order of Gaussian, DQN and DDPG`

			`TongzhengRen: losses, in order of ppo, pg, DQN, DDPG with management of placeholders`

			`YouQiaoben: data/Batch, implement num_timesteps, fix memory growth in num_episodes; adv_estimate.gae_lambda (need to write a value network in tf)`

			`ShihongSong: data/Replay; then adv_estimate.dqn after YongRen's DQN`

minor fixed 2017-12-08 21:18:29 +08:00			`HaoshengZou: collaborate mainly on Policy and losses; interfaces and architecture`

model-free rl first commit, with ppo_example.py in examples/ and task delegations in ppo_example.py and READMEs 2017-12-08 21:09:23 +08:00			Note: install openai/gym first to run the Atari environment; note that interfaces between modules may not be finalized; the management of placeholders and `feed_dict` may have to be done manually for the time being;

add a detailed Chinese google coding style for convenience 2017-12-18 23:32:41 +08:00			`Without preprocessing and other tricks, this example will not train to any meaningful results. Codes should past two tests: individual module test and run through this example code.`