Tianshou/tianshou/core/mcts/README.md

# MCTS

This is an implementation for Monte Carlo Tree Search in various Reinforcement Learning applications.

## MCTS with deterministic environments

The agent is interacting with a deterministic environment. That is to say, next state and reward for a state-action pair is deterministic. 

### Node

Action nodes are not needed here since every state-action pair only lead to one state.

Elements for a node:

+ state: the state for current node
+ parent node: the parent node of this node on the tree
+ parent action: the action that leads to this node
+ children: the next states the agent can reach by choosing an action from this node
+ prior: some external information for this node (default to be uniform)

Optional elements (for UCT or Thompson Sampling):

+ W: the list of all the sums of all sampled values collected for all children nodes (for UCT)
+ N: the list of numbers of times each of children node has been sampled (for UCT)
+ Q: the estimation for the value of all chidren nodes, i.e. W/N (for UCT)
+ U: upper bound value for all children nodes (for UCT)
+ R: the list of the one-step reward of all children nodes
+ alpha, beta: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, beta distribution)
+ mu, sigma: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, Gaussian distribution)

### Selection
In the selection part, an action is chosen for the current node. 

If the action has been chosen before, then go to the corresponding child node and go on selection.

If not, stop selection and start expansion.

### Expansion

Send the state-action pair to the simulator and then the next state and a reward are returned by the simulator. Then initialize a new node with the next state. The prior information may be given for initialization.

Then go to rollout.

### Rollout

At the new leaf node, use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node. Use this collected reward to initialize the value of this node.

Another way is to send this state to some external estimator and use the returned result to initialize the value of this node.

Then turn to backpropagation to send this value backup.

### Backpropagation

From the leaf node to the root node, update all nodes that have been passed in this iteration.

For each node, a value is returned by its child node. Then add this value (might be multiplied by a discouting factor gamma) with the stored reward for this child to get a new value. The new value is used to update this Q value of the corresponding children.

For UCT methods, the new value is add to W. Then add 1 to N. Q and U can be calculated out.

For Thompson Sampling, the new value is treated as a sample to update the posterior distribution.

Then return teh new value to its parent node.

Stop backpropagation until root node is reached. Then start selection again.

## MCTS with random environments

The agent is interacting with a random environment. That is to say, next state and reward for a state-action pair is not deterministic. We do not know the hidden dynamics and reward distribution. We can just get samples from the simulator.

### Node

Both state nodes and action nodes are needed here.

#### State nodes

Elements for a state node:

+ state: the state for current node
+ parent node: the parent action node of this node on the tree
+ children: the actions nodes chosen from this node
+ prior: some external information for this node (default to be uniform)

Optional elements (for UCT or Thompson Sampling):

+ W: the list of all the sums of all sampled values collected for all children action nodes (for UCT)
+ N: the list of numbers of times each of children action node has been sampled (for UCT)
+ Q: the estimation for the value of all chidren action nodes, i.e. W/N (for UCT)
+ U: upper bound value for all children action nodes (for UCT)
+ R: the list of the expected one-step reward of all children action nodes
+ alpha, beta: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, beta distribution)
+ mu, sigma: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, Gaussian distribution)

#### Action nodes

Elements for a state node:

+ action: the action for current node
+ parent node: the parent state node of this node on the tree
+ children: the states node sampled by this action

Optional elements (for UCT or Thompson Sampling):

+ V: the estimation for the value of children state nodes (for UCT)
+ N: the number of times each of the children state node has been sampled (for UCT)

### Selection
In the selection part, an action is chosen for the current state node. Then the state-action pair to the simulator and then the next state and a reward are returned by the simulator.

If the next state has been seen from this action node before, then go to the corresponding child node and go on selection.

If not, stop selection and start expansion.

### Expansion

Initialize a new node with the next state. Add this node to children of the parent action node. Then generate all possible children for this node. Initialize them.

The prior information may be given for initialization. 

Then go to rollout.

### Rollout

At the new state node, choose one action as a leaf node. Then use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node.

Then turn to backpropagation to send this value backup.

### Backpropagation

From the leaf node to the root node, update all nodes that have been passed in this iteration.

#### For state nodes

For each state node, a V value is returned by its child action node. Then add this value (might be multiplied by a discouting factor gamma) with the stored expected one step reward for this action node to get a new value. The new value is used to update the parameters for the child action node.

For UCT methods, the new value is add to W. Then add 1 to N. Q and U can be calculated out.

For Thompson Sampling, the new value is treated as a sample to update the posterior distribution.

Then return the new value to its parent state node. 

#### For action nodes

For an action node, a Q value is returned by its child state node.

For UCT, calculate the new averaged V with Q and N. Then N pluses 1.

For Thompson sampling, just return the Q value to its parent action value as a sample.


Stop backpropagation until root node is reached. Then start selection again.

## MCTS for POMDPs

The agent is interacting under a partially observed environment. That is to say, we can only see observations and choose actions. A simulator is needed here. Every time I sent a state action pair to the simulator, it can return me a new state, a observation and a reward. There need a prior distribution of states for the root node.

### Node

We use observation nodes and action nodes here.

#### Observation nodes

Elements for an observation node:

+ observation: the observation for current node
+ parent node: the parent action node of this node on the tree
+ children: the actions nodes chosen from this node
+ prior: some external information for this node (default to be uniform)

Optional elements (for UCT or Thompson Sampling):

+ h: history information for this node
+ W: the list of all the sums of all sampled values collected for all children action nodes (for UCT)
+ N: the list of numbers of times each of children action node has been sampled (for UCT)
+ Q: the estimation for the value of all chidren action nodes, i.e. W/N (for UCT)
+ U: upper bound value for all children action nodes (for UCT)
+ R: the list of the expected one-step reward of all children action nodes
+ alpha, beta: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, beta distribution)
+ mu, sigma: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, Gaussian distribution)

#### Action nodes

Elements for a state node:

+ action: the action for current node
+ parent node: the parent observation node of this node on the tree
+ children: the observations node sampled by this action

Optional elements (for UCT or Thompson Sampling):

+ V: the estimation for the value of children observation nodes (for UCT)
+ N: the number of times each of the children observation node has been sampled (for UCT)

### Selection
In the selection part, we first need to sample a state from the root node's prior distribution.

At each observation node, an action *a* is chosen. Here we have a state *s* for this observation. Then we send the *(s,a)*  pair to the simulator and then a new state *s'* and a reward are returned by the simulator.

If the next state has been seen from this action node before, then go to the corresponding child node with the new state *s'* and go on selection.

If not, stop selection and start expansion.

### Expansion

Initialize a new observation node. Add this node to children of its parent action node. Then generate all possible children for this node. Initialize them.

The prior information may be given for initialization. 

Then go to rollout.

### Rollout

At the new observation node, choose one action as a leaf node. Then use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node.

Then turn to backpropagation to send this value backup.

### Backpropagation

From the leaf node to the root node, update all nodes that have been passed in this iteration.

#### For observation nodes

For each observation node, a Q value is returned by its child observation node. Then add this value (might be multiplied by a discouting factor gamma) with the stored expected one step reward for this action node to get a new value. The new value is used to update the parameters for the child action node.

For UCT methods, the new value is add to W. Then add 1 to N. Q and U can be calculated out.

For Thompson Sampling, the new value is treated as a sample to update the posterior distribution.

Then return the new value to its parent state node. 

#### For action nodes

For an action node, a Q value is returned by its child state node.

For UCT, calculate the new averaged V with Q and N. Then N pluses 1.

For Thompson sampling, just return the Q value to its parent action value as a sample.


Stop backpropagation until root node is reached. Then start selection again.
update mcts docs 2017-11-16 12:38:51 +08:00			`# MCTS`

			`This is an implementation for Monte Carlo Tree Search in various Reinforcement Learning applications.`

			`## MCTS with deterministic environments`

			`The agent is interacting with a deterministic environment. That is to say, next state and reward for a state-action pair is deterministic.`

			`### Node`

			`Action nodes are not needed here since every state-action pair only lead to one state.`

			`Elements for a node:`

			`+ state: the state for current node`
			`+ parent node: the parent node of this node on the tree`
modification of docs for mcts 2017-11-18 15:55:14 +08:00			`+ parent action: the action that leads to this node`
update mcts docs 2017-11-16 12:38:51 +08:00			`+ children: the next states the agent can reach by choosing an action from this node`
			`+ prior: some external information for this node (default to be uniform)`

			`Optional elements (for UCT or Thompson Sampling):`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`+ W: the list of all the sums of all sampled values collected for all children nodes (for UCT)`
			`+ N: the list of numbers of times each of children node has been sampled (for UCT)`
			`+ Q: the estimation for the value of all chidren nodes, i.e. W/N (for UCT)`
			`+ U: upper bound value for all children nodes (for UCT)`
			`+ R: the list of the one-step reward of all children nodes`
			`+ alpha, beta: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, beta distribution)`
			`+ mu, sigma: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, Gaussian distribution)`
update mcts docs 2017-11-16 12:38:51 +08:00
			`### Selection`
			`In the selection part, an action is chosen for the current node.`

			`If the action has been chosen before, then go to the corresponding child node and go on selection.`

			`If not, stop selection and start expansion.`

			`### Expansion`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`Send the state-action pair to the simulator and then the next state and a reward are returned by the simulator. Then initialize a new node with the next state. The prior information may be given for initialization.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`Then go to rollout.`

			`### Rollout`

			`At the new leaf node, use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node. Use this collected reward to initialize the value of this node.`

			`Another way is to send this state to some external estimator and use the returned result to initialize the value of this node.`

			`Then turn to backpropagation to send this value backup.`

			`### Backpropagation`

			`From the leaf node to the root node, update all nodes that have been passed in this iteration.`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`For each node, a value is returned by its child node. Then add this value (might be multiplied by a discouting factor gamma) with the stored reward for this child to get a new value. The new value is used to update this Q value of the corresponding children.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`For UCT methods, the new value is add to W. Then add 1 to N. Q and U can be calculated out.`

			`For Thompson Sampling, the new value is treated as a sample to update the posterior distribution.`

			`Then return teh new value to its parent node.`

			`Stop backpropagation until root node is reached. Then start selection again.`

			`## MCTS with random environments`

			`The agent is interacting with a random environment. That is to say, next state and reward for a state-action pair is not deterministic. We do not know the hidden dynamics and reward distribution. We can just get samples from the simulator.`

			`### Node`

			`Both state nodes and action nodes are needed here.`

			`#### State nodes`

			`Elements for a state node:`

			`+ state: the state for current node`
			`+ parent node: the parent action node of this node on the tree`
			`+ children: the actions nodes chosen from this node`
			`+ prior: some external information for this node (default to be uniform)`

			`Optional elements (for UCT or Thompson Sampling):`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`+ W: the list of all the sums of all sampled values collected for all children action nodes (for UCT)`
			`+ N: the list of numbers of times each of children action node has been sampled (for UCT)`
			`+ Q: the estimation for the value of all chidren action nodes, i.e. W/N (for UCT)`
			`+ U: upper bound value for all children action nodes (for UCT)`
			`+ R: the list of the expected one-step reward of all children action nodes`
			`+ alpha, beta: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, beta distribution)`
			`+ mu, sigma: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, Gaussian distribution)`
update mcts docs 2017-11-16 12:38:51 +08:00
			`#### Action nodes`

			`Elements for a state node:`

			`+ action: the action for current node`
			`+ parent node: the parent state node of this node on the tree`
			`+ children: the states node sampled by this action`

			`Optional elements (for UCT or Thompson Sampling):`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`+ V: the estimation for the value of children state nodes (for UCT)`
			`+ N: the number of times each of the children state node has been sampled (for UCT)`
update mcts docs 2017-11-16 12:38:51 +08:00
			`### Selection`
			`In the selection part, an action is chosen for the current state node. Then the state-action pair to the simulator and then the next state and a reward are returned by the simulator.`

			`If the next state has been seen from this action node before, then go to the corresponding child node and go on selection.`

			`If not, stop selection and start expansion.`

			`### Expansion`

			`Initialize a new node with the next state. Add this node to children of the parent action node. Then generate all possible children for this node. Initialize them.`

			`The prior information may be given for initialization.`

			`Then go to rollout.`

			`### Rollout`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`At the new state node, choose one action as a leaf node. Then use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`Then turn to backpropagation to send this value backup.`

			`### Backpropagation`

			`From the leaf node to the root node, update all nodes that have been passed in this iteration.`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`#### For state nodes`
update mcts docs 2017-11-16 12:38:51 +08:00
modification of docs for mcts 2017-11-18 15:55:14 +08:00			`For each state node, a V value is returned by its child action node. Then add this value (might be multiplied by a discouting factor gamma) with the stored expected one step reward for this action node to get a new value. The new value is used to update the parameters for the child action node.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`For UCT methods, the new value is add to W. Then add 1 to N. Q and U can be calculated out.`

			`For Thompson Sampling, the new value is treated as a sample to update the posterior distribution.`

			`Then return the new value to its parent state node.`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`#### For action nodes`
update mcts docs 2017-11-16 12:38:51 +08:00
modification of docs for mcts 2017-11-18 15:55:14 +08:00			`For an action node, a Q value is returned by its child state node.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`For UCT, calculate the new averaged V with Q and N. Then N pluses 1.`

			`For Thompson sampling, just return the Q value to its parent action value as a sample.`


			`Stop backpropagation until root node is reached. Then start selection again.`

			`## MCTS for POMDPs`

			`The agent is interacting under a partially observed environment. That is to say, we can only see observations and choose actions. A simulator is needed here. Every time I sent a state action pair to the simulator, it can return me a new state, a observation and a reward. There need a prior distribution of states for the root node.`

			`### Node`

			`We use observation nodes and action nodes here.`

			`#### Observation nodes`

			`Elements for an observation node:`

			`+ observation: the observation for current node`
			`+ parent node: the parent action node of this node on the tree`
			`+ children: the actions nodes chosen from this node`
			`+ prior: some external information for this node (default to be uniform)`

			`Optional elements (for UCT or Thompson Sampling):`

			`+ h: history information for this node`
modification of docs for mcts 2017-11-18 15:55:14 +08:00			`+ W: the list of all the sums of all sampled values collected for all children action nodes (for UCT)`
			`+ N: the list of numbers of times each of children action node has been sampled (for UCT)`
			`+ Q: the estimation for the value of all chidren action nodes, i.e. W/N (for UCT)`
			`+ U: upper bound value for all children action nodes (for UCT)`
			`+ R: the list of the expected one-step reward of all children action nodes`
			`+ alpha, beta: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, beta distribution)`
			`+ mu, sigma: parameters for the posterior distribution of the value of all children nodes (for Thompson Sampling, Gaussian distribution)`
update mcts docs 2017-11-16 12:38:51 +08:00
			`#### Action nodes`

			`Elements for a state node:`

			`+ action: the action for current node`
			`+ parent node: the parent observation node of this node on the tree`
			`+ children: the observations node sampled by this action`

			`Optional elements (for UCT or Thompson Sampling):`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`+ V: the estimation for the value of children observation nodes (for UCT)`
			`+ N: the number of times each of the children observation node has been sampled (for UCT)`
update mcts docs 2017-11-16 12:38:51 +08:00
			`### Selection`
			`In the selection part, we first need to sample a state from the root node's prior distribution.`

			`At each observation node, an action a is chosen. Here we have a state s for this observation. Then we send the (s,a) pair to the simulator and then a new state s' and a reward are returned by the simulator.`

			`If the next state has been seen from this action node before, then go to the corresponding child node with the new state s' and go on selection.`

			`If not, stop selection and start expansion.`

			`### Expansion`

			`Initialize a new observation node. Add this node to children of its parent action node. Then generate all possible children for this node. Initialize them.`

			`The prior information may be given for initialization.`

			`Then go to rollout.`

			`### Rollout`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`At the new observation node, choose one action as a leaf node. Then use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`Then turn to backpropagation to send this value backup.`

			`### Backpropagation`

			`From the leaf node to the root node, update all nodes that have been passed in this iteration.`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`#### For observation nodes`
update mcts docs 2017-11-16 12:38:51 +08:00
modification of docs for mcts 2017-11-18 15:55:14 +08:00			`For each observation node, a Q value is returned by its child observation node. Then add this value (might be multiplied by a discouting factor gamma) with the stored expected one step reward for this action node to get a new value. The new value is used to update the parameters for the child action node.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`For UCT methods, the new value is add to W. Then add 1 to N. Q and U can be calculated out.`

			`For Thompson Sampling, the new value is treated as a sample to update the posterior distribution.`

modification of docs for mcts 2017-11-18 15:55:14 +08:00			`Then return the new value to its parent state node.`
update mcts docs 2017-11-16 12:38:51 +08:00
modification of docs for mcts 2017-11-18 15:55:14 +08:00			`#### For action nodes`
update mcts docs 2017-11-16 12:38:51 +08:00
modification of docs for mcts 2017-11-18 15:55:14 +08:00			`For an action node, a Q value is returned by its child state node.`
update mcts docs 2017-11-16 12:38:51 +08:00
			`For UCT, calculate the new averaged V with Q and N. Then N pluses 1.`

			`For Thompson sampling, just return the Q value to its parent action value as a sample.`


modification of docs for mcts 2017-11-18 15:55:14 +08:00			`Stop backpropagation until root node is reached. Then start selection again.`