Send the state-action pair to the simulator and then the next state and a reward are returned by the simulator. Then initialize a new node with the next state. The prior information may be given for initialization.
At the new leaf node, use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node. Use this collected reward to initialize the value of this node.
Another way is to send this state to some external estimator and use the returned result to initialize the value of this node.
Then turn to backpropagation to send this value backup.
### Backpropagation
From the leaf node to the root node, update all nodes that have been passed in this iteration.
For each node, a value is returned by its child node. Then add this value (might be multiplied by a discouting factor gamma) with the stored reward for this child to get a new value. The new value is used to update this Q value of the corresponding children.
For UCT methods, the new value is add to W. Then add 1 to N. Q and U can be calculated out.
For Thompson Sampling, the new value is treated as a sample to update the posterior distribution.
Then return teh new value to its parent node.
Stop backpropagation until root node is reached. Then start selection again.
## MCTS with random environments
The agent is interacting with a random environment. That is to say, next state and reward for a state-action pair is not deterministic. We do not know the hidden dynamics and reward distribution. We can just get samples from the simulator.
### Node
Both state nodes and action nodes are needed here.
#### State nodes
Elements for a state node:
+ state: the state for current node
+ parent node: the parent action node of this node on the tree
+ children: the actions nodes chosen from this node
+ prior: some external information for this node (default to be uniform)
In the selection part, an action is chosen for the current state node. Then the state-action pair to the simulator and then the next state and a reward are returned by the simulator.
If the next state has been seen from this action node before, then go to the corresponding child node and go on selection.
If not, stop selection and start expansion.
### Expansion
Initialize a new node with the next state. Add this node to children of the parent action node. Then generate all possible children for this node. Initialize them.
The prior information may be given for initialization.
At the new state node, choose one action as a leaf node. Then use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node.
For each state node, a V value is returned by its child action node. Then add this value (might be multiplied by a discouting factor gamma) with the stored expected one step reward for this action node to get a new value. The new value is used to update the parameters for the child action node.
For UCT, calculate the new averaged V with Q and N. Then N pluses 1.
For Thompson sampling, just return the Q value to its parent action value as a sample.
Stop backpropagation until root node is reached. Then start selection again.
## MCTS for POMDPs
The agent is interacting under a partially observed environment. That is to say, we can only see observations and choose actions. A simulator is needed here. Every time I sent a state action pair to the simulator, it can return me a new state, a observation and a reward. There need a prior distribution of states for the root node.
### Node
We use observation nodes and action nodes here.
#### Observation nodes
Elements for an observation node:
+ observation: the observation for current node
+ parent node: the parent action node of this node on the tree
+ children: the actions nodes chosen from this node
+ prior: some external information for this node (default to be uniform)
In the selection part, we first need to sample a state from the root node's prior distribution.
At each observation node, an action *a* is chosen. Here we have a state *s* for this observation. Then we send the *(s,a)* pair to the simulator and then a new state *s'* and a reward are returned by the simulator.
If the next state has been seen from this action node before, then go to the corresponding child node with the new state *s'* and go on selection.
If not, stop selection and start expansion.
### Expansion
Initialize a new observation node. Add this node to children of its parent action node. Then generate all possible children for this node. Initialize them.
The prior information may be given for initialization.
At the new observation node, choose one action as a leaf node. Then use a quick policy to play the game to some terminal state and return the collected reward along the trajectory to the leaf node.
For each observation node, a Q value is returned by its child observation node. Then add this value (might be multiplied by a discouting factor gamma) with the stored expected one step reward for this action node to get a new value. The new value is used to update the parameters for the child action node.