Tianshou/tianshou/core/policy/dqn.py

from __future__ import absolute_import

from .base import PolicyBase
import tensorflow as tf
import numpy as np


class DQN(PolicyBase):
    """
    use DQN from value_function as a member
    """
    def __init__(self, dqn, epsilon_train=0.1, epsilon_test=0.05):
        self.action_value = dqn
        self._argmax_action = tf.argmax(dqn.value_tensor_all_actions, axis=1)
        self.weight_update = dqn.weight_update
        if self.weight_update > 1:
            self.interaction_count = 0
        else:
            self.interaction_count = -1

        self.epsilon_train = epsilon_train
        self.epsilon_test = epsilon_test

    def act(self, observation, my_feed_dict={}):
        sess = tf.get_default_session()
        if self.weight_update > 1:
            if self.interaction_count % self.weight_update == 0:
                self.update_weights()

        feed_dict = {self.action_value._observation_placeholder: observation[None]}
        feed_dict.update(my_feed_dict)
        action = sess.run(self._argmax_action, feed_dict=feed_dict)

        # epsilon_greedy
        if np.random.rand() < self.epsilon_train:
            action = np.random.randint(self.action_value.num_actions)

        if self.weight_update > 0:
            self.interaction_count += 1

        return np.squeeze(action)

    def act_test(self, observation, my_feed_dict={}):
        sess = tf.get_default_session()
        if self.weight_update > 1:
            if self.interaction_count % self.weight_update == 0:
                self.update_weights()

        feed_dict = {self.action_value._observation_placeholder: observation[None]}
        feed_dict.update(my_feed_dict)
        action = sess.run(self._argmax_action, feed_dict=feed_dict)

        # epsilon_greedy
        if np.random.rand() < self.epsilon_test:
            action = np.random.randint(self.action_value.num_actions)

        if self.weight_update > 0:
            self.interaction_count += 1

        return np.squeeze(action)

    @property
    def q_net(self):
        return self.action_value

    def sync_weights(self):
        """
        sync the weights of network_old. Direct copy the weights of network.
        :return:
        """
        if self.action_value.sync_weights_ops is not None:
            self.action_value.sync_weights()

    def update_weights(self):
        """
        updates the weights of policy_old.
        :return:
        """
        if self.action_value.weight_update_ops is not None:
            self.action_value.update_weights()

    def set_epsilon_train(self, epsilon):
        self.epsilon_train = epsilon

    def set_epsilon_test(self, epsilon):
        self.epsilon_test = epsilon
fix imports to support both python2 and python3. move contents from __init__.py to leave for work after major development. 2017-12-23 15:36:10 +08:00			`from __future__ import absolute_import`

			`from .base import PolicyBase`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00			`import tensorflow as tf`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`import numpy as np`
add value_function (critic). value_function and policy not finished yet. 2017-12-22 00:22:23 +08:00

finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`class DQN(PolicyBase):`
add value_function (critic). value_function and policy not finished yet. 2017-12-22 00:22:23 +08:00			`"""`
			`use DQN from value_function as a member`
			`"""`
initial data_collector. working on examples/dqn_replay.py to run 2018-03-04 21:29:58 +08:00			`def __init__(self, dqn, epsilon_train=0.1, epsilon_test=0.05):`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`self.action_value = dqn`
			`self._argmax_action = tf.argmax(dqn.value_tensor_all_actions, axis=1)`
			`self.weight_update = dqn.weight_update`
			`if self.weight_update > 1:`
			`self.interaction_count = 0`
			`else:`
			`self.interaction_count = -1`
towards policy/value refactor 2017-12-23 17:25:16 +08:00
initial data_collector. working on examples/dqn_replay.py to run 2018-03-04 21:29:58 +08:00			`self.epsilon_train = epsilon_train`
			`self.epsilon_test = epsilon_test`

			`def act(self, observation, my_feed_dict={}):`
fix imports to support both python2 and python3. move contents from __init__.py to leave for work after major development. 2017-12-23 15:36:10 +08:00			`sess = tf.get_default_session()`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`if self.weight_update > 1:`
			`if self.interaction_count % self.weight_update == 0:`
			`self.update_weights()`
add dqn.py to write 2017-12-13 22:43:45 +08:00
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`feed_dict = {self.action_value._observation_placeholder: observation[None]}`
initial data_collector. working on examples/dqn_replay.py to run 2018-03-04 21:29:58 +08:00			`feed_dict.update(my_feed_dict)`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`action = sess.run(self._argmax_action, feed_dict=feed_dict)`
working on off-policy test. other parts of dqn_replay is runnable, but performance not tested. 2018-03-08 16:51:12 +08:00
			`# epsilon_greedy`
initial data_collector. working on examples/dqn_replay.py to run 2018-03-04 21:29:58 +08:00			`if np.random.rand() < self.epsilon_train:`
working on off-policy test. other parts of dqn_replay is runnable, but performance not tested. 2018-03-08 16:51:12 +08:00			`action = np.random.randint(self.action_value.num_actions)`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`if self.weight_update > 0:`
			`self.interaction_count += 1`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00
add epsilon-greedy for dqn 2018-02-25 16:31:35 +08:00			`return np.squeeze(action)`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00
initial data_collector. working on examples/dqn_replay.py to run 2018-03-04 21:29:58 +08:00			`def act_test(self, observation, my_feed_dict={}):`
working on off-policy test. other parts of dqn_replay is runnable, but performance not tested. 2018-03-08 16:51:12 +08:00			`sess = tf.get_default_session()`
			`if self.weight_update > 1:`
			`if self.interaction_count % self.weight_update == 0:`
			`self.update_weights()`

			`feed_dict = {self.action_value._observation_placeholder: observation[None]}`
			`feed_dict.update(my_feed_dict)`
			`action = sess.run(self._argmax_action, feed_dict=feed_dict)`

			`# epsilon_greedy`
			`if np.random.rand() < self.epsilon_test:`
			`action = np.random.randint(self.action_value.num_actions)`

			`if self.weight_update > 0:`
			`self.interaction_count += 1`

			`return np.squeeze(action)`
initial data_collector. working on examples/dqn_replay.py to run 2018-03-04 21:29:58 +08:00
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00			`@property`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`def q_net(self):`
			`return self.action_value`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`def sync_weights(self):`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00			`"""`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`sync the weights of network_old. Direct copy the weights of network.`
			`:return:`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00			`"""`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`if self.action_value.sync_weights_ops is not None:`
			`self.action_value.sync_weights()`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`def update_weights(self):`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00			`"""`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`updates the weights of policy_old.`
			`:return:`
finished very naive dqn: changed the interface of replay buffer by adding collect and next_batch, but still need refactoring; added implementation of dqn.py, but still need to consider the interface to make it more extensive; slightly refactored the code style of the codebase; more comments and todos will be in the next commit 2017-12-17 12:52:00 +08:00			`"""`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`if self.action_value.weight_update_ops is not None:`
initial data_collector. working on examples/dqn_replay.py to run 2018-03-04 21:29:58 +08:00			`self.action_value.update_weights()`

			`def set_epsilon_train(self, epsilon):`
			`self.epsilon_train = epsilon`

			`def set_epsilon_test(self, epsilon):`
			`self.epsilon_test = epsilon`