Tianshou/examples/dqn_example.py

#!/usr/bin/env python
from __future__ import absolute_import

import tensorflow as tf
import gym
import numpy as np
import time

# our lib imports here! It's ok to append path in examples
import sys
sys.path.append('..')
from tianshou.core import losses
from tianshou.data.batch import Batch
import tianshou.data.advantage_estimation as advantage_estimation
import tianshou.core.policy.dqn as policy  # TODO: fix imports as zhusuan so that only need to import to policy
import tianshou.core.value_function.action_value as value_function
import tianshou.data.replay_buffer.proportional as proportional
import tianshou.data.replay_buffer.rank_based as rank_based
import tianshou.data.replay_buffer.naive as naive
import tianshou.data.replay_buffer.Replay as Replay


# TODO: why this solves cartpole even without training?


if __name__ == '__main__':
    env = gym.make('CartPole-v0')
    observation_dim = env.observation_space.shape
    action_dim = env.action_space.n

    clip_param = 0.2
    num_batches = 10
    batch_size = 512

    seed = 0
    np.random.seed(seed)
    tf.set_random_seed(seed)

    ### 1. build network with pure tf
    observation_ph = tf.placeholder(tf.float32, shape=(None,) + observation_dim)

    def my_network():
        net = tf.layers.dense(observation_ph, 32, activation=tf.nn.tanh)
        net = tf.layers.dense(net, 32, activation=tf.nn.tanh)

        action_values = tf.layers.dense(net, action_dim, activation=None)

        return None, action_values  # no policy head

    ### 2. build policy, loss, optimizer
    dqn = value_function.DQN(my_network, observation_placeholder=observation_ph, weight_update=100)
    pi = policy.DQN(dqn)

    dqn_loss = losses.qlearning(dqn)

    total_loss = dqn_loss
    global_step = tf.Variable(0, name='global_step', trainable=False)
    optimizer = tf.train.AdamOptimizer(1e-4)
    train_op = optimizer.minimize(total_loss, var_list=dqn.trainable_variables, global_step=tf.train.get_global_step())

    # replay_memory = naive.NaiveExperience({'size': 1000})
    replay_memory = rank_based.RankBasedExperience({'size': 30})
    # replay_memory = proportional.PropotionalExperience({'size': 100, 'batch_size': 10})
    data_collector = Replay.Replay(replay_memory, env, pi, [advantage_estimation.ReplayMemoryQReturn(1, dqn)], [dqn])

    ### 3. define data collection
    # data_collector = Batch(env, pi, [advantage_estimation.nstep_q_return(1, dqn)], [dqn])

    ### 4. start training
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        sess.run(tf.global_variables_initializer())

        # assign actor to pi_old
        pi.sync_weights()  # TODO: automate this for policies with target network

        start_time = time.time()
        #TODO : repeat_num shoulde be defined in some configuration files
        repeat_num = 100
        for i in range(repeat_num):
            # collect data
            # data_collector.collect(nums=50)
            data_collector.collect(num_episodes=50, epsilon_greedy= (repeat_num - i + 0.0) / repeat_num)

            # print current return
            print('Epoch {}:'.format(i))
            data_collector.statistics()

            # update network
            for _ in range(num_batches):
                feed_dict = data_collector.next_batch(batch_size, tf.train.global_step(sess, global_step))
                sess.run(train_op, feed_dict=feed_dict)

            print('Elapsed time: {:.1f} min'.format((time.time() - start_time) / 60))
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00			`#!/usr/bin/env python`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`from __future__ import absolute_import`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
			`import tensorflow as tf`
			`import gym`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`import numpy as np`
			`import time`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`# our lib imports here! It's ok to append path in examples`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00			`import sys`
			`sys.path.append('..')`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`from tianshou.core import losses`
			`from tianshou.data.batch import Batch`
			`import tianshou.data.advantage_estimation as advantage_estimation`
			`import tianshou.core.policy.dqn as policy # TODO: fix imports as zhusuan so that only need to import to policy`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`import tianshou.core.value_function.action_value as value_function`
first modify of replay buffer, make all three replay buffers work, wait for refactoring and testing 2018-02-27 13:10:47 +08:00			`import tianshou.data.replay_buffer.proportional as proportional`
			`import tianshou.data.replay_buffer.rank_based as rank_based`
			`import tianshou.data.replay_buffer.naive as naive`
			`import tianshou.data.replay_buffer.Replay as Replay`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00

finish ddpg example. all examples under examples/ (except those containing 'contrib' and 'fail') can run! advantage estimation module is not complete yet. 2018-01-18 17:38:52 +08:00			`# TODO: why this solves cartpole even without training?`


fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`if __name__ == '__main__':`
			`env = gym.make('CartPole-v0')`
			`observation_dim = env.observation_space.shape`
			`action_dim = env.action_space.n`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`clip_param = 0.2`
			`num_batches = 10`
			`batch_size = 512`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`seed = 0`
			`np.random.seed(seed)`
			`tf.set_random_seed(seed)`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`### 1. build network with pure tf`
			`observation_ph = tf.placeholder(tf.float32, shape=(None,) + observation_dim)`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`def my_network():`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`net = tf.layers.dense(observation_ph, 32, activation=tf.nn.tanh)`
			`net = tf.layers.dense(net, 32, activation=tf.nn.tanh)`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`action_values = tf.layers.dense(net, action_dim, activation=None)`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`return None, action_values # no policy head`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`### 2. build policy, loss, optimizer`
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`dqn = value_function.DQN(my_network, observation_placeholder=observation_ph, weight_update=100)`
			`pi = policy.DQN(dqn)`
towards policy/value refactor 2017-12-23 17:25:16 +08:00
finish dqn example. advantage estimation module is not complete yet. 2018-01-18 12:19:48 +08:00			`dqn_loss = losses.qlearning(dqn)`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`total_loss = dqn_loss`
first modify of replay buffer, make all three replay buffers work, wait for refactoring and testing 2018-02-27 13:10:47 +08:00			`global_step = tf.Variable(0, name='global_step', trainable=False)`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`optimizer = tf.train.AdamOptimizer(1e-4)`
first modify of replay buffer, make all three replay buffers work, wait for refactoring and testing 2018-02-27 13:10:47 +08:00			`train_op = optimizer.minimize(total_loss, var_list=dqn.trainable_variables, global_step=tf.train.get_global_step())`

			`# replay_memory = naive.NaiveExperience({'size': 1000})`
			`replay_memory = rank_based.RankBasedExperience({'size': 30})`
			`# replay_memory = proportional.PropotionalExperience({'size': 100, 'batch_size': 10})`
			`data_collector = Replay.Replay(replay_memory, env, pi, [advantage_estimation.ReplayMemoryQReturn(1, dqn)], [dqn])`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`### 3. define data collection`
first modify of replay buffer, make all three replay buffers work, wait for refactoring and testing 2018-02-27 13:10:47 +08:00			`# data_collector = Batch(env, pi, [advantage_estimation.nstep_q_return(1, dqn)], [dqn])`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`### 4. start training`
			`config = tf.ConfigProto()`
			`config.gpu_options.allow_growth = True`
			`with tf.Session(config=config) as sess:`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00			`sess.run(tf.global_variables_initializer())`

fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`# assign actor to pi_old`
			`pi.sync_weights() # TODO: automate this for policies with target network`

			`start_time = time.time()`
add epsilon-greedy for dqn 2018-02-25 16:31:35 +08:00			`#TODO : repeat_num shoulde be defined in some configuration files`
			`repeat_num = 100`
			`for i in range(repeat_num):`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00			`# collect data`
Merge branch 'master' of https://github.com/sproblvem/tianshou 2018-02-27 13:15:36 +08:00			`# data_collector.collect(nums=50)`
add epsilon-greedy for dqn 2018-02-25 16:31:35 +08:00			`data_collector.collect(num_episodes=50, epsilon_greedy= (repeat_num - i + 0.0) / repeat_num)`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00
			`# print current return`
			`print('Epoch {}:'.format(i))`
			`data_collector.statistics()`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
			`# update network`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`for _ in range(num_batches):`
first modify of replay buffer, make all three replay buffers work, wait for refactoring and testing 2018-02-27 13:10:47 +08:00			`feed_dict = data_collector.next_batch(batch_size, tf.train.global_step(sess, global_step))`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`sess.run(train_op, feed_dict=feed_dict)`
preliminary design of dqn_example, dqn interface. identify the assign of networks 2017-12-13 20:47:45 +08:00
fix the env -> self._env bug 2018-02-10 03:42:00 +08:00			`print('Elapsed time: {:.1f} min'.format((time.time() - start_time) / 60))`