Tianshou/examples/ppo_cartpole_rllab.py

#!/usr/bin/env python
from __future__ import absolute_import

import tensorflow as tf
import time
import numpy as np

# our lib imports here! It's ok to append path in examples
import sys
sys.path.append('..')
from tianshou.core import losses
from tianshou.data.batch import Batch
import tianshou.data.advantage_estimation as advantage_estimation
import tianshou.core.policy.stochastic as policy  # TODO: fix imports as zhusuan so that only need to import to policy

from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.envs.normalized_env import normalize


# for tutorial purpose, placeholders are explicitly appended with '_ph' suffix

if __name__ == '__main__':
    env = normalize(CartpoleEnv())
    observation_dim = env.observation_space.shape
    action_dim = env.action_space.flat_dim

    clip_param = 0.2
    num_batches = 10
    batch_size = 128

    seed = 10
    np.random.seed(seed)
    tf.set_random_seed(seed)

    ### 1. build network with pure tf
    observation_ph = tf.placeholder(tf.float32, shape=(None,) + observation_dim)

    def my_policy():
        net = tf.layers.dense(observation_ph, 32, activation=tf.nn.tanh)
        net = tf.layers.dense(net, 32, activation=tf.nn.tanh)

        action_mean = tf.layers.dense(net, action_dim, activation=None)
        action_logstd = tf.get_variable('action_logstd', shape=(action_dim, ))
        # value = tf.layers.dense(net, 1, activation=None)

        return action_mean, action_logstd, None  # None value head

    # TODO: current implementation of passing function or overriding function has to return a value head
    # to allow network sharing between policy and value networks. This makes 'policy' and 'value_function'
    # imbalanced semantically (though they are naturally imbalanced since 'policy' is required to interact
    # with the environment and 'value_function' is not). I have an idea to solve this imbalance, which is
    # not based on passing function or overriding function.

    ### 2. build policy, loss, optimizer
    pi = policy.Normal(my_policy, observation_placeholder=observation_ph, weight_update=0)

    ppo_loss_clip = losses.ppo_clip(pi, clip_param)

    total_loss = ppo_loss_clip
    optimizer = tf.train.AdamOptimizer(1e-4)
    train_op = optimizer.minimize(total_loss, var_list=pi.trainable_variables)

    ### 3. define data collection
    training_data = Batch(env, pi, [advantage_estimation.full_return], [pi])

    ### 4. start training
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    with tf.Session(config=config) as sess:
        sess.run(tf.global_variables_initializer())

        # assign actor to pi_old
        pi.sync_weights()  # TODO: automate this for policies with target network

        start_time = time.time()
        for i in range(100):
            # collect data
            training_data.collect(num_episodes=20)

            # print current return
            print('Epoch {}:'.format(i))
            training_data.statistics()

            # update network
            for _ in range(num_batches):
                feed_dict = training_data.next_batch(batch_size)
                sess.run(train_op, feed_dict=feed_dict)

            # assigning actor to pi_old
            pi.update_weights()

            print('Elapsed time: {:.1f} min'.format((time.time() - start_time) / 60))
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00			`#!/usr/bin/env python`
			`from __future__ import absolute_import`

			`import tensorflow as tf`
			`import time`
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`import numpy as np`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
			`# our lib imports here! It's ok to append path in examples`
			`import sys`
			`sys.path.append('..')`
			`from tianshou.core import losses`
			`from tianshou.data.batch import Batch`
			`import tianshou.data.advantage_estimation as advantage_estimation`
			`import tianshou.core.policy.stochastic as policy # TODO: fix imports as zhusuan so that only need to import to policy`

finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`from rllab.envs.box2d.cartpole_env import CartpoleEnv`
			`from rllab.envs.normalized_env import normalize`


			`# for tutorial purpose, placeholders are explicitly appended with '_ph' suffix`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`if __name__ == '__main__':`
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`env = normalize(CartpoleEnv())`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00			`observation_dim = env.observation_space.shape`
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`action_dim = env.action_space.flat_dim`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
			`clip_param = 0.2`
			`num_batches = 10`
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`batch_size = 128`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`seed = 10`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00			`np.random.seed(seed)`
			`tf.set_random_seed(seed)`

finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`### 1. build network with pure tf`
			`observation_ph = tf.placeholder(tf.float32, shape=(None,) + observation_dim)`

			`def my_policy():`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-15 16:32:30 +08:00			`net = tf.layers.dense(observation_ph, 32, activation=tf.nn.tanh)`
			`net = tf.layers.dense(net, 32, activation=tf.nn.tanh)`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`action_mean = tf.layers.dense(net, action_dim, activation=None)`
			`action_logstd = tf.get_variable('action_logstd', shape=(action_dim, ))`
			`# value = tf.layers.dense(net, 1, activation=None)`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`return action_mean, action_logstd, None # None value head`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`# TODO: current implementation of passing function or overriding function has to return a value head`
			`# to allow network sharing between policy and value networks. This makes 'policy' and 'value_function'`
			`# imbalanced semantically (though they are naturally imbalanced since 'policy' is required to interact`
			`# with the environment and 'value_function' is not). I have an idea to solve this imbalance, which is`
			`# not based on passing function or overriding function.`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`### 2. build policy, loss, optimizer`
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`pi = policy.Normal(my_policy, observation_placeholder=observation_ph, weight_update=0)`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00
			`ppo_loss_clip = losses.ppo_clip(pi, clip_param)`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
			`total_loss = ppo_loss_clip`
			`optimizer = tf.train.AdamOptimizer(1e-4)`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`train_op = optimizer.minimize(total_loss, var_list=pi.trainable_variables)`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`### 3. define data collection`
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`training_data = Batch(env, pi, [advantage_estimation.full_return], [pi])`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`### 4. start training`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00			`config = tf.ConfigProto()`
			`config.gpu_options.allow_growth = True`
			`with tf.Session(config=config) as sess:`
			`sess.run(tf.global_variables_initializer())`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`# assign actor to pi_old`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`pi.sync_weights() # TODO: automate this for policies with target network`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
			`start_time = time.time()`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`for i in range(100):`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00			`# collect data`
finish design and running of ppo and actor-critic. advantage estimation module is not complete yet. 2018-01-17 14:21:50 +08:00			`training_data.collect(num_episodes=20)`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
			`# print current return`
			`print('Epoch {}:'.format(i))`
			`training_data.statistics()`

			`# update network`
			`for _ in range(num_batches):`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`feed_dict = training_data.next_batch(batch_size)`
			`sess.run(train_op, feed_dict=feed_dict)`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring. 2018-01-17 11:55:51 +08:00			`# assigning actor to pi_old`
finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic. 2018-01-15 00:03:06 +08:00			`pi.update_weights()`
fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research 2018-01-03 20:32:05 +08:00
			`print('Elapsed time: {:.1f} min'.format((time.time() - start_time) / 60))`