2017-12-15 14:24:08 +08:00
|
|
|
# TODO:
|
|
|
|
|
|
|
|
Notice that we will separate actor and critic, and batch will collect data for optimizing policy while replay will collect data for optimizing critic.
|
|
|
|
|
2017-12-08 21:09:23 +08:00
|
|
|
# Batch
|
|
|
|
|
|
|
|
YouQiaoben
|
|
|
|
|
|
|
|
fix as stated in ppo_example.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Replay
|
|
|
|
|
|
|
|
ShihongSong
|
|
|
|
|
|
|
|
a Replay.py file. must have collect() and next_batch() methods for training.
|
|
|
|
|
|
|
|
integrate previous ReplayBuffer codes.
|
|
|
|
|
|
|
|
|
|
|
|
# adv_estimate
|
|
|
|
|
|
|
|
YouQiaoben (gae_lambda), ShihongSong(dqn after policy.DQN)
|
|
|
|
|
|
|
|
seems to be direct python functions. also may write it in a functional form.
|