# TODO:

Notice that we will separate actor and critic, and batch will collect data for optimizing policy while replay will collect data for optimizing critic.

# Batch

YouQiaoben

fix as stated in ppo_example.py



# Replay

ShihongSong

a Replay.py file. must have collect() and next_batch() methods for training.

integrate previous ReplayBuffer codes.


# adv_estimate

YouQiaoben (gae_lambda), ShihongSong(dqn after policy.DQN)

seems to be direct python functions. also may write it in a functional form.