TODO:

Notice that we will separate actor and critic, and batch will collect data for optimizing policy while replay will collect data for optimizing critic.

Batch

YouQiaoben

fix as stated in ppo_example.py

Replay

ShihongSong

a Replay.py file. must have collect() and next_batch() methods for training.

integrate previous ReplayBuffer codes.

adv_estimate

YouQiaoben (gae_lambda), ShihongSong(dqn after policy.DQN)

seems to be direct python functions. also may write it in a functional form.

536 B Raw Blame History

TODO:

Batch

Replay

adv_estimate

536 B

Raw Blame History