Tianshou

Author	SHA1	Message	Date
haoshengzou	498b55c051	ppo with batch also works! now ppo improves steadily, dqn not so stable.	2018-03-10 17:30:11 +08:00
haoshengzou	92894d3853	working on off-policy test. other parts of dqn_replay is runnable, but performance not tested.	2018-03-09 15:07:14 +08:00
haoshengzou	675057c6b9	interfaces for advantage_estimation. full_return finished and tested.	2018-02-27 14:11:52 +08:00
Dong Yan	f3aee448e0	add option to show the running result of cartpole	2018-02-24 10:53:39 +08:00
Dong Yan	2163d18728	fix the env -> self._env bug	2018-02-10 03:42:00 +08:00
haoshengzou	9f96cc2461	finish design and running of ppo and actor-critic. advantage estimation module is not complete yet.	2018-01-17 14:21:50 +08:00
haoshengzou	ed25bf7586	fixed the bugs on Jan 14, which gives inferior or even no improvement. mistook group_ndims. policy will soon need refactoring.	2018-01-17 11:55:51 +08:00
haoshengzou	983cd36074	finished all ppo examples. Training is remarkably slower than the version before Jan 13. More strangely, in the gym example there's almost no improvement... but this problem comes behind design. I'll first write actor-critic.	2018-01-15 00:03:06 +08:00
haoshengzou	fed3bf2a12	auto target network. ppo_cartpole.py run ok. but results is different from previous version even with the same random seed, still needs debugging.	2018-01-14 20:58:28 +08:00
haoshengzou	dfcea74fcf	fix memory growth and slowness caused by sess.run(tf.multinomial()), now ppo examples are working OK with slight memory growth (1M/min), which still needs research	2018-01-03 20:32:05 +08:00
haoshengzou	4333ee5d39	ppo_cartpole.py seems to be working with param: bs128, num_ep20, max_time500; manually merged Normal from branch policy_wrapper	2018-01-02 19:40:37 +08:00

11 Commits