New method training_step, which
* collects training data (method _collect_training_data)
* performs "test in train" (method _test_in_train)
* performs policy update
The old method named train_step performed only the first two points
and was now split into two separate methods