Question about value network #45

yuke93 · 2019-07-17T04:51:08Z

Hi Yuanming,

Thanks for releasing codes of this wonderful project!

I have a question about the value network. In net.py, the new_value is predicted by observing fake_output and new_states. Let s_t denote fake_input, and then fake_output is s_{t+1}. The new_states contain the ation a_t that transfers s_t to s_{t+1}. Therefore, it seems the codes are predicting Q(s_t, a_{t-1}), Q(s_{t+1}, a_t) rather than Q(s_t, a_t), Q(s_{t+1}, a_{t+1}). If so, I am confused how the policy gradients are calculated (e.g., Eqn. (7) in the paper). I might get something wrong. I'd appreciate it if you could help me clarify this question. Thanks!

Yu Ke

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about value network #45

Question about value network #45

yuke93 commented Jul 17, 2019

Question about value network #45

Question about value network #45

Comments

yuke93 commented Jul 17, 2019