You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for releasing codes of this wonderful project!
I have a question about the value network. In net.py, the new_value is predicted by observing fake_output and new_states. Let s_t denote fake_input, and then fake_output is s_{t+1}. The new_states contain the ation a_t that transfers s_t to s_{t+1}. Therefore, it seems the codes are predicting Q(s_t, a_{t-1}), Q(s_{t+1}, a_t) rather than Q(s_t, a_t), Q(s_{t+1}, a_{t+1}). If so, I am confused how the policy gradients are calculated (e.g., Eqn. (7) in the paper). I might get something wrong. I'd appreciate it if you could help me clarify this question. Thanks!
Yu Ke
The text was updated successfully, but these errors were encountered:
Hi Yuanming,
Thanks for releasing codes of this wonderful project!
I have a question about the value network. In
net.py
, thenew_value
is predicted by observingfake_output
andnew_states
. Lets_t
denotefake_input
, and thenfake_output
iss_{t+1}
. Thenew_states
contain the ationa_t
that transferss_t
tos_{t+1}
. Therefore, it seems the codes are predictingQ(s_t, a_{t-1})
,Q(s_{t+1}, a_t)
rather thanQ(s_t, a_t)
,Q(s_{t+1}, a_{t+1})
. If so, I am confused how the policy gradients are calculated (e.g., Eqn. (7) in the paper). I might get something wrong. I'd appreciate it if you could help me clarify this question. Thanks!Yu Ke
The text was updated successfully, but these errors were encountered: