Reward-to-go arises directly from decomposing the policy gradient objective over prefix trajectories, recovering the causality argument as a corollary rather than a post-hoc rule.
Documentation resource
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go
Reward-to-go arises directly from decomposing the policy gradient objective over prefix trajectories, recovering the causality argument as a corollary rather than a post-hoc rule.