GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3representative citing papers
Prefix Sampling replays self-generated trajectory prefixes to control rollout pass rates near 50% in binary-reward RL, delivering wall-clock speedups and modest performance gains on SWE-bench Verified and AIME tasks.
Reward-to-go arises directly from decomposing the policy gradient objective over prefix trajectories, recovering the causality argument as a corollary rather than a post-hoc rule.
citing papers explorer
-
On the "Causality" Step in Policy Gradient Derivations: A Pedagogical Reconciliation of Full Return and Reward-to-Go
Reward-to-go arises directly from decomposing the policy gradient objective over prefix trajectories, recovering the causality argument as a corollary rather than a post-hoc rule.