EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
R e FT : Reasoning with reinforced fine-tuning
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 2polarities
background 2representative citing papers
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific
Faithful GRPO cuts inconsistent chain-of-thought reasoning in visual spatial tasks from 24.5% to 1.7% while raising visual grounding scores by 13% and final answer accuracy on seven benchmarks.
citing papers explorer
-
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.
-
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.
-
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training
Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific
-
Faithful GRPO: Improving Visual Spatial Reasoning in Multimodal Language Models via Constrained Policy Optimization
Faithful GRPO cuts inconsistent chain-of-thought reasoning in visual spatial tasks from 24.5% to 1.7% while raising visual grounding scores by 13% and final answer accuracy on seven benchmarks.