hub

Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947

Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang, Sewoong Oh, Simon Shaolei Du, Nathan Lambert, Sewon Min, Ranjay Krishna, et al · 2025 · arXiv 2506.10947

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions

cs.LG · 2026-04-19 · accept · novelty 7.0

The paper delivers the first systematic taxonomy and hierarchical framework for data-efficient reinforcement learning post-training of large language models across data-centric, training-centric, and framework-centric views.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

H\"older Policy Optimisation

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.

Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

cs.LG · 2026-04-26 · conditional · novelty 6.0

Correcting DeepSpeed optimizer and OpenRLHF loss bugs reveals SFT-then-RL outperforms mixed-policy methods by 3.8-22.2 points on math benchmarks.

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

cs.LG · 2026-04-20 · unverdicted · novelty 6.0

HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

Characterizing Model-Native Skills

cs.AI · 2026-04-19 · conditional · novelty 6.0

Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming human-characterized alternatives.

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

cs.CV · 2025-07-01 · unverdicted · novelty 6.0

GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.

Beyond Distribution Sharpening: The Importance of Task Rewards

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.

Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning

cs.LG · 2026-04-17 · unverdicted · novelty 5.0

PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.

citing papers explorer

Showing 13 of 13 citing papers after filters.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning cs.LG · 2026-05-01 · unverdicted · none · ref 56 · 2 links
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 97
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 26
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
H\"older Policy Optimisation cs.LG · 2026-05-12 · unverdicted · none · ref 56
HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Automated Reformulation of Robust Optimization via Memory-Augmented Large Language Models cs.AI · 2026-05-12 · unverdicted · none · ref 9
AutoREM augments LLMs with a structured memory of failed reformulation trajectories to improve accuracy and efficiency on robust optimization tasks without parameter updates or expert knowledge.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 82
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation cs.LG · 2026-05-06 · unverdicted · none · ref 168
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling cs.AI · 2026-05-04 · unverdicted · none · ref 9 · 2 links
APPS approximates power targets p(x)^alpha via parallel particle propagation with proposal-corrected reweighting and future-value-guided selection at block boundaries, improving accuracy-runtime trade-offs in training-free decoding.
HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment cs.LG · 2026-04-20 · unverdicted · none · ref 52
HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.
Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution cs.CL · 2026-04-03 · unverdicted · none · ref 31
Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning cs.CV · 2025-07-01 · unverdicted · none · ref 41
GLM-4.5V reaches state-of-the-art results on 42 multimodal benchmarks among open-source models of similar size by applying reinforcement learning with curriculum sampling to a strong vision foundation model.
Beyond Distribution Sharpening: The Importance of Task Rewards cs.LG · 2026-04-17 · unverdicted · none · ref 35
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
Placing Puzzle Pieces Where They Matter: A Question Augmentation Framework for Reinforcement Learning cs.LG · 2026-04-17 · unverdicted · none · ref 4
PieceHint strategically scores and injects critical reasoning hints in RL training to let a 1.5B model match 32B baselines on math benchmarks while preserving pass@k diversity.

Spurious rewards: Rethinking training signals in rlvr.arXiv preprint arXiv:2506.10947

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer