hub

Reward shaping to mitigate reward hacking in rlhf

Fu, J · 2025 · arXiv 2502.18770

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

cs.SE · 2026-05-06 · unverdicted · novelty 7.0

Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.

Uncertainty-Aware Reward Modeling for Stable RLHF

cs.LG · 2026-06-18 · unverdicted · novelty 6.0

UARM equips reward models with quantile-based conformal prediction uncertainty and reweights GRPO advantages via heteroscedastic variance decomposition to improve calibration and reduce reward hacking in RLHF.

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

cs.CL · 2026-06-09 · conditional · novelty 6.0

CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

GraphAE builds graphs from RM hidden-state similarities among sampled responses and propagates advantages to improve RLHF sample efficiency.

General Preference Reinforcement Learning

cs.LG · 2026-05-18 · unverdicted · novelty 6.0 · 3 refs

GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.

Reward Hacking in Rubric-Based Reinforcement Learning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.

Optimal Transport for LLM Reward Modeling from Noisy Preference

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.

Factored Causal Representation Learning for Robust Reward Modeling in RLHF

cs.LG · 2026-01-29 · unverdicted · novelty 6.0

A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.

Supervising the search process produces reliable and generalizable information-seeking agents

cs.CL · 2025-02-19 · unverdicted · novelty 6.0

Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

cs.LG · 2026-04-23 · unverdicted · novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

Self-Rewarding Vision-Language Model via Reasoning Decomposition

cs.CV · 2025-08-27 · unverdicted · novelty 5.0

Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.

From System 1 to System 2: A Survey of Reasoning Large Language Models

cs.AI · 2025-02-24 · accept · novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

cs.LG · 2026-04-19

citing papers explorer

Showing 17 of 17 citing papers.

Variance-aware Reward Modeling with Anchor Guidance stat.ML · 2026-05-12 · unverdicted · none · ref 49
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms cs.AI · 2026-05-08 · unverdicted · none · ref 5
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning cs.SE · 2026-05-06 · unverdicted · none · ref 10
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
Uncertainty-Aware Reward Modeling for Stable RLHF cs.LG · 2026-06-18 · unverdicted · none · ref 8
UARM equips reward models with quantile-based conformal prediction uncertainty and reweights GRPO advantages via heteroscedastic variance decomposition to improve calibration and reduce reward hacking in RLHF.
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning cs.CL · 2026-06-16 · unverdicted · none · ref 18
The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It cs.CL · 2026-06-09 · conditional · none · ref 18
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.
Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output cs.LG · 2026-06-09 · unverdicted · none · ref 32
GraphAE builds graphs from RM hidden-state similarities among sampled responses and propagates advantages to improve RLHF sample efficiency.
General Preference Reinforcement Learning cs.LG · 2026-05-18 · unverdicted · none · ref 21 · 3 links
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning cs.LG · 2026-05-14 · unverdicted · none · ref 51
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
Reward Hacking in Rubric-Based Reinforcement Learning cs.AI · 2026-05-12 · unverdicted · none · ref 9
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
Optimal Transport for LLM Reward Modeling from Noisy Preference cs.LG · 2026-05-07 · unverdicted · none · ref 260
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
Factored Causal Representation Learning for Robust Reward Modeling in RLHF cs.LG · 2026-01-29 · unverdicted · none · ref 11
A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.
Supervising the search process produces reliable and generalizable information-seeking agents cs.CL · 2025-02-19 · unverdicted · none · ref 16
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 31
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Self-Rewarding Vision-Language Model via Reasoning Decomposition cs.CV · 2025-08-27 · unverdicted · none · ref 7
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 170
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction cs.LG · 2026-04-19 · unreviewed · ref 18

Reward shaping to mitigate reward hacking in rlhf

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer