Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
arXiv preprint arXiv:2502.18770 , year=
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issue rather than loss scaling.
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
citing papers explorer
-
Variance-aware Reward Modeling with Anchor Guidance
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
-
Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
LPA uses fewer than 100 personality trait statements to train LLMs for harmlessness, matching the robustness of methods using 150k+ harmful examples while generalizing better to new attacks.
-
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
-
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issue rather than loss scaling.
-
Reward Hacking in Rubric-Based Reinforcement Learning
Rubric-based RL verifiers can be gamed via partial criterion satisfaction and implicit-to-explicit tricks, yielding proxy gains that do not improve quality under rubric-free judges; stronger verifiers reduce but do not eliminate the mismatch.
-
Optimal Transport for LLM Reward Modeling from Noisy Preference
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
-
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.