arXiv preprint arXiv:2310.05199

Loose lips sink ships: Mitigating length bias in reinforcement learning from human feedback · 2023 · arXiv 2310.05199

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

cs.CL · 2025-07-21 · unverdicted · novelty 6.0

CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

cs.CL · 2025-07-08 · unverdicted · novelty 6.0

REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.

Exploring the Secondary Risks of Large Language Models

cs.LG · 2025-06-14 · unverdicted · novelty 6.0

Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

citing papers explorer

Showing 3 of 3 citing papers.

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning cs.CL · 2025-07-21 · unverdicted · none · ref 18
CoLD mitigates length bias in process reward models for mathematical reasoning via counterfactual guidance, length penalties, bias estimation, and joint training, improving step selection accuracy and conciseness on MATH500 and GSM-Plus while boosting downstream RL performance.
Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling cs.CL · 2025-07-08 · unverdicted · none · ref 29
REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.
Exploring the Secondary Risks of Large Language Models cs.LG · 2025-06-14 · unverdicted · none · ref 40
Introduces secondary risks as a new class of LLM failures from benign prompts, defines two primitives, proposes SecLens search framework, and releases SecRiskBench showing risks are widespread across 16 models.

arXiv preprint arXiv:2310.05199

fields

years

verdicts

representative citing papers

citing papers explorer