Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
Disentangling Length from Quality in Direct Preference Optimization
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.
DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.
citing papers explorer
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization
A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.
-
Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance
DIR applies an information bottleneck to reward model training to mitigate complex inductive biases such as length, sycophancy, and format, with claimed improvements in RLHF generalization.