ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
super hub Canonical reference
Fine-Tuning Language Models from Human Preferences
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentimen
authors
co-cited works
representative citing papers
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
PPR-GDE is a new RL approach that integrates pairwise preference rewards with group-based diversity enhancement in a unified objective to improve both alignment quality and expressive diversity in open-ended generation tasks such as role-playing.
QuantEvolver applies reinforcement fine-tuning to evolve an LLM policy for generating executable alpha factor expressions, yielding higher-quality and more complementary factors than prompt-based baselines on market benchmarks.
Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment performance.
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
For convex losses in nested evolving feasible sets, a lazy algorithm balances O(T^{1-β}) regret with O(T^β) movement for any β; for strongly convex or sharp losses, Frugal achieves zero regret with O(log T) movement, shown optimal by matching lower bound.
RL on binary rewards boosts LLM factual recall by ~27% relative across models by redistributing probability mass to latent correct answers rather than acquiring new knowledge.
The paper establishes the first O(log T) regret and O(1/T) sub-optimality bounds for online RLHF under general f-divergence regularization via two sampling algorithms.
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
TraceLift trains reasoning planners with executor-grounded rewards that multiply a rubric-based reasoning quality score by measured performance uplift on a frozen executor, outperforming outcome-only training on math and code benchmarks.
Preference poisoning against log-linear DPO reduces to a binary sparse approximation problem solved by lattice-reduction (BAL-A) and matching-pursuit (BMP-A) algorithms that carry recovery guarantees.
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.
The work establishes a regret lower bound of Ω(T^{2/3} min(K,D)^{1/3}) for fair multi-user dueling bandits with heterogeneous Condorcet winners and gives algorithms achieving matching upper bounds up to logs.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
R-CAI inverts constitutional AI to automatically generate diverse toxic data for LLM red teaming, with probability clamping improving output coherence by 15% while preserving adversarial strength.
Reinforcement learning with a multi-part reward teaches LLMs to output independent, meaning-preserving sentence edits that raise argument appropriateness close to full rewriting.
E2E-REME outperforms nine LLMs in accuracy and efficiency for end-to-end microservice remediation by using experience-simulation reinforcement fine-tuning on a new benchmark called MicroRemed.
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
citing papers explorer
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
A Survey on Multimodal Large Language Models
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.