ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
hub
A general theoretical paradigm to understand learning from human preferences.arXiv preprint arXiv:2310.12036
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
CrossVLA develops a surrogate log-probability estimator for DPO on flow-matching VLAs, shows DoRA outperforming LoRA by +10.4 pp mean on LIBERO, and identifies inference bottlenecks with limited caching gains.
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.
AdaDPO uses self-adaptive stop-gradient coefficients to balance preferred and dispreferred gradients in DPO, achieving higher AlpacaEval 2 win rates than standard DPO on Llama-3-8B-Instruct.
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.
Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
citing papers explorer
No citing papers match the current filters.