hub

A general theoretical paradigm to understand learning from human preferences.arXiv preprint arXiv:2310.12036

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos · 2023 · arXiv 2310.12036

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 2

citation-polarity summary

background 2 use method 2

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models

cs.CV · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

CrossVLA develops a surrogate log-probability estimator for DPO on flow-matching VLAs, shows DoRA outperforming LoRA by +10.4 pp mean on LIBERO, and identifies inference bottlenecks with limited caching gains.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

Towards Spec Learning: Inference-Time Alignment from Preference Pairs

cs.CL · 2026-06-22 · unverdicted · novelty 6.0 · 2 refs

Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.

Weight-Space Geometry of Offline Reasoning Training

cs.LG · 2026-06-21 · unverdicted · novelty 6.0

Comparative weight-space analysis finds SFT/RFT/RIFT colinear with similar accuracy, DFT more divergent, GRPO partially orthogonal, and DPO near-orthogonal with highest GSM8K/AIME accuracy but using 10x smaller learning rate.

FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization

cs.RO · 2026-06-03 · unverdicted · novelty 6.0

FlowPRO applies proximalized preference optimization to flow-matching VLAs with intervention-rollback data to reach higher success rates on long-horizon bimanual tasks without rewards or critics.

AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

AdaDPO uses self-adaptive stop-gradient coefficients to balance preferred and dispreferred gradients in DPO, achieving higher AlpacaEval 2 win rates than standard DPO on Llama-3-8B-Instruct.

Response Time Enhances Alignment with Heterogeneous Preferences

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.

Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models

cs.LG · 2026-05-04 · conditional · novelty 6.0

Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

Constitutional On-Policy Safe Distillation

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

COPSD uses a Cross-SFT cold-start followed by constitution-conditioned distillation to achieve stronger safety-helpfulness balance and lower safety tax on reasoning than prior on-policy self-distillation methods.

YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.

Failure Modes of Maximum Entropy RLHF

cs.LG · 2025-09-24 · unverdicted · novelty 5.0

Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

cs.CL · 2025-08-06 · unverdicted · novelty 5.0

Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.

The Hitchhiker's Guide to Agentic AI: From Foundations to Systems

cs.AI · 2026-06-22 · unverdicted · novelty 2.0

A comprehensive reference book organizing existing techniques for agentic AI systems across LLM substrate, reasoning, agent design patterns, inter-agent coordination, and production deployment.

citing papers explorer

Showing 12 of 12 citing papers after filters.

CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models cs.CV · 2026-05-21 · unverdicted · none · ref 1 · 2 links
CrossVLA develops a surrogate log-probability estimator for DPO on flow-matching VLAs, shows DoRA outperforming LoRA by +10.4 pp mean on LIBERO, and identifies inference bottlenecks with limited caching gains.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 136 · 2 links
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models cs.LG · 2026-05-12 · unverdicted · none · ref 5 · 2 links
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Towards Spec Learning: Inference-Time Alignment from Preference Pairs cs.CL · 2026-06-22 · unverdicted · none · ref 4 · 2 links
Proposes compiling preference pairs into readable natural-language specifications for inference-time LLM alignment, claiming outperformance over DPO on dense-preference domains.
Weight-Space Geometry of Offline Reasoning Training cs.LG · 2026-06-21 · unverdicted · none · ref 14
Comparative weight-space analysis finds SFT/RFT/RIFT colinear with similar accuracy, DFT more divergent, GRPO partially orthogonal, and DPO near-orthogonal with highest GSM8K/AIME accuracy but using 10x smaller learning rate.
FlowPRO: Reward-Free Reinforced Fine-Tuning of Flow-Matching VLAs via Proximalized Preference Optimization cs.RO · 2026-06-03 · unverdicted · none · ref 29
FlowPRO applies proximalized preference optimization to flow-matching VLAs with intervention-rollback data to reach higher success rates on long-horizon bimanual tasks without rewards or critics.
AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates cs.CL · 2026-05-27 · unverdicted · none · ref 2
AdaDPO uses self-adaptive stop-gradient coefficients to balance preferred and dispreferred gradients in DPO, achieving higher AlpacaEval 2 win rates than standard DPO on Llama-3-8B-Instruct.
Response Time Enhances Alignment with Heterogeneous Preferences cs.LG · 2026-05-07 · unverdicted · none · ref 4
Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models cs.LG · 2026-05-04 · conditional · none · ref 1
Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
Constitutional On-Policy Safe Distillation cs.LG · 2026-06-02 · unverdicted · none · ref 2
COPSD uses a Cross-SFT cold-start followed by constitution-conditioned distillation to achieve stronger safety-helpfulness balance and lower safety tax on reasoning than prior on-policy self-distillation methods.
YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning cs.CL · 2026-05-12 · unverdicted · none · ref 22
YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.
The Hitchhiker's Guide to Agentic AI: From Foundations to Systems cs.AI · 2026-06-22 · unverdicted · none · ref 13
A comprehensive reference book organizing existing techniques for agentic AI systems across LLM substrate, reasoning, agent design patterns, inter-agent coordination, and production deployment.

A general theoretical paradigm to understand learning from human preferences.arXiv preprint arXiv:2310.12036

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer