Token- level direct preference optimization.arXiv preprint arXiv:2404.11999, 2024

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, Jun Wang · 2024 · arXiv 2404.11999

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

read on arXiv browse 15 citing papers

citation-role summary

background 3

citation-polarity summary

background 1 support 1 unclear 1

representative citing papers

Flow Reasoning Models: Scaling Reasoning Through Iterative Self-Refinement

cs.AI · 2026-06-28 · conditional · novelty 7.0

Flow models reach 99.2% Sudoku accuracy in 7 passes and 96.1% on out-of-distribution Sudoku-Extreme by selecting dynamically stable candidates and training with self-conditioning plus DPO to avoid failed outputs.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

cs.LG · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.

Step-level Denoising-time Diffusion Alignment with Multiple Objectives

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

MSDDA derives a closed-form optimal reverse denoising distribution for multi-objective diffusion alignment that is exactly equivalent to step-level RL fine-tuning with no approximation error.

RAD-DPO: Robust Adaptive Denoising Direct Preference Optimization for Generative Retrieval in E-commerce

cs.IR · 2026-02-27 · unverdicted · novelty 7.0

RAD-DPO adds token-level gradient detachment, similarity-based dynamic reward weighting, and a multi-label global contrastive objective to DPO for better handling of hierarchical Semantic IDs and noisy feedback in e-commerce generative retrieval.

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

cs.CV · 2026-06-06 · unverdicted · novelty 6.0

DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

TIPO applies preference-intensity weighting and padding gating to stabilize preference optimization for privacy personalization in mobile GUI agents, yielding higher alignment and distinction metrics than prior methods.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.

UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types

cs.LG · 2024-08-27 · unverdicted · novelty 6.0

UNA unifies binary, pairwise, and score-based feedback for LLM alignment via a generalized implicit reward function shown optimal by the log sum inequality.

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

cs.LG · 2026-02-05 · unverdicted · novelty 5.0 · 2 refs

PEPO is a single-step pessimistic ensemble algorithm for direct preference optimization that provably avoids over-optimization by depending only on single-policy concentrability without knowing the data distribution or learning an explicit reward model.

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

cs.CL · 2025-08-25 · unverdicted · novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

cs.CL · 2025-08-06 · unverdicted · novelty 5.0

Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

cs.CV · 2026-06-10 · unverdicted · novelty 4.0

Proposes bidirectional token-wise KL regularizer and visual-contrastive grounding objective to create fine-grained on-policy preference pairs for medical LVLMs by minimally editing model outputs.

LLM Harms: A Taxonomy and Discussion

cs.CY · 2025-12-05

From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations

cs.CL · 2025-07-07

citing papers explorer

Showing 4 of 4 citing papers after filters.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 8 · 2 links
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 58
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap cs.CL · 2025-08-06 · unverdicted · none · ref 41
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations cs.CL · 2025-07-07 · unreviewed · ref 55

Token- level direct preference optimization.arXiv preprint arXiv:2404.11999, 2024

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer