{"total":11,"items":[{"citing_arxiv_id":"2605.28440","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates","primary_cat":"cs.CL","submitted_at":"2026-05-27T13:05:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaDPO uses self-adaptive stop-gradient coefficients to balance preferred and dispreferred gradients in DPO, achieving higher AlpacaEval 2 win rates than standard DPO on Llama-3-8B-Instruct.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21854","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-05-21T01:02:41+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12288","ref_index":136,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:44:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11906","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-12T10:18:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"YFPO augments standard preference optimization with neuron-level activation margins from math-related features to improve LLM reasoning on math tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11726","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T08:09:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Johnson, J. Ho, D. Tarlow, and R. van den Berg. Structured Denoising Diffusion Models in Discrete State-Spaces.arXiv preprint arXiv:2107.03006, 2023. [4] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program Synthesis with Large Language Models.arXiv preprint arXiv:2108.07732, 2021. [5] M. G. Azar, M. Rowland, B. Piot, D. Guo, D. Calandriello, M. Valko, and R. Munos. A General Theoretical Paradigm to Understand Learning from Human Preferences.arXiv preprint arXiv:2310.12036, 2023. [6] A. Baheti, X. Lu, F. Brahman, R. L. Bras, M. Sap, and M. Riedl. Leftover Lunch: Advantage- Based Offline Reinforcement Learning for Language Models."},{"citing_arxiv_id":"2605.06987","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Response Time Enhances Alignment with Heterogeneous Preferences","primary_cat":"cs.LG","submitted_at":"2026-05-07T22:05:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Response times modeled as drift-diffusion processes enable consistent estimation of population-average preferences from heterogeneous anonymous binary choices.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"sistent estimator of b. To this end, we use the empirical Laplace transform of the response times, given by bLn(λ) := 1 n nX i=1 e−λti. (3) To see why this is useful, let us consider for a moment the homogeneous case with fixed driftV = v. Note that, as n → ∞, the empirical Laplace transform converges to its population counterpart, Lb(λ) := E[e−λT ], which is given by [see Echenique et al., 2025, Lemma 4]: Lb(λ) = cosh(bv) cosh b √ 2λ + v2\u0001 . (4) Consequently, asλ → ∞, we see that − log Lb(λ)√ 2λ converges to b, allowing us to recover the boundary parameter. The next theorem formalizes this idea while accounting for raters' heterogeneity inv. Theorem 2. Assume |V | ≤ M almost surely. Let (λn)n≥1 be any deterministic sequence with λn → ∞ and p λn = o(log n), (5)"},{"citing_arxiv_id":"2605.02626","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-04T14:15:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20265","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Failure Modes of Maximum Entropy RLHF","primary_cat":"cs.LG","submitted_at":"2025-09-24T15:52:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Derives SimPO from MaxEnt RL and reports that MaxEnt RL in online RLHF exhibits frequent overoptimization and unstable KL dynamics across scales, unlike stable KL-constrained baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.04149","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap","primary_cat":"cs.CL","submitted_at":"2025-08-06T07:24:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.01456","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Process Reinforcement through Implicit Rewards","primary_cat":"cs.LG","submitted_at":"2025-02-03T15:43:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"tby subtracting the baseline; (3) Calculate the discounted return for each response.For outcome rewards, we directly adopt LOO without any modification. Finally, the advantage is set to the combination of both returns: Ai t = |yi|X s=t γs−t ·  rϕ(yi s)− 1 K−1 X j̸=i rϕ yj\u0001   | {z } RLOO with implicit process rewards +r o yi\u0001 − 1 K−1 X j̸=i ro yj\u0001 | {z } RLOO with outcome rewards (5) Updating policy with PPO loss.We adopt PPO clip surrogate loss for more stable policy updates: LCLIP(θ) =E t \" min \u0012 πθ(yt|y<t) πθold(yt|y<t) At,clip \u0010 πθ(yt|y<t) πθold(yt|y<t) ,1−ϵ,1 +ϵ \u0011 At \u0013# (6) whereϵis a clipping parameter. The loss prevents the updated policy from deviating too far from the original distribution, which is the prerequisite of importance sampling."},{"citing_arxiv_id":"2403.07691","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ORPO: Monolithic Preference Optimization without Reference Model","primary_cat":"cs.CL","submitted_at":"2024-03-12T14:34:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"generated per input, we sample the first item for each input and examine their inter cosine similarity with Equation 15 for across-input diversity. Un- like per-input diversity, it is noteworthy that Phi-2 (ORPO) has lower average cosine similarity in the second row of Table 4. We can infer that ORPO triggers the model to generate more instruction- specific responses than DPO. AIDD(θ) = D N[ i=1 Oi,θ,j=1 ! (15) Per Input↓ Across Input↓ Phi-2 + SFT + DPO 0.8012 0.6019 Phi-2 + ORPO 0.8909 0.5173 Llama-2 + SFT + DPO 0.8889 0.5658 Llama-2 + ORPO 0.9008 0.5091 Table 4: Lexical diversity of Phi-2 and Llama-2 fine- tuned with DPO and ORPO. Lower cosine similarity is equivalent to higher diversity. The highest value in each column within the same model family is bolded. 7 Discussion In this section, we expound on the theoretical and computational details of ORPO. The theoretical anal- ysis of ORPO is studied in Section 7.1, which will be supported with the empirical analysis in Section 7.2. Then, we compare the computational load of DPO and ORPO in Section 7.3. 7.1 Comparison to Probability Ratio The rationale for selecting the odds ratio instead of the probability ratio lies in its stability. The prob- ability ratio for generating the favored response yw over the disfavored response yl given an input sequence x can be defined as Equation 16. PRθ(yw, yl) = Pθ(yw|x) Pθ(yl|x) (16) While this formulation has been used in previous preference alignment methods that precede SFT (Rafailov et al., 2023; Azar et al., 2023), the odds ratio is a better choice in the setting where the preference alignment is incorporated in SFT as the odds ratio is more sensitive to the model's prefer- ence understanding. In other words, the probability ratio leads to more extreme discrimination of the disfavored responses than the odds ratio. We visualize this through the sample distribu- tions of the log probability ratio log PR(X2|X1) and log odds ratio log OR(X2|X1). We sample 50,000 samp"}],"limit":50,"offset":0}