{"total":16,"items":[{"citing_arxiv_id":"2606.28401","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Vision-driven Preference Synthesis for Mitigating Hallucinations in VLMs","primary_cat":"cs.CV","submitted_at":"2026-06-24T11:06:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ViPSy constructs policy-aligned and visually grounded preference pairs for VLMs via visual cues from image variants, yielding SOTA hallucination reductions of 35.7% on AMBER and 24.5% on Object HalBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29398","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-28T05:47:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GDSD reduces RL for dLLMs to likelihood-free self-distillation via a normalization-free logit-matching objective, outperforming ELBO methods with more stable training on LLaDA-8B and Dream-7B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28440","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AdaDPO: Self-Adaptive Direct Preference Optimization with Balanced Gradient Updates","primary_cat":"cs.CL","submitted_at":"2026-05-27T13:05:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AdaDPO uses self-adaptive stop-gradient coefficients to balance preferred and dispreferred gradients in DPO, achieving higher AlpacaEval 2 win rates than standard DPO on Llama-3-8B-Instruct.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18721","ref_index":7,"ref_count":3,"confidence":0.9,"is_internal_anchor":false,"paper_title":"General Preference Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-18T17:50:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15113","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning from Language Feedback via Variational Policy Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-14T17:27:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09922","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-11T03:17:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09363","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions","primary_cat":"cs.LG","submitted_at":"2026-05-10T06:23:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09214","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability","primary_cat":"cs.LG","submitted_at":"2026-05-09T23:17:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general function approximation settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07331","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective","primary_cat":"cs.LG","submitted_at":"2026-05-08T06:35:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07244","ref_index":97,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-08T05:01:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04180","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MedFabric and EtHER: A Data-Centric Framework for Word-Level Fabrication Generation and Detection in Medical LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-05T18:19:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MedFabric dataset and EtHER detector achieve over 15% better word-level fabrication detection in medical LLMs than prior methods by generating stylistically faithful errors and using decomposition-based checking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02626","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-04T14:15:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20933","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IRIS: Interpolative R\\'enyi Iterative Self-play for Large Language Model Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-22T11:52:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of the annotated data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"second family corresponds to the Jensen-Shannon divergence [27], replacing the reward gap with independent binary classification of real and synthetic samples via noise contrastive estimation [75] or adversarial training [79]. The third family employs the χ2 divergence, enforcing explicit divergence constraints [40] or game-theoretic equilibrium formulations [81, 67] that bound the reward magnitude. This classification exposes a fundamental limitation shared by all existing approaches: each operates under a single fixed divergence throughout training. As discussed in Section 1, no singlef-divergence provides uniformly optimal learning dynamics across all stages, since different divergences exhibit complementary strengths depending on the distributional gap between the model and the target."},{"citing_arxiv_id":"2509.23102","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multiplayer Nash Preference Optimization","primary_cat":"cs.AI","submitted_at":"2025-09-27T04:18:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.10442","ref_index":105,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","primary_cat":"cs.CL","submitted_at":"2024-11-15T18:59:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"ification of the DPO loss that accounts for potential label noise in preference data. (5) RobustDPO [21], which pro- vides an unbiased estimate of the DPO loss designed to handle preference noise in data. Similar to cDPO, it as- sumes that labels are noisy with a certain probability. (6) BCO [37], which introduces a binary classifier trained to output logits used as reward values. (7) SPPO [105], which iteratively pushes chosen rewards toward 1/2 and rejected rewards toward -1/2 to approximate a Nash equilibrium, aiming to reduce data sparsity issues. (8) AOT [68], which applies Distributional Preference Alignment via Optimal Transport. (9) TR-DPO [29], which adds synchronization between the model and a reference model every few steps to"},{"citing_arxiv_id":"2408.15339","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UNA: A Unified Supervised Framework for Efficient LLM Alignment Across Feedback Types","primary_cat":"cs.LG","submitted_at":"2024-08-27T18:04:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UNA unifies binary, pairwise, and score-based feedback for LLM alignment via a generalized implicit reward function shown optimal by the log sum inequality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}