hub

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=

54 Pith papers cite this work. Polarity classification is still indexing.

54 Pith papers citing it

browse 54 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2 method 2

citation-polarity summary

background 2 use method 2

representative citing papers

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

cs.AI · 2026-05-20 · conditional · novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.

Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

cs.LG · 2026-05-11 · conditional · novelty 7.0

DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.

OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.

Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

AB-SID-iVAR enables Gaussian process active learning for self-induced Boltzmann distributions by closed-form approximation of the target, with high-probability error vanishing guarantees and empirical gains on PES and drug discovery tasks.

Relative Score Policy Optimization for Diffusion Language Models

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.

Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

astro-ph.IM · 2026-05-07 · unverdicted · novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.

Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models

cs.CV · 2026-05-05 · unverdicted · novelty 7.0

CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.

Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent

cs.GT · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

Proposes OPMD algorithm achieving accelerated O(1/n) rates for offline Nash equilibrium learning in alpha-potential games via reference-anchored data coverage.

Pessimism-Free Offline Learning in General-Sum Games via KL Regularization

cs.LG · 2026-04-30 · unverdicted · novelty 7.0 · 2 refs

KL regularization enables pessimism-free offline learning in general-sum games, recovering regularized Nash equilibria at accelerated rate O(1/n) via GANE and converging to coarse correlated equilibria at standard rate O(1/sqrt(n)+1/T) via GAMD.

RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners

cs.CL · 2026-04-30 · conditional · novelty 7.0

RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.

Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

cs.AI · 2026-04-19 · unverdicted · novelty 7.0

A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.

Validity-Calibrated Reasoning Distillation

cs.LG · 2026-04-14 · unverdicted · novelty 7.0

Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.

How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.

citing papers explorer

Showing 50 of 54 citing papers.

Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models) cs.LG · 2026-05-07 · unverdicted · none · ref 29
Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models cs.CV · 2026-05-20 · unverdicted · none · ref 36
Linear-DPO replaces sigmoid utility with linear utility and adds EMA reference to improve preference alignment in diffusion and flow-matching text-to-image models.
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment cs.AI · 2026-05-20 · conditional · none · ref 18
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL cs.LG · 2026-05-17 · unverdicted · none · ref 22
DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.
Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space cs.LG · 2026-05-15 · unverdicted · none · ref 22
Pretraining and alignment induce asymmetric geometric traces in transformer weights because alignment updates concentrate in read pathways due to activation covariance while write pathways inherit less structure from alignment losses.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 49
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 42
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling cs.LG · 2026-05-11 · conditional · none · ref 15
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents cs.AI · 2026-05-11 · unverdicted · none · ref 54
OLIVIA treats LLM agent action selection as a contextual linear bandit over frozen hidden states and applies UCB exploration to adapt online, yielding consistent gains over static ReAct and prompt-based baselines on four benchmarks.
Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights cs.LG · 2026-05-11 · unverdicted · none · ref 4
AB-SID-iVAR enables Gaussian process active learning for self-induced Boltzmann distributions by closed-form approximation of the target, with high-probability error vanishing guarantees and empirical gains on PES and drug discovery tasks.
Relative Score Policy Optimization for Diffusion Language Models cs.CL · 2026-05-11 · unverdicted · none · ref 43
RSPO interprets reward advantages as targets for relative log-ratios in dLLMs, calibrating noisy estimates to stabilize RLVR training and achieve strong gains on planning tasks with competitive math reasoning performance.
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck cs.LG · 2026-05-08 · unverdicted · none · ref 52
CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 16
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification astro-ph.IM · 2026-05-07 · unverdicted · none · ref 57
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models cs.CV · 2026-05-05 · unverdicted · none · ref 16
CoVUBench is the first benchmark framework for evaluating multimodal copyright unlearning in LVLMs via synthetic data, systematic variations, and a dual protocol for forgetting efficacy and utility preservation.
Fast Rates in $\alpha$-Potential Games via Regularized Mirror Descent cs.GT · 2026-04-30 · unverdicted · none · ref 19 · 2 links
Proposes OPMD algorithm achieving accelerated O(1/n) rates for offline Nash equilibrium learning in alpha-potential games via reference-anchored data coverage.
Pessimism-Free Offline Learning in General-Sum Games via KL Regularization cs.LG · 2026-04-30 · unverdicted · none · ref 10 · 2 links
KL regularization enables pessimism-free offline learning in general-sum games, recovering regularized Nash equilibria at accelerated rate O(1/n) via GANE and converging to coarse correlated equilibria at standard rate O(1/sqrt(n)+1/T) via GAMD.
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners cs.CL · 2026-04-30 · conditional · none · ref 24
RSAT uses SFT on verified traces followed by GRPO with NLI faithfulness rewards to make 1-8B models produce verifiable table reasoning with cell citations, raising faithfulness 3.7x to 0.826.
Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms cs.CL · 2026-04-21 · unverdicted · none · ref 14
Single-agent systems with tools provide the optimal performance-efficiency trade-off for small language models, outperforming base models and multi-agent setups.
Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback cs.LG · 2026-04-21 · unverdicted · none · ref 19
Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 61
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys cs.AI · 2026-04-19 · unverdicted · none · ref 31
A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
Validity-Calibrated Reasoning Distillation cs.LG · 2026-04-14 · unverdicted · none · ref 27
Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.
How Much Online RL is Enough? Informative Rollouts for Offline Preference Optimization in RLVR cs.LG · 2026-05-20 · unverdicted · none · ref 6
Short GRPO warm-up followed by offline DPO on informative rollouts matches or beats full GRPO on math reasoning benchmarks at substantially lower compute cost.
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 19
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation cs.CV · 2026-05-14 · unverdicted · none · ref 26
TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data cs.RO · 2026-05-13 · unverdicted · none · ref 37
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Learning Perturbations to Extrapolate Your LLM stat.ML · 2026-05-13 · unverdicted · none · ref 5
A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.
DocAtlas: Multilingual Document Understanding Across 80+ Languages cs.CL · 2026-05-12 · unverdicted · none · ref 67 · 2 links
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation cs.CL · 2026-05-12 · unverdicted · none · ref 22
SAGE trains a rubric-based verifier and an RL-optimized generator on seed human data to scalably augment LLM knowledge benchmarks, matching human-annotated quality on HellaSwag at lower cost and generalizing to MMLU.
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training cs.CL · 2026-05-12 · unverdicted · none · ref 54 · 2 links
LayerTracer analysis identifies deep LLM layers as stable task-critical regions, leading to a shallow-train deep-freeze strategy that outperforms full fine-tuning on C-Eval and CMMLU.
Annotations Mitigate Post-Training Mode Collapse cs.CL · 2026-05-11 · unverdicted · none · ref 15
Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics cs.CL · 2026-05-10 · unverdicted · none · ref 180
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
Post-hoc Selective Classification for Reliable Synthetic Image Detection cs.CV · 2026-05-09 · unverdicted · none · ref 52
ReSIDe generalizes logit-based confidence scores to intermediate layers of synthetic image detectors and uses preference optimization to aggregate them, cutting area under the risk-coverage curve by up to 69.55% under covariate shifts.
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation cs.CL · 2026-05-08 · unverdicted · none · ref 22
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
Threshold-Guided Optimization for Visual Generative Models cs.LG · 2026-05-06 · unverdicted · none · ref 2
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 21
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Multilingual Safety Alignment via Self-Distillation cs.LG · 2026-05-03 · unverdicted · none · ref 19 · 2 links
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors cs.LG · 2026-05-01 · conditional · none · ref 44
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.
Alignment has a Fantasia Problem cs.AI · 2026-04-23 · unverdicted · none · ref 64
AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction cs.AI · 2026-05-18 · unverdicted · none · ref 47
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning cs.AI · 2026-05-07 · unverdicted · none · ref 57 · 3 links
Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.
ReMedi: Reasoner for Medical Clinical Prediction cs.CL · 2026-05-02 · unverdicted · none · ref 29
ReMedi boosts LLM performance on EHR clinical predictions by up to 19.9% F1 through ground-truth-guided rationale regeneration and fine-tuning.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 72
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 51
A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents cs.MA · 2026-05-06 · unverdicted · none · ref 19
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
Supplement Generation Training for Enhancing Agentic Task Performance cs.LG · 2026-04-22 · unverdicted · none · ref 25
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning cs.AI · 2026-04-22 · unverdicted · none · ref 19
SuperIgor uses iterative co-training of a language model planner and a goal-conditional RL agent to self-generate and refine plans, resulting in stricter instruction adherence and better generalization to unseen instructions.
REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak cs.LG · 2026-05-20 · unreviewed · ref 12
Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning cs.AI · 2026-05-13 · unreviewed · ref 68

Advances in neural information processing systems , volume=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer