DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Aixin Liu; Bei Feng; Bingxuan Wang; Bing Xue; Bochao Wu; Chengda Lu; Chenggang Zhao; Chengqi Deng; Chenyu Zhang; Chong Ruan

arxiv: 2501.12948 · v2 · submitted 2025-01-22 · 💻 cs.CL · cs.AI· cs.LG

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI , Daya Guo , Dejian Yang , Haowei Zhang , Junxiao Song , Peiyi Wang , Qihao Zhu , Runxin Xu

show 190 more authors

Ruoyu Zhang Shirong Ma Xiao Bi Xiaokang Zhang Xingkai Yu Yu Wu Z.F. Wu Zhibin Gou Zhihong Shao Zhuoshu Li Ziyi Gao Aixin Liu Bing Xue Bingxuan Wang Bochao Wu Bei Feng Chengda Lu Chenggang Zhao Chengqi Deng Chenyu Zhang Chong Ruan Damai Dai Deli Chen Dongjie Ji Erhang Li Fangyun Lin Fucong Dai Fuli Luo Guangbo Hao Guanting Chen Guowei Li H. Zhang Han Bao Hanwei Xu Haocheng Wang Honghui Ding Huajian Xin Huazuo Gao Hui Qu Hui Li Jianzhong Guo Jiashi Li Jiawei Wang Jingchang Chen Jingyang Yuan Junjie Qiu Junlong Li J.L. Cai Jiaqi Ni Jian Liang Jin Chen Kai Dong Kai Hu Kaige Gao Kang Guan Kexin Huang Kuai Yu Lean Wang Lecong Zhang Liang Zhao Litong Wang Liyue Zhang Lei Xu Leyi Xia Mingchuan Zhang Minghua Zhang Minghui Tang Meng Li Miaojun Wang Mingming Li Ning Tian Panpan Huang Peng Zhang Qiancheng Wang Qinyu Chen Qiushi Du Ruiqi Ge Ruisong Zhang Ruizhe Pan Runji Wang R.J. Chen R.L. Jin Ruyi Chen Shanghao Lu Shangyan Zhou Shanhuang Chen Shengfeng Ye Shiyu Wang Shuiping Yu Shunfeng Zhou Shuting Pan S.S. Li Shuang Zhou Shaoqing Wu Tao Yun Tian Pei Tianyu Sun T. Wang Wangding Zeng Wanjia Zhao Wen Liu Wenfeng Liang Wenjun Gao Wenqin Yu Wentao Zhang W.L. Xiao Wei An XiaoDong Liu Xiaohan Wang Xiaokang Chen Xiaotao Nie Xin Cheng Xin Liu Xin Xie Xingchao Liu Xinyu Yang Xinyuan Li Xuecheng Su Xuheng Lin X.Q. Li Xiangyue Jin Xiaojin Shen Xiaosha Chen Xiaowen Sun Xiaoxiang Wang Xinnan Song Xinyi Zhou Xianzu Wang Xinxia Shan Y.K. Li Y.Q. Wang Y.X. Wei Yang Zhang Yanhong Xu Yao Li Yao Zhao Yaofeng Sun Yaohui Wang Yi Yu Yichao Zhang Yifan Shi Yiliang Xiong Ying He Yishi Piao Yisong Wang Yixuan Tan Yiyang Ma Yiyuan Liu Yongqiang Guo Yuan Ou Yuduan Wang Yue Gong Yuheng Zou Yujia He Yunfan Xiong Yuxiang Luo Yuxiang You Yuxuan Liu Yuyang Zhou Y.X. Zhu Yanping Huang Yaohui Li Yi Zheng Yuchen Zhu Yunxian Ma Ying Tang Yukun Zha Yuting Yan Z.Z. Ren Zehui Ren Zhangli Sha Zhe Fu Zhean Xu Zhenda Xie Zhengyan Zhang Zhewen Hao Zhicheng Ma Zhigang Yan Zhiyu Wu Zihui Gu Zijia Zhu Zijun Liu Zilin Li Ziwei Xie Ziyang Song Zizheng Pan Zhen Huang Zhipeng Xu Zhongyu Zhang Zhen Zhang

This is my paper

Pith reviewed 2026-05-23 04:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords reasoningmodelslearningllmscapabilitiesdemonstrationsemergentpatterns

0 comments

The pith

Pure reinforcement learning on LLMs produces emergent reasoning patterns and outperforms supervised models trained on human demonstrations on verifiable math, coding, and STEM tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most current AI models learn reasoning by copying step-by-step examples written by humans. This paper instead gives the model rewards only when it produces correct final answers on problems that can be automatically checked, such as math questions. Over time the model begins to show new behaviors including checking its own answers, verifying steps, and switching strategies mid-problem. The resulting large model beats earlier versions on competition-level math and coding benchmarks. The same learned patterns can then be transferred to improve smaller models.

Core claim

the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation.

Load-bearing premise

That reward signals derived solely from verifiable final answers on training tasks are sufficient to produce generalizable reasoning strategies that transfer to unseen complex problems.

read the original abstract

General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pure RL on outcome rewards can produce emergent self-reflection in LLMs and beat SFT on verifiable tasks, but the abstract supplies no numbers or ablations so the generalization claim stays untested.

read the letter

The main thing to know is that this paper reports training LLMs with pure reinforcement learning using only final-answer rewards, no human reasoning chains at all. The result is models that develop self-reflection, verification steps, and strategy switching on their own, and they outperform standard supervised models on math, coding, and STEM benchmarks. They also claim the large model can then be used to boost smaller ones.

Referee Report

2 major / 0 minor

Summary. The paper claims that pure reinforcement learning (RL) can incentivize advanced reasoning capabilities in LLMs without requiring human-labeled reasoning trajectories. The proposed RL framework is said to enable emergent behaviors such as self-reflection, verification, and dynamic strategy adaptation. As a result, the trained model (DeepSeek-R1) achieves superior performance on verifiable tasks including mathematics, coding competitions, and STEM fields compared to models trained via supervised fine-tuning on human demonstrations. The emergent patterns are further claimed to be harnessable for improving smaller models.

Significance. If substantiated with rigorous evidence, the result would be significant for reducing dependence on human-annotated data in scaling LLM reasoning. Demonstrating that outcome-based rewards alone can induce transferable reasoning strategies would challenge current reliance on supervised fine-tuning for complex tasks. However, the abstract supplies no metrics, baselines, training details, or statistical evidence, so the significance cannot be assessed from the provided text; the full manuscript would need to include these to support the claims.

major comments (2)

[Abstract] Abstract: the central claim that pure RL produces superior performance and emergent reasoning patterns is asserted without any quantitative metrics, baselines (e.g., specific SFT models), training details, or statistical evidence. This absence makes the claim unevaluable and directly undermines assessment of whether outcome-only rewards suffice for generalizable strategies.
[Abstract] Abstract: the assertion that reward signals from verifiable final answers induce transferable patterns such as self-reflection and dynamic strategy adaptation on unseen problems lacks any supporting isolation experiment or transfer results; the training distribution (math/coding/STEM with easy verification) does not by itself guarantee generalization when problem structure changes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that the abstract would be strengthened by the inclusion of key quantitative results and will revise it in the next version. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that pure RL produces superior performance and emergent reasoning patterns is asserted without any quantitative metrics, baselines (e.g., specific SFT models), training details, or statistical evidence. This absence makes the claim unevaluable and directly undermines assessment of whether outcome-only rewards suffice for generalizable strategies.

Authors: The full manuscript contains extensive experimental sections with quantitative metrics, specific SFT baselines, training details, and performance comparisons on mathematics, coding, and STEM benchmarks. We will revise the abstract to include representative numerical results and baseline references so that the central claims can be evaluated directly from the abstract. revision: yes
Referee: [Abstract] Abstract: the assertion that reward signals from verifiable final answers induce transferable patterns such as self-reflection and dynamic strategy adaptation on unseen problems lacks any supporting isolation experiment or transfer results; the training distribution (math/coding/STEM with easy verification) does not by itself guarantee generalization when problem structure changes.

Authors: The manuscript presents analyses of emergent behaviors during RL training and dedicated experiments showing that the resulting reasoning patterns can be used to improve smaller models. While we do not claim the training distribution guarantees generalization to arbitrary problem structures outside the evaluated domains, the reported results include transfer within verifiable tasks. We will add a brief reference to the transfer experiments in the revised abstract. revision: partial

Circularity Check

0 steps flagged

No circularity in claimed derivation or results

full rationale

The paper reports an empirical RL training procedure on verifiable-outcome tasks (math, coding, STEM) and measures downstream performance plus emergent behaviors such as self-reflection. No equations, uniqueness theorems, fitted parameters renamed as predictions, or self-citation chains are invoked to derive the central claim; the results are presented as experimental outcomes rather than reductions to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review conducted from abstract only; no explicit free parameters, invented entities, or non-standard axioms are stated. The approach implicitly relies on standard RL assumptions about reward shaping and policy optimization.

axioms (1)

domain assumption Reinforcement learning with outcome-based rewards can shape complex sequential behaviors in large neural networks.
Central premise of the proposed framework; invoked throughout the abstract description of emergent patterns.

pith-pipeline@v0.9.0 · 6502 in / 1109 out tokens · 44326 ms · 2026-05-23T04:47:04.438016+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 accept novelty 8.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, with evaluations showing direct QA at 66.4%, best practical agents at 79.1%, and oracle knowledge at 95.4%.
Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
cs.CV 2026-05 unverdicted novelty 8.0

VLMs fail to detect semantically different image swaps up to 60% of the time despite self-reflective statements, with thinking models more vulnerable and attention analysis showing self-reflection does not increase vi...
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
cs.AR 2026-05 conditional novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
cs.CL 2026-05 unverdicted novelty 8.0

Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
cs.CL 2026-05 unverdicted novelty 8.0

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
cs.AI 2026-05 conditional novelty 8.0

LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
Crafting Reversible SFT Behaviors in Large Language Models
cs.LG 2026-05 unverdicted novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
LLM Translation of Compiler Intermediate Representation
cs.PL 2026-05 unverdicted novelty 8.0

IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum
cs.LG 2026-05 unverdicted novelty 8.0

Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.
Characterizing the Expressivity of Local Attention in Transformers
cs.CL 2026-05 unverdicted novelty 8.0

Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
cs.CL 2026-04 unverdicted novelty 8.0

ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
cs.CL 2026-04 unverdicted novelty 8.0

MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
cs.CV 2026-04 unverdicted novelty 8.0

neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
cs.LG 2026-04 unverdicted novelty 8.0

Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
cs.CL 2026-04 conditional novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment
cs.CL 2026-04 unverdicted novelty 8.0

NovBench is the first large-scale benchmark with 1,684 expert-annotated pairs to evaluate LLMs on assessing academic paper novelty via a four-dimensional framework of Relevance, Correctness, Coverage, and Clarity.
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
cs.LG 2026-03 unverdicted novelty 8.0

Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
MVI-Bench: A Comprehensive Benchmark for Evaluating Robustness to Misleading Visual Inputs in LVLMs
cs.CV 2025-11 unverdicted novelty 8.0

MVI-Bench supplies the first taxonomy and dataset focused on misleading visual inputs to measure LVLM robustness, with tests on 18 models revealing clear weaknesses.
MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation
cs.CL 2025-07 accept novelty 8.0

MediQAl is a new French medical QA benchmark with 32k exam-sourced questions in three formats and cognitive labels, evaluated on 14 LLMs to reveal gaps between factual recall and reasoning performance.
Flow-GRPO: Training Flow Matching Models via Online RL
cs.CV 2025-05 unverdicted novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
cs.CL 2025-04 conditional novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
Large Language Diffusion Models
cs.CL 2025-02 unverdicted novelty 8.0

LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
Geo-Align: Video Generation Alignment via Metric Geometry Reward
cs.CV 2026-05 unverdicted novelty 7.0

Geo-Align applies RL with a perceptual reward derived from 3D camera trajectory estimation to improve controllability and fidelity in video generation without paired training data.
Training-Free Looped Transformers
cs.LG 2026-05 unverdicted novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
cs.CV 2026-05 unverdicted novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
cs.AI 2026-05 unverdicted novelty 7.0

More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
cs.CV 2026-05 conditional novelty 7.0

JMed48k is a new large-scale benchmark of Japanese medical licensing exams with images that reveals proprietary VLMs benefit more from visuals than medical-specific models, with large variation across professions.
MAVEN: A Multi-stage Agentic Annotation Pipeline for Video Reasoning Tasks
cs.CV 2026-05 unverdicted novelty 7.0

MAVEN pipeline generates multi-scale spatio-temporal event descriptions from videos using agentic adaptation and refinement, then produces training data that lets a fine-tuned 8B model outperform Gemini baselines on p...
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
cs.CV 2026-05 conditional novelty 7.0

CrossVLA introduces a surrogate log-probability estimator to enable DPO on flow-matching VLAs, reports DoRA yielding +10.4 pp mean gains over SFT on LIBERO with 600 trials, and shows inference caching limited to 21% s...
Frontier: Towards Comprehensive and Accurate LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Frontier is a new discrete-event simulator for disaggregated LLM serving that incorporates co-location, PDD, AFD, and optimizations, achieving under 4% throughput error and large reductions in latency prediction error...
On the Cost and Benefit of Chain of Thought: A Learning-Theoretic Perspective
cs.LG 2026-05 unverdicted novelty 7.0

Chain of Thought risk decomposes into oracle-trajectory benefit and trajectory-mismatch cost, with stability determining bounded, linear, or exponential error growth.
BioDefect: The First Dataset for Defect Detection in Bioinformatics Software
cs.SE 2026-05 unverdicted novelty 7.0

BioDefect is a new dataset for defect detection in bioinformatics software that improves average F1-scores by 29.61% to 38.04% over existing datasets when evaluated on nine language models.
AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals
cs.LG 2026-05 conditional novelty 7.0

AVSD improves self-distillation by identifying cross-view consensus signals and selectively incorporating aligned view-specific residuals for token-level supervision.
Code Generation by Differential Test Time Scaling
cs.SE 2026-05 unverdicted novelty 7.0

DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
cs.CL 2026-05 conditional novelty 7.0

AutoTool uses reinforcement learning with dual-mode rewards to train multimodal LLMs to adaptively choose between tool-assisted and text-centric reasoning, yielding accuracy and efficiency gains on V* and POPE benchmarks.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
cs.LG 2026-05 conditional novelty 7.0

CEPO sharpens token credit in RLVR by requiring tokens to be favored by the correct answer and disfavored by wrong answers drawn from rejected rollouts, delivering accuracy gains on five multimodal math benchmarks.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
KVBuffer: IO-aware Serving for Linear Attention
cs.LG 2026-05 unverdicted novelty 7.0

KVBuffer reduces linear attention decoding latency by up to 45% and increases speculative decoding throughput 5x by buffering keys/values for flexible chunked and parallel computation.
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE
cs.CL 2026-05 unverdicted novelty 7.0

PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
cs.AI 2026-05 unverdicted novelty 7.0

SVFSearch is the first open benchmark for short-video frame search in the Chinese gaming domain, providing a frozen retrieval environment and showing performance gaps of 13-29 points between direct QA models, practica...
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
cs.AI 2026-05 unverdicted novelty 7.0

AutoRubric-T2I learns and selects explicit rubrics from preference pairs to guide VLM judges, producing high-quality interpretable rewards for T2I alignment with far less data than traditional Bradley-Terry models.
AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment
cs.AI 2026-05 unverdicted novelty 7.0

AutoRubric-T2I learns a small set of interpretable rubrics for VLM judges that outperform scalar reward models on T2I benchmarks while using far less preference data.
Weak-to-Strong Elicitation via Mismatched Wrong Drafts
cs.CL 2026-05 conditional novelty 7.0

Mismatched wrong drafts from a 1.5B math model injected into GRPO training of a 7B model yield higher pass rates on MATH-500 and AIME than on-policy baselines or matched variants.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
cs.LG 2026-05 unverdicted novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean
cs.AI 2026-05 accept novelty 7.0

CAM-Bench is a new Lean 4 theorem-proving benchmark of 1,000 problems in computational and applied mathematics, built from textbook exercises using a dependency-recovery pipeline to reconstruct local context.
HalluScore: Large Language Model Hallucination Question Answering Benchmark
cs.CL 2026-05 unverdicted novelty 7.0

HalluScore is a curated Arabic QA dataset with 827 questions, ground-truth evidence, and human annotations used to measure hallucination rates across 17 LLMs.
DEVIS-GRPO: Unleashing GRPO on Dynamic Extreme View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

DEVIS-GRPO applies online policy gradients with an accumulative small-to-large view sampling strategy and multi-level rewards to improve trajectory-controlled extreme view video generation, reporting gains on Kubric-4...
Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasonin...
PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures
cs.CL 2026-05 unverdicted novelty 7.0

PQR is a dual-module iterative framework that generates diverse and realistic queries to elicit failures in QA agents, detecting 23-78% more unhelpful responses than prior methods.
ReAlign: Generalizable Image Forgery Detection via Reasoning-Aligned Representation
cs.CV 2026-05 unverdicted novelty 7.0

ReAlign distills LLM-generated reasoning texts into a lightweight AIGI forgery detector via contrastive image-text alignment to improve generalization on complex forgeries.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
cs.CL 2026-05 conditional novelty 7.0

MHGraphBench is a new PrimeKG-derived benchmark that exposes a recognition-to-judgment gap in 15 LLMs on mental health tasks while stressing that results measure KG agreement under constrained interfaces, not clinical...
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs
cs.LG 2026-05 unverdicted novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...