Recognition: unknown
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
read the original abstract
General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
-
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
-
Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
LLMs exhibit myopic planning in four-in-a-row: move choices are best explained by shallow nodes in reasoning traces, not the deep lookahead they generate, unlike humans where depth drives performance.
-
Crafting Reversible SFT Behaviors in Large Language Models
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
-
LLM Translation of Compiler Intermediate Representation
IRIS-14B is the first LLM trained explicitly for GIMPLE-to-LLVM IR translation and outperforms much larger models by up to 44 percentage points on real-world C code.
-
Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum
Momentum-based async SGD achieves optimal convergence rates for data-dependent delays without biasing updates toward simpler samples.
-
Characterizing the Expressivity of Local Attention in Transformers
Local attention strictly enlarges the class of regular languages recognizable by fixed-precision transformers by adding a second past operator in linear temporal logic, with global and local attention being expressive...
-
ArgBench: Benchmarking LLMs on Computational Argumentation Tasks
ArgBench unifies 33 existing datasets into a standardized benchmark for testing LLMs across 46 argumentation tasks and analyzes the impact of prompting techniques and model factors on performance.
-
MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning
MedPRMBench is the first fine-grained benchmark for process reward models in medical reasoning, featuring 6500 questions, 13000 chains, 113910 step labels, and a baseline that improves downstream QA accuracy by 3.2-6....
-
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models
Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.
-
NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment
NovBench is the first large-scale benchmark with 1,684 expert-annotated pairs to evaluate LLMs on assessing academic paper novelty via a four-dimensional framework of Relevance, Correctness, Coverage, and Clarity.
-
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
-
Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages
Derives an exact unbiased policy gradient for RL post-training of diffusion LLMs via entropy-guided step selection and one-step denoising rewards, achieving state-of-the-art results on coding and logical reasoning benchmarks.
-
Flow-GRPO: Training Flow Matching Models via Online RL
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
-
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
-
Large Language Diffusion Models
LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.
-
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
-
Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization
AsymRec decouples input and output representations in generative recommendation via multi-expert semantic projection and multi-faceted hierarchical quantization, outperforming prior models by 15.8% on average.
-
Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
TWN attaches separate reasoning and embedding LoRA adapters to a frozen backbone with gradient detachment and a self-supervised gate that decides per input whether to generate CoT, achieving SOTA on MMEB-V2 with 3-5% ...
-
GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction
A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
-
How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization
The authors derive a Maximally Scale-Stable Parameterization (MSSP) for MoE models that achieves robust learning-rate transfer and monotonic performance gains with scale across co-scaling regimes of width, experts, an...
-
ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents
ClawForge supplies a generator that turns scenario templates into reproducible command-line tasks testing state conflict handling, where the strongest frontier model scores only 45.3 percent strict accuracy.
-
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
-
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves
CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...
-
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.
-
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
-
Many-Shot CoT-ICL: Making In-Context Learning Truly Learn
Many-shot CoT-ICL functions as test-time learning when demonstrations are ordered for smooth conceptual progression rather than similarity, enabling a new selection method that improves reasoning performance.
-
Reward-Weighted On-Policy Distillation with an Open Property-Equivalence Verifier for NL-to-SVA Generation
Reward-Weighted On-Policy Distillation with an open property-equivalence verifier produces a 7B model that surpasses prior SOTA on NL-to-SVA generation across pass@1/5/10 metrics.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
-
Query-Conditioned Test-Time Self-Training for Large Language Models
QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
-
What Does LLM Refinement Actually Improve? A Systematic Study on Document-Level Literary Translation
Document-level machine translation followed by segment-level LLM refinement provides the strongest and most stable improvements in literary translation quality, mainly enhancing fluency and style rather than adequacy.
-
Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
PyRAG turns multi-hop reasoning into executable Python code over retrieval tools for explicit, verifiable step-by-step RAG.
-
Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective
ConSPO improves RLVR training by aligning rollout scores with generation likelihoods via length-normalized log-probabilities and applying a group-wise InfoNCE contrastive loss with a scheduled margin, outperforming GR...
-
AIS: Adaptive Importance Sampling for Quantized RL
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
-
Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Seg-Agent performs language-guided segmentation without training by using Set-of-Mark visual prompts to enable explicit multimodal chain-of-reasoning in three stages: generation, selection, and refinement.
-
From Instance Selection to Fixed-Pool Data Recipe Search for Supervised Fine-Tuning
AutoSelection discovers data recipes from a 90K instruction pool that outperform full-data training and other selectors on reasoning tasks for SFT across multiple models.
-
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
MOPD improves on-policy distillation for LLMs by using peer successes for positive patterns and failures for negative examples to create more informative teacher signals.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation
OmniNFT introduces modality-wise advantage routing, layer-wise gradient surgery, and region-wise loss reweighting in an online diffusion RL framework to improve audio-video quality, alignment, and synchronization.
-
VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
VIP evolves text prompts using visual cues and saliency-aware aggregation inside dino.txt to deliver 1.4-8.4% higher mIoU on dense vision-language tasks with low overhead.
-
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
Uni-AdGen uses a unified autoregressive framework with foreground perception, instruction tuning, and coarse-to-fine preference modules to generate personalized image-text ads from noisy user behaviors, outperforming ...
-
Large Language Models as Amortized Pareto-Front Generators for Constrained Bi-Objective Convex Optimization
DIPS fine-tunes LLMs to output ordered feasible decision vectors approximating Pareto fronts for constrained bi-objective convex problems, reaching 95-98% normalized hypervolume with 0.16s inference.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization
DGAO uses reinforcement learning to optimize LLMs for both accuracy and order stability by balancing intra-group accuracy advantages and inter-group stability advantages.
-
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
-
GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation
GEAR reshapes GRPO trajectory advantages using divergence signals from a ground-truth-conditioned teacher to create adaptive token- and segment-level credit regions.
-
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
-
CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating
CaC is a hierarchical spatiotemporal concentrating reward model for video anomalies that reports 25.7% accuracy gains on fine-grained benchmarks and 11.7% anomaly reduction in generated videos via a new dataset and GR...
-
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
-
CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG
CuSearch reallocates rollout budget in RLVR toward deeper-search trajectories as a proxy for retrieval supervision density, yielding up to 11.8 exact-match gains over uniform GRPO sampling on ZeroSearch.
-
SoK: Unlearnability and Unlearning for Model Dememorization
The first integrated taxonomy, empirical study of interplay and shallow dememorization, plus a theoretical guarantee on dememorization depth for certified unlearning.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
-
Deep Reasoning in General Purpose Agents via Structured Meta-Cognition
DOLORES, an agent using a formal language for meta-reasoning to construct adaptive scaffolds on the fly, outperforms prior scaffolding methods by 24.8% on average across four hard benchmarks and multiple model sizes.
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.
-
Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
DuST uses on-policy RL to train code models on ranking their own sampled solutions by sandbox execution correctness, improving judgment NDCG, pass@1, and Best-of-4 accuracy while showing that SFT on the same data does...
-
Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Distillation signals align better with ideal updates on incorrect student rollouts than correct ones, with optimal teacher context depending on student capacity and task.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.