hub Canonical reference

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu · 2024 · cs.CL · arXiv 2412.21187

Canonical reference. 93% of citing Pith papers cite this work as background.

62 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 62 citing papers arXiv PDF

abstract

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15

citation-polarity summary

background 14 support 1

representative citing papers

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete

cs.LG · 2026-06-01 · unverdicted · novelty 8.0

Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.

DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

cs.CL · 2025-04-15 · conditional · novelty 8.0

DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.

Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.

ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.

The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

cs.AI · 2025-10-24 · unverdicted · novelty 7.0

Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

cs.CL · 2025-07-05 · conditional · novelty 7.0

Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.

L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

cs.CL · 2025-03-06 · unverdicted · novelty 7.0

LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.

Overthink-Triggered Slowdown Attacks on LVLM-Based Robotic Systems

cs.CR · 2026-07-01 · unverdicted · novelty 6.0

Adversaries can use crafted scene text to trigger overthinking in LVLM-based robots, producing transferable slowdowns up to 6.96x latency amplification.

Addressing Over-Refusal in LLMs with Competing Rewards

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.

Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction

cs.CL · 2026-06-26 · unverdicted · novelty 6.0

Epi2Diff extracts cognitive episode sequences from LRM reasoning traces and combines them with semantic features to predict human item difficulty, outperforming baselines on four educational datasets.

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

cs.AI · 2026-06-05 · unverdicted · novelty 6.0

DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.

Adaptive Latent Agentic Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

ALAR trains LLM agents to perform most reasoning in a latent space supervised by actions and escalates to explicit CoT only when needed, cutting tokens by up to 84.6% while preserving accuracy on search and tool-use benchmarks.

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.

Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.

citing papers explorer

Showing 50 of 62 citing papers.

Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete cs.LG · 2026-06-01 · unverdicted · none · ref 84 · internal anchor
Sliding-window transformers without positional encodings are Turing complete because the sliding window breaks permutation symmetry and suffices to simulate Post machines via a constant-size histogram state.
DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning cs.CL · 2025-04-15 · conditional · none · ref 4 · internal anchor
DeepMath-103K is a new 103K-problem mathematical dataset with high difficulty, rigorous decontamination, and verifiable answers to support RL training of language-model reasoning.
Know When to Stop: Segment-Level Credit Assignment for Reducing Overthinking cs.CL · 2026-07-01 · unverdicted · none · ref 52 · internal anchor
DASH assigns segment-level credit in reasoning traces using drift toward ground-truth answers, yielding 50.8% accuracy on AIME25 versus 45.4% for GRPO while reducing overthinking behaviors.
ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning cs.AI · 2026-06-11 · unverdicted · none · ref 7 · internal anchor
ReSum trains LLMs via RLVR to self-summarize reasoning trajectories, yielding 4% average performance gains and 18.6% shorter rollouts through contrastive rollout branches.
Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning cs.LG · 2026-06-11 · unverdicted · none · ref 38 · internal anchor
SWITCH uses explicit <swi> and </swi> boundary tokens to make latent chain-of-thought compatible with on-policy RL (GRPO) and open to causal mechanistic probing, outperforming prior hidden-state recurrence methods.
KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty cs.CL · 2026-06-09 · unverdicted · none · ref 2 · internal anchor
KCSAT-ML benchmark supplies human error rates for math problems and DRG metric exposes that model accuracy collapses on high-human-error items while test-time scaling shows non-monotonic gains and alignment failures.
The Gordian Knot for VLMs: Diagrammatic Knot Reasoning as a Hard Benchmark cs.AI · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
KnotBench benchmark shows state-of-the-art VLMs perform near random on diagrammatic knot reasoning tasks and lack ability to simulate structural moves.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models cs.LG · 2026-05-10 · unverdicted · none · ref 10 · internal anchor
LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning outputs than base models on math benchmarks.
Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 120 · internal anchor
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models cs.AI · 2025-10-24 · unverdicted · none · ref 2 · internal anchor
Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models cs.CL · 2025-07-05 · conditional · none · ref 5 · internal anchor
Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning cs.CL · 2025-03-06 · unverdicted · none · ref 5 · internal anchor
LCPO trains L1 reasoning models to adhere to prompt-specified CoT lengths, supporting accuracy-compute trade-offs and yielding short reasoning models that outperform larger baselines at matched lengths.
Overthink-Triggered Slowdown Attacks on LVLM-Based Robotic Systems cs.CR · 2026-07-01 · unverdicted · none · ref 7 · internal anchor
Adversaries can use crafted scene text to trigger overthinking in LVLM-based robots, producing transferable slowdowns up to 6.96x latency amplification.
Addressing Over-Refusal in LLMs with Competing Rewards cs.LG · 2026-06-30 · unverdicted · none · ref 56 · internal anchor
SEAR trains one LLM via adversarial process rewards to explore harmful reasoning paths but flip to safe outputs, reducing over-refusal while preserving safety.
Cognitive Episodes in LLM Reasoning Traces Enable Interpretable Human Item Difficulty Prediction cs.CL · 2026-06-26 · unverdicted · none · ref 5 · internal anchor
Epi2Diff extracts cognitive episode sequences from LRM reasoning traces and combines them with semantic features to predict human item difficulty, outperforming baselines on four educational datasets.
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling cs.AI · 2026-06-05 · unverdicted · none · ref 3 · internal anchor
DyCon dynamically controls reasoning depth in LRMs by modeling evolving difficulty from step-level embeddings, reducing redundant steps across multiple benchmarks.
ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning cs.AI · 2026-06-02 · unverdicted · none · ref 4 · internal anchor
ThoughtFold applies introspective redundancy detection within correct CoT trajectories to create sub-trajectory spectra, then uses masked preference optimization to penalize redundant explorations, yielding 56% token reduction on DeepSeek-R1-Distill-Qwen-7B while preserving accuracy.
Adaptive Latent Agentic Reasoning cs.CL · 2026-06-01 · unverdicted · none · ref 30 · internal anchor
ALAR trains LLM agents to perform most reasoning in a latent space supervised by actions and escalates to explicit CoT only when needed, cutting tokens by up to 84.6% while preserving accuracy on search and tool-use benchmarks.
Quantized Reasoning Models Think They Need to Think Longer, but They Do Not cs.LG · 2026-05-29 · unverdicted · none · ref 49 · internal anchor
Post-training quantization increases overthinking errors in reasoning models; a logit penalty on curated overthinking markers reduces CoT length 12-23% without accuracy loss.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning cs.AI · 2026-05-21 · unverdicted · none · ref 14 · internal anchor
SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 9 · internal anchor
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration cs.LG · 2026-05-11 · unverdicted · none · ref 10 · 2 links · internal anchor
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness cs.LG · 2026-05-10 · unverdicted · none · ref 6 · internal anchor
VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.
Hint Tuning: Less Data Makes Better Reasoners cs.CL · 2026-05-09 · unverdicted · none · ref 7 · 2 links · internal anchor
Hint Tuning reduces token usage 24-66% (31.5% avg) in reasoning models via 1K self-annotated samples aligned to an instruct model's capabilities while keeping benchmark accuracy.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training cs.AI · 2026-05-08 · unverdicted · none · ref 7 · internal anchor
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models cs.CL · 2026-05-01 · unverdicted · none · ref 9 · internal anchor
A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space cs.LG · 2026-04-15 · unverdicted · none · ref 9 · internal anchor
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation cs.AI · 2026-04-09 · unverdicted · none · ref 6 · internal anchor
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 47 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Entropy After </Think> for reasoning model early exiting cs.LG · 2025-09-30 · unverdicted · none · ref 2 · internal anchor
Entropy After </Think> (EAT) enables early exiting in reasoning LLMs by tracking entropy stabilization after a </think> token, cutting token use 12-22% on MATH500 and AIME2025 with no accuracy loss.
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training cs.AI · 2025-09-30 · unverdicted · none · ref 8 · internal anchor
Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models cs.CL · 2025-09-11 · unverdicted · none · ref 33 · internal anchor
GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity cs.AI · 2025-06-07 · unverdicted · none · ref 30 · internal anchor
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference cs.LG · 2025-05-05 · conditional · none · ref 15 · internal anchor
RetroInfer introduces the wave index and wave buffer to realize sparse KV-cache attention for long-context LLM inference with up to 4.4X throughput gains while matching full-attention accuracy.
Muon is Scalable for LLM Training cs.LG · 2025-02-24 · unverdicted · none · ref 76 · internal anchor
Muon optimizer with weight decay and update scaling achieves ~2x efficiency over AdamW for large LLMs, shown via the Moonlight 3B/16B MoE model trained on 5.7T tokens.
CAT: Confidence-Adaptive Thinking for Efficient Reasoning of Large Reasoning Models cs.CL · 2026-07-01 · unverdicted · none · ref 4 · internal anchor
CAT uses intrinsic confidence signals in preference optimization to adapt reasoning length in LRMs, outperforming uniform compression baselines on accuracy across benchmarks.
LASER: Load-Aware Serving with Early-Exit for Reasoning LLMs at the Edge cs.DC · 2026-06-30 · unverdicted · none · ref 5 · internal anchor
LASER reduces edge LLM serving latency by 17-38% and improves SLO satisfaction by 3-6% via load-aware adaptive early-exit thresholds and difficulty-aware budget pre-allocation, with 1% average accuracy cost.
Thinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLMs cs.CL · 2026-05-31 · unverdicted · none · ref 8 · internal anchor
HAB applies coarse-to-fine budgeting to LLM reasoning, predicting per-problem depth and learning intra-step token budgets via PPL comparisons and adaptive Pareto optimization, yielding higher accuracy and lower token use than standard CoT on GSM8K and MATH500.
SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning cs.AI · 2026-05-29 · unverdicted · none · ref 3 · internal anchor
SLAT applies segment-level adaptive trimming in RL to reduce CoT reasoning length by 50% while maintaining competitive accuracy on benchmarks.
Reasoning Compression with Mixed-Policy Distillation cs.AI · 2026-05-09 · unverdicted · none · ref 22 · internal anchor
Mixed-Policy Distillation transfers concise reasoning behavior from larger to smaller LLMs by having the teacher compress student-generated trajectories, cutting token usage up to 27% while raising benchmark scores.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem cs.AI · 2026-05-07 · unverdicted · none · ref 27 · internal anchor
Non-reasoning LLMs fail the equivalence class problem while reasoning LLMs perform better but remain incomplete, with difficulty peaking at phase transition for the former and maximum diameter for the latter.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning cs.CL · 2026-04-11 · unverdicted · none · ref 252 · internal anchor
APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs cs.CL · 2026-04-11 · unverdicted · none · ref 267 · internal anchor
FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
SHAPE: Stage-aware Hierarchical Advantage via Potential Estimation for LLM Reasoning cs.LG · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
SHAPE improves average math reasoning accuracy by 3% while cutting token use by 30% through stage-aware hierarchical advantage and entropy-driven token redistribution.
Investigating Thinking Behaviours of Reasoning-Based Language Models for Social Bias Mitigation cs.CL · 2025-10-20 · unverdicted · none · ref 4 · internal anchor
Reasoning LLMs aggregate social biases through stereotype repetition and irrelevant information injection in their thinking processes, and a self-review prompt mitigates this on BBQ, StereoSet, and BOLD benchmarks.
Early Stopping Chain-of-thoughts in Large Language Models cs.CL · 2025-09-17 · conditional · none · ref 1 · internal anchor
ES-CoT shortens LLM chain-of-thought generation by tracking runs of identical step answers after linguistic markers, cutting tokens 16% on average while keeping accuracy comparable to full CoT across six datasets and three models.
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization cs.AI · 2025-08-13 · unverdicted · none · ref 20 · internal anchor
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.
ReasonCache: Accelerating Large Reasoning Model Serving through KV Cache Sharing cs.LG · 2025-07-29 · unverdicted · none · ref 5 · internal anchor
ReasonCache reuses similar KV cache states across reasoning steps in LRMs via collaborative filtering to boost serving throughput by up to 89.2% while preserving accuracy.

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer