Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models

· 2025 · arXiv 2505.22113

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs

cs.CL · 2026-06-27 · unverdicted · novelty 7.0

ThinkProbe builds non-generative Thought Graphs from 4200 LLM traces across 7 models and 200 questions to extract 5D cognitive profiles, finding model-level stability in reasoning structure that exceeds domain effects in four dimensions.

Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code

cs.SE · 2026-04-14 · unverdicted · novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

cs.CL · 2025-07-05 · conditional · novelty 7.0

Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.

RecurGuard: Runtime Monitoring for Reasoning-Token Consumption Attacks

cs.CR · 2026-06-06 · unverdicted · novelty 6.0

RecurGuard monitors recurrence rate, volume growth, and query progress in exposed reasoning traces to terminate generation on token-consumption attacks, reporting 99% detection on OverThink and 92% on ExtendAttack with near-zero false positives.

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

cs.SE · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

CRANE applies magnitude thresholding, a Conservative Taylor Gate, and Graduated Sigmoidal Projection to the Thinking-Instruct delta to improve code agent pass rates on Roo-Eval, SWE-bench-Verified, and Terminal-Bench while preserving efficiency.

Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

cs.AI · 2025-11-28 · unverdicted · novelty 5.0

AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.

citing papers explorer

Showing 7 of 7 citing papers.

ThinkProbe: Beyond Accuracy -- Structural Profiling of Open-Ended LLM Reasoning Traces via Non-Generative Thought Graphs cs.CL · 2026-06-27 · unverdicted · none · ref 9
ThinkProbe builds non-generative Thought Graphs from 4200 LLM traces across 7 models and 200 questions to extract 5D cognitive profiles, finding model-level stability in reasoning structure that exceeds domain effects in four dimensions.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code cs.SE · 2026-04-14 · unverdicted · none · ref 28
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and simplification.
Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models cs.CL · 2025-07-05 · conditional · none · ref 16
Evaluations of 53 LLMs on 14 basic math tasks show reasoning models use ~18x more tokens with sometimes lower accuracy, non-monotonic gains from extended budgets, and sharp performance drops under token constraints.
RecurGuard: Runtime Monitoring for Reasoning-Token Consumption Attacks cs.CR · 2026-06-06 · unverdicted · none · ref 42
RecurGuard monitors recurrence rate, volume growth, and query progress in exposed reasoning traces to terminate generation on token-consumption attacks, reporting 99% detection on OverThink and 92% on ExtendAttack with near-zero false positives.
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing cs.SE · 2026-05-13 · unverdicted · none · ref 5 · 2 links
CRANE applies magnitude thresholding, a Conservative Taylor Gate, and Graduated Sigmoidal Projection to the Thinking-Instruct delta to improve code agent pass rates on Roo-Eval, SWE-bench-Verified, and Terminal-Bench while preserving efficiency.
Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 46
CoRD uses collaborative multi-teacher step-wise decoding with perplexity-guided beam search to generate higher-quality Long-CoT data that lets smaller models reach near-teacher performance with less supervision.
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture cs.AI · 2025-11-28 · unverdicted · none · ref 23
AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.

Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer