hub Mixed citations

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua · 2025 · cs.CL · arXiv 2503.24235

Mixed citation behavior. Most common role is background (50%).

32 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 32 citing papers arXiv PDF

abstract

As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions. Our repository is available on https://github.com/testtimescaling/testtimescaling.github.io/

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 2 baseline 1 dataset 1

citation-polarity summary

background 4 use method 2 baseline 1 use dataset 1

representative citing papers

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.

Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair

cs.AR · 2026-04-19 · unverdicted · novelty 7.0

Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

cs.CV · 2025-05-20 · unverdicted · novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on VBench and NarrLV.

Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.

Stream-T1: Test-Time Scaling for Streaming Video Generation

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.

VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model

cs.RO · 2026-05-02 · unverdicted · novelty 6.0

VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.

When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

cs.AI · 2026-04-29 · unverdicted · novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

Evaluation-driven Scaling for Scientific Discovery

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.

ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

ComPASS creates tool-augmented LLM agents for substantive social support, releases the first personalized benchmark ComPASS-Bench, and fine-tunes ComPASS-Qwen to outperform its base model while matching larger LLMs.

Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling

cs.AI · 2026-04-19 · unverdicted · novelty 6.0

Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.

Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization

cs.LG · 2026-04-16 · unverdicted · novelty 6.0

A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.

MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.

CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning

cs.CL · 2026-03-09 · unverdicted · novelty 6.0

CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.

ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

cs.LG · 2026-01-29 · unverdicted · novelty 6.0 · 2 refs

ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.

RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling

cs.CV · 2025-10-23 · unverdicted · novelty 6.0

RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.

Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training

cs.AI · 2025-09-30 · unverdicted · novelty 6.0

Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs

cs.LG · 2025-08-27 · conditional · novelty 6.0

GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

cs.CL · 2025-08-06 · unverdicted · novelty 6.0

ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.

HSUGA: LLM-Enhanced Recommendation with Hierarchical Semantic Understanding and Group-Aware Alignment

cs.IR · 2026-05-12 · unverdicted · novelty 5.0

HSUGA improves LLM-enhanced sequential recommendation via staged hierarchical semantic understanding for better preference extraction and group-aware alignment that varies intensity by user activity level.

BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

BitCal-TTS raises exact-match accuracy by 3.7 points (7B) and 2.8 points (14B) on small GSM8K shards for 4-bit Qwen2.5 models while cutting premature-stop rates and retaining token savings versus fixed-budget decoding.

citing papers explorer

Showing 32 of 32 citing papers.

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation cs.CL · 2026-05-08 · unverdicted · none · ref 39 · internal anchor
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair cs.AR · 2026-04-19 · unverdicted · none · ref 30 · internal anchor
Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search cs.SE · 2026-04-12 · unverdicted · none · ref 58 · internal anchor
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 29 · internal anchor
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos cs.CV · 2026-05-18 · unverdicted · none · ref 31 · internal anchor
MIGA introduces two-stage alignment to close train-inference gaps and dual consistency enhancement via self-reflection and long-range guidance to achieve SOTA temporal consistency in infinite-frame video generation on VBench and NarrLV.
Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning cs.AI · 2026-05-12 · unverdicted · none · ref 51 · internal anchor
BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.
Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs cs.AI · 2026-05-09 · unverdicted · none · ref 4 · internal anchor
OPT-BENCH trains LLMs on NP-hard optimization via quality-aware RLVR, achieving 93.1% success rate and 46.6% quality ratio on Qwen2.5-7B while outperforming GPT-4o and transferring gains to other domains.
HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents cs.LG · 2026-05-08 · unverdicted · none · ref 42 · 2 links · internal anchor
HyperEyes presents a parallel multimodal search agent using dual-grained efficiency-aware RL with a new TRACE reward and IMEB benchmark, claiming 9.9% higher accuracy and 5.3x fewer tool calls than prior open-source agents.
Stream-T1: Test-Time Scaling for Streaming Video Generation cs.CV · 2026-05-06 · unverdicted · none · ref 52 · internal anchor
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve temporal consistency and visual quality.
VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model cs.RO · 2026-05-02 · unverdicted · none · ref 30 · internal anchor
VLA-ATTC equips VLA models with adaptive test-time compute via an uncertainty clutch and relative action critic, cutting failure rates by over 50% on LIBERO-LONG.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling cs.AI · 2026-04-29 · unverdicted · none · ref 40 · internal anchor
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
Evaluation-driven Scaling for Scientific Discovery cs.LG · 2026-04-21 · unverdicted · none · ref 172 · internal anchor
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.
ComPASS: Towards Personalized Agentic Social Support via Tool-Augmented Companionship cs.CL · 2026-04-20 · unverdicted · none · ref 39 · internal anchor
ComPASS creates tool-augmented LLM agents for substantive social support, releases the first personalized benchmark ComPASS-Bench, and fine-tunes ComPASS-Qwen to outperform its base model while matching larger LLMs.
Hive: A Multi-Agent Infrastructure for Algorithm- and Task-Level Scaling cs.AI · 2026-04-19 · unverdicted · none · ref 45 · internal anchor
Hive is a multi-agent infrastructure with a logits cache for reducing cross-path redundancy in sampling and agent-aware scheduling for better compute and KV-cache allocation, shown to deliver 1.11x-1.76x speedups and 33%-51% lower hotspot miss rates.
Adaptive Test-Time Compute Allocation for Reasoning LLMs via Constrained Policy Optimization cs.LG · 2026-04-16 · unverdicted · none · ref 19 · internal anchor
A Lagrangian-relaxation plus imitation-learning pipeline adaptively allocates test-time compute to LLMs, outperforming uniform baselines by up to 12.8% relative accuracy on MATH while staying within a fixed average budget.
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation cs.AI · 2026-04-16 · unverdicted · none · ref 30 · internal anchor
MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
CODA: Difficulty-Aware Compute Allocation for Adaptive Reasoning cs.CL · 2026-03-09 · unverdicted · none · ref 44 · internal anchor
CODA uses rollout-based difficulty signals to drive two gates that penalize verbosity on easy instances and promote deliberation on hard ones, cutting token use over 60% on simple tasks while maintaining accuracy.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment cs.LG · 2026-01-29 · unverdicted · none · ref 38 · 2 links · internal anchor
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling cs.CV · 2025-10-23 · unverdicted · none · ref 3 · internal anchor
RAPO++ is a three-stage prompt optimization framework combining retrieval-augmented refinement, closed-loop test-time scaling, and LLM fine-tuning to enhance text-to-video generation quality.
Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training cs.AI · 2025-09-30 · unverdicted · none · ref 55 · internal anchor
Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs cs.LG · 2025-08-27 · conditional · none · ref 43 · internal anchor
GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments cs.CL · 2025-08-06 · unverdicted · none · ref 28 · internal anchor
ReasoningGuard is an inference-time method that uses attention mechanisms to inject safety aha moments and scaling sampling to defend large reasoning models against jailbreak attacks.
HSUGA: LLM-Enhanced Recommendation with Hierarchical Semantic Understanding and Group-Aware Alignment cs.IR · 2026-05-12 · unverdicted · none · ref 67 · internal anchor
HSUGA improves LLM-enhanced sequential recommendation via staged hierarchical semantic understanding for better preference extraction and group-aware alignment that varies intensity by user activity level.
BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models cs.AI · 2026-05-07 · unverdicted · none · ref 17 · internal anchor
BitCal-TTS raises exact-match accuracy by 3.7 points (7B) and 2.8 points (14B) on small GSM8K shards for 4-bit Qwen2.5 models while cutting premature-stop rates and retaining token savings versus fixed-budget decoding.
Training-Free Test-Time Contrastive Learning for Large Language Models cs.CL · 2026-04-15 · unverdicted · none · ref 10 · internal anchor
TF-TTCL lets frozen LLMs adapt online by distilling textual rules from contrastive reasoning trajectories generated via multi-agent augmentation and applying them through retrieval-based steering.
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images cs.CV · 2026-04-13 · unverdicted · none · ref 41 · internal anchor
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning q-bio.QM · 2026-04-07 · unverdicted · none · ref 38 · internal anchor
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization cs.AI · 2025-08-13 · unverdicted · none · ref 32 · internal anchor
LCPO reduces average LRM output length by over 50% across benchmarks via targeted preference optimization while preserving reasoning performance.
The Agent Use of Agent Beings: Agent Cybernetics Is the Missing Science of Foundation Agents cs.AI · 2026-05-11 · unverdicted · none · ref 49 · internal anchor
Agent Cybernetics reframes foundation agent design by adapting classical cybernetics laws into three engineering desiderata for reliable, long-running, self-improving agents.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 203 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment cs.CL · 2026-01-20 · unreviewed · ref 50 · internal anchor
Language Model Networks: Supervision-Efficient Learning through Dense Communication cs.AI · 2025-05-19 · unreviewed · ref 57 · internal anchor

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer