Scaling Evaluation-time Compute with Reasoning Models as Evaluators
Pith reviewed 2026-05-22 21:57 UTC · model grok-4.3
The pith
Spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By employing reasoning models as evaluators and allowing them to generate more reasoning tokens, their accuracy on both outcome and process evaluation increases monotonically. These more accurate evaluators are then applied to rerank multiple generations from a language model, demonstrating that the resulting boost to problem-solving performance is comparable to the boost obtained by scaling compute at generation time.
What carries the argument
Reasoning models prompted for outcome evaluation and step-by-step process evaluation, whose accuracy scales monotonically with the number of generated reasoning tokens and is applied to rerank candidate responses.
Load-bearing premise
The observed monotonic gains in evaluator accuracy from additional reasoning tokens will translate into improved reranking performance that measurably boosts downstream problem-solving.
What would settle it
An experiment in which increasing the number of reasoning tokens generated by the evaluator fails to improve reranking accuracy or downstream task performance.
read the original abstract
As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that reasoning models can serve as improved evaluators when allocated more test-time compute (via longer chain-of-thought and process-level evaluation), that evaluator accuracy improves monotonically with additional reasoning tokens, and that these evaluators can be used for reranking to make evaluation-time compute scaling as effective as generation-time compute scaling for downstream problem-solving performance.
Significance. If the central equivalence claim holds with rigorous evidence, the result would identify a new, previously under-explored axis for compute scaling in LM systems. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are described in the provided text.
major comments (2)
- [Abstract] Abstract (final sentence): the assertion that 'spending more compute at evaluation time can be as effective as using more compute at generation time' is load-bearing for the paper's contribution, yet the abstract supplies no effect sizes, compute budgets, reranking metrics, or direct comparisons between the two regimes.
- [Abstract] Abstract (reranking experiment description): the manuscript states that more accurate evaluators 'enable reranking' that produces the claimed equivalence, but provides no quantitative results on how monotonic evaluator-accuracy gains translate into measurable downstream problem-solving improvements, leaving the weakest assumption untested.
Simulated Author's Rebuttal
We thank the referee for the comments. We address each major point below and will revise the abstract to better support the central claims with quantitative details from the experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract (final sentence): the assertion that 'spending more compute at evaluation time can be as effective as using more compute at generation time' is load-bearing for the paper's contribution, yet the abstract supplies no effect sizes, compute budgets, reranking metrics, or direct comparisons between the two regimes.
Authors: We agree the abstract is high-level and omits specific numbers. The full manuscript reports the relevant effect sizes, token budgets, Pass@1 improvements from reranking, and head-to-head comparisons between evaluation-time and generation-time scaling regimes. We will revise the abstract to include representative quantitative results. revision: yes
-
Referee: [Abstract] Abstract (reranking experiment description): the manuscript states that more accurate evaluators 'enable reranking' that produces the claimed equivalence, but provides no quantitative results on how monotonic evaluator-accuracy gains translate into measurable downstream problem-solving improvements, leaving the weakest assumption untested.
Authors: The abstract summarizes the end-to-end result. The body contains the quantitative mapping from evaluator accuracy (as a function of reasoning tokens) to reranking gains and downstream problem-solving metrics, including direct equivalence measurements. We will update the abstract to reference these measured improvements. revision: yes
Circularity Check
No circularity; claims rest on direct experimental observations
full rationale
The provided abstract contains no equations, fitted parameters, predictions derived from inputs, or self-citations. It reports monotonic performance gains observed in experiments and a demonstration via reranking that eval-time compute matches gen-time effectiveness. These are presented as empirical results rather than reductions by definition or imported premises. The derivation chain is self-contained against external benchmarks (the experiments themselves), yielding no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 4 Pith papers
-
Evalet: Evaluating Large Language Models through Functional Fragmentation
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
-
Process Rewards with Learned Reliability
BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
-
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
-
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models
GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.