Scaling Evaluation-time Compute with Reasoning Models as Evaluators

Carolin Lawrence; Graham Neubig; Ian Wu; Jinu Lee; Julia Hockenmaier; Kiril Gashteovski; Mingyeong Moon; Sean Welleck; Seongyun Lee; Seungone Kim

arxiv: 2503.19877 · v2 · pith:D444J7UAnew · submitted 2025-03-25 · 💻 cs.CL

Scaling Evaluation-time Compute with Reasoning Models as Evaluators

Seungone Kim , Ian Wu , Jinu Lee , Xiang Yue , Seongyun Lee , Mingyeong Moon , Carolin Lawrence , Kiril Gashteovski

show 3 more authors

Julia Hockenmaier Graham Neubig Sean Welleck

This is my paper

Pith reviewed 2026-05-22 21:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords test-time computereasoning modelsevaluationrerankinglanguage modelsprocess evaluationchain-of-thought

0 comments

The pith

Spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether language models can evaluate their own outputs more effectively by spending more compute during evaluation, similar to how they solve problems better with more thinking time. They test this by using reasoning models that generate long chains of thought as evaluators, prompting them to assess both the final answer and individual steps. The key finding is that evaluator accuracy improves steadily as more reasoning tokens are generated. When these evaluators are used to select the best response from multiple candidates, the resulting improvements in problem-solving match those from scaling up generation compute.

Core claim

By employing reasoning models as evaluators and allowing them to generate more reasoning tokens, their accuracy on both outcome and process evaluation increases monotonically. These more accurate evaluators are then applied to rerank multiple generations from a language model, demonstrating that the resulting boost to problem-solving performance is comparable to the boost obtained by scaling compute at generation time.

What carries the argument

Reasoning models prompted for outcome evaluation and step-by-step process evaluation, whose accuracy scales monotonically with the number of generated reasoning tokens and is applied to rerank candidate responses.

Load-bearing premise

The observed monotonic gains in evaluator accuracy from additional reasoning tokens will translate into improved reranking performance that measurably boosts downstream problem-solving.

What would settle it

An experiment in which increasing the number of reasoning tokens generated by the evaluator fails to improve reranking accuracy or downstream task performance.

read the original abstract

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract sketches applying test-time compute scaling to reasoning models as evaluators with process judgments and claims this can match generation-time gains via reranking, but supplies no data on whether the accuracy lift actually delivers comparable downstream results.

read the letter

The main takeaway from this abstract is that scaling test-time compute on reasoning models used as evaluators, through both longer reasoning and process-level assessment, can potentially deliver problem-solving improvements comparable to scaling compute at generation time via reranking. This applies existing test-time scaling concepts to the evaluator role and adds process evaluation as a way to use extra tokens. The monotonic improvement in evaluator accuracy with more reasoning tokens is noted as similar to generation trends. The paper does well in highlighting how better evaluation could reduce the need for larger models or more generation compute in reasoning domains. The soft spots center on the reranking experiment. The abstract asserts that the improved evaluators make eval-time compute as effective as gen-time compute, but it includes no details on the actual outcomes—no effect sizes for the downstream gains, no specific compute budgets being compared, and no indication of the metrics used to establish equivalence. Without those, the translation from evaluator accuracy to final capability remains unverified, which is the load-bearing part of the argument. Since the full text is not available here, any assessment stays limited to the abstract's claims. There are no equations or fitted parameters described, so no circularity issues arise from the given information. The work seems to build on prior results in test-time compute without inventing new entities. This paper is aimed at researchers working on test-time compute allocation and LM evaluation for math, code, and similar reasoning tasks. A reader already running experiments in this area might find the process evaluation approach worth trying out. It would be appropriate to send to peer review if the full paper contains the experimental results with proper controls and reporting, as the question it raises about compute trade-offs is practical and current.

Referee Report

2 major / 0 minor

Summary. The paper claims that reasoning models can serve as improved evaluators when allocated more test-time compute (via longer chain-of-thought and process-level evaluation), that evaluator accuracy improves monotonically with additional reasoning tokens, and that these evaluators can be used for reranking to make evaluation-time compute scaling as effective as generation-time compute scaling for downstream problem-solving performance.

Significance. If the central equivalence claim holds with rigorous evidence, the result would identify a new, previously under-explored axis for compute scaling in LM systems. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are described in the provided text.

major comments (2)

[Abstract] Abstract (final sentence): the assertion that 'spending more compute at evaluation time can be as effective as using more compute at generation time' is load-bearing for the paper's contribution, yet the abstract supplies no effect sizes, compute budgets, reranking metrics, or direct comparisons between the two regimes.
[Abstract] Abstract (reranking experiment description): the manuscript states that more accurate evaluators 'enable reranking' that produces the claimed equivalence, but provides no quantitative results on how monotonic evaluator-accuracy gains translate into measurable downstream problem-solving improvements, leaving the weakest assumption untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major point below and will revise the abstract to better support the central claims with quantitative details from the experiments.

read point-by-point responses

Referee: [Abstract] Abstract (final sentence): the assertion that 'spending more compute at evaluation time can be as effective as using more compute at generation time' is load-bearing for the paper's contribution, yet the abstract supplies no effect sizes, compute budgets, reranking metrics, or direct comparisons between the two regimes.

Authors: We agree the abstract is high-level and omits specific numbers. The full manuscript reports the relevant effect sizes, token budgets, Pass@1 improvements from reranking, and head-to-head comparisons between evaluation-time and generation-time scaling regimes. We will revise the abstract to include representative quantitative results. revision: yes
Referee: [Abstract] Abstract (reranking experiment description): the manuscript states that more accurate evaluators 'enable reranking' that produces the claimed equivalence, but provides no quantitative results on how monotonic evaluator-accuracy gains translate into measurable downstream problem-solving improvements, leaving the weakest assumption untested.

Authors: The abstract summarizes the end-to-end result. The body contains the quantitative mapping from evaluator accuracy (as a function of reasoning tokens) to reranking gains and downstream problem-solving metrics, including direct equivalence measurements. We will update the abstract to reference these measured improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct experimental observations

full rationale

The provided abstract contains no equations, fitted parameters, predictions derived from inputs, or self-citations. It reports monotonic performance gains observed in experiments and a demonstration via reranking that eval-time compute matches gen-time effectiveness. These are presented as empirical results rather than reductions by definition or imported premises. The derivation chain is self-contained against external benchmarks (the experiments themselves), yielding no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or implied in the provided text.

pith-pipeline@v0.9.0 · 5741 in / 991 out tokens · 20057 ms · 2026-05-22T21:57:12.902673+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Evalet: Evaluating Large Language Models through Functional Fragmentation
cs.HC 2025-09 conditional novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
Process Rewards with Learned Reliability
cs.CL 2026-05 unverdicted novelty 6.0

BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.
On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization
cs.CL 2025-09 unverdicted novelty 6.0

Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.
GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models
cs.CL 2025-09 unverdicted novelty 6.0

GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.