pith. sign in

arxiv: 2503.19877 · v2 · pith:D444J7UAnew · submitted 2025-03-25 · 💻 cs.CL

Scaling Evaluation-time Compute with Reasoning Models as Evaluators

Pith reviewed 2026-05-22 21:57 UTC · model grok-4.3

classification 💻 cs.CL
keywords test-time computereasoning modelsevaluationrerankinglanguage modelsprocess evaluationchain-of-thought
0
0 comments X

The pith

Spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether language models can evaluate their own outputs more effectively by spending more compute during evaluation, similar to how they solve problems better with more thinking time. They test this by using reasoning models that generate long chains of thought as evaluators, prompting them to assess both the final answer and individual steps. The key finding is that evaluator accuracy improves steadily as more reasoning tokens are generated. When these evaluators are used to select the best response from multiple candidates, the resulting improvements in problem-solving match those from scaling up generation compute.

Core claim

By employing reasoning models as evaluators and allowing them to generate more reasoning tokens, their accuracy on both outcome and process evaluation increases monotonically. These more accurate evaluators are then applied to rerank multiple generations from a language model, demonstrating that the resulting boost to problem-solving performance is comparable to the boost obtained by scaling compute at generation time.

What carries the argument

Reasoning models prompted for outcome evaluation and step-by-step process evaluation, whose accuracy scales monotonically with the number of generated reasoning tokens and is applied to rerank candidate responses.

Load-bearing premise

The observed monotonic gains in evaluator accuracy from additional reasoning tokens will translate into improved reranking performance that measurably boosts downstream problem-solving.

What would settle it

An experiment in which increasing the number of reasoning tokens generated by the evaluator fails to improve reranking accuracy or downstream task performance.

read the original abstract

As language model (LM) outputs get more and more natural, it is becoming more difficult than ever to evaluate their quality. Simultaneously, increasing LMs' "thinking" time through scaling test-time compute has proven an effective technique to solve challenging problems in domains such as math and code. This raises a natural question: can an LM's evaluation capability also be improved by spending more test-time compute? To answer this, we investigate employing reasoning models-LMs that natively generate long chain-of-thought reasoning-as evaluators. Specifically, we examine methods to leverage more test-time compute by (1) using reasoning models, and (2) prompting these models to evaluate not only the response as a whole (i.e., outcome evaluation) but also assess each step in the response separately (i.e., process evaluation). In experiments, we observe that the evaluator's performance improves monotonically when generating more reasoning tokens, similar to the trends observed in LM-based generation. Furthermore, we use these more accurate evaluators to rerank multiple generations, and demonstrate that spending more compute at evaluation time can be as effective as using more compute at generation time in improving an LM's problem-solving capability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that reasoning models can serve as improved evaluators when allocated more test-time compute (via longer chain-of-thought and process-level evaluation), that evaluator accuracy improves monotonically with additional reasoning tokens, and that these evaluators can be used for reranking to make evaluation-time compute scaling as effective as generation-time compute scaling for downstream problem-solving performance.

Significance. If the central equivalence claim holds with rigorous evidence, the result would identify a new, previously under-explored axis for compute scaling in LM systems. No machine-checked proofs, reproducible artifacts, or parameter-free derivations are described in the provided text.

major comments (2)
  1. [Abstract] Abstract (final sentence): the assertion that 'spending more compute at evaluation time can be as effective as using more compute at generation time' is load-bearing for the paper's contribution, yet the abstract supplies no effect sizes, compute budgets, reranking metrics, or direct comparisons between the two regimes.
  2. [Abstract] Abstract (reranking experiment description): the manuscript states that more accurate evaluators 'enable reranking' that produces the claimed equivalence, but provides no quantitative results on how monotonic evaluator-accuracy gains translate into measurable downstream problem-solving improvements, leaving the weakest assumption untested.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the comments. We address each major point below and will revise the abstract to better support the central claims with quantitative details from the experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract (final sentence): the assertion that 'spending more compute at evaluation time can be as effective as using more compute at generation time' is load-bearing for the paper's contribution, yet the abstract supplies no effect sizes, compute budgets, reranking metrics, or direct comparisons between the two regimes.

    Authors: We agree the abstract is high-level and omits specific numbers. The full manuscript reports the relevant effect sizes, token budgets, Pass@1 improvements from reranking, and head-to-head comparisons between evaluation-time and generation-time scaling regimes. We will revise the abstract to include representative quantitative results. revision: yes

  2. Referee: [Abstract] Abstract (reranking experiment description): the manuscript states that more accurate evaluators 'enable reranking' that produces the claimed equivalence, but provides no quantitative results on how monotonic evaluator-accuracy gains translate into measurable downstream problem-solving improvements, leaving the weakest assumption untested.

    Authors: The abstract summarizes the end-to-end result. The body contains the quantitative mapping from evaluator accuracy (as a function of reasoning tokens) to reranking gains and downstream problem-solving metrics, including direct equivalence measurements. We will update the abstract to reference these measured improvements. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct experimental observations

full rationale

The provided abstract contains no equations, fitted parameters, predictions derived from inputs, or self-citations. It reports monotonic performance gains observed in experiments and a demonstration via reranking that eval-time compute matches gen-time effectiveness. These are presented as empirical results rather than reductions by definition or imported premises. The derivation chain is self-contained against external benchmarks (the experiments themselves), yielding no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are stated or implied in the provided text.

pith-pipeline@v0.9.0 · 5741 in / 991 out tokens · 20057 ms · 2026-05-22T21:57:12.902673+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Evalet: Evaluating Large Language Models through Functional Fragmentation

    cs.HC 2025-09 conditional novelty 7.0

    Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.

  2. Process Rewards with Learned Reliability

    cs.CL 2026-05 unverdicted novelty 6.0

    BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.

  3. On the Shelf Life of Fine-Tuned LLM-Judges: Future-Proofing, Backward-Compatibility, and Question Generalization

    cs.CL 2025-09 unverdicted novelty 6.0

    Fine-tuned LLM judges struggle with future-proofing to newer generators but maintain backward-compatibility more easily; DPO training and continual learning improve adaptation while all models degrade on unseen questions.

  4. GrACE: A Generative Approach to Better Confidence Elicitation and Efficient Test-Time Scaling in Large Language Models

    cs.CL 2025-09 unverdicted novelty 6.0

    GrACE is a fine-tuned generative method that uses similarity to a special token embedding for real-time calibrated confidence in LLMs and enables efficient confidence-based test-time scaling.