arxiv: 2604.26954 · v1 · submitted 2026-04-03 · 💻 cs.CY · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

Scott Frohn

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:14 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords automated scoringLLMself-consistencyreasoning efforteducational assessmentmodel selectioncost efficiencyhigh school mathematics

0 comments

The pith

Choosing the right model and reasoning level beats larger ensembles for accurate low-cost LLM scoring of math conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests self-consistency through majority voting and varying reasoning effort when large language models score high school mathematics student conversations. Temperature sampling raised accuracy over single deterministic outputs, yet expanding ensembles from one to seven samples added no significant gains. Reasoning effort produced a positive accuracy trend that differed across model families, and efficiency analysis pinpointed one high-end configuration as most accurate and two low-cost ones as best value. These patterns matter because automated scoring could expand educational assessment only if it remains both reliable and affordable without unnecessary compute.

Core claim

Strategic model selection and reasoning settings outperform ensembling for optimizing automated scoring with LLMs, since temperature sampling improves accuracy while larger ensembles do not, higher reasoning effort raises accuracy in a model-dependent linear trend, and specific low-cost models without reasoning achieve the strongest cost-performance balance, as measured on 900 conversations against human ground truth.

What carries the argument

Intra-model self-consistency via majority voting across multiple temperature-sampled responses, combined with adjustable reasoning effort levels across frontier and low-cost models.

If this is right

Temperature sampling significantly improves scoring accuracy over deterministic calls.
Ensemble sizes from one to seven produce no significant accuracy gains.
Higher reasoning effort yields a significant positive linear trend in accuracy that varies by model family.
Gemini 3.1 Pro Preview at low reasoning achieves the highest accuracy at higher cost.
GPT-5.4 Nano and Mini with no reasoning deliver the best cost-performance balance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If patterns hold across subjects, scoring systems could default to single low-cost model calls without ensembles for routine tasks.
Model-specific reasoning settings may cut total compute costs in large-scale assessment deployments.
Extending tests to non-quantitative subjects would show whether reasoning effort benefits are content-specific.
The efficiency frontier supplies a practical selection rule for choosing LLM setups under budget limits.

Load-bearing premise

The observed accuracy trends with reasoning effort and the lack of benefit from larger ensembles will hold for other models, tasks, or datasets beyond the tested high school mathematics conversations.

What would settle it

A study on a new dataset of student responses showing clear accuracy gains from ensembles of size four or larger would falsify the claim that increasing ensemble size adds no value.

read the original abstract

Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-performance balance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

No accuracy gain from larger self-consistency ensembles, but reasoning effort helps and cheap models win on cost for scoring these math conversations.

read the letter

The main takeaway is that for scoring high-school math conversations, increasing self-consistency ensemble size from 1 to 7 produced no significant accuracy gains, while higher reasoning effort showed a positive linear trend that varied by model family, and the lowest-cost models hit the best efficiency point overall. Temperature sampling helped over deterministic calls. They ran this on 900 real student conversations with human scores as ground truth, using OpenAI and Google models at different price points, then mapped the accuracy-cost frontier. Gemini 3.1 Pro Preview at low reasoning was most accurate but expensive; the GPT-5.4 Nano and Mini with no reasoning gave the strongest cost-performance balance. That is the concrete, usable result. The experiment is straightforward and includes the cost dimension that practitioners actually care about. The soft spot is scope. All results come from one narrow domain—high-school mathematics conversations—and only two model providers. The claim that model selection and reasoning settings are more effective than ensembling therefore rests on an untested extrapolation, exactly as the stress-test note flags. A different task like essay scoring could easily change the relative value of ensembles. The paper does not overclaim beyond its data, but the broader takeaway needs more domains to land. This is for people building or tuning LLM scorers in education who need practical efficiency numbers. The methods look clean enough on the reported design that it deserves a serious referee even if the generalizability stays limited.

Referee Report

2 major / 2 minor

Summary. The paper claims that strategic model selection and reasoning settings are more effective than self-consistency ensembling for optimizing LLM-based automated scoring accuracy and cost. Experiments on 900 high-school mathematics conversations against human ground truth, using OpenAI and Google models, show that temperature sampling improves accuracy over deterministic decoding, increasing ensemble size from j=1 to j=7 yields no significant accuracy gains, higher reasoning effort produces a positive linear accuracy trend (varying by model family), and an efficiency-frontier analysis identifies Gemini 3.1 Pro Preview (low reasoning) as highest-accuracy but costly while GPT-5.4 Nano/Mini (no reasoning) offer the best cost-performance balance.

Significance. If the empirical patterns hold, the work supplies practical guidance for educational-technology deployments of LLMs in conversation-based assessment, showing that modest increases in reasoning effort and careful model choice can outperform computationally heavier ensembles while controlling cost.

major comments (2)

[Abstract] Abstract and Results: the central claim that model selection and reasoning effort outperform ensembling rests on the observed absence of accuracy gains for j=1 to j=7 and the linear reasoning-effort trend; both patterns are reported only for high-school mathematics conversations and OpenAI/Google models, so the broad assertion that these strategies are 'more effective than ensembling' for automated scoring in general requires either additional domains or an explicit scope limitation.
[Results] Results section: the statements 'no significant gains' and 'significant positive linear trend' are load-bearing yet lack the supporting statistical quantities (exact p-values, effect sizes, confidence intervals, and power calculations) needed to evaluate them; without these details the reader cannot confirm that the ensemble-size result is not simply under-powered.

minor comments (2)

[Abstract] Abstract: define the discrete levels of 'reasoning effort' and how they map to model-specific parameters (e.g., chain-of-thought length or token budget) so the linear-trend claim is reproducible.
Efficiency-frontier paragraph: state the exact API pricing and token-counting conventions used to compute cost, including the date of pricing data, to allow readers to replicate the frontier.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the scope of our claims requires explicit limitation and that additional statistical details are needed to support the key results. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and Results: the central claim that model selection and reasoning effort outperform ensembling rests on the observed absence of accuracy gains for j=1 to j=7 and the linear reasoning-effort trend; both patterns are reported only for high-school mathematics conversations and OpenAI/Google models, so the broad assertion that these strategies are 'more effective than ensembling' for automated scoring in general requires either additional domains or an explicit scope limitation.

Authors: We agree that the experiments are confined to high-school mathematics conversations and the specific OpenAI and Google models tested. We will revise the abstract, title, and discussion to explicitly limit the scope of the claims to this domain and model set, removing any implication of general applicability to automated scoring. revision: yes
Referee: [Results] Results section: the statements 'no significant gains' and 'significant positive linear trend' are load-bearing yet lack the supporting statistical quantities (exact p-values, effect sizes, confidence intervals, and power calculations) needed to evaluate them; without these details the reader cannot confirm that the ensemble-size result is not simply under-powered.

Authors: We will add the requested statistical details to the Results section, including exact p-values, effect sizes, confidence intervals, and a post-hoc power analysis for the ensemble-size experiments to substantiate the 'no significant gains' finding and the linear trend for reasoning effort. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of LLM outputs to human scores

full rationale

The paper conducts direct experiments on 900 high-school mathematics conversations, measuring accuracy against human ground truth for different models, temperatures, reasoning efforts, and ensemble sizes. No equations, derivations, or predictions are presented that reduce to fitted parameters or self-citations by construction. All claims rest on observed statistical trends from the experimental data rather than any self-referential loop or renamed input. This is a standard empirical evaluation with no load-bearing self-citation chains or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The findings depend on the assumption that the 900 conversations are representative and that model behaviors are consistent.

axioms (1)

domain assumption The human-scored ground truths provide an accurate benchmark for model performance.
All accuracy claims are relative to these human scores.

pith-pipeline@v0.9.0 · 5429 in / 1023 out tokens · 50013 ms · 2026-05-13T19:14:23.125180+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean (reality_from_one_distinction) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family

produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-p...

work page 2015
[2]

We did not, however, compare self-consistency against a single judge or other ensemble sizes

and found high performance across multiple LLMs for dichotomously scored high school mathematics and English language arts (ELA) short answer questions. We did not, however, compare self-consistency against a single judge or other ensemble sizes. Recent empirical work complicates this expectation. Xue et al. (2026) explored the performance of five models ...

work page 2026
[3]

produced less than 1% improvement in quadratic weighted kappa (QWK) over a single call when scoring short response items on 2–4 point rubrics—a result that was not statistically significant. The problem appears to be a decoupling between consistency and accuracy: individual models score with near-perfect self-consistency (especially at temp = 0), but that...

work page 2023
[4]

Explain Your Thinking

that may be difficult to justify for large-scale educational assessments. Self-consistency and higher reasoning effort may improve performance on benchmark tasks, but these methods have not yet reliably improved automated scoring in educational contexts. Prior investigations focused primarily on polytomously scored essay or short response questions in sci...

work page 2025
[5]

The theoretical argument holds that multiple reasoning paths should converge on a more accurate solution (i.e., score)

gives reason to temper that expectation. The theoretical argument holds that multiple reasoning paths should converge on a more accurate solution (i.e., score). There is, however, a fundamental difference between solving reasoning tasks with an objective solution and evaluating student responses against a rubric, which requires subjective interpretation. ...

work page 2025
[6]

From Google: the frontier Gemini 3.1 Pro Preview (released February 19,

and the low-cost variants GPT-5.4-Mini and GPT-5.4-Nano (both released March 17, 2026). From Google: the frontier Gemini 3.1 Pro Preview (released February 19,

work page 2026
[7]

All models were accessed via the OpenAI and Google APIs in R using the ellmer package, v.0.3.2 (Wickham et al., 2025)

and the low-cost variant Gemini 3 Flash Preview (released December 17, 2025). All models were accessed via the OpenAI and Google APIs in R using the ellmer package, v.0.3.2 (Wickham et al., 2025). 2.2 Data Three EYT items from high school mathematics were included: two from Algebra I and one from Geometry. Each item has two parts. Part 1 is a selected res...

work page 2025
[8]

none", so all three GPT variants were run at that setting. For Gemini, only the Flash variant at its lowest thinking level (

EYT Criterion-Level Human–Human Interrater Reliability Item Criterion Fleiss' κ Item 1 1 0.858 2 0.713 Item 2 1 0.797 2 0.662 3 0.749 Item 3 1 0.800 2 0.910 2.3 Prompt Khan Academy developed the scoring prompt through iterative prompt engineering. The prompt scores one criterion per call; a conversation with three criteria requires three separate calls. T...

work page 2026
[9]

The GLMM preserves the observation-level variance structure that aggregate κ discards

Model Performance (Cohen's κ ) by LLM and Ensemble Size at Lowest-Effort Reasoning Model Reasoning temp = 0 j = 1 temp = 1 j = 1 j = 3 j = 5 j = 7 GPT-5.4 Nano none 0.564 0.750 0.757 0.761 0.761 GPT-5.4 Mini none 0.668 0.756 0.772 0.766 0.767 GPT-5.4 none 0.708 0.701 0.714 0.714 0.714 Gemini 3 Flash Preview minimal 0.714 0.713 0.717 0.717 0.720 We fit two...

work page 2015
[10]

minimal") to

Model Reasoning Level none/minimal low medium high GPT-5.4 Nano 0.750 0.700 0.743 0.742 GPT-5.4 Mini 0.756 0.705 0.752 0.763 GPT-5.4 0.701 0.760 0.759 0.763 Gemini 3 Flash Preview 0.713 0.710 0.782 0.771 Gemini 3.1 Pro Preview — 0.794 0.788 0.793 For R2, we fit a single GLMM pooling both GPT and Gemini conditions (Table A3 in the Appendix), with reasoning...

work page 2026
[11]

We therefore used a single pooled 1 Note

API Cost per Model 1 Model Input Price Output Price GPT-5.4 Nano $0.20 $1.25 GPT-5.4 Mini $0.75 $4.50 GPT-5.4 $2.50 $15.00 Gemini 3 Flash Preview $0.50 $3.00 Gemini 3.1 Pro Preview $2.00 $12.00 Input token counts are determined by the system prompt and conversation text, which are identical across conditions for the same criterion. We therefore used a sin...

work page 2026
[12]

Input Tokens Avg

Estimated API Cost per 1,000 Calls by Model and Reasoning Level 2 Model Reasoning Avg. Input Tokens Avg. Output Tokens Avg. Reasoning Tokens Billed Output Tokens Cost / 1k Calls GPT-5.4 Nano none 1030 51 0 51 $0.27 low 1030 126 70 126 $0.36 medium 1030 176 119 176 $0.43 high 1030 240 181 240 $0.51 GPT-5.4 Mini none 1030 40 0 40 $0.95 low 1030 84 40 84 $1....

work page 2026
[13]

& Walker, S

improved accuracy for certain models; ensembling did not. Higher reasoning effort improved scoring accuracy overall (R2), with a significant positive linear trend. The relationship was not strictly monotonic, however, and varied substantially across model families. Gemini 3.1 Pro Preview maintained superior performance regardless of reasoning level. Gemin...

work page doi:10.18637/jss.v067.i01 2026