Recognition: 2 theorem links
· Lean TheoremThe Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost
Pith reviewed 2026-05-13 19:14 UTC · model grok-4.3
The pith
Choosing the right model and reasoning level beats larger ensembles for accurate low-cost LLM scoring of math conversations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Strategic model selection and reasoning settings outperform ensembling for optimizing automated scoring with LLMs, since temperature sampling improves accuracy while larger ensembles do not, higher reasoning effort raises accuracy in a model-dependent linear trend, and specific low-cost models without reasoning achieve the strongest cost-performance balance, as measured on 900 conversations against human ground truth.
What carries the argument
Intra-model self-consistency via majority voting across multiple temperature-sampled responses, combined with adjustable reasoning effort levels across frontier and low-cost models.
If this is right
- Temperature sampling significantly improves scoring accuracy over deterministic calls.
- Ensemble sizes from one to seven produce no significant accuracy gains.
- Higher reasoning effort yields a significant positive linear trend in accuracy that varies by model family.
- Gemini 3.1 Pro Preview at low reasoning achieves the highest accuracy at higher cost.
- GPT-5.4 Nano and Mini with no reasoning deliver the best cost-performance balance.
Where Pith is reading between the lines
- If patterns hold across subjects, scoring systems could default to single low-cost model calls without ensembles for routine tasks.
- Model-specific reasoning settings may cut total compute costs in large-scale assessment deployments.
- Extending tests to non-quantitative subjects would show whether reasoning effort benefits are content-specific.
- The efficiency frontier supplies a practical selection rule for choosing LLM setups under budget limits.
Load-bearing premise
The observed accuracy trends with reasoning effort and the lack of benefit from larger ensembles will hold for other models, tasks, or datasets beyond the tested high school mathematics conversations.
What would settle it
A study on a new dataset of student responses showing clear accuracy gains from ensembles of size four or larger would falsify the claim that increasing ensemble size adds no value.
read the original abstract
Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-performance balance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that strategic model selection and reasoning settings are more effective than self-consistency ensembling for optimizing LLM-based automated scoring accuracy and cost. Experiments on 900 high-school mathematics conversations against human ground truth, using OpenAI and Google models, show that temperature sampling improves accuracy over deterministic decoding, increasing ensemble size from j=1 to j=7 yields no significant accuracy gains, higher reasoning effort produces a positive linear accuracy trend (varying by model family), and an efficiency-frontier analysis identifies Gemini 3.1 Pro Preview (low reasoning) as highest-accuracy but costly while GPT-5.4 Nano/Mini (no reasoning) offer the best cost-performance balance.
Significance. If the empirical patterns hold, the work supplies practical guidance for educational-technology deployments of LLMs in conversation-based assessment, showing that modest increases in reasoning effort and careful model choice can outperform computationally heavier ensembles while controlling cost.
major comments (2)
- [Abstract] Abstract and Results: the central claim that model selection and reasoning effort outperform ensembling rests on the observed absence of accuracy gains for j=1 to j=7 and the linear reasoning-effort trend; both patterns are reported only for high-school mathematics conversations and OpenAI/Google models, so the broad assertion that these strategies are 'more effective than ensembling' for automated scoring in general requires either additional domains or an explicit scope limitation.
- [Results] Results section: the statements 'no significant gains' and 'significant positive linear trend' are load-bearing yet lack the supporting statistical quantities (exact p-values, effect sizes, confidence intervals, and power calculations) needed to evaluate them; without these details the reader cannot confirm that the ensemble-size result is not simply under-powered.
minor comments (2)
- [Abstract] Abstract: define the discrete levels of 'reasoning effort' and how they map to model-specific parameters (e.g., chain-of-thought length or token budget) so the linear-trend claim is reproducible.
- Efficiency-frontier paragraph: state the exact API pricing and token-counting conventions used to compute cost, including the date of pricing data, to allow readers to replicate the frontier.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We agree that the scope of our claims requires explicit limitation and that additional statistical details are needed to support the key results. We will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract and Results: the central claim that model selection and reasoning effort outperform ensembling rests on the observed absence of accuracy gains for j=1 to j=7 and the linear reasoning-effort trend; both patterns are reported only for high-school mathematics conversations and OpenAI/Google models, so the broad assertion that these strategies are 'more effective than ensembling' for automated scoring in general requires either additional domains or an explicit scope limitation.
Authors: We agree that the experiments are confined to high-school mathematics conversations and the specific OpenAI and Google models tested. We will revise the abstract, title, and discussion to explicitly limit the scope of the claims to this domain and model set, removing any implication of general applicability to automated scoring. revision: yes
-
Referee: [Results] Results section: the statements 'no significant gains' and 'significant positive linear trend' are load-bearing yet lack the supporting statistical quantities (exact p-values, effect sizes, confidence intervals, and power calculations) needed to evaluate them; without these details the reader cannot confirm that the ensemble-size result is not simply under-powered.
Authors: We will add the requested statistical details to the Results section, including exact p-values, effect sizes, confidence intervals, and a post-hoc power analysis for the ensemble-size experiments to substantiate the 'no significant gains' finding and the linear trend for reasoning effort. revision: yes
Circularity Check
No circularity: purely empirical comparison of LLM outputs to human scores
full rationale
The paper conducts direct experiments on 900 high-school mathematics conversations, measuring accuracy against human ground truth for different models, temperatures, reasoning efforts, and ensemble sizes. No equations, derivations, or predictions are presented that reduce to fitted parameters or self-citations by construction. All claims rest on observed statistical trends from the experimental data rather than any self-referential loop or renamed input. This is a standard empirical evaluation with no load-bearing self-citation chains or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The human-scored ground truths provide an accurate benchmark for model performance.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.lean (reality_from_one_distinction)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-p...
work page 2015
-
[2]
We did not, however, compare self-consistency against a single judge or other ensemble sizes
and found high performance across multiple LLMs for dichotomously scored high school mathematics and English language arts (ELA) short answer questions. We did not, however, compare self-consistency against a single judge or other ensemble sizes. Recent empirical work complicates this expectation. Xue et al. (2026) explored the performance of five models ...
work page 2026
-
[3]
produced less than 1% improvement in quadratic weighted kappa (QWK) over a single call when scoring short response items on 2–4 point rubrics—a result that was not statistically significant. The problem appears to be a decoupling between consistency and accuracy: individual models score with near-perfect self-consistency (especially at temp = 0), but that...
work page 2023
-
[4]
that may be difficult to justify for large-scale educational assessments. Self-consistency and higher reasoning effort may improve performance on benchmark tasks, but these methods have not yet reliably improved automated scoring in educational contexts. Prior investigations focused primarily on polytomously scored essay or short response questions in sci...
work page 2025
-
[5]
gives reason to temper that expectation. The theoretical argument holds that multiple reasoning paths should converge on a more accurate solution (i.e., score). There is, however, a fundamental difference between solving reasoning tasks with an objective solution and evaluating student responses against a rubric, which requires subjective interpretation. ...
work page 2025
-
[6]
From Google: the frontier Gemini 3.1 Pro Preview (released February 19,
and the low-cost variants GPT-5.4-Mini and GPT-5.4-Nano (both released March 17, 2026). From Google: the frontier Gemini 3.1 Pro Preview (released February 19,
work page 2026
-
[7]
and the low-cost variant Gemini 3 Flash Preview (released December 17, 2025). All models were accessed via the OpenAI and Google APIs in R using the ellmer package, v.0.3.2 (Wickham et al., 2025). 2.2 Data Three EYT items from high school mathematics were included: two from Algebra I and one from Geometry. Each item has two parts. Part 1 is a selected res...
work page 2025
-
[8]
EYT Criterion-Level Human–Human Interrater Reliability Item Criterion Fleiss' κ Item 1 1 0.858 2 0.713 Item 2 1 0.797 2 0.662 3 0.749 Item 3 1 0.800 2 0.910 2.3 Prompt Khan Academy developed the scoring prompt through iterative prompt engineering. The prompt scores one criterion per call; a conversation with three criteria requires three separate calls. T...
work page 2026
-
[9]
The GLMM preserves the observation-level variance structure that aggregate κ discards
Model Performance (Cohen's κ ) by LLM and Ensemble Size at Lowest-Effort Reasoning Model Reasoning temp = 0 j = 1 temp = 1 j = 1 j = 3 j = 5 j = 7 GPT-5.4 Nano none 0.564 0.750 0.757 0.761 0.761 GPT-5.4 Mini none 0.668 0.756 0.772 0.766 0.767 GPT-5.4 none 0.708 0.701 0.714 0.714 0.714 Gemini 3 Flash Preview minimal 0.714 0.713 0.717 0.717 0.720 We fit two...
work page 2015
-
[10]
Model Reasoning Level none/minimal low medium high GPT-5.4 Nano 0.750 0.700 0.743 0.742 GPT-5.4 Mini 0.756 0.705 0.752 0.763 GPT-5.4 0.701 0.760 0.759 0.763 Gemini 3 Flash Preview 0.713 0.710 0.782 0.771 Gemini 3.1 Pro Preview — 0.794 0.788 0.793 For R2, we fit a single GLMM pooling both GPT and Gemini conditions (Table A3 in the Appendix), with reasoning...
work page 2026
-
[11]
We therefore used a single pooled 1 Note
API Cost per Model 1 Model Input Price Output Price GPT-5.4 Nano $0.20 $1.25 GPT-5.4 Mini $0.75 $4.50 GPT-5.4 $2.50 $15.00 Gemini 3 Flash Preview $0.50 $3.00 Gemini 3.1 Pro Preview $2.00 $12.00 Input token counts are determined by the system prompt and conversation text, which are identical across conditions for the same criterion. We therefore used a sin...
work page 2026
-
[12]
Estimated API Cost per 1,000 Calls by Model and Reasoning Level 2 Model Reasoning Avg. Input Tokens Avg. Output Tokens Avg. Reasoning Tokens Billed Output Tokens Cost / 1k Calls GPT-5.4 Nano none 1030 51 0 51 $0.27 low 1030 126 70 126 $0.36 medium 1030 176 119 176 $0.43 high 1030 240 181 240 $0.51 GPT-5.4 Mini none 1030 40 0 40 $0.95 low 1030 84 40 84 $1....
work page 2026
-
[13]
improved accuracy for certain models; ensembling did not. Higher reasoning effort improved scoring accuracy overall (R2), with a significant positive linear trend. The relationship was not strictly monotonic, however, and varied substantially across model families. Gemini 3.1 Pro Preview maintained superior performance regardless of reasoning level. Gemin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.