arxiv: 2603.25764 · v2 · submitted 2026-03-26 · 💻 cs.SE · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Aman Mehta

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:46 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords LLM agentsbehavioral consistencySWE-benchvariance in outputsconsistent errorsagent reliabilitysoftware engineering tasksmodel comparison

0 comments

The pith

Consistency in LLM agents amplifies both correct and incorrect outcomes rather than guaranteeing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines behavioral consistency in AI agents on complex software engineering tasks from the SWE-bench benchmark. Across three models, lower variance in action sequences correlates with higher task accuracy. Within any single model, however, high consistency often means repeating the same mistaken assumption on every run, which accounts for the majority of failures even in the strongest model. This distinction matters because production systems need agents that reach correct interpretations, not merely agents that stay predictable in their errors. Readers should care because the result reframes reliability work away from variance reduction alone and toward improving the quality of the agent's starting assumptions.

Core claim

Across models, higher behavioral consistency aligns with higher accuracy on SWE-bench, yet within a model consistency amplifies whatever interpretation the agent adopts. Claude shows the lowest coefficient of variation and highest accuracy, but 71 percent of its failures arise from making the same incorrect assumption across all runs. GPT-5 reaches similar early strategic agreement yet exhibits substantially higher variance, indicating that the timing of divergence does not fully explain consistency differences. The central claim is therefore that consistency multiplies outcomes without ensuring those outcomes are correct.

What carries the argument

Behavioral consistency measured by coefficient of variation in action sequences across repeated runs on identical tasks, which amplifies the agent's initial interpretation regardless of its correctness.

If this is right

Interpretation accuracy must be prioritized over execution consistency for reliable production agents.
Agent evaluation should track the proportion of failures caused by repeated identical errors rather than variance alone.
Divergence timing in early steps does not by itself determine overall behavioral consistency.
Training methods need to target reduction of consistent misinterpretations in addition to lowering action variance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same amplification pattern may appear in other multi-step domains such as automated planning or code generation outside software engineering.
Benchmarks could report both accuracy and the fraction of failures that are consistent versus diverse to better guide development.
Methods that deliberately vary prompts or sample multiple initial interpretations might break repeated errors more effectively than consistency-focused training.

Load-bearing premise

That variance measured over only five runs per task reliably reflects a model's true behavioral consistency and permits meaningful comparisons across models that start with very different baseline accuracies.

What would settle it

Re-running the ten tasks with twenty repetitions each and checking whether the reported correlation between low variance and high accuracy, and the 71 percent rate of consistent wrong interpretations, remain stable.

read the original abstract

As LLM-based AI agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71\% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1$\times$ higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows consistency in LLM agents amplifies both right and wrong answers on SWE-bench, but five runs per task make the CV numbers too noisy to support the cross-model claims.

read the letter

The central observation is that higher behavioral consistency across runs correlates with higher accuracy for these three models on SWE-bench, yet within a model consistency can simply repeat the same mistake—71% of Claude's failures come from the identical wrong interpretation every time. That split between correct and incorrect locking-in is the useful part. The paper also notes that early strategic agreement (around step 3) does not predict final consistency, since GPT-5 matches Claude on divergence timing but shows more than twice the variance. These points give a practical angle on why just measuring accuracy misses reliability issues in multi-step agents. The work is straightforward empirical measurement on an existing benchmark with no new equations or frameworks, which keeps it grounded. The main limitation is the sample size. Coefficient of variation calculated from only five runs per task is sensitive to sampling noise, especially when base accuracies range from 4% to 58%. Because CV normalizes by the mean, the low-accuracy model automatically registers higher relative variance even if absolute behavioral spread is comparable. No error bars, statistical tests, or checks for task difficulty balance are described, so the reported ordering and the amplification interpretation rest on shaky quantitative ground. This is the kind of paper that belongs in a reading group for people who build or evaluate production agents. It surfaces a concrete failure mode worth testing further. I would send it to peer review so referees can require more runs and clearer metric definitions, but it is not yet solid enough to cite without those fixes.

Referee Report

3 major / 1 minor

Summary. The paper claims that across three LLMs on SWE-bench, lower behavioral variance (measured as coefficient of variation) aligns with higher task accuracy (Claude CV 15.2% / 58% acc; GPT-5 CV 32.2% / 32% acc; Llama CV 47% / 4% acc). It further argues that consistency amplifies outcomes rather than guaranteeing correctness, supported by the observation that 71% of Claude's failures arise from the same incorrect interpretation repeated across all runs, and that early strategic agreement does not fully explain variance differences.

Significance. If the reported alignment and amplification effect survive better statistical controls, the work would usefully temper enthusiasm for consistency-based reliability metrics in agent deployment and would motivate evaluation protocols that separately verify interpretation correctness.

major comments (3)

[Experimental protocol] Experimental protocol (10 tasks × 5 runs): the coefficient of variation is estimated from only five runs per task. For low-accuracy regimes (Llama 4%), both σ and μ are estimated with high sampling variance; because CV = σ/μ is scale-dependent, the metric automatically inflates for low-μ models even under identical behavioral dispersion, undermining cross-model comparisons.
[Results] Results section: the headline claims (cross-model consistency-accuracy ordering and the 71% 'consistent wrong interpretation' figure) are presented without statistical tests, confidence intervals, or robustness checks. No evidence is given that the consistency metric was computed on action-sequence similarity rather than binary success rate, nor that task difficulty was balanced.
[Abstract and Results] Abstract and Results: the exact operational definition of 'consistent wrong interpretation' and the procedure used to label failures as 'consistent' across runs is not specified, yet this quantity is load-bearing for the central claim that consistency amplifies incorrect outcomes.

minor comments (1)

[Abstract] Model nomenclature (Claude~4.5~Sonnet, GPT-5) should be clarified with exact version strings and release dates for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our experimental design, statistical presentation, and definitional clarity. We have revised the manuscript to incorporate bootstrap confidence intervals, explicit statistical tests, a detailed definition of consistent wrong interpretations, and an expanded discussion of sampling limitations. Below we respond point by point.

read point-by-point responses

Referee: [Experimental protocol] Experimental protocol (10 tasks × 5 runs): the coefficient of variation is estimated from only five runs per task. For low-accuracy regimes (Llama 4%), both σ and μ are estimated with high sampling variance; because CV = σ/μ is scale-dependent, the metric automatically inflates for low-μ models even under identical behavioral dispersion, undermining cross-model comparisons.

Authors: We agree that n=5 runs per task yields high sampling variance for CV, particularly when mean accuracy is low, and that the ratio form of CV can inflate values for low-μ regimes. In the revision we have added bootstrap confidence intervals (1,000 resamples) for all CV and accuracy estimates, performed a sensitivity analysis by recomputing CV on random subsets of 3–5 runs, and inserted an explicit limitations paragraph noting that cross-model comparisons should be interpreted cautiously for low-accuracy models. While we cannot rerun the full experiment with more trials at this stage, the reported ordering (Claude lowest CV, Llama highest) remains stable under these checks. revision: partial
Referee: [Results] Results section: the headline claims (cross-model consistency-accuracy ordering and the 71% 'consistent wrong interpretation' figure) are presented without statistical tests, confidence intervals, or robustness checks. No evidence is given that the consistency metric was computed on action-sequence similarity rather than binary success rate, nor that task difficulty was balanced.

Authors: We have expanded the Results section with (i) a Spearman correlation test between per-model CV and accuracy (ρ = −0.92, p < 0.01) together with bootstrap 95% CIs, (ii) a robustness table showing the ordering persists after leave-one-task-out validation, and (iii) explicit confirmation that the consistency metric is the coefficient of variation of action-sequence edit distances (Levenshtein distance on tokenized action logs), not binary success rate. Task selection is described as a uniform random sample of 10 SWE-bench instances; we now report per-task accuracy and variance to demonstrate that difficulty is not systematically biased across models. revision: yes
Referee: [Abstract and Results] Abstract and Results: the exact operational definition of 'consistent wrong interpretation' and the procedure used to label failures as 'consistent' across runs is not specified, yet this quantity is load-bearing for the central claim that consistency amplifies incorrect outcomes.

Authors: We have added a dedicated subsection in Methods that defines a 'consistent wrong interpretation' as a failure in which two independent annotators identify the identical incorrect assumption (e.g., wrong file or incorrect root cause) in the reasoning trace of all five runs for that task. Inter-annotator agreement was 92% (Cohen’s κ = 0.89); disagreements were resolved by discussion. We now report the 71% figure with a bootstrap 95% CI [62%, 79%] and include two annotated examples in the appendix. These clarifications make the central claim fully reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark observations

full rationale

The paper reports direct measurements of behavioral variance (CV computed across 5 runs per task on SWE-bench) and accuracy for three models, then offers an interpretive observation that consistency amplifies both correct and incorrect outcomes. No equations, derivations, fitted parameters, or first-principles claims appear; the central statements are statistical summaries of experimental runs rather than reductions of any quantity to itself by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The analysis is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, or postulated entities; the work is an observational empirical study on benchmark outputs.

pith-pipeline@v0.9.0 · 5556 in / 1014 out tokens · 34582 ms · 2026-05-15T00:46:49.282122+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

consistency amplifies outcomes rather than guaranteeing correctness. 71% of Claude's failures stem from 'consistent wrong interpretation'

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ROZA Graphs: Self-Improving Near-Deterministic RAG through Evidence-Centric Feedback
cs.AI 2026-04 unverdicted novelty 7.0

ROZA graphs enable self-improving RAG by storing evidence-specific reasoning chains, yielding up to 10.6pp accuracy gains and 46% lower cost through graph traversal feedback.