Self-Consistency Improves Chain of Thought Reasoning in Language Models

Aakanksha Chowdhery; Dale Schuurmans; Denny Zhou; Ed Chi; Jason Wei; Quoc Le; Sharan Narang; Xuezhi Wang

arxiv: 2203.11171 · v4 · submitted 2022-03-21 · 💻 cs.CL · cs.AI

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang , Jason Wei , Dale Schuurmans , Quoc Le , Ed Chi , Sharan Narang , Aakanksha Chowdhery , Denny Zhou This is my paper

Pith reviewed 2026-05-24 11:35 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords self-consistencychain of thoughtreasoninglanguage modelsdecodingarithmetic reasoningcommonsense reasoningprompting

0 comments

The pith

Self-consistency decoding replaces greedy search in chain-of-thought prompting and raises accuracy on reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces self-consistency as a decoding strategy that samples multiple reasoning paths for a given problem and then chooses the answer that is most consistent across those paths. This replaces the single greedy path used in standard chain-of-thought prompting. The approach is motivated by the observation that correct answers tend to be reachable by several different valid reasoning sequences. Empirical results show substantial gains on several benchmarks. Readers care because the method requires no model changes or additional training and applies directly to existing prompts.

Core claim

Self-consistency works by first generating a diverse set of reasoning paths through sampling and then selecting the answer that appears most frequently after marginalizing over the paths. This leverages the fact that a complex problem usually has multiple ways to reach its single correct answer.

What carries the argument

Self-consistency, a decoding procedure that samples diverse reasoning paths and aggregates answers by majority vote.

If this is right

Accuracy on GSM8K rises by 17.9 percentage points.
Accuracy on SVAMP rises by 11.0 percentage points.
Accuracy on AQuA rises by 12.2 percentage points.
Accuracy on StrategyQA rises by 6.4 percentage points.
Accuracy on ARC-challenge rises by 3.9 percentage points.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Self-consistency may reduce the need for carefully engineered prompts by allowing the model to explore multiple routes.
The method could extend to other tasks where multiple valid solution paths exist.
Computational cost increases with the number of sampled paths, suggesting a tradeoff between accuracy and efficiency.

Load-bearing premise

The model must produce a sufficient number of correct but varied reasoning paths so that the correct answer wins the majority vote rather than a wrong but repeated answer.

What would settle it

Finding a benchmark or problem set where the model consistently generates the same incorrect reasoning path across samples would show self-consistency performing no better or worse than greedy decoding.

read the original abstract

Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes self-consistency as a new decoding strategy to replace greedy decoding in chain-of-thought (CoT) prompting. It samples a diverse set of reasoning paths from the language model and selects the answer that appears most frequently across those paths via marginalization. The central empirical claim is that this yields substantial accuracy gains over standard CoT on arithmetic and commonsense benchmarks: GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%), and ARC-challenge (+3.9%).

Significance. If the reported gains hold under controlled conditions, the method supplies a simple, training-free improvement to CoT prompting that exploits the existence of multiple valid reasoning paths to the same correct answer. The approach requires no new parameters or fine-tuning and directly addresses a practical limitation of greedy decoding; the magnitude of the lifts on established benchmarks would make it a useful baseline for future reasoning work.

minor comments (3)

[Abstract] Abstract and experimental section: exact values for the number of sampled paths, sampling temperature, and any top-p/top-k settings are not stated, which hinders exact reproduction of the reported deltas.
[Experimental results] The evaluation does not report standard deviations or error bars across multiple runs, leaving the statistical reliability of the per-benchmark improvements (especially the smaller +3.9% on ARC-challenge) harder to assess.
[Section 4] A compute-matched baseline that uses the same total number of forward passes but with a different aggregation strategy (e.g., beam search) would more cleanly isolate the contribution of majority voting.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the work and the recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity; empirical method evaluated on external benchmarks

full rationale

The paper introduces self-consistency as an empirical decoding procedure: sample multiple CoT paths and marginalize via majority vote. The headline result consists of measured accuracy lifts on fixed external benchmarks (GSM8K, SVAMP, etc.). No equations appear that equate a derived quantity to its own fitted inputs, no self-citation chain supplies a uniqueness theorem, and the core intuition is directly probed by whether the reported margins materialize. The work is therefore self-contained against external data rather than internally circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that diverse sampled paths converge on the correct answer more often than on incorrect answers. No free parameters are introduced in the abstract. No new entities are postulated. The key domain assumption is that the pre-trained model already encodes multiple valid reasoning routes for the target problems.

axioms (1)

domain assumption A complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer.
This intuition is stated directly in the abstract as the basis for why marginalizing over paths improves accuracy.

pith-pipeline@v0.9.0 · 5707 in / 1378 out tokens · 19931 ms · 2026-05-24T11:35:43.766588+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 conditional novelty 8.0

AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...
SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology
cs.AI 2026-03 conditional novelty 8.0

SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.
Evaluating Large Language Models in Scientific Discovery
cs.AI 2025-12 unverdicted novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
cs.CR 2025-07 unverdicted novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
cs.AI 2024-08 unverdicted novelty 8.0

The AI Scientist framework enables LLMs to independently conduct the full scientific process from idea generation to paper writing and review, demonstrated across three ML subfields with papers costing under $15 each.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
cs.CL 2023-05 accept novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.
PAL: Program-aided Language Models
cs.CL 2022-11 conditional novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
cs.CL 2026-05 unverdicted novelty 7.0

IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
TextReg: Mitigating Prompt Distributional Overfitting via Regularized Text-Space Optimization
cs.CL 2026-05 unverdicted novelty 7.0

TextReg mitigates prompt distributional overfitting via regularized text-space optimization, reporting up to +16.5% OOD accuracy gains over prior methods on reasoning benchmarks.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens witho...
Retrieval-Augmented Linguistic Calibration
cs.CL 2026-05 unverdicted novelty 7.0

RALC is a retrieval-augmented rewriting pipeline that improves linguistic faithfulness and calibration of LLM outputs by up to 66% and 58% on QA benchmarks.
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels
cs.LG 2026-05 unverdicted novelty 7.0

Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identificati...
EXG: Self-Evolving Agents with Experience Graphs
cs.AI 2026-05 unverdicted novelty 7.0

EXG is an experience graph framework for self-evolving LLM agents that supports online real-time growth and offline reuse to enhance solution quality and efficiency on code generation and reasoning benchmarks.
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance
nlin.AO 2026-05 unverdicted novelty 7.0

LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era
cs.LG 2026-05 unverdicted novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable ...
\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer
cs.LG 2026-05 unverdicted novelty 7.0

MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
DISA: Offline Importance Sampling for Distribution-Matching LLM-RL
cs.LG 2026-05 unverdicted novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more stra...
Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management
cs.AI 2026-05 unverdicted novelty 7.0

Autonomous AI agents outperform humans in supply chain simulations but exhibit an inherent agent bullwhip effect of amplified decision unreliability, mitigated by GRPO reinforcement learning post-training.
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verif...
BOOKMARKS: Efficient Active Storyline Memory for Role-playing
cs.CL 2026-05 unverdicted novelty 7.0

BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
Test-Time Hinting for Black-Box Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 conditional novelty 7.0

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
Query-Conditioned Test-Time Self-Training for Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
Think Twice, Act Once: Verifier-Guided Action Selection For Embodied Agents
cs.AI 2026-05 unverdicted novelty 7.0

VeGAS improves MLLM-based embodied agents by sampling action ensembles and using a verifier trained on LLM-synthesized failure cases, yielding up to 36% relative gains on hard multi-object long-horizon tasks in Habita...
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
cs.CL 2026-05 unverdicted novelty 7.0

BICR trains a lightweight probe on contrastive hidden states from real versus blind images to detect visual grounding in LVLM predictions, outperforming baselines on calibration and discrimination with fewer parameters.
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
cs.CL 2026-05 unverdicted novelty 7.0

BICR uses blind-image contrastive ranking on frozen LVLM hidden states to train a lightweight probe that penalizes confidence on blacked-out inputs, yielding top calibration and discrimination across five models and m...
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
cond-mat.stat-mech 2026-05 unverdicted novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
Active Testing of Large Language Models via Approximate Neyman Allocation
cs.AI 2026-05 unverdicted novelty 7.0

Proposes surrogate semantic entropy stratification followed by approximate Neyman allocation for active testing of LLMs on generative benchmarks, reporting up to 28% MSE reduction and 22.9% average budget savings vers...
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
cs.AI 2026-05 unverdicted novelty 7.0

MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...
Skill-CMIB: Multimodal Agent Skill for Consistent Action via Conditional Multimodal Information Bottleneck
cs.LG 2026-05 unverdicted novelty 7.0

CMIB uses a conditional multimodal information bottleneck to create reusable agent skills that separate verbalizable text content from predictive perceptual residuals, improving execution stability.
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
cs.CL 2026-05 unverdicted novelty 7.0

AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation
cs.CL 2026-05 unverdicted novelty 7.0

CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candida...
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Tracing Uncertainty in Language Model "Reasoning"
cs.LG 2026-05 unverdicted novelty 7.0

Uncertainty trace profiles from LM reasoning traces predict correct final answers with AUROC up to 0.807 and enable early error detection using only initial tokens.
Regulating Branch Parallelism in LLM Serving
cs.DC 2026-05 unverdicted novelty 7.0

TAPER regulates LLM branch parallelism by admitting extra branches opportunistically when predicted externality fits slack, delivering 1.48-1.77x higher goodput than eager or fixed-cap baselines on Qwen3-32B while kee...
Distributional Process Reward Models: Calibrated Prediction of Future Rewards via Conditional Optimal Transport
cs.LG 2026-05 unverdicted novelty 7.0

Conditional optimal transport calibrates PRMs by learning monotonic conditional quantile functions over success probabilities conditioned on hidden states, yielding improved calibration and downstream Best-of-N perfor...
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
cs.CL 2026-05 unverdicted novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
cs.AI 2026-05 unverdicted novelty 7.0

RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios
cs.AI 2026-05 unverdicted novelty 7.0

ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
cs.CL 2026-05 unverdicted novelty 7.0

Iterative search over reward functions with ranked feedback in GRPO training improves LLM math reasoning, achieving F1 of 0.795 on GSM8K versus 0.609 for baseline.
Enhanced LLM Reasoning by Optimizing Reward Functions with Search-Driven Reinforcement Learning
cs.CL 2026-05 accept novelty 7.0

Iterative LLM-driven search over reward functions, screened via GRPO on GSM8K, raises F1 from 0.609 baseline to 0.795 with ensembles on Llama-3.2-3B.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation
cs.CV 2026-04 accept novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
Green Shielding: A User-Centric Approach Towards Trustworthy AI
cs.CL 2026-04 unverdicted novelty 7.0

Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like t...
Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees
cs.LG 2026-04 unverdicted novelty 7.0

Distinct Leaf Enumeration (DLE) replaces stochastic self-consistency sampling with deterministic traversal of a truncated decoding tree to enumerate distinct leaves, increasing coverage and reducing redundant computat...
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
cs.AI 2026-04 unverdicted novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Learning When Not to Decide: A Framework for Overcoming Factual Presumptuousness in AI Adjudication
cs.AI 2026-04 unverdicted novelty 7.0

A new structured prompting method (SPEC) helps AI detect insufficient evidence in adjudication tasks and defer decisions appropriately, reaching 89% accuracy on a benchmark varying information completeness from Colora...
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
Structural Verification for Reliable EDA Code Generation without Tool-in-the-Loop Debugging
cs.SE 2026-04 unverdicted novelty 7.0

Structural dependency graphs and staged pre-execution verification raise LLM-based EDA code pass rates to 82.5% (single-step) and 70-84% (multi-step) while halving tool calls by catching dependency violations before runtime.
AI scientists produce results without reasoning scientifically
cs.AI 2026-04 conditional novelty 7.0

LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning
cs.CL 2026-04 unverdicted novelty 7.0

CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.
Clover: A Neural-Symbolic Agentic Harness with Stochastic Tree-of-Thoughts for Verified RTL Repair
cs.AR 2026-04 unverdicted novelty 7.0

Clover fixes 96.8% of bugs on an RTL-repair benchmark using stochastic tree-of-thoughts and neural-symbolic agents, outperforming traditional and LLM baselines by 94% and 63% respectively with 87.5% pass@1.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench
cs.AI 2026-04 conditional novelty 7.0

AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cu...