arxiv: 2605.08680 · v1 · submitted 2026-05-09 · 💻 cs.SE · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Semantic Voting: Execution-Grounded Consensus for LLM Code Generation

Chenguang Zhu, Shan Jiang, Zijian Yi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.LG

keywords LLM code generationSemanticVoteexecution-based selectionmajority votingtest input generationconsensuscode selection

0 comments

The pith

Execution-based selectors outperform textual majority voting by 18-52 points when choosing from LLM code candidates without oracles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines selection strategies for multiple code candidates produced by large language models in settings where no complete test oracle exists. It compares output-pattern majority voting against several execution-based approaches, including a new method called SemanticVote, across 18 configurations of models, thinking levels, and benchmarks. The results show that any execution-based selector substantially exceeds majority voting, and that among execution-based methods the specific aggregation rule matters little while the quality of test inputs matters a great deal. This matters because real-world code generation pipelines must pick one answer from samples at inference time, and the work reframes the problem as one of obtaining reliable behavioral signals rather than refining vote-counting rules.

Core claim

SemanticVote clusters LLM code candidates by their execution fingerprints on inputs generated by the LLM. In all 18 tested configurations, every execution-based selector exceeds output-pattern majority voting by at least 18 percentage points, with the best exceeding it by 19-52 points. Once execution data is available, different aggregation rules produce statistically indistinguishable results, while sketch-based input generation improves outcomes by 0.6-2.1 points over direct LLM generation and up to 11.3 points over random fuzzing. Deeper thinking during generation boosts majority voting by about 12 points but leaves execution-based selection flat or slightly worse.

What carries the argument

Execution fingerprints on LLM-generated inputs, used to cluster or rank code candidates by observed behavior rather than textual patterns.

If this is right

Execution-based selection improves accuracy by 18-52 percentage points over textual voting in every tested setting.
The quality of generated test inputs drives larger gains than the choice of aggregation rule once execution data exists.
Increased thinking depth during generation helps textual voting but does not help and can hurt execution-based selection due to lower candidate diversity.
When execution data is collected, SemanticVote, weighted voting, and MBR-Exec yield equivalent performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same execution-grounded clustering idea could be tested on other generative tasks such as algorithm design or proof generation where partial behavioral checks are possible.
Pipeline designers should allocate more effort to diverse, high-quality input generation than to refining vote-aggregation logic.
The differing interaction of thinking level with textual versus execution selection suggests that generation and selection stages may need to be tuned jointly rather than independently.

Load-bearing premise

That execution fingerprints on LLM-generated inputs serve as a reliable proxy for semantic equivalence and functional correctness.

What would settle it

A set of code candidates that produce identical execution results on the generated inputs yet differ in correctness on held-out human-written tests would disprove the proxy.

read the original abstract

LLM code-generation pipelines often sample multiple candidates and select one final answer without access to a complete oracle. Existing pipelines mix textual voting, ranking, and execution-based agreement, but the relative contribution of each component remains unclear. We study 18 configurations across different models, thinking levels, and benchmarks, comparing output-pattern majority voting, weighted voting, MBR-Exec, and SemanticVote - a method that clusters candidates by execution fingerprints on LLM-generated inputs. Three findings emerge. (1) The best execution-based selector exceeds output-pattern majority voting by 19-52 percentage points on every configuration, with every execution-based selector exceeding it by at least 18 points. (2) Once candidates are executed on diverse inputs, aggregation rule has limited effect: SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations. The largest factor is input quality: sketch-based input generation consistently outperforms direct LLM generation by 0.6-2.1 pp and random fuzzing by up to 11.3 pp. (3) Thinking level interacts differently with selection families: deeper thinking improves majority voting by 12 pp but execution-based methods stay flat or degrade as candidate diversity falls. These results frame inference-time code selection as a signal-quality problem rather than an aggregation-rule problem: when oracles are unavailable, the behavioral evidence matters more than the aggregation rule.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Execution-based selectors beat text voting by a wide margin here, with input quality mattering more than the aggregation rule.

read the letter

The main takeaway is that running LLM code candidates on generated inputs and then selecting via execution behavior gives substantially better final pass rates than majority voting on the text outputs. Across the 18 configurations, every execution-based method outperforms output-pattern voting by at least 18 points, and the strongest one does so by 19-52 points depending on the setup. Among the execution methods themselves, the exact rule makes little difference once the candidates have been executed.

Referee Report

2 major / 3 minor

Summary. The paper empirically studies selection strategies for LLM code generation pipelines without access to complete oracles. It compares output-pattern majority voting against execution-based methods (MBR-Exec and the proposed SemanticVote, which clusters candidates via execution fingerprints on LLM-generated inputs) across 18 configurations varying models, thinking levels, and benchmarks. Key claims are that the best execution-based selector outperforms majority voting by 19-52 percentage points in every case, that aggregation rules become statistically indistinguishable once candidates are executed on diverse inputs, that input-generation quality (especially sketch-based) dominates performance, and that deeper thinking improves majority voting but leaves execution-based methods flat or worse.

Significance. If the reported performance gaps and interaction patterns hold under scrutiny, the work usefully reframes inference-time code selection as primarily a signal-quality problem rather than an aggregation-rule problem. The systematic head-to-head evaluation across multiple dimensions and the concrete demonstration that execution fingerprints provide stronger behavioral evidence than textual patterns are valuable for practitioners building LLM code pipelines. The paper's strength is its direct, reproducible-style empirical comparisons on standard benchmarks; no parameter-free derivations or machine-checked proofs are claimed.

major comments (2)

[Abstract and §4] Abstract and §4 (results): the claim that 'SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations' is load-bearing for the conclusion that aggregation rule has limited effect, yet the abstract provides no mention of the statistical test, p-value threshold, or correction for multiple comparisons used to establish indistinguishability. Without these details the claim cannot be fully evaluated from the given text.
[§3] §3 (methodology): the definition of 'execution fingerprints' and the clustering procedure in SemanticVote must be specified precisely (e.g., exact hash or distance metric on execution traces, handling of timeouts or errors) because the 19-52 pp gap rests on these fingerprints serving as a reliable proxy; any ambiguity here directly affects reproducibility of the central empirical result.

minor comments (3)

[Abstract] The abstract would benefit from naming the specific benchmarks and models used in the 18 configurations so readers can immediately gauge scope.
[Results figures/tables] Table or figure captions should explicitly state whether error bars represent standard deviation, standard error, or confidence intervals, and whether results are averaged over multiple random seeds.
[Abstract] Minor notation inconsistency: 'SemanticVote' is introduced in the abstract without a one-sentence formal description; a brief parenthetical would improve clarity for readers unfamiliar with the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive suggestions. We address each major comment below and will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (results): the claim that 'SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations' is load-bearing for the conclusion that aggregation rule has limited effect, yet the abstract provides no mention of the statistical test, p-value threshold, or correction for multiple comparisons used to establish indistinguishability. Without these details the claim cannot be fully evaluated from the given text.

Authors: We agree that the abstract should reference the statistical procedure to allow readers to evaluate the indistinguishability claim directly. The paired statistical tests and multiple-comparison correction are described in §4; we will add a concise statement to the abstract summarizing the test, threshold, and correction method used. revision: yes
Referee: [§3] §3 (methodology): the definition of 'execution fingerprints' and the clustering procedure in SemanticVote must be specified precisely (e.g., exact hash or distance metric on execution traces, handling of timeouts or errors) because the 19-52 pp gap rests on these fingerprints serving as a reliable proxy; any ambiguity here directly affects reproducibility of the central empirical result.

Authors: We acknowledge that greater precision in §3 would improve reproducibility. We will expand the description of execution fingerprints and the SemanticVote clustering procedure to explicitly state the distance metric on traces, the fingerprint representation, and the handling of timeouts and error cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements

full rationale

The paper reports direct head-to-head pass-rate measurements on benchmark test suites across 18 configurations. No equations, derivations, or first-principles predictions appear; all reported improvements (e.g., execution-based selectors exceeding majority voting by 19-52 pp) are computed from the same external oracles used to score every method. Input-generation quality and thinking-level interactions are likewise observed outcomes, not fitted parameters renamed as predictions. The work is therefore self-contained against external benchmarks with no load-bearing self-citation chains or definitional reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that execution behavior provides a valid signal for correctness without oracles; no free parameters or invented entities beyond the new method are explicitly fitted or postulated in the abstract.

axioms (1)

domain assumption Execution on LLM-generated inputs provides a reliable proxy for semantic similarity and correctness
Invoked throughout the comparison of execution-based selectors including SemanticVote.

invented entities (1)

SemanticVote no independent evidence
purpose: Clustering code candidates by execution fingerprints on LLM-generated inputs for consensus selection
New method introduced and evaluated in the study.

pith-pipeline@v0.9.0 · 5549 in / 1388 out tokens · 60124 ms · 2026-05-12T00:50:50.078253+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
SEMANTICVOTE clusters candidates by execution fingerprints on LLM-generated inputs... sketch-based input generation... exception-aware execution fingerprints
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
Three findings emerge. (1) The best execution-based selector exceeds output-pattern majority voting by 19-52 percentage points...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

[1]

International Conference on Learning Representations , year=

Self-consistency improves chain of thought reasoning in language models , author=. International Conference on Learning Representations , year=

work page
[2]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

On the effectiveness of large language models in writing

Hong, Yang and Jiang, Shan and Fu, Yulei and Khurshid, Sarfraz , journal=. On the effectiveness of large language models in writing

work page
[4]

Competition-level code generation with

Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-level code generation with. Science , volume=

work page
[5]

Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu , booktitle=

work page
[6]

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

Natural language to code translation with execution , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2022
[7]

Is your code generated by

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle=. Is your code generated by

work page
[8]

International Journal on Software Tools for Technology Transfer , volume=

Program sketching , author=. International Journal on Software Tools for Technology Transfer , volume=

work page
[9]

An approach for

Zhong, Hua and Jiang, Shan and Khurshid, Sarfraz , journal=. An approach for

work page
[10]

Generating executable oracles to check conformance of client code to requirements of

Jiang, Shan and Zhu, Chenguang and Khurshid, Sarfraz , journal=. Generating executable oracles to check conformance of client code to requirements of

work page
[11]

An empirical study of the reliability of

Miller, Barton P and Fredriksen, Louis and So, Bryan , journal=. An empirical study of the reliability of

work page
[12]

Claessen, Koen and Hughes, John , booktitle=

work page
[13]

arXiv preprint , year=

Semantic Voting: Execution-Grounded Consensus for Code Generation , author=. arXiv preprint , year=

work page
[14]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle=. Scaling

work page
[15]

Inference scaling laws: An empirical analysis of compute-optimal inference for

Wu, Yangzhen and Sun, Zhiqing and Li, Shanda and Welleck, Sean and Yang, Yiming , booktitle=. Inference scaling laws: An empirical analysis of compute-optimal inference for

work page
[16]

arXiv preprint arXiv:2410.16377 , year=

A simple model of inference scaling laws , author=. arXiv preprint arXiv:2410.16377 , year=

work page arXiv
[17]

Program Synthesis by Sketching , author=

work page
[18]

Jiang, Shan and Zhu, Chenguang and Khurshid, Sarfraz , booktitle=

work page
[19]

Jiang, Shan and Kovuri, Pranoy and Tao, David and Tan, Zhixun , booktitle=

work page
[20]

International Conference on Machine Learning , year=

Coder reviewer reranking for code generation , author=. International Conference on Machine Learning , year=

work page
[21]

ACM Transactions on Software Engineering and Methodology , volume=

Self-planning code generation with large language models , author=. ACM Transactions on Software Engineering and Methodology , volume=

work page
[22]

Wen, Jiaxin and Guan, Jian and Wang, Hongning and Wu, Wei and Huang, Minlie , booktitle=

work page
[23]

Advances in Neural Information Processing Systems , year=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , year=

work page
[24]

Wider or Deeper? Scaling

Inoue, Yuichi and Misaki, Kou and Imajuku, Yuki and Kuroki, So and Nakamura, Taishi and Akiba, Takuya , booktitle=. Wider or Deeper? Scaling

work page
[25]

International Conference on Machine Learning , year=

Language agent tree search unifies reasoning, acting, and planning in language models , author=. International Conference on Machine Learning , year=

work page
[26]

and Baek, Jinheon and Hwang, Sung Ju , booktitle=

Aytes, Simon A. and Baek, Jinheon and Hwang, Sung Ju , booktitle=. Sketch-of-Thought: Efficient

work page
[27]

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle=

work page
[28]

Measuring coding challenge competence with

Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , journal=. Measuring coding challenge competence with

work page
[29]

Digital Technical Journal , volume=

Differential testing for software , author=. Digital Technical Journal , volume=

work page
[30]

Zhong, Hua and Jiang, Shan and Khurshid, Sarfraz , journal=

work page
[31]

Findings of the Association for Computational Linguistics: ACL 2025 , year=

Ranked voting based self-consistency of large language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , year=

work page 2025
[32]

Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal=

work page