Recognition: 2 theorem links
· Lean TheoremSemantic Voting: Execution-Grounded Consensus for LLM Code Generation
Pith reviewed 2026-05-12 00:50 UTC · model grok-4.3
The pith
Execution-based selectors outperform textual majority voting by 18-52 points when choosing from LLM code candidates without oracles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SemanticVote clusters LLM code candidates by their execution fingerprints on inputs generated by the LLM. In all 18 tested configurations, every execution-based selector exceeds output-pattern majority voting by at least 18 percentage points, with the best exceeding it by 19-52 points. Once execution data is available, different aggregation rules produce statistically indistinguishable results, while sketch-based input generation improves outcomes by 0.6-2.1 points over direct LLM generation and up to 11.3 points over random fuzzing. Deeper thinking during generation boosts majority voting by about 12 points but leaves execution-based selection flat or slightly worse.
What carries the argument
Execution fingerprints on LLM-generated inputs, used to cluster or rank code candidates by observed behavior rather than textual patterns.
If this is right
- Execution-based selection improves accuracy by 18-52 percentage points over textual voting in every tested setting.
- The quality of generated test inputs drives larger gains than the choice of aggregation rule once execution data exists.
- Increased thinking depth during generation helps textual voting but does not help and can hurt execution-based selection due to lower candidate diversity.
- When execution data is collected, SemanticVote, weighted voting, and MBR-Exec yield equivalent performance.
Where Pith is reading between the lines
- The same execution-grounded clustering idea could be tested on other generative tasks such as algorithm design or proof generation where partial behavioral checks are possible.
- Pipeline designers should allocate more effort to diverse, high-quality input generation than to refining vote-aggregation logic.
- The differing interaction of thinking level with textual versus execution selection suggests that generation and selection stages may need to be tuned jointly rather than independently.
Load-bearing premise
That execution fingerprints on LLM-generated inputs serve as a reliable proxy for semantic equivalence and functional correctness.
What would settle it
A set of code candidates that produce identical execution results on the generated inputs yet differ in correctness on held-out human-written tests would disprove the proxy.
read the original abstract
LLM code-generation pipelines often sample multiple candidates and select one final answer without access to a complete oracle. Existing pipelines mix textual voting, ranking, and execution-based agreement, but the relative contribution of each component remains unclear. We study 18 configurations across different models, thinking levels, and benchmarks, comparing output-pattern majority voting, weighted voting, MBR-Exec, and SemanticVote - a method that clusters candidates by execution fingerprints on LLM-generated inputs. Three findings emerge. (1) The best execution-based selector exceeds output-pattern majority voting by 19-52 percentage points on every configuration, with every execution-based selector exceeding it by at least 18 points. (2) Once candidates are executed on diverse inputs, aggregation rule has limited effect: SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations. The largest factor is input quality: sketch-based input generation consistently outperforms direct LLM generation by 0.6-2.1 pp and random fuzzing by up to 11.3 pp. (3) Thinking level interacts differently with selection families: deeper thinking improves majority voting by 12 pp but execution-based methods stay flat or degrade as candidate diversity falls. These results frame inference-time code selection as a signal-quality problem rather than an aggregation-rule problem: when oracles are unavailable, the behavioral evidence matters more than the aggregation rule.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper empirically studies selection strategies for LLM code generation pipelines without access to complete oracles. It compares output-pattern majority voting against execution-based methods (MBR-Exec and the proposed SemanticVote, which clusters candidates via execution fingerprints on LLM-generated inputs) across 18 configurations varying models, thinking levels, and benchmarks. Key claims are that the best execution-based selector outperforms majority voting by 19-52 percentage points in every case, that aggregation rules become statistically indistinguishable once candidates are executed on diverse inputs, that input-generation quality (especially sketch-based) dominates performance, and that deeper thinking improves majority voting but leaves execution-based methods flat or worse.
Significance. If the reported performance gaps and interaction patterns hold under scrutiny, the work usefully reframes inference-time code selection as primarily a signal-quality problem rather than an aggregation-rule problem. The systematic head-to-head evaluation across multiple dimensions and the concrete demonstration that execution fingerprints provide stronger behavioral evidence than textual patterns are valuable for practitioners building LLM code pipelines. The paper's strength is its direct, reproducible-style empirical comparisons on standard benchmarks; no parameter-free derivations or machine-checked proofs are claimed.
major comments (2)
- [Abstract and §4] Abstract and §4 (results): the claim that 'SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations' is load-bearing for the conclusion that aggregation rule has limited effect, yet the abstract provides no mention of the statistical test, p-value threshold, or correction for multiple comparisons used to establish indistinguishability. Without these details the claim cannot be fully evaluated from the given text.
- [§3] §3 (methodology): the definition of 'execution fingerprints' and the clustering procedure in SemanticVote must be specified precisely (e.g., exact hash or distance metric on execution traces, handling of timeouts or errors) because the 19-52 pp gap rests on these fingerprints serving as a reliable proxy; any ambiguity here directly affects reproducibility of the central empirical result.
minor comments (3)
- [Abstract] The abstract would benefit from naming the specific benchmarks and models used in the 18 configurations so readers can immediately gauge scope.
- [Results figures/tables] Table or figure captions should explicitly state whether error bars represent standard deviation, standard error, or confidence intervals, and whether results are averaged over multiple random seeds.
- [Abstract] Minor notation inconsistency: 'SemanticVote' is introduced in the abstract without a one-sentence formal description; a brief parenthetical would improve clarity for readers unfamiliar with the method.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We address each major comment below and will incorporate the requested clarifications into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (results): the claim that 'SemanticVote, weighted voting, and MBR-Exec are statistically indistinguishable across all 18 configurations' is load-bearing for the conclusion that aggregation rule has limited effect, yet the abstract provides no mention of the statistical test, p-value threshold, or correction for multiple comparisons used to establish indistinguishability. Without these details the claim cannot be fully evaluated from the given text.
Authors: We agree that the abstract should reference the statistical procedure to allow readers to evaluate the indistinguishability claim directly. The paired statistical tests and multiple-comparison correction are described in §4; we will add a concise statement to the abstract summarizing the test, threshold, and correction method used. revision: yes
-
Referee: [§3] §3 (methodology): the definition of 'execution fingerprints' and the clustering procedure in SemanticVote must be specified precisely (e.g., exact hash or distance metric on execution traces, handling of timeouts or errors) because the 19-52 pp gap rests on these fingerprints serving as a reliable proxy; any ambiguity here directly affects reproducibility of the central empirical result.
Authors: We acknowledge that greater precision in §3 would improve reproducibility. We will expand the description of execution fingerprints and the SemanticVote clustering procedure to explicitly state the distance metric on traces, the fingerprint representation, and the handling of timeouts and error cases. revision: yes
Circularity Check
No significant circularity; purely empirical measurements
full rationale
The paper reports direct head-to-head pass-rate measurements on benchmark test suites across 18 configurations. No equations, derivations, or first-principles predictions appear; all reported improvements (e.g., execution-based selectors exceeding majority voting by 19-52 pp) are computed from the same external oracles used to score every method. Input-generation quality and thinking-level interactions are likewise observed outcomes, not fitted parameters renamed as predictions. The work is therefore self-contained against external benchmarks with no load-bearing self-citation chains or definitional reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Execution on LLM-generated inputs provides a reliable proxy for semantic similarity and correctness
invented entities (1)
-
SemanticVote
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearSEMANTICVOTE clusters candidates by execution fingerprints on LLM-generated inputs... sketch-based input generation... exception-aware execution fingerprints
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclearThree findings emerge. (1) The best execution-based selector exceeds output-pattern majority voting by 19-52 percentage points...
Reference graph
Works this paper leans on
-
[1]
International Conference on Learning Representations , year=
Self-consistency improves chain of thought reasoning in language models , author=. International Conference on Learning Representations , year=
-
[2]
Evaluating Large Language Models Trained on Code
Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
On the effectiveness of large language models in writing
Hong, Yang and Jiang, Shan and Fu, Yulei and Khurshid, Sarfraz , journal=. On the effectiveness of large language models in writing
-
[4]
Competition-level code generation with
Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Competition-level code generation with. Science , volume=
-
[5]
Chen, Bei and Zhang, Fengji and Nguyen, Anh and Zan, Daoguang and Lin, Zeqi and Lou, Jian-Guang and Chen, Weizhu , booktitle=
-
[6]
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
Natural language to code translation with execution , author=. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2022
-
[7]
Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle=. Is your code generated by
-
[8]
International Journal on Software Tools for Technology Transfer , volume=
Program sketching , author=. International Journal on Software Tools for Technology Transfer , volume=
-
[9]
Zhong, Hua and Jiang, Shan and Khurshid, Sarfraz , journal=. An approach for
-
[10]
Generating executable oracles to check conformance of client code to requirements of
Jiang, Shan and Zhu, Chenguang and Khurshid, Sarfraz , journal=. Generating executable oracles to check conformance of client code to requirements of
-
[11]
An empirical study of the reliability of
Miller, Barton P and Fredriksen, Louis and So, Bryan , journal=. An empirical study of the reliability of
-
[12]
Claessen, Koen and Hughes, John , booktitle=
-
[13]
Semantic Voting: Execution-Grounded Consensus for Code Generation , author=. arXiv preprint , year=
-
[14]
Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle=. Scaling
-
[15]
Inference scaling laws: An empirical analysis of compute-optimal inference for
Wu, Yangzhen and Sun, Zhiqing and Li, Shanda and Welleck, Sean and Yang, Yiming , booktitle=. Inference scaling laws: An empirical analysis of compute-optimal inference for
-
[16]
arXiv preprint arXiv:2410.16377 , year=
A simple model of inference scaling laws , author=. arXiv preprint arXiv:2410.16377 , year=
-
[17]
Program Synthesis by Sketching , author=
-
[18]
Jiang, Shan and Zhu, Chenguang and Khurshid, Sarfraz , booktitle=
-
[19]
Jiang, Shan and Kovuri, Pranoy and Tao, David and Tan, Zhixun , booktitle=
-
[20]
International Conference on Machine Learning , year=
Coder reviewer reranking for code generation , author=. International Conference on Machine Learning , year=
-
[21]
ACM Transactions on Software Engineering and Methodology , volume=
Self-planning code generation with large language models , author=. ACM Transactions on Software Engineering and Methodology , volume=
-
[22]
Wen, Jiaxin and Guan, Jian and Wang, Hongning and Wu, Wei and Huang, Minlie , booktitle=
-
[23]
Advances in Neural Information Processing Systems , year=
Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , year=
-
[24]
Inoue, Yuichi and Misaki, Kou and Imajuku, Yuki and Kuroki, So and Nakamura, Taishi and Akiba, Takuya , booktitle=. Wider or Deeper? Scaling
-
[25]
International Conference on Machine Learning , year=
Language agent tree search unifies reasoning, acting, and planning in language models , author=. International Conference on Machine Learning , year=
-
[26]
and Baek, Jinheon and Hwang, Sung Ju , booktitle=
Aytes, Simon A. and Baek, Jinheon and Hwang, Sung Ju , booktitle=. Sketch-of-Thought: Efficient
-
[27]
Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , booktitle=
-
[28]
Measuring coding challenge competence with
Hendrycks, Dan and Basart, Steven and Kadavath, Saurav and Mazeika, Mantas and Arora, Akul and Guo, Ethan and Burns, Collin and Puranik, Samir and He, Horace and Song, Dawn and Steinhardt, Jacob , journal=. Measuring coding challenge competence with
-
[29]
Digital Technical Journal , volume=
Differential testing for software , author=. Digital Technical Journal , volume=
-
[30]
Zhong, Hua and Jiang, Shan and Khurshid, Sarfraz , journal=
-
[31]
Findings of the Association for Computational Linguistics: ACL 2025 , year=
Ranked voting based self-consistency of large language models , author=. Findings of the Association for Computational Linguistics: ACL 2025 , year=
work page 2025
-
[32]
Jain, Naman and Han, King and Gu, Alex and Li, Wen-Ding and Yan, Fanjia and Zhang, Tianjun and Wang, Sida and Solar-Lezama, Armando and Sen, Koushik and Stoica, Ion , journal=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.