arxiv: 2605.09610 · v1 · submitted 2026-05-10 · 💻 cs.MA · cs.AI· cs.CE· cs.LG· cs.PL· cs.SE

Recognition: 2 theorem links

· Lean Theorem

SmartEval: A Benchmark for Evaluating LLM-Generated Smart Contracts from Natural Language Specifications

Abhinav Goel, Agostino Capponi, Alfio Gliozzo, Chaitya Shah

Pith reviewed 2026-05-12 04:23 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.CEcs.LGcs.PLcs.SE

keywords smart contractsLLM evaluationSoliditybenchmarknatural language specificationsfailure modescode qualitystate machine correctness

0 comments

The pith

A new benchmark shows LLM-generated smart contracts score 8.29 points higher than expert ground-truth versions because of literal specification adherence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SmartEval as a benchmark to evaluate Solidity smart contracts that LLMs produce from natural language specifications. It supplies a corpus of 9,000 generated contracts paired with expert ground-truth implementations, a five-dimensional rubric for scoring, and a full generation-and-evaluation pipeline. Validation comes from ablation experiments, human expert review showing alignment within 0.34 points, and 79.4 percent agreement with a static analyzer. Analysis of the full set identifies repeated failure modes such as logic omissions and state transition errors while documenting the overall scoring edge for the generated contracts. This setup gives researchers a reproducible way to measure how well current LLMs handle contract synthesis tasks.

Core claim

SmartEval establishes a validated benchmark that, through systematic scoring of 9,000 LLM-generated contracts against ground-truth implementations, identifies characteristic failure modes including logic omissions at 35.3 percent and state transition errors at 23.4 percent, while recording a +8.29 composite-score advantage for the generated contracts attributable to LLMs' literal following of the input specifications.

What carries the argument

The five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, applied through an automated pipeline to contracts drawn from the FSMSCG dataset.

If this is right

Generated contracts receive higher overall scores than human-written ground truth but still exhibit systematic gaps in logic completeness and state management.
Contract quality degrades as the complexity of the underlying natural language specification increases.
The benchmark and its validation studies provide a foundation for empirical work on improving LLM performance in smart contract generation.
Literal specification following explains both the scoring advantage and the specific error patterns observed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams building smart contracts could use LLMs for initial drafts to reduce deviation from stated requirements before manual review.
The documented failure modes point to specific areas for model fine-tuning, such as better handling of state transitions in code.
The same evaluation approach could be adapted to test LLM code generation in other domains that rely on formal or semi-formal specifications.

Load-bearing premise

The five-dimensional rubric and automated pipeline produce scores that reflect real-world smart contract quality and security.

What would settle it

A test where contracts scoring high under the benchmark are deployed on a public testnet and exhibit security breaches or functional failures at rates that contradict the benchmark ratings.

Figures

Figures reproduced from arXiv: 2605.09610 by Abhinav Goel, Agostino Capponi, Alfio Gliozzo, Chaitya Shah.

**Figure 2.** Figure 2: Grade distribution across all 9,000 generated con [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Approximate score distributions for LLM-generated [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Left: Breakdown of error modes in the 2,398 lower [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Radar chart of per-metric quality profiles for [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Compilation rate increases monotonically as [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Unified per-metric comparison across three reference points: the full 9,000-contract evaluation (teal), ablation [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

We introduce SmartEval, a benchmark for systematically evaluating the quality of Solidity smart contracts generated by large language models (LLMs) from natural language specifications. SmartEval provides a corpus of 9,000 generated contracts paired with expert-written ground-truth implementations drawn from the FSMSCG dataset, a five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality, and a reproducible generation-and-evaluation pipeline. To validate the benchmark's reliability, we conduct three independent empirical studies: a five-condition ablation study (N=300 per condition) isolating the contribution of each pipeline component, a human expert evaluation by three Columbia University PhD researchers confirming automated scores align with expert judgment to within 0.34 points, and external security analysis via the Slither static analyzer confirming 79.4% agreement between the LLM auditor and a non-LLM rule-based tool. Systematic analysis of 9,000 generated contracts reveals characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, and complexity-driven degradation) and quantifies a +8.29 composite-score advantage of generated contracts over ground-truth implementations, attributable to LLMs' literal specification-following behavior. SmartEval establishes a reproducible, validated foundation for empirical research on LLM smart contract synthesis quality, with all data, evaluation code, and generated contracts publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SmartEval gives a new public benchmark and corpus for LLM Solidity generation with decent human and Slither validation, but the +8.29 advantage interpretation rests on a rubric that may under-weight security.

read the letter

SmartEval is a benchmark for scoring LLM-generated Solidity contracts against natural language specs. The authors supply a 9,000-contract corpus drawn from FSMSCG, a five-dimensional rubric, and a full generation-plus-evaluation pipeline that they test with ablations, human raters, and Slither cross-checks. They also release all artifacts publicly and report concrete failure rates such as logic omissions at 35.3% and state-transition errors at 23.4% plus a +8.29 composite-score edge for the generated contracts over the ground-truth ones. The human experts align with the automated scores to within 0.34 points and Slither agrees 79.4% of the time. Those steps are the parts that actually move the needle. The ablation study helps isolate pipeline effects, and the public data lets others rerun or extend the work without starting from scratch. The softer spot is the headline advantage claim. The rubric emphasizes completeness, fidelity, and code quality but does not explicitly score security properties. With only 79.4% Slither agreement, the higher scores could reflect tighter literal adherence while still missing defects that matter in deployment. A breakdown of where the two tools diverge on vulnerabilities would have made the quality interpretation tighter. This is useful for researchers who build or evaluate LLM tools for smart-contract code. Anyone who needs a ready corpus and failure-mode taxonomy in this narrow domain will find concrete material here. It is worth sending to peer review because the benchmark itself and the validation steps are new and reproducible enough to justify referee attention, even if the security reading of the scores needs more evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SmartEval, a benchmark for assessing the quality of smart contracts generated by LLMs from natural language specifications. It comprises a corpus of 9,000 generated contracts and corresponding expert-written ground-truth implementations from the FSMSCG dataset, a five-dimensional rubric (functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, code quality), and an automated evaluation pipeline. The authors validate the benchmark through an ablation study (N=300 per condition), human expert evaluation by three PhD researchers (alignment within 0.34 points), and external analysis with the Slither tool (79.4% agreement). Key findings include characteristic failure modes (logic omissions at 35.3%, state transition errors at 23.4%, complexity-driven degradation) and a +8.29 composite score advantage for LLM-generated contracts over ground-truth, attributed to LLMs' literal adherence to specifications. All artifacts are publicly released.

Significance. Should the central claims hold after addressing the comments below, this paper would offer a valuable, reproducible benchmark for the growing field of LLM-assisted smart contract development. The quantified failure modes and the public release of the full dataset, code, and generated contracts are particular strengths that enable follow-on empirical research. The work helps establish empirical baselines for LLM performance in a high-stakes domain where correctness is critical.

major comments (2)

[§4.3] §4.3 (external security analysis): The 79.4% agreement between the LLM auditor and Slither is reported without a per-vulnerability breakdown, false-negative rates on security issues, or analysis of cases where the rubric and Slither diverge. Because the five rubric dimensions do not explicitly target security properties, this leaves open the possibility that the +8.29 composite-score advantage reflects literal spec adherence while missing defects that would reverse the quality interpretation.
[§5] §5 (results): The +8.29 composite-score advantage and failure-mode percentages (35.3% logic omissions, 23.4% state transition errors) are presented without raw per-dimension scores, the exact aggregation formula for the five rubric dimensions, statistical significance tests, or data-exclusion criteria for the 9,000-contract corpus. These omissions make it impossible to verify whether post-hoc choices affect the headline claims.

minor comments (2)

[§4.1] The ablation study description would benefit from an explicit table listing the five conditions and their individual contributions to the final scores.
[§3.2] Notation for the composite score in the rubric definition could be clarified with an equation showing how the five dimensions are combined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on SmartEval. The comments identify areas where additional transparency will strengthen the manuscript. We address each major comment below and have revised the relevant sections accordingly.

read point-by-point responses

Referee: [§4.3] §4.3 (external security analysis): The 79.4% agreement between the LLM auditor and Slither is reported without a per-vulnerability breakdown, false-negative rates on security issues, or analysis of cases where the rubric and Slither diverge. Because the five rubric dimensions do not explicitly target security properties, this leaves open the possibility that the +8.29 composite-score advantage reflects literal spec adherence while missing defects that would reverse the quality interpretation.

Authors: We agree that the external validation section would benefit from greater detail. In the revised manuscript we add a per-vulnerability breakdown of agreement rates, report false-negative cases where Slither flags issues missed by the LLM auditor, and analyze the specific divergence instances. We also clarify that the five rubric dimensions are intentionally scoped to functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality rather than explicit security properties; the +8.29 advantage is therefore measured only on those dimensions. The Slither comparison serves as an independent check on the automated pipeline rather than a comprehensive security audit. revision: yes
Referee: [§5] §5 (results): The +8.29 composite-score advantage and failure-mode percentages (35.3% logic omissions, 23.4% state transition errors) are presented without raw per-dimension scores, the exact aggregation formula for the five rubric dimensions, statistical significance tests, or data-exclusion criteria for the 9,000-contract corpus. These omissions make it impossible to verify whether post-hoc choices affect the headline claims.

Authors: We accept that the results presentation requires additional detail for reproducibility. The revised §5 now includes the raw per-dimension scores for both LLM-generated and ground-truth contracts, states the aggregation formula explicitly (unweighted arithmetic mean of the five normalized dimension scores), reports paired t-test results with p-values for the composite-score difference, and documents the data-exclusion criteria (non-compiling contracts and those exceeding the 8k-token generation limit, which removed 4.7% of the initial corpus). These additions enable direct verification of the reported figures and failure-mode statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines a new benchmark (SmartEval) with an explicit five-dimensional rubric and applies it to measure scores on 9,000 LLM-generated contracts versus expert ground-truth implementations from an external dataset. The headline +8.29 composite advantage and failure-mode statistics are direct empirical outputs of this rubric applied to the generated corpus. Reliability is checked via separate human-expert alignment (Columbia PhD researchers) and agreement with the independent external Slither static analyzer; neither validation step is defined in terms of the rubric scores themselves nor reduces to a self-citation or fitted parameter. No equations, predictions, or uniqueness claims collapse by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the five-dimensional rubric captures the essential qualities of correct smart contracts and that human expert agreement validates the automated scores.

axioms (1)

domain assumption The five-dimensional rubric (functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, code quality) accurately measures smart-contract quality
Invoked throughout the evaluation pipeline and used to compute the composite score advantage.

pith-pipeline@v0.9.0 · 5577 in / 1410 out tokens · 46622 ms · 2026-05-12T04:23:29.197349+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear
five-dimensional evaluation rubric covering functional completeness, variable fidelity, state-machine correctness, business-logic fidelity, and code quality
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear
severity-gated reinforcement loop and multi-agent pipeline

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

Nicola Atzei, Massimo Bartoletti, and Tiziana Cimoli. 2017. A Survey of Attacks on Ethereum Smart Contracts (SoK). InProceedings of the 6th International Conference on Principles of Security and Trust. Springer, 164–186

work page 2017
[2]

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program Synthesis with Large Language Models. InarXiv preprint arXiv:2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

work page 2020
[4]

Vitalik Buterin. 2014. A Next-Generation Smart Contract and Decentralized Application Platform. InEthereum White Paper

work page 2014
[5]

Agostino Capponi, Garud Iyengar, and Jay Sethuraman. 2023. Decentralized Finance: Protocols, Risks, and Governance.Foundations and Trends in Privacy and Security5, 3 (2023), 144–188. doi:10.1561/3300000036

work page doi:10.1561/3300000036 2023
[6]

Huashan Chen, Marcus Pendleton, Laurent Njilla, and Shouhuai Xu. 2020. A Survey on Ethereum Systems Security: Vulnerabilities, Attacks, and Defenses. In ACM Computing Surveys, Vol. 53. ACM, 1–43

work page 2020
[7]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating Large Language Models Trained on Code. InarXiv preprint arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

CrewAI. 2024. CrewAI: Building Multi-Agent Systems with Ease. https://github. com/joaomdmoura/crewAI

work page 2024
[9]

Ethereum Foundation. 2024. Solidity Documentation. https://docs.soliditylang. org/. Accessed: 2025-01-15

work page 2024
[10]

Josselin Feist, Gustavo Grieco, and Alex Groce. 2019. Slither: A Static Analysis Framework for Smart Contracts.2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB)(2019), 8–15

work page 2019
[11]

IBM Research. 2024. IBM Agentics: A Framework for Building Agentic AI Systems. IBM Technical Report(2024)

work page 2024
[12]

Hao Luo, Yuhao Lin, Xiao Yan, Xintong Hu, Yuxiang Wang, Qiming Zeng, Hao Wang, and Jiawei Jiang. 2025. Guiding LLM-based Smart Contract Generation with Finite State Machine. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence. IJCAI, 5869–5877. doi:10.24963/ijcai.2025/653 Main Track. Available at: https://www.ijcai...

work page doi:10.24963/ijcai.2025/653 2025
[13]

Loi Luu, Duc-Hiep Chu, Hrishi Olickel, Prateek Saxena, and Aquinas Hobor

work page
[14]

InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security

Making Smart Contracts Smarter. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 254–269

work page 2016
[15]

Anastasia Mavridou and Aron Laszka. 2018. Designing Secure Ethereum Smart Contracts: A Finite State Machine Based Approach.Financial Cryptography and Data Security(2018), 523–540

work page 2018
[16]

Kim, and Marek Laskowski

Muhammad Izhar Mehar, Charles Shier, Alana Giambattista, Elgar Gong, Gabrielle Fletcher, Ryan Sanayhie, Henry M. Kim, and Marek Laskowski. 2019. Under- standing a Revolutionary and Flawed Grand Experiment in Blockchain: The DAO Attack.Journal of Cases on Information Technology21, 1 (2019), 19–32

work page 2019
[17]

Bernhard Mueller. 2018. Smashing Ethereum Smart Contracts for Fun and Real Profit. In9th Annual HITB Security Conference

work page 2018
[18]

Satoshi Nakamoto. 2008. Bitcoin: A Peer-to-Peer Electronic Cash System.Decen- tralized Business Review(2008), 21260

work page 2008
[19]

OpenAI. 2024. GPT-4 Technical Report.arXiv preprint arXiv:2303.08774(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao- qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. 2024. Code Llama: Open Foundation Models for Code.arXiv preprint arXiv:2308.12950 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Sally Junsong Wang, Kexin Pei, and Junfeng Yang. 2024. SmartInv: Multi- modal Learning for Smart Contract Invariant Inference. arXiv:2411.09217 [cs.SE] arXiv:2411.09217

work page arXiv 2024
[22]

Sally Junsong Wang, Jianan Yao, Kexin Pei, Hideaki Takahashi, and Junfeng Yang

work page
[23]

arXiv:2409.04597 [cs.SE] arXiv:2409.04597

Detecting Buggy Contracts via Smart Testing. arXiv:2409.04597 [cs.SE] arXiv:2409.04597

work page arXiv
[24]

Gavin Wood. 2014. Ethereum: A Secure Decentralised Generalised Transaction Ledger.Ethereum Project Yellow Paper151 (2014), 1–32

work page 2014
[25]

Winnie Xiao, Cole Killian, Henry Sleight, Alan Chan, Nicholas Carlini, and Alwin Peng. 2025. AI Agents Find $4.6M in Blockchain Smart Contract Exploits. https://red.anthropic.com/2025/smart-contracts/. Anthropic AI Safety Research. SCONE-bench: Smart CONtracts Exploitation Benchmark

work page 2025
[26]

indexed": true where declared indexed in Solidity. •stateMutability is set correctly:

Zibin Zheng, Shaoan Xie, Hong-Ning Dai, Weili Chen, Xiangping Chen, Jian Weng, and Muhammad Imran. 2020. An Overview on Smart Contracts: Chal- lenges, Advances and Platforms.Future Generation Computer Systems105 (2020), 475–491. A Phase 1: Requirement Specification Agent The UniversalContractSchema captures the following fields, all extracted with exact s...

work page 2020