SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

Arthur Gervais; Dawn Song; Kaihua Qin

arxiv: 2605.29059 · v1 · pith:ZPHZQUYInew · submitted 2026-05-27 · 💻 cs.SE · cs.AI· cs.CR

SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

Kaihua Qin , Dawn Song , Arthur Gervais This is my paper

Pith reviewed 2026-06-29 10:24 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR

keywords smart contract decompilationLLM evaluationSoliditysemantic consistencybenchmarkblockchain securitydifferential replay

0 comments

The pith

Frontier LLMs produce compilable smart contract code yet match original semantics in only 42 of 600 cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SCDBench as a dataset of 600 real-world Solidity contracts paired with bytecode and replayable semantic checkpoints, together with a four-stage evaluation pipeline. It tests several frontier LLMs in zero-shot decompilation and measures outputs first for format completeness, then compilability, ABI recovery, and finally behavioral equivalence through differential replay. The results establish that models frequently emit structured, compilable Solidity yet rarely preserve the original contract's observable behavior. The work further demonstrates that a same-model compilation-repair loop raises the number of semantically correct outputs at modest added cost. This evaluation framework supplies a reproducible basis for measuring progress on decompilation reliability.

Core claim

SCDBench supplies 600 contracts with bytecode inputs, ground-truth source, and replayable checkpoints. Decompiler outputs are scored cumulatively on format completeness, compilability, ABI recovery, and semantic consistency via differential replay. In zero-shot settings the strongest frontier model reaches perfect semantic consistency on only 42 contracts, while inserting a compilation-repair stage improves that count without large extra cost.

What carries the argument

The four cumulative evaluation stages that terminate in differential replay of semantic checkpoints to test behavioral equivalence between original and decompiled contracts.

If this is right

Semantic consistency, not merely compilability, must serve as the decisive quality metric for smart-contract decompilers.
A compilation-repair step can be inserted after initial generation to raise the fraction of semantically faithful outputs.
Reproducible multi-stage benchmarks enable direct comparison of future decompilation methods on the same contracts.
Applications that depend on recovered source for security analysis require higher semantic fidelity than current models deliver.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged benchmark could be reused to measure whether larger models or different training regimes close the remaining semantic gap.
Low semantic-consistency rates suggest that decompiled code cannot yet substitute for source in high-stakes contract audits without additional verification.
The same cumulative-check approach may transfer to other domains where generated code must preserve original observable behavior.

Load-bearing premise

The 600 contracts and their checkpoints represent the broader population of deployed smart contracts sufficiently well for performance numbers to generalize.

What would settle it

A decompiler that passes the semantic-consistency stage on substantially more than 42 of the 600 contracts under the identical four-stage protocol.

Figures

Figures reproduced from arXiv: 2605.29059 by Arthur Gervais, Dawn Song, Kaihua Qin.

**Figure 2.** Figure 2: Per-contract ABI recovery distributions by difficulty. Each stacked bar groups contracts by [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCDBench brings a new paired dataset and four-stage semantic evaluation to LLM smart contract decompilation, but the 600-contract set's representativeness is the key unverified piece for the 42/600 claim.

read the letter

The core point is that this paper supplies a concrete benchmark with 600 real-world contracts, each tied to bytecode, source, and replayable checkpoints, plus a staged evaluation that checks format, compilability, ABI, and finally semantic consistency through differential replay. That setup is new compared to earlier narrow-dataset studies and gives a clearer way to measure whether LLM outputs actually match the original contract behavior.

The work does a few things right. It evaluates multiple frontier models in zero-shot settings and tests a same-model compilation-repair variant, showing that repair lifts performance at modest extra cost. The cumulative stages make sense as a filter: many outputs get through the early checks but fail the replay test, which matches the abstract's observation that structured and compilable code is reachable while full semantic fidelity is not. The 42/600 perfect decompiles for the best model is a specific, falsifiable number that future work can target.

The main soft spot is dataset construction. The stress-test note flags that without explicit selection criteria, diversity statistics, or coverage arguments, it is unclear whether these 600 contracts reflect the range of deployed Solidity usage in complexity or state-machine patterns. If the set skews toward easier or harder cases, the headline gap does not yet generalize. The abstract calls them real-world with replayable checkpoints, but absent those details the representativeness claim stays provisional.

This is for researchers building or auditing LLM decompilers and for blockchain security groups that need reproducible evaluation. A reader working on smart-contract tooling or LLM code generation will find the methodology and dataset useful as a starting point. The paper deserves a serious referee because the benchmark artifact and staged protocol address a documented gap with measurable results, even if the dataset section needs tightening in revision.

Referee Report

1 major / 0 minor

Summary. The paper introduces SCDBench, a benchmark dataset of 600 real-world Solidity contracts with paired bytecode, ground-truth source, and replayable semantic checkpoints. It proposes a four-stage cumulative evaluation (format completeness, compilability, ABI recovery, semantic consistency via differential replay) and reports zero-shot results for frontier LLMs (Claude Opus 4.7, GPT-5.3-Codex, GLM-5 variants) plus a compilation-repair setting, concluding that while structured and compilable output is often achieved, perfect semantic decompilation occurs for only 42/600 contracts in the best case.

Significance. SCDBench supplies a structured, multi-stage protocol with replayable checkpoints that supports reproducible assessment of semantic fidelity, a clear strength for the field. The demonstration that same-model compilation repair yields gains at modest cost is a practical contribution. The central generalization that semantic consistency remains far from solved, however, rests on the unverified representativeness of the 600-contract set.

major comments (1)

[Dataset Construction] Dataset Construction section: explicit selection criteria, diversity statistics (contract complexity, opcode patterns, state-machine structure), and coverage arguments for the replayable semantic checkpoints are not provided. This is load-bearing for the headline claim, because the 42/600 figure and the generalization that semantic consistency is unsolved can be interpreted only if the contracts reflect the distribution of deployed Solidity usage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of dataset transparency. We address the major comment below and will revise the manuscript to strengthen the presentation of the dataset.

read point-by-point responses

Referee: [Dataset Construction] Dataset Construction section: explicit selection criteria, diversity statistics (contract complexity, opcode patterns, state-machine structure), and coverage arguments for the replayable semantic checkpoints are not provided. This is load-bearing for the headline claim, because the 42/600 figure and the generalization that semantic consistency is unsolved can be interpreted only if the contracts reflect the distribution of deployed Solidity usage.

Authors: We agree that additional detail on dataset construction is warranted to support the generalizability claims. In the revised manuscript we will expand the Dataset Construction section with: (1) explicit selection criteria (sourcing from verified Etherscan contracts meeting minimum transaction and verification thresholds, with exclusion rules for trivial or duplicate contracts); (2) diversity statistics including histograms or tables for contract complexity (LOC, function count), opcode pattern distributions, and state-machine characteristics (e.g., number of state variables and external calls); and (3) coverage arguments showing how the 600 contracts and their replayable checkpoints sample common deployed patterns (ERC standards, DeFi primitives, access-control logic). These additions will directly address the load-bearing concern for interpreting the 42/600 semantic-consistency result. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark results are direct empirical measurements on a newly introduced dataset with no fitted predictions or self-referential derivations.

full rationale

The paper presents SCDBench as an independent evaluation artifact consisting of 600 real-world contracts and a four-stage methodology (format completeness, compilability, ABI recovery, semantic consistency via differential replay). The headline result (42/600 perfect decompilations for the best model) is obtained by applying this methodology to frontier LLMs in a zero-shot setting. No equations, parameters fitted to subsets then re-predicted, or load-bearing self-citations appear in the provided text. The dataset construction and representativeness assumptions are stated explicitly as scope limitations rather than derived claims. The derivation chain is therefore self-contained as a measurement exercise and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper; contains no fitted parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5793 in / 1161 out tokens · 29818 ms · 2026-06-29T10:24:02.581683+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Decompiling smart contracts with a large language model.arXiv preprint arXiv:2506.19624,

Isaac David, Liyi Zhou, Dawn Song, Arthur Gervais, and Kaihua Qin. Decompiling smart contracts with a large language model.arXiv preprint arXiv:2506.19624,

work page arXiv
[2]

An empirical study of smart contract decompilers

Xia Liu, Baojian Hua, Yang Wang, and Zhizhong Pan. An empirical study of smart contract decompilers. In 2023 IEEE international conference on Software Analysis, Evolution and Reengineering (SANER), pages 1–12. IEEE,

2023
[3]

Ethereum: A secure decentralised generalised transaction ledger.Ethereum project yellow paper, 151(2014):1–32,

Gavin Wood et al. Ethereum: A secure decentralised generalised transaction ledger.Ethereum project yellow paper, 151(2014):1–32,

2014
[4]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Decompiling smart contracts with a large language model.arXiv preprint arXiv:2506.19624,

Isaac David, Liyi Zhou, Dawn Song, Arthur Gervais, and Kaihua Qin. Decompiling smart contracts with a large language model.arXiv preprint arXiv:2506.19624,

work page arXiv

[2] [2]

An empirical study of smart contract decompilers

Xia Liu, Baojian Hua, Yang Wang, and Zhizhong Pan. An empirical study of smart contract decompilers. In 2023 IEEE international conference on Software Analysis, Evolution and Reengineering (SANER), pages 1–12. IEEE,

2023

[3] [3]

Ethereum: A secure decentralised generalised transaction ledger.Ethereum project yellow paper, 151(2014):1–32,

Gavin Wood et al. Ethereum: A secure decentralised generalised transaction ledger.Ethereum project yellow paper, 151(2014):1–32,

2014

[4] [4]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review Pith/arXiv arXiv