pith. sign in

arxiv: 2605.29059 · v1 · pith:ZPHZQUYInew · submitted 2026-05-27 · 💻 cs.SE · cs.AI· cs.CR

SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

Pith reviewed 2026-06-29 10:24 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CR
keywords smart contract decompilationLLM evaluationSoliditysemantic consistencybenchmarkblockchain securitydifferential replay
0
0 comments X

The pith

Frontier LLMs produce compilable smart contract code yet match original semantics in only 42 of 600 cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SCDBench as a dataset of 600 real-world Solidity contracts paired with bytecode and replayable semantic checkpoints, together with a four-stage evaluation pipeline. It tests several frontier LLMs in zero-shot decompilation and measures outputs first for format completeness, then compilability, ABI recovery, and finally behavioral equivalence through differential replay. The results establish that models frequently emit structured, compilable Solidity yet rarely preserve the original contract's observable behavior. The work further demonstrates that a same-model compilation-repair loop raises the number of semantically correct outputs at modest added cost. This evaluation framework supplies a reproducible basis for measuring progress on decompilation reliability.

Core claim

SCDBench supplies 600 contracts with bytecode inputs, ground-truth source, and replayable checkpoints. Decompiler outputs are scored cumulatively on format completeness, compilability, ABI recovery, and semantic consistency via differential replay. In zero-shot settings the strongest frontier model reaches perfect semantic consistency on only 42 contracts, while inserting a compilation-repair stage improves that count without large extra cost.

What carries the argument

The four cumulative evaluation stages that terminate in differential replay of semantic checkpoints to test behavioral equivalence between original and decompiled contracts.

If this is right

  • Semantic consistency, not merely compilability, must serve as the decisive quality metric for smart-contract decompilers.
  • A compilation-repair step can be inserted after initial generation to raise the fraction of semantically faithful outputs.
  • Reproducible multi-stage benchmarks enable direct comparison of future decompilation methods on the same contracts.
  • Applications that depend on recovered source for security analysis require higher semantic fidelity than current models deliver.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The staged benchmark could be reused to measure whether larger models or different training regimes close the remaining semantic gap.
  • Low semantic-consistency rates suggest that decompiled code cannot yet substitute for source in high-stakes contract audits without additional verification.
  • The same cumulative-check approach may transfer to other domains where generated code must preserve original observable behavior.

Load-bearing premise

The 600 contracts and their checkpoints represent the broader population of deployed smart contracts sufficiently well for performance numbers to generalize.

What would settle it

A decompiler that passes the semantic-consistency stage on substantially more than 42 of the 600 contracts under the identical four-stage protocol.

Figures

Figures reproduced from arXiv: 2605.29059 by Arthur Gervais, Dawn Song, Kaihua Qin.

Figure 1
Figure 1. Figure 1: Zero-shot decompilation prompt format. The model receives EVM assembly and optional [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-contract ABI recovery distributions by difficulty. Each stacked bar groups contracts by [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces SCDBench, a benchmark dataset of 600 real-world Solidity contracts with paired bytecode, ground-truth source, and replayable semantic checkpoints. It proposes a four-stage cumulative evaluation (format completeness, compilability, ABI recovery, semantic consistency via differential replay) and reports zero-shot results for frontier LLMs (Claude Opus 4.7, GPT-5.3-Codex, GLM-5 variants) plus a compilation-repair setting, concluding that while structured and compilable output is often achieved, perfect semantic decompilation occurs for only 42/600 contracts in the best case.

Significance. SCDBench supplies a structured, multi-stage protocol with replayable checkpoints that supports reproducible assessment of semantic fidelity, a clear strength for the field. The demonstration that same-model compilation repair yields gains at modest cost is a practical contribution. The central generalization that semantic consistency remains far from solved, however, rests on the unverified representativeness of the 600-contract set.

major comments (1)
  1. [Dataset Construction] Dataset Construction section: explicit selection criteria, diversity statistics (contract complexity, opcode patterns, state-machine structure), and coverage arguments for the replayable semantic checkpoints are not provided. This is load-bearing for the headline claim, because the 42/600 figure and the generalization that semantic consistency is unsolved can be interpreted only if the contracts reflect the distribution of deployed Solidity usage.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of dataset transparency. We address the major comment below and will revise the manuscript to strengthen the presentation of the dataset.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: explicit selection criteria, diversity statistics (contract complexity, opcode patterns, state-machine structure), and coverage arguments for the replayable semantic checkpoints are not provided. This is load-bearing for the headline claim, because the 42/600 figure and the generalization that semantic consistency is unsolved can be interpreted only if the contracts reflect the distribution of deployed Solidity usage.

    Authors: We agree that additional detail on dataset construction is warranted to support the generalizability claims. In the revised manuscript we will expand the Dataset Construction section with: (1) explicit selection criteria (sourcing from verified Etherscan contracts meeting minimum transaction and verification thresholds, with exclusion rules for trivial or duplicate contracts); (2) diversity statistics including histograms or tables for contract complexity (LOC, function count), opcode pattern distributions, and state-machine characteristics (e.g., number of state variables and external calls); and (3) coverage arguments showing how the 600 contracts and their replayable checkpoints sample common deployed patterns (ERC standards, DeFi primitives, access-control logic). These additions will directly address the load-bearing concern for interpreting the 42/600 semantic-consistency result. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark results are direct empirical measurements on a newly introduced dataset with no fitted predictions or self-referential derivations.

full rationale

The paper presents SCDBench as an independent evaluation artifact consisting of 600 real-world contracts and a four-stage methodology (format completeness, compilability, ABI recovery, semantic consistency via differential replay). The headline result (42/600 perfect decompilations for the best model) is obtained by applying this methodology to frontier LLMs in a zero-shot setting. No equations, parameters fitted to subsets then re-predicted, or load-bearing self-citations appear in the provided text. The dataset construction and representativeness assumptions are stated explicitly as scope limitations rather than derived claims. The derivation chain is therefore self-contained as a measurement exercise and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark introduction paper; contains no fitted parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5793 in / 1161 out tokens · 29818 ms · 2026-06-29T10:24:02.581683+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Decompiling smart contracts with a large language model.arXiv preprint arXiv:2506.19624,

    Isaac David, Liyi Zhou, Dawn Song, Arthur Gervais, and Kaihua Qin. Decompiling smart contracts with a large language model.arXiv preprint arXiv:2506.19624,

  2. [2]

    An empirical study of smart contract decompilers

    Xia Liu, Baojian Hua, Yang Wang, and Zhizhong Pan. An empirical study of smart contract decompilers. In 2023 IEEE international conference on Software Analysis, Evolution and Reengineering (SANER), pages 1–12. IEEE,

  3. [3]

    Ethereum: A secure decentralised generalised transaction ledger.Ethereum project yellow paper, 151(2014):1–32,

    Gavin Wood et al. Ethereum: A secure decentralised generalised transaction ledger.Ethereum project yellow paper, 151(2014):1–32,

  4. [4]

    GLM-5: from Vibe Coding to Agentic Engineering

    Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chenghua Huang, Chengxing Xie, et al. Glm-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,