arxiv: 2605.05282 · v1 · submitted 2026-05-06 · 💻 cs.PL · cs.CL

Recognition: unknown

Beyond BLEU: A Semantic Evaluation Method for Code Translation

Julius N\"aumann , Sven Keidel , Amir Molzam Sharifloo , Mira Mezini

Authors on Pith no claims yet

Pith reviewed 2026-05-08 15:35 UTC · model grok-4.3

classification 💻 cs.PL cs.CL

keywords code translationsemantic evaluationBLEUlarge language modelsdecompilationbinary liftingcompiler testingprogram equivalence

0 comments

The pith

A semantic score based on execution outcomes shows LLM decompilers far outperform heuristics, while BLEU scores barely correlate with whether translations actually run correctly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that code translation should be judged by whether the output program produces the same results as the input on test cases, not by how similar the text looks. It borrows established compiler-testing practices to measure this semantic correctness as the fraction of translations that pass execution checks. When applied to binary lifting from assembly to higher-level code, the method finds that fine-tuned LLMs achieve markedly higher semantic correctness than rule-based heuristic decompilers. The same experiments reveal that BLEU scores, which only track token overlap, show almost no statistical relationship to actual functional accuracy.

Core claim

We introduce a semantic correctness score defined as the proportion of translations that produce correct execution outcomes on a suite of test cases. Applying this score to LLM-based and heuristic decompilers shows that the LLM approaches significantly outperform the heuristic ones. BLEU scores exhibit negligible correlation with the semantic correctness score (r ranging from -0.127 to 0.354), indicating that syntactic similarity metrics fail to predict functional accuracy.

What carries the argument

The semantic correctness score, which counts the fraction of translated programs that match the original's output on a fixed set of execution test cases.

If this is right

Evaluation protocols for code translation should replace or supplement BLEU with execution-based semantic checks.
LLM-based binary lifters are functionally superior to current heuristic decompilers when correctness is measured by runtime behavior.
Training objectives that optimize only for syntactic similarity are unlikely to improve the actual utility of generated code.
Compiler-testing techniques transfer directly to assessing neural code generators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same execution-based scoring could be applied to other code-generation tasks such as synthesis or repair without requiring new metrics.
Developers of code models might benefit from reinforcement learning signals drawn directly from test-suite outcomes rather than from token-level losses.
The current low correlation implies that benchmark suites relying solely on BLEU may systematically mis-rank translation systems.
Extending the test-case generator to produce more diverse inputs could strengthen the metric's ability to catch subtle semantic mismatches.

Load-bearing premise

The chosen test cases and execution environment are sufficient to establish semantic equivalence for the evaluated translations.

What would settle it

Finding a large collection of code translations in which BLEU scores show a strong positive correlation with the proportion that pass execution tests, or discovering a translation that passes the test suite yet produces observably different behavior on new inputs.

Figures

Figures reproduced from arXiv: 2605.05282 by Amir Molzam Sharifloo, Julius N\"aumann, Mira Mezini, Sven Keidel.

**Figure 1.** Figure 1: An overview of the evaluation framework. view at source ↗

**Figure 2.** Figure 2: BLEU similarity scores comparing original assembly to round-trip assembly (compiled from lifted source), view at source ↗

read the original abstract

Code translation is one of the core capabilities of LLMs. However, evaluating the correctness of translations remains difficult, as commonly used metrics such as BLEU measure only syntactic similarity, disregarding program semantics. We propose a novel evaluation methodology for code translation tasks, emphasizing semantic equivalence over surface-level string similarity. Our approach applies established compiler testing methodology to a new domain, allowing the assessment of an LLM fine-tuned for binary lifting tasks (i.e. decompiling binaries to higher-level representations). We introduce a semantic correctness score, defined as the proportion of translations that produce correct execution outcomes, and demonstrate its application by evaluating LLM-based and heuristic decompilers. Our findings show that LLM-based approaches significantly outperform heuristic ones, while BLEU scores show negligible correlation with semantic correctness (r = -0.127 to 0.354), demonstrating that syntactic metrics fail to predict functional accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows BLEU has little correlation with functional correctness in code translation and applies execution-based scoring to LLM decompilers, but the test case details are too thin to support the performance claims.

read the letter

The main thing to know is that this paper finds BLEU scores barely correlate with whether translated code produces matching outputs, while their execution-based semantic score shows LLM decompilers beating heuristics on binary lifting tasks. They define the score simply as the share of translations that pass the same test inputs as the original program. This is a straightforward move that highlights how syntactic metrics miss the point for code that needs to behave correctly. The application to decompiling binaries is a reasonable extension of compiler testing ideas into LLM evaluation. It is useful to see the low r values laid out and the direct comparison between LLM and heuristic approaches. The core observation that string similarity does not predict execution behavior holds up on its own terms. The soft spot is the missing information on the test cases and execution harness. The abstract invokes established compiler testing methods but supplies no numbers on suite size, input selection, path coverage, or handling of edge cases like aliasing or non-determinism. Without those details the semantic score could be inflated by weak tests that both LLM and heuristic outputs happen to pass. That weakens the claim of significant outperformance. Readers working on LLM evaluation for programming or on better metrics for code equivalence will get the most out of the basic idea. It is a useful prompt for anyone tired of BLEU in software tasks, though the current evidence stays at the level of a demonstration rather than a conclusive study. I would send it for peer review. The problem it targets matters and the direction is sound, but referees should press for the full experimental protocol before the results can be taken as firm.

Referee Report

2 major / 1 minor

Summary. The paper proposes a semantic evaluation methodology for code translation tasks, with emphasis on binary lifting/decompilation. It defines a semantic correctness score as the proportion of translations producing correct execution outcomes on test cases, applies this to compare LLM-based decompilers against heuristic ones, and reports that LLMs significantly outperform heuristics while BLEU scores exhibit negligible correlation with semantic correctness (r ranging from -0.127 to 0.354).

Significance. If the test harness proves robust, the work could meaningfully advance evaluation practices in LLM-based code generation by prioritizing functional equivalence over syntactic similarity. Demonstrating BLEU's poor predictive power for semantic outcomes is a useful contribution that could influence benchmarking standards, provided the execution-based proxy is shown to be reliable through adequate coverage and controls.

major comments (2)

Abstract: The abstract invokes 'established compiler testing methodology' but supplies no coverage metrics, test-suite size, or differential-testing protocol. This is load-bearing for the central claim, as the semantic correctness score (defined as the observed proportion of matching executions) is only as valid as the test cases' ability to detect semantic differences in decompiled binaries; without these details, the reported outperformance of LLM approaches and the correlation findings cannot be properly assessed.
Evaluation and results sections: The correlation values (r = -0.127 to 0.354) between BLEU and semantic scores are presented without sample sizes, p-values, confidence intervals, or controls for confounding factors such as test case selection bias. This weakens the conclusion that syntactic metrics fail to predict functional accuracy, as the strength of the 'negligible correlation' claim depends on statistical rigor.

minor comments (1)

The abstract and methodology would benefit from explicit mention of the programming languages, binary formats, and number of test cases used to ground the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper to improve transparency and statistical reporting while preserving the core contributions.

read point-by-point responses

Referee: Abstract: The abstract invokes 'established compiler testing methodology' but supplies no coverage metrics, test-suite size, or differential-testing protocol. This is load-bearing for the central claim, as the semantic correctness score is only as valid as the test cases' ability to detect semantic differences.

Authors: We agree the abstract should be more self-contained. In the revision we will add a concise summary of the test harness, including test-suite size, the differential-testing approach of comparing execution outcomes on identical inputs, and coverage metrics from the evaluation section. These details are already present in the body of the paper; we will simply surface them in the abstract to allow readers to assess the validity of the semantic score without needing to read further. revision: yes
Referee: Evaluation and results sections: The correlation values (r = -0.127 to 0.354) between BLEU and semantic scores are presented without sample sizes, p-values, confidence intervals, or controls for confounding factors such as test case selection bias.

Authors: We accept that the statistical presentation can be strengthened. The revision will report the exact sample sizes used for each correlation, include p-values and 95% confidence intervals, and add a short discussion of test-case selection (random sampling from a larger pool) together with a brief sensitivity check. These additions will be placed in the results section and will not alter the reported correlation ranges or the conclusion that syntactic metrics are poor predictors of functional accuracy. revision: yes

Circularity Check

0 steps flagged

No circularity: semantic score is an explicit operational definition

full rationale

The paper introduces the semantic correctness score by direct definition as the observed proportion of translations that match on execution outcomes under chosen test cases. This is not derived from any fitted parameter, self-referential equation, or prior result within the paper. The reported comparisons (LLM vs. heuristic performance) and correlation coefficients with BLEU are straightforward empirical measurements on the same execution data, with no reduction by construction to the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing way in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that execution outcome matching on selected tests serves as a valid proxy for semantic equivalence.

axioms (1)

domain assumption Execution outcomes on test cases determine semantic equivalence
Directly invoked to define the semantic correctness score.

pith-pipeline@v0.9.0 · 5453 in / 1014 out tokens · 47599 ms · 2026-05-08T15:35:20.395368+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 3 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2509.12973 (2025)

Aamer Aljagthami and Mohammed Banabila and Musab Alshehri and Mohammed Kabini and Mohammad D. Alahmadi , title =. Preprint, arXiv:2509.12973 , year =

work page arXiv
[2]

2025 , note =

Avast , title =. 2025 , note =

2025
[3]

Evaluating Large Language Models Trained on Code

Mark Chen and Jerry Tworek and Heewoo Jun and others , title =. Preprint, arXiv:2107.03374 , year =

work page internal anchor Pith review arXiv
[4]

Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction , series =

Chris Cummins and Volker Seeker and Dejan Grubisic and others , title =. Proceedings of the 34th ACM SIGPLAN International Conference on Compiler Construction , series =
[5]

Proceedings of the Platform for Advanced Scientific Computing Conference , series =

Akash Dhruv and Anshu Dubey , title =. Proceedings of the Platform for Advanced Scientific Computing Conference , series =
[6]

Mikhail Evtikhiev and Egor Bogomolov and Yaroslav Sokolov and Timofey Bryksin , title =. J. Syst. Softw. , volume =
[7]

Proceedings 2024 Network and Distributed System Security Symposium , year =

Peiwei Hu and Ruigang Liang and Kai Chen , title =. Proceedings 2024 Network and Distributed System Security Symposium , year =

2024
[8]

HPEC , pages =

Bin Lei and Caiwen Ding and Le Chen and Pei-Hung Lin and Chunhua Liao , title =. HPEC , pages =
[9]

Plotkin , title =

Gordon D. Plotkin , title =. Theor. Comput. Sci. , volume =
[10]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Shuo Ren and Daya Guo and Shuai Lu and others , title =. CoRR, abs/2009.10297 , year =

work page internal anchor Pith review arXiv 2009
[11]

Transactions of the American Mathematical society , volume =

Henry Gordon Rice , title =. Transactions of the American Mathematical society , volume =
[12]

2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC) , pages =

Ngoc Tran and Hieu Tran and Son Nguyen and Hoan Nguyen and Tien Nguyen , title =. 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC) , pages =

2019
[13]

Bharadwaj Yadavalli and Aaron Smith , title =

S. Bharadwaj Yadavalli and Aaron Smith , title =. LCTES 2019 , pages =

2019
[14]

PLDI 2011 , pages =

Xuejun Yang and Yang Chen and Eric Eide and John Regehr , title =. PLDI 2011 , pages =

2011
[15]

SIGPLAN Not

Xuejun Yang and Yang Chen and Eric Eide and John Regehr , title =. SIGPLAN Not. , volume =