arxiv: 2605.11006 · v1 · submitted 2026-05-10 · 💻 cs.SE · cs.AI

Recognition: no theorem link

An Execution-Verified Multi-Language Benchmark for Code Semantic Reasoning

Yikun Li , Jinfeng Jiang , Ting Zhang , Chengran Yang , Chenxing Zhong , Yin Yide , Leow Wen Bin , Eng Lieh Ouh

show 2 more authors

Lwin Khin Shar David Lo

Authors on Pith no claims yet

Pith reviewed 2026-05-13 00:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code semantic reasoningcall graph recoveryexecution verificationLLM benchmarkmulti-languageruntime structuredynamic analysistrace validation

0 comments

The pith

TraceEval supplies execution-verified call graphs from real programs so models can be tested on recovering actual runtime structure from source code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TraceEval, a benchmark of over ten thousand real-world programs drawn from open-source repositories in Python, JavaScript, and Java. Its defining feature is that every positive call-graph edge has been confirmed by actually running the program and recording the trace, removing reliance on static analyzers or human labels. This setup measures whether language models can recover dynamic execution behavior rather than just producing code that passes tests. Zero-shot evaluation shows the best model at 72.9 percent average F1, while fine-tuning smaller models on the training portion lifts them to within a few points of that level. The authors also release a pipeline that turns additional repositories into new verified instances.

Core claim

TraceEval is a multi-language collection of 10,583 programs whose runtime call structures have been mechanically witnessed through execution tracing, providing the first large-scale, label-noise-free test of whether language models can recover a program's dynamic call relationships directly from source.

What carries the argument

The execution-verified call-graph recovery task, in which every retained edge must be observed during actual program runs with tracer validation.

If this is right

Language models can be fine-tuned on the training split to improve recovery of dynamic call structure, closing most of the gap to the strongest zero-shot model.
Any open-source repository can be turned into additional verified instances using the released pipeline, expanding the benchmark over time.
Evaluations now isolate semantic understanding of runtime behavior from the separate skill of writing test-passing code.
Performance on this task becomes a concrete, reproducible signal for comparing models on structural program reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success here may indicate broader capability to reason about other runtime properties such as data dependencies or exception flows.
The same verification approach could be applied to measure whether generated code preserves intended call behavior in addition to passing tests.
Repositories with richer dynamic behavior, such as those using heavy reflection or callbacks, would form natural stress tests for the current construction method.

Load-bearing premise

The LLM-assisted harness generation plus tracer validation captures complete and unbiased runtime call behavior for the chosen programs without missing dynamic paths or creating artifacts.

What would settle it

Executing the benchmark programs under new inputs or environments that activate additional call paths not recorded in the supplied traces would reveal whether the ground-truth labels are incomplete.

Figures

Figures reproduced from arXiv: 2605.11006 by Chengran Yang, Chenxing Zhong, David Lo, Eng Lieh Ouh, Jinfeng Jiang, Leow Wen Bin, Lwin Khin Shar, Ting Zhang, Yikun Li, Yin Yide.

**Figure 1.** Figure 1: Overview of the TRACEEVAL pipeline. Construction (Stages 1–3): an LLM-assisted harness generator adds drivers and lightweight dependency stubs to files from 1,600+ GitHub repositories, making them self-contained and executable; language-specific dynamic tracers execute those programs and observe caller→callee edges; validation accepts only programs that execute successfully and produce at least two traced … view at source ↗

read the original abstract

Evaluating whether large language models (LLMs) can recover execution-relevant program structure, rather than only produce code that passes tests, remains an open problem. Existing code benchmarks emphasize test-passing outputs, from standalone programming tasks (HumanEval, MBPP, LiveCodeBench) to repository repair (SWE-Bench); this is useful, but offers limited diagnostic signal about which program semantics a model can recover from source. We introduce TraceEval, to our knowledge the first execution-verified, multi-language benchmark for code semantic reasoning: recovering a program's runtime call structure from source code. Unlike prior call-graph benchmarks that rely on static-tool output or hand-annotated ground truth, every positive edge in TraceEval is mechanically witnessed by validation execution, eliminating annotator disagreement and label noise for observed behavior. TraceEval consists of (i) 10,583 real-world programs (2,129 test, 8,454 train) extracted from 1,600+ open-source repositories across Python, JavaScript, and Java via an LLM-assisted harness-generation pipeline with tracer validation; and (ii) a reproducible pipeline that converts any open-source repository into new verified benchmark instances. We evaluate 10 LLMs at zero-shot on the held-out test split. The strongest model, Claude-Opus-4.6, reaches an average F1 of 72.9% across the three languages. To demonstrate the train split's utility as a supervision substrate, we fine-tune the Qwen2.5-Coder family on it: lifts of up to +55.6 F1 bring tuned Qwen2.5-Coder-32B to 71.2%, within 1.7 F1 of zero-shot Claude-Opus-4.6. We release the benchmark, pipeline, baselines, and a datasheet at https://github.com/yikun-li/TraceEva

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TraceEval supplies execution-verified call-structure data across three languages at decent scale, but the LLM harness step leaves open questions about how complete the traces really are.

read the letter

TraceEval is the first benchmark that ties call-structure recovery directly to observed execution traces instead of static analysis or hand labels, and it does this for real programs in Python, JavaScript, and Java. The authors pulled roughly 10,500 functions from 1,600-plus repositories, generated harnesses with LLM help, ran them under a tracer, and kept only the edges that actually fired. They then test ten LLMs zero-shot and show that fine-tuning on the train split lifts smaller models substantially, bringing a 32B Qwen variant within a couple points of Claude-Opus-4.6. Releasing the full pipeline and datasheet is a practical plus for anyone who wants to extend or audit the set.

Referee Report

1 major / 1 minor

Summary. The paper introduces TraceEval, claimed as the first execution-verified multi-language benchmark for code semantic reasoning: recovering runtime call structure from source code. It extracts 10,583 real-world programs (2,129 test, 8,454 train) from 1,600+ repositories across Python, JavaScript, and Java using an LLM-assisted harness-generation pipeline with tracer validation; every positive call-graph edge is mechanically witnessed by execution. Zero-shot evaluation of 10 LLMs yields up to 72.9% average F1 (Claude-Opus-4.6); fine-tuning Qwen2.5-Coder models on the train split produces lifts up to +55.6 F1, reaching 71.2%. A reproducible pipeline for generating new verified instances from open-source repositories is released along with the benchmark, baselines, and datasheet.

Significance. If the ground-truth call structures prove complete and representative, TraceEval would be a meaningful contribution by supplying a large-scale, multi-language, execution-verified resource that directly measures semantic recovery rather than test-passing. The open pipeline, train split for supervision, and release of all artifacts would enable reproducible follow-up work and could become a standard diagnostic for code-understanding models.

major comments (1)

[Dataset construction / harness-generation pipeline] Dataset construction (LLM-assisted harness-generation pipeline with tracer validation): the claim that the resulting traces supply unbiased, complete runtime call structures for real-world programs rests on the unverified assumption that LLM-generated harnesses trigger all relevant entry points, conditionals, and callbacks. No quantitative coverage analysis (e.g., fraction of functions exercised, comparison against project-native test suites, or manual audit of missed dynamic edges) is described; this directly undermines the assertion that the benchmark captures representative runtime behavior rather than only the subset exercised by the generated harnesses.

minor comments (1)

[Abstract] The final sentence of the abstract contains a truncated repository name ('TraceEva' instead of 'TraceEval').

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The concern regarding dataset construction and coverage is well-taken, and we address it directly below. We have revised the paper to incorporate clarifications and additional discussion as described.

read point-by-point responses

Referee: Dataset construction (LLM-assisted harness-generation pipeline with tracer validation): the claim that the resulting traces supply unbiased, complete runtime call structures for real-world programs rests on the unverified assumption that LLM-generated harnesses trigger all relevant entry points, conditionals, and callbacks. No quantitative coverage analysis (e.g., fraction of functions exercised, comparison against project-native test suites, or manual audit of missed dynamic edges) is described; this directly undermines the assertion that the benchmark captures representative runtime behavior rather than only the subset exercised by the generated harnesses.

Authors: We appreciate the referee's point and agree that the harness-generation process does not guarantee exhaustive coverage of all possible execution paths. TraceEval does not claim to capture a complete or unbiased enumeration of every conceivable runtime call structure; rather, it supplies a large set of real-world programs for which every positive call-graph edge has been mechanically verified through execution of the generated harness. This design eliminates false-positive labels and annotator disagreement for the observed behaviors, which is the primary contribution relative to static or hand-annotated call-graph resources. The LLM-assisted harness pipeline is intended to produce inputs that exercise core program functionality (with tracer validation to confirm execution), but we acknowledge that coverage is harness-dependent and may miss some dynamic edges. In the revised manuscript we will (1) add an explicit limitations subsection clarifying the scope of the verified traces, (2) report quantitative coverage statistics (e.g., fraction of functions reached by the harnesses on the test split), and (3) note that the released pipeline enables community extensions for improved coverage. We believe these changes preserve the benchmark's utility for evaluating semantic recovery on verified runtime structures while addressing the concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper constructs TraceEval by extracting programs from external open-source repositories and applying an LLM-assisted harness-generation pipeline followed by independent tracer validation to produce execution-witnessed call edges. No equations, fitted parameters, or predictions are defined in terms of the target outputs; the benchmark creation, LLM zero-shot evaluation, and fine-tuning experiments operate on externally sourced data with mechanical verification. The central claims rest on the reproducibility of the pipeline against real-world repositories rather than any self-referential reduction or self-citation load-bearing step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that dynamic execution traces provide faithful ground truth for runtime call structures and that the automated harness pipeline can extract representative real-world programs without systematic bias.

axioms (1)

domain assumption Dynamic execution traces from instrumented runs accurately capture all relevant runtime call edges for the selected programs.
This underpins the claim that every positive edge is mechanically witnessed.

pith-pipeline@v0.9.0 · 5678 in / 1224 out tokens · 34947 ms · 2026-05-13T00:59:11.876160+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

For each question , list the fu nc ti on calls as a comma - s e p a r a t e d list

work page
[2]

Do not include a d d i t i o n a l e x p l a n a t i o n s or c o m m e n t a r y

work page
[3]

Include both ex pl ic it and i mp li ci t fu nc tio n calls ( e . g . , __ ini t_ _ when an object is created )

work page
[4]

If a fu nc ti on is called through an alias or variable , resolve it to the actual f un ct ion being called

work page
[5]

If a passed ar gu men t is not invoked within the function , do not include it

work page
[6]

If there are no fu nc ti on calls , leave the answer empty

work page
[7]

main . MyClass . func

** I M P O R T A N T **: Always use fully q u a l i f i e d names with the module prefix . For example , use " main . MyClass . func " not " MyClass . func ". The module name is the fil en am e without e x t e n s i o n ( e . g . , " main . py " -> " main ") . ** Format for Answers **: - Provide your answer next to each q ues ti on number . - Do not inclu...

work page
[8]

func1 , module

module . func1 , module . func2

work page
[9]

module . func3 3. { language - sp ec if ic worked example : Python / J a v a S c r i p t / Java } **{ l an gu ag e } Code P ro vi de d **: { code } ** Q u e s t i o n s **: { one per ground - truth caller } ** Answers **: A.2 Harness-Generation Prompt The corpus-construction stage (Section 2.1) calls GPT-5.4 once per source file with the prompt below. The...

work page
[10]

** Pr es er ve the call pa tt ern s **: keep the same fu nc tio n / method call r e l a t i o n s h i p s

work page
[11]

** Remove all ex te rn al d e p e n d e n c i e s **: replace imports with stubs or inline i m p l e m e n t a t i o n s

work page
[12]

** Make it self - c o n t a i n e d **: the code must run on its own with no ex te rn al p ac ka ges

work page
[13]

** Add an entry point **: { e n t r y _ p o i n t _ i n s t r u c t i o n }

work page
[14]

** Keep it 15 -40 lines **: s im pli fy if needed , but pr ese rv e the call s t r u c t u r e

work page
[15]

No markdown , no e x p l a n a t i o n

** Use r e a l i s t i c names **: don't rename to func1 / func2 , keep m e a n i n g f u l names from the or ig in al ## O ri gi na l code from { repo }: ``` { source } ``` ## Output format : Return ONLY the r e w r i t t e n code . No markdown , no e x p l a n a t i o n . Just the ru nn ab le code . The per-language entry-point instruction is: •Python /...

work page