arxiv: 2603.22048 · v4 · submitted 2026-03-23 · 💻 cs.SE

Recognition: no theorem link

Dynamic analysis enhances issue resolution

Mingwei Liu, Yanlin Wang, Zheng Pei, Zhenxi Chen, Zibin Zheng, Zihao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:43 UTC · model grok-4.3

classification 💻 cs.SE

keywords automated program repairdynamic analysisLLM-based agentsruntime tracingfault localizationsoftware debuggingSWE-bench benchmark

0 comments

The pith

Dynamic analysis embedded in LLM agents resolves 79.4% of code issues

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DAIRA, a framework that integrates dynamic analysis into automated repair agents powered by large language models. Existing approaches depend on static analysis and shallow execution signals, causing agents to speculate on root causes in complex code with polymorphic flows. DAIRA captures runtime data such as call stacks and variable states through lightweight tracing tools and structures them into reports that reveal execution paths. This enables the agent to perform deterministic fault localization instead of broad exploration. On the SWE-bench Verified benchmark, it sets a new record at 79.4% resolution and shows particular strength on hard logical defects.

Core claim

DAIRA is an automated repair framework that deeply embeds dynamic analysis into the agent's decision loop via a Test Tracing-Driven workflow. It uses lightweight tools to capture runtime evidence including call stacks and variable states, then converts this into structured semantic reports. These reports illuminate execution paths and causal dependencies, shifting the agent from speculative reasoning to precise, deterministic inference while avoiding irrelevant code retrievals that flood the context window.

What carries the argument

The Test Tracing-Driven workflow that employs lightweight dynamic analysis tools to capture runtime evidence and convert it into structured semantic reports for guiding the LLM agent's repair process.

If this is right

Precise fault localization becomes possible even in intricate polymorphic control flows and implicit type issues.
Resolution rate reaches 79.4% on SWE-bench Verified and 44.4% on the most demanding tasks.
Unique resolution of complex edge cases that static-analysis baselines cannot handle.
Inference costs reduce by approximately 10% and input token consumption by 25% across different LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dynamic tracing could be applied to other LLM agent tasks involving stateful reasoning beyond code repair.
Lightweight tools open the door to integrating this approach into interactive development environments for real-time assistance.
The shift to runtime evidence might generalize to reduce token waste in long-horizon agent interactions.
Future work could explore combining this with static methods for even broader coverage of defect types.

Load-bearing premise

Lightweight dynamic analysis tools can reliably capture and convert runtime evidence into structured reports that enable precise fault localization without introducing overhead or misleading the LLM agent in complex control flows.

What would settle it

A direct comparison on SWE-bench Verified where DAIRA fails to exceed baseline resolution rates on tasks involving deep logical defects would disprove the advantage of embedding dynamic analysis.

Figures

Figures reproduced from arXiv: 2603.22048 by Mingwei Liu, Yanlin Wang, Zheng Pei, Zhenxi Chen, Zibin Zheng, Zihao Wang.

**Figure 2.** Figure 2: Comparison of Opaque and Transparent Runtime States. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Trace log for reproducing Matplotlib issue #22719. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of DAIRA’s Test-Tracing Driven Workflow Example [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Overlap Analysis of Resolved Instances among the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Trace report during the resolution process of Issue [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Performance analysis in different models. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of Resolution Rates across Different [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

read the original abstract

Resolving complex code defects from natural language descriptions remains a fundamental software engineering challenge. Recently, large language models (LLMs) have driven the creation of agent-based automated repair systems. While improving repository-level problem-solving, current methods struggle with complex defects like intricate polymorphic control flows and implicit type degradation. These approaches rely on static analysis and shallow execution feedback, lacking the ability to monitor intermediate execution states. Consequently, agents often fall into speculative exploration, consuming significant tokens without identifying the root cause. We introduce DAIRA (Dynamic Analysis-enhanced Issue Resolution Agent), a pioneering automated repair framework deeply embedding dynamic analysis into the agent's decision loop. DAIRA employs a Test Tracing-Driven workflow, using lightweight tools to capture runtime evidence (e.g., call stacks and variable states) and convert it into structured semantic reports. By illuminating execution paths and causal dependencies, DAIRA enables precise fault localization and prevents context window flooding from irrelevant code retrievals. This shifts the agent's approach from speculative reasoning to deterministic inference. Evaluations on the SWE-bench Verified benchmark show DAIRA achieves a state-of-the-art 79.4% resolution rate when powered by Gemini 3 Flash Preview. Furthermore, it demonstrates robustness in addressing deep-seated logical defects, securing a 44.4% resolution rate on the most demanding tasks. Compared to baselines, DAIRA uniquely resolves complex edge cases and improves operational efficiency. Across various LLMs, it reduces inference costs by approximately 10% and input token consumption by 25%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces DAIRA, an LLM agent framework for automated issue resolution that embeds lightweight dynamic analysis via a Test Tracing-Driven workflow. This workflow captures runtime evidence such as call stacks and variable states and converts it into structured semantic reports to support precise fault localization. The central empirical claim is that this approach yields a state-of-the-art 79.4% resolution rate on SWE-bench Verified (with Gemini 3 Flash Preview) and 44.4% on the most demanding tasks, while also reducing token consumption by ~25% and inference costs by ~10% relative to baselines.

Significance. If the reported gains are shown to stem specifically from the dynamic-analysis component, the work would provide concrete evidence that runtime traces can reduce speculative reasoning in LLM agents for repository-level repair. This has direct implications for practical automated program repair tools, particularly for complex control-flow and type-related defects.

major comments (3)

[Evaluation] Evaluation section: the 79.4% resolution claim on SWE-bench Verified is not supported by an ablation that disables the Test Tracing-Driven workflow (while holding prompting, retrieval, and base LLM fixed). Without this isolation, the performance delta cannot be attributed to dynamic analysis rather than other unstated factors.
[§4] §4 (experimental setup): no quantitative measurement of tracing error rates or cases in which the generated runtime reports mislead the agent on polymorphic flows or implicit type degradation is provided, leaving the weakest modeling assumption untested.
[Results] Results tables: baseline comparisons lack reported variance across runs, exact configuration details for competing methods, and statistical significance tests, weakening the state-of-the-art assertion.

minor comments (2)

[§3] Clarify the precise format and content of the 'structured semantic reports' produced from captured call stacks and variable states.
Add a reproducibility note specifying the exact versions of SWE-bench Verified and the Gemini model used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the empirical claims without altering the core contributions of DAIRA.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the 79.4% resolution claim on SWE-bench Verified is not supported by an ablation that disables the Test Tracing-Driven workflow (while holding prompting, retrieval, and base LLM fixed). Without this isolation, the performance delta cannot be attributed to dynamic analysis rather than other unstated factors.

Authors: We agree that an ablation isolating the Test Tracing-Driven workflow is necessary to attribute gains specifically to dynamic analysis. In the revised manuscript we will add this controlled ablation (disabling only the tracing and report-generation steps while freezing prompting, retrieval, and the Gemini 3 Flash Preview backbone) and report the resulting resolution rates, token usage, and cost figures. revision: yes
Referee: [§4] §4 (experimental setup): no quantitative measurement of tracing error rates or cases in which the generated runtime reports mislead the agent on polymorphic flows or implicit type degradation is provided, leaving the weakest modeling assumption untested.

Authors: We acknowledge the absence of quantitative tracing-error analysis. The revised version will include measured error rates from the lightweight tracing tools on SWE-bench Verified, together with a qualitative breakdown of failure modes on polymorphic control flows and implicit type degradation, including the frequency with which such reports led the agent to incorrect localizations. revision: yes
Referee: [Results] Results tables: baseline comparisons lack reported variance across runs, exact configuration details for competing methods, and statistical significance tests, weakening the state-of-the-art assertion.

Authors: We agree that variance, configuration transparency, and statistical tests are required to support the SOTA claim. The updated tables will report mean and standard deviation over at least three independent runs per configuration, list exact hyper-parameters and retrieval settings for every baseline, and include p-values from paired statistical tests comparing DAIRA against each baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivations or self-referential reductions

full rationale

The paper presents DAIRA as an empirical framework that embeds dynamic analysis into an LLM agent and reports resolution rates on the external SWE-bench Verified benchmark (79.4% overall, 44.4% on hard tasks). No equations, parameters, or first-principles derivations appear in the provided text. Performance claims rest on direct benchmark measurements rather than any fitted inputs renamed as predictions, self-citations that carry the central argument, or definitions that presuppose the claimed outcome. The derivation chain is therefore self-contained against the benchmark and exhibits no reduction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that runtime evidence from dynamic analysis can be converted into reports that shift LLM reasoning from speculative to deterministic inference; this is treated as a domain assumption without external validation beyond the reported benchmark.

axioms (1)

domain assumption LLMs can effectively leverage structured runtime reports (call stacks, variable states) for precise fault localization in complex code
Invoked in the description of how DAIRA prevents context flooding and enables deterministic inference

invented entities (1)

DAIRA framework with Test Tracing-Driven workflow no independent evidence
purpose: To embed dynamic analysis into the agent's decision loop for automated code repair
New system introduced to address limitations of static methods; no independent evidence provided beyond the SWE-bench results

pith-pipeline@v0.9.0 · 5572 in / 1221 out tokens · 28374 ms · 2026-05-15T00:43:06.823705+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

Anonymous Author(s). 2025. DAIRA: Replication Package. https://anonymous. 4open.science/r/DAIRA-1EC0 Accessed: 2025-01-30

work page 2025
[2]

Antonis Antoniades, Albert Örwall, Kexun Zhang, Yuxi Xie, Anirudh Goyal, and William Wang. 2025. SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement. arXiv:2410.20285 [cs.AI] https: //arxiv.org/abs/2410.20285

work page arXiv 2025
[3]

Nicolas Bettenburg, Sascha Just, Adrian Schröter, Cathrin Weiss, Rahul Premraj, and Thomas Zimmermann. 2008. What makes a good bug report?. InProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of Software Engineering(Atlanta, Georgia)(SIGSOFT ’08/FSE-16). Association for Computing Machinery, New York, NY, USA, 308–318. doi:10.1145/...

work page doi:10.1145/1453101.1453146 2008
[4]

Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, and Qianxiang Wang. 2025. SWE-Exp: Experience- Driven Software Issue Resolution. arXiv:2507.23361 [cs.SE] https://arxiv.org/ abs/2507.23361

work page arXiv 2025
[5]

SGM Cornelissen, AE Zaidman, A van Deursen, LMF Moonen, and R Koschke

work page
[6]

doi:10.1109/TSE.2009.28

A Systematic Survey of Program Comprehension through Dynamic Analysis.IEEE Transactions on Software Engineering35, 5 (2009), 684–702. doi:10.1109/TSE.2009.28

work page doi:10.1109/tse.2009.28 2009
[7]

J. D. Hunter. 2007. Matplotlib: A 2D graphics environment.Computing in Science & Engineering9, 3 (2007), 90–95. doi:10.1109/MCSE.2007.55

work page doi:10.1109/mcse.2007.55 2007
[8]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real- World GitHub Issues? arXiv:2310.06770 [cs.CL] https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, and Qianxiang Wang. 2025. SWE-Debate: Competitive Multi- Agent Debate for Software Issue Resolution. arXiv:2507.23348 [cs.SE] https: //arxiv.org/abs/2507.23348

work page arXiv 2025
[10]

Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, and Doyen Sahoo. 2025. CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Luis Ch...

work page doi:10.18653/v1/2025.naacl-long.189 2025
[11]

Changshu Liu, Alireza Ghazanfari, Yang Chen, and Reyhan Jabbarvand. 2025. Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings. https://api.semanticscholar.org/CorpusID:283920955

work page 2025
[12]

Xiangyan Liu, Bo Lan, Zhiyuan Hu, Yang Liu, Zhicheng Zhang, Fei Wang, Michael Shieh, and Wenmeng Zhou. 2024. CodexGraph: Bridging Large Language Models and Code Repositories via Code Graph Databases. arXiv:2408.03910 [cs.SE] https://arxiv.org/abs/2408.03910

work page arXiv 2024
[13]

Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B

Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin, Amit Kumar, Sergiu Ivanov, Jason K. Moore, Sar- taj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger, Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pe- dregosa, Matthew J. Curry, Andy R. Terrel, Štěpán...

work page doi:10.7717/peerj-cs.103 2017
[14]

Manish Motwani, Sandhya Sankaranarayanan, René Just, and Yuriy Brun. 2018. Do automated program repair techniques repair hard and important bugs?. In Proceedings of the 40th International Conference on Software Engineering(Gothen- burg, Sweden)(ICSE ’18). Association for Computing Machinery, New York, NY, USA, 25. doi:10.1145/3180155.3182533

work page doi:10.1145/3180155.3182533 2018
[15]

Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, and Qing Wang. 2025. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair. arXiv:2506.10484 [cs.SE] https://arxiv.org/abs/2506.10484

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

2025.Hunter: A flexible code tracing toolkit

Ionel Cristian Măries,. 2025.Hunter: A flexible code tracing toolkit. https://github. com/ionelmc/python-hunter BSD 2-Clause License

work page 2025
[17]

Haifeng Ruan, Yuntong Zhang, and Abhik Roychoudhury. 2024. SpecRover: Code Intent Extraction via LLMs. arXiv:2408.02232 [cs.SE] https://arxiv.org/abs/2408. 02232

work page arXiv 2024
[18]

Gregory Tassey. 2002. The economic impacts of inadequate infrastructure for software testing. https://api.semanticscholar.org/CorpusID:107332102

work page 2002
[19]

G Tassey. 2002. Economic Impacts of Inadequate Infrastructure for Software Testing. (2002)

work page 2002
[20]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. 2025. OpenHands: An Open Platform for A...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang

work page
[22]

Agentless: Demystifying LLM-based Software Engineering Agents

Agentless: Demystifying LLM-based Software Engineering Agents. arXiv:2407.01489 [cs.SE] https://arxiv.org/abs/2407.01489

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, and Lingming Zhang

work page
[24]

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly? arXiv:2511.13646 [cs.SE] https://arxiv.org/abs/2511.13646

work page arXiv
[25]

Boyang Yang, Jiadong Ren, Shunfu Jin, Yang Liu, Feng Liu, Bach Le, and Haoye Tian. 2025. Enhancing repository-level software repair via repository-aware knowledge graphs. arXiv:2503.21710 [cs.SE] https://arxiv.org/abs/2503.21710

work page arXiv 2025
[26]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. InAdvances in Neural Informa- tion Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran A...

work page doi:10.52202/079017-1601 2024
[27]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. arXiv:2404.05427 [cs.SE] https://arxiv.org/abs/2404.05427

work page arXiv 2024
[28]

Li Zhong, Zilong Wang, and Jingbo Shang. 2024. Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step. arXiv:2402.16906 [cs.SE] https://arxiv.org/abs/2402.16906

work page arXiv 2024