pith. machine review for the scientific record. sign in

arxiv: 2604.24831 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.LG

Recognition: unknown

FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:51 UTC · model grok-4.3

classification 💻 cs.SE cs.LG
keywords codeframeworkfgdmprogramsagentsdetectioninterconnectedlarge
0
0 comments X

The pith

FGDM is a sequential multi-agent system using flow graphs, CoT/ToT prompts, and FAISS retrieval that reports mean Levenshtein distance reductions of 24.33 (Python) and 8.37 (C) with cosine similarities of 0.951 and 0.974 on 100 programs from ten open-source projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The system splits bug detection and repair across four specialized AI agents that run in order. One agent builds a flow graph of the code's execution paths. Subsequent agents apply step-by-step reasoning prompts and explore multiple possible fixes in a tree structure to locate errors. They also query a database of past bugs for similar cases. The final agent outputs repaired code. The authors tested the full pipeline on 100 programs drawn from projects including Pandas, FastAPI, and Matplotlib, measuring success by how few edits the repaired code needs compared with the buggy version and how similar the two versions are in meaning.

Core claim

Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.

Load-bearing premise

That Levenshtein distance and cosine similarity on the generated repairs, together with the chosen 100 programs from ten projects, constitute sufficient evidence that FGDM outperforms prior bug-detection methods in real-world settings.

Figures

Figures reproduced from arXiv: 2604.24831 by Bhargavi Karuturi, Jerusha Karen Indupalli, Santhan Reddy Chilla, Srita Padmanabhuni, Vivek Yelleti.

Figure 1
Figure 1. Figure 1: Schematic diagram of the proposed approach to be suggested to be as minimal as possible. In the end, it generates a rectified flow graph while preserving valid structures. This agent makes sure that only the affected lines are changed and does the following validations: • Structure Preservation: All original vertices are re￾tained, while only a minimal number of edges are modified. • Defect Coverage: All d… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of methods Srita et al.: Preprint submitted to Elsevier Page 9 of 14 view at source ↗
read the original abstract

Deep Learning methods are becoming prominent in automated software bug detection; however, they lack the global understanding of the given code. Consequently, their performance tends to degrade, especially when they are applied to large interconnected code bases or complex modular programs. Recently, Large Language Models (LLMs) have proven to be effective at capturing dependencies among multiple interconnected modules in the codebase. This motivated us to propose the Flow-Graph-Driven Multi-Agent Framework (FGDM), which is composed of four agents that operate in a sequential manner. The framework converts the received code to a flow graph, identifies the erroneous segments, and further generates the repaired code. All the employed agents utilize Chain-of-Thought (COT) and Tree-of-Thoughts (TOT) prompts. Additionally, we also integrated with the FAISS vector database to retrieve similar previous bugs and their repairs. We demonstrated the efficacy of the proposed framework over 100 programs from several projects, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado in both C and Python programs. Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes FGDM, a multi-agent framework for automated software bug detection and repair. It uses four sequential agents that convert input code to a flow graph, identify erroneous segments via Chain-of-Thought and Tree-of-Thought prompting, retrieve similar prior bugs with FAISS, and generate repairs. Evaluation is performed on 100 programs drawn from ten projects (Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, Tornado) in both Python and C, with the central claim that FGDM outperforms prior approaches by achieving mean Levenshtein distance reductions of 24.33 (Python) and 8.37 (C) together with cosine similarities of 0.951 and 0.974.

Significance. The multi-agent design that explicitly incorporates flow-graph representations of code dependencies and retrieval-augmented reasoning is a reasonable direction for improving LLM-based repair on modular codebases. If the empirical results were supported by functional verification and explicit baselines, the work could usefully extend the literature on structured prompting for code tasks.

major comments (2)
  1. [Abstract] Abstract: The claim that FGDM 'outperforms the extant approaches' is presented with only aggregate Levenshtein and cosine values; no baseline methods are named, no per-baseline numbers are supplied, and no statistical significance or variance is reported. This directly affects the central empirical claim.
  2. [Abstract] Abstract / Evaluation: The reported metrics are purely syntactic (Levenshtein distance and cosine similarity on generated repairs). No evidence is given that repairs were executed against the original test suites or checked for semantic equivalence, so the numbers do not establish that bugs were actually fixed.
minor comments (1)
  1. [Abstract] The selection criteria for the 100 programs and the nature of the injected or observed bugs are not described, making it difficult to judge the representativeness of the test set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions that will strengthen the presentation of our empirical results while maintaining the integrity of the reported findings.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that FGDM 'outperforms the extant approaches' is presented with only aggregate Levenshtein and cosine values; no baseline methods are named, no per-baseline numbers are supplied, and no statistical significance or variance is reported. This directly affects the central empirical claim.

    Authors: We agree that the abstract would benefit from greater specificity to support the outperforming claim. The full manuscript contains comparisons against prior approaches in the evaluation section, but these details were summarized at a high level in the abstract. In the revised version, we will name the specific baseline methods, include per-baseline Levenshtein distance and cosine similarity values, and report variance along with statistical significance tests for the mean reductions of 24.33 (Python) and 8.37 (C). revision: yes

  2. Referee: [Abstract] Abstract / Evaluation: The reported metrics are purely syntactic (Levenshtein distance and cosine similarity on generated repairs). No evidence is given that repairs were executed against the original test suites or checked for semantic equivalence, so the numbers do not establish that bugs were actually fixed.

    Authors: The referee correctly identifies that our primary metrics are syntactic. Levenshtein distance reduction and cosine similarity are standard quantitative proxies in automated program repair research for measuring how closely generated repairs align with ground-truth fixes. We did not execute the repaired programs against test suites or perform explicit semantic equivalence checks in the current experiments. In the revision, we will add a dedicated limitations paragraph acknowledging this and will discuss plans for functional verification as future work. The syntactic metrics still demonstrate that FGDM produces repairs substantially closer to expected fixes than the baselines considered. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical evaluation stands on direct metrics without self-referential reduction

full rationale

The paper proposes the FGDM multi-agent framework and supports its claims solely through experimental results on 100 programs, reporting mean Levenshtein distance reductions and cosine similarities. No equations, derivations, fitted parameters, or self-citations appear in the provided text that would reduce any central claim to its own inputs by construction. The evaluation uses standard syntactic similarity measures applied to generated repairs; these are independent measurements rather than predictions forced by prior fitting or definitional loops. The framework description (flow-graph conversion, agent sequencing, COT/TOT prompting, FAISS retrieval) introduces no circular dependencies in its justification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The framework implicitly relies on standard LLM capabilities and off-the-shelf retrieval, with prompt wording and agent coordination likely containing unstated tuning choices.

pith-pipeline@v0.9.0 · 5575 in / 1193 out tokens · 65719 ms · 2026-05-08T02:51:20.975814+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages

  1. [1]

    G.Long,J.Gong,H.Fang,T.Chen,Learningsoftwarebugreports:A systematic literature review, ACM Trans. Softw. Eng. Methodol.Just Accepted (Jul. 2025).doi:10.1145/3750040. URL https://doi.org/10.1145/3750040

  2. [2]

    Shiri Harzevili, A

    N. Shiri Harzevili, A. Boaye Belle, J. Wang, S. Wang, Z. M. Jiang, N. Nagappan, A systematic literature review on automated software vulnerability detection using machine learning, ACM Computing Surveys 57 (3) (2024) 1–36

  3. [3]

    Y. Yang, X. Xia, D. Lo, J. Grundy, A survey on deep learning for software engineering, ACM Computing Surveys (CSUR) 54 (10s) (2022) 1–73

  4. [4]

    S. F. Ahmed, M. S. B. Alam, M. Hassan, M. R. Rozbu, T. Ishtiak, N. Rafa, M. Mofijur, A. Shawkat Ali, A. H. Gandomi, Deep learning modellingtechniques:currentprogress,applications,advantages,and challenges, Artificial Intelligence Review 56 (11) (2023) 13521– 13617

  5. [5]

    X. Zhu, W. Zhou, Q.-L. Han, W. Ma, S. Wen, Y. Xiang, When softwaresecuritymeetslargelanguagemodels:Asurvey,IEEE/CAA Journal of Automatica Sinica 12 (2) (2025) 317–334

  6. [6]

    J.Wei,X.Wang,D.Schuurmans,M.Bosma,F.Xia,E.Chi,Q.V.Le, D.Zhou,etal.,Chain-of-thoughtpromptingelicitsreasoninginlarge languagemodels,Advancesinneuralinformationprocessingsystems 35 (2022) 24824–24837

  7. [7]

    S.Yao,D.Yu,J.Zhao,I.Shafran,T.Griffiths,Y.Cao,K.Narasimhan, Tree of thoughts: Deliberate problem solving with large language models,Advancesinneuralinformationprocessingsystems36(2023) 11809–11822

  8. [8]

    S.Padmanabhuni,B.Karuturi,J.K.Indupalli,S.R.Chilla,V.Yelleti, Meta-agentic framework for software bug detection using large lan- guage models, in: 2026 18th International Conference on COMmu- nication Systems and NETworks (COMSNETS), 2026, pp. 841–846. doi:10.1109/COMSNETS67989.2026.11418178

  9. [9]

    Lestable, N

    M.A.Ferrag,A.Battah,N.Tihanyi,R.Jain,D.Maimuţ,F.Alwahedi, T. Lestable, N. S. Thandi, A. Mechri, M. Debbah, L. C. Cordeiro, Securefalcon: Are we there yet in automated software vulnerability detection with llms?, IEEE Transactions on Software Engineering 51 (4) (2025) 1248–1265.doi:10.1109/TSE.2025.3548168

  10. [10]

    ACM Program

    H.Li,Y.Hao,Y.Zhai,Z.Qian,Enhancingstaticanalysisforpractical bug detection: An llm-integrated approach, Proc. ACM Program. Lang. 8 (OOPSLA1) (Apr. 2024).doi:10.1145/3649828. URL https://doi.org/10.1145/3649828

  11. [11]

    Zhang, C

    Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, Z. Chen, Pre-trained model-based automated software vulnerability repair: How far are we?,IEEETransactionsonDependableandSecureComputing21(4) (2024) 2507–2525.doi:10.1109/TDSC.2023.3308897

  12. [12]

    Jiang, H

    Y. Jiang, H. Liu, X. Luo, Z. Zhu, X. Chi, N. Niu, Y. Zhang, Y. Hu, P. Bian, L. Zhang, Bugbuilder: An automated approach to building bug repository, IEEE Transactions on Software Engineering 49 (4) (2023) 1443–1463.doi:10.1109/TSE.2022.3177713

  13. [13]

    H. Guan, G. Bai, Y. Liu, Crossprobe: Llm-empowered cross-project bugdetectionfordeeplearningframeworks,ProceedingsoftheACM on Software Engineering 2 (ISSTA) (2025) 2430–2452

  14. [14]

    S.S.Sijwali,A.M.Colom,A.Guo,S.Saha,Fixingperformancebugs throughllmexplanations,in:2025IEEEInternationalConferenceon Artificial Intelligence Testing (AITest), IEEE, 2025, pp. 102–109

  15. [15]

    Zhang, Z

    B. Zhang, Z. Zhang, Detecting bugs with substantial monetary con- sequences by llm and rule-based reasoning, Advances in Neural Information Processing Systems 37 (2024) 133999–134023

  16. [16]

    T.Hai,J.Zhou,N.Li,S.K.Jain,S.Agrawal,I.B.Dhaou,Cloud-based bugtrackingsoftwaredefectsanalysisusingdeeplearning,Journalof Cloud Computing 11 (1) (2022) 32

  17. [17]

    Pradel, K

    M. Pradel, K. Sen, Deepbugs: A learning approach to name-based bug detection, Proceedings of the ACM on Programming Languages 2 (OOPSLA) (2018) 1–25

  18. [18]

    Kukkar, R

    A. Kukkar, R. Mohana, Y. Kumar, A. Nayyar, M. Bilal, K.-S. Kwak, Duplicate bug report detection and classification system based on deep learning technique, IEEE Access 8 (2020) 200749–200763

  19. [19]

    H. V. Pham, T. Lutellier, W. Qi, L. Tan, Cradle: cross-backend vali- dation to detect and localize bugs in deep learning libraries, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, 2019, pp. 1027–1038

  20. [20]

    S.-Q. Xi, Y. Yao, X.-S. Xiao, F. Xu, J. Lv, Bug triaging based on tossing sequence modeling, Journal of Computer Science and Technology 34 (5) (2019) 942–956

  21. [21]

    Z. Zeng, Y. Zhao, L. Gong, Classifying bug issue types for deep learning-orientedprojectswithpre-trainedmodel,in:202431stAsia- Pacific Software Engineering Conference (APSEC), IEEE, 2024, pp. 91–100

  22. [22]

    Fukuda, C

    N. Fukuda, C. Wu, S. Horiuchi, K. Tayama, Fault report generation for ict systems by jointly learning time-series and text data, in: NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022, pp. 1–9

  23. [23]

    Y. Wei, C. Zhang, T. Ren, Improving bug severity prediction with domain-specific representation learning, Ieee Access 11 (2023) 62829–62839

  24. [24]

    Zhang, W

    Q. Zhang, W. Sun, C. Fang, B. Yu, H. Li, M. Yan, J. Zhou, Z. Chen, Exploringautomatedassertiongenerationvialargelanguagemodels, ACMTrans.Softw.Eng.Methodol.34(3)(Feb.2025). doi:10.1145/ 3699598. URL https://doi.org/10.1145/3699598 Srita et al.: Preprint submitted to Elsevier Page 13 of 14 Flow-Graph-Driven Multi-Agent Framework

  25. [25]

    R. Siva, K. S, B. Hariharan, N. Premkumar, Automatic software bug prediction using adaptive golden eagle optimizer with deep learning, Multimedia Tools and Applications 83 (1) (2024) 1261–1281

  26. [26]

    Mostafa, S

    S. Mostafa, S. T. Cynthia, B. Roy, D. Mondal, Feature transformation for improved software bug detection and commit classification, Journal of Systems and Software 219 (2025) 112205. doi:https://doi.org/10.1016/j.jss.2024.112205. URL https://www.sciencedirect.com/science/article/pii/ S0164121224002498

  27. [27]

    S. T. Cynthia, B. Roy, D. Mondal, Feature transformation for im- proved software bug detection models, in: Proceedings of the 15th Innovations in Software Engineering Conference, ISEC ’22, Associ- ation for Computing Machinery, New York, NY, USA, 2022.doi: 10.1145/3511430.3511444. URL https://doi.org/10.1145/3511430.3511444

  28. [28]

    Juneja, G

    S. Juneja, G. S. Bhathal, B. K. Sidhu, Crf_lstm_do: automated software bug detection deep learning framework, International Journal of Information Technology (Oct 2025). doi:10.1007/ s41870-025-02834-0. URL https://doi.org/10.1007/s41870-025-02834-0

  29. [29]

    R. Garg, A. Bhargava, Bug prediction based on deep neural network with reptile search optimization to enhance software reliability, Mul- timedia Tools and Applications 83 (31) (2024) 75869–75891.doi: 10.1007/s11042-024-18479-3. URL https://doi.org/10.1007/s11042-024-18479-3

  30. [30]

    Al-Fraihat, Y

    D. Al-Fraihat, Y. Sharrab, A.-R. Al-Ghuwairi, H. Alshishani, A. Al- garni,Hyperparameteroptimizationforsoftwarebugpredictionusing ensemble learning, IEEE Access 12 (2024) 51869–51878.doi:10. 1109/ACCESS.2024.3380024

  31. [31]

    S. Garg, R. Z. Moghaddam, C. B. Clement, N. Sundaresan, C. Wu, Deepdev-perf: a deep learning-based approach for improving soft- ware performance, in: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, ESEC/FSE 2022, Association for ComputingMachinery,NewYork,NY,USA,2022,p....

  32. [32]

    Tameswar, G

    K. Tameswar, G. Suddul, K. Dookhitram, A hybrid deep learning approach with genetic and coral reefs metaheuristics for enhanced defect detection in software, International Journal of Information Management Data Insights 2 (2) (2022) 100105. doi:https://doi.org/10.1016/j.jjimei.2022.100105. URL https://www.sciencedirect.com/science/article/pii/ S2667096822000489

  33. [33]

    Bharath, P

    K. Bharath, P. Jagadeesh, An innovative software bug prediction sys- temusingrandomforestalgorithmforenhancedaccuracyincompari- sonwithlogisticregressionalgorithm,in:2023IntelligentComputing and Control for Engineering and Business Systems (ICCEBS), 2023, pp. 1–6.doi:10.1109/ICCEBS58601.2023.10449266

  34. [34]

    D. Patel, Improving software performance through early bug detec- tion using large-scale machine learning models, in: 2025 3rd World Conference on Communication & Computing (WCONF), 2025, pp. 1–6.doi:10.1109/WCONF64849.2025.11233621

  35. [35]

    Zhang, A

    L. Zhang, A. Miranskyy, Automated flakiness detection in quantum software bug reports, in: 2024 IEEE International Conference on QuantumComputingandEngineering(QCE),Vol.02,2024,pp.179–

  36. [36]

    doi:10.1109/QCE60285.2024.10274

  37. [37]

    Nadim, M

    M. Nadim, M. Hassan, A. K. Mandal, C. K. Roy, B. Roy, K. A. Schneider, Comparative analysis of quantum and classical support vector classifiers for software bug prediction: an exploratory study, Quantum Machine Intelligence 7 (1) (2025) 32. doi:10.1007/ s42484-025-00236-w. URL https://doi.org/10.1007/s42484-025-00236-w

  38. [38]

    Widyasari, S

    R. Widyasari, S. Q. Sim, C. Lok, H. Qi, J. Phan, Q. Tay, C. Tan, F. Wee, J. E. Tan, Y. Yieh, B. Goh, F. Thung, H. J. Kang, T. Hoang, D. Lo, E. L. Ouh, Bugsinpy: A database of existing bugs in python programs to enable controlled testing and debugging studies, in: Proceedings of the 28th ACM Joint Meeting on European Soft- wareEngineeringConferenceandSympo...

  39. [39]

    10, Soviet Union, 1966, pp

    V.I.Levenshtein,etal.,Binarycodescapableofcorrectingdeletions, insertions, and reversals, in: Soviet physics doklady, Vol. 10, Soviet Union, 1966, pp. 707–710

  40. [40]

    Singhal, et al., Modern information retrieval: A brief overview, IEEE Data Eng

    A. Singhal, et al., Modern information retrieval: A brief overview, IEEE Data Eng. Bull. 24 (4) (2001) 35–43. Srita et al.: Preprint submitted to Elsevier Page 14 of 14