arxiv: 2604.24831 · v1 · submitted 2026-04-27 · 💻 cs.SE · cs.LG

Recognition: unknown

FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting

Srita Padmanabhuni , Bhargavi Karuturi , Jerusha Karen Indupalli , Santhan Reddy Chilla , Vivek Yelleti

Authors on Pith no claims yet

Pith reviewed 2026-05-08 02:51 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords codeframeworkfgdmprogramsagentsdetectioninterconnectedlarge

0 comments

The pith

FGDM is a sequential multi-agent system using flow graphs, CoT/ToT prompts, and FAISS retrieval that reports mean Levenshtein distance reductions of 24.33 (Python) and 8.37 (C) with cosine similarities of 0.951 and 0.974 on 100 programs from ten open-source projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The system splits bug detection and repair across four specialized AI agents that run in order. One agent builds a flow graph of the code's execution paths. Subsequent agents apply step-by-step reasoning prompts and explore multiple possible fixes in a tree structure to locate errors. They also query a database of past bugs for similar cases. The final agent outputs repaired code. The authors tested the full pipeline on 100 programs drawn from projects including Pandas, FastAPI, and Matplotlib, measuring success by how few edits the repaired code needs compared with the buggy version and how similar the two versions are in meaning.

Core claim

Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.

Load-bearing premise

That Levenshtein distance and cosine similarity on the generated repairs, together with the chosen 100 programs from ten projects, constitute sufficient evidence that FGDM outperforms prior bug-detection methods in real-world settings.

Figures

Figures reproduced from arXiv: 2604.24831 by Bhargavi Karuturi, Jerusha Karen Indupalli, Santhan Reddy Chilla, Srita Padmanabhuni, Vivek Yelleti.

**Figure 1.** Figure 1: Schematic diagram of the proposed approach to be suggested to be as minimal as possible. In the end, it generates a rectified flow graph while preserving valid structures. This agent makes sure that only the affected lines are changed and does the following validations: • Structure Preservation: All original vertices are retained, while only a minimal number of edges are modified. • Defect Coverage: All d… view at source ↗

**Figure 2.** Figure 2: Comparison of methods Srita et al.: Preprint submitted to Elsevier Page 9 of 14 view at source ↗

read the original abstract

Deep Learning methods are becoming prominent in automated software bug detection; however, they lack the global understanding of the given code. Consequently, their performance tends to degrade, especially when they are applied to large interconnected code bases or complex modular programs. Recently, Large Language Models (LLMs) have proven to be effective at capturing dependencies among multiple interconnected modules in the codebase. This motivated us to propose the Flow-Graph-Driven Multi-Agent Framework (FGDM), which is composed of four agents that operate in a sequential manner. The framework converts the received code to a flow graph, identifies the erroneous segments, and further generates the repaired code. All the employed agents utilize Chain-of-Thought (COT) and Tree-of-Thoughts (TOT) prompts. Additionally, we also integrated with the FAISS vector database to retrieve similar previous bugs and their repairs. We demonstrated the efficacy of the proposed framework over 100 programs from several projects, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado in both C and Python programs. Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper packages a four-agent flow-graph pipeline with CoT/ToT and retrieval for bug repair, but its performance numbers only show string similarity and do not confirm the repairs actually work.

read the letter

The main takeaway is that FGDM strings together existing pieces—flow graphs, multi-agent orchestration, chain-of-thought and tree-of-thought prompting, plus FAISS retrieval—into one sequential workflow for turning buggy code into repaired code. The abstract walks through the four agents clearly enough that you can picture how the graph step is meant to give global context that pure language models miss. That specific ordering and the choice to test on both Python and C from ten real projects is the concrete new synthesis here, even if each component has been tried before separately. The authors also report mean Levenshtein reductions and high cosine similarities on 100 programs, which at least gives a quantitative hook to compare against other repair systems. The workflow description itself is straightforward and avoids unnecessary complexity, which is a plus for anyone who wants to re-implement or extend the agent hand-off. The soft spots sit right in the evaluation. Levenshtein distance and cosine similarity measure how close the output text is to some reference repair; they say nothing about whether the patched code passes the original failing tests or behaves correctly. The abstract supplies no baseline names, no variance numbers, no statistical tests, and no indication that any generated patch was executed. With only 100 programs drawn from a narrow set of projects, it is hard to know how much the numbers would move on a broader or harder test set. Readers who care about functional correctness will see this gap immediately. The work is aimed at people already building LLM-based repair tools who might borrow the agent sequence or the retrieval step. It is not yet ready for practitioners who need evidence that the fixes are reliable. I would still send it to a serious referee rather than desk-reject. The architecture is described in enough detail that reviewers can ask for test-suite results, baseline details, and error analysis without starting from scratch. The authors would have to strengthen the empirical side, but the core idea is concrete enough to be worth the review cycle.

Referee Report

2 major / 1 minor

Summary. The paper proposes FGDM, a multi-agent framework for automated software bug detection and repair. It uses four sequential agents that convert input code to a flow graph, identify erroneous segments via Chain-of-Thought and Tree-of-Thought prompting, retrieve similar prior bugs with FAISS, and generate repairs. Evaluation is performed on 100 programs drawn from ten projects (Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, Tornado) in both Python and C, with the central claim that FGDM outperforms prior approaches by achieving mean Levenshtein distance reductions of 24.33 (Python) and 8.37 (C) together with cosine similarities of 0.951 and 0.974.

Significance. The multi-agent design that explicitly incorporates flow-graph representations of code dependencies and retrieval-augmented reasoning is a reasonable direction for improving LLM-based repair on modular codebases. If the empirical results were supported by functional verification and explicit baselines, the work could usefully extend the literature on structured prompting for code tasks.

major comments (2)

[Abstract] Abstract: The claim that FGDM 'outperforms the extant approaches' is presented with only aggregate Levenshtein and cosine values; no baseline methods are named, no per-baseline numbers are supplied, and no statistical significance or variance is reported. This directly affects the central empirical claim.
[Abstract] Abstract / Evaluation: The reported metrics are purely syntactic (Levenshtein distance and cosine similarity on generated repairs). No evidence is given that repairs were executed against the original test suites or checked for semantic equivalence, so the numbers do not establish that bugs were actually fixed.

minor comments (1)

[Abstract] The selection criteria for the 100 programs and the nature of the injected or observed bugs are not described, making it difficult to judge the representativeness of the test set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions that will strengthen the presentation of our empirical results while maintaining the integrity of the reported findings.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that FGDM 'outperforms the extant approaches' is presented with only aggregate Levenshtein and cosine values; no baseline methods are named, no per-baseline numbers are supplied, and no statistical significance or variance is reported. This directly affects the central empirical claim.

Authors: We agree that the abstract would benefit from greater specificity to support the outperforming claim. The full manuscript contains comparisons against prior approaches in the evaluation section, but these details were summarized at a high level in the abstract. In the revised version, we will name the specific baseline methods, include per-baseline Levenshtein distance and cosine similarity values, and report variance along with statistical significance tests for the mean reductions of 24.33 (Python) and 8.37 (C). revision: yes
Referee: [Abstract] Abstract / Evaluation: The reported metrics are purely syntactic (Levenshtein distance and cosine similarity on generated repairs). No evidence is given that repairs were executed against the original test suites or checked for semantic equivalence, so the numbers do not establish that bugs were actually fixed.

Authors: The referee correctly identifies that our primary metrics are syntactic. Levenshtein distance reduction and cosine similarity are standard quantitative proxies in automated program repair research for measuring how closely generated repairs align with ground-truth fixes. We did not execute the repaired programs against test suites or perform explicit semantic equivalence checks in the current experiments. In the revision, we will add a dedicated limitations paragraph acknowledging this and will discuss plans for functional verification as future work. The syntactic metrics still demonstrate that FGDM produces repairs substantially closer to expected fixes than the baselines considered. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical evaluation stands on direct metrics without self-referential reduction

full rationale

The paper proposes the FGDM multi-agent framework and supports its claims solely through experimental results on 100 programs, reporting mean Levenshtein distance reductions and cosine similarities. No equations, derivations, fitted parameters, or self-citations appear in the provided text that would reduce any central claim to its own inputs by construction. The evaluation uses standard syntactic similarity measures applied to generated repairs; these are independent measurements rather than predictions forced by prior fitting or definitional loops. The framework description (flow-graph conversion, agent sequencing, COT/TOT prompting, FAISS retrieval) introduces no circular dependencies in its justification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The framework implicitly relies on standard LLM capabilities and off-the-shelf retrieval, with prompt wording and agent coordination likely containing unstated tuning choices.

pith-pipeline@v0.9.0 · 5575 in / 1193 out tokens · 65719 ms · 2026-05-08T02:51:20.975814+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages

[1]

G.Long,J.Gong,H.Fang,T.Chen,Learningsoftwarebugreports:A systematic literature review, ACM Trans. Softw. Eng. Methodol.Just Accepted (Jul. 2025).doi:10.1145/3750040. URL https://doi.org/10.1145/3750040

work page doi:10.1145/3750040 2025
[2]

Shiri Harzevili, A

N. Shiri Harzevili, A. Boaye Belle, J. Wang, S. Wang, Z. M. Jiang, N. Nagappan, A systematic literature review on automated software vulnerability detection using machine learning, ACM Computing Surveys 57 (3) (2024) 1–36

2024
[3]

Y. Yang, X. Xia, D. Lo, J. Grundy, A survey on deep learning for software engineering, ACM Computing Surveys (CSUR) 54 (10s) (2022) 1–73

2022
[4]

S. F. Ahmed, M. S. B. Alam, M. Hassan, M. R. Rozbu, T. Ishtiak, N. Rafa, M. Mofijur, A. Shawkat Ali, A. H. Gandomi, Deep learning modellingtechniques:currentprogress,applications,advantages,and challenges, Artificial Intelligence Review 56 (11) (2023) 13521– 13617

2023
[5]

X. Zhu, W. Zhou, Q.-L. Han, W. Ma, S. Wen, Y. Xiang, When softwaresecuritymeetslargelanguagemodels:Asurvey,IEEE/CAA Journal of Automatica Sinica 12 (2) (2025) 317–334

2025
[6]

J.Wei,X.Wang,D.Schuurmans,M.Bosma,F.Xia,E.Chi,Q.V.Le, D.Zhou,etal.,Chain-of-thoughtpromptingelicitsreasoninginlarge languagemodels,Advancesinneuralinformationprocessingsystems 35 (2022) 24824–24837

2022
[7]

S.Yao,D.Yu,J.Zhao,I.Shafran,T.Griffiths,Y.Cao,K.Narasimhan, Tree of thoughts: Deliberate problem solving with large language models,Advancesinneuralinformationprocessingsystems36(2023) 11809–11822

2023
[8]

S.Padmanabhuni,B.Karuturi,J.K.Indupalli,S.R.Chilla,V.Yelleti, Meta-agentic framework for software bug detection using large lan- guage models, in: 2026 18th International Conference on COMmu- nication Systems and NETworks (COMSNETS), 2026, pp. 841–846. doi:10.1109/COMSNETS67989.2026.11418178

work page doi:10.1109/comsnets67989.2026.11418178 2026
[9]

Lestable, N

M.A.Ferrag,A.Battah,N.Tihanyi,R.Jain,D.Maimuţ,F.Alwahedi, T. Lestable, N. S. Thandi, A. Mechri, M. Debbah, L. C. Cordeiro, Securefalcon: Are we there yet in automated software vulnerability detection with llms?, IEEE Transactions on Software Engineering 51 (4) (2025) 1248–1265.doi:10.1109/TSE.2025.3548168

work page doi:10.1109/tse.2025.3548168 2025
[10]

ACM Program

H.Li,Y.Hao,Y.Zhai,Z.Qian,Enhancingstaticanalysisforpractical bug detection: An llm-integrated approach, Proc. ACM Program. Lang. 8 (OOPSLA1) (Apr. 2024).doi:10.1145/3649828. URL https://doi.org/10.1145/3649828

work page doi:10.1145/3649828 2024
[11]

Zhang, C

Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, Z. Chen, Pre-trained model-based automated software vulnerability repair: How far are we?,IEEETransactionsonDependableandSecureComputing21(4) (2024) 2507–2525.doi:10.1109/TDSC.2023.3308897

work page doi:10.1109/tdsc.2023.3308897 2024
[12]

Jiang, H

Y. Jiang, H. Liu, X. Luo, Z. Zhu, X. Chi, N. Niu, Y. Zhang, Y. Hu, P. Bian, L. Zhang, Bugbuilder: An automated approach to building bug repository, IEEE Transactions on Software Engineering 49 (4) (2023) 1443–1463.doi:10.1109/TSE.2022.3177713

work page doi:10.1109/tse.2022.3177713 2023
[13]

H. Guan, G. Bai, Y. Liu, Crossprobe: Llm-empowered cross-project bugdetectionfordeeplearningframeworks,ProceedingsoftheACM on Software Engineering 2 (ISSTA) (2025) 2430–2452

2025
[14]

S.S.Sijwali,A.M.Colom,A.Guo,S.Saha,Fixingperformancebugs throughllmexplanations,in:2025IEEEInternationalConferenceon Artificial Intelligence Testing (AITest), IEEE, 2025, pp. 102–109

2025
[15]

Zhang, Z

B. Zhang, Z. Zhang, Detecting bugs with substantial monetary con- sequences by llm and rule-based reasoning, Advances in Neural Information Processing Systems 37 (2024) 133999–134023

2024
[16]

T.Hai,J.Zhou,N.Li,S.K.Jain,S.Agrawal,I.B.Dhaou,Cloud-based bugtrackingsoftwaredefectsanalysisusingdeeplearning,Journalof Cloud Computing 11 (1) (2022) 32

2022
[17]

Pradel, K

M. Pradel, K. Sen, Deepbugs: A learning approach to name-based bug detection, Proceedings of the ACM on Programming Languages 2 (OOPSLA) (2018) 1–25

2018
[18]

Kukkar, R

A. Kukkar, R. Mohana, Y. Kumar, A. Nayyar, M. Bilal, K.-S. Kwak, Duplicate bug report detection and classification system based on deep learning technique, IEEE Access 8 (2020) 200749–200763

2020
[19]

H. V. Pham, T. Lutellier, W. Qi, L. Tan, Cradle: cross-backend vali- dation to detect and localize bugs in deep learning libraries, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, 2019, pp. 1027–1038

2019
[20]

S.-Q. Xi, Y. Yao, X.-S. Xiao, F. Xu, J. Lv, Bug triaging based on tossing sequence modeling, Journal of Computer Science and Technology 34 (5) (2019) 942–956

2019
[21]

Z. Zeng, Y. Zhao, L. Gong, Classifying bug issue types for deep learning-orientedprojectswithpre-trainedmodel,in:202431stAsia- Pacific Software Engineering Conference (APSEC), IEEE, 2024, pp. 91–100

2024
[22]

Fukuda, C

N. Fukuda, C. Wu, S. Horiuchi, K. Tayama, Fault report generation for ict systems by jointly learning time-series and text data, in: NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022, pp. 1–9

2022
[23]

Y. Wei, C. Zhang, T. Ren, Improving bug severity prediction with domain-specific representation learning, Ieee Access 11 (2023) 62829–62839

2023
[24]

Zhang, W

Q. Zhang, W. Sun, C. Fang, B. Yu, H. Li, M. Yan, J. Zhou, Z. Chen, Exploringautomatedassertiongenerationvialargelanguagemodels, ACMTrans.Softw.Eng.Methodol.34(3)(Feb.2025). doi:10.1145/ 3699598. URL https://doi.org/10.1145/3699598 Srita et al.: Preprint submitted to Elsevier Page 13 of 14 Flow-Graph-Driven Multi-Agent Framework

work page doi:10.1145/3699598 2025
[25]

R. Siva, K. S, B. Hariharan, N. Premkumar, Automatic software bug prediction using adaptive golden eagle optimizer with deep learning, Multimedia Tools and Applications 83 (1) (2024) 1261–1281

2024
[26]

Mostafa, S

S. Mostafa, S. T. Cynthia, B. Roy, D. Mondal, Feature transformation for improved software bug detection and commit classification, Journal of Systems and Software 219 (2025) 112205. doi:https://doi.org/10.1016/j.jss.2024.112205. URL https://www.sciencedirect.com/science/article/pii/ S0164121224002498

work page doi:10.1016/j.jss.2024.112205 2025
[27]

S. T. Cynthia, B. Roy, D. Mondal, Feature transformation for im- proved software bug detection models, in: Proceedings of the 15th Innovations in Software Engineering Conference, ISEC ’22, Associ- ation for Computing Machinery, New York, NY, USA, 2022.doi: 10.1145/3511430.3511444. URL https://doi.org/10.1145/3511430.3511444

work page doi:10.1145/3511430.3511444 2022
[28]

Juneja, G

S. Juneja, G. S. Bhathal, B. K. Sidhu, Crf_lstm_do: automated software bug detection deep learning framework, International Journal of Information Technology (Oct 2025). doi:10.1007/ s41870-025-02834-0. URL https://doi.org/10.1007/s41870-025-02834-0

work page doi:10.1007/s41870-025-02834-0 2025
[29]

R. Garg, A. Bhargava, Bug prediction based on deep neural network with reptile search optimization to enhance software reliability, Mul- timedia Tools and Applications 83 (31) (2024) 75869–75891.doi: 10.1007/s11042-024-18479-3. URL https://doi.org/10.1007/s11042-024-18479-3

work page doi:10.1007/s11042-024-18479-3 2024
[30]

Al-Fraihat, Y

D. Al-Fraihat, Y. Sharrab, A.-R. Al-Ghuwairi, H. Alshishani, A. Al- garni,Hyperparameteroptimizationforsoftwarebugpredictionusing ensemble learning, IEEE Access 12 (2024) 51869–51878.doi:10. 1109/ACCESS.2024.3380024

work page arXiv 2024
[31]

S. Garg, R. Z. Moghaddam, C. B. Clement, N. Sundaresan, C. Wu, Deepdev-perf: a deep learning-based approach for improving soft- ware performance, in: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, ESEC/FSE 2022, Association for ComputingMachinery,NewYork,NY,USA,2022,p....

work page doi:10.1145/3540250.3549096 2022
[32]

Tameswar, G

K. Tameswar, G. Suddul, K. Dookhitram, A hybrid deep learning approach with genetic and coral reefs metaheuristics for enhanced defect detection in software, International Journal of Information Management Data Insights 2 (2) (2022) 100105. doi:https://doi.org/10.1016/j.jjimei.2022.100105. URL https://www.sciencedirect.com/science/article/pii/ S2667096822000489

work page doi:10.1016/j.jjimei.2022.100105 2022
[33]

Bharath, P

K. Bharath, P. Jagadeesh, An innovative software bug prediction sys- temusingrandomforestalgorithmforenhancedaccuracyincompari- sonwithlogisticregressionalgorithm,in:2023IntelligentComputing and Control for Engineering and Business Systems (ICCEBS), 2023, pp. 1–6.doi:10.1109/ICCEBS58601.2023.10449266

work page doi:10.1109/iccebs58601.2023.10449266 2023
[34]

D. Patel, Improving software performance through early bug detec- tion using large-scale machine learning models, in: 2025 3rd World Conference on Communication & Computing (WCONF), 2025, pp. 1–6.doi:10.1109/WCONF64849.2025.11233621

work page doi:10.1109/wconf64849.2025.11233621 2025
[35]

Zhang, A

L. Zhang, A. Miranskyy, Automated flakiness detection in quantum software bug reports, in: 2024 IEEE International Conference on QuantumComputingandEngineering(QCE),Vol.02,2024,pp.179–

2024
[36]

doi:10.1109/QCE60285.2024.10274

work page doi:10.1109/qce60285.2024.10274 2024
[37]

Nadim, M

M. Nadim, M. Hassan, A. K. Mandal, C. K. Roy, B. Roy, K. A. Schneider, Comparative analysis of quantum and classical support vector classifiers for software bug prediction: an exploratory study, Quantum Machine Intelligence 7 (1) (2025) 32. doi:10.1007/ s42484-025-00236-w. URL https://doi.org/10.1007/s42484-025-00236-w

work page doi:10.1007/s42484-025-00236-w 2025
[38]

Widyasari, S

R. Widyasari, S. Q. Sim, C. Lok, H. Qi, J. Phan, Q. Tay, C. Tan, F. Wee, J. E. Tan, Y. Yieh, B. Goh, F. Thung, H. J. Kang, T. Hoang, D. Lo, E. L. Ouh, Bugsinpy: A database of existing bugs in python programs to enable controlled testing and debugging studies, in: Proceedings of the 28th ACM Joint Meeting on European Soft- wareEngineeringConferenceandSympo...

work page arXiv 2020
[39]

10, Soviet Union, 1966, pp

V.I.Levenshtein,etal.,Binarycodescapableofcorrectingdeletions, insertions, and reversals, in: Soviet physics doklady, Vol. 10, Soviet Union, 1966, pp. 707–710

1966
[40]

Singhal, et al., Modern information retrieval: A brief overview, IEEE Data Eng

A. Singhal, et al., Modern information retrieval: A brief overview, IEEE Data Eng. Bull. 24 (4) (2001) 35–43. Srita et al.: Preprint submitted to Elsevier Page 14 of 14

2001