Recognition: unknown
FGDM: Reasoning Aware Multi-Agentic Framework for Software Bug Detection using Chain of Thought and Tree of Thought Prompting
Pith reviewed 2026-05-08 02:51 UTC · model grok-4.3
The pith
FGDM is a sequential multi-agent system using flow graphs, CoT/ToT prompts, and FAISS retrieval that reports mean Levenshtein distance reductions of 24.33 (Python) and 8.37 (C) with cosine similarities of 0.951 and 0.974 on 100 programs from ten open-source projects.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.
Load-bearing premise
That Levenshtein distance and cosine similarity on the generated repairs, together with the chosen 100 programs from ten projects, constitute sufficient evidence that FGDM outperforms prior bug-detection methods in real-world settings.
Figures
read the original abstract
Deep Learning methods are becoming prominent in automated software bug detection; however, they lack the global understanding of the given code. Consequently, their performance tends to degrade, especially when they are applied to large interconnected code bases or complex modular programs. Recently, Large Language Models (LLMs) have proven to be effective at capturing dependencies among multiple interconnected modules in the codebase. This motivated us to propose the Flow-Graph-Driven Multi-Agent Framework (FGDM), which is composed of four agents that operate in a sequential manner. The framework converts the received code to a flow graph, identifies the erroneous segments, and further generates the repaired code. All the employed agents utilize Chain-of-Thought (COT) and Tree-of-Thoughts (TOT) prompts. Additionally, we also integrated with the FAISS vector database to retrieve similar previous bugs and their repairs. We demonstrated the efficacy of the proposed framework over 100 programs from several projects, including Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, and Tornado in both C and Python programs. Our experiments demonstrate that the FGDM outperforms the extant approaches and yielded reductions with a mean of 24.33 and 8.37 in Levenshtein distance and similarities of 0.951 and 0.974 in cosine similarity for Python and C, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes FGDM, a multi-agent framework for automated software bug detection and repair. It uses four sequential agents that convert input code to a flow graph, identify erroneous segments via Chain-of-Thought and Tree-of-Thought prompting, retrieve similar prior bugs with FAISS, and generate repairs. Evaluation is performed on 100 programs drawn from ten projects (Ansible, Black, FastAPI, Keras, Luigi, Matplotlib, Pandas, Scrapy, SpaCy, Tornado) in both Python and C, with the central claim that FGDM outperforms prior approaches by achieving mean Levenshtein distance reductions of 24.33 (Python) and 8.37 (C) together with cosine similarities of 0.951 and 0.974.
Significance. The multi-agent design that explicitly incorporates flow-graph representations of code dependencies and retrieval-augmented reasoning is a reasonable direction for improving LLM-based repair on modular codebases. If the empirical results were supported by functional verification and explicit baselines, the work could usefully extend the literature on structured prompting for code tasks.
major comments (2)
- [Abstract] Abstract: The claim that FGDM 'outperforms the extant approaches' is presented with only aggregate Levenshtein and cosine values; no baseline methods are named, no per-baseline numbers are supplied, and no statistical significance or variance is reported. This directly affects the central empirical claim.
- [Abstract] Abstract / Evaluation: The reported metrics are purely syntactic (Levenshtein distance and cosine similarity on generated repairs). No evidence is given that repairs were executed against the original test suites or checked for semantic equivalence, so the numbers do not establish that bugs were actually fixed.
minor comments (1)
- [Abstract] The selection criteria for the 100 programs and the nature of the injected or observed bugs are not described, making it difficult to judge the representativeness of the test set.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and outline revisions that will strengthen the presentation of our empirical results while maintaining the integrity of the reported findings.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that FGDM 'outperforms the extant approaches' is presented with only aggregate Levenshtein and cosine values; no baseline methods are named, no per-baseline numbers are supplied, and no statistical significance or variance is reported. This directly affects the central empirical claim.
Authors: We agree that the abstract would benefit from greater specificity to support the outperforming claim. The full manuscript contains comparisons against prior approaches in the evaluation section, but these details were summarized at a high level in the abstract. In the revised version, we will name the specific baseline methods, include per-baseline Levenshtein distance and cosine similarity values, and report variance along with statistical significance tests for the mean reductions of 24.33 (Python) and 8.37 (C). revision: yes
-
Referee: [Abstract] Abstract / Evaluation: The reported metrics are purely syntactic (Levenshtein distance and cosine similarity on generated repairs). No evidence is given that repairs were executed against the original test suites or checked for semantic equivalence, so the numbers do not establish that bugs were actually fixed.
Authors: The referee correctly identifies that our primary metrics are syntactic. Levenshtein distance reduction and cosine similarity are standard quantitative proxies in automated program repair research for measuring how closely generated repairs align with ground-truth fixes. We did not execute the repaired programs against test suites or perform explicit semantic equivalence checks in the current experiments. In the revision, we will add a dedicated limitations paragraph acknowledging this and will discuss plans for functional verification as future work. The syntactic metrics still demonstrate that FGDM produces repairs substantially closer to expected fixes than the baselines considered. revision: partial
Circularity Check
No circularity; empirical evaluation stands on direct metrics without self-referential reduction
full rationale
The paper proposes the FGDM multi-agent framework and supports its claims solely through experimental results on 100 programs, reporting mean Levenshtein distance reductions and cosine similarities. No equations, derivations, fitted parameters, or self-citations appear in the provided text that would reduce any central claim to its own inputs by construction. The evaluation uses standard syntactic similarity measures applied to generated repairs; these are independent measurements rather than predictions forced by prior fitting or definitional loops. The framework description (flow-graph conversion, agent sequencing, COT/TOT prompting, FAISS retrieval) introduces no circular dependencies in its justification.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
G.Long,J.Gong,H.Fang,T.Chen,Learningsoftwarebugreports:A systematic literature review, ACM Trans. Softw. Eng. Methodol.Just Accepted (Jul. 2025).doi:10.1145/3750040. URL https://doi.org/10.1145/3750040
-
[2]
Shiri Harzevili, A
N. Shiri Harzevili, A. Boaye Belle, J. Wang, S. Wang, Z. M. Jiang, N. Nagappan, A systematic literature review on automated software vulnerability detection using machine learning, ACM Computing Surveys 57 (3) (2024) 1–36
2024
-
[3]
Y. Yang, X. Xia, D. Lo, J. Grundy, A survey on deep learning for software engineering, ACM Computing Surveys (CSUR) 54 (10s) (2022) 1–73
2022
-
[4]
S. F. Ahmed, M. S. B. Alam, M. Hassan, M. R. Rozbu, T. Ishtiak, N. Rafa, M. Mofijur, A. Shawkat Ali, A. H. Gandomi, Deep learning modellingtechniques:currentprogress,applications,advantages,and challenges, Artificial Intelligence Review 56 (11) (2023) 13521– 13617
2023
-
[5]
X. Zhu, W. Zhou, Q.-L. Han, W. Ma, S. Wen, Y. Xiang, When softwaresecuritymeetslargelanguagemodels:Asurvey,IEEE/CAA Journal of Automatica Sinica 12 (2) (2025) 317–334
2025
-
[6]
J.Wei,X.Wang,D.Schuurmans,M.Bosma,F.Xia,E.Chi,Q.V.Le, D.Zhou,etal.,Chain-of-thoughtpromptingelicitsreasoninginlarge languagemodels,Advancesinneuralinformationprocessingsystems 35 (2022) 24824–24837
2022
-
[7]
S.Yao,D.Yu,J.Zhao,I.Shafran,T.Griffiths,Y.Cao,K.Narasimhan, Tree of thoughts: Deliberate problem solving with large language models,Advancesinneuralinformationprocessingsystems36(2023) 11809–11822
2023
-
[8]
S.Padmanabhuni,B.Karuturi,J.K.Indupalli,S.R.Chilla,V.Yelleti, Meta-agentic framework for software bug detection using large lan- guage models, in: 2026 18th International Conference on COMmu- nication Systems and NETworks (COMSNETS), 2026, pp. 841–846. doi:10.1109/COMSNETS67989.2026.11418178
-
[9]
M.A.Ferrag,A.Battah,N.Tihanyi,R.Jain,D.Maimuţ,F.Alwahedi, T. Lestable, N. S. Thandi, A. Mechri, M. Debbah, L. C. Cordeiro, Securefalcon: Are we there yet in automated software vulnerability detection with llms?, IEEE Transactions on Software Engineering 51 (4) (2025) 1248–1265.doi:10.1109/TSE.2025.3548168
-
[10]
H.Li,Y.Hao,Y.Zhai,Z.Qian,Enhancingstaticanalysisforpractical bug detection: An llm-integrated approach, Proc. ACM Program. Lang. 8 (OOPSLA1) (Apr. 2024).doi:10.1145/3649828. URL https://doi.org/10.1145/3649828
-
[11]
Q. Zhang, C. Fang, B. Yu, W. Sun, T. Zhang, Z. Chen, Pre-trained model-based automated software vulnerability repair: How far are we?,IEEETransactionsonDependableandSecureComputing21(4) (2024) 2507–2525.doi:10.1109/TDSC.2023.3308897
-
[12]
Y. Jiang, H. Liu, X. Luo, Z. Zhu, X. Chi, N. Niu, Y. Zhang, Y. Hu, P. Bian, L. Zhang, Bugbuilder: An automated approach to building bug repository, IEEE Transactions on Software Engineering 49 (4) (2023) 1443–1463.doi:10.1109/TSE.2022.3177713
-
[13]
H. Guan, G. Bai, Y. Liu, Crossprobe: Llm-empowered cross-project bugdetectionfordeeplearningframeworks,ProceedingsoftheACM on Software Engineering 2 (ISSTA) (2025) 2430–2452
2025
-
[14]
S.S.Sijwali,A.M.Colom,A.Guo,S.Saha,Fixingperformancebugs throughllmexplanations,in:2025IEEEInternationalConferenceon Artificial Intelligence Testing (AITest), IEEE, 2025, pp. 102–109
2025
-
[15]
Zhang, Z
B. Zhang, Z. Zhang, Detecting bugs with substantial monetary con- sequences by llm and rule-based reasoning, Advances in Neural Information Processing Systems 37 (2024) 133999–134023
2024
-
[16]
T.Hai,J.Zhou,N.Li,S.K.Jain,S.Agrawal,I.B.Dhaou,Cloud-based bugtrackingsoftwaredefectsanalysisusingdeeplearning,Journalof Cloud Computing 11 (1) (2022) 32
2022
-
[17]
Pradel, K
M. Pradel, K. Sen, Deepbugs: A learning approach to name-based bug detection, Proceedings of the ACM on Programming Languages 2 (OOPSLA) (2018) 1–25
2018
-
[18]
Kukkar, R
A. Kukkar, R. Mohana, Y. Kumar, A. Nayyar, M. Bilal, K.-S. Kwak, Duplicate bug report detection and classification system based on deep learning technique, IEEE Access 8 (2020) 200749–200763
2020
-
[19]
H. V. Pham, T. Lutellier, W. Qi, L. Tan, Cradle: cross-backend vali- dation to detect and localize bugs in deep learning libraries, in: 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE, 2019, pp. 1027–1038
2019
-
[20]
S.-Q. Xi, Y. Yao, X.-S. Xiao, F. Xu, J. Lv, Bug triaging based on tossing sequence modeling, Journal of Computer Science and Technology 34 (5) (2019) 942–956
2019
-
[21]
Z. Zeng, Y. Zhao, L. Gong, Classifying bug issue types for deep learning-orientedprojectswithpre-trainedmodel,in:202431stAsia- Pacific Software Engineering Conference (APSEC), IEEE, 2024, pp. 91–100
2024
-
[22]
Fukuda, C
N. Fukuda, C. Wu, S. Horiuchi, K. Tayama, Fault report generation for ict systems by jointly learning time-series and text data, in: NOMS 2022-2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022, pp. 1–9
2022
-
[23]
Y. Wei, C. Zhang, T. Ren, Improving bug severity prediction with domain-specific representation learning, Ieee Access 11 (2023) 62829–62839
2023
-
[24]
Q. Zhang, W. Sun, C. Fang, B. Yu, H. Li, M. Yan, J. Zhou, Z. Chen, Exploringautomatedassertiongenerationvialargelanguagemodels, ACMTrans.Softw.Eng.Methodol.34(3)(Feb.2025). doi:10.1145/ 3699598. URL https://doi.org/10.1145/3699598 Srita et al.: Preprint submitted to Elsevier Page 13 of 14 Flow-Graph-Driven Multi-Agent Framework
-
[25]
R. Siva, K. S, B. Hariharan, N. Premkumar, Automatic software bug prediction using adaptive golden eagle optimizer with deep learning, Multimedia Tools and Applications 83 (1) (2024) 1261–1281
2024
-
[26]
S. Mostafa, S. T. Cynthia, B. Roy, D. Mondal, Feature transformation for improved software bug detection and commit classification, Journal of Systems and Software 219 (2025) 112205. doi:https://doi.org/10.1016/j.jss.2024.112205. URL https://www.sciencedirect.com/science/article/pii/ S0164121224002498
-
[27]
S. T. Cynthia, B. Roy, D. Mondal, Feature transformation for im- proved software bug detection models, in: Proceedings of the 15th Innovations in Software Engineering Conference, ISEC ’22, Associ- ation for Computing Machinery, New York, NY, USA, 2022.doi: 10.1145/3511430.3511444. URL https://doi.org/10.1145/3511430.3511444
-
[28]
S. Juneja, G. S. Bhathal, B. K. Sidhu, Crf_lstm_do: automated software bug detection deep learning framework, International Journal of Information Technology (Oct 2025). doi:10.1007/ s41870-025-02834-0. URL https://doi.org/10.1007/s41870-025-02834-0
-
[29]
R. Garg, A. Bhargava, Bug prediction based on deep neural network with reptile search optimization to enhance software reliability, Mul- timedia Tools and Applications 83 (31) (2024) 75869–75891.doi: 10.1007/s11042-024-18479-3. URL https://doi.org/10.1007/s11042-024-18479-3
-
[30]
D. Al-Fraihat, Y. Sharrab, A.-R. Al-Ghuwairi, H. Alshishani, A. Al- garni,Hyperparameteroptimizationforsoftwarebugpredictionusing ensemble learning, IEEE Access 12 (2024) 51869–51878.doi:10. 1109/ACCESS.2024.3380024
-
[31]
S. Garg, R. Z. Moghaddam, C. B. Clement, N. Sundaresan, C. Wu, Deepdev-perf: a deep learning-based approach for improving soft- ware performance, in: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, ESEC/FSE 2022, Association for ComputingMachinery,NewYork,NY,USA,2022,p....
-
[32]
K. Tameswar, G. Suddul, K. Dookhitram, A hybrid deep learning approach with genetic and coral reefs metaheuristics for enhanced defect detection in software, International Journal of Information Management Data Insights 2 (2) (2022) 100105. doi:https://doi.org/10.1016/j.jjimei.2022.100105. URL https://www.sciencedirect.com/science/article/pii/ S2667096822000489
-
[33]
K. Bharath, P. Jagadeesh, An innovative software bug prediction sys- temusingrandomforestalgorithmforenhancedaccuracyincompari- sonwithlogisticregressionalgorithm,in:2023IntelligentComputing and Control for Engineering and Business Systems (ICCEBS), 2023, pp. 1–6.doi:10.1109/ICCEBS58601.2023.10449266
-
[34]
D. Patel, Improving software performance through early bug detec- tion using large-scale machine learning models, in: 2025 3rd World Conference on Communication & Computing (WCONF), 2025, pp. 1–6.doi:10.1109/WCONF64849.2025.11233621
-
[35]
Zhang, A
L. Zhang, A. Miranskyy, Automated flakiness detection in quantum software bug reports, in: 2024 IEEE International Conference on QuantumComputingandEngineering(QCE),Vol.02,2024,pp.179–
2024
-
[36]
doi:10.1109/QCE60285.2024.10274
-
[37]
M. Nadim, M. Hassan, A. K. Mandal, C. K. Roy, B. Roy, K. A. Schneider, Comparative analysis of quantum and classical support vector classifiers for software bug prediction: an exploratory study, Quantum Machine Intelligence 7 (1) (2025) 32. doi:10.1007/ s42484-025-00236-w. URL https://doi.org/10.1007/s42484-025-00236-w
-
[38]
R. Widyasari, S. Q. Sim, C. Lok, H. Qi, J. Phan, Q. Tay, C. Tan, F. Wee, J. E. Tan, Y. Yieh, B. Goh, F. Thung, H. J. Kang, T. Hoang, D. Lo, E. L. Ouh, Bugsinpy: A database of existing bugs in python programs to enable controlled testing and debugging studies, in: Proceedings of the 28th ACM Joint Meeting on European Soft- wareEngineeringConferenceandSympo...
-
[39]
10, Soviet Union, 1966, pp
V.I.Levenshtein,etal.,Binarycodescapableofcorrectingdeletions, insertions, and reversals, in: Soviet physics doklady, Vol. 10, Soviet Union, 1966, pp. 707–710
1966
-
[40]
Singhal, et al., Modern information retrieval: A brief overview, IEEE Data Eng
A. Singhal, et al., Modern information retrieval: A brief overview, IEEE Data Eng. Bull. 24 (4) (2001) 35–43. Srita et al.: Preprint submitted to Elsevier Page 14 of 14
2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.