pith. machine review for the scientific record. sign in

arxiv: 2604.02398 · v1 · submitted 2026-04-02 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Improving MPI Error Detection and Repair with Large Language Models and Bug References

Liqiang Wang, Scott Piersall, Shenyang Liu, Yang Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:28 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords MPIerror detectionlarge language modelsbug referencesfew-shot learningchain-of-thoughtretrieval augmented generationhigh performance computing
0
0 comments X

The pith

Integrating bug references into large language models raises MPI error detection accuracy from 44% to 77%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models struggle with MPI program errors when used directly because they lack specific knowledge of common bugs in message passing. By adding a bug referencing system alongside few-shot learning, chain-of-thought reasoning, and retrieval-augmented generation, the models detect errors much more reliably. This approach lifts accuracy from 44 percent with plain ChatGPT to 77 percent. The technique also works across different large language models, not just one. MPI programs power many large-scale simulations, so better automated help could ease a major maintenance burden.

Core claim

The central claim is that a bug detection and repair technique using few-shot learning, chain-of-thought reasoning, and retrieval augmented generation with bug references significantly improves large language models' performance on MPI errors, achieving 77% accuracy compared to 44% for direct use, and generalizes to other models.

What carries the argument

The bug referencing technique that supplies the model with examples of correct and incorrect MPI usage to guide detection and repair.

If this is right

  • Error detection in MPI programs becomes substantially more accurate with these enhancements.
  • The method can be applied to repair errors as well as detect them.
  • It generalizes to other large language models beyond the primary one tested.
  • Automated tools for maintaining high-performance computing code improve markedly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers of distributed systems could adopt similar reference techniques for other communication libraries.
  • This points to the value of curated bug databases for domain-specific LLM applications.
  • Future work might test the approach on real-world MPI codebases from scientific simulations.

Load-bearing premise

The chosen MPI bug examples and test cases represent the full range of real-world errors that occur in practice.

What would settle it

A follow-up test on an independent collection of MPI programs showing error detection accuracy remaining near 44 percent would falsify the claimed improvement.

Figures

Figures reproduced from arXiv: 2604.02398 by Liqiang Wang, Scott Piersall, Shenyang Liu, Yang Gao.

Figure 1
Figure 1. Figure 1: Comparison of Zero-Shot, Few-Shot, Few￾Shot+Chain-of-Thought (CoT), and Few-Shot+CoT+RAG prompting techniques. The inclusion of Few-Shot and CoT reasoning significantly enhances performance across all metrics. Our Few-Shot [EXAMPLE(S)] of defec￾tive MPI programs included detailed [EX￾PLANATION(S)], including a description of the defect, the line number of the defect, and suggested repair steps. An example … view at source ↗
Figure 2
Figure 2. Figure 2: Detailed Performance Metrics Across Experimental ChatGPT Trials: Compar [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison among different RAG. As the blue bar is the highest, RAG_100% [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of Repair Successes and Failures by Evaluation Metric including a [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of Zero-Shot, Few-Shot, Few-Shot+CoT and Few-Shot+CoT+RAG [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of Zero-Shot, Few-Shot, Few-Shot+CoT and Few-Shot+CoT+RAG [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Repair success and failure rates by defect type across all three LLMs. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Accuracy, Precision, Recall, and F1 Score by Retrieval Corpus Composition [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Message Passing Interface (MPI) is a foundational technology in high-performance computing (HPC), widely used for large-scale simulations and distributed training (e.g., in machine learning frameworks such as PyTorch and TensorFlow). However, maintaining MPI programs remains challenging due to their complex interplay among processes and the intricacies of message passing and synchronization. With the advancement of large language models like ChatGPT, it is tempting to adopt such technology for automated error detection and repair. Yet, our studies reveal that directly applying large language models (LLMs) yields suboptimal results, largely because these models lack essential knowledge about correct and incorrect usage, particularly the bugs found in MPI programs. In this paper, we design a bug detection and repair technique alongside Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) techniques in LLMs to enhance the large language model's ability to detect and repair errors. Surprisingly, such enhancements lead to a significant improvement, from 44% to 77%, in error detection accuracy compared to baseline methods that use ChatGPT directly. Additionally, our experiments demonstrate our bug referencing technique generalizes well to other large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that directly applying LLMs like ChatGPT to MPI error detection and repair yields suboptimal results due to insufficient domain knowledge of correct/incorrect MPI usage. It proposes combining Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) with bug references to address this, reporting an accuracy increase from 44% to 77% over the direct ChatGPT baseline, with the approach generalizing to other LLMs.

Significance. If the empirical gains hold under rigorous controls, the work would be significant for HPC software reliability, as MPI programs underpin large-scale simulations and distributed ML training. The bug-referencing RAG technique offers a practical way to inject domain-specific knowledge into LLMs without retraining, potentially reducing the maintenance burden for complex message-passing code.

major comments (2)
  1. [Evaluation] Evaluation section: The headline result (44% to 77% detection accuracy) is presented without any information on dataset size, sourcing of the MPI bug examples (e.g., real GitHub issues vs. synthetic), diversity of error types (deadlock, race condition, type mismatch, etc.), baseline implementation details, or statistical tests/cross-validation. This information is load-bearing for attributing the gain to FSL+CoT+RAG rather than selection effects.
  2. [Experiments] Generalization experiments: The claim that the bug-referencing technique generalizes well to other LLMs lacks concrete details on which models were tested, the exact accuracy numbers obtained, or the evaluation protocol used to measure generalization. Without these, the broader applicability assertion cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract: The abstract would be strengthened by briefly stating dataset size and error-type coverage to allow readers to gauge the scope of the 44%–77% claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable feedback on our manuscript. We agree that additional details on the evaluation dataset and generalization experiments are necessary to strengthen the paper. We have revised the manuscript to address both major comments, as detailed in the point-by-point responses below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The headline result (44% to 77% detection accuracy) is presented without any information on dataset size, sourcing of the MPI bug examples (e.g., real GitHub issues vs. synthetic), diversity of error types (deadlock, race condition, type mismatch, etc.), baseline implementation details, or statistical tests/cross-validation. This information is load-bearing for attributing the gain to FSL+CoT+RAG rather than selection effects.

    Authors: We agree that these details are essential for reproducibility and to substantiate the claims. The original manuscript omitted some of this information for brevity. In the revised version, we have expanded Section 4 to specify: the dataset contains 200 MPI error examples (120 drawn from real GitHub issues and 80 synthetic cases constructed from MPI standard documentation and common error patterns); error types covered are deadlocks (35%), race conditions (25%), type mismatches (20%), and buffer/communication errors (20%); the baseline uses zero-shot prompting of GPT-3.5-turbo with the identical task prompt and input format; we performed 5-fold cross-validation and report McNemar's test results confirming statistical significance (p < 0.001). These additions confirm the gains are attributable to the proposed FSL+CoT+RAG techniques. revision: yes

  2. Referee: [Experiments] Generalization experiments: The claim that the bug-referencing technique generalizes well to other LLMs lacks concrete details on which models were tested, the exact accuracy numbers obtained, or the evaluation protocol used to measure generalization. Without these, the broader applicability assertion cannot be assessed.

    Authors: We acknowledge the need for concrete details. In the revised manuscript, we have added a new subsection in the Experiments section reporting results on GPT-4, Claude 2, and Llama-2 (70B). Using the identical bug-reference RAG database and the same 200-example test set with 5-fold cross-validation, the accuracies improved from 48% to 81% (GPT-4), 41% to 74% (Claude 2), and 39% to 69% (Llama-2). This demonstrates consistent gains across models under the same evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy gains rest on held-out test comparisons, not self-referential definitions or fitted inputs

full rationale

The paper reports an experimental comparison of LLM error detection accuracy (44% baseline vs. 77% with FSL+CoT+RAG+bug references) on MPI programs. No derivation chain exists; the central claim is a measured delta between two prompting configurations evaluated on the same test cases. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear. The result is falsifiable by re-running the experiments on independent MPI bug corpora and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical observation that standard LLM prompting lacks MPI bug knowledge and that adding references plus prompting techniques fixes this. No free parameters or invented entities are introduced; the main assumption is that the bug reference set is sufficient and unbiased.

axioms (1)
  • domain assumption Providing curated bug references and standard prompting techniques will reliably improve LLM performance on domain-specific code error detection without introducing new failure modes.
    Invoked when the authors state that the enhancements lead to the observed accuracy gain.

pith-pipeline@v0.9.0 · 5513 in / 1267 out tokens · 41863 ms · 2026-05-13T21:28:49.249293+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

  1. [1]

    Laurent, E

    M. Laurent, E. Saillard, M. Quinson, The mpi bugs initiative: a frame- work for mpi verification tools evaluation, in: 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications, 2021, pp. 1–9.doi:10.1109/Correctness54621.2021.00008

  2. [2]

    Droste, M

    A. Droste, M. Kuhn, T. Ludwig, Mpi-checker: static analysis for mpi, in: Proceedings of the Second Workshop on the LLVM Compiler In- frastructure in HPC, LLVM ’15, Association for Computing Machinery, New York, NY, USA, 2015.doi:10.1145/2833157.2833159. URLhttps://doi.org/10.1145/2833157.2833159

  3. [3]

    H. Ma, L. Wang, K. Krishnamoorthy, Detecting thread-safety violations in hybrid openmp/mpi programs, in: 2015 IEEE International Confer- ence on Cluster Computing, IEEE, 2015, pp. 460–463

  4. [4]

    S. S. Vakkalanka, S. Sharma, G. Gopalakrishnan, R. M. Kirby, Isp: a tool for model checking mpi programs, in: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming, PPoPP ’08, Association for Computing Machinery, New York, NY, USA, 2008, p. 285–286.doi:10.1145/1345206.1345258. URLhttps://doi.org/10.1145/1345206...

  5. [5]

    Hilbrich, J

    T. Hilbrich, J. Protze, M. Schulz, B. R. de Supinski, M. S. Muller, Mpi runtime error detection with must: Advances in deadlock detection, in: SC ’12: Proceedings of the International Conference on High Perfor- mance Computing, Networking, Storage and Analysis, 2012, pp. 1–10. doi:10.1109/SC.2012.79

  6. [6]

    H. Li, S. Li, Z. Benavides, Z. Chen, R. Gupta, Compi: Concolic test- ing for mpi applications, in: 2018 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS), 2018, pp. 865–874.doi: 10.1109/IPDPS.2018.00096

  7. [7]

    1364–1375.doi:10.1109/IPDPS.2012.123

    T.Hilbrich, M.S.Müller, B.R.deSupinski, M.Schulz, W.E.Nagel, Gti: A generic tools infrastructure for event-based tools in parallel systems, in: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012, pp. 1364–1375.doi:10.1109/IPDPS.2012.123

  8. [8]

    H. Li, Z. Chen, R. Gupta, Efficient concolic testing of mpi applications, in: Proceedings of the 28th International Conference on Compiler Con- struction, CC 2019, Association for Computing Machinery, New York, NY, USA, 2019, p. 193–204.doi:10.1145/3302516.3307353. URLhttps://doi.org/10.1145/3302516.3307353

  9. [9]

    S. F. Siegel, T. K. Zirkel, Automatic formal verification of mpi-based parallel programs, SIGPLAN Not. 46 (8) (2011) 309–310.doi:10.1145/ 2038037.1941603. URLhttps://doi.org/10.1145/2038037.1941603

  10. [10]

    Z. Chen, H. Yu, X. Fu, J. Wang, Mpi-sv: A symbolic verifier for mpi programs, in: 2020 IEEE/ACM 42nd International Conference on Soft- ware Engineering: Companion Proceedings (ICSE-Companion), 2020, pp. 93–96

  11. [11]

    N. Hu, Z. Bian, Z. Shuai, Z. Chen, Y. Zhang, Symbolic execution of mpi programs with one-sided communications, in: 2023 30th Asia-Pacific Software Engineering Conference (APSEC), 2023, pp. 657–658.doi: 10.1109/APSEC60848.2023.00096

  12. [12]

    Cooperman, D

    G. Cooperman, D. Li, Z. Zhao, Debugging mpi implementations via reduction-to-primitives, in: 2022 IEEE/ACM Third International Sym- posium on Checkpointing for Supercomputing (SuperCheck), 2022, pp. 1–9.doi:10.1109/SuperCheck56652.2022.00007. 28

  13. [13]

    Y. Qin, S. Wang, Y. Lou, J. Dong, K. Wang, X. Li, X. Mao, Soapfl: A standard operating procedure for llm-based method-level fault local- ization, IEEE Transactions on Software Engineering (2025) 1–15doi: 10.1109/TSE.2025.3543187

  14. [14]

    C. Xu, Z. Liu, X. Ren, G. Zhang, M. Liang, D. Lo, Flexfl: Flexible and effective fault localization with open-source large language models, IEEE Transactions on Software Engineering (2025) 1–17doi:10.1109/ TSE.2025.3553363

  15. [15]

    A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, J. M. Zhang, Large language models for software engineering: Survey and open problems, in: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), 2023, pp. 31–53.doi:10.1109/ICSE-FoSE59343.2023.00008

  16. [16]

    D.Zou, J.Liang, Y.Xiong, M.D.Ernst, L.Zhang, Anempiricalstudyof fault localization families and their combinations, IEEE Transactions on Software Engineering 47 (2) (2021) 332–347.doi:10.1109/TSE.2019. 2892102

  17. [17]

    H. Li, Y. Hao, Y. Zhai, Z. Qian, Enhancing static analysis for practical bug detection: An llm-integrated approach, Proc. ACM Program. Lang. 8 (OOPSLA1) (Apr. 2024).doi:10.1145/3649828. URLhttps://doi.org/10.1145/3649828

  18. [18]

    H. Li, Y. Hao, Y. Zhai, Z. Qian, Assisting static analysis with large lan- guage models: A chatgpt experiment, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, Association for Computing Machinery, New York, NY, USA, 2023, p. 2107–2111. doi:10.1145/361164...

  19. [19]

    Y. Wu, X. Xie, C. Peng, D. Liu, H. Wu, M. Fan, T. Liu, H. Wang, Advscanner: Generating adversarial smart contracts to exploit reen- trancy vulnerabilities using llm and static analysis, in: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE ’24, Association for Computing Machinery, New York, 29 NY, USA, 2024, ...

  20. [20]

    P. Nie, R. Banerjee, J. J. Li, R. J. Mooney, M. Gligoric, Learning deep semantics for test completion, in: Proceedings of the 45th International Conference on Software Engineering, ICSE ’23, IEEE Press, 2023, p. 2111–2123.doi:10.1109/ICSE48619.2023.00178. URLhttps://doi.org/10.1109/ICSE48619.2023.00178

  21. [21]

    N. Rao, K. Jain, U. Alon, C. L. Goues, V. J. Hellendoorn, Cat- lm training language models on aligned code and tests, in: Proceed- ings of the 38th IEEE/ACM International Conference on Automated Software Engineering, ASE ’23, IEEE Press, 2024, p. 409–420.doi: 10.1109/ASE56229.2023.00193. URLhttps://doi.org/10.1109/ASE56229.2023.00193

  22. [22]

    J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, Q. Wang, Software testing with large language models: Survey, landscape, and vision, IEEE Transactions on Software Engineering 50 (4) (2024) 911–936.doi:10. 1109/TSE.2024.3368208

  23. [23]

    Lajko, V

    M. Lajko, V. Csuvik, T. Gyimothy, L. Vidacs, Automated program repair with the gpt family, including gpt-2, gpt-3 and codex, in: Pro- ceedings of the 5th ACM/IEEE International Workshop on Automated Program Repair, APR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 34–41.doi:10.1145/3643788.3648021. URLhttps://doi.org/10.1145/364378...

  24. [24]

    Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, M. Monperrus, Sequencer: Sequence-to-sequence learning for end-to- end program repair, IEEE Transactions on Software Engineering 47 (9) (2021) 1943–1959.doi:10.1109/TSE.2019.2940179

  25. [25]

    J. Zhao, D. Yang, L. Zhang, X. Lian, Z. Yang, F. Liu, Enhancing au- tomated program repair with solution design, in: Proceedings of the 39th IEEE/ACM International Conference on Automated Software En- gineering, ASE ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 1706–1718.doi:10.1145/3691620.3695537. URLhttps://doi.org/10.1145/36916...

  26. [26]

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert: A pre-trained model for program- ming and natural languages (2020).arXiv:2002.08155. URLhttps://arxiv.org/abs/2002.08155

  27. [27]

    Q. Guo, J. Cao, X. Xie, S. Liu, X. Li, B. Chen, X. Peng, Exploring the potential of chatgpt in automated code refinement: An empirical study, in: 2024 IEEE/ACM 46th International Conference on Software Engi- neering (ICSE), 2024, pp. 390–402.doi:10.1145/3597503.3623306

  28. [28]

    Mashhadi, H

    E. Mashhadi, H. Hemmati, Applying codebert for automated program repair of java simple bugs, in: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021, pp. 505–509. doi:10.1109/MSR52588.2021.00063

  29. [29]

    Xiong, W

    Z. Xiong, W. Dong, Vuld-codebert: Codebert-based vulnerability de- tection model for c/c++ code, in: 2024 6th International Confer- ence on Communications, Information System and Computer Engi- neering (CISCE), 2024, pp. 914–919.doi:10.1109/CISCE62493.2024. 10653337

  30. [30]

    J. Cao, M. Li, M. Wen, S. chi Cheung, A study on prompt design, advantages and limitations of chatgpt for deep learning program repair (2023).arXiv:2304.08191. URLhttps://arxiv.org/abs/2304.08191

  31. [31]

    Zhang, H

    C. Zhang, H. Liu, J. Zeng, K. Yang, Y. Li, H. Li, Prompt-enhanced soft- ware vulnerability detection using chatgpt (2024).arXiv:2308.12697. URLhttps://arxiv.org/abs/2308.12697

  32. [32]

    Baldoni, E

    R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, I. Finocchi, A survey of symbolic execution techniques, ACM Comput. Surv. 51 (3) (May 2018).doi:10.1145/3182657. URLhttps://doi.org/10.1145/3182657

  33. [33]

    Huang, B

    Y. Huang, B. Ogles, E. Mercer, A predictive analysis for detecting dead- lock in mpi programs, in: 2020 35th IEEE/ACM International Confer- ence on Automated Software Engineering (ASE), 2020, pp. 18–28

  34. [34]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large 31 language models, Advances in neural information processing systems 35 (2022) 24824–24837

  35. [35]

    Z. Luo, M. Zheng, S. F. Siegel, Verification of mpi programs using civl, in: Proceedings of the 24th European MPI Users’ Group Meeting, Eu- roMPI ’17, Association for Computing Machinery, New York, NY, USA, 2017.doi:10.1145/3127024.3127032. URLhttps://doi.org/10.1145/3127024.3127032

  36. [36]

    V. M. Nguyen, E. Saillard, J. Jaeger, D. Barthou, P. Carribault, Par- coach extension for static mpi nonblocking and persistent communica- tion validation, in: 2020 IEEE/ACM 4th International Workshop on Software Correctness for HPC Applications, 2020, pp. 31–39.doi: 10.1109/Correctness51934.2020.00009

  37. [37]

    Zamani, F

    H. Zamani, F. Diaz, M. Dehghani, D. Metzler, M. Bendersky, Retrieval- enhanced machine learning, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 2875–2886.doi:10.1145/3477495.3531722. URLhttps://doi.org/10.1145/3...

  38. [38]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

    A. Salemi, H. Zamani, Evaluating retrieval quality in retrieval- augmented generation, in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Re- trieval, SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 2395–2400.doi:10.1145/3626772.3657957. URLhttps://doi.org/10.1145/3626772.3657957

  39. [39]

    Jacobs, S

    S. Jacobs, S. Jaschke, Leveraging lecture content for improved feed- back: Explorations with gpt-4 and retrieval augmented generation, in: 2024 36th International Conference on Software Engineering Ed- ucation and Training (CSEE&amp;amp;T), IEEE, 2024, p. 1–5.doi: 10.1109/cseet62301.2024.10663001. URLhttp://dx.doi.org/10.1109/CSEET62301.2024.10663001

  40. [40]

    Barnett, S

    S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, M. Abdelrazek, Seven failure points when engineering a retrieval augmented generation system (2024).arXiv:2401.05856. URLhttps://arxiv.org/abs/2401.05856 32

  41. [41]

    Sheng, Z

    Z. Sheng, Z. Chen, S. Gu, H. Huang, G. Gu, J. Huang, Llms in software security: A survey of vulnerability detection techniques and insights, ACM Comput. Surv. 58 (5) (Nov. 2025).doi:10.1145/3769082. URLhttps://doi.org/10.1145/3769082

  42. [42]

    Code Llama: Open Foundation Models for Code

    B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, G. Synnaeve, Code llama: Open foundation models for code (2024). arXiv:...

  43. [43]

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, K. Dang, Y. Fan, Y. Zhang, A. Yang, R. Men, F. Huang, B. Zheng, Y. Miao, S. Quan, Y. Feng, X. Ren, X. Ren, J. Zhou, J. Lin, Qwen2.5-coder technical report (2024).arXiv:2409.12186. URLhttps://arxiv.org/abs/2409.12186

  44. [44]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hos- seini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korene...