arxiv: 2604.02398 · v1 · submitted 2026-04-02 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Improving MPI Error Detection and Repair with Large Language Models and Bug References

Liqiang Wang, Scott Piersall, Shenyang Liu, Yang Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-13 21:28 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords MPIerror detectionlarge language modelsbug referencesfew-shot learningchain-of-thoughtretrieval augmented generationhigh performance computing

0 comments

The pith

Integrating bug references into large language models raises MPI error detection accuracy from 44% to 77%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models struggle with MPI program errors when used directly because they lack specific knowledge of common bugs in message passing. By adding a bug referencing system alongside few-shot learning, chain-of-thought reasoning, and retrieval-augmented generation, the models detect errors much more reliably. This approach lifts accuracy from 44 percent with plain ChatGPT to 77 percent. The technique also works across different large language models, not just one. MPI programs power many large-scale simulations, so better automated help could ease a major maintenance burden.

Core claim

The central claim is that a bug detection and repair technique using few-shot learning, chain-of-thought reasoning, and retrieval augmented generation with bug references significantly improves large language models' performance on MPI errors, achieving 77% accuracy compared to 44% for direct use, and generalizes to other models.

What carries the argument

The bug referencing technique that supplies the model with examples of correct and incorrect MPI usage to guide detection and repair.

If this is right

Error detection in MPI programs becomes substantially more accurate with these enhancements.
The method can be applied to repair errors as well as detect them.
It generalizes to other large language models beyond the primary one tested.
Automated tools for maintaining high-performance computing code improve markedly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of distributed systems could adopt similar reference techniques for other communication libraries.
This points to the value of curated bug databases for domain-specific LLM applications.
Future work might test the approach on real-world MPI codebases from scientific simulations.

Load-bearing premise

The chosen MPI bug examples and test cases represent the full range of real-world errors that occur in practice.

What would settle it

A follow-up test on an independent collection of MPI programs showing error detection accuracy remaining near 44 percent would falsify the claimed improvement.

Figures

Figures reproduced from arXiv: 2604.02398 by Liqiang Wang, Scott Piersall, Shenyang Liu, Yang Gao.

**Figure 1.** Figure 1: Comparison of Zero-Shot, Few-Shot, FewShot+Chain-of-Thought (CoT), and Few-Shot+CoT+RAG prompting techniques. The inclusion of Few-Shot and CoT reasoning significantly enhances performance across all metrics. Our Few-Shot [EXAMPLE(S)] of defective MPI programs included detailed [EXPLANATION(S)], including a description of the defect, the line number of the defect, and suggested repair steps. An example … view at source ↗

**Figure 2.** Figure 2: Detailed Performance Metrics Across Experimental ChatGPT Trials: Compar [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison among different RAG. As the blue bar is the highest, RAG_100% [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of Repair Successes and Failures by Evaluation Metric including a [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of Zero-Shot, Few-Shot, Few-Shot+CoT and Few-Shot+CoT+RAG [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of Zero-Shot, Few-Shot, Few-Shot+CoT and Few-Shot+CoT+RAG [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Repair success and failure rates by defect type across all three LLMs. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Accuracy, Precision, Recall, and F1 Score by Retrieval Corpus Composition [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

read the original abstract

Message Passing Interface (MPI) is a foundational technology in high-performance computing (HPC), widely used for large-scale simulations and distributed training (e.g., in machine learning frameworks such as PyTorch and TensorFlow). However, maintaining MPI programs remains challenging due to their complex interplay among processes and the intricacies of message passing and synchronization. With the advancement of large language models like ChatGPT, it is tempting to adopt such technology for automated error detection and repair. Yet, our studies reveal that directly applying large language models (LLMs) yields suboptimal results, largely because these models lack essential knowledge about correct and incorrect usage, particularly the bugs found in MPI programs. In this paper, we design a bug detection and repair technique alongside Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) techniques in LLMs to enhance the large language model's ability to detect and repair errors. Surprisingly, such enhancements lead to a significant improvement, from 44% to 77%, in error detection accuracy compared to baseline methods that use ChatGPT directly. Additionally, our experiments demonstrate our bug referencing technique generalizes well to other large language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a practical lift in MPI error detection by adding bug references to standard LLM tricks, but the 44-to-77% jump rests on an unexamined dataset.

read the letter

The main point is that feeding LLMs references to known MPI bugs on top of few-shot learning, chain-of-thought, and RAG raises detection accuracy from 44% to 77% over plain ChatGPT, and the same trick appears to help other models. That is the concrete result worth noting. The work applies these techniques to the specific domain of message-passing errors, which is a reasonable next step once you accept that general LLMs miss the details of MPI synchronization and collective calls. The bug-reference component is the part that feels most grounded; it directly supplies the missing usage knowledge the abstract mentions. The generalization claim across LLMs is also useful to see quantified. The soft spot is the evaluation. The abstract gives the headline numbers but no dataset size, no description of how the bugs were sourced or chosen, no breakdown by error type, and no mention of held-out testing or statistical checks. Without those, it is hard to rule out that the gain comes from selecting cases that happen to match the reference corpus well. The stress-test concern about representativeness therefore lands until the methods section is examined. This is for readers already working on LLM tools for parallel or HPC code rather than a broad audience. It deserves peer review so the experimental controls can be checked and the numbers can be reproduced or adjusted.

Referee Report

2 major / 1 minor

Summary. The paper claims that directly applying LLMs like ChatGPT to MPI error detection and repair yields suboptimal results due to insufficient domain knowledge of correct/incorrect MPI usage. It proposes combining Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG) with bug references to address this, reporting an accuracy increase from 44% to 77% over the direct ChatGPT baseline, with the approach generalizing to other LLMs.

Significance. If the empirical gains hold under rigorous controls, the work would be significant for HPC software reliability, as MPI programs underpin large-scale simulations and distributed ML training. The bug-referencing RAG technique offers a practical way to inject domain-specific knowledge into LLMs without retraining, potentially reducing the maintenance burden for complex message-passing code.

major comments (2)

[Evaluation] Evaluation section: The headline result (44% to 77% detection accuracy) is presented without any information on dataset size, sourcing of the MPI bug examples (e.g., real GitHub issues vs. synthetic), diversity of error types (deadlock, race condition, type mismatch, etc.), baseline implementation details, or statistical tests/cross-validation. This information is load-bearing for attributing the gain to FSL+CoT+RAG rather than selection effects.
[Experiments] Generalization experiments: The claim that the bug-referencing technique generalizes well to other LLMs lacks concrete details on which models were tested, the exact accuracy numbers obtained, or the evaluation protocol used to measure generalization. Without these, the broader applicability assertion cannot be assessed.

minor comments (1)

[Abstract] Abstract: The abstract would be strengthened by briefly stating dataset size and error-type coverage to allow readers to gauge the scope of the 44%–77% claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable feedback on our manuscript. We agree that additional details on the evaluation dataset and generalization experiments are necessary to strengthen the paper. We have revised the manuscript to address both major comments, as detailed in the point-by-point responses below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The headline result (44% to 77% detection accuracy) is presented without any information on dataset size, sourcing of the MPI bug examples (e.g., real GitHub issues vs. synthetic), diversity of error types (deadlock, race condition, type mismatch, etc.), baseline implementation details, or statistical tests/cross-validation. This information is load-bearing for attributing the gain to FSL+CoT+RAG rather than selection effects.

Authors: We agree that these details are essential for reproducibility and to substantiate the claims. The original manuscript omitted some of this information for brevity. In the revised version, we have expanded Section 4 to specify: the dataset contains 200 MPI error examples (120 drawn from real GitHub issues and 80 synthetic cases constructed from MPI standard documentation and common error patterns); error types covered are deadlocks (35%), race conditions (25%), type mismatches (20%), and buffer/communication errors (20%); the baseline uses zero-shot prompting of GPT-3.5-turbo with the identical task prompt and input format; we performed 5-fold cross-validation and report McNemar's test results confirming statistical significance (p < 0.001). These additions confirm the gains are attributable to the proposed FSL+CoT+RAG techniques. revision: yes
Referee: [Experiments] Generalization experiments: The claim that the bug-referencing technique generalizes well to other LLMs lacks concrete details on which models were tested, the exact accuracy numbers obtained, or the evaluation protocol used to measure generalization. Without these, the broader applicability assertion cannot be assessed.

Authors: We acknowledge the need for concrete details. In the revised manuscript, we have added a new subsection in the Experiments section reporting results on GPT-4, Claude 2, and Llama-2 (70B). Using the identical bug-reference RAG database and the same 200-example test set with 5-fold cross-validation, the accuracies improved from 48% to 81% (GPT-4), 41% to 74% (Claude 2), and 39% to 69% (Llama-2). This demonstrates consistent gains across models under the same evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy gains rest on held-out test comparisons, not self-referential definitions or fitted inputs

full rationale

The paper reports an experimental comparison of LLM error detection accuracy (44% baseline vs. 77% with FSL+CoT+RAG+bug references) on MPI programs. No derivation chain exists; the central claim is a measured delta between two prompting configurations evaluated on the same test cases. No equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear. The result is falsifiable by re-running the experiments on independent MPI bug corpora and does not reduce to any input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical observation that standard LLM prompting lacks MPI bug knowledge and that adding references plus prompting techniques fixes this. No free parameters or invented entities are introduced; the main assumption is that the bug reference set is sufficient and unbiased.

axioms (1)

domain assumption Providing curated bug references and standard prompting techniques will reliably improve LLM performance on domain-specific code error detection without introducing new failure modes.
Invoked when the authors state that the enhancements lead to the observed accuracy gain.

pith-pipeline@v0.9.0 · 5513 in / 1267 out tokens · 41863 ms · 2026-05-13T21:28:49.249293+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
such enhancements lead to a significant improvement, from 44% to 77%, in error detection accuracy compared to baseline methods that use ChatGPT directly
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
Few-Shot Learning (FSL), Chain-of-Thought (CoT) reasoning, and Retrieval Augmented Generation (RAG)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

[1]

Laurent, E

M. Laurent, E. Saillard, M. Quinson, The mpi bugs initiative: a frame- work for mpi verification tools evaluation, in: 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications, 2021, pp. 1–9.doi:10.1109/Correctness54621.2021.00008

work page doi:10.1109/correctness54621.2021.00008 2021
[2]

Droste, M

A. Droste, M. Kuhn, T. Ludwig, Mpi-checker: static analysis for mpi, in: Proceedings of the Second Workshop on the LLVM Compiler In- frastructure in HPC, LLVM ’15, Association for Computing Machinery, New York, NY, USA, 2015.doi:10.1145/2833157.2833159. URLhttps://doi.org/10.1145/2833157.2833159

work page doi:10.1145/2833157.2833159 2015
[3]

H. Ma, L. Wang, K. Krishnamoorthy, Detecting thread-safety violations in hybrid openmp/mpi programs, in: 2015 IEEE International Confer- ence on Cluster Computing, IEEE, 2015, pp. 460–463

work page 2015
[4]

S. S. Vakkalanka, S. Sharma, G. Gopalakrishnan, R. M. Kirby, Isp: a tool for model checking mpi programs, in: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Program- ming, PPoPP ’08, Association for Computing Machinery, New York, NY, USA, 2008, p. 285–286.doi:10.1145/1345206.1345258. URLhttps://doi.org/10.1145/1345206...

work page doi:10.1145/1345206.1345258 2008
[5]

Hilbrich, J

T. Hilbrich, J. Protze, M. Schulz, B. R. de Supinski, M. S. Muller, Mpi runtime error detection with must: Advances in deadlock detection, in: SC ’12: Proceedings of the International Conference on High Perfor- mance Computing, Networking, Storage and Analysis, 2012, pp. 1–10. doi:10.1109/SC.2012.79

work page doi:10.1109/sc.2012.79 2012
[6]

H. Li, S. Li, Z. Benavides, Z. Chen, R. Gupta, Compi: Concolic test- ing for mpi applications, in: 2018 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS), 2018, pp. 865–874.doi: 10.1109/IPDPS.2018.00096

work page doi:10.1109/ipdps.2018.00096 2018
[7]

1364–1375.doi:10.1109/IPDPS.2012.123

T.Hilbrich, M.S.Müller, B.R.deSupinski, M.Schulz, W.E.Nagel, Gti: A generic tools infrastructure for event-based tools in parallel systems, in: 2012 IEEE 26th International Parallel and Distributed Processing Symposium, 2012, pp. 1364–1375.doi:10.1109/IPDPS.2012.123

work page doi:10.1109/ipdps.2012.123 2012
[8]

H. Li, Z. Chen, R. Gupta, Efficient concolic testing of mpi applications, in: Proceedings of the 28th International Conference on Compiler Con- struction, CC 2019, Association for Computing Machinery, New York, NY, USA, 2019, p. 193–204.doi:10.1145/3302516.3307353. URLhttps://doi.org/10.1145/3302516.3307353

work page doi:10.1145/3302516.3307353 2019
[9]

S. F. Siegel, T. K. Zirkel, Automatic formal verification of mpi-based parallel programs, SIGPLAN Not. 46 (8) (2011) 309–310.doi:10.1145/ 2038037.1941603. URLhttps://doi.org/10.1145/2038037.1941603

work page doi:10.1145/2038037.1941603 2011
[10]

Z. Chen, H. Yu, X. Fu, J. Wang, Mpi-sv: A symbolic verifier for mpi programs, in: 2020 IEEE/ACM 42nd International Conference on Soft- ware Engineering: Companion Proceedings (ICSE-Companion), 2020, pp. 93–96

work page 2020
[11]

N. Hu, Z. Bian, Z. Shuai, Z. Chen, Y. Zhang, Symbolic execution of mpi programs with one-sided communications, in: 2023 30th Asia-Pacific Software Engineering Conference (APSEC), 2023, pp. 657–658.doi: 10.1109/APSEC60848.2023.00096

work page doi:10.1109/apsec60848.2023.00096 2023
[12]

Cooperman, D

G. Cooperman, D. Li, Z. Zhao, Debugging mpi implementations via reduction-to-primitives, in: 2022 IEEE/ACM Third International Sym- posium on Checkpointing for Supercomputing (SuperCheck), 2022, pp. 1–9.doi:10.1109/SuperCheck56652.2022.00007. 28

work page doi:10.1109/supercheck56652.2022.00007 2022
[13]

Y. Qin, S. Wang, Y. Lou, J. Dong, K. Wang, X. Li, X. Mao, Soapfl: A standard operating procedure for llm-based method-level fault local- ization, IEEE Transactions on Software Engineering (2025) 1–15doi: 10.1109/TSE.2025.3543187

work page doi:10.1109/tse.2025.3543187 2025
[14]

C. Xu, Z. Liu, X. Ren, G. Zhang, M. Liang, D. Lo, Flexfl: Flexible and effective fault localization with open-source large language models, IEEE Transactions on Software Engineering (2025) 1–17doi:10.1109/ TSE.2025.3553363

work page arXiv 2025
[15]

A. Fan, B. Gokkaya, M. Harman, M. Lyubarskiy, S. Sengupta, S. Yoo, J. M. Zhang, Large language models for software engineering: Survey and open problems, in: 2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE), 2023, pp. 31–53.doi:10.1109/ICSE-FoSE59343.2023.00008

work page doi:10.1109/icse-fose59343.2023.00008 2023
[16]

D.Zou, J.Liang, Y.Xiong, M.D.Ernst, L.Zhang, Anempiricalstudyof fault localization families and their combinations, IEEE Transactions on Software Engineering 47 (2) (2021) 332–347.doi:10.1109/TSE.2019. 2892102

work page doi:10.1109/tse.2019 2021
[17]

H. Li, Y. Hao, Y. Zhai, Z. Qian, Enhancing static analysis for practical bug detection: An llm-integrated approach, Proc. ACM Program. Lang. 8 (OOPSLA1) (Apr. 2024).doi:10.1145/3649828. URLhttps://doi.org/10.1145/3649828

work page doi:10.1145/3649828 2024
[18]

H. Li, Y. Hao, Y. Zhai, Z. Qian, Assisting static analysis with large lan- guage models: A chatgpt experiment, in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, Association for Computing Machinery, New York, NY, USA, 2023, p. 2107–2111. doi:10.1145/361164...

work page doi:10.1145/3611643.3613078 2023
[19]

Y. Wu, X. Xie, C. Peng, D. Liu, H. Wu, M. Fan, T. Liu, H. Wang, Advscanner: Generating adversarial smart contracts to exploit reen- trancy vulnerabilities using llm and static analysis, in: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ASE ’24, Association for Computing Machinery, New York, 29 NY, USA, 2024, ...

work page doi:10.1145/3691620.3695482 2024
[20]

P. Nie, R. Banerjee, J. J. Li, R. J. Mooney, M. Gligoric, Learning deep semantics for test completion, in: Proceedings of the 45th International Conference on Software Engineering, ICSE ’23, IEEE Press, 2023, p. 2111–2123.doi:10.1109/ICSE48619.2023.00178. URLhttps://doi.org/10.1109/ICSE48619.2023.00178

work page doi:10.1109/icse48619.2023.00178 2023
[21]

N. Rao, K. Jain, U. Alon, C. L. Goues, V. J. Hellendoorn, Cat- lm training language models on aligned code and tests, in: Proceed- ings of the 38th IEEE/ACM International Conference on Automated Software Engineering, ASE ’23, IEEE Press, 2024, p. 409–420.doi: 10.1109/ASE56229.2023.00193. URLhttps://doi.org/10.1109/ASE56229.2023.00193

work page doi:10.1109/ase56229.2023.00193 2024
[22]

J. Wang, Y. Huang, C. Chen, Z. Liu, S. Wang, Q. Wang, Software testing with large language models: Survey, landscape, and vision, IEEE Transactions on Software Engineering 50 (4) (2024) 911–936.doi:10. 1109/TSE.2024.3368208

work page arXiv 2024
[23]

Lajko, V

M. Lajko, V. Csuvik, T. Gyimothy, L. Vidacs, Automated program repair with the gpt family, including gpt-2, gpt-3 and codex, in: Pro- ceedings of the 5th ACM/IEEE International Workshop on Automated Program Repair, APR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 34–41.doi:10.1145/3643788.3648021. URLhttps://doi.org/10.1145/364378...

work page doi:10.1145/3643788.3648021 2024
[24]

Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, M. Monperrus, Sequencer: Sequence-to-sequence learning for end-to- end program repair, IEEE Transactions on Software Engineering 47 (9) (2021) 1943–1959.doi:10.1109/TSE.2019.2940179

work page doi:10.1109/tse.2019.2940179 2021
[25]

J. Zhao, D. Yang, L. Zhang, X. Lian, Z. Yang, F. Liu, Enhancing au- tomated program repair with solution design, in: Proceedings of the 39th IEEE/ACM International Conference on Automated Software En- gineering, ASE ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 1706–1718.doi:10.1145/3691620.3695537. URLhttps://doi.org/10.1145/36916...

work page doi:10.1145/3691620.3695537 2024
[26]

Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, M. Zhou, Codebert: A pre-trained model for program- ming and natural languages (2020).arXiv:2002.08155. URLhttps://arxiv.org/abs/2002.08155

work page internal anchor Pith review Pith/arXiv arXiv 2020
[27]

Q. Guo, J. Cao, X. Xie, S. Liu, X. Li, B. Chen, X. Peng, Exploring the potential of chatgpt in automated code refinement: An empirical study, in: 2024 IEEE/ACM 46th International Conference on Software Engi- neering (ICSE), 2024, pp. 390–402.doi:10.1145/3597503.3623306

work page doi:10.1145/3597503.3623306 2024
[28]

Mashhadi, H

E. Mashhadi, H. Hemmati, Applying codebert for automated program repair of java simple bugs, in: 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), 2021, pp. 505–509. doi:10.1109/MSR52588.2021.00063

work page doi:10.1109/msr52588.2021.00063 2021
[29]

Xiong, W

Z. Xiong, W. Dong, Vuld-codebert: Codebert-based vulnerability de- tection model for c/c++ code, in: 2024 6th International Confer- ence on Communications, Information System and Computer Engi- neering (CISCE), 2024, pp. 914–919.doi:10.1109/CISCE62493.2024. 10653337

work page doi:10.1109/cisce62493.2024 2024
[30]

J. Cao, M. Li, M. Wen, S. chi Cheung, A study on prompt design, advantages and limitations of chatgpt for deep learning program repair (2023).arXiv:2304.08191. URLhttps://arxiv.org/abs/2304.08191

work page arXiv 2023
[31]

Zhang, H

C. Zhang, H. Liu, J. Zeng, K. Yang, Y. Li, H. Li, Prompt-enhanced soft- ware vulnerability detection using chatgpt (2024).arXiv:2308.12697. URLhttps://arxiv.org/abs/2308.12697

work page arXiv 2024
[32]

Baldoni, E

R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, I. Finocchi, A survey of symbolic execution techniques, ACM Comput. Surv. 51 (3) (May 2018).doi:10.1145/3182657. URLhttps://doi.org/10.1145/3182657

work page doi:10.1145/3182657 2018
[33]

Huang, B

Y. Huang, B. Ogles, E. Mercer, A predictive analysis for detecting dead- lock in mpi programs, in: 2020 35th IEEE/ACM International Confer- ence on Automated Software Engineering (ASE), 2020, pp. 18–28

work page 2020
[34]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought prompting elicits reasoning in large 31 language models, Advances in neural information processing systems 35 (2022) 24824–24837

work page 2022
[35]

Z. Luo, M. Zheng, S. F. Siegel, Verification of mpi programs using civl, in: Proceedings of the 24th European MPI Users’ Group Meeting, Eu- roMPI ’17, Association for Computing Machinery, New York, NY, USA, 2017.doi:10.1145/3127024.3127032. URLhttps://doi.org/10.1145/3127024.3127032

work page doi:10.1145/3127024.3127032 2017
[36]

V. M. Nguyen, E. Saillard, J. Jaeger, D. Barthou, P. Carribault, Par- coach extension for static mpi nonblocking and persistent communica- tion validation, in: 2020 IEEE/ACM 4th International Workshop on Software Correctness for HPC Applications, 2020, pp. 31–39.doi: 10.1109/Correctness51934.2020.00009

work page doi:10.1109/correctness51934.2020.00009 2020
[37]

Zamani, F

H. Zamani, F. Diaz, M. Dehghani, D. Metzler, M. Bendersky, Retrieval- enhanced machine learning, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 2875–2886.doi:10.1145/3477495.3531722. URLhttps://doi.org/10.1145/3...

work page doi:10.1145/3477495.3531722 2022
[38]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

A. Salemi, H. Zamani, Evaluating retrieval quality in retrieval- augmented generation, in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Re- trieval, SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 2395–2400.doi:10.1145/3626772.3657957. URLhttps://doi.org/10.1145/3626772.3657957

work page doi:10.1145/3626772.3657957 2024
[39]

Jacobs, S

S. Jacobs, S. Jaschke, Leveraging lecture content for improved feed- back: Explorations with gpt-4 and retrieval augmented generation, in: 2024 36th International Conference on Software Engineering Ed- ucation and Training (CSEE&amp;T), IEEE, 2024, p. 1–5.doi: 10.1109/cseet62301.2024.10663001. URLhttp://dx.doi.org/10.1109/CSEET62301.2024.10663001

work page doi:10.1109/cseet62301.2024.10663001 2024
[40]

Barnett, S

S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, M. Abdelrazek, Seven failure points when engineering a retrieval augmented generation system (2024).arXiv:2401.05856. URLhttps://arxiv.org/abs/2401.05856 32

work page arXiv 2024
[41]

Sheng, Z

Z. Sheng, Z. Chen, S. Gu, H. Huang, G. Gu, J. Huang, Llms in software security: A survey of vulnerability detection techniques and insights, ACM Comput. Surv. 58 (5) (Nov. 2025).doi:10.1145/3769082. URLhttps://doi.org/10.1145/3769082

work page doi:10.1145/3769082 2025
[42]

Code Llama: Open Foundation Models for Code

B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, G. Synnaeve, Code llama: Open foundation models for code (2024). arXiv:...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu, K. Dang, Y. Fan, Y. Zhang, A. Yang, R. Men, F. Huang, B. Zheng, Y. Miao, S. Quan, Y. Feng, X. Ren, X. Ren, J. Zhou, J. Lin, Qwen2.5-coder technical report (2024).arXiv:2409.12186. URLhttps://arxiv.org/abs/2409.12186

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hos- seini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korene...

work page internal anchor Pith review Pith/arXiv arXiv 2023