Recognition: unknown
Prompt-Driven Code Summarization: A Systematic Literature Review
Pith reviewed 2026-05-10 11:30 UTC · model grok-4.3
The pith
A review of prompting techniques shows LLMs can generate better code summaries, but optimal strategies and evaluations remain unclear across studies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that prompting paradigms such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning demonstrate promise for enhancing LLM performance on code summarization, yet existing research remains fragmented with limited insight into the best strategies for specific models and contexts, and most evaluations depend on overlap-based metrics that may fail to reflect semantic quality.
What carries the argument
Categorization of prompting paradigms (few-shot, zero-shot, chain-of-thought, retrieval-augmented) combined with cross-study analysis of their effectiveness and evaluation practices in LLM-driven code summarization.
If this is right
- Researchers need targeted experiments to identify conditions under which particular prompting strategies outperform others for different models.
- Development of metrics that capture semantic quality beyond simple overlap measures would improve assessment of summary usefulness.
- Clearer guidelines on prompt design could support more consistent integration of automated summarization into developer workflows.
- Filling identified gaps would reduce reliance on manual documentation and aid tasks such as code review and maintenance.
Where Pith is reading between the lines
- If prompting best practices become standardized, integrated tools in development environments could produce more reliable documentation at scale.
- Better code summaries from optimized prompts may indirectly improve accuracy in downstream applications like defect localization or commit message generation.
- The current fragmentation points to a need for shared benchmark datasets that test prompting strategies across varied codebases and languages.
Load-bearing premise
The collected studies form a representative sample of the field and the chosen categorization of prompting paradigms accurately reflects the underlying technical distinctions without significant selection or reporting bias.
What would settle it
A new comprehensive survey that incorporates overlooked recent papers and finds either consistent superiority of one prompting method across models or evaluation results that align closely on semantic quality would contradict the fragmentation and limited-understanding conclusions.
Figures
read the original abstract
Software documentation is essential for program comprehension, developer onboarding, code review, and long-term maintenance. Yet producing quality documentation manually is time-consuming and frequently yields incomplete or inconsistent results. Large language models (LLMs) offer a promising solution by automatically generating natural language descriptions from source code, helping developers understand code more efficiently, facilitating maintenance, and supporting downstream activities such as defect localization and commit message generation. However, the effectiveness of LLMs in documentation tasks critically depends on how they are prompted. Properly structured instructions can substantially improve model performance, making prompt engineering-the design of input prompts to guide model behavior-a foundational technique in LLM-based software engineering. Approaches such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning show promise for code summarization, yet current research remains fragmented. There is limited understanding of which prompting strategies work best, for which models, and under what conditions. Moreover, evaluation practices vary widely, with most studies relying on overlap-based metrics that may not capture semantic quality. This systematic literature review consolidates existing evidence, categorizes prompting paradigms, examines their effectiveness, and identifies gaps to guide future research and practical adoption.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper presents a systematic literature review on prompt-driven code summarization with large language models. It consolidates evidence on prompting paradigms (few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning), examines their effectiveness for generating natural language code descriptions, notes that research remains fragmented with limited understanding of optimal strategies/models/conditions, critiques reliance on overlap-based evaluation metrics, and identifies gaps to guide future work.
Significance. If methodologically rigorous, this SLR would be a useful consolidation in an active area of LLM applications for software engineering. It would help clarify promising prompting directions, surface evaluation weaknesses, and reduce fragmentation by mapping what is known about prompt effectiveness for code summarization tasks that support comprehension, maintenance, and downstream activities.
major comments (1)
- [Methodology section] Methodology section: The central synthesis—that prompting approaches show promise yet research is fragmented with limited understanding of best strategies—depends on the collected studies forming a representative sample and the chosen categorization (few-shot, CoT, RAG, zero-shot) accurately reflecting technical distinctions without selection or reporting bias. The abstract outlines scope but provides no explicit search protocol, databases, inclusion/exclusion criteria, or quality assessment details, leaving these load-bearing elements unverified and risking incomplete coverage or forced groupings that undermine the gap analysis.
minor comments (1)
- [Abstract] Abstract: The claim that 'most studies relying on overlap-based metrics' would be strengthened by stating the total number of primary studies reviewed and the covered time period.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our systematic literature review. We address the major comment on methodology below and will revise the manuscript accordingly to improve transparency and verifiability.
read point-by-point responses
-
Referee: [Methodology section] Methodology section: The central synthesis—that prompting approaches show promise yet research is fragmented with limited understanding of best strategies—depends on the collected studies forming a representative sample and the chosen categorization (few-shot, CoT, RAG, zero-shot) accurately reflecting technical distinctions without selection or reporting bias. The abstract outlines scope but provides no explicit search protocol, databases, inclusion/exclusion criteria, or quality assessment details, leaving these load-bearing elements unverified and risking incomplete coverage or forced groupings that undermine the gap analysis.
Authors: We agree that the abstract does not explicitly detail the search protocol, databases, inclusion/exclusion criteria, or quality assessment, which limits immediate verifiability for readers. The full manuscript contains a dedicated Methodology section (Section 3) that outlines a structured search across IEEE Xplore, ACM Digital Library, ScienceDirect, SpringerLink, and arXiv using predefined keyword strings (detailed in Appendix A), with inclusion criteria limited to empirical studies (2020 onward) evaluating LLM prompting for code summarization and exclusion of non-empirical or non-English works. Quality assessment used a modified Kitchenham checklist with reported inter-rater reliability. The four-category taxonomy was derived from the primary technique reported in each primary study, with dual-author independent coding and consensus resolution to mitigate bias. However, we acknowledge the referee's point that these elements could be presented more explicitly and with a PRISMA diagram to strengthen the synthesis. We will revise the abstract to include a concise methodology summary and expand Section 3 with additional justification for the categorization scheme and search coverage. revision: yes
Circularity Check
No circularity: SLR aggregates external studies without self-referential derivations
full rationale
This systematic literature review consolidates evidence from external papers on prompting strategies for code summarization. It contains no mathematical derivations, fitted parameters, predictions, or uniqueness theorems that reduce to the paper's own inputs by construction. Claims about fragmentation and promising approaches (few-shot, CoT, RAG, zero-shot) are synthesized from cited literature rather than defined or forced by the review's own categorization or search process. No self-citation chains or ansatzes are load-bearing for the central synthesis. The paper is self-contained against external benchmarks as a review.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard systematic literature review methodology (search strategy, inclusion criteria, quality assessment) is sufficient to consolidate evidence without major bias
Reference graph
Works this paper leans on
-
[1]
Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024
2024
-
[2]
Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022
2022
-
[3]
An analysis of the automatic bug fixing performance of chatgpt
Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. An analysis of the automatic bug fixing performance of chatgpt. In2023 IEEE/ACM International Workshop on Automated Program Repair (APR), pages 23–30. IEEE, 2023
2023
-
[4]
Program translation via code distillation.arXiv preprint arXiv:2310.11476, 2023
Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, and Neel Sundaresan. Program translation via code distillation.arXiv preprint arXiv:2310.11476, 2023
-
[5]
A systematic literature review on the use of deep learning in software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–58, 2022
Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. A systematic literature review on the use of deep learning in software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–58, 2022
2022
-
[6]
Automatic generation of natural language summaries for java classes
Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. Automatic generation of natural language summaries for java classes. In2013 21st International conference on program comprehension (ICPC), pages 23–32. IEEE, 2013
2013
-
[7]
Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352, 2019
Yuxiang Zhu and Minxue Pan. Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352, 2019
-
[8]
A convolutional attention network for extreme summarization of source code
Miltiadis Allamanis, Hao Peng, and Charles Sutton. A convolutional attention network for extreme summarization of source code. InInternational conference on machine learning, pages 2091–2100. PMLR, 2016
2091
-
[9]
Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Sonia Haiduc, Gabriele Bavota, et al. Deep learning-based code reviews: A paradigm shift or a double-edged sword?arXiv preprint arXiv:2411.11401, 2024
-
[10]
Toward deep learning software repositories
Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. Toward deep learning software repositories. In2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 334–345. IEEE, 2015
2015
-
[11]
Automatically documenting program changes
Raymond PL Buse and Westley R Weimer. Automatically documenting program changes. InProceedings of the 25th IEEE/ACM international conference on automated software engineering, pages 33–42, 2010
2010
-
[12]
Simulated analysis and hardware implementation of voiceband circular microphone array
M Sami Zitouni, M Luai Hammadih, Abdulla AlShehhi, Saif AlKindi, Nazar Ali, and Luis Weruaga. Simulated analysis and hardware implementation of voiceband circular microphone array. In2013 IEEE 20th International Conference on Electronics, Circuits, and Systems (ICECS), pages 508–511. IEEE, 2013
2013
-
[13]
Commit message matters: Investigating impact and evolution of commit message quality
Jiawei Li and Iftekhar Ahmed. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 806–817. IEEE, 2023
2023
-
[14]
Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959, 2024
Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959, 2024
-
[15]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review arXiv 2024
-
[17]
Maxim Enis and Mark Hopkins. From llm to nmt: Advancing low-resource machine translation with claude.arXiv preprint arXiv:2404.13813, 2024
-
[18]
Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? InProceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300–2344, 2022
2022
-
[19]
Sheng Lu, Hendrik Schuff, and Iryna Gurevych. How are prompts different in terms of sensitivity? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5833–5856, 2024
2024
-
[20]
Large language models are human-level prompt engineers
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022
2022
-
[21]
Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019
2019
-
[22]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review arXiv 2023
-
[23]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, pages 1877–1901, 2020
1901
-
[24]
Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval augmented code generation and summarization.arXiv preprint arXiv:2108.11601, 2021
-
[25]
Key challenges in prompt engineering
Vladimir Geroimenko. Key challenges in prompt engineering. InThe Essential Guide to Prompt Engineering: Key Principles, Techniques, Challenges, and Security Risks, pages 85–102. Springer, 2025
2025
-
[26]
Unleashing the potential of prompt engineering for large language models.Patterns, 2025
Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering for large language models.Patterns, 2025
2025
-
[27]
Reassessing automatic evaluation metrics for code summarization tasks
Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. Reassessing automatic evaluation metrics for code summarization tasks. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1105–1116, 2021
2021
-
[28]
Semantic similarity metrics for evaluating source code summarization
Sakib Haque, Zachary Eberhart, Aakash Bansal, and Collin McMillan. Semantic similarity metrics for evaluating source code summarization. InProceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pages 36–47, 2022
2022
-
[29]
Evaluating code summarization techniques: A new metric and an empirical characterization
Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, and Gabriele Bavota. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024
2024
-
[30]
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155, 2020
work page internal anchor Pith review arXiv 2002
-
[31]
GraphCodeBERT: Pre-training Code Representations with Data Flow
Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020
work page internal anchor Pith review arXiv 2009
-
[32]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859, 2021
-
[33]
Unified pre- training for program understanding and generation
Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation.arXiv preprint arXiv:2103.06333, 2021. Manuscript submitted to ACM 40 Farjanaet al
-
[34]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024
work page internal anchor Pith review arXiv 2024
-
[35]
A systematic evaluation of large language models of code
Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language models of code. InProceedings of the 6th ACM SIGPLAN international symposium on machine programming, pages 1–10, 2022
2022
-
[36]
A review of automatic source code summarization.Empirical Software Engineering, 29(6):162, 2024
Xuejun Zhang, Xia Hou, Xiuming Qiao, and Wenfeng Song. A review of automatic source code summarization.Empirical Software Engineering, 29(6):162, 2024
2024
-
[37]
A survey of large language models for code intelligence
Zhu Zhang, Yue Wang, Daya Guo, Duyu Tang, Nan Duan, and Ming Zhou. A survey of large language models for code intelligence. 2022
2022
-
[38]
Summarizing source code using a neural attention model
Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Summarizing source code using a neural attention model. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 2073–2083, 2016
2073
-
[39]
Improved code summarization via a graph neural network
Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. Improved code summarization via a graph neural network. InProceedings of the 28th international conference on program comprehension, pages 184–195, 2020
2020
-
[40]
Transformer-based model for source code summarization
Wasi Uddin Ahmad, Swarnendu Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Transformer-based model for source code summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4998–5007, 2020
2020
-
[41]
Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. Retrieval-augmented generation for code summarization via hybrid gnn.arXiv preprint arXiv:2006.05405, 2020
-
[42]
Enhanced prompting framework for code summarization with large language models.Proceedings of the ACM on Software Engineering, 2(ISSTA):1630–1653, 2025
Minying Fang, Xing Yuan, Yuying Li, Haojie Li, Chunrong Fang, and Junwei Du. Enhanced prompting framework for code summarization with large language models.Proceedings of the ACM on Software Engineering, 2(ISSTA):1630–1653, 2025
2025
-
[43]
Codex: A comprehensive knowledge graph completion benchmark.arXiv preprint arXiv:2009.07810, 2020
Tara Safavi and Danai Koutra. Codex: A comprehensive knowledge graph completion benchmark.arXiv preprint arXiv:2009.07810, 2020
-
[44]
StarCoder: may the source be with you!
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023
work page internal anchor Pith review arXiv 2023
-
[45]
Jieke Shi, Zhou Yang, and David Lo. Efficient and green large language models for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology, 34(5):1–22, 2025
2025
-
[46]
Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017
2017
-
[47]
Plbart: Pre-training language model for bi-directional attentive representation of code
Wasi Uddin Ahmad, Swarnendu Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Plbart: Pre-training language model for bi-directional attentive representation of code. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021
2021
-
[48]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. InJournal of Machine Learning Research, volume 21, pages 1–67, 2020
2020
-
[49]
Deepcom: Deep comment generation for source code.Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 200–210, 2018
Xiaopeng Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. Deepcom: Deep comment generation for source code.Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 200–210, 2018
2018
-
[50]
A comprehensive study of code summarization with neural attention.Proceedings of the 28th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 53–64, 2020
Alexander LeClair and Collin McMillan. A comprehensive study of code summarization with neural attention.Proceedings of the 28th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 53–64, 2020
2020
-
[51]
Codexglue: A benchmark dataset and open challenge for code intelligence
Shuai Lu, Duyu Tang, Nan Duan, Zhangyin Feng, et al. Codexglue: A benchmark dataset and open challenge for code intelligence. InAdvances in Neural Information Processing Systems, 2021
2021
-
[52]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[53]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Finetuned language models are zero-shot learners
Jason Wei et al. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2021
2021
-
[55]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, and et al. Training language models to follow instructions with human feedback.arXiv preprint arXiv:2203.02155, 2022
work page internal anchor Pith review arXiv 2022
-
[56]
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, et al. Self-instruct: Aligning language model with self generated instructions.arXiv preprint arXiv:2212.10560, 2022
work page internal anchor Pith review arXiv 2022
-
[57]
Prompt engineering in large language models
Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba-Nabende. Prompt engineering in large language models. InInternational conference on data intelligence and cognitive informatics, pages 387–402. Springer, 2023
2023
-
[58]
Source code summarization & comment generation with nlp: A new index proposal
M Alp Eren Kilic and M Fatih Adak. Source code summarization & comment generation with nlp: A new index proposal. In2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pages 1–6. IEEE, 2024
2024
-
[59]
Guoqing Wang, Zeyu Sun, Zhihao Gong, Sixiang Ye, Yizhou Chen, Yifan Zhao, Qingyuan Liang, and Dan Hao. Do advanced language models eliminate the need for prompt engineering in software engineering?arXiv preprint arXiv:2411.02093, 2024
-
[60]
Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore, 2023. Association for Computational Linguistics
2023
-
[61]
A prompt learning framework for source code summarization.arXiv preprint arXiv:2312.16066, 2023
Weisong Sun, Chunrong Fang, Yudu You, Yuchen Chen, Yi Liu, Chong Wang, Jian Zhang, Quanjun Zhang, Hanwei Qian, Wei Zhao, et al. A prompt learning framework for source code summarization.arXiv preprint arXiv:2312.16066, 2023
-
[62]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
2020
-
[63]
arXiv preprint arXiv:2407.12994 , year=
Shubham Vatsal and Harsh Dubey. A survey of prompt engineering methods in large language models for different nlp tasks.arXiv preprint arXiv:2407.12994, 2024
-
[64]
Raxcs: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation
Kaiyuan Yang, Junfeng Wang, and Zihua Song. Raxcs: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation. Information and Software Technology, 183:107741, 2025
2025
-
[65]
Retrieval-based neural source code summarization
Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. Retrieval-based neural source code summarization. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 1385–1397, 2020
2020
-
[66]
On the evaluation of neural code summarization
Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dongmei Zhang, and Hongbin Sun. On the evaluation of neural code summarization. InProceedings of the 44th international conference on software engineering, pages 1597–1608, 2022
2022
-
[67]
Summary generation for source code: A systematic literature review.Information and Software Technology, 158:107169, 2023
Bin Liu, Ming Wang, Shihai Huang, Zhe Jiang, and Yijun Yu. Summary generation for source code: A systematic literature review.Information and Software Technology, 158:107169, 2023
2023
-
[68]
Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022. Manuscript submitted to ACM Prompt-Driven Code Summarization: A Systematic Literature Review 41
-
[69]
Neural architecture for source code summarization
Alexander LeClair and Collin McMillan. Neural architecture for source code summarization. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 590–601. IEEE, 2019
2019
-
[70]
Weisong Sun, Yiran Zhang, Jie Zhu, Zhihui Wang, Chunrong Fang, Yonglong Zhang, Yebo Feng, Jiangping Huang, Xingya Wang, Zhi Jin, et al. Commenting higher-level code unit: Full code, reduced code, or hierarchical code summarization.arXiv preprint arXiv:2503.10737, 2025
-
[71]
Achieving high-level software component summarization via hierarchical chain-of-thought prompting and static code analysis
Satrio Adi Rukmono, Lina Ochoa, and Michel RV Chaudron. Achieving high-level software component summarization via hierarchical chain-of-thought prompting and static code analysis. In2023 IEEE International Conference on Data and Software Engineering (ICoDSE), pages 7–12. IEEE, 2023
2023
-
[72]
Ximing Dong, Shaowei Wang, Dayi Lin, Gopi Krishnan Rajbahadur, Boquan Zhou, Shichao Liu, and Ahmed E Hassan. Promptexp: Multi-granularity prompt explanation of large language models.arXiv preprint arXiv:2410.13073, 2024
-
[73]
Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. Docagent: A multi-agent system for automated code documentation generation.arXiv preprint arXiv:2504.08725, 2025
-
[74]
Retrieve and refine: exemplar-based neural comment generation
Bolin Wei, Yongmin Li, Ge Li, Xin Xia, and Zhi Jin. Retrieve and refine: exemplar-based neural comment generation. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 349–360, 2020
2020
-
[75]
A survey of automatic generation of source code comments: Algorithms and techniques.IEEE access, 7:111411–111428, 2019
Xiaotao Song, Hailong Sun, Xu Wang, and Jiafei Yan. A survey of automatic generation of source code comments: Algorithms and techniques.IEEE access, 7:111411–111428, 2019
2019
-
[76]
Unlocking the potential of the prompt engineering paradigm in software engineering: A systematic literature review.AI, 6(9):206, 2025
Irdina Wanda Syahputri, Eko K Budiardjo, and Panca O Hadi Putra. Unlocking the potential of the prompt engineering paradigm in software engineering: A systematic literature review.AI, 6(9):206, 2025
2025
-
[77]
Guidelines for performing systematic literature reviews in software engineering
Barbara Kitchenham and Stuart Charters. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01, EBSE Technical Report, Keele University and University of Durham, 2007
2007
-
[78]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
2002
-
[79]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004
2004
-
[80]
Meteor: An automatic metric for mt evaluation with improved correlation with human judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005
2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.