arxiv: 2604.15385 · v1 · submitted 2026-04-16 · 💻 cs.SE · cs.LG

Recognition: unknown

Prompt-Driven Code Summarization: A Systematic Literature Review

Afia Farjana , Zaiyu Cheng , Antonio Mastropaolo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:30 UTC · model grok-4.3

classification 💻 cs.SE cs.LG

keywords code summarizationprompt engineeringlarge language modelssystematic literature reviewsoftware documentationLLM promptingevaluation metrics

0 comments

The pith

A review of prompting techniques shows LLMs can generate better code summaries, but optimal strategies and evaluations remain unclear across studies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a systematic literature review to consolidate research on using large language models to create natural language summaries from source code. It notes that prompt design is critical to LLM success in this task, with approaches like few-shot examples, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning offering improvements for software documentation. Yet the studies are scattered, leaving open questions about which techniques perform best for given models or conditions. Evaluation often relies on simple overlap metrics that miss deeper semantic accuracy, and the review maps these patterns while highlighting gaps for future work on reliable automated documentation.

Core claim

The central claim is that prompting paradigms such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning demonstrate promise for enhancing LLM performance on code summarization, yet existing research remains fragmented with limited insight into the best strategies for specific models and contexts, and most evaluations depend on overlap-based metrics that may fail to reflect semantic quality.

What carries the argument

Categorization of prompting paradigms (few-shot, zero-shot, chain-of-thought, retrieval-augmented) combined with cross-study analysis of their effectiveness and evaluation practices in LLM-driven code summarization.

If this is right

Researchers need targeted experiments to identify conditions under which particular prompting strategies outperform others for different models.
Development of metrics that capture semantic quality beyond simple overlap measures would improve assessment of summary usefulness.
Clearer guidelines on prompt design could support more consistent integration of automated summarization into developer workflows.
Filling identified gaps would reduce reliance on manual documentation and aid tasks such as code review and maintenance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If prompting best practices become standardized, integrated tools in development environments could produce more reliable documentation at scale.
Better code summaries from optimized prompts may indirectly improve accuracy in downstream applications like defect localization or commit message generation.
The current fragmentation points to a need for shared benchmark datasets that test prompting strategies across varied codebases and languages.

Load-bearing premise

The collected studies form a representative sample of the field and the chosen categorization of prompting paradigms accurately reflects the underlying technical distinctions without significant selection or reporting bias.

What would settle it

A new comprehensive survey that incorporates overlooked recent papers and finds either consistent superiority of one prompting method across models or evaluation results that align closely on semantic quality would contradict the fragmentation and limited-understanding conclusions.

Figures

Figures reproduced from arXiv: 2604.15385 by Afia Farjana, Antonio Mastropaolo, Zaiyu Cheng.

**Figure 2.** Figure 2: Study selection process for the systematic review on prompt-based code summarization. The pipeline includes four stages: (i) identification [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Publication Year initial study corpus. For studies without author-supplied keywords, manual annotation was performed using title, abstract, and methodological content, following the guidelines of [77]. This process yielded 4 candidate papers for manual analysis. Both the Python script and the complete prompt-engineering terms used in our search are provided in our replication package. Thresholding rule: we… view at source ↗

**Figure 4.** Figure 4: Venue Distribution of Prompt Engineering Techniques in Code Summarization [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: A Taxonomy of Prompt Engineering Techniques for Code Summarization Across Granularity Levels and Prompt Paradigms. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Trend of LLM family adoption across code-summarization studies (2020–2025). The trend highlights a gradual diversification of model [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Distribution of programming languages across 29 prompt-based code summarization studies. Each ring represents a prompting paradigm [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Artifact-sharing landscape across 29 studies. Preprocessing and training scripts are the most frequently shared artifacts, followed by [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Year-wise availability of replication packages in primary studies (2020–2025). Bars show [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

read the original abstract

Software documentation is essential for program comprehension, developer onboarding, code review, and long-term maintenance. Yet producing quality documentation manually is time-consuming and frequently yields incomplete or inconsistent results. Large language models (LLMs) offer a promising solution by automatically generating natural language descriptions from source code, helping developers understand code more efficiently, facilitating maintenance, and supporting downstream activities such as defect localization and commit message generation. However, the effectiveness of LLMs in documentation tasks critically depends on how they are prompted. Properly structured instructions can substantially improve model performance, making prompt engineering-the design of input prompts to guide model behavior-a foundational technique in LLM-based software engineering. Approaches such as few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning show promise for code summarization, yet current research remains fragmented. There is limited understanding of which prompting strategies work best, for which models, and under what conditions. Moreover, evaluation practices vary widely, with most studies relying on overlap-based metrics that may not capture semantic quality. This systematic literature review consolidates existing evidence, categorizes prompting paradigms, examines their effectiveness, and identifies gaps to guide future research and practical adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This paper presents a systematic literature review on prompt-driven code summarization with large language models. It consolidates evidence on prompting paradigms (few-shot prompting, chain-of-thought reasoning, retrieval-augmented generation, and zero-shot learning), examines their effectiveness for generating natural language code descriptions, notes that research remains fragmented with limited understanding of optimal strategies/models/conditions, critiques reliance on overlap-based evaluation metrics, and identifies gaps to guide future work.

Significance. If methodologically rigorous, this SLR would be a useful consolidation in an active area of LLM applications for software engineering. It would help clarify promising prompting directions, surface evaluation weaknesses, and reduce fragmentation by mapping what is known about prompt effectiveness for code summarization tasks that support comprehension, maintenance, and downstream activities.

major comments (1)

[Methodology section] Methodology section: The central synthesis—that prompting approaches show promise yet research is fragmented with limited understanding of best strategies—depends on the collected studies forming a representative sample and the chosen categorization (few-shot, CoT, RAG, zero-shot) accurately reflecting technical distinctions without selection or reporting bias. The abstract outlines scope but provides no explicit search protocol, databases, inclusion/exclusion criteria, or quality assessment details, leaving these load-bearing elements unverified and risking incomplete coverage or forced groupings that undermine the gap analysis.

minor comments (1)

[Abstract] Abstract: The claim that 'most studies relying on overlap-based metrics' would be strengthened by stating the total number of primary studies reviewed and the covered time period.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our systematic literature review. We address the major comment on methodology below and will revise the manuscript accordingly to improve transparency and verifiability.

read point-by-point responses

Referee: [Methodology section] Methodology section: The central synthesis—that prompting approaches show promise yet research is fragmented with limited understanding of best strategies—depends on the collected studies forming a representative sample and the chosen categorization (few-shot, CoT, RAG, zero-shot) accurately reflecting technical distinctions without selection or reporting bias. The abstract outlines scope but provides no explicit search protocol, databases, inclusion/exclusion criteria, or quality assessment details, leaving these load-bearing elements unverified and risking incomplete coverage or forced groupings that undermine the gap analysis.

Authors: We agree that the abstract does not explicitly detail the search protocol, databases, inclusion/exclusion criteria, or quality assessment, which limits immediate verifiability for readers. The full manuscript contains a dedicated Methodology section (Section 3) that outlines a structured search across IEEE Xplore, ACM Digital Library, ScienceDirect, SpringerLink, and arXiv using predefined keyword strings (detailed in Appendix A), with inclusion criteria limited to empirical studies (2020 onward) evaluating LLM prompting for code summarization and exclusion of non-empirical or non-English works. Quality assessment used a modified Kitchenham checklist with reported inter-rater reliability. The four-category taxonomy was derived from the primary technique reported in each primary study, with dual-author independent coding and consensus resolution to mitigate bias. However, we acknowledge the referee's point that these elements could be presented more explicitly and with a PRISMA diagram to strengthen the synthesis. We will revise the abstract to include a concise methodology summary and expand Section 3 with additional justification for the categorization scheme and search coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: SLR aggregates external studies without self-referential derivations

full rationale

This systematic literature review consolidates evidence from external papers on prompting strategies for code summarization. It contains no mathematical derivations, fitted parameters, predictions, or uniqueness theorems that reduce to the paper's own inputs by construction. Claims about fragmentation and promising approaches (few-shot, CoT, RAG, zero-shot) are synthesized from cited literature rather than defined or forced by the review's own categorization or search process. No self-citation chains or ansatzes are load-bearing for the central synthesis. The paper is self-contained against external benchmarks as a review.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities. The review rests on the domain assumption that established systematic literature review procedures can reliably map a fragmented research area.

axioms (1)

domain assumption Standard systematic literature review methodology (search strategy, inclusion criteria, quality assessment) is sufficient to consolidate evidence without major bias
Invoked implicitly by the decision to perform an SLR and to draw conclusions about effectiveness and gaps.

pith-pipeline@v0.9.0 · 5508 in / 1256 out tokens · 91559 ms · 2026-05-10T11:30:51.383645+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

117 extracted references · 39 canonical work pages · 11 internal anchors

[1]

Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology, 33(8):1–79, 2024

2024
[2]

Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode.Science, 378(6624):1092–1097, 2022

2022
[3]

An analysis of the automatic bug fixing performance of chatgpt

Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. An analysis of the automatic bug fixing performance of chatgpt. In2023 IEEE/ACM International Workshop on Automated Program Repair (APR), pages 23–30. IEEE, 2023

2023
[4]

Program translation via code distillation.arXiv preprint arXiv:2310.11476, 2023

Yufan Huang, Mengnan Qi, Yongqiang Yao, Maoquan Wang, Bin Gu, Colin Clement, and Neel Sundaresan. Program translation via code distillation.arXiv preprint arXiv:2310.11476, 2023

work page arXiv 2023
[5]

A systematic literature review on the use of deep learning in software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–58, 2022

Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. A systematic literature review on the use of deep learning in software engineering research.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–58, 2022

2022
[6]

Automatic generation of natural language summaries for java classes

Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. Automatic generation of natural language summaries for java classes. In2013 21st International conference on program comprehension (ICPC), pages 23–32. IEEE, 2013

2013
[7]

Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352, 2019

Yuxiang Zhu and Minxue Pan. Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352, 2019

work page arXiv 1909
[8]

A convolutional attention network for extreme summarization of source code

Miltiadis Allamanis, Hao Peng, and Charles Sutton. A convolutional attention network for extreme summarization of source code. InInternational conference on machine learning, pages 2091–2100. PMLR, 2016

2091
[9]

Deep learning-based code reviews: A paradigm shift or a double-edged sword?arXiv preprint arXiv:2411.11401, 2024

Rosalia Tufano, Alberto Martin-Lopez, Ahmad Tayeb, Sonia Haiduc, Gabriele Bavota, et al. Deep learning-based code reviews: A paradigm shift or a double-edged sword?arXiv preprint arXiv:2411.11401, 2024

work page arXiv 2024
[10]

Toward deep learning software repositories

Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. Toward deep learning software repositories. In2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 334–345. IEEE, 2015

2015
[11]

Automatically documenting program changes

Raymond PL Buse and Westley R Weimer. Automatically documenting program changes. InProceedings of the 25th IEEE/ACM international conference on automated software engineering, pages 33–42, 2010

2010
[12]

Simulated analysis and hardware implementation of voiceband circular microphone array

M Sami Zitouni, M Luai Hammadih, Abdulla AlShehhi, Saif AlKindi, Nazar Ali, and Luis Weruaga. Simulated analysis and hardware implementation of voiceband circular microphone array. In2013 IEEE 20th International Conference on Electronics, Circuits, and Systems (ICECS), pages 508–511. IEEE, 2013

2013
[13]

Commit message matters: Investigating impact and evolution of commit message quality

Jiawei Li and Iftekhar Ahmed. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 806–817. IEEE, 2023

2023
[14]

Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959, 2024

Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. Source code summarization in the era of large language models.arXiv preprint arXiv:2407.07959, 2024

work page arXiv 2024
[15]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review arXiv 2024
[17]

From llm to nmt: Advancing low-resource machine translation with claude.arXiv preprint arXiv:2404.13813, 2024

Maxim Enis and Mark Hopkins. From llm to nmt: Advancing low-resource machine translation with claude.arXiv preprint arXiv:2404.13813, 2024

work page arXiv 2024
[18]

Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? InProceedings of the 2022 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 2300–2344, 2022

2022
[19]

Sheng Lu, Hendrik Schuff, and Iryna Gurevych. How are prompts different in terms of sensitivity? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5833–5856, 2024

2024
[20]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. InThe eleventh international conference on learning representations, 2022

2022
[21]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[22]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review arXiv 2023
[23]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, pages 1877–1901, 2020

1901
[24]

Barr, and Collin McMillan

Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval augmented code generation and summarization.arXiv preprint arXiv:2108.11601, 2021

work page arXiv 2021
[25]

Key challenges in prompt engineering

Vladimir Geroimenko. Key challenges in prompt engineering. InThe Essential Guide to Prompt Engineering: Key Principles, Techniques, Challenges, and Security Risks, pages 85–102. Springer, 2025

2025
[26]

Unleashing the potential of prompt engineering for large language models.Patterns, 2025

Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering for large language models.Patterns, 2025

2025
[27]

Reassessing automatic evaluation metrics for code summarization tasks

Devjeet Roy, Sarah Fakhoury, and Venera Arnaoudova. Reassessing automatic evaluation metrics for code summarization tasks. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1105–1116, 2021

2021
[28]

Semantic similarity metrics for evaluating source code summarization

Sakib Haque, Zachary Eberhart, Aakash Bansal, and Collin McMillan. Semantic similarity metrics for evaluating source code summarization. InProceedings of the 30th IEEE/ACM International Conference on Program Comprehension, pages 36–47, 2022

2022
[29]

Evaluating code summarization techniques: A new metric and an empirical characterization

Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, and Gabriele Bavota. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering, pages 1–13, 2024

2024
[30]

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155, 2020

work page internal anchor Pith review arXiv 2002
[31]

GraphCodeBERT: Pre-training Code Representations with Data Flow

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow.arXiv preprint arXiv:2009.08366, 2020

work page internal anchor Pith review arXiv 2009
[32]

Joty and Steven C

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation.arXiv preprint arXiv:2109.00859, 2021

work page arXiv 2021
[33]

Unified pre- training for program understanding and generation

Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program understanding and generation.arXiv preprint arXiv:2103.06333, 2021. Manuscript submitted to ACM 40 Farjanaet al

work page arXiv 2021
[34]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196, 2024

work page internal anchor Pith review arXiv 2024
[35]

A systematic evaluation of large language models of code

Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language models of code. InProceedings of the 6th ACM SIGPLAN international symposium on machine programming, pages 1–10, 2022

2022
[36]

A review of automatic source code summarization.Empirical Software Engineering, 29(6):162, 2024

Xuejun Zhang, Xia Hou, Xiuming Qiao, and Wenfeng Song. A review of automatic source code summarization.Empirical Software Engineering, 29(6):162, 2024

2024
[37]

A survey of large language models for code intelligence

Zhu Zhang, Yue Wang, Daya Guo, Duyu Tang, Nan Duan, and Ming Zhou. A survey of large language models for code intelligence. 2022

2022
[38]

Summarizing source code using a neural attention model

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Summarizing source code using a neural attention model. InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL), pages 2073–2083, 2016

2073
[39]

Improved code summarization via a graph neural network

Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. Improved code summarization via a graph neural network. InProceedings of the 28th international conference on program comprehension, pages 184–195, 2020

2020
[40]

Transformer-based model for source code summarization

Wasi Uddin Ahmad, Swarnendu Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Transformer-based model for source code summarization. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4998–5007, 2020

2020
[41]

Retrieval-augmented generation for code summarization via hybrid gnn.arXiv preprint arXiv:2006.05405, 2020

Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. Retrieval-augmented generation for code summarization via hybrid gnn.arXiv preprint arXiv:2006.05405, 2020

work page arXiv 2006
[42]

Enhanced prompting framework for code summarization with large language models.Proceedings of the ACM on Software Engineering, 2(ISSTA):1630–1653, 2025

Minying Fang, Xing Yuan, Yuying Li, Haojie Li, Chunrong Fang, and Junwei Du. Enhanced prompting framework for code summarization with large language models.Proceedings of the ACM on Software Engineering, 2(ISSTA):1630–1653, 2025

2025
[43]

Codex: A comprehensive knowledge graph completion benchmark.arXiv preprint arXiv:2009.07810, 2020

Tara Safavi and Danai Koutra. Codex: A comprehensive knowledge graph completion benchmark.arXiv preprint arXiv:2009.07810, 2020

work page arXiv 2009
[44]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161, 2023

work page internal anchor Pith review arXiv 2023
[45]

Jieke Shi, Zhou Yang, and David Lo. Efficient and green large language models for software engineering: Literature review, vision, and the road ahead.ACM Transactions on Software Engineering and Methodology, 34(5):1–22, 2025

2025
[46]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

2017
[47]

Plbart: Pre-training language model for bi-directional attentive representation of code

Wasi Uddin Ahmad, Swarnendu Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Plbart: Pre-training language model for bi-directional attentive representation of code. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

2021
[48]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. InJournal of Machine Learning Research, volume 21, pages 1–67, 2020

2020
[49]

Deepcom: Deep comment generation for source code.Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 200–210, 2018

Xiaopeng Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. Deepcom: Deep comment generation for source code.Proceedings of the 26th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 200–210, 2018

2018
[50]

A comprehensive study of code summarization with neural attention.Proceedings of the 28th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 53–64, 2020

Alexander LeClair and Collin McMillan. A comprehensive study of code summarization with neural attention.Proceedings of the 28th IEEE/ACM International Conference on Program Comprehension (ICPC), pages 53–64, 2020

2020
[51]

Codexglue: A benchmark dataset and open challenge for code intelligence

Shuai Lu, Duyu Tang, Nan Duan, Zhangyin Feng, et al. Codexglue: A benchmark dataset and open challenge for code intelligence. InAdvances in Neural Information Processing Systems, 2021

2021
[52]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Finetuned language models are zero-shot learners

Jason Wei et al. Finetuned language models are zero-shot learners. InInternational Conference on Learning Representations, 2021

2021
[55]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, and et al. Training language models to follow instructions with human feedback.arXiv preprint arXiv:2203.02155, 2022

work page internal anchor Pith review arXiv 2022
[56]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Yizhong Wang, Yeganeh Kordi, et al. Self-instruct: Aligning language model with self generated instructions.arXiv preprint arXiv:2212.10560, 2022

work page internal anchor Pith review arXiv 2022
[57]

Prompt engineering in large language models

Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba-Nabende. Prompt engineering in large language models. InInternational conference on data intelligence and cognitive informatics, pages 387–402. Springer, 2023

2023
[58]

Source code summarization & comment generation with nlp: A new index proposal

M Alp Eren Kilic and M Fatih Adak. Source code summarization & comment generation with nlp: A new index proposal. In2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), pages 1–6. IEEE, 2024

2024
[59]

Do advanced language models eliminate the need for prompt engineering in software engineering?arXiv preprint arXiv:2411.02093, 2024

Guoqing Wang, Zeyu Sun, Zhihao Gong, Sixiang Ye, Yizhou Chen, Yifan Zhao, Qingyuan Liang, and Dan Hao. Do advanced language models eliminate the need for prompt engineering in software engineering?arXiv preprint arXiv:2411.02093, 2024

work page arXiv 2024
[60]

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi. Codet5+: Open code large language models for code understanding and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore, 2023. Association for Computational Linguistics

2023
[61]

A prompt learning framework for source code summarization.arXiv preprint arXiv:2312.16066, 2023

Weisong Sun, Chunrong Fang, Yudu You, Yuchen Chen, Yi Liu, Chong Wang, Jian Zhang, Quanjun Zhang, Hanwei Qian, Wei Zhao, et al. A prompt learning framework for source code summarization.arXiv preprint arXiv:2312.16066, 2023

work page arXiv 2023
[62]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

2020
[63]

arXiv preprint arXiv:2407.12994 , year=

Shubham Vatsal and Harsh Dubey. A survey of prompt engineering methods in large language models for different nlp tasks.arXiv preprint arXiv:2407.12994, 2024

work page arXiv 2024
[64]

Raxcs: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation

Kaiyuan Yang, Junfeng Wang, and Zihua Song. Raxcs: Towards cross-language code summarization with contrastive pre-training and retrieval augmentation. Information and Software Technology, 183:107741, 2025

2025
[65]

Retrieval-based neural source code summarization

Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. Retrieval-based neural source code summarization. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 1385–1397, 2020

2020
[66]

On the evaluation of neural code summarization

Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dongmei Zhang, and Hongbin Sun. On the evaluation of neural code summarization. InProceedings of the 44th international conference on software engineering, pages 1597–1608, 2022

2022
[67]

Summary generation for source code: A systematic literature review.Information and Software Technology, 158:107169, 2023

Bin Liu, Ming Wang, Shihai Huang, Zhe Jiang, and Yijun Yu. Summary generation for source code: A systematic literature review.Information and Software Technology, 158:107169, 2023

2023
[68]

Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022. Manuscript submitted to ACM Prompt-Driven Code Summarization: A Systematic Literature Review 41

work page arXiv 2022
[69]

Neural architecture for source code summarization

Alexander LeClair and Collin McMillan. Neural architecture for source code summarization. In2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 590–601. IEEE, 2019

2019
[70]

Commenting higher-level code unit: Full code, reduced code, or hierarchical code summarization.arXiv preprint arXiv:2503.10737, 2025

Weisong Sun, Yiran Zhang, Jie Zhu, Zhihui Wang, Chunrong Fang, Yonglong Zhang, Yebo Feng, Jiangping Huang, Xingya Wang, Zhi Jin, et al. Commenting higher-level code unit: Full code, reduced code, or hierarchical code summarization.arXiv preprint arXiv:2503.10737, 2025

work page arXiv 2025
[71]

Achieving high-level software component summarization via hierarchical chain-of-thought prompting and static code analysis

Satrio Adi Rukmono, Lina Ochoa, and Michel RV Chaudron. Achieving high-level software component summarization via hierarchical chain-of-thought prompting and static code analysis. In2023 IEEE International Conference on Data and Software Engineering (ICoDSE), pages 7–12. IEEE, 2023

2023
[72]

Promptexp: Multi-granularity prompt explanation of large language models.arXiv preprint arXiv:2410.13073, 2024

Ximing Dong, Shaowei Wang, Dayi Lin, Gopi Krishnan Rajbahadur, Boquan Zhou, Shichao Liu, and Ahmed E Hassan. Promptexp: Multi-granularity prompt explanation of large language models.arXiv preprint arXiv:2410.13073, 2024

work page arXiv 2024
[73]

Docagent: A multi-agent system for automated code documentation generation.arXiv preprint arXiv:2504.08725, 2025

Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. Docagent: A multi-agent system for automated code documentation generation.arXiv preprint arXiv:2504.08725, 2025

work page arXiv 2025
[74]

Retrieve and refine: exemplar-based neural comment generation

Bolin Wei, Yongmin Li, Ge Li, Xin Xia, and Zhi Jin. Retrieve and refine: exemplar-based neural comment generation. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 349–360, 2020

2020
[75]

A survey of automatic generation of source code comments: Algorithms and techniques.IEEE access, 7:111411–111428, 2019

Xiaotao Song, Hailong Sun, Xu Wang, and Jiafei Yan. A survey of automatic generation of source code comments: Algorithms and techniques.IEEE access, 7:111411–111428, 2019

2019
[76]

Unlocking the potential of the prompt engineering paradigm in software engineering: A systematic literature review.AI, 6(9):206, 2025

Irdina Wanda Syahputri, Eko K Budiardjo, and Panca O Hadi Putra. Unlocking the potential of the prompt engineering paradigm in software engineering: A systematic literature review.AI, 6(9):206, 2025

2025
[77]

Guidelines for performing systematic literature reviews in software engineering

Barbara Kitchenham and Stuart Charters. Guidelines for performing systematic literature reviews in software engineering. Technical Report EBSE-2007-01, EBSE Technical Report, Keele University and University of Durham, 2007

2007
[78]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[79]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

2004
[80]

Meteor: An automatic metric for mt evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005

2005

Showing first 80 references.