pith. sign in

arxiv: 2607.01425 · v1 · pith:CZHVXGD3new · submitted 2026-07-01 · 💻 cs.AI

Agent4cs: A Multi-agent System for Code Summarization in Large Hierarchical Codebases

Pith reviewed 2026-07-03 20:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords code summarizationmulti-agent systemshierarchical codebaseslarge language modelssemantic consistencykeyword coveragesoftware documentationbottom-up processing
0
0 comments X

The pith

A three-agent system summarizes large codebases bottom-up and raises semantic consistency 8 percent on average while lifting keyword coverage up to 38 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agent4cs as a multi-agent framework that processes code repositories from the bottom folder level upward instead of treating all source files as flat text. One agent generates summaries, a second extracts key terms from subfolders, and a third checks and refines the output for coherence. When tested across seven frontier models on real-world datasets, the method produced summaries with higher consistency at every folder depth and greater coverage of important terms than two structured-prompting baselines. If the gains hold, developers could obtain usable documentation for large, poorly documented codebases with less manual effort.

Core claim

Agent4cs applies three specialized agents in a bottom-up hierarchy: a summarization agent creates folder-level descriptions, a keyword-extraction agent surfaces critical information from child folders, and a quality-assurance agent iterates on readability and completeness. This structure yields an average 8 percent rise in semantic consistency across all folder levels and up to 38 percent higher normalized keyword coverage compared with structured prompting baselines that also receive code segments.

What carries the argument

The three-agent分工 of summarization agent, keyword-extraction agent, and quality-assurance agent arranged in bottom-up order over the folder hierarchy.

If this is right

  • Summaries maintain higher semantic consistency at every level of a repository's folder tree.
  • Normalized keyword coverage improves by as much as 38 percent on real-world code collections.
  • The same agent arrangement works across seven different frontier models without per-model changes.
  • The method handles repositories that contain obfuscated structure or missing documentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bottom-up agent pattern could be applied to other hierarchical artifacts such as documentation trees or data catalogs.
  • Teams maintaining large codebases might reduce manual review time if the generated summaries prove reliable enough to serve as first drafts.
  • Further tests could check whether adding a fourth agent for cross-folder dependency detection produces additional gains.

Load-bearing premise

The measured gains arise from the specific division into three agents and the bottom-up folder traversal rather than from simply issuing more language-model calls or supplying longer contexts.

What would settle it

An experiment that keeps total language-model calls and context length constant but collapses the three agents into one prompt and shows no remaining improvement over the baselines.

Figures

Figures reproduced from arXiv: 2607.01425 by Doruk Tuncel, Ezgi Sarikayak, Jie M. Zhang, Thomas Runkler, Yongjian Tang.

Figure 1
Figure 1. Figure 1: Combined with a bottom-up approach for hierarchical [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Agent4cs performs hierarchical repository summarization in two stages: function-level summarization and hierarchical [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of a repository: hierarchical folder struc [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of code obfuscation through identifier [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: An illustrative example showing the hierarchical [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: The prompts used to summarize the function-level [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: The prompts used to summarize a parent folder [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The prompt for LLM-as-a-judge. schema used in LLM-as-a-judge to ensure consistent evalua￾tion. Their evaluation results served as valuable reference in our experimental context. For hierarchical summary evaluation, no open-source datasets provide groundtruth annotations for folder-level summaries, mak￾ing traditional reference-based metrics inapplicable. Consequently, we define the following reference-free… view at source ↗
Figure 8
Figure 8. Figure 8: The correlation heatmap between human and LLM [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Average Flesch reading-ease scores for summaries [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: The normalized keyword coverage rate considering [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Understanding large, complex codebases, especially those with obfuscated structures and incomplete documentation, remains a significant challenge. Existing code summarization solutions often rely on a single language model or coding assistant like Claude Code, and treat source code as flat text, underutilizing the rich interdependencies and hierarchical information within a repository. To address these shortcomings, we propose Agent4cs - a multi-agent framework that summarizes large codebases in a bottom-up fashion, where a summarization agent focuses on producing robust summaries; a keyword-extraction agent proactively identifies critical information from subfolders; and a quality-assurance agent iteratively refines the outputs for readability, coherence, and completeness. Evaluated on 7 frontier models, Agent4cs improves semantic consistency across all folder levels by average 8% compared to two structured prompting baselines with code segments. Furthermore, extensive evaluation on real-world datasets demonstrates up to 38% gains in normalized keyword coverage rate over the same baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes Agent4cs, a multi-agent framework for bottom-up summarization of large hierarchical codebases. It deploys three agents (summarization, keyword-extraction, and quality-assurance) that operate on folder structure to produce summaries, extract keywords, and iteratively refine outputs. The central empirical claim is that, across 7 frontier models and real-world datasets, Agent4cs yields an average 8% improvement in semantic consistency across folder levels and up to 38% gains in normalized keyword coverage rate relative to two structured prompting baselines that also receive code segments.

Significance. If the reported gains prove robust and causally attributable to the three-agent hierarchical design rather than increased inference budget, the work would offer a concrete demonstration that specialized multi-agent decomposition can improve LLM-based understanding of complex repositories. The evaluation on multiple models is a positive feature; however, the absence of controls that isolate the architectural contribution weakens the interpretability of the results.

major comments (1)
  1. [Evaluation section] Evaluation section: The central claim attributes the 8% semantic-consistency and 38% keyword-coverage improvements to the specific three-agent bottom-up design. The comparisons are made only against 'structured prompting baselines with code segments' and supply no data on total LLM calls, total tokens consumed, or number of refinement iterations used by Agent4cs versus the baselines. Without such controls, the observed deltas cannot be isolated from the possibility that Agent4cs simply expends more inference budget; this is load-bearing for the causal interpretation of the architecture.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: The central claim attributes the 8% semantic-consistency and 38% keyword-coverage improvements to the specific three-agent bottom-up design. The comparisons are made only against 'structured prompting baselines with code segments' and supply no data on total LLM calls, total tokens consumed, or number of refinement iterations used by Agent4cs versus the baselines. Without such controls, the observed deltas cannot be isolated from the possibility that Agent4cs simply expends more inference budget; this is load-bearing for the causal interpretation of the architecture.

    Authors: We agree that the absence of reported inference-budget metrics limits the strength of the causal claim. The manuscript currently provides no data on LLM calls, token consumption, or refinement iterations for Agent4cs versus the structured-prompting baselines, so the observed gains cannot be isolated from possible differences in total compute. In the revised manuscript we will add a new table and accompanying text in the Evaluation section that reports, for each of the seven models and both datasets: (i) average number of LLM calls per top-level summary, (ii) total tokens consumed, and (iii) number of quality-assurance refinement rounds. These figures will be collected under identical model and temperature settings for all methods, enabling readers to assess whether the reported improvements exceed what would be expected from increased inference budget alone. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical multi-agent system evaluation

full rationale

The paper describes a multi-agent framework (summarization, keyword-extraction, and QA agents) for hierarchical code summarization and reports empirical gains (8% semantic consistency, 38% keyword coverage) against structured prompting baselines on 7 models and real-world datasets. No equations, parameters, or derivations are present. All claims rest on direct external comparisons rather than any self-referential fitting, self-citation chains, or renamings that reduce to inputs by construction. The work is therefore self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unstated premise that the chosen automatic metrics (semantic consistency and normalized keyword coverage) are valid proxies for summary quality and that the evaluation protocol was not tuned post-hoc to favor the proposed system.

axioms (1)
  • domain assumption Automatic metrics such as semantic consistency and keyword coverage correlate with human judgments of summary usefulness.
    The paper treats the 8% and 38% improvements as meaningful without providing human evaluation or correlation studies.
invented entities (1)
  • Summarization agent, keyword-extraction agent, quality-assurance agent no independent evidence
    purpose: Specialized roles that together produce hierarchical summaries
    These are newly proposed components whose independent value is asserted by the performance numbers rather than shown by external evidence.

pith-pipeline@v0.9.1-grok · 5713 in / 1433 out tokens · 19579 ms · 2026-07-03T20:26:08.380187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2020. A Transformer-based Approach for Source Code Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4998– 5007

  2. [2]

    Toufique Ahmed and Premkumar Devanbu. 2023. Few-shot training LLMs for project-specific code-summarization. InProceedings of the 37th IEEE/ACM In- ternational Conference on Automated Software Engineering(Rochester, MI, USA) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 177, 5 pages. doi:10.1145/3551349.3559555

  3. [3]

    Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl Barr. 2024. Automatic semantic augmentation of language model prompts (for code summa- rization). InProceedings of the IEEE/ACM 46th international conference on software engineering. 1–13

  4. [4]

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate.International Conference on Learning Representations (ICLR)(2015), 1–15. arXiv:1409.0473 [cs.CL]

  5. [5]

    Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio

  6. [6]

    InProceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation

    On the Properties of Neural Machine Translation: Encoder–Decoder Ap- proaches. InProceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Association for Computational Linguistics, Doha, Qatar, 103–111. doi:10.3115/v1/W14-4012

  7. [7]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  8. [8]

    Wikipedia contributors. n.d.. Flesch–Kincaid readability tests. https://en. wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests Accessed: Octo- ber 7, 2025

  9. [9]

    Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, and Gabriele Bavota. 2025. On the Effectiveness of LLM-as-a- judge for Code Generation and Summarization.IEEE Transactions on Software Engineering(2025)

  10. [10]

    Nilesh Dhulshette, Sapan Shah, and Vinay Kulkarni. 2025. Hierarchical repository- level code summarization for business applications using local LLMs. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 145–152

  11. [11]

    William H DuBay. 2004. The principles of readability. (2004)

  12. [12]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The LLaMA 3 herd of models.arXiv e-prints(2024), arXiv–2407

  13. [13]

    Hanya Elhashemy, Youssef Lotfy, and Yongjian Tang. 2025. Bridging the Prototype-Production Gap: A Multi-Agent System for Notebooks Transforma- tion. In2025 40th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). 299–302. doi:10.1109/ASEW67777.2025.00061

  14. [14]

    Chunrong Fang, Weisong Sun, Yuchen Chen, Xiao Chen, Zhao Wei, Quanjun Zhang, Yudu You, Bin Luo, Yang Liu, and Zhenyu Chen. 2024. ESALE: Enhanc- ing code-summary alignment learning for source code summarization.IEEE Transactions on Software Engineering50, 8 (2024), 2077–2095

  15. [15]

    Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, Hongyu Zhang, and Michael R. Lyu. 2024. What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs?. InProceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering(Echternach, Luxembourg)(ASE ’23). IEEE Press, 761–773. doi:10.1109/ASE56229.2023.00109

  16. [16]

    Rajarshi Haldar and Julia Hockenmaier. 2024. Analyzing the Performance of Large Language Models on Code Summarization. InJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024. European Language Resources Association (ELRA), 995–1008

  17. [17]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. GPT-4o system card.arXiv preprint arXiv:2410.21276(2024)

  18. [18]

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436(2019)

  19. [19]

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Sum- marizing source code using a neural attention model. In54th Annual Meeting of the Association for Computational Linguistics 2016. Association for Computational Linguistics, 2073–2083

  20. [20]

    Alexander LeClair, Sakib Haque, Lingfei Wu, and Collin McMillan. 2020. Im- proved code summarization via a graph neural network. InProceedings of the 28th International Conference on Program Comprehension. 184–195

  21. [21]

    Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out. 74–81

  22. [22]

    Shangqing Liu, Yu Chen, Xiaofei Xie, Jingkai Siow, and Yang Liu. 2021. Retrieval- augmented generation for code summarization via hybrid GNN.(2021). InPro- ceedings of the Ninth International Conference on Learning Representations: ICLR. 4–8

  23. [23]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understand- ing and Generation. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1)

  24. [24]

    Vladimir Makharev and Vladimir Ivanov. 2025. Code Summarization Beyond Function Level. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, 153–160

  25. [25]

    Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, and Gabriele Bavota. 2024. Evaluating code summarization techniques: A new metric and an empirical characterization. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13

  26. [26]

    Debanjan Mondal, Abhilasha Lodha, Ankita Sahoo, and Beena Kumari. 2023. Understanding Code Semantics: An Evaluation of Transformer Models in Sum- marization. InGenBench: The First Workshop on Generalisation (benchmarking) in NLP. 65

  27. [27]

    OpenAI. 2025. GPT-5 System Card. (2025)

  28. [28]

    OpenAI. 2025. Introducing GPT-4.1 in the API. (2025)

  29. [29]

    Amirkia Rafiei Oskooei, Selcan Yukcu, Mehmet Cevheri Bozoglan, and Mehmet S. Aktas. 2025. Repository-Level Code Understanding by LLMs via Hierarchical Summarization: Improving Code Search and Bug Localization. InComputational Science and Its Applications – ICCSA 2025 Workshops: Istanbul, Turkey, June 30 – July 3, 2025, Proceedings, Part I(Istanbul, Türkiy...

  30. [30]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 311–318

  31. [31]

    Fabian Pedregosa, Ga"el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ’Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning ...

  32. [32]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 3982–3992

  33. [33]

    Online Free Websites: Public Repositories. [n. d.]. https://github.com/ twilio/twilio-python, https://github.com/apple/turicreate, https://github.com/ extremenetworks/pybind, https://github.com/iotile/coretools, https://github.com/ pantsbuild/pants

  34. [34]

    Savio Antony Sebastian, Saurabh Malgaonkar, Paulami Shah, Mudit Kapoor, and Tanay Parekhji. 2016. A study & review on code obfuscation. In2016 World Conference on Futuristic Trends in Research and Innovation for Social Welfare (Startup Conclave). IEEE, 1–6

  35. [35]

    Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi Han, Hongyu Zhang, Dong- mei Zhang, and Hongbin Sun. 2022. On the evaluation of neural code summariza- tion. InProceedings of the 44th International Conference on Software Engineering. 1597–1608

  36. [36]

    Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. 2025. Prompt Engineering or Fine-Tuning: An Empirical Assessment of LLMs for Code. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 490–502

  37. [37]

    David Sounthiraraj, Jared Hancock, Yassin Kortam, Ashok Javvaji, Prabhat Singh, and Shaila Shankar. 2025. Code-Craft: Hierarchical Graph-Based Code Summa- rization for Enhanced Context Retrieval.arXiv preprint arXiv:2504.08975(2025)

  38. [38]

    Chia-Yi Su and Collin McMillan. 2024. Distilled GPT for source code summariza- tion.Automated Software Engineering31, 1 (2024), 22

  39. [39]

    Weisong Sun, Chunrong Fang, Yuchen Chen, Quanjun Zhang, Guanhong Tao, Yudu You, Tingxu Han, Yifei Ge, Yuling Hu, Bin Luo, et al. 2024. An extractive- and-abstractive framework for source code summarization.ACM Transactions on Software Engineering and Methodology33, 3 (2024), 1–39

  40. [40]

    Weisong Sun, Chunrong Fang, Yun Miao, Yudu You, Mengzhe Yuan, Yuchen Chen, Quanjun Zhang, An Guo, Xiang Chen, Yang Liu, et al . 2023. Abstract syntax tree for programming language understanding and representation: How far are we?arXiv preprint arXiv:2312.00413(2023)

  41. [41]

    Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al . 2023. Auto- matic code summarization via ChatGPT: How far are we?arXiv preprint arXiv:2305.12865(2023)

  42. [42]

    Weisong Sun, Yun Miao, Yuekang Li, Hongyu Zhang, Chunrong Fang, Yi Liu, Gelei Deng, Yang Liu, and Zhenyu Chen. 2025. Source Code Summarization in the Era of Large Language Models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 1882–1894. 10

  43. [43]

    Yongjian Tang, Rakebul Hasan, and Thomas Runkler. 2024. FsPONER: Few- Shot Prompt Optimization for Named Entity Recognition in Domain-Specific Scenarios. InECAI 2024. IOS Press, 3757–3764. https://ebooks.iospress.nl/doi/10. 3233/FAIA240936

  44. [44]

    Yongjian Tang and Thomas Runkler. 2026. LLM-Based Agentic Systems for Software Engineering: Challenges and Opportunities. SE2026. doi:10.18420/ se2026-ws_15

  45. [45]

    Yongjian Tang, Doruk Tuncel, Christian Koerner, and Thomas Runkler. 2025. The Few-shot Dilemma: Over-prompting Large Language Models.arXiv preprint arXiv:2509.13196(2025)

  46. [46]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. 2025. Gemma 3 technical report.arXiv preprint arXiv:2503.19786 (2025)

  47. [47]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.Advances in Neural Information Processing Systems30 (2017)

  48. [48]

    Alessio Viticchié, Leonardo Regano, Marco Torchiano, Cataldo Basile, Mariano Ceccato, Paolo Tonella, and Roberto Tiella. 2016. Assessment of source code obfuscation techniques. In2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 11–20

  49. [49]

    Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep rein- forcement learning. InProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397–407

  50. [50]

    Yang Wu, Yao Wan, Zhaoyang Chu, Wenting Zhao, Ye Liu, Hongyu Zhang, Xuanhua Shi, Hai Jin, and Philip S Yu. 2025. Can large language models serve as evaluators for code summarization?IEEE Transactions on Software Engineering (2025)

  51. [51]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505. 09388

  52. [52]

    Chunyan Zhang, Junchao Wang, Qinglei Zhou, Ting Xu, Ke Tang, Hairen Gui, and Fudong Liu. 2022. A survey of automatic source code summarization.Symmetry 14, 3 (2022), 471

  53. [53]

    Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020. Retrieval-based neural source code summarization. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1385–1397

  54. [54]

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi

  55. [55]

    BERTscore: Evaluating text generation with BERT.arXiv preprint arXiv:1904.09675(2019)

  56. [56]

    Xuejun Zhang, Xia Hou, Xiuming Qiao, and Wenfeng Song. 2024. A review of automatic source code summarization.Empirical Software Engineering29, 6 (2024), 162

  57. [57]

    Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, and Jing Ma. 2025. CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?. InProceedings of the 31st International Conference on Computational Linguistics. 73–95

  58. [58]

    Yuxiang Zhu and Minxue Pan. 2019. Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352(2019). 11