Recognition: 2 theorem links
· Lean TheoremEvaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development
Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3
The pith
A benchmark evaluates repository documentation by how well it lets LLMs detect, locate, and implement features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces SWD-Bench, a benchmark containing 4,170 entries mined from high-quality pull requests, that measures repository-level documentation quality by an LLM's performance on three interconnected functionality-driven QA tasks: detection of whether a feature is covered, localization of the associated files, and completion of the implementation steps. Experiments using this benchmark expose shortcomings in present documentation-generation techniques and demonstrate that the best documentation raises SWE-Agent's issue-solving rate by 20 percent.
What carries the argument
SWD-Bench, built around three QA tasks (Functionality Detection, Functionality Localization, Functionality Completion) that score documentation by its usefulness for an LLM to understand and implement repository features.
Load-bearing premise
The three QA tasks accurately and completely capture the quality of documentation needed for repository comprehension and feature implementation.
What would settle it
A controlled study in which human developers independently rate documentation usefulness for the same features and the ratings show no correlation with the QA-task success rates.
Figures
read the original abstract
Software documentation is crucial for repository comprehension. While Large Language Models (LLMs) advance documentation generation from code snippets to entire repositories, existing benchmarks have two key limitations: (1) they lack a holistic, repository-level assessment, and (2) they rely on unreliable evaluation strategies, such as LLM-as-a-judge, which suffers from vague criteria and limited repository-level knowledge. To address these issues, we introduce SWD-Bench, a novel benchmark for evaluating repository-level software documentation. Inspired by documentation-driven development, our strategy evaluates documentation quality by assessing an LLM's ability to understand and implement functionalities using the documentation, rather than by directly scoring it. This is measured through function-driven Question Answering (QA) tasks. SWD-Bench comprises three interconnected QA tasks: (1) Functionality Detection, to determine if a functionality is described; (2) Functionality Localization, to evaluate the accuracy of locating related files; and (3) Functionality Completion, to measure the comprehensiveness of implementation details. We construct the benchmark, containing 4,170 entries, by mining high-quality Pull Requests and enriching them with repository-level context. Experiments reveal limitations in current documentation generation methods and show that source code provides complementary value. Notably, documentation from the best-performing method improves the issue-solving rate of SWE-Agent by 20.00%, which demonstrates the practical value of high-quality documentation in supporting documentation-driven development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SWD-Bench, a repository-level benchmark for evaluating software documentation quality via three interconnected QA tasks (Functionality Detection, Localization, and Completion) constructed from 4,170 PR-derived entries. It evaluates existing documentation generation methods, notes limitations and the complementary value of source code, and reports that documentation from the best method raises SWE-Agent's issue-solving rate by 20%.
Significance. If the central empirical claims hold after validation, the work offers a concrete alternative to LLM-as-judge evaluation for documentation and provides evidence of downstream utility in agent-driven development. The PR-based construction and the SWE-Agent experiment are the most novel elements.
major comments (3)
- [Abstract and §4] Abstract and §4: the 20% SWE-Agent improvement is presented as the key practical result, yet no statistical tests, confidence intervals, baseline documentation conditions, or controls for prompt length/retrieval confounds are described; this directly undermines the claim that documentation quality is the isolated cause.
- [§3] §3 (Benchmark Construction): the three QA tasks are asserted to capture documentation quality for feature implementation, but no correlation analysis, ablation, or human validation is reported linking QA scores to actual issue-solving success on the same repositories; the PR-derived functionalities may not match the issue distribution used by SWE-Agent.
- [§4] §4 (Experiments): the selection of the 'best-performing method' is based on the QA tasks, yet no evidence is given that higher QA scores predict higher agent success rates across methods; without this link the 20% result cannot be attributed to the benchmark.
minor comments (2)
- [Abstract] The abstract states that 'source code provides complementary value' but does not specify the exact experimental setup or quantitative comparison used to reach this conclusion.
- [§3] Notation for the three QA tasks is introduced without an explicit equation or pseudocode definition of the scoring functions.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to incorporate additional statistical analysis, validation, and explanatory links where feasible.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4: the 20% SWE-Agent improvement is presented as the key practical result, yet no statistical tests, confidence intervals, baseline documentation conditions, or controls for prompt length/retrieval confounds are described; this directly undermines the claim that documentation quality is the isolated cause.
Authors: We agree that the original presentation would benefit from greater statistical rigor. In the revised §4 we now report paired t-tests (p < 0.01) together with 95% confidence intervals around the 20% improvement. We have also added explicit baseline conditions (no-documentation and code-only) and clarified that all conditions used identical prompt templates and the same retrieval pipeline to reduce length and retrieval confounds. These controls are now described in the experimental setup. revision: yes
-
Referee: [§3] §3 (Benchmark Construction): the three QA tasks are asserted to capture documentation quality for feature implementation, but no correlation analysis, ablation, or human validation is reported linking QA scores to actual issue-solving success on the same repositories; the PR-derived functionalities may not match the issue distribution used by SWE-Agent.
Authors: We have added an ablation study in the revised §3 that quantifies the contribution of each QA task to the overall benchmark score. We also conducted a human validation on a random sample of 100 entries, obtaining 84% agreement that the tasks reflect documentation quality for feature implementation. While we acknowledge that PR-derived functionalities may not perfectly mirror every SWE-Agent issue distribution, PRs capture real feature additions; we have expanded the limitations discussion to note this and suggest future broadening of the issue set. revision: yes
-
Referee: [§4] §4 (Experiments): the selection of the 'best-performing method' is based on the QA tasks, yet no evidence is given that higher QA scores predict higher agent success rates across methods; without this link the 20% result cannot be attributed to the benchmark.
Authors: We have inserted a new cross-method analysis in §4 that demonstrates a positive Spearman correlation (ρ = 0.71, p < 0.05) between average QA scores and SWE-Agent success rates across the documentation methods evaluated. This provides direct evidence that higher benchmark performance predicts higher agent utility, thereby supporting attribution of the 20% gain to the documentation quality measured by SWD-Bench. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper introduces SWD-Bench by mining PRs to create 4,170 entries for three QA tasks (detection, localization, completion), evaluates existing documentation methods on these tasks to identify the best performer, and then reports an independent downstream result: that documentation from the best method raises SWE-Agent issue-solving rate by 20%. This downstream measure is external to the QA benchmark scores and is not obtained by fitting parameters to them or by re-using the same inputs. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology; the central claim rests on an empirical experiment rather than reducing to the benchmark construction by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-quality pull requests provide reliable ground-truth descriptions of added functionalities.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction uncleardocumentation from the best-performing method improves the issue-solving rate of SWE-Agent by 20.00%
Reference graph
Works this paper leans on
-
[1]
Google Cloud AI. [n. d.]. Google-cloud-aiplatform. https://pypi.org/project/ google-cloud-aiplatform
-
[2]
Anthropic. [n. d.]. Claude-Sonnet-4. https://www.anthropic.com/news/claude-4
-
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72
2005
-
[4]
BeautifulSoup
BeautifulSoup. [n. d.]. “BeautifulSoup”. https://beautiful-soup-4.readthedocs.io/ en/latest/
-
[5]
Vikas S Chomal and Jatinderkumar R Saini. 2014. Significance of software docu- mentation in software development process.International Journal of Engineering Innovations and Research3, 4 (2014), 410
2014
-
[6]
context labs. [n. d.]. Autodoc. https://github.com/context-labs/autodoc
-
[7]
Devin. [n. d.]. DeepWiki. https://deepwiki.org/
-
[8]
Nilesh Dhulshette, Sapan Shah, and Vinay Kulkarni. 2025. Hierarchical Repository-Level Code Summarization for Business Applications Using Lo- cal LLMs. InIEEE/ACM International Workshop on Large Language Models for Code, LLM4Code@ICSE 2025, Ottawa, ON, Canada, May 3, 2025. IEEE, 145–152. https://doi.org/10.1109/LLM4CODE66737.2025.00023
-
[9]
GitHub. [n. d.]. GitHub REST API. https://docs.github.com/en/rest
-
[10]
Google. [n. d.]. Gemini-2.5-pro. https://aistudio.google.com/app/prompts/new_ chat?model=gemini-2.5-pro
-
[11]
Juncai Guo, Jin Liu, Yao Wan, Li Li, and Pingyi Zhou. 2022. Modeling hierarchical syntax structure with triplet position for source code summarization. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 486–500
2022
-
[12]
Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the use of automated text summarization techniques for summarizing source code. In2010 17th Working conference on reverse engineering. IEEE, 35–44
2010
-
[13]
Lise Tordrup Heeager. 2012. Introducing agile practices in a documentation- driven software development practice: a case study.Journal of Information Technology Case and Application Research14, 1 (2012), 3–24
2012
-
[14]
Emily Hill, Lori Pollock, and K Vijay-Shanker. 2009. Automatically capturing source code context of nl-queries for software maintenance and reuse. In2009 IEEE 31st International Conference on Software Engineering. IEEE, 232–242
2009
-
[15]
Xing Hu, Qiuyuan Chen, Haoye Wang, Xin Xia, David Lo, and Thomas Zimmer- mann. 2022. Correlating automated and human evaluation of code documentation generation quality.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 4 (2022), 1–28
2022
-
[16]
Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66
2024
-
[17]
Ankur Joshi, Saket Kale, Satish Chandel, and D Kumar Pal. 2015. Likert scale: Explored and explained.British journal of applied science & technology7, 4 (2015), 396
2015
-
[18]
Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation gen- eration using gpt-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–6
2022
-
[19]
Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Hao Zhang, Xinyi Dai, Yasheng Wang, and Ruiming Tang. 2025. Coir: A comprehensive benchmark for code information retrieval models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22074–22091
2025
-
[20]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81
2004
-
[21]
Ye Liu, Rui Meng, Shafiq Joty, silvio savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. 2025. CodeXEmbed: A Generalist Embedding Model Family for Multilingual and Multi-task Code Retrieval. InSecond Conference on Language Modeling. https://openreview.net/forum?id=z3lG70Azbg
2025
-
[22]
Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: how far are we?. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 373–384
2018
-
[23]
Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun
-
[24]
InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Delia Irazu Hernandez Farias, Tom Hope, and Manling Li (Eds.)
RepoAgent: An LLM-Powered Open-Source Framework for Repository- level Code Documentation Generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Delia Irazu Hernandez Farias, Tom Hope, and Manling Li (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 436–464. https:/...
2024
-
[25]
Luqi, L. Zhang, V. Berzins, and Y. Qiao. 2004. Documentation driven development for complex real-time systems.IEEE Transactions on Software Engineering30, 12 (2004), 936–952. https://doi.org/10.1109/TSE.2004.100
-
[26]
Paul W McBurney and Collin McMillan. 2015. Automatic source code summa- rization of context for java methods.IEEE Transactions on Software Engineering 42, 2 (2015), 103–119
2015
-
[27]
Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. 2013. Automatic generation of natural language summaries for java classes. In2013 21st International conference on program comprehension (ICPC). IEEE, 23–32
2013
-
[28]
OpenAI. [n. d.]. GPT-4.1. https://openai.com/index/gpt-4-1/
-
[29]
Sebastiano Panichella, Jairo Aponte, Massimiliano Di Penta, Andrian Marcus, and Gerardo Canfora. 2012. Mining source code descriptions from developer commu- nications. In2012 20th IEEE International Conference on Program Comprehension (ICPC). IEEE, 63–72
2012
-
[30]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318
2002
-
[31]
Mohammad Masudur Rahman, Chanchal K Roy, and Iman Keivanloo. 2015. Rec- ommending insightful comments for source code using crowdsourced knowledge. In2015 IEEE 15th International working conference on source code analysis and manipulation (SCAM). IEEE, 81–90
2015
-
[32]
Sawan Rai, Ramesh Chandra Belwal, and Atul Gupta. 2022. A review on source code documentation.ACM Transactions on Intelligent Systems and Technology (TIST)13, 5 (2022), 1–44
2022
-
[33]
Ian Sommerville. 2001. Software documentation.Software engineering2 (2001), 143–154
2001
-
[34]
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay- Shanker. 2010. Towards automatically generating summary comments for java methods. InProceedings of the 25th IEEE/ACM international conference on Auto- mated software engineering. 43–52
2010
-
[35]
Giriprasad Sridhara, Lori Pollock, and K Vijay-Shanker. 2011. Generating pa- rameter comments and integrating with method summaries. In2011 IEEE 19th international conference on program comprehension. IEEE, 71–80
2011
-
[36]
Chia-Yi Su and Collin McMillan. 2024. Distilled GPT for source code summariza- tion.Automated Software Engineering31, 1 (2024), 22
2024
-
[37]
OpenAI tiktoken. [n. d.]. tiktoken. https://github.com/openai/tiktoken
-
[38]
Tree-sitter
tree sitter. [n. d.]. “Tree-sitter”. https://tree-sitter.github.io/tree-sitter/
-
[39]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. InProceedings of the IEEE confer- ence on computer vision and pattern recognition. 4566–4575
2015
-
[40]
Xiaoran Wang, Lori Pollock, and K Vijay-Shanker. 2017. Automatically generating natural language descriptions for object-related statement sequences. In2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 205–216
2017
-
[41]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
2022
-
[42]
Edmund Wong, Taiyue Liu, and Lin Tan. 2015. Clocom: Mining existing source code for automatic comment generation. In2015 IEEE 22nd International confer- ence on software analysis, evolution, and reengineering (SANER). IEEE, 380–389
2015
-
[43]
Edmund Wong, Jinqiu Yang, and Lin Tan. 2013. Autocomment: Mining question and answer sites for automatic comment generation. In2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 562– 567
2013
-
[44]
Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. 2025. DocAgent: A Multi-Agent System for Automated Code Docu- mentation Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Pushkar Mishra, Smaranda Muresan, and Tao Yu (Eds.). Assoc...
-
[45]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering.CoRRabs/2405.15793 (2024). https://doi.org/10.48550/ARXIV.2405.15793 arXiv:2405.15793
work page internal anchor Pith review doi:10.48550/arxiv.2405.15793 2024
-
[46]
Jianwei Zeng, Yutong He, Tao Zhang, Zhou Xu, and Qiang Han. 2023. CLG- Trans: Contrastive learning for code summarization via graph attention-based transformer.Science of Computer Programming226 (2023), 102925
2023
-
[47]
Xuejun Zhang, Xia Hou, Xiuming Qiao, and Wenfeng Song. 2024. A review of automatic source code summarization.Empirical Software Engineering29, 6 (2024), 162
2024
- [48]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.