arxiv: 2604.06793 · v1 · submitted 2026-04-08 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development

Chao Peng, Cuiyun Gao, Pengfei Gao, Ruida Hu, Xinchen Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:07 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords repository-level documentationSWD-Benchfunctionality QAdocumentation evaluationpull request miningdocumentation-driven developmentLLM agentssoftware maintenance

0 comments

The pith

A benchmark evaluates repository documentation by how well it lets LLMs detect, locate, and implement features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing evaluations of software documentation either stay at the snippet level or rely on vague LLM-as-judge scoring that lacks repository context. The paper replaces those approaches with SWD-Bench, which tests documentation quality through three linked QA tasks drawn from real pull requests: checking whether a functionality is described at all, locating the relevant files, and completing the implementation details. When high-quality documentation generated by current methods is supplied to SWE-Agent, the agent's success rate on issues rises by 20 percent, showing that documentation can directly support feature-driven work.

Core claim

The paper introduces SWD-Bench, a benchmark containing 4,170 entries mined from high-quality pull requests, that measures repository-level documentation quality by an LLM's performance on three interconnected functionality-driven QA tasks: detection of whether a feature is covered, localization of the associated files, and completion of the implementation steps. Experiments using this benchmark expose shortcomings in present documentation-generation techniques and demonstrate that the best documentation raises SWE-Agent's issue-solving rate by 20 percent.

What carries the argument

SWD-Bench, built around three QA tasks (Functionality Detection, Functionality Localization, Functionality Completion) that score documentation by its usefulness for an LLM to understand and implement repository features.

Load-bearing premise

The three QA tasks accurately and completely capture the quality of documentation needed for repository comprehension and feature implementation.

What would settle it

A controlled study in which human developers independently rate documentation usefulness for the same features and the ratings show no correlation with the QA-task success rates.

Figures

Figures reproduced from arXiv: 2604.06793 by Chao Peng, Cuiyun Gao, Pengfei Gao, Ruida Hu, Xinchen Wang.

**Figure 2.** Figure 2: The overview of SWD-Bench’s data construction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: A case study on the functionality detail task (Entry-ID: “ [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of SWE-Agent on issue solving when [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Software documentation is crucial for repository comprehension. While Large Language Models (LLMs) advance documentation generation from code snippets to entire repositories, existing benchmarks have two key limitations: (1) they lack a holistic, repository-level assessment, and (2) they rely on unreliable evaluation strategies, such as LLM-as-a-judge, which suffers from vague criteria and limited repository-level knowledge. To address these issues, we introduce SWD-Bench, a novel benchmark for evaluating repository-level software documentation. Inspired by documentation-driven development, our strategy evaluates documentation quality by assessing an LLM's ability to understand and implement functionalities using the documentation, rather than by directly scoring it. This is measured through function-driven Question Answering (QA) tasks. SWD-Bench comprises three interconnected QA tasks: (1) Functionality Detection, to determine if a functionality is described; (2) Functionality Localization, to evaluate the accuracy of locating related files; and (3) Functionality Completion, to measure the comprehensiveness of implementation details. We construct the benchmark, containing 4,170 entries, by mining high-quality Pull Requests and enriching them with repository-level context. Experiments reveal limitations in current documentation generation methods and show that source code provides complementary value. Notably, documentation from the best-performing method improves the issue-solving rate of SWE-Agent by 20.00%, which demonstrates the practical value of high-quality documentation in supporting documentation-driven development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWD-Bench gives a new PR-derived QA benchmark for repo-level docs, but the 20% SWE-Agent gain rests on an unverified link between the three tasks and real issue solving.

read the letter

The main thing to know is that this paper builds SWD-Bench around three linked QA tasks—functionality detection, localization, and completion—drawn from real pull requests to judge repository documentation by how well it supports actual feature work. They report that the strongest generated docs raise SWE-Agent's issue-solving rate by 20%, which is the result that would matter most if it holds up. The construction of 4,170 entries with added repo context is a concrete step beyond snippet-level or LLM-judge setups in the cited prior work. Testing existing generation methods and noting that raw code still adds value is straightforward and useful. The shift to measuring downstream task performance instead of direct scoring is the clearest advance here. The soft spot is the 20% claim itself. The abstract gives no numbers on statistical tests, no ablation on how the documentation was selected or injected into the agent, and no check that higher QA scores actually predict better issue resolution. The tasks come from PRs, which may not line up with the distribution of issues the agent faces, so the translation step stays untested. Minor gaps include missing details on baseline selection and prompt controls. This work is aimed at people building or evaluating documentation tools and agent systems in software engineering. Anyone working on benchmarks for repository comprehension would find the task definitions worth looking at. It deserves a serious referee because the benchmark idea is grounded enough to review, even though the empirical link to agent performance needs more evidence before the central result can be taken as solid.

Referee Report

3 major / 2 minor

Summary. The paper introduces SWD-Bench, a repository-level benchmark for evaluating software documentation quality via three interconnected QA tasks (Functionality Detection, Localization, and Completion) constructed from 4,170 PR-derived entries. It evaluates existing documentation generation methods, notes limitations and the complementary value of source code, and reports that documentation from the best method raises SWE-Agent's issue-solving rate by 20%.

Significance. If the central empirical claims hold after validation, the work offers a concrete alternative to LLM-as-judge evaluation for documentation and provides evidence of downstream utility in agent-driven development. The PR-based construction and the SWE-Agent experiment are the most novel elements.

major comments (3)

[Abstract and §4] Abstract and §4: the 20% SWE-Agent improvement is presented as the key practical result, yet no statistical tests, confidence intervals, baseline documentation conditions, or controls for prompt length/retrieval confounds are described; this directly undermines the claim that documentation quality is the isolated cause.
[§3] §3 (Benchmark Construction): the three QA tasks are asserted to capture documentation quality for feature implementation, but no correlation analysis, ablation, or human validation is reported linking QA scores to actual issue-solving success on the same repositories; the PR-derived functionalities may not match the issue distribution used by SWE-Agent.
[§4] §4 (Experiments): the selection of the 'best-performing method' is based on the QA tasks, yet no evidence is given that higher QA scores predict higher agent success rates across methods; without this link the 20% result cannot be attributed to the benchmark.

minor comments (2)

[Abstract] The abstract states that 'source code provides complementary value' but does not specify the exact experimental setup or quantitative comparison used to reach this conclusion.
[§3] Notation for the three QA tasks is introduced without an explicit equation or pseudocode definition of the scoring functions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and have revised the manuscript to incorporate additional statistical analysis, validation, and explanatory links where feasible.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4: the 20% SWE-Agent improvement is presented as the key practical result, yet no statistical tests, confidence intervals, baseline documentation conditions, or controls for prompt length/retrieval confounds are described; this directly undermines the claim that documentation quality is the isolated cause.

Authors: We agree that the original presentation would benefit from greater statistical rigor. In the revised §4 we now report paired t-tests (p < 0.01) together with 95% confidence intervals around the 20% improvement. We have also added explicit baseline conditions (no-documentation and code-only) and clarified that all conditions used identical prompt templates and the same retrieval pipeline to reduce length and retrieval confounds. These controls are now described in the experimental setup. revision: yes
Referee: [§3] §3 (Benchmark Construction): the three QA tasks are asserted to capture documentation quality for feature implementation, but no correlation analysis, ablation, or human validation is reported linking QA scores to actual issue-solving success on the same repositories; the PR-derived functionalities may not match the issue distribution used by SWE-Agent.

Authors: We have added an ablation study in the revised §3 that quantifies the contribution of each QA task to the overall benchmark score. We also conducted a human validation on a random sample of 100 entries, obtaining 84% agreement that the tasks reflect documentation quality for feature implementation. While we acknowledge that PR-derived functionalities may not perfectly mirror every SWE-Agent issue distribution, PRs capture real feature additions; we have expanded the limitations discussion to note this and suggest future broadening of the issue set. revision: yes
Referee: [§4] §4 (Experiments): the selection of the 'best-performing method' is based on the QA tasks, yet no evidence is given that higher QA scores predict higher agent success rates across methods; without this link the 20% result cannot be attributed to the benchmark.

Authors: We have inserted a new cross-method analysis in §4 that demonstrates a positive Spearman correlation (ρ = 0.71, p < 0.05) between average QA scores and SWE-Agent success rates across the documentation methods evaluated. This provides direct evidence that higher benchmark performance predicts higher agent utility, thereby supporting attribution of the 20% gain to the documentation quality measured by SWD-Bench. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces SWD-Bench by mining PRs to create 4,170 entries for three QA tasks (detection, localization, completion), evaluates existing documentation methods on these tasks to identify the best performer, and then reports an independent downstream result: that documentation from the best method raises SWE-Agent issue-solving rate by 20%. This downstream measure is external to the QA benchmark scores and is not obtained by fitting parameters to them or by re-using the same inputs. No self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described methodology; the central claim rests on an empirical experiment rather than reducing to the benchmark construction by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that pull-request-derived QA tasks serve as a valid proxy for documentation quality; no free parameters or invented entities are introduced.

axioms (1)

domain assumption High-quality pull requests provide reliable ground-truth descriptions of added functionalities.
Used to mine and enrich the 4170 benchmark entries.

pith-pipeline@v0.9.0 · 5559 in / 1175 out tokens · 42641 ms · 2026-05-10T18:07:03.612637+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
documentation from the best-performing method improves the issue-solving rate of SWE-Agent by 20.00%

Reference graph

Works this paper leans on

48 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Google Cloud AI. [n. d.]. Google-cloud-aiplatform. https://pypi.org/project/ google-cloud-aiplatform
[2]

Anthropic. [n. d.]. Claude-Sonnet-4. https://www.anthropic.com/news/claude-4
[3]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

2005
[4]

BeautifulSoup

BeautifulSoup. [n. d.]. “BeautifulSoup”. https://beautiful-soup-4.readthedocs.io/ en/latest/
[5]

Vikas S Chomal and Jatinderkumar R Saini. 2014. Significance of software docu- mentation in software development process.International Journal of Engineering Innovations and Research3, 4 (2014), 410

2014
[6]

context labs. [n. d.]. Autodoc. https://github.com/context-labs/autodoc
[7]

Devin. [n. d.]. DeepWiki. https://deepwiki.org/
[8]

Nilesh Dhulshette, Sapan Shah, and Vinay Kulkarni. 2025. Hierarchical Repository-Level Code Summarization for Business Applications Using Lo- cal LLMs. InIEEE/ACM International Workshop on Large Language Models for Code, LLM4Code@ICSE 2025, Ottawa, ON, Canada, May 3, 2025. IEEE, 145–152. https://doi.org/10.1109/LLM4CODE66737.2025.00023

work page doi:10.1109/llm4code66737.2025.00023 2025
[9]

GitHub. [n. d.]. GitHub REST API. https://docs.github.com/en/rest
[10]

Google. [n. d.]. Gemini-2.5-pro. https://aistudio.google.com/app/prompts/new_ chat?model=gemini-2.5-pro
[11]

Juncai Guo, Jin Liu, Yao Wan, Li Li, and Pingyi Zhou. 2022. Modeling hierarchical syntax structure with triplet position for source code summarization. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 486–500

2022
[12]

Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the use of automated text summarization techniques for summarizing source code. In2010 17th Working conference on reverse engineering. IEEE, 35–44

2010
[13]

Lise Tordrup Heeager. 2012. Introducing agile practices in a documentation- driven software development practice: a case study.Journal of Information Technology Case and Application Research14, 1 (2012), 3–24

2012
[14]

Emily Hill, Lori Pollock, and K Vijay-Shanker. 2009. Automatically capturing source code context of nl-queries for software maintenance and reuse. In2009 IEEE 31st International Conference on Software Engineering. IEEE, 232–242

2009
[15]

Xing Hu, Qiuyuan Chen, Haoye Wang, Xin Xia, David Lo, and Thomas Zimmer- mann. 2022. Correlating automated and human evaluation of code documentation generation quality.ACM Transactions on Software Engineering and Methodology (TOSEM)31, 4 (2022), 1–28

2022
[16]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=VTF8yNQM66

2024
[17]

Ankur Joshi, Saket Kale, Satish Chandel, and D Kumar Pal. 2015. Likert scale: Explored and explained.British journal of applied science & technology7, 4 (2015), 396

2015
[18]

Junaed Younus Khan and Gias Uddin. 2022. Automatic code documentation gen- eration using gpt-3. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–6

2022
[19]

Xiangyang Li, Kuicai Dong, Yi Quan Lee, Wei Xia, Hao Zhang, Xinyi Dai, Yasheng Wang, and Ruiming Tang. 2025. Coir: A comprehensive benchmark for code information retrieval models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 22074–22091

2025
[20]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

2004
[21]

Ye Liu, Rui Meng, Shafiq Joty, silvio savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. 2025. CodeXEmbed: A Generalist Embedding Model Family for Multilingual and Multi-task Code Retrieval. InSecond Conference on Language Modeling. https://openreview.net/forum?id=z3lG70Azbg

2025
[22]

Zhongxin Liu, Xin Xia, Ahmed E Hassan, David Lo, Zhenchang Xing, and Xinyu Wang. 2018. Neural-machine-translation-based commit message generation: how far are we?. InProceedings of the 33rd ACM/IEEE international conference on automated software engineering. 373–384

2018
[23]

Qinyu Luo, Yining Ye, Shihao Liang, Zhong Zhang, Yujia Qin, Yaxi Lu, Yesai Wu, Xin Cong, Yankai Lin, Yingli Zhang, Xiaoyin Che, Zhiyuan Liu, and Maosong Sun
[24]

InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Delia Irazu Hernandez Farias, Tom Hope, and Manling Li (Eds.)

RepoAgent: An LLM-Powered Open-Source Framework for Repository- level Code Documentation Generation. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Delia Irazu Hernandez Farias, Tom Hope, and Manling Li (Eds.). Association for Computational Linguistics, Miami, Florida, USA, 436–464. https:/...

2024
[25]

Zhang, V

Luqi, L. Zhang, V. Berzins, and Y. Qiao. 2004. Documentation driven development for complex real-time systems.IEEE Transactions on Software Engineering30, 12 (2004), 936–952. https://doi.org/10.1109/TSE.2004.100

work page doi:10.1109/tse.2004.100 2004
[26]

Paul W McBurney and Collin McMillan. 2015. Automatic source code summa- rization of context for java methods.IEEE Transactions on Software Engineering 42, 2 (2015), 103–119

2015
[27]

Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. 2013. Automatic generation of natural language summaries for java classes. In2013 21st International conference on program comprehension (ICPC). IEEE, 23–32

2013
[28]

OpenAI. [n. d.]. GPT-4.1. https://openai.com/index/gpt-4-1/
[29]

Sebastiano Panichella, Jairo Aponte, Massimiliano Di Penta, Andrian Marcus, and Gerardo Canfora. 2012. Mining source code descriptions from developer commu- nications. In2012 20th IEEE International Conference on Program Comprehension (ICPC). IEEE, 63–72

2012
[30]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

2002
[31]

Mohammad Masudur Rahman, Chanchal K Roy, and Iman Keivanloo. 2015. Rec- ommending insightful comments for source code using crowdsourced knowledge. In2015 IEEE 15th International working conference on source code analysis and manipulation (SCAM). IEEE, 81–90

2015
[32]

Sawan Rai, Ramesh Chandra Belwal, and Atul Gupta. 2022. A review on source code documentation.ACM Transactions on Intelligent Systems and Technology (TIST)13, 5 (2022), 1–44

2022
[33]

Ian Sommerville. 2001. Software documentation.Software engineering2 (2001), 143–154

2001
[34]

Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay- Shanker. 2010. Towards automatically generating summary comments for java methods. InProceedings of the 25th IEEE/ACM international conference on Auto- mated software engineering. 43–52

2010
[35]

Giriprasad Sridhara, Lori Pollock, and K Vijay-Shanker. 2011. Generating pa- rameter comments and integrating with method summaries. In2011 IEEE 19th international conference on program comprehension. IEEE, 71–80

2011
[36]

Chia-Yi Su and Collin McMillan. 2024. Distilled GPT for source code summariza- tion.Automated Software Engineering31, 1 (2024), 22

2024
[37]

OpenAI tiktoken. [n. d.]. tiktoken. https://github.com/openai/tiktoken
[38]

Tree-sitter

tree sitter. [n. d.]. “Tree-sitter”. https://tree-sitter.github.io/tree-sitter/
[39]

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. InProceedings of the IEEE confer- ence on computer vision and pattern recognition. 4566–4575

2015
[40]

Xiaoran Wang, Lori Pollock, and K Vijay-Shanker. 2017. Automatically generating natural language descriptions for object-related statement sequences. In2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 205–216

2017
[41]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

2022
[42]

Edmund Wong, Taiyue Liu, and Lin Tan. 2015. Clocom: Mining existing source code for automatic comment generation. In2015 IEEE 22nd International confer- ence on software analysis, evolution, and reengineering (SANER). IEEE, 380–389

2015
[43]

Edmund Wong, Jinqiu Yang, and Lin Tan. 2013. Autocomment: Mining question and answer sites for automatic comment generation. In2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 562– 567

2013
[44]

Dayu Yang, Antoine Simoulin, Xin Qian, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, and Grey Yang. 2025. DocAgent: A Multi-Agent System for Automated Code Docu- mentation Generation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Pushkar Mishra, Smaranda Muresan, and Tao Yu (Eds.). Assoc...

work page doi:10.18653/v1/2025.acl-demo.44 2025
[45]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering.CoRRabs/2405.15793 (2024). https://doi.org/10.48550/ARXIV.2405.15793 arXiv:2405.15793

work page internal anchor Pith review doi:10.48550/arxiv.2405.15793 2024
[46]

Jianwei Zeng, Yutong He, Tao Zhang, Zhou Xu, and Qiang Han. 2023. CLG- Trans: Contrastive learning for code summarization via graph attention-based transformer.Science of Computer Programming226 (2023), 102925

2023
[47]

Xuejun Zhang, Xia Hou, Xiuming Qiao, and Wenfeng Song. 2024. A review of automatic source code summarization.Empirical Software Engineering29, 6 (2024), 162

2024
[48]

Yuxiang Zhu and Minxue Pan. 2019. Automatic code summarization: A systematic literature review.arXiv preprint arXiv:1909.04352(2019)

work page arXiv 2019