Recognition: unknown
Fine-grained Multi-Document Extraction and Generation of Code Change Rationale
Pith reviewed 2026-05-10 15:25 UTC · model grok-4.3
The pith
Code change rationales are fragmented across commit messages, issues, and pull requests, so an LLM tool can identify relevant sentences and generate useful summaries from multiple documents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An empirical study of 63 commits from five open-source Java projects revealed that rationale components are highly fragmented: commit messages and pull requests mainly capture goals, needs and alternatives are more often in issues and PRs, and other components appear rarely and outside commit messages. No single artifact type contains all components. ARGUS, an LLM-based method, then identifies sentences expressing goal, need, and alternative across a commit's artifacts and synthesizes concise rationale summaries. On the studied commits the strongest variant achieved 51.4 percent precision and 93.2 percent recall for identification while the generated summaries were rated accurate; a user stu
What carries the argument
ARGUS, the LLM pipeline that locates sentences stating a change's goal, need, or alternative across commit messages, issues, and pull requests, then produces concise synthesized summaries.
If this is right
- Tools for code review and maintenance must combine information from commit messages, issues, and pull requests rather than relying on any one source.
- Developers can receive automatically generated rationale summaries that reduce the time spent searching scattered records for why a change was made.
- Rationale summaries produced this way are perceived as helpful specifically for code review, writing documentation, and diagnosing bugs.
- LLM-based extraction can achieve high recall in locating rationale sentences even when precision is moderate.
Where Pith is reading between the lines
- Integrating ARGUS-style summaries directly into version-control interfaces could make rationale visible at the moment a developer views a commit.
- The same multi-document approach might be tested on larger sets of commits to check whether the observed fragmentation pattern holds beyond the five studied projects.
- Measuring whether developers complete review or debugging tasks faster or with fewer errors when given these summaries would give a stronger test of practical value than perception ratings alone.
Load-bearing premise
The 63 commits from five Java projects sufficiently represent how rationale is distributed in typical development, and the LLM sentence identification works without major loss of accuracy on other projects or languages.
What would settle it
Running the same multi-document analysis on commits from a different programming language or from closed-source repositories and finding markedly different distributions of goals, needs, and alternatives across artifact types.
Figures
read the original abstract
Understanding the reasons behind past code changes is critical for many software engineering tasks, including refactoring and reviewing code, diagnosing bugs, and implementing new features. Unfortunately, locating and reconstructing this rationale can be difficult for developers because the information is often fragmented, inconsistently documented, and scattered across different artifacts such as commit messages, issue reports, and PRs. In this paper, we address this challenge in two steps. First, we conduct an empirical study of 63 commits from five open-source Java projects to analyze how rationale components (e.g., a change's goal, need, and alternative) are distributed across artifacts. We find that the rationale is highly fragmented: commit messages and pull requests primarily capture goals, while needs and alternatives are more often found in issues and PRs. Other components are scarce but found in artifacts other than commit messages. No single artifact type captures all components, underscoring the need for cross-document reasoning and synthesis. Second, we introduce ARGUS, an LLM-based approach that identifies sentences expressing goal, need, and alternative across a commit's artifacts and creates concise rationale summaries to support code comprehension and maintenance tasks. We evaluated ARGUS on the 63 commits and compared its performance against baseline variants. The best-performing version achieved 51.4% precision and 93.2% recall for rationale identification, while producing rationale summaries rated as accurate. A user study with 12 Java developers further showed that these summaries were perceived as useful and helpful for tasks such as code review, documentation, and debugging. Our results highlight the need for multi-document reasoning in capturing rationale and demonstrate the potential of ARGUS to help developers understand and maintain software systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical study of 63 commits from five open-source Java projects to analyze the distribution of code change rationale components (e.g., goals, needs, alternatives) across artifacts such as commit messages, issues, and PRs, finding high fragmentation with no single artifact capturing all components. It then introduces ARGUS, an LLM-based approach to identify relevant sentences across multi-document artifacts and generate concise rationale summaries. ARGUS is evaluated on the same 63 commits, with the best variant achieving 51.4% precision and 93.2% recall for identification, accurate summaries, and a user study with 12 Java developers rating the summaries as useful for code review, documentation, and debugging.
Significance. If the results hold under broader validation, the work contributes concrete empirical data on rationale fragmentation and demonstrates a practical LLM-based multi-document synthesis method that could support key SE tasks. The empirical counts across artifact types and the developer user study are strengths that provide grounded evidence rather than purely synthetic claims.
major comments (1)
- [Evaluation] The evaluation of ARGUS reports 51.4% precision and 93.2% recall based solely on the same 63 commits from the empirical study, with no held-out test set, cross-project validation, or external commits described. This setup is load-bearing for the central performance and usefulness claims, as the metrics and developer perceptions could be artifacts of the narrow sample (five Java projects) rather than evidence of broader applicability.
minor comments (2)
- [Abstract] The abstract and evaluation description provide limited detail on inter-rater agreement for the manual rationale labeling, the specific LLM prompts or engineering choices, and how the baseline variants were implemented; adding these would improve reproducibility and assessment of the identification results.
- The paper could strengthen the threats-to-validity discussion by explicitly addressing the representativeness of the 63 commits and potential domain shift beyond the studied Java projects.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report on our manuscript. We value the constructive criticism regarding the evaluation of ARGUS. We address the major comment below and indicate the changes we will make in the revised version.
read point-by-point responses
-
Referee: [Evaluation] The evaluation of ARGUS reports 51.4% precision and 93.2% recall based solely on the same 63 commits from the empirical study, with no held-out test set, cross-project validation, or external commits described. This setup is load-bearing for the central performance and usefulness claims, as the metrics and developer perceptions could be artifacts of the narrow sample (five Java projects) rather than evidence of broader applicability.
Authors: We acknowledge that the performance metrics for ARGUS were computed on the same 63 commits used in the empirical study, and that no held-out test set or cross-project validation was performed. This design choice stems from the fact that the empirical study was necessary to identify and annotate the rationale components across artifacts before developing and testing the extraction and generation approach. However, we agree that this limits the strength of the claims regarding broader applicability. In the revised manuscript, we will add a limitations section explicitly discussing the small sample size and lack of external validation. Additionally, we will perform and report a 5-fold cross-validation on the 63 commits to provide a more robust estimate of performance, and we will clarify that the user study with 12 developers offers qualitative insights rather than quantitative generalizability. We believe these revisions will address the core concern while preserving the contributions of the empirical analysis. revision: partial
Circularity Check
No significant circularity; empirical metrics and user study are independent of method definition
full rationale
The paper conducts a separate empirical study of rationale distribution across 63 commits, then applies an LLM-based identification and summarization method (ARGUS) whose outputs are compared against ground-truth annotations from that study to compute precision/recall. This is standard evaluation practice and does not reduce the reported 51.4% precision / 93.2% recall or user-study usefulness to a definitional tautology, fitted parameter, or self-citation chain. No equations, ansatzes, or uniqueness theorems appear; the LLM component operates independently of the specific performance numbers, and results remain falsifiable through replication on new commits or projects. The evaluation uses the study data for both analysis and testing, but this does not constitute circularity per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Rationale for a code change can be decomposed into goal, need, and alternative components that are expressed in natural language across commit messages, issues, and PRs.
- domain assumption LLM-based sentence classification can reliably surface these components when given the full set of artifacts for a commit.
Reference graph
Works this paper leans on
-
[1]
[n. d.]. JUnit4 commit cec4a6baf600b8dee3d1318c242a67b56874288a. https://github.com/junit-team/junit4/commit/ cec4a6baf600b8dee3d1318c242a67b56874288a. Accessed: 2026-03-27
2026
-
[2]
[n. d.]. OkHttp commit 4c86085429edbeef0a383941936ee7b64cc3805e. https://github.com/square/okhttp/commit/ 4c86085429edbeef0a383941936ee7b64cc3805e. Accessed: 2026-03-27
2026
-
[3]
Apache-Dubbo
2025. Apache-Dubbo. https://dubbo.apache.org/
2025
-
[4]
Comment Parser
2025. Comment Parser. https://pypi.org/project/comment-parser/
2025
-
[5]
2025. JUnit4. https://junit.org/junit4/
2025
-
[6]
2025. OkHttp. https://square.github.io/okhttp/
2025
-
[7]
Online Replication Package
2025. Online Replication Package. https://anonymous.4open.science/r/Fine-grained-Multi-Document-Extraction-and-Generation-of-Code-Change- Rationale-8BC6/README.md
2025
-
[8]
Retrofit
2025. Retrofit. https://square.github.io/retrofit/
2025
-
[9]
Spacy Model: en_core_web_trf
2025. Spacy Model: en_core_web_trf. https://spacy.io/models/en. Manuscript submitted to ACM Fine-grained Multi-Document Extraction and Generation of Code Change Rationale 27
2025
-
[10]
Spring-Boot
2025. Spring-Boot. https://spring.io/projects/spring-boot
2025
-
[11]
Khadijah Al Safwan, Mohammed Elarnaoty, and Francisco Servant. 2022. Developers’ need for the rationale of code commits: An in-breadth and in-depth study.Journal of Systems and Software189 (2022), 111320
2022
-
[12]
Rana Alkadhi, Manuel Nonnenmacher, Emitza Guzman, and Bernd Bruegge. 2018. How do developers discuss rationale?. InProceedings of the IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18). 357–369
2018
-
[13]
A.I. Anton. 1996. Goal-based requirements analysis. InProceedings of the Second International Conference on Requirements Engineering. 136–144. https://doi.org/10.1109/ICRE.1996.491438
-
[14]
Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and challenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE). IEEE, 712–721
2013
-
[15]
Tingting Bi, Wei Ding, Peng Liang, and Antony Tang. 2021. Architecture information communication in two OSS projects: The why, who, when, and what.Journal of Systems and Software181 (2021), 111035
2021
-
[16]
Janet E Burge and David C Brown. 2008. Software engineering using rationale.Journal of Systems and Software81, 3 (2008), 395–413
2008
-
[17]
Burge, John M
Janet E. Burge, John M. Carroll, Raymond McCall, and Ivan Mistrik. 2008.Rationale-Based Software Engineering. Springer Berlin Heidelberg
2008
-
[18]
Francesco Casillo, Antonio Mastropaolo, Gabriele Bavota, Vincenzo Deufemia, and Carmine Gravino. 2025. Towards Generating the Rationale for Code Changes. In2025 IEEE/ACM 33rd International Conference on Program Comprehension (ICPC). IEEE Computer Society, 327–338
2025
-
[19]
Oscar Chaparro, Jing Lu, Fiorella Zampetti, Laura Moreno, Massimiliano Di Penta, Andrian Marcus, Gabriele Bavota, and Vincent Ng. 2017. Detecting missing information in bug descriptions. InProceedings of the 2017 11th joint meeting on foundations of software engineering. 396–407
2017
-
[20]
Xiangping Chen, Yangzi Li, Zhicao Tang, Yuan Huang, Haojie Zhou, Mingdong Tang, and Zibin Zheng. 2024. ESGen: Commit Message Generation Based on Edit Sequence of Code Change. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 112–124
2024
-
[21]
Mihai Codoban, Sruti Srinivasa Ragavan, Danny Dig, and Brian Bailey. 2015. Software history under the lens: A study on why and how developers examine it. InProceedings of the IEEE International Conference on Software Maintenance and Evolution (ICSME’15). 1–10
2015
-
[22]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46
1960
-
[23]
Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological bulletin70, 4 (1968), 213
1968
-
[24]
Luis Fernando Cortés-Coy, Mario Linares-Vásquez, Jairo Aponte, and Denys Poshyvanyk. 2014. On automatically generating commit messages via summarization of source code changes. In2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation. IEEE, 275–284
2014
-
[25]
Mouna Dhaouadi, Bentley Oakes, and Michalis Famelis. 2025. Automated Extraction and Analysis of Developer’s Rationale in Open Source Software. Proceedings of the ACM on Software Engineering2, FSE (2025), 2548–2570
2025
-
[26]
Mouna Dhaouadi, Bentley James Oakes, and Michalis Famelis. 2022. End-to-end rationale reconstruction. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–5
2022
-
[27]
Mouna Dhaouadi, Bentley James Oakes, and Michalis Famelis. 2024. Rationale dataset and analysis for the commit messages of the Linux kernel out-of-memory killer. InProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 415–425
2024
-
[28]
Mouna Dhaouadi, Bentley James Oakes, and Michalis Famelis. 2025. CoMRAT: Commit Message Rationale Analysis Tool.2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR)(2025), 831–835. https://api.semanticscholar.org/CorpusID:278777370
2025
-
[29]
Andrea Di Sorbo, Sebastiano Panichella, Corrado A Visaggio, Massimiliano Di Penta, Gerardo Canfora, and Harald C Gall. 2019. Exploiting natural language structures in software informal documentation.IEEE Transactions on Software Engineering47, 8 (2019), 1587–1604
2019
-
[31]
Allen H Dutoit and Barbara Paech. 2001. Rationale management in software engineering. InHandbook of Software Engineering and Knowledge Engineering: Volume I: Fundamentals. World Scientific, 787–815
2001
-
[32]
Felipe Ebert, Fernando Castor, Nicole Novielli, and Alexander Serebrenik. 2019. Confusion in Code Reviews: Reasons, Impacts, and Coping Strategies. InProceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering. 49–60
2019
-
[33]
Thomas Fritz and Gail C. Murphy. 2010. Using information fragments to answer the questions developers ask. InProceedings of the ACM/IEEE 32nd International Conference on Software Engineering, Vol. 1. 175–184
2010
-
[34]
Fabian Gilson and Vincent Englebert. 2011. Rationale, decisions and alternatives traceability for architecture design. InProceedings of the 5th European Conference on Software Architecture: Companion Volume. 1–9
2011
-
[35]
Daqing Hou, Chandan Raj Rupakheti, and H. James Hoover. 2008. Documenting and Evaluating Scattered Concerns for Framework Usability: A Case Study. In2008 15th Asia-Pacific Software Engineering Conference. 213–220. https://doi.org/10.1109/APSEC.2008.39
-
[36]
H. Kaiya, H. Horai, and M. Saeki. 2002. AGORA: attributed goal-oriented requirements analysis method. InProceedings IEEE Joint International Conference on Requirements Engineering. 13–22. https://doi.org/10.1109/ICRE.2002.1048501
- [37]
-
[38]
Anja Kleebaum, Barbara Paech, Jan Ole Johanssen, and Bernd Bruegge. 2021. Continuous Rationale Visualization. In2021 Working Conference on Software Visualization (VISSOFT). 33–43. https://doi.org/10.1109/VISSOFT52517.2021.00013
-
[39]
Andrew J Ko, Robert DeLine, and Gina Venolia. 2007. Information needs in collocated software development teams. In29th International Conference on Software Engineering (ICSE’07). 344–353. Manuscript submitted to ACM 28 Mehedi Sun, Antu Saha, Nadeeshan De Silva, Antonio Mastropaolo, and Oscar Chaparro
2007
-
[40]
2018.Content analysis: An introduction to its methodology
Klaus Krippendorff. 2018.Content analysis: An introduction to its methodology. Sage publications
2018
-
[41]
Yuri Kuratov, Aydar Bulatov, Petr Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev. 2024. Babilong: Testing the limits of llms with long context reasoning-in-a-haystack.Advances in Neural Information Processing Systems37 (2024), 106519–106554
2024
-
[42]
Thomas D. LaToza and Brad A. Myers. 2010. Hard-to-answer questions about code. InEvaluation and Usability of Programming Languages and Tools. Association for Computing Machinery, 1–6. http://doi.org/10.1145/1937117.1937125
-
[43]
Jintae Lee. 1991. Extending the Potts and Bruns model for recording design rationale. InProceedings-13th International Conference on Software Engineering. IEEE Computer Society, 114–115
1991
-
[44]
Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024. Same task, more tokens: the impact of input length on the reasoning performance of large language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15339–15353
2024
-
[45]
Jiawei Li and Iftekhar Ahmed. 2023. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 806–817
2023
-
[46]
Jiawei Li, David Faragó, Christian Petrov, and Iftekhar Ahmed. 2024. Only diff is not enough: Generating commit messages leveraging reasoning and action of large language model.Proceedings of the ACM on Software Engineering1, FSE (2024), 745–766
2024
-
[47]
Jenny T Liang, Maryam Arab, Minhyuk Ko, Amy J Ko, and Thomas D LaToza. 2023. A qualitative study on the implementation design decisions of developers. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 435–447
2023
-
[48]
Yan Liang, Ying Liu, Chun Kit Kwong, and Wing Bun Lee. 2012. Learning the "Whys": Discovering design rationale using text mining - An algorithm perspective.Comput. Aided Des.44, 10 (Oct. 2012), 916–930. https://doi.org/10.1016/j.cad.2011.08.002
-
[49]
Bo Lin, Shangwen Wang, Zhongxin Liu, Yepang Liu, Xin Xia, and Xiaoguang Mao. 2023. Cct5: A code-change-oriented pre-trained model. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1509–1521
2023
- [50]
-
[51]
Abhinav Reddy Mandli, Saurabhsingh Rajput, and Tushar Sharma. 2025. COMET: Generating commit messages using delta graph context representation.Journal of Systems and Software222 (2025), 112307
2025
-
[52]
OpenAI. 2025. OpenAI o4-mini: Reasoning Language Model. https://en.wikipedia.org/wiki/OpenAI_o4-mini Accessed: 2025-09-11
2025
-
[53]
OpenAI. 2025. Text Embedding 3 Large. https://platform.openai.com/docs/models/text-embedding-3-large Accessed: 2025-09-11
2025
-
[54]
Luca Pascarella, Magiel Bruntink, and Alberto Bacchelli. 2019. Classifying code comments in Java software systems.Empirical Software Engineering 24, 3 (2019), 1499–1537
2019
-
[55]
Luca Pascarella, Davide Spadini, Fabio Palomba, Magiel Bruntink, and Alberto Bacchelli. 2018. Information needs in contemporary code review. Proceedings of the ACM on human-computer interaction2, CSCW (2018), 1–27
2018
-
[56]
Pooja Rani, Sebastiano Panichella, Manuel Leuenberger, Andrea Di Sorbo, and Oscar Nierstrasz. 2021. How to identify class comment types? A multi-language approach for class comment classification.Journal of systems and software181 (2021), 111047
2021
-
[57]
Sarah Rastkar and Gail C. Murphy. 2013. Why did this code change?. InProceedings of the 35th International Conference on Software Engineering (ICSE’13). 1193–1196
2013
-
[58]
Michael Rath, Jacob Rendall, Jin L. C. Guo, Jane Cleland-Huang, and Patrick Mäder. 2018. Traceability in the wild: automatically augmenting incomplete trace links. InProceedings of the 40th International Conference on Software Engineering(Gothenburg, Sweden)(ICSE ’18). Association for Computing Machinery, New York, NY, USA, 834–845. https://doi.org/10.114...
-
[59]
C. J. Van Rijsbergen. 1979.Information Retrieval(2nd ed.). Butterworth-Heinemann, USA
1979
-
[60]
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In2012 34th International Conference on Software Engineering (ICSE). IEEE, 255–265
2012
-
[61]
Benjamin Rogers, James Gung, Yechen Qiao, and Janet E. Burge. 2012. Exploring techniques for rationale extraction from existing documents. In Proceedings of the 34th International Conference on Software Engineering(Zurich, Switzerland)(ICSE ’12). IEEE Press, 1313–1316
2012
-
[62]
Khadijah Al Safwan and Francisco Servant. 2019. Decomposing the rationale of code commits: the software developer’s perspective. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 397–408. https://dl.acm.org/doi/10.1145/3338906.3338979
-
[63]
Pankajeshwara Nand Sharma, Bastin Tony Roy Savarimuthu, and Nigel Stanger. 2021. Extracting Rationale for Open Source Software Development Decisions — A Study of Python Email Archives. InProceedings of the IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). 1008–1019
2021
-
[64]
Murphy, and Kris De Volder
Jonathan Sillito, Gail C. Murphy, and Kris De Volder. 2008. Asking and Answering Questions during a Programming Change Task.IEEE Transactions on Software Engineering34, 4 (2008), 434–451
2008
-
[65]
Adriana Meza Soria, Taylor Lopez, Elizabeth Seero, Negin Mashhadi, Emily Evans, Janet Burge, and André Van der Hoek. 2024. Characterizing software maintenance meetings: Information shared, discussion outcomes, and information captured. InProceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–13
2024
-
[66]
Antony Tang, Muhammad Ali Babar, Ian Gorton, and Jun Han. 2006. A survey of architecture design rationale.Journal of systems and software79, 12 (2006), 1792–1804
2006
-
[67]
Yida Tao, Yingnong Dang, Tao Xie, Dongmei Zhang, and Sunghun Kim. 2012. How do software engineers understand code changes? an exploratory study in industry. InProceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. Association for Manuscript submitted to ACM Fine-grained Multi-Document Extraction and Generat...
-
[68]
Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. 2022. What makes a good commit message?. InProceedings of the 44th International Conference on Software Engineering. 2389–2401
2022
-
[69]
A. van Lamsweerde. 2001. Goal-oriented requirements engineering: a guided tour. InProceedings Fifth IEEE International Symposium on Requirements Engineering. 249–262. https://doi.org/10.1109/ISRE.2001.948567
- [70]
- [71]
-
[72]
Jiuang Zhao, Zitian Yang, Li Zhang, Xiaoli Lian, Donghao Yang, and Xin Tan. 2024. DRMiner: Extracting Latent Design Rationale from Jira Issue Logs.2024 39th IEEE/ACM International Conference on Automated Software Engineering (ASE)(2024), 468–480
2024
-
[73]
Xiyu Zhou, Ruiyin Li, Peng Liang, Beiqi Zhang, Mojtaba Shahin, Zengyang Li, and Chen Yang. 2025. Using LLMs in generating design rationale for software architecture decisions.ACM Transactions on Software Engineering and Methodology(2025)
2025
-
[74]
Thomas Zimmermann, Rahul Premraj, Nicolas Bettenburg, Sascha Just, Adrian Schroter, and Cathrin Weiss. 2010. What makes a good bug report? IEEE Transactions on Software Engineering36, 5 (2010), 618–643. Manuscript submitted to ACM
2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.