Recognition: no theorem link
The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code
Pith reviewed 2026-05-14 18:15 UTC · model grok-4.3
The pith
LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns.
Load-bearing premise
The synthesized readability model combining textual, structural, program, and visual features accurately measures readability in a manner that aligns with human perception and practical maintainability.
Figures
read the original abstract
As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that current LLMs produce code with overall readability comparable to human-written code (as measured by a new model fusing textual, structural, program, and visual features) but with distinct readability issue patterns. Evaluation is performed on 5,869 scenarios drawn from the World of Code and LeetCode repositories. The study further reports that prompt design has limited overall impact on readability, with function signatures, constraints, and style descriptions emerging as the most influential prompt dimensions.
Significance. If the custom readability model is shown to align with human perception and downstream maintainability metrics, the work would be significant for software engineering: it supplies large-scale empirical evidence on a non-functional property that directly affects code review and long-term maintenance, while identifying concrete prompt factors and persistent issue patterns that future LLM tooling could target.
major comments (2)
- [Readability Model section] The readability model (described in the section establishing the comprehensive model) synthesizes four feature families into a scalar score yet supplies no information on feature weighting/aggregation, calibration against human readability ratings, or correlation with maintenance metrics such as defect density or review time. Because the headline claim of 'comparable' readability and the 'distinct issue patterns' both rest on this unvalidated scalar, the central comparability result cannot be interpreted without such evidence.
- [Evaluation and Results section] The human baseline is drawn from the same WoC/LeetCode repositories and scored with the identical unvalidated model; therefore any reported comparability is only as trustworthy as the model itself. A direct human-rating study or external validation against existing readability corpora (e.g., those used in prior SE literature) is required before the 'comparable' conclusion can be treated as load-bearing.
minor comments (1)
- [Abstract] The abstract states '5,869 scenarios' without indicating how many were generated per LLM or per prompt variant; adding this breakdown (perhaps as a table) would clarify the statistical power behind the prompt-effect findings.
Simulated Author's Rebuttal
We thank the referee for the constructive comments emphasizing the need for greater transparency and validation around our readability model. We have revised the manuscript to expand the model description, add external anchoring where feasible, and more carefully qualify the comparability claims. Below we respond point by point.
read point-by-point responses
-
Referee: [Readability Model section] The readability model (described in the section establishing the comprehensive model) synthesizes four feature families into a scalar score yet supplies no information on feature weighting/aggregation, calibration against human readability ratings, or correlation with maintenance metrics such as defect density or review time. Because the headline claim of 'comparable' readability and the 'distinct issue patterns' both rest on this unvalidated scalar, the central comparability result cannot be interpreted without such evidence.
Authors: We agree the original description of aggregation was insufficiently detailed. In the revised manuscript we have added a dedicated subsection explaining that each of the four feature families is first computed with established metrics drawn from prior SE literature (textual readability formulas, cyclomatic complexity and nesting depth for structural, Halstead metrics for program, and indentation/line-length statistics for visual), then min-max normalized to [0,1] and combined via an unweighted arithmetic mean. We also cite the source papers that originally validated the individual metrics against human judgments. We did not conduct a fresh end-to-end calibration or correlation analysis with defect density/review time in this study; we have therefore added an explicit Limitations paragraph acknowledging this gap and stating that the 'comparable' finding should be read as a relative, model-internal comparison rather than an absolute human-equivalence claim. revision: partial
-
Referee: [Evaluation and Results section] The human baseline is drawn from the same WoC/LeetCode repositories and scored with the identical unvalidated model; therefore any reported comparability is only as trustworthy as the model itself. A direct human-rating study or external validation against existing readability corpora (e.g., those used in prior SE literature) is required before the 'comparable' conclusion can be treated as load-bearing.
Authors: We accept that the human-LLM comparison inherits the model's limitations. To strengthen the result we have added an external validation experiment: we scored a random sample of 200 functions from the Buse & Weimer readability corpus with our model and report a moderate positive correlation (r = 0.61) with the original human ratings. This provides some anchoring against prior SE work. A full-scale human-rating study on the 5,869 scenarios would be valuable but is outside the resource and scope of the present large-scale empirical analysis; we have therefore revised the Discussion to frame the findings as evidence of similar average scores and distinct issue distributions under a consistent measurement instrument, while explicitly calling for future human validation studies. revision: partial
Circularity Check
No significant circularity in the readability evaluation chain
full rationale
The paper defines a composite readability model from textual, structural, program, and visual features, then applies the fixed model to score LLM-generated code against human baselines drawn from independent external repositories (WoC and LeetCode). No parameters are fitted to the 5,869 evaluation scenarios, no outcome is renamed as a prediction, and no self-citation chain supplies the central comparison. The derivation therefore remains self-contained against external data and does not reduce the reported comparability result to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
[n. d.]. LeetCode. https://leetcode.com/problemset/
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [3]
-
[4]
André Altmann, Laura Toloşi, Oliver Sander, and Thomas Lengauer. 2010. Per- mutation importance: a corrected feature importance measure.Bioinformatics 26, 10 (2010), 1340–1347
2010
-
[5]
Anthropic. [n. d.]. Claude. https://www.anthropic.com/claude/
-
[6]
Anysphere. [n. d.]. Cursor. https://www.cursor.com/
-
[7]
Ashley. 2024. What is .cursorrule and How to Use It Effectively. https: //medium.com/towards-agi/what-are-cursor-rules-and-how-to-use-them- ec558468d139
2024
-
[8]
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE). IEEE, 712–721
2013
-
[10]
2006.Pattern recognition and machine learning
Christopher M Bishop and Nasser M Nasrabadi. 2006.Pattern recognition and machine learning. Vol. 4. Springer
2006
-
[11]
O’Reilly Media, Inc
Dustin Boswell and Trevor Foucher. 2011.The art of readable code. " O’Reilly Media, Inc. "
2011
-
[12]
Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32
2001
-
[13]
Kiran Busch, Alexander Rochlitzer, Diana Sola, and Henrik Leopold. 2023. Just tell me: Prompt engineering in business process management. InInternational Conference on Business Process Modeling, Development and Support. Springer, 3–11
2023
-
[14]
Raymond PL Buse and Westley R Weimer. 2008. A metric for software readability. InProceedings of the 2008 international symposium on Software testing and analysis. 121–130
2008
-
[15]
Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability.IEEE Transactions on software engineering36, 4 (2009), 546–558
2009
-
[16]
Teresa Busjahn, Roman Bednarik, Andrew Begel, Martha Crosby, James H Pater- son, Carsten Schulte, Bonita Sharif, and Sascha Tamm. 2015. Eye movements in code reading: Relaxing the linear order. In2015 ieee 23rd international conference on program comprehension. IEEE, 255–265
2015
-
[17]
Mark Chen. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A performance study of llm-generated code on leetcode. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 79–89
2024
-
[19]
Martha E Crosby and Jan Stelovsky. 2002. How do we read algorithms? A case study.Computer23, 1 (2002), 25–35
2002
-
[20]
Pablo Roberto Fernandes de Oliveira, Rohit Gheyi, José Aldo Silva da Costa, and Márcio Ribeiro. 2024. Assessing Python Style Guides: An Eye-Tracking Study with Novice Developers. InSimpósio Brasileiro de Engenharia de Software (SBES). SBC, 136–146
2024
-
[21]
George Digkas, Alexander Chatzigeorgiou, Apostolos Ampatzoglou, and Paris Avgeriou. 2020. Can clean new code reduce technical debt density?IEEE Transactions on Software Engineering48, 5 (2020), 1705–1721
2020
-
[22]
Xi Ding, Rui Peng, Xiangping Chen, Yuan Huang, Jing Bian, and Zibin Zheng
-
[23]
Do code summarization models process too much information? function signature may be all that is needed.ACM Transactions on Software Engineering and Methodology33, 6 (2024), 1–35
2024
-
[24]
2011.Software development and professional practice
John Dooley and John Zukowski. 2011.Software development and professional practice. Springer
2011
-
[25]
Jonathan Dorn. 2012. A general software readability model.MCS Thesis available at (web.eecs.umich.edu/~weimerw/students/dorn-mcs-paper.pdf)5 (2012), 11–14
2012
-
[26]
James L Elshoff and Michael Marcotty. 1982. Improving computer program readability to aid modification.Commun. ACM25, 8 (1982), 512–521
1982
-
[27]
Tom Fawcett. 2006. An introduction to ROC analysis.Pattern recognition letters 27, 8 (2006), 861–874
2006
-
[28]
Louie Giray. 2023. Prompt engineering with ChatGPT: a guide for academic writers.Annals of biomedical engineering51, 12 (2023), 2629–2633
2023
-
[29]
Github. [n. d.]. Copilot. https://github.com/features/copilot
-
[30]
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al . 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection.Journal of machine learning research3, Mar (2003), 1157– 1182
2003
-
[32]
1977.Elements of Software Science (Operating and program- ming systems series)
Maurice H Halstead. 1977.Elements of Software Science (Operating and program- ming systems series). Elsevier Science Inc
1977
-
[33]
Ardis Hanson. 2017. Negative case analysis.The international encyclopedia of communication research methods(2017), 1–2
2017
-
[34]
Mohammad Hassany, Jiaze Ke, Peter Brusilovsky, Arun Balajiee Lek- shmi Narayanan, and Kamil Akhuseyinoglu. 2024. Authoring Worked Examples for JAVA Programming with Human AI Collaboration. InProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing. 101–103
2024
-
[35]
Roberta Heale and Dorothy Forbes. 2013. Understanding triangulation in research.Evidence-based nursing16, 4 (2013), 98–98
2013
-
[36]
Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79
2024
-
[37]
Chao Hu, Yitian Chai, Hao Zhou, Fandong Meng, Jie Zhou, and Xiaodong Gu
-
[38]
InProceedings of the 39th IEEE/ACM International Conference on Auto- mated Software Engineering
How Effectively Do Code Language Models Understand Poor-Readability Code?. InProceedings of the 39th IEEE/ACM International Conference on Auto- mated Software Engineering. 795–806
-
[39]
Yuan Huang, Nan Jia, Junhuai Shu, Xinyu Hu, Xiangping Chen, and Qiang Zhou. 2020. Does your code need comment?Software: Practice and Experience 50, 3 (2020), 227–245
2020
-
[40]
Reza Iranzad and Xiao Liu. 2025. A review of random forest-based feature selection methods for data science education and applications.International Journal of Data Science and Analytics20, 2 (2025), 197–211
2025
-
[41]
Ciera Jaspan and Collin Green. 2023. Defining, measuring, and managing technical debt.IEEE Software40, 03 (2023), 15–19
2023
-
[42]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.ACM computing surveys55, 12 (2023), 1–38
2023
-
[43]
Can Jin, Hongwu Peng, Shiyu Zhao, Zhenting Wang, Wujiang Xu, Ligong Han, Jiahui Zhao, Kai Zhong, Sanguthevar Rajasekaran, and Dimitris N Metaxas
-
[44]
InCompanion Proceedings of the ACM on Web Conference 2025
Apeer: Automatic prompt engineering enhances large language model reranking. InCompanion Proceedings of the ACM on Web Conference 2025. 2494– 2502
2025
-
[45]
Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen
- [46]
-
[47]
John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif
-
[48]
In2019 IEEE International conference on software maintenance and evolution (ICSME)
An empirical study assessing source code readability in comprehension. In2019 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 513–523
- [49]
-
[50]
Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. InIjcai, Vol. 14. Montreal, Canada, 1137–1145
1995
-
[51]
Dawn Lawrie, Henry Feild, and David Binkley. 2006. Syntactic identifier con- ciseness and consistency. In2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, 139–148. Ye et al
2006
-
[52]
Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. What’s in a Name? A Study of Identifiers. In14th IEEE international conference on program comprehension (ICPC’06). IEEE, 3–12
2006
-
[53]
Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2007. Effec- tive identifier names for comprehension and memory.Innovations in Systems and Software Engineering3, 4 (2007), 303–318
2007
-
[54]
Valentina Lenarduzzi, Terese Besker, Davide Taibi, Antonio Martini, and Francesca Arcelli Fontana. 2021. A systematic literature review on technical debt prioritization: Strategies, processes, factors, and tools.Journal of Systems and Software171 (2021), 110827
2021
-
[55]
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al . 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. 2023. GPT detectors are biased against non-native English writers.Patterns4, 7 (2023)
2023
- [57]
-
[58]
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems36 (2023), 21558–21572
2023
-
[59]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys 55, 9 (2023), 1–35
2023
- [60]
-
[61]
Michael R Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, and Patanamon Thongtanunam. 2024. Automatic programming: Large language models and beyond.ACM Transactions on Software Engineering and Methodology (2024)
2024
-
[62]
Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus
-
[63]
In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
World of code: an infrastructure for mining the universe of open source VCS data. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 143–154
-
[64]
Yuxing Ma, Tapajit Dey, Chris Bogart, Sadika Amreen, Marat Valiev, Adam Tutko, David Kennard, Russell Zaretzki, and Audris Mockus. 2021. World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data.Empirical Software Engineering26 (2021), 1–42
2021
-
[65]
Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba- Nabende. 2023. Prompt engineering in large language models. InInternational conference on data intelligence and cognitive informatics. Springer, 387–402
2023
-
[66]
Orni Meerbaum-Salant, Michal Armoni, and Mordechai Ben-Ari. 2011. Habits of programming in scratch. InProceedings of the 16th annual joint conference on Innovation and technology in computer science education. 168–172
2011
-
[67]
Prabhaker Mishra, Uttam Singh, Chandra M Pandey, Priyadarshni Mishra, and Gaurav Pandey. 2019. Application of student’s t-test, analysis of variance, and covariance.Annals of cardiac anaesthesia22, 4 (2019), 407–411
2019
-
[68]
Leon Moonen. 2001. Generating robust parsers using island grammars. In Proceedings eighth working conference on reverse engineering. IEEE, 13–22
2001
-
[69]
Delano Oliveira, Reydne Santos, Benedito De Oliveira, Martin Monperrus, Fer- nando Castor, and Fernanda Madeiral. 2024. Understanding Code Understand- ability Improvements in Code Reviews.IEEE Transactions on Software Engineer- ing(2024)
2024
-
[70]
OpenAI. [n. d.]. ChatGPT release. https://openai.com/index/chatgpt/
-
[71]
2007.Beautiful code Leading programmers explain how they think
Andy Oram and Greg Wilson. 2007.Beautiful code Leading programmers explain how they think. O’Reilly Media, Inc
2007
-
[72]
Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Andrea De Lucia. 2014. Do they really smell bad? a study on developers’ perception of bad code smells. In2014 IEEE International conference on software maintenance and evolution. IEEE, 101–110
2014
- [73]
-
[74]
Sebastiano Panichella, Venera Arnaoudova, Massimiliano Di Penta, and Giuliano Antoniol. 2015. Would static analysis tools help developers with code reviews?. In2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 161–170
2015
-
[75]
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models.Advances in neural information processing systems34 (2021), 11054–11070
2021
-
[76]
Felix Petersen, Debarghya Mukherjee, Yuekai Sun, and Mikhail Yurochkin
-
[77]
Post-processing for individual fairness.Advances in Neural Information Processing Systems34 (2021), 25944–25955
2021
-
[78]
Valentina Piantadosi, Fabiana Fierro, Simone Scalabrino, Alexander Serebrenik, and Rocco Oliveto. 2020. How does code readability change during software evolution?Empirical Software Engineering25 (2020), 5374–5412
2020
-
[79]
Daryl Posnett, Abram Hindle, and Premkumar Devanbu. 2011. A simpler model of software readability. InProceedings of the 8th working conference on mining software repositories. 73–82
2011
-
[80]
Vaclav Rajlich and Prashant Gosavi. 2004. Incremental change in object-oriented programming.IEEE software21, 4 (2004), 62–69
2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.