pith. machine review for the scientific record. sign in

arxiv: 2605.13280 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.AI

Recognition: no theorem link

The Readability Spectrum: Patterns, Issues, and Prompt Effects in LLM-Generated Code

Fengyuan Ran, Hengzhi Ye, Minghui Zhou, Weiwei Xu

Pith reviewed 2026-05-14 18:15 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords codereadabilityllm-generatedprompthuman-writtenllmspatternscomparable
0
0 comments X

The pith

LLM-generated code matches human-written code in overall readability but exhibits different issue patterns, and prompt engineering has limited impact on improving it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors built a readability model that combines textual features like naming and comments, structural aspects like nesting and complexity, programming metrics, and visual layout elements. They applied this model to thousands of code examples generated by mainstream LLMs across scenarios drawn from large repositories and LeetCode problems. The evaluation showed that the AI code scored similarly to human code on average readability. However, the specific problems differed, such as variations in how code is organized or documented. Testing different prompt elements revealed that including function signatures, constraints, and style instructions had the biggest influence, yet changing prompts overall did not produce large readability gains. This suggests current LLMs can produce usable code from a readability standpoint but may introduce maintenance challenges that prompt tweaks alone cannot easily resolve.

Core claim

We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns.

Load-bearing premise

The synthesized readability model combining textual, structural, program, and visual features accurately measures readability in a manner that aligns with human perception and practical maintainability.

Figures

Figures reproduced from arXiv: 2605.13280 by Fengyuan Ran, Hengzhi Ye, Minghui Zhou, Weiwei Xu.

Figure 1
Figure 1. Figure 1: Overview of the methodology cross-verification phase. Any discrepancies were resolved through iterative discussion to a consensus on all final expressions. We fi￾nally got a comprehensive set of 5,869 prompts reflecting diverse real-world development tasks. 3.2.2 Set B: Controlled Experimental Prompts (RQ3). Analyzing how specific prompt dimensions affect readability requires a rigor￾ous control of variabl… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the readability scores of LLM [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Readability score distribution of code generated by [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature importance of the Random Forest regres [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) are transforming software development, the functional quality of generated code has become a central focus, leaving readability, one of critical non-functional attributes, understudied. Given that LLM-generated code still needs human review before adoption, it is important to understand its readability especially compared with human-written code and the role of prompt design in shaping it. We therefore set out to conduct a systematic investigation into the code readability of LLM-generated code. To systematically quantify code readability, We establish a comprehensive readability model that synthesizes textual, structural, program, and visual features of code. Based on the model, we evaluate the readability of code generated by the mainstream LLMs under 5,869 scenarios extracted from large code base including World of Code (WoC) and LeetCode. We find that current LLMs produce code with overall readability comparable to human-written code, but displaying distinct readability issue patterns. We further examine how different prompt dimensions affect the readability of LLM-generated code, and find that function signatures, constraints and style descriptions emerge as the most influential factors, while the overall impact of prompt design remains limited. Our findings indicate that, on one hand, LLM-generated code is at least comparable to human-written code in readability, validating its potential for systematic integration into software workflows from a non-functional perspective; on the other hand, distinct readability issue patterns and limited effectiveness of prompt engineering reveal a latent technical debt, highlighting the need for future research to improve the readability of LLM-generated code and thus ensure long-term maintainability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that current LLMs produce code with overall readability comparable to human-written code (as measured by a new model fusing textual, structural, program, and visual features) but with distinct readability issue patterns. Evaluation is performed on 5,869 scenarios drawn from the World of Code and LeetCode repositories. The study further reports that prompt design has limited overall impact on readability, with function signatures, constraints, and style descriptions emerging as the most influential prompt dimensions.

Significance. If the custom readability model is shown to align with human perception and downstream maintainability metrics, the work would be significant for software engineering: it supplies large-scale empirical evidence on a non-functional property that directly affects code review and long-term maintenance, while identifying concrete prompt factors and persistent issue patterns that future LLM tooling could target.

major comments (2)
  1. [Readability Model section] The readability model (described in the section establishing the comprehensive model) synthesizes four feature families into a scalar score yet supplies no information on feature weighting/aggregation, calibration against human readability ratings, or correlation with maintenance metrics such as defect density or review time. Because the headline claim of 'comparable' readability and the 'distinct issue patterns' both rest on this unvalidated scalar, the central comparability result cannot be interpreted without such evidence.
  2. [Evaluation and Results section] The human baseline is drawn from the same WoC/LeetCode repositories and scored with the identical unvalidated model; therefore any reported comparability is only as trustworthy as the model itself. A direct human-rating study or external validation against existing readability corpora (e.g., those used in prior SE literature) is required before the 'comparable' conclusion can be treated as load-bearing.
minor comments (1)
  1. [Abstract] The abstract states '5,869 scenarios' without indicating how many were generated per LLM or per prompt variant; adding this breakdown (perhaps as a table) would clarify the statistical power behind the prompt-effect findings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments emphasizing the need for greater transparency and validation around our readability model. We have revised the manuscript to expand the model description, add external anchoring where feasible, and more carefully qualify the comparability claims. Below we respond point by point.

read point-by-point responses
  1. Referee: [Readability Model section] The readability model (described in the section establishing the comprehensive model) synthesizes four feature families into a scalar score yet supplies no information on feature weighting/aggregation, calibration against human readability ratings, or correlation with maintenance metrics such as defect density or review time. Because the headline claim of 'comparable' readability and the 'distinct issue patterns' both rest on this unvalidated scalar, the central comparability result cannot be interpreted without such evidence.

    Authors: We agree the original description of aggregation was insufficiently detailed. In the revised manuscript we have added a dedicated subsection explaining that each of the four feature families is first computed with established metrics drawn from prior SE literature (textual readability formulas, cyclomatic complexity and nesting depth for structural, Halstead metrics for program, and indentation/line-length statistics for visual), then min-max normalized to [0,1] and combined via an unweighted arithmetic mean. We also cite the source papers that originally validated the individual metrics against human judgments. We did not conduct a fresh end-to-end calibration or correlation analysis with defect density/review time in this study; we have therefore added an explicit Limitations paragraph acknowledging this gap and stating that the 'comparable' finding should be read as a relative, model-internal comparison rather than an absolute human-equivalence claim. revision: partial

  2. Referee: [Evaluation and Results section] The human baseline is drawn from the same WoC/LeetCode repositories and scored with the identical unvalidated model; therefore any reported comparability is only as trustworthy as the model itself. A direct human-rating study or external validation against existing readability corpora (e.g., those used in prior SE literature) is required before the 'comparable' conclusion can be treated as load-bearing.

    Authors: We accept that the human-LLM comparison inherits the model's limitations. To strengthen the result we have added an external validation experiment: we scored a random sample of 200 functions from the Buse & Weimer readability corpus with our model and report a moderate positive correlation (r = 0.61) with the original human ratings. This provides some anchoring against prior SE work. A full-scale human-rating study on the 5,869 scenarios would be valuable but is outside the resource and scope of the present large-scale empirical analysis; we have therefore revised the Discussion to frame the findings as evidence of similar average scores and distinct issue distributions under a consistent measurement instrument, while explicitly calling for future human validation studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the readability evaluation chain

full rationale

The paper defines a composite readability model from textual, structural, program, and visual features, then applies the fixed model to score LLM-generated code against human baselines drawn from independent external repositories (WoC and LeetCode). No parameters are fitted to the 5,869 evaluation scenarios, no outcome is renamed as a prediction, and no self-citation chain supplies the central comparison. The derivation therefore remains self-contained against external data and does not reduce the reported comparability result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit details on model construction, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5587 in / 1113 out tokens · 47768 ms · 2026-05-14T18:15:20.006987+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

114 extracted references · 17 canonical work pages · 7 internal anchors

  1. [1]

    [n. d.]. LeetCode. https://leetcode.com/problemset/

  2. [2]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Duaa Alawad, Manisha Panta, Minhaz Zibran, and Md Rakibul Islam. 2019. An empirical study of the relationships between code readability and software complexity.arXiv preprint arXiv:1909.01760(2019)

  4. [4]

    André Altmann, Laura Toloşi, Oliver Sander, and Thomas Lengauer. 2010. Per- mutation importance: a corrected feature importance measure.Bioinformatics 26, 10 (2010), 1340–1347

  5. [5]

    Anthropic. [n. d.]. Claude. https://www.anthropic.com/claude/

  6. [6]

    Anysphere. [n. d.]. Cursor. https://www.cursor.com/

  7. [7]

    Ashley. 2024. What is .cursorrule and How to Use It Effectively. https: //medium.com/towards-agi/what-are-cursor-rules-and-how-to-use-them- ec558468d139

  8. [8]

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. 2021. Program synthesis with large language models.arXiv preprint arXiv:2108.07732(2021)

  9. [9]

    Alberto Bacchelli and Christian Bird. 2013. Expectations, outcomes, and chal- lenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE). IEEE, 712–721

  10. [10]

    2006.Pattern recognition and machine learning

    Christopher M Bishop and Nasser M Nasrabadi. 2006.Pattern recognition and machine learning. Vol. 4. Springer

  11. [11]

    O’Reilly Media, Inc

    Dustin Boswell and Trevor Foucher. 2011.The art of readable code. " O’Reilly Media, Inc. "

  12. [12]

    Leo Breiman. 2001. Random forests.Machine learning45, 1 (2001), 5–32

  13. [13]

    Kiran Busch, Alexander Rochlitzer, Diana Sola, and Henrik Leopold. 2023. Just tell me: Prompt engineering in business process management. InInternational Conference on Business Process Modeling, Development and Support. Springer, 3–11

  14. [14]

    Raymond PL Buse and Westley R Weimer. 2008. A metric for software readability. InProceedings of the 2008 international symposium on Software testing and analysis. 121–130

  15. [15]

    Raymond PL Buse and Westley R Weimer. 2009. Learning a metric for code readability.IEEE Transactions on software engineering36, 4 (2009), 546–558

  16. [16]

    Teresa Busjahn, Roman Bednarik, Andrew Begel, Martha Crosby, James H Pater- son, Carsten Schulte, Bonita Sharif, and Sascha Tamm. 2015. Eye movements in code reading: Relaxing the linear order. In2015 ieee 23rd international conference on program comprehension. IEEE, 255–265

  17. [17]

    Mark Chen. 2021. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374(2021)

  18. [18]

    Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A performance study of llm-generated code on leetcode. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 79–89

  19. [19]

    Martha E Crosby and Jan Stelovsky. 2002. How do we read algorithms? A case study.Computer23, 1 (2002), 25–35

  20. [20]

    Pablo Roberto Fernandes de Oliveira, Rohit Gheyi, José Aldo Silva da Costa, and Márcio Ribeiro. 2024. Assessing Python Style Guides: An Eye-Tracking Study with Novice Developers. InSimpósio Brasileiro de Engenharia de Software (SBES). SBC, 136–146

  21. [21]

    George Digkas, Alexander Chatzigeorgiou, Apostolos Ampatzoglou, and Paris Avgeriou. 2020. Can clean new code reduce technical debt density?IEEE Transactions on Software Engineering48, 5 (2020), 1705–1721

  22. [22]

    Xi Ding, Rui Peng, Xiangping Chen, Yuan Huang, Jing Bian, and Zibin Zheng

  23. [23]

    Do code summarization models process too much information? function signature may be all that is needed.ACM Transactions on Software Engineering and Methodology33, 6 (2024), 1–35

  24. [24]

    2011.Software development and professional practice

    John Dooley and John Zukowski. 2011.Software development and professional practice. Springer

  25. [25]

    Jonathan Dorn. 2012. A general software readability model.MCS Thesis available at (web.eecs.umich.edu/~weimerw/students/dorn-mcs-paper.pdf)5 (2012), 11–14

  26. [26]

    James L Elshoff and Michael Marcotty. 1982. Improving computer program readability to aid modification.Commun. ACM25, 8 (1982), 512–521

  27. [27]

    Tom Fawcett. 2006. An introduction to ROC analysis.Pattern recognition letters 27, 8 (2006), 861–874

  28. [28]

    Louie Giray. 2023. Prompt engineering with ChatGPT: a guide for academic writers.Annals of biomedical engineering51, 12 (2023), 2629–2633

  29. [29]

    Github. [n. d.]. Copilot. https://github.com/features/copilot

  30. [30]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Yu Wu, YK Li, et al . 2024. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196(2024)

  31. [31]

    Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection.Journal of machine learning research3, Mar (2003), 1157– 1182

  32. [32]

    1977.Elements of Software Science (Operating and program- ming systems series)

    Maurice H Halstead. 1977.Elements of Software Science (Operating and program- ming systems series). Elsevier Science Inc

  33. [33]

    Ardis Hanson. 2017. Negative case analysis.The international encyclopedia of communication research methods(2017), 1–2

  34. [34]

    Mohammad Hassany, Jiaze Ke, Peter Brusilovsky, Arun Balajiee Lek- shmi Narayanan, and Kamil Akhuseyinoglu. 2024. Authoring Worked Examples for JAVA Programming with Human AI Collaboration. InProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing. 101–103

  35. [35]

    Roberta Heale and Dorothy Forbes. 2013. Understanding triangulation in research.Evidence-based nursing16, 4 (2013), 98–98

  36. [36]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review.ACM Transactions on Software Engineering and Methodology33, 8 (2024), 1–79

  37. [37]

    Chao Hu, Yitian Chai, Hao Zhou, Fandong Meng, Jie Zhou, and Xiaodong Gu

  38. [38]

    InProceedings of the 39th IEEE/ACM International Conference on Auto- mated Software Engineering

    How Effectively Do Code Language Models Understand Poor-Readability Code?. InProceedings of the 39th IEEE/ACM International Conference on Auto- mated Software Engineering. 795–806

  39. [39]

    Yuan Huang, Nan Jia, Junhuai Shu, Xinyu Hu, Xiangping Chen, and Qiang Zhou. 2020. Does your code need comment?Software: Practice and Experience 50, 3 (2020), 227–245

  40. [40]

    Reza Iranzad and Xiao Liu. 2025. A review of random forest-based feature selection methods for data science education and applications.International Journal of Data Science and Analytics20, 2 (2025), 197–211

  41. [41]

    Ciera Jaspan and Collin Green. 2023. Defining, measuring, and managing technical debt.IEEE Software40, 03 (2023), 15–19

  42. [42]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.ACM computing surveys55, 12 (2023), 1–38

  43. [43]

    Can Jin, Hongwu Peng, Shiyu Zhao, Zhenting Wang, Wujiang Xu, Ligong Han, Jiahui Zhao, Kai Zhong, Sanguthevar Rajasekaran, and Dimitris N Metaxas

  44. [44]

    InCompanion Proceedings of the ACM on Web Conference 2025

    Apeer: Automatic prompt engineering enhances large language model reranking. InCompanion Proceedings of the ACM on Web Conference 2025. 2494– 2502

  45. [45]

    Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen

  46. [46]

    From llms to llm-based agents for software engineering: A survey of current, challenges and future.arXiv preprint arXiv:2408.02479(2024)

  47. [47]

    John Johnson, Sergio Lubo, Nishitha Yedla, Jairo Aponte, and Bonita Sharif

  48. [48]

    In2019 IEEE International conference on software maintenance and evolution (ICSME)

    An empirical study assessing source code readability in comprehension. In2019 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 513–523

  49. [49]

    Sungmin Kang, Louis Milliken, and Shin Yoo. 2024. Identifying inaccurate descriptions in llm-generated code comments via test execution.arXiv preprint arXiv:2406.14836(2024)

  50. [50]

    Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. InIjcai, Vol. 14. Montreal, Canada, 1137–1145

  51. [51]

    Dawn Lawrie, Henry Feild, and David Binkley. 2006. Syntactic identifier con- ciseness and consistency. In2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation. IEEE, 139–148. Ye et al

  52. [52]

    Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. What’s in a Name? A Study of Identifiers. In14th IEEE international conference on program comprehension (ICPC’06). IEEE, 3–12

  53. [53]

    Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2007. Effec- tive identifier names for comprehension and memory.Innovations in Systems and Software Engineering3, 4 (2007), 303–318

  54. [54]

    Valentina Lenarduzzi, Terese Besker, Davide Taibi, Antonio Martini, and Francesca Arcelli Fontana. 2021. A systematic literature review on technical debt prioritization: Strategies, processes, factors, and tools.Journal of Systems and Software171 (2021), 110827

  55. [55]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al . 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161(2023)

  56. [56]

    Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. 2023. GPT detectors are biased against non-native English writers.Patterns4, 7 (2023)

  57. [57]

    Sherlock A Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Kla Tan- tithamthavorn. 2025. Comparing Human and LLM Generated Code: The Jury is Still Out!arXiv preprint arXiv:2501.16857(2025)

  58. [58]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems36 (2023), 21558–21572

  59. [59]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys 55, 9 (2023), 1–35

  60. [60]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation.arXiv preprint arXiv:2102.04664(2021)

  61. [61]

    Michael R Lyu, Baishakhi Ray, Abhik Roychoudhury, Shin Hwei Tan, and Patanamon Thongtanunam. 2024. Automatic programming: Large language models and beyond.ACM Transactions on Software Engineering and Methodology (2024)

  62. [62]

    Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus

  63. [63]

    In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

    World of code: an infrastructure for mining the universe of open source VCS data. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 143–154

  64. [64]

    Yuxing Ma, Tapajit Dey, Chris Bogart, Sadika Amreen, Marat Valiev, Adam Tutko, David Kennard, Russell Zaretzki, and Audris Mockus. 2021. World of code: enabling a research workflow for mining and analyzing the universe of open source VCS data.Empirical Software Engineering26 (2021), 1–42

  65. [65]

    Ggaliwango Marvin, Nakayiza Hellen, Daudi Jjingo, and Joyce Nakatumba- Nabende. 2023. Prompt engineering in large language models. InInternational conference on data intelligence and cognitive informatics. Springer, 387–402

  66. [66]

    Orni Meerbaum-Salant, Michal Armoni, and Mordechai Ben-Ari. 2011. Habits of programming in scratch. InProceedings of the 16th annual joint conference on Innovation and technology in computer science education. 168–172

  67. [67]

    Prabhaker Mishra, Uttam Singh, Chandra M Pandey, Priyadarshni Mishra, and Gaurav Pandey. 2019. Application of student’s t-test, analysis of variance, and covariance.Annals of cardiac anaesthesia22, 4 (2019), 407–411

  68. [68]

    Leon Moonen. 2001. Generating robust parsers using island grammars. In Proceedings eighth working conference on reverse engineering. IEEE, 13–22

  69. [69]

    Delano Oliveira, Reydne Santos, Benedito De Oliveira, Martin Monperrus, Fer- nando Castor, and Fernanda Madeiral. 2024. Understanding Code Understand- ability Improvements in Code Reviews.IEEE Transactions on Software Engineer- ing(2024)

  70. [70]

    OpenAI. [n. d.]. ChatGPT release. https://openai.com/index/chatgpt/

  71. [71]

    2007.Beautiful code Leading programmers explain how they think

    Andy Oram and Greg Wilson. 2007.Beautiful code Leading programmers explain how they think. O’Reilly Media, Inc

  72. [72]

    Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Andrea De Lucia. 2014. Do they really smell bad? a study on developers’ perception of bad code smells. In2014 IEEE International conference on software maintenance and evolution. IEEE, 101–110

  73. [73]

    Dangfeng Pan, Zhensu Sun, Cenyuan Zhang, David Lo, and Xiaoning Du. 2025. The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget.arXiv preprint arXiv:2508.13666(2025)

  74. [74]

    Sebastiano Panichella, Venera Arnaoudova, Massimiliano Di Penta, and Giuliano Antoniol. 2015. Would static analysis tools help developers with code reviews?. In2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE, 161–170

  75. [75]

    Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models.Advances in neural information processing systems34 (2021), 11054–11070

  76. [76]

    Felix Petersen, Debarghya Mukherjee, Yuekai Sun, and Mikhail Yurochkin

  77. [77]

    Post-processing for individual fairness.Advances in Neural Information Processing Systems34 (2021), 25944–25955

  78. [78]

    Valentina Piantadosi, Fabiana Fierro, Simone Scalabrino, Alexander Serebrenik, and Rocco Oliveto. 2020. How does code readability change during software evolution?Empirical Software Engineering25 (2020), 5374–5412

  79. [79]

    Daryl Posnett, Abram Hindle, and Premkumar Devanbu. 2011. A simpler model of software readability. InProceedings of the 8th working conference on mining software repositories. 73–82

  80. [80]

    Vaclav Rajlich and Prashant Gosavi. 2004. Incremental change in object-oriented programming.IEEE software21, 4 (2004), 62–69

Showing first 80 references.