Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?
Pith reviewed 2026-05-20 04:06 UTC · model grok-4.3
The pith
LLM-generated object-oriented designs simplify code but miss key abstractions and responsibility assignments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PureAI projects generated end-to-end by contemporary LLMs exhibit lower code smell density and generally smaller size, complexity, and coupling than PreAI and PostAI human-involved projects, yet this pattern aligns with missing abstractions and weaker responsibility separation in domain modeling; PostAI projects sit closer to PureAI than to PreAI on many of the same measures.
What carries the argument
Comparative case study applying project-level OOD metrics, code smell density, and domain modeling to the same postgraduate Java assignment under PreAI, PostAI, and PureAI authorship conditions.
If this is right
- PureAI designs reduce code smells mainly by producing simpler, less coupled structures that omit needed domain abstractions.
- Responsibility assignment and decomposition in object-oriented work continue to require human input for balanced results.
- Human projects produced after LLMs became available already display some of the same oversimplification seen in PureAI outputs.
- Lower code-smell counts in LLM outputs do not automatically signal stronger overall object-oriented quality.
- Human guidance on decomposition and responsibility assignment improves OOD outcomes when LLMs are used.
Where Pith is reading between the lines
- Teams may need explicit review steps focused on class responsibilities when incorporating LLM-generated code into larger systems.
- Prompt strategies that emphasize domain modeling could narrow the abstraction gap observed here.
- The pattern may appear in other design-heavy tasks where LLMs are asked to produce multi-component solutions.
Load-bearing premise
The chosen postgraduate Java assignment and the selected OOD metrics capture general object-oriented design quality, and any measured differences arise from the authorship conditions rather than from student skill variation or prompting differences.
What would settle it
A replication on a different or more complex assignment that finds PureAI projects matching or exceeding human projects in number of domain abstractions and quality of responsibility separation would undermine the central claim.
read the original abstract
Background: Large Language Models (LLMs) are increasingly used for code generation. However, their ability to generate multi-class projects that require object-oriented design (OOD) remains unclear, especially relative to projects developed with human involvement. Aims: The primary objective of this study is to compare OOD quality in projects from three authorship conditions: PreAI (human-involved projects produced before widespread LLM use), PostAI (human-involved projects produced after widespread LLM use), and PureAI (projects generated end-to-end by contemporary LLMs). Method: We conducted a comparative case study on a postgraduate Java assignment. Two offerings of the same assignment were selected as the PreAI and PostAI datasets. PureAI projects were generated using three contemporary LLMs. We analyzed OOD quality using project-level OOD metrics, code smell density, and domain modeling. Results: Relative to human-involved projects, PureAI projects show lower code smell density and generally appear simpler in terms of total size, complexity, and coupling. However, this is consistent with oversimplification, as it is associated with missing abstractions and weaker responsibility separation. PostAI is closer to PureAI than PreAI on many OOD measures and also shows tendencies toward oversimplification. Conclusions: Our findings indicate that appropriate human guidance on object-oriented decomposition and responsibility assignment remains important when LLMs are used for object-oriented design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a comparative case study of object-oriented design (OOD) quality across three authorship conditions on the same postgraduate Java assignment: PreAI (human-developed projects before widespread LLM adoption), PostAI (human-developed projects after LLM adoption), and PureAI (end-to-end LLM-generated projects using three contemporary models). OOD quality is assessed via project-level metrics, code-smell density, and domain modeling. Results indicate that PureAI projects exhibit lower code-smell density and greater simplicity (smaller size, lower complexity and coupling) but also missing abstractions and weaker responsibility separation, which the authors interpret as oversimplification. PostAI projects lie closer to PureAI than to PreAI on many measures. The central conclusion is that appropriate human guidance on decomposition and responsibility assignment remains important when LLMs are used for OOD.
Significance. If the central attribution holds after addressing confounds, the work supplies concrete empirical evidence that current LLMs tend to produce oversimplified OOD artifacts relative to human-involved development. This finding would be useful for software-engineering education and for guidelines on human-AI collaboration in design tasks. The multi-metric approach (project-level metrics plus code smells plus domain modeling) is a strength, as is the use of two temporally separated human cohorts to bracket the LLM era.
major comments (3)
- [Method] Method section: the description of PureAI generation provides no information on the exact prompts, whether OOD-specific guidance (e.g., on decomposition or responsibility assignment) was included, or any sensitivity analysis across prompting strategies. Because the central claim attributes oversimplification to the absence of human guidance rather than to prompting choices, the lack of prompt detail is load-bearing; an underspecified or generic prompt could produce the observed missing abstractions without demonstrating an inherent LLM limitation.
- [Method and Results] Method and Results sections: no sample sizes are reported for the PreAI or PostAI student projects, no statistical tests are described for metric comparisons, and no controls or matching for individual student skill/experience are mentioned despite the two offerings of the same assignment. These omissions prevent ruling out that observed differences between PreAI and PostAI (or between human-involved and PureAI) arise from cohort skill variation rather than authorship condition, directly undermining the claim that human guidance is necessary.
- [Discussion] Discussion: the interpretation that lower complexity and coupling in PureAI constitutes 'oversimplification' rests on the domain-modeling assessment, yet no inter-rater reliability, rubric, or validation of that assessment is provided. Without such grounding, the link between the quantitative metrics and the qualitative claim of weaker responsibility separation remains subjective and load-bearing for the conclusion.
minor comments (2)
- [Abstract] Abstract: the three conditions (PreAI, PostAI, PureAI) are introduced without a brief parenthetical definition; adding one sentence would improve immediate readability for readers outside the subfield.
- [Results] Results: a summary table reporting means (or medians) and ranges for the key OOD metrics across the three conditions would make the comparative claims easier to evaluate at a glance.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the changes planned for the revised manuscript to enhance methodological transparency and strengthen the interpretations.
read point-by-point responses
-
Referee: [Method] Method section: the description of PureAI generation provides no information on the exact prompts, whether OOD-specific guidance (e.g., on decomposition or responsibility assignment) was included, or any sensitivity analysis across prompting strategies. Because the central claim attributes oversimplification to the absence of human guidance rather than to prompting choices, the lack of prompt detail is load-bearing; an underspecified or generic prompt could produce the observed missing abstractions without demonstrating an inherent LLM limitation.
Authors: We agree that the absence of prompt details weakens the ability to attribute findings specifically to the lack of human guidance. In the revised Method section, we will provide the exact prompts used for each of the three LLMs in the PureAI condition. These prompts were intentionally kept generic to reflect typical end-to-end usage without explicit OOD instructions on decomposition or responsibility assignment. We will also add a sensitivity analysis subsection that reports results from minor prompt variations to show consistency in the observed oversimplification patterns. revision: yes
-
Referee: [Method and Results] Method and Results sections: no sample sizes are reported for the PreAI or PostAI student projects, no statistical tests are described for metric comparisons, and no controls or matching for individual student skill/experience are mentioned despite the two offerings of the same assignment. These omissions prevent ruling out that observed differences between PreAI and PostAI (or between human-involved and PureAI) arise from cohort skill variation rather than authorship condition, directly undermining the claim that human guidance is necessary.
Authors: We acknowledge that sample sizes, statistical tests, and explicit controls for student skill were not reported. This study is framed as a comparative case study on identical assignments across two temporally separated offerings rather than a controlled experiment. In the revision, we will report the exact sample sizes for the PreAI and PostAI cohorts. We will add descriptive statistical comparisons (e.g., effect sizes and non-parametric tests where appropriate) and explicitly discuss the absence of individual skill matching as a limitation, while noting that the same assignment and postgraduate level provide partial control for task and cohort comparability. revision: partial
-
Referee: [Discussion] Discussion: the interpretation that lower complexity and coupling in PureAI constitutes 'oversimplification' rests on the domain-modeling assessment, yet no inter-rater reliability, rubric, or validation of that assessment is provided. Without such grounding, the link between the quantitative metrics and the qualitative claim of weaker responsibility separation remains subjective and load-bearing for the conclusion.
Authors: We will expand the Discussion to include the specific criteria and rubric applied during the domain-modeling assessment, along with illustrative examples from the projects. Although formal inter-rater reliability statistics were not calculated, the evaluation was performed by the author team with substantial experience in object-oriented design. We will clarify how the qualitative observations of missing abstractions and responsibility separation directly support the interpretation of the quantitative metrics as evidence of oversimplification. revision: yes
Circularity Check
No circularity: empirical case study with independent measurements
full rationale
The paper is a comparative case study that selects existing postgraduate Java assignments for PreAI and PostAI cohorts, generates PureAI projects via LLMs, and reports direct measurements of project-level OOD metrics, code smell density, and domain modeling. No equations, derivations, fitted parameters, or predictions appear; the central claim follows from observed differences in these independent artifacts rather than any self-definitional loop, fitted-input renaming, or self-citation chain. The study is self-contained against its chosen benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Project-level OOD metrics, code smell density, and domain modeling accurately capture design quality differences attributable to authorship.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We analyzed OOD quality using project-level OOD metrics, code smell density, and domain modeling.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PureAI projects show lower code smell density and generally appear simpler in terms of total size, complexity, and coupling.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
8 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
- [2]
-
[3]
16 Brian Henderson-Sellers, Larry L Constantine, and Ian M Graham
Accessed: 2026-03-06. 16 Brian Henderson-Sellers, Larry L Constantine, and Ian M Graham. Coupling and cohesion (towards a valid metrics suite for object-oriented analysis and design).Object oriented systems, 3(3):143–158,
work page 2026
-
[4]
Measuring Coding Challenge Competence With APPS
17 Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with APPS.arXiv preprint arXiv:2105.09938,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
19 International Organization for Standardization. ISO/IEC 19502:2005 – Information technology – Meta Object Facility (MOF).https://www.iso.org/standard/32621.html, November
work page 2005
-
[6]
Tool support for feature-oriented software development: featureide: an eclipse-based approach
24 Thomas Leich, Sven Apel, Laura Marnitz, and Gunter Saake. Tool support for feature-oriented software development: featureide: an eclipse-based approach. InProceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange, pages 55–59,
work page 2005
-
[7]
DevBench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3,
25 Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. DevBench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3,
-
[8]
Comparing human and llm generated code: The jury is still out!arXiv preprint arXiv:2501.16857,
26 Sherlock A Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Kla Tantithamthav- orn. Comparing human and llm generated code: The jury is still out!arXiv preprint arXiv:2501.16857,
-
[9]
27 Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20205–20221,
work page 2025
- [10]
-
[11]
Accessed: 2026-03-06. 35 Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. InProceedings of the 40th International Conference on Software Engineering, pages 482–482, May
work page 2026
-
[12]
45 András Vargha and Harold D Delaney
Accessed: 2026-03-08. 45 András Vargha and Harold D Delaney. A critique and improvement of the cl common language effect size statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics, 25(2):101–132,
work page 2026
-
[13]
Oodeval: Evaluating large language models on object-oriented design.arXiv preprint arXiv:2601.07602,
46 Bingxu Xiao, Yunwei Dong, Yiqi Tang, Manqing Zhang, Yifan Zhou, Chunyan Ma, and Yepang Liu. Oodeval: Evaluating large language models on object-oriented design.arXiv preprint arXiv:2601.07602,
-
[14]
47 Feiyang Xu, Poonacha K Medappa, Murat M Tunc, Martijn Vroegindeweij, and Jan C Fransoo. Ai-assisted programming decreases the productivity of experienced developers by increasing the technical debt and maintenance burden.arXiv preprint arXiv:2510.10165,
-
[15]
49 Morteza Zakeri-Nasrabadi, Saeed Parsa, Ehsan Esmaili, and Fabio Palomba. A systematic literature review on the code smells datasets and validation mechanisms.ACM Computing Surveys, 55(13s):1–48, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.