Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?

Elliott Wen; Ewan Tempero; Zushuai Zhang

arxiv: 2605.19901 · v1 · pith:STMICVLHnew · submitted 2026-05-19 · 💻 cs.SE

Can LLMs Produce Better Object-Oriented Designs than Human-Involved Development?

Zushuai Zhang , Elliott Wen , Ewan Tempero This is my paper

Pith reviewed 2026-05-20 04:06 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM code generationobject-oriented designOOD metricscode smellsdomain modelinghuman-AI collaborationJava projectssoftware design quality

0 comments

The pith

LLM-generated object-oriented designs simplify code but miss key abstractions and responsibility assignments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The study compares object-oriented design quality across three sets of postgraduate Java projects: those created by humans before LLMs became common, those created by humans after LLMs, and those generated entirely by current LLMs. LLM-only projects show fewer code smells and lower overall size, complexity, and coupling than the human-involved ones. This apparent improvement, however, stems from oversimplification that leaves out important domain concepts and fails to divide responsibilities clearly among classes. Human projects created after LLMs appeared already trend toward similar simplification patterns. The work therefore concludes that human guidance on how to break down a problem into objects and assign duties to them continues to add value even when LLMs assist with coding.

Core claim

PureAI projects generated end-to-end by contemporary LLMs exhibit lower code smell density and generally smaller size, complexity, and coupling than PreAI and PostAI human-involved projects, yet this pattern aligns with missing abstractions and weaker responsibility separation in domain modeling; PostAI projects sit closer to PureAI than to PreAI on many of the same measures.

What carries the argument

Comparative case study applying project-level OOD metrics, code smell density, and domain modeling to the same postgraduate Java assignment under PreAI, PostAI, and PureAI authorship conditions.

If this is right

PureAI designs reduce code smells mainly by producing simpler, less coupled structures that omit needed domain abstractions.
Responsibility assignment and decomposition in object-oriented work continue to require human input for balanced results.
Human projects produced after LLMs became available already display some of the same oversimplification seen in PureAI outputs.
Lower code-smell counts in LLM outputs do not automatically signal stronger overall object-oriented quality.
Human guidance on decomposition and responsibility assignment improves OOD outcomes when LLMs are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams may need explicit review steps focused on class responsibilities when incorporating LLM-generated code into larger systems.
Prompt strategies that emphasize domain modeling could narrow the abstraction gap observed here.
The pattern may appear in other design-heavy tasks where LLMs are asked to produce multi-component solutions.

Load-bearing premise

The chosen postgraduate Java assignment and the selected OOD metrics capture general object-oriented design quality, and any measured differences arise from the authorship conditions rather than from student skill variation or prompting differences.

What would settle it

A replication on a different or more complex assignment that finds PureAI projects matching or exceeding human projects in number of domain abstractions and quality of responsibility separation would undermine the central claim.

read the original abstract

Background: Large Language Models (LLMs) are increasingly used for code generation. However, their ability to generate multi-class projects that require object-oriented design (OOD) remains unclear, especially relative to projects developed with human involvement. Aims: The primary objective of this study is to compare OOD quality in projects from three authorship conditions: PreAI (human-involved projects produced before widespread LLM use), PostAI (human-involved projects produced after widespread LLM use), and PureAI (projects generated end-to-end by contemporary LLMs). Method: We conducted a comparative case study on a postgraduate Java assignment. Two offerings of the same assignment were selected as the PreAI and PostAI datasets. PureAI projects were generated using three contemporary LLMs. We analyzed OOD quality using project-level OOD metrics, code smell density, and domain modeling. Results: Relative to human-involved projects, PureAI projects show lower code smell density and generally appear simpler in terms of total size, complexity, and coupling. However, this is consistent with oversimplification, as it is associated with missing abstractions and weaker responsibility separation. PostAI is closer to PureAI than PreAI on many OOD measures and also shows tendencies toward oversimplification. Conclusions: Our findings indicate that appropriate human guidance on object-oriented decomposition and responsibility assignment remains important when LLMs are used for object-oriented design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a comparative case study of object-oriented design (OOD) quality across three authorship conditions on the same postgraduate Java assignment: PreAI (human-developed projects before widespread LLM adoption), PostAI (human-developed projects after LLM adoption), and PureAI (end-to-end LLM-generated projects using three contemporary models). OOD quality is assessed via project-level metrics, code-smell density, and domain modeling. Results indicate that PureAI projects exhibit lower code-smell density and greater simplicity (smaller size, lower complexity and coupling) but also missing abstractions and weaker responsibility separation, which the authors interpret as oversimplification. PostAI projects lie closer to PureAI than to PreAI on many measures. The central conclusion is that appropriate human guidance on decomposition and responsibility assignment remains important when LLMs are used for OOD.

Significance. If the central attribution holds after addressing confounds, the work supplies concrete empirical evidence that current LLMs tend to produce oversimplified OOD artifacts relative to human-involved development. This finding would be useful for software-engineering education and for guidelines on human-AI collaboration in design tasks. The multi-metric approach (project-level metrics plus code smells plus domain modeling) is a strength, as is the use of two temporally separated human cohorts to bracket the LLM era.

major comments (3)

[Method] Method section: the description of PureAI generation provides no information on the exact prompts, whether OOD-specific guidance (e.g., on decomposition or responsibility assignment) was included, or any sensitivity analysis across prompting strategies. Because the central claim attributes oversimplification to the absence of human guidance rather than to prompting choices, the lack of prompt detail is load-bearing; an underspecified or generic prompt could produce the observed missing abstractions without demonstrating an inherent LLM limitation.
[Method and Results] Method and Results sections: no sample sizes are reported for the PreAI or PostAI student projects, no statistical tests are described for metric comparisons, and no controls or matching for individual student skill/experience are mentioned despite the two offerings of the same assignment. These omissions prevent ruling out that observed differences between PreAI and PostAI (or between human-involved and PureAI) arise from cohort skill variation rather than authorship condition, directly undermining the claim that human guidance is necessary.
[Discussion] Discussion: the interpretation that lower complexity and coupling in PureAI constitutes 'oversimplification' rests on the domain-modeling assessment, yet no inter-rater reliability, rubric, or validation of that assessment is provided. Without such grounding, the link between the quantitative metrics and the qualitative claim of weaker responsibility separation remains subjective and load-bearing for the conclusion.

minor comments (2)

[Abstract] Abstract: the three conditions (PreAI, PostAI, PureAI) are introduced without a brief parenthetical definition; adding one sentence would improve immediate readability for readers outside the subfield.
[Results] Results: a summary table reporting means (or medians) and ranges for the key OOD metrics across the three conditions would make the comparative claims easier to evaluate at a glance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the changes planned for the revised manuscript to enhance methodological transparency and strengthen the interpretations.

read point-by-point responses

Referee: [Method] Method section: the description of PureAI generation provides no information on the exact prompts, whether OOD-specific guidance (e.g., on decomposition or responsibility assignment) was included, or any sensitivity analysis across prompting strategies. Because the central claim attributes oversimplification to the absence of human guidance rather than to prompting choices, the lack of prompt detail is load-bearing; an underspecified or generic prompt could produce the observed missing abstractions without demonstrating an inherent LLM limitation.

Authors: We agree that the absence of prompt details weakens the ability to attribute findings specifically to the lack of human guidance. In the revised Method section, we will provide the exact prompts used for each of the three LLMs in the PureAI condition. These prompts were intentionally kept generic to reflect typical end-to-end usage without explicit OOD instructions on decomposition or responsibility assignment. We will also add a sensitivity analysis subsection that reports results from minor prompt variations to show consistency in the observed oversimplification patterns. revision: yes
Referee: [Method and Results] Method and Results sections: no sample sizes are reported for the PreAI or PostAI student projects, no statistical tests are described for metric comparisons, and no controls or matching for individual student skill/experience are mentioned despite the two offerings of the same assignment. These omissions prevent ruling out that observed differences between PreAI and PostAI (or between human-involved and PureAI) arise from cohort skill variation rather than authorship condition, directly undermining the claim that human guidance is necessary.

Authors: We acknowledge that sample sizes, statistical tests, and explicit controls for student skill were not reported. This study is framed as a comparative case study on identical assignments across two temporally separated offerings rather than a controlled experiment. In the revision, we will report the exact sample sizes for the PreAI and PostAI cohorts. We will add descriptive statistical comparisons (e.g., effect sizes and non-parametric tests where appropriate) and explicitly discuss the absence of individual skill matching as a limitation, while noting that the same assignment and postgraduate level provide partial control for task and cohort comparability. revision: partial
Referee: [Discussion] Discussion: the interpretation that lower complexity and coupling in PureAI constitutes 'oversimplification' rests on the domain-modeling assessment, yet no inter-rater reliability, rubric, or validation of that assessment is provided. Without such grounding, the link between the quantitative metrics and the qualitative claim of weaker responsibility separation remains subjective and load-bearing for the conclusion.

Authors: We will expand the Discussion to include the specific criteria and rubric applied during the domain-modeling assessment, along with illustrative examples from the projects. Although formal inter-rater reliability statistics were not calculated, the evaluation was performed by the author team with substantial experience in object-oriented design. We will clarify how the qualitative observations of missing abstractions and responsibility separation directly support the interpretation of the quantitative metrics as evidence of oversimplification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical case study with independent measurements

full rationale

The paper is a comparative case study that selects existing postgraduate Java assignments for PreAI and PostAI cohorts, generates PureAI projects via LLMs, and reports direct measurements of project-level OOD metrics, code smell density, and domain modeling. No equations, derivations, fitted parameters, or predictions appear; the central claim follows from observed differences in these independent artifacts rather than any self-definitional loop, fitted-input renaming, or self-citation chain. The study is self-contained against its chosen benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard software engineering assumptions about metric validity without introducing new free parameters or entities.

axioms (1)

domain assumption Project-level OOD metrics, code smell density, and domain modeling accurately capture design quality differences attributable to authorship.
Invoked in the method and results sections to interpret differences between conditions.

pith-pipeline@v0.9.0 · 5778 in / 1229 out tokens · 39507 ms · 2026-05-20T04:06:58.437164+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We analyzed OOD quality using project-level OOD metrics, code smell density, and domain modeling.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PureAI projects show lower code smell density and generally appear simpler in terms of total size, complexity, and coupling.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

Evaluating Large Language Models Trained on Code

8 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

15 Google

Accessed: 2026-03-08. 15 Google. OpenAI compatibility. https://ai.google.dev/gemini-api/docs/openai# thinking,

work page 2026
[3]

16 Brian Henderson-Sellers, Larry L Constantine, and Ian M Graham

Accessed: 2026-03-06. 16 Brian Henderson-Sellers, Larry L Constantine, and Ian M Graham. Coupling and cohesion (towards a valid metrics suite for object-oriented analysis and design).Object oriented systems, 3(3):143–158,

work page 2026
[4]

Measuring Coding Challenge Competence With APPS

17 Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with APPS.arXiv preprint arXiv:2105.09938,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

ISO/IEC 19502:2005 – Information technology – Meta Object Facility (MOF).https://www.iso.org/standard/32621.html, November

19 International Organization for Standardization. ISO/IEC 19502:2005 – Information technology – Meta Object Facility (MOF).https://www.iso.org/standard/32621.html, November

work page 2005
[6]

Tool support for feature-oriented software development: featureide: an eclipse-based approach

24 Thomas Leich, Sven Apel, Laura Marnitz, and Gunter Saake. Tool support for feature-oriented software development: featureide: an eclipse-based approach. InProceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange, pages 55–59,

work page 2005
[7]

DevBench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3,

25 Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. DevBench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3,

work page arXiv
[8]

Comparing human and llm generated code: The jury is still out!arXiv preprint arXiv:2501.16857,

26 Sherlock A Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Kla Tantithamthav- orn. Comparing human and llm generated code: The jury is still out!arXiv preprint arXiv:2501.16857,

work page arXiv
[9]

ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation

27 Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20205–20221,

work page 2025
[10]

34 OpenAI

Accessed: 2026-03-06. 34 OpenAI. Using GPT-5.4. https://developers.openai.com/api/docs/guides/ latest-model#gpt-54-parameter-compatibility,

work page 2026
[11]

35 Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia

Accessed: 2026-03-06. 35 Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. InProceedings of the 40th International Conference on Software Engineering, pages 482–482, May

work page 2026
[12]

45 András Vargha and Harold D Delaney

Accessed: 2026-03-08. 45 András Vargha and Harold D Delaney. A critique and improvement of the cl common language effect size statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics, 25(2):101–132,

work page 2026
[13]

Oodeval: Evaluating large language models on object-oriented design.arXiv preprint arXiv:2601.07602,

46 Bingxu Xiao, Yunwei Dong, Yiqi Tang, Manqing Zhang, Yifan Zhou, Chunyan Ma, and Yepang Liu. Oodeval: Evaluating large language models on object-oriented design.arXiv preprint arXiv:2601.07602,

work page arXiv
[14]

AI-assisted Programming May Decrease the Productiv- ity of Experienced Developers by Increasing Maintenance Burden,

47 Feiyang Xu, Poonacha K Medappa, Murat M Tunc, Martijn Vroegindeweij, and Jan C Fransoo. Ai-assisted programming decreases the productivity of experienced developers by increasing the technical debt and maintenance burden.arXiv preprint arXiv:2510.10165,

work page arXiv
[15]

A systematic literature review on the code smells datasets and validation mechanisms.ACM Computing Surveys, 55(13s):1–48, 2023

49 Morteza Zakeri-Nasrabadi, Saeed Parsa, Ehsan Esmaili, and Fabio Palomba. A systematic literature review on the code smells datasets and validation mechanisms.ACM Computing Surveys, 55(13s):1–48, 2023

work page 2023

[1] [1]

Evaluating Large Language Models Trained on Code

8 Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

15 Google

Accessed: 2026-03-08. 15 Google. OpenAI compatibility. https://ai.google.dev/gemini-api/docs/openai# thinking,

work page 2026

[3] [3]

16 Brian Henderson-Sellers, Larry L Constantine, and Ian M Graham

Accessed: 2026-03-06. 16 Brian Henderson-Sellers, Larry L Constantine, and Ian M Graham. Coupling and cohesion (towards a valid metrics suite for object-oriented analysis and design).Object oriented systems, 3(3):143–158,

work page 2026

[4] [4]

Measuring Coding Challenge Competence With APPS

17 Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. Measuring coding challenge competence with APPS.arXiv preprint arXiv:2105.09938,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

ISO/IEC 19502:2005 – Information technology – Meta Object Facility (MOF).https://www.iso.org/standard/32621.html, November

19 International Organization for Standardization. ISO/IEC 19502:2005 – Information technology – Meta Object Facility (MOF).https://www.iso.org/standard/32621.html, November

work page 2005

[6] [6]

Tool support for feature-oriented software development: featureide: an eclipse-based approach

24 Thomas Leich, Sven Apel, Laura Marnitz, and Gunter Saake. Tool support for feature-oriented software development: featureide: an eclipse-based approach. InProceedings of the 2005 OOPSLA workshop on Eclipse technology eXchange, pages 55–59,

work page 2005

[7] [7]

DevBench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3,

25 Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. DevBench: A comprehensive benchmark for software development.arXiv preprint arXiv:2403.08604, 3,

work page arXiv

[8] [8]

Comparing human and llm generated code: The jury is still out!arXiv preprint arXiv:2501.16857,

26 Sherlock A Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, and Kla Tantithamthav- orn. Comparing human and llm generated code: The jury is still out!arXiv preprint arXiv:2501.16857,

work page arXiv

[9] [9]

ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation

27 Kaiyuan Liu, Youcheng Pan, Yang Xiang, Daojing He, Jing Li, Yexing Du, and Tianrun Gao. ProjectEval: A benchmark for programming agents automated evaluation on project-level code generation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 20205–20221,

work page 2025

[10] [10]

34 OpenAI

Accessed: 2026-03-06. 34 OpenAI. Using GPT-5.4. https://developers.openai.com/api/docs/guides/ latest-model#gpt-54-parameter-compatibility,

work page 2026

[11] [11]

35 Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia

Accessed: 2026-03-06. 35 Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Fausto Fasano, Rocco Oliveto, and Andrea De Lucia. On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation. InProceedings of the 40th International Conference on Software Engineering, pages 482–482, May

work page 2026

[12] [12]

45 András Vargha and Harold D Delaney

Accessed: 2026-03-08. 45 András Vargha and Harold D Delaney. A critique and improvement of the cl common language effect size statistics of McGraw and Wong.Journal of Educational and Behavioral Statistics, 25(2):101–132,

work page 2026

[13] [13]

Oodeval: Evaluating large language models on object-oriented design.arXiv preprint arXiv:2601.07602,

46 Bingxu Xiao, Yunwei Dong, Yiqi Tang, Manqing Zhang, Yifan Zhou, Chunyan Ma, and Yepang Liu. Oodeval: Evaluating large language models on object-oriented design.arXiv preprint arXiv:2601.07602,

work page arXiv

[14] [14]

AI-assisted Programming May Decrease the Productiv- ity of Experienced Developers by Increasing Maintenance Burden,

47 Feiyang Xu, Poonacha K Medappa, Murat M Tunc, Martijn Vroegindeweij, and Jan C Fransoo. Ai-assisted programming decreases the productivity of experienced developers by increasing the technical debt and maintenance burden.arXiv preprint arXiv:2510.10165,

work page arXiv

[15] [15]

A systematic literature review on the code smells datasets and validation mechanisms.ACM Computing Surveys, 55(13s):1–48, 2023

49 Morteza Zakeri-Nasrabadi, Saeed Parsa, Ehsan Esmaili, and Fabio Palomba. A systematic literature review on the code smells datasets and validation mechanisms.ACM Computing Surveys, 55(13s):1–48, 2023

work page 2023