arxiv: 2605.11986 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: no theorem link

On the Limitations of Large Language Models for Conceptual Database Modeling

Arthur F. Siqueira, Carlos D. S. Nogueira, Claudio E. C. Campelo, Eduarda Farias, J\'ulia Menezes

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsentity-relationship diagramsconceptual modelingdatabase designprompt engineeringmodel evaluationreliability assessment

0 comments

The pith

Large language models lose reliability when generating entity-relationship diagrams from increasingly complex natural language requirements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can reliably turn natural language descriptions into Entity-Relationship diagrams for relational databases. Three models were tried with three different prompting methods on the same set of requirements that grew steadily more detailed and constrained. In the simpler versions the models usually identified the right entities and links, yet errors in constraints, missing attributes, and ambiguous relationships multiplied once the scenarios added more rules and interconnections. This pattern matters because many development teams hope to use these models to shorten the initial modeling phase. If the drop in quality is typical, the extra time spent reviewing and correcting the output could cancel out any speed advantage.

Core claim

The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less

What carries the argument

The experimental comparison of LLM-generated ER diagrams against natural language requirements of increasing complexity, using zero-shot, chain-of-thought, and chain-of-thought-plus-verifier prompting.

Load-bearing premise

The chosen requirements scenarios of rising complexity represent typical real-world database modeling problems and a single qualitative comparison by the authors is sufficient to assess whether the diagrams match the requirements structurally and semantically.

What would settle it

Running the same models on a collection of actual industry database requirement documents evaluated by several independent experts using a scored rubric would falsify the claim if the experts found consistently high structural and semantic accuracy even for the most complex cases.

Figures

Figures reproduced from arXiv: 2605.11986 by Arthur F. Siqueira, Carlos D. S. Nogueira, Claudio E. C. Campelo, Eduarda Farias, J\'ulia Menezes.

read the original abstract

This article analyzes the use of Large Language Models (LLMs) as support for the conceptual modeling of relational databases through the automatic generation of Entity-Relationship (ER) diagrams from natural language requirements. The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less complex scenarios, their reliability decreases as the complexity of the requirements increases, with a rise in inconsistencies, ambiguities, and failures in representing constraints. These findings reinforce that, in their current state, LLMs are not sufficiently mature for reliable use in complex scenarios, and the cost of validation may offset the apparent productivity gains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs handle simple ER diagram tasks from text but degrade on complex requirements in this study, with the evidence resting on unquantified qualitative checks.

read the letter

This paper reports that large language models can generate reasonable Entity-Relationship diagrams from straightforward requirements but start producing inconsistencies and missing constraints once the requirements get more involved. They put three models through zero-shot, chain-of-thought, and chain-of-thought plus verifier prompts on the same requirements scaled in complexity. The qualitative checks show the performance drop. The strength here is the controlled setup that isolates complexity as the variable while testing common prompting strategies. It gives practitioners some idea of where these tools might need extra human oversight. On the downside, the analysis stays at the level of direct comparison without any numbers on how many entities were missed or how often constraints were ignored. There's no evaluation rubric described and no sign of multiple evaluators to check consistency in the judgments. The scenarios themselves aren't validated against actual industry cases, so the complexity progression might not generalize. This kind of work fits for people building or evaluating AI support for conceptual modeling in software projects. Someone looking for hard numbers on error rates might come away wanting more, but it highlights a real practical limitation. The authors engage honestly with the limitations of current LLMs in this area, so the paper is worth a referee's time. I would recommend sending it to peer review after asking for added quantitative measures and clearer description of the test requirements.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates three large language models using three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier) to generate ER diagrams from natural language requirements. It applies these to three scenarios of increasing complexity and performs qualitative direct comparison of the outputs against the source text, concluding that LLMs perform reasonably in simple cases but exhibit rising inconsistencies, ambiguities, and constraint failures as complexity grows, rendering them insufficiently mature for reliable use in complex conceptual database modeling.

Significance. If the central findings hold after methodological strengthening, the work would offer timely empirical evidence on LLM limitations in structured conceptual modeling tasks, a key area of AI-assisted software engineering. The multi-model, multi-prompt design and focus on constraint representation are positive elements that could help practitioners weigh productivity gains against validation costs.

major comments (3)

[Section 4 (Results)] Section 4 (Results): The central claim of a complexity-dependent reliability decrease rests entirely on qualitative direct comparison; no quantitative metrics (e.g., counts of missing entities/relationships, constraint coverage scores, or precision on attribute types) or evaluation rubric are defined or reported, so observed differences across scenarios could reflect subjective interpretation rather than model behavior.
[Section 3 (Methodology)] Section 3 (Methodology): No inter-rater reliability statistic, second independent evaluator, or agreement measure is mentioned for the qualitative analysis, which is load-bearing for the assertion that inconsistencies and ambiguities increase with complexity.
[Section 3.2 (Scenarios)] Section 3.2 (Scenarios): The three requirements scenarios are presented without external validation, benchmark comparison, or justification that their complexity progression mirrors real-world database modeling challenges, weakening the generalizability of the maturity conclusion.

minor comments (2)

[Abstract] Abstract: The specific LLMs (including versions and access details) are not named, which reduces immediate reproducibility even though the prompting techniques are described.
[Figures] Figures: The ER diagrams would be clearer if the text annotations or captions explicitly marked the inconsistencies and constraint failures discussed in the analysis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major comment below and note the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Section 4 (Results): The central claim of a complexity-dependent reliability decrease rests entirely on qualitative direct comparison; no quantitative metrics (e.g., counts of missing entities/relationships, constraint coverage scores, or precision on attribute types) or evaluation rubric are defined or reported, so observed differences across scenarios could reflect subjective interpretation rather than model behavior.

Authors: We agree that the results section would benefit from quantitative support. In the revision we will introduce a clear evaluation rubric and report counts of missing or incorrect entities, relationships, attributes, and constraint violations for each scenario and prompting method. These metrics will be presented alongside the qualitative analysis to make the complexity-dependent trend more objective. revision: yes
Referee: Section 3 (Methodology): No inter-rater reliability statistic, second independent evaluator, or agreement measure is mentioned for the qualitative analysis, which is load-bearing for the assertion that inconsistencies and ambiguities increase with complexity.

Authors: The qualitative comparisons were performed by the authors with internal cross-checking. We accept that an independent second evaluator and agreement measure would increase confidence in the findings. We will add this step in the revision: a second evaluator will assess a representative sample of outputs and we will report Cohen’s kappa (or equivalent) for entity identification, relationship correctness, and constraint handling. revision: yes
Referee: Section 3.2 (Scenarios): The three requirements scenarios are presented without external validation, benchmark comparison, or justification that their complexity progression mirrors real-world database modeling challenges, weakening the generalizability of the maturity conclusion.

Authors: The scenarios were deliberately constructed to increase in the number of entities, relationships, attributes, and constraints. We will expand Section 3.2 with an explicit justification of this progression, including simple quantitative indicators of complexity. External benchmark validation is not feasible at present because no established public datasets exist for ER-diagram generation from requirements; the progressive design nevertheless illustrates the observed reliability decline. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical comparison of LLM outputs to input requirements

full rationale

The paper performs an experimental evaluation by feeding three LLMs three prompting techniques on three natural-language requirements scenarios of increasing complexity, then qualitatively comparing the generated ER diagrams directly against the source text for structural and semantic adherence. No equations, fitted parameters, predictions, or derivations appear; the central claim that reliability decreases with complexity rests on this independent comparison rather than any self-referential reduction or self-citation chain. The evaluation method is self-contained against the provided requirements text and does not import uniqueness theorems, ansatzes, or renamed known results from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical evaluation study with no mathematical derivations, fitted parameters, or new theoretical postulates; it relies on standard assumptions about LLM prompting and qualitative assessment of diagrams.

pith-pipeline@v0.9.0 · 5504 in / 1196 out tokens · 73859 ms · 2026-05-13T05:17:18.217753+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

doi:10.5753/sbbd.2025.247302

SBC. doi:10.5753/sbbd.2025.247302. URLhttps://sol.sbc.org.br/index.php/sbbd/article/view/37272. Ramez Elmasri and Shamkant B. Navathe.Fundamentals of Database Systems. Addison-Wesley, 6th edition,

work page doi:10.5753/sbbd.2025.247302 2025
[2]

Relbench: A benchmark for deep learning on relational databases

Joshua Robinson, Rishabh Ranjan, Weihua Hu, Kexin Huang, Jiaqi Han, Alejandro Dobles, Matthias Fey, Jan Eric Lenssen, Yiwen Yuan, Zecheng Zhang, Xinwei He, and Jure Leskovec. Relbench: A benchmark for deep learning on relational databases. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track, pages 21330–21341,

work page 2024
[3]

Marcos Magalhães and Carlos Alberto Heuser.Projeto de Banco de Dados

URLhttps://proceedings.neurips.cc/paper_files/paper /2024/hash/25cd345233c65fac1fec0ce61d0f7836-Abstract-Datasets_and_Benchmarks_Track.html. Marcos Magalhães and Carlos Alberto Heuser.Projeto de Banco de Dados. Editora Érica,

work page 2024
[4]

URL https://www.taylorfrancis.com/books/mono/10

doi:10.1201/9781003314455. URL https://www.taylorfrancis.com/books/mono/10. 1201/9781003314455/database-design-using-entity-relationship-diagrams-sikha-saha-bagui -richard-walsh-earp. Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques a...

work page doi:10.1201/9781003314455
[5]

Department of Computer Science and Engineering, IIT Patna; Stanford University; Amazon AI

URL https://arxiv.org/abs/2401.10238 . Department of Computer Science and Engineering, IIT Patna; Stanford University; Amazon AI. Prashant Bansal. Prompt engineering importance and applicability with generative ai.Journal of Computer and Communications, 12:14–23,

work page arXiv
[6]

9 LLMs for Conceptual Database ModelingA PREPRINT Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H

doi:10.4236/jcc.2024.1210002. 9 LLMs for Conceptual Database ModelingA PREPRINT Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th Conference on Neural Information Processing Systems,

work page doi:10.4236/jcc.2024.1210002 2024
[7]

Parmonangan R

doi:10.14257/ijdta.2015.8.2.07. Parmonangan R. Togatorop, Rezky P. Simanjuntak, Siti B. Manurung, and Mega C. Silalahi. Generating en- tity relationship diagram from requirement specification using natural language processing for indonesian lan- guage. InJ-Icon: Jurnal Komputer dan Informatika, volume 9, pages 196–206. Institut Teknologi Del,

work page doi:10.14257/ijdta.2015.8.2.07 2015
[8]

URL https://ejurnal.undana.ac.id/index.php/jicon/article/view/50

doi:10.35508/jicon.v9i2.5051. URL https://ejurnal.undana.ac.id/index.php/jicon/article/view/50

work page doi:10.35508/jicon.v9i2.5051
[9]

Vaihunthan Vyramuthu and Gregor Grambow

doi:10.1007/978-981-97-9613-7_23. Vaihunthan Vyramuthu and Gregor Grambow. Towards extracting entity relationship diagrams from unstructured text using natural language processing. InDBKDA 2025: The Seventeenth International Conference on Advances in Databases, Knowledge, and Data Applications, pages 42–47, Aalen, Germany,

work page doi:10.1007/978-981-97-9613-7_23 2025
[10]

ISBN 978-1-68558- 244-9

IARIA. ISBN 978-1-68558- 244-9. URLhttps://www.thinkmind.org. Abhay Zala, Han Lin, Jaemin Cho, and Mohit Bansal. Diagrammergpt: Generating open-domain, open-platform diagrams via llm planning.https://arxiv.org/abs/2310.12128,

work page arXiv
[11]

Umldesigner: An automatic uml diagram design tool

Vinasetan Ratheil Houndji and Genereux Akotenou. Umldesigner: An automatic uml diagram design tool. In Proceedings of the International Conference on Deep Learning Theory and Applications (DeLTA 2023), volume 1875 ofCommunications in Computer and Information Science, pages 309–321. Springer, Cham,

work page 2023
[12]

In: Dempe, S., Zemkoho, A

doi:10.1007/978- 3-031-39059-3_23. URLhttps://link.springer.com/chapter/10.1007/978-3-031-39059-3_23. Moksh Mishra, Samar Sheikh, and Tosh Tonpe. Fine-tuning language models for enhanced diagram generation: A deep learning approach.International Journal for Research in Applied Science and Engineering Technology (IJRASET), 12(IV):2819–2822,

work page doi:10.1007/978-
[13]

Nadia Salem, Khawla Al-Tarawneh, Amjad Hudaib, Hamza Salem, Afaf Tareef, Hadi Salloum, and Manuel Mazzara

doi:10.22214/ijraset.2024.60445. Nadia Salem, Khawla Al-Tarawneh, Amjad Hudaib, Hamza Salem, Afaf Tareef, Hadi Salloum, and Manuel Mazzara. Generating database schema from requirement specification based on natural language processing and large language model.Computer Research and Modeling, 16(7):1703–1713,

work page doi:10.22214/ijraset.2024.60445 2024
[14]

https://www.researchgate.net/publication/387457163

doi:10.20537/2076-7633-2024-16-7-1703-1713. https://www.researchgate.net/publication/387457163. 10

work page doi:10.20537/2076-7633-2024-16-7-1703-1713 2076