Recognition: no theorem link
On the Limitations of Large Language Models for Conceptual Database Modeling
Pith reviewed 2026-05-13 05:17 UTC · model grok-4.3
The pith
Large language models lose reliability when generating entity-relationship diagrams from increasingly complex natural language requirements.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less
What carries the argument
The experimental comparison of LLM-generated ER diagrams against natural language requirements of increasing complexity, using zero-shot, chain-of-thought, and chain-of-thought-plus-verifier prompting.
Load-bearing premise
The chosen requirements scenarios of rising complexity represent typical real-world database modeling problems and a single qualitative comparison by the authors is sufficient to assess whether the diagrams match the requirements structurally and semantically.
What would settle it
Running the same models on a collection of actual industry database requirement documents evaluated by several independent experts using a scored rubric would falsify the claim if the experts found consistently high structural and semantic accuracy even for the most complex cases.
Figures
read the original abstract
This article analyzes the use of Large Language Models (LLMs) as support for the conceptual modeling of relational databases through the automatic generation of Entity-Relationship (ER) diagrams from natural language requirements. The approach combines different language models with prompt engineering techniques to evaluate their ability to identify entities, relationships, and attributes in a conceptually consistent manner. The experimental evaluation involved three LLMs, each subjected to three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier), applied to the same requirements scenario with progressively increasing complexity. The generated diagrams were qualitatively analyzed through direct comparison with the textual requirements, considering the structural and semantic adherence of the modeled elements. The results indicate that, although LLMs show reasonable performance in less complex scenarios, their reliability decreases as the complexity of the requirements increases, with a rise in inconsistencies, ambiguities, and failures in representing constraints. These findings reinforce that, in their current state, LLMs are not sufficiently mature for reliable use in complex scenarios, and the cost of validation may offset the apparent productivity gains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates three large language models using three prompting techniques (Zero-Shot, Chain of Thought, and Chain of Thought + Verifier) to generate ER diagrams from natural language requirements. It applies these to three scenarios of increasing complexity and performs qualitative direct comparison of the outputs against the source text, concluding that LLMs perform reasonably in simple cases but exhibit rising inconsistencies, ambiguities, and constraint failures as complexity grows, rendering them insufficiently mature for reliable use in complex conceptual database modeling.
Significance. If the central findings hold after methodological strengthening, the work would offer timely empirical evidence on LLM limitations in structured conceptual modeling tasks, a key area of AI-assisted software engineering. The multi-model, multi-prompt design and focus on constraint representation are positive elements that could help practitioners weigh productivity gains against validation costs.
major comments (3)
- [Section 4 (Results)] Section 4 (Results): The central claim of a complexity-dependent reliability decrease rests entirely on qualitative direct comparison; no quantitative metrics (e.g., counts of missing entities/relationships, constraint coverage scores, or precision on attribute types) or evaluation rubric are defined or reported, so observed differences across scenarios could reflect subjective interpretation rather than model behavior.
- [Section 3 (Methodology)] Section 3 (Methodology): No inter-rater reliability statistic, second independent evaluator, or agreement measure is mentioned for the qualitative analysis, which is load-bearing for the assertion that inconsistencies and ambiguities increase with complexity.
- [Section 3.2 (Scenarios)] Section 3.2 (Scenarios): The three requirements scenarios are presented without external validation, benchmark comparison, or justification that their complexity progression mirrors real-world database modeling challenges, weakening the generalizability of the maturity conclusion.
minor comments (2)
- [Abstract] Abstract: The specific LLMs (including versions and access details) are not named, which reduces immediate reproducibility even though the prompting techniques are described.
- [Figures] Figures: The ER diagrams would be clearer if the text annotations or captions explicitly marked the inconsistencies and constraint failures discussed in the analysis.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond to each major comment below and note the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Section 4 (Results): The central claim of a complexity-dependent reliability decrease rests entirely on qualitative direct comparison; no quantitative metrics (e.g., counts of missing entities/relationships, constraint coverage scores, or precision on attribute types) or evaluation rubric are defined or reported, so observed differences across scenarios could reflect subjective interpretation rather than model behavior.
Authors: We agree that the results section would benefit from quantitative support. In the revision we will introduce a clear evaluation rubric and report counts of missing or incorrect entities, relationships, attributes, and constraint violations for each scenario and prompting method. These metrics will be presented alongside the qualitative analysis to make the complexity-dependent trend more objective. revision: yes
-
Referee: Section 3 (Methodology): No inter-rater reliability statistic, second independent evaluator, or agreement measure is mentioned for the qualitative analysis, which is load-bearing for the assertion that inconsistencies and ambiguities increase with complexity.
Authors: The qualitative comparisons were performed by the authors with internal cross-checking. We accept that an independent second evaluator and agreement measure would increase confidence in the findings. We will add this step in the revision: a second evaluator will assess a representative sample of outputs and we will report Cohen’s kappa (or equivalent) for entity identification, relationship correctness, and constraint handling. revision: yes
-
Referee: Section 3.2 (Scenarios): The three requirements scenarios are presented without external validation, benchmark comparison, or justification that their complexity progression mirrors real-world database modeling challenges, weakening the generalizability of the maturity conclusion.
Authors: The scenarios were deliberately constructed to increase in the number of entities, relationships, attributes, and constraints. We will expand Section 3.2 with an explicit justification of this progression, including simple quantitative indicators of complexity. External benchmark validation is not feasible at present because no established public datasets exist for ER-diagram generation from requirements; the progressive design nevertheless illustrates the observed reliability decline. revision: partial
Circularity Check
No circularity: direct empirical comparison of LLM outputs to input requirements
full rationale
The paper performs an experimental evaluation by feeding three LLMs three prompting techniques on three natural-language requirements scenarios of increasing complexity, then qualitatively comparing the generated ER diagrams directly against the source text for structural and semantic adherence. No equations, fitted parameters, predictions, or derivations appear; the central claim that reliability decreases with complexity rests on this independent comparison rather than any self-referential reduction or self-citation chain. The evaluation method is self-contained against the provided requirements text and does not import uniqueness theorems, ansatzes, or renamed known results from prior author work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
SBC. doi:10.5753/sbbd.2025.247302. URLhttps://sol.sbc.org.br/index.php/sbbd/article/view/37272. Ramez Elmasri and Shamkant B. Navathe.Fundamentals of Database Systems. Addison-Wesley, 6th edition,
-
[2]
Relbench: A benchmark for deep learning on relational databases
Joshua Robinson, Rishabh Ranjan, Weihua Hu, Kexin Huang, Jiaqi Han, Alejandro Dobles, Matthias Fey, Jan Eric Lenssen, Yiwen Yuan, Zecheng Zhang, Xinwei He, and Jure Leskovec. Relbench: A benchmark for deep learning on relational databases. InAdvances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track, pages 21330–21341,
work page 2024
-
[3]
Marcos Magalhães and Carlos Alberto Heuser.Projeto de Banco de Dados
URLhttps://proceedings.neurips.cc/paper_files/paper /2024/hash/25cd345233c65fac1fec0ce61d0f7836-Abstract-Datasets_and_Benchmarks_Track.html. Marcos Magalhães and Carlos Alberto Heuser.Projeto de Banco de Dados. Editora Érica,
work page 2024
-
[4]
URL https://www.taylorfrancis.com/books/mono/10
doi:10.1201/9781003314455. URL https://www.taylorfrancis.com/books/mono/10. 1201/9781003314455/database-design-using-entity-relationship-diagrams-sikha-saha-bagui -richard-walsh-earp. Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. A systematic survey of prompt engineering in large language models: Techniques a...
-
[5]
Department of Computer Science and Engineering, IIT Patna; Stanford University; Amazon AI
URL https://arxiv.org/abs/2401.10238 . Department of Computer Science and Engineering, IIT Patna; Stanford University; Amazon AI. Prashant Bansal. Prompt engineering importance and applicability with generative ai.Journal of Computer and Communications, 12:14–23,
-
[6]
doi:10.4236/jcc.2024.1210002. 9 LLMs for Conceptual Database ModelingA PREPRINT Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th Conference on Neural Information Processing Systems,
-
[7]
doi:10.14257/ijdta.2015.8.2.07. Parmonangan R. Togatorop, Rezky P. Simanjuntak, Siti B. Manurung, and Mega C. Silalahi. Generating en- tity relationship diagram from requirement specification using natural language processing for indonesian lan- guage. InJ-Icon: Jurnal Komputer dan Informatika, volume 9, pages 196–206. Institut Teknologi Del,
-
[8]
URL https://ejurnal.undana.ac.id/index.php/jicon/article/view/50
doi:10.35508/jicon.v9i2.5051. URL https://ejurnal.undana.ac.id/index.php/jicon/article/view/50
-
[9]
Vaihunthan Vyramuthu and Gregor Grambow
doi:10.1007/978-981-97-9613-7_23. Vaihunthan Vyramuthu and Gregor Grambow. Towards extracting entity relationship diagrams from unstructured text using natural language processing. InDBKDA 2025: The Seventeenth International Conference on Advances in Databases, Knowledge, and Data Applications, pages 42–47, Aalen, Germany,
-
[10]
IARIA. ISBN 978-1-68558- 244-9. URLhttps://www.thinkmind.org. Abhay Zala, Han Lin, Jaemin Cho, and Mohit Bansal. Diagrammergpt: Generating open-domain, open-platform diagrams via llm planning.https://arxiv.org/abs/2310.12128,
-
[11]
Umldesigner: An automatic uml diagram design tool
Vinasetan Ratheil Houndji and Genereux Akotenou. Umldesigner: An automatic uml diagram design tool. In Proceedings of the International Conference on Deep Learning Theory and Applications (DeLTA 2023), volume 1875 ofCommunications in Computer and Information Science, pages 309–321. Springer, Cham,
work page 2023
-
[12]
doi:10.1007/978- 3-031-39059-3_23. URLhttps://link.springer.com/chapter/10.1007/978-3-031-39059-3_23. Moksh Mishra, Samar Sheikh, and Tosh Tonpe. Fine-tuning language models for enhanced diagram generation: A deep learning approach.International Journal for Research in Applied Science and Engineering Technology (IJRASET), 12(IV):2819–2822,
-
[13]
doi:10.22214/ijraset.2024.60445. Nadia Salem, Khawla Al-Tarawneh, Amjad Hudaib, Hamza Salem, Afaf Tareef, Hadi Salloum, and Manuel Mazzara. Generating database schema from requirement specification based on natural language processing and large language model.Computer Research and Modeling, 16(7):1703–1713,
-
[14]
https://www.researchgate.net/publication/387457163
doi:10.20537/2076-7633-2024-16-7-1703-1713. https://www.researchgate.net/publication/387457163. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.