arxiv: 2604.03057 · v1 · submitted 2026-04-03 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Querying Structured Data Through Natural Language Using Language Models

Hontan Valentin-Micu , Bunea Andrei-Alexandru , Tantaroudas Nikolaos Dimitrios , Popovici Dan-Matei

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords natural language queryingstructured dataLLM fine-tuningsynthetic dataquery generationmultilingualresource-constrained

0 comments

The pith

Fine-tuning a compact LLM on synthetic data enables accurate natural language querying of structured datasets on standard hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to let users query structured datasets in natural language by training a model to output executable queries instead of using retrieval methods that falter on numbers and structure. A pipeline creates synthetic question and answer pairs reflecting user intent and the data's meaning. They fine-tune an 8 billion parameter model with efficient methods to run on ordinary computers. Evaluation on Spanish accessibility data yields high accuracy even when switching languages or using new locations. This opens the way for precise query systems without expensive large models.

Core claim

The authors demonstrate that a synthetic data generation pipeline followed by QLoRA fine-tuning of DeepSeek R1 Distill 8B produces a model capable of generating accurate executable queries for structured data on accessibility services, maintaining performance across monolingual, multilingual, and unseen location test cases.

What carries the argument

Synthetic training data generation pipeline producing diverse question-answer pairs that capture user intent and dataset semantics, used to fine-tune a quantized 8B model via QLoRA.

Load-bearing premise

The synthetic question-answer pairs accurately mirror real user questions and the true semantics of the structured dataset.

What would settle it

The model producing incorrect or invalid queries when evaluated on a set of real human questions about the same accessibility data or on a new but similar structured dataset.

Figures

Figures reproduced from arXiv: 2604.03057 by Bunea Andrei-Alexandru, Hontan Valentin-Micu, Popovici Dan-Matei, Tantaroudas Nikolaos Dimitrios.

**Figure 1.** Figure 1: Overview of the model inference and query generation process therefore explored a system in which the model generates executable queries, retrieves the corresponding data, and then reasons over the results. The overall functionality of our system is outlined in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Question-answer pair for model fine-tuning During inference, the model generates a call whenever it deems necessary. Once the call is produced, token generation is paused, the database query is invoked, and the returned value is inserted into the context. When generation resumes, the model has access to the complete call–response pair and can use it to formulate its final answer. For this reason, the train… view at source ↗

**Figure 3.** Figure 3: Evaluation metrics during fine-tuning. ation requires structured reasoning: recognizing when a query is necessary, identifying the required arguments, and mapping question semantics to the appropriate query. The model’s strong latent reasoning capacity makes it well suited for learning these decision patterns [3]. Moreover, the model displays low variance in output structure relative to similarly sized … view at source ↗

**Figure 4.** Figure 4: Web application for model queries domain constraints. Although this contributes modestly to backend latency, it plays a critical role in maintaining reliability and user safety. The dominance of inference time suggests that end-to-end performance is chiefly limited by the model’s computational footprint [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Application performance 7 Limitations & Future work Applicability to broader scenarios. The application developed in this work is currently being integrated into the FUTURAL Metasearch platform, which provides unified access to heterogeneous agricultural data originating from multiple smart services. Our broader objective, however, is to establish a general methodology for connecting arbitrary queryable d… view at source ↗

read the original abstract

This paper presents an open source methodology for allowing users to query structured non textual datasets through natural language Unlike Retrieval Augmented Generation RAG which struggles with numerical and highly structured information our approach trains an LLM to generate executable queries To support this capability we introduce a principled pipeline for synthetic training data generation producing diverse question answer pairs that capture both user intent and the semantics of the underlying dataset We fine tune a compact model DeepSeek R1 Distill 8B using QLoRA with 4 bit quantization making the system suitable for deployment on commodity hardware We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems We evaluate our approach on a dataset describing accessibility to essential services across Durangaldea Spain The fine tuned model achieves high accuracy across monolingual multilingual and unseen location scenarios demonstrating both robust generalization and reliable query generation Our results highlight that small domain specific models can achieve high precision for this task without relying on large proprietary LLMs making this methodology suitable for resource constrained environments and adaptable to broader multi dataset systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable synthetic-data pipeline for fine-tuning a small 8B model to turn natural language into executable queries on structured data, but the results rest on unvalidated synthetic examples with no reported metrics.

read the letter

The main takeaway is that the authors built a pipeline to generate synthetic question-answer pairs from the semantics of a structured dataset, then fine-tuned DeepSeek R1 Distill 8B with QLoRA and 4-bit quantization so the model outputs executable queries instead of relying on RAG. They test this on the Durangaldea accessibility dataset and claim it handles monolingual, multilingual, and unseen-location cases well enough to run on ordinary hardware without big proprietary models. That focus on compact, deployable systems for cases where RAG breaks on numerical or relational data is the practical part worth noting. The open-source release and the attempt to capture dataset-specific intent through the synthetic step also make the method easy to try on other tables. The approach is straightforward and targets a real pain point for domain-specific query interfaces. The soft spots are in the evidence. The abstract repeats the evaluation paragraph and gives no accuracy numbers, no baselines, and no error breakdown. More critically, there is no check on whether the synthetic pairs match real user phrasing or whether the model still works on actual queries rather than the generated ones. Without human review of the data, an ablation on the synthetic component, or a real-versus-synthetic comparison, the generalization claims could be tied to artifacts in how the pairs were created. This is the kind of paper for engineers or researchers who need natural-language access to structured data in resource-limited settings, such as local government or small organizations. A reader already working on NL-to-SQL or similar interfaces would get concrete steps to adapt. It deserves a serious referee because the core pipeline is clearly described and the hardware constraint is genuine, even though the current write-up needs the missing metrics and validation details before it can be taken as strong evidence. I would send it for review with requests for those additions.

Referee Report

3 major / 3 minor

Summary. This paper presents an open-source methodology for querying structured non-textual datasets via natural language by training an LLM to generate executable queries rather than relying on RAG. It introduces a synthetic data generation pipeline to produce diverse question-answer pairs capturing user intent and dataset semantics, fine-tunes the DeepSeek R1 Distill 8B model with QLoRA and 4-bit quantization, and evaluates the approach on an accessibility dataset for Durangaldea, Spain, claiming high accuracy in monolingual, multilingual, and unseen-location scenarios.

Significance. If the accuracy and generalization results hold under rigorous evaluation, the work offers a practical, deployable alternative to RAG for structured data querying that runs on commodity hardware. The synthetic pipeline and small-model focus could enable domain adaptation without large proprietary LLMs, with potential for multi-dataset systems; the open-source framing and emphasis on resource-constrained environments add to its utility if reproducibility is ensured.

major comments (3)

[§5] §5 (Evaluation): The central claim that the fine-tuned model 'achieves high accuracy across monolingual multilingual and unseen location scenarios' is unsupported because the manuscript provides no numerical metrics (e.g., exact accuracy, precision, or F1), baselines, error analysis, or test-set details, leaving the evidence for robust generalization and reliable query generation impossible to assess.
[§3] §3 (Synthetic Data Pipeline): The pipeline is described only as 'principled' and 'diverse' without reporting inter-annotator agreement on generated pairs, ablations removing synthetic components, or comparisons of model performance on real versus synthetic queries; this is load-bearing for the claim that performance reflects true semantic capture rather than in-distribution artifacts.
[§4] §4 (Model and Training): The use of QLoRA with 4-bit quantization on the 8B model is presented as enabling commodity-hardware deployment, but no hyperparameter values, training curves, or ablation against full fine-tuning are given, undermining reproducibility and the suitability claim.

minor comments (3)

[Abstract] Abstract: The final two sentences are duplicated verbatim; remove the repetition for clarity.
[Abstract] Abstract: Add missing punctuation and hyphenation (e.g., 'non-textual', 'Durangaldea, Spain') and ensure consistent terminology such as 'fine-tuned'.
[Introduction] Throughout: Define all acronyms on first use (e.g., RAG, QLoRA) and provide a brief description of the Durangaldea dataset structure to aid readers unfamiliar with the domain.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript requires stronger quantitative support and additional details for reproducibility. We will revise the paper to address all points raised.

read point-by-point responses

Referee: [§5] §5 (Evaluation): The central claim that the fine-tuned model 'achieves high accuracy across monolingual multilingual and unseen location scenarios' is unsupported because the manuscript provides no numerical metrics (e.g., exact accuracy, precision, or F1), baselines, error analysis, or test-set details, leaving the evidence for robust generalization and reliable query generation impossible to assess.

Authors: We agree that the evaluation section in the current draft is insufficiently quantitative. In the revised manuscript we will expand §5 with exact numerical results including accuracy, precision, and F1 scores broken down by monolingual, multilingual, and unseen-location scenarios. We will also add baseline comparisons (zero-shot and RAG), a categorized error analysis, and full test-set statistics (size, generation method, and distribution) so that the generalization claims can be properly assessed. revision: yes
Referee: [§3] §3 (Synthetic Data Pipeline): The pipeline is described only as 'principled' and 'diverse' without reporting inter-annotator agreement on generated pairs, ablations removing synthetic components, or comparisons of model performance on real versus synthetic queries; this is load-bearing for the claim that performance reflects true semantic capture rather than in-distribution artifacts.

Authors: We will strengthen §3 by adding inter-annotator agreement statistics on a sampled subset of the generated pairs, ablation experiments that isolate each component of the pipeline and report the resulting performance change, and a direct comparison of model accuracy on a small set of real user queries versus the synthetic queries to demonstrate that the results reflect semantic capture rather than artifacts. revision: yes
Referee: [§4] §4 (Model and Training): The use of QLoRA with 4-bit quantization on the 8B model is presented as enabling commodity-hardware deployment, but no hyperparameter values, training curves, or ablation against full fine-tuning are given, undermining reproducibility and the suitability claim.

Authors: We will revise §4 to include the complete set of QLoRA and quantization hyperparameters, training loss and accuracy curves, and an ablation comparing QLoRA performance against full fine-tuning. These additions will support the reproducibility of the commodity-hardware claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline is self-contained

full rationale

The paper's chain consists of (1) a synthetic data generation pipeline that produces question-answer pairs from the Durangaldea dataset, (2) QLoRA fine-tuning of DeepSeek-R1-Distill-8B on those pairs, and (3) empirical accuracy measurement on monolingual, multilingual, and unseen-location splits. None of these steps reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation; the reported accuracies are measured outcomes on held-out scenarios rather than quantities forced by construction from the training distribution. No equations appear, and the methodology contains no uniqueness theorems or ansatzes imported from prior author work. The derivation is therefore independent of its inputs in the sense required by the circularity criteria.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard assumptions about LLM fine-tuning capabilities and the effectiveness of synthetic data in capturing dataset semantics, with no new entities postulated.

free parameters (1)

QLoRA hyperparameters and 4-bit quantization settings
Chosen to enable deployment on commodity hardware while balancing model performance and efficiency.

axioms (1)

domain assumption LLMs can be fine-tuned to reliably generate executable structured queries from natural language inputs when trained on appropriate synthetic data.
Invoked as the basis for the training pipeline and generalization claims.

pith-pipeline@v0.9.0 · 5550 in / 1282 out tokens · 69517 ms · 2026-05-13T19:19:24.241661+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 4 internal anchors

[1]

The llama 3 herd of models. Tech. rep., Meta (2024), https://ai.meta.com/research/publications/the-llama-3-herd-of-models/

work page 2024
[2]

https://www.anthropic.com/news/model-context-protocol (Nov 2024), accessed: 2025-03-30

Anthropic: Introducing the model context protocol. https://www.anthropic.com/news/model-context-protocol (Nov 2024), accessed: 2025-03-30

work page 2024
[3]

arXiv preprint arXiv:2501.04652 (2025), https://arxiv.org/abs/2501.04652

DeepSeek-AI: Deepseek r1: Incentivizing reasoning capability in large language models via reinforcement learning. arXiv preprint arXiv:2501.04652 (2025), https://arxiv.org/abs/2501.04652

work page arXiv 2025
[4]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient fine- tuning of quantized llms. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023) 16 Hontan et al

work page 2023
[5]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D.: Gptq: Accurate post- training quantization for generative pretrained transformers. arXiv preprint arXiv:2210.17323 (2022), https://arxiv.org/abs/2210.17323

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Gao, D., Wang, H., Li, Y., Sun, X., Qian, Y., Ding, B., Zhou, J.: Text- to-sql empowered by large language models: A benchmark evaluation (2023), https://arxiv.org/abs/2308.15363

work page arXiv 2023
[7]

Gao, S., Shi, Z., Zhu, M., Fang, B., Xin, X., Ren, P., Chen, Z., Ma, J., Ren, Z.: Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum (2023), https://arxiv.org/abs/2308.14034

work page arXiv 2023
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., et al.: Deepseek-r1: Incentivizing reasoning capability in llms. arXiv preprint arXiv:2501.12948 (2025), https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Imagery, N., Agency, M.: Department of defense world geodetic system 1984: Its definition and relationships with local geodetic systems. Tech. Rep. TR8350.2, NIMA (2000)

work page 1984
[10]

Jolicoeur-Martineau, A.: Less is more: Recursive reasoning with tiny networks (2025), https://arxiv.org/abs/2510.04871

work page internal anchor Pith review arXiv 2025
[11]

In: Larochelle, H., Ranzato, M., Had- sell, R., Balcan, M.F., Lin, H.T

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-augmented generation for knowledge-intensive nlp tasks. In: Larochelle, H., Ranzato, M., Had- sell, R., Balcan, M.F., Lin, H.T. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. ...

work page 2020
[12]

Olbricht,R.:Theoverpassapi.https://wiki.openstreetmap.org/wiki/Overpass_API (2012), accessed: 2025-12-11

work page 2012
[13]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large language model connected with massive apis. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Systems. vol. 37, pp. 126544–126565. Curran Associates, Inc. (2024). https://doi.org/10.52202/079017-4020

work page doi:10.52202/079017-4020 2024
[14]

Qian, C., Han, C., Fung, Y.R., Qin, Y., Liu, Z., Ji, H.: Creator: Tool creation for disentangling abstract and concrete reasoning of large language models (2024), https://arxiv.org/abs/2305.14318

work page arXiv 2024
[15]

Toolformer: Language Models Can Teach Themselves to Use Tools

Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., Scialom, T.: Toolformer: Language models can teach them- selves to use tools. In: NeurIPS (or arXiv preprint arXiv:2302.04761) (2023), https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Cao, B., Sun, L.: Toolalpaca: Generalized tool learning for language models with 3000 simulated cases (2023), https://arxiv.org/abs/2306.05301

work page arXiv 2023
[17]

Yang, R., Song, L., Li, Y., Zhao, S., Ge, Y., Li, X., Shan, Y.: Gpt4tools: Teaching large language model to use tools via self-instruction (2023), https://arxiv.org/abs/2305.18752

work page arXiv 2023
[18]

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizingreasoningandactinginlanguagemodels.In:Theeleventhinternational conference on learning representations (2022)

work page 2022
[19]

In: Bessiere, C

Zeng, Y., Gao, Y., Guo, J., Chen, B., Liu, Q., Lou, J.G., Teng, F., Zhang, D.: Rec- parser: A recursive semantic parsing framework for text-to-sql task. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20. pp. 3644–3650. International Joint Conferences on Arti- ficial Intelligence Orga...

work page doi:10.24963/ijcai.2020/504 2020