arxiv: 2605.05525 · v1 · submitted 2026-05-07 · 💻 cs.DB · cs.CL

Recognition: unknown

Anatomy of a Query: W5H Dimensions and FAR Patterns for Text-to-SQL Evaluation

Vicki Stover Hertzberg , Eduardo Valverde , Joyce C. Ho

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:48 UTC · model grok-4.3

classification 💻 cs.DB cs.CL

keywords text-to-SQLquery evaluationsemantic dimensionsFAR patternsnatural language interfacesdatabase querieshealthcare queriesevaluation framework

0 comments

The pith

Every well-formed query reduces to Filter, Aggregate, and Return operations, with filtering criteria mapping to six semantic dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Natural language interfaces to databases require stronger foundations for evaluation and design. The paper introduces the QUEST framework, which rests on the FAR structural invariant that every query decomposes into filtering, aggregation, and returning results, plus the W5H framework that assigns all filtering criteria to the dimensions of who, what, where, when, why, and how. Validation on 120,464 queries from five text-to-SQL datasets confirms that FAR conformance holds universally across domains and schemas. W5H profiles, however, shift substantially by domain, with healthcare queries showing heavy concentration in temporal and person-centric dimensions and almost no causal or procedural content.

Core claim

The paper establishes that the FAR structural invariant holds universally: every well-formed query reduces exactly to Filter, Aggregate, and Return operations. At the same time, the W5H dimensional framework shows that filtering criteria distribute unevenly across domains. Healthcare queries concentrate strongly in the WHEN dimension at 80.4 percent and the WHO dimension at 73.0 percent, far above general-domain benchmarks, while causal WHY and mechanistic HOW reasoning appear near zero everywhere, with apparent HOW exceptions actually reflecting quantitative aggregation rather than procedural reasoning.

What carries the argument

The FAR structural invariant that decomposes every query into Filter, Aggregate, and Return operations, together with the W5H dimensional framework that classifies all filtering criteria into six semantic dimensions.

Load-bearing premise

Every well-formed query can be exhaustively and uniquely reduced to Filter, Aggregate, and Return operations, and every filtering condition can be mapped to one of the six W5H dimensions without loss, overlap, or remainder.

What would settle it

A single natural language query that requires a database operation outside of filtering, aggregation, and returning results, or whose filtering criteria cannot be assigned to any of the six W5H dimensions.

Figures

Figures reproduced from arXiv: 2605.05525 by Eduardo Valverde, Joyce C. Ho, Vicki Stover Hertzberg.

**Figure 1.** Figure 1: The QUEST translation pipeline for two representative queries from ATIS and EHRSQL. view at source ↗

**Figure 2.** Figure 2: Heatmap showing W5H dimensional profiles across five text-to-SQL datasets. Rows view at source ↗

read the original abstract

Natural language interfaces to databases have gained popularity, yet the theoretical foundations for evaluating and designing these systems remain underdeveloped. We present QUEST (Query Understanding Evaluation through Semantic Translation), a framework resting on two independently motivated components: the FAR structural invariant, which holds that every well-formed query reduces to Filter, Aggregate, and Return operations; and the W5H dimensional framework, which holds that all filtering criteria map to six semantic dimensions (Who, What, Where, When, Why, and How). Validated across five text-to-SQL datasets (n = 120,464), FAR conformance is universal across all domains and schema types, while W5H dimensional profiles vary substantially. Healthcare queries are strongly concentrated in temporal (WHEN: 80.4%) and person-centric (WHO: 73.0%) dimensions far exceeding general-domain benchmarks, and causal (WHY) and mechanistic (HOW) reasoning are near-zero everywhere, with apparent HOW exceptions reflecting quantitative aggregation rather than genuine procedural reasoning. These results identify a frontier that must be crossed for genuine machine reasoning over structured data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a clean way to break down text-to-SQL queries into FAR operations and W5H filter dimensions, with large-scale profiles showing domain gaps and missing causal reasoning, but the supporting annotation details are missing.

read the letter

The main thing to know is that the authors reduce every query to Filter, Aggregate, and Return steps and then tag the filters with Who/What/Where/When/Why/How labels. They apply this to 120k queries across five datasets and report that FAR always holds while the W5H mix changes sharply by domain, with healthcare heavy on time and person and almost no causal or procedural content anywhere.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes the QUEST framework for text-to-SQL evaluation, comprising the FAR structural invariant (every well-formed query reduces to Filter, Aggregate, Return operations) and the W5H dimensional framework (filtering criteria map exhaustively to Who, What, Where, When, Why, How dimensions). Empirical validation on 120,464 queries from five datasets shows universal FAR conformance, domain-varying W5H profiles (e.g., healthcare: 80.4% WHEN, 73.0% WHO), and near-zero WHY/HOW with HOW exceptions reinterpreted as aggregation.

Significance. If substantiated, the findings offer a useful taxonomy for understanding query semantics in natural language interfaces to databases, potentially simplifying evaluation by showing limited need for causal/mechanistic reasoning in most queries. The scale of the study (n=120,464 across five datasets) provides broad empirical support and identifies domain-specific patterns, such as temporal focus in healthcare, which could inform targeted system improvements. The independent motivation of the two components is a positive aspect.

major comments (2)

Abstract: The claims of universal FAR conformance and precise W5H percentages (e.g., 80.4% WHEN and 73.0% WHO in healthcare) depend on an annotation process applied to 120,464 queries for which no inter-annotator agreement statistics, detailed protocol, or error bars are reported. This omission is load-bearing for the central empirical results.
Abstract: The post-hoc interpretation that apparent HOW exceptions represent aggregation rather than procedural reasoning is tied to the W5H framework definitions, which may introduce circularity; an independent verification or sensitivity analysis would strengthen this aspect of the argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important aspects of transparency in our empirical validation, and we address each point below with plans for revision.

read point-by-point responses

Referee: Abstract: The claims of universal FAR conformance and precise W5H percentages (e.g., 80.4% WHEN and 73.0% WHO in healthcare) depend on an annotation process applied to 120,464 queries for which no inter-annotator agreement statistics, detailed protocol, or error bars are reported. This omission is load-bearing for the central empirical results.

Authors: We agree that the absence of these details limits the assessability of the reported results. The annotation was conducted by database researchers using a protocol derived directly from the independently defined FAR and W5H frameworks. In the revised manuscript, we will add a dedicated section describing the full annotation protocol, report inter-annotator agreement statistics computed on a stratified sample of queries, and include error bars or bootstrap confidence intervals for the key W5H percentages across datasets. These additions will directly address the load-bearing nature of the empirical claims. revision: yes
Referee: Abstract: The post-hoc interpretation that apparent HOW exceptions represent aggregation rather than procedural reasoning is tied to the W5H framework definitions, which may introduce circularity; an independent verification or sensitivity analysis would strengthen this aspect of the argument.

Authors: We acknowledge the risk of circularity in the post-hoc reclassification of HOW instances. The W5H dimensions were motivated and defined prior to any dataset inspection, and the aggregation reinterpretation follows explicit criteria (presence of aggregate operators without sequential procedural steps) stated in the protocol. To strengthen this, the revised version will include an independent verification: a blinded re-annotation of all HOW-labeled queries by a separate annotator team using only the original protocol, plus a sensitivity analysis that varies the aggregation threshold and reports the resulting impact on the near-zero HOW rate. This will demonstrate that the conclusion is robust to alternative interpretations. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents FAR and W5H as two independently motivated components, with the central claims resting on empirical validation across five external text-to-SQL datasets totaling 120,464 queries. FAR conformance is reported as holding universally based on direct analysis of the data rather than by definitional reduction or self-referential mapping. W5H dimensional profiles are derived from the same mapping process but are presented as varying observations, not predictions forced by the inputs. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems imported from prior author work appear in the derivation chain. The analysis is self-contained against external benchmarks, with no load-bearing step that reduces to its own assumptions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claims rest on two newly introduced frameworks whose universality is asserted rather than derived from prior theory. No free parameters are fitted, but the frameworks themselves function as domain assumptions whose validity is tested empirically.

axioms (2)

domain assumption Every well-formed query reduces to Filter, Aggregate, and Return operations
Stated as the FAR structural invariant that holds universally.
domain assumption All filtering criteria map to six semantic dimensions (Who, What, Where, When, Why, How)
Core premise of the W5H framework.

invented entities (3)

QUEST framework no independent evidence
purpose: Unified evaluation system combining FAR and W5H for text-to-SQL
Newly proposed framework resting on the two invariants.
FAR structural invariant no independent evidence
purpose: Universal decomposition of queries into three operations
Postulated as holding for all well-formed queries.
W5H dimensional framework no independent evidence
purpose: Semantic categorization of filtering criteria
New mapping of filters to six dimensions.

pith-pipeline@v0.9.0 · 5492 in / 1564 out tokens · 44862 ms · 2026-05-08T03:48:37.133419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 8 canonical work pages

[1]

URL https: //www.cambridge.org/core/journals/natural-language-engineering/ article/natural-language-interfaces-to-databases-an-introduction/ 21C30448C70DD4988E6DA0D54205FB56

doi: 10.1017/S135132490000005X. URL https: //www.cambridge.org/core/journals/natural-language-engineering/ article/natural-language-interfaces-to-databases-an-introduction/ 21C30448C70DD4988E6DA0D54205FB56. Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schn...

work page doi:10.1017/s135132490000005x
[2]

Deborah A

doi: 10.1145/355592.365646. Deborah A. Dahl, Madeleine Bates, John Brown, William Fisher, Kate Hunicke-Smith, David Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriberg. Expanding the scope of the ATIS task: The atis-3 corpus. InProceedings of the Workshop on Human Language Technology, pages 43–48, Plainsboro, NJ,

work page doi:10.1145/355592.365646
[3]

2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.Journal of the American Medical Informatics Association, 27(1):3–12, 10

11 Sam Henry, Kevin Buchan, Michele Filannino, Amber Stubbs, and Özlem Uzuner. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records.Journal of the American Medical Informatics Association, 27(1):3–12, 10

2018
[4]

URLhttps://doi.org

doi: 10.1093/jamia/ocz175. URLhttps://doi.org. Harold D. Lasswell. The structure and function of communication in society. In Lyman Bryson, editor,The Communication of Ideas, volume 1 ofReligion and Civilization Series, pages 117–130. Harper and Row, New York,

work page doi:10.1093/jamia/ocz175
[5]

Wang, and Victor Zhong

URLhttps://arxiv.org/abs/2411.07763. Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to-SQLs. InAdvances in Neural Information Processing Systems, volume 36, pages 55514–55539, Red Hook, NY ,

work page arXiv
[6]

org/abs/1309.4408

URL https://arxiv. org/abs/1309.4408. Long Ouyang, Jeffrey Wu, Xu Jiang, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744,

work page arXiv
[7]

Evaluating cross-domain text-to-SQL models and benchmarks

Mohammadreza Pourreza and Davood Rafiei. Evaluating cross-domain text-to-SQL models and benchmarks. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1601–1611, Singapore,

2023
[8]

2010 i2b2/va challenge on concepts, assertions, and relations in clinical text.Journal of the American Medical Informatics Association, 18(5):552–556,

Özlem Uzuner, Brett R South, Shuying Shen, and Scott L DuVall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text.Journal of the American Medical Informatics Association, 18(5):552–556,

2010
[9]

URLhttps://doi.org

doi: 10.1136/amiajnl-2011-000203. URLhttps://doi.org. Paul Windisch, Carole Koechli, Fabio Dennstädt, Daniel M. Aebersold, Daniel R. Zwahlen, Robert Förster, and Christina Schröder. Is one run enough? Reproducibility of flagship large language models across temperature and reasoning settings in biomedical text processing.Journal of the American Medical In...

work page doi:10.1136/amiajnl-2011-000203 2011
[10]

Corrected proof, published 30 March

doi: 10.1093/jamia/ocag039. Corrected proof, published 30 March

work page doi:10.1093/jamia/ocag039
[11]

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors,Proceedings...

2018
[12]

S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

Association for Computational Linguistics. doi: 10.18653/v1/D18-1425. URLhttps://aclanthology.org/D18-1425/. John M. Zelle and Raymond J. Mooney. Learning to parse database queries using inductive logic programming. InProceedings of the 13th National Conference on Artificial Intelligence, pages 1050–1055, Menlo Park, CA,

work page doi:10.18653/v1/d18-1425