BEAVER: An Enterprise Benchmark for Text-to-SQL

Peter Baile Chen , Devin Yang , Weiyue Li , Fabian Wenz , Yi Zhang , Nesime Tatbul , Michael Cafarella , \c{C}a\u{g}atay Demiralp

show 1 more author

Michael Stonebraker

Authors on Pith no claims yet

classification 💻 cs.CL cs.AIcs.DB

keywords queryevaluationaccuracybenchmarkchallengesdomainenterpriseexisting

0 comments

read the original abstract

Existing text-to-SQL benchmarks have largely been constructed from public databases with well-structured schemas and simplistic question-SQL pairs. While large language models (LLMs) excel on these settings, their efficacy in complex private enterprise environments, characterized by intricate schemas, domain knowledge, and analytical user queries involving sophisticated structures and functions, remains unproven. To bridge this gap, we introduce BEAVER, the first text-to-SQL benchmark derived from private data warehouses. It comprises 9128 question-SQL pairs sourced from real-world query logs and 812 tables across 19 diverse domains. Building this benchmark is challenging because (1) enterprise query logs are scarce due to privacy constraints, and (2) existing all-or-nothing evaluation metrics based on accuracy make error diagnosis difficult -- especially when producing a correct query involves solving multiple compounded challenges, such as domain knowledge and query complexity. We address these issues at two levels. At the dataset level, we synthesize high-fidelity, expert-verified queries that increase dataset size and isolate individual challenges or combine them, producing queries focused on domain knowledge, query complexity, and both. At the evaluation level, we provide human annotations and evaluation metrics for five critical subtasks to enable fine-grained analysis. Our evaluation reveals a significant performance gap compared to existing benchmarks: SOTA agentic frameworks using the advanced model GPT-5.2 achieve only 10.8% accuracy. When provided with all subtask annotations as oracle hints, accuracy increases to 30.1%, confirming that a major bottleneck lies in correctly resolving these subtasks. Finally, we provide a taxonomy of the residual errors that persist even with subtask hints, identifying specific challenges such as the use of advanced functions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EGREFINE: An Execution-Grounded Optimization Framework for Text-to-SQL Schema Refinement
cs.DB 2026-05 unverdicted novelty 6.0

EGRefine optimizes column renamings via execution-grounded verification and view materialization to recover Text-to-SQL accuracy lost to schema naming issues while guaranteeing query equivalence.
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
cs.CL 2026-04 unverdicted novelty 6.0

SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
An Alternate Agentic AI Architecture (It's About the Data)
cs.DB 2026-04 unverdicted novelty 5.0

RUBICON replaces opaque LLM-based tool orchestration in agentic AI with an explicit query algebra (AQL: Find, From, Where) executed via wrappers to deliver traceable, deterministic access to heterogeneous enterprise d...
A Demonstration of SQLyzr: A Platform for Fine-Grained Text-to-SQL Evaluation and Analysis
cs.DB 2026-04 unverdicted novelty 5.0

SQLyzr is a new evaluation platform that adds diverse metrics, realistic settings, query classification, and analysis features to overcome the single-score limitations of existing text-to-SQL benchmarks.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.