pith. machine review for the scientific record. sign in

arxiv: 2605.08687 · v1 · submitted 2026-05-09 · 💻 cs.DB · cs.AI

Recognition: no theorem link

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:57 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords data preparationnatural languagelarge language modelsbenchmarkdata workflowscode generationdisambiguation
0
0 comments X

The pith

State-of-the-art LLMs still struggle to realize natural-language-driven data preparation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Data preparation is a central and time-consuming stage in data analysis workflows traditionally managed through graphical user interfaces. The paper introduces PrepBench to determine how close large language models are to enabling a shift where users specify data preparation intents directly in natural language. This benchmark assesses three key capabilities: interactive disambiguation of ambiguous intents, generation of preparation code, and translation of code into interpretable workflows. Tasks are systematically derived from real data challenges across diverse domains, with many requiring over 100 lines of code. Evaluations of current LLMs reveal that significant challenges persist in achieving this paradigm shift.

Core claim

PrepBench is a benchmark for evaluating natural language driven data preparation that covers interactive disambiguation, prep code generation and code to workflow translation. Built by crawling and extending Preppin Data Challenges it includes tasks with 3 to 18 steps and up to 300 lines of Python code. The evaluation demonstrates that despite recent progress state of the art LLMs find it challenging to perform these tasks effectively.

What carries the argument

The PrepBench benchmark which tests LLMs on interactive disambiguation, prep-code generation, and code-to-workflow translation for data preparation tasks.

If this is right

  • Data preparation workflows may remain dependent on GUI tools until LLMs improve at handling ambiguous intents.
  • Long and complex code generation for data prep continues to be a bottleneck for LLM agents.
  • Code-to-workflow translation is essential for validation and must be prioritized in future LLM developments.
  • The benchmark provides a way to measure progress toward NL-driven data preparation over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improving performance on PrepBench could enable more intuitive data analysis tools accessible without programming expertise.
  • Similar benchmarks might be developed for other data-intensive tasks like feature engineering or visualization specification.
  • The focus on interpretable workflows suggests a need for LLMs to output explanations alongside code in data contexts.
  • Testing with real user interactions beyond the benchmark could reveal additional practical limitations.

Load-bearing premise

The tasks constructed by crawling and extending Preppin' Data Challenges accurately represent the key real-world characteristics of natural-language-driven data preparation, including ambiguous intents and the need for interpretable workflows.

What would settle it

An experiment where a state-of-the-art LLM completes the majority of PrepBench tasks with correct disambiguation, accurate code, and valid workflow translations would falsify the claim that the paradigm shift remains challenging.

Figures

Figures reproduced from arXiv: 2605.08687 by Guoliang Li, Jiannan Wang, Jingzhe Xu, Rui Wang.

Figure 1
Figure 1. Figure 1: (a) Unprepared and prepared tables: Raw data from two tables alongside the resulting prepared data; (b) GUI-driven [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of PrepBench construction. We adopt an agent-based pipeline to generate ground-truth code. The agent performs two tasks: profiling tables to summarize their characteristics, and synthesizing code based on the request. The profiling process is described in Section 4.2. Unlike a standard NL￾driven data preparation task, here the ground-truth output is avail￾able. This gives the agent direct feedb… view at source ↗
Figure 3
Figure 3. Figure 3: PrepBench statistics. an appropriate response. A remaining risk is that LLM-assisted artifacts may reflect the style of the construction model, which could give models from a similar family an advantage in evaluation. 3.3 Benchmark Statistics PrepBench comprises 306 realistic tasks over 829 input tables across 32 domains. We report detailed benchmark statistics to better characterize the benchmark. Input-D… view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation pipeline in PrepBench. All modes share the same input tables, reference outputs, and comparator; varying the input assets isolates different capabilities [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ambiguity taxonomy in PrepBench. the latest record”, [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of ambiguities in PrepBench. PrepBench provides disamb request (instead of original request) and input tables, and the correctness is measured by code-acc. Capability Requirements. Mode 2 evaluates whether an agent can generate correct prep-code under practical table constraints. That is, table samples may miss irregular values in full tables, while full tables may exceed LLM context limits or… view at source ↗
Figure 8
Figure 8. Figure 8: Ambiguity gap by type: (1) single-table reference, (2) [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interact gains across models. GPT-5.1-Codex Kimi K2 Thinking Claude-Sonnet-4.5 Gemini 3 Flash GLM-4.7 Qwen3-235B-A22B DeepSeek-V3.2 Grok Code Fast 1 DevStral 2 GPT-4o 0% 25% 50% 75% 100% Questions Used / Budget [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Question budget usage across models. capped at 67.7%, confirming that workflow generation is a dis￾tinct and error-prone task. This gap is structural: prep-code is free￾form, while workflows need to conform to a schema-constrained, operator-based representation that models have not seen during training. We analyze translation failures in RQ5. Cost-effectiveness varies across models. Higher cost does not g… view at source ↗
Figure 13
Figure 13. Figure 13: Profiling gains by irregularity type. on code generation quality but also on how well a model identifies ambiguities and resolves them. More questions do not guarantee better gains. A natural hy￾pothesis is that asking more questions leads to higher accuracy, but [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Translation outcome composition across models. [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Translation outcomes by Preparation steps. [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
read the original abstract

Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in large language models (LLMs) raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin' Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PrepBench, a benchmark for NL-driven data preparation constructed by crawling and extending Preppin' Data Challenges. It targets three capabilities—interactive disambiguation, prep-code generation, and code-to-workflow translation—across tasks with 3–18 steps, diverse domains, and solutions up to ~300 lines of Python. The central claim is that, despite recent LLM progress, state-of-the-art models still face substantial challenges in realizing this paradigm shift.

Significance. If the benchmark faithfully captures real-world characteristics such as natural ambiguity and the need for interpretable workflows, PrepBench would be a useful addition to the field by filling a gap left by existing code-generation benchmarks. The systematic extension of an existing challenge set, coverage of long workflows, and focus on end-to-end capabilities (disambiguation through validation) provide a concrete resource for tracking progress in LLM-based data-preparation agents.

major comments (2)
  1. [Benchmark Construction] Benchmark Construction section: the claim that the extended tasks accurately reflect ambiguous user intents and the need for interpretable workflows rests on the assumption that crawling Preppin' Data Challenges plus author extensions produce naturally occurring ambiguity rather than synthetic disambiguation steps. If the original challenges contain relatively explicit intents, the reported LLM difficulties with disambiguation and workflow translation may not generalize to real user interactions, directly weakening the headline evaluation result.
  2. [Evaluation] Evaluation section: the abstract asserts that the evaluation demonstrates challenges for SOTA LLMs, yet the provided description supplies no quantitative metrics, baselines, success rates, or details on how the three core capabilities were measured or how tasks were extended. Without these, the strength of the central claim cannot be assessed.
minor comments (1)
  1. The abstract states that nearly half the tasks require over 100 lines of code and the longest approach 300 lines; a table or figure summarizing the distribution of task lengths, domains, and step counts would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, explaining how we will revise the manuscript to improve clarity and rigor while preserving the core contributions of PrepBench.

read point-by-point responses
  1. Referee: [Benchmark Construction] Benchmark Construction section: the claim that the extended tasks accurately reflect ambiguous user intents and the need for interpretable workflows rests on the assumption that crawling Preppin' Data Challenges plus author extensions produce naturally occurring ambiguity rather than synthetic disambiguation steps. If the original challenges contain relatively explicit intents, the reported LLM difficulties with disambiguation and workflow translation may not generalize to real user interactions, directly weakening the headline evaluation result.

    Authors: We appreciate the referee's concern regarding the naturalness of the ambiguities introduced in PrepBench. The Preppin' Data Challenges originate from real practitioner-submitted data preparation problems, which commonly feature underspecified intents (e.g., high-level goals like 'prepare the sales data for analysis' without detailing column mappings or outlier handling). Our extensions were not arbitrary but systematically derived by analyzing the original challenge statements for points of potential ambiguity and adding disambiguation steps that mirror typical user-LLM interactions in data prep. To address this directly, we will revise the Benchmark Construction section to include a new subsection detailing the crawling methodology, the criteria used to identify natural ambiguities from the source challenges, and concrete before/after examples of task descriptions. This will provide stronger evidence that the benchmark reflects realistic rather than purely synthetic ambiguity. revision: yes

  2. Referee: [Evaluation] Evaluation section: the abstract asserts that the evaluation demonstrates challenges for SOTA LLMs, yet the provided description supplies no quantitative metrics, baselines, success rates, or details on how the three core capabilities were measured or how tasks were extended. Without these, the strength of the central claim cannot be assessed.

    Authors: We agree that the Evaluation section requires more explicit quantitative detail to fully substantiate the claims. The manuscript does contain an evaluation reporting results across the three capabilities (disambiguation via clarification success rate, code generation via execution accuracy on test cases, and workflow translation via fidelity and interpretability metrics), with comparisons against several SOTA LLMs and baselines. However, to resolve the referee's valid point, we will substantially expand this section with dedicated tables summarizing success rates, error breakdowns, and precise measurement protocols (e.g., how interactive disambiguation was scored and how workflow interpretability was assessed). We will also add statistics on task extensions (e.g., average added steps and ambiguity types). These changes will make the central claim more transparent and assessable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no derivations or self-referential reductions

full rationale

The paper introduces PrepBench by crawling and extending Preppin' Data Challenges into tasks for evaluating LLM capabilities in disambiguation, code generation, and workflow translation. The central claim—that SOTA LLMs struggle with NL-driven data prep—is supported solely by direct empirical results on these tasks. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the derivation. The benchmark construction is described as systematic extension of external challenges, but the evaluation outcome does not reduce to the construction by definition or self-reference. This is a standard empirical benchmark paper whose result is falsifiable against the released tasks and LLM outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that LLMs remain far from NL-driven data preparation rests on the domain assumption that the selected and extended tasks from Preppin' Data Challenges capture the essential difficulties of real data prep work.

axioms (1)
  • domain assumption Tasks from Preppin' Data Challenges can be systematically extended into a benchmark that tests interactive disambiguation, prep-code generation, and code-to-workflow translation.
    The paper builds PrepBench by crawling these challenges and extending them to cover the three core capabilities.

pith-pipeline@v0.9.0 · 5562 in / 1284 out tokens · 43916 ms · 2026-05-12T00:57:55.857292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages

  1. [1]

    Anaconda, Inc. 2020. State of Data Science 2020. https://www.anaconda.com/ resources/whitepaper/state-of-data-science-2020 Accessed Jan. 11, 2026

  2. [2]

    Mengshi Chen, Yuxiang Sun, Tengchao Li, Jianwei Wang, Kai Wang, Xuemin Lin, Ying Zhang, and Wenjie Zhang. 2025. Empowering Tabular Data Preparation with Language Models: Why and How?arXiv preprint arXiv:2508.01556(2025)

  3. [3]

    Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xi- aoyong Du. 2023. Haipipe: Combining human-generated and machine-generated pipelines for data preparation.Proceedings of the ACM on Management of Data1, 1 (2023), 1–26

  4. [4]

    Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020. TabFact: A Large- scale Dataset for Table-based Fact Verification. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net. https://openreview.net/forum?id=rkeJRhNYDH

  5. [5]

    Wei-Hao Chen, Weixi Tong, Amanda Case, and Tianyi Zhang. 2025. Dango: A Mixed-Initiative Data Wrangling System using Large Language Model. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–28

  6. [6]

    Yibin Chen, Yifu Yuan, Zeyu Zhang, Yan Zheng, Jinyi Liu, Fei Ni, Jianye Hao, Hangyu Mao, and Fuzheng Zhang. 2025. SheetAgent: towards a generalist agent for spreadsheet reasoning and manipulation via large language models. In Proceedings of the ACM on Web Conference 2025. 158–177

  7. [7]

    Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, and Michael Cafarella. 2023. SEED: Domain-specific data curation with large language models.arXiv preprint arXiv:2310.00749(2023)

  8. [8]

    Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F Ilyas, Mourad Ouzzani, and Nan Tang. 2013. NADEEF: a commodity data cleaning system. InProceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 541–552. 14

  9. [9]

    Mingwen Dong, Nischal Ashok Kumar, Yiqun Hu, Anuj Chauhan, Chung-Wei Hang, Shuaichen Chang, Lin Pan, Wuwei Lan, Henghui Zhu, Jiarong Jiang, et al

  10. [10]

    InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

    PRACTIQ: A practical conversational text-to-SQL dataset with ambiguous and unanswerable queries. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 255–273

  11. [11]

    Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, and Thomas Wolf. 2025. DABstep: Data Agent Benchmark for Multi-step Reasoning.arXiv preprint arXiv:2506.23719(2025)

  12. [12]

    Meihao Fan, Ju Fan, Nan Tang, Lei Cao, Guoliang Li, and Xiaoyong Du. 2025. AutoPrep: Natural Language Question-Aware Data Preparation with a Multi- Agent Framework.Proc. VLDB Endow.18, 10 (2025), 3504–3517. https://doi.org/ 10.14778/3748191.3748211

  13. [13]

    Yuhang Ge, Yachuan Liu, Zhangyan Ye, Yuren Mao, and Yunjun Gao. 2025. Text- to-pipeline: Bridging natural language and data preparation pipelines.arXiv preprint arXiv:2505.15874(2025)

  14. [14]

    Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter. 2024. Automated data cleaning can hurt fairness in machine learning-based decision making.IEEE Transactions on Knowledge and Data Engineering36, 12 (2024), 7368–7379

  15. [15]

    Vivek Gupta, Pranshu Kandoi, Mahek Vora, Shuo Zhang, Yujie He, Ridho Reinanda, and Vivek Srikumar. 2023. TempTabQA: Temporal question answering for semi-structured tables. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2431–2453

  16. [16]

    Mazhar Hameed and Felix Naumann. 2020. Data preparation: A survey of commercial tools.ACM sigmod record49, 3 (2020), 18–29

  17. [17]

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, et al. 2025. Data interpreter: An LLM agent for data science. InFindings of the Association for Computational Linguistics: ACL 2025. 19796–19821

  18. [18]

    Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Qianli Ma, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Jiwei Li, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. InfiAgent-DABench: Evalu- ating Agents on Data Analysis Tasks. InForty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 2...

  19. [19]

    Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, and Kang Liu. 2024. DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, Y...

  20. [20]

    Zezhou Huang, Pavan Kalyan Damalapati, and Eugene Wu. 2023. Data ambiguity strikes back: How documentation improves GPT’s Text-to-SQL.arXiv preprint arXiv:2310.18742(2023)

  21. [21]

    Zezhou Huang and Eugene Wu. 2024. Cocoon: Semantic table profiling using large language models. InProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics. 1–7

  22. [22]

    Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Özcan, Yannis Papakonstantinou, and Reynold Cheng. 2025. BIRD-INTERACT: Re-imagining Text-to-SQL Ev...

  23. [23]

    Tengjun Jin, Yuxuan Zhu, and Daniel Kang. 2025. ELT-Bench: An End-to- End Benchmark for Evaluating AI Agents on ELT Pipelines.arXiv preprint arXiv:2504.04808(2025)

  24. [24]

    Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wran- gler: Interactive visual specification of data transformation scripts. InProceedings of the sigchi conference on human factors in computing systems. 3363–3372

  25. [25]

    Moe Kayali, Anton Lykov, Ilias Fountalis, Nikolaos Vasiloglou, Dan Olteanu, and Dan Suciu. 2024. CHORUS: Foundation Models for Unified Data Discovery and Exploration.Proc. VLDB Endow.17, 8 (2024), 2104–2114. https://doi.org/10. 14778/3659437.3659461

  26. [26]

    Eugenie Lai, Gerardo Vitagliano, Ziyu Zhang, Om Chabra, Sivaprasad Sudhir, Anna Zeng, Anton A Zabreyko, Chenning Li, Ferdi Kossmann, Jialin Ding, et al

  27. [27]

    KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes.arXiv preprint arXiv:2506.06541(2025)

  28. [28]

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettle- moyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. InInternational Conference on Machine Learning. PMLR, 18319–18345

  29. [29]

    Fangyu Lei, Jixuan Chen, Yuxiao Ye, Ruisheng Cao, Dongchan Shin, Hongjin Su, Zhaoqing Suo, Hongcheng Gao, Wenjing Hu, Pengcheng Yin, Victor Zhong, Caiming Xiong, Ruoxi Sun, Qian Liu, Sida Wang, and Tao Yu. 2025. Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows. InThe Thirteenth International Conference on Learning Repr...

  30. [30]

    Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhao-Xiang Zhang. 2023. SheetCopilot: Bringing software productivity to the next level through large language models.Advances in Neural Information Processing Systems36 (2023), 4952–4984

  31. [31]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2023. Can LLM already serve as A database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems36 (2023), 42330–42357

  32. [32]

    Meng Li, Timothy M McPhillips, Dingmin Wang, Shin-Rong Tsai, and Bertram Ludäscher. 2025. CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks.arXiv preprint arXiv:2507.11742(2025)

  33. [33]

    Luyi Ma, Nikhil Thakurdesai, Jiao Chen, Jianpeng Xu, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2023. LLMs with User-defined prompts as generic data operators for reliable data processing. In2023 IEEE International Conference on big data (BigData). IEEE, 3144–3148

  34. [34]

    Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xiaokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. 2024. Spreadsheetbench: Towards challenging real world spreadsheet manipulation.Advances in Neural Information Processing Systems37 (2024), 94871–94908

  35. [35]

    Orr, and Christopher Ré

    Avanika Narayan, Ines Chami, Laurel J. Orr, and Christopher Ré. 2022. Can Foundation Models Wrangle Your Data?Proc. VLDB Endow.16, 4 (2022), 738–746. https://doi.org/10.14778/3574245.3574258

  36. [36]

    OpenAI. 2024. ChatGPT. https://chatgpt.com Accessed Jan. 11, 2026

  37. [37]

    Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M Peeters, and Stijn Vansummeren. 2024. Schema matching with large language models: an experi- mental study.arXiv preprint arXiv:2407.11852(2024)

  38. [38]

    Ralph Peeters, Aaron Steiner, and Christian Bizer. 2025. Entity Matching using Large Language Models. InProceedings 28th International Conference on Extend- ing Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025, Alkis Simitsis, Bettina Kemme, Anna Queralt, Oscar Romero, and Petar Jovanovic (Eds.). OpenProceedings.org, 529–541. https://do...

  39. [39]

    Preppin’ Data. 2024. Preppin’ Data Challenges. https://www.preppindata.com/ challenges Accessed Jan. 11, 2026

  40. [40]

    Danrui Qi, Zhengjie Miao, and Jiannan Wang. 2025. CleanAgent: Automat- ing Data Standardization with LLM-based Agents. InVLDB 2025 Workshop: Data Driven AI. https://www.vldb.org/2025/Workshops/VLDB-Workshops- 2025/DATAI/DATAI25_8.pdf Artifact available at https://github.com/sfu- db/CleanAgent

  41. [41]

    Zhangcheng Qiang, Weiqing Wang, and Kerry Taylor. 2024. Agent-OM: Leverag- ing LLM Agents for Ontology Matching.Proc. VLDB Endow.18, 3 (2024), 516–529. https://doi.org/10.14778/3712221.3712222

  42. [42]

    Vijayshankar Raman. 2001. Potter’s wheel: An interactive data cleaning system. InVLDB, Vol. 1. 381–390

  43. [43]

    Ilyas, and Christopher Ré

    Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. Holo- Clean: Holistic Data Repairs with Probabilistic Inference.Proc. VLDB Endow.10, 11 (2017), 1190–1201. https://doi.org/10.14778/3137628.3137631

  44. [44]

    El Kindi Rezig, Lei Cao, Michael Stonebraker, Giovanni Simonini, Wenbo Tao, Samuel Madden, Mourad Ouzzani, Nan Tang, and Ahmed K Elmagarmid. 2019. Data civilizer 2.0: A holistic framework for data preparation and analytics. (2019)

  45. [45]

    Aécio Santos, Eduardo HM Pena, Roque Lopez, and Juliana Freire. 2025. Inter- active Data Harmonization with LLM Agents: Opportunities and Challenges. arXiv preprint arXiv:2502.07132(2025)

  46. [46]

    SAS Institute. 2024. SAS Data Preparation. https://www.sas.com/en_us/software/ data-preparation.html Accessed Jan. 11, 2026

  47. [47]

    Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biess- mann, and Andreas Grafberger. 2018. Automating large-scale data quality verifi- cation.Proceedings of the VLDB Endowment11, 12 (2018), 1781–1794

  48. [48]

    Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. Table Meets LLM: Can large language models understand structured table data? a benchmark and empirical study. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 645–654

  49. [49]

    Tableau Software. 2024. Tableau Prep. https://www.tableau.com/products/prep Accessed Jan. 11, 2026

  50. [50]

    Enzo Veltri, Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Data ambiguity profiling for the generation of training examples. In2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, 450–463

  51. [51]

    Tianshu Wang, Xiaoyang Chen, Hongyu Lin, Xuanang Chen, Xianpei Han, Le Sun, Hao Wang, and Zhenyu Zeng. 2025. Match, compare, or select? an investigation of large language models for entity matching. InProceedings of the 31st International Conference on Computational Linguistics. 96–109

  52. [52]

    Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. 2024. MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. https://openreview.net/forum?id=jp3gWrMuIZ 15

  53. [53]

    Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, et al

  54. [54]

    InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Natural language to code generation in interactive data science notebooks. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 126–173

  55. [55]

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Pro...

  56. [56]

    Haochen Zhang, Yuyang Dong, Chuan Xiao, and Masafumi Oyamada. 2024. Large Language Models as Data Preprocessors. InProceedings of Workshops at the 50th International Conference on Very Large Data Bases, VLDB 2024, Guangzhou, China, August 26-30, 2024. VLDB.org. https://vldb.org/workshops/2024/proceedings/ TaDA/TaDA.11.pdf

  57. [57]

    Michael J. Q. Zhang, W. Bradley Knox, and Eunsol Choi. 2025. Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=cwuSAR7EKd

  58. [58]

    Yuge Zhang, Qiyang Jiang, XingyuHan XingyuHan, Nan Chen, Yuqing Yang, and Kan Ren. 2024. Benchmarking data science agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5677–5700

  59. [59]

    Wei Zhou, Jun Zhou, Haoyu Wang, Zhenghao Li, Qikang He, Shaokun Han, Guoliang Li, Xuanhe Zhou, Yeye He, Chunwei Liu, et al. 2026. Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs. arXiv preprint arXiv:2601.17058(2026)

  60. [60]

    Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, et al. 2025. A Survey of LLM × DATA.arXiv preprint arXiv:2505.18458(2025)

  61. [61]

    Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li, Wei Zhou, Xinyu Liu, Zhangyang Peng, Tianqi Luo, Yu Li, et al . 2025. A Survey of Data Agents: Emerging Paradigm or Overstated Hype?arXiv preprint arXiv:2510.23587 (2025). 16