arxiv: 2604.26356 · v1 · submitted 2026-04-29 · 💻 cs.DB

Recognition: unknown

PiLLar: Matching for Pivot Table Schema via LLM-guided Monte-Carlo Tree Search

Yunjun Gao , Chuangyu Ouyang , Congcong Ge , Yifan Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:41 UTC · model grok-4.3

classification 💻 cs.DB

keywords pivot tableschema matchingLLMMonte-Carlo Tree Searchdata lakesanonymizationschema-value matchingtraining-free

0 comments

The pith

The PiLLar framework matches pivot table schemas accurately by guiding Monte Carlo searches with large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PiLLar to solve the problem of matching schemas between pivot tables and standard relational tables. A correct match must be semantically consistent at the schema level and compatible at the value level, which is challenging with anonymized data in data lakes. PiLLar uses an LLM to guide Monte-Carlo Tree Search in a training-free way that adapts across domains with minimal annotated data. The authors provide a theoretical analysis showing the method converges and create a new benchmark PTbench from real domains. Experiments confirm it outperforms others with 87.94 percent accuracy on correct predictions.

Core claim

The authors present PiLLar as the first framework for matching pivot table schemas. They formulate it as an LLM-driven search paradigm operating with minimal annotated privacy-compliant data for training-free adaptation across domains. Theoretical analysis on error dynamics ensures asymptotic convergence. A benchmark PTbench is derived from four real-world domains by mining unpivot-suitable tables, unpivoting coherent attributes, and applying sampling and anonymization. Extensive experiments show superiority with an average accuracy of 87.94% on correctly predicted matches.

What carries the argument

The central mechanism is the LLM-guided Monte-Carlo Tree Search paradigm, which uses large language model evaluations to direct the exploration of match possibilities while ensuring semantic and value consistency.

If this is right

Enables accurate schema matching on anonymized pivot tables without task-specific training.
Provides theoretical assurance of convergence through error dynamics analysis.
Introduces PTbench as a new evaluation benchmark from diverse real-world domains.
Demonstrates high accuracy across four representative domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could improve data integration pipelines in organizations handling sensitive information by automating pivot table alignment.
Combining LLMs with search methods may generalize to other privacy-constrained data tasks in databases.
Testing on larger scales or different LLM models could reveal robustness limits not covered in the current experiments.

Load-bearing premise

The LLM can reliably guide the Monte-Carlo Tree Search to produce semantically and value-consistent matches across anonymized data from unseen domains without any task-specific training or fine-tuning.

What would settle it

Testing the framework on a fresh set of anonymized pivot tables from an entirely new domain and finding accuracy much lower than 87.94% would indicate that the training-free adaptation does not hold generally.

Figures

Figures reproduced from arXiv: 2604.26356 by Chuangyu Ouyang, Congcong Ge, Yifan Zhu, Yunjun Gao.

**Figure 1.** Figure 1: An example of performing matching for pivot table schema with separate unpivot and schema matching steps view at source ↗

**Figure 2.** Figure 2: Overview of the PiLLar framework Sketch. We sketch the argument and defer details to Appendix B. Since S is finite and expansion is𝜀-randomized, every feasible candidate is generated with probability 1 (probabilistic completeness). Once the optimal node 𝑣 ★ is discovered, the max–average backup yields a contraction effect on the root’s estimation error whenever 𝑣 ★ is reached and backpropagated. Moreover… view at source ↗

**Figure 3.** Figure 3: An example of an initialization prompt template view at source ↗

**Figure 4.** Figure 4: Performance of different iteration times view at source ↗

**Figure 5.** Figure 5: Performance of different epsilon 10 1 10 2 10 3 # Attributes 2 0 2 2 2 4 2 6 2 8 2 10 Time (s) PiLLar GRAM view at source ↗

**Figure 6.** Figure 6: Runtime scalability w.r.t. the number of attributes view at source ↗

**Figure 7.** Figure 7: Unpivoted attributes identified by SOTA LLMs view at source ↗

**Figure 8.** Figure 8: Ablation study on description information view at source ↗

**Figure 9.** Figure 9: Ablation study on the generation of root node view at source ↗

**Figure 10.** Figure 10: Performance of different similarity metrics view at source ↗

read the original abstract

Pivot tables are ubiquitous in data lakes of modern data ecosystems, making accurate schema matching over pivot tables a key prerequisite for data integration. In this paper, we focus on matching for pivot table schema, which is a novel joint schema-value matching task. It aims to align schemas between pivot tables and standard relational tables, where a correct match must be semantically consistent at the schema level and compatible at the value level. However, due to the inherent data sensitivity of this task, the prevalence of anonymized data in practice poses significant challenges to its matching accuracy and generalization capability. To tackle these challenges, we propose PiLLar, the first matching for pivot table schema framework. We first formulate PiLLar as an LLM-driven search paradigm that operates with minimal annotated privacy-compliant data, thereby achieving training-free adaptation across diverse domains. Next, we provide a theoretical analysis on the error dynamics of the paradigm to ensure the asymptotic convergence of the proposed method. Furthermore, we introduce a new benchmark PTbench, derived from four representative real-world domains and constructed by mining unpivot-suitable tables, performing unpivot on semantically coherent attributes, and applying sampling and anonymization. Extensive experiments demonstrate the superiority of PiLLar, which achieves an average accuracy of 87.94% on the correctly predicted matches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PiLLar brings an LLM-MCTS method to pivot table schema matching with a new benchmark, but the experimental support for its generalization claims is still light.

read the letter

The one thing your colleague should know is that this paper introduces PiLLar as the first framework for joint schema-value matching on pivot tables, using LLM-guided Monte-Carlo Tree Search in a training-free way, along with a new benchmark PTbench derived from real domains with anonymization. It claims an average accuracy of 87.94% and provides a theoretical analysis suggesting asymptotic convergence. What the paper does well is to identify a practical gap in data integration for modern data lakes where pivot tables are everywhere. The joint matching requirement makes sense because schema alignment alone isn't enough if values don't line up. Framing it as a search problem with MCTS lets the LLM provide guidance without needing task-specific fine-tuning, which is good for handling sensitive or anonymized data. Constructing the benchmark by selecting unpivot-suitable tables, performing unpivots on coherent attributes, sampling, and anonymizing is a thoughtful way to create test cases that reflect real challenges. The soft spots are mainly around the strength of the evidence. The reported accuracy is promising but the abstract and setup don't include comparisons to existing schema matching methods, details on the experimental protocol, or how correctness of matches was determined. This makes it difficult to judge if the gains are substantial or if the method truly outperforms simpler approaches. The theoretical analysis on error dynamics is welcome, but without seeing the specific bounds or how they account for LLM variability, it's hard to assess if the convergence holds in practice. The biggest concern is the reliance on the LLM to guide the search effectively on anonymized data from unseen domains. Anonymization removes many semantic cues that LLMs typically use, so the assumption that it can still produce consistent matches without any adaptation might not hold broadly. More experiments on domain shifts or LLM error patterns would help here. Overall, this paper is aimed at database researchers focused on schema matching and data lake integration. Someone working on similar problems would find the benchmark useful and the MCTS idea worth exploring. It deserves a serious referee because it addresses a recurring task with a novel combination of techniques and supplies a new evaluation resource, even though the current results need more detail to be fully convincing.

Referee Report

2 major / 2 minor

Summary. The paper proposes PiLLar, the first framework for pivot table schema matching, formulated as an LLM-driven Monte-Carlo Tree Search paradigm that is training-free and adapts across domains using minimal annotated privacy-compliant data. It includes a theoretical analysis of error dynamics to prove asymptotic convergence, introduces the PTbench benchmark derived from four real-world domains via unpivot mining, sampling, and anonymization, and reports an average accuracy of 87.94% on correctly predicted matches, claiming superiority over alternatives.

Significance. If the empirical accuracy and convergence guarantees hold under the stated conditions, PiLLar would offer a meaningful advance for data integration over anonymized pivot tables in data lakes, addressing a gap in joint schema-value matching without task-specific fine-tuning. The combination of MCTS search with LLM guidance and the new PTbench benchmark could enable more generalizable methods for privacy-sensitive settings, provided the LLM steering remains reliable across domain shifts.

major comments (2)

[Abstract] Abstract: The reported average accuracy of 87.94% on correctly predicted matches is presented without any experimental details on benchmark construction (e.g., number of tables per domain, sampling strategy, or how semantic coherence and value compatibility were judged), baseline comparisons, error bars, or statistical tests. This directly undermines evaluation of the superiority and generalization claims, as the central empirical result lacks the protocol needed to assess reproducibility or the impact of anonymization.
[Theoretical analysis] Theoretical analysis section: The claim of asymptotic convergence rests on an analysis of LLM error dynamics remaining bounded and unbiased across anonymized unseen domains, yet no equations, proof outline, or assumptions about the LLM's guidance reliability (e.g., how MCTS exploration compensates for potential semantic drift after anonymization) are provided. This is load-bearing for the training-free adaptation guarantee.

minor comments (2)

[Title and Abstract] The phrasing 'matching for pivot table schema' is repeated in the title and abstract; consider standardizing to 'pivot table schema matching' for clarity.
[Abstract] The abstract mentions 'extensive experiments' but provides no table or figure references; ensure the full manuscript includes clear result tables with per-domain breakdowns and baseline metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The reported average accuracy of 87.94% on correctly predicted matches is presented without any experimental details on benchmark construction (e.g., number of tables per domain, sampling strategy, or how semantic coherence and value compatibility were judged), baseline comparisons, error bars, or statistical tests. This directly undermines evaluation of the superiority and generalization claims, as the central empirical result lacks the protocol needed to assess reproducibility or the impact of anonymization.

Authors: We agree that the abstract's brevity omits key experimental details, which could aid quick assessment of the claims. Full details on PTbench construction (including tables per domain, sampling, semantic coherence and value compatibility judgments), baselines, error bars, and statistical tests appear in Sections 4 and 5. We will revise the abstract to incorporate a concise summary of the benchmark and evaluation protocol, improving reproducibility without altering its length substantially. revision: yes
Referee: [Theoretical analysis] Theoretical analysis section: The claim of asymptotic convergence rests on an analysis of LLM error dynamics remaining bounded and unbiased across anonymized unseen domains, yet no equations, proof outline, or assumptions about the LLM's guidance reliability (e.g., how MCTS exploration compensates for potential semantic drift after anonymization) are provided. This is load-bearing for the training-free adaptation guarantee.

Authors: The theoretical analysis in Section 3 discusses error dynamics and asymptotic convergence under bounded LLM errors, but we acknowledge the current presentation lacks explicit equations, a full proof outline, and detailed assumptions on LLM reliability and MCTS compensation for semantic drift post-anonymization. We will revise the section to add the key equations, assumptions, and proof sketch, making the convergence argument more rigorous and transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation is self-contained

full rationale

The paper introduces PiLLar as a novel LLM-guided MCTS framework for pivot table schema matching, formulates it as a training-free search paradigm, provides a separate theoretical analysis of error dynamics for asymptotic convergence, and validates via a newly constructed PTbench benchmark with reported empirical accuracy of 87.94%. No load-bearing step reduces a claimed result to its own inputs by definition, fitted parameter, or self-citation chain; the central claims rest on experimental outcomes and an independent theoretical argument rather than tautological renaming or construction. The method's independence from task-specific training is explicitly stated and not derived from the accuracy metric itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the paper invokes an LLM's ability to evaluate partial matches and the validity of an error-dynamics analysis whose details are not supplied.

axioms (1)

domain assumption LLM-guided Monte-Carlo Tree Search converges asymptotically to correct schema-value matches
Stated in the abstract as part of the theoretical analysis but without the actual proof or assumptions listed.

pith-pipeline@v0.9.0 · 5537 in / 1308 out tokens · 48932 ms · 2026-05-07T12:41:34.847454+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 5 canonical work pages · 2 internal anchors

[1]

General Data Protection Regulation

2016. General Data Protection Regulation. https://gdpr-info.eu/

2016
[2]

Regulation (EU) 2018/1725 of the European Parliament

2018. Regulation (EU) 2018/1725 of the European Parliament. https://eur- lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32018R1725

2018
[3]

Rethink Data: Put More of Your Business Data to Work-From Edge to Cloud

2020. Rethink Data: Put More of Your Business Data to Work-From Edge to Cloud. https://www.seagate.com/files/www-content/our-story/rethink- data/files/Rethink_Data_Report_2020.pdf

2020
[4]

California Consumer Privacy Act

2024. California Consumer Privacy Act. https://oag.ca.gov/privacy/ccpa

2024
[5]

Informatica – Master Data Management

2024. Informatica – Master Data Management. https://www.informatica.com/re sources/articles/what-is-master-data-management.html

2024
[6]

Technical Report

2025.Cloud Data Governance and Catalog. Technical Report. Salesforce, Inc. https://www.informatica.com/content/dam/informatica-com/en/collateral/da ta-sheet/cloud-data-governance-and-catalog_data-sheet_4152en.pdf

2025
[7]

Football-Data

2025. Football-Data. https://www.football-data.co.uk/

2025
[8]

Foundry Ontology Overview

2025. Foundry Ontology Overview. https://www.palantir.com/docs/foundry/on tology/overview

2025
[9]

Google Cloud Looker

2025. Google Cloud Looker. https://cloud.google.com/looker

2025
[10]

GTEx Portal

2025. GTEx Portal. https://www.gtexportal.org/home/

2025
[11]

Microsoft Fabric

2025. Microsoft Fabric. https://app.fabric.microsoft.com

2025
[12]

Microsoft Power BI

2025. Microsoft Power BI. https://app.powerbi.com

2025
[13]

PowerCenter 10.5.9 Designer Guide: Editing Columns

2025. PowerCenter 10.5.9 Designer Guide: Editing Columns. https://docs.inf ormatica.com/data-integration/powercenter/10-5-9/designer-guide/working- with-flat-files/editing-flat-file-definitions/editing-columns.html

2025
[14]

Salesforce CRM

2025. Salesforce CRM. https://www.salesforce.com/crm/

2025
[15]

2025. U.S. Census Bureau Homepage. https://www.census.gov/

2025
[16]

Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, et al. 2020. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores.PVLDB13, 12 (2020), 3411–3424

2020
[17]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems33 (2020), 1877–1901

2020
[18]

Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samoth- rakis, and Simon Colton. 2012. A Survey of Monte Carlo Tree Search Methods. TCIAIG4, 1 (2012), 1–43

2012
[19]

Nancy Chinchor and Patricia Robinson. 1997. MUC-7 Named Entity Task Defini- tion. InMUC, Vol. 29. 1–21

1997
[20]

Whanhee Cho and Anna Fariha. 2025. Data-Semantics-Aware Recommendation of Diverse Pivot Tables.arXiv preprint arXiv:2507.06171(2025)

work page arXiv 2025
[21]

David F Crouse. 2016. On Implementing 2D Rectangular Assignment Algorithms. IEEE Trans. Aerospace Electron. Systems52, 4 (2016), 1679–1696

2016
[22]

Hong-Hai Do and Erhard Rahm. 2002. COMA — A System for Flexible Combina- tion of Schema Matching Approaches. InPVLDB. 610–621

2002
[23]

AnHai Doan, Pedro Domingos, and Alon Levy. 2000. Learning Source Description for Data Integration. InWebDB. 81–86

2000
[24]

Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al . 2023. C3: Zero-shot Text-to-SQL with ChatGPT.arXiv preprint arXiv:2307.07306(2023)

work page arXiv 2023
[25]

Pavan Edara and Mosha Pasumansky. 2021. Big Metadata: When Metadata is Big Data.PVLDB14, 12 (2021), 3083–3095

2021
[26]

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. 2024. Detecting hallucinations in large language models using semantic entropy.Nature630, 8017 (2024), 625–630

2024
[27]

Wael H Gomaa, Aly A Fahmy, et al. 2013. A Survey of Text Similarity Approaches. International Journal of Computer Applications68, 13 (2013), 13–18

2013
[28]

Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theo- phane Weber, Lars Buesing, and Peter W

Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theo- phane Weber, Lars Buesing, and Peter W. Battaglia. 2020. Combining Q-Learning and Search with Amortized Value Estimates. InICLR

2020
[29]

Zhipeng Huang and Yeye He. 2018. Auto-Detect: Data-Driven Error Detection in Tables. InSIGMOD. 1377–1392

2018
[30]

Andrea Iovine, Yunhan Huang, Melvin Monteiro, Mohamed Yakout, and Sedat Gokalp. 2025. Effective Product Schema Matching and Duplicate Detection with Large Language Models. (2025). https://www.amazon.science/publications/ effective-product-schema-matching-and-duplicate-detection-with-large- language-models

2025
[31]

Bas Jansen and Felienne Hermans. 2018. The Use of Charts, Pivot Tables, and Array Formulas in Two Popular Spreadsheet Corpora.arXiv preprint arXiv:1808.10642(2018)

work page arXiv 2018
[32]

Levente Kocsis and Csaba Szepesvári. 2006. Bandit Based Monte-Carlo Planning. InECML. 282–293

2006
[33]

Farnaz Kohankhaki, Kiarash Aghakasiri, Hongming Zhang, Ting-Han Wei, Chao Gao, and Martin Müller. 2024. Monte Carlo Tree Search in the Presence of Transition Uncertainty. InAAAI, Vol. 38. 20151–20158

2024
[34]

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsi- fodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. InICDE. 468–479

2021
[35]

Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. InSoviet Physics Doklady. 707–710

1966
[36]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.NeurIPS33 (2020), 9459–9474

2020
[37]

Peng Li, Yeye He, Cong Yan, Yue Wang, and Surajit Chaudhuri. 2023. Auto-Tables: Synthesizing Multi-Step Transformations to Relationalize Tables without Using Examples.PVLDB16, 11 (2023), 3391–3403

2023
[38]

Jianhua Lin. 2002. Divergence Measures Based on the Shannon Entropy.IEEE Transactions on Information Theory37, 1 (2002), 145–151

2002
[39]

Xuanqing Liu, Runhui Wang, Yang Song, and Luyang Kong. 2024. GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security. InSIGKDD. 5476–5486

2024
[40]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 (2019)

work page internal anchor Pith review arXiv 2019
[41]

Yurong Liu, Eduardo H. M. Pena, Aécio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. PVLDB18, 8 (2025), 2681–2694

2025
[42]

2010.Master Data Management

David Loshin. 2010.Master Data Management

2010
[43]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al
[44]

Self-Refine: Iterative Refinement with Self-Feedback.NeurIPS36 (2023), 46534–46594

2023
[45]

Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid. InPVLDB, Vol. 1. 49–58

2001
[46]

Sabine Massmann, Salvatore Raunich, David Aumüller, Patrick Arnold, Erhard Rahm, et al. 2011. Evolution of the COMA match system.Ontology Matching49 (2011), 49–60

2011
[47]

Sergi Nadal, Petar Jovanovic, Besim Bilalli, and Oscar Romero. 2022. Opera- tionalizing and automating Data Governance.Journal of Big Data9, 1 (2022), 117

2022
[48]

2021.Trust in Data

Palantir Technologies Inc. 2021.Trust in Data. Technical Report. Palantir Technologies Inc. https://www.palantir.com/assets/xrfr7uokpv1b/621jZEFhAkz eFjj6fndeW/f8e96ca8a08ee8afb50ad61ea3ff10a0/Trust_in_Data_Whitepaper__ US_.pdf

2021
[49]

2024.Palantir Privacy and Governance Whitepaper

Palantir Technologies Inc. 2024.Palantir Privacy and Governance Whitepaper. Technical Report. Palantir Technologies Inc. https://www.palantir.com/assets/ xrfr7uokpv1b/6pey1VnYHULqeggNbPKqP0/9f577de3e3dfb9fc031bd75dc75265 17/Palantir_Privacy_and_Governance_Whitepaper__1_.pdf

2024
[50]

Luigi Palopoli, Giorgio Terracina, Domenico Ursino, et al. 2000. The System DIKE: Towards the Semi-Automatic Synthesis of Cooperative Information Systems and Data Warehouses. InADBIS-DASFAA. 108–117

2000
[51]

Marcel Parciak, Brecht Vandevoort, Frank Neven, Liesbet M Peeters, and Stijn Vansummeren. 2024. Schema Matching with Large Language Models: an Experi- mental Study.PVLDB2150 (2024), 8097

2024
[52]

Neil Raden. 2023. Shadow IT Never Dies: Why Spreadsheets Are Still Running Your Business. https://diginomica.com/shadow-it-never-dies-why-spreadsheets- are-still-running-your-business

2023
[53]

Erhard Rahm and Philip A Bernstein. 2001. A Survey of Approaches to Automatic Schema Matching.the VLDB Journal10, 4 (2001), 334–350

2001
[54]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. Dis- tilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108(2019)

work page internal anchor Pith review arXiv 2019
[55]

Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biess- mann, and Andreas Grafberger. 2018. Automating Large-Scale Data Quality Verification.PVLDB11, 12 (2018), 1781–1794

2018
[56]

Nabeel Seedat and Mihaela van der Schaar. 2024. Matchmaker: Self-Improving Large Language Model Programs for Schema Matching. InGenAI for Health: Potential, Trust and Policy Compliance

2024
[57]

Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majum- dar. 2025. A Survey on Uncertainty Quantification of Large Language Models: Taxonomy, Open Research Challenges, and Future Directions.Comput. Surveys (2025)

2025
[58]

Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. ADnEV: Cross-domain Schema Matching Using Deep Similarity Matrix Adjustment and Evaluation. PVLDB13, 9 (2020), 1401–1415

2020
[59]

Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Guoliang Li, Xiaoyong Du, Xiaofeng Jia, and Song Gao. 2023. Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration.PACMMOD1, 1 (2023), 1–26

2023
[60]

Pei Wang and Yeye He. 2019. Uni-Detect: A Unified Approach to Automated Error Detection in Tables. InSIGMOD. 811–828. Yunjun Gao, Chuangyu Ouyang, Congcong Ge, and Yifan Zhu

2019
[61]

Hadley Wickham. 2014. Tidy Data.Journal of Statistical Software59 (2014), 1–23

2014
[62]

Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Apple- ton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship.Scientific Data3, 1 (2016), 1–9

2016
[63]

Kevin Wu, Jing Zhang, and Joyce C Ho. 2023. CONSchema: Schema Matching with Semantics and Constraints. InEuropean Conference on Advances in Databases and Information Systems. 231–241

2023
[64]

Cong Yan and Yeye He. 2020. Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks. InSIGMOD. 1539–1554

2020
[65]

Junwen Yang, Yeye He, and Surajit Chaudhuri. 2021. Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search. PVLDB14, 11 (2021), 2563–2575

2021
[66]

Jing Zhang, Bonggun Shin, Jinho D Choi, and Joyce C Ho. 2021. SMAT: An Attention-based Deep Learning Solution to the Automation of Schema Matching. InADBIS. 260–274

2021
[67]

Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M Procopiuc, and Divesh Srivastava. 2011. Automatic Discovery of Attributes in Relational Databases. InSIGMOD. 109–120

2011
[68]

Yunjia Zhang, Avrilia Floratou, Joyce Cahoon, Subru Krishnan, Andreas C Müller, Dalitso Banda, Fotis Psallidas, and Jignesh M Patel. 2023. Schema Matching Using Pre-trained Language Models. InICDE. 1558–1571

2023
[69]

Yu Zhang, Di Mei, Haozheng Luo, Chenwei Xu, and Richard Tzong-Han Tsai
[70]

unpivot_columns

SMUTF: Schema Matching Using Generative Tags and Hybrid Features. Information Systems(2025), 102570. A Supplementary Case Study Result This appendix provides supplementary qualitative outputs for the running example in Figure 7, together with the web-chat prompt transcript used to obtain them (we omit intermediate assistant acknowledgements for brevity.)....

2025