RACT: Retrieval Augmented Column-Table Learning and Prediction for Multi-Table Schema Matching

Andreas Behrend; Enas Khwaileh; George Karabatis; Leonard Traeger

arxiv: 2606.07843 · v1 · pith:OTI5H3UDnew · submitted 2026-06-05 · 💻 cs.DB · cs.IR· cs.LG

RACT: Retrieval Augmented Column-Table Learning and Prediction for Multi-Table Schema Matching

Leonard Traeger , Enas Khwaileh , Andreas Behrend , George Karabatis This is my paper

Pith reviewed 2026-06-27 20:01 UTC · model grok-4.3

classification 💻 cs.DB cs.IRcs.LG

keywords matchingschemacolumnsmulti-tabletablescolumndifferentexperiments

0 comments

The pith

RACT retrieves candidate tables via referential context to constrain column candidates and raises multi-table schema matching precision and completeness by up to 70 percent over baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RACT, a self-supervised framework that first learns to probabilistically retrieve relevant tables for source columns using referential context. This retrieval step narrows the pool of possible column matches in heterogeneous multi-table schemas, where direct similarity between columns often fails because similar-meaning columns sit in tables with unrelated surrounding structure. Experiments show the full approach beats standard similarity baselines, and restricting the column search to columns from the top retrieved tables improves both precision and completeness by as much as 70 percent. The work targets the practical problem of integrating data across many tables whose designs differ in context and layout.

Core claim

The central claim is that exploiting referential context through probabilistic table retrieval allows the column search space to be reliably constrained in multi-table holistic schema matching, where similarity-based methods are inadequate, and that this constraint produces measurably higher matching precision and completeness.

What carries the argument

RACT learning and prediction, a self-supervised framework that performs probabilistic retrieval of candidate tables for source columns to constrain relevant column candidates.

If this is right

The RACT framework outperforms similarity-based baselines on multi-table schema matching tasks.
Constraining column candidates to those inside top-t retrieved tables improves average matching precision and completeness by up to 70 percent.
Referential context supplies useful signal when columns with similar semantics appear inside tables that have dissimilar surrounding structure.
The self-supervised retrieval step can be applied before any downstream column matcher to reduce the effective search space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same retrieval-first pattern could be tested on other data-integration subtasks such as foreign-key discovery across multiple sources.
If retrieval recall drops on schemas with very sparse referential links, the gains would shrink, suggesting a need for hybrid retrieval that also considers column-level signals early.
The approach may reduce the amount of labeled column pairs needed for training because the table filter already removes most irrelevant candidates.
Scaling the method to schemas with hundreds of tables would test whether the top-t constraint remains effective when table diversity grows.
keywords:[

Load-bearing premise

Probabilistic retrieval of tables based on referential context will include the tables that contain the correct column matches without excluding them in heterogeneous schemas.

What would settle it

A dataset of multi-table schemas in which, for a substantial fraction of columns, the table containing the true match is ranked outside the top-t tables returned by the retrieval step, causing the subsequent matcher to miss correspondences.

Figures

Figures reproduced from arXiv: 2606.07843 by Andreas Behrend, Enas Khwaileh, George Karabatis, Leonard Traeger.

**Figure 2.** Figure 2: Schema Matching with Column-Table Prediction. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: RACT Learning and Prediction Framework for Schema Matching [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of Serialization for Column Blocking ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Recall Performance (y-axis) at @top-t Table Prediction (x-axis) with Holistic(2) and Blocking @top-k Column (lines). matching signal. Hence, referential context is crucial for ColumnTable learning while harming column-column similarity. Ablation Study for Schema Matching. Based on our previous findings, we evaluate the impact of Column-Table Prediction on the full matching pipeline. First, we train a Hol… view at source ↗

**Figure 6.** Figure 6: mAP Performance (y-axis) at @top-k Table Prediction (x-axis) with Holistic(2) and Matching @top-k Column (lines). blocking, aligning with our intrinsic method analysis ( [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Schema matching, a critical task for integrating data from diverse sources, seeks to identify correspondences between columns across different schemas. In multi-table holistic schema matching, columns with similar semantic meaning may reside in tables with different contexts due to heterogeneous schema designs, where similarity-based techniques are inadequate. The focus of this paper is exploiting referential context into schema matching by introducing RACT learning and prediction, a self-supervised framework enabling the probabilistic retrieval of candidate tables for source columns to constrain relevant column candidates. Experiments demonstrate that this approach outperforms similarity-based baselines on matching multi-table schemas. In subsequent matching experiments, constraining the column search space via top-t tables improves both average matching precision and completeness by up to +70%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RACT adds table retrieval to constrain column matching in multi-table schemas and reports large gains, but the abstract gives no recall numbers or dataset details so the improvements are hard to trust.

read the letter

The main point is that RACT learns to retrieve candidate tables via a self-supervised model on referential context, then restricts column matching to those tables, claiming up to 70% better precision and completeness than similarity baselines.

The combination of probabilistic table retrieval with column-table learning for holistic matching looks like the actual new piece. It targets the specific failure mode where semantically similar columns sit in tables with mismatched contexts, which pure similarity methods ignore.

The paper does a clear job stating why context matters and sketching a pipeline that uses observed FK/PK patterns to drive the retrieval step.

The soft spots are straightforward. The abstract supplies zero information on datasets, exact baselines, statistical tests, or ablation results. The stress-test concern lands: the reported gains only make sense if the top-t retriever keeps every ground-truth table that contains a correct match. Nothing shown indicates they measured recall at the table level or tested on schemas outside the training distribution. If recall is incomplete, the numbers reflect an easier filtered problem rather than the full task.

This is aimed at people building data integration pipelines who need better candidate pruning. A practitioner might find the retrieval idea worth trying, but the work is incremental rather than foundational.

It deserves peer review so the experiments can be checked in full.

Referee Report

2 major / 2 minor

Summary. The paper introduces RACT, a self-supervised retrieval-augmented framework for multi-table schema matching that learns to probabilistically retrieve candidate tables using referential context (e.g., FK/PK patterns) in order to constrain the column search space. It claims this outperforms similarity-based baselines and that applying a top-t table constraint yields up to +70% gains in average matching precision and completeness.

Significance. If the central experimental claims hold after verification of recall and controls, the work could meaningfully advance holistic schema matching for heterogeneous multi-table sources by moving beyond pure column similarity. The self-supervised use of referential context is a plausible direction, but the absence of reported recall metrics for the table retriever leaves the robustness of the pipeline unproven.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the headline result that top-t table constraints improve precision and completeness by up to +70% is load-bearing for the central claim, yet no recall@ t figures, precision-recall curves, or analysis of ground-truth tables excluded by the retriever are provided. Without these, the reported gains may reflect an easier filtered problem rather than end-to-end robustness.
[Abstract] Abstract: the empirical gains are stated without any description of the datasets used, the similarity-based baselines, statistical significance tests, or controls for confounding factors such as schema size or domain heterogeneity, preventing assessment of whether the +70% figure generalizes.

minor comments (2)

[Method] Notation for the probabilistic table retriever and the self-supervised training objective on referential context should be introduced with explicit equations rather than prose descriptions.
[Method / Experiments] The paper should clarify whether the top-t constraint is applied at inference only or also during training, and how t is chosen.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the experimental validation and clarity of the claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline result that top-t table constraints improve precision and completeness by up to +70% is load-bearing for the central claim, yet no recall@ t figures, precision-recall curves, or analysis of ground-truth tables excluded by the retriever are provided. Without these, the reported gains may reflect an easier filtered problem rather than end-to-end robustness.

Authors: We agree that the absence of recall metrics for the table retriever leaves the end-to-end robustness unproven. In the revised manuscript we will add recall@t results across multiple values of t, precision-recall curves for the retriever component, and an explicit analysis of ground-truth tables excluded by the top-t constraint. These additions will allow readers to assess whether the reported +70% gains in matching precision and completeness arise from an easier filtered sub-problem or from the retrieval-augmented pipeline itself. revision: yes
Referee: [Abstract] Abstract: the empirical gains are stated without any description of the datasets used, the similarity-based baselines, statistical significance tests, or controls for confounding factors such as schema size or domain heterogeneity, preventing assessment of whether the +70% figure generalizes.

Authors: We acknowledge that the abstract, while concise, does not provide sufficient context for the headline result. We will revise the abstract to briefly name the datasets (including schema counts and domains), identify the similarity-based baselines, note that statistical significance testing was performed, and indicate that controls for schema size and domain heterogeneity were applied. The Experiments section already contains these details; the abstract update will make the +70% claim more interpretable without lengthening the paper substantially. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes a self-supervised retrieval-augmented framework for multi-table schema matching and reports empirical gains from top-t table constraints, but the provided text contains no equations, fitted parameters, or derivation steps that reduce to their own inputs by construction. Claims rest on experimental comparisons against baselines rather than self-referential logic or self-citation load-bearing premises. No instances of self-definitional relations, fitted inputs renamed as predictions, or ansatz smuggling via citation are present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; ledger is therefore minimal and based solely on the high-level description. No explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.1-grok · 5655 in / 1000 out tokens · 16000 ms · 2026-06-27T20:01:38.457412+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 34 canonical work pages · 1 internal anchor

[1]

Ziawasch Abedjan, Patrick Schulze, and Felix Naumann. 2014. DFD: Efficient Functional Dependency Discovery. InProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM ’14). Association for Computing Machinery, New York, NY, USA, 949–958. https: //doi.org/10.1145/2661829.2661884

work page doi:10.1145/2661829.2661884 2014
[2]

David Aumueller, Hong-Hai Do, Sabine Massmann, and Erhard Rahm. 2005. Schema and ontology matching with COMA++. InProceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, Baltimore Mary- land, 906–908. https://doi.org/10.1145/1066157.1066283

work page doi:10.1145/1066157.1066283 2005
[3]

Daniel Ayala, Inma Hernández, David Ruiz, and Erhard Rahm. 2022. LEAPME: Learning-based Property Matching with Embeddings.Data & Knowledge Engi- neering137 (Jan. 2022), 101943. https://doi.org/10.1016/j.datak.2021.101943

work page doi:10.1016/j.datak.2021.101943 2022
[4]

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for Tabular Data Representation: A Survey of Models and Applications.Transactions of the Association for Computational Linguistics11 (March 2023), 227–249. https: //doi.org/10.1162/tacl_a_00544

work page doi:10.1162/tacl_a_00544 2023
[5]

Zohra Bellahsene, Angela Bonifati, Fabien Duchateau, and Yannis Velegrakis
[6]

InSchema Matching and Mapping, Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.)

On Evaluating Schema Matching and Mapping. InSchema Matching and Mapping, Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). Springer, Berlin, Heidelberg, 253–291. https://doi.org/10.1007/978-3-642-16518-4_9

work page doi:10.1007/978-3-642-16518-4_9
[7]

Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integra- tion Tasks. InProceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1335–1349. https://doi.org/10.1145/3318464.3389742

work page doi:10.1145/3318464.3389742 2020
[8]

Qahtan, Ahmed Elma- garmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang

Raul Castro Fernandez, Essam Mansour, Abdulhakim A. Qahtan, Ahmed Elma- garmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. In2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, Paris, 989–1000. https://doi.org/10.110...

work page doi:10.1109/icde.2018.00093 2018
[9]

Peter Pin-Shan Chen. 1976. The entity-relationship model—toward a unified view of data.ACM Trans. Database Syst.1, 1 (March 1976). https://doi.org/10. 1145/320434.320440

arXiv 1976
[10]

Hong-Hai Do and Erhard Rahm. 2002. COMA: a system for flexible combination of schema matching approaches. InProceedings of the 28th international conference on Very Large Data Bases (VLDB ’02). VLDB Endowment, Hong Kong, China, 610–621

2002
[11]

Kai Herrmann, Hannes Voigt, Andreas Behrend, Jonas Rausch, and Wolfgang Lehner. 2017. Living in Parallel Realities – Co-Existing Schema Versions with a Bidirectional Database Evolution Language. InProceedings of the 2017 ACM International Conference on Management of Data. 1101–1116. https://doi.org/10. 1145/3035918.3064046 arXiv:1608.05564 [cs]

arXiv 2017
[12]

Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, and Carsten Bin- nig. 2022. It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. https://doi.org/10.48550/arXiv.2203.04366 arXiv:2203.04366 [cs]

work page doi:10.48550/arxiv.2203.04366 2022
[13]

Jeff Johnson, Matthijs Douze, and Herve Jegou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (July 2021), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572

work page doi:10.1109/tbdata.2019.2921572 2021
[14]

Miller, and Mirek Riedewald

Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Se- mantic Table Union Search.Proc. ACM Manag. Data1, 1 (May 2023), 9:1–9:25. https://doi.org/10.1145/3588689

work page doi:10.1145/3588689 2023
[15]

Henning Koehler and Sebastian Link. 2025. Orthogonal Keys High Precision and Recall for Mining Database Keys From Inconsistent and Incomplete Relations. IEEE Transactions on Knowledge and Data Engineering37, 11 (Nov. 2025), 6550–

2025
[16]

https://doi.org/10.1109/TKDE.2025.3608680

work page doi:10.1109/tkde.2025.3608680 2025
[17]

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsi- fodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In2021 IEEE 37th International Conference on Data Engineering (ICDE). 468–479. https://doi.org/10.1109/ICDE51399.2021.0004...

work page doi:10.1109/icde51399.2021.00047 2021
[18]

Christos Koutras, Jiani Zhang, Xiao Qin, Chuan Lei, Vasileios Ioannidis, Chris- tos Faloutsos, George Karypis, and Asterios Katsifodimos. 2024. OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories. https://doi.org/doi:10.14778/3749646.3749715 Version Number: 1

work page doi:10.14778/3749646.3749715 2024
[19]

Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, and Philip S. Yu. 2025. Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLM...

work page doi:10.48550/arxiv.2507.09477 2025
[20]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Focal Loss for Dense Object Detection.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 2 (Feb. 2020), 318–327. https://doi.org/10.1109/TPAMI. 2018.2858826

work page doi:10.1109/tpami 2020
[21]

Yurong Liu, Eduardo H. M. Pena, Aécio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. Proceedings of the VLDB Endowment18, 8 (April 2025), 2681–2694. https://doi. org/10.14778/3742728.3742757

work page doi:10.14778/3742728.3742757 2025
[22]

Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid.VLDB(2001)

2001
[23]

Marc Maynou, Sergi Nadal, Raquel Panadero, Javier Flores, Oscar Romero, and Anna Queralt. 2026. Freyja: Efficient Join Discovery in Data Lakes.IEEE Transactions on Knowledge and Data Engineering01 (Jan. 2026), 1–12. https: //doi.org/10.1109/TKDE.2026.3656786

work page doi:10.1109/tkde.2026.3656786 2026
[24]

Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, and Berthold Reinwald. 2024. Alfa: active learning for graph neural network-based semantic schema alignment.The VLDB Journal33, 4 (July 2024), 981–1011. https://doi. org/10.1007/s00778-023-00822-z

work page doi:10.1007/s00778-023-00822-z 2024
[25]

Melnik, H

S. Melnik, H. Garcia-Molina, and E. Rahm. 2002. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. InProceedings 18th International Conference on Data Engineering. IEEE Comput. Soc, San Jose, CA, USA, 117–128. https://doi.org/10.1109/ICDE.2002.994702

work page doi:10.1109/icde.2002.994702 2002
[26]

Matteo Paganelli, Domenico Beneventano, Francesco Guerra, and Paolo Sottovia
[27]

2019), 18–31

Parallelizing Computations of Full Disjunctions.Big Data Research17 (Sept. 2019), 18–31. https://doi.org/10.1016/j.bdr.2019.07.002

work page doi:10.1016/j.bdr.2019.07.002 2019
[28]

Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: an experimental evaluation of seven algorithms.Proc. VLDB Endow.8, 10 (June 2015), 1082–1093. https://doi.org/10.14778/2794367. 2794377

work page doi:10.14778/2794367 2015
[29]

Bernstein

Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching.The VLDB Journal10, 4 (Dec. 2001), 334–350. https://doi.org/ 10.1007/s007780100057

work page doi:10.1007/s007780100057 2001
[30]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. https://doi.org/10.48550/ARXIV.1908.10084 Version Number: 1

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019
[31]

Adel Remadi, Karim El Hage, Yasmina Hobeika, and Francesca Bugiotti. 2024. To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data.Data & Knowledge Engineering 152 (July 2024), 102313. https://doi.org/10.1016/j.datak.2024.102313

work page doi:10.1016/j.datak.2024.102313 2024
[32]

Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2024. Re- Match: Retrieval Enhanced Schema Matching with LLMs. https://doi.org/10. 48550/arXiv.2403.01567 arXiv:2403.01567 [cs]

arXiv 2024
[33]

Roee Shraga and Avigdor Gal. 2021. PoWareMatch: a Quality-aware Deep Learn- ing Approach to Improve Human Schema Matching. https://doi.org/10.48550/ arXiv.2109.07321 arXiv:2109.07321 [cs]

arXiv 2021
[34]

Pranav Subramaniam, Udayan Khurana, Kavitha Srinivas, and Horst Samulowitz
[35]

InProceedings of the 32nd ACM International Conference on Information and Knowledge Management

NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. ACM, Birmingham United Kingdom, 5096–5100. https://doi.org/10.1145/3583780.3614750

work page doi:10.1145/3583780.3614750
[36]

Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Lan- guage Models. InProceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1493–1503. https://doi.org/10.1145/3514221.3517906

work page doi:10.1145/3514221.3517906 2022
[37]

Leonard Traeger, Andreas Behrend, and George Karabatis. 2025. SEALM: Seman- tically Enriched Attributes with Language Models for Linkage Recommendation:. InProceedings of the 27th International Conference on Enterprise Information Sys- tems. SCITEPRESS - Science and Technology Publications, Porto, Portugal, 39–50. https://doi.org/10.5220/0013217700003929

work page doi:10.5220/0013217700003929 2025
[38]

Leonard Traeger, Andreas Behrend, and George Karabatis. 2026. Collabora- tive Scoping: Self-Supervised Linkability Assessment for Schema Matching. In Proceedings 29th International Conference on Extending Database Technology (1, Vol. 29). OpenProceedings.org, Tampere, Finland. https://doi.org/10.48786/EDBT. 2026.03

work page doi:10.48786/edbt 2026
[39]

Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Guoliang Li, Xiaoyong Du, Xiaofeng Jia, and Song Gao. 2023. Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration.Proceedings of the ACM on Management of Data1, 1 (May 2023), 1–26. https://doi.org/10.1145/3588938

work page doi:10.1145/3588938 2023
[40]

Sha Wang, Yuchen Li, Hanhua Xiao, Bing Tian Dai, Roy Ka-Wei Lee, Yanfei Dong, and Lambert Deng. 2025. LLMATCH: A Unified Schema Matching Frame- work with Large Language Models. https://doi.org/10.48550/arXiv.2507.10897 arXiv:2507.10897 [cs]

work page doi:10.48550/arxiv.2507.10897 2025
[41]

Procopiuc, and Divesh Srivastava

Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2011. Automatic discovery of attributes in relational databases. InProceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD ’11). Association for Computing Machinery, New York, NY, USA, 109–120. https://doi.org/10.1145/198932...

work page doi:10.1145/1989323.1989336 2011
[42]

Müller, Dalitso Banda, Fotis Psallidas, and Jignesh M

Yunjia Zhang, Avrilia Floratou, Joyce Cahoon, Subru Krishnan, Andreas C. Müller, Dalitso Banda, Fotis Psallidas, and Jignesh M. Patel. 2023. Schema Matching using Pre-Trained Language Models. In2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, Anaheim, CA, USA, 1558–1571. https://doi. org/10.1109/ICDE55515.2023.00123

work page doi:10.1109/icde55515.2023.00123 2023
[43]

Yu Zhang, Di Mei, Haozheng Luo, Chenwei Xu, and Richard Tzong-Han Tsai
[44]

Information Systems133 (Aug

SMUTF: Schema Matching Using Generative Tags and Hybrid Features. Information Systems133 (Aug. 2025), 102570. https://doi.org/10.1016/j.is.2025. 102570

work page doi:10.1016/j.is.2025 2025
[45]

Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data. ACM, Amsterdam Netherlands, 847–864. https://doi.org/10.1145/3299869.3300065

work page doi:10.1145/3299869.3300065 2019

[1] [1]

Ziawasch Abedjan, Patrick Schulze, and Felix Naumann. 2014. DFD: Efficient Functional Dependency Discovery. InProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management (CIKM ’14). Association for Computing Machinery, New York, NY, USA, 949–958. https: //doi.org/10.1145/2661829.2661884

work page doi:10.1145/2661829.2661884 2014

[2] [2]

David Aumueller, Hong-Hai Do, Sabine Massmann, and Erhard Rahm. 2005. Schema and ontology matching with COMA++. InProceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, Baltimore Mary- land, 906–908. https://doi.org/10.1145/1066157.1066283

work page doi:10.1145/1066157.1066283 2005

[3] [3]

Daniel Ayala, Inma Hernández, David Ruiz, and Erhard Rahm. 2022. LEAPME: Learning-based Property Matching with Embeddings.Data & Knowledge Engi- neering137 (Jan. 2022), 101943. https://doi.org/10.1016/j.datak.2021.101943

work page doi:10.1016/j.datak.2021.101943 2022

[4] [4]

Gilbert Badaro, Mohammed Saeed, and Paolo Papotti. 2023. Transformers for Tabular Data Representation: A Survey of Models and Applications.Transactions of the Association for Computational Linguistics11 (March 2023), 227–249. https: //doi.org/10.1162/tacl_a_00544

work page doi:10.1162/tacl_a_00544 2023

[5] [5]

Zohra Bellahsene, Angela Bonifati, Fabien Duchateau, and Yannis Velegrakis

[6] [6]

InSchema Matching and Mapping, Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.)

On Evaluating Schema Matching and Mapping. InSchema Matching and Mapping, Zohra Bellahsene, Angela Bonifati, and Erhard Rahm (Eds.). Springer, Berlin, Heidelberg, 253–291. https://doi.org/10.1007/978-3-642-16518-4_9

work page doi:10.1007/978-3-642-16518-4_9

[7] [7]

Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integra- tion Tasks. InProceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 1335–1349. https://doi.org/10.1145/3318464.3389742

work page doi:10.1145/3318464.3389742 2020

[8] [8]

Qahtan, Ahmed Elma- garmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang

Raul Castro Fernandez, Essam Mansour, Abdulhakim A. Qahtan, Ahmed Elma- garmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, and Nan Tang. 2018. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. In2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, Paris, 989–1000. https://doi.org/10.110...

work page doi:10.1109/icde.2018.00093 2018

[9] [9]

Peter Pin-Shan Chen. 1976. The entity-relationship model—toward a unified view of data.ACM Trans. Database Syst.1, 1 (March 1976). https://doi.org/10. 1145/320434.320440

arXiv 1976

[10] [10]

Hong-Hai Do and Erhard Rahm. 2002. COMA: a system for flexible combination of schema matching approaches. InProceedings of the 28th international conference on Very Large Data Bases (VLDB ’02). VLDB Endowment, Hong Kong, China, 610–621

2002

[11] [11]

Kai Herrmann, Hannes Voigt, Andreas Behrend, Jonas Rausch, and Wolfgang Lehner. 2017. Living in Parallel Realities – Co-Existing Schema Versions with a Bidirectional Database Evolution Language. InProceedings of the 2017 ACM International Conference on Management of Data. 1101–1116. https://doi.org/10. 1145/3035918.3064046 arXiv:1608.05564 [cs]

arXiv 2017

[12] [12]

Benjamin Hättasch, Michael Truong-Ngoc, Andreas Schmidt, and Carsten Bin- nig. 2022. It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings. https://doi.org/10.48550/arXiv.2203.04366 arXiv:2203.04366 [cs]

work page doi:10.48550/arxiv.2203.04366 2022

[13] [13]

Jeff Johnson, Matthijs Douze, and Herve Jegou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (July 2021), 535–547. https://doi.org/10.1109/TBDATA.2019.2921572

work page doi:10.1109/tbdata.2019.2921572 2021

[14] [14]

Miller, and Mirek Riedewald

Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, and Mirek Riedewald. 2023. SANTOS: Relationship-based Se- mantic Table Union Search.Proc. ACM Manag. Data1, 1 (May 2023), 9:1–9:25. https://doi.org/10.1145/3588689

work page doi:10.1145/3588689 2023

[15] [15]

Henning Koehler and Sebastian Link. 2025. Orthogonal Keys High Precision and Recall for Mining Database Keys From Inconsistent and Incomplete Relations. IEEE Transactions on Knowledge and Data Engineering37, 11 (Nov. 2025), 6550–

2025

[16] [16]

https://doi.org/10.1109/TKDE.2025.3608680

work page doi:10.1109/tkde.2025.3608680 2025

[17] [17]

Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsi- fodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In2021 IEEE 37th International Conference on Data Engineering (ICDE). 468–479. https://doi.org/10.1109/ICDE51399.2021.0004...

work page doi:10.1109/icde51399.2021.00047 2021

[18] [18]

Christos Koutras, Jiani Zhang, Xiao Qin, Chuan Lei, Vasileios Ioannidis, Chris- tos Faloutsos, George Karypis, and Asterios Katsifodimos. 2024. OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories. https://doi.org/doi:10.14778/3749646.3749715 Version Number: 1

work page doi:10.14778/3749646.3749715 2024

[19] [19]

Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, and Philip S. Yu. 2025. Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLM...

work page doi:10.48550/arxiv.2507.09477 2025

[20] [20]

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. Focal Loss for Dense Object Detection.IEEE Transactions on Pattern Analysis and Machine Intelligence42, 2 (Feb. 2020), 318–327. https://doi.org/10.1109/TPAMI. 2018.2858826

work page doi:10.1109/tpami 2020

[21] [21]

Yurong Liu, Eduardo H. M. Pena, Aécio Santos, Eden Wu, and Juliana Freire. 2025. Magneto: Combining Small and Large Language Models for Schema Matching. Proceedings of the VLDB Endowment18, 8 (April 2025), 2681–2694. https://doi. org/10.14778/3742728.3742757

work page doi:10.14778/3742728.3742757 2025

[22] [22]

Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid.VLDB(2001)

2001

[23] [23]

Marc Maynou, Sergi Nadal, Raquel Panadero, Javier Flores, Oscar Romero, and Anna Queralt. 2026. Freyja: Efficient Join Discovery in Data Lakes.IEEE Transactions on Knowledge and Data Engineering01 (Jan. 2026), 1–12. https: //doi.org/10.1109/TKDE.2026.3656786

work page doi:10.1109/tkde.2026.3656786 2026

[24] [24]

Venkata Vamsikrishna Meduri, Abdul Quamar, Chuan Lei, Xiao Qin, and Berthold Reinwald. 2024. Alfa: active learning for graph neural network-based semantic schema alignment.The VLDB Journal33, 4 (July 2024), 981–1011. https://doi. org/10.1007/s00778-023-00822-z

work page doi:10.1007/s00778-023-00822-z 2024

[25] [25]

Melnik, H

S. Melnik, H. Garcia-Molina, and E. Rahm. 2002. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. InProceedings 18th International Conference on Data Engineering. IEEE Comput. Soc, San Jose, CA, USA, 117–128. https://doi.org/10.1109/ICDE.2002.994702

work page doi:10.1109/icde.2002.994702 2002

[26] [26]

Matteo Paganelli, Domenico Beneventano, Francesco Guerra, and Paolo Sottovia

[27] [27]

2019), 18–31

Parallelizing Computations of Full Disjunctions.Big Data Research17 (Sept. 2019), 18–31. https://doi.org/10.1016/j.bdr.2019.07.002

work page doi:10.1016/j.bdr.2019.07.002 2019

[28] [28]

Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: an experimental evaluation of seven algorithms.Proc. VLDB Endow.8, 10 (June 2015), 1082–1093. https://doi.org/10.14778/2794367. 2794377

work page doi:10.14778/2794367 2015

[29] [29]

Bernstein

Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching.The VLDB Journal10, 4 (Dec. 2001), 334–350. https://doi.org/ 10.1007/s007780100057

work page doi:10.1007/s007780100057 2001

[30] [30]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. https://doi.org/10.48550/ARXIV.1908.10084 Version Number: 1

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019

[31] [31]

Adel Remadi, Karim El Hage, Yasmina Hobeika, and Francesca Bugiotti. 2024. To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data.Data & Knowledge Engineering 152 (July 2024), 102313. https://doi.org/10.1016/j.datak.2024.102313

work page doi:10.1016/j.datak.2024.102313 2024

[32] [32]

Eitam Sheetrit, Menachem Brief, Moshik Mishaeli, and Oren Elisha. 2024. Re- Match: Retrieval Enhanced Schema Matching with LLMs. https://doi.org/10. 48550/arXiv.2403.01567 arXiv:2403.01567 [cs]

arXiv 2024

[33] [33]

Roee Shraga and Avigdor Gal. 2021. PoWareMatch: a Quality-aware Deep Learn- ing Approach to Improve Human Schema Matching. https://doi.org/10.48550/ arXiv.2109.07321 arXiv:2109.07321 [cs]

arXiv 2021

[34] [34]

Pranav Subramaniam, Udayan Khurana, Kavitha Srinivas, and Horst Samulowitz

[35] [35]

InProceedings of the 32nd ACM International Conference on Information and Knowledge Management

NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management. ACM, Birmingham United Kingdom, 5096–5100. https://doi.org/10.1145/3583780.3614750

work page doi:10.1145/3583780.3614750

[36] [36]

Yoshihiko Suhara, Jinfeng Li, Yuliang Li, Dan Zhang, Çağatay Demiralp, Chen Chen, and Wang-Chiew Tan. 2022. Annotating Columns with Pre-trained Lan- guage Models. InProceedings of the 2022 International Conference on Management of Data (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 1493–1503. https://doi.org/10.1145/3514221.3517906

work page doi:10.1145/3514221.3517906 2022

[37] [37]

Leonard Traeger, Andreas Behrend, and George Karabatis. 2025. SEALM: Seman- tically Enriched Attributes with Language Models for Linkage Recommendation:. InProceedings of the 27th International Conference on Enterprise Information Sys- tems. SCITEPRESS - Science and Technology Publications, Porto, Portugal, 39–50. https://doi.org/10.5220/0013217700003929

work page doi:10.5220/0013217700003929 2025

[38] [38]

Leonard Traeger, Andreas Behrend, and George Karabatis. 2026. Collabora- tive Scoping: Self-Supervised Linkability Assessment for Schema Matching. In Proceedings 29th International Conference on Extending Database Technology (1, Vol. 29). OpenProceedings.org, Tampere, Finland. https://doi.org/10.48786/EDBT. 2026.03

work page doi:10.48786/edbt 2026

[39] [39]

Jianhong Tu, Ju Fan, Nan Tang, Peng Wang, Guoliang Li, Xiaoyong Du, Xiaofeng Jia, and Song Gao. 2023. Unicorn: A Unified Multi-tasking Model for Supporting Matching Tasks in Data Integration.Proceedings of the ACM on Management of Data1, 1 (May 2023), 1–26. https://doi.org/10.1145/3588938

work page doi:10.1145/3588938 2023

[40] [40]

Sha Wang, Yuchen Li, Hanhua Xiao, Bing Tian Dai, Roy Ka-Wei Lee, Yanfei Dong, and Lambert Deng. 2025. LLMATCH: A Unified Schema Matching Frame- work with Large Language Models. https://doi.org/10.48550/arXiv.2507.10897 arXiv:2507.10897 [cs]

work page doi:10.48550/arxiv.2507.10897 2025

[41] [41]

Procopiuc, and Divesh Srivastava

Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2011. Automatic discovery of attributes in relational databases. InProceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD ’11). Association for Computing Machinery, New York, NY, USA, 109–120. https://doi.org/10.1145/198932...

work page doi:10.1145/1989323.1989336 2011

[42] [42]

Müller, Dalitso Banda, Fotis Psallidas, and Jignesh M

Yunjia Zhang, Avrilia Floratou, Joyce Cahoon, Subru Krishnan, Andreas C. Müller, Dalitso Banda, Fotis Psallidas, and Jignesh M. Patel. 2023. Schema Matching using Pre-Trained Language Models. In2023 IEEE 39th International Conference on Data Engineering (ICDE). IEEE, Anaheim, CA, USA, 1558–1571. https://doi. org/10.1109/ICDE55515.2023.00123

work page doi:10.1109/icde55515.2023.00123 2023

[43] [43]

Yu Zhang, Di Mei, Haozheng Luo, Chenwei Xu, and Richard Tzong-Han Tsai

[44] [44]

Information Systems133 (Aug

SMUTF: Schema Matching Using Generative Tags and Hybrid Features. Information Systems133 (Aug. 2025), 102570. https://doi.org/10.1016/j.is.2025. 102570

work page doi:10.1016/j.is.2025 2025

[45] [45]

Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data. ACM, Amsterdam Netherlands, 847–864. https://doi.org/10.1145/3299869.3300065

work page doi:10.1145/3299869.3300065 2019