Recognition: 2 theorem links
· Lean TheoremRelBench v2: A Large-Scale Benchmark and Repository for Relational Data
Pith reviewed 2026-05-16 01:47 UTC · model grok-4.3
The pith
Relational deep learning models outperform single-table baselines by modeling connections across multiple tables in databases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RelBench v2 expands the benchmark to 11 datasets with over 22 million rows by including new data from scholarly publications, enterprise systems, consumer platforms, and clinical records. It adds autocomplete tasks that require inferring missing attributes while respecting temporal constraints. The work also incorporates translated temporal graph data, access to over 70 real-world databases for pretraining, and additional multi-table tasks. Results demonstrate that relational deep learning models consistently surpass single-table baselines across these tasks.
What carries the argument
RelBench v2 benchmark, which standardizes evaluation of relational deep learning models across diverse datasets and new predictive tasks like autocomplete.
If this is right
- Models that capture entity relationships in databases will achieve higher performance on prediction tasks than those that flatten data into single tables.
- Autocomplete tasks offer a practical way to test models on filling in incomplete relational data under time constraints.
- Unified access to many real-world databases supports development of larger relational foundation models.
- Future benchmarks for relational learning should include more datasets from varied domains to ensure robustness.
Where Pith is reading between the lines
- Practitioners working with enterprise data could see gains by shifting from single-table machine learning to relational approaches using this benchmark.
- The integration of external resources suggests that pretraining on diverse relational data could become standard for database-related AI tasks.
- Researchers might explore how these improvements scale with model size or dataset complexity beyond the current experiments.
Load-bearing premise
The new datasets and tasks accurately represent real-world relational data challenges, and the single-table baselines provide a fair comparison without hidden advantages.
What would settle it
Finding that single-table models match or exceed relational model performance on these tasks after applying advanced feature engineering or data preprocessing techniques.
Figures
read the original abstract
Relational deep learning (RDL) has emerged as a powerful paradigm for learning directly on relational databases by modeling entities and their relationships across multiple interconnected tables. As this paradigm evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic evaluation and progress. In this paper, we introduce RelBench v2, a major expansion of the RelBench benchmark for RDL. RelBench v2 adds four large-scale relational datasets spanning scholarly publications, enterprise resource planning, consumer platforms, and clinical records, increasing the benchmark to 11 datasets comprising over 22 million rows across 29 tables. We further introduce autocomplete tasks, a new class of predictive objectives that require models to infer missing attribute values directly within relational tables while respecting temporal constraints, expanding beyond traditional forecasting tasks constructed via SQL queries. In addition, RelBench v2 expands beyond its native datasets by integrating external benchmarks and evaluation frameworks: we translate event streams from the Temporal Graph Benchmark into relational schemas for unified relational-temporal evaluation, interface with ReDeLEx to provide uniform access to 70+ real-world databases suitable for pretraining, and incorporate 4DBInfer datasets and tasks to broaden multi-table prediction coverage. Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks, highlighting the importance of modeling relational structure explicitly.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RelBench v2, a major expansion of the RelBench benchmark for relational deep learning. It adds four large-scale datasets (scholarly, ERP, consumer, clinical) for a total of 11 datasets with over 22 million rows across 29 tables, introduces autocomplete tasks that infer missing attributes under temporal constraints, and integrates external resources including translated Temporal Graph Benchmark streams, ReDeLEx for 70+ databases, and 4DBInfer tasks. Experiments claim that RDL models consistently outperform single-table baselines on autocomplete, forecasting, and recommendation tasks.
Significance. If the empirical results hold with proper verification, RelBench v2 would be a significant contribution by providing a scalable, realistic benchmark suite for RDL that includes new task types and cross-benchmark integrations. This could standardize evaluation and support progress toward relational foundation models, especially given the scale and diversity of the added datasets.
major comments (2)
- [Experimental Evaluation] Experimental results section: the central claim that RDL models 'consistently outperform single-table baselines' is presented without exact metric values, baseline implementation details, data split descriptions, or statistical significance tests, making the outperformance impossible to verify from the reported information.
- [§4] §4 (new datasets and tasks): the representativeness of the four new datasets and autocomplete tasks for real-world relational challenges is asserted but not supported by any analysis of domain coverage, temporal dynamics, or comparison to existing benchmarks beyond size metrics.
minor comments (2)
- [Introduction] The integration with ReDeLEx and 4DBInfer is described at a high level but lacks explicit citations or version information for the external frameworks.
- Figure and table captions could more clearly indicate which tasks and metrics are being compared to allow readers to trace the outperformance claims without cross-referencing the text.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review of our manuscript on RelBench v2. We appreciate the emphasis on experimental verifiability and the need for deeper analysis of the new datasets and tasks. We address each major comment below and outline the specific revisions we will implement.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental results section: the central claim that RDL models 'consistently outperform single-table baselines' is presented without exact metric values, baseline implementation details, data split descriptions, or statistical significance tests, making the outperformance impossible to verify from the reported information.
Authors: We agree that the experimental results section requires more granular reporting to support verification of the central claim. The manuscript currently summarizes outperformance without providing the supporting details. In the revised version, we will expand this section to include exact metric values (e.g., in expanded tables for autocomplete, forecasting, and recommendation tasks), full baseline implementation details (including code references and hyperparameter settings), explicit descriptions of data splits (including temporal constraints), and statistical significance tests (such as paired t-tests across multiple runs) to substantiate the results. revision: yes
-
Referee: [§4] §4 (new datasets and tasks): the representativeness of the four new datasets and autocomplete tasks for real-world relational challenges is asserted but not supported by any analysis of domain coverage, temporal dynamics, or comparison to existing benchmarks beyond size metrics.
Authors: We acknowledge that the current presentation in §4 focuses primarily on scale and integration aspects without sufficient supporting analysis of representativeness. In the revision, we will add a new subsection to §4 that analyzes domain coverage (scholarly, ERP, consumer, and clinical), temporal dynamics (e.g., time spans, event frequencies, and constraint handling in autocomplete tasks), and explicit comparisons to existing benchmarks such as the original RelBench, Temporal Graph Benchmark, and others, going beyond size metrics to highlight relevance to real-world relational challenges. revision: yes
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat ≃ Nat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RELBENCHv2 adds four large-scale relational datasets... autocomplete tasks... Temporal Graph Benchmark integration
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
KumoRFM-2: Scaling Foundation Models for Relational Learning
KumoRFM-2 pre-trains on synthetic and real relational data across row, column, foreign-key and cross-sample axes, injects task information early, and achieves up to 8% gains over supervised baselines on 41 benchmarks ...
-
RelAgent: LLM Agents as Data Scientists for Relational Learning
RelAgent uses an LLM agent to autonomously generate SQL feature programs paired with classical models for interpretable relational learning predictions that execute efficiently on standard databases.
Reference graph
Works this paper leans on
-
[1]
Relgnn: Composite message passing for relational deep learning.arXiv preprint arXiv:2502.06784,
Tianlang Chen, Charilaos Kanatsoulis, and Jure Leskovec. Relgnn: Composite message passing for relational deep learning.arXiv preprint arXiv:2502.06784,
-
[2]
Relational graph transformer.arXiv preprint arXiv:2505.10960, 2025a
Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I Kanatsoulis, Rishi Puri, Matthias Fey, and Jure Leskovec. Relational graph transformer.arXiv preprint arXiv:2505.10960, 2025a. Vijay Prakash Dwivedi, Charilaos Kanatsoulis, Shenyang Huang, and Jure Leskovec. Relational deep learning: Challenges, foundations and next-generation arc...
-
[3]
Pytorch frame: A modular framework for multi-modal tabular learning.arXiv preprint arXiv:2404.00776,
Weihua Hu, Yiwen Yuan, Zecheng Zhang, Akihiro Nitta, Kaidi Cao, Vid Kocijan, Jure Leskovec, and Matthias Fey. Pytorch frame: A modular framework for multi-modal tabular learning.arXiv preprint arXiv:2404.00776,
-
[4]
URLhttps://arxiv.org/abs/2506.00710. Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV.PhysioNet, October
-
[5]
URLhttps://doi.org/10.13026/kpb9-mt58
doi: 10.13026/ kpb9-mt58. URLhttps://doi.org/10.13026/kpb9-mt58. Version 3.1. Charilaos Kanatsoulis, Evelyn Choi, Stefanie Jegelka, Jure Leskovec, and Alejandro Ribeiro. Learn- ing efficient positional encodings with graph neural networks. InThe Thirteenth International Conference on Learning Representations,
-
[6]
Joint Relational Database Generation via Graph-Conditional Diffusion Models
11 Preprint. Under review. Mohamed Amine Ketata, David Lüdke, Leo Schwinn, and Stephan Günnemann. Joint relational database generation via graph-conditional diffusion models.arXiv preprint arXiv:2505.16527,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
SALT: Sales autocompletion linked business tables dataset
Tassilo Klein, Clemens Biehl, Margarida Costa, Andre Sres, Jonas Kolk, and Johannes Hoffart. SALT: Sales autocompletion linked business tables dataset. InNeurIPS 2024 Third Table Representation Learning Workshop,
work page 2024
-
[8]
URLhttps://arxiv.org/abs/2602.04029. Jan Motl and Oliver Schulte. The ctu prague relational learning repository,
-
[9]
URL https: //arxiv.org/abs/1511.03086. Jakub Peleška and Gustav Šír. Transformers meet relational databases,
-
[10]
URL https:// arxiv.org/abs/2412.05218. Jakub Peleška and Gustav Šír. Redelex: A framework for relational deep learning exploration,
-
[11]
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan
URLhttps://arxiv.org/abs/2506.22199. Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning,
-
[12]
Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, and Jure Leskovec. Relational trans- former: Toward zero-shot foundation models for relational data.arXiv preprint arXiv:2510.06377,
-
[13]
BPR: Bayesian Personalized Ranking from Implicit Feedback
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback.arXiv preprint arXiv:1205.2618,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs.ICML Workshop on Graph Representation Learning 2020,
work page 2020
-
[15]
Haiming Tang, Sirui He, Mengjie Li, and Zhimao Guo
URL https://arxiv.org/abs/2410.13516. Haiming Tang, Sirui He, Mengjie Li, and Zhimao Guo. arxiv-physics: A large-scale physics citation and authorship dataset. [https://github.com/PKUTHM/arxiv-physics](https: //github.com/PKUTHM/arxiv-physics),
-
[16]
12 Preprint. Under review. A RELATEDWORK Relational deep learning (RDL).RDL studies how to train neural models directly on relational databases by leveraging their multi-table structure. RDL represents a relational database as a heterogeneous graph, where rows correspond to entities and foreign-key relationships define edges between them (Fey et al., 2024...
work page 2024
-
[17]
These models leverage supervised (Hollmann et al., 2023
and efficient fine-tuning (Kim et al., 2024). These models leverage supervised (Hollmann et al., 2023
work page 2024
-
[18]
or self-supervised (Spinaci et al., 2024; Kim et al.,
work page 2024
-
[19]
pretraining on real and synthetic tabular datasets. Extending such models to relational databases is challenging due to the presence of multiple tables connected via foreign-key relationships. To address this, relational foundation models have recently been proposed. For example, Fey et al. (2025) introduce KumoRFM, a graph-transformer- based architecture...
work page 2025
-
[20]
as well as generating synthetic databases from scratch using random graphs and Structural Causal Models (SCMs) (Kothapalli et al., 2026). In contrast, in RELBENCHv2 we collect a large number of realistic databases in a uniformly accessible manner. B DATASET SCHEMAS Figure 1: RELBENCHschema of the newly added Sales Autocompletion Linked Business Tables (SA...
work page 2026
-
[21]
•item-shippoint : For each sales order item, predict its shipping point (dispatch location)
9.rel-salt Autocomplete Classification: •item-plant : For each sales order item, predict its plant (production/storage facility). •item-shippoint : For each sales order item, predict its shipping point (dispatch location). •item-incoterms : For each sales order item, predict its item-level international commercial terms. •sales-office : For each sales ord...
-
[22]
The consistency of these defaults across task types highlights the robustness of RDL models, which perform well without extensive hyperparameter optimization. For the rel-ratebeer recommendation tasks, we adjusted the batch size to 64 when training two-layer GraphSAGE, two-layer ID-GNN, and four-layer ID-GNN models to accommodate GPU memory constraints. W...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.