pith. machine review for the scientific record. sign in

arxiv: 2602.12606 · v2 · submitted 2026-02-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

RelBench v2: A Large-Scale Benchmark and Repository for Relational Data

Authors on Pith no claims yet

Pith reviewed 2026-05-16 01:47 UTC · model grok-4.3

classification 💻 cs.LG
keywords relational deep learningbenchmarkrelational dataautocomplete tasksmulti-table predictiondatabase learningtemporal constraints
0
0 comments X

The pith

Relational deep learning models outperform single-table baselines by modeling connections across multiple tables in databases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RelBench v2, an expanded benchmark for relational deep learning that adds four new large-scale datasets and introduces autocomplete tasks for predicting missing values in relational tables. It integrates external benchmarks to allow unified evaluation of models on relational and temporal data. Experiments show that models using relational structure perform better than single-table approaches on autocomplete, forecasting, and recommendation tasks. This supports the idea that explicitly handling relationships between data entities improves accuracy in real-world database applications.

Core claim

RelBench v2 expands the benchmark to 11 datasets with over 22 million rows by including new data from scholarly publications, enterprise systems, consumer platforms, and clinical records. It adds autocomplete tasks that require inferring missing attributes while respecting temporal constraints. The work also incorporates translated temporal graph data, access to over 70 real-world databases for pretraining, and additional multi-table tasks. Results demonstrate that relational deep learning models consistently surpass single-table baselines across these tasks.

What carries the argument

RelBench v2 benchmark, which standardizes evaluation of relational deep learning models across diverse datasets and new predictive tasks like autocomplete.

If this is right

  • Models that capture entity relationships in databases will achieve higher performance on prediction tasks than those that flatten data into single tables.
  • Autocomplete tasks offer a practical way to test models on filling in incomplete relational data under time constraints.
  • Unified access to many real-world databases supports development of larger relational foundation models.
  • Future benchmarks for relational learning should include more datasets from varied domains to ensure robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners working with enterprise data could see gains by shifting from single-table machine learning to relational approaches using this benchmark.
  • The integration of external resources suggests that pretraining on diverse relational data could become standard for database-related AI tasks.
  • Researchers might explore how these improvements scale with model size or dataset complexity beyond the current experiments.

Load-bearing premise

The new datasets and tasks accurately represent real-world relational data challenges, and the single-table baselines provide a fair comparison without hidden advantages.

What would settle it

Finding that single-table models match or exceed relational model performance on these tasks after applying advanced feature engineering or data preprocessing techniques.

Figures

Figures reproduced from arXiv: 2602.12606 by Charilaos Kanatsoulis, Fengyu Li, Haiming Tang, Jure Leskovec, Justin Gu, Mark Znidar, Martin Jurkovic, Parth Shroff, Pranshu Chaturvedi, Rishabh Ranjan, Valter Hudovernik.

Figure 1
Figure 1. Figure 1: RELBENCH schema of the newly added Sales Autocompletion Linked Business Tables (SALT) dataset (Klein et al., 2024). 13 [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RELBENCH schema of the newly added arXiv-physics dataset (Tang et al., 2024). C ADDITIONAL TASK INFORMATION C.1 AUTOCOMPLETE TASK: MOTIVATION Autocomplete tasks were inspired by the sales order autocomplete task from the SAP S/4HANA Sales Order User interface. In [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RELBENCH schema of the newly added RateBeer dataset. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: RELBENCH schema of the newly added MIMIC-IV v3.1 dataset (Johnson et al., 2024) [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustrative example of a real-world autocomplete task, where the SAP S/4HANA Sales [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Relational deep learning (RDL) has emerged as a powerful paradigm for learning directly on relational databases by modeling entities and their relationships across multiple interconnected tables. As this paradigm evolves toward larger models and relational foundation models, scalable and realistic benchmarks are essential for enabling systematic evaluation and progress. In this paper, we introduce RelBench v2, a major expansion of the RelBench benchmark for RDL. RelBench v2 adds four large-scale relational datasets spanning scholarly publications, enterprise resource planning, consumer platforms, and clinical records, increasing the benchmark to 11 datasets comprising over 22 million rows across 29 tables. We further introduce autocomplete tasks, a new class of predictive objectives that require models to infer missing attribute values directly within relational tables while respecting temporal constraints, expanding beyond traditional forecasting tasks constructed via SQL queries. In addition, RelBench v2 expands beyond its native datasets by integrating external benchmarks and evaluation frameworks: we translate event streams from the Temporal Graph Benchmark into relational schemas for unified relational-temporal evaluation, interface with ReDeLEx to provide uniform access to 70+ real-world databases suitable for pretraining, and incorporate 4DBInfer datasets and tasks to broaden multi-table prediction coverage. Experimental results demonstrate that RDL models consistently outperform single-table baselines across autocomplete, forecasting, and recommendation tasks, highlighting the importance of modeling relational structure explicitly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RelBench v2, a major expansion of the RelBench benchmark for relational deep learning. It adds four large-scale datasets (scholarly, ERP, consumer, clinical) for a total of 11 datasets with over 22 million rows across 29 tables, introduces autocomplete tasks that infer missing attributes under temporal constraints, and integrates external resources including translated Temporal Graph Benchmark streams, ReDeLEx for 70+ databases, and 4DBInfer tasks. Experiments claim that RDL models consistently outperform single-table baselines on autocomplete, forecasting, and recommendation tasks.

Significance. If the empirical results hold with proper verification, RelBench v2 would be a significant contribution by providing a scalable, realistic benchmark suite for RDL that includes new task types and cross-benchmark integrations. This could standardize evaluation and support progress toward relational foundation models, especially given the scale and diversity of the added datasets.

major comments (2)
  1. [Experimental Evaluation] Experimental results section: the central claim that RDL models 'consistently outperform single-table baselines' is presented without exact metric values, baseline implementation details, data split descriptions, or statistical significance tests, making the outperformance impossible to verify from the reported information.
  2. [§4] §4 (new datasets and tasks): the representativeness of the four new datasets and autocomplete tasks for real-world relational challenges is asserted but not supported by any analysis of domain coverage, temporal dynamics, or comparison to existing benchmarks beyond size metrics.
minor comments (2)
  1. [Introduction] The integration with ReDeLEx and 4DBInfer is described at a high level but lacks explicit citations or version information for the external frameworks.
  2. Figure and table captions could more clearly indicate which tasks and metrics are being compared to allow readers to trace the outperformance claims without cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript on RelBench v2. We appreciate the emphasis on experimental verifiability and the need for deeper analysis of the new datasets and tasks. We address each major comment below and outline the specific revisions we will implement.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental results section: the central claim that RDL models 'consistently outperform single-table baselines' is presented without exact metric values, baseline implementation details, data split descriptions, or statistical significance tests, making the outperformance impossible to verify from the reported information.

    Authors: We agree that the experimental results section requires more granular reporting to support verification of the central claim. The manuscript currently summarizes outperformance without providing the supporting details. In the revised version, we will expand this section to include exact metric values (e.g., in expanded tables for autocomplete, forecasting, and recommendation tasks), full baseline implementation details (including code references and hyperparameter settings), explicit descriptions of data splits (including temporal constraints), and statistical significance tests (such as paired t-tests across multiple runs) to substantiate the results. revision: yes

  2. Referee: [§4] §4 (new datasets and tasks): the representativeness of the four new datasets and autocomplete tasks for real-world relational challenges is asserted but not supported by any analysis of domain coverage, temporal dynamics, or comparison to existing benchmarks beyond size metrics.

    Authors: We acknowledge that the current presentation in §4 focuses primarily on scale and integration aspects without sufficient supporting analysis of representativeness. In the revision, we will add a new subsection to §4 that analyzes domain coverage (scholarly, ERP, consumer, and clinical), temporal dynamics (e.g., time spans, event frequencies, and constraint handling in autocomplete tasks), and explicit comparisons to existing benchmarks such as the original RelBench, Temporal Graph Benchmark, and others, going beyond size metrics to highlight relevance to real-world relational challenges. revision: yes

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark construction and evaluation paper with no mathematical derivations, fitted parameters, or new postulated entities; all content rests on standard database and machine-learning assumptions.

pith-pipeline@v0.9.0 · 5586 in / 1040 out tokens · 80283 ms · 2026-05-16T01:47:40.348966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. KumoRFM-2: Scaling Foundation Models for Relational Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    KumoRFM-2 pre-trains on synthetic and real relational data across row, column, foreign-key and cross-sample axes, injects task information early, and achieves up to 8% gains over supervised baselines on 41 benchmarks ...

  2. RelAgent: LLM Agents as Data Scientists for Relational Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    RelAgent uses an LLM agent to autonomously generate SQL feature programs paired with classical models for interpretable relational learning predictions that execute efficiently on standard databases.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 2 Pith papers · 2 internal anchors

  1. [1]

    Relgnn: Composite message passing for relational deep learning.arXiv preprint arXiv:2502.06784,

    Tianlang Chen, Charilaos Kanatsoulis, and Jure Leskovec. Relgnn: Composite message passing for relational deep learning.arXiv preprint arXiv:2502.06784,

  2. [2]

    Relational graph transformer.arXiv preprint arXiv:2505.10960, 2025a

    Vijay Prakash Dwivedi, Sri Jaladi, Yangyi Shen, Federico López, Charilaos I Kanatsoulis, Rishi Puri, Matthias Fey, and Jure Leskovec. Relational graph transformer.arXiv preprint arXiv:2505.10960, 2025a. Vijay Prakash Dwivedi, Charilaos Kanatsoulis, Shenyang Huang, and Jure Leskovec. Relational deep learning: Challenges, foundations and next-generation arc...

  3. [3]

    Pytorch frame: A modular framework for multi-modal tabular learning.arXiv preprint arXiv:2404.00776,

    Weihua Hu, Yiwen Yuan, Zecheng Zhang, Akihiro Nitta, Kaidi Cao, Vid Kocijan, Jure Leskovec, and Matthias Fey. Pytorch frame: A modular framework for multi-modal tabular learning.arXiv preprint arXiv:2404.00776,

  4. [4]

    Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark

    URLhttps://arxiv.org/abs/2506.00710. Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Brian Gow, Benjamin Moody, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV.PhysioNet, October

  5. [5]

    URLhttps://doi.org/10.13026/kpb9-mt58

    doi: 10.13026/ kpb9-mt58. URLhttps://doi.org/10.13026/kpb9-mt58. Version 3.1. Charilaos Kanatsoulis, Evelyn Choi, Stefanie Jegelka, Jure Leskovec, and Alejandro Ribeiro. Learn- ing efficient positional encodings with graph neural networks. InThe Thirteenth International Conference on Learning Representations,

  6. [6]

    Joint Relational Database Generation via Graph-Conditional Diffusion Models

    11 Preprint. Under review. Mohamed Amine Ketata, David Lüdke, Leo Schwinn, and Stephan Günnemann. Joint relational database generation via graph-conditional diffusion models.arXiv preprint arXiv:2505.16527,

  7. [7]

    SALT: Sales autocompletion linked business tables dataset

    Tassilo Klein, Clemens Biehl, Margarida Costa, Andre Sres, Jonas Kolk, and Johannes Hoffart. SALT: Sales autocompletion linked business tables dataset. InNeurIPS 2024 Third Table Representation Learning Workshop,

  8. [8]

    Jan Motl and Oliver Schulte

    URLhttps://arxiv.org/abs/2602.04029. Jan Motl and Oliver Schulte. The ctu prague relational learning repository,

  9. [9]

    Jakub Peleška and Gustav Šír

    URL https: //arxiv.org/abs/1511.03086. Jakub Peleška and Gustav Šír. Transformers meet relational databases,

  10. [10]

    Jakub Peleška and Gustav Šír

    URL https:// arxiv.org/abs/2412.05218. Jakub Peleška and Gustav Šír. Redelex: A framework for relational deep learning exploration,

  11. [11]

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan

    URLhttps://arxiv.org/abs/2506.22199. Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InForty-second International Conference on Machine Learning,

  12. [12]

    Relational trans- former: Toward zero-shot foundation models for relational data.arXiv preprint arXiv:2510.06377,

    Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, and Jure Leskovec. Relational trans- former: Toward zero-shot foundation models for relational data.arXiv preprint arXiv:2510.06377,

  13. [13]

    BPR: Bayesian Personalized Ranking from Implicit Feedback

    Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback.arXiv preprint arXiv:1205.2618,

  14. [14]

    Temporal graph networks for deep learning on dynamic graphs.ICML Workshop on Graph Representation Learning 2020,

    Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs.ICML Workshop on Graph Representation Learning 2020,

  15. [15]

    Haiming Tang, Sirui He, Mengjie Li, and Zhimao Guo

    URL https://arxiv.org/abs/2410.13516. Haiming Tang, Sirui He, Mengjie Li, and Zhimao Guo. arxiv-physics: A large-scale physics citation and authorship dataset. [https://github.com/PKUTHM/arxiv-physics](https: //github.com/PKUTHM/arxiv-physics),

  16. [16]

    Under review

    12 Preprint. Under review. A RELATEDWORK Relational deep learning (RDL).RDL studies how to train neural models directly on relational databases by leveraging their multi-table structure. RDL represents a relational database as a heterogeneous graph, where rows correspond to entities and foreign-key relationships define edges between them (Fey et al., 2024...

  17. [17]

    These models leverage supervised (Hollmann et al., 2023

    and efficient fine-tuning (Kim et al., 2024). These models leverage supervised (Hollmann et al., 2023

  18. [18]

    or self-supervised (Spinaci et al., 2024; Kim et al.,

  19. [19]

    Extending such models to relational databases is challenging due to the presence of multiple tables connected via foreign-key relationships

    pretraining on real and synthetic tabular datasets. Extending such models to relational databases is challenging due to the presence of multiple tables connected via foreign-key relationships. To address this, relational foundation models have recently been proposed. For example, Fey et al. (2025) introduce KumoRFM, a graph-transformer- based architecture...

  20. [20]

    In contrast, in RELBENCHv2 we collect a large number of realistic databases in a uniformly accessible manner

    as well as generating synthetic databases from scratch using random graphs and Structural Causal Models (SCMs) (Kothapalli et al., 2026). In contrast, in RELBENCHv2 we collect a large number of realistic databases in a uniformly accessible manner. B DATASET SCHEMAS Figure 1: RELBENCHschema of the newly added Sales Autocompletion Linked Business Tables (SA...

  21. [21]

    •item-shippoint : For each sales order item, predict its shipping point (dispatch location)

    9.rel-salt Autocomplete Classification: •item-plant : For each sales order item, predict its plant (production/storage facility). •item-shippoint : For each sales order item, predict its shipping point (dispatch location). •item-incoterms : For each sales order item, predict its item-level international commercial terms. •sales-office : For each sales ord...

  22. [22]

    event-as-node

    The consistency of these defaults across task types highlights the robustness of RDL models, which perform well without extensive hyperparameter optimization. For the rel-ratebeer recommendation tasks, we adjusted the batch size to 64 when training two-layer GraphSAGE, two-layer ID-GNN, and four-layer ID-GNN models to accommodate GPU memory constraints. W...