Incorporating Deep Learning Design in Database Queries
Pith reviewed 2026-06-30 14:13 UTC · model grok-4.3
The pith
Database queries can be lifted to jointly handle relational data and learnable tuple embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By representing tuple provenance as learnable vector embeddings and lifting relational algebra operators to act on both the data and these embeddings, queries can directly realize the computations performed by graph neural networks over relational data.
What carries the argument
Lifted relational queries that propagate and aggregate tuple embeddings according to the query structure.
If this is right
- Graph neural network models become expressible as standard database queries.
- The engineering overhead of data export to external ML systems is eliminated.
- Database optimizations can be applied directly to neural computations.
- Models including graph convolutional networks, heterogeneous graph transformers, and hypergraph networks can be implemented this way.
Where Pith is reading between the lines
- The approach may enable training and inference entirely inside the database without data movement.
- It could generalize to other types of neural architectures that operate on relational structures.
- Query planners might automatically optimize the embedding computations for better performance.
Load-bearing premise
The interactions induced by relational joins are fully captured by the manipulations that graph neural networks perform on tuple embeddings.
What would settle it
Finding a relational deep learning task where no lifted query reproduces the output of the corresponding graph neural network on the same input data and embeddings.
Figures
read the original abstract
Deep learning over relational databases is conventionally realized by translating data into graph representations and applying graph-based neural networks within external frameworks. This round-trip between the database and external machine learning (ML) systems introduces non-trivial engineering overhead. In effect, these graph neural networks operate on tuple embeddings and manipulate them in ways that capture the interactions induced by relational joins. Given this natural correspondence, there is no fundamental reason why specifying a neural network over relational data should be substantially harder than querying it. We propose an approach that naturally integrates deep learning with database queries. The key idea is to associate each tuple with provenance, represented as a vector embedding with learnable parameters. Queries are lifted to operate jointly on data and embeddings, mapping input relations with embedded tuples to output relations with embedded tuples. This approach provides a declarative foundation for relational deep learning, facilitating integration with database systems, optimization, and wide adoption. We describe RelaNN, a proof-of-concept implementation of this approach built on top of PyTorch and cuDF. We illustrate the utility of RelaNN by implementing various graph-learning models, including graph convolutional networks, heterogeneous graph transformers, hypergraph neural networks and deep homomorphism networks. The simplicity of the programs and their competitive runtime performance demonstrate a concrete path toward making the implementation of state-of-the-art neural networks over databases as simple as writing a query.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes lifting relational database queries to jointly operate on data tuples and associated learnable provenance embeddings, enabling declarative specification of deep learning models (such as GCNs, heterogeneous graph transformers, and hypergraph NNs) directly over relational data without external graph frameworks. It presents RelaNN, a PyTorch/cuDF proof-of-concept, and claims that the natural correspondence between relational joins and GNN operations on embeddings allows simple query-based implementations with competitive runtime.
Significance. If the lifted operators are shown to be semantically equivalent to reference GNN implementations (including multi-hop aggregation, normalization, and heterogeneous edge handling), the work could meaningfully reduce engineering overhead in relational deep learning and support tighter DB-ML integration. The absence of accuracy results, embedding comparisons, or equivalence verification in the provided description limits the assessed significance to a promising but unvalidated direction.
major comments (2)
- [Abstract] Abstract: the central claim that queries can 'faithfully reproduce GNN message passing, aggregation, and update steps' for models like heterogeneous graph transformers rests on an unverified natural correspondence; the manuscript reports only that models 'were implemented' and runtime is competitive, with no accuracy numbers, embedding comparisons, or output-equivalence checks against reference implementations.
- [Abstract] The description of RelaNN and the lifted operators provides no derivation, formal semantics, or proof that the embedding manipulations preserve the exact aggregation/normalization behavior of the target GNNs (e.g., attention in heterogeneous transformers or hyperedge aggregation); without this, the declarative foundation claim cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for stronger verification of the claimed correspondence between lifted relational operators and GNN computations. We address the two major comments below and will incorporate revisions to provide the requested evidence and formal details.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that queries can 'faithfully reproduce GNN message passing, aggregation, and update steps' for models like heterogeneous graph transformers rests on an unverified natural correspondence; the manuscript reports only that models 'were implemented' and runtime is competitive, with no accuracy numbers, embedding comparisons, or output-equivalence checks against reference implementations.
Authors: The manuscript grounds the claim in the natural structural correspondence between relational joins and the multi-hop neighborhood aggregations performed by GNNs, which is illustrated through the concrete RelaNN implementations of GCNs, heterogeneous graph transformers, and hypergraph NNs. We agree, however, that the abstract and evaluation sections would be strengthened by explicit verification. We will add a new subsection reporting (i) output-equivalence checks (element-wise L2 distance and cosine similarity on embeddings) against reference implementations in PyTorch Geometric and (ii) end-to-end accuracy on standard node-classification benchmarks for each model. revision: yes
-
Referee: [Abstract] The description of RelaNN and the lifted operators provides no derivation, formal semantics, or proof that the embedding manipulations preserve the exact aggregation/normalization behavior of the target GNNs (e.g., attention in heterogeneous transformers or hyperedge aggregation); without this, the declarative foundation claim cannot be evaluated.
Authors: The current text presents the lifting via an intuitive mapping from join-induced interactions to embedding operations but does not supply a formal semantics or equivalence proof for the more involved cases (attention coefficients, normalization constants, hyperedge pooling). We will revise the manuscript by inserting a new section that (a) defines the lifted relational operators with precise algebraic semantics and (b) sketches the equivalence arguments for the supported GNN families, including the handling of heterogeneous attention and hyperedge aggregation. revision: yes
Circularity Check
No circularity: new lifted-query machinery introduced without reduction to fitted inputs or self-citations
full rationale
The paper proposes associating tuples with learnable vector embeddings and lifting relational queries to operate jointly on data and embeddings. This is presented as a new declarative foundation rather than a derivation from prior fitted quantities. No equations define a target quantity in terms of itself, no parameters are fitted on a subset and then renamed as predictions, and no load-bearing self-citations or uniqueness theorems from the authors' prior work are invoked. The implementation of RelaNN and example models (GCN, HGT, hypergraph NNs) serves as direct evidence of utility, keeping the central claim self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- embedding dimension and parameters
axioms (1)
- domain assumption Relational joins induce interactions that can be captured by operations on tuple embeddings
invented entities (1)
-
tuple provenance embeddings
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Neuro-Relational Programs: Unifying Queries and Neural Computation over Structured Data
NRPs extend Datalog with embedding operations to create a single formalism readable as both query plans with trainable parts and neural architectures with relational structure.
Reference graph
Works this paper leans on
-
[1]
Serge Abiteboul, Marcelo Arenas, Pablo Barceló, Meghyn Bienvenu, Diego Cal- vanese, Claire David, Richard Hull, Eyke Hüllermeier, Benny Kimelfeld, Leonid Libkin, Wim Martens, Tova Milo, Filip Murlak, Frank Neven, Magdalena Ortiz, Thomas Schwentick, Julia Stoyanovich, Jianwen Su, Dan Suciu, Victor Vianu, and Ke Yi. 2018. Research Directions for Principles ...
2018
-
[2]
Bronstein, İsmail İlkan Ceylan, and Matthias Lanzinger
Linus Bao, Emily Jin, Michael M. Bronstein, İsmail İlkan Ceylan, and Matthias Lanzinger. 2025. Homomorphism Counts as Structural Encodings for Graph Learning. InICLR. OpenReview.net
2025
-
[3]
Thomas Bonald, Nathan de Lara, Quentin Lutz, and Bertrand Charpentier. 2020. Scikit-network: Graph Analysis in Python.Journal of Machine Learning Research 21, 185 (2020), 1–6. http://jmlr.org/papers/v21/20-412.html
2020
-
[4]
Rajesh Bordawekar and Oded Shmueli. 2017. Using Word Embedding to Enable Semantic Queries in Relational Databases. InDEEM@SIGMOD. ACM, 5:1–5:4
2017
-
[5]
Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. InSIGMOD Conference. ACM, 1335–1349
2020
-
[6]
Lingjiao Chen, Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. 2017. To- wards Linear Algebra over Normalized Data.Proc. VLDB Endow.10, 11 (2017), 1214–1225
2017
-
[7]
Tianlang Chen, Charilaos Kanatsoulis, and Jure Leskovec. 2025. RelGNN: Com- posite Message Passing for Relational Deep Learning. InICML
2025
-
[8]
Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh
-
[9]
Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019. ACM, 257–266. doi:10.1145/3292500.3330925
-
[10]
E. F. Codd. 1970. A Relational Model of Data for Large Shared Data Banks. Commun. ACM13, 6 (1970), 377–387
1970
-
[11]
Tamara Cucumides and Floris Geerts. 2025. From Features to Structure: Task- Aware Graph Construction for Relational and Tabular Learning with GNNs. In Tabular Data Analysis Workshop (TaDA) at VLDB
2025
-
[12]
Alexis Cvetkov-Iliev, Alexandre Allauzen, and Gaël Varoquaux. 2023. Relational data embeddings for feature enrichment with background information.Mach. Learn.112, 2 (2023), 687–720
2023
- [13]
-
[14]
Matthias Fey. 2019. PyTorch Scatter: Optimized Scatter Operations for Py- Torch. https://github.com/rusty1s/pytorch_scatter. GPU-native scatter_add, scatter_mean, scatter_max with autograd support
2019
-
[15]
Matthias Fey, Weihua Hu, Kexin Huang, Jan Eric Lenssen, Rishabh Ranjan, Joshua Robinson, Rex Ying, Jiaxuan You, and Jure Leskovec. 2024. Position: Relational Deep Learning - Graph Representation Learning on Relational Databases. In ICML. OpenReview.net
2024
-
[16]
Matthias Fey, Weihua Hu, Kexin Huang, Jan Eric Lenssen, Rishabh Ranjan, Joshua Robinson, Rex Ying, Jiaxuan You, and Jure Leskovec. 2024. RelBench: A Benchmark for Deep Learning on Relational Databases. InAdvances in Neural Information Processing Systems 37 (NeurIPS), Datasets and Benchmarks Track. https://arxiv.org/abs/2407.20060
-
[17]
Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. InICLR Workshop on Representation Learning on Graphs and Manifolds
2019
-
[18]
Billy Joe Franks, Moshe Eliasof, Semih Cantürk, Guy Wolf, Carola-Bibiane Schön- lieb, Sophie Fellenz, and Marius Kloft. 2025. Towards Graph Foundation Models: A Study on the Generalization of Positional and Structural Encodings.Trans. Mach. Learn. Res.2025 (2025)
2025
-
[19]
Xinyu Fu, Jiani Zhang, Ziqiao Meng, and Irwin King. 2020. MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding. In WWW. ACM / IW3C2, 2331–2341
2020
-
[20]
Boris Glavic. 2021. Data Provenance - Origins, Applications, Algorithms, and Models.Foundations and Trends®in Databases9, 3-4 (2021), 209–441. doi:10. 1561/1900000068
2021
-
[21]
Green, Gregory Karvounarakis, and Val Tannen
Todd J. Green, Gregory Karvounarakis, and Val Tannen. 2007. Provenance semirings. InPODS. ACM, 31–40
2007
-
[22]
Hamilton, Rex Ying, and Jure Leskovec
William L. Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 1025–1035
2017
-
[23]
Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib analytics library: or MAD skills, the SQL.Proc. VLDB Endow.5, 12 (Aug. 2012), 1700–1711. doi:10.14778/2367502. 2367510
-
[24]
Ziniu Hu, Yuxiao Dong, Kuansan Wang, and Yizhou Sun. 2020. Heterogeneous Graph Transformer. InProceedings of The Web Conference 2020 (WWW). 2704–
2020
- [25]
-
[26]
Valter Hudovernik, Federico López, Vid Kocijan, Akihiro Nitta, Jan Eric Lenssen, Jure Leskovec, and Matthias Fey. 2026. KumoRFM-2: Scaling Foundation Models for Relational Learning.CoRRabs/2604.12596 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Hasan M. Jamil. 2024. Toward a Declarative Query Language for Machine Learn- ing. InVLDB Workshops. https://api.semanticscholar.org/CorpusID:273878548
2024
-
[28]
Matthias Jasny, Tobias Ziegler, Tim Kraska, Uwe Roehm, and Carsten Binnig
-
[29]
DB4ML - An In-Memory Database Kernel with Machine Learning Support. InProceedings of the 2020 ACM SIGMOD International Conference on Management of Data(Portland, OR, USA)(SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 159–173. doi:10.1145/3318464.3380575
-
[30]
Fahim Shahriar Khan and Ashraf Aboulnaga. 2025. A Vision for SQL-Based Relational Deep Learning. InVLDB 2025 Workshop: Tabular Data Analysis (TaDA)
2025
-
[31]
Kipf and Max Welling
Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations (ICLR)
2017
-
[32]
Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. 2015. Learning Generalized Linear Models Over Normalized Data. InSIGMOD Conference. ACM, 1969–1984
2015
-
[33]
Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. An Intermediate Representation for Optimizing Machine Learning Pipelines.Proc. VLDB Endow.12, 11 (2019), 1553– 1567
2019
-
[34]
Guoliang Li, Ji Sun, Lijie Xu, Shifu Li, Jiang Wang, and Wen Nie. 2024. Gaussml: An end-to-end in-database machine learning system. In2024 IEEE 40th Interna- tional Conference on Data Engineering (ICDE). IEEE, 5198–5210
2024
-
[35]
Xupeng Li, Bin Cui, Yiru Chen, Wentao Wu, and Ce Zhang. 2017. MLog: Towards Declarative In-Database Machine Learning.Proc. VLDB Endow.10, 12 (2017), 1933–1936
2017
-
[36]
Yuval Lev Lubarsky, Jan Tönshoff, Martin Grohe, and Benny Kimelfeld. 2023. Selecting Walk Schemes for Database Embedding. InCIKM. ACM, 1677–1686
2023
-
[37]
Takanori Maehara and Hoang NT. 2024. Deep Homomorphism Networks. InAdvances in Neural Information Processing Systems 37 (NeurIPS). https://proceedings.neurips.cc/paper_files/paper/2024/file/ 65f54fdf62cd5614dc5715ae7ece4ef6-Paper-Conference.pdf
2024
-
[38]
Haggai Maron, Heli Ben-Hamu, Hadar Serviansky, and Yaron Lipman. 2019. Provably Powerful Graph Networks. InAdvances in Neural Information Processing Systems 32 (NeurIPS)
2019
-
[39]
Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep Learning for Entity Matching: A Design Space Exploration. InSIGMOD Conference. ACM, 19–34
2018
-
[40]
Luis Müller, Mikhail Galkin, Christopher Morris, and Ladislav Rampásek. 2024. Attending to Graph Transformers.Trans. Mach. Learn. Res.2024 (2024)
2024
-
[41]
NVIDIA. 2024. RAPIDS cuDF: GPU DataFrame Library. https://github.com/ rapidsai/cudf. Pandas-compatible API with GPU acceleration
2024
-
[42]
Dan Olteanu. 2020. The Relational Data Borg is Learning.Proc. VLDB Endow.13, 12 (2020), 3502–3515
2020
-
[43]
Paolo Papotti and Carsten Binnig. 2025. Panel on Neural Relational Data: Tabular Foundation Models, LLMs... or both?Proc. VLDB Endow.18 (2025), 5513–5515. https://api.semanticscholar.org/CorpusID:281247089
2025
-
[44]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer
-
[45]
https://api.semanticscholar.org/ CorpusID:40027675
Automatic differentiation in PyTorch. https://api.semanticscholar.org/ CorpusID:40027675
-
[46]
Khaled Mohammed Saifuddin, Briana Bumgardner, Farhan Tanvir, and Esra Akbas. 2023. HyGNN: Drug-Drug Interaction Prediction via Hypergraph Neural Network. In2023 IEEE 39th International Conference on Data Engineering (ICDE). 1503–1516. doi:10.1109/ICDE55515.2023.00119
-
[47]
Modeling Relational Data with Graph Convolutional Networks
Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling Relational Data with Graph Convolutional Networks. InThe Semantic Web – 15th International Confer- ence (ESWC) (Lecture Notes in Computer Science, Vol. 10843). Springer, 593–607. https://arxiv.org/abs/1703.06103
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[48]
Schüle, Matthias Bungeroth, Alfons Kemper, Stephan Günnemann, and Thomas Neumann
Maximilian E. Schüle, Matthias Bungeroth, Alfons Kemper, Stephan Günnemann, and Thomas Neumann. 2019. MLearn: A Declarative Machine Learning Language for Database Systems. InDEEM@SIGMOD
2019
-
[49]
Schüle, Matthias Bungeroth, Dimitri Vorona, Alfons Kemper, Stephan Günnemann, and Thomas Neumann
Maximilian E. Schüle, Matthias Bungeroth, Dimitri Vorona, Alfons Kemper, Stephan Günnemann, and Thomas Neumann. 2019. ML2SQL - Compiling a Declarative Machine Learning Language to SQL and Python. InInternational Conference on Extending Database Technology. https://api.semanticscholar.org/ CorpusID:81990872
2019
-
[50]
Erez Shinan. 2017. Lark: A Parsing Library for Python. https://github.com/lark- parser/lark. Accessed: 2026
2017
-
[51]
Thiviyan Thanapalasingam, Lucas van Berkel, Peter Bloem, and Paul Groth
-
[52]
Relational Graph Convolutional Networks: A Closer Look.PeerJ Computer Science8 (2022), e1073. doi:10.7717/peerj-cs.1073
-
[53]
Jan Tönshoff, Neta Friedman, Martin Grohe, and Benny Kimelfeld. 2023. Stable Tuple Embeddings for Dynamic Databases. InICDE. IEEE, 1286–1299. 9
2023
-
[54]
Smola, and Zheng Zhang
Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. 2019. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs.ICLR Workshop on Representation Learning on Graphs and Manifo...
2019
-
[55]
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S. Yu. 2019. Heterogeneous Graph Attention Network. InWWW
2019
-
[56]
Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, and Muhan Zhang. 2025. Griffin: Towards a Graph-Centric Relational Database Foundation Model. InICML (Proceedings of Machine Learning Research). PMLR / OpenReview.net
2025
-
[57]
Yu Philip
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S. Yu Philip. 2020. A comprehensive survey on graph neural networks.IEEE Transactions on Neural Networks and Learning Systems32, 1 (2020), 4–24
2020
- [58]
-
[59]
Jianan Zhao, Xiao Wang, Chuan Shi, Binbin Hu, Guojie Song, and Yanfang Ye
-
[60]
Heterogeneous Graph Structure Learning for Graph Neural Networks. In AAAI. AAAI Press, 4697–4705. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.