pith. machine review for the scientific record. sign in

arxiv: 2605.10511 · v1 · submitted 2026-05-11 · 💻 cs.DB

Recognition: no theorem link

Data Path Fusion in GPU for Analytical Query Processing

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:07 UTC · model grok-4.3

classification 💻 cs.DB
keywords GPUanalytical query processingdata path fusionhost-device communicationdatabase enginesTPC-HSSBcompression
0
0 comments X

The pith

Fusing IO, decompression and query steps into one GPU kernel cuts host-device transfers for analytical workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets inefficiencies in existing GPU database engines that stem from repeated data movement between CPU and GPU plus execution split across many separate kernels. It proposes Data Path Fusion to combine IO, decompression and query operations inside a single GPU kernel, so the whole sequence runs without returning to the host in between. The method adds GPU-friendly support for type-specific compression, variable-length fields and direct IO. Experiments on standard analytical benchmarks show the fused design delivers large reductions in total time compared with prior GPU approaches. The result points to a practical way to let GPUs handle end-to-end query work more efficiently.

Core claim

Data Path Fusion integrates the sequence of data-path operations—including IOs, decompression, and query operations—into a single GPU kernel, thereby reducing host-device communication overheads and enabling more efficient utilization of GPU resources for analytical query workloads.

What carries the argument

Data Path Fusion (DPF), the architecture that places IO, decompression and query processing inside one GPU kernel while incorporating type-specific compression and variable-length attribute handling.

If this is right

  • Host-device communication volume drops because intermediate results no longer cross the PCIe bus after each stage.
  • GPU resources stay occupied longer inside one kernel instead of idling between multiple kernel launches.
  • End-to-end analytical queries can execute directly on the GPU without returning control to the CPU after each operation.
  • Type-specific compression and variable-length support integrate naturally inside the fused kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fusion patterns could be tested on other GPU-accelerated data tasks such as graph analytics or machine-learning feature pipelines.
  • The approach may reduce reliance on specialized interconnect hardware if the software-level fusion already captures most of the available bandwidth.
  • Future database engines might adopt single-kernel data paths as a default rather than an optimization.

Load-bearing premise

Combining IO, decompression and query work into one kernel can be done without creating new bottlenecks or correctness problems that cancel out the savings from fewer host-device transfers.

What would settle it

A direct timing comparison in which the fused kernel plus any added internal overhead takes longer overall than the sum of separate kernels and host-device transfers on the same TPC-H or SSB queries.

Figures

Figures reproduced from arXiv: 2605.10511 by Kazuo Goda, Tsuyoshi Ozawa.

Figure 1
Figure 1. Figure 1: Overall architecture of Data Path Fusion (DPF). [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Internal structure of a fused GPU kernel. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Page layout for fixed-length attributes. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of end-to-end query response, kernel invocations and total IO volume on TPC-H (left column) and SSB [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity analysis of page sizes DPF remains robust across all page sizes, offering speedups of 1.89 to 22.17 over the baseline case (GiDP). 0 200 400 600 800 1000 50 100 200 300 400 Execution time [msec] Scale Factor GiDP GiDP+BaM GiDP+BaM+KF DPF (a) TPC-H Q3. 0 500 1000 1500 2000 2500 3000 50 100 200 300 400 Execution time [msec] Scale Factor (b) TPC-H Q13. 0 500 1000 1500 2000 2500 3000 50 100 200 300… view at source ↗
Figure 7
Figure 7. Figure 7: Data scalability analysis. DPF remains robust across all scale factors, offering speedups of 2.72 to 20.69 over the baseline case (GiDP). of 2.35 to 8.84 over GiDP+BaM+KF for all test queries including Q16, where type-specific compression substantially reduced IO vol￾ume and faster decompression overcame the slowdown observed in GiDP+BaM+KF. In sum, DPF successfully performed significantly faster (by facto… view at source ↗
Figure 8
Figure 8. Figure 8: Query selectivity sensitivity analysis. DPF provides consistent speedups of 2.26 to 5.66 across all selectivity values, regardless of whether pruning is effective for the query predicate [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

One major technical challenge for modern analytical database systems is how to leverage GPU to exploit their massive parallelism and high bandwidth. Yet, existing GPU-driven database engines suffer from inefficiencies caused by frequent host-device interactions and fragmented execution across multiple GPU kernels, limiting their ability to fully utilize GPU's computational and IO capabilities. This paper proposes Data Path Fusion (DPF), a novel GPU-driven data processing architecture that integrates a sequence of data path operations -- including IOs, decompression, and query operations -- into a single GPU kernel. By fusing the data path, DPF reduces host-device communication overheads and enables more efficient utilization of GPU resources for analytical query workloads. DPF seamlessly integrates GPU-friendly optimization techniques, including type-specific compression/decompression, variable-length attribute support, and state-of-the-art GPU-driven IO mechanism, to work in concert, enabling efficient end-to-end query execution directly on GPU. Through extensive experimental evaluation using a prototyped DPF-based GPU-driven database engine (DPFProto) with analytical benchmark workloads, this paper demonstrates that DPF achieves speedups of 2.66 to 6.22 on TPC-H and 3.84 to 16.81 on SSB over the state-of-the-art approach in the representative configuration. Our results show that DPF effectively unlocks the computational and IO potential of modern GPU, providing a promising direction for next-generation analytical database systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Data Path Fusion (DPF), a GPU architecture for analytical query processing that fuses IO, decompression, and query operations into a single kernel to reduce host-device communication overhead. It integrates type-specific compression/decompression, variable-length attribute support, and GPU-driven IO mechanisms. Using a prototype (DPFProto), the authors report speedups of 2.66-6.22× on TPC-H and 3.84-16.81× on SSB over the state-of-the-art in representative configurations.

Significance. If the speedups can be cleanly attributed to single-kernel fusion rather than other integrated optimizations, the work would be significant for GPU-accelerated databases by showing how to better utilize GPU compute and IO bandwidth for end-to-end analytical queries. The engineering integration of multiple techniques into one kernel addresses a recognized inefficiency in prior systems and provides concrete benchmark numbers on TPC-H and SSB.

major comments (2)
  1. [Abstract] Abstract: The central performance claims attribute the reported speedups (2.66-6.22× TPC-H, 3.84-16.81× SSB) to fusing IO/decompression/query into one GPU kernel, yet the same paragraph lists integration of type-specific compression/decompression and variable-length attribute support as core parts of DPF. If the cited state-of-the-art baseline omits these techniques, the deltas cannot be credited to fusion alone without an ablation that holds compression, variable-length handling, and IO mechanisms fixed while varying only the kernel fusion.
  2. [Experimental evaluation] Experimental evaluation section: The abstract and results provide no details on hardware configuration, data sizes, number of runs, error bars, baseline implementation specifics, or how the single-kernel design avoids resource conflicts (e.g., register pressure or warp divergence on variable-length decoding). These omissions make the headline numbers impossible to verify or reproduce.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'in the representative configuration' is used for the speedup ranges but is not defined; clarify which queries, scale factors, or hardware settings correspond to the minimum and maximum values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results without misrepresenting the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central performance claims attribute the reported speedups (2.66-6.22× TPC-H, 3.84-16.81× SSB) to fusing IO/decompression/query into one GPU kernel, yet the same paragraph lists integration of type-specific compression/decompression and variable-length attribute support as core parts of DPF. If the cited state-of-the-art baseline omits these techniques, the deltas cannot be credited to fusion alone without an ablation that holds compression, variable-length handling, and IO mechanisms fixed while varying only the kernel fusion.

    Authors: We appreciate the referee's observation on attribution. The manuscript positions Data Path Fusion as the central mechanism that fuses IO, decompression, and query operations into one kernel, thereby enabling the listed optimizations to execute without inter-kernel overheads and host-device transfers. The state-of-the-art baselines we evaluate do not perform this fusion, resulting in fragmented execution even when they incorporate subsets of the other techniques. The reported speedups therefore reflect the end-to-end benefit of the fused architecture. That said, we agree that an explicit ablation isolating the fusion step (while holding compression, variable-length handling, and IO mechanisms constant) would make the contribution clearer. We will add this ablation study to the revised experimental evaluation section. revision: yes

  2. Referee: [Experimental evaluation] Experimental evaluation section: The abstract and results provide no details on hardware configuration, data sizes, number of runs, error bars, baseline implementation specifics, or how the single-kernel design avoids resource conflicts (e.g., register pressure or warp divergence on variable-length decoding). These omissions make the headline numbers impossible to verify or reproduce.

    Authors: We agree that additional experimental details are required for reproducibility. In the revised manuscript we will expand the experimental evaluation section to specify the hardware platform (GPU model, host CPU, memory hierarchy), benchmark data sizes and scale factors for TPC-H and SSB, the number of runs performed with error bars or standard deviations, precise descriptions of the baseline implementations (including which optimizations each baseline contains), and a discussion of resource management within the single-kernel design, including register allocation strategies and handling of warp divergence for variable-length decoding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system proposal with benchmark validation

full rationale

The paper proposes an engineering architecture (Data Path Fusion) that fuses IO, decompression, and query operations into a single GPU kernel, then validates it via prototype implementation and empirical speedups on TPC-H and SSB benchmarks. No derivation chain, mathematical predictions, or first-principles results exist that could reduce to inputs by construction. Performance numbers are measured outcomes, not fitted parameters renamed as predictions, and no self-citation or ansatz is invoked to justify core claims. The work is self-contained as an empirical evaluation of a new system design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the work appears to rely on standard GPU programming assumptions and existing compression techniques.

pith-pipeline@v0.9.0 · 5542 in / 1214 out tokens · 65324 ms · 2026-05-12T04:07:25.748349+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

138 extracted references · 138 canonical work pages

  1. [1]

    Azim Afroozeh and Peter A. Boncz. 2023. The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code.Proc. VLDB Endow. 16, 9 (2023), 2132–2144

  2. [2]

    Azim Afroozeh, Lotte Felius, and Peter A. Boncz. 2024. Accelerating GPU Data Processing using FastLanes Compression. InProc. DaMoN. 8:1–8:11

  3. [3]

    Peter Hofstee

    Tim Anema, Joost Hoozemans, Zaid Al-Ars, and H. Peter Hofstee. 2025. High Throughput GPU-Accelerated FSST String Compression. InProc. ADMS25. https://www.vldb.org/2025/Workshops/VLDB-Workshops-2025/ ADMS/ADMS25-01.pdf

  4. [4]

    Felix Beier, Torsten Kilias, and Kai-Uwe Sattler. 2012. GiST scan acceleration using coprocessors. InProc. DaMoN. 63–69

  5. [5]

    Christos Bellas and Anastasios Gounaris. 2017. GPU processing of theta-joins. Concurr. Comput. Pract. Exp.29, 18 (2017)

  6. [6]

    Nils Boeschen and Carsten Binnig. 2022. GaccO - A GPU-accelerated OLTP DBMS. InProc. SIGMOD. 1003–1016

  7. [7]

    Nils Boeschen, Tobias Ziegler, and Carsten Binnig. 2024. GOLAP: A GPU-in- Data-Path Architecture for High-Speed OLAP.Proc. ACM Manag. Data2, 6 (2024), 237:1–237:26

  8. [8]

    Bøgh, Sean Chester, and Ira Assent

    Kenneth S. Bøgh, Sean Chester, and Ira Assent. 2015. Work-Efficient Parallel Skyline Computation for the GPU.Proc. VLDB Endow.8, 9 (2015), 962–973

  9. [9]

    Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: Fast Random Access String Compression.Proc. VLDB Endow.13, 11 (2020), 2649–2661

  10. [10]

    Sebastian Breß. 2013. Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS.Proc. VLDB Endow.6, 12 (2013), 1398–1403

  11. [11]

    Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Process- ing in Co-Processor-Accelerated Databases. InProc. SIGMOD. 1891–1906

  12. [12]

    Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. GPU-Accelerated Database Systems: Survey and Open Challenges. Trans. Large Scale Data Knowl. Centered Syst.15 (2014), 1–35

  13. [13]

    Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors.VLDB J.27, 6 (2018), 797–822

  14. [14]

    Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim

  15. [15]

    VLDB Endow.17, 3 (2023), 441–454

    GPU Database Systems Characterization and Optimization.Proc. VLDB Endow.17, 3 (2023), 441–454

  16. [16]

    Narasayya

    Surajit Chaudhuri, Umeshwar Dayal, and Vivek R. Narasayya. 2011. An overview of business intelligence technology.Commun. ACM54, 8 (2011), 88–98

  17. [17]

    Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anasta- sia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines.Proc. VLDB Endow.12, 5 (2019), 544–556

  18. [18]

    Copeland and Setrag Khoshafian

    George P. Copeland and Setrag Khoshafian. 1985. A Decomposition Storage Model. InProc. SIGMOD. 268–279

  19. [19]

    Dally, Stephen W

    William J. Dally, Stephen W. Keckler, and David Blair Kirk. 2021. Evolution of the Graphics Processing Unit (GPU).MICRO41, 6 (2021), 42–51

  20. [20]

    Harish Doraiswamy and Jayant R. Haritsa. 2026. Raster is Faster: Rethinking Ray Tracing in Database Indexing. InProc. CIDR. https://vldb.org/cidrdb/2026/raster- is-faster-rethinking-ray-tracing-in-database-indexing.html

  21. [21]

    Harish Doraiswamy, Vikas Kalagi, Karthik Ramachandra, and Jayant R. Haritsa

  22. [22]

    VLDB Endow.16, 10 (2023), 2499–2511

    A Case for Graphics-driven Query Processing.Proc. VLDB Endow.16, 10 (2023), 2499–2511

  23. [23]

    Wenbin Fang, Bingsheng He, and Qiong Luo. 2010. Database Compression on Graphics Processors.Proc. VLDB Endow.3, 1 (2010), 670–680

  24. [24]

    Sofoklis Floratos, Mengbai Xiao, Hao Wang, Chengxin Guo, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2021. NestGPU: Nested Query Processing on GPU. InProc. ICDE. 1008–1019

  25. [25]

    Phil Francisco. 2011. The Netezza Data Appliance Architecture: A Platform for High Performance Data Warehousing and Analytics. https://public.dhe. ibm.com/software/ch/de/pdf/Netezza_Appliance_Architecture_WP.pdf IBM Redguide REDP-4725. Accessed: 2026-05-01

  26. [26]

    Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teub- ner. 2018. Pipelined Query Processing in Coprocessor Environments. InProc. SIGMOD. 1603–1618

  27. [27]

    Henning Funke and Jens Teubner. 2020. Data-Parallel Query Processing on Non-Uniform Data.Proc. VLDB Endow.13, 6 (2020), 884–897

  28. [28]

    Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha

    Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. 2006. GPUTeraSort: high performance graphics co-processor sorting for large data- base management. InProc. SIGMOD. 325–336

  29. [29]

    Govindaraju, Brandon Lloyd, Wei Wang, Ming C

    Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming C. Lin, and Dinesh Manocha. 2004. Fast Computation of Database Operations using Graphics Processors. InProc. SIGMOD. ACM, 215–226

  30. [30]

    Alexander Greß and Gabriel Zachmann. 2006. GPU-ABiSort: optimal parallel sorting on stream architectures. InProc. IPDPS. https://doi.org/10.1109/IPDPS. 2006.1639284

  31. [31]

    Wentian Guo, Yuchen Li, Mo Sha, Bingsheng He, Xiaokui Xiao, and Kian-Lee Tan. 2020. GPU-Accelerated Subgraph Enumeration on Partitioned Graphs. In Proc. SIGMOD. 1067–1082

  32. [32]

    Donghyoung Han, Jongwuk Lee, and Min-Soo Kim. 2022. FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan Generation. InProc. SIGMOD. 1891–1904

  33. [33]

    Donghyoung Han, Yoon-Min Nam, Jihye Lee, Kyongseok Park, Hyunwoo Kim, and Min-Soo Kim. 2019. DistME: A Fast and Elastic Distributed Matrix Compu- tation Engine using GPUs. InProc. SIGMOD. 759–774

  34. [34]

    Jihoon Han, Anand Sivasubramaniam, Chia-Hao Chang, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-Mei Hwu. 2026. Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO. InProc. ASPLOS. 208–222

  35. [35]

    Govindaraju, Qiong Luo, and Pedro V

    Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational query coprocessing on graphics processors. ACM Trans. Database Syst.34, 4 (2009), 21:1–21:39

  36. [36]

    Govindaraju, Qiong Luo, and Pedro V

    Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2008. Relational joins on graphics processors. InProc. SIGMOD. 511–524

  37. [37]

    Bingsheng He and Jeffrey Xu Yu. 2011. High-Throughput Transaction Execu- tions on Graphics Processors.Proc. VLDB Endow.4, 5 (2011), 314–325

  38. [38]

    Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi

    Dong He, Supun C. Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computa- tion Runtimes.Proc. VLDB Endow.15, 11 (2022), 2811–2825

  39. [39]

    Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture.Proc. VLDB Endow.6, 10 (2013), 889–900

  40. [40]

    Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-Cache Query Co- Processing on Coupled CPU-GPU Architectures.Proc. VLDB Endow.8, 4 (2014), 329–340

  41. [41]

    HEAVY.AI. 2026. HeavyDB: A GPU-accelerated SQL database. https://github. com/heavyai/heavydb Accessed: 2026-04-27

  42. [42]

    Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl

  43. [43]

    VLDB Endow.6, 9 (2013), 709–720

    Hardware-Oblivious Parallelism for In-Memory Column-Stores.Proc. VLDB Endow.6, 9 (2013), 709–720

  44. [44]

    Justus Henneberg and Felix Schuhknecht. 2023. RTIndeX: Exploiting Hardware- Accelerated GPU Raytracing for Database Indexing.Proc. VLDB Endow.16, 13 (2023), 4268–4281

  45. [45]

    Justus Henneberg, Felix Martin Schuhknecht, Rosina Kharal, and Trevor Brown

  46. [46]

    More Bang for Your Buck(et): Fast and Space-Efficient Hardware- Accelerated Coarse-Granular Indexing on GPUs. InProc. ICDE. IEEE, 1320– 1333

  47. [47]

    Sven Hepkema, Azim Afroozeh, Charlotte Felius, Peter Boncz, and Stefan Manegold. 2025. G-ALP: Rethinking Light-weight Encodings for GPUs. In Proceedings of the 21st International Workshop on Data Management on New Hardware, DaMoN 2025, Berlin, Germany, June 22-27, 2025. ACM, 11:1–11:10. https://doi.org/10.1145/3736227.3736242

  48. [48]

    HeteroDB, Inc. 2026. PG-Strom: GPU acceleration for PostgreSQL. https: //heterodb.github.io/pg-strom/ Accessed: 2026-04-27

  49. [49]

    Bhowmick, and Wook-Shin Han

    Kijae Hong, Kyoungmin Kim, Young-Koo Lee, Yang-Sae Moon, Sourav S. Bhowmick, and Wook-Shin Han. 2024. Themis: A GPU-accelerated Relational Query Execution Engine.Proc. VLDB Endow.18, 2 (2024), 426–438

  50. [50]

    Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. 2022. TCUDB: Accelerating Database with Tensor Processors. InProc. SIGMOD. 1360–1374

  51. [51]

    Zezhou Huang, Krystian Sakowski, Hans Lehnert, Wei Cui, Carlo Curino, Mat- teo Interlandi, Marius Dumitru, and Rathijit Sen. 2025. GPU Acceleration of SQL Analytics on Compressed Data.Proc. VLDB Endow.19, 3 (2025), 320–333

  52. [52]

    Marko Kabic, Bowen Wu, Jonas Dann, and Gustavo Alonso. 2025. Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads on Modern GPUs. Proc. VLDB Endow.18, 11 (2025), 4350–4363

  53. [53]

    Lohman, René Müller, and Peter Benjamin Volk

    Tim Kaldewey, Guy M. Lohman, René Müller, and Peter Benjamin Volk. 2012. GPU join processing revisited. InProc. DaMoN. 55–62

  54. [54]

    Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources.Proc. VLDB Endow.10, 7 (2017), 733–744

  55. [55]

    Nguyen, Tim Kaldewey, Victor W

    Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey

  56. [56]

    FAST: fast architecture sensitive tree search on modern CPUs and GPUs. InProc. SIGMOD. 339–350

  57. [57]

    Zhuohang Lai, Xibo Sun, Qiong Luo, and Xiaolong Xie. 2022. Accelerating multi-way joins on the GPU.VLDB J.31, 3 (2022), 529–553

  58. [58]

    Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later.Proc. VLDB Endow.5, 12 (2012), 1790–1801

  59. [59]

    Nikolaj Leischner, Vitaly Osipov, and Peter Sanders. 2010. GPU sample sort. In Proc. IPDPS. 1–10

  60. [60]

    Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data 13 Analytics.Proc. VLDB Endow.9, 14 (2016), 1647–1658

  61. [61]

    Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke

    Yinan Li, Bailu Ding, Ziyun Wei, Lukas M. Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke. 2025. Scaling GPU-Accelerated Databases beyond GPU Memory Size.Proc. VLDB Endow.18, 11 (2025), 4518– 4531

  62. [63]

    Pump Up the Volume: Processing Large Data on GPUs with Fast Inter- connects. InProc. SIGMOD. 1633–1649

  63. [64]

    Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl

  64. [65]

    Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. InProc. SIGMOD. 1017–1032

  65. [66]

    Vasilis Mageirakos, Riccardo Mancini, Srinivas Karthik, Bikash Chandra, and Anastasia Ailamaki. 2022. Efficient GPU-accelerated Join Optimization for Complex Queries. InProc. ICDE. 3190–3193

  66. [67]

    Tobias Maltenberger, Ivan Ilic, Igor Tolovski, and Tilmann Rabl. 2022. Evaluating Multi-GPU Sorting with Modern Interconnects. InProc. SIGMOD. 1795–1809

  67. [68]

    Tobias Maltenberger, Ivan Ilic, Igor Tolovski, and Tilmann Rabl. 2025. Efficiently Joining Large Relations on Multi-GPU Systems.Proc. VLDB Endow.18, 11 (2025), 4653–4667

  68. [69]

    Belviranli, Seyong Lee, Jeffrey S

    Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: breaking GPU memory capacity limits with direct NVM access. InProc. SC. IEEE / ACM, 32:1–32:13

  69. [70]

    Guido Moerkotte. 1998. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. InProc. VLDB. 476–487

  70. [71]

    CJ Newburn, Prashant Prabhu, and Vikram Sharma Mailthody. 2025. Speed- of-Light Data Movement Between Storage and the GPU. https://www.nvidia. com/en-us/on-demand/session/gtc25-s73012/ Accessed: 2026-04-27

  71. [72]

    Anh Nguyen, Masato Edahiro, and Shinpei Kato. 2018. GPU-Accelerated VoltDB: A Case for Indexed Nested Loop Join. InProc. HPCS. 204–212

  72. [73]

    Hamish Nicholson, Konstantinos Chasialis, Antonio Boffa, and Anastasia Aila- maki. 2025. The Effectiveness of Compression for GPU-Accelerated Queries on Out-of-Memory Datasets. InProc. DaMoN. 10:1–10:10

  73. [74]

    Hamish Nicholson, Aunn Raza, Periklis Chrysogelos, and Anas- tasia Ailamaki. 2023. HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics. InProc. CIDR. https://vldb.org/cidrdb/2023/hetcache-synergising-nvme-storage-and- gpu-acceleration-for-memory-efficient-analytics.html

  74. [75]

    John Nickolls and William J. Dally. 2010. The GPU Computing Era.MICRO30, 2 (2010), 56–69

  75. [76]

    NVIDIA Corporation. 2021. NVIDIA A100 Tensor Core GPU Datasheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data- Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf Accessed: 2026-04-27

  76. [77]

    NVIDIA Corporation. 2024. GPUDirect Storage Overview Guide. https://docs. nvidia.com/gpudirect-storage/overview-guide/index.html Accessed: 2026-04- 27

  77. [78]

    NVIDIA Corporation. 2024. NVIDIA H100 Tensor Core GPU Datasheet. https: //resources.nvidia.com/en-us-gpu-resources/h100-datasheet-24306 Accessed: 2026-04-27

  78. [79]

    NVIDIA Corporation. 2026. RAPIDS Accelerator for Apache Spark. https: //nvidia.github.io/spark-rapids/ Accessed: 2026-04-27

  79. [80]

    O’Neil, Elizabeth J

    Patrick E. O’Neil, Elizabeth J. O’Neil, Xuedong Chen, and Stephen Revilak

  80. [81]

    The Star Schema Benchmark and Augmented Fact Table Indexing. InProc. TPCTC. 237–252

Showing first 80 references.