arxiv: 2605.10511 · v1 · submitted 2026-05-11 · 💻 cs.DB

Recognition: no theorem link

Data Path Fusion in GPU for Analytical Query Processing

Tsuyoshi Ozawa , Kazuo Goda

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:07 UTC · model grok-4.3

classification 💻 cs.DB

keywords GPUanalytical query processingdata path fusionhost-device communicationdatabase enginesTPC-HSSBcompression

0 comments

The pith

Fusing IO, decompression and query steps into one GPU kernel cuts host-device transfers for analytical workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets inefficiencies in existing GPU database engines that stem from repeated data movement between CPU and GPU plus execution split across many separate kernels. It proposes Data Path Fusion to combine IO, decompression and query operations inside a single GPU kernel, so the whole sequence runs without returning to the host in between. The method adds GPU-friendly support for type-specific compression, variable-length fields and direct IO. Experiments on standard analytical benchmarks show the fused design delivers large reductions in total time compared with prior GPU approaches. The result points to a practical way to let GPUs handle end-to-end query work more efficiently.

Core claim

Data Path Fusion integrates the sequence of data-path operations—including IOs, decompression, and query operations—into a single GPU kernel, thereby reducing host-device communication overheads and enabling more efficient utilization of GPU resources for analytical query workloads.

What carries the argument

Data Path Fusion (DPF), the architecture that places IO, decompression and query processing inside one GPU kernel while incorporating type-specific compression and variable-length attribute handling.

If this is right

Host-device communication volume drops because intermediate results no longer cross the PCIe bus after each stage.
GPU resources stay occupied longer inside one kernel instead of idling between multiple kernel launches.
End-to-end analytical queries can execute directly on the GPU without returning control to the CPU after each operation.
Type-specific compression and variable-length support integrate naturally inside the fused kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fusion patterns could be tested on other GPU-accelerated data tasks such as graph analytics or machine-learning feature pipelines.
The approach may reduce reliance on specialized interconnect hardware if the software-level fusion already captures most of the available bandwidth.
Future database engines might adopt single-kernel data paths as a default rather than an optimization.

Load-bearing premise

Combining IO, decompression and query work into one kernel can be done without creating new bottlenecks or correctness problems that cancel out the savings from fewer host-device transfers.

What would settle it

A direct timing comparison in which the fused kernel plus any added internal overhead takes longer overall than the sum of separate kernels and host-device transfers on the same TPC-H or SSB queries.

Figures

Figures reproduced from arXiv: 2605.10511 by Kazuo Goda, Tsuyoshi Ozawa.

**Figure 2.** Figure 2: Internal structure of a fused GPU kernel. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Page layout for fixed-length attributes. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Comparison of end-to-end query response, kernel invocations and total IO volume on TPC-H (left column) and SSB [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of page sizes DPF remains robust across all page sizes, offering speedups of 1.89 to 22.17 over the baseline case (GiDP). 0 200 400 600 800 1000 50 100 200 300 400 Execution time [msec] Scale Factor GiDP GiDP+BaM GiDP+BaM+KF DPF (a) TPC-H Q3. 0 500 1000 1500 2000 2500 3000 50 100 200 300 400 Execution time [msec] Scale Factor (b) TPC-H Q13. 0 500 1000 1500 2000 2500 3000 50 100 200 300… view at source ↗

**Figure 7.** Figure 7: Data scalability analysis. DPF remains robust across all scale factors, offering speedups of 2.72 to 20.69 over the baseline case (GiDP). of 2.35 to 8.84 over GiDP+BaM+KF for all test queries including Q16, where type-specific compression substantially reduced IO volume and faster decompression overcame the slowdown observed in GiDP+BaM+KF. In sum, DPF successfully performed significantly faster (by facto… view at source ↗

**Figure 8.** Figure 8: Query selectivity sensitivity analysis. DPF provides consistent speedups of 2.26 to 5.66 across all selectivity values, regardless of whether pruning is effective for the query predicate [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

read the original abstract

One major technical challenge for modern analytical database systems is how to leverage GPU to exploit their massive parallelism and high bandwidth. Yet, existing GPU-driven database engines suffer from inefficiencies caused by frequent host-device interactions and fragmented execution across multiple GPU kernels, limiting their ability to fully utilize GPU's computational and IO capabilities. This paper proposes Data Path Fusion (DPF), a novel GPU-driven data processing architecture that integrates a sequence of data path operations -- including IOs, decompression, and query operations -- into a single GPU kernel. By fusing the data path, DPF reduces host-device communication overheads and enables more efficient utilization of GPU resources for analytical query workloads. DPF seamlessly integrates GPU-friendly optimization techniques, including type-specific compression/decompression, variable-length attribute support, and state-of-the-art GPU-driven IO mechanism, to work in concert, enabling efficient end-to-end query execution directly on GPU. Through extensive experimental evaluation using a prototyped DPF-based GPU-driven database engine (DPFProto) with analytical benchmark workloads, this paper demonstrates that DPF achieves speedups of 2.66 to 6.22 on TPC-H and 3.84 to 16.81 on SSB over the state-of-the-art approach in the representative configuration. Our results show that DPF effectively unlocks the computational and IO potential of modern GPU, providing a promising direction for next-generation analytical database systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fusing the full data path into one GPU kernel is a sensible engineering move, but the speedups are hard to credit cleanly to fusion because other optimizations are bundled in.

read the letter

The paper's main contribution is Data Path Fusion (DPF), which packs IO, decompression, and query execution into a single GPU kernel instead of the usual multi-kernel handoff with host-device traffic. They also fold in type-specific compression, variable-length attribute handling, and a GPU-driven IO path, then show a prototype (DPFProto) running TPC-H and SSB queries. The headline numbers are 2.66–6.22× on TPC-H and 3.84–16.81× on SSB versus prior GPU DB work. If the full paper has solid implementation details and the numbers hold under scrutiny, this is the kind of incremental but useful step that GPU database people actually implement.

Referee Report

2 major / 1 minor

Summary. The paper proposes Data Path Fusion (DPF), a GPU architecture for analytical query processing that fuses IO, decompression, and query operations into a single kernel to reduce host-device communication overhead. It integrates type-specific compression/decompression, variable-length attribute support, and GPU-driven IO mechanisms. Using a prototype (DPFProto), the authors report speedups of 2.66-6.22× on TPC-H and 3.84-16.81× on SSB over the state-of-the-art in representative configurations.

Significance. If the speedups can be cleanly attributed to single-kernel fusion rather than other integrated optimizations, the work would be significant for GPU-accelerated databases by showing how to better utilize GPU compute and IO bandwidth for end-to-end analytical queries. The engineering integration of multiple techniques into one kernel addresses a recognized inefficiency in prior systems and provides concrete benchmark numbers on TPC-H and SSB.

major comments (2)

[Abstract] Abstract: The central performance claims attribute the reported speedups (2.66-6.22× TPC-H, 3.84-16.81× SSB) to fusing IO/decompression/query into one GPU kernel, yet the same paragraph lists integration of type-specific compression/decompression and variable-length attribute support as core parts of DPF. If the cited state-of-the-art baseline omits these techniques, the deltas cannot be credited to fusion alone without an ablation that holds compression, variable-length handling, and IO mechanisms fixed while varying only the kernel fusion.
[Experimental evaluation] Experimental evaluation section: The abstract and results provide no details on hardware configuration, data sizes, number of runs, error bars, baseline implementation specifics, or how the single-kernel design avoids resource conflicts (e.g., register pressure or warp divergence on variable-length decoding). These omissions make the headline numbers impossible to verify or reproduce.

minor comments (1)

[Abstract] Abstract: The phrase 'in the representative configuration' is used for the speedup ranges but is not defined; clarify which queries, scale factors, or hardware settings correspond to the minimum and maximum values.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results without misrepresenting the work.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims attribute the reported speedups (2.66-6.22× TPC-H, 3.84-16.81× SSB) to fusing IO/decompression/query into one GPU kernel, yet the same paragraph lists integration of type-specific compression/decompression and variable-length attribute support as core parts of DPF. If the cited state-of-the-art baseline omits these techniques, the deltas cannot be credited to fusion alone without an ablation that holds compression, variable-length handling, and IO mechanisms fixed while varying only the kernel fusion.

Authors: We appreciate the referee's observation on attribution. The manuscript positions Data Path Fusion as the central mechanism that fuses IO, decompression, and query operations into one kernel, thereby enabling the listed optimizations to execute without inter-kernel overheads and host-device transfers. The state-of-the-art baselines we evaluate do not perform this fusion, resulting in fragmented execution even when they incorporate subsets of the other techniques. The reported speedups therefore reflect the end-to-end benefit of the fused architecture. That said, we agree that an explicit ablation isolating the fusion step (while holding compression, variable-length handling, and IO mechanisms constant) would make the contribution clearer. We will add this ablation study to the revised experimental evaluation section. revision: yes
Referee: [Experimental evaluation] Experimental evaluation section: The abstract and results provide no details on hardware configuration, data sizes, number of runs, error bars, baseline implementation specifics, or how the single-kernel design avoids resource conflicts (e.g., register pressure or warp divergence on variable-length decoding). These omissions make the headline numbers impossible to verify or reproduce.

Authors: We agree that additional experimental details are required for reproducibility. In the revised manuscript we will expand the experimental evaluation section to specify the hardware platform (GPU model, host CPU, memory hierarchy), benchmark data sizes and scale factors for TPC-H and SSB, the number of runs performed with error bars or standard deviations, precise descriptions of the baseline implementations (including which optimizations each baseline contains), and a discussion of resource management within the single-kernel design, including register allocation strategies and handling of warp divergence for variable-length decoding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system proposal with benchmark validation

full rationale

The paper proposes an engineering architecture (Data Path Fusion) that fuses IO, decompression, and query operations into a single GPU kernel, then validates it via prototype implementation and empirical speedups on TPC-H and SSB benchmarks. No derivation chain, mathematical predictions, or first-principles results exist that could reduce to inputs by construction. Performance numbers are measured outcomes, not fitted parameters renamed as predictions, and no self-citation or ansatz is invoked to justify core claims. The work is self-contained as an empirical evaluation of a new system design.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the work appears to rely on standard GPU programming assumptions and existing compression techniques.

pith-pipeline@v0.9.0 · 5542 in / 1214 out tokens · 65324 ms · 2026-05-12T04:07:25.748349+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

138 extracted references · 138 canonical work pages

[1]

Azim Afroozeh and Peter A. Boncz. 2023. The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code.Proc. VLDB Endow. 16, 9 (2023), 2132–2144

work page 2023
[2]

Azim Afroozeh, Lotte Felius, and Peter A. Boncz. 2024. Accelerating GPU Data Processing using FastLanes Compression. InProc. DaMoN. 8:1–8:11

work page 2024
[3]

Peter Hofstee

Tim Anema, Joost Hoozemans, Zaid Al-Ars, and H. Peter Hofstee. 2025. High Throughput GPU-Accelerated FSST String Compression. InProc. ADMS25. https://www.vldb.org/2025/Workshops/VLDB-Workshops-2025/ ADMS/ADMS25-01.pdf

work page 2025
[4]

Felix Beier, Torsten Kilias, and Kai-Uwe Sattler. 2012. GiST scan acceleration using coprocessors. InProc. DaMoN. 63–69

work page 2012
[5]

Christos Bellas and Anastasios Gounaris. 2017. GPU processing of theta-joins. Concurr. Comput. Pract. Exp.29, 18 (2017)

work page 2017
[6]

Nils Boeschen and Carsten Binnig. 2022. GaccO - A GPU-accelerated OLTP DBMS. InProc. SIGMOD. 1003–1016

work page 2022
[7]

Nils Boeschen, Tobias Ziegler, and Carsten Binnig. 2024. GOLAP: A GPU-in- Data-Path Architecture for High-Speed OLAP.Proc. ACM Manag. Data2, 6 (2024), 237:1–237:26

work page 2024
[8]

Bøgh, Sean Chester, and Ira Assent

Kenneth S. Bøgh, Sean Chester, and Ira Assent. 2015. Work-Efficient Parallel Skyline Computation for the GPU.Proc. VLDB Endow.8, 9 (2015), 962–973

work page 2015
[9]

Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: Fast Random Access String Compression.Proc. VLDB Endow.13, 11 (2020), 2649–2661

work page 2020
[10]

Sebastian Breß. 2013. Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS.Proc. VLDB Endow.6, 12 (2013), 1398–1403

work page 2013
[11]

Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Process- ing in Co-Processor-Accelerated Databases. InProc. SIGMOD. 1891–1906

work page 2016
[12]

Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. GPU-Accelerated Database Systems: Survey and Open Challenges. Trans. Large Scale Data Knowl. Centered Syst.15 (2014), 1–35

work page 2014
[13]

Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors.VLDB J.27, 6 (2018), 797–822

work page 2018
[14]

Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim

work page
[15]

VLDB Endow.17, 3 (2023), 441–454

GPU Database Systems Characterization and Optimization.Proc. VLDB Endow.17, 3 (2023), 441–454

work page 2023
[16]

Narasayya

Surajit Chaudhuri, Umeshwar Dayal, and Vivek R. Narasayya. 2011. An overview of business intelligence technology.Commun. ACM54, 8 (2011), 88–98

work page 2011
[17]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anasta- sia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines.Proc. VLDB Endow.12, 5 (2019), 544–556

work page 2019
[18]

Copeland and Setrag Khoshafian

George P. Copeland and Setrag Khoshafian. 1985. A Decomposition Storage Model. InProc. SIGMOD. 268–279

work page 1985
[19]

Dally, Stephen W

William J. Dally, Stephen W. Keckler, and David Blair Kirk. 2021. Evolution of the Graphics Processing Unit (GPU).MICRO41, 6 (2021), 42–51

work page 2021
[20]

Harish Doraiswamy and Jayant R. Haritsa. 2026. Raster is Faster: Rethinking Ray Tracing in Database Indexing. InProc. CIDR. https://vldb.org/cidrdb/2026/raster- is-faster-rethinking-ray-tracing-in-database-indexing.html

work page 2026
[21]

Harish Doraiswamy, Vikas Kalagi, Karthik Ramachandra, and Jayant R. Haritsa

work page
[22]

VLDB Endow.16, 10 (2023), 2499–2511

A Case for Graphics-driven Query Processing.Proc. VLDB Endow.16, 10 (2023), 2499–2511

work page 2023
[23]

Wenbin Fang, Bingsheng He, and Qiong Luo. 2010. Database Compression on Graphics Processors.Proc. VLDB Endow.3, 1 (2010), 670–680

work page 2010
[24]

Sofoklis Floratos, Mengbai Xiao, Hao Wang, Chengxin Guo, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2021. NestGPU: Nested Query Processing on GPU. InProc. ICDE. 1008–1019

work page 2021
[25]

Phil Francisco. 2011. The Netezza Data Appliance Architecture: A Platform for High Performance Data Warehousing and Analytics. https://public.dhe. ibm.com/software/ch/de/pdf/Netezza_Appliance_Architecture_WP.pdf IBM Redguide REDP-4725. Accessed: 2026-05-01

work page 2011
[26]

Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teub- ner. 2018. Pipelined Query Processing in Coprocessor Environments. InProc. SIGMOD. 1603–1618

work page 2018
[27]

Henning Funke and Jens Teubner. 2020. Data-Parallel Query Processing on Non-Uniform Data.Proc. VLDB Endow.13, 6 (2020), 884–897

work page 2020
[28]

Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha

Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. 2006. GPUTeraSort: high performance graphics co-processor sorting for large data- base management. InProc. SIGMOD. 325–336

work page 2006
[29]

Govindaraju, Brandon Lloyd, Wei Wang, Ming C

Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming C. Lin, and Dinesh Manocha. 2004. Fast Computation of Database Operations using Graphics Processors. InProc. SIGMOD. ACM, 215–226

work page 2004
[30]

Alexander Greß and Gabriel Zachmann. 2006. GPU-ABiSort: optimal parallel sorting on stream architectures. InProc. IPDPS. https://doi.org/10.1109/IPDPS. 2006.1639284

work page doi:10.1109/ipdps 2006
[31]

Wentian Guo, Yuchen Li, Mo Sha, Bingsheng He, Xiaokui Xiao, and Kian-Lee Tan. 2020. GPU-Accelerated Subgraph Enumeration on Partitioned Graphs. In Proc. SIGMOD. 1067–1082

work page 2020
[32]

Donghyoung Han, Jongwuk Lee, and Min-Soo Kim. 2022. FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan Generation. InProc. SIGMOD. 1891–1904

work page 2022
[33]

Donghyoung Han, Yoon-Min Nam, Jihye Lee, Kyongseok Park, Hyunwoo Kim, and Min-Soo Kim. 2019. DistME: A Fast and Elastic Distributed Matrix Compu- tation Engine using GPUs. InProc. SIGMOD. 759–774

work page 2019
[34]

Jihoon Han, Anand Sivasubramaniam, Chia-Hao Chang, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-Mei Hwu. 2026. Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO. InProc. ASPLOS. 208–222

work page 2026
[35]

Govindaraju, Qiong Luo, and Pedro V

Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational query coprocessing on graphics processors. ACM Trans. Database Syst.34, 4 (2009), 21:1–21:39

work page 2009
[36]

Govindaraju, Qiong Luo, and Pedro V

Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2008. Relational joins on graphics processors. InProc. SIGMOD. 511–524

work page 2008
[37]

Bingsheng He and Jeffrey Xu Yu. 2011. High-Throughput Transaction Execu- tions on Graphics Processors.Proc. VLDB Endow.4, 5 (2011), 314–325

work page 2011
[38]

Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi

Dong He, Supun C. Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computa- tion Runtimes.Proc. VLDB Endow.15, 11 (2022), 2811–2825

work page 2022
[39]

Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture.Proc. VLDB Endow.6, 10 (2013), 889–900

work page 2013
[40]

Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-Cache Query Co- Processing on Coupled CPU-GPU Architectures.Proc. VLDB Endow.8, 4 (2014), 329–340

work page 2014
[41]

HEAVY.AI. 2026. HeavyDB: A GPU-accelerated SQL database. https://github. com/heavyai/heavydb Accessed: 2026-04-27

work page 2026
[42]

Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl

work page
[43]

VLDB Endow.6, 9 (2013), 709–720

Hardware-Oblivious Parallelism for In-Memory Column-Stores.Proc. VLDB Endow.6, 9 (2013), 709–720

work page 2013
[44]

Justus Henneberg and Felix Schuhknecht. 2023. RTIndeX: Exploiting Hardware- Accelerated GPU Raytracing for Database Indexing.Proc. VLDB Endow.16, 13 (2023), 4268–4281

work page 2023
[45]

Justus Henneberg, Felix Martin Schuhknecht, Rosina Kharal, and Trevor Brown

work page
[46]

More Bang for Your Buck(et): Fast and Space-Efficient Hardware- Accelerated Coarse-Granular Indexing on GPUs. InProc. ICDE. IEEE, 1320– 1333

work page
[47]

Sven Hepkema, Azim Afroozeh, Charlotte Felius, Peter Boncz, and Stefan Manegold. 2025. G-ALP: Rethinking Light-weight Encodings for GPUs. In Proceedings of the 21st International Workshop on Data Management on New Hardware, DaMoN 2025, Berlin, Germany, June 22-27, 2025. ACM, 11:1–11:10. https://doi.org/10.1145/3736227.3736242

work page doi:10.1145/3736227.3736242 2025
[48]

HeteroDB, Inc. 2026. PG-Strom: GPU acceleration for PostgreSQL. https: //heterodb.github.io/pg-strom/ Accessed: 2026-04-27

work page 2026
[49]

Bhowmick, and Wook-Shin Han

Kijae Hong, Kyoungmin Kim, Young-Koo Lee, Yang-Sae Moon, Sourav S. Bhowmick, and Wook-Shin Han. 2024. Themis: A GPU-accelerated Relational Query Execution Engine.Proc. VLDB Endow.18, 2 (2024), 426–438

work page 2024
[50]

Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. 2022. TCUDB: Accelerating Database with Tensor Processors. InProc. SIGMOD. 1360–1374

work page 2022
[51]

Zezhou Huang, Krystian Sakowski, Hans Lehnert, Wei Cui, Carlo Curino, Mat- teo Interlandi, Marius Dumitru, and Rathijit Sen. 2025. GPU Acceleration of SQL Analytics on Compressed Data.Proc. VLDB Endow.19, 3 (2025), 320–333

work page 2025
[52]

Marko Kabic, Bowen Wu, Jonas Dann, and Gustavo Alonso. 2025. Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads on Modern GPUs. Proc. VLDB Endow.18, 11 (2025), 4350–4363

work page 2025
[53]

Lohman, René Müller, and Peter Benjamin Volk

Tim Kaldewey, Guy M. Lohman, René Müller, and Peter Benjamin Volk. 2012. GPU join processing revisited. InProc. DaMoN. 55–62

work page 2012
[54]

Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources.Proc. VLDB Endow.10, 7 (2017), 733–744

work page 2017
[55]

Nguyen, Tim Kaldewey, Victor W

Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey

work page
[56]

FAST: fast architecture sensitive tree search on modern CPUs and GPUs. InProc. SIGMOD. 339–350

work page
[57]

Zhuohang Lai, Xibo Sun, Qiong Luo, and Xiaolong Xie. 2022. Accelerating multi-way joins on the GPU.VLDB J.31, 3 (2022), 529–553

work page 2022
[58]

Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later.Proc. VLDB Endow.5, 12 (2012), 1790–1801

work page 2012
[59]

Nikolaj Leischner, Vitaly Osipov, and Peter Sanders. 2010. GPU sample sort. In Proc. IPDPS. 1–10

work page 2010
[60]

Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data 13 Analytics.Proc. VLDB Endow.9, 14 (2016), 1647–1658

work page 2016
[61]

Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke

Yinan Li, Bailu Ding, Ziyun Wei, Lukas M. Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke. 2025. Scaling GPU-Accelerated Databases beyond GPU Memory Size.Proc. VLDB Endow.18, 11 (2025), 4518– 4531

work page 2025
[63]

Pump Up the Volume: Processing Large Data on GPUs with Fast Inter- connects. InProc. SIGMOD. 1633–1649

work page
[64]

Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl

work page
[65]

Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. InProc. SIGMOD. 1017–1032

work page
[66]

Vasilis Mageirakos, Riccardo Mancini, Srinivas Karthik, Bikash Chandra, and Anastasia Ailamaki. 2022. Efficient GPU-accelerated Join Optimization for Complex Queries. InProc. ICDE. 3190–3193

work page 2022
[67]

Tobias Maltenberger, Ivan Ilic, Igor Tolovski, and Tilmann Rabl. 2022. Evaluating Multi-GPU Sorting with Modern Interconnects. InProc. SIGMOD. 1795–1809

work page 2022
[68]

Tobias Maltenberger, Ivan Ilic, Igor Tolovski, and Tilmann Rabl. 2025. Efficiently Joining Large Relations on Multi-GPU Systems.Proc. VLDB Endow.18, 11 (2025), 4653–4667

work page 2025
[69]

Belviranli, Seyong Lee, Jeffrey S

Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: breaking GPU memory capacity limits with direct NVM access. InProc. SC. IEEE / ACM, 32:1–32:13

work page 2018
[70]

Guido Moerkotte. 1998. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. InProc. VLDB. 476–487

work page 1998
[71]

CJ Newburn, Prashant Prabhu, and Vikram Sharma Mailthody. 2025. Speed- of-Light Data Movement Between Storage and the GPU. https://www.nvidia. com/en-us/on-demand/session/gtc25-s73012/ Accessed: 2026-04-27

work page 2025
[72]

Anh Nguyen, Masato Edahiro, and Shinpei Kato. 2018. GPU-Accelerated VoltDB: A Case for Indexed Nested Loop Join. InProc. HPCS. 204–212

work page 2018
[73]

Hamish Nicholson, Konstantinos Chasialis, Antonio Boffa, and Anastasia Aila- maki. 2025. The Effectiveness of Compression for GPU-Accelerated Queries on Out-of-Memory Datasets. InProc. DaMoN. 10:1–10:10

work page 2025
[74]

Hamish Nicholson, Aunn Raza, Periklis Chrysogelos, and Anas- tasia Ailamaki. 2023. HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics. InProc. CIDR. https://vldb.org/cidrdb/2023/hetcache-synergising-nvme-storage-and- gpu-acceleration-for-memory-efficient-analytics.html

work page 2023
[75]

John Nickolls and William J. Dally. 2010. The GPU Computing Era.MICRO30, 2 (2010), 56–69

work page 2010
[76]

NVIDIA Corporation. 2021. NVIDIA A100 Tensor Core GPU Datasheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data- Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf Accessed: 2026-04-27

work page 2021
[77]

NVIDIA Corporation. 2024. GPUDirect Storage Overview Guide. https://docs. nvidia.com/gpudirect-storage/overview-guide/index.html Accessed: 2026-04- 27

work page 2024
[78]

NVIDIA Corporation. 2024. NVIDIA H100 Tensor Core GPU Datasheet. https: //resources.nvidia.com/en-us-gpu-resources/h100-datasheet-24306 Accessed: 2026-04-27

work page 2024
[79]

NVIDIA Corporation. 2026. RAPIDS Accelerator for Apache Spark. https: //nvidia.github.io/spark-rapids/ Accessed: 2026-04-27

work page 2026
[80]

O’Neil, Elizabeth J

Patrick E. O’Neil, Elizabeth J. O’Neil, Xuedong Chen, and Stephen Revilak

work page
[81]

The Star Schema Benchmark and Augmented Fact Table Indexing. InProc. TPCTC. 237–252

work page

Showing first 80 references.