Recognition: no theorem link
Data Path Fusion in GPU for Analytical Query Processing
Pith reviewed 2026-05-12 04:07 UTC · model grok-4.3
The pith
Fusing IO, decompression and query steps into one GPU kernel cuts host-device transfers for analytical workloads.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Data Path Fusion integrates the sequence of data-path operations—including IOs, decompression, and query operations—into a single GPU kernel, thereby reducing host-device communication overheads and enabling more efficient utilization of GPU resources for analytical query workloads.
What carries the argument
Data Path Fusion (DPF), the architecture that places IO, decompression and query processing inside one GPU kernel while incorporating type-specific compression and variable-length attribute handling.
If this is right
- Host-device communication volume drops because intermediate results no longer cross the PCIe bus after each stage.
- GPU resources stay occupied longer inside one kernel instead of idling between multiple kernel launches.
- End-to-end analytical queries can execute directly on the GPU without returning control to the CPU after each operation.
- Type-specific compression and variable-length support integrate naturally inside the fused kernel.
Where Pith is reading between the lines
- Similar fusion patterns could be tested on other GPU-accelerated data tasks such as graph analytics or machine-learning feature pipelines.
- The approach may reduce reliance on specialized interconnect hardware if the software-level fusion already captures most of the available bandwidth.
- Future database engines might adopt single-kernel data paths as a default rather than an optimization.
Load-bearing premise
Combining IO, decompression and query work into one kernel can be done without creating new bottlenecks or correctness problems that cancel out the savings from fewer host-device transfers.
What would settle it
A direct timing comparison in which the fused kernel plus any added internal overhead takes longer overall than the sum of separate kernels and host-device transfers on the same TPC-H or SSB queries.
Figures
read the original abstract
One major technical challenge for modern analytical database systems is how to leverage GPU to exploit their massive parallelism and high bandwidth. Yet, existing GPU-driven database engines suffer from inefficiencies caused by frequent host-device interactions and fragmented execution across multiple GPU kernels, limiting their ability to fully utilize GPU's computational and IO capabilities. This paper proposes Data Path Fusion (DPF), a novel GPU-driven data processing architecture that integrates a sequence of data path operations -- including IOs, decompression, and query operations -- into a single GPU kernel. By fusing the data path, DPF reduces host-device communication overheads and enables more efficient utilization of GPU resources for analytical query workloads. DPF seamlessly integrates GPU-friendly optimization techniques, including type-specific compression/decompression, variable-length attribute support, and state-of-the-art GPU-driven IO mechanism, to work in concert, enabling efficient end-to-end query execution directly on GPU. Through extensive experimental evaluation using a prototyped DPF-based GPU-driven database engine (DPFProto) with analytical benchmark workloads, this paper demonstrates that DPF achieves speedups of 2.66 to 6.22 on TPC-H and 3.84 to 16.81 on SSB over the state-of-the-art approach in the representative configuration. Our results show that DPF effectively unlocks the computational and IO potential of modern GPU, providing a promising direction for next-generation analytical database systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Data Path Fusion (DPF), a GPU architecture for analytical query processing that fuses IO, decompression, and query operations into a single kernel to reduce host-device communication overhead. It integrates type-specific compression/decompression, variable-length attribute support, and GPU-driven IO mechanisms. Using a prototype (DPFProto), the authors report speedups of 2.66-6.22× on TPC-H and 3.84-16.81× on SSB over the state-of-the-art in representative configurations.
Significance. If the speedups can be cleanly attributed to single-kernel fusion rather than other integrated optimizations, the work would be significant for GPU-accelerated databases by showing how to better utilize GPU compute and IO bandwidth for end-to-end analytical queries. The engineering integration of multiple techniques into one kernel addresses a recognized inefficiency in prior systems and provides concrete benchmark numbers on TPC-H and SSB.
major comments (2)
- [Abstract] Abstract: The central performance claims attribute the reported speedups (2.66-6.22× TPC-H, 3.84-16.81× SSB) to fusing IO/decompression/query into one GPU kernel, yet the same paragraph lists integration of type-specific compression/decompression and variable-length attribute support as core parts of DPF. If the cited state-of-the-art baseline omits these techniques, the deltas cannot be credited to fusion alone without an ablation that holds compression, variable-length handling, and IO mechanisms fixed while varying only the kernel fusion.
- [Experimental evaluation] Experimental evaluation section: The abstract and results provide no details on hardware configuration, data sizes, number of runs, error bars, baseline implementation specifics, or how the single-kernel design avoids resource conflicts (e.g., register pressure or warp divergence on variable-length decoding). These omissions make the headline numbers impossible to verify or reproduce.
minor comments (1)
- [Abstract] Abstract: The phrase 'in the representative configuration' is used for the speedup ranges but is not defined; clarify which queries, scale factors, or hardware settings correspond to the minimum and maximum values.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments point by point below, providing clarifications and committing to revisions that strengthen the presentation of our results without misrepresenting the work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claims attribute the reported speedups (2.66-6.22× TPC-H, 3.84-16.81× SSB) to fusing IO/decompression/query into one GPU kernel, yet the same paragraph lists integration of type-specific compression/decompression and variable-length attribute support as core parts of DPF. If the cited state-of-the-art baseline omits these techniques, the deltas cannot be credited to fusion alone without an ablation that holds compression, variable-length handling, and IO mechanisms fixed while varying only the kernel fusion.
Authors: We appreciate the referee's observation on attribution. The manuscript positions Data Path Fusion as the central mechanism that fuses IO, decompression, and query operations into one kernel, thereby enabling the listed optimizations to execute without inter-kernel overheads and host-device transfers. The state-of-the-art baselines we evaluate do not perform this fusion, resulting in fragmented execution even when they incorporate subsets of the other techniques. The reported speedups therefore reflect the end-to-end benefit of the fused architecture. That said, we agree that an explicit ablation isolating the fusion step (while holding compression, variable-length handling, and IO mechanisms constant) would make the contribution clearer. We will add this ablation study to the revised experimental evaluation section. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation section: The abstract and results provide no details on hardware configuration, data sizes, number of runs, error bars, baseline implementation specifics, or how the single-kernel design avoids resource conflicts (e.g., register pressure or warp divergence on variable-length decoding). These omissions make the headline numbers impossible to verify or reproduce.
Authors: We agree that additional experimental details are required for reproducibility. In the revised manuscript we will expand the experimental evaluation section to specify the hardware platform (GPU model, host CPU, memory hierarchy), benchmark data sizes and scale factors for TPC-H and SSB, the number of runs performed with error bars or standard deviations, precise descriptions of the baseline implementations (including which optimizations each baseline contains), and a discussion of resource management within the single-kernel design, including register allocation strategies and handling of warp divergence for variable-length decoding. revision: yes
Circularity Check
No circularity: empirical system proposal with benchmark validation
full rationale
The paper proposes an engineering architecture (Data Path Fusion) that fuses IO, decompression, and query operations into a single GPU kernel, then validates it via prototype implementation and empirical speedups on TPC-H and SSB benchmarks. No derivation chain, mathematical predictions, or first-principles results exist that could reduce to inputs by construction. Performance numbers are measured outcomes, not fitted parameters renamed as predictions, and no self-citation or ansatz is invoked to justify core claims. The work is self-contained as an empirical evaluation of a new system design.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Azim Afroozeh and Peter A. Boncz. 2023. The FastLanes Compression Layout: Decoding >100 Billion Integers per Second with Scalar Code.Proc. VLDB Endow. 16, 9 (2023), 2132–2144
work page 2023
-
[2]
Azim Afroozeh, Lotte Felius, and Peter A. Boncz. 2024. Accelerating GPU Data Processing using FastLanes Compression. InProc. DaMoN. 8:1–8:11
work page 2024
-
[3]
Tim Anema, Joost Hoozemans, Zaid Al-Ars, and H. Peter Hofstee. 2025. High Throughput GPU-Accelerated FSST String Compression. InProc. ADMS25. https://www.vldb.org/2025/Workshops/VLDB-Workshops-2025/ ADMS/ADMS25-01.pdf
work page 2025
-
[4]
Felix Beier, Torsten Kilias, and Kai-Uwe Sattler. 2012. GiST scan acceleration using coprocessors. InProc. DaMoN. 63–69
work page 2012
-
[5]
Christos Bellas and Anastasios Gounaris. 2017. GPU processing of theta-joins. Concurr. Comput. Pract. Exp.29, 18 (2017)
work page 2017
-
[6]
Nils Boeschen and Carsten Binnig. 2022. GaccO - A GPU-accelerated OLTP DBMS. InProc. SIGMOD. 1003–1016
work page 2022
-
[7]
Nils Boeschen, Tobias Ziegler, and Carsten Binnig. 2024. GOLAP: A GPU-in- Data-Path Architecture for High-Speed OLAP.Proc. ACM Manag. Data2, 6 (2024), 237:1–237:26
work page 2024
-
[8]
Bøgh, Sean Chester, and Ira Assent
Kenneth S. Bøgh, Sean Chester, and Ira Assent. 2015. Work-Efficient Parallel Skyline Computation for the GPU.Proc. VLDB Endow.8, 9 (2015), 962–973
work page 2015
-
[9]
Peter Boncz, Thomas Neumann, and Viktor Leis. 2020. FSST: Fast Random Access String Compression.Proc. VLDB Endow.13, 11 (2020), 2649–2661
work page 2020
-
[10]
Sebastian Breß. 2013. Why it is time for a HyPE: A Hybrid Query Processing Engine for Efficient GPU Coprocessing in DBMS.Proc. VLDB Endow.6, 12 (2013), 1398–1403
work page 2013
-
[11]
Sebastian Breß, Henning Funke, and Jens Teubner. 2016. Robust Query Process- ing in Co-Processor-Accelerated Databases. InProc. SIGMOD. 1891–1906
work page 2016
-
[12]
Sebastian Breß, Max Heimel, Norbert Siegmund, Ladjel Bellatreche, and Gunter Saake. 2014. GPU-Accelerated Database Systems: Survey and Open Challenges. Trans. Large Scale Data Knowl. Centered Syst.15 (2014), 1–35
work page 2014
-
[13]
Sebastian Breß, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors.VLDB J.27, 6 (2018), 797–822
work page 2018
-
[14]
Jiashen Cao, Rathijit Sen, Matteo Interlandi, Joy Arulraj, and Hyesoon Kim
-
[15]
VLDB Endow.17, 3 (2023), 441–454
GPU Database Systems Characterization and Optimization.Proc. VLDB Endow.17, 3 (2023), 441–454
work page 2023
- [16]
-
[17]
Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anasta- sia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines.Proc. VLDB Endow.12, 5 (2019), 544–556
work page 2019
-
[18]
Copeland and Setrag Khoshafian
George P. Copeland and Setrag Khoshafian. 1985. A Decomposition Storage Model. InProc. SIGMOD. 268–279
work page 1985
-
[19]
William J. Dally, Stephen W. Keckler, and David Blair Kirk. 2021. Evolution of the Graphics Processing Unit (GPU).MICRO41, 6 (2021), 42–51
work page 2021
-
[20]
Harish Doraiswamy and Jayant R. Haritsa. 2026. Raster is Faster: Rethinking Ray Tracing in Database Indexing. InProc. CIDR. https://vldb.org/cidrdb/2026/raster- is-faster-rethinking-ray-tracing-in-database-indexing.html
work page 2026
-
[21]
Harish Doraiswamy, Vikas Kalagi, Karthik Ramachandra, and Jayant R. Haritsa
-
[22]
VLDB Endow.16, 10 (2023), 2499–2511
A Case for Graphics-driven Query Processing.Proc. VLDB Endow.16, 10 (2023), 2499–2511
work page 2023
-
[23]
Wenbin Fang, Bingsheng He, and Qiong Luo. 2010. Database Compression on Graphics Processors.Proc. VLDB Endow.3, 1 (2010), 670–680
work page 2010
-
[24]
Sofoklis Floratos, Mengbai Xiao, Hao Wang, Chengxin Guo, Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2021. NestGPU: Nested Query Processing on GPU. InProc. ICDE. 1008–1019
work page 2021
-
[25]
Phil Francisco. 2011. The Netezza Data Appliance Architecture: A Platform for High Performance Data Warehousing and Analytics. https://public.dhe. ibm.com/software/ch/de/pdf/Netezza_Appliance_Architecture_WP.pdf IBM Redguide REDP-4725. Accessed: 2026-05-01
work page 2011
-
[26]
Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teub- ner. 2018. Pipelined Query Processing in Coprocessor Environments. InProc. SIGMOD. 1603–1618
work page 2018
-
[27]
Henning Funke and Jens Teubner. 2020. Data-Parallel Query Processing on Non-Uniform Data.Proc. VLDB Endow.13, 6 (2020), 884–897
work page 2020
-
[28]
Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha
Naga K. Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. 2006. GPUTeraSort: high performance graphics co-processor sorting for large data- base management. InProc. SIGMOD. 325–336
work page 2006
-
[29]
Govindaraju, Brandon Lloyd, Wei Wang, Ming C
Naga K. Govindaraju, Brandon Lloyd, Wei Wang, Ming C. Lin, and Dinesh Manocha. 2004. Fast Computation of Database Operations using Graphics Processors. InProc. SIGMOD. ACM, 215–226
work page 2004
-
[30]
Alexander Greß and Gabriel Zachmann. 2006. GPU-ABiSort: optimal parallel sorting on stream architectures. InProc. IPDPS. https://doi.org/10.1109/IPDPS. 2006.1639284
-
[31]
Wentian Guo, Yuchen Li, Mo Sha, Bingsheng He, Xiaokui Xiao, and Kian-Lee Tan. 2020. GPU-Accelerated Subgraph Enumeration on Partitioned Graphs. In Proc. SIGMOD. 1067–1082
work page 2020
-
[32]
Donghyoung Han, Jongwuk Lee, and Min-Soo Kim. 2022. FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan Generation. InProc. SIGMOD. 1891–1904
work page 2022
-
[33]
Donghyoung Han, Yoon-Min Nam, Jihye Lee, Kyongseok Park, Hyunwoo Kim, and Min-Soo Kim. 2019. DistME: A Fast and Elastic Distributed Matrix Compu- tation Engine using GPUs. InProc. SIGMOD. 759–774
work page 2019
-
[34]
Jihoon Han, Anand Sivasubramaniam, Chia-Hao Chang, Vikram Sharma Mailthody, Zaid Qureshi, and Wen-Mei Hwu. 2026. Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO. InProc. ASPLOS. 208–222
work page 2026
-
[35]
Govindaraju, Qiong Luo, and Pedro V
Bingsheng He, Mian Lu, Ke Yang, Rui Fang, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2009. Relational query coprocessing on graphics processors. ACM Trans. Database Syst.34, 4 (2009), 21:1–21:39
work page 2009
-
[36]
Govindaraju, Qiong Luo, and Pedro V
Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. 2008. Relational joins on graphics processors. InProc. SIGMOD. 511–524
work page 2008
-
[37]
Bingsheng He and Jeffrey Xu Yu. 2011. High-Throughput Transaction Execu- tions on Graphics Processors.Proc. VLDB Endow.4, 5 (2011), 314–325
work page 2011
-
[38]
Dong He, Supun C. Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query Processing on Tensor Computa- tion Runtimes.Proc. VLDB Endow.15, 11 (2022), 2811–2825
work page 2022
-
[39]
Jiong He, Mian Lu, and Bingsheng He. 2013. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture.Proc. VLDB Endow.6, 10 (2013), 889–900
work page 2013
-
[40]
Jiong He, Shuhao Zhang, and Bingsheng He. 2014. In-Cache Query Co- Processing on Coupled CPU-GPU Architectures.Proc. VLDB Endow.8, 4 (2014), 329–340
work page 2014
-
[41]
HEAVY.AI. 2026. HeavyDB: A GPU-accelerated SQL database. https://github. com/heavyai/heavydb Accessed: 2026-04-27
work page 2026
-
[42]
Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl
-
[43]
VLDB Endow.6, 9 (2013), 709–720
Hardware-Oblivious Parallelism for In-Memory Column-Stores.Proc. VLDB Endow.6, 9 (2013), 709–720
work page 2013
-
[44]
Justus Henneberg and Felix Schuhknecht. 2023. RTIndeX: Exploiting Hardware- Accelerated GPU Raytracing for Database Indexing.Proc. VLDB Endow.16, 13 (2023), 4268–4281
work page 2023
-
[45]
Justus Henneberg, Felix Martin Schuhknecht, Rosina Kharal, and Trevor Brown
-
[46]
More Bang for Your Buck(et): Fast and Space-Efficient Hardware- Accelerated Coarse-Granular Indexing on GPUs. InProc. ICDE. IEEE, 1320– 1333
-
[47]
Sven Hepkema, Azim Afroozeh, Charlotte Felius, Peter Boncz, and Stefan Manegold. 2025. G-ALP: Rethinking Light-weight Encodings for GPUs. In Proceedings of the 21st International Workshop on Data Management on New Hardware, DaMoN 2025, Berlin, Germany, June 22-27, 2025. ACM, 11:1–11:10. https://doi.org/10.1145/3736227.3736242
-
[48]
HeteroDB, Inc. 2026. PG-Strom: GPU acceleration for PostgreSQL. https: //heterodb.github.io/pg-strom/ Accessed: 2026-04-27
work page 2026
-
[49]
Kijae Hong, Kyoungmin Kim, Young-Koo Lee, Yang-Sae Moon, Sourav S. Bhowmick, and Wook-Shin Han. 2024. Themis: A GPU-accelerated Relational Query Execution Engine.Proc. VLDB Endow.18, 2 (2024), 426–438
work page 2024
-
[50]
Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. 2022. TCUDB: Accelerating Database with Tensor Processors. InProc. SIGMOD. 1360–1374
work page 2022
-
[51]
Zezhou Huang, Krystian Sakowski, Hans Lehnert, Wei Cui, Carlo Curino, Mat- teo Interlandi, Marius Dumitru, and Rathijit Sen. 2025. GPU Acceleration of SQL Analytics on Compressed Data.Proc. VLDB Endow.19, 3 (2025), 320–333
work page 2025
-
[52]
Marko Kabic, Bowen Wu, Jonas Dann, and Gustavo Alonso. 2025. Powerful GPUs or Fast Interconnects: Analyzing Relational Workloads on Modern GPUs. Proc. VLDB Endow.18, 11 (2025), 4350–4363
work page 2025
-
[53]
Lohman, René Müller, and Peter Benjamin Volk
Tim Kaldewey, Guy M. Lohman, René Müller, and Peter Benjamin Volk. 2012. GPU join processing revisited. InProc. DaMoN. 55–62
work page 2012
-
[54]
Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources.Proc. VLDB Endow.10, 7 (2017), 733–744
work page 2017
-
[55]
Nguyen, Tim Kaldewey, Victor W
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey
-
[56]
FAST: fast architecture sensitive tree search on modern CPUs and GPUs. InProc. SIGMOD. 339–350
-
[57]
Zhuohang Lai, Xibo Sun, Qiong Luo, and Xiaolong Xie. 2022. Accelerating multi-way joins on the GPU.VLDB J.31, 3 (2022), 529–553
work page 2022
-
[58]
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The Vertica Analytic Database: C-Store 7 Years Later.Proc. VLDB Endow.5, 12 (2012), 1790–1801
work page 2012
-
[59]
Nikolaj Leischner, Vitaly Osipov, and Peter Sanders. 2010. GPU sample sort. In Proc. IPDPS. 1–10
work page 2010
-
[60]
Jing Li, Hung-Wei Tseng, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. 2016. HippogriffDB: Balancing I/O and GPU Bandwidth in Big Data 13 Analytics.Proc. VLDB Endow.9, 14 (2016), 1647–1658
work page 2016
-
[61]
Yinan Li, Bailu Ding, Ziyun Wei, Lukas M. Maas, Momin Al-Ghosien, Spyros Blanas, Nicolas Bruno, Carlo Curino, Matteo Interlandi, Craig Peeper, Kaushik Rajan, Surajit Chaudhuri, and Johannes Gehrke. 2025. Scaling GPU-Accelerated Databases beyond GPU Memory Size.Proc. VLDB Endow.18, 11 (2025), 4518– 4531
work page 2025
-
[63]
Pump Up the Volume: Processing Large Data on GPUs with Fast Inter- connects. InProc. SIGMOD. 1633–1649
-
[64]
Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl
-
[65]
Triton Join: Efficiently Scaling to a Large Join State on GPUs with Fast Interconnects. InProc. SIGMOD. 1017–1032
-
[66]
Vasilis Mageirakos, Riccardo Mancini, Srinivas Karthik, Bikash Chandra, and Anastasia Ailamaki. 2022. Efficient GPU-accelerated Join Optimization for Complex Queries. InProc. ICDE. 3190–3193
work page 2022
-
[67]
Tobias Maltenberger, Ivan Ilic, Igor Tolovski, and Tilmann Rabl. 2022. Evaluating Multi-GPU Sorting with Modern Interconnects. InProc. SIGMOD. 1795–1809
work page 2022
-
[68]
Tobias Maltenberger, Ivan Ilic, Igor Tolovski, and Tilmann Rabl. 2025. Efficiently Joining Large Relations on Multi-GPU Systems.Proc. VLDB Endow.18, 11 (2025), 4653–4667
work page 2025
-
[69]
Belviranli, Seyong Lee, Jeffrey S
Pak Markthub, Mehmet E. Belviranli, Seyong Lee, Jeffrey S. Vetter, and Satoshi Matsuoka. 2018. DRAGON: breaking GPU memory capacity limits with direct NVM access. InProc. SC. IEEE / ACM, 32:1–32:13
work page 2018
-
[70]
Guido Moerkotte. 1998. Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing. InProc. VLDB. 476–487
work page 1998
-
[71]
CJ Newburn, Prashant Prabhu, and Vikram Sharma Mailthody. 2025. Speed- of-Light Data Movement Between Storage and the GPU. https://www.nvidia. com/en-us/on-demand/session/gtc25-s73012/ Accessed: 2026-04-27
work page 2025
-
[72]
Anh Nguyen, Masato Edahiro, and Shinpei Kato. 2018. GPU-Accelerated VoltDB: A Case for Indexed Nested Loop Join. InProc. HPCS. 204–212
work page 2018
-
[73]
Hamish Nicholson, Konstantinos Chasialis, Antonio Boffa, and Anastasia Aila- maki. 2025. The Effectiveness of Compression for GPU-Accelerated Queries on Out-of-Memory Datasets. InProc. DaMoN. 10:1–10:10
work page 2025
-
[74]
Hamish Nicholson, Aunn Raza, Periklis Chrysogelos, and Anas- tasia Ailamaki. 2023. HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics. InProc. CIDR. https://vldb.org/cidrdb/2023/hetcache-synergising-nvme-storage-and- gpu-acceleration-for-memory-efficient-analytics.html
work page 2023
-
[75]
John Nickolls and William J. Dally. 2010. The GPU Computing Era.MICRO30, 2 (2010), 56–69
work page 2010
-
[76]
NVIDIA Corporation. 2021. NVIDIA A100 Tensor Core GPU Datasheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data- Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf Accessed: 2026-04-27
work page 2021
-
[77]
NVIDIA Corporation. 2024. GPUDirect Storage Overview Guide. https://docs. nvidia.com/gpudirect-storage/overview-guide/index.html Accessed: 2026-04- 27
work page 2024
-
[78]
NVIDIA Corporation. 2024. NVIDIA H100 Tensor Core GPU Datasheet. https: //resources.nvidia.com/en-us-gpu-resources/h100-datasheet-24306 Accessed: 2026-04-27
work page 2024
-
[79]
NVIDIA Corporation. 2026. RAPIDS Accelerator for Apache Spark. https: //nvidia.github.io/spark-rapids/ Accessed: 2026-04-27
work page 2026
-
[80]
Patrick E. O’Neil, Elizabeth J. O’Neil, Xuedong Chen, and Stephen Revilak
-
[81]
The Star Schema Benchmark and Augmented Fact Table Indexing. InProc. TPCTC. 237–252
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.