Recognition: no theorem link
FractalSortCPU: Bandwidth-Efficient Compressed Radix Sort on CPU
Pith reviewed 2026-05-13 03:05 UTC · model grok-4.3
The pith
A CPU radix sort uses compressed histograms and fully parallel key updates to cut bandwidth use by up to 6x versus prior methods on large datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FractalSortCPU solves distribution dependence and limited parallelism in radix sort by introducing a CPU-adapted histogram compression scheme for arbitrary-precision keys. Fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing, reducing complexity and memory footprint. A parallelized sorting architecture with SIMD-accelerated operations then achieves state-of-the-art execution time while limiting histogram growth, delivering bandwidth efficiency improvements of 6x over CPU, 3x over GPU, and 2.5x over FPGA state-of-the-art on 512 MB to 32 GB data sets at 16-bit precision.
What carries the argument
CPU-adapted histogram compression scheme with fully parallel key-based histogram updates that limit growth and remove pre-processing requirements.
Load-bearing premise
The CPU-adapted histogram compression scheme can be implemented with fully parallel key-based updates without introducing synchronization overhead, correctness loss, or hidden pre-processing costs for arbitrary-precision keys.
What would settle it
A direct measurement on 512 MB to 32 GB datasets at 16-bit precision showing that parallel histogram updates require atomic operations or contention that erase the claimed bandwidth savings relative to state-of-the-art CPU radix sorts would falsify the efficiency claims.
Figures
read the original abstract
Cloud database systems, particularly their middleware and query execution layers, use sorting as a core operation in query processing, indexing and join execution. Distribution-dependence and limited parallelism are key issues inherent in state-of-the-art radix sort which is preferred for large datasets due to performance advantages over comparison-based algorithms. Multi-pass bucketing, stochastic sampling and dependence graph structures are common solutions to these problems that incur the cost of data pre-processing and increased memory footprint hence they are less appropriate for large-scale workloads common in cloud environments. In-place radix sort schemes increase the number of passes as precision increases, which negatively impacts latency. Our work solves these problems by introducing a CPU-adapted histogram compression scheme for radix sorting for arbitrary-precision keys implemented on the CPU for increased accessibility, providing state-of-the-art execution time, while limiting histogram growth. Fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing further lowering latency, mitigating distribution-dependence and reducing complexity. With a parallelized sorting architecture utilizing SIMD-accelerated operations for low latency, the algorithm demonstrates improvement over the state-of-the-art on the CPU, GPU, and FPGA by 6x, 3x and 2.5x in bandwidth efficiency on 512MB to 32GB data sets at 16-bit precision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FractalSortCPU, a CPU radix-sort algorithm that employs a CPU-adapted histogram compression scheme to support fully parallel key-based histogram updates. This design eliminates multi-pass bucketing, stochastic sampling, and data pre-processing while handling arbitrary-precision keys, and incorporates SIMD acceleration. The central performance claim is that the resulting algorithm improves bandwidth efficiency over state-of-the-art radix sorts by 6x on CPU, 3x on GPU, and 2.5x on FPGA for 512 MB–32 GB inputs at 16-bit precision.
Significance. If the claimed bandwidth-efficiency gains and correctness for arbitrary-precision keys are demonstrated, the work would address longstanding distribution-dependence and pre-processing overheads in radix sort, offering a practical primitive for cloud database query execution, indexing, and joins on large-scale workloads.
major comments (2)
- [Abstract] Abstract: The claim that 'fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing' while incurring 'zero measurable overhead (no atomics, no barriers, no per-thread merge)' is load-bearing for all reported bandwidth-efficiency numbers, yet the manuscript supplies no pseudocode, update rule, memory-access analysis, or complexity bound showing how concurrent writes to a shared compressed histogram are performed without synchronization or extra traffic on a multi-core CPU.
- [Abstract] Abstract: No experimental data, implementation details, hardware platform, compiler flags, or verification steps are provided to support the 6x/3x/2.5x bandwidth-efficiency claims on 512 MB–32 GB data sets; the central performance assertions therefore cannot be evaluated.
minor comments (1)
- [Abstract] The title specifies 'on CPU' while the abstract reports cross-platform (GPU/FPGA) speedups; clarify whether the GPU/FPGA numbers are obtained by direct porting or by comparison against external implementations.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. We agree that the current presentation lacks sufficient supporting details for the central claims and will revise the manuscript to address both points.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'fully parallel key-based histogram updates eliminate the need for input bucketing and data pre-processing' while incurring 'zero measurable overhead (no atomics, no barriers, no per-thread merge)' is load-bearing for all reported bandwidth-efficiency numbers, yet the manuscript supplies no pseudocode, update rule, memory-access analysis, or complexity bound showing how concurrent writes to a shared compressed histogram are performed without synchronization or extra traffic on a multi-core CPU.
Authors: We acknowledge that the manuscript text does not include explicit pseudocode, update rules, memory-access analysis, or complexity bounds for the parallel histogram updates. The CPU-adapted compression scheme is designed to map keys to distinct compressed bins that can be updated concurrently without synchronization, atomics, or barriers because each thread writes only to its own fractal partition of the histogram. However, these mechanisms are described at a high level only. We will add a new subsection with pseudocode for the key-based update, a detailed memory-access analysis demonstrating no extra traffic beyond the input read, and a complexity argument showing O(1) overhead per key. This revision will make the zero-overhead claim fully verifiable. revision: yes
-
Referee: [Abstract] Abstract: No experimental data, implementation details, hardware platform, compiler flags, or verification steps are provided to support the 6x/3x/2.5x bandwidth-efficiency claims on 512 MB–32 GB data sets; the central performance assertions therefore cannot be evaluated.
Authors: The current manuscript excerpt emphasizes the algorithmic contribution but indeed omits the requested experimental specifics. We will expand the evaluation section to include: hardware platforms (specific CPU, GPU, and FPGA models), compiler flags and build settings, implementation details (SIMD intrinsics and parallelization strategy), verification procedures (correctness tests on arbitrary-precision keys using both synthetic uniform and real-world skewed datasets), and the exact methodology for bandwidth-efficiency measurement (bytes transferred per sorted key). Results will be reported explicitly for the 512 MB–32 GB range at 16-bit precision, allowing direct evaluation of the 6x/3x/2.5x claims. revision: yes
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
CellSort: High Performance Sorting on the Cell Processor
B. G. R. R. B. and P. S. Yu, “CellSort: High Performance Sorting on the Cell Processor.” In:Proc. VLDB(2007), pp. 1286–1297
work page 2007
-
[2]
R. M. J. T. G. Alonso, “Sorting networks on FPGAs.” In:VLDB21 (2012), pp. 1–23. https://doi.org/10.1007/s00778-011-0232-z
-
[3]
Terabyte sort on FPGA-accelerated flash storage
S. J. S. X. R. Arvind, “Terabyte sort on FPGA-accelerated flash storage.” In:International Symposium on Field-Programmable Custom Computing Machines (FCCM)(2017)
work page 2017
-
[4]
Terabyte Sort on FPGA-Accelerated Flash Storage
S. J. S. X. S. Arvind, “Terabyte Sort on FPGA-Accelerated Flash Storage.” In:Proceedings of the 25th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM)(2017)
work page 2017
-
[5]
F. Berthelot, F. Nouvel, and D. Houzet, “Partial and dynamic reconfiguration of FPGAs: a top down design methodology for an automatic implementation.” In:Proceedings 20th IEEE Interna- tional Parallel Distributed Processing Symposium(2006), 4 pp. DOI: 10.1109/IPDPS.2006.1639466
-
[6]
PARADIS: An efficient parallel algorithm for in-place radix sort
M. C. D. B. R. B. Bordawekar, “PARADIS: An efficient parallel algorithm for in-place radix sort.” In:Very Large Data Bases (VLDB) (2015)
work page 2015
-
[7]
Spark SQL: Relational data processing in spark
M. A. R. X. C. L. Y . H. D. L. J. Bradley, “Spark SQL: Relational data processing in spark.” In:International Conference on Management of Data(2015). URL: https://sortbenchmark.org/
work page 2015
-
[8]
A Novel Digital Equalizer Based on RF Sampling Beyond GHz
L. Canese et al., “A Novel Digital Equalizer Based on RF Sampling Beyond GHz.” In:IEEE Access12 (2024), pp. 92560–92572. DOI: 10.1109/ACCESS.2024.3422802
-
[9]
Efficient implementation of sorting on multi-core SIMD CPU architecture
J. Chhugani et al., “Efficient implementation of sorting on multi-core SIMD CPU architecture.” In:Proceedings of the VLDB Endowment1(2) (2008), pp. 1313–1324
work page 2008
-
[10]
Bonsai: High-Performance Adaptive Merge Tree Sorting
N. S. W. Q. V . A. M. C. J. Cong, “Bonsai: High-Performance Adaptive Merge Tree Sorting.” In:ACM/IEEE 47th Annual International Sympo- sium on Computer Architecture (ISCA)(2020)
work page 2020
-
[11]
FANS: FPGA-accelerated near-storage sorting
W. Q. L. G. M. C. J. Cong, “FANS: FPGA-accelerated near-storage sorting.” In:Proceedings of the 29th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM)(2021)
work page 2021
-
[12]
TopSort: A High-Performance Two-Phase Sorting Accelerator Optimized on HBM-based FPGAs
W. Q. L. G. Z. F. M. C. J. Cong, “TopSort: A High-Performance Two-Phase Sorting Accelerator Optimized on HBM-based FPGAs.” In: IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)(2022). DOI: https://doi.org/10. 1109/FCCM53951.2022.9786209
-
[13]
J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters.” In:Communications of the ACM51(1) (Jan. 2008), pp. 107–113. DOI: 10.1145/1327452.1327492. URL: https://doi.org/10. 1145/1327452.1327492
-
[14]
Contrasting Parallelized with Sequential Sort- ing
A. Dutta and M. Saha, “Contrasting Parallelized with Sequential Sort- ing.” In:2022 IEEE 7th International Conference on Recent Advances and Innovations in Engineering (ICRAIE). V ol. 7. 2022, pp. 256–259. DOI: 10.1109/ICRAIE56454.2022.10054300
-
[15]
Implementing sorting in database systems
G. Graefe, “Implementing sorting in database systems.” In: ACM Computing Surveys38(3) (Sept. 2006), 10–es. DOI: 10.1145/1132960.1132964. URL: https://doi.org/10.1145/1132960. 1132964
-
[16]
J. Gray, “Sort Benchmark Home Page.” (2023). URL: https:// sortbenchmark.org/
work page 2023
-
[17]
A memory bandwidth-efficient hybrid radix sort on GPUs
E. S. H. A. Jacobsen, “A memory bandwidth-efficient hybrid radix sort on GPUs.” In:International Conference on Management of Data (SIGMOD)(2017). DOI: https://doi.org/10.1145/3035918.3064043
-
[18]
D. Koch and J. Torresen, “FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting.” In:Proceedings of the 19th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays(2011), pp. 45–54
work page 2011
-
[19]
FPGA vs. Multi-core CPUs vs. GPUs: Hands-On Experience with a Sorting Application
C. G. Z. B. P. Laskov, “FPGA vs. Multi-core CPUs vs. GPUs: Hands-On Experience with a Sorting Application.” In:Lecture Notes in Computer Science6310 (2017), pp. 105–117
work page 2017
-
[20]
A Memory-Disaggregated Radix Tree
X. Luo et al., “A Memory-Disaggregated Radix Tree.” In:ACM Trans- actions on Storage20(3) (June 2024). DOI: 10.1145/3664289. URL: https://doi.org/10.1145/3664289
-
[21]
Evaluating multi-GPU sorting with modern interconnects
T. Maltenberger et al., “Evaluating multi-GPU sorting with modern interconnects.” In:Proceedings of the 2022 International Conference on Management of Data(2022), pp. 1795–1809
work page 2022
-
[22]
D. Merrill and A. Grimshaw, “High Performance and Scalable Radix Sorting: A Case Study of Implementing Dynamic Parallelism for GPU Computing.” In:Parallel Processing Letters21(02) (2011), pp. 245–
work page 2011
-
[23]
URL: https://doi.org/10.1142/ S0129626411000187
DOI: 10.1142/S0129626411000187. URL: https://doi.org/10.1142/ S0129626411000187
-
[24]
FPGA-Accelerated Samplesort for Large Data Sets
H. C. S. M. M. F. P. Milder, “FPGA-Accelerated Samplesort for Large Data Sets.” In:Proceedings of the 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA)(2020)
work page 2020
-
[25]
FLiMS: Fast lightweight merge sorter
P. Papaphilippou, C. Brooks, and W. Luk, “FLiMS: Fast lightweight merge sorter.” In:2018 International Conference on Field- Programmable Technology (FPT). IEEE, 2018, pp. 78–85
work page 2018
-
[26]
Designing efficient sorting al- gorithms for manycore GPUs
N. Satish, M. Harris, and M. Garland, “Designing efficient sorting al- gorithms for manycore GPUs.” In:2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 2009, pp. 1–10
work page 2009
-
[27]
Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
N. Satish et al., “Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort.” In:Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data(2010), pp. 351–362
work page 2010
-
[28]
Re-configurable parallel Feed-Forward Neu- ral Network implementation using FPGA
M. El-Sharkawy et al., “Re-configurable parallel Feed-Forward Neu- ral Network implementation using FPGA.” In:Integration97 (2024), p. 102176. DOI: https://doi.org/10.1016/j.vlsi.2024.102176
-
[29]
A cost-effective and scalable merge sorter tree on FPGAs
T. Usui, T. Van Chu, and K. Kise, “A cost-effective and scalable merge sorter tree on FPGAs.” In:2016 Fourth International Symposium on Computing and Networking (CANDAR). IEEE, 2016, pp. 47–56
work page 2016
-
[30]
High throughput large scale sorting on a CPU-FPGA heterogeneous platform
C. Zhang, R. Chen, and V . Prasanna, “High throughput large scale sorting on a CPU-FPGA heterogeneous platform.” In:2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2016, pp. 148–155
work page 2016
-
[31]
M. Zuluaga, P. Milder, and M. P ¨uschel, “Streaming Sorting Networks.” In: 21(4) (May 2016). DOI: 10.1145/2854150. URL: https://doi.org/10. 1145/2854150
-
[32]
Performance comparison of FPGA, GPU and CPU in image processing
S. Asano, T. Maruyama, and Y . Yamaguchi, “Performance comparison of FPGA, GPU and CPU in image processing.” In:2009 International Conference on Field Programmable Logic and Applications (FPL) (2009), pp. 126–131. DOI: https://doi.org/10.1109/FPL.2009.5272532
-
[33]
D. E. Knuth,The Art of Computer Programming, Volume 3: Sorting and Searching, 2nd ed. Addison-Wesley, 1998
work page 1998
-
[34]
T. Bingmann, S. Sanders, and J. Singler, “Parallel string sample sort,” ACM Journal of Experimental Algorithmics, vol. 23, no. 1, pp. 1–23, 2018
work page 2018
-
[35]
A comparison of sorting algorithms for the connection machine CM-2,
G. E. Blelloch, C. E. Leiserson, B. M. Maggs, C. G. Plaxton, S. J. Smith, and M. Zagha, “A comparison of sorting algorithms for the connection machine CM-2,” inProceedings of the 3rd ACM Symposium on Parallel Algorithms and Architectures, 1991, pp. 3–16
work page 1991
-
[36]
Parallel sorting in message-passing systems,
P. Sanders and J. L. Tr ¨aff, “Parallel sorting in message-passing systems,” inProceedings of the IEEE International Parallel and Distributed Processing Symposium, 1997, pp. 1–10
work page 1997
-
[37]
Scalable parallel geometric algorithms for coarse grained multicomputers,
F. Dehne, A. Fabri, and A. Rau-Chaplin, “Scalable parallel geometric algorithms for coarse grained multicomputers,” inProceedings of the IEEE Symposium on Parallel and Distributed Processing, 1996, pp. 298–305. 10
work page 1996
-
[38]
D. Ajwani, R. Dementiev, and P. Sanders, “A cache-oblivious distri- bution sweeping framework for orthogonal range reporting and related problems,”ACM Transactions on Algorithms, vol. 12, no. 1, pp. 1–33, 2016
work page 2016
-
[39]
Optimal merging and sorting on the EREW PRAM,
T. Hagerup and C. R ¨ub, “Optimal merging and sorting on the EREW PRAM,”Information Processing Letters, vol. 33, no. 4, pp. 181–185, 1991
work page 1991
-
[40]
N. Leischner, V . Osipov, and P. Sanders, “GPU sample sort,” inPro- ceedings of the IEEE International Parallel and Distributed Processing Symposium, 2010, pp. 1–10
work page 2010
-
[41]
An efficient parallel radix sort algorithm on multicore systems,
S. Shi and X. Zhang, “An efficient parallel radix sort algorithm on multicore systems,”Proceedings of the International Conference on Computer Science and Service System, 2012, pp. 450–453
work page 2012
-
[42]
Intel Integrated Performance Primitives (Intel IPP) – Data Processing and Sorting,
Intel Corporation, “Intel Integrated Performance Primitives (Intel IPP) – Data Processing and Sorting,” 2023. [Online]. Available: https://www. intel.com/content/www/us/en/developer/tools/oneapi/ipp.html
work page 2023
-
[43]
AA-Sort: A new parallel sorting algorithm for multi-core SIMD processors,
H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani, “AA-Sort: A new parallel sorting algorithm for multi-core SIMD processors,” in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2007, pp. 189–198
work page 2007
-
[44]
F. Strati, P. Elvinger, T. Kerimoglu, and A Klimovic, ”ML Training with Cloud GPU Shortages: Is Cross-Region the Answer?”In Proceedings of the 4th Workshop on Machine Learning and Systems (EuroMLSys ’24), Association for Computing Machinery, New York, NY , USA, 107–116. https://doi.org/10.1145/3642970.3655843
-
[45]
M. Dang’ana, “FractalSortCPU Source Code,” 2025. [Online]. Available: https://github.com/mikdangana/fractalsort cpu/
work page 2025
-
[46]
“Numba reference,” https://numba.pydata.org/, 2025, accessed: 2025-08- 04
work page 2025
-
[47]
Engineering in-place (shared-memory) sorting algorithms,
M. Axtmann, S. Witt, D. Ferizovic, and P. Sanders, “Engineering in-place (shared-memory) sorting algorithms,”ACM Trans. Parallel Comput., vol. 9, no. 1, Jan. 2022. [Online]. Available: https://doi.org/ 10.1145/3505286
-
[48]
Performance analysis of efficient sorting algorithms in big data processing,
Y . Li, “Performance analysis of efficient sorting algorithms in big data processing,”Procedia Computer Science, vol. 262, pp. 44–50, 2025
work page 2025
-
[49]
Optiflex- sort: A hybrid sorting algorithm for efficient large-scale data processing,
N. S. Abuba, E. Y . Baagyere, C. I. Nakpih, and J. K. Wiredu, “Optiflex- sort: A hybrid sorting algorithm for efficient large-scale data processing,” Journal of Advances in Mathematics and Computer Science, vol. 40, no. 2, pp. 67–81, 2025
work page 2025
-
[50]
J. Anantpur and R. Govindarajan, “Taming warp divergence,” in2017 IEEE/ACM International Symposium on Code Generation and Optimiza- tion (CGO), 2017, pp. 50–60
work page 2017
-
[51]
A study of sorting algorithms on approximate memory,
C. Shuang, J. Shunning, H. Bingsheng, and T. Xuenyan, “A study of sorting algorithms on approximate memory,”San Francisco, 2016
work page 2016
- [52]
-
[53]
Atomic read modify write primitives for i/o devices,
A. Singhal, R. Van der Wijngaart, and P. Barry, “Atomic read modify write primitives for i/o devices,” Tech. rep, Tech. Rep., 2008
work page 2008
-
[54]
“Intel 7-8665u cpu reference,” https://www. intel.com/content/www/us/en/products/sku/193563/ intel-core-i78665u-processor-8m-cache-up-to-4-80-ghz/specifications. html, 2025, accessed: 2025-08-04
work page 2025
-
[55]
FractalSort: High precision com- pressed radix sort on FPGA,
M. Dang’ana and H.-A. Jacobsen, “FractalSort: High precision com- pressed radix sort on FPGA,”IEEE Transactions on Computers, vol. 75, no. 5, pp. 1752–1766, May 2026. DOI: 10.1109/TC.2026.3653702
-
[56]
M. Dang’ana, Y . Zhang, and H.-A. Jacobsen, “Ksurf-Drone: Attention Kalman filter for contextual bandit optimization in cloud resource allocation,”IEEE Transactions on Cloud Computing, vol. PP, no. 99, pp. 1–15, Jan. 2026. DOI: 10.1109/TCC.2026.3653558. 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.