Engineering Scalable Distributed List Ranking
Pith reviewed 2026-06-27 14:56 UTC · model grok-4.3
The pith
List ranking scales to billions of elements on up to 24,576 cores through engineering of the sparse ruling-set algorithm.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate that indirect communication, exploiting input locality, and message coalescing allows scaling to billions of elements on up to 24,576 cores. They provide a more detailed analysis of algorithm parameters and validate the approach across input instances with different structural properties.
What carries the argument
The engineered sparse ruling-set algorithm that uses indirect communication and message coalescing to reduce communication overhead while preserving correctness.
If this is right
- List ranking becomes practical as a building block inside larger parallel graph algorithms on machines with tens of thousands of cores.
- Parameter choices for ruling-set size and communication batching can be tuned using the provided analysis to balance local work and global messages.
- The same communication engineering techniques may reduce overhead in other pointer-jumping or tree-based parallel primitives.
Where Pith is reading between the lines
- If the scaling holds for real application data, list ranking could be added to distributed libraries without becoming a bottleneck.
- The approach might extend to other irregular pointer problems such as connected components or tree contraction on similar hardware.
- Future work could test whether the same coalescing strategy improves performance when the underlying network has higher latency.
Load-bearing premise
The tested input instances with different structural properties are representative of the real-world linked lists that arise when list ranking is used inside larger applications.
What would settle it
A run on a new input structure, such as a single long cycle or highly irregular locality pattern, that fails to scale beyond a few thousand cores or shows communication time dominating computation.
Figures
read the original abstract
The list ranking problem is one of the classical problems of parallel computing, with nontrivial algorithms and many applications as a subroutine for solving other problems. While it has been intensively studied in the early days of parallel computing, few things happened in the last 20 years. In particular, there is little work on scaling list ranking to large machines and input sizes. We reconsider list ranking starting from the ground-breaking results of Sibeyn a quarter century ago. We employ algorithm and performance engineering to improve his sparse ruling-set algorithm, making it capable of scaling to many processors, and provide a more detailed analysis of the impact of the algorithm's parameters, further guiding our practical implementation. We perform an extensive experimental study across a variety of input instances with different structural properties. We demonstrate that indirect communication, exploiting input locality, and message coalescing allows scaling to billions of elements on up to 24,576 cores.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper revisits Sibeyn's sparse ruling-set algorithm for the list ranking problem, applies algorithm and performance engineering (including indirect communication, input locality exploitation, and message coalescing), provides parameter analysis, and reports an extensive experimental study on varied synthetic inputs demonstrating scaling to billions of elements on up to 24,576 cores.
Significance. If the results hold under representative workloads, the work revives practical interest in distributed list ranking after two decades of limited progress and supplies concrete engineering guidance plus parameter insights that could benefit applications using list ranking as a subroutine in parallel graph or string algorithms. The emphasis on reproducible scaling experiments is a clear strength.
major comments (1)
- [Experimental study] Experimental study (as described in the abstract and motivating the central claim): the manuscript motivates list ranking via its use as a subroutine in larger applications yet evaluates only 'a variety of input instances with different structural properties' whose generation method and locality/irregularity statistics are not shown to match those arising from real application-derived lists (e.g., from parallel graph algorithms). This makes the observed scaling on 24,576 cores potentially specific to the chosen generators rather than a general property of the engineered algorithm.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the experimental study. We address the major comment below.
read point-by-point responses
-
Referee: [Experimental study] Experimental study (as described in the abstract and motivating the central claim): the manuscript motivates list ranking via its use as a subroutine in larger applications yet evaluates only 'a variety of input instances with different structural properties' whose generation method and locality/irregularity statistics are not shown to match those arising from real application-derived lists (e.g., from parallel graph algorithms). This makes the observed scaling on 24,576 cores potentially specific to the chosen generators rather than a general property of the engineered algorithm.
Authors: We agree that the inputs are synthetic and that the manuscript does not provide direct statistical matching to lists extracted from specific real applications. The generators are chosen to systematically vary structural properties (locality, irregularity) that are known to affect distributed list ranking performance, enabling controlled evaluation of scalability across a spectrum of cases. This is described in the experimental section. While real application-derived lists would strengthen the practical claims, obtaining representative billion-element lists from large-scale graph algorithms at the required scale is not feasible within this study. We can revise the manuscript to include additional details on the generators, their parameters, and how the varied properties relate to those expected in graph and string algorithm subroutines. revision: partial
Circularity Check
No circularity; experimental engineering paper with independent validation
full rationale
The paper starts from Sibeyn's established sparse ruling-set algorithm (external prior work), applies engineering improvements (indirect communication, locality exploitation, message coalescing), analyzes parameters, and reports empirical scaling results on synthetic inputs with varied structures. No equations, predictions, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The representativeness of test instances is an external-validity issue, not a circularity in the derivation. The work is self-contained against benchmarks and does not invoke uniqueness theorems or ansatzes from the authors' prior work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Anderson, R.J., Miller, G.L.: A simple randomized parallel algorithm for list- ranking. Inf. Process. Lett.33(5), 269–273 (1990). https://doi.org/10.1016/0020- 0190(90)90196-5
-
[2]
Algorithmica6(6), 859–868 (1991)
Anderson, R.J., Miller, G.L.: Deterministic parallel list ranking. Algorithmica6(6), 859–868 (1991). https://doi.org/10.1007/BF01759076
-
[3]
Cole, R., Vishkin, U.: Approximate and exact parallel scheduling with applications to list, tree and graph problems. In: FOCS. pp. 478–491. IEEE Computer Society (1986). https://doi.org/10.1109/SFCS.1986.10
-
[4]
Dehne, F., Song, S.W.: Randomized parallel list ranking for distributed memory multiprocessors. Int. J. Parallel Program.25(1), 1–16 (1997). https://doi.org/10.1007/BF02700044
-
[5]
Algorithmica33(2), 183–200 (2002)
Dehne, F.K.H.A., et al.: Efficient parallel graph algorithms for coarse- grained multicomputers and BSP. Algorithmica33(2), 183–200 (2002). https://doi.org/10.1007/S00453-001-0109-4
-
[6]
Funke, D., et al.: Communication-free massively distributed graph generation. J. Parallel Distributed Comput.131, 200–217 (2019). https://doi.org/10.1016/J.JPDC.2019.03.011 14 Sanders et al
-
[7]
Hameed, F., et al.: Contour ranking on coarse grained machines: a case study for low-level vision computations. Concurr. Pract. Exp.9(3), 203–221 (1997)
1997
-
[8]
Addison-Wesley (1992)
JáJá, J.F.: An Introduction to Parallel Algorithms. Addison-Wesley (1992)
1992
-
[9]
Lassous, I.G., Gustedt, J.: Portable list ranking: An experimental study. ACM J. Exp. Algorithmics7, 7 (2002). https://doi.org/10.1145/944618.944625
-
[10]
Miller, G.L., Reif, J.H.: Parallel tree contraction and its application. In: FOCS. pp. 478–489. IEEE Computer Society (1985). https://doi.org/10.1109/SFCS.1985.43
-
[11]
Pan, T., et al.: Fast de bruijn graph compaction in distributed memory envi- ronments. IEEE ACM Trans. Comput. Biol. Bioinform.17(1), 136–148 (2020). https://doi.org/10.1109/TCBB.2018.2858797
-
[12]
Patel, J.N., et al.: Scalable parallel implementations of list ranking on fine- grained machines. IEEE Trans. Parallel Distributed Syst.8(10), 1006–1018 (1997). https://doi.org/10.1109/71.629484
-
[13]
Rehman, M.S., et al.: Fast and scalable list ranking on the GPU. In: ICS. pp. 235–243. ACM (2009). https://doi.org/10.1145/1542275.1542311
-
[14]
Reid-Miller, M.: List ranking and list scan on the cray C-90. In: SPAA. pp. 104–113. ACM (1994). https://doi.org/10.1145/181014.181049
-
[15]
Sanders, P., et al.: Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Springer (2019). https://doi.org/10.1007/978-3-030-25209-0
-
[16]
Sibeyn, J.F.: Better trade-offs for parallel list ranking. In: SPAA. pp. 221–230. ACM (1997). https://doi.org/10.1145/258492.258514
-
[17]
2025.Approaches to Automated NACE Coding of German Business Activity Descriptions
Sibeyn, J.F.: Ultimate parallel list ranking? In: HiPC. Lecture Notes in Computer Science, vol. 1745, pp. 197–201. Springer (1999). https://doi.org/10.1007/978-3- 540-46642-0_28
-
[18]
Sibeyn, J.F.: List-ranking on interconnection networks. Inf. Comput.181(2), 75–87 (2003). https://doi.org/10.1016/S0890-5401(02)00029-9
-
[19]
Sibeyn, J.F.: Minimizing global communication in parallel list ranking. In: Euro- Par. Lecture Notes in Computer Science, vol. 2790, pp. 894–902. Springer (2003). https://doi.org/10.1007/978-3-540-45209-6_123
-
[20]
Sibeyn, J.F., Guillaume, F., Seidel, T.: Practical parallel list ranking. J. Parallel Dis- tributed Comput.56(2), 156–180 (1999). https://doi.org/10.1006/JPDC.1998.1508
-
[21]
Träff, J.L.: Portable randomized list ranking on multiprocessors using MPI. In: PVM/MPI. Lecture Notes in Computer Science, vol. 1497, pp. 395–402. Springer (1998). https://doi.org/10.1007/BFB0056600
-
[22]
Uhl, T.N., et al.: Kamping: Flexible and (near) zero-overhead C++ bindings for MPI. In: SC. p. 44. IEEE (2024). https://doi.org/10.1109/SC41406.2024.00050
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00050 2024
-
[23]
Wei, Z., JáJá, J.F.: Optimization of linked list prefix computations on multithreaded gpus using CUDA. Parallel Process. Lett.22(4) (2012). https://doi.org/10.1142/S0129626412500120
-
[24]
Wyllie, J.: The Complexity of Parallel Computations. Ph.D. thesis, Cornell Univer- sity, USA (1979)
1979
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.