Engineering Scalable Distributed List Ranking

Matthias Schimek; Peter Sanders; Thomas Weidmann; Tim Niklas Uhl

arxiv: 2606.09318 · v1 · pith:B6SYQFVAnew · submitted 2026-06-08 · 💻 cs.DC · cs.DS

Engineering Scalable Distributed List Ranking

Peter Sanders , Matthias Schimek , Tim Niklas Uhl , Thomas Weidmann This is my paper

Pith reviewed 2026-06-27 14:56 UTC · model grok-4.3

classification 💻 cs.DC cs.DS

keywords list rankingparallel algorithmsdistributed computingsparse ruling setsmessage coalescingscalabilitylinked lists

0 comments

The pith

List ranking scales to billions of elements on up to 24,576 cores through engineering of the sparse ruling-set algorithm.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reconsiders the classical list ranking problem, which serves as a subroutine in many parallel applications, and improves Sibeyn's sparse ruling-set algorithm from twenty-five years ago. Through algorithm and performance engineering the authors add indirect communication, exploitation of input locality, and message coalescing. Extensive experiments on varied input structures show that these changes allow the algorithm to handle billions of elements across thousands of cores. A reader would care because prior work had not demonstrated such scaling on modern large machines.

Core claim

The authors demonstrate that indirect communication, exploiting input locality, and message coalescing allows scaling to billions of elements on up to 24,576 cores. They provide a more detailed analysis of algorithm parameters and validate the approach across input instances with different structural properties.

What carries the argument

The engineered sparse ruling-set algorithm that uses indirect communication and message coalescing to reduce communication overhead while preserving correctness.

If this is right

List ranking becomes practical as a building block inside larger parallel graph algorithms on machines with tens of thousands of cores.
Parameter choices for ruling-set size and communication batching can be tuned using the provided analysis to balance local work and global messages.
The same communication engineering techniques may reduce overhead in other pointer-jumping or tree-based parallel primitives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the scaling holds for real application data, list ranking could be added to distributed libraries without becoming a bottleneck.
The approach might extend to other irregular pointer problems such as connected components or tree contraction on similar hardware.
Future work could test whether the same coalescing strategy improves performance when the underlying network has higher latency.

Load-bearing premise

The tested input instances with different structural properties are representative of the real-world linked lists that arise when list ranking is used inside larger applications.

What would settle it

A run on a new input structure, such as a single long cycle or highly irregular locality pattern, that fails to scale beyond a few thousand cores or shows communication time dominating computation.

Figures

Figures reproduced from arXiv: 2606.09318 by Matthias Schimek, Peter Sanders, Thomas Weidmann, Tim Niklas Uhl.

**Figure 2.** Figure 2: Evaluation of different locality-aware techniques on random lists with [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Scalability of pointer doubling and sparse ruling-set on different instances. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of different message indirection techniques on [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

The list ranking problem is one of the classical problems of parallel computing, with nontrivial algorithms and many applications as a subroutine for solving other problems. While it has been intensively studied in the early days of parallel computing, few things happened in the last 20 years. In particular, there is little work on scaling list ranking to large machines and input sizes. We reconsider list ranking starting from the ground-breaking results of Sibeyn a quarter century ago. We employ algorithm and performance engineering to improve his sparse ruling-set algorithm, making it capable of scaling to many processors, and provide a more detailed analysis of the impact of the algorithm's parameters, further guiding our practical implementation. We perform an extensive experimental study across a variety of input instances with different structural properties. We demonstrate that indirect communication, exploiting input locality, and message coalescing allows scaling to billions of elements on up to 24,576 cores.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This engineering paper scales Sibeyn's list ranking to 24k cores with communication tweaks, but the test inputs may not match real application lists.

read the letter

The main thing to know is that the authors take Sibeyn's 25-year-old sparse ruling-set algorithm and add indirect communication, locality exploitation, and message coalescing to reach billions of elements on 24,576 cores. The scaling numbers and parameter analysis are the concrete new pieces.

The experiments are the strongest part. They run across inputs with varied structures and show how the tweaks improve performance at scale. That gives useful data for anyone implementing parallel primitives on large machines.

The soft spot is the inputs. The paper uses generated instances with different properties, but there is no direct check that these match the locality or irregularity patterns that appear when list ranking runs inside graph algorithms or string processing. If the generators produce more favorable distributions than real calls, the observed scaling could be narrower than claimed. This is worth a referee note rather than a deal-breaker.

The work stays grounded in the prior algorithm with no circular derivations or invented entities. Citations track the relevant history.

This is for people in distributed algorithms who need practical scaling results on current hardware. A reader focused on supercomputer implementations would find the numbers and implementation details worth their time. It has enough experimental substance to go to peer review.

Referee Report

1 major / 0 minor

Summary. The paper revisits Sibeyn's sparse ruling-set algorithm for the list ranking problem, applies algorithm and performance engineering (including indirect communication, input locality exploitation, and message coalescing), provides parameter analysis, and reports an extensive experimental study on varied synthetic inputs demonstrating scaling to billions of elements on up to 24,576 cores.

Significance. If the results hold under representative workloads, the work revives practical interest in distributed list ranking after two decades of limited progress and supplies concrete engineering guidance plus parameter insights that could benefit applications using list ranking as a subroutine in parallel graph or string algorithms. The emphasis on reproducible scaling experiments is a clear strength.

major comments (1)

[Experimental study] Experimental study (as described in the abstract and motivating the central claim): the manuscript motivates list ranking via its use as a subroutine in larger applications yet evaluates only 'a variety of input instances with different structural properties' whose generation method and locality/irregularity statistics are not shown to match those arising from real application-derived lists (e.g., from parallel graph algorithms). This makes the observed scaling on 24,576 cores potentially specific to the chosen generators rather than a general property of the engineered algorithm.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the experimental study. We address the major comment below.

read point-by-point responses

Referee: [Experimental study] Experimental study (as described in the abstract and motivating the central claim): the manuscript motivates list ranking via its use as a subroutine in larger applications yet evaluates only 'a variety of input instances with different structural properties' whose generation method and locality/irregularity statistics are not shown to match those arising from real application-derived lists (e.g., from parallel graph algorithms). This makes the observed scaling on 24,576 cores potentially specific to the chosen generators rather than a general property of the engineered algorithm.

Authors: We agree that the inputs are synthetic and that the manuscript does not provide direct statistical matching to lists extracted from specific real applications. The generators are chosen to systematically vary structural properties (locality, irregularity) that are known to affect distributed list ranking performance, enabling controlled evaluation of scalability across a spectrum of cases. This is described in the experimental section. While real application-derived lists would strengthen the practical claims, obtaining representative billion-element lists from large-scale graph algorithms at the required scale is not feasible within this study. We can revise the manuscript to include additional details on the generators, their parameters, and how the varied properties relate to those expected in graph and string algorithm subroutines. revision: partial

Circularity Check

0 steps flagged

No circularity; experimental engineering paper with independent validation

full rationale

The paper starts from Sibeyn's established sparse ruling-set algorithm (external prior work), applies engineering improvements (indirect communication, locality exploitation, message coalescing), analyzes parameters, and reports empirical scaling results on synthetic inputs with varied structures. No equations, predictions, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The representativeness of test instances is an external-validity issue, not a circularity in the derivation. The work is self-contained against benchmarks and does not invoke uniqueness theorems or ansatzes from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering study of an existing algorithm; it introduces no new mathematical axioms, free parameters beyond standard tuning knobs already present in the base method, or invented entities.

pith-pipeline@v0.9.1-grok · 5684 in / 983 out tokens · 22045 ms · 2026-06-27T14:56:13.643557+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Anderson, R.J., Miller, G.L.: A simple randomized parallel algorithm for list- ranking. Inf. Process. Lett.33(5), 269–273 (1990). https://doi.org/10.1016/0020- 0190(90)90196-5

work page doi:10.1016/0020- 1990
[2]

Algorithmica6(6), 859–868 (1991)

Anderson, R.J., Miller, G.L.: Deterministic parallel list ranking. Algorithmica6(6), 859–868 (1991). https://doi.org/10.1007/BF01759076

work page doi:10.1007/bf01759076 1991
[3]

In: FOCS

Cole, R., Vishkin, U.: Approximate and exact parallel scheduling with applications to list, tree and graph problems. In: FOCS. pp. 478–491. IEEE Computer Society (1986). https://doi.org/10.1109/SFCS.1986.10

work page doi:10.1109/sfcs.1986.10 1986
[4]

Dehne, F., Song, S.W.: Randomized parallel list ranking for distributed memory multiprocessors. Int. J. Parallel Program.25(1), 1–16 (1997). https://doi.org/10.1007/BF02700044

work page doi:10.1007/bf02700044 1997
[5]

Algorithmica33(2), 183–200 (2002)

Dehne, F.K.H.A., et al.: Efficient parallel graph algorithms for coarse- grained multicomputers and BSP. Algorithmica33(2), 183–200 (2002). https://doi.org/10.1007/S00453-001-0109-4

work page doi:10.1007/s00453-001-0109-4 2002
[6]

Funke, D., et al.: Communication-free massively distributed graph generation. J. Parallel Distributed Comput.131, 200–217 (2019). https://doi.org/10.1016/J.JPDC.2019.03.011 14 Sanders et al

work page doi:10.1016/j.jpdc.2019.03.011 2019
[7]

Hameed, F., et al.: Contour ranking on coarse grained machines: a case study for low-level vision computations. Concurr. Pract. Exp.9(3), 203–221 (1997)

1997
[8]

Addison-Wesley (1992)

JáJá, J.F.: An Introduction to Parallel Algorithms. Addison-Wesley (1992)

1992
[9]

Lassous, I.G., Gustedt, J.: Portable list ranking: An experimental study. ACM J. Exp. Algorithmics7, 7 (2002). https://doi.org/10.1145/944618.944625

work page doi:10.1145/944618.944625 2002
[10]

In: FOCS

Miller, G.L., Reif, J.H.: Parallel tree contraction and its application. In: FOCS. pp. 478–489. IEEE Computer Society (1985). https://doi.org/10.1109/SFCS.1985.43

work page doi:10.1109/sfcs.1985.43 1985
[11]

IEEE ACM Trans

Pan, T., et al.: Fast de bruijn graph compaction in distributed memory envi- ronments. IEEE ACM Trans. Comput. Biol. Bioinform.17(1), 136–148 (2020). https://doi.org/10.1109/TCBB.2018.2858797

work page doi:10.1109/tcbb.2018.2858797 2020
[12]

IEEE Trans

Patel, J.N., et al.: Scalable parallel implementations of list ranking on fine- grained machines. IEEE Trans. Parallel Distributed Syst.8(10), 1006–1018 (1997). https://doi.org/10.1109/71.629484

work page doi:10.1109/71.629484 1997
[13]

Rehman, M.S., et al.: Fast and scalable list ranking on the GPU. In: ICS. pp. 235–243. ACM (2009). https://doi.org/10.1145/1542275.1542311

work page doi:10.1145/1542275.1542311 2009
[14]

In: SPAA

Reid-Miller, M.: List ranking and list scan on the cray C-90. In: SPAA. pp. 104–113. ACM (1994). https://doi.org/10.1145/181014.181049

work page doi:10.1145/181014.181049 1994
[15]

Springer (2019)

Sanders, P., et al.: Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Springer (2019). https://doi.org/10.1007/978-3-030-25209-0

work page doi:10.1007/978-3-030-25209-0 2019
[16]

In: SPAA

Sibeyn, J.F.: Better trade-offs for parallel list ranking. In: SPAA. pp. 221–230. ACM (1997). https://doi.org/10.1145/258492.258514

work page doi:10.1145/258492.258514 1997
[17]

2025.Approaches to Automated NACE Coding of German Business Activity Descriptions

Sibeyn, J.F.: Ultimate parallel list ranking? In: HiPC. Lecture Notes in Computer Science, vol. 1745, pp. 197–201. Springer (1999). https://doi.org/10.1007/978-3- 540-46642-0_28

work page doi:10.1007/978-3- 1999
[18]

Sibeyn, J.F.: List-ranking on interconnection networks. Inf. Comput.181(2), 75–87 (2003). https://doi.org/10.1016/S0890-5401(02)00029-9

work page doi:10.1016/s0890-5401(02)00029-9 2003
[19]

In: Euro- Par

Sibeyn, J.F.: Minimizing global communication in parallel list ranking. In: Euro- Par. Lecture Notes in Computer Science, vol. 2790, pp. 894–902. Springer (2003). https://doi.org/10.1007/978-3-540-45209-6_123

work page doi:10.1007/978-3-540-45209-6_123 2003
[20]

Sibeyn, J.F., Guillaume, F., Seidel, T.: Practical parallel list ranking. J. Parallel Dis- tributed Comput.56(2), 156–180 (1999). https://doi.org/10.1006/JPDC.1998.1508

work page doi:10.1006/jpdc.1998.1508 1999
[21]

In: PVM/MPI

Träff, J.L.: Portable randomized list ranking on multiprocessors using MPI. In: PVM/MPI. Lecture Notes in Computer Science, vol. 1497, pp. 395–402. Springer (1998). https://doi.org/10.1007/BFB0056600

work page doi:10.1007/bfb0056600 1998
[22]

Uhl, T.N., et al.: Kamping: Flexible and (near) zero-overhead C++ bindings for MPI. In: SC. p. 44. IEEE (2024). https://doi.org/10.1109/SC41406.2024.00050

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00050 2024
[23]

Parallel Process

Wei, Z., JáJá, J.F.: Optimization of linked list prefix computations on multithreaded gpus using CUDA. Parallel Process. Lett.22(4) (2012). https://doi.org/10.1142/S0129626412500120

work page doi:10.1142/s0129626412500120 2012
[24]

Wyllie, J.: The Complexity of Parallel Computations. Ph.D. thesis, Cornell Univer- sity, USA (1979)

1979

[1] [1]

Anderson, R.J., Miller, G.L.: A simple randomized parallel algorithm for list- ranking. Inf. Process. Lett.33(5), 269–273 (1990). https://doi.org/10.1016/0020- 0190(90)90196-5

work page doi:10.1016/0020- 1990

[2] [2]

Algorithmica6(6), 859–868 (1991)

Anderson, R.J., Miller, G.L.: Deterministic parallel list ranking. Algorithmica6(6), 859–868 (1991). https://doi.org/10.1007/BF01759076

work page doi:10.1007/bf01759076 1991

[3] [3]

In: FOCS

Cole, R., Vishkin, U.: Approximate and exact parallel scheduling with applications to list, tree and graph problems. In: FOCS. pp. 478–491. IEEE Computer Society (1986). https://doi.org/10.1109/SFCS.1986.10

work page doi:10.1109/sfcs.1986.10 1986

[4] [4]

Dehne, F., Song, S.W.: Randomized parallel list ranking for distributed memory multiprocessors. Int. J. Parallel Program.25(1), 1–16 (1997). https://doi.org/10.1007/BF02700044

work page doi:10.1007/bf02700044 1997

[5] [5]

Algorithmica33(2), 183–200 (2002)

Dehne, F.K.H.A., et al.: Efficient parallel graph algorithms for coarse- grained multicomputers and BSP. Algorithmica33(2), 183–200 (2002). https://doi.org/10.1007/S00453-001-0109-4

work page doi:10.1007/s00453-001-0109-4 2002

[6] [6]

Funke, D., et al.: Communication-free massively distributed graph generation. J. Parallel Distributed Comput.131, 200–217 (2019). https://doi.org/10.1016/J.JPDC.2019.03.011 14 Sanders et al

work page doi:10.1016/j.jpdc.2019.03.011 2019

[7] [7]

Hameed, F., et al.: Contour ranking on coarse grained machines: a case study for low-level vision computations. Concurr. Pract. Exp.9(3), 203–221 (1997)

1997

[8] [8]

Addison-Wesley (1992)

JáJá, J.F.: An Introduction to Parallel Algorithms. Addison-Wesley (1992)

1992

[9] [9]

Lassous, I.G., Gustedt, J.: Portable list ranking: An experimental study. ACM J. Exp. Algorithmics7, 7 (2002). https://doi.org/10.1145/944618.944625

work page doi:10.1145/944618.944625 2002

[10] [10]

In: FOCS

Miller, G.L., Reif, J.H.: Parallel tree contraction and its application. In: FOCS. pp. 478–489. IEEE Computer Society (1985). https://doi.org/10.1109/SFCS.1985.43

work page doi:10.1109/sfcs.1985.43 1985

[11] [11]

IEEE ACM Trans

Pan, T., et al.: Fast de bruijn graph compaction in distributed memory envi- ronments. IEEE ACM Trans. Comput. Biol. Bioinform.17(1), 136–148 (2020). https://doi.org/10.1109/TCBB.2018.2858797

work page doi:10.1109/tcbb.2018.2858797 2020

[12] [12]

IEEE Trans

Patel, J.N., et al.: Scalable parallel implementations of list ranking on fine- grained machines. IEEE Trans. Parallel Distributed Syst.8(10), 1006–1018 (1997). https://doi.org/10.1109/71.629484

work page doi:10.1109/71.629484 1997

[13] [13]

Rehman, M.S., et al.: Fast and scalable list ranking on the GPU. In: ICS. pp. 235–243. ACM (2009). https://doi.org/10.1145/1542275.1542311

work page doi:10.1145/1542275.1542311 2009

[14] [14]

In: SPAA

Reid-Miller, M.: List ranking and list scan on the cray C-90. In: SPAA. pp. 104–113. ACM (1994). https://doi.org/10.1145/181014.181049

work page doi:10.1145/181014.181049 1994

[15] [15]

Springer (2019)

Sanders, P., et al.: Sequential and Parallel Algorithms and Data Structures - The Basic Toolbox. Springer (2019). https://doi.org/10.1007/978-3-030-25209-0

work page doi:10.1007/978-3-030-25209-0 2019

[16] [16]

In: SPAA

Sibeyn, J.F.: Better trade-offs for parallel list ranking. In: SPAA. pp. 221–230. ACM (1997). https://doi.org/10.1145/258492.258514

work page doi:10.1145/258492.258514 1997

[17] [17]

2025.Approaches to Automated NACE Coding of German Business Activity Descriptions

Sibeyn, J.F.: Ultimate parallel list ranking? In: HiPC. Lecture Notes in Computer Science, vol. 1745, pp. 197–201. Springer (1999). https://doi.org/10.1007/978-3- 540-46642-0_28

work page doi:10.1007/978-3- 1999

[18] [18]

Sibeyn, J.F.: List-ranking on interconnection networks. Inf. Comput.181(2), 75–87 (2003). https://doi.org/10.1016/S0890-5401(02)00029-9

work page doi:10.1016/s0890-5401(02)00029-9 2003

[19] [19]

In: Euro- Par

Sibeyn, J.F.: Minimizing global communication in parallel list ranking. In: Euro- Par. Lecture Notes in Computer Science, vol. 2790, pp. 894–902. Springer (2003). https://doi.org/10.1007/978-3-540-45209-6_123

work page doi:10.1007/978-3-540-45209-6_123 2003

[20] [20]

Sibeyn, J.F., Guillaume, F., Seidel, T.: Practical parallel list ranking. J. Parallel Dis- tributed Comput.56(2), 156–180 (1999). https://doi.org/10.1006/JPDC.1998.1508

work page doi:10.1006/jpdc.1998.1508 1999

[21] [21]

In: PVM/MPI

Träff, J.L.: Portable randomized list ranking on multiprocessors using MPI. In: PVM/MPI. Lecture Notes in Computer Science, vol. 1497, pp. 395–402. Springer (1998). https://doi.org/10.1007/BFB0056600

work page doi:10.1007/bfb0056600 1998

[22] [22]

Uhl, T.N., et al.: Kamping: Flexible and (near) zero-overhead C++ bindings for MPI. In: SC. p. 44. IEEE (2024). https://doi.org/10.1109/SC41406.2024.00050

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41406.2024.00050 2024

[23] [23]

Parallel Process

Wei, Z., JáJá, J.F.: Optimization of linked list prefix computations on multithreaded gpus using CUDA. Parallel Process. Lett.22(4) (2012). https://doi.org/10.1142/S0129626412500120

work page doi:10.1142/s0129626412500120 2012

[24] [24]

Wyllie, J.: The Complexity of Parallel Computations. Ph.D. thesis, Cornell Univer- sity, USA (1979)

1979