Rectangular Matrix Multiplication in the Low-Bandwidth Model

Chetan Gupta; Hossein Vahidi; Jukka Suomela

arxiv: 2606.04652 · v1 · pith:YV4MXFHNnew · submitted 2026-06-03 · 💻 cs.DC

Rectangular Matrix Multiplication in the Low-Bandwidth Model

Chetan Gupta , Jukka Suomela , Hossein Vahidi This is my paper

Pith reviewed 2026-06-28 04:27 UTC · model grok-4.3

classification 💻 cs.DC

keywords rectangular matrix multiplicationlow-bandwidth modeldistributed computinground complexityphase transitionsemiring multiplicationaspect ratio

0 comments

The pith

In the low-bandwidth model, rectangular matrix multiplication has round complexity Õ(d^{4/3}) for ⟨d,n,d⟩ and shows a phase transition for ⟨n,d,n⟩ at d = √n.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines rectangular matrix multiplication in a distributed setting where n computers each exchange only O(log n)-bit messages per round, with inputs split evenly and each computer responsible for part of the output. It focuses on two aspect ratios: multiplying d×n by n×d matrices, and n×d by d×n matrices. For the first ratio the authors obtain an upper bound of Õ(d^{4/3}) rounds that recovers the known square-matrix bound when d equals n. For the second ratio they prove a tight Θ(d √n) bound when d is at most √n, but only an O(d^{2/3} n^{2/3}) upper bound when d is larger, revealing a genuine change in scaling behavior.

Core claim

We show that for ⟨d,n,d⟩, we can achieve the complexity of Õ(d^{4/3}); for ⟨n,d,n⟩ with d ≤ √n the complexity is Θ(d √n); however for d ≥ √n the upper bound is O(d^{2/3} n^{2/3}).

What carries the argument

The two aspect ratios ⟨n,d,n⟩ and ⟨d,n,d⟩ inside the low-bandwidth model of n computers that each send and receive an O(log n)-bit message per round.

If this is right

When d = n both ratios recover the Õ(n^{4/3}) bound known for square matrices over semirings.
Matching upper and lower bounds hold for ⟨n,d,n⟩ throughout the regime d ≤ √n.
The scaling changes qualitatively once d exceeds √n for the ⟨n,d,n⟩ task.
All results apply without any sparsity assumptions on the matrices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Choosing the orientation of rectangular matrices in a distributed computation could reduce total rounds by selecting the cheaper aspect ratio.
The identified phase transition may appear in other linear-algebra primitives that involve similar dimension-dependent communication patterns.
Extending the analysis to fully general ⟨a,b,c⟩ shapes could map the complete complexity landscape for rectangular multiplication.

Load-bearing premise

The input matrices start evenly distributed across the n computers and each computer must output only its designated share of the product matrix.

What would settle it

A proof that ⟨d,n,d⟩ multiplication requires Ω(d^{4/3 + ε}) rounds for some ε > 0, or an algorithm that solves ⟨n,d,n⟩ in o(d √n) rounds when d = n^{1/3}, would refute the stated bounds.

Figures

Figures reproduced from arXiv: 2606.04652 by Chetan Gupta, Hossein Vahidi, Jukka Suomela.

read the original abstract

We study rectangular matrix multiplication in the low-bandwidth model of distributed computing. There are $n$ computers; initially the input matrices are distributed evenly between computers, and in each communication round every computer can send and receive an $O(\log n)$-bit message. Eventually each computer must output its designated part of the product matrix. While prior work has focused primarily on square $n \times n$ multiplication under various sparsity assumptions, we study rectangular instances with no sparsity assumption. We denote by $\langle a,b,c\rangle$ the task of multiplying an $a\times b$ matrix by a $b\times c$ matrix in this model. We concentrate on two natural aspect ratios, $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$, for $d \le n$, and we study how the round complexity depends on $n$ and $d$. When $d \to n$, both $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$ approach $\langle n,n,n\rangle$, which is the usual task of multiplying square matrices. If we consider multiplication over semirings, the current best upper bound in that case is $O(n^{4/3})$ rounds, and there is a trivial unconditional lower bound of $\Omega(n)$. We show that for $\langle d,n,d\rangle$, we can achieve the complexity of $\tilde O(d^{4/3})$, which seems like a natural generalization of the upper bound $\tilde O(n^{4/3})$ when $d=n$. However, the case of $\langle n,d,n\rangle$ is fundamentally different, and also exhibits a phase transition. We show that for $d \le \sqrt{n}$, the complexity of $\langle n,d,n\rangle$ is $\Theta(d \sqrt{n})$; we have matching upper and lower bounds. However, the behavior is genuinely different in the region $d \ge \sqrt{n}$, where the upper bound is $O(d^{2/3} n^{2/3})$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives matching bounds for small-aspect rectangular matrix mult and a consistent upper bound for the large-d regime in the low-bandwidth model.

read the letter

The punchline is that this work extends the known square-matrix results to rectangular cases and identifies a phase transition at d = sqrt(n) for the task.

The new pieces are the tilde-O(d^{4/3}) upper bound for , the Theta(d sqrt(n)) matching bounds when d <= sqrt(n) for , and the O(d^{2/3} n^{2/3}) upper bound that takes over for larger d. These line up with the square case at the boundaries and with the trivial Omega(n) lower bound when d grows. The model is the standard low-bandwidth one with even initial distribution and O(log n)-bit messages, so the claims rest on familiar ground.

The paper does well at making the distinction between the two aspect ratios clear and at showing how the small-d regime admits tight analysis while the larger-d regime reverts to something closer to the square-matrix upper bound. The phase transition itself is the clearest contribution.

The main soft spot is that the d >= sqrt(n) regime only has an upper bound; there is no matching lower bound, so we do not yet know if the exponent is optimal. Everything is stated for semirings, which is explicit but means ring-specific speedups are left aside. The abstract-level claims are now backed by the full text, and the stress-test note finds no internal contradictions.

This is for people working on distributed linear algebra or communication complexity in bandwidth-limited models. A reader already familiar with the square-matrix O(n^{4/3}) result will see the natural next step. It deserves a serious referee because the results are new, the model is standard, and the gap in the large-d case is clearly marked rather than hidden.

Referee Report

1 major / 2 minor

Summary. The paper studies rectangular matrix multiplication ⟨a,b,c⟩ in the low-bandwidth distributed model with n computers, where inputs are evenly distributed initially and each computer sends/receives O(log n)-bit messages per round before outputting its designated portion of the product. It focuses on ⟨n,d,n⟩ and ⟨d,n,d⟩ for d ≤ n, claiming Õ(d^{4/3}) rounds for ⟨d,n,d⟩; Θ(d √n) (matching upper/lower bounds) for ⟨n,d,n⟩ when d ≤ √n; and O(d^{2/3} n^{2/3}) upper bound for ⟨n,d,n⟩ when d ≥ √n. These are positioned as generalizations of the Õ(n^{4/3}) semiring bound for square multiplication.

Significance. If the stated bounds hold with complete proofs, the work identifies a phase transition in the ⟨n,d,n⟩ regime and provides a natural rectangular generalization of known square-matrix results in the low-bandwidth model, which would be of interest to distributed computing theory.

major comments (1)

[Abstract] Abstract: the central claims (Õ(d^{4/3}) for ⟨d,n,d⟩, Θ(d √n) for ⟨n,d,n⟩ with d ≤ √n, and O(d^{2/3} n^{2/3}) for d ≥ √n) are presented without any proof sketches, algorithmic constructions, or lower-bound arguments, creating a derivation gap that prevents verification of the stated complexities.

minor comments (2)

[Abstract] The model definition (even initial distribution, designated output parts, O(log n)-bit messages) is stated only in the abstract; a dedicated section or paragraph in the introduction would improve clarity.
[Abstract] Notation: the semiring assumption is mentioned only for the square case; it should be stated explicitly whether the new rectangular bounds hold over rings, fields, or semirings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater clarity in the abstract. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims (Õ(d^{4/3}) for ⟨d,n,d⟩, Θ(d √n) for ⟨n,d,n⟩ with d ≤ √n, and O(d^{2/3} n^{2/3}) for d ≥ √n) are presented without any proof sketches, algorithmic constructions, or lower-bound arguments, creating a derivation gap that prevents verification of the stated complexities.

Authors: The abstract is written as a concise summary of results and does not contain technical details by design. The full manuscript provides complete algorithmic constructions, matching lower-bound arguments, and full proofs: the Õ(d^{4/3}) upper bound for ⟨d,n,d⟩ follows from a rectangular generalization of the known square-matrix semiring algorithm via suitable block partitioning; the Θ(d √n) bound for ⟨n,d,n⟩ when d ≤ √n is obtained by a direct communication lower bound that matches a simple upper bound based on broadcasting the smaller dimension; and the O(d^{2/3} n^{2/3}) upper bound for d ≥ √n uses a different decomposition that reduces the problem to a smaller number of square multiplications. We will revise the abstract to include one-sentence high-level indications of these techniques so that the derivation path is clearer while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper derives upper bounds via explicit algorithmic constructions (e.g., generalizations of square-matrix techniques to rectangular cases) and lower bounds via standard communication-complexity arguments in the low-bandwidth model. The stated complexities—Õ(d^{4/3}) for ⟨d,n,d⟩, Θ(d √n) for ⟨n,d,n⟩ when d ≤ √n, and O(d^{2/3} n^{2/3}) otherwise—are obtained directly from these constructions and analyses without any self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the claims to tautologies. The phase transition at d = √n emerges from comparing the model parameters (message size, distribution, output requirements) rather than from any internal redefinition or smuggling of ansatzes. The work is self-contained against external benchmarks of distributed matrix multiplication.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claims rest on the standard low-bandwidth distributed computing model and semiring arithmetic; no free parameters or new entities are introduced based on the abstract.

axioms (3)

domain assumption Input matrices are distributed evenly between the n computers initially.
This is part of the model setup described in the abstract.
domain assumption Each computer can send and receive an O(log n)-bit message per round.
Core definition of the low-bandwidth model.
domain assumption Multiplication is over semirings with no sparsity assumptions.
Stated when discussing the square case and the rectangular instances.

pith-pipeline@v0.9.1-grok · 5912 in / 1450 out tokens · 56399 ms · 2026-06-28T04:27:16.301184+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

More asymmetry yields faster matrix multiplication

Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. In Yossi Azar and Debmalya Panigrahi, editors,Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2025, New Orleans, LA, USA, January 12-15, 2025, pages 2005–2039. SIAM, 2025.doi:10.113...

work page doi:10.1137/1.9781611978322.63 2025
[2]

Distributed computation in node-capacitated networks

John Augustine, Mohsen Ghaffari, Robert Gmyr, Kristian Hinnenthal, Christian Schei- deler, Fabian Kuhn, and Jason Li. Distributed computation in node-capacitated networks. In Christian Scheideler and Petra Berenbrink, editors,The 31st ACM Symposium on Paral- lelism in Algorithms and Architectures, SPAA 2019, Phoenix, AZ, USA, June 22-24, 2019, pages 69–79...

work page doi:10.1145/3323165.3323195 2019
[3]

Brief announcement: strongscalingofmatrixmultiplicationalgorithmsandmemory-independent communication lower bounds

Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. Brief announcement: strongscalingofmatrixmultiplicationalgorithmsandmemory-independent communication lower bounds. In Guy E. Blelloch and Maurice Herlihy, editors,24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012, ...

work page doi:10.1145/2312005.2312021 2012
[4]

Korhonen, and Dean Leitersdorf

Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fast approx- imate shortest paths in the congested clique.Distributed Computing, 34(6):463–487, 2021. doi:10.1007/S00446-020-00380-5

work page doi:10.1007/s00446-020-00380-5 2021
[5]

Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela

Keren Censor-Hillel, Petteri Kaski, Janne H. Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela. Algebraic methods in the congested clique.Distributed Computing, 32(6):461–478, 2019.doi:10.1007/S00446-016-0270-2

work page doi:10.1007/s00446-016-0270-2 2019
[6]

Sparse matrix multiplication and triangle listing in the congested clique model

Keren Censor-Hillel, Dean Leitersdorf, and Elia Turner. Sparse matrix multiplication and triangle listing in the congested clique model. In Jiannong Cao, Faith Ellen, Luís Rodrigues, 15 and Bernardo Ferreira, editors,22nd International Conference on Principles of Distributed Systems, OPODIS 2018, Hong Kong, December 17-19, 2018, volume 125 ofLIPIcs, pages...

work page doi:10.4230/lipics 2018
[7]

Brief announcement: Tight memory-independent parallel matrix multiplication communication lower bounds

Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse. Brief announcement: Tight memory-independent parallel matrix multiplication communication lower bounds. In Kunal Agrawal and I-Ting Angelina Lee, editors,SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, Philadelphia, PA, USA, July 11 - 14, 2022, pages ...

work page doi:10.1145/3490148.3538552 2022
[8]

Communication-optimal parallel recursive rectangular ma- trix multiplication

James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. Communication-optimal parallel recursive rectangular ma- trix multiplication. In27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013, pages 261–272. IEEE Computer Society, 2013.d...

work page doi:10.1109/ipdps.2013.80 2013
[9]

Exact lower time bounds for computing Boolean functions on CREW PRAMs.Journal of Computer and System Sciences, 48(2):231–254, 1994.doi:10.1016/S0022-0000(05)80003-0

Martin Dietzfelbinger, Mirosław Kutyłowski, and Rüdiger Reischuk. Exact lower time bounds for computing Boolean functions on CREW PRAMs.Journal of Computer and System Sciences, 48(2):231–254, 1994.doi:10.1016/S0022-0000(05)80003-0

work page doi:10.1016/s0022-0000(05)80003-0 1994
[10]

Korhonen, Jan Studený, and Jukka Suomela

Chetan Gupta, Juho Hirvonen, Janne H. Korhonen, Jan Studený, and Jukka Suomela. Sparse matrix multiplication in the low-bandwidth model. In Kunal Agrawal and I-Ting An- gelina Lee, editors,SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, Philadelphia, PA, USA, July 11 - 14, 2022, pages 435–444. ACM, 2022. doi:10.1145/3490148.3538575

work page doi:10.1145/3490148.3538575 2022
[11]

Korhonen, Jan Studený, Jukka Suomela, and Hossein Vahidi

Chetan Gupta, Janne H. Korhonen, Jan Studený, Jukka Suomela, and Hossein Vahidi. Low- bandwidth matrix multiplication: Faster algorithms and more general forms of sparsity. In Ulrich Schmid and Roman Kuznets, editors,Structural Information and Communication Complexity - 32nd International Colloquium, SIROCCO 2025, Delphi, Greece, June 2-4, 2025, Proceedin...

work page doi:10.1007/978-3-031-91736-3_20 2025
[12]

Communication lower bounds for distributed-memory matrix multiplication.Journal of Parallel and Distributed Comput- ing, 64(9):1017–1026, 2004.doi:10.1016/J.JPDC.2004.03.021

Dror Irony, Sivan Toledo, and Alexandre Tiskin. Communication lower bounds for distributed-memory matrix multiplication.Journal of Parallel and Distributed Comput- ing, 64(9):1017–1026, 2004.doi:10.1016/J.JPDC.2004.03.021

work page doi:10.1016/j.jpdc.2004.03.021 2004
[13]

Matrixmultiplication in the MPC model

LakshyaJoshi, AryaDeshmukh, AtharvChhabra, andChetanGupta. Matrixmultiplication in the MPC model. In Neeldhara Misra and Arti Pandey, editors,Algorithms and Discrete Applied Mathematics - 12th International Conference, CALDAM 2026, Dharwad, India, February 12-14, 2026, Proceedings, volume 16445 ofLecture Notes in Computer Science, pages 209–220. Springer,...

work page doi:10.1007/978-3-032-17156-6_16 2026
[14]

A high-performance parallel al- gorithm for nonnegative matrix factorization

Ramakrishnan Kannan, Grey Ballard, and Haesun Park. A high-performance parallel al- gorithm for nonnegative matrix factorization. In Rafael Asenjo and Tim Harris, editors, Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, pages 9:1–9:11. ACM, 2016.doi:10.1145/...

work page doi:10.1145/2851141.2851152 2016
[15]

Further algebraic algorithms in the congested clique model and appli- cations to graph-theoretic problems

François Le Gall. Further algebraic algorithms in the congested clique model and appli- cations to graph-theoretic problems. In Cyril Gavoille and David Ilcinkas, editors,Dis- tributed Computing - 30th International Symposium, DISC 2016, Paris, France, September 27-29, 2016. Proceedings, volume 9888 ofLecture Notes in Computer Science, pages 57–70. Spring...

work page doi:10.1007/978-3-662-53426-7_5 2016
[16]

Optimal deterministic routing and sorting on the congested clique , booktitle =

Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In Panagiota Fatourou and Gadi Taubenfeld, editors,ACM Symposium on Principles of Distributed Computing, PODC ’13, Montreal, QC, Canada, July 22-24, 2013, pages 42–50. ACM, 2013.doi:10.1145/2484239.2501983

work page doi:10.1145/2484239.2501983 2013
[17]

MST construction in O(log logn)communication rounds

Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in O(log logn)communication rounds. In Arnold L. Rosenberg and Friedhelm Meyer auf der Heide, editors,SPAA 2003: Proceedings of the Fifteenth Annual ACM Symposium on Par- allelism in Algorithms and Architectures, June 7-9, 2003, San Diego, California, USA (part of FCRC 2003), pag...

work page doi:10.1145/777412.777428 2003
[18]

McColl and Alexandre Tiskin

William F. McColl and Alexandre Tiskin. Memory-efficient matrix multiplication in the BSP model.Algorithmica, 24(3-4):287–297, 1999.doi:10.1007/PL00008264

work page doi:10.1007/pl00008264 1999
[19]

Tim Roughgarden, Sergei Vassilvitskii, and Joshua R. Wang. Shuffles and circuits (on lower bounds for modern parallel computation).Journal of the ACM, 65(6):41:1–41:24, 2018.doi:10.1145/3232536

work page doi:10.1145/3232536 2018
[20]

Gaussian elimination is not optimal

Volker Strassen. Gaussian elimination is not optimal.Numerische Mathematik, 13(4):354– 356, 1969.doi:10.1007/bf02165411. 17

work page doi:10.1007/bf02165411 1969

[1] [1]

More asymmetry yields faster matrix multiplication

Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. In Yossi Azar and Debmalya Panigrahi, editors,Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2025, New Orleans, LA, USA, January 12-15, 2025, pages 2005–2039. SIAM, 2025.doi:10.113...

work page doi:10.1137/1.9781611978322.63 2025

[2] [2]

Distributed computation in node-capacitated networks

John Augustine, Mohsen Ghaffari, Robert Gmyr, Kristian Hinnenthal, Christian Schei- deler, Fabian Kuhn, and Jason Li. Distributed computation in node-capacitated networks. In Christian Scheideler and Petra Berenbrink, editors,The 31st ACM Symposium on Paral- lelism in Algorithms and Architectures, SPAA 2019, Phoenix, AZ, USA, June 22-24, 2019, pages 69–79...

work page doi:10.1145/3323165.3323195 2019

[3] [3]

Brief announcement: strongscalingofmatrixmultiplicationalgorithmsandmemory-independent communication lower bounds

Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. Brief announcement: strongscalingofmatrixmultiplicationalgorithmsandmemory-independent communication lower bounds. In Guy E. Blelloch and Maurice Herlihy, editors,24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012, ...

work page doi:10.1145/2312005.2312021 2012

[4] [4]

Korhonen, and Dean Leitersdorf

Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fast approx- imate shortest paths in the congested clique.Distributed Computing, 34(6):463–487, 2021. doi:10.1007/S00446-020-00380-5

work page doi:10.1007/s00446-020-00380-5 2021

[5] [5]

Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela

Keren Censor-Hillel, Petteri Kaski, Janne H. Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela. Algebraic methods in the congested clique.Distributed Computing, 32(6):461–478, 2019.doi:10.1007/S00446-016-0270-2

work page doi:10.1007/s00446-016-0270-2 2019

[6] [6]

Sparse matrix multiplication and triangle listing in the congested clique model

Keren Censor-Hillel, Dean Leitersdorf, and Elia Turner. Sparse matrix multiplication and triangle listing in the congested clique model. In Jiannong Cao, Faith Ellen, Luís Rodrigues, 15 and Bernardo Ferreira, editors,22nd International Conference on Principles of Distributed Systems, OPODIS 2018, Hong Kong, December 17-19, 2018, volume 125 ofLIPIcs, pages...

work page doi:10.4230/lipics 2018

[7] [7]

Brief announcement: Tight memory-independent parallel matrix multiplication communication lower bounds

Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse. Brief announcement: Tight memory-independent parallel matrix multiplication communication lower bounds. In Kunal Agrawal and I-Ting Angelina Lee, editors,SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, Philadelphia, PA, USA, July 11 - 14, 2022, pages ...

work page doi:10.1145/3490148.3538552 2022

[8] [8]

Communication-optimal parallel recursive rectangular ma- trix multiplication

James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. Communication-optimal parallel recursive rectangular ma- trix multiplication. In27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013, pages 261–272. IEEE Computer Society, 2013.d...

work page doi:10.1109/ipdps.2013.80 2013

[9] [9]

Exact lower time bounds for computing Boolean functions on CREW PRAMs.Journal of Computer and System Sciences, 48(2):231–254, 1994.doi:10.1016/S0022-0000(05)80003-0

Martin Dietzfelbinger, Mirosław Kutyłowski, and Rüdiger Reischuk. Exact lower time bounds for computing Boolean functions on CREW PRAMs.Journal of Computer and System Sciences, 48(2):231–254, 1994.doi:10.1016/S0022-0000(05)80003-0

work page doi:10.1016/s0022-0000(05)80003-0 1994

[10] [10]

Korhonen, Jan Studený, and Jukka Suomela

Chetan Gupta, Juho Hirvonen, Janne H. Korhonen, Jan Studený, and Jukka Suomela. Sparse matrix multiplication in the low-bandwidth model. In Kunal Agrawal and I-Ting An- gelina Lee, editors,SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, Philadelphia, PA, USA, July 11 - 14, 2022, pages 435–444. ACM, 2022. doi:10.1145/3490148.3538575

work page doi:10.1145/3490148.3538575 2022

[11] [11]

Korhonen, Jan Studený, Jukka Suomela, and Hossein Vahidi

Chetan Gupta, Janne H. Korhonen, Jan Studený, Jukka Suomela, and Hossein Vahidi. Low- bandwidth matrix multiplication: Faster algorithms and more general forms of sparsity. In Ulrich Schmid and Roman Kuznets, editors,Structural Information and Communication Complexity - 32nd International Colloquium, SIROCCO 2025, Delphi, Greece, June 2-4, 2025, Proceedin...

work page doi:10.1007/978-3-031-91736-3_20 2025

[12] [12]

Communication lower bounds for distributed-memory matrix multiplication.Journal of Parallel and Distributed Comput- ing, 64(9):1017–1026, 2004.doi:10.1016/J.JPDC.2004.03.021

Dror Irony, Sivan Toledo, and Alexandre Tiskin. Communication lower bounds for distributed-memory matrix multiplication.Journal of Parallel and Distributed Comput- ing, 64(9):1017–1026, 2004.doi:10.1016/J.JPDC.2004.03.021

work page doi:10.1016/j.jpdc.2004.03.021 2004

[13] [13]

Matrixmultiplication in the MPC model

LakshyaJoshi, AryaDeshmukh, AtharvChhabra, andChetanGupta. Matrixmultiplication in the MPC model. In Neeldhara Misra and Arti Pandey, editors,Algorithms and Discrete Applied Mathematics - 12th International Conference, CALDAM 2026, Dharwad, India, February 12-14, 2026, Proceedings, volume 16445 ofLecture Notes in Computer Science, pages 209–220. Springer,...

work page doi:10.1007/978-3-032-17156-6_16 2026

[14] [14]

A high-performance parallel al- gorithm for nonnegative matrix factorization

Ramakrishnan Kannan, Grey Ballard, and Haesun Park. A high-performance parallel al- gorithm for nonnegative matrix factorization. In Rafael Asenjo and Tim Harris, editors, Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, pages 9:1–9:11. ACM, 2016.doi:10.1145/...

work page doi:10.1145/2851141.2851152 2016

[15] [15]

Further algebraic algorithms in the congested clique model and appli- cations to graph-theoretic problems

François Le Gall. Further algebraic algorithms in the congested clique model and appli- cations to graph-theoretic problems. In Cyril Gavoille and David Ilcinkas, editors,Dis- tributed Computing - 30th International Symposium, DISC 2016, Paris, France, September 27-29, 2016. Proceedings, volume 9888 ofLecture Notes in Computer Science, pages 57–70. Spring...

work page doi:10.1007/978-3-662-53426-7_5 2016

[16] [16]

Optimal deterministic routing and sorting on the congested clique , booktitle =

Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In Panagiota Fatourou and Gadi Taubenfeld, editors,ACM Symposium on Principles of Distributed Computing, PODC ’13, Montreal, QC, Canada, July 22-24, 2013, pages 42–50. ACM, 2013.doi:10.1145/2484239.2501983

work page doi:10.1145/2484239.2501983 2013

[17] [17]

MST construction in O(log logn)communication rounds

Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in O(log logn)communication rounds. In Arnold L. Rosenberg and Friedhelm Meyer auf der Heide, editors,SPAA 2003: Proceedings of the Fifteenth Annual ACM Symposium on Par- allelism in Algorithms and Architectures, June 7-9, 2003, San Diego, California, USA (part of FCRC 2003), pag...

work page doi:10.1145/777412.777428 2003

[18] [18]

McColl and Alexandre Tiskin

William F. McColl and Alexandre Tiskin. Memory-efficient matrix multiplication in the BSP model.Algorithmica, 24(3-4):287–297, 1999.doi:10.1007/PL00008264

work page doi:10.1007/pl00008264 1999

[19] [19]

Tim Roughgarden, Sergei Vassilvitskii, and Joshua R. Wang. Shuffles and circuits (on lower bounds for modern parallel computation).Journal of the ACM, 65(6):41:1–41:24, 2018.doi:10.1145/3232536

work page doi:10.1145/3232536 2018

[20] [20]

Gaussian elimination is not optimal

Volker Strassen. Gaussian elimination is not optimal.Numerische Mathematik, 13(4):354– 356, 1969.doi:10.1007/bf02165411. 17

work page doi:10.1007/bf02165411 1969