pith. sign in

arxiv: 2606.04652 · v1 · pith:YV4MXFHNnew · submitted 2026-06-03 · 💻 cs.DC

Rectangular Matrix Multiplication in the Low-Bandwidth Model

Pith reviewed 2026-06-28 04:27 UTC · model grok-4.3

classification 💻 cs.DC
keywords rectangular matrix multiplicationlow-bandwidth modeldistributed computinground complexityphase transitionsemiring multiplicationaspect ratio
0
0 comments X

The pith

In the low-bandwidth model, rectangular matrix multiplication has round complexity Õ(d^{4/3}) for ⟨d,n,d⟩ and shows a phase transition for ⟨n,d,n⟩ at d = √n.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines rectangular matrix multiplication in a distributed setting where n computers each exchange only O(log n)-bit messages per round, with inputs split evenly and each computer responsible for part of the output. It focuses on two aspect ratios: multiplying d×n by n×d matrices, and n×d by d×n matrices. For the first ratio the authors obtain an upper bound of Õ(d^{4/3}) rounds that recovers the known square-matrix bound when d equals n. For the second ratio they prove a tight Θ(d √n) bound when d is at most √n, but only an O(d^{2/3} n^{2/3}) upper bound when d is larger, revealing a genuine change in scaling behavior.

Core claim

We show that for ⟨d,n,d⟩, we can achieve the complexity of Õ(d^{4/3}); for ⟨n,d,n⟩ with d ≤ √n the complexity is Θ(d √n); however for d ≥ √n the upper bound is O(d^{2/3} n^{2/3}).

What carries the argument

The two aspect ratios ⟨n,d,n⟩ and ⟨d,n,d⟩ inside the low-bandwidth model of n computers that each send and receive an O(log n)-bit message per round.

If this is right

  • When d = n both ratios recover the Õ(n^{4/3}) bound known for square matrices over semirings.
  • Matching upper and lower bounds hold for ⟨n,d,n⟩ throughout the regime d ≤ √n.
  • The scaling changes qualitatively once d exceeds √n for the ⟨n,d,n⟩ task.
  • All results apply without any sparsity assumptions on the matrices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Choosing the orientation of rectangular matrices in a distributed computation could reduce total rounds by selecting the cheaper aspect ratio.
  • The identified phase transition may appear in other linear-algebra primitives that involve similar dimension-dependent communication patterns.
  • Extending the analysis to fully general ⟨a,b,c⟩ shapes could map the complete complexity landscape for rectangular multiplication.

Load-bearing premise

The input matrices start evenly distributed across the n computers and each computer must output only its designated share of the product matrix.

What would settle it

A proof that ⟨d,n,d⟩ multiplication requires Ω(d^{4/3 + ε}) rounds for some ε > 0, or an algorithm that solves ⟨n,d,n⟩ in o(d √n) rounds when d = n^{1/3}, would refute the stated bounds.

Figures

Figures reproduced from arXiv: 2606.04652 by Chetan Gupta, Hossein Vahidi, Jukka Suomela.

Figure 1
Figure 1. Figure 1: This figure summarizes the results from Table 1. Here, blue indicates ranges where [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

We study rectangular matrix multiplication in the low-bandwidth model of distributed computing. There are $n$ computers; initially the input matrices are distributed evenly between computers, and in each communication round every computer can send and receive an $O(\log n)$-bit message. Eventually each computer must output its designated part of the product matrix. While prior work has focused primarily on square $n \times n$ multiplication under various sparsity assumptions, we study rectangular instances with no sparsity assumption. We denote by $\langle a,b,c\rangle$ the task of multiplying an $a\times b$ matrix by a $b\times c$ matrix in this model. We concentrate on two natural aspect ratios, $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$, for $d \le n$, and we study how the round complexity depends on $n$ and $d$. When $d \to n$, both $\langle n,d,n\rangle$ and $\langle d,n,d\rangle$ approach $\langle n,n,n\rangle$, which is the usual task of multiplying square matrices. If we consider multiplication over semirings, the current best upper bound in that case is $O(n^{4/3})$ rounds, and there is a trivial unconditional lower bound of $\Omega(n)$. We show that for $\langle d,n,d\rangle$, we can achieve the complexity of $\tilde O(d^{4/3})$, which seems like a natural generalization of the upper bound $\tilde O(n^{4/3})$ when $d=n$. However, the case of $\langle n,d,n\rangle$ is fundamentally different, and also exhibits a phase transition. We show that for $d \le \sqrt{n}$, the complexity of $\langle n,d,n\rangle$ is $\Theta(d \sqrt{n})$; we have matching upper and lower bounds. However, the behavior is genuinely different in the region $d \ge \sqrt{n}$, where the upper bound is $O(d^{2/3} n^{2/3})$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper studies rectangular matrix multiplication ⟨a,b,c⟩ in the low-bandwidth distributed model with n computers, where inputs are evenly distributed initially and each computer sends/receives O(log n)-bit messages per round before outputting its designated portion of the product. It focuses on ⟨n,d,n⟩ and ⟨d,n,d⟩ for d ≤ n, claiming Õ(d^{4/3}) rounds for ⟨d,n,d⟩; Θ(d √n) (matching upper/lower bounds) for ⟨n,d,n⟩ when d ≤ √n; and O(d^{2/3} n^{2/3}) upper bound for ⟨n,d,n⟩ when d ≥ √n. These are positioned as generalizations of the Õ(n^{4/3}) semiring bound for square multiplication.

Significance. If the stated bounds hold with complete proofs, the work identifies a phase transition in the ⟨n,d,n⟩ regime and provides a natural rectangular generalization of known square-matrix results in the low-bandwidth model, which would be of interest to distributed computing theory.

major comments (1)
  1. [Abstract] Abstract: the central claims (Õ(d^{4/3}) for ⟨d,n,d⟩, Θ(d √n) for ⟨n,d,n⟩ with d ≤ √n, and O(d^{2/3} n^{2/3}) for d ≥ √n) are presented without any proof sketches, algorithmic constructions, or lower-bound arguments, creating a derivation gap that prevents verification of the stated complexities.
minor comments (2)
  1. [Abstract] The model definition (even initial distribution, designated output parts, O(log n)-bit messages) is stated only in the abstract; a dedicated section or paragraph in the introduction would improve clarity.
  2. [Abstract] Notation: the semiring assumption is mentioned only for the square case; it should be stated explicitly whether the new rectangular bounds hold over rings, fields, or semirings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater clarity in the abstract. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claims (Õ(d^{4/3}) for ⟨d,n,d⟩, Θ(d √n) for ⟨n,d,n⟩ with d ≤ √n, and O(d^{2/3} n^{2/3}) for d ≥ √n) are presented without any proof sketches, algorithmic constructions, or lower-bound arguments, creating a derivation gap that prevents verification of the stated complexities.

    Authors: The abstract is written as a concise summary of results and does not contain technical details by design. The full manuscript provides complete algorithmic constructions, matching lower-bound arguments, and full proofs: the Õ(d^{4/3}) upper bound for ⟨d,n,d⟩ follows from a rectangular generalization of the known square-matrix semiring algorithm via suitable block partitioning; the Θ(d √n) bound for ⟨n,d,n⟩ when d ≤ √n is obtained by a direct communication lower bound that matches a simple upper bound based on broadcasting the smaller dimension; and the O(d^{2/3} n^{2/3}) upper bound for d ≥ √n uses a different decomposition that reduces the problem to a smaller number of square multiplications. We will revise the abstract to include one-sentence high-level indications of these techniques so that the derivation path is clearer while preserving conciseness. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no circular reductions identified

full rationale

The paper derives upper bounds via explicit algorithmic constructions (e.g., generalizations of square-matrix techniques to rectangular cases) and lower bounds via standard communication-complexity arguments in the low-bandwidth model. The stated complexities—Õ(d^{4/3}) for ⟨d,n,d⟩, Θ(d √n) for ⟨n,d,n⟩ when d ≤ √n, and O(d^{2/3} n^{2/3}) otherwise—are obtained directly from these constructions and analyses without any self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the claims to tautologies. The phase transition at d = √n emerges from comparing the model parameters (message size, distribution, output requirements) rather than from any internal redefinition or smuggling of ansatzes. The work is self-contained against external benchmarks of distributed matrix multiplication.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claims rest on the standard low-bandwidth distributed computing model and semiring arithmetic; no free parameters or new entities are introduced based on the abstract.

axioms (3)
  • domain assumption Input matrices are distributed evenly between the n computers initially.
    This is part of the model setup described in the abstract.
  • domain assumption Each computer can send and receive an O(log n)-bit message per round.
    Core definition of the low-bandwidth model.
  • domain assumption Multiplication is over semirings with no sparsity assumptions.
    Stated when discussing the square case and the rectangular instances.

pith-pipeline@v0.9.1-grok · 5912 in / 1450 out tokens · 56399 ms · 2026-06-28T04:27:16.301184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    More asymmetry yields faster matrix multiplication

    Josh Alman, Ran Duan, Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. More asymmetry yields faster matrix multiplication. In Yossi Azar and Debmalya Panigrahi, editors,Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2025, New Orleans, LA, USA, January 12-15, 2025, pages 2005–2039. SIAM, 2025.doi:10.113...

  2. [2]

    Distributed computation in node-capacitated networks

    John Augustine, Mohsen Ghaffari, Robert Gmyr, Kristian Hinnenthal, Christian Schei- deler, Fabian Kuhn, and Jason Li. Distributed computation in node-capacitated networks. In Christian Scheideler and Petra Berenbrink, editors,The 31st ACM Symposium on Paral- lelism in Algorithms and Architectures, SPAA 2019, Phoenix, AZ, USA, June 22-24, 2019, pages 69–79...

  3. [3]

    Brief announcement: strongscalingofmatrixmultiplicationalgorithmsandmemory-independent communication lower bounds

    Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, and Oded Schwartz. Brief announcement: strongscalingofmatrixmultiplicationalgorithmsandmemory-independent communication lower bounds. In Guy E. Blelloch and Maurice Herlihy, editors,24th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’12, Pittsburgh, PA, USA, June 25-27, 2012, ...

  4. [4]

    Korhonen, and Dean Leitersdorf

    Keren Censor-Hillel, Michal Dory, Janne H. Korhonen, and Dean Leitersdorf. Fast approx- imate shortest paths in the congested clique.Distributed Computing, 34(6):463–487, 2021. doi:10.1007/S00446-020-00380-5

  5. [5]

    Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela

    Keren Censor-Hillel, Petteri Kaski, Janne H. Korhonen, Christoph Lenzen, Ami Paz, and Jukka Suomela. Algebraic methods in the congested clique.Distributed Computing, 32(6):461–478, 2019.doi:10.1007/S00446-016-0270-2

  6. [6]

    Sparse matrix multiplication and triangle listing in the congested clique model

    Keren Censor-Hillel, Dean Leitersdorf, and Elia Turner. Sparse matrix multiplication and triangle listing in the congested clique model. In Jiannong Cao, Faith Ellen, Luís Rodrigues, 15 and Bernardo Ferreira, editors,22nd International Conference on Principles of Distributed Systems, OPODIS 2018, Hong Kong, December 17-19, 2018, volume 125 ofLIPIcs, pages...

  7. [7]

    Brief announcement: Tight memory-independent parallel matrix multiplication communication lower bounds

    Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, and Kathryn Rouse. Brief announcement: Tight memory-independent parallel matrix multiplication communication lower bounds. In Kunal Agrawal and I-Ting Angelina Lee, editors,SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, Philadelphia, PA, USA, July 11 - 14, 2022, pages ...

  8. [8]

    Communication-optimal parallel recursive rectangular ma- trix multiplication

    James Demmel, David Eliahu, Armando Fox, Shoaib Kamil, Benjamin Lipshitz, Oded Schwartz, and Omer Spillinger. Communication-optimal parallel recursive rectangular ma- trix multiplication. In27th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2013, Cambridge, MA, USA, May 20-24, 2013, pages 261–272. IEEE Computer Society, 2013.d...

  9. [9]

    Exact lower time bounds for computing Boolean functions on CREW PRAMs.Journal of Computer and System Sciences, 48(2):231–254, 1994.doi:10.1016/S0022-0000(05)80003-0

    Martin Dietzfelbinger, Mirosław Kutyłowski, and Rüdiger Reischuk. Exact lower time bounds for computing Boolean functions on CREW PRAMs.Journal of Computer and System Sciences, 48(2):231–254, 1994.doi:10.1016/S0022-0000(05)80003-0

  10. [10]

    Korhonen, Jan Studený, and Jukka Suomela

    Chetan Gupta, Juho Hirvonen, Janne H. Korhonen, Jan Studený, and Jukka Suomela. Sparse matrix multiplication in the low-bandwidth model. In Kunal Agrawal and I-Ting An- gelina Lee, editors,SPAA ’22: 34th ACM Symposium on Parallelism in Algorithms and Architectures, Philadelphia, PA, USA, July 11 - 14, 2022, pages 435–444. ACM, 2022. doi:10.1145/3490148.3538575

  11. [11]

    Korhonen, Jan Studený, Jukka Suomela, and Hossein Vahidi

    Chetan Gupta, Janne H. Korhonen, Jan Studený, Jukka Suomela, and Hossein Vahidi. Low- bandwidth matrix multiplication: Faster algorithms and more general forms of sparsity. In Ulrich Schmid and Roman Kuznets, editors,Structural Information and Communication Complexity - 32nd International Colloquium, SIROCCO 2025, Delphi, Greece, June 2-4, 2025, Proceedin...

  12. [12]

    Communication lower bounds for distributed-memory matrix multiplication.Journal of Parallel and Distributed Comput- ing, 64(9):1017–1026, 2004.doi:10.1016/J.JPDC.2004.03.021

    Dror Irony, Sivan Toledo, and Alexandre Tiskin. Communication lower bounds for distributed-memory matrix multiplication.Journal of Parallel and Distributed Comput- ing, 64(9):1017–1026, 2004.doi:10.1016/J.JPDC.2004.03.021

  13. [13]

    Matrixmultiplication in the MPC model

    LakshyaJoshi, AryaDeshmukh, AtharvChhabra, andChetanGupta. Matrixmultiplication in the MPC model. In Neeldhara Misra and Arti Pandey, editors,Algorithms and Discrete Applied Mathematics - 12th International Conference, CALDAM 2026, Dharwad, India, February 12-14, 2026, Proceedings, volume 16445 ofLecture Notes in Computer Science, pages 209–220. Springer,...

  14. [14]

    A high-performance parallel al- gorithm for nonnegative matrix factorization

    Ramakrishnan Kannan, Grey Ballard, and Haesun Park. A high-performance parallel al- gorithm for nonnegative matrix factorization. In Rafael Asenjo and Tim Harris, editors, Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2016, Barcelona, Spain, March 12-16, 2016, pages 9:1–9:11. ACM, 2016.doi:10.1145/...

  15. [15]

    Further algebraic algorithms in the congested clique model and appli- cations to graph-theoretic problems

    François Le Gall. Further algebraic algorithms in the congested clique model and appli- cations to graph-theoretic problems. In Cyril Gavoille and David Ilcinkas, editors,Dis- tributed Computing - 30th International Symposium, DISC 2016, Paris, France, September 27-29, 2016. Proceedings, volume 9888 ofLecture Notes in Computer Science, pages 57–70. Spring...

  16. [16]

    Optimal deterministic routing and sorting on the congested clique , booktitle =

    Christoph Lenzen. Optimal deterministic routing and sorting on the congested clique. In Panagiota Fatourou and Gadi Taubenfeld, editors,ACM Symposium on Principles of Distributed Computing, PODC ’13, Montreal, QC, Canada, July 22-24, 2013, pages 42–50. ACM, 2013.doi:10.1145/2484239.2501983

  17. [17]

    MST construction in O(log logn)communication rounds

    Zvi Lotker, Elan Pavlov, Boaz Patt-Shamir, and David Peleg. MST construction in O(log logn)communication rounds. In Arnold L. Rosenberg and Friedhelm Meyer auf der Heide, editors,SPAA 2003: Proceedings of the Fifteenth Annual ACM Symposium on Par- allelism in Algorithms and Architectures, June 7-9, 2003, San Diego, California, USA (part of FCRC 2003), pag...

  18. [18]

    McColl and Alexandre Tiskin

    William F. McColl and Alexandre Tiskin. Memory-efficient matrix multiplication in the BSP model.Algorithmica, 24(3-4):287–297, 1999.doi:10.1007/PL00008264

  19. [19]

    Tim Roughgarden, Sergei Vassilvitskii, and Joshua R. Wang. Shuffles and circuits (on lower bounds for modern parallel computation).Journal of the ACM, 65(6):41:1–41:24, 2018.doi:10.1145/3232536

  20. [20]

    Gaussian elimination is not optimal

    Volker Strassen. Gaussian elimination is not optimal.Numerische Mathematik, 13(4):354– 356, 1969.doi:10.1007/bf02165411. 17