arxiv: 2605.10681 · v1 · submitted 2026-05-11 · 💻 cs.IT · cs.LG· math.IT

Recognition: 2 theorem links

· Lean Theorem

Scalable Mamba-Based Message-Passing Neural Decoder for Error-Correcting Codes

Rostislav Gusev , Nikita Aleksandrov , Artem Solomkin , Dmitry Artemasov

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:22 UTC · model grok-4.3

classification 💻 cs.IT cs.LGmath.IT

keywords Mamba decoderneural decodingLDPC codesmessage passingerror-correcting codesstate-space modelsscalable decoding

0 comments

The pith

The Mamba message-passing decoder combines local Tanner-graph updates with bidirectional state-space blocks to decode long error-correcting codes without attention's quadratic costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MMPD, a neural decoder for binary linear codes that keeps the classical message-passing structure on the Tanner graph for local variable-check updates. It adds bidirectional Mamba blocks to carry information across the full code length in linear time and memory. This avoids the dense matrices that limit attention-based decoders when codes grow long. The design targets practical communication systems that rely on long LDPC codes, where current neural methods become too expensive. On a (1056, 880) LDPC code the approach shows both higher accuracy and substantially lower memory use than the prior CrossMPT decoder.

Core claim

MMPD retains the Tanner-graph structure of a message-passing decoder by performing local pairwise aggregation along variable-check edges, then combines these local updates with bidirectional Mamba state-space blocks to enable efficient long-range information propagation in a syndrome-based neural decoder for binary linear codes, avoiding dense attention matrices and thereby scaling more favorably for long codes in both memory and computation.

What carries the argument

Local pairwise aggregation along Tanner-graph edges integrated with bidirectional Mamba state-space blocks that replace attention for long-range propagation.

If this is right

MMPD scales more favorably than attention-based decoders in memory and computation as code length increases.
It delivers a 0.45 dB gain over the CrossMPT decoder at the target bit error rate on the (1056, 880) LDPC code.
Memory consumption drops by a factor of 1.5 on the tested code, with the reduction growing substantially for longer codes.
The decoder remains applicable to any binary linear code through its syndrome-based message-passing design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-plus-Mamba pattern could be tested on other graph-structured inference tasks where long-range dependencies must be handled efficiently without attention.
Because the method avoids code-specific tuning on the single tested LDPC code, it may generalize to different code families and lengths used in wireless and storage systems.
Lower memory footprint could allow neural decoders to run on edge devices for real-time error correction in latency-sensitive links.

Load-bearing premise

The local pairwise updates plus Mamba blocks will move information effectively across codes of arbitrary length without code-specific tuning or training instability.

What would settle it

Train and test MMPD on an LDPC code several times longer than 1056 bits and observe whether the performance gain over CrossMPT disappears, memory savings fail to grow, or training becomes unstable.

Figures

Figures reproduced from arXiv: 2605.10681 by Artem Solomkin, Dmitry Artemasov, Nikita Aleksandrov, Rostislav Gusev.

**Figure 1.** Figure 1: Architecture of the proposed MMPD. The scores are then normalized over the neighbors of VN i α (t) ij = exp(f (t) ij ) P j ′∈Nv(i) exp(f (t) ij′ ) , j ∈ Nv(i). (10) The resulting CN-to-VN message is r (t,v) i = W(t,v) o X j∈Nv(i) α (t) ij W(t,v) p s (t) j , (11) where W(t,v) p ,W(t,v) o ∈ R d×d are learnable VN-stream projections. The reverse VN-to-CN aggregation is defined similarly. ¯s (t) j = W¯ (t) s … view at source ↗

**Figure 2.** Figure 2: BER performance for long LDPC codes. with kb = 22. By varying the lifting factor, we obtain block lengths, including punctured bits, ranging from n = 208 to n = 9984. Figs. 3 and 4 report the measured memory required for training with a batch size 128 and for single-sample inference, respectively [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Peak GPU memory footprint during training for 5G LDPC codes, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: GPU memory footprint during single-sample inference for 5G LDPC [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Number of trainable parameters for 5G LDPC codes [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Forward error correction is essential for reliable communication over noisy channels. Attention-based model-free neural decoders have shown strong performance for short codes, but their scalability to longer codes is limited by the quadratic memory and computational cost of attention. In this paper, we introduce the Mamba message-passing decoder (MMPD), an attention-free syndrome-based neural decoder for binary linear codes. MMPD retains the Tanner-graph structure of a message-passing decoder by performing local pairwise aggregation along variable-check edges. To enable efficient long-range information propagation, these local updates are combined with bidirectional Mamba state-space blocks. By avoiding dense attention matrices, MMPD scales more favorably for long codes in both memory and computation. Experiments on the (1056, 880) LDPC code show that MMPD achieves a 0.45 dB gain over the state-of-the-art CrossMPT decoder at a specified target bit error rate, while reducing memory consumption by a factor of 1.5. This reduction factor increases substantially for longer codes, demonstrating the applicability of MMPD to scalable neural decoding of practical long codes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Mamba Message-Passing Decoder (MMPD), an attention-free syndrome-based neural decoder for binary linear codes. It retains the Tanner-graph structure via local pairwise aggregation along variable-check edges and augments this with bidirectional Mamba state-space blocks for long-range propagation. The central empirical claim is that MMPD achieves a 0.45 dB gain over the CrossMPT decoder at a target BER on the (1056,880) LDPC code while reducing memory by a factor of 1.5, with the memory advantage asserted to grow for longer codes.

Significance. If the scalability claims hold, MMPD would provide a concrete route to neural decoding of practical-length codes by replacing quadratic attention with linear-complexity Mamba blocks while preserving graph locality. The architecture is defined independently of the performance numbers and avoids obvious circularity. The reported memory reduction on the tested code is a tangible engineering advantage, but the absence of results on longer or varied codes leaves the broader significance unestablished.

major comments (2)

[Experiments] Experiments section: All quantitative results (0.45 dB gain, 1.5× memory reduction) are reported for a single (1056,880) LDPC code. No ablation studies, statistical error bars, training details, or results on longer block lengths or other code families are provided, which directly undermines the claim that the memory reduction “increases substantially for longer codes” and that Mamba propagation remains effective without code-specific retuning.
[Method] Method section: The integration of bidirectional Mamba blocks with local pairwise aggregation is described at a high level, but no analysis, bound, or empirical test addresses whether the Mamba state propagation avoids information bottlenecks or vanishing gradients on Tanner graphs whose diameter grows with block length. This is load-bearing for the scalability argument.

minor comments (2)

[Abstract] Abstract: The specific target bit-error-rate value at which the 0.45 dB gain is measured should be stated explicitly rather than left as “a specified target bit error rate.”
[Experiments] The paper would benefit from a short table comparing parameter counts, training epochs, and optimizer settings against CrossMPT to allow readers to assess whether the gain could be explained by hyperparameter differences.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thorough review and valuable comments. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: All quantitative results (0.45 dB gain, 1.5× memory reduction) are reported for a single (1056,880) LDPC code. No ablation studies, statistical error bars, training details, or results on longer block lengths or other code families are provided, which directly undermines the claim that the memory reduction “increases substantially for longer codes” and that Mamba propagation remains effective without code-specific retuning.

Authors: We agree that the reported results are confined to the (1056,880) LDPC code, selected to permit direct comparison with prior attention-based decoders. In the revision we will include training hyperparameters, ablation studies isolating the contribution of the bidirectional Mamba blocks versus local aggregation, and error bars from multiple independent training runs. The claim that memory reduction grows with code length follows directly from the O(N) complexity of Mamba versus quadratic attention; however, we will qualify this statement to note that it is supported by complexity analysis rather than additional empirical results, as generating data for substantially longer codes requires resources beyond the present study. revision: partial
Referee: [Method] Method section: The integration of bidirectional Mamba blocks with local pairwise aggregation is described at a high level, but no analysis, bound, or empirical test addresses whether the Mamba state propagation avoids information bottlenecks or vanishing gradients on Tanner graphs whose diameter grows with block length. This is load-bearing for the scalability argument.

Authors: The bidirectional Mamba blocks are chosen precisely because their selective state-space mechanism provides linear-time propagation with built-in mechanisms (input-dependent gating and stable recurrence) that mitigate vanishing gradients on long sequences. We will augment the method section with a concise discussion of these properties, supported by references to the Mamba literature, and explain why the combination with local Tanner-graph aggregation preserves locality while enabling long-range flow. A formal bound on information flow is outside the scope of this work, but we will add a short empirical check of gradient norms during training on the evaluated code to address the concern. revision: yes

standing simulated objections not resolved

Empirical results on block lengths substantially longer than 1056 bits or on additional code families are not present in the manuscript and cannot be supplied without new large-scale experiments.

Circularity Check

0 steps flagged

No circularity; architecture and empirical results are independent

full rationale

The paper defines MMPD via local pairwise aggregation on Tanner-graph edges combined with bidirectional Mamba blocks for long-range propagation. This architectural choice is stated directly and is independent of the reported performance numbers. The 0.45 dB gain and 1.5x memory reduction are presented as experimental outcomes on the (1056,880) LDPC code, not as quantities derived from or equivalent to any fitted parameter or self-referential normalization. No equations, uniqueness theorems, or ansatzes reduce the central claims to inputs by construction. Self-citations (if present for CrossMPT or Mamba) are not load-bearing for the core architecture or scalability argument, which rests on the explicit avoidance of quadratic attention costs. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard coding-theory assumptions and learned neural weights; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)

neural network weights
All parameters of the Mamba blocks and message-passing layers are fitted during training on codeword data.

axioms (1)

domain assumption Binary linear codes admit a Tanner-graph representation with variable and check nodes connected by edges.
Invoked when the paper states that MMPD retains the Tanner-graph structure for local pairwise aggregation.

pith-pipeline@v0.9.0 · 5506 in / 1184 out tokens · 44062 ms · 2026-05-12T04:22:20.080939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MMPD retains the Tanner-graph structure ... local pairwise aggregation along variable-check edges ... combined with bidirectional Mamba state-space blocks
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on the (1056,880) LDPC code ... 0.45 dB gain ... memory consumption by a factor of 1.5

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Recent advances in deep learning for channel coding: A survey,

T. Matsumine and H. Ochiai, “Recent advances in deep learning for channel coding: A survey,”IEEE Open Journal of the Communications Society, vol. 5, pp. 6443–6481, 2024

work page 2024
[2]

Graph neural networks for channel decoding,

S. Cammerer, J. Hoydis, F. A. Aoudia, and A. Keller, “Graph neural networks for channel decoding,” in2022 IEEE Globecom Workshops (GC Wkshps), 2022, pp. 486–491

work page 2022
[3]

Deep neural network based decoding of short 5G LDPC codes,

K. Andreev, A. Frolov, G. Svistunov, K. Wu, and J. Liang, “Deep neural network based decoding of short 5G LDPC codes,” in2021 XVII International Symposium ”Problems of Redundancy in Information and Control Systems” (REDUNDANCY), 2021, pp. 155–160

work page 2021
[4]

Deep learning for decod- ing of linear codes - a syndrome-based approach,

A. Bennatan, Y . Choukroun, and P. Kisilev, “Deep learning for decod- ing of linear codes - a syndrome-based approach,” inproc. of IEEE International Symposium on Information Theory (ISIT), June 2018, pp. 1595–1599

work page 2018
[5]

Iterative syndrome- based deep neural network decoding,

D. Artemasov, K. Andreev, P. Rybin, and A. Frolov, “Iterative syndrome- based deep neural network decoding,”IEEE Open Journal of the Communications Society, vol. 6, pp. 629–641, 2024

work page 2024
[6]

Error correction code transformer,

Y . Choukroun and L. Wolf, “Error correction code transformer,” inAdvances in Neural Information Processing Systems, vol. 35. Curran Associates, Inc., 2022, pp. 38 695–38 705. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/ 2022/file/fcd3909db30887ce1da519c4468db668-Paper-Conference.pdf

work page 2022
[7]

CrossMPT: Cross-attention message-passing transformer for error correcting codes,

S.-J. Park, H.-Y . Kwak, S.-H. Kim, Y . Kim, and J.-S. No, “CrossMPT: Cross-attention message-passing transformer for error correcting codes,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https://openreview.net/ forum?id=gFvRRCnQvX

work page 2025
[8]

Accelerating error correction code transformers,

M. Levy, Y . Choukroun, and L. Wolf, “Accelerating error correction code transformers,” 2024. [Online]. Available: https://arxiv.org/abs/ 2410.05911

work page arXiv 2024
[9]

Hybrid mamba- transformer decoder for error-correcting codes,

S. el Cohen, Y . Choukroun, and E. Nachmani, “Hybrid mamba- transformer decoder for error-correcting codes,” 2025. [Online]. Available: https://arxiv.org/abs/2505.17834

work page arXiv 2025
[10]

A foundation model for error correction codes,

Y . Choukroun and L. Wolf, “A foundation model for error correction codes,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/ forum?id=7KDuQPrAF3

work page 2024
[11]

Cross- attention message-passing transformers for code-agnostic decoding in 6g networks,

S.-J. Park, H.-Y . Kwak, S.-H. Kim, Y . Kim, and J.-S. No, “Cross- attention message-passing transformers for code-agnostic decoding in 6g networks,” 2025. [Online]. Available: https://arxiv.org/abs/2507.01038

work page arXiv 2025
[12]

Denoising diffusion error correction codes,

Y . Choukroun and L. Wolf, “Denoising diffusion error correction codes,” 2022. [Online]. Available: https://arxiv.org/abs/2209.13533

work page arXiv 2022
[13]

Syndrome-flow consistency model achieves one-step denoising error correction codes,

H. Lei, C. W. Lau, K. Zhou, N. Guo, and F. Farnia, “Syndrome-flow consistency model achieves one-step denoising error correction codes,”

work page
[14]

Available: https://arxiv.org/abs/2512.01389

[Online]. Available: https://arxiv.org/abs/2512.01389

work page arXiv
[15]

Natural language inference by tree-based convolution and heuristic matching,

L. Mou, R. Men, G. Li, Y . Xu, L. Zhang, R. Yan, and Z. Jin, “Natural language inference by tree-based convolution and heuristic matching,” in Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics. Berlin, Germany: Association for Computational Linguistics, Aug. 2016, pp. 130–136

work page 2016
[16]

Factor graphs and the sum- product algorithm,

F. Kschischang, B. Frey, and H.-A. Loeliger, “Factor graphs and the sum- product algorithm,”IEEE Transactions on Information Theory, vol. 47, no. 2, pp. 498–519, 2001

work page 2001
[17]

5G NR Multiplexing and Channel Coding,

3GPP, “5G NR Multiplexing and Channel Coding,” 3rd Generation Partnership Project (3GPP), Technical Specification TS 38.212, 2020, version 16.11.0. Available: https://www.3gpp.org/ftp/Specs/archive/38 series/38.212

work page 2020