pith. machine review for the scientific record. sign in

arxiv: 2604.06014 · v2 · submitted 2026-04-07 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords vision transformersSwin TransformerRetentive Networksattention gatingManhattan decayimage classificationshifted window attentionhybrid attention
0
0 comments X

The pith

Gated-SwinRMT merges Swin window attention with retentive Manhattan decay through input-dependent gating to raise image classification accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gated-SwinRMT as a hybrid vision transformer that joins the shifted-window attention of Swin Transformers with the Manhattan-distance spatial decay of Retentive Networks. Self-attention inside each window is split into successive width-wise and height-wise retention steps, each using per-head exponential decay masks to enforce locality without extra learned biases. Two gated variants are presented: one replaces softmax with sigmoid and adds SwiGLU on the value path, while the other keeps softmax but inserts an explicit input-dependent gate after local context enhancement. On Mini-ImageNet both variants exceed the RMT baseline by several percentage points at roughly 77-79 million parameters. The accuracy edge narrows sharply on CIFAR-10 because small feature maps force the adaptive windows toward global scope.

Core claim

Gated-SwinRMT decomposes self-attention into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks supply a two-dimensional locality prior without learned positional biases. One variant substitutes sigmoid for softmax, applies balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU so that normalized outputs suppress uninformative scores. The other variant retains softmax-normalized retention with an additive log-space decay bias and inserts an explicit G1 sigmoid gate projected from the block input after local context enhancement but before the output projection to

What carries the argument

Width-wise and height-wise retention decomposition inside shifted windows, combined with per-head Manhattan-distance exponential decay masks and input-dependent gating on value projections or after local context enhancement.

If this is right

  • At 77-79 million parameters Gated-SwinRMT-SWAT reaches 80.22 percent top-1 accuracy on Mini-ImageNet while the RMT baseline reaches 73.74 percent under the same training protocol.
  • On CIFAR-10 the accuracy advantage shrinks from 6.48 to 0.56 percentage points because small feature maps cause adaptive windowing to collapse to global attention.
  • The normalized output of the sigmoid variant implicitly down-weights low-attention scores without an extra explicit gate.
  • Both variants avoid learned positional biases by relying on the fixed Manhattan-distance decay masks inside each window.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same width-height split and gating pattern could be tested on object-detection or segmentation heads where local context matters more than global scope.
  • When input resolution is low enough that windows cover the entire feature map, the architecture may need an optional global retention path to preserve its advantage.
  • The explicit G1 gate after local context enhancement suggests a general recipe for relieving the low-rank bottleneck in value-to-output projections across other transformer blocks.
  • Because the decay masks are input-independent while the gate is input-dependent, the design separates fixed locality bias from adaptive suppression, a separation worth measuring on datasets with varying object scales.

Load-bearing premise

The observed accuracy gains arise from the width-height retention split, spatial decay masks, and input-dependent gates rather than from differences in training schedules or hyperparameter choices.

What would settle it

Train the RMT baseline and both Gated-SwinRMT variants from scratch using identical data augmentations, optimizers, learning-rate schedules, and random seeds, then measure whether the top-1 gap on Mini-ImageNet remains above 5 percentage points.

Figures

Figures reproduced from arXiv: 2604.06014 by Arindam Roy, Dipan Maity, Suman Mondal.

Figure 1
Figure 1. Figure 1: Architecture of the proposed Gated-SwinRMT variants. (a) Gated-SwinRMT-SWAT: sigmoid-based normalized attention with SwiGLU-gated values, balanced ALiBi positional bias, and multiplicative spatial decay γ |i−j| applied post-sigmoid. (b) Gated￾SwinRMT-Retention: softmax-normalized retention with additive log-decay mask Mij=|i−j| log γh applied pre-softmax, and a learned G1 sigmoid gate applied after local c… view at source ↗
read the original abstract

We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed.Gated-SwinRMT-SWAT substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate -- projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~$W_O$ -- to alleviate the low-rank $W_V \!\cdot\! W_O$ bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet ($224{\times}224$, 100 classes) and CIFAR-10 ($32{\times}32$, 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At ${\approx}77$--$79$\,M parameters, Gated-SwinRMT-SWAT achieves $80.22\%$ and Gated-SwinRMT-Retention $78.20\%$ top-1 test accuracy on Mini-ImageNet, compared with $73.74\%$ for the RMT baseline. On CIFAR-10 -- where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope -- the accuracy advantage compresses from $+6.48$\,pp to $+0.56$\,pp.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 0 minor

Summary. The paper introduces Gated-SwinRMT, a hybrid vision transformer family that unifies Swin Transformer's shifted-window attention with Retentive Networks' Manhattan-distance spatial decay, augmented by input-dependent gating. Two variants are presented: Gated-SwinRMT-SWAT (using sigmoid activation, balanced ALiBi slopes, multiplicative decay, and SwiGLU gating) and Gated-SwinRMT-Retention (softmax retention with additive log-space decay and explicit G1 sigmoid gate after local context enhancement). Under claimed identical training protocols on a single GPU, the models achieve 80.22% and 78.20% top-1 accuracy on Mini-ImageNet (vs. 73.74% for RMT baseline) at ~77-79M parameters, with the gap shrinking to +0.56 pp on CIFAR-10 due to window collapse on small feature maps.

Significance. If the reported gains hold under verified controls, the work offers a concrete unification of windowed attention and retentive decay with input-dependent gating that avoids learned positional biases, providing a new architectural direction for efficient vision transformers. The explicit decomposition into width/height retention passes and the two gating strategies (SwiGLU vs. G1 sigmoid) are clearly motivated and could inspire further hybrids; the single-GPU constraint and post-hoc CIFAR analysis also highlight practical deployment considerations.

major comments (4)
  1. [Abstract / Experimental Results] Abstract and Experimental Results: The central claim of a +6.48 pp improvement on Mini-ImageNet rests on 'identical training protocols' for the RMT baseline, yet no table, appendix, or section confirms that learning-rate schedule, optimizer, augmentation, epoch count, or weight decay were literally matched; without this, the attribution to width/height retention decomposition, spatial decay masks, and input-dependent gating cannot be isolated from possible hyperparameter or implementation differences.
  2. [Abstract / CIFAR-10 Results] Abstract and Results on CIFAR-10: The post-hoc explanation that the accuracy advantage collapses from +6.48 pp to +0.56 pp because 'small feature maps cause the adaptive windowing mechanism to collapse attention to global scope' is presented without supporting ablations on window size, feature-map resolution, or controlled tests of the retention passes; this weakens the claim that the proposed mechanisms provide intrinsic gains rather than resolution-dependent behavior.
  3. [Experimental Results] Experimental Results: No component ablations are reported that isolate the contributions of the consecutive width-wise/height-wise retention passes, per-head exponential decay masks, or the specific gating choices (SwiGLU in SWAT vs. explicit G1 sigmoid in Retention); without these, it is impossible to determine which element drives the reported lift over the RMT baseline.
  4. [Abstract / Results] Abstract and Results: The headline accuracies (80.22%, 78.20%, 73.74%) are given as single-point estimates with no error bars, multiple random seeds, or statistical significance tests; this is especially problematic for a claim that hinges on a 6.48 pp gap at matched parameter counts.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. We address each of the major comments in detail below, providing clarifications and indicating revisions where appropriate. Our responses aim to strengthen the manuscript's claims regarding the proposed unification of Swin attention and retentive decay.

read point-by-point responses
  1. Referee: The central claim of a +6.48 pp improvement on Mini-ImageNet rests on 'identical training protocols' for the RMT baseline, yet no table, appendix, or section confirms that learning-rate schedule, optimizer, augmentation, epoch count, or weight decay were literally matched; without this, the attribution to width/height retention decomposition, spatial decay masks, and input-dependent gating cannot be isolated from possible hyperparameter or implementation differences.

    Authors: We confirm that the RMT baseline and both Gated-SwinRMT variants were trained under literally identical protocols, including AdamW optimizer, cosine learning-rate schedule with the same peak LR and warmup, identical RandAugment + mixup/cutmix augmentations, 300 epochs, and weight decay of 0.05. To eliminate any ambiguity, we will add an explicit hyperparameter table in the appendix of the revised manuscript. revision: yes

  2. Referee: The post-hoc explanation that the accuracy advantage collapses from +6.48 pp to +0.56 pp because 'small feature maps cause the adaptive windowing mechanism to collapse attention to global scope' is presented without supporting ablations on window size, feature-map resolution, or controlled tests of the retention passes; this weakens the claim that the proposed mechanisms provide intrinsic gains rather than resolution-dependent behavior.

    Authors: The collapse explanation follows directly from the shifted-window formulation: on 32x32 CIFAR-10 inputs the feature maps quickly become smaller than the default 7x7 window, turning the mechanism global. We acknowledge the lack of dedicated ablations and will add a short paragraph plus a minimal window-size sensitivity check on a held-out subset in the revised experimental section. revision: partial

  3. Referee: No component ablations are reported that isolate the contributions of the consecutive width-wise/height-wise retention passes, per-head exponential decay masks, or the specific gating choices (SwiGLU in SWAT vs. explicit G1 sigmoid in Retention); without these, it is impossible to determine which element drives the reported lift over the RMT baseline.

    Authors: Given the single-GPU resource constraint, we focused on full-model comparisons rather than exhaustive component ablations. The separable width/height retention and the two gating designs are motivated in Sections 3.2–3.3 as direct remedies for the low-rank bottleneck and lack of 2D locality in prior retentive work. We will expand the discussion to clarify the intended contribution of each element but cannot add new ablation experiments at this time. revision: no

  4. Referee: The headline accuracies (80.22%, 78.20%, 73.74%) are given as single-point estimates with no error bars, multiple random seeds, or statistical significance tests; this is especially problematic for a claim that hinges on a 6.48 pp gap at matched parameter counts.

    Authors: We agree that multiple random seeds and error bars would strengthen the presentation. All runs were performed on a single GPU under tight resource limits, so only single-run results are reported. The magnitude of the gap and its consistency across two datasets and two architectural variants provide supporting evidence, but we will add an explicit limitations paragraph acknowledging the absence of statistical tests. revision: partial

Circularity Check

0 steps flagged

No circularity: new architecture combines external components without self-referential derivation

full rationale

The paper presents Gated-SwinRMT as an explicit hybrid construction that decomposes shifted-window attention into width/height retention passes, adds Manhattan decay masks, and introduces input-dependent gating (SwiGLU or G1 sigmoid). All equations describe forward-pass operations on external baselines (Swin, RMT) rather than deriving a result from a fitted quantity defined by the authors' own prior equations. Performance numbers are reported against an external RMT baseline under claimed identical protocols; no step renames a known empirical pattern, smuggles an ansatz via self-citation, or treats a fitted parameter as a prediction. The derivation chain is therefore self-contained as an engineering synthesis.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard transformer assumptions plus the new gating and decay components; no major free parameters or invented entities with independent evidence are declared beyond architectural choices.

axioms (1)
  • domain assumption Self-attention inside shifted windows can be decomposed into consecutive width-wise and height-wise retention passes with per-head exponential decay masks
    Invoked in the description of how locality is enforced without learned positional biases.
invented entities (1)
  • Input-dependent G1 sigmoid gate no independent evidence
    purpose: Alleviate low-rank W_V · W_O bottleneck and enable input-dependent suppression of attended outputs
    New component introduced in the Retention variant and applied after local context enhancement.

pith-pipeline@v0.9.0 · 5675 in / 1296 out tokens · 73734 ms · 2026-05-10T19:57:45.516938+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

    Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. CSWin transformer: A general vision transformer backbone with cross-shaped windows. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 12114–12124. IEEE, 2022. doi: 10.1109/CVPR526...

  2. [2]

    In: CVPR, pp

    Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. RMT: retentive networks meet vision trans- formers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 5641–5651. IEEE, 2024. doi: 10.1109/CVPR52733.2024.00539

  3. [3]

    XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , booktitle =

    Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q. Weinberger. Deep networks with stochastic depth. InComputer Vision - ECCV 2016 - 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 ofLecture Notes in Computer Science, pages 646–661. Springer, 2016. doi: 10.1007/978-3-319-46493-0\_39

  4. [4]

    Learning multiple layers of fea- tures from tiny images

    Alex Krizhevsky. Learning multiple layers of fea- tures from tiny images. Technical report, University of Toronto, 2009. URL https://www.cs.toronto. edu/~kriz/learning-features-2009-TR.pdf

  5. [5]

    Ranftl, A

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021. doi: 10.1109/ICCV48922. 2021.00986. 8

  6. [6]

    Smith, and Mike Lewis

    Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https:// openreview.net/forum?id=R8sQPpGCv0

  7. [7]

    Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InAdvances in Neural In- formation Processing Systems 38: Annual Conference on Neural Information Processin...

  8. [8]

    URL https://openreview.net/forum? id=1b7whO4SfY

  9. [9]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Reten- tive network: A successor to transformer for large language models.CoRR, abs/2307.08621, 2023. URL https: //arxiv.org/abs/2307.08621

  10. [10]

    Lillicrap, Ko- ray Kavukcuoglu, and Daan Wierstra

    Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Ko- ray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. InAdvances in Neural Information Processing Systems 29: Annual Conference on Neural In- formation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3630–3638, 2016. 9