Recognition: 2 theorem links
· Lean TheoremGated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
Pith reviewed 2026-05-10 19:57 UTC · model grok-4.3
The pith
Gated-SwinRMT merges Swin window attention with retentive Manhattan decay through input-dependent gating to raise image classification accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gated-SwinRMT decomposes self-attention into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks supply a two-dimensional locality prior without learned positional biases. One variant substitutes sigmoid for softmax, applies balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU so that normalized outputs suppress uninformative scores. The other variant retains softmax-normalized retention with an additive log-space decay bias and inserts an explicit G1 sigmoid gate projected from the block input after local context enhancement but before the output projection to
What carries the argument
Width-wise and height-wise retention decomposition inside shifted windows, combined with per-head Manhattan-distance exponential decay masks and input-dependent gating on value projections or after local context enhancement.
If this is right
- At 77-79 million parameters Gated-SwinRMT-SWAT reaches 80.22 percent top-1 accuracy on Mini-ImageNet while the RMT baseline reaches 73.74 percent under the same training protocol.
- On CIFAR-10 the accuracy advantage shrinks from 6.48 to 0.56 percentage points because small feature maps cause adaptive windowing to collapse to global attention.
- The normalized output of the sigmoid variant implicitly down-weights low-attention scores without an extra explicit gate.
- Both variants avoid learned positional biases by relying on the fixed Manhattan-distance decay masks inside each window.
Where Pith is reading between the lines
- The same width-height split and gating pattern could be tested on object-detection or segmentation heads where local context matters more than global scope.
- When input resolution is low enough that windows cover the entire feature map, the architecture may need an optional global retention path to preserve its advantage.
- The explicit G1 gate after local context enhancement suggests a general recipe for relieving the low-rank bottleneck in value-to-output projections across other transformer blocks.
- Because the decay masks are input-independent while the gate is input-dependent, the design separates fixed locality bias from adaptive suppression, a separation worth measuring on datasets with varying object scales.
Load-bearing premise
The observed accuracy gains arise from the width-height retention split, spatial decay masks, and input-dependent gates rather than from differences in training schedules or hyperparameter choices.
What would settle it
Train the RMT baseline and both Gated-SwinRMT variants from scratch using identical data augmentations, optimizers, learning-rate schedules, and random seeds, then measure whether the top-1 gap on Mini-ImageNet remains above 5 percentage points.
Figures
read the original abstract
We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed.Gated-SwinRMT-SWAT substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate -- projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~$W_O$ -- to alleviate the low-rank $W_V \!\cdot\! W_O$ bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet ($224{\times}224$, 100 classes) and CIFAR-10 ($32{\times}32$, 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At ${\approx}77$--$79$\,M parameters, Gated-SwinRMT-SWAT achieves $80.22\%$ and Gated-SwinRMT-Retention $78.20\%$ top-1 test accuracy on Mini-ImageNet, compared with $73.74\%$ for the RMT baseline. On CIFAR-10 -- where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope -- the accuracy advantage compresses from $+6.48$\,pp to $+0.56$\,pp.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gated-SwinRMT, a hybrid vision transformer family that unifies Swin Transformer's shifted-window attention with Retentive Networks' Manhattan-distance spatial decay, augmented by input-dependent gating. Two variants are presented: Gated-SwinRMT-SWAT (using sigmoid activation, balanced ALiBi slopes, multiplicative decay, and SwiGLU gating) and Gated-SwinRMT-Retention (softmax retention with additive log-space decay and explicit G1 sigmoid gate after local context enhancement). Under claimed identical training protocols on a single GPU, the models achieve 80.22% and 78.20% top-1 accuracy on Mini-ImageNet (vs. 73.74% for RMT baseline) at ~77-79M parameters, with the gap shrinking to +0.56 pp on CIFAR-10 due to window collapse on small feature maps.
Significance. If the reported gains hold under verified controls, the work offers a concrete unification of windowed attention and retentive decay with input-dependent gating that avoids learned positional biases, providing a new architectural direction for efficient vision transformers. The explicit decomposition into width/height retention passes and the two gating strategies (SwiGLU vs. G1 sigmoid) are clearly motivated and could inspire further hybrids; the single-GPU constraint and post-hoc CIFAR analysis also highlight practical deployment considerations.
major comments (4)
- [Abstract / Experimental Results] Abstract and Experimental Results: The central claim of a +6.48 pp improvement on Mini-ImageNet rests on 'identical training protocols' for the RMT baseline, yet no table, appendix, or section confirms that learning-rate schedule, optimizer, augmentation, epoch count, or weight decay were literally matched; without this, the attribution to width/height retention decomposition, spatial decay masks, and input-dependent gating cannot be isolated from possible hyperparameter or implementation differences.
- [Abstract / CIFAR-10 Results] Abstract and Results on CIFAR-10: The post-hoc explanation that the accuracy advantage collapses from +6.48 pp to +0.56 pp because 'small feature maps cause the adaptive windowing mechanism to collapse attention to global scope' is presented without supporting ablations on window size, feature-map resolution, or controlled tests of the retention passes; this weakens the claim that the proposed mechanisms provide intrinsic gains rather than resolution-dependent behavior.
- [Experimental Results] Experimental Results: No component ablations are reported that isolate the contributions of the consecutive width-wise/height-wise retention passes, per-head exponential decay masks, or the specific gating choices (SwiGLU in SWAT vs. explicit G1 sigmoid in Retention); without these, it is impossible to determine which element drives the reported lift over the RMT baseline.
- [Abstract / Results] Abstract and Results: The headline accuracies (80.22%, 78.20%, 73.74%) are given as single-point estimates with no error bars, multiple random seeds, or statistical significance tests; this is especially problematic for a claim that hinges on a 6.48 pp gap at matched parameter counts.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and valuable suggestions. We address each of the major comments in detail below, providing clarifications and indicating revisions where appropriate. Our responses aim to strengthen the manuscript's claims regarding the proposed unification of Swin attention and retentive decay.
read point-by-point responses
-
Referee: The central claim of a +6.48 pp improvement on Mini-ImageNet rests on 'identical training protocols' for the RMT baseline, yet no table, appendix, or section confirms that learning-rate schedule, optimizer, augmentation, epoch count, or weight decay were literally matched; without this, the attribution to width/height retention decomposition, spatial decay masks, and input-dependent gating cannot be isolated from possible hyperparameter or implementation differences.
Authors: We confirm that the RMT baseline and both Gated-SwinRMT variants were trained under literally identical protocols, including AdamW optimizer, cosine learning-rate schedule with the same peak LR and warmup, identical RandAugment + mixup/cutmix augmentations, 300 epochs, and weight decay of 0.05. To eliminate any ambiguity, we will add an explicit hyperparameter table in the appendix of the revised manuscript. revision: yes
-
Referee: The post-hoc explanation that the accuracy advantage collapses from +6.48 pp to +0.56 pp because 'small feature maps cause the adaptive windowing mechanism to collapse attention to global scope' is presented without supporting ablations on window size, feature-map resolution, or controlled tests of the retention passes; this weakens the claim that the proposed mechanisms provide intrinsic gains rather than resolution-dependent behavior.
Authors: The collapse explanation follows directly from the shifted-window formulation: on 32x32 CIFAR-10 inputs the feature maps quickly become smaller than the default 7x7 window, turning the mechanism global. We acknowledge the lack of dedicated ablations and will add a short paragraph plus a minimal window-size sensitivity check on a held-out subset in the revised experimental section. revision: partial
-
Referee: No component ablations are reported that isolate the contributions of the consecutive width-wise/height-wise retention passes, per-head exponential decay masks, or the specific gating choices (SwiGLU in SWAT vs. explicit G1 sigmoid in Retention); without these, it is impossible to determine which element drives the reported lift over the RMT baseline.
Authors: Given the single-GPU resource constraint, we focused on full-model comparisons rather than exhaustive component ablations. The separable width/height retention and the two gating designs are motivated in Sections 3.2–3.3 as direct remedies for the low-rank bottleneck and lack of 2D locality in prior retentive work. We will expand the discussion to clarify the intended contribution of each element but cannot add new ablation experiments at this time. revision: no
-
Referee: The headline accuracies (80.22%, 78.20%, 73.74%) are given as single-point estimates with no error bars, multiple random seeds, or statistical significance tests; this is especially problematic for a claim that hinges on a 6.48 pp gap at matched parameter counts.
Authors: We agree that multiple random seeds and error bars would strengthen the presentation. All runs were performed on a single GPU under tight resource limits, so only single-run results are reported. The magnitude of the gap and its consistency across two datasets and two architectural variants provide supporting evidence, but we will add an explicit limitations paragraph acknowledging the absence of statistical tests. revision: partial
Circularity Check
No circularity: new architecture combines external components without self-referential derivation
full rationale
The paper presents Gated-SwinRMT as an explicit hybrid construction that decomposes shifted-window attention into width/height retention passes, adds Manhattan decay masks, and introduces input-dependent gating (SwiGLU or G1 sigmoid). All equations describe forward-pass operations on external baselines (Swin, RMT) rather than deriving a result from a fitted quantity defined by the authors' own prior equations. Performance numbers are reported against an external RMT baseline under claimed identical protocols; no step renames a known empirical pattern, smuggles an ansatz via self-citation, or treats a fitted parameter as a prediction. The derivation chain is therefore self-contained as an engineering synthesis.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-attention inside shifted windows can be decomposed into consecutive width-wise and height-wise retention passes with per-head exponential decay masks
invented entities (1)
-
Input-dependent G1 sigmoid gate
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
G1 output gate … applied after local context enhancement (LCE) but prior to the output projection WO
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =
Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. CSWin transformer: A general vision transformer backbone with cross-shaped windows. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 12114–12124. IEEE, 2022. doi: 10.1109/CVPR526...
-
[2]
Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. RMT: retentive networks meet vision trans- formers. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 5641–5651. IEEE, 2024. doi: 10.1109/CVPR52733.2024.00539
-
[3]
XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , booktitle =
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kil- ian Q. Weinberger. Deep networks with stochastic depth. InComputer Vision - ECCV 2016 - 14th European Confer- ence, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, volume 9908 ofLecture Notes in Computer Science, pages 646–661. Springer, 2016. doi: 10.1007/978-3-319-46493-0\_39
-
[4]
Learning multiple layers of fea- tures from tiny images
Alex Krizhevsky. Learning multiple layers of fea- tures from tiny images. Technical report, University of Toronto, 2009. URL https://www.cs.toronto. edu/~kriz/learning-features-2009-TR.pdf
2009
-
[5]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 9992–10002. IEEE, 2021. doi: 10.1109/ICCV48922. 2021.00986. 8
-
[6]
Smith, and Mike Lewis
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https:// openreview.net/forum?id=R8sQPpGCv0
2022
-
[7]
Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InAdvances in Neural In- formation Processing Systems 38: Annual Conference on Neural Information Processin...
2025
-
[8]
URL https://openreview.net/forum? id=1b7whO4SfY
-
[9]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Reten- tive network: A successor to transformer for large language models.CoRR, abs/2307.08621, 2023. URL https: //arxiv.org/abs/2307.08621
work page internal anchor Pith review arXiv 2023
-
[10]
Lillicrap, Ko- ray Kavukcuoglu, and Daan Wierstra
Oriol Vinyals, Charles Blundell, Timothy P. Lillicrap, Ko- ray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. InAdvances in Neural Information Processing Systems 29: Annual Conference on Neural In- formation Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3630–3638, 2016. 9
2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.