pith. sign in

arxiv: 2606.05999 · v1 · pith:3DQACKPBnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

Pith reviewed 2026-06-28 02:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords cloud removalremote sensingtransformertriangular attentionimage restorationattention mechanism
0
0 comments X

The pith

ATT-CR approximates self-attention with triangular matrices to remove clouds more efficiently from remote sensing images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cloud removal requires reconstructing ground objects hidden in remote sensing images. Standard transformer self-attention is too slow for large images and lets cloudy pixels interfere with the computation. The ATT-CR model introduces triangular attention that approximates the full attention using lower and upper triangular matrices at linear cost. It pairs this with a gating module that selects only clean features for further processing. Experiments show this leads to better results on cloud removal benchmarks.

Core claim

ATT-CR consists of Triangular Attention (TAN) that approximates softmax self-attention using lower and upper triangular matrices for O(N) complexity and Feature Selected Gating Module (FSGM) that adaptively selects clean features to avoid interference from cloudy pixels, leading to better reconstruction of ground objects.

What carries the argument

Triangular Attention (TAN) combined with Feature Selected Gating Module (FSGM), where TAN approximates full attention with triangular matrices to reduce cost and FSGM filters invalid cloudy information.

Load-bearing premise

Approximating full self-attention with lower and upper triangular matrices still models the necessary long-range dependencies to accurately reconstruct ground objects hidden by clouds.

What would settle it

A test case where the triangular attention produces visibly worse reconstructions than full attention on images with intricate cloud patterns would falsify the effectiveness of the approximation.

Figures

Figures reproduced from arXiv: 2606.05999 by Jinjun Wang, Kangyi Wu, Pengna Li, Wenli Huang, Xiaomeng Xin, Yang Wu, Ye Deng.

Figure 1
Figure 1. Figure 1: The cloud removal outputs from our ATT-CR model in diverse cloud [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of ATT-CR. (a) ATT-CR employs a multi-stage design, with each stage consisting of stacked Transformer blocks. (b) The ATAM integrates [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The calculation of the triangular attention output values involves [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of cloud removal results on the RICE dataset. The first two rows correspond to RICE2, while the last two belong to RICE1. Cloudy input [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of cloud removal results on the T-CLOUD dataset. Cloudy input images are shown in the first column, and the reference (ground truth) [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of cloud removal results in the RGB channels for the SEN12MS-CR dataset. Cloudy input images are shown in the first column, and the [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation visualization results from the RICE1 and RICE2 datasets. (a) removing the TAN; (b) removing the MS-Tokens; (c) removing the FSGM; [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: These values reveal distinct patterns, with higher gating [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 9
Figure 9. Figure 9: The visualization of the output values from the learned FSGM, based on results from the RICE2 dataset, shows varying characteristics across different [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Selected gating channel outputs from FSGM at stage 2. The first [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of selected channel outputs before and after FSGM and FSGM gating values. Darker regions indicate suppressed areas; brighter regions [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: FGSM gate value distributions (channel-wise mean) across five network stages for large-scale thick cloud (top), small-scale thick cloud (middle), [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
read the original abstract

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ATT-CR, an Adaptive Triangular Transformer for Cloud Removal in remote sensing images. It introduces Triangular Attention (TAN) that approximates standard softmax self-attention via lower and upper triangular matrices to achieve O(N) complexity, paired with a Feature Selected Gating Module (FSGM) that adaptively distinguishes cloudy from clean features to reduce interference in subsequent layers. The central claim is that these components together yield superior performance over existing Transformer-based methods on cloud removal benchmarks while addressing scalability and disturbance issues.

Significance. If the triangular approximation is shown to retain sufficient long-range dependencies for accurate ground-object reconstruction and the performance gains are quantitatively verified, the work would offer a practical efficiency improvement for attention-based restoration models in remote sensing, where processing large images under cloud cover is common.

major comments (2)
  1. [TAN description] TAN description: The claim that lower/upper triangular matrices approximate softmax self-attention while preserving the long-range dependencies required to reconstruct obscured ground objects from distant clean pixels lacks any derivation, error-bound analysis, or attention-map evidence. Triangular masking restricts interactions to directional/partial token sets rather than dense pairwise relations; without showing that the approximation error remains small in the cloud-removal regime, the O(N) efficiency cannot be assumed to support the reconstruction performance.
  2. [Experimental results] Experimental results: The abstract asserts superior benchmark performance, yet no metrics (PSNR, SSIM, etc.), baselines, error bars, dataset details, or statistical tests are referenced. This leaves the central superiority claim without visible quantitative support; the results section must supply these to make the claim load-bearing.
minor comments (1)
  1. [Abstract] The abstract lists two issues with prior methods but does not explicitly state the datasets comprising the 'cloud removal benchmarks,' which would aid immediate context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses
  1. Referee: The claim that lower/upper triangular matrices approximate softmax self-attention while preserving the long-range dependencies required to reconstruct obscured ground objects from distant clean pixels lacks any derivation, error-bound analysis, or attention-map evidence. Triangular masking restricts interactions to directional/partial token sets rather than dense pairwise relations; without showing that the approximation error remains small in the cloud-removal regime, the O(N) efficiency cannot be assumed to support the reconstruction performance.

    Authors: We agree that the current description of TAN would be strengthened by explicit supporting analysis. In the revised manuscript we will add a mathematical derivation of the lower/upper triangular approximation to softmax attention, an error-bound analysis tailored to the cloud-removal setting, and attention-map visualizations that compare TAN with standard self-attention to show preservation of the long-range dependencies needed for ground-object reconstruction. revision: yes

  2. Referee: The abstract asserts superior benchmark performance, yet no metrics (PSNR, SSIM, etc.), baselines, error bars, dataset details, or statistical tests are referenced. This leaves the central superiority claim without visible quantitative support; the results section must supply these to make the claim load-bearing.

    Authors: The results section already contains the requested quantitative details (PSNR, SSIM, baselines, datasets). To make the abstract claim self-contained and to ensure all supporting evidence is immediately visible, we will revise the abstract to reference the key metrics and will confirm that error bars and statistical information are explicitly reported in the results tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes ATT-CR via explicit architectural choices: TAN approximates softmax attention using lower/upper triangular matrices (stated as an O(N) design decision) and FSGM adaptively gates features. No equations or claims reduce by construction to fitted inputs, self-definitions, or prior self-citations; performance is reported as empirical benchmark results rather than derived predictions. The derivation chain consists of independent design steps without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented physical entities; the model components are described at the level of architectural design choices only.

pith-pipeline@v0.9.1-grok · 5744 in / 960 out tokens · 54247 ms · 2026-06-28T02:31:34.679684+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 5 linked inside Pith

  1. [1]

    Second simulation of the satellite signal in the solar spectrum, 6s: an overview,

    E. F. Vermote, D. Tanr ´e, J. Deuz´e, M. Herman, and J. Morcette, “Second simulation of the satellite signal in the solar spectrum, 6s: an overview,” IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 3, pp. 675–686, 1997

  2. [2]

    Thin cloud removal from single satellite images,

    J. Liu, X. Wang, M. Chen, S. Liu, X. Zhou, Z. Shao, and P. Liu, “Thin cloud removal from single satellite images,”Optics express, vol. 22, no. 1, pp. 618–632, 2014

  3. [3]

    Haze and thin cloud removal via sphere model improved dark channel prior,

    J. Li, Q. Hu, and M. Ai, “Haze and thin cloud removal via sphere model improved dark channel prior,”IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 3, pp. 472–476, 2019

  4. [4]

    Haze and thin cloud removal using elliptical boundary prior for remote sensing image,

    Q. Guo, H. Hu, and B. Li, “Haze and thin cloud removal using elliptical boundary prior for remote sensing image,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 11, pp. 9124–9137, 2019

  5. [5]

    Thin cloud removal with residual symmetrical concatenation network,

    W. Li, Y . Li, D. Chen, and J. C.-W. Chan, “Thin cloud removal with residual symmetrical concatenation network,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 153, pp. 137–150, 2019

  6. [6]

    Thin cloud removal for multispectral remote sensing images using convolutional neural networks combined with an imaging model,

    Y . Zi, F. Xie, N. Zhang, Z. Jiang, W. Zhu, and H. Zhang, “Thin cloud removal for multispectral remote sensing images using convolutional neural networks combined with an imaging model,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 3811–3823, 2021

  7. [7]

    Cloud removal in optical remote sensing imagery using multiscale distortion-aware networks,

    W. Yu, X. Zhang, and M. Pun, “Cloud removal in optical remote sensing imagery using multiscale distortion-aware networks,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

  8. [8]

    Wavelet integrated convolutional neural network for thin cloud removal in remote sensing images,

    Y . Zi, H. Ding, F. Xie, Z. Jiang, and X. Song, “Wavelet integrated convolutional neural network for thin cloud removal in remote sensing images,”Remote Sensing, vol. 15, no. 3, p. 781, 2023

  9. [9]

    Cloud-guided fusion with sar-to-optical translation for thick cloud removal,

    X. Xiang, Y . Tan, and L. Yan, “Cloud-guided fusion with sar-to-optical translation for thick cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

  10. [10]

    Robust haze and thin cloud removal via conditional variational autoencoders,

    H. Ding, F. Xie, L. Qiu, X. Zhang, and Z. Shi, “Robust haze and thin cloud removal via conditional variational autoencoders,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

  11. [11]

    Cloud removal for remote sensing imagery via spatial attention generative adversarial network,

    H. Pan, “Cloud removal for remote sensing imagery via spatial attention generative adversarial network,”arXiv preprint arXiv:2009.13015, 2020

  12. [12]

    Attentive contextual attention for cloud removal,

    W. Huang, Y . Deng, Y . Wu, and J. Wang, “Attentive contextual attention for cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024

  13. [13]

    Uncertainty-based thin cloud removal network via conditional variational autoencoders,

    H. Ding, Y . Zi, and F. Xie, “Uncertainty-based thin cloud removal network via conditional variational autoencoders,” inComputer Vision - ACCV 2022 - 16th Asian Conference on Computer Vision, Macao, China, December 4-8, 2022, Proceedings, Part III, ser. Lecture Notes in Computer Science, vol. 13843, 2022, pp. 52–68

  14. [14]

    Trinity-net: Gradient-guided swin transformer-based remote sensing image dehazing and beyond,

    K. Chi, Y . Yuan, and Q. Wang, “Trinity-net: Gradient-guided swin transformer-based remote sensing image dehazing and beyond,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

  15. [15]

    Cascaded memory network for optical remote sensing imagery cloud removal,

    J. Liu, B. Pan, and Z. Shi, “Cascaded memory network for optical remote sensing imagery cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–11, 2024

  16. [16]

    Cr- former: Single-image cloud removal with focused taylor attention,

    Y . Wu, Y . Deng, S. Zhou, Y . Liu, W. Huang, and J. Wang, “Cr- former: Single-image cloud removal with focused taylor attention,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

  17. [17]

    Glf-cr: Sar-enhanced cloud removal with global–local fusion,

    F. Xu, Y . Shi, P. Ebel, L. Yu, G.-S. Xia, W. Yang, and X. X. Zhu, “Glf-cr: Sar-enhanced cloud removal with global–local fusion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 192, pp. 268–278, 2022

  18. [18]

    Low-rank bottleneck in multi-head attention models,

    S. Bhojanapalli, C. Yun, A. S. Rawat, S. J. Reddi, and S. Kumar, “Low-rank bottleneck in multi-head attention models,” inProceedings of the 37th International Conference on Machine Learning, Virtual Event, 2020, pp. 864–873

  19. [19]

    Mamba- cr: A state-space model for remote sensing image cloud removal,

    C. Zhang, F. Wang, X. Zhang, M. Wang, X. Wu, and S. Dang, “Mamba- cr: A state-space model for remote sensing image cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–13, 2025

  20. [20]

    Cr-famba: A frequency-domain assisted mamba for thin cloud removal in optical remote sensing imagery,

    J. Liu, B. Pan, and Z. Shi, “Cr-famba: A frequency-domain assisted mamba for thin cloud removal in optical remote sensing imagery,”IEEE Transactions on Multimedia, vol. 27, pp. 5659–5668, 2025

  21. [21]

    Mamba: Linear-time sequence modeling with selective state spaces,

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

  22. [22]

    Mambaout: Do we really need mamba for vision?

    W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  23. [23]

    Efficient attention: Attention with linear complexities,

    Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” inProceedings of IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2021, pp. 3530–3538

  24. [24]

    Efficientvit: Multi-scale linear attention for high-resolution dense prediction,

    H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “Efficientvit: Multi-scale linear attention for high-resolution dense prediction,”arXiv preprint arXiv:2205.14756, 2022

  25. [25]

    Filmy cloud removal on satellite imagery with multispectral conditional generative adversarial nets,

    K. Enomoto, K. Sakurada, W. Wang, H. Fukui, M. Matsuoka, R. Naka- mura, and N. Kawaguchi, “Filmy cloud removal on satellite imagery with multispectral conditional generative adversarial nets,” inProceed- ings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA,, 2017, pp. 1533–1541

  26. [26]

    Cloud-gan: Cloud removal for sentinel- 2 imagery using a cyclic consistent generative adversarial networks,

    P. Singh and N. Komodakis, “Cloud-gan: Cloud removal for sentinel- 2 imagery using a cyclic consistent generative adversarial networks,” inProceedings of IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 2018, pp. 1772–1775

  27. [27]

    Cermf-net: A sar- optical feature fusion for cloud elimination from sentinel-2 imagery using residual multiscale dilated network,

    J. Anandakrishnan, V . M. Sundaram, and P. Paneer, “Cermf-net: A sar- optical feature fusion for cloud elimination from sentinel-2 imagery using residual multiscale dilated network,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 11 741–11 749, 2024

  28. [28]

    Cloud removal based on sar-optical remote sensing data fusion via a two-flow network,

    R. Mao, H. Li, G. Ren, and Z. Yin, “Cloud removal based on sar-optical remote sensing data fusion via a two-flow network,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 7677–7686, 2022

  29. [29]

    Thin cloud removal in optical remote sensing images based on generative adversarial networks and physical model of cloud distortion,

    J. Li, Z. Wu, Z. Hu, J. Zhang, M. Li, L. Mo, and M. Molinier, “Thin cloud removal in optical remote sensing images based on generative adversarial networks and physical model of cloud distortion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 373– 389, 2020. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, X 2025 15

  30. [30]

    Blind single-image-based thin cloud removal using a cloud perception integrated fast fourier convolutional network,

    Y . Guo, W. He, Y . Xia, and H. Zhang, “Blind single-image-based thin cloud removal using a cloud perception integrated fast fourier convolutional network,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 206, pp. 63–86, 2023

  31. [31]

    Msar-defognet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution,

    Y . Zhou, W. Jing, J. Wang, G. Chen, R. Scherer, and R. Damasevicius, “Msar-defognet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution,”IET Image Process., vol. 16, no. 3, pp. 659–668, 2022

  32. [32]

    An effective network integrating residual learning and channel attention mechanism for thin cloud re- moval,

    X. Wen, Z. Pan, Y . Hu, and J. Liu, “An effective network integrating residual learning and channel attention mechanism for thin cloud re- moval,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

  33. [33]

    A novel dense-attention network for thick cloud removal by reconstructing semantic information,

    Y . Chen, Z. Cai, J. Yuan, and L. Wu, “A novel dense-attention network for thick cloud removal by reconstructing semantic information,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 2339–2351, 2023

  34. [34]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5998–6008

  35. [35]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProceedings of 9th International Conference on Learning Representations. ICLR, Virtual Event, Austria, 2024

  36. [36]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,

    W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” inProceedings of IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 548–558

  37. [37]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 9992–10 002

  38. [38]

    Cswin transformer: A general vision transformer backbone with cross-shaped windows,

    X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” inProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 12 114–12 124

  39. [39]

    Restormer: Efficient transformer for high-resolution image restoration,

    S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” inProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 5718– 5729

  40. [40]

    Event- equalized dense video captioning,

    K. Wu, P. Li, J. Fu, Y . Li, Y . Wu, Y . Liu, J. Wang, and S. Zhou, “Event- equalized dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 8417–8427

  41. [41]

    Cloud-egan: Rethinking cyclegan from a feature enhancement perspective for cloud removal by combining cnn and transformer,

    X. Ma, Y . Huang, X. Zhang, M.-O. Pun, and B. Huang, “Cloud-egan: Rethinking cyclegan from a feature enhancement perspective for cloud removal by combining cnn and transformer,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 4999–5012, 2023

  42. [42]

    Cloudformer: A cloud-removal network combining self-attention mechanism and convolution,

    P. Wu, Z. Pan, H. Tang, and Y . Hu, “Cloudformer: A cloud-removal network combining self-attention mechanism and convolution,”Remote. Sens., vol. 14, no. 23, p. 6132, 2022

  43. [43]

    Density guided and frequency modulation dehazing network for remote sensing images,

    H. Liu, J. Huang, J. Nie, J. Xie, L. Chen, and X. Zhou, “Density guided and frequency modulation dehazing network for remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp. 1–13, 2025

  44. [44]

    Tsmcf: Transformer-based sar and multispectral cross-attention fusion for cloud removal,

    H. Zhu, Z. Wang, L. Han, M. Xu, W. Li, Q. Liu, S. Liu, and B. Du, “Tsmcf: Transformer-based sar and multispectral cross-attention fusion for cloud removal,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 6710–6720, 2025

  45. [45]

    Transformers are rnns: Fast autoregressive transformers with linear attention,

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 2020, pp. 5156–5165

  46. [46]

    SOFT: softmax-free transformer with linear complexity,

    J. Lu, J. Yao, J. Zhang, X. Zhu, H. Xu, W. Gao, C. Xu, T. Xiang, and L. Zhang, “SOFT: softmax-free transformer with linear complexity,” inAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, virtual, 2021, pp. 21 297–21 309

  47. [47]

    Squeeze-and-excitation networks,

    J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141

  48. [48]

    Free- form image inpainting with gated convolution,

    J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free- form image inpainting with gated convolution,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4471– 4480

  49. [49]

    Language modeling with gated convolutional networks,

    Y . N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” inInternational conference on machine learning, 2017, pp. 933–941

  50. [50]

    Sparse self- attention transformer for image inpainting,

    W. Huang, Y . Deng, S. Hui, Y . Wu, S. Zhou, and J. Wang, “Sparse self- attention transformer for image inpainting,”Pattern Recognition, vol. 145, p. 109897, 2024

  51. [51]

    Gated convolutional networks for cloud removal from bi-temporal remote sensing images,

    P. Dai, S. Ji, and Y . Zhang, “Gated convolutional networks for cloud removal from bi-temporal remote sensing images,”Remote Sensing, vol. 12, no. 20, p. 3427, 2020

  52. [52]

    Cloud removal with sar-optical data fusion using a unified spatial–spectral residual network,

    Y . Wang, B. Zhang, W. Zhang, D. Hong, B. Zhao, and Z. Li, “Cloud removal with sar-optical data fusion using a unified spatial–spectral residual network,”IEEE Transactions on Geoscience and Remote Sens- ing, vol. 62, pp. 1–20, 2024

  53. [53]

    cosformer: Rethinking softmax in attention,

    Z. Qin, W. Sun, H. Deng, D. Li, Y . Wei, B. Lv, J. Yan, L. Kong, and Y . Zhong, “cosformer: Rethinking softmax in attention,” inProceedings of 10th International Conference on Learning Representations, ICLR, Virtual Event, April 25-29, 2022

  54. [54]

    Flatten transformer: Vision transformer using focused linear attention,

    D. Han, X. Pan, S. Song, and G. Huang, “Flatten transformer: Vision transformer using focused linear attention,” inProceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, 2023, pp. 5938–5948

  55. [55]

    Mb-taylorformer v2: Improved multi-branch linear transformer expanded by taylor formula for image restoration,

    Z. Jin, Y . Qiu, K. Zhang, H. Li, and W. Luo, “Mb-taylorformer v2: Improved multi-branch linear transformer expanded by taylor formula for image restoration,”TPAMI, 2025

  56. [56]

    Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions,

    G. R. dense transformer with grid structure for image restoration in adverse weather conditions, “Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions,” International Journal of Computer Vision, pp. 1–23, 2024

  57. [57]

    Deep dense multi-scale network for snow removal using semantic and depth priors,

    K. Zhang, R. Li, Y . Yu, W. Luo, and C. Li, “Deep dense multi-scale network for snow removal using semantic and depth priors,”IEEE Transactions on Image Processing, vol. 30, pp. 7419–7431, 2021

  58. [58]

    Wavelet approximation-aware residual network for single image deraining,

    W.-Y . Hsu and W.-C. Chang, “Wavelet approximation-aware residual network for single image deraining,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 12, pp. 15 979–15 995, 2023

  59. [59]

    Mobilenets: Efficient convolutional neural networks for mobile vision applications,

    A. G. Howard, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

  60. [60]

    A remote sensing image dataset for cloud removal,

    D. Lin, G. Xu, X. Wang, Y . Wang, X. Sun, and K. Fu, “A remote sensing image dataset for cloud removal,”CoRR, vol. abs/1901.00600, 2019

  61. [61]

    Multisensor data fusion for cloud removal in global and all-season sentinel-2 imagery,

    P. Ebel, A. Meraner, M. Schmitt, and X. X. Zhu, “Multisensor data fusion for cloud removal in global and all-season sentinel-2 imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 7, pp. 5866–5878, 2020

  62. [62]

    Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

    P. Ebel, V . S. F. Garnot, M. Schmitt, J. D. Wegner, and X. X. Zhu, “Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2086–2096

  63. [63]

    Image-to-image translation with conditional adversarial networks,

    P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of IEEE Confer- ence on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 5967–5976

  64. [64]

    CTGAN : Cloud transformer generative adver- sarial network,

    G. Huang and P. Wu, “CTGAN : Cloud transformer generative adver- sarial network,” inProceedings of IEEE International Conference on Image Processing, Bordeaux, France, 2022, pp. 511–515

  65. [65]

    Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature,

    T. Chai and R. R. Draxler, “Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature,” Geoscientific model development, vol. 7, no. 3, pp. 1247–1250, 2014

  66. [66]

    The spectral image processing system (sips)—interactive visualization and analysis of imaging spectrometer data,

    F. A. Kruse, A. Lefkoff, J. Boardman, K. Heidebrecht, A. Shapiro, P. Barloon, and A. Goetz, “The spectral image processing system (sips)—interactive visualization and analysis of imaging spectrometer data,”Remote sensing of environment, vol. 44, no. 2-3, pp. 145–163, 1993

  67. [67]

    Peak signal-to-noise ratio revisited: Is simple beautiful?

    J. Korhonen and J. You, “Peak signal-to-noise ratio revisited: Is simple beautiful?” inProceedings of 4th International Workshop on Quality of Multimedia Experience, 2012, pp. 37–38

  68. [68]

    Image quality assessment: from error visibility to structural similarity,

    Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

  69. [69]

    Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion,

    A. Meraner, P. Ebel, X. X. Zhu, and M. Schmitt, “Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 333–346, 2020. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, X 2025 16

  70. [70]

    Pvt v2: Improved baselines with pyramid vision transformer,

    W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022

  71. [71]

    Semantic-aware representation learning for homography estimation,

    Y . Liu, Q. Huang, S. Hui, J. Fu, S. Zhou, K. Wu, P. Li, and J. Wang, “Semantic-aware representation learning for homography estimation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2506–2514

  72. [72]

    Mind the gap: Aligning vision foundation models to image feature matching,

    Y . Liu, J. Fu, Y . Wu, K. Wu, P. Li, J. Wu, S. Zhou, and J. Xin, “Mind the gap: Aligning vision foundation models to image feature matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 313–20 323

  73. [73]

    Patchcue: Enhancing vision-language model reasoning with patch- based visual cues,

    Y . Qi, P. Fu, H. Li, Y . Liu, C. Jiang, B. Qin, Z. Luo, and J. Luan, “Patchcue: Enhancing vision-language model reasoning with patch- based visual cues,”arXiv preprint arXiv:2603.05869, 2026

  74. [74]

    Shaping schema via language representation as the next frontier for llm intelligence expanding,

    Z. Yang, Y . Liu, J. Fu, M. Sugiyama, N. Zhenget al., “Shaping schema via language representation as the next frontier for llm intelligence expanding,”arXiv preprint arXiv:2605.09271, 2026

  75. [75]

    Structured progressive knowledge ac- tivation for llm-driven neural architecture search,

    Z. Liu, Y . Liu, and J. Fu, “Structured progressive knowledge ac- tivation for llm-driven neural architecture search,”arXiv preprint arXiv:2605.04057, 2026