ATT-CR: Adaptive Triangular Transformer for Cloud Removal

Jinjun Wang; Kangyi Wu; Pengna Li; Wenli Huang; Xiaomeng Xin; Yang Wu; Ye Deng

arxiv: 2606.05999 · v1 · pith:3DQACKPBnew · submitted 2026-06-04 · 💻 cs.CV · cs.AI

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

Yang Wu , Ye Deng , Pengna Li , Wenli Huang , Kangyi Wu , Xiaomeng Xin , Jinjun Wang This is my paper

Pith reviewed 2026-06-28 02:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords cloud removalremote sensingtransformertriangular attentionimage restorationattention mechanism

0 comments

The pith

ATT-CR approximates self-attention with triangular matrices to remove clouds more efficiently from remote sensing images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Cloud removal requires reconstructing ground objects hidden in remote sensing images. Standard transformer self-attention is too slow for large images and lets cloudy pixels interfere with the computation. The ATT-CR model introduces triangular attention that approximates the full attention using lower and upper triangular matrices at linear cost. It pairs this with a gating module that selects only clean features for further processing. Experiments show this leads to better results on cloud removal benchmarks.

Core claim

ATT-CR consists of Triangular Attention (TAN) that approximates softmax self-attention using lower and upper triangular matrices for O(N) complexity and Feature Selected Gating Module (FSGM) that adaptively selects clean features to avoid interference from cloudy pixels, leading to better reconstruction of ground objects.

What carries the argument

Triangular Attention (TAN) combined with Feature Selected Gating Module (FSGM), where TAN approximates full attention with triangular matrices to reduce cost and FSGM filters invalid cloudy information.

Load-bearing premise

Approximating full self-attention with lower and upper triangular matrices still models the necessary long-range dependencies to accurately reconstruct ground objects hidden by clouds.

What would settle it

A test case where the triangular attention produces visibly worse reconstructions than full attention on images with intricate cloud patterns would falsify the effectiveness of the approximation.

Figures

Figures reproduced from arXiv: 2606.05999 by Jinjun Wang, Kangyi Wu, Pengna Li, Wenli Huang, Xiaomeng Xin, Yang Wu, Ye Deng.

**Figure 2.** Figure 2: Architecture of ATT-CR. (a) ATT-CR employs a multi-stage design, with each stage consisting of stacked Transformer blocks. (b) The ATAM integrates [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: The calculation of the triangular attention output values involves [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of cloud removal results on the RICE dataset. The first two rows correspond to RICE2, while the last two belong to RICE1. Cloudy input [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of cloud removal results on the T-CLOUD dataset. Cloudy input images are shown in the first column, and the reference (ground truth) [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of cloud removal results in the RGB channels for the SEN12MS-CR dataset. Cloudy input images are shown in the first column, and the [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Ablation visualization results from the RICE1 and RICE2 datasets. (a) removing the TAN; (b) removing the MS-Tokens; (c) removing the FSGM; [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: These values reveal distinct patterns, with higher gating [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 9.** Figure 9: The visualization of the output values from the learned FSGM, based on results from the RICE2 dataset, shows varying characteristics across different [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Selected gating channel outputs from FSGM at stage 2. The first [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of selected channel outputs before and after FSGM and FSGM gating values. Darker regions indicate suppressed areas; brighter regions [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: FGSM gate value distributions (channel-wise mean) across five network stages for large-scale thick cloud (top), small-scale thick cloud (middle), [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

read the original abstract

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATT-CR introduces triangular attention and a gating module to handle compute and cloudy-pixel issues in remote-sensing restoration, but the abstract contains no metrics or comparisons to back the performance claims.

read the letter

The paper's main contributions are the Triangular Attention (TAN) that uses lower and upper triangular matrices to approximate softmax self-attention at O(N) cost, and the Feature Selected Gating Module (FSGM) that tries to separate clean from cloudy features before they propagate. These target two practical problems in the domain: standard transformers scale poorly on high-resolution satellite images, and treating all pixels equally lets cloud artifacts degrade later layers. The motivation is clear and the components are described in enough detail to understand the intended mechanism.

The design is straightforward and could be useful if the approximation preserves enough context for ground-object reconstruction. The paper does a reasonable job stating the limitations of prior transformer work on this task.

The obvious gap is the complete absence of results. The abstract asserts superior benchmark performance without a single number, baseline name, dataset detail, or protocol. That makes it impossible to judge whether the triangular approximation actually works. Triangular masking restricts token interactions to directional or partial patterns, which risks losing the long-range clean-pixel context needed to fill clouds; the stress-test note correctly flags this as a load-bearing assumption with no error analysis or ablation shown. Without those, the FSGM may not compensate.

This is for people working on efficient attention for remote-sensing restoration. A reader already following transformer variants in that niche could extract the architectural ideas, but the lack of evidence limits broader value right now.

Send it for peer review if the full paper supplies proper experiments and comparisons; the ideas address real constraints even if they need stronger validation.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes ATT-CR, an Adaptive Triangular Transformer for Cloud Removal in remote sensing images. It introduces Triangular Attention (TAN) that approximates standard softmax self-attention via lower and upper triangular matrices to achieve O(N) complexity, paired with a Feature Selected Gating Module (FSGM) that adaptively distinguishes cloudy from clean features to reduce interference in subsequent layers. The central claim is that these components together yield superior performance over existing Transformer-based methods on cloud removal benchmarks while addressing scalability and disturbance issues.

Significance. If the triangular approximation is shown to retain sufficient long-range dependencies for accurate ground-object reconstruction and the performance gains are quantitatively verified, the work would offer a practical efficiency improvement for attention-based restoration models in remote sensing, where processing large images under cloud cover is common.

major comments (2)

[TAN description] TAN description: The claim that lower/upper triangular matrices approximate softmax self-attention while preserving the long-range dependencies required to reconstruct obscured ground objects from distant clean pixels lacks any derivation, error-bound analysis, or attention-map evidence. Triangular masking restricts interactions to directional/partial token sets rather than dense pairwise relations; without showing that the approximation error remains small in the cloud-removal regime, the O(N) efficiency cannot be assumed to support the reconstruction performance.
[Experimental results] Experimental results: The abstract asserts superior benchmark performance, yet no metrics (PSNR, SSIM, etc.), baselines, error bars, dataset details, or statistical tests are referenced. This leaves the central superiority claim without visible quantitative support; the results section must supply these to make the claim load-bearing.

minor comments (1)

[Abstract] The abstract lists two issues with prior methods but does not explicitly state the datasets comprising the 'cloud removal benchmarks,' which would aid immediate context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: The claim that lower/upper triangular matrices approximate softmax self-attention while preserving the long-range dependencies required to reconstruct obscured ground objects from distant clean pixels lacks any derivation, error-bound analysis, or attention-map evidence. Triangular masking restricts interactions to directional/partial token sets rather than dense pairwise relations; without showing that the approximation error remains small in the cloud-removal regime, the O(N) efficiency cannot be assumed to support the reconstruction performance.

Authors: We agree that the current description of TAN would be strengthened by explicit supporting analysis. In the revised manuscript we will add a mathematical derivation of the lower/upper triangular approximation to softmax attention, an error-bound analysis tailored to the cloud-removal setting, and attention-map visualizations that compare TAN with standard self-attention to show preservation of the long-range dependencies needed for ground-object reconstruction. revision: yes
Referee: The abstract asserts superior benchmark performance, yet no metrics (PSNR, SSIM, etc.), baselines, error bars, dataset details, or statistical tests are referenced. This leaves the central superiority claim without visible quantitative support; the results section must supply these to make the claim load-bearing.

Authors: The results section already contains the requested quantitative details (PSNR, SSIM, baselines, datasets). To make the abstract claim self-contained and to ensure all supporting evidence is immediately visible, we will revise the abstract to reference the key metrics and will confirm that error bars and statistical information are explicitly reported in the results tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes ATT-CR via explicit architectural choices: TAN approximates softmax attention using lower/upper triangular matrices (stated as an O(N) design decision) and FSGM adaptively gates features. No equations or claims reduce by construction to fitted inputs, self-definitions, or prior self-citations; performance is reported as empirical benchmark results rather than derived predictions. The derivation chain consists of independent design steps without load-bearing self-referential reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented physical entities; the model components are described at the level of architectural design choices only.

pith-pipeline@v0.9.1-grok · 5744 in / 960 out tokens · 54247 ms · 2026-06-28T02:31:34.679684+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 5 linked inside Pith

[1]

Second simulation of the satellite signal in the solar spectrum, 6s: an overview,

E. F. Vermote, D. Tanr ´e, J. Deuz´e, M. Herman, and J. Morcette, “Second simulation of the satellite signal in the solar spectrum, 6s: an overview,” IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 3, pp. 675–686, 1997

1997
[2]

Thin cloud removal from single satellite images,

J. Liu, X. Wang, M. Chen, S. Liu, X. Zhou, Z. Shao, and P. Liu, “Thin cloud removal from single satellite images,”Optics express, vol. 22, no. 1, pp. 618–632, 2014

2014
[3]

Haze and thin cloud removal via sphere model improved dark channel prior,

J. Li, Q. Hu, and M. Ai, “Haze and thin cloud removal via sphere model improved dark channel prior,”IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 3, pp. 472–476, 2019

2019
[4]

Haze and thin cloud removal using elliptical boundary prior for remote sensing image,

Q. Guo, H. Hu, and B. Li, “Haze and thin cloud removal using elliptical boundary prior for remote sensing image,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 11, pp. 9124–9137, 2019

2019
[5]

Thin cloud removal with residual symmetrical concatenation network,

W. Li, Y . Li, D. Chen, and J. C.-W. Chan, “Thin cloud removal with residual symmetrical concatenation network,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 153, pp. 137–150, 2019

2019
[6]

Thin cloud removal for multispectral remote sensing images using convolutional neural networks combined with an imaging model,

Y . Zi, F. Xie, N. Zhang, Z. Jiang, W. Zhu, and H. Zhang, “Thin cloud removal for multispectral remote sensing images using convolutional neural networks combined with an imaging model,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 3811–3823, 2021

2021
[7]

Cloud removal in optical remote sensing imagery using multiscale distortion-aware networks,

W. Yu, X. Zhang, and M. Pun, “Cloud removal in optical remote sensing imagery using multiscale distortion-aware networks,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

2022
[8]

Wavelet integrated convolutional neural network for thin cloud removal in remote sensing images,

Y . Zi, H. Ding, F. Xie, Z. Jiang, and X. Song, “Wavelet integrated convolutional neural network for thin cloud removal in remote sensing images,”Remote Sensing, vol. 15, no. 3, p. 781, 2023

2023
[9]

Cloud-guided fusion with sar-to-optical translation for thick cloud removal,

X. Xiang, Y . Tan, and L. Yan, “Cloud-guided fusion with sar-to-optical translation for thick cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

2024
[10]

Robust haze and thin cloud removal via conditional variational autoencoders,

H. Ding, F. Xie, L. Qiu, X. Zhang, and Z. Shi, “Robust haze and thin cloud removal via conditional variational autoencoders,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

2024
[11]

Cloud removal for remote sensing imagery via spatial attention generative adversarial network,

H. Pan, “Cloud removal for remote sensing imagery via spatial attention generative adversarial network,”arXiv preprint arXiv:2009.13015, 2020

arXiv 2009
[12]

Attentive contextual attention for cloud removal,

W. Huang, Y . Deng, Y . Wu, and J. Wang, “Attentive contextual attention for cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024

2024
[13]

Uncertainty-based thin cloud removal network via conditional variational autoencoders,

H. Ding, Y . Zi, and F. Xie, “Uncertainty-based thin cloud removal network via conditional variational autoencoders,” inComputer Vision - ACCV 2022 - 16th Asian Conference on Computer Vision, Macao, China, December 4-8, 2022, Proceedings, Part III, ser. Lecture Notes in Computer Science, vol. 13843, 2022, pp. 52–68

2022
[14]

Trinity-net: Gradient-guided swin transformer-based remote sensing image dehazing and beyond,

K. Chi, Y . Yuan, and Q. Wang, “Trinity-net: Gradient-guided swin transformer-based remote sensing image dehazing and beyond,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

2023
[15]

Cascaded memory network for optical remote sensing imagery cloud removal,

J. Liu, B. Pan, and Z. Shi, “Cascaded memory network for optical remote sensing imagery cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–11, 2024

2024
[16]

Cr- former: Single-image cloud removal with focused taylor attention,

Y . Wu, Y . Deng, S. Zhou, Y . Liu, W. Huang, and J. Wang, “Cr- former: Single-image cloud removal with focused taylor attention,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

2024
[17]

Glf-cr: Sar-enhanced cloud removal with global–local fusion,

F. Xu, Y . Shi, P. Ebel, L. Yu, G.-S. Xia, W. Yang, and X. X. Zhu, “Glf-cr: Sar-enhanced cloud removal with global–local fusion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 192, pp. 268–278, 2022

2022
[18]

Low-rank bottleneck in multi-head attention models,

S. Bhojanapalli, C. Yun, A. S. Rawat, S. J. Reddi, and S. Kumar, “Low-rank bottleneck in multi-head attention models,” inProceedings of the 37th International Conference on Machine Learning, Virtual Event, 2020, pp. 864–873

2020
[19]

Mamba- cr: A state-space model for remote sensing image cloud removal,

C. Zhang, F. Wang, X. Zhang, M. Wang, X. Wu, and S. Dang, “Mamba- cr: A state-space model for remote sensing image cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–13, 2025

2025
[20]

Cr-famba: A frequency-domain assisted mamba for thin cloud removal in optical remote sensing imagery,

J. Liu, B. Pan, and Z. Shi, “Cr-famba: A frequency-domain assisted mamba for thin cloud removal in optical remote sensing imagery,”IEEE Transactions on Multimedia, vol. 27, pp. 5659–5668, 2025

2025
[21]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023
[22]

Mambaout: Do we really need mamba for vision?

W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[23]

Efficient attention: Attention with linear complexities,

Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” inProceedings of IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2021, pp. 3530–3538

2021
[24]

Efficientvit: Multi-scale linear attention for high-resolution dense prediction,

H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “Efficientvit: Multi-scale linear attention for high-resolution dense prediction,”arXiv preprint arXiv:2205.14756, 2022

arXiv 2022
[25]

Filmy cloud removal on satellite imagery with multispectral conditional generative adversarial nets,

K. Enomoto, K. Sakurada, W. Wang, H. Fukui, M. Matsuoka, R. Naka- mura, and N. Kawaguchi, “Filmy cloud removal on satellite imagery with multispectral conditional generative adversarial nets,” inProceed- ings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA,, 2017, pp. 1533–1541

2017
[26]

Cloud-gan: Cloud removal for sentinel- 2 imagery using a cyclic consistent generative adversarial networks,

P. Singh and N. Komodakis, “Cloud-gan: Cloud removal for sentinel- 2 imagery using a cyclic consistent generative adversarial networks,” inProceedings of IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 2018, pp. 1772–1775

2018
[27]

Cermf-net: A sar- optical feature fusion for cloud elimination from sentinel-2 imagery using residual multiscale dilated network,

J. Anandakrishnan, V . M. Sundaram, and P. Paneer, “Cermf-net: A sar- optical feature fusion for cloud elimination from sentinel-2 imagery using residual multiscale dilated network,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 11 741–11 749, 2024

2024
[28]

Cloud removal based on sar-optical remote sensing data fusion via a two-flow network,

R. Mao, H. Li, G. Ren, and Z. Yin, “Cloud removal based on sar-optical remote sensing data fusion via a two-flow network,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 7677–7686, 2022

2022
[29]

Thin cloud removal in optical remote sensing images based on generative adversarial networks and physical model of cloud distortion,

J. Li, Z. Wu, Z. Hu, J. Zhang, M. Li, L. Mo, and M. Molinier, “Thin cloud removal in optical remote sensing images based on generative adversarial networks and physical model of cloud distortion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 373– 389, 2020. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, X 2025 15

2020
[30]

Blind single-image-based thin cloud removal using a cloud perception integrated fast fourier convolutional network,

Y . Guo, W. He, Y . Xia, and H. Zhang, “Blind single-image-based thin cloud removal using a cloud perception integrated fast fourier convolutional network,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 206, pp. 63–86, 2023

2023
[31]

Msar-defognet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution,

Y . Zhou, W. Jing, J. Wang, G. Chen, R. Scherer, and R. Damasevicius, “Msar-defognet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution,”IET Image Process., vol. 16, no. 3, pp. 659–668, 2022

2022
[32]

An effective network integrating residual learning and channel attention mechanism for thin cloud re- moval,

X. Wen, Z. Pan, Y . Hu, and J. Liu, “An effective network integrating residual learning and channel attention mechanism for thin cloud re- moval,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

2022
[33]

A novel dense-attention network for thick cloud removal by reconstructing semantic information,

Y . Chen, Z. Cai, J. Yuan, and L. Wu, “A novel dense-attention network for thick cloud removal by reconstructing semantic information,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 2339–2351, 2023

2023
[34]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5998–6008

2017
[35]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProceedings of 9th International Conference on Learning Representations. ICLR, Virtual Event, Austria, 2024

2024
[36]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,

W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” inProceedings of IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 548–558

2021
[37]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 9992–10 002

2021
[38]

Cswin transformer: A general vision transformer backbone with cross-shaped windows,

X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” inProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 12 114–12 124

2022
[39]

Restormer: Efficient transformer for high-resolution image restoration,

S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” inProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 5718– 5729

2022
[40]

Event- equalized dense video captioning,

K. Wu, P. Li, J. Fu, Y . Li, Y . Wu, Y . Liu, J. Wang, and S. Zhou, “Event- equalized dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 8417–8427

2025
[41]

Cloud-egan: Rethinking cyclegan from a feature enhancement perspective for cloud removal by combining cnn and transformer,

X. Ma, Y . Huang, X. Zhang, M.-O. Pun, and B. Huang, “Cloud-egan: Rethinking cyclegan from a feature enhancement perspective for cloud removal by combining cnn and transformer,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 4999–5012, 2023

2023
[42]

Cloudformer: A cloud-removal network combining self-attention mechanism and convolution,

P. Wu, Z. Pan, H. Tang, and Y . Hu, “Cloudformer: A cloud-removal network combining self-attention mechanism and convolution,”Remote. Sens., vol. 14, no. 23, p. 6132, 2022

2022
[43]

Density guided and frequency modulation dehazing network for remote sensing images,

H. Liu, J. Huang, J. Nie, J. Xie, L. Chen, and X. Zhou, “Density guided and frequency modulation dehazing network for remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp. 1–13, 2025

2025
[44]

Tsmcf: Transformer-based sar and multispectral cross-attention fusion for cloud removal,

H. Zhu, Z. Wang, L. Han, M. Xu, W. Li, Q. Liu, S. Liu, and B. Du, “Tsmcf: Transformer-based sar and multispectral cross-attention fusion for cloud removal,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 6710–6720, 2025

2025
[45]

Transformers are rnns: Fast autoregressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 2020, pp. 5156–5165

2020
[46]

SOFT: softmax-free transformer with linear complexity,

J. Lu, J. Yao, J. Zhang, X. Zhu, H. Xu, W. Gao, C. Xu, T. Xiang, and L. Zhang, “SOFT: softmax-free transformer with linear complexity,” inAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, virtual, 2021, pp. 21 297–21 309

2021
[47]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141

2018
[48]

Free- form image inpainting with gated convolution,

J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free- form image inpainting with gated convolution,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4471– 4480

2019
[49]

Language modeling with gated convolutional networks,

Y . N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” inInternational conference on machine learning, 2017, pp. 933–941

2017
[50]

Sparse self- attention transformer for image inpainting,

W. Huang, Y . Deng, S. Hui, Y . Wu, S. Zhou, and J. Wang, “Sparse self- attention transformer for image inpainting,”Pattern Recognition, vol. 145, p. 109897, 2024

2024
[51]

Gated convolutional networks for cloud removal from bi-temporal remote sensing images,

P. Dai, S. Ji, and Y . Zhang, “Gated convolutional networks for cloud removal from bi-temporal remote sensing images,”Remote Sensing, vol. 12, no. 20, p. 3427, 2020

2020
[52]

Cloud removal with sar-optical data fusion using a unified spatial–spectral residual network,

Y . Wang, B. Zhang, W. Zhang, D. Hong, B. Zhao, and Z. Li, “Cloud removal with sar-optical data fusion using a unified spatial–spectral residual network,”IEEE Transactions on Geoscience and Remote Sens- ing, vol. 62, pp. 1–20, 2024

2024
[53]

cosformer: Rethinking softmax in attention,

Z. Qin, W. Sun, H. Deng, D. Li, Y . Wei, B. Lv, J. Yan, L. Kong, and Y . Zhong, “cosformer: Rethinking softmax in attention,” inProceedings of 10th International Conference on Learning Representations, ICLR, Virtual Event, April 25-29, 2022

2022
[54]

Flatten transformer: Vision transformer using focused linear attention,

D. Han, X. Pan, S. Song, and G. Huang, “Flatten transformer: Vision transformer using focused linear attention,” inProceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, 2023, pp. 5938–5948

2023
[55]

Mb-taylorformer v2: Improved multi-branch linear transformer expanded by taylor formula for image restoration,

Z. Jin, Y . Qiu, K. Zhang, H. Li, and W. Luo, “Mb-taylorformer v2: Improved multi-branch linear transformer expanded by taylor formula for image restoration,”TPAMI, 2025

2025
[56]

Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions,

G. R. dense transformer with grid structure for image restoration in adverse weather conditions, “Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions,” International Journal of Computer Vision, pp. 1–23, 2024

2024
[57]

Deep dense multi-scale network for snow removal using semantic and depth priors,

K. Zhang, R. Li, Y . Yu, W. Luo, and C. Li, “Deep dense multi-scale network for snow removal using semantic and depth priors,”IEEE Transactions on Image Processing, vol. 30, pp. 7419–7431, 2021

2021
[58]

Wavelet approximation-aware residual network for single image deraining,

W.-Y . Hsu and W.-C. Chang, “Wavelet approximation-aware residual network for single image deraining,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 12, pp. 15 979–15 995, 2023

2023
[59]

Mobilenets: Efficient convolutional neural networks for mobile vision applications,

A. G. Howard, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

Pith/arXiv arXiv 2017
[60]

A remote sensing image dataset for cloud removal,

D. Lin, G. Xu, X. Wang, Y . Wang, X. Sun, and K. Fu, “A remote sensing image dataset for cloud removal,”CoRR, vol. abs/1901.00600, 2019

Pith/arXiv arXiv 1901
[61]

Multisensor data fusion for cloud removal in global and all-season sentinel-2 imagery,

P. Ebel, A. Meraner, M. Schmitt, and X. X. Zhu, “Multisensor data fusion for cloud removal in global and all-season sentinel-2 imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 7, pp. 5866–5878, 2020

2020
[62]

Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

P. Ebel, V . S. F. Garnot, M. Schmitt, J. D. Wegner, and X. X. Zhu, “Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2086–2096

2023
[63]

Image-to-image translation with conditional adversarial networks,

P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of IEEE Confer- ence on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 5967–5976

2017
[64]

CTGAN : Cloud transformer generative adver- sarial network,

G. Huang and P. Wu, “CTGAN : Cloud transformer generative adver- sarial network,” inProceedings of IEEE International Conference on Image Processing, Bordeaux, France, 2022, pp. 511–515

2022
[65]

Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature,

T. Chai and R. R. Draxler, “Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature,” Geoscientific model development, vol. 7, no. 3, pp. 1247–1250, 2014

2014
[66]

The spectral image processing system (sips)—interactive visualization and analysis of imaging spectrometer data,

F. A. Kruse, A. Lefkoff, J. Boardman, K. Heidebrecht, A. Shapiro, P. Barloon, and A. Goetz, “The spectral image processing system (sips)—interactive visualization and analysis of imaging spectrometer data,”Remote sensing of environment, vol. 44, no. 2-3, pp. 145–163, 1993

1993
[67]

Peak signal-to-noise ratio revisited: Is simple beautiful?

J. Korhonen and J. You, “Peak signal-to-noise ratio revisited: Is simple beautiful?” inProceedings of 4th International Workshop on Quality of Multimedia Experience, 2012, pp. 37–38

2012
[68]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

2004
[69]

Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion,

A. Meraner, P. Ebel, X. X. Zhu, and M. Schmitt, “Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 333–346, 2020. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, X 2025 16

2020
[70]

Pvt v2: Improved baselines with pyramid vision transformer,

W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022

2022
[71]

Semantic-aware representation learning for homography estimation,

Y . Liu, Q. Huang, S. Hui, J. Fu, S. Zhou, K. Wu, P. Li, and J. Wang, “Semantic-aware representation learning for homography estimation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2506–2514

2024
[72]

Mind the gap: Aligning vision foundation models to image feature matching,

Y . Liu, J. Fu, Y . Wu, K. Wu, P. Li, J. Wu, S. Zhou, and J. Xin, “Mind the gap: Aligning vision foundation models to image feature matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 313–20 323

2025
[73]

Patchcue: Enhancing vision-language model reasoning with patch- based visual cues,

Y . Qi, P. Fu, H. Li, Y . Liu, C. Jiang, B. Qin, Z. Luo, and J. Luan, “Patchcue: Enhancing vision-language model reasoning with patch- based visual cues,”arXiv preprint arXiv:2603.05869, 2026

arXiv 2026
[74]

Shaping schema via language representation as the next frontier for llm intelligence expanding,

Z. Yang, Y . Liu, J. Fu, M. Sugiyama, N. Zhenget al., “Shaping schema via language representation as the next frontier for llm intelligence expanding,”arXiv preprint arXiv:2605.09271, 2026

Pith/arXiv arXiv 2026
[75]

Structured progressive knowledge ac- tivation for llm-driven neural architecture search,

Z. Liu, Y . Liu, and J. Fu, “Structured progressive knowledge ac- tivation for llm-driven neural architecture search,”arXiv preprint arXiv:2605.04057, 2026

Pith/arXiv arXiv 2026

[1] [1]

Second simulation of the satellite signal in the solar spectrum, 6s: an overview,

E. F. Vermote, D. Tanr ´e, J. Deuz´e, M. Herman, and J. Morcette, “Second simulation of the satellite signal in the solar spectrum, 6s: an overview,” IEEE Transactions on Geoscience and Remote Sensing, vol. 35, no. 3, pp. 675–686, 1997

1997

[2] [2]

Thin cloud removal from single satellite images,

J. Liu, X. Wang, M. Chen, S. Liu, X. Zhou, Z. Shao, and P. Liu, “Thin cloud removal from single satellite images,”Optics express, vol. 22, no. 1, pp. 618–632, 2014

2014

[3] [3]

Haze and thin cloud removal via sphere model improved dark channel prior,

J. Li, Q. Hu, and M. Ai, “Haze and thin cloud removal via sphere model improved dark channel prior,”IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 3, pp. 472–476, 2019

2019

[4] [4]

Haze and thin cloud removal using elliptical boundary prior for remote sensing image,

Q. Guo, H. Hu, and B. Li, “Haze and thin cloud removal using elliptical boundary prior for remote sensing image,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 11, pp. 9124–9137, 2019

2019

[5] [5]

Thin cloud removal with residual symmetrical concatenation network,

W. Li, Y . Li, D. Chen, and J. C.-W. Chan, “Thin cloud removal with residual symmetrical concatenation network,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 153, pp. 137–150, 2019

2019

[6] [6]

Thin cloud removal for multispectral remote sensing images using convolutional neural networks combined with an imaging model,

Y . Zi, F. Xie, N. Zhang, Z. Jiang, W. Zhu, and H. Zhang, “Thin cloud removal for multispectral remote sensing images using convolutional neural networks combined with an imaging model,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 3811–3823, 2021

2021

[7] [7]

Cloud removal in optical remote sensing imagery using multiscale distortion-aware networks,

W. Yu, X. Zhang, and M. Pun, “Cloud removal in optical remote sensing imagery using multiscale distortion-aware networks,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

2022

[8] [8]

Wavelet integrated convolutional neural network for thin cloud removal in remote sensing images,

Y . Zi, H. Ding, F. Xie, Z. Jiang, and X. Song, “Wavelet integrated convolutional neural network for thin cloud removal in remote sensing images,”Remote Sensing, vol. 15, no. 3, p. 781, 2023

2023

[9] [9]

Cloud-guided fusion with sar-to-optical translation for thick cloud removal,

X. Xiang, Y . Tan, and L. Yan, “Cloud-guided fusion with sar-to-optical translation for thick cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

2024

[10] [10]

Robust haze and thin cloud removal via conditional variational autoencoders,

H. Ding, F. Xie, L. Qiu, X. Zhang, and Z. Shi, “Robust haze and thin cloud removal via conditional variational autoencoders,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

2024

[11] [11]

Cloud removal for remote sensing imagery via spatial attention generative adversarial network,

H. Pan, “Cloud removal for remote sensing imagery via spatial attention generative adversarial network,”arXiv preprint arXiv:2009.13015, 2020

arXiv 2009

[12] [12]

Attentive contextual attention for cloud removal,

W. Huang, Y . Deng, Y . Wu, and J. Wang, “Attentive contextual attention for cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024

2024

[13] [13]

Uncertainty-based thin cloud removal network via conditional variational autoencoders,

H. Ding, Y . Zi, and F. Xie, “Uncertainty-based thin cloud removal network via conditional variational autoencoders,” inComputer Vision - ACCV 2022 - 16th Asian Conference on Computer Vision, Macao, China, December 4-8, 2022, Proceedings, Part III, ser. Lecture Notes in Computer Science, vol. 13843, 2022, pp. 52–68

2022

[14] [14]

Trinity-net: Gradient-guided swin transformer-based remote sensing image dehazing and beyond,

K. Chi, Y . Yuan, and Q. Wang, “Trinity-net: Gradient-guided swin transformer-based remote sensing image dehazing and beyond,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

2023

[15] [15]

Cascaded memory network for optical remote sensing imagery cloud removal,

J. Liu, B. Pan, and Z. Shi, “Cascaded memory network for optical remote sensing imagery cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–11, 2024

2024

[16] [16]

Cr- former: Single-image cloud removal with focused taylor attention,

Y . Wu, Y . Deng, S. Zhou, Y . Liu, W. Huang, and J. Wang, “Cr- former: Single-image cloud removal with focused taylor attention,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

2024

[17] [17]

Glf-cr: Sar-enhanced cloud removal with global–local fusion,

F. Xu, Y . Shi, P. Ebel, L. Yu, G.-S. Xia, W. Yang, and X. X. Zhu, “Glf-cr: Sar-enhanced cloud removal with global–local fusion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 192, pp. 268–278, 2022

2022

[18] [18]

Low-rank bottleneck in multi-head attention models,

S. Bhojanapalli, C. Yun, A. S. Rawat, S. J. Reddi, and S. Kumar, “Low-rank bottleneck in multi-head attention models,” inProceedings of the 37th International Conference on Machine Learning, Virtual Event, 2020, pp. 864–873

2020

[19] [19]

Mamba- cr: A state-space model for remote sensing image cloud removal,

C. Zhang, F. Wang, X. Zhang, M. Wang, X. Wu, and S. Dang, “Mamba- cr: A state-space model for remote sensing image cloud removal,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–13, 2025

2025

[20] [20]

Cr-famba: A frequency-domain assisted mamba for thin cloud removal in optical remote sensing imagery,

J. Liu, B. Pan, and Z. Shi, “Cr-famba: A frequency-domain assisted mamba for thin cloud removal in optical remote sensing imagery,”IEEE Transactions on Multimedia, vol. 27, pp. 5659–5668, 2025

2025

[21] [21]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023

[22] [22]

Mambaout: Do we really need mamba for vision?

W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[23] [23]

Efficient attention: Attention with linear complexities,

Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” inProceedings of IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2021, pp. 3530–3538

2021

[24] [24]

Efficientvit: Multi-scale linear attention for high-resolution dense prediction,

H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “Efficientvit: Multi-scale linear attention for high-resolution dense prediction,”arXiv preprint arXiv:2205.14756, 2022

arXiv 2022

[25] [25]

Filmy cloud removal on satellite imagery with multispectral conditional generative adversarial nets,

K. Enomoto, K. Sakurada, W. Wang, H. Fukui, M. Matsuoka, R. Naka- mura, and N. Kawaguchi, “Filmy cloud removal on satellite imagery with multispectral conditional generative adversarial nets,” inProceed- ings of IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA,, 2017, pp. 1533–1541

2017

[26] [26]

Cloud-gan: Cloud removal for sentinel- 2 imagery using a cyclic consistent generative adversarial networks,

P. Singh and N. Komodakis, “Cloud-gan: Cloud removal for sentinel- 2 imagery using a cyclic consistent generative adversarial networks,” inProceedings of IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 2018, pp. 1772–1775

2018

[27] [27]

Cermf-net: A sar- optical feature fusion for cloud elimination from sentinel-2 imagery using residual multiscale dilated network,

J. Anandakrishnan, V . M. Sundaram, and P. Paneer, “Cermf-net: A sar- optical feature fusion for cloud elimination from sentinel-2 imagery using residual multiscale dilated network,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 11 741–11 749, 2024

2024

[28] [28]

Cloud removal based on sar-optical remote sensing data fusion via a two-flow network,

R. Mao, H. Li, G. Ren, and Z. Yin, “Cloud removal based on sar-optical remote sensing data fusion via a two-flow network,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 7677–7686, 2022

2022

[29] [29]

Thin cloud removal in optical remote sensing images based on generative adversarial networks and physical model of cloud distortion,

J. Li, Z. Wu, Z. Hu, J. Zhang, M. Li, L. Mo, and M. Molinier, “Thin cloud removal in optical remote sensing images based on generative adversarial networks and physical model of cloud distortion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 373– 389, 2020. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, X 2025 15

2020

[30] [30]

Blind single-image-based thin cloud removal using a cloud perception integrated fast fourier convolutional network,

Y . Guo, W. He, Y . Xia, and H. Zhang, “Blind single-image-based thin cloud removal using a cloud perception integrated fast fourier convolutional network,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 206, pp. 63–86, 2023

2023

[31] [31]

Msar-defognet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution,

Y . Zhou, W. Jing, J. Wang, G. Chen, R. Scherer, and R. Damasevicius, “Msar-defognet: Lightweight cloud removal network for high resolution remote sensing images based on multi scale convolution,”IET Image Process., vol. 16, no. 3, pp. 659–668, 2022

2022

[32] [32]

An effective network integrating residual learning and channel attention mechanism for thin cloud re- moval,

X. Wen, Z. Pan, Y . Hu, and J. Liu, “An effective network integrating residual learning and channel attention mechanism for thin cloud re- moval,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

2022

[33] [33]

A novel dense-attention network for thick cloud removal by reconstructing semantic information,

Y . Chen, Z. Cai, J. Yuan, and L. Wu, “A novel dense-attention network for thick cloud removal by reconstructing semantic information,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 2339–2351, 2023

2023

[34] [34]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inProceedings of Advances in Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5998–6008

2017

[35] [35]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProceedings of 9th International Conference on Learning Representations. ICLR, Virtual Event, Austria, 2024

2024

[36] [36]

Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,

W. Wang, E. Xie, X. Li, D. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” inProceedings of IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 548–558

2021

[37] [37]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of IEEE/CVF International Conference on Computer Vision, ICCV, Montreal, QC, Canada, 2021, pp. 9992–10 002

2021

[38] [38]

Cswin transformer: A general vision transformer backbone with cross-shaped windows,

X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” inProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 12 114–12 124

2022

[39] [39]

Restormer: Efficient transformer for high-resolution image restoration,

S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” inProceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 2022, pp. 5718– 5729

2022

[40] [40]

Event- equalized dense video captioning,

K. Wu, P. Li, J. Fu, Y . Li, Y . Wu, Y . Liu, J. Wang, and S. Zhou, “Event- equalized dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 8417–8427

2025

[41] [41]

Cloud-egan: Rethinking cyclegan from a feature enhancement perspective for cloud removal by combining cnn and transformer,

X. Ma, Y . Huang, X. Zhang, M.-O. Pun, and B. Huang, “Cloud-egan: Rethinking cyclegan from a feature enhancement perspective for cloud removal by combining cnn and transformer,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 4999–5012, 2023

2023

[42] [42]

Cloudformer: A cloud-removal network combining self-attention mechanism and convolution,

P. Wu, Z. Pan, H. Tang, and Y . Hu, “Cloudformer: A cloud-removal network combining self-attention mechanism and convolution,”Remote. Sens., vol. 14, no. 23, p. 6132, 2022

2022

[43] [43]

Density guided and frequency modulation dehazing network for remote sensing images,

H. Liu, J. Huang, J. Nie, J. Xie, L. Chen, and X. Zhou, “Density guided and frequency modulation dehazing network for remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, pp. 1–13, 2025

2025

[44] [44]

Tsmcf: Transformer-based sar and multispectral cross-attention fusion for cloud removal,

H. Zhu, Z. Wang, L. Han, M. Xu, W. Li, Q. Liu, S. Liu, and B. Du, “Tsmcf: Transformer-based sar and multispectral cross-attention fusion for cloud removal,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 6710–6720, 2025

2025

[45] [45]

Transformers are rnns: Fast autoregressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 2020, pp. 5156–5165

2020

[46] [46]

SOFT: softmax-free transformer with linear complexity,

J. Lu, J. Yao, J. Zhang, X. Zhu, H. Xu, W. Gao, C. Xu, T. Xiang, and L. Zhang, “SOFT: softmax-free transformer with linear complexity,” inAdvances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems, virtual, 2021, pp. 21 297–21 309

2021

[47] [47]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141

2018

[48] [48]

Free- form image inpainting with gated convolution,

J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Free- form image inpainting with gated convolution,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4471– 4480

2019

[49] [49]

Language modeling with gated convolutional networks,

Y . N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” inInternational conference on machine learning, 2017, pp. 933–941

2017

[50] [50]

Sparse self- attention transformer for image inpainting,

W. Huang, Y . Deng, S. Hui, Y . Wu, S. Zhou, and J. Wang, “Sparse self- attention transformer for image inpainting,”Pattern Recognition, vol. 145, p. 109897, 2024

2024

[51] [51]

Gated convolutional networks for cloud removal from bi-temporal remote sensing images,

P. Dai, S. Ji, and Y . Zhang, “Gated convolutional networks for cloud removal from bi-temporal remote sensing images,”Remote Sensing, vol. 12, no. 20, p. 3427, 2020

2020

[52] [52]

Cloud removal with sar-optical data fusion using a unified spatial–spectral residual network,

Y . Wang, B. Zhang, W. Zhang, D. Hong, B. Zhao, and Z. Li, “Cloud removal with sar-optical data fusion using a unified spatial–spectral residual network,”IEEE Transactions on Geoscience and Remote Sens- ing, vol. 62, pp. 1–20, 2024

2024

[53] [53]

cosformer: Rethinking softmax in attention,

Z. Qin, W. Sun, H. Deng, D. Li, Y . Wei, B. Lv, J. Yan, L. Kong, and Y . Zhong, “cosformer: Rethinking softmax in attention,” inProceedings of 10th International Conference on Learning Representations, ICLR, Virtual Event, April 25-29, 2022

2022

[54] [54]

Flatten transformer: Vision transformer using focused linear attention,

D. Han, X. Pan, S. Song, and G. Huang, “Flatten transformer: Vision transformer using focused linear attention,” inProceedings of IEEE/CVF International Conference on Computer Vision, Paris, France, 2023, pp. 5938–5948

2023

[55] [55]

Mb-taylorformer v2: Improved multi-branch linear transformer expanded by taylor formula for image restoration,

Z. Jin, Y . Qiu, K. Zhang, H. Li, and W. Luo, “Mb-taylorformer v2: Improved multi-branch linear transformer expanded by taylor formula for image restoration,”TPAMI, 2025

2025

[56] [56]

Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions,

G. R. dense transformer with grid structure for image restoration in adverse weather conditions, “Gridformer: Residual dense transformer with grid structure for image restoration in adverse weather conditions,” International Journal of Computer Vision, pp. 1–23, 2024

2024

[57] [57]

Deep dense multi-scale network for snow removal using semantic and depth priors,

K. Zhang, R. Li, Y . Yu, W. Luo, and C. Li, “Deep dense multi-scale network for snow removal using semantic and depth priors,”IEEE Transactions on Image Processing, vol. 30, pp. 7419–7431, 2021

2021

[58] [58]

Wavelet approximation-aware residual network for single image deraining,

W.-Y . Hsu and W.-C. Chang, “Wavelet approximation-aware residual network for single image deraining,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 12, pp. 15 979–15 995, 2023

2023

[59] [59]

Mobilenets: Efficient convolutional neural networks for mobile vision applications,

A. G. Howard, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

Pith/arXiv arXiv 2017

[60] [60]

A remote sensing image dataset for cloud removal,

D. Lin, G. Xu, X. Wang, Y . Wang, X. Sun, and K. Fu, “A remote sensing image dataset for cloud removal,”CoRR, vol. abs/1901.00600, 2019

Pith/arXiv arXiv 1901

[61] [61]

Multisensor data fusion for cloud removal in global and all-season sentinel-2 imagery,

P. Ebel, A. Meraner, M. Schmitt, and X. X. Zhu, “Multisensor data fusion for cloud removal in global and all-season sentinel-2 imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 7, pp. 5866–5878, 2020

2020

[62] [62]

Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,

P. Ebel, V . S. F. Garnot, M. Schmitt, J. D. Wegner, and X. X. Zhu, “Uncrtaints: Uncertainty quantification for cloud removal in optical satellite time series,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2086–2096

2023

[63] [63]

Image-to-image translation with conditional adversarial networks,

P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of IEEE Confer- ence on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 5967–5976

2017

[64] [64]

CTGAN : Cloud transformer generative adver- sarial network,

G. Huang and P. Wu, “CTGAN : Cloud transformer generative adver- sarial network,” inProceedings of IEEE International Conference on Image Processing, Bordeaux, France, 2022, pp. 511–515

2022

[65] [65]

Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature,

T. Chai and R. R. Draxler, “Root mean square error (rmse) or mean absolute error (mae)?–arguments against avoiding rmse in the literature,” Geoscientific model development, vol. 7, no. 3, pp. 1247–1250, 2014

2014

[66] [66]

The spectral image processing system (sips)—interactive visualization and analysis of imaging spectrometer data,

F. A. Kruse, A. Lefkoff, J. Boardman, K. Heidebrecht, A. Shapiro, P. Barloon, and A. Goetz, “The spectral image processing system (sips)—interactive visualization and analysis of imaging spectrometer data,”Remote sensing of environment, vol. 44, no. 2-3, pp. 145–163, 1993

1993

[67] [67]

Peak signal-to-noise ratio revisited: Is simple beautiful?

J. Korhonen and J. You, “Peak signal-to-noise ratio revisited: Is simple beautiful?” inProceedings of 4th International Workshop on Quality of Multimedia Experience, 2012, pp. 37–38

2012

[68] [68]

Image quality assessment: from error visibility to structural similarity,

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004

2004

[69] [69]

Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion,

A. Meraner, P. Ebel, X. X. Zhu, and M. Schmitt, “Cloud removal in sentinel-2 imagery using a deep residual neural network and sar-optical data fusion,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 333–346, 2020. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, X 2025 16

2020

[70] [70]

Pvt v2: Improved baselines with pyramid vision transformer,

W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022

2022

[71] [71]

Semantic-aware representation learning for homography estimation,

Y . Liu, Q. Huang, S. Hui, J. Fu, S. Zhou, K. Wu, P. Li, and J. Wang, “Semantic-aware representation learning for homography estimation,” in Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 2506–2514

2024

[72] [72]

Mind the gap: Aligning vision foundation models to image feature matching,

Y . Liu, J. Fu, Y . Wu, K. Wu, P. Li, J. Wu, S. Zhou, and J. Xin, “Mind the gap: Aligning vision foundation models to image feature matching,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 20 313–20 323

2025

[73] [73]

Patchcue: Enhancing vision-language model reasoning with patch- based visual cues,

Y . Qi, P. Fu, H. Li, Y . Liu, C. Jiang, B. Qin, Z. Luo, and J. Luan, “Patchcue: Enhancing vision-language model reasoning with patch- based visual cues,”arXiv preprint arXiv:2603.05869, 2026

arXiv 2026

[74] [74]

Shaping schema via language representation as the next frontier for llm intelligence expanding,

Z. Yang, Y . Liu, J. Fu, M. Sugiyama, N. Zhenget al., “Shaping schema via language representation as the next frontier for llm intelligence expanding,”arXiv preprint arXiv:2605.09271, 2026

Pith/arXiv arXiv 2026

[75] [75]

Structured progressive knowledge ac- tivation for llm-driven neural architecture search,

Z. Liu, Y . Liu, and J. Fu, “Structured progressive knowledge ac- tivation for llm-driven neural architecture search,”arXiv preprint arXiv:2605.04057, 2026

Pith/arXiv arXiv 2026