pith. sign in

arxiv: 2601.11310 · v2 · submitted 2026-01-16 · 💻 cs.CV

Context-Aware Semantic Segmentation via Stage-Wise Attention

Pith reviewed 2026-05-16 13:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic segmentationultra-high-resolution imagestransformer architecturecross-attentionremote sensingaerial imagerycontext transfermasked pretraining
0
0 comments X

The pith

A dual-branch Swin transformer adds low-resolution context to high-resolution features through stage-wise cross-attention for ultra-high-resolution semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CASWiT to overcome quadratic memory growth in transformers when segmenting ultra-high-resolution images for remote sensing. It runs a low-resolution branch to capture broad context and feeds that information into a high-resolution branch at multiple stages using lightweight cross-attention. A masked-image pretraining step further strengthens the cross-scale learning. Experiments on aerial and medical datasets show the design improves overall accuracy and boundary quality compared with strong single-branch baselines. The approach keeps computation manageable while preserving fine detail.

Core claim

CASWiT is a dual-branch Swin-based architecture that injects low-resolution contextual information into fine-grained high-resolution features through lightweight stage-wise cross-attention, combined with a SimMIM-style masked reconstruction pretraining of the high-resolution image.

What carries the argument

Stage-wise cross-attention between a low-resolution context branch and a high-resolution detail branch inside a dual Swin transformer.

If this is right

  • The architecture reaches 66.37 percent mIoU on the FLAIR-HUB aerial dataset under an RGB-only ultra-high-resolution protocol while improving boundary quality over strong baselines.
  • The same model records 49.2 percent mIoU on the URUR benchmark under the official protocol.
  • The pretrained weights transfer directly to medical ultra-high-resolution segmentation tasks without retraining the core attention blocks.
  • Memory usage stays linear in the number of high-resolution tokens because the low-resolution branch supplies context at fixed cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stage-wise injection pattern could be tested on video or 3-D volumetric data where one axis supplies cheap context for another.
  • If the cross-attention layers remain stable across sensors, the method may reduce the need for task-specific architectural search in remote-sensing pipelines.
  • Replacing the Swin backbone with other hierarchical encoders might reveal whether the performance gain is tied to the specific token-merging schedule.

Load-bearing premise

Stage-wise cross-attention between low- and high-resolution branches reliably transfers useful context without adding noise or requiring per-dataset retuning.

What would settle it

Training the same dual-branch model on the FLAIR-HUB dataset but replacing the stage-wise cross-attention blocks with simple concatenation or addition, then measuring whether mIoU falls below the reported single-branch baselines and boundary metrics degrade.

Figures

Figures reproduced from arXiv: 2601.11310 by Adrien Gressin, Antoine Carreaud, Arthur Chansel, Elias Naha, Jan Skaloud, Nina Lahellec.

Figure 1
Figure 1. Figure 1: Proposed architecture (CASWiT). A dual-branch (green) encoder for ultra–high-resolution imagery: the high-resolution (HR) branch processes the targeted tile to predict, while the low-resolution (LR) branch (pink) ingests a downsampled global context. At every Swin stage (1→4), HR features supply queries (Q) and LR features provide keys/values (K, V ) to a multi-scale cross-attention module with gating (⊗) … view at source ↗
Figure 2
Figure 2. Figure 2: Self-supervised inference results on the CASWiT architecture. Each image (left to right) shows: original high-resolution image, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: URUR: illustrative annotation mismatch. Example [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on IGN FLAIR-HUB. From left to right: LR image (note the missing band at the top), HR image crop, [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cross-attention maps after SSL pretraining and supervised fine-tuning. Visualization of HR-LR cross-attention for each stage of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example where the provided mask (overlaid) locally diverges from the RGB content; such cases are occasional but can affect [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of LR/HR images, ground truth overlays, Swin Base predictions, and CASWiT predictions on eight test patches. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of data pre-processing, on the left are the HR patches and on the right are the merged patches obtained from the [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Self-supervised SimMIM-style inference results on the CASWiT-Base architecture. Each row (left to right) shows: original [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of cross-attention maps for each stage of the model on four test patches. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Semantic ultra-high-resolution (UHR) image segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models remain challenging in this setting because memory grows quadratically with the number of tokens, limiting either spatial resolution or contextual scope. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch Swin-based architecture that injects low-resolution contextual information into fine-grained high-resolution features through lightweight stage-wise cross-attention. To strengthen cross-scale learning, we also propose a SimMIM-style pretraining strategy based on masked reconstruction of the high-resolution image. Extensive experiments on the large-scale FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Under our RGB-only UHR protocol, CASWiT reaches 66.37% mIoU with a SegFormer decoder, improving over strong RGB baselines while also improving boundary quality. On the URUR benchmark, CASWiT reaches 49.2% mIoU under the official evaluation protocol, and it also transfers effectively to medical UHR segmentation benchmarks. Code and pretrained models are available at https://huggingface.co/collections/heig-vd-geo/caswit

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CASWiT, a dual-branch Swin-based architecture for ultra-high-resolution semantic segmentation. It injects low-resolution contextual information into high-resolution features via lightweight stage-wise cross-attention and uses a SimMIM-style masked reconstruction pretraining strategy. Experiments report 66.37% mIoU on FLAIR-HUB under an RGB-only UHR protocol (with SegFormer decoder) and 49.2% mIoU on URUR, with claims of improved boundary quality and effective transfer to medical benchmarks; code and models are released.

Significance. If the stage-wise cross-attention demonstrably adds signal rather than noise or redundancy, the method could meaningfully address quadratic memory scaling in transformers for remote-sensing UHR tasks while preserving fine detail. Code and pretrained-model release is a clear strength for reproducibility. Significance is currently limited by the absence of controls that isolate the attention mechanism from pretraining and dual-branch design.

major comments (2)
  1. [Experiments] Experiments section (results on FLAIR-HUB and URUR): the headline mIoU gains (66.37% and 49.2%) are reported without ablations that hold SimMIM pretraining, decoder choice, and dual-branch structure fixed while toggling only the stage-wise cross-attention blocks. This leaves open the possibility that observed deltas arise from pretraining or the mere presence of a low-resolution branch rather than context injection.
  2. [Method] Method description of cross-attention: no quantitative analysis (e.g., attention-map statistics, noise-injection tests, or failure-case examination) is provided to confirm that low-to-high resolution context transfer reliably adds useful signal without introducing boundary or class-label degradation.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'improving over strong RGB baselines' should name the specific baselines and their scores for immediate context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to strengthen the experimental validation and analysis of the cross-attention mechanism.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (results on FLAIR-HUB and URUR): the headline mIoU gains (66.37% and 49.2%) are reported without ablations that hold SimMIM pretraining, decoder choice, and dual-branch structure fixed while toggling only the stage-wise cross-attention blocks. This leaves open the possibility that observed deltas arise from pretraining or the mere presence of a low-resolution branch rather than context injection.

    Authors: We agree that isolating the contribution of the stage-wise cross-attention is essential. In the revised manuscript we will add a controlled ablation on FLAIR-HUB that keeps SimMIM pretraining, the SegFormer decoder, and the dual-branch structure fixed, comparing performance with and without the cross-attention blocks. This will directly address whether the reported gains arise from context injection rather than the other design elements. revision: yes

  2. Referee: [Method] Method description of cross-attention: no quantitative analysis (e.g., attention-map statistics, noise-injection tests, or failure-case examination) is provided to confirm that low-to-high resolution context transfer reliably adds useful signal without introducing boundary or class-label degradation.

    Authors: We acknowledge the need for quantitative evidence on the cross-attention behavior. In the revision we will add (i) attention-map visualizations from multiple stages with corresponding statistics (e.g., entropy and spatial correlation with ground-truth edges), (ii) a controlled noise-injection experiment on the low-resolution branch, and (iii) a failure-case study highlighting any boundary or label degradation. These additions will demonstrate that the context transfer adds useful signal. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on public benchmarks

full rationale

The paper introduces CASWiT as a dual-branch Swin architecture using stage-wise cross-attention and SimMIM-style pretraining, then reports mIoU numbers (66.37% on FLAIR-HUB RGB-only UHR, 49.2% on URUR) against public benchmarks. No equations, fitted parameters, or derivations are presented that reduce by construction to the inputs; the central claims rest on external dataset performance rather than self-referential definitions or self-citation chains. The architecture description and pretraining strategy are independent of the reported metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach builds on existing Swin transformer blocks and SimMIM pretraining; the main addition is the stage-wise cross-attention mechanism whose effectiveness is treated as an empirical question.

free parameters (1)
  • cross-attention parameters and training hyperparameters
    Learned during supervised fine-tuning and pretraining on the target datasets; standard for deep learning models.
axioms (1)
  • domain assumption Swin transformer blocks can be extended with lightweight cross-attention to fuse multi-resolution features without destabilizing training.
    Invoked implicitly as the foundation for the dual-branch design.

pith-pipeline@v0.9.0 · 5524 in / 1265 out tokens · 44745 ms · 2026-05-16T13:29:05.357449+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Baltru ˇsaitis, C

    T. Baltru ˇsaitis, C. Ahuja, and L.-P. Morency. Multimodal machine learning: A survey and taxonomy.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 41(2): 423–443, 2019. 2

  2. [2]

    Crossvit: Cross-attention multi-scale vision trans- former for image classification

    Chun-Fu (Richard) Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision trans- former for image classification. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 357–366, 2021. 1

  3. [3]

    An efficient and light transformer- based segmentation network for remote sensing images of landscapes.Forests, 14(11), 2023

    Lijia Chen, Honghui Chen, Yanqiu Xie, Tianyou He, Jing Ye, and Yushan Zheng. An efficient and light transformer- based segmentation network for remote sensing images of landscapes.Forests, 14(11), 2023. 2

  4. [4]

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vi- sion (ECCV), 2018. 7

  5. [5]

    Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution im- ages

    Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2019. 2, 7

  6. [6]

    Berg, and Alexander Kirillov

    Bowen Cheng, Ross Girshick, Piotr Doll ´ar, Alexander C. Berg, and Alexander Kirillov. Boundary iou: Improving object-centric image segmentation evaluation, 2021. 6

  7. [7]

    Schwing, Alexan- der Kirillov, and Rohit Girdhar

    Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation, 2022. 1, 8

  8. [8]

    Cascadepsp: Toward class-agnostic and very high- resolution segmentation via global and local refinement

    Ho Kei Cheng, Jihoon Chung, Yu-Wing Tai, and Chi-Keung Tang. Cascadepsp: Toward class-agnostic and very high- resolution segmentation via global and local refinement. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2020. 2, 7

  9. [9]

    Deepglobe 2018: A challenge to parse the earth through satellite images

    Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (CVPR) Workshops, 2018. 5

  10. [10]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations,

  11. [11]

    Rethinking bisenet for real-time semantic segmentation

    Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei, Zhenhua Chai, Junfeng Luo, and Xiaolin Wei. Rethinking bisenet for real-time semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 9716–9725, 2021. 7

  12. [12]

    Flair : a country-scale land cover se- mantic segmentation dataset from multi-source optical im- agery

    Anatol Garioud, Nicolas Gonthier, Loic Landrieu, Apolline De Wit, Marion Valette, Marc Poup ´ee, Sebastien Giordano, and boris Wattrelos. Flair : a country-scale land cover se- mantic segmentation dataset from multi-source optical im- agery. InAdvances in Neural Information Processing Sys- tems, pages 16456–16482. Curran Associates, Inc., 2023. 4

  13. [13]

    Flair-hub: Large-scale multimodal dataset for land cover and crop mapping, 2025

    Anatol Garioud, S ´ebastien Giordano, Nicolas David, and Nicolas Gonthier. Flair-hub: Large-scale multimodal dataset for land cover and crop mapping, 2025. 1, 2, 4, 6, 3

  14. [14]

    Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmenta- tion

    Shaohua Guo, Liang Liu, Zhenye Gan, Yabiao Wang, Wuhao Zhang, Chengjie Wang, Guannan Jiang, Wei Zhang, Ran Yi, Lizhuang Ma, and Ke Xu. Isdnet: Integrating shallow and deep networks for efficient ultra-high resolution segmenta- tion. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 4361– 4370, 2022. 2, 3, 7

  15. [15]

    Multiscale progressive segmentation network for high- resolution remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 60:1–12, 2022

    Renlong Hang, Ping Yang, Feng Zhou, and Qingshan Liu. Multiscale progressive segmentation network for high- resolution remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing, 60:1–12, 2022. 2

  16. [16]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 7

  17. [17]

    Masked autoencoders are scalable vision learners, 2021

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021. 4

  18. [18]

    Stu-net: Scalable and transferable medical image segmentation models empowered by large- scale supervised pre-training, 2023

    Ziyan Huang, Haoyu Wang, Zhongying Deng, Jin Ye, Yanzhou Su, Hui Sun, Junjun He, Yun Gu, Lixu Gu, Shaot- ing Zhang, and Yu Qiao. Stu-net: Scalable and transferable medical image segmentation models empowered by large- scale supervised pre-training, 2023. 2

  19. [19]

    Progressive semantic segmentation

    Chuong Huynh, Anh Tuan Tran, Khoa Luu, and Minh Hoai. Progressive semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 16755–16764, 2021. 2

  20. [20]

    Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation, 2023

    Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation, 2023. 1, 2

  21. [21]

    Ultra-high resolution segmentation with ultra-rich con- text: A novel benchmark

    Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jieping Ye. Ultra-high resolution segmentation with ultra-rich con- text: A novel benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23621–23630, 2023. 1, 2, 4, 6, 7

  22. [22]

    Ldnet: Semantic segmentation of high-resolution images via learnable patch proposal and dynamic refinement

    Yuyang Ji and Lianlei Shan. Ldnet: Semantic segmentation of high-resolution images via learnable patch proposal and dynamic refinement. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2024. 2

  23. [23]

    Maestro: Masked au- toencoders for multimodal, multitemporal, and multispectral earth observation data, 2025

    Antoine Labatie, Michael Vaccaro, Nina Lardiere, Ana- tol Garioud, and Nicolas Gonthier. Maestro: Masked au- toencoders for multimodal, multitemporal, and multispectral earth observation data, 2025. 2, 6 9

  24. [24]

    Memory-constrained semantic segmen- tation for ultra-high resolution uav imagery.IEEE Robotics and Automation Letters, 9(2):1708–1715, 2024

    Qi Li, Jiaxin Cai, Jiexin Luo, Yuanlong Yu, Jason Gu, Jia Pan, and Wenxi Liu. Memory-constrained semantic segmen- tation for ultra-high resolution uav imagery.IEEE Robotics and Automation Letters, 9(2):1708–1715, 2024. 2

  25. [25]

    ESNet: Evolution and suc- cession network for high-resolution salient object detection

    Hongyu Liu, Runmin Cong, Hua Li, Qianqian Xu, Qing- ming Huang, and Wei Zhang. ESNet: Evolution and suc- cession network for high-resolution salient object detection. InForty-first International Conference on Machine Learn- ing, 2024. 3

  26. [26]

    Wenshu Liu, Nan Cui, Luo Guo, Shihong Du, and Weiyin Wang. Desformer: A dual-branch encoding strategy for se- mantic segmentation of very-high-resolution remote sensing images based on feature interaction and multiscale context fusion.IEEE Transactions on Geoscience and Remote Sens- ing, 62:1–20, 2024. 2

  27. [27]

    Mestrans: Multi-scale embedding spatial trans- former for medical image segmentation.Computer Methods and Programs in Biomedicine, 233:107493, 2023

    Yatong Liu, Yu Zhu, Ying Xin, Yanan Zhang, Dawei Yang, and Tao Xu. Mestrans: Multi-scale embedding spatial trans- former for medical image segmentation.Computer Methods and Programs in Biomedicine, 233:107493, 2023. 2

  28. [28]

    Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021. 1, 2, 3

  29. [29]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin transformer v2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12009–12019, 2022. 2

  30. [30]

    Chen Lu, Xian Zhang, Kaile Du, Han Xu, and Guangcan Liu. Ctcfnet: Cnn-transformer complementary and fusion network for high-resolution remote sensing image semantic segmentation.IEEE Transactions on Geoscience and Re- mote Sensing, 62:1–17, 2024. 2

  31. [31]

    Can semantic labeling methods general- ize to any city? the inria aerial image labeling benchmark

    Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez. Can semantic labeling methods general- ize to any city? the inria aerial image labeling benchmark. In2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 3226–3229, 2017. 5

  32. [32]

    Torr, and Yarin Gal

    Jishnu Mukhoti, Andreas Kirsch, Joost van Amersfoort, Philip H.S. Torr, and Yarin Gal. Deep deterministic un- certainty: A new simple baseline. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24384–24394, 2023. 3

  33. [33]

    Ultra-high resolution image seg- mentation via locality-aware context fusion and alternating local enhancement.International Journal of Computer Vi- sion, 132:5030–5047, 2024

    Qi, Lin Xindai, Yang Weixiang, He Shengfeng, Yu Yuan- long Liu Wenxi, and Li. Ultra-high resolution image seg- mentation via locality-aware context fusion and alternating local enhancement.International Journal of Computer Vi- sion, 132:5030–5047, 2024. 2, 7

  34. [34]

    Boosting the dual-stream architecture in ultra-high resolution segmentation with resolution-biased uncertainty estimation

    Rong Qin, Xingyu Liu, Jinglei Shi, Liang Lin, and Jufeng Yang. Boosting the dual-stream architecture in ultra-high resolution segmentation with resolution-biased uncertainty estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 25960–25970, 2025. 1, 3, 5, 6, 7, 4

  35. [35]

    Semantic segmentation of very-high- resolution remote sensing images via deep multi-feature learning.Remote Sensing, 14, 2022

    Yanzhou Su, Jian Cheng, Haiwei Bai, Haijun Liu, and Changtao He. Semantic segmentation of very-high- resolution remote sensing images via deep multi-feature learning.Remote Sensing, 14, 2022. 2

  36. [36]

    Fast semantic segmentation of ultra-high- resolution remote sensing images via score map and fast transformer-based fusion.Remote Sensing, 16(17), 2024

    Yihao Sun, Mingrui Wang, Xiaoyi Huang, Chengshu Xin, and Yinan Sun. Fast semantic segmentation of ultra-high- resolution remote sensing images via score map and fast transformer-based fusion.Remote Sensing, 16(17), 2024. 2

  37. [37]

    Full contextual attention for multi-resolution transformers in semantic segmentation

    Loic Themyr, Cl ´ement Rambour, Nicolas Thome, Toby Collins, and Alexandre Hostettler. Full contextual attention for multi-resolution transformers in semantic segmentation. InProceedings of the IEEE/CVF Winter Conference on Ap- plications of Computer Vision (WACV), pages 3224–3233,

  38. [38]

    Sub-ensembles for fast uncer- tainty estimation in neural networks

    Matias Valdenegro-Toro. Sub-ensembles for fast uncer- tainty estimation in neural networks. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 4119–4127, 2023. 3

  39. [39]

    Gated convolutional neural network for semantic segmentation in high-resolution images.Remote Sensing, 9(5), 2017

    Hongzhen Wang, Ying Wang, Qian Zhang, Shiming Xiang, and Chunhong Pan. Gated convolutional neural network for semantic segmentation in high-resolution images.Remote Sensing, 9(5), 2017. 3

  40. [40]

    Toward real ul- tra image segmentation: Leveraging surrounding context to cultivate general segmentation model

    Sai Wang, Yutian Lin, Yu Wu, and Bo Du. Toward real ul- tra image segmentation: Leveraging surrounding context to cultivate general segmentation model. InAdvances in Neu- ral Information Processing Systems, pages 129227–129249. Curran Associates, Inc., 2024. 2

  41. [41]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 568–578, 2021. 2

  42. [42]

    Pvt v2: Improved baselines with pyramid vision transformer

    Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022. 1, 2

  43. [43]

    Cmtfnet: Cnn and multiscale transformer fusion network for remote-sensing image semantic segmentation

    Honglin Wu, Peng Huang, Min Zhang, Wenlong Tang, and Xinyu Yu. Cmtfnet: Cnn and multiscale transformer fusion network for remote-sensing image semantic segmentation. IEEE Transactions on Geoscience and Remote Sensing, 61: 1–12, 2023. 2

  44. [44]

    Patch proposal network for fast semantic segmentation of high-resolution images.Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12402– 12409, 2020

    Tong Wu, Zhenzhen Lei, Bingqian Lin, Cuihua Li, Yanyun Qu, and Yuan Xie. Patch proposal network for fast semantic segmentation of high-resolution images.Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):12402– 12409, 2020. 2, 3

  45. [45]

    Unified perceptual parsing for scene understand- ing, 2018

    Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing, 2018. 3

  46. [46]

    Simmim: A simple framework for masked image modeling

    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple framework for masked image modeling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9653–9663, 2022. 1, 3, 4

  47. [47]

    Mstrans: Multi-scale transformer for building extraction from hr re- mote sensing images.Electronics, 13, 2024

    Fei Yang, Fenlong Jiang, Jianzhao Li, and Lei Lu. Mstrans: Multi-scale transformer for building extraction from hr re- mote sensing images.Electronics, 13, 2024. 2 10

  48. [48]

    Zhan Zhang, Daoyu Shu, Guihe Gu, Wenkai Hu, Ru Wang, Xiaoling Chen, and Bingnan Yang. Ringformer-seg: A scal- able and context-preserving vision transformer framework for semantic segmentation of ultra-high-resolution remote sensing imagery.Remote Sensing, 17:3064, 2025. 2 11 Context-Aware Semantic Segmentation via Stage-Wise Attention Supplementary Material

  49. [49]

    Example where the provided mask (overlaid) locally diverges from the RGB content; such cases are occasional but can affect evaluation metrics

    URUR: illustrative annotation mismatch Figure 7. Example where the provided mask (overlaid) locally diverges from the RGB content; such cases are occasional but can affect evaluation metrics. Visible classes include:others,building,greenhouse,woodland,farmland,bareland,water,road. 1

  50. [50]

    Comparison of LR/HR images, ground truth overlays, Swin Base predictions, and CASWiT predictions on eight test patches

    Qualitative analysis on FLAIR-HUB 2 Figure 8. Comparison of LR/HR images, ground truth overlays, Swin Base predictions, and CASWiT predictions on eight test patches

  51. [51]

    Per-class IoU (%) on the FLAIRHUB RGB test set for Swin-Base and our CASWiT variants

    Supplementary results (IoUs) Class Swin-Base [13] CASWiT-B (ours) CASWiT-B-SSL (ours) CASWiT-B-SSL-aug (ours) Building 83.77 84.54 84.58 85.47 Greenhouse 77.89 78.87 78.61 79.46 Swimming pool 61.59 61.45 60.93 62.12 Impervious surface 75.03 76.24 76.48 76.78 Pervious surface 56.97 58.33 58.65 58.86 Bare soil 65.21 65.99 66.70 66.95 Water 90.08 89.37 90.26...

  52. [52]

    Examples of data pre-processing, on the left are the HR patches and on the right are the merged patches obtained from the available neighbors

    Dataset FLAIR-HUB merge Figure 9. Examples of data pre-processing, on the left are the HR patches and on the right are the merged patches obtained from the available neighbors. 4

  53. [53]

    Self-supervised SimMIM-style inference results on the CASWiT-Base architecture

    SSL results 5 Figure 10. Self-supervised SimMIM-style inference results on the CASWiT-Base architecture. Each row (left to right) shows: original high-resolution image, high-resolution image with random masking, low-resolution image with central masking, and the reconstruction of the high-resolution image. 6

  54. [54]

    Visualization of cross-attention maps for each stage of the model on four test patches

    Cross-attention visualization Figure 11. Visualization of cross-attention maps for each stage of the model on four test patches. 7