pith. machine review for the scientific record. sign in

arxiv: 2604.02948 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: no theorem link

CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal semantic segmentationcross-modal fusionarbitrary modalityModality Interaction BlockSeam-Aligned Fusionfeature aggregationsensor fusiongeneralization
0
0 comments X

The pith

CrossWeaver fuses arbitrary sensor modalities for semantic segmentation using selective interaction blocks and aligned fusion without custom adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CrossWeaver as a framework that processes data from any mix of sensors such as cameras and depth devices to label each pixel with semantic classes. Prior methods often lock designs to specific modality pairs or allow only weak exchanges that miss coordinated details. CrossWeaver places a Modality Interaction Block inside the encoder to let features from different modalities influence each other selectively based on reliability, then applies a lightweight Seam-Aligned Fusion module to combine them while keeping each modality's distinct traits. Tests across several benchmarks show the approach reaches leading accuracy, adds almost no parameters, and continues to work when new modality combinations appear at test time.

Core claim

CrossWeaver is a multimodal fusion framework for arbitrary-modality semantic segmentation whose core is a Modality Interaction Block (MIB) that enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features, achieving state-of-the-art performance on multiple multimodal semantic segmentation benchmarks with minimal additional parameters and strong generalization to unseen modality combinations.

What carries the argument

The Modality Interaction Block (MIB), which performs selective and reliability-aware cross-modal interaction inside the encoder, together with the Seam-Aligned Fusion (SAF) module that aggregates the resulting features.

If this is right

  • The same encoder can accept any number of input modalities without redesigning fusion layers for each new pair.
  • Cross-modal information exchange improves segmentation accuracy while adding only a small number of extra parameters.
  • Unique characteristics of each modality remain intact during fusion rather than being overwritten by a single shared pathway.
  • The framework maintains performance when tested on modality combinations absent from the training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same interaction block could be inserted into other dense prediction tasks such as depth estimation or panoptic segmentation.
  • Deployment in field robotics would become simpler because new sensors could be added without retraining the entire fusion stack.
  • Reliability weighting inside the block might be extended to handle noisy or missing channels in real-time streams.

Load-bearing premise

The Modality Interaction Block can achieve selective and reliability-aware cross-modal interaction while the Seam-Aligned Fusion preserves unique modality characteristics across arbitrary combinations without requiring modality-specific adaptations.

What would settle it

An experiment in which CrossWeaver is trained on one set of modality pairs and then tested on a new unseen combination, measuring whether accuracy drops below competing methods or requires substantially more parameters to recover.

Figures

Figures reproduced from arXiv: 2604.02948 by Chuanzhi Xu, Huiqi Liang, Kedi Li, Tao Zhang, Zelin Zhang.

Figure 1
Figure 1. Figure 1: Existing multimodal fusion paradigms versus the proposed CrossWeaver. CrossWeaver enables selective, reliability [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance Comparison across Different Methods. (a) Results on the MCubeS [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework of CrossWeaver, consisting of a shared hierarchical encoder and two plug-and-play modules: (a) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Visualization of MIB and SAF in Cross [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of CrossWeaver (MiT-B0 Backbone) on MCubeS [19] Dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional visualizations on MCubeS under missing-modality conditions. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Additional visualizations on DeLiVER under missing-modality conditions. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional visualizations on DeLiVER dataset under diverse weather conditions and varying modality combinations. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
read the original abstract

Multimodal semantic segmentation has shown great potential in leveraging complementary information across diverse sensing modalities. However, existing approaches often rely on carefully designed fusion strategies that either use modality-specific adaptations or rely on loosely coupled interactions, thereby limiting flexibility and resulting in less effective cross-modal coordination. Moreover, these methods often struggle to balance efficient information exchange with preserving the unique characteristics of each modality across different modality combinations. To address these challenges, we propose CrossWeaver, a simple yet effective multimodal fusion framework for arbitrary-modality semantic segmentation. Its core is a Modality Interaction Block (MIB), which enables selective and reliability-aware cross-modal interaction within the encoder, while a lightweight Seam-Aligned Fusion (SAF) module further aggregates the enhanced features. Extensive experiments on multiple multimodal semantic segmentation benchmarks demonstrate that our framework achieves state-of-the-art performance with minimal additional parameters and strong generalization to unseen modality combinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes CrossWeaver, a multimodal fusion framework for arbitrary-modality semantic segmentation. Its core components are the Modality Interaction Block (MIB), which performs selective and reliability-aware cross-modal interaction inside the encoder, and the lightweight Seam-Aligned Fusion (SAF) module that aggregates the resulting features. The central claim is that this design achieves state-of-the-art performance on multiple multimodal segmentation benchmarks while adding only minimal parameters and generalizing to unseen modality combinations without modality-specific adaptations.

Significance. If the empirical claims are substantiated with rigorous quantitative evidence, the work would offer a practically useful alternative to existing fusion strategies that require hand-crafted modality-specific modules. The emphasis on parameter efficiency and cross-modal generalization could influence downstream applications that must accommodate variable sensor suites, provided the method is shown to remain stable under genuine distribution shifts from novel modalities.

major comments (2)
  1. [Abstract and §4] Abstract and §4: The abstract asserts SOTA results, minimal additional parameters, and strong generalization to unseen modality combinations, yet the provided text supplies no quantitative metrics (mIoU, parameter counts, baseline tables, or ablation numbers). Without these data it is impossible to evaluate whether the central claims are supported.
  2. [§3.2 and §4.3] §3.2 (MIB) and §4.3 (generalization experiments): The description of selective/reliability-aware interaction and seam-aligned aggregation does not include a formal argument or controlled test showing stability when input feature statistics, channel counts, or noise profiles differ from those seen during training. Experiments appear limited to permutations of modalities already present in the benchmark pools rather than introduction of genuinely novel sensors, which leaves the arbitrary-modality claim unverified.
minor comments (2)
  1. [§3.2] Notation for feature dimensions and reliability weights inside the MIB should be defined explicitly (e.g., with consistent symbols for channel count C_m across modalities) to allow readers to verify dimension-agnostic behavior.
  2. [§4] Figure captions and experimental tables would benefit from explicit listing of the exact modality subsets used in each 'unseen combination' row so that the scope of generalization can be assessed at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the opportunity to address the concerns raised and clarify the contributions of CrossWeaver. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4: The abstract asserts SOTA results, minimal additional parameters, and strong generalization to unseen modality combinations, yet the provided text supplies no quantitative metrics (mIoU, parameter counts, baseline tables, or ablation numbers). Without these data it is impossible to evaluate whether the central claims are supported.

    Authors: The abstract provides a high-level summary due to space constraints, while the full manuscript in Section 4 supplies the requested quantitative evidence. Table 1 reports mIoU scores across benchmarks with SOTA results, Table 2 details parameter counts showing minimal overhead relative to baselines, and §4.3 includes ablation studies and generalization metrics. We will incorporate key numerical highlights into the abstract in the revised version. revision: partial

  2. Referee: [§3.2 and §4.3] §3.2 (MIB) and §4.3 (generalization experiments): The description of selective/reliability-aware interaction and seam-aligned aggregation does not include a formal argument or controlled test showing stability when input feature statistics, channel counts, or noise profiles differ from those seen during training. Experiments appear limited to permutations of modalities already present in the benchmark pools rather than introduction of genuinely novel sensors, which leaves the arbitrary-modality claim unverified.

    Authors: The MIB employs adaptive attention and reliability weighting that operate on arbitrary input feature dimensions without fixed assumptions on statistics or channel counts, as detailed in §3.2. Our §4.3 experiments systematically evaluate all modality combinations within the benchmarks, including training on subsets and testing on unseen pairings, demonstrating generalization without modality-specific modules. We acknowledge that a formal stability proof under arbitrary distribution shifts is not provided, and experiments use existing benchmark modalities rather than new sensor hardware; extending to completely novel sensors would require additional datasets beyond this work's scope. revision: no

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces the Modality Interaction Block and Seam-Aligned Fusion as independent architectural choices to enable selective cross-modal interaction and aggregation for arbitrary modalities. Performance claims rest on empirical results from standard multimodal segmentation benchmarks rather than any equations or parameters that reduce outputs to inputs by construction. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text; the framework is presented as a design contribution evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly introduced modules (MIB and SAF) whose internal mechanisms are not further decomposed in the abstract; no numerical free parameters, mathematical axioms, or external benchmarks are specified.

invented entities (2)
  • Modality Interaction Block (MIB) no independent evidence
    purpose: Enables selective and reliability-aware cross-modal interaction within the encoder
    New architectural component introduced to overcome limitations of prior fusion strategies
  • Seam-Aligned Fusion (SAF) module no independent evidence
    purpose: Lightweight aggregation of enhanced cross-modal features
    New module proposed to balance information exchange while preserving modality characteristics

pith-pipeline@v0.9.0 · 5459 in / 1212 out tokens · 55921 ms · 2026-05-13T19:33:54.341778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

  1. [1]

    TS Arulananth, PG Kuppusamy, Ramesh Kumar Ayyasamy, Saadat M Alhashmi, M Mahalakshmi, K Vasanth, and P Chinnasamy. 2024. Semantic segmentation of urban environments: Leveraging U-Net deep learning model for cityscape image analysis.Plos one19, 4 (2024), e0300767

  2. [2]

    Bing Cao, Junliang Guo, Pengfei Zhu, and Qinghua Hu. 2024. Bi-directional adapter for multimodal tracking. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 927–935

  3. [3]

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolu- tional nets, atrous convolution, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence40, 4 (2017), 834–848

  4. [4]

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. 2017. Rethinking atrous convolution for semantic image segmentation.arXiv preprint arXiv:1706.05587(2017)

  5. [5]

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. InProceedings of the European conference on computer vision (ECCV). 801–818

  6. [6]

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus En- zweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. InProceedings of the IEEE conference on computer vision and pattern recognition. 3213–3223

  7. [7]

    Shaohua Dong, Yunhe Feng, Qing Yang, Yan Huang, Dongfang Liu, and Heng Fan. 2024. Efficient multimodal semantic segmentation via dual-prompt learning. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 14196–14203

  8. [8]

    Shaohua Dong, Wujie Zhou, Caie Xu, and Weiqing Yan. 2023. EGFNet: Edge- aware guidance fusion network for RGB-thermal urban scene parsing.IEEE Transactions on Intelligent Transportation Systems25, 1 (2023), 657–669

  9. [9]

    Xiaoyu Dong and Naoto Yokoya. 2024. Understanding dark scenes by contrasting multi-modal observations. InProceedings of the IEEE/CVF winter conference on applications of computer vision. 840–850

  10. [10]

    Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2015. The pascal visual object classes challenge: A retrospective.International journal of computer vision111, 1 (2015), 98–136

  11. [11]

    Jun Fu, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. Dual attention network for scene segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3146–3154

  12. [12]

    Caner Hazirbas, Lingni Ma, Csaba Domokos, and Daniel Cremers. 2016. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architec- ture. InAsian conference on computer vision. Springer, 213–228

  13. [13]

    Qibin He. 2024. Prompting multi-modal image segmentation with semantic grouping. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 2094–2102

  14. [14]

    Qibin Hou, Li Zhang, Ming-Ming Cheng, and Jiashi Feng. 2020. Strip pooling: Rethinking spatial pooling for scene parsing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4003–4012

  15. [15]

    Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. 2019. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 603–612

  16. [16]

    Ding Jia, Jianyuan Guo, Kai Han, Han Wu, Chao Zhang, Chang Xu, and Xinghao Chen. 2024. Geminifusion: Efficient pixel-wise multimodal fusion for vision transformer.arXiv preprint arXiv:2406.01210(2024)

  17. [17]

    Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. 2024. U3M: Unbiased multiscale modal fusion model for multimodal semantic segmentation. arXiv preprint arXiv:2405.15365(2024)

  18. [18]

    Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. 2025. Stitchfu- sion: Weaving any visual modalities to enhance multimodal semantic segmen- tation. InProceedings of the 33rd ACM International Conference on Multimedia. 1308–1317

  19. [19]

    Yupeng Liang, Ryosuke Wakaki, Shohei Nobuhara, and Ko Nishino. 2022. Mul- timodal material segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19800–19808

  20. [20]

    Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. 2022. Target-aware dual adversarial learning and a multi- scenario multi-modality benchmark to fuse infrared and visible for object detec- tion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5802–5811

  21. [21]

    Rui Liu, Li Mi, and Zhenzhong Chen. 2020. AFNet: Adaptive fusion network for remote sensing image semantic segmentation.IEEE Transactions on Geoscience and Remote Sensing59, 9 (2020), 7871–7886

  22. [22]

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. IEEE. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2. 7

  23. [23]

    Zhuoyan Liu, Bo Wang, Lizhi Wang, Chenyu Mao, and Ye Li. 2025. Sharecmp: Polarization-aware rgb-p semantic segmentation.IEEE Transactions on Circuits and Systems for Video Technology(2025)

  24. [24]

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440

  25. [25]

    Ying Lv, Zhi Liu, and Gongyang Li. 2024. Context-aware interaction network for rgb-t semantic segmentation.IEEE Transactions on Multimedia26 (2024), 6348–6360

  26. [26]

    Seong-Jin Park, Ki-Sang Hong, and Seungyong Lee. 2017. Rdfnet: Rgb-d multi- level residual feature fusion for indoor semantic segmentation. InProceedings of the IEEE international conference on computer vision. 4980–4989

  27. [27]

    Md Kaykobad Reza, Ashley Prater-Bennette, and M Salman Asif. 2024. Mms- former: Multimodal transformer for material and semantic segmentation.IEEE Open Journal of Signal Processing5 (2024), 599–610

  28. [28]

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. InInternational Conference on Medical image computing and computer-assisted intervention. Springer, 234–241

  29. [29]

    Daniel Seichter, Söhnke Benedikt Fischedick, Mona Köhler, and Horst-Michael Groß. 2022. Efficient multi-task rgb-d scene analysis for indoor environments. In 2022 International joint conference on neural networks (IJCNN). IEEE, 1–10

  30. [30]

    Yuxiang Sun, Weixun Zuo, and Ming Liu. 2019. RTFNet: RGB-thermal fusion net- work for semantic segmentation of urban scenes.IEEE Robotics and Automation Letters4, 3 (2019), 2576–2583

  31. [31]

    Antonin Vobecky, David Hurych, Oriane Simeoni, Spyros Gidaris, Andrei Bursuc, Patrick Perez, and Josef Sivic. 2025. Unsupervised semantic segmentation of urban scenes via cross-modal distillation.International Journal of Computer Vision133, 6 (2025), 3519–3541

  32. [32]

    Weiyue Wang and Ulrich Neumann. 2018. Depth-aware cnn for rgb-d seg- mentation. InProceedings of the European conference on computer vision (ECCV). 135–150

  33. [33]

    Yikai Wang, Xinghao Chen, Lele Cao, Wenbing Huang, Fuchun Sun, and Yunhe Wang. 2022. Multimodal token fusion for vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12186–12195

  34. [34]

    Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers.Advances in neural information processing systems34 (2021), 12077–12090

  35. [35]

    Xuan Yang, Liangzhe Yuan, Kimberly Wilber, Astuti Sharma, Xiuye Gu, Siyuan Qiao, Stephanie Debats, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, et al

  36. [36]

    InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    Polymax: General dense prediction with mask transformer. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1050–1061

  37. [37]

    Bowen Yin, Xuying Zhang, Zhongyu Li, Li Liu, Ming-Ming Cheng, and Qibin Hou. 2023. Dformer: Rethinking rgbd representation learning for semantic segmentation.arXiv preprint arXiv:2309.09668(2023)

  38. [38]

    Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018. Bisenet: Bilateral segmentation network for real-time semantic segmentation. InProceedings of the European conference on computer vision (ECCV). 325–341

  39. [39]

    Nannan Yu, Chaoyi Wang, Yu Qiao, Jiankang Ren, Dongsheng Zhou, Xiaopeng Wei, Qiang Zhang, and Xin Yang. 2025. Event-Based Image Semantic Segmen- tation.Journal of Computer-Aided Design & Computer Graphics37, 9 (2025), 1560–1572

  40. [40]

    Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. 2022. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10819–10829

  41. [41]

    Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, and Jingdong Wang. 2018. Ocnet: Object context network for scene parsing.arXiv preprint arXiv:1809.00916(2018)

  42. [42]

    Chunyi Zhang, Marcos Calegari Andrade, Zachary K Goldsmith, Abhinav S Raman, Yifan Li, Pablo Piaggi, Xifan Wu, Roberto Car, and Annabella Selloni

  43. [43]

    Electrical double layer and capacitance of TiO2 electrolyte interfaces from first principles simulations.arXiv preprint arXiv:2404.00167(2024)

  44. [44]

    Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruiping Liu, and Rainer Stiefelhagen. 2023. CMX: Cross-modal fusion for RGB-X semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems24, 12 (2023), 14679–14694

  45. [45]

    Jiaming Zhang, Ruiping Liu, Hao Shi, Kailun Yang, Simon Reiß, Kunyu Peng, Haodong Fu, Kaiwei Wang, and Rainer Stiefelhagen. 2023. Delivering arbitrary- modal semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1136–1147

  46. [46]

    Qiang Zhang, Shenlu Zhao, Yongjiang Luo, Dingwen Zhang, Nianchang Huang, and Jungong Han. 2021. ABMDRNet: Adaptive-weighted bi-directional modality difference reduction network for RGB-T semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2633–2642

  47. [47]

    Zilong Zhang, Chao Xu, Zhengping Li, Yixuan Chen, and Chao Nie. 2025. Multi- scale fusion semantic enhancement network for medical image segmentation. Scientific Reports15, 1 (2025), 23018. CrossWeaver: Cross-modal Weaving for Arbitrary-Modality Semantic Segmentation , ,

  48. [48]

    Zelin Zhang, Tao Zhang, and Aibo Xu. 2025. Efficient Token Fusion for Transformer-Based Semantic Segmentation. InPacific-Asia Conference on Knowl- edge Discovery and Data Mining. Springer, 446–455

  49. [50]

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. InProceedings of the IEEE conference on computer vision and pattern recognition. 2881–2890

  50. [51]

    Yuzhong Zhao, Weijia Wu, Zhuang Li, Jiahong Li, and Weiqiang Wang. 2023. Flowtext: Synthesizing realistic scene text video with optical flow estimation. In 2023 IEEE international conference on multimedia and expo (ICME). IEEE, 1517– 1522

  51. [52]

    Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al . 2021. Re- thinking semantic segmentation from a sequence-to-sequence perspective with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6881–6890

  52. [53]

    Xu Zheng, Yuanhuiyi Lyu, Jiazhou Zhou, and Lin Wang. 2024. Centering the value of every modality: Towards efficient and resilient modality-agnostic semantic segmentation. InEuropean Conference on Computer Vision. Springer, 192–212

  53. [54]

    Wujie Zhou, Shaohua Dong, Caie Xu, and Yaguan Qian. 2022. Edge-aware guidance fusion network for rgb–thermal scene parsing. InProceedings of the AAAI conference on artificial intelligence, Vol. 36. 3571–3579

  54. [55]

    Yuting Zhou, Xuemei Yang, Shiqi Liu, and Junping Yin. 2024. Multimodal medical image fusion network based on target information enhancement.IEEE Access12 (2024), 70851–70869

  55. [56]

    Eventlowres

    Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin. 2021. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9939–9948. , , Zhang et al. 6 Technical Appendices and Supplementary Material This supplementar...