pith. machine review for the scientific record. sign in

arxiv: 2604.18829 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

Abrar Majeedi, Hongcheng Wang, Jianglin Lu, Yin Li, Zhiyuan Ruan, Ziyi Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords RGB-Infrared fusionmultimodal large language modelsvisual reasoningcross-attentionimage degradation robustnessDV-204K datasetDV-500 benchmark
0
0 comments X

The pith

DUALVISION adds a lightweight patch-level cross-attention module to fuse infrared and RGB data, allowing multimodal large language models to reason robustly even under visual degradations like fog or low light.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DUALVISION, a fusion approach for combining RGB and infrared imagery in multimodal large language models. Current MLLMs struggle with degraded images such as those affected by fog, blur, or darkness, but infrared provides complementary information that remains reliable in those conditions. By using a patch-level localized cross-attention mechanism, the method integrates this information efficiently into existing models. The authors also release DV-204K, a large dataset of aligned pairs with QA annotations, and DV-500, a benchmark for cross-modal reasoning. Benchmarks on these show strong performance gains for both open and closed source models.

Core claim

DUALVISION is a lightweight fusion module that incorporates infrared and RGB information into MLLMs via patch-level localized cross-attention. Supported by new datasets DV-204K with 204K QA annotations and DV-500 benchmark, it delivers strong empirical performance under a wide range of visual degradations.

What carries the argument

The patch-level localized cross-attention fusion module, which efficiently merges complementary IR-RGB information into pre-existing MLLMs.

If this is right

  • Existing MLLMs can be enhanced for robustness without major retraining.
  • New datasets enable standardized evaluation of IR-RGB multimodal reasoning.
  • Performance holds across open- and closed-source models under degradations.
  • Benchmarks demonstrate practical utility for real-world visual tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar fusion modules could be developed for other sensor modalities like depth or thermal beyond IR.
  • The approach might extend to video or other sequential data where degradations vary over time.
  • Testing on edge devices could reveal if the lightweight module maintains efficiency in deployment.

Load-bearing premise

The patch-level localized cross-attention can efficiently incorporate complementary IR-RGB information into existing MLLMs without major retraining or new failure modes.

What would settle it

If DUALVISION-enhanced models show no improvement or worse performance than standard MLLMs on the DV-500 benchmark under degradations, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.18829 by Abrar Majeedi, Hongcheng Wang, Jianglin Lu, Yin Li, Zhiyuan Ruan, Ziyi Zhao.

Figure 1
Figure 1. Figure 1: When visibility degrades in RGB images ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DUALVISION. (a) shows how DUALVISION integrates into a MLLM to fuse RGB and IR image tokens for robust visual reasoning. (b) illustrates the multi-scale localized cross-attention module, where RGB tokens serve as queries and IR tokens as keys and values. (c) visualizes the spatially aligned RGB–IR token grids over which localized cross-attention is performed. recent multimodal systems explore l… view at source ↗
Figure 3
Figure 3. Figure 3: RGB, IR Captions and QA Pairs. We show examples of paired RGB–IR images along with their modality-specific captions and the corresponding question–answer pairs generated from those captions. The RGB caption includes richer scene details such as lighting, clothing, and context, while the IR caption reflects high level information like presence of people and overall scene type. misrepresent IR-specific cues … view at source ↗
Figure 4
Figure 4. Figure 4: RGB degradation conditions in DV-500. Examples of the three corruption types: darkness, blur, and fog, applied at four severity levels each, with IR left unaltered. These controlled degradations enable systematic evaluation of the robustness of VLMs. ratings for detail, as IR imagery inherently provides less fine-grained information than RGB. These results confirm that our framework produces informative an… view at source ↗
Figure 5
Figure 5. Figure 5: Performance by Modalities (IR, RGB, RGB+IR). IR-only performance stays flat across degradations, RGB-only performance drops sharply as degradation severity increases, and RGB+IR provides the most robust results. Full results are in the Supplement. Original Low Medium High Highest Intensity 80 82 84 86 88 Accuracy (%) Darkness Adaptive Addition Addition Concatenation DualVision (Ours) Original Low Medium Hi… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of Fusion Methods. We compare several fusion strategies: addition, adaptive addition, concatenation, and our DUALVISION. Note that the concatenation baseline is equivalent to finetuned LLaVA-1.5-7B. Full results can be found in the Supplement. also evaluate some closed-source commercial systems such as Anthropic Claude Sonnet 3.5v2 and Claude Opus 4 [34]. All baselines are evaluated with RGB and… view at source ↗
Figure 7
Figure 7. Figure 7: Results on DV-500. Both methods, finetuned on DV-204K, answer correctly with clean RGB–IR inputs (left). When the RGB is degraded (right), we can see DUALVISION remains robust, while finetuned LLaVA-1.5 shows reduced reliability. same degradation-aware training protocol. All models share the LLaVA 1.5–7B backbone and frozen CLIP encoder E, ensuring that performance differences arise solely from the fusion … view at source ↗
read the original abstract

Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DUALVISION delivers strong empirical performance under a wide range of visual degradations. Our code and dataset are available at https://abrarmajeedi.github.io/dualvision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes DUALVISION, a lightweight fusion module that integrates complementary RGB and infrared (IR) information into existing multimodal large language models (MLLMs) via patch-level localized cross-attention. It introduces two new resources: the DV-204K dataset (~25K aligned IR-RGB image pairs with 204K modality-specific QA annotations) for training and the DV-500 benchmark (500 IR-RGB pairs with 500 QA pairs) for evaluating cross-modal reasoning under degradations. The authors benchmark open- and closed-source MLLMs and report that DUALVISION yields strong empirical performance across a range of visual degradations such as fog, blur, and low light, with code and datasets released for reproducibility.

Significance. If the reported empirical results hold, the work is significant because it offers a practical, lightweight method to enhance MLLM robustness by exploiting the inherent strengths of IR imaging without requiring full model retraining. The release of DV-204K, DV-500, and associated code provides reusable artifacts that can accelerate research on multimodal robustness, an area of growing importance for real-world deployment of vision-language systems.

minor comments (3)
  1. [Abstract] Abstract: the statement that DUALVISION 'delivers strong empirical performance' would be strengthened by including one or two concrete quantitative highlights (e.g., accuracy gains or robustness metrics under specific degradations) rather than leaving the claim entirely qualitative.
  2. [Method] The description of the patch-level localized cross-attention fusion module would benefit from an explicit statement of its parameter count and inference overhead relative to the base MLLM, to substantiate the 'lightweight' and 'without requiring major retraining' claims.
  3. Figure and table captions should be expanded to include the exact degradation types, model variants, and metric definitions used, improving standalone readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our manuscript and the recommendation for minor revision. The recognition of DUALVISION's practical value for enhancing MLLM robustness via lightweight IR-RGB fusion, along with the utility of the released DV-204K and DV-500 resources, is appreciated. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical contribution that introduces a new lightweight fusion module, two new datasets (DV-204K and DV-500), and benchmark results on open- and closed-source MLLMs under visual degradations. No equations, fitted parameters, or predictions appear in the provided text. The central claims rest on the released datasets, code, and direct experimental measurements rather than any derivation that reduces to author-defined inputs or self-citations by construction. The work is therefore self-contained and externally testable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on standard attention mechanisms and the empirical effectiveness of the proposed fusion design; no free parameters or invented physical entities are specified in the abstract.

axioms (1)
  • standard math Transformer-based cross-attention can be localized at the patch level to fuse two image modalities.
    The method description invokes established attention operations without new proofs.

pith-pipeline@v0.9.0 · 5525 in / 1000 out tokens · 48151 ms · 2026-05-10T04:50:34.667189+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PolarVLM integrates polarimetric physical parameters into VLMs via dual-stream architecture and progressive training, outperforming RGB baselines by 25.4% on a new 75K-pair polarization-aware VQA benchmark.

  2. PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 7.0

    PolarVLM is the first VLM framework to integrate polarimetric physical parameters via dual-stream architecture and progressive training, delivering 25.4% gains over RGB baselines on reflection and transparency tasks w...

Reference graph

Works this paper leans on

48 extracted references · 6 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Flamingo: a Visual Language Model for Few-Shot Learn- ing

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikoł aj B...

  2. [2]

    Multimodal large language models in health care: applications, challenges, and future outlook.Journal of med- ical Internet research, 26:e59505, 2024

    Rawan AlSaad, Alaa Abd-Alrazaq, Sabri Boughorbel, Arfan Ahmed, Max-Antoine Renault, Rafat Damseh, and Javaid Sheikh. Multimodal large language models in health care: applications, challenges, and future outlook.Journal of med- ical Internet research, 26:e59505, 2024. 1

  3. [3]

    LLMs can see and hear without any training

    Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, and Rohit Girdhar. LLMs can see and hear without any training. InICML, 2025. 5, 6, 3

  4. [4]

    Qwen2.5-VL technical report.arXiv e-prints, pages arXiv–2502, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv e-prints, pages arXiv–2502, 2025. 8

  5. [5]

    Reyes, Teo Susnjak, and Andre L

    Martin Brenner, Napoleon H. Reyes, Teo Susnjak, and Andre L. C. Barczak. RGB-D and Thermal Sensor Fusion: A Sys- tematic Literature Review.IEEE Access, 11:82410–82442,

  6. [6]

    IRGPT: Un- derstanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark

    Zhe Cao, Jin Zhang, and Ruiheng Zhang. IRGPT: Un- derstanding Real-world Infrared Image with Bi-cross-modal Curriculum on Large-scale Benchmark. InProceedings of the IEEE international conference on computer vision, pages 166–176, 2025. 2

  7. [7]

    A survey on multimodal large lan- guage models for autonomous driving

    Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. A survey on multimodal large lan- guage models for autonomous driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 958–979, 2024. 1

  8. [8]

    InstructBLIP: Towards general-purpose vision- language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision- language models with instruction tuning. InAdvances in neural information processing systems, pages 49250–49267,

  9. [9]

    ImageBind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. ImageBind: One embedding space to bind them all. InCVPR, 2023. 2

  10. [10]

    The wavelength de- pendent model of extinction in fog and haze for free space optical communication.Optics Express, 19(4):3379–3386,

    Martin Grabner and Vaclav Kvicera. The wavelength de- pendent model of extinction in fog and haze for free space optical communication.Optics Express, 19(4):3379–3386,

  11. [11]

    Publisher: Optica Publishing Group. 1

  12. [12]

    Benchmarking Neu- ral Network Robustness to Common Corruptions and Pertur- bations

    Dan Hendrycks and Thomas Dietterich. Benchmarking Neu- ral Network Robustness to Common Corruptions and Pertur- bations. InInternational Conference on Learning Represen- tations, 2018. 3

  13. [13]

    LoRA: Low-rank adaptation of large language mod- els.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language mod- els.ICLR, 1(2):3, 2022. 4

  14. [14]

    LLVIP: A visible-infrared paired dataset for low-light vision

    Xinyu Jia, Chuang Zhu, Minzhen Li, Wenqi Tang, and Wenli Zhou. LLVIP: A visible-infrared paired dataset for low-light vision. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 3496–3504, 2021. 3, 5, 6

  15. [15]

    Infrared-LLaV A: Enhancing Un- derstanding of Infrared Images in Multi-Modal Large Lan- guage Models

    Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-LLaV A: Enhancing Un- derstanding of Infrared Images in Multi-Modal Large Lan- guage Models. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 8573–8591, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 2, 4

  16. [16]

    OpenVLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, et al. OpenVLA: An open-source vision-language-action model. InConference on Robot Learning, pages 2679–2713. PMLR, 2025. 1

  17. [17]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. LLaV A-NeXT- Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models, 2024. arXiv:2407.07895 [cs]. 6, 8

  18. [18]

    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the 40th International Conference on Ma- chine Learning, pages 19730–19742. PMLR, 2023. ISSN: 2640-3498. 2

  19. [19]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual Instruction Tuning. InAdvances in Neural Informa- tion Processing Systems, pages 34892–34916. Curran Asso- ciates, Inc., 2023. 2

  20. [20]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 2, 6, 8, 3

  21. [21]

    Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 9992–10002, 2021. ISSN: 2380-7504. 2

  22. [22]

    PA VE: Patching and adapting video large lan- guage models

    Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, and Yin Li. PA VE: Patching and adapting video large lan- guage models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3306–3317, 2025. 3

  23. [23]

    SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer

    Jiayi Ma, Linfeng Tang, Fan Fan, Jun Huang, Xiaoguang Mei, and Yong Ma. SwinFusion: Cross-domain Long-range Learning for General Image Fusion via Swin Transformer. IEEE/CAA Journal of Automatica Sinica, 9(7):1200–1217,

  24. [24]

    Publisher: IEEE/CAA Journal of Automatica Sinica. 2

  25. [25]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025

    Meta. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025. 6, 8

  26. [26]

    Attention Bottlenecks for Multimodal Fusion

    Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. Attention Bottlenecks for Multimodal Fusion. InAdvances in Neural Information Processing Systems, pages 14200–14213. Curran Associates, Inc., 2021. 2

  27. [27]

    OWLViz: An Open- World Benchmark for Visual Question Answering, 2025

    Thuy Nguyen, Dang Nguyen, Hoang Nguyen, Thuan Luong, Long Hoang Dang, and Viet Dac Lai. OWLViz: An Open- World Benchmark for Visual Question Answering, 2025. arXiv:2503.07631 [cs]. 6

  28. [28]

    HDRT: A large- scale dataset for infrared-guided HDR imaging.Information Fusion, 120:103109, 2025

    Jingchao Peng, Thomas Bashford-Rogers, Francesco Ban- terle, Haitao Zhao, and Kurt Debattista. HDRT: A large- scale dataset for infrared-guided HDR imaging.Information Fusion, 120:103109, 2025. 5, 6

  29. [29]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021....

  30. [30]

    A theory of joint light and heat transport for lambertian scenes

    Mani Ramanagopal, Sriram Narayanan, Aswin C Sankara- narayanan, and Srinivasa G Narasimhan. A theory of joint light and heat transport for lambertian scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11924–11933, 2024. 2

  31. [31]

    Sparse Depth Enhanced Direct Thermal-Infrared SLAM Beyond the Visible Spec- trum.IEEE Robotics and Automation Letters, 4(3):2918– 2925, 2019

    Young-Sik Shin and Ayoung Kim. Sparse Depth Enhanced Direct Thermal-Infrared SLAM Beyond the Visible Spec- trum.IEEE Robotics and Automation Letters, 4(3):2918– 2925, 2019. 1

  32. [32]

    Shivakumar, Neil Rodrigues, Alex Zhou, Ian D

    Shreyas S. Shivakumar, Neil Rodrigues, Alex Zhou, Ian D. Miller, Vijay Kumar, and Camillo J. Taylor. PST900: RGB- Thermal Calibration, Dataset and Segmentation Network. In 2020 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 9441–9447, 2020. ISSN: 2577- 087X. 3

  33. [33]

    PandaGPT: One Model To Instruction-Follow Them All

    Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. PandaGPT: One Model To Instruction-Follow Them All. InProceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of In- teractive Assistants!, pages 11–23, Prague, Czech Republic,

  34. [34]

    Association for Computational Linguistics. 3

  35. [35]

    Rethinking image restoration for object detection.Advances in Neural Information Processing Systems, 35:4461–4474,

    Shangquan Sun, Wenqi Ren, Tao Wang, and Xiaochun Cao. Rethinking image restoration for object detection.Advances in Neural Information Processing Systems, 35:4461–4474,

  36. [36]

    High resolution pho- tography with an rgb-infrared camera

    Huixuan Tang, Xiaopeng Zhang, Shaojie Zhuo, Feng Chen, Kiriakos N Kutulakos, and Liang Shen. High resolution pho- tography with an rgb-infrared camera. In2015 IEEE Interna- tional Conference on Computational Photography (ICCP), pages 1–10. IEEE, 2015. 2

  37. [37]

    Anthropic Claude LLMs, 2025

    Anthropic Team. Anthropic Claude LLMs, 2025. 5, 7, 8

  38. [38]

    Analysing the Robustness of Vision-Language-Models to Common Cor- ruptions, 2025

    Muhammad Usama, Syeda Aishah Asim, Syed Bilal Ali, Syed Talal Wasim, and Umair Bin Mansoor. Analysing the Robustness of Vision-Language-Models to Common Cor- ruptions, 2025. arXiv:2504.13690 [cs]. 1, 3

  39. [39]

    Attention is All you Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

  40. [40]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution, 2024. arXiv:2409.12191 [cs]. 6, 8

  41. [41]

    CQA-Face: Con- trastive Quality-Aware Attentions for Face Recognition.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 36(3):2504–2512, 2022

    Qiangchang Wang and Guodong Guo. CQA-Face: Con- trastive Quality-Aware Attentions for Face Recognition.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 36(3):2504–2512, 2022. 3

  42. [42]

    Mixture of Scale Experts for Alignment- free RGBT Video Object Detection and A Unified Bench- mark, 2025

    Qishun Wang, Zhengzheng Tu, Kunpeng Wang, Le Gu, and Chuanwang Guo. Mixture of Scale Experts for Alignment- free RGBT Video Object Detection and A Unified Bench- mark, 2025. arXiv:2410.12143 [cs]. 3

  43. [43]

    SGFNet: Semantic- Guided Fusion Network for RGB-Thermal Semantic Seg- mentation.IEEE Trans

    Yike Wang, Gongyang Li, and Zhi Liu. SGFNet: Semantic- Guided Fusion Network for RGB-Thermal Semantic Seg- mentation.IEEE Trans. Cir. and Sys. for Video Technol., 33 (12):7737–7748, 2023. 2, 3

  44. [44]

    Rgb-infrared cross-modality per- son re-identification

    Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, and Jianhuang Lai. Rgb-infrared cross-modality per- son re-identification. InProceedings of the IEEE inter- national conference on computer vision, pages 5380–5389,

  45. [45]

    A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, 11(12): nwae403, 2024. 1

  46. [46]

    Object fusion tracking based on visible and in- frared images: A comprehensive review.Information Fu- sion, 63:166–187, 2020

    Xingchen Zhang, Ping Ye, Henry Leung, Ke Gong, and Gang Xiao. Object fusion tracking based on visible and in- frared images: A comprehensive review.Information Fu- sion, 63:166–187, 2020. 2

  47. [47]

    Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, W ANG HongFa, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Cai Wan Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment. InThe Twelfth International Conference on Learning Representa- tions, 2024. 2, 5, 3

  48. [48]

    MiniGPT-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. InThe Twelfth International Conference on Learning Representa- tions, 2024. 2 DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning Supplementary Material This supple...