pith. machine review for the scientific record. sign in

arxiv: 2605.07574 · v2 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelspolarization imagingoptical ambiguitiesreflection recognitiontransparent objectsPolarVQAphysics-aware VQA
0
0 comments X

The pith

PolarVLM integrates polarimetric physical parameters into vision-language models to resolve optical ambiguities in reflections and transparent objects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mainstream vision-language models struggle with severe optical ambiguities such as reflections and transparent objects due to limitations in standard RGB inputs. PolarVLM bridges this gap by incorporating polarimetric physical parameters that capture additional light properties to disambiguate these scenes. The framework employs a dual-stream architecture and a progressive two-stage training strategy to add this physics information while avoiding new errors and keeping general visual reasoning intact. To support this, the authors introduce PolarVQA, a benchmark with 75,000 physics-grounded instruction pairs focused on reflective and transparent scenes. Results demonstrate that this approach yields a 25.4% overall improvement over RGB baselines across five tasks, including notable gains in reflection recognition and glass counting.

Core claim

PolarVLM is the first multimodal framework that integrates polarimetric physical parameters into VLMs. Using a dual-stream architecture and progressive two-stage training, it effectively prevents physical misinterpretations while preserving general visual abilities. This enables physics-aware semantic understanding, as shown by outperforming the RGB baseline by 25.4% overall on five evaluation tasks, with gains of 26.6% in reflection recognition and 34.0% in glass counting, on the newly constructed PolarVQA benchmark.

What carries the argument

dual-stream architecture combined with progressive two-stage training for fusing polarimetric parameters into vision-language models

If this is right

  • PolarVLM achieves 25.4% better overall performance than RGB-only VLMs on physics-related tasks.
  • It provides 26.6% improvement specifically in reflection recognition.
  • Glass counting accuracy increases by 34.0%.
  • The model maintains general visual reasoning capabilities on non-polarization tasks.
  • It unlocks semantic understanding grounded in physical light properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This integration could enhance vision systems in environments with many transparent or reflective surfaces, such as indoor robotics or autonomous vehicles.
  • The PolarVQA dataset may become a standard testbed for evaluating how well models understand physical scene properties.
  • Similar dual-stream approaches might be adapted to incorporate other physical imaging modalities like thermal or depth into VLMs.

Load-bearing premise

The dual-stream architecture combined with progressive two-stage training can incorporate polarimetric parameters to resolve optical ambiguities without introducing new misinterpretations or degrading the model's general visual reasoning abilities on non-polarization tasks.

What would settle it

An experiment where PolarVLM shows no improvement or degradation on polarized scenes compared to RGB, or performs worse than the baseline on standard non-polarized VQA tasks, would indicate the integration method fails.

Figures

Figures reproduced from arXiv: 2605.07574 by Boxin Shi, Chu Zhou, Heng Guo, Imari Sato, Yuliang Li, Zhanyu Ma.

Figure 1
Figure 1. Figure 1: PolarVLM overcomes optical ambiguities through physics-aware multimodal reasoning. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: PolarVLM overcomes optical ambiguities through physics-aware multimodal reasoning. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall design of PolarVLM. (a) Dual-stream physics-aware architecture with sequence [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistical overview and task taxonomy of the PolarVQA benchmark. (a) For Stage 1, the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons on the PolarVQA test set. We evaluate PolarVLM against the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Automated data generation pipeline for PolarVQA. The pipeline extracts visual structures [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template for reflection caption generation in Stage 1. The template contains shared [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for glass caption generation in Stage 1. The template guides the generation [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template for reflection instruction generation in Stage 2. The template is used to [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template for glass instruction generation in Stage 2. The template is used to [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Task-specific evaluation prompts used by GPT-4o-mini [ [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative comparisons on the PolarVQA test set. The examples cover glass [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Adaptive attention allocation in PolarVLM. We plot the [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
read the original abstract

Mainstream vision-language models (VLMs) fundamentally struggle with severe optical ambiguities, such as reflections and transparent objects, due to the inherent limitations of standard RGB inputs. While polarization imaging captures polarimetric physical parameters that resolve these ambiguities, existing methods are constrained by fixed-format outputs and remain isolated from open-ended reasoning. To bridge this semantic-physical gap, we introduce PolarVLM, the first multimodal framework integrating polarimetric physical parameters into VLMs. By employing a dual-stream architecture and a progressive two-stage training strategy, PolarVLM effectively prevents physical misinterpretations while preserving general visual abilities. Complementing our architecture, we construct PolarVQA, the first benchmark for polarization-aware VQA, featuring 75K physics-grounded instruction-tuning pairs targeting reflective and transparent scenes. Experiments show that PolarVLM surpasses the RGB baseline by 25.4% overall across five evaluation tasks, with remarkable gains of 26.6% in reflection recognition and 34.0% in glass counting, successfully unlocking physics-aware semantic understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PolarVLM, the first multimodal VLM framework to integrate polarimetric physical parameters via a dual-stream architecture and progressive two-stage training, aiming to resolve optical ambiguities (reflections, transparent objects) that plague standard RGB-based VLMs while preserving general visual reasoning. It also releases PolarVQA, a new benchmark of 75K physics-grounded VQA pairs focused on reflective and transparent scenes. Experiments claim a 25.4% overall gain over an RGB baseline across five tasks, including +26.6% on reflection recognition and +34.0% on glass counting.

Significance. If the central claims hold after additional controls, the work would be significant as the first open-ended VLM integration of polarization imaging, with the PolarVQA benchmark providing a reusable resource for physics-aware vision-language research. The dual-stream design and staged training strategy represent a concrete architectural contribution that could generalize to other physical sensing modalities.

major comments (3)
  1. [Experiments] Experiments section: the headline claim that PolarVLM 'preserves general visual abilities' (abstract) rests on the dual-stream + two-stage training successfully injecting polarimetric cues without degradation elsewhere, yet no quantitative results are reported on any standard non-polar VQA, captioning, or reasoning benchmark (VQAv2, GQA, OK-VQA, etc.) for either the full model or an ablated polarimetric-stream variant. This control is load-bearing for the weakest assumption identified in the reader's report.
  2. [§4] §4 (or equivalent experimental setup): insufficient detail is given on the RGB baseline implementation, including whether it uses identical backbone, training data volume, hyperparameters, or instruction-tuning pairs as PolarVLM; without this, the reported 25.4% aggregate improvement cannot be confidently attributed to the polarimetric stream rather than training differences.
  3. [§4] §4: the manuscript provides no statistical significance tests, confidence intervals, or explicit controls for data leakage between PolarVQA construction and evaluation splits, which is required to substantiate the large per-task gains (26.6%, 34.0%) given the new benchmark's construction.
minor comments (2)
  1. [Abstract] Abstract: 'five evaluation tasks' are referenced but never enumerated; a brief list would improve clarity.
  2. [§3] Notation for polarimetric parameters (e.g., Stokes vectors or degree of polarization) should be defined at first use in §3 to aid readers unfamiliar with polarization imaging.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. The comments highlight important aspects of experimental rigor that we address point-by-point below. We have revised the manuscript to incorporate additional details, results, and clarifications as described.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: the headline claim that PolarVLM 'preserves general visual abilities' (abstract) rests on the dual-stream + two-stage training successfully injecting polarimetric cues without degradation elsewhere, yet no quantitative results are reported on any standard non-polar VQA, captioning, or reasoning benchmark (VQAv2, GQA, OK-VQA, etc.) for either the full model or an ablated polarimetric-stream variant. This control is load-bearing for the weakest assumption identified in the reader's report.

    Authors: We agree that quantitative evidence on standard benchmarks would strengthen the preservation claim. The progressive two-stage training first optimizes the shared components on general instruction data before introducing the polarimetric stream, which is intended to avoid catastrophic forgetting. However, the submitted manuscript indeed omits these controls. In the revision we have added results on VQAv2 and GQA showing that PolarVLM achieves performance within 1-2% of the RGB-only baseline, confirming no degradation. An ablation removing the polar stream is also included. These new tables and discussion will appear in the updated Experiments section. revision: yes

  2. Referee: [§4] §4 (or equivalent experimental setup): insufficient detail is given on the RGB baseline implementation, including whether it uses identical backbone, training data volume, hyperparameters, or instruction-tuning pairs as PolarVLM; without this, the reported 25.4% aggregate improvement cannot be confidently attributed to the polarimetric stream rather than training differences.

    Authors: We acknowledge the need for greater transparency. The RGB baseline employs the identical vision-language backbone, the same total number of training pairs drawn from the general instruction-tuning corpus, and the same optimizer, learning rate schedule, and batch size. The only differences are the addition of the polarimetric stream and the second-stage training on PolarVQA. We have expanded §4 with a new table that explicitly lists backbone, data volume, hyperparameters, and training stages for both models, along with a statement that all other factors are matched. revision: yes

  3. Referee: [§4] §4: the manuscript provides no statistical significance tests, confidence intervals, or explicit controls for data leakage between PolarVQA construction and evaluation splits, which is required to substantiate the large per-task gains (26.6%, 34.0%) given the new benchmark's construction.

    Authors: We accept this criticism. The revised manuscript now reports 95% confidence intervals and paired t-test p-values for all task improvements. For data leakage, PolarVQA was generated via physics-based rendering with procedurally varied scene parameters; training and test splits use completely disjoint sets of object instances, lighting conditions, and camera poses. We have added a dedicated subsection in §4 describing the generation pipeline, split criteria, and verification steps that ensure no scene overlap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains on external benchmark

full rationale

The paper's core contribution is an empirical dual-stream architecture plus progressive training evaluated via direct comparison to an RGB baseline on the newly introduced PolarVQA benchmark. No equations, parameters, or claims reduce to self-definition, fitted inputs renamed as predictions, or load-bearing self-citations. The reported 25.4% overall improvement and task-specific gains are measured against an independent external baseline, making the derivation self-contained rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Central claim rests on standard VLM training assumptions plus the domain premise that polarization resolves RGB ambiguities; no new physical constants or entities postulated beyond the model and dataset themselves.

axioms (1)
  • domain assumption Polarization imaging supplies physical parameters that resolve optical ambiguities such as reflections and transparency in RGB images
    Invoked as the motivation for the entire framework in the abstract.
invented entities (2)
  • PolarVLM dual-stream architecture no independent evidence
    purpose: To fuse polarimetric parameters with semantic VLM reasoning
    New model component introduced by the paper
  • PolarVQA benchmark no independent evidence
    purpose: Physics-grounded instruction-tuning pairs for reflective and transparent scenes
    New dataset constructed for this work

pith-pipeline@v0.9.0 · 5490 in / 1384 out tokens · 49621 ms · 2026-05-12T03:01:06.867745+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 8 internal anchors

  1. [1]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InProc. of Advances in Neural Information Processing Systems, pages 34892–34916, 2023

  2. [2]

    BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. InProc. of International Conference on Machine Learning, pages 19730–19742, 2023

  3. [3]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProc. of Computer Vision and Pattern Recognition, pages 26296–26306, 2024

  4. [4]

    CRC Press, 2017

    Dennis H Goldstein.Polarized Light. CRC Press, 2017

  5. [5]

    SPIE press Bellingham, Washington, 2005

    Edward Collett.Field guide to polarization. SPIE press Bellingham, Washington, 2005

  6. [6]

    Reflection separation using a pair of unpolarized and polarized images

    Youwei Lyu, Zhaopeng Cui, Si Li, Marc Pollefeys, and Boxin Shi. Reflection separation using a pair of unpolarized and polarized images. InProc. of Advances in Neural Information Processing Systems, 2019

  7. [7]

    Polarized reflection removal with dual-stream attention guidance.Pattern Recognition, 157:110945, 2025

    Xin Wang, Yong Zhang, and Yanchu Chen. Polarized reflection removal with dual-stream attention guidance.Pattern Recognition, 157:110945, 2025

  8. [8]

    PolarFree: Polarization-based reflection-free imaging

    Mingde Yao, Menglu Wang, King-Man Tam, Lingen Li, Tianfan Xue, and Jinwei Gu. PolarFree: Polarization-based reflection-free imaging. InProc. of Computer Vision and Pattern Recognition, pages 10890–10899, 2025

  9. [9]

    Glass segmentation using intensity and spectral polarization cues

    Haiyang Mei, Bo Dong, Wen Dong, Jiaxi Yang, Seung-Hwan Baek, Felix Heide, Pieter Peers, Xiaopeng Wei, and Xin Yang. Glass segmentation using intensity and spectral polarization cues. InProc. of Computer Vision and Pattern Recognition, pages 12622–12631, 2022

  10. [10]

    Transparent shape from a single view polarization image

    Mingqi Shao, Chongkun Xia, Zhendong Yang, Junnan Huang, and Xueqian Wang. Transparent shape from a single view polarization image. InProc. of International Conference on Computer Vision, pages 9277–9286, 2023

  11. [11]

    Florence-2: Advancing a unified representation for a variety of vision tasks

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. InProc. of Computer Vision and Pattern Recognition, pages 4818–4829, 2024

  12. [12]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

  13. [13]

    Deep polarization imaging for 3D shape and SVBRDF acquisition

    Valentin Deschaintre, Yiming Lin, and Abhijeet Ghosh. Deep polarization imaging for 3D shape and SVBRDF acquisition. InProc. of Computer Vision and Pattern Recognition, pages 15567–15576, 2021

  14. [14]

    Deep shape from polarization

    Yunhao Ba, Alex Gilbert, Franklin Wang, Jinfa Yang, Rui Chen, Yiqin Wang, Lei Yan, Boxin Shi, and Achuta Kadambi. Deep shape from polarization. InProc. of European Conference on Computer Vision, pages 554–571, 2020

  15. [15]

    Shape from polarization with distant lighting estimation

    Youwei Lyu, Lingran Zhao, Si Li, and Boxin Shi. Shape from polarization with distant lighting estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):13991–14004, 2023

  16. [16]

    Shape from polarization for complex scenes in the wild

    Chenyang Lei, Chenyang Qi, Jiaxin Xie, Na Fan, Vladlen Koltun, and Qifeng Chen. Shape from polarization for complex scenes in the wild. InProc. of Computer Vision and Pattern Recognition, pages 12632–12641, 2022

  17. [17]

    Multi-view azimuth stereo via tangent space consistency

    Xu Cao, Hiroaki Santo, Fumio Okura, and Yasuyuki Matsushita. Multi-view azimuth stereo via tangent space consistency. InProc. of Computer Vision and Pattern Recognition, pages 825–834, 2023. 10

  18. [18]

    NeRSP: Neural 3D reconstruction for reflective objects with sparse polarized images

    Yufei Han, Heng Guo, Koki Fukai, Hiroaki Santo, Boxin Shi, Fumio Okura, Zhanyu Ma, and Yunpeng Jia. NeRSP: Neural 3D reconstruction for reflective objects with sparse polarized images. InProc. of Computer Vision and Pattern Recognition, pages 11821–11830, 2024

  19. [19]

    PISR: Polarimetric neural implicit surface reconstruction for textureless and specular objects

    Guangcheng Chen, Yicheng He, Li He, and Hong Zhang. PISR: Polarimetric neural implicit surface reconstruction for textureless and specular objects. InProc. of European Conference on Computer Vision, pages 205–222, 2024

  20. [20]

    PANDORA: Polarization-aided neural decomposi- tion of radiance

    Akshat Dave, Yongyi Zhao, and Ashok Veeraraghavan. PANDORA: Polarization-aided neural decomposi- tion of radiance. InProc. of European Conference on Computer Vision, pages 538–556, 2022

  21. [21]

    Depth sensing using geometrically constrained polarization normals.International Journal of Computer Vision, 125(1):34–51, 2017

    Achuta Kadambi, Vage Taamazyan, Boxin Shi, and Ramesh Raskar. Depth sensing using geometrically constrained polarization normals.International Journal of Computer Vision, 125(1):34–51, 2017

  22. [22]

    DPS-Net: Deep polarimetric stereo depth estimation

    Chaoran Tian, Weihong Pan, Zimo Wang, Mao Mao, Guofeng Zhang, Hujun Bao, Ping Tan, and Zhaopeng Cui. DPS-Net: Deep polarimetric stereo depth estimation. InProc. of International Conference on Computer Vision, pages 3569–3579, 2023

  23. [23]

    Physics-guided reflection separation from a pair of unpolarized and polarized images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2151–2165, 2022

    Youwei Lyu, Zhaopeng Cui, Si Li, Marc Pollefeys, and Boxin Shi. Physics-guided reflection separation from a pair of unpolarized and polarized images.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2151–2165, 2022

  24. [24]

    Polarized reflection removal with perfect alignment in the wild

    Chenyang Lei, Xuhua Huang, Mengdi Zhang, Qiong Yan, Wenxiu Sun, and Qifeng Chen. Polarized reflection removal with perfect alignment in the wild. InProc. of Computer Vision and Pattern Recognition, pages 1750–1758, 2020

  25. [25]

    Instant dehazing of images using polarization

    Yoav Y Schechner, Srinivasa G Narasimhan, and Shree K Nayar. Instant dehazing of images using polarization. InProc. of Computer Vision and Pattern Recognition, pages I–I, 2001

  26. [26]

    Learning to dehaze with polarization

    Chu Zhou, Minggui Teng, Yufei Han, Chao Xu, and Boxin Shi. Learning to dehaze with polarization. In Proc. of Advances in Neural Information Processing Systems, pages 11487–11500, 2021

  27. [27]

    HDR reconstruction based on the polarization camera.IEEE Robotics and Automation Letters, 5(4):5113–5119, 2020

    Xuesong Wu, Hong Zhang, Xiaoping Hu, Moein Shakeri, Chen Fan, and Juiwen Ting. HDR reconstruction based on the polarization camera.IEEE Robotics and Automation Letters, 5(4):5113–5119, 2020

  28. [28]

    Polarization guided HDR reconstruction via pixel-wise depolarization.IEEE Transactions on Image Processing, 32:1774–1787, 2023

    Chu Zhou, Yufei Han, Minggui Teng, Jin Han, Si Li, Chao Xu, and Boxin Shi. Polarization guided HDR reconstruction via pixel-wise depolarization.IEEE Transactions on Image Processing, 32:1774–1787, 2023

  29. [29]

    Degree-of-linear-polarization- based color constancy

    Taishi Ono, Yuhi Kondo, Legong Sun, Teppei Kurita, and Yusuke Moriuchi. Degree-of-linear-polarization- based color constancy. InProc. of Computer Vision and Pattern Recognition, pages 19740–19749, 2022

  30. [30]

    Polarization guided mask-free shadow removal

    Chu Zhou, Chao Xu, and Boxin Shi. Polarization guided mask-free shadow removal. InProc. of the AAAI Conference on Artificial Intelligence, pages 10716–10724, 2025

  31. [31]

    Deep polarization cues for transparent object segmentation

    Agastya Kalra, Vage Taamazyan, Supreeth Krishna Rao, Kartik Venkataraman, Ramesh Raskar, and Achuta Kadambi. Deep polarization cues for transparent object segmentation. InProc. of Computer Vision and Pattern Recognition, pages 8602–8611, 2020

  32. [32]

    Single image reflection separation with perceptual losses

    Xuaner Zhang, Ren Ng, and Qifeng Chen. Single image reflection separation with perceptual losses. In Proc. of Computer Vision and Pattern Recognition, pages 4786–4794, 2018

  33. [33]

    CoRRN: Cooperative reflection removal network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(12):2969– 2982, 2019

    Renjie Wan, Boxin Shi, Haoliang Li, Ling-Yu Duan, Ah-Hwee Tan, and Alex C Kot. CoRRN: Cooperative reflection removal network.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(12):2969– 2982, 2019

  34. [34]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

  35. [35]

    InstructBLIP: Towards general-purpose vision-language models with instruction tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. InProc. of Advances in Neural Information Processing Systems, pages 49250–49267, 2023

  36. [36]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 11

  37. [37]

    InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProc. of Computer Vision and Pattern Recognition, pages 24185–24198, 2024

  38. [38]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProc. of International Conference on Machine Learning, pages 8748–8763, 2021

  39. [39]

    PointCLIP: Point cloud understanding by CLIP

    Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li. PointCLIP: Point cloud understanding by CLIP. InProc. of Computer Vision and Pattern Recognition, pages 8552–8562, 2022

  40. [40]

    VideoCLIP: Contrastive pre-training for zero-shot video-text understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. InProc. of the Conference on Empirical Methods in Natural Language Processing, pages 6787–6800, 2021

  41. [41]

    AudioCLIP: Extending CLIP to image, text and audio

    Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. AudioCLIP: Extending CLIP to image, text and audio. InProc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pages 976–980, 2022

  42. [42]

    ImageBind: One embedding space to bind them all

    Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. ImageBind: One embedding space to bind them all. InProc. of Computer Vision and Pattern Recognition, pages 15180–15190, 2023

  43. [43]

    Binding touch to everything: Learning unified multimodal tactile representations

    Fengyu Yang, Chao Feng, Ziyang Chen, Hyoungseob Park, Daniel Wang, Yiming Dou, Ziyao Zeng, Xien Chen, Rit Gangopadhyay, Andrew Owens, et al. Binding touch to everything: Learning unified multimodal tactile representations. InProc. of Computer Vision and Pattern Recognition, pages 26340–26353, 2024

  44. [44]

    Eventvl: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

    Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, and Hui Xiong. EventVL: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025

  45. [45]

    Infrared-LLaV A: Enhancing understanding of infrared images in multi-modal large language models

    Shixin Jiang, Zerui Chen, Jiafeng Liang, Yanyan Zhao, Ming Liu, and Bing Qin. Infrared-LLaV A: Enhancing understanding of infrared images in multi-modal large language models. InProc. of the Conference on Empirical Methods in Natural Language Processing, pages 8573–8591, 2024

  46. [46]

    Learning to deblur polarized images.International Journal of Computer Vision, 133(9):5976–5991, 2025

    Chu Zhou, Minggui Teng, Xinyu Zhou, Chao Xu, Imari Sato, and Boxin Shi. Learning to deblur polarized images.International Journal of Computer Vision, 133(9):5976–5991, 2025

  47. [47]

    PIDSR: Complementary polarized image demosaicing and super-resolution

    Shuangfan Zhou, Chu Zhou, Youwei Lyu, Heng Guo, Zhanyu Ma, Boxin Shi, and Imari Sato. PIDSR: Complementary polarized image demosaicing and super-resolution. InProc. of Computer Vision and Pattern Recognition, pages 16081–16090, 2025

  48. [48]

    pCON: Polarimetric coordinate networks for neural scene representations

    Henry Peters, Yunhao Ba, and Achuta Kadambi. pCON: Polarimetric coordinate networks for neural scene representations. InProc. of Computer Vision and Pattern Recognition, pages 16579–16589, 2023

  49. [49]

    Polarimetric neural field via unified complex-valued wave representation

    Chu Zhou, Yixin Yang, Junda Liao, Heng Guo, Boxin Shi, and Imari Sato. Polarimetric neural field via unified complex-valued wave representation. InProc. of International Conference on Computer Vision, pages 25660–25669, 2025

  50. [50]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InProc. of International Conference on Learning Representations, 2022

  51. [51]

    QLoRA: Efficient finetuning of quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InProc. of Advances in Neural Information Processing Systems, pages 10088–10115, 2023

  52. [52]

    SHARP: Steering hallucination in LVLMs via representation engineering

    Junfei Wu, Yue Ding, Guofan Liu, Tianze Xia, Ziyue Huang, Dianbo Sui, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. SHARP: Steering hallucination in LVLMs via representation engineering. InProc. of the Conference on Empirical Methods in Natural Language Processing, pages 14357–14372, 2025

  53. [53]

    LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

    Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q Li, Meng Xu, Kuan Huang, Umamaheswara Rao Tida, Bingyu Shen, Daehan Kwak, et al. LLM-as-judge framework for evaluating tone-induced hallucination in vision-language models.arXiv preprint arXiv:2604.18803, 2026. 12

  54. [54]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  55. [55]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  56. [56]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  57. [57]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InProc. of International Conference on Learning Representations, 2019

  58. [58]

    DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

    Abrar Majeedi, Zhiyuan Ruan, Ziyi Zhao, Hongcheng Wang, Jianglin Lu, and Yin Li. DUALVISION: RGB- infrared multimodal large language models for robust visual reasoning.arXiv preprint arXiv:2604.18829, 2026

  59. [59]

    EAGLE: Exploring the design space for multimodal LLMs with mixture of encoders

    Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, and Guilin Liu. EAGLE: Exploring the design space for multimodal LLMs with mixture of encoders. InProc. of International Conference on Learning Representa...

  60. [60]

    human silhouette(s)

    Junyan Lin, Haoran Chen, Yue Fan, Yingqi Fan, Xin Jin, Hui Su, Jinlan Fu, and Xiaoyu Shen. Multi-layer visual feature fusion in multimodal LLMs: Methods, analysis, and best practices. InProc. of Computer Vision and Pattern Recognition, pages 4156–4166, 2025. A Details of the PolarVQA generation pipeline In this section, we detail the automated data genera...