pith. machine review for the scientific record. sign in

arxiv: 2605.06969 · v2 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Bringing Multimodal Large Language Models to Infrared-Visible Image Fusion Quality Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords infrared-visible image fusionquality assessmentmultimodal large language modelscontinuous scoringThurstone modelhuman visual preferencesperceptual ambiguitysoft labels
0
0 comments X

The pith

FuScore lets an MLLM generate continuous quality scores for infrared-visible fused images by modeling agreement across four sub-dimensions and enforcing ordering constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FuScore as a way to assess infrared-visible image fusion results so that scores better reflect what humans actually prefer to see. Prior metrics either optimize hand-crafted statistics that diverge from perception or regress to single numbers from human ratings without using language-model reasoning or capturing how much judges disagree on a given image. FuScore instead prompts an MLLM to output a continuous score and derives a soft label from how consistently four fusion-specific criteria are judged; a tripartite loss then trains the model on that label plus Thurstone-style pairwise orderings within and across scenes. If the approach works, quality assessment becomes fine-grained enough to separate nearly identical fusions and to rank both algorithms and individual scenes in ways that track human consensus.

Core claim

FuScore utilizes an MLLM to mimic human visual perception by producing continuous quality scores rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. It exploits the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. A tripartite objective then combines per-image distributional supervision with within-source-pair Thurstone fidelity for method-level ordering and cross-source-pair Thurstone fidelity for scene-level ordering across scenes.

What carries the argument

The tripartite objective that trains an MLLM on per-image soft labels derived from four sub-dimension agreements together with Thurstone fidelity terms that enforce consistent ordering within source pairs and across scenes.

If this is right

  • Continuous rather than discrete outputs allow the model to distinguish fused images whose quality is close but not identical.
  • Soft labels built from sub-dimension agreement supply per-image supervision that reflects real perceptual uncertainty.
  • The within-pair and cross-pair Thurstone terms produce both method-level and scene-level orderings that remain consistent with human preferences.
  • Experiments on standard IVIF benchmarks show higher correlation with human visual preferences than prior no-reference, full-reference, or scalar-regression baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continuous MLLM scores could serve directly as reward signals when training new fusion networks instead of relying on hand-crafted losses.
  • The same sub-dimension agreement mechanism might transfer to quality assessment for other multimodal fusion tasks such as medical or remote-sensing imagery.
  • Images whose soft labels are broad could be flagged as difficult cases that merit additional fusion research or human review.

Load-bearing premise

An MLLM can reliably mimic human visual perception to produce meaningful continuous quality scores, and agreement among four IVIF-specific sub-dimensions accurately encodes per-image perceptual ambiguity.

What would settle it

A new collection of human ratings on previously unseen IVIF images in which FuScore's continuous scores show low rank correlation with the ratings or in which the sharpness of the four-sub-dimension soft labels fails to predict overall human agreement.

Figures

Figures reproduced from arXiv: 2605.06969 by Junli Gong, Weifeng Su, Xintong Xu, Yao Lu, Yiuming Cheung, Yuchen Guo.

Figure 1
Figure 1. Figure 1: Two limitations of current IVIF quality assessment. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of our FuScore. a) FuScore Inference for per image. b) Sub-dim-aware soft [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results on three representative source pairs (SP1–SP3). [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human expert annotation web tool. −0.0034/ − 0.0042). The submitted hyperparameters are therefore at a local optimum on the λ axis under our single-dataset training protocol, and not a coincidental plateau pick. C Expert Annotation Protocol This appendix details the annotation protocol for the SemanticRT [9] 5-expert pilot benchmark used in Sec. 4.3 and the human-side validation of δi in Sec. 4.4. C.1 Imag… view at source ↗
read the original abstract

Infrared-Visible image fusion (IVIF) aims to integrate thermal information and detailed spatial structures into a single fused image to enhance perception. However, existing evaluation approaches tend to over-optimize both hand-crafted no-reference statistics and full-reference metrics that treat the source images as pseudo ground truths. Recent IVIF reward-modelling efforts learn from human ratings but use scalar regression on aggregated scores, neither leveraging the reasoning of Multimodal Large Language Models (MLLMs) nor encoding per-image perceptual ambiguity in their supervision, but naively introducing MLLMs with discrete one-hot supervision likewise collapses fused images of similar quality into different rating levels. To address this, we introduce FuScore, which utilizes an MLLM to mimic human visual perception by producing continuous quality score, rather than discrete level predictions, enabling fine-grained discrimination among fused images of similar quality. We exploit the agreement among four IVIF-specific sub-dimensions to construct a per-image soft label whose sharpness reflects how consensual the overall judgment is. We further introduce a tripartite objective combining per-image distributional supervision, within-source-pair Thurstone fidelity for method-level ordering, and cross-source-pair Thurstone fidelity for scene-level ordering across scenes. Extensive experiments demonstrate that FuScore achieves state-of-the-art correlation with human visual preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FuScore, a quality assessment framework for infrared-visible image fusion (IVIF) that uses a multimodal large language model (MLLM) to output continuous quality scores instead of discrete ratings. Per-image soft labels are derived from the agreement across four author-specified IVIF sub-dimensions (whose sharpness encodes perceptual ambiguity), and a tripartite loss combines per-image distributional supervision with within-source-pair and cross-source-pair Thurstone ordering constraints. The central claim is that this yields state-of-the-art correlation with human visual preferences on IVIF images.

Significance. If the empirical claims are substantiated, the work could meaningfully advance no-reference IVIF evaluation by moving beyond hand-crafted statistics and scalar regression to leverage MLLM reasoning for fine-grained, ambiguity-aware scoring; this has direct implications for reward modeling in fusion algorithm optimization.

major comments (3)
  1. [Experiments] Experiments section (and abstract): the claim of 'state-of-the-art correlation with human visual preferences' is asserted without any reported quantitative metrics, baseline comparisons, dataset statistics, or statistical significance tests in the abstract and is only summarized at high level in the provided text; this prevents assessment of whether the improvement is load-bearing or merely incremental.
  2. [Section 3.2] Section 3.2 (soft-label construction): the four IVIF-specific sub-dimensions are fixed by the authors and the MLLM is prompted to rate them; without an ablation that replaces these sub-dimensions with an independent holistic human rating protocol on held-out scenes, it remains possible that the reported human correlation reflects consistency with the chosen prompting scheme rather than genuine mimicry of human perceptual preferences.
  3. [Section 3.3] Section 3.3 (tripartite loss): the Thurstone ordering terms assume that MLLM-derived continuous scores can be treated as interval-scale utilities; no calibration study or comparison against direct scalar human ratings is described to confirm that the distributional matching plus ordering constraints do not simply reproduce the sub-dimension agreement structure.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'extensive experiments demonstrate' is used without any numerical support, which is non-standard and reduces immediate readability.
  2. [Section 3.2] Notation: the precise formulation of the soft-label sharpness (e.g., how agreement across the four dimensions is aggregated into a distribution) should be given explicitly with an equation rather than described in prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. Below we provide point-by-point responses to the major comments, indicating revisions where we agree changes are needed to improve the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (and abstract): the claim of 'state-of-the-art correlation with human visual preferences' is asserted without any reported quantitative metrics, baseline comparisons, dataset statistics, or statistical significance tests in the abstract and is only summarized at high level in the provided text; this prevents assessment of whether the improvement is load-bearing or merely incremental.

    Authors: We agree that the abstract would benefit from including quantitative metrics to substantiate the SOTA claim. In the revised manuscript, we will modify the abstract to include specific values for the correlation metrics (PLCC and SRCC) with human preferences, along with brief mentions of the dataset and statistical significance. The experiments section provides full details including baseline comparisons in tables, dataset statistics, and significance tests. We will also ensure these are highlighted more prominently. revision: yes

  2. Referee: [Section 3.2] Section 3.2 (soft-label construction): the four IVIF-specific sub-dimensions are fixed by the authors and the MLLM is prompted to rate them; without an ablation that replaces these sub-dimensions with an independent holistic human rating protocol on held-out scenes, it remains possible that the reported human correlation reflects consistency with the chosen prompting scheme rather than genuine mimicry of human perceptual preferences.

    Authors: The four sub-dimensions are chosen from established IVIF quality criteria in the literature to capture different aspects of perceptual quality. The human correlation is computed using separate overall human ratings, not the sub-dimension scores. Nevertheless, to directly address the concern about potential bias from the prompting scheme, we will add an ablation study in the revised paper that compares the multi-dimensional soft labels to soft labels derived from a single holistic quality rating prompt on the same scenes. This will help confirm that the current approach provides better alignment with human preferences. revision: partial

  3. Referee: [Section 3.3] Section 3.3 (tripartite loss): the Thurstone ordering terms assume that MLLM-derived continuous scores can be treated as interval-scale utilities; no calibration study or comparison against direct scalar human ratings is described to confirm that the distributional matching plus ordering constraints do not simply reproduce the sub-dimension agreement structure.

    Authors: We will revise Section 3.3 to explicitly discuss the assumptions underlying the Thurstone model and its suitability for modeling perceptual utilities in this context. The end-to-end evaluation against human ratings provides evidence that the learned scores capture human preferences. We will also include a brief calibration analysis, such as the correlation between the MLLM continuous scores and direct human scalar ratings on a validation subset, to show that the tripartite objective enhances rather than merely reproduces the sub-dimension structure. revision: partial

Circularity Check

0 steps flagged

No significant circularity; central correlation claim evaluated against external human preferences.

full rationale

The paper constructs per-image soft labels from MLLM agreement on four author-specified IVIF sub-dimensions and applies a tripartite loss (distributional matching plus Thurstone ordering terms). However, the reported SOTA result is an empirical correlation with independent human visual preference ratings, which serves as an external benchmark rather than a self-referential fit. No equations or steps in the abstract reduce the final human-correlation metric to the soft-label construction by definition. No self-citations are invoked as load-bearing uniqueness theorems. This qualifies as a normal non-circular outcome (score 0-2) where the derivation remains self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters or invented entities; the central approach rests on the domain assumption that MLLMs can emulate human perceptual judgments.

axioms (1)
  • domain assumption Multimodal LLMs can produce continuous quality scores that meaningfully mimic human visual perception for fused images
    Invoked to justify replacing discrete predictions with continuous scores and soft labels.

pith-pipeline@v0.9.0 · 5540 in / 1182 out tokens · 36102 ms · 2026-05-12T02:11:01.421813+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  2. [2]

    Semantic-relation transformer for visible and infrared fused image quality assessment

    Zhihao Chang, Shuyuan Yang, Zhixi Feng, Quanwei Gao, Shengzhe Wang, and Yuyong Cui. Semantic-relation transformer for visible and infrared fused image quality assessment. Information Fusion, 95:454–470, 2023

  3. [3]

    Evanet: Towards more efficient and consistent infrared and visible image fusion assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Tao Zhou, Hui Li, Zhangyong Tang, and Josef Kittler. Evanet: Towards more efficient and consistent infrared and visible image fusion assessment.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  4. [4]

    Image quality measures and their performance.IEEE Transactions on communications, 43(12):2959–2965, 1995

    Ahmet M Eskicioglu and Paul S Fisher. Image quality measures and their performance.IEEE Transactions on communications, 43(12):2959–2965, 1995

  5. [5]

    Fuse4seg: image-level fusion based multi-modality medical image segmentation.arXiv preprint arXiv:2409.10328, 2024

    Yuchen Guo and Weifeng Su. Fuse4seg: image-level fusion based multi-modality medical image segmentation.arXiv preprint arXiv:2409.10328, 2024

  6. [6]

    Dae-fuse: An adaptive discrimina- tive autoencoder for multi-modality image fusion

    Yuchen Guo, Ruoxiang Xu, Rongcheng Li, and Weifeng Su. Dae-fuse: An adaptive discrimina- tive autoencoder for multi-modality image fusion. In2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

  7. [7]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

  8. [8]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022

  9. [9]

    Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images

    Wei Ji, Jingjing Li, Cheng Bian, Zhicheng Zhang, and Li Cheng. Semanticrt: A large-scale dataset and method for robust semantic segmentation in multispectral images. InProceedings of the 31st ACM International Conference on Multimedia, pages 3307–3316, 2023

  10. [10]

    Musiq: Multi-scale image quality transformer

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

  11. [11]

    Pixel-level image fusion: A survey of the state of the art.information Fusion, 33:100–112, 2017

    Shutao Li, Xudong Kang, Leyuan Fang, Jianwen Hu, and Haitao Yin. Pixel-level image fusion: A survey of the state of the art.information Fusion, 33:100–112, 2017

  12. [12]

    Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection

    Jinyuan Liu, Xin Fan, Zhanbo Huang, Guanyao Wu, Risheng Liu, Wei Zhong, and Zhongxuan Luo. Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5811, 2022

  13. [13]

    Bridging human evaluation to infrared and visible image fusion.arXiv preprint arXiv:2603.03871,

    Jinyuan Liu, Xingyuan Li, Qingyun Mei, Haoyuan Xu, Zhiying Jiang, Long Ma, Risheng Liu, and Xin Fan. Bridging human evaluation to infrared and visible image fusion.arXiv preprint arXiv:2603.03871, 2026. 10

  14. [14]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  15. [15]

    Infrared and visible image fusion methods and applications: A survey.Information fusion, 45:153–178, 2019

    Jiayi Ma, Yong Ma, and Chang Li. Infrared and visible image fusion methods and applications: A survey.Information fusion, 45:153–178, 2019

  16. [16]

    Ddcgan: A dual- discriminator conditional generative adversarial network for multi-resolution image fusion

    Jiayi Ma, Han Xu, Junjun Jiang, Xiaoguang Mei, and Xiao-Ping Zhang. Ddcgan: A dual- discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Transactions on Image Processing, 29:4980–4995, 2020

  17. [17]

    Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 9(7):1200–1217, 2022

    Jiayi Ma, Linfeng Tang, Fan Fan, Jun Huang, Xiaoguang Mei, and Yong Ma. Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer.IEEE/CAA Journal of Automatica Sinica, 9(7):1200–1217, 2022

  18. [18]

    Assessment of image fusion procedures using entropy, image quality, and multispectral classification.Journal of Applied Remote Sensing, 2(1):023522, 2008

    J Wesley Roberts, Jan A Van Aardt, and Fethi Babikker Ahmed. Assessment of image fusion procedures using entropy, image quality, and multispectral classification.Journal of Applied Remote Sensing, 2(1):023522, 2008

  19. [19]

    Mask-difuser: A masked diffusion model for unified unsupervised image fusion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Linfeng Tang, Chunyu Li, and Jiayi Ma. Mask-difuser: A masked diffusion model for unified unsupervised image fusion.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  20. [20]

    A law of comparative judgment

    Louis L Thurstone. A law of comparative judgment. InScaling, pages 81–92. Routledge, 2017

  21. [21]

    Exploring clip for assessing the look and feel of images

    Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 2555–2563, 2023

  22. [22]

    A comparative analysis of image fusion methods.IEEE transactions on geoscience and remote sensing, 43(6): 1391–1402, 2005

    Zhijun Wang, Djemel Ziou, Costas Armenakis, Deren Li, and Qingquan Li. A comparative analysis of image fusion methods.IEEE transactions on geoscience and remote sensing, 43(6): 1391–1402, 2005

  23. [23]

    Modern image quality assessment

    Zhou Wang and Alan Conrad Bovik. Modern image quality assessment. 2006

  24. [24]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4): 600–612, 2004

  25. [25]

    Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-align: Teaching lmms for visual scoring via discrete text-defined levels.arXiv preprint arXiv:2312.17090, 2023

  26. [26]

    Maniqa: Multi-dimension attention network for no-reference image quality assessment

    Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1191–1200, 2022

  27. [27]

    Teaching large language models to regress accurate image quality scores using score distribution

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14483–14494, 2025

  28. [28]

    Visible and infrared image fusion using deep learning

    Xingchen Zhang and Yiannis Demiris. Visible and infrared image fusion using deep learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10535–10554, 2023

  29. [29]

    Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 54:99–118, 2020

    Yu Zhang, Yu Liu, Peng Sun, Han Yan, Xiaolin Zhao, and Li Zhang. Ifcnn: A general image fusion framework based on convolutional neural network.Information Fusion, 54:99–118, 2020

  30. [30]

    Cddfuse: Correlation-driven dual-branch feature decomposition for multi- modality image fusion

    Zixiang Zhao, Haowen Bai, Jiangshe Zhang, Yulun Zhang, Shuang Xu, Zudi Lin, Radu Timofte, and Luc Van Gool. Cddfuse: Correlation-driven dual-branch feature decomposition for multi- modality image fusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5906–5916, 2023. 11

  31. [31]

    Good photo./Bad photo

    Zixiang Zhao, Haowen Bai, Yuanzhi Zhu, Jiangshe Zhang, Shuang Xu, Yulun Zhang, Kai Zhang, Deyu Meng, Radu Timofte, and Luc Van Gool. Ddfm: Denoising diffusion model for multi-modality image fusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 8082–8093, 2023. A Experimental Setup Details This appendix expands the compre...