Recognition: unknown
Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift
Pith reviewed 2026-05-08 04:17 UTC · model grok-4.3
The pith
Majorization view shows that entropy-based test-time adaptation for vision-language models must control modality reliability to prevent error increases under asymmetric shifts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that entropy minimization on the fused posterior increases error under modality-specific shifts because an unreliable modality can dominate fusion. Through a majorization perspective, adaptation is cast as a constrained de-mixing problem on the fused prediction. MG-MTTA solves this by updating only a lightweight gate with an objective that combines fused-posterior entropy minimization and a reliability-aware gate prior derived from anchor-based modality consistency and cross-modal conflict. The analysis supplies conditions under which entropy reduction preserves correct ranking and a threshold that marks modality-dominance failure. On ImageNet-based benchmarks this lifts
What carries the argument
The reliability-aware gate prior, constructed from anchor-based modality consistency and cross-modal conflict, which augments the fused-posterior entropy minimization objective to prevent an unreliable modality from dominating the adapted prediction.
If this is right
- Entropy reduction preserves the correct ranking only under the conditions identified by the majorization analysis.
- A threshold characterizes when modality-dominance failure occurs.
- Top-1 accuracy rises from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift.
- The method stays competitive on visual-only shift benchmarks by updating only a lightweight gate or adapter while the backbone remains frozen.
- Multimodal test-time adaptation must control modality reliability rather than minimize prediction entropy alone.
Where Pith is reading between the lines
- The same reliability-prior construction could be tested on other multimodal pairs such as audio-visual models facing independent modality shifts.
- The majorization de-mixing formulation may extend to non-entropy objectives in other fusion architectures without changing the backbone.
- Deploying such lightweight gates could support continual adaptation in streaming applications where labeled validation data are unavailable.
- If the consistency anchors prove stable across datasets, the approach might reduce reliance on full supervised fine-tuning for handling distribution shifts.
Load-bearing premise
The reliability-aware gate prior constructed from anchor-based modality consistency and cross-modal conflict accurately reflects the true underlying reliability of each modality without requiring additional labeled data or supervision.
What would settle it
Run the method on a benchmark where the constructed gate prior is deliberately inverted to favor the less reliable modality and check whether accuracy gains disappear or reverse compared to plain entropy minimization.
Figures
read the original abstract
Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MG-MTTA for vision-language models under asymmetric modality shifts. It views multimodal posteriors through majorization, casts adaptation as constrained de-mixing of the fused prediction, and updates only a lightweight gate or adapter. The objective minimizes fused-posterior entropy while incorporating a reliability-aware gate prior derived from anchor-based modality consistency and cross-modal conflict. Analysis supplies conditions under which entropy reduction preserves correct ranking and a threshold for modality-dominance failure. On ImageNet-based benchmarks, the method raises top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift while remaining competitive on visual-only shifts.
Significance. If the central claims hold, the work supplies a principled mechanism for handling modality-specific shifts at test time without backbone updates and demonstrates that controlling per-modality reliability can outperform pure entropy minimization. The majorization framing and explicit threshold characterization are positive contributions that could guide future multimodal adaptation research.
major comments (2)
- Abstract: the reported gains (57.97 to 66.51 and 21.68 to 26.27) are presented without error bars, standard deviations, or ablation isolating the reliability-aware gate prior from plain entropy minimization; without these, it is impossible to confirm that improvements stem from controlled de-mixing rather than incidental regularization.
- Abstract (analysis paragraph): the stated conditions under which entropy reduction preserves correct ranking and the threshold characterizing modality-dominance failure do not address the self-referential construction of the anchor-based modality consistency prior; because anchors are built from the initial shifted model outputs without labels, asymmetric shifts can bias the prior toward the unreliable modality and undermine the de-mixing guarantee.
minor comments (2)
- Abstract: the terms 'semantics-preserving textual shift' and 'joint visual-textual shift' are used without a concise definition or reference to the exact benchmark construction protocol.
- Abstract: no mention is made of the number of runs, random seeds, or statistical testing used to obtain the quoted accuracy figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting both the potential of the majorization framing and the need for clearer statistical and analytical support in the abstract. We respond to each major comment below.
read point-by-point responses
-
Referee: [—] Abstract: the reported gains (57.97 to 66.51 and 21.68 to 26.27) are presented without error bars, standard deviations, or ablation isolating the reliability-aware gate prior from plain entropy minimization; without these, it is impossible to confirm that improvements stem from controlled de-mixing rather than incidental regularization.
Authors: We agree that the abstract would be strengthened by including statistical details and an explicit reference to the isolating ablation. The full manuscript already reports mean performance over five random seeds together with standard deviations in the main results table and provides a dedicated ablation in Section 5.2 that directly compares MG-MTTA to a pure entropy-minimization baseline. We will revise the abstract to report the gains with standard deviations and to note that the ablation isolates the contribution of the reliability-aware gate prior beyond entropy minimization. revision: yes
-
Referee: [—] Abstract (analysis paragraph): the stated conditions under which entropy reduction preserves correct ranking and the threshold characterizing modality-dominance failure do not address the self-referential construction of the anchor-based modality consistency prior; because anchors are built from the initial shifted model outputs without labels, asymmetric shifts can bias the prior toward the unreliable modality and undermine the de-mixing guarantee.
Authors: The analysis in Section 3 derives the ranking-preservation conditions and the modality-dominance threshold from the majorization properties of the fused posterior, treating the reliability prior as given. The anchor construction is indeed self-referential because it uses the initial model outputs. However, the prior also incorporates an explicit cross-modal conflict term whose purpose is to detect and attenuate dominance by the less reliable modality. We will add a short discussion paragraph to the analysis section that acknowledges the self-referential nature of the anchors, explains how the conflict term mitigates bias, and notes that the derived threshold can serve as a diagnostic for cases where the prior may be compromised. revision: partial
Circularity Check
No circularity: derivation self-contained via independent majorization analysis and empirical benchmarks
full rationale
The abstract and described approach frame adaptation as constrained de-mixing of fused posteriors using a majorization view, with an objective that adds entropy minimization to a reliability-aware gate prior derived from anchor consistency and cross-modal conflict. No equations, self-citations, or fitted parameters are quoted that reduce the claimed predictions (e.g., accuracy gains or ranking-preservation conditions) to inputs by construction. The threshold characterization of modality-dominance failure and conditions for entropy reduction appear as independent analysis rather than tautological redefinition. Reported results are benchmark comparisons, not statistically forced outputs of the prior itself. The method is therefore self-contained against external evaluation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...
work page internal anchor Pith review doi:10.48550/arxiv.2204.14198 2022
-
[2]
Marsden, Tobias Raichle, and Bin Yang
Mario Döbler, Robert A. Marsden, Tobias Raichle, and Bin Yang. 2024. A Lost Opportunity for Vision-Language Models: A Comparative Study of Online Test- Time Adaptation for Vision-Language Models. doi:10.48550/ARXIV.2405.14977
- [3]
- [5]
-
[6]
Songtao Li and Hao Tang. 2024. Multimodal Alignment and Fusion: A Survey. doi:10.48550/ARXIV.2411.17040
-
[7]
Jian Liang, Ran He, and Tieniu Tan. 2024. A Comprehensive Survey on Test-Time Adaptation Under Distribution Shifts.International Journal of Computer Vision 133, 1 (July 2024), 31–64. doi:10.1007/s11263-024-02181-w
-
[8]
J. Lin. 1991. Divergence measures based on the Shannon entropy.IEEE Transac- tions on Information Theory37, 1 (1991), 145–151. doi:10.1109/18.61115
-
[9]
Sarthak Kumar Maharana, Baoming Zhang, Leonid Karlinsky, Rogerio Feris, and Yunhui Guo. 2024. BATCLIP: Bimodal Online Test-Time Adaptation for CLIP. doi:10.48550/ARXIV.2412.02837
-
[10]
Marshall, Ingram Olkin, and Barry C
Albert W. Marshall, Ingram Olkin, and Barry C. Arnold. 2011.Inequalities: Theory of Majorization and Its Applications. Springer New York. doi:10.1007/978-0-387- 68276-1
-
[11]
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. 2022. Efficient Test-Time Model Adaptation without Forgetting. doi:10.48550/ARXIV.2204.02610
-
[12]
Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. 2023. Towards Stable Test-Time Adaptation in Dynamic Wild World. doi:10.48550/ARXIV.2302.12400
-
[13]
Changdae Oh, Mijoo Kim, Hyesu Lim, Junhyeok Park, Euiseog Jeong, Zhi-Qi Cheng, and Kyungwoo Song. 2024. Towards Calibrated Robust Fine-Tuning of Vision-Language Models. InNeurIPS 2023 Workshop on Distribution Shifts: New Frontiers with Foundation Models. https://openreview.net/forum?id=S9h0eLl71q
2024
-
[14]
Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-Qi Cheng, and Kyungwoo Song. 2023. Towards Calibrated Robust Fine-Tuning of Vision-Language Models. doi:10.48550/ARXIV. 2311.01723
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[15]
George Papandreou, Athanassios Katsamanis, Vassilis Pitsikalis, and Petros Maragos. 2009. Adaptive Multimodal Fusion by Uncertainty Compensation With Application to Audiovisual Speech Recognition.IEEE Transactions on Audio, Speech, and Language Processing17, 3 (2009), 423–435. doi:10.1109/TASL.2008. 2011515
-
[16]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. doi:10.48550/ARXIV.2103.00020
work page internal anchor Pith review doi:10.48550/arxiv.2103.00020 2021
-
[17]
2022.Entropy, Divergence, and Majorization in Classical and Quantum Thermodynamics
Takahiro Sagawa. 2022.Entropy, Divergence, and Majorization in Classical and Quantum Thermodynamics. Springer Singapore. doi:10.1007/978-981-16-6644-5
-
[19]
Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anand- kumar, and Chaowei Xiao. 2022. Test-Time Prompt Tuning for Zero-Shot Gener- alization in Vision-Language Models. doi:10.48550/ARXIV.2209.07511
-
[20]
Sahil Sidheekh, Pranuthi Tenali, Saurabh Mathur, Erik Blasch, and Sriraam Natarajan. 2024. On the Robustness and Reliability of Late Multi-Modal Fusion using Probabilistic Circuits. In2024 27th International Conference on Information Fusion (FUSION). 1–8. doi:10.23919/FUSION59988.2024.10706372
-
[21]
Jingchen Sun, Rohan Sharma, Vishnu Suresh Lokhande, and Changyou Chen
-
[22]
Cross-Modal Feature Alignment and MMD Improve Robustness of Prompt Tuning. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). 4714–4724. doi:10.1109/WACV61041.2025.00462
- [23]
-
[24]
Qin Wang, Olga Fink, Luc Van Gool, and Dengxin Dai. 2022. Continual Test-Time Domain Adaptation. doi:10.48550/ARXIV.2203.13591
-
[25]
Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani, Shengcai Liao, and Cees G. M. Snoek. 2024. Any-Shift Prompting for Generalization over Distributions. doi:10.48550/ARXIV.2402.10099
-
[26]
Mouxing Yang, Yunfan Li, Changqing Zhang, Peng Hu, and Xi Peng. 2024. Test- time Adaptation against Multi-modal Reliability Bias. InThe Twelfth Interna- tional Conference on Learning Representations. https://openreview.net/forum? id=TPZRq4FALB
2024
-
[27]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. 2023. Sigmoid Loss for Language Image Pre-Training. doi:10.48550/ARXIV.2303.15343
-
[28]
Duoyi Zhang, Md Abul Bashar, and Richi Nayak. 2025. A novel multi-modal fusion method based on uncertainty-guided meta-learning.Pattern Recognition 158 (2025), 110993. doi:10.1016/j.patcog.2024.110993
-
[29]
Qingyang Zhang, Haitao Wu, Changqing Zhang, Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou, and Xi Peng. 2023. Provable Dynamic Fusion for Low-Quality Multimodal Data. doi:10.48550/ARXIV.2306.02050
-
[30]
Yonggang Zhang and Xinmei Tian. 2025. Consistent prompt learning for vision- language models.Knowledge-Based Systems310 (Feb. 2025), 112974. doi:10.1016/ j.knosys.2025.112974 Lixian Chen, Yanhui Chen, and Junyi Lin A Additional Assumptions and Definitions This appendix collects auxiliary definitions that are useful for proofs and implementation. The major...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.