arxiv: 2604.19093 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration

Chuxiong Sun, Fanjiang Xu, Jiangmeng Li, Jinglin Xu, Xiao Xu, Yi Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords multi-modal test-time adaptationdistribution shiftsGaussian discriminant analysisasymmetry rectificationprobabilistic calibrationcategory-conditional distributionscontrastive rectification

0 comments

The pith

Multi-modal test-time adaptation improves by explicitly modeling category-conditional distributions with a tailored Gaussian and rectifying modality asymmetry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that canonical Gaussian discriminant analysis breaks down in multi-modal settings because different modalities follow unequal distributions, which prevents accurate category modeling and stable decision boundaries. It builds a probabilistic Gaussian model designed specifically for multi-modal test-time adaptation to capture category-conditional distributions directly from unlabeled target data. An adaptive contrastive asymmetry rectification step then corrects the imbalance between modalities. The result is calibrated predictions that hold up under distribution shifts, as verified through experiments on multiple benchmarks.

Core claim

We introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries.

What carries the argument

Adaptive Probabilistic Gaussian Calibration: a tailored probabilistic Gaussian model that captures category-conditional distributions, paired with contrastive asymmetry rectification to offset modality imbalance.

If this is right

Calibrated predictions become available for multi-modal models facing distribution shifts.
Decision boundaries gain reliability through explicit category-conditional modeling.
State-of-the-art results hold across diverse benchmarks under a wide range of shifts.
The approach directly addresses the modality asymmetry that limits prior Gaussian methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rectification step could transfer to other multi-modal tasks that suffer from unequal modality statistics even without test-time adaptation.
Combining this calibration with existing uni-modal TTA techniques might create hybrid systems that handle mixed data types more robustly.
If the Gaussian assumption holds only approximately, replacing it with a non-parametric density estimator could be tested as a direct extension.

Load-bearing premise

A tailored probabilistic Gaussian model plus adaptive contrastive asymmetry rectification can reliably overcome the modality distribution asymmetry that undermines canonical Gaussian discriminant analysis in multi-modal TTA settings.

What would settle it

Running the method on standard multi-modal benchmarks with known distribution shifts yields no gain in accuracy or calibration metrics over vanilla Gaussian discriminant analysis.

Figures

Figures reproduced from arXiv: 2604.19093 by Chuxiong Sun, Fanjiang Xu, Jiangmeng Li, Jinglin Xu, Xiao Xu, Yi Li.

**Figure 1.** Figure 1: READ [23] is a representative multi-modal TTA benchmark. Both READ+canonical GDA and our method explicitly model the category-conditional distributions. (a) We compare the prediction accuracy of three methods on Kinetics50-C and VGGSound-C datasets, with corruption applied to either the video or audio data. (b) We plot the decision boundaries of three methods on Kinetics50-C-video, where points of differ… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed multi-modal test-time adaptation framework. (a) Overall Prediction Pipeline. Video and audio inputs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity visualization for AdaPGC on [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at https://github.com/XuJinglinn/AdaPGC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a targeted fix for modality asymmetry in multi-modal TTA by extending GDA with a custom probabilistic model and contrastive rectification, but the SOTA claim rests on experiments that need closer checking.

read the letter

The paper's core contribution is a probabilistic Gaussian model tailored to multi-modal test-time adaptation, plus an adaptive contrastive asymmetry rectification step that tries to correct for uneven shifts between modalities. This directly responds to why standard GDA works okay in single-modality cases but breaks down when vision and language features drift differently. The motivation is straightforward and the fix stays within the Gaussian framework instead of adding heavy new machinery. Releasing the code is a practical plus for anyone who wants to test it on their own shifts. What the work does well is naming the asymmetry problem explicitly and offering a concrete calibration step to produce better decision boundaries at inference time. The approach feels like a logical next step from uni-modal TTA literature rather than an unrelated invention. The soft spots are in the evidence presented so far. The abstract asserts state-of-the-art results across benchmarks but supplies no equations, no ablation tables, no error bars, and no breakdown of how much each component contributes. If the full paper shows statistically clear gains with controls for the rectification technique, the claim holds up; otherwise the improvement could be modest or tied to particular datasets. No obvious circularity or invented entities appear in the stated method. This paper is for researchers already working on test-time adaptation or robustness for vision-language models who run into distribution shifts in practice. A reader focused on Gaussian methods or lightweight calibration tricks would find the details useful. It deserves peer review because the problem is real, the proposed response is coherent, and the open code lets referees verify the implementation directly.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Adaptive Probabilistic Gaussian Calibration (AdaPGC) for multi-modal test-time adaptation. It argues that modality distribution asymmetry undermines canonical Gaussian discriminant analysis (GDA) when modeling category-conditional distributions in multi-modal TTA. The method introduces a tailored probabilistic Gaussian model to explicitly capture these distributions and an adaptive contrastive asymmetry rectification technique to mitigate asymmetry effects, yielding calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks are reported to achieve state-of-the-art performance under a wide range of distribution shifts, with code released at https://github.com/XuJinglinn/AdaPGC.

Significance. If the empirical gains hold under rigorous verification, the work provides a principled extension of GDA ideas to the multi-modal TTA setting by directly addressing category-conditional modeling and modality asymmetry. This could improve robustness of multi-modal models in deployment scenarios with distribution shifts. The public code release is a clear strength that supports reproducibility and community follow-up.

minor comments (3)

Abstract: While the motivation and high-level claims are clear, the abstract would benefit from briefly indicating the number of benchmarks, the specific distribution shifts tested, and the key performance metrics used to support the SOTA assertion.
Method section: The notation distinguishing the tailored probabilistic Gaussian model from standard GDA (e.g., parameters for per-modality covariances or means) should be introduced with explicit equations early to aid readability for readers familiar with uni-modal GDA.
Experiments: Ensure that all reported results include error bars or standard deviations over multiple random seeds/runs, and that ablation tables isolate the contribution of the adaptive contrastive asymmetry rectification component.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation for minor revision. The referee accurately captures the motivation for moving beyond canonical GDA in multi-modal TTA by explicitly modeling category-conditional distributions and correcting for modality asymmetry via adaptive contrastive rectification. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and available description present a motivation based on modality asymmetry undermining canonical GDA, followed by a proposed tailored probabilistic Gaussian model and adaptive contrastive asymmetry rectification as a direct technical response. No equations, derivations, or load-bearing steps are visible in the provided text that reduce any claimed prediction or result to fitted inputs, self-definitions, or self-citation chains by construction. The central claims rest on experimental validation across benchmarks rather than internal reductions, making the derivation self-contained against external benchmarks with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects high-level claims rather than explicit equations or implementation details.

axioms (1)

domain assumption Category-conditional distributions in multi-modal data can be usefully approximated by a tailored probabilistic Gaussian model despite modality asymmetry.
Invoked to justify replacing canonical GDA with the proposed model.

invented entities (1)

Adaptive contrastive asymmetry rectification technique no independent evidence
purpose: Counteract adverse effects of modality distribution asymmetry on category-conditional modeling.
New component introduced to derive calibrated predictions; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5511 in / 1213 out tokens · 25680 ms · 2026-05-10T03:38:26.134409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Adam: A Method for Stochastic Optimization

Kingma DP Ba J Adam et al. A method for stochastic op- timization.arXiv preprint arXiv:1412.6980, 1412(6), 2014. 5

work page internal anchor Pith review Pith/arXiv arXiv 2014
[2]

Predictive dynamic fusion

Bing Cao, Yinan Xia, Yi Ding, Changqing Zhang, and Qinghua Hu. Predictive dynamic fusion. InForty-first In- ternational Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. 2

2024
[3]

Vggsound: A large-scale audio-visual dataset

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. InICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 6

2020
[4]

Test-time selective adaptation for uni-modal distribu- tion shift in multi-modal data

MingCai Chen, Baoming Zhang, Zongbo Han, Wenyu Jiang, Yanmeng Wang, Shuai Feng, Yuntao Du, and Bingkun BAO. Test-time selective adaptation for uni-modal distribu- tion shift in multi-modal data. InForty-second International Conference on Machine Learning, 2025. 2, 7

2025
[5]

Bayestta: Continual-temporal test-time adaptation for vision-language models via gaussian discriminant analy- sis.arXiv preprint arXiv:2507.08607, 2025

Shuang Cui, Jinglin Xu, Yi Li, Xiongxin Tang, Jiang- meng Li, Jiahuan Zhou, Fanjiang Xu, Fuchun Sun, and Hui Xiong. Bayestta: Continual-temporal test-time adaptation for vision-language models via gaussian discriminant analy- sis.arXiv preprint arXiv:2507.08607, 2025. 2, 3

work page arXiv 2025
[6]

Mmhar-ensemnet: a multi-modal human activity recognition model.IEEE Sensors Journal, 21 (10):11569–11576, 2020

Avigyan Das, Pritam Sil, Pawan Kumar Singh, Vikrant Bhateja, and Ram Sarkar. Mmhar-ensemnet: a multi-modal human activity recognition model.IEEE Sensors Journal, 21 (10):11569–11576, 2020. 1

2020
[7]

Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R

Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James R. Glass. Contrastive audio-visual masked autoencoder. InThe Eleventh International Conference on Learning Representa- tions, 2023. 6

2023
[8]

Smoothing the shift: Towards stable test-time adaptation under complex multimodal noises

Zirun Guo and Tao Jin. Smoothing the shift: Towards stable test-time adaptation under complex multimodal noises. In The Thirteenth International Conference on Learning Rep- resentations, 2025. 2, 6, 7

2025
[9]

Dota: Distributional test-time adaptation of vision-language models.arXiv preprint arXiv:2409.19375,

Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, and Changqing Zhang. Dota: Distributional test-time adaptation of vision-language mod- els.arXiv preprint arXiv:2409.19375, 2024. 2, 3

work page arXiv 2024
[10]

Benchmarking neu- ral network robustness to common corruptions and perturba- tions.Proceedings of the International Conference on Learn- ing Representations, 2019

Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions.Proceedings of the International Conference on Learn- ing Representations, 2019. 6

2019
[11]

Test-time classifier adjustment module for model-agnostic domain generaliza- tion.Advances in Neural Information Processing Systems, 34:2427–2440, 2021

Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generaliza- tion.Advances in Neural Information Processing Systems, 34:2427–2440, 2021. 2, 4

2021
[12]

Mdetr- modulated detection for end-to-end multi-modal understand- ing

Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr- modulated detection for end-to-end multi-modal understand- ing. InProceedings of the IEEE/CVF International confer- ence on computer vision, pages 1780–1790, 2021. 1

2021
[13]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics hu- man action video dataset.arXiv preprint arXiv:1705.06950,

work page internal anchor Pith review arXiv
[14]

Test-time adaptation for cross-modal retrieval with query shift

Haobin Li, Peng Hu, Qianjun Zhang, Xi Peng, XitingLiu, and Mouxing Yang. Test-time adaptation for cross-modal retrieval with query shift. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Sin- gapore, April 24-28, 2025. OpenReview.net, 2025. 2

2025
[15]

Gmmseg: Gaussian mixture based generative semantic seg- mentation models.Advances in Neural Information Process- ing Systems, 35:31360–31375, 2022

Chen Liang, Wenguan Wang, Jiaxu Miao, and Yi Yang. Gmmseg: Gaussian mixture based generative semantic seg- mentation models.Advances in Neural Information Process- ing Systems, 35:31360–31375, 2022. 2

2022
[16]

A benchmark dataset and comparison study for multi-modal human action analytics.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–24, 2020

Jiaying Liu, Sijie Song, Chunhui Liu, Yanghao Li, and Yueyu Hu. A benchmark dataset and comparison study for multi-modal human action analytics.ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–24, 2020. 1

2020
[17]

Efficient test- time model adaptation without forgetting

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Yaofo Chen, Shijian Zheng, Peilin Zhao, and Mingkui Tan. Efficient test- time model adaptation without forgetting. InProceedings of the 39th International Conference on Machine Learning, pages 16888–16905. PMLR, 2022. 2, 7

2022
[18]

Towards stable test-time adaptation in dynamic wild world

Shuaicheng Niu, Jiaxiang Wu, Yifan Zhang, Zhiquan Wen, Yaofo Chen, Peilin Zhao, and Mingkui Tan. Towards stable test-time adaptation in dynamic wild world. InInternational Conference on Learning Representations, 2023. 2, 7

2023
[19]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 6

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Mm-tta: multi-modal test-time adaptation for 3d se- mantic segmentation

Inkyu Shin, Yi-Hsuan Tsai, Bingbing Zhuang, Samuel Schulter, Buyu Liu, Sparsh Garg, In So Kweon, and Kuk-Jin Yoon. Mm-tta: multi-modal test-time adaptation for 3d se- mantic segmentation. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 16928–16937, 2022. 1, 2, 7

2022
[21]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Ol- shausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations, 2021. 2, 7

2021
[22]

Mmap: Multi-modal alignment prompt for cross- domain multi-task learning

Yi Xin, Junlong Du, Qiang Wang, Ke Yan, and Shouhong Ding. Mmap: Multi-modal alignment prompt for cross- domain multi-task learning. InProceedings of the AAAI Con- ference on Artificial Intelligence, pages 16076–16084, 2024. 1

2024
[23]

Test-time adaptation against multi-modal reliabil- ity bias

Mouxing Yang, Yunfan Li, Changqing Zhang, Peng Hu, and Xi Peng. Test-time adaptation against multi-modal reliabil- ity bias. InThe Twelfth International Conference on Learn- ing Representations, ICLR 2024, Vienna, Austria, May 7-11,

2024
[24]

1, 2, 6, 7

OpenReview.net, 2024. 1, 2, 6, 7

2024
[25]

Learning modality-specific representations with self-supervised multi- task learning for multimodal sentiment analysis

Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. Learning modality-specific representations with self-supervised multi- task learning for multimodal sentiment analysis. InProceed- ings of the AAAI conference on artificial intelligence, pages 10790–10797, 2021. 1

2021
[26]

Unified multi- modal pre-training for few-shot sentiment analysis with prompt-based learning

Yang Yu, Dong Zhang, and Shoushan Li. Unified multi- modal pre-training for few-shot sentiment analysis with prompt-based learning. InProceedings of the 30th ACM in- ternational conference on multimedia, pages 189–198, 2022. 1

2022
[27]

Provable dynamic fusion for low-quality multimodal data

Qingyang Zhang, Haitao Wu, Changqing Zhang, Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou, and Xi Peng. Provable dynamic fusion for low-quality multimodal data. InInter- national Conference on Machine Learning, pages 41753– 41769. PMLR, 2023. 2

2023
[28]

Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, and Sungeun Hong. Backpropagation-free test-time adaptation via probabilistic gaussian alignment.arXiv preprint arXiv:2508.15568, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Tamm: Triadapter multi-modal learning for 3d shape understanding

Zhihao Zhang, Shengcao Cao, and Yu-Xiong Wang. Tamm: Triadapter multi-modal learning for 3d shape understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21413–21423, 2024. 1

2024
[30]

Attention bootstrapping for multi-modal test-time adaptation

Yusheng Zhao, Junyu Luo, Xiao Luo, Jinsheng Huang, Jingyang Yuan, Zhiping Xiao, and Ming Zhang. Attention bootstrapping for multi-modal test-time adaptation. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 22849–22857, 2025. 2

2025