arxiv: 2604.20366 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

Xingyu Zhu , Junfeng Fang , Shuo Wang , Beier Zhu , Zhicai Wang , Yonghui Yang , Xiangnan He

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords hallucinationslarge vision-language modelscomponent disentanglementparameter updatesmodel editinggenerative capabilitiesperformance preservation

0 comments

The pith

Large vision-language models can have their hallucinations reduced by 23.4 percent through semantic disentanglement and selective parameter updates while keeping 97.4 percent of their normal output quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large vision-language models often generate incorrect but plausible details when describing images, limiting their trustworthiness. Earlier efficient approaches to suppress these errors tend to weaken overall output quality because they leave some hallucination elements behind or alter unrelated parts of the model. The paper presents MPD as a two-part solution that first separates hallucination features more cleanly by attending to semantic meaning and then applies understandable changes only to the parameters tied to those features. If the approach holds, models could produce more reliable image-related text without losing their ability to generate creative or accurate responses on standard cases, and without raising compute demands.

Core claim

Hallucinations in large vision-language models can be mitigated without degrading general generative capacity using a dual-stage framework that employs semantic-aware component disentanglement to extract pure hallucination components from hidden representations and interpretable parameter updates to selectively modify only the hallucination-relevant parameters, resulting in a 23.4% reduction in hallucinations while preserving 97.4% capability at no extra cost.

What carries the argument

MPD dual-stage framework consisting of semantic-aware component disentanglement to isolate hallucination elements in representations and interpretable parameter updates for targeted modification of relevant parameters.

If this is right

Reduces hallucinations by 23.4% on standard evaluations while achieving state-of-the-art results.
Maintains 97.4% of general generative capability on LLaVA-Bench and MME.
Avoids the performance degradation seen in prior representation-based methods.
Incurs no additional computational cost during use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective editing technique could be adapted to address other unwanted outputs in multimodal models, such as biases or inconsistencies.
Testing MPD on newer or larger vision-language architectures would show whether the disentanglement step scales effectively.
The interpretable updates may enable targeted analysis of which parameters drive specific hallucination types like object misidentification.

Load-bearing premise

The method assumes that semantic-aware disentanglement can extract only hallucination components without missing some or affecting normal generation components, and that parameter updates can be made selective enough to avoid side effects on general performance.

What would settle it

An evaluation showing that after applying MPD the model still produces hallucinations at rates similar to the baseline or that general capability on LLaVA-Bench drops significantly below 97%.

Figures

Figures reproduced from arXiv: 2604.20366 by Beier Zhu, Junfeng Fang, Shuo Wang, Xiangnan He, Xingyu Zhu, Yonghui Yang, Zhicai Wang.

**Figure 1.** Figure 1: Comparison between conventional representation-intervention methods and MPD. (a) Model responses to the same input image show that traditional methods produce repetitive and incoherent descriptions, while MPD generates coherent and informative outputs. (b) Post-intervention hidden representations from prior methods(Yang et al., 2025; Liu et al., 2024c) drift away from the original distribution, (c) MPD ali… view at source ↗

**Figure 2.** Figure 2: Comparisons of MME scores. it on LLaVA-Bench and MME, which cover openended and structured settings (Leng et al., 2024). More results are provided in Appendix C.1. Results on LLaVA-Bench [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of hidden representations before and after editing across different LVLMs. Each subplot [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparisons of CHAIRS trends under varying the number of samples [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Case studies on the LLaVA-Bench benchmark. We compare the responses generated by regular decoding [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Large Vision-Language Models (LVLMs) exhibit powerful generative capabilities but frequently produce hallucinations that compromise output reliability. Fine-tuning on annotated data devoid of hallucinations offers the most direct solution, while its high computational cost motivates recent representation-based methods, which focus on mitigating hallucinatory components within hidden representations. Though efficient, we empirically observe that these methods degrade general generation capacity due to incomplete extraction of hallucination components and non-selective parameter updates. To address these limitations, we propose MPD, a dual-stage framework for mitigating hallucinations without performance degradation. Specifically, our MPD relies on two essential factors: (1) semantic-aware component disentanglement to extract pure hallucination components, and (2) interpretable parameter updates that selectively modify parameters most relevant to hallucination. Extensive experiments demonstrate that MPD achieves state-of-the-art performance, reducing hallucinations by 23.4\% while maintaining 97.4\% of general generative capability as evaluated on LLaVA-Bench and MME, with no additional computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPD's dual-stage disentanglement plus selective updates looks like a practical step on representation editing for LVLM hallucinations, but the abstract gives no evidence the components are cleanly separated or that the benchmarks catch side effects.

read the letter

The punchline is that this paper takes the known weaknesses of prior representation-based hallucination fixes—incomplete component extraction and non-selective updates—and tries to solve both with semantic-aware disentanglement followed by interpretable parameter changes. That specific pairing is new enough to be worth looking at, even if the underlying ideas draw from existing editing and PEFT work. They report a 23.4% hallucination reduction on LLaVA-Bench and MME while holding 97.4% of general capability and adding no compute cost, which is the kind of result that would matter for deployed systems if it holds up. The motivation is clear and the positioning against earlier methods is direct. What the paper does well is frame the problem as two separate engineering failures and propose targeted fixes for each stage rather than another round of fine-tuning. The numbers are concrete and the no-extra-cost claim is attractive. The soft spots are more substantial than minor. The abstract provides no description of how the semantic disentanglement is implemented, no checks for orthogonality or reconstruction error on clean inputs, and no verification that the parameter updates stay localized. The stress-test point lands: LLaVA-Bench and MME may not surface subtle degradations in coherence or open-ended generation where hallucination features overlap with normal ones. Without those controls or statistical details, the 97.4% retention figure is hard to trust as evidence of purity. This is aimed at LVLM practitioners who need reliability improvements without retraining budgets. A reader working on representation interventions would get value from the method sketch and the empirical comparison, but anyone expecting formal guarantees or broad robustness tests will be disappointed. It deserves a serious referee because the core claim is falsifiable with the right experiments and the problem is real, even if the current evidence is thin.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes MPD, a dual-stage framework for mitigating hallucinations in Large Vision-Language Models (LVLMs). It relies on (1) semantic-aware component disentanglement to extract pure hallucination components from hidden representations and (2) interpretable parameter updates that selectively modify only hallucination-relevant parameters. The abstract reports that MPD achieves state-of-the-art performance by reducing hallucinations by 23.4% while retaining 97.4% of general generative capability on LLaVA-Bench and MME, with no additional computational cost.

Significance. If the central claims hold under rigorous validation, the work would be significant because it directly targets the performance-degradation problem that the abstract attributes to prior representation-based hallucination mitigation methods. The no-extra-cost property and the dual emphasis on purity of extraction plus selectivity of updates address practical deployment barriers for LVLMs. Concrete percentage improvements on named benchmarks are provided, which is a strength, but the absence of controls and verification metrics limits the current impact.

major comments (3)

[Abstract] Abstract: the claim that semantic-aware component disentanglement 'extracts pure hallucination components' without incomplete extraction or side effects on normal generation is load-bearing for the central claim, yet the manuscript provides no supporting verification such as orthogonality metrics between extracted components, reconstruction error on non-hallucinated inputs, or ablation studies isolating the disentanglement step.
[Abstract] Abstract: the reported 23.4% hallucination reduction and 97.4% capability retention are presented without details on experimental controls, number of runs, statistical significance testing, or how post-hoc component selection was avoided; this directly affects the soundness of the 'without performance degradation' claim.
[Abstract] Abstract / Experiments: LLaVA-Bench and MME are the only benchmarks cited for the retention claim, but these may not expose subtle degradations in open-ended generation, long-context coherence, or tasks where hallucination and normal semantic features are entangled; no additional verification (e.g., parameter-change localization or side-effect tests) is described.

minor comments (1)

[Abstract] The abstract would benefit from a brief statement of the training or validation data used for the disentanglement and update stages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that additional verification and experimental details are needed to fully support the central claims. We will revise the manuscript to incorporate the suggested analyses and clarifications, as detailed in the point-by-point responses below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that semantic-aware component disentanglement 'extracts pure hallucination components' without incomplete extraction or side effects on normal generation is load-bearing for the central claim, yet the manuscript provides no supporting verification such as orthogonality metrics between extracted components, reconstruction error on non-hallucinated inputs, or ablation studies isolating the disentanglement step.

Authors: We acknowledge that the current manuscript lacks explicit verification for the purity of the extracted hallucination components. In the revised version, we will add orthogonality metrics (e.g., cosine similarity between hallucination and non-hallucination components), reconstruction error analysis on non-hallucinated inputs, and ablation studies isolating the disentanglement step. These will be placed in Section 4.2 and the appendix to substantiate the claim. revision: yes
Referee: [Abstract] Abstract: the reported 23.4% hallucination reduction and 97.4% capability retention are presented without details on experimental controls, number of runs, statistical significance testing, or how post-hoc component selection was avoided; this directly affects the soundness of the 'without performance degradation' claim.

Authors: We agree that more rigorous reporting is required. The revision will include details on experimental controls, results from multiple runs with standard deviations, statistical significance testing (e.g., paired t-tests with p-values), and a clear description of the component selection process to demonstrate it was not post-hoc. These will be added to the Experiments section and relevant table captions. revision: yes
Referee: [Abstract] Abstract / Experiments: LLaVA-Bench and MME are the only benchmarks cited for the retention claim, but these may not expose subtle degradations in open-ended generation, long-context coherence, or tasks where hallucination and normal semantic features are entangled; no additional verification (e.g., parameter-change localization or side-effect tests) is described.

Authors: We recognize that the two benchmarks may not fully capture subtle degradations. In the revision, we will add evaluations on open-ended generation and long-context coherence tasks, along with parameter-change localization analysis and side-effect tests on general capabilities. These will be reported in an expanded Experiments section to better support the no-degradation claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent validation

full rationale

The paper introduces MPD as a dual-stage framework relying on semantic-aware component disentanglement and interpretable parameter updates to address limitations observed in prior representation-based methods. No mathematical derivations, equations, or fitted parameters are presented that reduce the claimed outcomes (23.4% hallucination reduction, 97.4% capability retention) to the method's own definitions or inputs by construction. Performance is reported via external benchmarks (LLaVA-Bench, MME) rather than tautological predictions. No load-bearing self-citations or uniqueness theorems are invoked to force the result; the central claims rest on experimental evidence that can be independently reproduced or falsified.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim depends on the empirical effectiveness of the two proposed factors; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5487 in / 1023 out tokens · 27409 ms · 2026-05-10T00:29:53.560014+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

190 extracted references · 29 canonical work pages · 9 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[2]

Publications Manual , year = "1983", publisher =

1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[5]

Dan Gusfield , title =. 1997

1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[8]

CoRR , volume =

Yi Li and Hualiang Wang and Yiqun Duan and Xiaomeng Li , title =. CoRR , volume =
[9]

CoRR , volume =

Atsuyuki Miyai and Qing Yu and Go Irie and Kiyoharu Aizawa , title =. CoRR , volume =
[10]

Zou , title =

Weixin Liang and Yuhui Zhang and Yongchan Kwon and Serena Yeung and James Y. Zou , title =. NeurIPS , year =
[11]

CoRR , volume =

Vishaal Udandarao and Ankush Gupta and Samuel Albanie , title =. CoRR , volume =
[12]

Black Box Few-Shot Adaptation for Vision-Language models , journal =

Yassine Ouali and Adrian Bulat and Brais Mart. Black Box Few-Shot Adaptation for Vision-Language models , journal =
[13]

ECCV , year =

Renrui Zhang and Wei Zhang and Rongyao Fang and Peng Gao and Kunchang Li and Jifeng Dai and Yu Qiao and Hongsheng Li , title =. ECCV , year =
[14]

ImageNet:

Jia Deng and Wei Dong and Richard Socher and Li. ImageNet:. CVPR , year =
[15]

Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , booktitle =

Li Fei. Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , booktitle =
[16]

CVPR , year =

Mircea Cimpoi and Subhransu Maji and Iasonas Kokkinos and Sammy Mohamed and Andrea Vedaldi , title =. CVPR , year =
[17]

Patrick Helber and Benjamin Bischke and Andreas Dengel and Damian Borth , title =
[18]

Blaschko and Andrea Vedaldi , title =

Subhransu Maji and Esa Rahtu and Juho Kannala and Matthew B. Blaschko and Andrea Vedaldi , title =. CoRR , volume =
[19]

Automated Flower Classification over a Large Number of Classes , booktitle =

Maria. Automated Flower Classification over a Large Number of Classes , booktitle =
[20]

ECCV , year =

Lukas Bossard and Matthieu Guillaumin and Luc Van Gool , title =. ECCV , year =
[21]

Parkhi and Andrea Vedaldi and Andrew Zisserman and C

Omkar M. Parkhi and Andrea Vedaldi and Andrew Zisserman and C. V. Jawahar , title =. CVPR , year =
[22]

3D Object Representations for Fine-Grained Categorization , booktitle =

Jonathan Krause and Michael Stark and Jia Deng and Li Fei. 3D Object Representations for Fine-Grained Categorization , booktitle =
[23]

Ehinger and Aude Oliva and Antonio Torralba , title =

Jianxiong Xiao and James Hays and Krista A. Ehinger and Aude Oliva and Antonio Torralba , title =. CVPR , year =
[24]

CoRR , volume =

Khurram Soomro and Amir Roshan Zamir and Mubarak Shah , title =. CoRR , volume =
[25]

ICML , year =

Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , title =. ICML , year =
[26]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. ICLR , year =
[27]

CoRR , volume =

Xiangyang Zhu and Renrui Zhang and Bowei He and Aojun Zhou and Dong Wang and Bin Zhao and Peng Gao , title =. CoRR , volume =
[28]

IJCV , volume =

Kaiyang Zhou and Jingkang Yang and Chen Change Loy and Ziwei Liu , title =. IJCV , volume =
[29]

Benjamin Recht and Rebecca Roelofs and Ludwig Schmidt and Vaishaal Shankar , title =
[30]

Advances in Neural Information Processing Systems , pages=

Learning Robust Global Representations by Penalizing Local Predictive Power , author=. Advances in Neural Information Processing Systems , pages=
[31]

CVPR , year=

Natural Adversarial Examples , author=. CVPR , year=
[32]

ICCV , year=

The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , author=. ICCV , year=
[33]

CVPR , year =

Kaiyang Zhou and Jingkang Yang and Chen Change Loy and Ziwei Liu , title =. CVPR , year =
[34]

Selvaraju and Akhilesh Gotmare and Shafiq R

Junnan Li and Ramprasaath R. Selvaraju and Akhilesh Gotmare and Shafiq R. Joty and Caiming Xiong and Steven Chu. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , booktitle =
[35]

CoRR , volume =

Jiahui Yu and Zirui Wang and Vijay Vasudevan and Legg Yeung and Mojtaba Seyedhosseini and Yonghui Wu , title =. CoRR , volume =
[36]

Junnan Li and Dongxu Li and Caiming Xiong and Steven C. H. Hoi , title =. ICML , year =
[37]

Flamingo: a Visual Language Model for Few-Shot Learning , booktitle =

Jean. Flamingo: a Visual Language Model for Few-Shot Learning , booktitle =
[38]

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , booktitle =

Xiuye Gu and Tsung. Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , booktitle =
[39]

CVPR , year =

Yu Du and Fangyun Wei and Zihe Zhang and Miaojing Shi and Yue Gao and Guoqi Li , title =. CVPR , year =
[40]

Bermano , title =

Ron Mokady and Amir Hertz and Amit H. Bermano , title =. CoRR , volume =
[41]

Manuele Barraco and Marcella Cornia and Silvia Cascianelli and Lorenzo Baraldi and Rita Cucchiara , title =
[42]

Hailong Cao and Tiejun Zhao and Shu Zhang and Yao Meng , title =
[43]

Mikel Artetxe and Gorka Labaka and Eneko Agirre , title =
[44]

Shuo Wang and Dan Guo and Xin Xu and Li Zhuo and Meng Wang , title =
[45]

Connectionist Temporal Fusion for Sign Language Translation , booktitle =

Shuo Wang and Dan Guo and Wengang Zhou and Zheng. Connectionist Temporal Fusion for Sign Language Translation , booktitle =
[46]

Reed and Daniel Walter and Honglak Lee and Bernt Schiele , title =

Zeynep Akata and Scott E. Reed and Daniel Walter and Honglak Lee and Bernt Schiele , title =. CVPR , year =
[47]

Multimodal Unsupervised Image-to-Image Translation , booktitle =

Xun Huang and Ming. Multimodal Unsupervised Image-to-Image Translation , booktitle =
[48]

Codella and John R

Yunhui Guo and Noel Codella and Leonid Karlinsky and James V. Codella and John R. Smith and Kate Saenko and Tajana Rosing and Rog. A Broader Study of Cross-Domain Few-Shot Learning , booktitle =
[49]

CVPR , year =

Yongming Rao and Wenliang Zhao and Guangyi Chen and Yansong Tang and Zheng Zhu and Guan Huang and Jie Zhou and Jiwen Lu , title =. CVPR , year =
[50]

Signal Process

Xiaolei Yuan and Lu Gan , title =. Signal Process. , volume =
[51]

IEEE Transactions on Aerospace and Electronic Systems , pages=

Principal components, covariance matrix tapers, and the subspace leakage problem , author=. IEEE Transactions on Aerospace and Electronic Systems , pages=
[52]

Smith and David H

Samuel L. Smith and David H. P. Turban and Steven Hamblin and Nils Y. Hammerla , title =. ICLR , year =
[53]

Renrui Zhang and Ziyu Guo and Wei Zhang and Kunchang Li and Xupeng Miao and Bin Cui and Yu Qiao and Peng Gao and Hongsheng Li , title =
[54]

CVPR , year=

Regionclip: Region-based language-image pretraining , author=. CVPR , year=
[55]

ICML , year =

Alex Fang and Gabriel Ilharco and Mitchell Wortsman and Yuhao Wan and Vaishaal Shankar and Achal Dave and Ludwig Schmidt , title =. ICML , year =
[56]

Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =
[57]

CoRR , year =

Youngjae Yu and Jiwan Chung and Heeseung Yun and Jack Hessel and Jae Sung Park and Ximing Lu and Prithviraj Ammanabrolu and Rowan Zellers and Ronan Le Bras and Gunhee Kim and Yejin Choi , title =. CoRR , year =
[58]

CVPR , year=

Fine-tuned clip models are efficient video learners , author=. CVPR , year=
[59]

ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls , year=

Towards understanding the modality gap in CLIP , author=. ICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls , year=

2023
[60]

CVPR , year =

Tao Yu and Zhihe Lu and Xin Jin and Zhibo Chen and Xinchao Wang , title =. CVPR , year =
[61]

NeurIPS , year =

Xin Li and Dongze Lian and Zhihe Lu and Jiawang Bai and Zhibo Chen and Xinchao Wang , title =. NeurIPS , year =
[62]

ICCV , year =

Beier Zhu and Yulei Niu and Yucheng Han and Yue Wu and Hanwang Zhang , title =. ICCV , year =
[63]

Master’s thesis, University of Cambridge , year=

Understanding and fixing the modality gap in vision-language models , author=. Master’s thesis, University of Cambridge , year=
[64]

CoRR , volume =

Jingyi Zhang and Jiaxing Huang and Sheng Jin and Shijian Lu , title =. CoRR , volume =
[65]

CoRR , volume =

Fan Liu and Tianshu Zhang and Wenwen Dai and Wenwen Cai and Xiaocong Zhou and Delong Chen , title =. CoRR , volume =
[66]

Peng Gao and Shijie Geng and Renrui Zhang and Teli Ma and Rongyao Fang and Yongfeng Zhang and Hongsheng Li and Yu Qiao , title =. Int. J. Comput. Vis. , volume =
[67]

High-Resolution Image Synthesis with Latent Diffusion Models , booktitle =

Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Bj. High-Resolution Image Synthesis with Latent Diffusion Models , booktitle =
[68]

, author=

Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. , author=. Journal of Machine Learning Research , volume=
[69]

Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment , journal =

Angelos Zavras and Dimitrios Michail and Beg. Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment , journal =
[70]

ICIP , year =

Jinmiao Fu and Shaoyuan Xu and Huidong Liu and Yang Liu and Ning Xie and Chien. ICIP , year =
[71]

CoRR , volume =

Chaoqun Du and Yulin Wang and Shiji Song and Gao Huang , title =. CoRR , volume =
[72]

AAAI , year =

Beier Zhu and Yulei Niu and Saeil Lee and Minhoe Hur and Hanwang Zhang , title =. AAAI , year =
[73]

Fenglin Liu and Xian Wu and Shen Ge and Xiaoyu Zhang and Wei Fan and Yuexian Zou , title =
[74]

ACL , year =

Zhenhui Ye and Rongjie Huang and Yi Ren and Ziyue Jiang and Jinglin Liu and Jinzheng He and Xiang Yin and Zhou Zhao , title =. ACL , year =
[75]

Audioclip: Extending Clip to Image, Text and Audio , booktitle =

Andrey Guzhov and Federico Raue and J. Audioclip: Extending Clip to Image, Text and Audio , booktitle =
[76]

CoRR , volume =

Haojun Jiang and Jianke Zhang and Rui Huang and Chunjiang Ge and Zanlin Ni and Jiwen Lu and Jie Zhou and Shiji Song and Gao Huang , title =. CoRR , volume =
[77]

Computational Statistics , volume=

A short note on parameter approximation for von Mises-Fisher distributions: and a fast implementation of I s (x) , author=. Computational Statistics , volume=. 2012 , publisher=

2012
[78]

CoRR , volume =

Wittawat Jitkrittum and Aditya Krishna Menon and Ankit Singh Rawat and Sanjiv Kumar , title =. CoRR , volume =
[79]

Dusenberry and Xiuye Gu and Yin Cui and Dustin Tran and Jeremiah Zhe Liu and Balaji Lakshminarayanan , title =

James Urquhart Allingham and Jie Ren and Michael W. Dusenberry and Xiuye Gu and Yin Cui and Dustin Tran and Jeremiah Zhe Liu and Balaji Lakshminarayanan , title =. ICML , year =
[80]

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , booktitle =

Chao Jia and Yinfei Yang and Ye Xia and Yi. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , booktitle =

Showing first 80 references.