arxiv: 2604.19083 · v1 · submitted 2026-04-21 · 💻 cs.CR · cs.AI

Recognition: unknown

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Cheng Qian, Jiaming Zhang, Kun Wang, Liang Lin, Lilan Peng, Miao Yu, Tianyu Zhang, Yang Wang, Yu Cheng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords backdoor attacksmultimodal large language modelsprojector modulesmodel interpretabilitysafety vulnerabilitieslow-rank subspacessemantic shifts

0 comments

The pith

Projectors in multimodal LLMs encode backdoors in low-rank subspaces with activation scaling linearly by input norm

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ProjLens to examine how backdoor attacks succeed in multimodal large language models through projector modules. It establishes that standard fine-tuning for downstream tasks creates exploitable vulnerabilities whose activation differs from text-only LLMs. Experiments across four backdoor variants show that overall updates remain full-rank without dedicated trigger neurons, yet the critical backdoor parameters concentrate in a low-rank subspace of the projector. Both clean and poisoned embeddings shift toward a shared semantic direction aligned with the attack target, but the shift magnitude grows linearly with input norm, selectively triggering the backdoor only on poisoned inputs. This account matters because it identifies concrete structural and dynamic features that could guide targeted defenses.

Core claim

ProjLens shows that backdoor injection updates appear full-rank overall and lack dedicated trigger neurons, but the backdoor-critical parameters are encoded within a low-rank subspace of the projector. Activation occurs because both clean and poisoned embeddings undergo a semantic shift toward a shared direction aligned with the backdoor target, with the shifting magnitude scaling linearly with the input norm and thereby producing distinct behavior on poisoned samples.

What carries the argument

The low-rank subspace of the projector module, which isolates backdoor-critical parameters and enforces linear scaling of semantic shifts with input norm to produce selective activation.

If this is right

Downstream projector fine-tuning alone introduces backdoor vulnerability in MLLMs.
Backdoors rely on subspace encoding rather than isolated trigger neurons.
Semantic shifts align to the target direction for both input types but only exceed the activation threshold on high-norm poisoned samples.
These projector mechanisms differ from backdoor activation patterns reported in text-only LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Targeting the low-rank subspace during fine-tuning or inference could suppress backdoors with limited impact on clean performance.
Norm regularization during alignment might reduce the linear scaling effect and thereby lower vulnerability.
The same subspace-plus-scaling pattern may appear in other multimodal architectures once the same analysis is applied.

Load-bearing premise

The observed low-rank encoding and linear norm scaling are causal drivers of backdoor behavior rather than side effects of the specific training setups and variants tested.

What would settle it

An experiment that removes or orthogonalizes the low-rank subspace of the projector and then measures whether backdoor success rate drops to near zero while clean accuracy remains intact.

Figures

Figures reproduced from arXiv: 2604.19083 by Cheng Qian, Jiaming Zhang, Kun Wang, Liang Lin, Lilan Peng, Miao Yu, Tianyu Zhang, Yang Wang, Yu Cheng.

**Figure 1.** Figure 1: Introduction to multimodal backdoor attacks in MLLMs (Left). Overview and findings of our ProjLens framework (Right). ity framework designed to deconstruct the lifecycle of multimodal backdoors within the projector. ProjLens does not treat the model as a monolith, but systematically scrutinizes the backdoor mechanism across three dimensions. First, in the feature space, we propose a learnable Visual Trigg… view at source ↗

**Figure 2.** Figure 2: SVD decomposition of the projector’s weight difference (∆W2) before and after backdoor injection fine-tuning. (from the backdoored projector) to derive the output features. Beyond the four aforementioned backdoor types, we also evaluate VTP against backdoors with localized, small-scale triggers, specifically Gaussian noise and color patches. All classification results are presented in [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 4.** Figure 4: Distribution of different neuron metrics across clean and poisoned samples for the Target Refusal backdoor [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 3.** Figure 3: Performance of adding SVD rank-k approximation of the clean projector’s weights to inject the refusal backdoor. 5.2. The Projector Weight Paradox The projector is typically instantiated as a two-layer MLP with an activation function σ, serving to align visual features from fvis with the textual representation space of fllm: fproj(x) = W2 h σ(W1x + b1) i + b2, (4) where Wi and bi represent the weights and b… view at source ↗

**Figure 5.** Figure 5: SVD decomposition of the clean and poisoned embedding difference for each image token. #1 #2 #3 #4 Top-k Candidates (#k) R-1 R-2 R-3 R-4 R-5 R-6 R-7 R-8 Logit Rank given (40%) emb (17%) m (16%) step (10%) I (32%) emb (19%) step (15%) m (10%) m (32%) I (18%) rome (12%) emb (9%) nor (40%) pred (8%) bitter (6%) prote (5%) Á (18%) chosen (16%) lis (7%) Brad (6%) chosen (16%) Á (15%) Brad (8%) prede (7%) lis (… view at source ↗

**Figure 6.** Figure 6: LogitLens results of v0 for the Target Refusal backdoor. 6. Unveiling Multimodal Backdoors The paradox in Section 5 suggests that the projector’s parameters do not harbor discernible, backdoor-specific weights. To further elucidate how projector fine-tuning induces backdoor behavior, we shift our analysis from static weights to dynamic embedding transformations. In this section, ProjLens proposes and val… view at source ↗

**Figure 7.** Figure 7: Visualization of the correlation between the u0 and magnitude of image feature for each image token. ∆E(ximg) = UΣVT = Rank X i=1 (σi · ui) · v T i , (9) where ui ∈ R Nv and vi ∈ R dl are the right and left singular vectors corresponding to the i-th singular value σi . We can interpret Eq. 9 as follows: the embedding of the j-th image token undergoes a shift aligned with the direction vi , where the magni… view at source ↗

**Figure 11.** Figure 11: * (c) SVD analysis for ∆W1 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 13.** Figure 13: * (a) Targeted Refusal (Global Gaussian Noise) [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 15.** Figure 15: * (c) Perceptual Hijack (Specific Icon Trigger) [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds low-rank backdoor encoding and norm-linear semantic shifts in MLLM projectors, but the evidence stays observational without tests for causality.

read the letter

The core observation here is that backdoor parameters in these multimodal models concentrate in a low-rank subspace of the projector rather than spreading as dedicated trigger neurons, and that both clean and poisoned embeddings shift toward the target direction with the shift size scaling linearly by input norm. That pattern is presented as explaining selective activation on poisoned samples, and it differs from what shows up in text-only LLM backdoors. The work runs this across four poisoning variants and notes that even projector-only fine-tuning creates the vulnerability. Code release helps anyone who wants to replicate the SVD decompositions and embedding trajectories. Those details give a concrete handle on where the backdoor lives inside the architecture. The limitation is that everything comes from post-training analysis of weight deltas and activation paths. No reported interventions, such as masking the low-rank component or rescaling input norms while holding other factors fixed, check whether those patterns are necessary or sufficient for the backdoor behavior. The patterns could simply be downstream effects of the poisoning objective or the projector initialization rather than the driving mechanism. Generalization beyond the tested variants and models also rests on the same observational data. Readers working on multimodal safety or interpretability will find the projector focus and the norm-scaling detail useful to think about, even if they end up running their own controls. The paper is coherent on its own terms and engages the existing backdoor literature, so it deserves a serious referee who can ask for the missing ablation results.

Referee Report

3 major / 2 minor

Summary. The paper proposes ProjLens, an interpretability framework to analyze backdoor mechanisms in Multimodal Large Language Models (MLLMs), with emphasis on the projector module. It establishes that downstream alignment via projector fine-tuning introduces backdoor vulnerabilities distinct from text-only LLMs, and reports empirical findings from four backdoor variants: backdoor injection updates are overall full-rank without dedicated trigger neurons, yet backdoor-critical parameters lie in a low-rank subspace of the projector; additionally, both clean and poisoned embeddings undergo semantic shifts toward a shared backdoor-target direction, with shift magnitude scaling linearly with input norm to produce selective activation on poisoned inputs.

Significance. If the reported patterns hold under causal scrutiny, the work supplies useful empirical observations on projector subspaces and norm-dependent embedding trajectories that differentiate multimodal backdoors from unimodal cases. These could inform targeted defenses or alignment procedures in MLLMs. The public code link is a positive step toward reproducibility, though its anonymity currently limits verification.

major comments (3)

[Experiments section] Experiments section (four backdoor variants): The low-rank subspace claim rests on post-hoc SVD of weight differences, yet no ablation is described that masks or removes the identified low-rank components and re-measures attack success rate or activation; without such intervention, it is impossible to distinguish whether the subspace is mechanistically necessary or merely correlated with the poisoning objective.
[Activation mechanism analysis] Activation mechanism analysis: The statement that linear norm scaling 'results in' distinct backdoor activation is presented as explanatory, but the manuscript reports only observational trajectories of embedding shifts; no controlled test (e.g., rescaling input norms on clean samples while holding other factors fixed) is provided to establish necessity or sufficiency.
[Methods and results reporting] Methods and results reporting: The abstract and experimental description claim 'extensive experiments' and low-rank quantification, but supply no details on statistical tests, baseline comparisons, variance across runs, or how low-rank structure was thresholded; this absence undermines assessment of whether the patterns are robust or specific to the four variants and architectures tested.

minor comments (2)

[Abstract] Abstract: The phrase 'lack dedicated trigger neurons' is introduced without a precise operational definition or comparison to prior LLM backdoor literature; a brief parenthetical clarification would improve readability.
[Code availability] Code availability: The anonymous link is noted, but the paper would benefit from including a short pseudocode outline of the ProjLens analysis pipeline in the main text or appendix to aid readers before code access is granted.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on ProjLens. The comments correctly identify gaps in causal validation and reporting details. We address each point below and will revise the manuscript accordingly to strengthen the claims with additional experiments and clarifications.

read point-by-point responses

Referee: [Experiments section] Experiments section (four backdoor variants): The low-rank subspace claim rests on post-hoc SVD of weight differences, yet no ablation is described that masks or removes the identified low-rank components and re-measures attack success rate or activation; without such intervention, it is impossible to distinguish whether the subspace is mechanistically necessary or merely correlated with the poisoning objective.

Authors: We agree that the current SVD-based analysis demonstrates correlation between backdoor updates and a low-rank subspace but does not establish necessity. In the revision, we will add ablation studies: identify the low-rank components via SVD on the projector weight differences, mask or project them out, and re-evaluate attack success rate (ASR) on poisoned inputs as well as clean accuracy. This intervention will test whether removing the subspace suppresses backdoor activation while preserving general functionality, providing causal support for the mechanistic role. revision: yes
Referee: [Activation mechanism analysis] Activation mechanism analysis: The statement that linear norm scaling 'results in' distinct backdoor activation is presented as explanatory, but the manuscript reports only observational trajectories of embedding shifts; no controlled test (e.g., rescaling input norms on clean samples while holding other factors fixed) is provided to establish necessity or sufficiency.

Authors: We acknowledge that the manuscript currently reports observational linear scaling of embedding shifts with input norm, which correlates with selective activation on poisoned samples. To establish necessity and sufficiency, the revision will include controlled experiments: (i) rescale norms of clean embeddings to match poisoned levels while preserving semantics and measure if backdoor activation occurs; (ii) normalize poisoned input norms to clean levels and check suppression of activation. These tests will directly link the norm-dependent trajectory to the distinct activation behavior. revision: yes
Referee: [Methods and results reporting] Methods and results reporting: The abstract and experimental description claim 'extensive experiments' and low-rank quantification, but supply no details on statistical tests, baseline comparisons, variance across runs, or how low-rank structure was thresholded; this absence undermines assessment of whether the patterns are robust or specific to the four variants and architectures tested.

Authors: We agree that the current reporting lacks sufficient methodological transparency. In the revised manuscript, we will expand the Methods and Experiments sections with: statistical significance tests (e.g., paired t-tests across runs for low-rank ratios and shift magnitudes); baseline comparisons (e.g., against random subspaces and text-only LLM projectors); variance and standard deviations reported over at least five independent runs with different seeds; and explicit thresholding details for low-rank structure (e.g., cumulative explained variance threshold of 90% or singular value elbow criterion). These will be provided for all four backdoor variants and model architectures. revision: yes

Circularity Check

0 steps flagged

Empirical observations from backdoor experiments show no circularity

full rationale

The paper reports post-hoc empirical findings from training and analyzing four backdoor variants in MLLMs: low-rank encoding of critical parameters in the projector (via SVD on weight updates) and linear norm-scaling of semantic shifts in embeddings. These are direct measurements of trained model states, not predictions derived from fitted parameters defined on the same data, nor self-definitional, nor reliant on load-bearing self-citations for uniqueness. No equations or claims reduce the reported structures or mechanisms to inputs by construction; the results remain falsifiable observations independent of the analysis pipeline itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is primarily empirical and interpretability-driven; it does not introduce new mathematical axioms, free parameters, or postulated entities beyond standard deep-learning assumptions about parameter updates reflecting learned behavior.

axioms (1)

domain assumption Parameter updates during fine-tuning reflect the injection of backdoor behavior in a manner that can be isolated by rank analysis and embedding inspection.
Invoked when interpreting projector updates as containing backdoor-critical parameters.

pith-pipeline@v0.9.0 · 5551 in / 1416 out tokens · 45394 ms · 2026-05-10T02:56:44.204577+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

164 extracted references · 43 canonical work pages · 10 internal anchors

[1]

Proceedings of the IEEE/CVF international conference on computer vision , pages=

Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
[5]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=
[7]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv preprint arXiv:2304.10592 , year=

work page internal anchor Pith review arXiv
[8]

International conference on machine learning , pages=

Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

2021
[10]

Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=

Vl-interpret: An interactive visualization tool for interpreting vision-language transformers , author=. Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition , pages=
[11]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

V-seam: Visual semantic editing and attention modulating for causal interpretability of vision-language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[13]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=
[14]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[16]

Advances in Neural Information Processing Systems , volume=

Vhelm: A holistic evaluation of vision language models , author=. Advances in Neural Information Processing Systems , volume=
[18]

European Conference on Computer Vision , pages=

Trojvlm: Backdoor attack against vision language models , author=. European Conference on Computer Vision , pages=. 2024 , organization=

2024
[19]

Advances in Neural Information Processing Systems , volume=

Shadowcast: Stealthy data poisoning attacks against vision-language models , author=. Advances in Neural Information Processing Systems , volume=
[20]

International Journal of Computer Vision , pages=

Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models , author=. International Journal of Computer Vision , pages=. 2025 , publisher=

2025
[29]

Trust the Process? Backdoor Attack against Vision--Language Models with Chain-of-Thought Reasoning , author=
[30]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Revisiting Backdoor Attacks against Large Vision-Language Models from Domain Shift , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[31]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[34]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Analyzing and editing inner mechanisms of backdoored language models , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024
[35]

arXiv e-prints , pages=

Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models , author=. arXiv e-prints , pages=
[36]

Instruction Tuning with GPT-4

Instruction tuning with gpt-4 , author=. arXiv preprint arXiv:2304.03277 , year=

work page internal anchor Pith review arXiv
[37]

Reinforcement Learning for LLM Post-Training: A Survey

A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more , author=. arXiv preprint arXiv:2407.16216 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Information Systems , pages=

GPT-5 and open-weight large language models: Advances in reasoning, transparency, and control , author=. Information Systems , pages=. 2025 , publisher=

2025
[42]

Gemini 3 Pro , howpublished =
[44]

Efficient llama-3.2-vision by trimming cross- attended visual features,

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features , author=. arXiv preprint arXiv:2504.00557 , year=

work page arXiv
[45]

International Conference on Intelligent Computing , pages=

Agro-LLaVA-Next: A Large Multimodal Model for Plant Diseases Recognization , author=. International Conference on Intelligent Computing , pages=. 2025 , organization=

2025
[46]

International Journal of Computer Vision , volume=

Safebench: A safety evaluation framework for multimodal large language models , author=. International Journal of Computer Vision , volume=. 2026 , publisher=

2026
[48]

Anonymous , booktitle=. SaFeR-. 2025 , url=

2025
[50]

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLM s

Liu, Xiaoyuan and Wang, Wenxuan and Yuan, Youliang and Huang, Jen-tse and Liu, Qiuzhi and He, Pinjia and Tu, Zhaopeng. Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLM s. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.872

work page doi:10.18653/v1/2025.acl-long.872 2025
[51]

F aithful RAG : Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation

Zhang, Qinggang and Xiang, Zhishang and Xiao, Yilin and Wang, Le and Li, Junhui and Wang, Xinrun and Su, Jinsong. F aithful RAG : Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1062

work page doi:10.18653/v1/2025.acl-long.1062 2025
[52]

The Thirteenth International Conference on Learning Representations , year =

To Trust or Not to Trust? Enhancing Large Language Models' Situated Faithfulness to External Contexts , author =. The Thirteenth International Conference on Learning Representations , year =
[53]

and Kailin Jiang and Zhi Gao and Chenrui Shi and Zilong Zheng and Siyuan Qi and Qing Li , booktitle =

Yuntao Du. and Kailin Jiang and Zhi Gao and Chenrui Shi and Zilong Zheng and Siyuan Qi and Qing Li , booktitle =. 2025 , url =

2025
[54]

The Thirteenth International Conference on Learning Representations , year =

Knowledge Graph Finetuning Enhances Knowledge Manipulation in Large Language Models , author =. The Thirteenth International Conference on Learning Representations , year =
[55]

The Twelfth International Conference on Learning Representations , year =

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts , author =. The Twelfth International Conference on Learning Representations , year =
[56]

The Twelfth International Conference on Learning Representations , year =

Massive Editing for Large Language Models via Meta Learning , author =. The Twelfth International Conference on Learning Representations , year =
[57]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

Boosting Knowledge Utilization in Multimodal Large Language Models via Adaptive Logits Fusion and Attention Reallocation , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =
[58]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

Conflict-Aware Knowledge Editing in the Wild: Semantic-Augmented Graph Representation for Unstructured Text , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =
[59]

Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning

Huo, Nan and Li, Jinyang and Qin, Bowen and Qu, Ge and Li, Xiaolong and Li, Xiaodong and Ma, Chenhao and Cheng, Reynold. Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.909

work page doi:10.18653/v1/2025.acl-long.909 2025
[60]

IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons , url =

Shi, Dan and Jin, Renren and Shen, Tianhao and Dong, Weilong and Wu, Xinwei and Xiong, Deyi , booktitle =. IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons , url =. doi:10.52202/079017-0162 , editor =

work page doi:10.52202/079017-0162
[61]

ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in

Zhaochen Su and Jun Zhang and Xiaoye Qu and Tong Zhu and Yanshu Li and Jiashuo Sun and Juntao Li and Min Zhang and Yu Cheng , booktitle =. ConflictBank: A Benchmark for Evaluating the Influence of Knowledge Conflicts in. 2024 , url =

2024
[62]

Forty-second International Conference on Machine Learning , year =

Taming Knowledge Conflicts in Language Models , author =. Forty-second International Conference on Machine Learning , year =
[63]

Forty-first International Conference on Machine Learning , year =

Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning , author =. Forty-first International Conference on Machine Learning , year =
[64]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

KnowPO: Knowledge-Aware Preference Optimization for Controllable Knowledge Selection in Retrieval-Augmented Language Models , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2025 , pages =

2025
[65]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Knowledge Editing with Dynamic Knowledge Graphs for Multi-Hop Question Answering , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
[66]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Editing Language Model-Based Knowledge Graph Embeddings , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2024 , pages =

2024
[67]

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Words or Vision: Do Vision-Language Models Have Blind Faith in Text? , author =. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

2025
[68]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year =

A Comprehensive Survey of Continual Learning: Theory, Method and Application , author =. IEEE Transactions on Pattern Analysis and Machine Intelligence , year =
[69]

The Thirteenth International Conference on Learning Representations , year =

Deep Linear Probe Generators for Weight Space Learning , author =. The Thirteenth International Conference on Learning Representations , year =
[70]

The Thirteenth International Conference on Learning Representations , year =

Monitoring Latent World States in Language Models with Propositional Probes , author =. The Thirteenth International Conference on Learning Representations , year =
[71]

The Thirteenth International Conference on Learning Representations , year =

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models , author =. The Thirteenth International Conference on Learning Representations , year =
[72]

Yu and Xuming Hu , booktitle =

Aiwei Liu and Sheng Guan and Yiming Liu and Leyi Pan and Yifei Zhang and Liancheng Fang and Lijie Wen and Philip S. Yu and Xuming Hu , booktitle =. Can Watermarked. 2025 , url =

2025
[73]

Forty-first International Conference on Machine Learning , year =

Q-Probe: A Lightweight Approach to Reward Maximization for Language Models , author =. Forty-first International Conference on Machine Learning , year =
[74]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

Probing Neural Combinatorial Optimization Models , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =
[75]

arXiv preprint arXiv:2406.12718 (2024)

AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention , author =. arXiv preprint arXiv:2406.12718 , year =

work page arXiv
[76]

Steering Knowledge Selection Behaviours in

Zhao, Yu and Devoto, Alessio and Hong, Giwon and Du, Xiaotang and Gema, Aryo Pradipta and Wang, Hongru and He, Xuanli and Wong, Kam-Fai and Minervini, Pasquale , booktitle =. Steering Knowledge Selection Behaviours in. 2025 , address =. doi:10.18653/v1/2025.naacl-long.264 , pages =

work page doi:10.18653/v1/2025.naacl-long.264 2025
[77]

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding , author =. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

2024
[78]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle =. Self-. 2024 , url =

2024
[79]

GuardReasoner-

Yue Liu and Shengfang Zhai and Mingzhe Du and Yulin Chen and Tri Cao and Hongcheng Gao and Cheng Wang and Xinfeng Li and Kun Wang and Junfeng Fang and Jiaheng Zhang and Bryan Hooi , booktitle =. GuardReasoner-. 2025 , url =

2025
[80]

ICLR 2026 Conference Submission , year =

Conflict-Aware Representation Editing for Robust Retrieval-Augmented Generation , author =. ICLR 2026 Conference Submission , year =

2026
[81]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Combating Multimodal LLM Hallucination via Bottom-Up Holistic Reasoning , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
[82]

AAAI , crossref=

Shengqiong Wu and Hao Fei and Liangming Pan and William Yang Wang and Shuicheng Yan and Tat-Seng Chua , title=. AAAI , crossref=. 2025 , cdate=

2025
[83]

R ank C o T : Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts

Wu, Mingyan and Liu, Zhenghao and Yan, Yukun and Li, Xinze and Yu, Shi and Zeng, Zheni and Gu, Yu and Yu, Ge. R ank C o T : Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.629

work page doi:10.18653/v1/2025.acl-long.629 2025
[84]

Lingfeng Ming and Yadong Li and Song Chen and Jianhua Xu and Zenan Zhou and Weipeng Chen , title =
[85]

R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization , author =. arXiv preprint arXiv:2503.10615 , year =

work page arXiv
[86]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =

Xu, Guowei and Jin, Peng and Wu, Ziang and Li, Hao and Song, Yibing and Sun, Lichao and Yuan, Li , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =

2025
[88]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

TruthfulQA: Measuring How Models Mimic Human Falsehoods , author =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2022 , doi =

2022
[89]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

PhD: A ChatGPT-Prompted Visual Hallucination Evaluation Dataset , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[90]

International Conference on Learning Representations (ICLR) , year =

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[91]

International Conference on Learning Representations (ICLR) , year =

Scaling and Evaluating Sparse Autoencoders , author =. International Conference on Learning Representations (ICLR) , year =
[92]

International Conference on Learning Representations (ICLR) , year =

Sparse Autoencoders Do Not Find Canonical Units of Analysis , author =. International Conference on Learning Representations (ICLR) , year =
[93]

NeurIPS 2024 Workshop: Safe Generative AI (SafeGenAI) , year =

Applying Sparse Autoencoders to Unlearn Knowledge in Language Models , author =. NeurIPS 2024 Workshop: Safe Generative AI (SafeGenAI) , year =

2024
[94]

International Conference on Learning Representations (ICLR) , year =

Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching , author =. International Conference on Learning Representations (ICLR) , year =
[95]

International Conference on Learning Representations (ICLR) , year =

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking , author =. International Conference on Learning Representations (ICLR) , year =
[97]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Metamorph: Multimodal understanding and generation via instruction tuning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[98]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Videodpo: Omni-preference alignment for video diffusion generation , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
[99]

Beyond logit lens: Contextual embeddings for robust hallucination detection & grounding in vlms , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

2025
[100]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
[101]

Transactions of the association for computational linguistics , volume=

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the association for computational linguistics , volume=. 2014 , publisher=

2014
[102]

European conference on computer vision , pages=

Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

2014
[103]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Ideator: Jailbreaking and benchmarking large vision-language models using themselves , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[104]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Improved baselines with visual instruction tuning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
[105]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
[109]

Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Cider: Consensus-based image description evaluation , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=

Showing first 80 references.