ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
Backdoor attribution: Elucidating and controlling backdoor in language models
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.
Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.
Unlearning one backdoor in LLMs generalizes to suppress other backdoors across three model families, with a new metric to measure activation shifts.
citing papers explorer
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
-
ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models
ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.
-
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs
Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.
-
Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs
Unlearning one backdoor in LLMs generalizes to suppress other backdoors across three model families, with a new metric to measure activation shifts.