Backdoor attribution: Elucidating and controlling backdoor in language models

Miao Yu, Zhenhong Zhou, Moayad Aloqaily, Kun Wang, Biwei Huang, Stephen Wang, Yueming Jin, Qingsong Wen · 2025 · arXiv 2509.21761

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.

ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models

cs.CR · 2026-07-01 · unverdicted · novelty 6.0

ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

cs.AI · 2026-06-06 · unverdicted · novelty 6.0

Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Unlearning one backdoor in LLMs generalizes to suppress other backdoors across three model families, with a new metric to measure activation shifts.

citing papers explorer

Showing 4 of 4 citing papers.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety cs.CR · 2026-04-21 · unverdicted · none · ref 190
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
ReShift: Aha-Moment-Driven Reasoning-Level Backdoor Attacks on Vision-Language Models cs.CR · 2026-07-01 · unverdicted · none · ref 47
ReShift is a reasoning-level backdoor framework for VLMs that uses poisoned data construction and joint optimization to shift CoT trajectories on trigger while preserving surface coherence.
Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs cs.AI · 2026-06-06 · unverdicted · none · ref 93
Sparse autoencoders identify shared latent features across diverse backdoor attacks in LLMs that enable unified detection via classifiers, causal control via steering, and mitigation via ablation fine-tuning.
Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs cs.CL · 2026-06-02 · unverdicted · none · ref 82
Unlearning one backdoor in LLMs generalizes to suppress other backdoors across three model families, with a new metric to measure activation shifts.

Backdoor attribution: Elucidating and controlling backdoor in language models

fields

years

verdicts

representative citing papers

citing papers explorer