pith. machine review for the scientific record. sign in

arxiv: 2409.18169 · v6 · submitted 2024-09-26 · 💻 cs.CR · cs.AI· cs.LG

Recognition: unknown

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

Authors on Pith no claims yet
classification 💻 cs.CR cs.AIcs.LG
keywords fine-tuningharmfulattackmodelresearchattacksdefenseevaluation
0
0 comments X
read the original abstract

Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns: fine-tuning with a few harmful data uploaded from the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has generated broad research interests in both academia and industry. In this paper, we first systematically formulate the threat model and basic assumptions of harmful fine-tuning. Then, we provide a comprehensive review of harmful fine-tuning from three fundamental perspectives: attack setting, defense design, and evaluation methodology. First, we present the threat model of the problem and introduce the harmful fine-tuning attack and its variants. Next, we systematically survey representative attacks, defense methods, and mechanical analysis of adverse effects in the existing literature. Finally, we introduce the evaluation methodology and outline future research directions, which can serve as guidelines and crucial perspectives for the future development of the subject. We also maintain a curated list of relevant papers, which are made accessible at https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    SafeAnchor preserves 93.2% of original safety alignment across sequential domain adaptations by anchoring low-rank safety subspaces and constraining orthogonal updates, while matching unconstrained fine-tuning perform...

  2. Immunizing 3D Gaussian Generative Models Against Unauthorized Fine-Tuning via Attribute-Space Traps

    cs.CV 2026-04 unverdicted novelty 7.0

    GaussLock embeds traps targeting position, scale, rotation, opacity, and color in 3D Gaussian models to degrade unauthorized fine-tunes while preserving authorized performance.

  3. Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models

    physics.optics 2026-05 unverdicted novelty 6.0

    JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.

  4. Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

    cs.CR 2026-05 unverdicted novelty 6.0

    A truly benign DPO attack using 10 harmless preference pairs jailbreaks frontier LLMs by suppressing refusal behavior, achieving up to 81.73% attack success rate on GPT-4.1-nano at low cost.

  5. Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

    cs.AI 2026-04 unverdicted novelty 6.0

    Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

  6. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  7. An Independent Safety Evaluation of Kimi K2.5

    cs.CR 2026-04 conditional novelty 6.0

    Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.