A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
arXiv:2310.08540 [cs]
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
In-weights learning induces linear embeddings enabling transitive inference in transformers, whereas in-context learning defaults to match-and-copy unless pre-trained on linear tasks or prompted with linear mental maps.
The paper surveys definitions, techniques, applications, and challenges in in-context learning for large language models.
citing papers explorer
-
Mitigating Many-shot Jailbreak Attacks with One Single Demonstration
A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.
-
Fine-tuning vs. In-context Learning in Large Language Models: A Formal Language Learning Perspective
A controlled formal language task reveals fine-tuning outperforms in-context learning on in-distribution generalization but equals it on out-of-distribution, with ICL showing greater sensitivity to model size and tokenization.
-
Relational reasoning and inductive bias in transformers and large language models
In-weights learning induces linear embeddings enabling transitive inference in transformers, whereas in-context learning defaults to match-and-copy unless pre-trained on linear tasks or prompted with linear mental maps.
-
A Survey on In-context Learning
The paper surveys definitions, techniques, applications, and challenges in in-context learning for large language models.