{"work":{"id":"b95e7447-320c-4c85-b5d0-3708cc2cc72e","openalex_id":null,"doi":null,"arxiv_id":"2401.05566","raw_key":null,"title":"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training","authors":null,"authors_text":"Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid","year":2024,"venue":"cs.CR","abstract":"Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.","external_url":"https://arxiv.org/abs/2401.05566","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-14T21:02:58.883997+00:00","pith_arxiv_id":"2401.05566","created_at":"2026-05-08T22:04:17.916125+00:00","updated_at":"2026-05-14T21:02:58.883997+00:00","title_quality_ok":true,"display_title":"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training","render_title":"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training"},"hub":{"state":{"work_id":"b95e7447-320c-4c85-b5d0-3708cc2cc72e","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":37,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2024-10-03T16:30:47+00:00","last_pith_cited_at":"2026-05-13T17:50:27+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-15T00:16:14.395104+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":2}],"polarity_counts":[{"context_polarity":"background","n":2}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:19:23.936661+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain","work_id":"7b1cd3ac-9abd-4579-8d13-c75d30c83a5f","shared_citers":9},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":9},{"title":"Alignment faking in large language models","work_id":"cc253a89-cda1-4889-9631-bf3ce8147650","shared_citers":7},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":7},{"title":"Prompt Injection attack against LLM-integrated Applications","work_id":"977b4683-bba6-49d6-8f3d-496c41cb7fac","shared_citers":5},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":5},{"title":"Concrete Problems in AI Safety","work_id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","shared_citers":4},{"title":"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!","work_id":"8b07137a-7175-4dd1-a8d9-570493d3f404","shared_citers":4},{"title":"Frontier models are capable of in-context scheming","work_id":"1372f9da-8fac-4446-ba43-4e9e053c8b28","shared_citers":4},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":4},{"title":"Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations","work_id":"93844332-869b-448c-a1be-35466150b1b2","shared_citers":4},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":4},{"title":"Auditing language models for hidden objectives","work_id":"a479b1c8-1808-4da7-b836-0a4d85dcba6b","shared_citers":3},{"title":"Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs","work_id":"27d5f019-1fc8-47e4-bc86-3f09a2569685","shared_citers":3},{"title":"Feder Cooper, Daphne Ippolito, Christopher A","work_id":"7ee4de98-0bdd-47ab-abe6-1865cb65b1ae","shared_citers":3},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":3},{"title":"Jailbreaking Black Box Large Language Models in Twenty Queries","work_id":"38678cda-6595-4ca3-916b-066c00cce063","shared_citers":3},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":3},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":3},{"title":"Reasoning models don’t always say what they think","work_id":"b9bdcbf5-9ae0-464c-b1a6-de04f85a6e33","shared_citers":3},{"title":"Risks from learned optimization in advanced machine learning systems","work_id":"871c0bb7-e08b-4d8b-be76-610707c748dd","shared_citers":3},{"title":"Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning","work_id":"bb1fb326-f0f6-4c72-a4d2-eb7f0707b971","shared_citers":3},{"title":"Towards Understanding Sycophancy in Language Models","work_id":"aeefec9a-6ad5-4743-92b9-de6983895e21","shared_citers":3},{"title":"Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, et al","work_id":"37d1dcb3-68ff-43ea-b3da-cb0a4bb30284","shared_citers":3}],"time_series":[{"n":1,"year":2024},{"n":33,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:20:11.587635+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:19:28.569758+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training","claims":[{"claim_text":"Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stat","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:20:11.518098+00:00"}},"summary":{"title":"Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training","claims":[{"claim_text":"Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stat","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain","work_id":"7b1cd3ac-9abd-4579-8d13-c75d30c83a5f","shared_citers":9},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":9},{"title":"Alignment faking in large language models","work_id":"cc253a89-cda1-4889-9631-bf3ce8147650","shared_citers":7},{"title":"Universal and Transferable Adversarial Attacks on Aligned Language Models","work_id":"3322fa86-1768-4677-8425-dd326b45e078","shared_citers":7},{"title":"Prompt Injection attack against LLM-integrated Applications","work_id":"977b4683-bba6-49d6-8f3d-496c41cb7fac","shared_citers":5},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":5},{"title":"Concrete Problems in AI Safety","work_id":"c8d14fbe-6eab-464a-95b3-778aabd82fa3","shared_citers":4},{"title":"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!","work_id":"8b07137a-7175-4dd1-a8d9-570493d3f404","shared_citers":4},{"title":"Frontier models are capable of in-context scheming","work_id":"1372f9da-8fac-4446-ba43-4e9e053c8b28","shared_citers":4},{"title":"HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal","work_id":"b0b0303f-2444-4789-a979-8153624312ff","shared_citers":4},{"title":"Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations","work_id":"93844332-869b-448c-a1be-35466150b1b2","shared_citers":4},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":4},{"title":"Auditing language models for hidden objectives","work_id":"a479b1c8-1808-4da7-b836-0a4d85dcba6b","shared_citers":3},{"title":"Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs","work_id":"27d5f019-1fc8-47e4-bc86-3f09a2569685","shared_citers":3},{"title":"Feder Cooper, Daphne Ippolito, Christopher A","work_id":"7ee4de98-0bdd-47ab-abe6-1865cb65b1ae","shared_citers":3},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":3},{"title":"Jailbreaking Black Box Large Language Models in Twenty Queries","work_id":"38678cda-6595-4ca3-916b-066c00cce063","shared_citers":3},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":3},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":3},{"title":"Reasoning models don’t always say what they think","work_id":"b9bdcbf5-9ae0-464c-b1a6-de04f85a6e33","shared_citers":3},{"title":"Risks from learned optimization in advanced machine learning systems","work_id":"871c0bb7-e08b-4d8b-be76-610707c748dd","shared_citers":3},{"title":"Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning","work_id":"bb1fb326-f0f6-4c72-a4d2-eb7f0707b971","shared_citers":3},{"title":"Towards Understanding Sycophancy in Language Models","work_id":"aeefec9a-6ad5-4743-92b9-de6983895e21","shared_citers":3},{"title":"Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, et al","work_id":"37d1dcb3-68ff-43ea-b3da-cb0a4bb30284","shared_citers":3}],"time_series":[{"n":1,"year":2024},{"n":33,"year":2026}],"dependency_candidates":[]},"authors":[]}}