A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Me, myself, and ai: The situational awareness dataset (sad) for llms
10 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.
Evaluation awareness in open language models is multivariate, with detection, behavioral shifts, and representational controllability varying independently across 37 models.
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.
Introduces CoT-Output 2x2 matrix revealing alignment faking, context-injection failures, and an oversight paradox in multi-turn models via 6750 turn-level observations on information-hazard scenarios.
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.
citing papers explorer
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
Honeypot Protocol
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
-
Representational Depth of Evaluation Awareness Shifts With Scale in Open-Weight Language Models
In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.
-
Evaluation Awareness Is Not One Capability: Evidence from Open Language Models
Evaluation awareness in open language models is multivariate, with detection, behavioral shifts, and representational controllability varying independently across 37 models.
-
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
-
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
Introduces CoT-Output 2x2 matrix revealing alignment faking, context-injection failures, and an oversight paradox in multi-turn models via 6750 turn-level observations on information-hazard scenarios.
-
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
-
AI Integrity: Defending Against Backdoors and Secret Loyalties
The report defines AI integrity threats (model sabotage and subversion) and recommends four US government policy actions to defend frontier AI systems against backdoors and secret loyalties.