A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Me, myself, and ai: The situational awareness dataset (sad) for llms
9 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
In Qwen 2.5 and Gemma 2 families, the layer where evaluation awareness is most linearly recoverable shifts from late layers in small models to early layers in large models.
Evaluation awareness in open language models is multivariate, with detection, behavioral shifts, and representational controllability varying independently across 37 models.
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.
Introduces CoT-Output 2x2 matrix revealing alignment faking, context-injection failures, and an oversight paradox in multi-turn models via 6750 turn-level observations on information-hazard scenarios.
Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.
citing papers explorer
-
Frontier Models are Capable of In-context Scheming
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
-
Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Llama3-8b-Instruct recognizes its own outputs via a residual-stream vector associated with self-authorship that can be steered to control authorship claims and perceptions.